arXiv Papers with Code in Human-Computer Interactio (January 2025

Abstract:
Recent advances in deep learning have led to increasingly complex models with deeper layers and more parameters, reducing interpretability and making their decisions harder to understand. While many methods explain black-box reasoning, most lack effective interventions or only operate at sample-level without modifying the model itself. To address this, we propose the Concept Bottleneck Model for Enhancing Human-Neural Network Mutual Understanding (CBM-HNMU). CBM-HNMU leverages the Concept Bottleneck Model (CBM) as an interpretable framework to approximate black-box reasoning and communicate conceptual understanding. Detrimental concepts are automatically identified and refined (removed/replaced) based on global gradient contributions. The modified CBM then distills corrected knowledge back into the black-box model, enhancing both interpretability and accuracy. We evaluate CBM-HNMU on various CNN and transformer-based models across Flower-102, CIFAR-10, CIFAR-100, FGVC-Aircraft, and CUB-200, achieving a maximum accuracy improvement of 2.64% and a maximum increase in average accuracy across 1.03%. Source code is available at: https://github.com/XiGuaBo/CBM-HNMU.

Paperid: 2, https://arxiv.org/pdf/2506.21862.pdf GitHub

Abstract:
In this paper, we present LLaVA-Scissor, a training-free token compression strategy designed for video multimodal large language models. Previous methods mostly attempt to compress tokens based on attention scores, but fail to effectively capture all semantic regions and often lead to token redundancy. Differently, we propose to leverage the Semantic Connected Components (SCC) approach that assigns tokens to distinct semantic regions within the token set, ensuring comprehensive semantic coverage. The outcome is a two-step spatio-temporal token compression strategy that utilizes SCC in both spatial and temporal domains. This strategy can effectively compress tokens by representing the entire video with a set of non-overlapping semantic tokens. We conduct extensive evaluations of the token compression capabilities of LLaVA-Scissor across diverse video understanding benchmarks, including video question answering, long video understanding, and comprehensive multi-choices benchmarks. Experimental results show that the proposed LLaVA-Scissor outperforms other token compression methods, achieving superior performance in various video understanding benchmarks, particularly at low token retention ratios. Project page: https://github.com/HumanMLLM/LLaVA-Scissor.

Paperid: 3, https://arxiv.org/pdf/2506.21604.pdf GitHub

Abstract:
Current evaluation frameworks for multimodal generative AI struggle to establish trustworthiness, hindering enterprise adoption where reliability is paramount. We introduce a systematic, quantitative benchmarking framework to measure the trustworthiness of progressively integrating cross-modal inputs such as text, images, captions, and OCR within VisualRAG systems for enterprise document intelligence. Our approach establishes quantitative relationships between technical metrics and user-centric trust measures. Evaluation reveals that optimal modality weighting with weights of 30% text, 15% image, 25% caption, and 30% OCR improves performance by 57.3% over text-only baselines while maintaining computational efficiency. We provide comparative assessments of foundation models, demonstrating their differential impact on trustworthiness in caption generation and OCR extraction-a vital consideration for reliable enterprise AI. This work advances responsible AI deployment by providing a rigorous framework for quantifying and enhancing trustworthiness in multimodal RAG for critical enterprise applications.

Paperid: 4, https://arxiv.org/pdf/2506.21490.pdf GitHub GitHub

Abstract:
Achieving seamless coordination between AI agents and humans is crucial for real-world applications, yet it remains a significant open challenge. Hanabi is a cooperative card game featuring imperfect information, constrained communication, theory of mind requirements, and coordinated action -- making it an ideal testbed for human-AI coordination. However, its use for human-AI interaction has been limited by the challenges of human evaluation. In this work, we introduce the Ad-Hoc Human-AI Coordination Challenge (AH2AC2) to overcome the constraints of costly and difficult-to-reproduce human evaluations. We develop \textit{human proxy agents} on a large-scale human dataset that serve as robust, cheap, and reproducible human-like evaluation partners in AH2AC2. To encourage the development of data-efficient methods, we open-source a dataset of 3,079 games, deliberately limiting the amount of available human gameplay data. We present baseline results for both two- and three- player Hanabi scenarios. To ensure fair evaluation, we host the proxy agents through a controlled evaluation system rather than releasing them publicly. The code is available at \href{https://github.com/FLAIROx/ah2ac2}{https://github.com/FLAIROx/ah2ac2}.

Paperid: 5, https://arxiv.org/pdf/2506.21319.pdf GitHub

Abstract:
Current multimodal large language models (MLLMs), while effective in natural image understanding, struggle with visualization understanding due to their inability to decode the data-to-visual mapping and extract structured information. To address these challenges, we propose SimVec, a novel simplified vector format that encodes chart elements such as mark type, position, and size. The effectiveness of SimVec is demonstrated by using MLLMs to reconstruct chart information from SimVec formats. Then, we build a new visualization dataset, SimVecVis, to enhance the performance of MLLMs in visualization understanding, which consists of three key dimensions: bitmap images of charts, their SimVec representations, and corresponding data-centric question-answering (QA) pairs with explanatory chain-of-thought (CoT) descriptions. We finetune state-of-the-art MLLMs (e.g., MiniCPM and Qwen-VL), using SimVecVis with different dataset dimensions. The experimental results show that it leads to substantial performance improvements of MLLMs with good spatial perception capabilities (e.g., MiniCPM) in data-centric QA tasks. Our dataset and source code are available at: https://github.com/VIDA-Lab/SimVecVis.

Paperid: 6, https://arxiv.org/pdf/2506.19352.pdf GitHub

Abstract:
Ensuring persona fidelity in large language models (LLMs) is essential for maintaining coherent and engaging human-AI interactions. However, LLMs often exhibit Out-of-Character (OOC) behavior, where generated responses deviate from an assigned persona, leading to inconsistencies that affect model reliability. Existing evaluation methods typically assign single scores to entire responses, struggling to capture subtle persona misalignment, particularly in long-form text generation. To address this limitation, we propose an atomic-level evaluation framework that quantifies persona fidelity at a finer granularity. Our three key metrics measure the degree of persona alignment and consistency within and across generations. Our approach enables a more precise and realistic assessment of persona fidelity by identifying subtle deviations that real users would encounter. Through our experiments, we demonstrate that our framework effectively detects persona inconsistencies that prior methods overlook. By analyzing persona fidelity across diverse tasks and personality types, we reveal how task structure and persona desirability influence model adaptability, highlighting challenges in maintaining consistent persona expression.

Paperid: 7, https://arxiv.org/pdf/2506.18786.pdf GitHub

Abstract:
Cybersickness remains a critical barrier to the widespread adoption of Virtual Reality (VR), particularly in scenarios involving intense or artificial motion cues. Among the key contributors is excessive optical flow-perceived visual motion that, when unmatched by vestibular input, leads to sensory conflict and discomfort. While previous efforts have explored geometric or hardware based mitigation strategies, such methods often rely on predefined scene structures, manual tuning, or intrusive equipment. In this work, we propose U-MAD, a lightweight, real-time, AI-based solution that suppresses perceptually disruptive optical flow directly at the image level. Unlike prior handcrafted approaches, this method learns to attenuate high-intensity motion patterns from rendered frames without requiring mesh-level editing or scene specific adaptation. Designed as a plug and play module, U-MAD integrates seamlessly into existing VR pipelines and generalizes well to procedurally generated environments. The experiments show that U-MAD consistently reduces average optical flow and enhances temporal stability across diverse scenes. A user study further confirms that reducing visual motion leads to improved perceptual comfort and alleviated cybersickness symptoms. These findings demonstrate that perceptually guided modulation of optical flow provides an effective and scalable approach to creating more user-friendly immersive experiences. The code will be released at https://github.com/XXXXX (upon publication).

Paperid: 8, https://arxiv.org/pdf/2506.18269.pdf GitHub

Abstract:
This study introduces Co-Persona, a methodological framework bridging large-scale social media analysis with authentic user understanding through systematic integration of Large Language Models and expert validation. Through a case study of B.Co, a Chinese manufacturer, we investigated Co-Persona application in bedside lamp development. Our methodology analyzed over 38 million posts from Xiao Hongshu, employing multi-stage data processing combining advanced NLP with expert validation. Analysis revealed five user personas derived from bedtime behaviors: Health Aficionados, Night Owls, Interior Decorators, Child-care Workers, and Workaholics-each showing unique pre-sleep activities and product preferences. Findings demonstrate Co-Persona enhances manufacturers' ability to process large datasets while maintaining user understanding. The methodology provides structured approaches for targeted marketing and product strategies. Research contributes to theoretical understanding of data-driven persona development and practical applications in consumer-driven innovation. Code and data available at https://github.com/INFPa/LLMwithPersona.

Paperid: 9, https://arxiv.org/pdf/2506.16643.pdf GitHub

Abstract:
Nonverbal visual symbols and displays play an important role in communication when humans and robots work collaboratively. However, few studies have investigated how different types of non-verbal cues affect objective task performance, especially in a dynamic environment that requires real time decision-making. In this work, we designed a collaborative navigation task where the user and the robot only had partial information about the map on each end and thus the users were forced to communicate with a robot to complete the task. We conducted our study in a public space and recruited 37 participants who randomly passed by our setup. Each participant collaborated with a robot utilizing either animated anthropomorphic eyes and animated icons, or static anthropomorphic eyes and static icons. We found that participants that interacted with a robot with animated displays reported the greatest level of trust and satisfaction; that participants interpreted static icons the best; and that participants with a robot with static eyes had the highest completion success. These results suggest that while animation can foster trust with robots, human-robot communication can be optimized by the addition of familiar static icons that may be easier for users to interpret. We published our code, designed symbols, and collected results online at: https://github.com/mattufts/huamn_Cozmo_interaction.

Paperid: 10, https://arxiv.org/pdf/2506.15928.pdf GitHub

Abstract:
This paper presents an evaluation framework for agentic AI systems in mission-critical negotiation contexts, addressing the need for AI agents that can adapt to diverse human operators and stakeholders. Using Sotopia as a simulation testbed, we present two experiments that systematically evaluated how personality traits and AI agent characteristics influence LLM-simulated social negotiation outcomes--a capability essential for a variety of applications involving cross-team coordination and civil-military interactions. Experiment 1 employs causal discovery methods to measure how personality traits impact price bargaining negotiations, through which we found that Agreeableness and Extraversion significantly affect believability, goal achievement, and knowledge acquisition outcomes. Sociocognitive lexical measures extracted from team communications detected fine-grained differences in agents' empathic communication, moral foundations, and opinion patterns, providing actionable insights for agentic AI systems that must operate reliably in high-stakes operational scenarios. Experiment 2 evaluates human-AI job negotiations by manipulating both simulated human personality and AI system characteristics, specifically transparency, competence, adaptability, demonstrating how AI agent trustworthiness impact mission effectiveness. These findings establish a repeatable evaluation methodology for experimenting with AI agent reliability across diverse operator personalities and human-agent team dynamics, directly supporting operational requirements for reliable AI systems. Our work advances the evaluation of agentic AI workflows by moving beyond standard performance metrics to incorporate social dynamics essential for mission success in complex operations.

Paperid: 11, https://arxiv.org/pdf/2506.15860.pdf GitHub

Abstract:
Visual analysis of relational data is essential for many real-world analytics tasks, with layout quality being key to interpretability. However, existing layout algorithms often require users to navigate complex parameters to express their intent. We present a user-guided force-directed layout approach that enables intuitive control through freehand sketching. Our method uses classical image analysis techniques to extract structural information from sketches, which is then used to generate positional constraints that guide the layout process. We evaluate the approach on various real and synthetic graphs ranging from small to medium scale, demonstrating its ability to produce layouts aligned with user expectations. An implementation of our method along with documentation and a demo page is freely available on GitHub at https://github.com/sciluna/uggly.

Paperid: 12, https://arxiv.org/pdf/2506.15085.pdf GitHub GitHub

Abstract:
Humans vary their expressivity when speaking for extended periods to maintain engagement with their listener. Although social robots tend to be deployed with ``expressive'' joyful voices, they lack this long-term variation found in human speech. Foundation model text-to-speech systems are beginning to mimic the expressivity in human speech, but they are difficult to deploy offline on robots. We present EmojiVoice, a free, customizable text-to-speech (TTS) toolkit that allows social roboticists to build temporally variable, expressive speech on social robots. We introduce emoji-prompting to allow fine-grained control of expressivity on a phase level and use the lightweight Matcha-TTS backbone to generate speech in real-time. We explore three case studies: (1) a scripted conversation with a robot assistant, (2) a storytelling robot, and (3) an autonomous speech-to-speech interactive agent. We found that using varied emoji prompting improved the perception and expressivity of speech over a long period in a storytelling task, but expressive voice was not preferred in the assistant use case.

Paperid: 13, https://arxiv.org/pdf/2506.14799.pdf GitHub

Abstract:
Recent advances in AI has made automated analysis of complex media content at scale possible while generating actionable insights regarding character representation along such dimensions as gender and age. Past works focused on quantifying representation from audio/video/text using AI models, but without having the audience in the loop. We ask, even if character distribution along demographic dimensions are available, how useful are those to the general public? Do they actually trust the numbers generated by AI models? Our work addresses these open questions by proposing a new AI-based character representation tool and performing a thorough user study. Our tool has two components: (i) An analytics extraction model based on the Contrastive Language Image Pretraining (CLIP) foundation model that analyzes visual screen data to quantify character representation across age and gender; (ii) A visualization component effectively designed for presenting the analytics to lay audience. The user study seeks empirical evidence on the usefulness and trustworthiness of the AI-generated results for carefully chosen movies presented in the form of our visualizations. We found that participants were able to understand the analytics in our visualizations, and deemed the tool `overall useful'. Participants also indicated a need for more detailed visualizations to include more demographic categories and contextual information of the characters. Participants' trust in AI-based gender and age models is seen to be moderate to low, although they were not against the use of AI in this context. Our tool including code, benchmarking, and the user study data can be found at https://github.com/debadyuti0510/Character-Representation-Media.

Paperid: 14, https://arxiv.org/pdf/2506.13776.pdf GitHub

Abstract:
In this position paper, we argue that human baselines in foundation model evaluations must be more rigorous and more transparent to enable meaningful comparisons of human vs. AI performance, and we provide recommendations and a reporting checklist towards this end. Human performance baselines are vital for the machine learning community, downstream users, and policymakers to interpret AI evaluations. Models are often claimed to achieve "super-human" performance, but existing baselining methods are neither sufficiently rigorous nor sufficiently well-documented to robustly measure and assess performance differences. Based on a meta-review of the measurement theory and AI evaluation literatures, we derive a framework with recommendations for designing, executing, and reporting human baselines. We synthesize our recommendations into a checklist that we use to systematically review 115 human baselines (studies) in foundation model evaluations and thus identify shortcomings in existing baselining methods; our checklist can also assist researchers in conducting human baselines and reporting results. We hope our work can advance more rigorous AI evaluation practices that can better serve both the research community and policymakers. Data is available at: https://github.com/kevinlwei/human-baselines

Paperid: 15, https://arxiv.org/pdf/2506.13326.pdf GitHub

Abstract:
Data visualization generation using Large Language Models (LLMs) has shown promising results but often produces suboptimal visualizations that require human intervention for improvement. In this work, we introduce VIS-Shepherd, a specialized Multimodal Large Language Model (MLLM)-based critic to evaluate and provide feedback for LLM-generated data visualizations. At the core of our approach is a framework to construct a high-quality visualization critique dataset, where we collect human-created visualization instances, synthesize corresponding LLM-generated instances, and construct high-quality critiques. We conduct both model-based automatic evaluation and human preference studies to evaluate the effectiveness of our approach. Our experiments show that even small (7B parameters) open-source MLLM models achieve substantial performance gains by leveraging our high-quality visualization critique dataset, reaching levels comparable to much larger open-source or even proprietary models. Our work demonstrates significant potential for MLLM-based automated visualization critique and indicates promising directions for enhancing LLM-based data visualization generation. Our project page: https://github.com/bopan3/VIS-Shepherd.

Paperid: 16, https://arxiv.org/pdf/2506.13079.pdf GitHub

Abstract:
Reinforcement Learning from Human Feedback has recently achieved significant success in various fields, and its performance is highly related to feedback quality. While much prior work acknowledged that human teachers' characteristics would affect human feedback patterns, there is little work that has closely investigated the actual effects. In this work, we designed an exploratory study investigating how human feedback patterns are associated with human characteristics. We conducted a public space study with two long horizon tasks and 46 participants. We found that feedback patterns are not only correlated with task statistics, such as rewards, but also correlated with participants' characteristics, especially robot experience and educational background. Additionally, we demonstrated that human feedback value can be more accurately predicted with human characteristics compared to only using task statistics. All human feedback and characteristics we collected, and codes for our data collection and predicting more accurate human feedback are available at https://github.com/AABL-Lab/CHARM

Paperid: 17, https://arxiv.org/pdf/2506.12524.pdf GitHub

Abstract:
Event-based eye tracking holds significant promise for fine-grained cognitive state inference, offering high temporal resolution and robustness to motion artifacts, critical features for decoding subtle mental states such as attention, confusion, or fatigue. In this work, we introduce a model-agnostic, inference-time refinement framework designed to enhance the output of existing event-based gaze estimation models without modifying their architecture or requiring retraining. Our method comprises two key post-processing modules: (i) Motion-Aware Median Filtering, which suppresses blink-induced spikes while preserving natural gaze dynamics, and (ii) Optical Flow-Based Local Refinement, which aligns gaze predictions with cumulative event motion to reduce spatial jitter and temporal discontinuities. To complement traditional spatial accuracy metrics, we propose a novel Jitter Metric that captures the temporal smoothness of predicted gaze trajectories based on velocity regularity and local signal complexity. Together, these contributions significantly improve the consistency of event-based gaze signals, making them better suited for downstream tasks such as micro-expression analysis and mind-state decoding. Our results demonstrate consistent improvements across multiple baseline models on controlled datasets, laying the groundwork for future integration with multimodal affect recognition systems in real-world environments. Our code implementations can be found at https://github.com/eye-tracking-for-physiological-sensing/EyeLoRiN.

Paperid: 18, https://arxiv.org/pdf/2506.10974.pdf GitHub

Abstract:
Large Language Model (LLM) agents have shown great potential in addressing real-world data science problems. LLM-driven data science agents promise to automate the entire machine learning pipeline, yet their real-world effectiveness remains limited. Existing frameworks depend on rigid, pre-defined workflows and inflexible coding strategies; consequently, they excel only on relatively simple, classical problems and fail to capture the empirical expertise that human practitioners bring to complex, innovative tasks. In this work, we introduce AutoMind, an adaptive, knowledgeable LLM-agent framework that overcomes these deficiencies through three key advances: (1) a curated expert knowledge base that grounds the agent in domain expert knowledge, (2) an agentic knowledgeable tree search algorithm that strategically explores possible solutions, and (3) a self-adaptive coding strategy that dynamically tailors code generation to task complexity. Evaluations on two automated data science benchmarks demonstrate that AutoMind delivers superior performance versus state-of-the-art baselines. Additional analyses confirm favorable effectiveness, efficiency, and qualitative solution quality, highlighting AutoMind as an efficient and robust step toward fully automated data science.

Paperid: 19, https://arxiv.org/pdf/2506.10164.pdf GitHub

Abstract:
Developing literacy with unfamiliar data visualization techniques such as Parallel Coordinate Plots (PCPs) can be a significant challenge for students. We adopted the Revised Bloom's taxonomy to instruct students on Parallel Coordinate Plots (PCPs) using Mastery Learning in the classroom. To evaluate Mastery Learning's impact, we conducted an intervention in a Data Visualization course to teach students about PCPs using the Revised Bloom's taxonomy with and without Mastery Learning. Based on our intervention, we found that while students in both groups performed similarly on the first two (Remember, Understand) modules, the students in the Mastery Learning group performed better on modules that required more advanced thinking (Analyze, Evaluate) and demonstrated a better comprehension of PCPs. We provide all the materials developed including the six-module Bloom's Taxonomy PCP literacy (BTPL) test for full reproducibility on our website at https://vis-graphics.github.io/PCP-Literacy-Test/.

Paperid: 20, https://arxiv.org/pdf/2506.04867.pdf GitHub

Abstract:
We propose a method that enables large language models (LLMs) to control embodied agents by directly mapping continuous observation vectors to continuous action vectors. At the outset, the LLMs generate a control strategy based on a textual description of the agent, its environment, and the intended goal. This strategy is then iteratively refined through a learning process in which the LLMs are repeatedly prompted to improve the current strategy, using performance feedback and sensory-motor data collected during its evaluation. The method is validated on classic control tasks from the Gymnasium library and the inverted pendulum task from the MuJoCo library. The approach proves effective with relatively compact models such as Gpt-oss:120b and Qwen2.5:72b. In most cases, it successfully identifies optimal or near-optimal solutions by integrating symbolic knowledge derived through reasoning with sub-symbolic sensory-motor data gathered as the agent interacts with its environment.

Paperid: 21, https://arxiv.org/pdf/2506.03310.pdf GitHub

Abstract:
Recent studies comparing AI-generated and human-authored literary texts have produced conflicting results: some suggest AI already surpasses human quality, while others argue it still falls short. We start from the hypothesis that such divergences can be largely explained by genuine differences in how readers interpret and value literature, rather than by an intrinsic quality of the texts evaluated. Using five public datasets (1,471 stories, 101 annotators including critics, students, and lay readers), we (i) extract 17 reference-less textual features (e.g., coherence, emotional variance, average sentence length...); (ii) model individual reader preferences, deriving feature importance vectors that reflect their textual priorities; and (iii) analyze these vectors in a shared "preference space". Reader vectors cluster into two profiles: 'surface-focused readers' (mainly non-experts), who prioritize readability and textual richness; and 'holistic readers' (mainly experts), who value thematic development, rhetorical variety, and sentiment dynamics. Our results quantitatively explain how measurements of literary quality are a function of how text features align with each reader's preferences. These findings advocate for reader-sensitive evaluation frameworks in the field of creative text generation.

Paperid: 22, https://arxiv.org/pdf/2506.02911.pdf GitHub GitHub

Abstract:
Cell type annotation is a key task in analyzing the heterogeneity of single-cell RNA sequencing data. Although recent foundation models automate this process, they typically annotate cells independently, without considering batch-level cellular context or providing explanatory reasoning. In contrast, human experts often annotate distinct cell types for different cell clusters based on their domain knowledge. To mimic this workflow, we introduce the CellPuzzles task, where the objective is to assign unique cell types to a batch of cells. This benchmark spans diverse tissues, diseases, and donor conditions, and requires reasoning across the batch-level cellular context to ensure label uniqueness. We find that off-the-shelf large language models (LLMs) struggle on CellPuzzles, with the best baseline (OpenAI's o1) achieving only 19.0% batch-level accuracy. To fill this gap, we propose Cell-o1, a 7B LLM trained via supervised fine-tuning on distilled reasoning traces, followed by reinforcement learning with batch-level rewards. Cell-o1 achieves state-of-the-art performance, outperforming o1 by over 73% and generalizing well across contexts. Further analysis of training dynamics and reasoning behaviors provides insights into batch-level annotation performance and emergent expert-like reasoning. Code and data are available at https://github.com/ncbi-nlp/cell-o1.

Paperid: 23, https://arxiv.org/pdf/2506.02380.pdf GitHub

Abstract:
3D Gaussian Splatting (3DGS) is an emerging media representation that reconstructs real-world 3D scenes in high fidelity, enabling 6-degrees-of-freedom (6-DoF) navigation in virtual reality (VR). However, developing and evaluating 3DGS-enabled applications and optimizing their rendering performance, require realistic user navigation data. Such data is currently unavailable for photorealistic 3DGS reconstructions of real-world scenes. This paper introduces EyeNavGS (EyeNavGS), the first publicly available 6-DoF navigation dataset featuring traces from 46 participants exploring twelve diverse, real-world 3DGS scenes. The dataset was collected at two sites, using the Meta Quest Pro headsets, recording the head pose and eye gaze data for each rendered frame during free world standing 6-DoF navigation. For each of the twelve scenes, we performed careful scene initialization to correct for scene tilt and scale, ensuring a perceptually-comfortable VR experience. We also release our open-source SIBR viewer software fork with record-and-replay functionalities and a suite of utility tools for data processing, conversion, and visualization. The EyeNavGS dataset and its accompanying software tools provide valuable resources for advancing research in 6-DoF viewport prediction, adaptive streaming, 3D saliency, and foveated rendering for 3DGS scenes. The EyeNavGS dataset is available at: https://symmru.github.io/EyeNavGS/.

Paperid: 24, https://arxiv.org/pdf/2506.01391.pdf GitHub

Abstract:
The recent progress of large language model agents has opened new possibilities for automating tasks through graphical user interfaces (GUIs), especially in mobile environments where intelligent interaction can greatly enhance usability. However, practical deployment of such agents remains constrained by several key challenges. Existing training data is often noisy and lack semantic diversity, which hinders the learning of precise grounding and planning. Models trained purely by imitation tend to overfit to seen interface patterns and fail to generalize in unfamiliar scenarios. Moreover, most prior work focuses on English interfaces while overlooks the growing diversity of non-English applications such as those in the Chinese mobile ecosystem. In this work, we present AgentCPM-GUI, an 8B-parameter GUI agent built for robust and efficient on-device GUI interaction. Our training pipeline includes grounding-aware pre-training to enhance perception, supervised fine-tuning on high-quality Chinese and English trajectories to imitate human-like actions, and reinforcement fine-tuning with GRPO to improve reasoning capability. We also introduce a compact action space that reduces output length and supports low-latency execution on mobile devices. AgentCPM-GUI achieves state-of-the-art performance on five public benchmarks and a new Chinese GUI benchmark called CAGUI, reaching $96.9\%$ Type-Match and $91.3\%$ Exact-Match. To facilitate reproducibility and further research, we publicly release all code, model checkpoint, and evaluation data.

Paperid: 25, https://arxiv.org/pdf/2506.01077.pdf GitHub

Abstract:
Large Language Model (LLM)-driven digital humans have sparked a series of recent studies on co-speech gesture generation systems. However, existing approaches struggle with real-time synthesis and long-text comprehension. This paper introduces Transformer-Based Rich Motion Matching (TRiMM), a novel multi-modal framework for real-time 3D gesture generation. Our method incorporates three modules: 1) a cross-modal attention mechanism to achieve precise temporal alignment between speech and gestures; 2) a long-context autoregressive model with a sliding window mechanism for effective sequence modeling; 3) a large-scale gesture matching system that constructs an atomic action library and enables real-time retrieval. Additionally, we develop a lightweight pipeline implemented in the Unreal Engine for experimentation. Our approach achieves real-time inference at 120 fps and maintains a per-sentence latency of 0.15 seconds on consumer-grade GPUs (Geforce RTX3060). Extensive subjective and objective evaluations on the ZEGGS, and BEAT datasets demonstrate that our model outperforms current state-of-the-art methods. TRiMM enhances the speed of co-speech gesture generation while ensuring gesture quality, enabling LLM-driven digital humans to respond to speech in real time and synthesize corresponding gestures. Our code is available at https://github.com/teroon/TRiMM-Transformer-Based-Rich-Motion-Matching

Paperid: 26, https://arxiv.org/pdf/2506.00791.pdf GitHub

Abstract:
Drama-in-education is an interdisciplinary instructional approach that integrates subjects such as language, history, and psychology. Its core component is playwriting. Based on need-finding interviews of 13 teachers, we found that current general-purpose AI tools cannot effectively assist teachers and students during playwriting. Therefore, we propose CO-OPERA - a collaborative playwriting tool integrating generative artificial intelligence capabilities. In CO-OPERA, users can both expand their thinking through discussions with a tutor and converge their thinking by operating agents to generate script elements. Additionally, the system allows for iterative modifications and regenerations based on user requirements. A system usability test conducted with middle school students shows that our CO-OPERA helps users focus on whole logical narrative development during playwriting. Our playwriting examples and raw data for qualitative and quantitative analysis are available at https://github.com/daisyinb612/CO-OPERA.

Paperid: 27, https://arxiv.org/pdf/2506.00454.pdf GitHub

Abstract:
Dysarthria, a motor speech disorder, affects intelligibility and requires targeted interventions for effective communication. In this work, we investigate automated mispronunciation feedback by collecting a dysarthric speech dataset from six speakers reading two passages, annotated by a speech therapist with temporal markers and mispronunciation descriptions. We design a three-stage framework for explainable mispronunciation evaluation: (1) overall clarity scoring, (2) mispronunciation localization, and (3) mispronunciation type classification. We systematically analyze pretrained Automatic Speech Recognition (ASR) models in each stage, assessing their effectiveness in dysarthric speech evaluation (Code available at: https://github.com/augmented-human-lab/interspeech25_speechtherapy, Supplementary webpage: https://apps.ahlab.org/interspeech25_speechtherapy/). Our findings offer clinically relevant insights for automating actionable feedback for pronunciation assessment, which could enable independent practice for patients and help therapists deliver more effective interventions.

Paperid: 28, https://arxiv.org/pdf/2505.24301.pdf GitHub

Abstract:
Reconstructing visual stimuli from EEG signals is a crucial step in realizing brain-computer interfaces. In this paper, we propose a transformer-based EEG signal encoder integrating the Discrete Wavelet Transform (DWT) and the gating mechanism. Guided by the feature alignment and category-aware fusion losses, this encoder is used to extract features related to visual stimuli from EEG signals. Subsequently, with the aid of a pre-trained diffusion model, these features are reconstructed into visual stimuli. To verify the effectiveness of the model, we conducted EEG-to-image generation and classification tasks using the THINGS-EEG dataset. To address the limitations of quantitative analysis at the semantic level, we combined WordNet-based classification and semantic similarity metrics to propose a novel semantic-based score, emphasizing the ability of our model to transfer neural activities into visual representations. Experimental results show that our model significantly improves semantic alignment and classification accuracy, which achieves a maximum single-subject accuracy of 43\%, outperforming other state-of-the-art methods. The source code and supplementary material is available at https://github.com/zes0v0inn/DWT_EEG_Reconstruction/tree/main.

Paperid: 29, https://arxiv.org/pdf/2505.24255.pdf GitHub

Abstract:
Large Language Models (LLMs) have shown potential in simulating human behaviors and performing theory-of-mind (ToM) reasoning, a crucial skill for complex social interactions. In this study, we investigate the role of ToM reasoning in aligning agentic behaviors with human norms in negotiation tasks, using the ultimatum game as a controlled environment. We initialized LLM agents with different prosocial beliefs (including Greedy, Fair, and Selfless) and reasoning methods like chain-of-thought (CoT) and varying ToM levels, and examined their decision-making processes across diverse LLMs, including reasoning models like o3-mini and DeepSeek-R1 Distilled Qwen 32B. Results from 2,700 simulations indicated that ToM reasoning enhances behavior alignment, decision-making consistency, and negotiation outcomes. Consistent with previous findings, reasoning models exhibit limited capability compared to models with ToM reasoning, different roles of the game benefits with different orders of ToM reasoning. Our findings contribute to the understanding of ToM's role in enhancing human-AI interaction and cooperative decision-making. The code used for our experiments can be found at https://github.com/Stealth-py/UltimatumToM.

Paperid: 30, https://arxiv.org/pdf/2505.23856.pdf GitHub

Abstract:
The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts provided in non-text modalities such as image and audio). To tackle this challenge, we propose OMNIGUARD, an approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts. OMNIGUARD improves harmful prompt classification accuracy by 11.57\% over the strongest baseline in a multilingual setting, by 20.44\% for image-based prompts, and sets a new SOTA for audio-based prompts. By repurposing embeddings computed during generation, OMNIGUARD is also very efficient ($\approx 120 \times$ faster than the next fastest baseline). Code and data are available at: https://github.com/vsahil/OmniGuard.

Paperid: 31, https://arxiv.org/pdf/2505.23183.pdf GitHub

Abstract:
Word-level quality estimation (WQE) aims to automatically identify fine-grained error spans in machine-translated outputs and has found many uses, including assisting translators during post-editing. Modern WQE techniques are often expensive, involving prompting of large language models or ad-hoc training on large amounts of human-labeled data. In this work, we investigate efficient alternatives exploiting recent advances in language model interpretability and uncertainty quantification to identify translation errors from the inner workings of translation models. In our evaluation spanning 14 metrics across 12 translation directions, we quantify the impact of human label variation on metric performance by using multiple sets of human labels. Our results highlight the untapped potential of unsupervised metrics, the shortcomings of supervised methods when faced with label uncertainty, and the brittleness of single-annotator evaluation practices.

Paperid: 32, https://arxiv.org/pdf/2505.22863.pdf GitHub

Abstract:
Depression is a growing concern gaining attention in both public discourse and AI research. While deep neural networks (DNNs) have been used for recognition, they still lack real-world effectiveness. Large language models (LLMs) show strong potential but require domain-specific fine-tuning and struggle with non-textual cues. Since depression is often expressed through vocal tone and behaviour rather than explicit text, relying on language alone is insufficient. Diagnostic accuracy also suffers without incorporating psychological expertise. To address these limitations, we present, to the best of our knowledge, the first application of LLMs to multimodal depression detection using the DAIC-WOZ dataset. We extract the audio features using the pre-trained model Wav2Vec, and mapped it to text-based LLMs for further processing. We also propose a novel strategy for incorporating psychological knowledge into LLMs to enhance diagnostic performance, specifically using a question and answer set to grant authorised knowledge to LLMs. Our approach yields a notable improvement in both Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) compared to a base score proposed by the related original paper. The codes are available at https://github.com/myxp-lyp/Depression-detection.git

Paperid: 33, https://arxiv.org/pdf/2505.22809.pdf GitHub

Abstract:
Much work has been done on conversational LLM agents which directly assist human users with tasks. We present an alternative paradigm for interacting with LLM agents, which we call "overhearing agents". These overhearing agents do not actively participate in conversation -- instead, they "listen in" on human-to-human conversations and perform background tasks or provide suggestions to assist the user. In this work, we explore the overhearing agents paradigm through the lens of Dungeons & Dragons gameplay. We present an in-depth study using large multimodal audio-language models as overhearing agents to assist a Dungeon Master. We perform a human evaluation to examine the helpfulness of such agents and find that some large audio-language models have the emergent ability to perform overhearing agent tasks using implicit audio cues. Finally, we release Python libraries and our project code to support further research into the overhearing agents paradigm at https://github.com/zhudotexe/overhearing_agents.

Paperid: 34, https://arxiv.org/pdf/2505.21966.pdf GitHub

Abstract:
We introduce MapStory, an LLM-powered animation prototyping tool that generates editable map animation sequences directly from natural language text by leveraging a dual-agent LLM architecture. Given a user written script, MapStory automatically produces a scene breakdown, which decomposes the text into key map animation primitives such as camera movements, visual highlights, and animated elements. Our system includes a researcher agent that accurately queries geospatial information by leveraging an LLM with web search, enabling automatic extraction of relevant regions, paths, and coordinates while allowing users to edit and query for changes or additional information to refine the results. Additionally, users can fine-tune parameters of these primitive blocks through an interactive timeline editor. We detail the system's design and architecture, informed by formative interviews with professional animators and by an analysis of 200 existing map animation videos. Our evaluation, which includes expert interviews (N=5) and a usability study (N=12), demonstrates that MapStory enables users to create map animations with ease, facilitates faster iteration, encourages creative exploration, and lowers barriers to creating map-centric stories.

Paperid: 35, https://arxiv.org/pdf/2505.20679.pdf GitHub

Abstract:
Mental manipulation is a subtle yet pervasive form of abuse in interpersonal communication, making its detection critical for safeguarding potential victims. However, due to manipulation's nuanced and context-specific nature, identifying manipulative language in complex, multi-turn, and multi-person conversations remains a significant challenge for large language models (LLMs). To address this gap, we introduce the MultiManip dataset, comprising 220 multi-turn, multi-person dialogues balanced between manipulative and non-manipulative interactions, all drawn from reality shows that mimic real-world scenarios. For manipulative interactions, it includes 11 distinct manipulations depicting real-life scenarios. We conduct extensive evaluations of state-of-the-art LLMs, such as GPT-4o and Llama-3.1-8B, employing various prompting strategies. Despite their capabilities, these models often struggle to detect manipulation effectively. To overcome this limitation, we propose SELF-PERCEPT, a novel, two-stage prompting framework inspired by Self-Perception Theory, demonstrating strong performance in detecting multi-person, multi-turn mental manipulation. Our code and data are publicly available at https://github.com/danushkhanna/self-percept .

Paperid: 36, https://arxiv.org/pdf/2505.19897.pdf GitHub

Abstract:
Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers' workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery. Our code, environment, and benchmark are at https://qiushisun.github.io/ScienceBoard-Home/.

Paperid: 37, https://arxiv.org/pdf/2505.19335.pdf GitHub

Abstract:
Large language models are designed to encode general purpose knowledge about the world from Internet data. Yet, a wealth of information falls outside this scope -- ranging from personal preferences to organizational policies, from community-specific advice to up-to-date news -- that users want models to access but remains unavailable. In this paper, we propose a knowledge ecosystem in which end-users can create, curate, and configure custom knowledge modules that are utilized by language models, such as ChatGPT and Claude. To support this vision, we introduce Knoll, a software infrastructure that allows users to make modules by clipping content from the web or authoring shared documents on Google Docs and GitHub, add modules that others have made, and rely on the system to insert relevant knowledge when interacting with an LLM. We conduct a public deployment of Knoll reaching over 200 users who employed the system for a diverse set of tasks including personalized recommendations, advice-seeking, and writing assistance. In our evaluation, we validate that using Knoll improves the quality of generated responses.

Paperid: 38, https://arxiv.org/pdf/2505.18829.pdf GitHub GitHub

Abstract:
We present AIOS 1.0, a novel platform designed to advance computer-use agent (CUA) capabilities through environmental contextualization. While existing approaches primarily focus on building more powerful agent frameworks or enhancing agent models, we identify a fundamental limitation: the semantic disconnect between how language models understand the world and how computer interfaces are structured. AIOS 1.0 addresses this challenge by transforming computers into contextual environments that language models can natively comprehend, implementing a Model Context Protocol (MCP) server architecture to abstract computer states and actions. This approach effectively decouples interface complexity from decision complexity, enabling agents to reason more effectively about computing environments. To demonstrate our platform's effectiveness, we introduce LiteCUA, a lightweight computer-use agent built on AIOS 1.0 that achieves a 14.66% success rate on the OSWorld benchmark, outperforming several specialized agent frameworks despite its simple architecture. Our results suggest that contextualizing computer environments for language models represents a promising direction for developing more capable computer-use agents and advancing toward AI that can interact with digital systems. The source code of LiteCUA is available at https://github.com/agiresearch/LiteCUA, and it is also integrated into the AIOS main branch as part of AIOS at https://github.com/agiresearch/AIOS.

Paperid: 39, https://arxiv.org/pdf/2505.18175.pdf GitHub

Abstract:
Electroencephalography-based Emotion Recognition (EEG-ER) has become a growing research area in recent years. Analyzing 216 papers published between 2018 and 2023, we uncover that the field lacks a unified evaluation protocol, which is essential to fairly define the state of the art, compare new approaches and to track the field's progress. We report the main inconsistencies between the used evaluation protocols, which are related to ground truth definition, evaluation metric selection, data splitting types (e.g., subject-dependent or subject-independent) and the use of different datasets. Capitalizing on this state-of-the-art research, we propose a unified evaluation protocol, EEGain (https://github.com/EmotionLab/EEGain), which enables an easy and efficient evaluation of new methods and datasets. EEGain is a novel open source software framework, offering the capability to compare - and thus define - state-of-the-art results. EEGain includes standardized methods for data pre-processing, data splitting, evaluation metrics, and the ability to load the six most relevant datasets (i.e., AMIGOS, DEAP, DREAMER, MAHNOB-HCI, SEED, SEED-IV) in EEG-ER with only a single line of code. In addition, we have assessed and validated EEGain using these six datasets on the four most common publicly available methods (EEGNet, DeepConvNet, ShallowConvNet, TSception). This is a significant step to make research on EEG-ER more reproducible and comparable, thereby accelerating the overall progress of the field.

Paperid: 40, https://arxiv.org/pdf/2505.18156.pdf GitHub

Abstract:
Large Language Models (LLMs) are changing the way people interact with technology. Tools like ChatGPT and Claude AI are now common in business, research, and everyday life. But with that growth comes new risks, especially prompt-based attacks that exploit how these models process language. InjectLab is a security framework designed to address that problem. This paper introduces InjectLab as a structured, open-source matrix that maps real-world techniques used to manipulate LLMs. The framework is inspired by MITRE ATT&CK and focuses specifically on adversarial behavior at the prompt layer. It includes over 25 techniques organized under six core tactics, covering threats like instruction override, identity swapping, and multi-agent exploitation. Each technique in InjectLab includes detection guidance, mitigation strategies, and YAML-based simulation tests. A Python tool supports easy execution of prompt-based test cases. This paper outlines the framework's structure, compares it to other AI threat taxonomies, and discusses its future direction as a practical, community-driven foundation for securing language models.

Paperid: 41, https://arxiv.org/pdf/2505.17241.pdf GitHub

Abstract:
Generative artificial intelligence (GenAI) is increasingly used to support a wide range of human tasks, yet empirical evidence on its effect on creativity remains scattered. Can GenAI generate ideas that are creative? To what extent can it support humans in generating ideas that are both creative and diverse? In this study, we conduct a meta-analysis to evaluate the effect of GenAI on the performance in creative tasks. For this, we first perform a systematic literature search, based on which we identify n = 28 relevant studies (m = 8214 participants) for inclusion in our meta-analysis. We then compute standardized effect sizes based on Hedges' g. We compare different outcomes: (i) how creative GenAI is; (ii) how creative humans augmented by GenAI are; and (iii) the diversity of ideas by humans augmented by GenAI. Our results show no significant difference in creative performance between GenAI and humans (g = -0.05), while humans collaborating with GenAI significantly outperform those working without assistance (g = 0.27). However, GenAI has a significant negative effect on the diversity of ideas for such collaborations between humans and GenAI (g = -0.86). We further analyze heterogeneity across different GenAI models (e.g., GPT-3.5, GPT-4), different tasks (e.g., creative writing, ideation, divergent thinking), and different participant populations (e.g., laypeople, business, academia). Overall, our results position GenAI as an augmentative tool that can support, rather than replace, human creativity-particularly in tasks benefiting from ideation support.

Paperid: 42, https://arxiv.org/pdf/2505.15946.pdf GitHub

Abstract:
Decoding visual experiences from fMRI offers a powerful avenue to understand human perception and develop advanced brain-computer interfaces. However, current progress often prioritizes maximizing reconstruction fidelity while overlooking interpretability, an essential aspect for deriving neuroscientific insight. To address this gap, we propose MoRE-Brain, a neuro-inspired framework designed for high-fidelity, adaptable, and interpretable visual reconstruction. MoRE-Brain uniquely employs a hierarchical Mixture-of-Experts architecture where distinct experts process fMRI signals from functionally related voxel groups, mimicking specialized brain networks. The experts are first trained to encode fMRI into the frozen CLIP space. A finetuned diffusion model then synthesizes images, guided by expert outputs through a novel dual-stage routing mechanism that dynamically weighs expert contributions across the diffusion process. MoRE-Brain offers three main advancements: First, it introduces a novel Mixture-of-Experts architecture grounded in brain network principles for neuro-decoding. Second, it achieves efficient cross-subject generalization by sharing core expert networks while adapting only subject-specific routers. Third, it provides enhanced mechanistic insight, as the explicit routing reveals precisely how different modeled brain regions shape the semantic and spatial attributes of the reconstructed image. Extensive experiments validate MoRE-Brain's high reconstruction fidelity, with bottleneck analyses further demonstrating its effective utilization of fMRI signals, distinguishing genuine neural decoding from over-reliance on generative priors. Consequently, MoRE-Brain marks a substantial advance towards more generalizable and interpretable fMRI-based visual decoding. Code will be publicly available soon: https://github.com/yuxiangwei0808/MoRE-Brain.

Paperid: 43, https://arxiv.org/pdf/2505.15596.pdf GitHub

Abstract:
This project examines the prospect of using AI-generated feedback as suggestions to expedite and enhance human instructors' feedback provision. In particular, we focus on understanding the teaching assistants' perspectives on the quality of AI-generated feedback and how they may or may not utilize AI feedback in their own workflows. We situate our work in a foundational college Economics class, which has frequent short essay assignments. We developed an LLM-powered feedback engine that generates feedback on students' essays based on grading rubrics used by the teaching assistants (TAs). To ensure that TAs can meaningfully critique and engage with the AI feedback, we had them complete their regular grading jobs. For a randomly selected set of essays that they had graded, we used our feedback engine to generate feedback and displayed the feedback as in-text comments in a Word document. We then performed think-aloud studies with 5 TAs over 20 1-hour sessions to have them evaluate the AI feedback, contrast the AI feedback with their handwritten feedback, and share how they envision using the AI feedback if they were offered as suggestions. The study highlights the importance of providing detailed rubrics for AI to generate high-quality feedback for knowledge-intensive essays. TAs considered that using AI feedback as suggestions during their grading could expedite grading, enhance consistency, and improve overall feedback quality. We discuss the importance of decomposing the feedback generation task into steps and presenting intermediate results, in order for TAs to use the AI feedback.

Paperid: 44, https://arxiv.org/pdf/2505.15364.pdf GitHub

Abstract:
Auditory attention detection (AAD) aims to detect the target speaker in a multi-talker environment from brain signals, such as electroencephalography (EEG), which has made great progress. However, most AAD methods solely utilize attention mechanisms sequentially and overlook valuable multi-scale contextual information within EEG signals, limiting their ability to capture long-short range spatiotemporal dependencies simultaneously. To address these issues, this paper proposes a multi-scale hybrid attention network (MHANet) for AAD, which consists of the multi-scale hybrid attention (MHA) module and the spatiotemporal convolution (STC) module. Specifically, MHA combines channel attention and multi-scale temporal and global attention mechanisms. This effectively extracts multi-scale temporal patterns within EEG signals and captures long-short range spatiotemporal dependencies simultaneously. To further improve the performance of AAD, STC utilizes temporal and spatial convolutions to aggregate expressive spatiotemporal representations. Experimental results show that the proposed MHANet achieves state-of-the-art performance with fewer trainable parameters across three datasets, 3 times lower than that of the most advanced model. Code is available at: https://github.com/fchest/MHANet.

Paperid: 45, https://arxiv.org/pdf/2505.14664.pdf GitHub

Abstract:
Cross-modal embeddings form the foundation for multi-modal models. However, visualization methods for interpreting cross-modal embeddings have been primarily confined to traditional dimensionality reduction (DR) techniques like PCA and t-SNE. These DR methods primarily focus on feature distributions within a single modality, whilst failing to incorporate metrics (e.g., CLIPScore) across multiple modalities. This paper introduces AKRMap, a new DR technique designed to visualize cross-modal embeddings metric with enhanced accuracy by learning kernel regression of the metric landscape in the projection space. Specifically, AKRMap constructs a supervised projection network guided by a post-projection kernel regression loss, and employs adaptive generalized kernels that can be jointly optimized with the projection. This approach enables AKRMap to efficiently generate visualizations that capture complex metric distributions, while also supporting interactive features such as zoom and overlay for deeper exploration. Quantitative experiments demonstrate that AKRMap outperforms existing DR methods in generating more accurate and trustworthy visualizations. We further showcase the effectiveness of AKRMap in visualizing and comparing cross-modal embeddings for text-to-image models. Code and demo are available at https://github.com/yilinye/AKRMap.

Paperid: 46, https://arxiv.org/pdf/2505.14633.pdf GitHub

Abstract:
Detecting AI risks becomes more challenging as stronger models emerge and find novel methods such as Alignment Faking to circumvent these detection attempts. Inspired by how risky behaviors in humans (i.e., illegal activities that may hurt others) are sometimes guided by strongly-held values, we believe that identifying values within AI models can be an early warning system for AI's risky behaviors. We create LitmusValues, an evaluation pipeline to reveal AI models' priorities on a range of AI value classes. Then, we collect AIRiskDilemmas, a diverse collection of dilemmas that pit values against one another in scenarios relevant to AI safety risks such as Power Seeking. By measuring an AI model's value prioritization using its aggregate choices, we obtain a self-consistent set of predicted value priorities that uncover potential risks. We show that values in LitmusValues (including seemingly innocuous ones like Care) can predict for both seen risky behaviors in AIRiskDilemmas and unseen risky behaviors in HarmBench.

Paperid: 47, https://arxiv.org/pdf/2505.13931.pdf GitHub

Abstract:
Recent advancements in robotics have underscored the need for effective collaboration between humans and robots. Traditional interfaces often struggle to balance robot autonomy with human oversight, limiting their practical application in complex tasks like mobile manipulation. This study aims to develop an intuitive interface that enables a mobile manipulator to autonomously interpret user-provided sketches, enhancing user experience while minimizing burden. We implemented a web-based application utilizing machine learning algorithms to process sketches, making the interface accessible on mobile devices for use anytime, anywhere, by anyone. In the first validation, we examined natural sketches drawn by users for 27 selected manipulation and navigation tasks, gaining insights into trends related to sketch instructions. The second validation involved comparative experiments with five grasping tasks, showing that the sketch interface reduces workload and enhances intuitiveness compared to conventional axis control interfaces. These findings suggest that the proposed sketch interface improves the efficiency of mobile manipulators and opens new avenues for integrating intuitive human-robot collaboration in various applications.

Paperid: 48, https://arxiv.org/pdf/2505.12727.pdf GitHub

Abstract:
Mental-health stigma remains a pervasive social problem that hampers treatment-seeking and recovery. Existing resources for training neural models to finely classify such stigma are limited, relying primarily on social-media or synthetic data without theoretical underpinnings. To remedy this gap, we present an expert-annotated, theory-informed corpus of human-chatbot interviews, comprising 4,141 snippets from 684 participants with documented socio-cultural backgrounds. Our experiments benchmark state-of-the-art neural models and empirically unpack the challenges of stigma detection. This dataset can facilitate research on computationally detecting, neutralizing, and counteracting mental-health stigma. Our corpus is openly available at https://github.com/HanMeng2004/Mental-Health-Stigma-Interview-Corpus.

Paperid: 49, https://arxiv.org/pdf/2505.12516.pdf GitHub

Abstract:
As see-through Mixed Reality Head-Mounted Displays (MRHMDs) proliferate, their usage is gradually shifting from controlled, private settings to spontaneous, public contexts. While location-based augmented reality mobile games such as Pokemon GO have been successful, the embodied interaction afforded by MRHMDs moves play beyond phone-based screen-tapping toward co-located, bodily, movement-based play. In anticipation of widespread MRHMD adoption, major technology companies have teased concept videos envisioning urban streets as vast mixed reality playgrounds-imagine Harry Potter-style wizard duels in city streets-which we term Immersive Mixed Reality Street Play (IMRSP). However, few real-world studies examine such scenarios. Through empirical, in-the-wild studies of our research-through-design game probe, Multiplayer Omnipresent Fighting Arena (MOFA), deployed across diverse public venues, we offer initial insights into the social implications, challenges, opportunities, and design recommendations of IMRSP. The MOFA framework, which includes three gameplay modes-"The Training," "The Duel," and "The Dragon"-is open-sourced at https://github.com/realitydeslab/mofa.

Paperid: 50, https://arxiv.org/pdf/2505.11612.pdf GitHub

Abstract:
Psychiatric disorders affect millions globally, yet their diagnosis faces significant challenges in clinical practice due to subjective assessments and accessibility concerns, leading to potential delays in treatment. To help address this issue, we present Heart2Mind, a human-centered contestable psychiatric disorder diagnosis system using wearable electrocardiogram (ECG) monitors. Our approach leverages cardiac biomarkers, particularly heart rate variability (HRV) and R-R intervals (RRI) time series, as objective indicators of autonomic dysfunction in psychiatric conditions. The system comprises three key components: (1) a Cardiac Monitoring Interface (CMI) for real-time data acquisition from Polar H9/H10 devices; (2) a Multi-Scale Temporal-Frequency Transformer (MSTFT) that processes RRI time series through integrated time-frequency domain analysis; (3) a Contestable Diagnosis Interface (CDI) combining Self-Adversarial Explanations (SAEs) with contestable Large Language Models (LLMs). Our MSTFT achieves 91.7% accuracy on the HRV-ACC dataset using leave-one-out cross-validation, outperforming state-of-the-art methods. SAEs successfully detect inconsistencies in model predictions by comparing attention-based and gradient-based explanations, while LLMs enable clinicians to validate correct predictions and contest erroneous ones. This work demonstrates the feasibility of combining wearable technology with Explainable Artificial Intelligence (XAI) and contestable LLMs to create a transparent, contestable system for psychiatric diagnosis that maintains clinical oversight while leveraging advanced AI capabilities. Our implementation is publicly available at: https://github.com/Analytics-Everywhere-Lab/heart2mind.

Paperid: 51, https://arxiv.org/pdf/2505.11146.pdf GitHub

Abstract:
The ability to imitate realistic facial expressions is essential for humanoid robots engaged in affective human-robot communication. However, the lack of datasets containing diverse humanoid facial expressions with proper annotations hinders progress in realistic humanoid facial expression imitation. To address these challenges, we introduce X2C (Anything to Control), a dataset featuring nuanced facial expressions for realistic humanoid imitation. With X2C, we contribute: 1) a high-quality, high-diversity, large-scale dataset comprising 100,000 (image, control value) pairs. Each image depicts a humanoid robot displaying a diverse range of facial expressions, annotated with 30 control values representing the ground-truth expression configuration; 2) X2CNet, a novel human-to-humanoid facial expression imitation framework that learns the correspondence between nuanced humanoid expressions and their underlying control values from X2C. It enables facial expression imitation in the wild for different human performers, providing a baseline for the imitation task, showcasing the potential value of our dataset; 3) real-world demonstrations on a physical humanoid robot, highlighting its capability to advance realistic humanoid facial expression imitation. Code and Data: https://lipzh5.github.io/X2CNet/

Paperid: 52, https://arxiv.org/pdf/2505.10348.pdf GitHub

Abstract:
Auditory attention detection (AAD) aims to identify the direction of the attended speaker in multi-speaker environments from brain signals, such as Electroencephalography (EEG) signals. However, existing EEG-based AAD methods overlook the spatio-temporal dependencies of EEG signals, limiting their decoding and generalization abilities. To address these issues, this paper proposes a Lightweight Spatio-Temporal Enhancement Nested Network (ListenNet) for AAD. The ListenNet has three key components: Spatio-temporal Dependency Encoder (STDE), Multi-scale Temporal Enhancement (MSTE), and Cross-Nested Attention (CNA). The STDE reconstructs dependencies between consecutive time windows across channels, improving the robustness of dynamic pattern extraction. The MSTE captures temporal features at multiple scales to represent both fine-grained and long-range temporal patterns. In addition, the CNA integrates hierarchical features more effectively through novel dynamic attention mechanisms to capture deep spatio-temporal correlations. Experimental results on three public datasets demonstrate the superiority of ListenNet over state-of-the-art methods in both subject-dependent and challenging subject-independent settings, while reducing the trainable parameter count by approximately 7 times. Code is available at:https://github.com/fchest/ListenNet.

Paperid: 53, https://arxiv.org/pdf/2505.09938.pdf GitHub

Abstract:
Designing and evaluating personalized and proactive assistant agents remains challenging due to the time, cost, and ethical concerns associated with human-in-the-loop experimentation. Existing Human-Computer Interaction (HCI) methods often require extensive physical setup and human participation, which introduces privacy concerns and limits scalability. Simulated environments offer a partial solution but are typically constrained by rule-based scenarios and still depend heavily on human input to guide interactions and interpret results. Recent advances in large language models (LLMs) have introduced the possibility of generative agents that can simulate realistic human behavior, reasoning, and social dynamics. However, their effectiveness in modeling human-assistant interactions remains largely unexplored. To address this gap, we present a generative agent-based simulation platform designed to simulate human-assistant interactions. We identify ten prior studies on assistant agents that span different aspects of interaction design and replicate these studies using our simulation platform. Our results show that fully simulated experiments using generative agents can approximate key aspects of human-assistant interactions. Based on these simulations, we are able to replicate the core conclusions of the original studies. Our work provides a scalable and cost-effective approach for studying assistant agent design without requiring live human subjects. We will open source both the platform and collected results from the experiments on our website: https://dash-gidea.github.io/.

Paperid: 54, https://arxiv.org/pdf/2505.08245.pdf GitHub

Abstract:
The advancement of large language models (LLMs) has outpaced traditional evaluation methodologies. This progress presents novel challenges, such as measuring human-like psychological constructs, moving beyond static and task-specific benchmarks, and establishing human-centered evaluation. These challenges intersect with psychometrics, the science of quantifying the intangible aspects of human psychology, such as personality, values, and intelligence. This review paper introduces and synthesizes the emerging interdisciplinary field of LLM Psychometrics, which leverages psychometric instruments, theories, and principles to evaluate, understand, and enhance LLMs. The reviewed literature systematically shapes benchmarking principles, broadens evaluation scopes, refines methodologies, validates results, and advances LLM capabilities. Diverse perspectives are integrated to provide a structured framework for researchers across disciplines, enabling a more comprehensive understanding of this nascent field. Ultimately, the review provides actionable insights for developing future evaluation paradigms that align with human-level AI and promote the advancement of human-centered AI systems for societal benefit. A curated repository of LLM psychometric resources is available at https://github.com/valuebyte-ai/Awesome-LLM-Psychometrics.

Paperid: 55, https://arxiv.org/pdf/2505.06386.pdf GitHub

Abstract:
Embedding projections are popular for visualizing large datasets and models. However, people often encounter "friction" when using embedding visualization tools: (1) barriers to adoption, e.g., tedious data wrangling and loading, scalability limits, no integration of results into existing workflows, and (2) limitations in possible analyses, without integration with external tools to additionally show coordinated views of metadata. In this paper, we present Embedding Atlas, a scalable, interactive visualization tool designed to make interacting with large embeddings as easy as possible. Embedding Atlas uses modern web technologies and advanced algorithms -- including density-based clustering, and automated labeling -- to provide a fast and rich data analysis experience at scale. We evaluate Embedding Atlas with a competitive analysis against other popular embedding tools, showing that Embedding Atlas's feature set specifically helps reduce friction, and report a benchmark on its real-time rendering performance with millions of points. Embedding Atlas is available as open source to support future work in embedding-based analysis.

Paperid: 56, https://arxiv.org/pdf/2505.06291.pdf GitHub

Abstract:
While foundation models excel in text, image, and video domains, the critical biological signals, particularly electroencephalography(EEG), remain underexplored. EEG benefits neurological research with its high temporal resolution, operational practicality, and safety profile. However, low signal-to-noise ratio, inter-subject variability, and cross-paradigm differences hinder the generalization of current models. Existing methods often employ simplified strategies, such as a single loss function or a channel-temporal joint representation module, and suffer from a domain gap between pretraining and evaluation tasks that compromises efficiency and adaptability. To address these limitations, we propose the Adaptive Large Foundation model for EEG signal representation(ALFEE) framework, a novel hybrid transformer architecture with two learning stages for robust EEG representation learning. ALFEE employs a hybrid attention that separates channel-wise feature aggregation from temporal dynamics modeling, enabling robust EEG representation with variable channel configurations. A channel encoder adaptively compresses variable channel information, a temporal encoder captures task-guided evolution, and a hybrid decoder reconstructs signals in both temporal and frequency domains. During pretraining, ALFEE optimizes task prediction, channel and temporal mask reconstruction, and temporal forecasting to enhance multi-scale and multi-channel representation. During fine-tuning, a full-model adaptation with a task-specific token dictionary and a cross-attention layer boosts performance across multiple tasks. After 25,000 hours of pretraining, extensive experimental results on six downstream EEG tasks demonstrate the superior performance of ALFEE over existing models. Our ALFEE framework establishes a scalable foundation for biological signal analysis with implementation at https://github.com/xw1216/ALFEE.

Paperid: 57, https://arxiv.org/pdf/2505.06064.pdf GitHub

Abstract:
Electromyography (EMG)-based gesture recognition is a promising approach for designing intuitive human-computer interfaces. However, while these systems typically perform well in controlled laboratory settings, their usability in real-world applications is compromised by declining performance during real-time control. This decline is largely due to goal-directed behaviors that are not captured in static, offline scenarios. To address this issue, we use \textit{Context Informed Incremental Learning} (CIIL) - marking its first deployment in an object-manipulation scenario - to continuously adapt the classifier using contextual cues. Nine participants without upper limb differences completed a functional task in a virtual reality (VR) environment involving transporting objects with life-like grips. We compared two scenarios: one where the classifier was adapted in real-time using contextual information, and the other using a traditional open-loop approach without adaptation. The CIIL-based approach not only enhanced task success rates and efficiency, but also reduced the perceived workload by 7.1 %, despite causing a 5.8 % reduction in offline classification accuracy. This study highlights the potential of real-time contextualized adaptation to enhance user experience and usability of EMG-based systems for practical, goal-oriented applications, crucial elements towards their long-term adoption. The source code for this study is available at: https://github.com/BiomedicalITS/ciil-emg-vr.

Paperid: 58, https://arxiv.org/pdf/2505.02780.pdf GitHub

Abstract:
Pathologists rely on gigapixel whole-slide images (WSIs) to diagnose diseases like cancer, yet current digital pathology tools hinder diagnosis. The immense scale of WSIs, often exceeding 100,000 X 100,000 pixels, clashes with the limited views traditional monitors offer. This mismatch forces constant panning and zooming, increasing pathologist cognitive load, causing diagnostic fatigue, and slowing pathologists' adoption of digital methods. PathVis, our mixed-reality visualization platform for Apple Vision Pro, addresses these challenges. It transforms the pathologist's interaction with data, replacing cumbersome mouse-and-monitor navigation with intuitive exploration using natural hand gestures, eye gaze, and voice commands in an immersive workspace. PathVis integrates AI to enhance diagnosis. An AI-driven search function instantly retrieves and displays the top five similar patient cases side-by-side, improving diagnostic precision and efficiency through rapid comparison. Additionally, a multimodal conversational AI assistant offers real-time image interpretation support and aids collaboration among pathologists across multiple Apple devices. By merging the directness of traditional pathology with advanced mixed-reality visualization and AI, PathVis improves diagnostic workflows, reduces cognitive strain, and makes pathology practice more effective and engaging. The PathVis source code and a demo video are publicly available at: https://github.com/jaiprakash1824/Path_Vis

Paperid: 59, https://arxiv.org/pdf/2505.01753.pdf GitHub

Abstract:
Educational videos have become increasingly relevant in today's learning environments. While prior research in laboratory studies has provided valuable insights, analyzing real-world interaction data can enhance our understanding of authentic user behavior. Previous studies have investigated technical aspects, such as the influence of cuts on pausing behavior, but the impact of visual complexity remains understudied. In this paper, we address this gap and propose a novel approach centered on visual complexity, defined as the number of visually distinguishable and meaningful elements in a video frame, such as mathematical equations, chemical formulas, or graphical representations. Our study introduces a fine-grained taxonomy of visual objects in educational videos, expanding on previous classifications. Applying this taxonomy to 25 videos from physics and chemistry, we examine the relationship between visual complexity and user behavior, including pauses, in-video navigation, and session dropouts. The results indicate that increased visual complexity, especially of textual elements, correlates with more frequent pauses, rewinds, and dropouts. The results offer a deeper understanding of how video design affects user behavior in real-world scenarios. Our work has implications for optimizing educational videos, particularly in STEM fields. We make our code publicly available (https://github.com/TIBHannover/from_formulas_to_figures).

Paperid: 60, https://arxiv.org/pdf/2504.20114.pdf GitHub

Abstract:
Retrieval-augmented generation (RAG) systems face significant challenges in multi-hop question answering (MHQA), where complex queries require synthesizing information across multiple document chunks. Existing approaches typically rely on iterative LLM-based query rewriting and routing, resulting in high computational costs due to repeated LLM invocations and multi-stage processes. To address these limitations, we propose TreeHop, an embedding-level framework without the need for LLMs in query refinement. TreeHop dynamically updates query embeddings by fusing semantic information from prior queries and retrieved documents, enabling iterative retrieval through embedding-space operations alone. This method replaces the traditional "Retrieve-Rewrite-Vectorize-Retrieve" cycle with a streamlined "Retrieve-Embed-Retrieve" loop, significantly reducing computational overhead. Moreover, a rule-based stop criterion is introduced to further prune redundant retrievals, balancing efficiency and recall rate. Experimental results show that TreeHop rivals advanced RAG methods across three open-domain MHQA datasets, achieving comparable performance with only 5\%-0.4\% of the model parameter size and reducing the query latency by approximately 99\% compared to concurrent approaches. This makes TreeHop a faster and more cost-effective solution for deployment in a range of knowledge-intensive applications. For reproducibility purposes, codes and data are available here: https://github.com/allen-li1231/TreeHop-RAG.

Paperid: 61, https://arxiv.org/pdf/2504.19838.pdf GitHub

Abstract:
With the rapid rise of large language models (LLMs), phone automation has undergone transformative changes. This paper systematically reviews LLM-driven phone GUI agents, highlighting their evolution from script-based automation to intelligent, adaptive systems. We first contextualize key challenges, (i) limited generality, (ii) high maintenance overhead, and (iii) weak intent comprehension, and show how LLMs address these issues through advanced language understanding, multimodal perception, and robust decision-making. We then propose a taxonomy covering fundamental agent frameworks (single-agent, multi-agent, plan-then-act), modeling approaches (prompt engineering, training-based), and essential datasets and benchmarks. Furthermore, we detail task-specific architectures, supervised fine-tuning, and reinforcement learning strategies that bridge user intent and GUI operations. Finally, we discuss open challenges such as dataset diversity, on-device deployment efficiency, user-centric adaptation, and security concerns, offering forward-looking insights into this rapidly evolving field. By providing a structured overview and identifying pressing research gaps, this paper serves as a definitive reference for researchers and practitioners seeking to harness LLMs in designing scalable, user-friendly phone GUI agents.

Paperid: 62, https://arxiv.org/pdf/2504.18271.pdf GitHub GitHub

Abstract:
Antenna modeling is a time-consuming and complex process, decreasing the speed of antenna analysis and design. In this paper, a large language model (LLM)- enabled antenna modeling method, called LEAM, is presented to address this challenge. LEAM enables automatic antenna model generation based on language descriptions via prompt input, images, descriptions from academic papers, patents, and technical reports (either one or multiple). The effectiveness of LEAM is demonstrated by three examples: a Vivaldi antenna generated from a complete user description, a slotted patch antenna generated from an incomplete user description and the operating frequency, and a monopole slotted antenna generated from images and descriptions scanned from the literature. For all the examples, correct antenna models are generated in a few minutes. The code can be accessed via https://github.com/TaoWu974/LEAM.

Paperid: 63, https://arxiv.org/pdf/2504.18010.pdf GitHub

Abstract:
Recent advances in autonomous system simulation platforms have significantly enhanced the safe and scalable testing of driving policies. However, existing simulators do not yet fully meet the needs of future transportation research-particularly in enabling effective human-AI collaboration and modeling socially-aware driving agents. This paper introduces Sky-Drive, a novel distributed multi-agent simulation platform that addresses these limitations through four key innovations: (a) a distributed architecture for synchronized simulation across multiple terminals; (b) a multi-modal human-in-the-loop framework integrating diverse sensors to collect rich behavioral data; (c) a human-AI collaboration mechanism supporting continuous and adaptive knowledge exchange; and (d) a digital twin framework for constructing high-fidelity virtual replicas of real-world transportation environments. Sky-Drive supports diverse applications such as autonomous vehicle-human road users interaction modeling, human-in-the-loop training, socially-aware reinforcement learning, personalized driving development, and customized scenario generation. Future extensions will incorporate foundation models for context-aware decision support and hardware-in-the-loop testing for real-world validation. By bridging scenario generation, data collection, algorithm training, and hardware integration, Sky-Drive has the potential to become a foundational platform for the next generation of human-centered and socially-aware autonomous transportation systems research. The demo video and code are available at:https://sky-lab-uw.github.io/Sky-Drive-website/

Paperid: 64, https://arxiv.org/pdf/2504.17960.pdf GitHub GitHub

Abstract:
Gait disorders are commonly observed in older adults, who frequently experience various issues related to walking. Additionally, researchers and clinicians extensively investigate mobility related to gait in typically and atypically developing children, athletes, and individuals with orthopedic and neurological disorders. Effective gait analysis enables the understanding of the causal mechanisms of mobility and balance control of patients, the development of tailored treatment plans to improve mobility, the reduction of fall risk, and the tracking of rehabilitation progress. However, analyzing gait data is a complex task due to the multivariate nature of the data, the large volume of information to be interpreted, and the technical skills required. Existing tools for gait analysis are often limited to specific patient groups (e.g., cerebral palsy), only handle a specific subset of tasks in the entire workflow, and are not openly accessible. To address these shortcomings, we conducted a requirements assessment with gait practitioners (e.g., researchers, clinicians) via surveys and identified key components of the workflow, including (1) data processing and (2) data analysis and visualization. Based on the findings, we designed VIGMA, an open-access visual analytics framework integrated with computational notebooks and a Python library, to meet the identified requirements. Notably, the framework supports analytical capabilities for assessing disease progression and for comparing multiple patient groups. We validated the framework through usage scenarios with experts specializing in gait and mobility rehabilitation. VIGMA is available at https://github.com/komar41/VIGMA.

Paperid: 65, https://arxiv.org/pdf/2504.16728.pdf GitHub

Abstract:
The rapid advancement in capabilities of large language models (LLMs) raises a pivotal question: How can LLMs accelerate scientific discovery? This work tackles the crucial first stage of research, generating novel hypotheses. While recent work on automated hypothesis generation focuses on multi-agent frameworks and extending test-time compute, none of the approaches effectively incorporate transparency and steerability through a synergistic Human-in-the-loop (HITL) approach. To address this gap, we introduce IRIS: Interactive Research Ideation System, an open-source platform designed for researchers to leverage LLM-assisted scientific ideation. IRIS incorporates innovative features to enhance ideation, including adaptive test-time compute expansion via Monte Carlo Tree Search (MCTS), fine-grained feedback mechanism, and query-based literature synthesis. Designed to empower researchers with greater control and insight throughout the ideation process. We additionally conduct a user study with researchers across diverse disciplines, validating the effectiveness of our system in enhancing ideation. We open-source our code at https://github.com/Anikethh/IRIS-Interactive-Research-Ideation-System

Paperid: 66, https://arxiv.org/pdf/2504.15918.pdf GitHub

Abstract:
Locating specific segments within an instructional video is an efficient way to acquire guiding knowledge. Generally, the task of obtaining video segments for both verbal explanations and visual demonstrations is known as visual answer localization (VAL). However, users often need multiple interactions to obtain answers that align with their expectations when using the system. During these interactions, humans deepen their understanding of the video content by asking themselves questions, thereby accurately identifying the location. Therefore, we propose a new task, named In-VAL, to simulate the multiple interactions between humans and videos in the procedure of obtaining visual answers. The In-VAL task requires interactively addressing several semantic gap issues, including 1) the ambiguity of user intent in the input questions, 2) the incompleteness of language in video subtitles, and 3) the fragmentation of content in video segments. To address these issues, we propose Ask2Loc, a framework for resolving In-VAL by asking questions. It includes three key modules: 1) a chatting module to refine initial questions and uncover clear intentions, 2) a rewriting module to generate fluent language and create complete descriptions, and 3) a searching module to broaden local context and provide integrated content. We conduct extensive experiments on three reconstructed In-VAL datasets. Compared to traditional end-to-end and two-stage methods, our proposed Ask2Loc can improve performance by up to 14.91 (mIoU) on the In-VAL task. Our code and datasets can be accessed at https://github.com/changzong/Ask2Loc.

Paperid: 67, https://arxiv.org/pdf/2504.15329.pdf GitHub

Abstract:
Accurate 6D pose estimation has gained more attention over the years for robotics-assisted tasks that require precise interaction with physical objects. This paper presents an interactive 3D-to-2D visualization and annotation tool to support the 6D pose estimation research community. To the best of our knowledge, the proposed work is the first tool that allows users to visualize and manipulate 3D objects interactively on a 2D real-world scene, along with a comprehensive user study. This system supports robust 6D camera pose annotation by providing both visual cues and spatial relationships to determine object position and orientation in various environments. The annotation feature in Vision6D is particularly helpful in scenarios where the transformation matrix between the camera and world objects is unknown, as it enables accurate annotation of these objects' poses using only the camera intrinsic matrix. This capability serves as a foundational step in developing and training advanced pose estimation models across various domains. We evaluate Vision6D's effectiveness by utilizing widely-used open-source pose estimation datasets Linemod and HANDAL through comparisons between the default ground-truth camera poses with manual annotations. A user study was performed to show that Vision6D generates accurate pose annotations via visual cues in an intuitive 3D user interface. This approach aims to bridge the gap between 2D scene projections and 3D scenes, offering an effective way for researchers and developers to solve 6D pose annotation related problems. The software is open-source and publicly available at https://github.com/InteractiveGL/vision6D.

Paperid: 68, https://arxiv.org/pdf/2504.15133.pdf GitHub

Abstract:
In this paper, we introduce EasyEdit2, a framework designed to enable plug-and-play adjustability for controlling Large Language Model (LLM) behaviors. EasyEdit2 supports a wide range of test-time interventions, including safety, sentiment, personality, reasoning patterns, factuality, and language features. Unlike its predecessor, EasyEdit2 features a new architecture specifically designed for seamless model steering. It comprises key modules such as the steering vector generator and the steering vector applier, which enable automatic generation and application of steering vectors to influence the model's behavior without modifying its parameters. One of the main advantages of EasyEdit2 is its ease of use-users do not need extensive technical knowledge. With just a single example, they can effectively guide and adjust the model's responses, making precise control both accessible and efficient. Empirically, we report model steering performance across different LLMs, demonstrating the effectiveness of these techniques. We have released the source code on GitHub at https://github.com/zjunlp/EasyEdit along with a demonstration notebook. In addition, we provide a demo video at https://www.youtube.com/watch?v=AkfoiPfp5rQ for a quick introduction.

Paperid: 69, https://arxiv.org/pdf/2504.15101.pdf GitHub GitHub

Abstract:
Traditional brain-computer interfaces (BCIs), reliant on costly electroencephalography or invasive implants, struggle with complex human-computer interactions due to setup complexity and limited precision. We present NeuGaze, a novel webcam-based system that leverages eye gaze, head movements, and facial expressions to enable intuitive, real-time control using only a standard 30 Hz webcam, often pre-installed in laptops. Requiring minimal calibration, NeuGaze achieves performance comparable to conventional inputs, supporting precise cursor navigation, key triggering via an efficient skill wheel, and dynamic gaming interactions, such as defeating formidable opponents in first-person games. By harnessing preserved neck-up functionalities in motor-impaired individuals, NeuGaze eliminates the need for specialized hardware, offering a low-cost, accessible alternative to BCIs. This paradigm empowers diverse applications, from assistive technology to entertainment, redefining human-computer interaction for motor-impaired users. Project is at \href{https://github.com/NeuSpeech/NeuGaze}{github.com/NeuSpeech/NeuGaze}.

Paperid: 70, https://arxiv.org/pdf/2504.14764.pdf GitHub

Abstract:
Unstructured text has long been difficult to automatically analyze at scale. Large language models (LLMs) now offer a way forward by enabling {\em semantic data processing}, where familiar data processing operators (e.g., map, reduce, filter) are powered by LLMs instead of code. However, building effective semantic data processing pipelines presents a departure from traditional data pipelines: users need to understand their data to write effective pipelines, yet they need to construct pipelines to extract the data necessary for that understanding -- all while navigating LLM idiosyncrasies and inconsistencies. We present \docwrangler, a mixed-initiative integrated development environment (IDE) for semantic data processing with three novel features to address the gaps between the user, their data, and their pipeline: {\em (i) In-Situ User Notes} that allows users to inspect, annotate, and track observations across documents and LLM outputs, {\em (ii) LLM-Assisted Prompt Refinement} that transforms user notes into improved operations, and {\em (iii) LLM-Assisted Operation Decomposition} that identifies when operations or documents are too complex for the LLM to correctly process and suggests decompositions. Our evaluation combines a think-aloud study with 10 participants and a public-facing deployment (available at \href{https://docetl.org/playground}{docetl.org/playground}) with 1,500+ recorded sessions, revealing how users develop systematic strategies for their semantic data processing tasks; e.g., transforming open-ended operations into classifiers for easier validation and intentionally using vague prompts to learn more about their data or LLM capabilities.

Paperid: 71, https://arxiv.org/pdf/2504.14603.pdf GitHub GitHub

Abstract:
Recent Computer-Using Agents (CUAs), powered by multimodal large language models (LLMs), offer a promising direction for automating complex desktop workflows through natural language. However, most existing CUAs remain conceptual prototypes, hindered by shallow OS integration, fragile screenshot-based interaction, and disruptive execution. We present UFO2, a multiagent AgentOS for Windows desktops that elevates CUAs into practical, system-level automation. UFO2 features a centralized HostAgent for task decomposition and coordination, alongside a collection of application-specialized AppAgent equipped with native APIs, domain-specific knowledge, and a unified GUI--API action layer. This architecture enables robust task execution while preserving modularity and extensibility. A hybrid control detection pipeline fuses Windows UI Automation (UIA) with vision-based parsing to support diverse interface styles. Runtime efficiency is further enhanced through speculative multi-action planning, reducing per-step LLM overhead. Finally, a Picture-in-Picture (PiP) interface enables automation within an isolated virtual desktop, allowing agents and users to operate concurrently without interference. We evaluate UFO2 across over 20 real-world Windows applications, demonstrating substantial improvements in robustness and execution accuracy over prior CUAs. Our results show that deep OS integration unlocks a scalable path toward reliable, user-aligned desktop automation.

Paperid: 72, https://arxiv.org/pdf/2504.14602.pdf GitHub

Abstract:
The natural interaction and control performance of lower limb rehabilitation robots are closely linked to biomechanical information from various human locomotion activities. Multidimensional human motion data significantly deepen the understanding of the complex mechanisms governing neuromuscular alterations, thereby facilitating the development and application of rehabilitation robots in multifaceted real-world environments. However, currently available lower limb datasets are inadequate for supplying the essential multimodal data and large-scale gait samples necessary for effective data-driven approaches, and they neglect the significant effects of acquisition interference in real applications.To fill this gap, we present the K2MUSE dataset, which includes a comprehensive collection of multimodal data, comprising kinematic, kinetic, amplitude-mode ultrasound (AUS), and surface electromyography (sEMG) measurements. The proposed dataset includes lower limb multimodal data from 30 able-bodied participants walking under different inclines (0$^\circ$, $\pm$5$^\circ$, and $\pm$10$^\circ$), various speeds (0.5 m/s, 1.0 m/s, and 1.5 m/s), and different nonideal acquisition conditions (muscle fatigue, electrode shifts, and inter-day differences). The kinematic and ground reaction force data were collected via a Vicon motion capture system and an instrumented treadmill with embedded force plates, whereas the sEMG and AUS data were synchronously recorded for thirteen muscles on the bilateral lower limbs. This dataset offers a new resource for designing control frameworks for rehabilitation robots and conducting biomechanical analyses of lower limb locomotion. The dataset is available at https://k2muse.github.io/.

Paperid: 73, https://arxiv.org/pdf/2504.13936.pdf GitHub

Abstract:
App agents, which autonomously operate mobile Apps through Graphical User Interfaces (GUIs), have gained significant interest in real-world applications. Yet, they often struggle with long-horizon planning, failing to find the optimal actions for complex tasks with longer steps. To address this, world models are used to predict the next GUI observation based on user actions, enabling more effective agent planning. However, existing world models primarily focus on generating only textual descriptions, lacking essential visual details. To fill this gap, we propose ViMo, the first visual world model designed to generate future App observations as images. For the challenge of generating text in image patches, where even minor pixel errors can distort readability, we decompose GUI generation into graphic and text content generation. We propose a novel data representation, the Symbolic Text Representation~(STR) to overlay text content with symbolic placeholders while preserving graphics. With this design, ViMo employs a STR Predictor to predict future GUIs' graphics and a GUI-text Predictor for generating the corresponding text. Moreover, we deploy ViMo to enhance agent-focused tasks by predicting the outcome of different action options. Experiments show ViMo's ability to generate visually plausible and functionally effective GUIs that enable App agents to make more informed decisions.

Paperid: 74, https://arxiv.org/pdf/2504.13934.pdf GitHub

Abstract:
Three-dimensional urban environment simulation is a powerful tool for informed urban planning. However, the intensive manual effort required to prepare input 3D city models has hindered its widespread adoption. To address this challenge, we present VoxCity, an open-source Python package that provides a one-stop solution for grid-based 3D city model generation and urban environment simulation for cities worldwide. VoxCity's `generator' subpackage automatically downloads building heights, tree canopy heights, land cover, and terrain elevation within a specified target area, and voxelizes buildings, trees, land cover, and terrain to generate an integrated voxel city model. The `simulator' subpackage enables users to conduct environmental simulations, including solar radiation and view index analyses. Users can export the generated models using several file formats compatible with external software, such as ENVI-met (INX), Blender, and Rhino (OBJ). We generated 3D city models for eight global cities, and demonstrated the calculation of solar irradiance, sky view index, and green view index. We also showcased microclimate simulation and 3D rendering visualization through ENVI-met and Rhino, respectively, through the file export function. Additionally, we reviewed openly available geospatial data to create guidelines to help users choose appropriate data sources depending on their target areas and purposes. VoxCity can significantly reduce the effort and time required for 3D city model preparation and promote the utilization of urban environment simulations. This contributes to more informed urban and architectural design that considers environmental impacts, and in turn, fosters sustainable and livable cities. VoxCity is released openly at https://github.com/kunifujiwara/VoxCity.

Paperid: 75, https://arxiv.org/pdf/2504.13805.pdf GitHub

Abstract:
Mobile GUI agents show promise in automating tasks but face generalization challenges in diverse real-world scenarios. Traditional approaches using pre-training or fine-tuning with massive datasets struggle with the diversity of mobile applications and user-specific tasks. We propose enhancing mobile GUI agent capabilities through human demonstrations, focusing on improving performance in unseen scenarios rather than pursuing universal generalization through larger datasets. To realize this paradigm, we introduce LearnGUI, the first comprehensive dataset specifically designed for studying demonstration-based learning in mobile GUI agents, comprising 2,252 offline tasks and 101 online tasks with high-quality human demonstrations. We further develop LearnAct, a sophisticated multi-agent framework that automatically extracts knowledge from demonstrations to enhance task completion. This framework integrates three specialized agents: DemoParser for knowledge extraction, KnowSeeker for relevant knowledge retrieval, and ActExecutor for demonstration-enhanced task execution. Our experimental results show significant performance gains in both offline and online evaluations. In offline assessments, a single demonstration improves model performance, increasing Gemini-1.5-Pro's accuracy from 19.3% to 51.7%. In online evaluations, our framework enhances UI-TARS-7B-SFT's task success rate from 18.1% to 32.8%. LearnAct framework and LearnGUI benchmark establish demonstration-based learning as a promising direction for more adaptable, personalized, and deployable mobile GUI agents.

Paperid: 76, https://arxiv.org/pdf/2504.11936.pdf GitHub

Abstract:
The reconstruction of 3D objects from brain signals has gained significant attention in brain-computer interface (BCI) research. Current research predominantly utilizes functional magnetic resonance imaging (fMRI) for 3D reconstruction tasks due to its excellent spatial resolution. Nevertheless, the clinical utility of fMRI is limited by its prohibitive costs and inability to support real-time operations. In comparison, electroencephalography (EEG) presents distinct advantages as an affordable, non-invasive, and mobile solution for real-time brain-computer interaction systems. While recent advances in deep learning have enabled remarkable progress in image generation from neural data, decoding EEG signals into structured 3D representations remains largely unexplored. In this paper, we propose a novel framework that translates EEG recordings into 3D object reconstructions by leveraging neural decoding techniques and generative models. Our approach involves training an EEG encoder to extract spatiotemporal visual features, fine-tuning a large language model to interpret these features into descriptive multimodal outputs, and leveraging generative 3D Gaussians with layout-guided control to synthesize the final 3D structures. Experiments demonstrate that our model captures salient geometric and semantic features, paving the way for applications in brain-computer interfaces (BCIs), virtual reality, and neuroprosthetics. Our code is available in https://github.com/sddwwww/Mind2Matter.

Paperid: 77, https://arxiv.org/pdf/2504.11257.pdf GitHub

Abstract:
Recent advancements in Large Vision-Language Models are accelerating the development of Graphical User Interface (GUI) agents that utilize human-like vision perception capabilities to enhance productivity on digital devices. Compared to approaches predicated on GUI metadata, which are platform-dependent and vulnerable to implementation variations, vision-based approaches offer broader applicability. In this vision-based paradigm, the GUI instruction grounding, which maps user instruction to the location of corresponding element on the given screenshot, remains a critical challenge, particularly due to limited public training dataset and resource-intensive manual instruction data annotation. In this paper, we delve into unexplored challenges in this task including element-to-screen ratio, unbalanced element type, and implicit instruction. To address these challenges, we introduce a large-scale data synthesis pipeline UI-E2I-Synth for generating varying complex instruction datasets using GPT-4o instead of human annotators. Furthermore, we propose a new GUI instruction grounding benchmark UI-I2E-Bench, which is designed to address the limitations of existing benchmarks by incorporating diverse annotation aspects. Our model, trained on the synthesized data, achieves superior performance in GUI instruction grounding, demonstrating the advancements of proposed data synthesis pipeline. The proposed benchmark, accompanied by extensive analyses, provides practical insights for future research in GUI grounding. We will release corresponding artifacts at https://microsoft.github.io/FIVE-UI-Evol/ .

Paperid: 78, https://arxiv.org/pdf/2504.10808.pdf GitHub

Abstract:
Detecting empathy from video interactions is an emerging area of research, particularly in healthcare and social robotics. However, privacy and ethical concerns often prevent the release of raw video data, with many datasets instead shared as pre-extracted tabular features. Previous work on such datasets has established classical tree-based models as the state of the art. Motivated by recent successes of large-scale foundation models for text, we investigate the potential of tabular foundation models (TFMs) for empathy detection from video-derived tabular data. Our proposed system, TFMPathy, is demonstrated with two recent TFMs (TabPFN v2 and TabICL) under both in-context learning and fine-tuning paradigms. On a public human-robot interaction benchmark, TFMPathy significantly improves empathy detection accuracy reported in the literature. While the established evaluation protocol in the literature does not ensure cross-subject generalisation, our evaluation scheme also captures such generalisation. We show that TFMPathy under a fine-tuning setup has better cross-subject generalisation capacity over baseline methods (accuracy: $0.590 \rightarrow 0.730$; AUC: $0.564 \rightarrow 0.669$). Given the ongoing privacy and ethical constraints around raw video sharing, the proposed TFMPathy system provides a practical and scalable path toward building AI systems dependent on human-centred video datasets. Our code is publicly available at https://github.com/hasan-rakibul/TFMPathy (will be made available upon acceptance of this paper).

Paperid: 79, https://arxiv.org/pdf/2504.10489.pdf GitHub

Abstract:
In this paper, we present Roamify, an Artificial Intelligence powered travel assistant that aims to ease the process of travel planning. We have tested and used multiple Large Language Models like Llama and T5 to generate personalised itineraries per user preferences. Results from user surveys highlight the preference for AI powered mediums over existing methods to help in travel planning across all user age groups. These results firmly validate the potential need of such a travel assistant. We highlight the two primary design considerations for travel assistance: D1) incorporating a web-scraping method to gather up-to-date news articles about destinations from various blog sources, which significantly improves our itinerary suggestions, and D2) utilising user preferences to create customised travel experiences along with a recommendation system which changes the itinerary according to the user needs. Our findings suggest that Roamify has the potential to improve and simplify how users across multiple age groups plan their travel experiences.

Paperid: 80, https://arxiv.org/pdf/2504.09737.pdf GitHub

Abstract:
Peer review at AI conferences is stressed by rapidly rising submission volumes, leading to deteriorating review quality and increased author dissatisfaction. To address these issues, we developed Review Feedback Agent, a system leveraging multiple large language models (LLMs) to improve review clarity and actionability by providing automated feedback on vague comments, content misunderstandings, and unprofessional remarks to reviewers. Implemented at ICLR 2025 as a large randomized control study, our system provided optional feedback to more than 20,000 randomly selected reviews. To ensure high-quality feedback for reviewers at this scale, we also developed a suite of automated reliability tests powered by LLMs that acted as guardrails to ensure feedback quality, with feedback only being sent to reviewers if it passed all the tests. The results show that 27% of reviewers who received feedback updated their reviews, and over 12,000 feedback suggestions from the agent were incorporated by those reviewers. This suggests that many reviewers found the AI-generated feedback sufficiently helpful to merit updating their reviews. Incorporating AI feedback led to significantly longer reviews (an average increase of 80 words among those who updated after receiving feedback) and more informative reviews, as evaluated by blinded researchers. Moreover, reviewers who were selected to receive AI feedback were also more engaged during paper rebuttals, as seen in longer author-reviewer discussions. This work demonstrates that carefully designed LLM-generated review feedback can enhance peer review quality by making reviews more specific and actionable while increasing engagement between reviewers and authors. The Review Feedback Agent is publicly available at https://github.com/zou-group/review_feedback_agent.

Paperid: 81, https://arxiv.org/pdf/2504.09689.pdf GitHub

Abstract:
The rise of LLM-driven AI characters raises safety concerns, particularly for vulnerable human users with psychological disorders. To address these risks, we propose EmoAgent, a multi-agent AI framework designed to evaluate and mitigate mental health hazards in human-AI interactions. EmoAgent comprises two components: EmoEval simulates virtual users, including those portraying mentally vulnerable individuals, to assess mental health changes before and after interactions with AI characters. It uses clinically proven psychological and psychiatric assessment tools (PHQ-9, PDI, PANSS) to evaluate mental risks induced by LLM. EmoGuard serves as an intermediary, monitoring users' mental status, predicting potential harm, and providing corrective feedback to mitigate risks. Experiments conducted in popular character-based chatbots show that emotionally engaging dialogues can lead to psychological deterioration in vulnerable users, with mental state deterioration in more than 34.4% of the simulations. EmoGuard significantly reduces these deterioration rates, underscoring its role in ensuring safer AI-human interactions. Our code is available at: https://github.com/1akaman/EmoAgent

Paperid: 82, https://arxiv.org/pdf/2504.09352.pdf GitHub

Abstract:
Automation of existing Graphical User Interfaces (GUIs) is important but hard to achieve. Upstream of making the GUI user-accessible or somehow scriptable, even the data-collection to understand the original interface poses significant challenges. For example, large quantities of general UI data seem helpful for training general machine learning (ML) models, but accessibility for each person can hinge on the ML's precision on a specific app. We therefore take the perspective that a given user needs confidence, that the relevant UI elements are being detected correctly throughout one app or digital environment. We mostly assume that the target application is known in advance, so that data collection and ML-training can be personalized for the test-time target domain. The proposed Explorer system focuses on detecting on-screen buttons and text-entry fields, i.e. interactables, where the training process has access to a live version of the application. The live application can run on almost any popular platform except iOS phones, and the collection is especially streamlined for Android phones or for desktop Chrome browsers. Explorer also enables the recording of interactive user sessions, and subsequent mapping of how such sessions overlap and sometimes loop back to similar states. We show how having such a map enables a kind of path planning through the GUI, letting a user issue audio commands to get to their destination. Critically, we are releasing our code for Explorer openly at https://github.com/varnelis/Explorer.

Paperid: 83, https://arxiv.org/pdf/2504.08875.pdf GitHub GitHub

Abstract:
Motivation: The visualization and analysis of high-dimensional data are essential in biomedical research. There is a need for secure, scalable, and reproducible tools to facilitate data exploration and interpretation. Results: We introduce DataMap, a browser-based application for visualization of high-dimensional data using heatmaps, principal component analysis (PCA), and t-distributed stochastic neighbor embedding (t-SNE). DataMap runs in the web browser, ensuring data privacy while eliminating the need for installation or a server. The application has an intuitive user interface for data transformation, annotation, and generation of reproducible R code. Availability and Implementation: Freely available as a GitHub page https://gexijin.github.io/datamap/. The source code can be found at https://github.com/gexijin/datamap, and can also be installed as an R package. Contact: Xijin.Ge@sdstate.ed

Paperid: 84, https://arxiv.org/pdf/2504.07981.pdf GitHub

Abstract:
Recent advancements in Multi-modal Large Language Models (MLLMs) have led to significant progress in developing GUI agents for general tasks such as web browsing and mobile phone use. However, their application in professional domains remains under-explored. These specialized workflows introduce unique challenges for GUI perception models, including high-resolution displays, smaller target sizes, and complex environments. In this paper, we introduce ScreenSpot-Pro, a new benchmark designed to rigorously evaluate the grounding capabilities of MLLMs in high-resolution professional settings. The benchmark comprises authentic high-resolution images from a variety of professional domains with expert annotations. It spans 23 applications across five industries and three operating systems. Existing GUI grounding models perform poorly on this dataset, with the best model achieving only 18.9%. Our experiments reveal that strategically reducing the search area enhances accuracy. Based on this insight, we propose ScreenSeekeR, a visual search method that utilizes the GUI knowledge of a strong planner to guide a cascaded search, achieving state-of-the-art performance with 48.1% without any additional training. We hope that our benchmark and findings will advance the development of GUI agents for professional applications. Code, data and leaderboard can be found at https://gui-agent.github.io/grounding-leaderboard.

Paperid: 85, https://arxiv.org/pdf/2504.07870.pdf GitHub

Abstract:
In the power and energy industry, multiple entities in grid operational logs are frequently recorded and updated. Thanks to recent advances in IT facilities and smart metering services, a variety of datasets such as system load, generation mix, and grid connection are often publicly available. While these resources are valuable in evaluating power grid's operational conditions and system resilience, the lack of fine-grained, accurate locational information constrain the usage of current data, which further hinders the development of smart grid and renewables integration. For instance, electricity end users are not aware of nodal generation mix or carbon emissions, while the general public have limited understanding about the effect of demand response or renewables integration if only the whole system's demands and generations are available. In this work, we focus on recovering power grid topology and line flow directions from open public dataset. Taking the Alberta grid as a working example, we start from mapping multi-modal power system datasets to the grid topology integrated with geographical information. By designing a novel optimization-based scheme to recover line flow directions, we are able to analyze and visualize the interactions between generations and demand vectors in an efficient manner. Proposed research is fully open-sourced and highly generalizable, which can help model and visualize grid information, create synthetic dataset, and facilitate analytics and decision-making framework for clean energy transition.

Paperid: 86, https://arxiv.org/pdf/2504.07285.pdf GitHub

Abstract:
Interactive visualization of embedding projections is a useful technique for understanding data and evaluating machine learning models. Labeling data within these visualizations is critical for interpretation, as labels provide an overview of the projection and guide user navigation. However, most methods for producing labels require clustering the points, which can be computationally expensive as the number of points grows. In this paper, we describe an efficient clustering approach using kernel density estimation in the projected 2D space instead of points. This algorithm can produce high-quality cluster regions from a 2D density map in a few hundred milliseconds, orders of magnitude faster than current approaches. We contribute the design of the algorithm, benchmarks, and applications that demonstrate the utility of the algorithm, including labeling and summarization.

Paperid: 87, https://arxiv.org/pdf/2504.06677.pdf GitHub GitHub

Abstract:
Augmented reality (AR) is an effective tool in robotic surgery education as it combines exploratory learning with three-dimensional guidance. However, existing AR systems require expert supervision and do not account for differences in the mentor and mentee robot configurations. To enable novices to train outside the operating room while receiving expert-informed guidance, we present dV-STEAR: an open-source system that plays back task-aligned expert demonstrations without assuming identical setup joint positions between expert and novice. Pose estimation was rigorously quantified, showing a registration error of 3.86 (SD=2.01)mm. In a user study (N=24), dV-STEAR significantly improved novice performance on tasks from the Fundamentals of Laparoscopic Surgery. In a single-handed ring-over-wire task, dV-STEAR increased completion speed (p=0.03) and reduced collision time (p=0.01) compared to dry-lab training alone. During a pick-and-place task, it improved success rates (p=0.004). Across both tasks, participants using dV-STEAR exhibited significantly more balanced hand use and reported lower frustration levels. This work presents a novel educational tool implemented on the da Vinci Research Kit, demonstrates its effectiveness in teaching novices, and builds the foundation for further AR integration into robot-assisted surgery.

Paperid: 88, https://arxiv.org/pdf/2504.04872.pdf GitHub

Abstract:
Meat reduction benefits human and planetary health, but social norms keep meat central in shared meals. To date, the development of communication strategies that promote meat reduction while minimizing social costs has required the costly involvement of human participants at each stage of the process. We present work in progress on simulating multi-round dialogues on meat reduction between Generative Agents based on large language models (LLMs). We measure our main outcome using established psychological questionnaires based on the Theory of Planned Behavior and additionally investigate Social Costs. We find evidence that our preliminary simulations produce outcomes that are (i) consistent with theoretical expectations; and (ii) valid when compared to data from previous studies with human participants. Generative agent-based models are a promising tool for identifying novel communication strategies on meat reduction-tailored to highly specific participant groups-to then be tested in subsequent studies with human participants.

Paperid: 89, https://arxiv.org/pdf/2504.03787.pdf GitHub

Abstract:
This paper proposes a novel hypothesis about the foundation of Tenochtitlan by combining digital elevation modeling with historical and symbolic analysis. Using geospatial data from EarthExplorer, we simulate various historical water levels in the Valley of Mexico. The resulting lake configurations reveal possible locations for ancient settlements near now-vanished shorelines, suggesting a dynamic transformation of sacred geography that aligns with key Mexica myths. We identify Santa MarÃa Aztahuacan as a strong candidate for the historical Aztlan and propose a reinterpretation of foundational codices in light of geomythical correlations.

Paperid: 90, https://arxiv.org/pdf/2504.03352.pdf GitHub

Abstract:
Stereotypes are known to have very harmful effects, making their detection critically important. However, current research predominantly focuses on detecting and evaluating stereotypical biases, thereby leaving the study of stereotypes in its early stages. Our study revealed that many works have failed to clearly distinguish between stereotypes and stereotypical biases, which has significantly slowed progress in advancing research in this area. Stereotype and Anti-stereotype detection is a problem that requires social knowledge; hence, it is one of the most difficult areas in Responsible AI. This work investigates this task, where we propose a five-tuple definition and provide precise terminologies disentangling stereotypes, anti-stereotypes, stereotypical bias, and general bias. We provide a conceptual framework grounded in social psychology for reliable detection. We identify key shortcomings in existing benchmarks for this task of stereotype and anti-stereotype detection. To address these gaps, we developed StereoDetect, a well curated, definition-aligned benchmark dataset designed for this task. We show that sub-10B language models and GPT-4o frequently misclassify anti-stereotypes and fail to recognize neutral overgeneralizations. We demonstrate StereoDetect's effectiveness through multiple qualitative and quantitative comparisons with existing benchmarks and models fine-tuned on them. The dataset and code is available at https://github.com/KaustubhShejole/StereoDetect.

Paperid: 91, https://arxiv.org/pdf/2504.02793.pdf GitHub

Abstract:
Large artificial intelligence (AI) models have garnered significant attention for their remarkable, often "superhuman", performance on standardized benchmarks. However, when these models are deployed in high-stakes verticals such as healthcare, education, and law, they often reveal notable limitations. For instance, they exhibit brittleness to minor variations in input data, present contextually uninformed decisions in critical settings, and undermine user trust by confidently producing or reproducing inaccuracies. These challenges in applying large models necessitate cross-disciplinary innovations to align the models' capabilities with the needs of real-world applications. We introduce a framework that addresses this gap through a layer-wise abstraction of innovations aimed at meeting users' requirements with large models. Through multiple case studies, we illustrate how researchers and practitioners across various fields can operationalize this framework. Beyond modularizing the pipeline of transforming large models into useful "vertical systems", we also highlight the dynamism that exists within different layers of the framework. Finally, we discuss how our framework can guide researchers and practitioners to (i) optimally situate their innovations (e.g., when vertical-specific insights can empower broadly impactful vertical-agnostic innovations), (ii) uncover overlooked opportunities (e.g., spotting recurring problems across verticals to develop practically useful foundation models instead of chasing benchmarks), and (iii) facilitate cross-disciplinary communication of critical challenges (e.g., enabling a shared vocabulary for AI developers, domain experts, and human-computer interaction scholars). Project webpage: https://gaurav22verma.github.io/vertical-systems-with-large-ai-models/

Paperid: 92, https://arxiv.org/pdf/2504.02123.pdf GitHub

Abstract:
Robot-moderated group discussions have the potential to facilitate engaging and productive interactions among human participants. Previous work on topic management in conversational agents has predominantly focused on human engagement and topic personalization, with the agent having an active role in the discussion. Also, studies have shown the usefulness of including robots in groups, yet further exploration is still needed for robots to learn when to change the topic while facilitating discussions. Accordingly, our work investigates the suitability of machine-learning models and audiovisual non-verbal features in predicting appropriate topic changes. We utilized interactions between a robot moderator and human participants, which we annotated and used for extracting acoustic and body language-related features. We provide a detailed analysis of the performance of machine learning approaches using sequential and non-sequential data with different sets of features. The results indicate promising performance in classifying inappropriate topic changes, outperforming rule-based approaches. Additionally, acoustic features exhibited comparable performance and robustness compared to the complete set of multimodal features. Our annotated data is publicly available at https://github.com/ghadj/topic-change-robot-discussions-data-2024.

Paperid: 93, https://arxiv.org/pdf/2504.01038.pdf GitHub

Abstract:
Early detection of gastric cancer, a leading cause of cancer-related mortality worldwide, remains hampered by the limitations of current diagnostic technologies, leading to high rates of misdiagnosis and missed diagnoses. To address these challenges, we propose an integrated system that synergizes advanced hardware and software technologies to balance speed-accuracy. Our study introduces the One Class Twin Cross Learning (OCT-X) algorithm. Leveraging a novel fast double-threshold grid search strategy (FDT-GS) and a patch-based deep fully convolutional network, OCT-X maximizes diagnostic accuracy through real-time data processing and seamless lesion surveillance. The hardware component includes an all-in-one point-of-care testing (POCT) device with high-resolution imaging sensors, real-time data processing, and wireless connectivity, facilitated by the NI CompactDAQ and LabVIEW software. Our integrated system achieved an unprecedented diagnostic accuracy of 99.70%, significantly outperforming existing models by up to 4.47%, and demonstrated a 10% improvement in multirate adaptability. These findings underscore the potential of OCT-X as well as the integrated system in clinical diagnostics, offering a path toward more accurate, efficient, and less invasive early gastric cancer detection. Future research will explore broader applications, further advancing oncological diagnostics. Code is available at https://github.com/liu37972/Multirate-Location-on-OCT-X-Learning.git.

Paperid: 94, https://arxiv.org/pdf/2504.00274.pdf GitHub

Abstract:
Urban systems are managed using complex textual documentation that need coding and analysis to set requirements and evaluate built environment performance. This paper contributes to the study of applying large-language models (LLM) to qualitative coding activities to reduce resource requirements while maintaining comparable reliability to humans. Qualitative coding and assessment face challenges like resource limitations and bias, accuracy, and consistency between human evaluators. Here we report the application of LLMs to deductively code 10 case documents on the presence of 17 digital twin characteristics for the management of urban systems. We utilize two prompting methods to compare the semantic processing of LLMs with human coding efforts: whole text analysis and text chunk analysis using OpenAI's GPT-4o, GPT-4o-mini, and o1-mini models. We found similar trends of internal variability between methods and results indicate that LLMs may perform on par with human coders when initialized with specific deductive coding contexts. GPT-4o, o1-mini and GPT-4o-mini showed significant agreement with human raters when employed using a chunking method. The application of both GPT-4o and GPT-4o-mini as an additional rater with three manual raters showed statistically significant agreement across all raters, indicating that the analysis of textual documents is benefited by LLMs. Our findings reveal nuanced sub-themes of LLM application suggesting LLMs follow human memory coding processes where whole-text analysis may introduce multiple meanings. The novel contributions of this paper lie in assessing the performance of OpenAI GPT models and introduces the chunk-based prompting approach, which addresses context aggregation biases by preserving localized context.

Paperid: 95, https://arxiv.org/pdf/2503.22769.pdf GitHub

Abstract:
Artificial Intelligence (AI) has been advancing rapidly and with the advent of large language models (LLMs) in late 2022, numerous opportunities have emerged for adopting this technology across various domains, including medicine. These innovations hold immense potential to revolutionize and modernize medical education. Our research project leverages large language models to enhance medical education and address workflow challenges through the development of MediTools - AI Medical Education. This prototype application focuses on developing interactive tools that simulate real-life clinical scenarios, provide access to medical literature, and keep users updated with the latest medical news. Our first tool is a dermatology case simulation tool that uses real patient images depicting various dermatological conditions and enables interaction with LLMs acting as virtual patients. This platform allows users to practice their diagnostic skills and enhance their clinical decision-making abilities. The application also features two additional tools: an AI-enhanced PubMed tool for engaging with LLMs to gain deeper insights into research papers, and a Google News tool that offers LLM generated summaries of articles for various medical specialties. A comprehensive survey has been conducted among medical professionals and students to gather initial feedback on the effectiveness and user satisfaction of MediTools, providing insights for further development and refinement of the application. This research demonstrates the potential of AI-driven tools in transforming and revolutionizing medical education, offering a scalable and interactive platform for continuous learning and skill development.

Paperid: 96, https://arxiv.org/pdf/2503.19225.pdf GitHub

Abstract:
We introduce CoinFT, a capacitive 6-axis force/torque (F/T) sensor that is compact, light, low-cost, and robust with an average mean-squared error of 0.11N for force and 0.84mNm for moment when the input ranges from 0~10N and 0~4N in normal and shear directions, respectively. CoinFT is a stack of two rigid PCBs with comb-shaped electrodes connected by an array of silicone rubber pillars. The microcontroller interrogates the electrodes in different subsets in order to enhance sensitivity for measuring 6-axis F/T. The combination of desirable features of CoinFT enables various contact-rich robot interactions at a scale, across different embodiment domains including drones, robot end-effectors, and wearable haptic devices. We demonstrate the utility of CoinFT on drones by performing an attitude-based force control to perform tasks that require careful contact force modulation. The design, fabrication, and firmware of CoinFT are open-sourced at https://hojung-choi.github.io/coinft.github.io/.

Paperid: 97, https://arxiv.org/pdf/2503.18313.pdf GitHub

Abstract:
Large Language Models (LLMs) have demonstrated impressive capabilities across various domains, but their effectiveness in financial decision-making remains inadequately evaluated. Current benchmarks primarily assess LLMs' understanding on financial documents rather than the ability to manage assets or dig out trading opportunities in dynamic market conditions. Despite the release of new benchmarks for evaluating diversified tasks on the financial domain, we identified four major problems in these benchmarks, which are data leakage, navel-gazing, over-intervention, and maintenance-hard. To pave the research gap, we introduce DeepFund, a comprehensive arena platform for evaluating LLM-based trading strategies in a live environment. Our approach implements a multi-agent framework where they serve as multiple key roles that realize the real-world investment decision processes. Moreover, we provide a web interface that visualizes LLMs' performance with fund investment metrics across different market conditions, enabling detailed comparative analysis. Through DeepFund, we aim to provide a more realistic and fair assessment on LLM's capabilities in fund investment, offering diversified insights and revealing their potential applications in real-world financial markets. Our code is publicly available at https://github.com/HKUSTDial/DeepFund.

Paperid: 98, https://arxiv.org/pdf/2503.17511.pdf GitHub

Abstract:
Ureteroscopy is the standard of care for diagnosing and treating kidney stones and tumors. However, current ureteroscopes have a limited field of view, requiring significant experience to adequately navigate the renal collecting system. This is evidenced by the fact that inexperienced surgeons have higher rates of missed stones. One-third of patients with residual stones require re-operation within 20 months. In order to aid surgeons to fully explore the kidney, this study presents the Navigated Augmented Reality Visualization for Ureteroscopic Surgery (NAVIUS) system. NAVIUS assists surgeons by providing 3D maps of the target anatomy, real-time scope positions, and preoperative imaging overlays. To enable real-time navigation and visualization, we integrate an electromagnetic tracker-based navigation pipeline with augmented reality visualizations. NAVIUS connects to 3D Slicer and Unity with OpenIGTLink, and uses HoloLens 2 as a holographic interface. We evaluate NAVIUS through a user study where surgeons conducted ureteroscopy on kidney phantoms with and without visual guidance. With our proposed system, we observed that surgeons explored more areas within the collecting system with NAVIUS (average 23.73% increase), and NASA-TLX metrics were improved (up to 27.27%). NAVIUS acts as a step towards better surgical outcomes and surgeons' experience. The codebase for the system will be available at: https://github.com/vu-maple-lab/NAVIUS.

Paperid: 99, https://arxiv.org/pdf/2503.16478.pdf GitHub GitHub

Abstract:
Effective visualisation of multidimensional data is crucial for generating insights. Glyph-based visualisations, which encode data dimensions onto multiple visual channels such as colour, shape, and size, provide an effective means of representing complex datasets. Pie-chart glyphs (pie-glyphs) are one such approach, where multiple data attributes are mapped to slices within a pie chart. This paper introduces the PieGlyph R package, which enables users to overlay any 2D plot with axis-invariant pie-glyphs, offering a compact and intuitive representation of multidimensional data. Unlike existing R packages such as scatterpie or ggforce, PieGlyph generates pie-glyphs independently of the plot axes by employing a nested coordinate system, ensuring they remain circular regardless of changes to the underlying coordinate system. This enhances interpretability, particularly in when visualising spatial data, as users can select the most appropriate map projection without distorting the glyphs' shape. Pie-glyphs are also particularly well-suited for visualising compositional data, where there is a natural sum-to-one constraint on the data attributes. PieGlyph is developed under the Grammar of Graphics paradigm using the ggplot2 framework and supports the generation of interactive pie-glyphs through the ggiraph package. Designed to integrate seamlessly with all features and extensions offered by ggplot2 and ggiraph, PieGlyph provides users with full flexibility in customising every aspect of the visualisation. This paper outlines the conceptual framework of PieGlyph, compares it with existing alternatives, and demonstrates its applications through example visualisations.

Paperid: 100, https://arxiv.org/pdf/2503.16477.pdf GitHub

Abstract:
In aviation emergencies, high-stakes decisions must be made in an instant. Pilots rely on quick access to precise, context-specific information -- an area where emerging tools like large language models (LLMs) show promise in providing critical support. This paper introduces LeRAAT, a framework that integrates LLMs with the X-Plane flight simulator to deliver real-time, context-aware pilot assistance. The system uses live flight data, weather conditions, and aircraft documentation to generate recommendations aligned with aviation best practices and tailored to the particular situation. It employs a Retrieval-Augmented Generation (RAG) pipeline that extracts and synthesizes information from aircraft type-specific manuals, including performance specifications and emergency procedures, as well as aviation regulatory materials, such as FAA directives and standard operating procedures. We showcase the framework in both a virtual reality and traditional on-screen simulation, supporting a wide range of research applications such as pilot training, human factors research, and operational decision support.

Paperid: 101, https://arxiv.org/pdf/2503.16465.pdf GitHub

Abstract:
Autonomous graphical user interface (GUI) agents powered by multimodal large language models have shown great promise. However, a critical yet underexplored issue persists: over-execution, where the agent executes tasks in a fully autonomous way, without adequate assessment of its action confidence to compromise an adaptive human-agent collaboration. This poses substantial risks in complex scenarios, such as those involving ambiguous user instructions, unexpected interruptions, and environmental hijacks. To address the issue, we introduce OS-Kairos, an adaptive GUI agent capable of predicting confidence levels at each interaction step and efficiently deciding whether to act autonomously or seek human intervention. OS-Kairos is developed through two key mechanisms: (i) collaborative probing that annotates confidence scores at each interaction step; (ii) confidence-driven interaction that leverages these confidence scores to elicit the ability of adaptive interaction. Experimental results show that OS-Kairos substantially outperforms existing models on our curated dataset featuring complex scenarios, as well as on established benchmarks such as AITZ and Meta-GUI, with 24.59\%$\sim$87.29\% improvements in task success rate. OS-Kairos facilitates an adaptive human-agent collaboration, prioritizing effectiveness, generality, scalability, and efficiency for real-world GUI interaction. The dataset and codes are available at https://github.com/Wuzheng02/OS-Kairos.

Paperid: 102, https://arxiv.org/pdf/2503.16454.pdf GitHub

Abstract:
In the field of affective computing, traditional methods for generating emotions predominantly rely on deep learning techniques and large-scale emotion datasets. However, deep learning techniques are often complex and difficult to interpret, and standardizing large-scale emotional datasets are difficult and costly to establish. To tackle these challenges, we introduce a novel framework named Audio-Visual Fusion for Brain-like Emotion Learning(AVF-BEL). In contrast to conventional brain-inspired emotion learning methods, this approach improves the audio-visual emotion fusion and generation model through the integration of modular components, thereby enabling more lightweight and interpretable emotion learning and generation processes. The framework simulates the integration of the visual, auditory, and emotional pathways of the brain, optimizes the fusion of emotional features across visual and auditory modalities, and improves upon the traditional Brain Emotional Learning (BEL) model. The experimental results indicate a significant improvement in the similarity of the audio-visual fusion emotion learning generation model compared to single-modality visual and auditory emotion learning and generation model. Ultimately, this aligns with the fundamental phenomenon of heightened emotion generation facilitated by the integrated impact of visual and auditory stimuli. This contribution not only enhances the interpretability and efficiency of affective intelligence but also provides new insights and pathways for advancing affective computing technology. Our source code can be accessed here: https://github.com/OpenHUTB/emotion}{https://github.com/OpenHUTB/emotion.

Paperid: 103, https://arxiv.org/pdf/2503.16434.pdf GitHub

Abstract:
Humans have long relied on visual aids like sketches and diagrams to support reasoning and problem-solving. Visual tools, like auxiliary lines in geometry or graphs in calculus, are essential for understanding complex ideas. However, many tutoring systems remain text-based, providing feedback only through natural language. Leveraging recent advances in Large Multimodal Models (LMMs), this paper introduces Interactive Sketchpad, a tutoring system that combines language-based explanations with interactive visualizations to enhance learning. Built on a pre-trained LMM, Interactive Sketchpad is fine-tuned to provide step-by-step guidance in both text and visuals, enabling natural multimodal interaction with the student. Accurate and robust diagrams are generated by incorporating code execution into the reasoning process. User studies conducted on math problems such as geometry, calculus, and trigonometry demonstrate that Interactive Sketchpad leads to improved task comprehension, problem-solving accuracy, and engagement levels, highlighting its potential for transforming educational technologies. All code is available at: https://stevenshinechen.github.io/interactivesketchpad/.

Paperid: 104, https://arxiv.org/pdf/2503.15500.pdf GitHub

Abstract:
Foundation models are rapidly improving the capability of robots in performing everyday tasks autonomously such as meal preparation, yet robots will still need to be instructed by humans due to model performance, the difficulty of capturing user preferences, and the need for user agency. Robots can be instructed using various methods-natural language conveys immediate instructions but can be abstract or ambiguous, whereas end-user programming supports longer horizon tasks but interfaces face difficulties in capturing user intent. In this work, we propose using direct manipulation of images as an alternative paradigm to instruct robots, and introduce a specific instantiation called ImageInThat which allows users to perform direct manipulation on images in a timeline-style interface to generate robot instructions. Through a user study, we demonstrate the efficacy of ImageInThat to instruct robots in kitchen manipulation tasks, comparing it to a text-based natural language instruction method. The results show that participants were faster with ImageInThat and preferred to use it over the text-based method. Supplementary material including code can be found at: https://image-in-that.github.io/.

Paperid: 105, https://arxiv.org/pdf/2503.14724.pdf GitHub GitHub

Abstract:
While developers increasingly adopt tools powered by large language models (LLMs) in day-to-day workflows, these tools still require explicit user invocation. To seamlessly integrate LLM capabilities to a developer's workflow, we introduce CodingGenie, a proactive assistant integrated into the code editor. CodingGenie autonomously provides suggestions, ranging from bug fixing to unit testing, based on the current code context and allows users to customize suggestions by providing a task description and selecting what suggestions are shown. We demonstrate multiple use cases to show how proactive suggestions from CodingGenie can improve developer experience, and also analyze the cost of adding proactivity. We believe this open-source tool will enable further research into proactive assistants. CodingGenie is open-sourced at https://github.com/sebzhao/CodingGenie/ and video demos are available at https://sebzhao.github.io/CodingGenie/.

Paperid: 106, https://arxiv.org/pdf/2503.13509.pdf GitHub

Abstract:
We introduce MentalChat16K, an English benchmark dataset combining a synthetic mental health counseling dataset and a dataset of anonymized transcripts from interventions between Behavioral Health Coaches and Caregivers of patients in palliative or hospice care. Covering a diverse range of conditions like depression, anxiety, and grief, this curated dataset is designed to facilitate the development and evaluation of large language models for conversational mental health assistance. By providing a high-quality resource tailored to this critical domain, MentalChat16K aims to advance research on empathetic, personalized AI solutions to improve access to mental health support services. The dataset prioritizes patient privacy, ethical considerations, and responsible data usage. MentalChat16K presents a valuable opportunity for the research community to innovate AI technologies that can positively impact mental well-being. The dataset is available at https://huggingface.co/datasets/ShenLab/MentalChat16K and the code and documentation are hosted on GitHub at https://github.com/ChiaPatricia/MentalChat16K.

Paperid: 107, https://arxiv.org/pdf/2503.09885.pdf GitHub

Abstract:
Analyzing CT scans, MRIs and X-rays is pivotal in diagnosing and treating diseases. However, detecting and identifying abnormalities from such medical images is a time-intensive process that requires expert analysis and is prone to interobserver variability. To mitigate such issues, machine learning-based models have been introduced to automate and significantly reduce the cost of image segmentation. Despite significant advances in medical image analysis in recent years, many of the latest models are never applied in clinical settings because state-of-the-art models do not easily interface with existing medical image viewers. To address these limitations, we propose QuickDraw, an open-source framework for medical image visualization and analysis that allows users to upload DICOM images and run off-the-shelf models to generate 3D segmentation masks. In addition, our tool allows users to edit, export, and evaluate segmentation masks to iteratively improve state-of-the-art models through active learning. In this paper, we detail the design of our tool and present survey results that highlight the usability of our software. Notably, we find that QuickDraw reduces the time to manually segment a CT scan from four hours to six minutes and reduces machine learning-assisted segmentation time by 10\% compared to prior work. Our code and documentation are available at https://github.com/qd-seg/quickdraw

Paperid: 108, https://arxiv.org/pdf/2503.09436.pdf GitHub

Abstract:
Recent technological advances popularized the use of image generation among the general public. Crafting effective prompts can, however, be difficult for novice users. To tackle this challenge, we developed PromptMap, a new interaction style for text-to-image AI that allows users to freely explore a vast collection of synthetic prompts through a map-like view with semantic zoom. PromptMap groups images visually by their semantic similarity, allowing users to discover relevant examples. We evaluated PromptMap in a between-subject online study ($n=60$) and a qualitative within-subject study ($n=12$). We found that PromptMap supported users in crafting prompts by providing them with examples. We also demonstrated the feasibility of using LLMs to create vast example collections. Our work contributes a new interaction style that supports users unfamiliar with prompting in achieving a satisfactory image output.

Paperid: 109, https://arxiv.org/pdf/2503.08102.pdf GitHub

Abstract:
Human interaction with the external world fundamentally involves the exchange of personal memory, whether with other individuals, websites, applications, or, in the future, AI agents. A significant portion of this interaction is redundant, requiring users to repeatedly provide the same information across different contexts. Existing solutions, such as browser-stored credentials, autofill mechanisms, and unified authentication systems, have aimed to mitigate this redundancy by serving as intermediaries that store and retrieve commonly used user data. The advent of large language models (LLMs) presents an opportunity to redefine memory management through an AI-native paradigm: SECOND ME. SECOND ME acts as an intelligent, persistent memory offload system that retains, organizes, and dynamically utilizes user-specific knowledge. By serving as an intermediary in user interactions, it can autonomously generate context-aware responses, prefill required information, and facilitate seamless communication with external systems, significantly reducing cognitive load and interaction friction. Unlike traditional memory storage solutions, SECOND ME extends beyond static data retention by leveraging LLM-based memory parameterization. This enables structured organization, contextual reasoning, and adaptive knowledge retrieval, facilitating a more systematic and intelligent approach to memory management. As AI-driven personal agents like SECOND ME become increasingly integrated into digital ecosystems, SECOND ME further represents a critical step toward augmenting human-world interaction with persistent, contextually aware, and self-optimizing memory systems. We have open-sourced the fully localizable deployment system at GitHub: https://github.com/Mindverse/Second-Me.

Paperid: 110, https://arxiv.org/pdf/2503.08061.pdf GitHub GitHub

Abstract:
Realistic Hand manipulation is a key component of immersive virtual reality (VR), yet existing methods often rely on kinematic approach or motion-capture datasets that omit crucial physical attributes such as contact forces and finger torques. Consequently, these approaches prioritize tight, one-size-fits-all grips rather than reflecting users' intended force levels. We present ForceGrip, a deep learning agent that synthesizes realistic hand manipulation motions, faithfully reflecting the user's grip force intention. Instead of mimicking predefined motion datasets, ForceGrip uses generated training scenarios-randomizing object shapes, wrist movements, and trigger input flows-to challenge the agent with a broad spectrum of physical interactions. To effectively learn from these complex tasks, we employ a three-phase curriculum learning framework comprising Finger Positioning, Intention Adaptation, and Dynamic Stabilization. This progressive strategy ensures stable hand-object contact, adaptive force control based on user inputs, and robust handling under dynamic conditions. Additionally, a proximity reward function enhances natural finger motions and accelerates training convergence. Quantitative and qualitative evaluations reveal ForceGrip's superior force controllability and plausibility compared to state-of-the-art methods. Demo videos are available as supplementary material and the code is provided at https://han-dongheun.github.io/ForceGrip.

Paperid: 111, https://arxiv.org/pdf/2503.07541.pdf GitHub

Abstract:
We introduce Geometric Retargeting (GeoRT), an ultrafast, and principled neural hand retargeting algorithm for teleoperation, developed as part of our recent Dexterity Gen (DexGen) system. GeoRT converts human finger keypoints to robot hand keypoints at 1KHz, achieving state-of-the-art speed and accuracy with significantly fewer hyperparameters. This high-speed capability enables flexible postprocessing, such as leveraging a foundational controller for action correction like DexGen. GeoRT is trained in an unsupervised manner, eliminating the need for manual annotation of hand pairs. The core of GeoRT lies in novel geometric objective functions that capture the essence of retargeting: preserving motion fidelity, ensuring configuration space (C-space) coverage, maintaining uniform response through high flatness, pinch correspondence and preventing self-collisions. This approach is free from intensive test-time optimization, offering a more scalable and practical solution for real-time hand retargeting.

Paperid: 112, https://arxiv.org/pdf/2503.06791.pdf GitHub

Abstract:
The social robot's open API allows users to customize open-domain interactions. However, it remains inaccessible to those without programming experience. In this work, we introduce AutoMisty, the first multi-agent collaboration framework powered by large language models (LLMs), to enable the seamless generation of executable Misty robot code from natural language instructions. AutoMisty incorporates four specialized agent modules to manage task decomposition, assignment, problem-solving, and result synthesis. Each agent incorporates a two-layer optimization mechanism, with self-reflection for iterative refinement and human-in-the-loop for better alignment with user preferences. AutoMisty ensures a transparent reasoning process, allowing users to iteratively refine tasks through natural language feedback for precise execution. To evaluate AutoMisty's effectiveness, we designed a benchmark task set spanning four levels of complexity and conducted experiments in a real Misty robot environment. Extensive evaluations demonstrate that AutoMisty not only consistently generates high-quality code but also enables precise code control, significantly outperforming direct reasoning with ChatGPT-4o and ChatGPT-o1. All code, optimized APIs, and experimental videos will be publicly released through the webpage: https://wangxiaoshawn.github.io/AutoMisty.html

Paperid: 113, https://arxiv.org/pdf/2503.04730.pdf GitHub

Abstract:
Graphical User Interface (GUI) tasks are vital for automating workflows such as software testing, user interface navigation. For users, the GUI is the most intuitive platform for interacting with a computer. Previous work identified a key challenge in developing visual GUI agents: GUI grounding - the ability to accurately locate screen elements based on instructions. However, most existing GUI agents rely on structured data formats like DOM or HTML files in training or inferencing, which are inaccessible across all applications, particular in a general desktop environments such as Windows OS. To address this, we introduce WinClick, a novel visual GUI agent developed in Windows platform. WinClick leverages screenshots to detect actionable regions. To overcome the challenge of GUI grounding, we enhance WinClick with GUI grounding pre-training and propose an LLM-based method for aligning GUI grounding data. Additionally, we introduce WinSpot, the first comprehensive benchmark for GUI grounding on Windows. Our experiments demonstrate that WinClick, combined with GUI grounding pre-training, significantly outperforms existing baselines, offering a scalable solution for GUI automation in desktop environments. WinSpot is publicly available at https://github.com/zackhuiiiii/WinSpot.

Paperid: 114, https://arxiv.org/pdf/2503.04250.pdf GitHub

Abstract:
We present Vinci, a vision-language system designed to provide real-time, comprehensive AI assistance on portable devices. At its core, Vinci leverages EgoVideo-VL, a novel model that integrates an egocentric vision foundation model with a large language model (LLM), enabling advanced functionalities such as scene understanding, temporal grounding, video summarization, and future planning. To enhance its utility, Vinci incorporates a memory module for processing long video streams in real time while retaining contextual history, a generation module for producing visual action demonstrations, and a retrieval module that bridges egocentric and third-person perspectives to provide relevant how-to videos for skill acquisition. Unlike existing systems that often depend on specialized hardware, Vinci is hardware-agnostic, supporting deployment across a wide range of devices, including smartphones and wearable cameras. In our experiments, we first demonstrate the superior performance of EgoVideo-VL on multiple public benchmarks, showcasing its vision-language reasoning and contextual understanding capabilities. We then conduct a series of user studies to evaluate the real-world effectiveness of Vinci, highlighting its adaptability and usability in diverse scenarios. We hope Vinci can establish a new framework for portable, real-time egocentric AI systems, empowering users with contextual and actionable insights. Including the frontend, backend, and models, all codes of Vinci are available at https://github.com/OpenGVLab/vinci.

Paperid: 115, https://arxiv.org/pdf/2503.03196.pdf GitHub

Abstract:
Graphical User Interface (GUI) agents show amazing abilities in assisting human-computer interaction, automating human user's navigation on digital devices. An ideal GUI agent is expected to achieve high accuracy, low latency, and compatibility for different GUI platforms. Recent vision-based approaches have shown promise by leveraging advanced Vision Language Models (VLMs). While they generally meet the requirements of compatibility and low latency, these vision-based GUI agents tend to have low accuracy due to their limitations in element grounding. To address this issue, we propose $\textbf{SpiritSight}$, a vision-based, end-to-end GUI agent that excels in GUI navigation tasks across various GUI platforms. First, we create a multi-level, large-scale, high-quality GUI dataset called $\textbf{GUI-Lasagne}$ using scalable methods, empowering SpiritSight with robust GUI understanding and grounding capabilities. Second, we introduce the $\textbf{Universal Block Parsing (UBP)}$ method to resolve the ambiguity problem in dynamic high-resolution of visual inputs, further enhancing SpiritSight's ability to ground GUI objects. Through these efforts, SpiritSight agent outperforms other advanced methods on diverse GUI benchmarks, demonstrating its superior capability and compatibility in GUI navigation tasks. Models and datasets are available at https://hzhiyuan.github.io/SpiritSight-Agent.

Paperid: 116, https://arxiv.org/pdf/2503.03094.pdf GitHub

Abstract:
Image labeling is an important task for training computer vision models. In specialized domains, such as healthcare, it is expensive and challenging to recruit specialists for image labeling. We propose HEPHA, a mixed-initiative image labeling tool that elicits human expertise via inductive logic learning to infer and refine labeling rules. Each rule comprises visual predicates that describe the image. HEPHA enables users to iteratively refine the rules by either direct manipulation through a visual programming interface or by labeling more images. To facilitate rule refinement, HEPHA recommends which rule to edit and which predicate to update. For users unfamiliar with visual programming, HEPHA suggests diverse and informative images to users for further labeling. We conducted a within-subjects user study with 16 participants and compared HEPHA with a variant of HEPHA and a deep learning-based approach. We found that HEPHA outperforms the two baselines in both specialized-domain and general-domain image labeling tasks. Our code is available at https://github.com/Neural-Symbolic-Image-Labeling/NSILWeb.

Paperid: 117, https://arxiv.org/pdf/2503.03044.pdf GitHub

Abstract:
Word-level quality estimation (QE) methods aim to detect erroneous spans in machine translations, which can direct and facilitate human post-editing. While the accuracy of word-level QE systems has been assessed extensively, their usability and downstream influence on the speed, quality and editing choices of human post-editing remain understudied. In this study, we investigate the impact of word-level QE on machine translation (MT) post-editing in a realistic setting involving 42 professional post-editors across two translation directions. We compare four error-span highlight modalities, including supervised and uncertainty-based word-level QE methods, for identifying potential errors in the outputs of a state-of-the-art neural MT model. Post-editing effort and productivity are estimated from behavioral logs, while quality improvements are assessed by word- and segment-level human annotation. We find that domain, language and editors' speed are critical factors in determining highlights' effectiveness, with modest differences between human-made and automated QE highlights underlining a gap between accuracy and usability in professional workflows.

Paperid: 118, https://arxiv.org/pdf/2503.01387.pdf GitHub

Abstract:
Real camera footage is subject to noise, motion blur (MB) and depth of field (DoF). In some applications these might be considered distortions to be removed, but in others it is important to model them because it would be ineffective, or interfere with an aesthetic choice, to simply remove them. In augmented reality applications where virtual content is composed into a live video feed, we can model noise, MB and DoF to make the virtual content visually consistent with the video. Existing methods for this typically suffer two main limitations. First, they require a camera calibration step to relate a known calibration target to the specific cameras response. Second, existing work require methods that can be (differentiably) tuned to the calibration, such as slow and specialized neural networks. We propose a method which estimates parameters for noise, MB and DoF instantly, which allows using off-the-shelf real-time simulation methods from e.g., a game engine in compositing augmented content. Our main idea is to unlock both features by showing how to use modern computer vision methods that can remove noise, MB and DoF from the video stream, essentially providing self-calibration. This allows to auto-tune any black-box real-time noise+MB+DoF method to deliver fast and high-fidelity augmentation consistency.

Paperid: 119, https://arxiv.org/pdf/2503.01163.pdf GitHub

Abstract:
Prompt optimization aims to search for effective prompts that enhance the performance of large language models (LLMs). Although existing prompt optimization methods have discovered effective prompts, they often differ from sophisticated prompts carefully designed by human experts. Prompt design strategies, representing best practices for improving prompt performance, can be key to improving prompt optimization. Recently, a method termed the Autonomous Prompt Engineering Toolbox (APET) has incorporated various prompt design strategies into the prompt optimization process. In APET, the LLM is needed to implicitly select and apply the appropriate strategies because prompt design strategies can have negative effects. This implicit selection may be suboptimal due to the limited optimization capabilities of LLMs. This paper introduces Optimizing Prompts with sTrategy Selection (OPTS), which implements explicit selection mechanisms for prompt design. We propose three mechanisms, including a Thompson sampling-based approach, and integrate them into EvoPrompt, a well-known prompt optimizer. Experiments optimizing prompts for two LLMs, Llama-3-8B-Instruct and GPT-4o mini, were conducted using BIG-Bench Hard. Our results show that the selection of prompt design strategies improves the performance of EvoPrompt, and the Thompson sampling-based mechanism achieves the best overall results. Our experimental code is provided at https://github.com/shiralab/OPTS .

Paperid: 120, https://arxiv.org/pdf/2503.00401.pdf GitHub

Abstract:
Perception-enhanced pre-training, particularly through grounding techniques, is widely adopted to enhance the performance of graphical user interface (GUI) agents. However, in resource-constrained scenarios, the format discrepancy between coordinate-oriented grounding and action-oriented reasoning limits the effectiveness of grounding for reasoning tasks. To address this challenge, we propose a query-oriented pivot approach called query inference, which serves as a bridge between GUI grounding and reasoning. By inferring potential user queries from a screenshot and its associated element coordinates, query inference improves the understanding of coordinates while aligning more closely with reasoning tasks. Experimental results show that query inference outperforms previous grounding techniques under the same training data scale. Notably, query inference achieves comparable or even better performance to large-scale grounding-enhanced OS-Atlas with less than 0.1% of training data. Furthermore, we explore the impact of reasoning formats and demonstrate that integrating additional semantic information into the input further boosts reasoning performance. The code is publicly available at https://github.com/ZrW00/GUIPivot.

Paperid: 121, https://arxiv.org/pdf/2503.00086.pdf GitHub

Abstract:
This paper presents a systematic study of the generalization of convolutional neural networks (CNNs) and humans on relational reasoning tasks with bar charts. We first revisit previous experiments on graphical perception and update the benchmark performance of CNNs. We then test the generalization performance of CNNs on a classic relational reasoning task: estimating bar length ratios in a bar chart, by progressively perturbing the standard visualizations. We further conduct a user study to compare the performance of CNNs and humans. Our results show that CNNs outperform humans only when the training and test data have the same visual encodings. Otherwise, they may perform worse. We also find that CNNs are sensitive to perturbations in various visual encodings, regardless of their relevance to the target bars. Yet, humans are mainly influenced by bar lengths. Our study suggests that robust relational reasoning with visualizations is challenging for CNNs. Improving CNNs' generalization performance may require training them to better recognize task-related visual properties.

Paperid: 122, https://arxiv.org/pdf/2502.21154.pdf GitHub

Abstract:
Emotional Recognition in Conversation (ERC) is valuable for diagnosing health conditions such as autism and depression, and for understanding the emotions of individuals who struggle to express their feelings. Current ERC methods primarily rely on semantic, audio and visual data but face significant challenges in integrating physiological signals such as Electroencephalography (EEG). This research proposes Hypergraph Multi-Modal Learning (Hyper-MML), a novel framework for identifying emotions in conversation. Hyper-MML effectively integrates EEG with audio and video information to capture complex emotional dynamics. Firstly, we introduce an Adaptive Brain Encoder with Mutual-cross Attention (ABEMA) module for processing EEG signals. This module captures emotion-relevant features across different frequency bands and adapts to subject-specific variations through hierarchical mutual-cross attention mechanisms. Secondly, we propose an Adaptive Hypergraph Fusion Module (AHFM) to actively model the higher-order relationships among multi-modal signals in ERC. Experimental results on the EAV and AFFEC datasets demonstrate that our Hyper-MML model significantly outperforms current state-of-the-art methods. The proposed Hyper-MML can serve as an effective communication tool for healthcare professionals, enabling better engagement with patients who have difficulty expressing their emotions. The official implementation codes are available at https://github.com/NZWANG/Hyper-MML.

Paperid: 123, https://arxiv.org/pdf/2502.20990.pdf GitHub

Abstract:
As social VR grows in popularity, understanding how to optimise interactions becomes increasingly important. Interpersonal distance (the physical space people maintain between each other) is a key aspect of user experience. Previous work in psychology has shown that breaches of personal space cause stress and discomfort. Thus, effectively managing this distance is crucial in social VR, where social interactions are frequent. Teleportation, a commonly used locomotion method in these environments, involves distinct cognitive processes and requires users to rely on their ability to estimate distance. Despite its widespread use, the effect of teleportation on proximity remains unexplored. To investigate this, we measured the interpersonal distance of 70 participants during interactions with embodied conversational agents, comparing teleportation to natural walking. Our findings revealed that participants maintained closer proximity from the agents during teleportation. Female participants kept greater distances from the agents than male participants, and natural walking was associated with higher agency and body ownership, though co-presence remained unchanged. We propose that differences in spatial perception and spatial cognitive load contribute to reduced interpersonal distance with teleportation. These findings emphasise that proximity should be a key consideration when selecting locomotion methods in social VR, highlighting the need for further research on how locomotion impacts spatial perception and social dynamics in virtual environments.

Paperid: 124, https://arxiv.org/pdf/2502.20480.pdf GitHub

Abstract:
Video descriptions are crucial for blind and low vision (BLV) users to access visual content. However, current artificial intelligence models for generating descriptions often fall short due to limitations in the quality of human annotations within training datasets, resulting in descriptions that do not fully meet BLV users' needs. To address this gap, we introduce VideoA11y, an approach that leverages multimodal large language models (MLLMs) and video accessibility guidelines to generate descriptions tailored for BLV individuals. Using this method, we have curated VideoA11y-40K, the largest and most comprehensive dataset of 40,000 videos described for BLV users. Rigorous experiments across 15 video categories, involving 347 sighted participants, 40 BLV participants, and seven professional describers, showed that VideoA11y descriptions outperform novice human annotations and are comparable to trained human annotations in clarity, accuracy, objectivity, descriptiveness, and user satisfaction. We evaluated models on VideoA11y-40K using both standard and custom metrics, demonstrating that MLLMs fine-tuned on this dataset produce high-quality accessible descriptions. Code and dataset are available at https://people-robots.github.io/VideoA11y.

Paperid: 125, https://arxiv.org/pdf/2502.18889.pdf GitHub

Abstract:
Traditional text-to-speech (TTS) methods primarily focus on establishing a mapping between phonemes and mel-spectrograms. However, during the phoneme encoding stage, there is often a lack of real mel-spectrogram auxiliary information, which results in the encoding process lacking true semantic understanding. At the same time, traditional TTS systems often struggle to balance the inference speed of the model with the quality of the synthesized speech. Methods that generate high-quality synthesized speech tend to have slower inference speeds, while faster inference methods often sacrifice speech quality. In this paper, I propose Clip-TTS, a TTS method based on the Clip architecture. This method uses the Clip framework to establish a connection between text content and real mel-spectrograms during the text encoding stage, enabling the text encoder to directly learn the true semantics of the global context, thereby ensuring the quality of the synthesized speech. In terms of model architecture, I adopt the basic structure of Transformer, which allows Clip-TTS to achieve fast inference speeds. Experimental results show that on the LJSpeech and Baker datasets, the speech generated by Clip-TTS achieves state-of-the-art MOS scores, and it also performs excellently on multi-emotion datasets.Audio samples are available at: https://ltydd1314.github.io/.

Paperid: 126, https://arxiv.org/pdf/2502.18736.pdf GitHub

Abstract:
Chat-based prompts respond with verbose linear-sequential texts, making it difficult to explore and refine ambiguous intents, back up and reinterpret, or shift directions in creative AI-assisted design work. AI-Instruments instead embody "prompts" as interface objects via three key principles: (1) Reification of user-intent as reusable direct-manipulation instruments; (2) Reflection of multiple interpretations of ambiguous user-intents (Reflection-in-intent) as well as the range of AI-model responses (Reflection-in-response) to inform design "moves" towards a desired result; and (3) Grounding to instantiate an instrument from an example, result, or extrapolation directly from another instrument. Further, AI-Instruments leverage LLM's to suggest, vary, and refine new instruments, enabling a system that goes beyond hard-coded functionality by generating its own instrumental controls from content. We demonstrate four technology probes, applied to image generation, and qualitative insights from twelve participants, showing how AI-Instruments address challenges of intent formulation, steering via direct manipulation, and non-linear iterative workflows to reflect and resolve ambiguous intents.

Paperid: 127, https://arxiv.org/pdf/2502.16485.pdf GitHub

Abstract:
In this paper, we focus on the challenge of individual variability in affective brain-computer interfaces (aBCI), which employs electroencephalogram (EEG) signals to monitor and recognize human emotional states, thereby facilitating the advancement of emotion-aware technologies. The variability in EEG data across individuals poses a significant barrier to the development of effective and widely applicable aBCI models. To tackle this issue, we propose a novel transfer learning framework called Semi-supervised Domain Adaptation with Dynamic Distribution Alignment (SDA-DDA). This approach aligns the marginal and conditional probability distribution of source and target domains using maximum mean discrepancy (MMD) and conditional maximum mean discrepancy (CMMD). We introduce a dynamic distribution alignment mechanism to adjust differences throughout training and enhance adaptation. Additionally, a pseudo-label confidence filtering module is integrated into the semi-supervised process to refine pseudo-label generation and improve the estimation of conditional distributions. Extensive experiments on EEG benchmark databases (SEED, SEED-IV and DEAP) validate the robustness and effectiveness of SDA-DDA. The results demonstrate its superiority over existing methods in emotion recognition across various scenarios, including cross-subject and cross-session conditions. This advancement enhances the generalization and accuracy of emotion recognition, potentially fostering the development of personalized aBCI applications. The source code is accessible at https://github.com/XuanSuTrum/SDA-DDA.

Paperid: 128, https://arxiv.org/pdf/2502.15980.pdf GitHub

Abstract:
Text-to-SQL models, which parse natural language (NL) questions to executable SQL queries, are increasingly adopted in real-world applications. However, deploying such models in the real world often requires adapting them to the highly specialized database schemas used in specific applications. We find that existing text-to-SQL models experience significant performance drops when applied to new schemas, primarily due to the lack of domain-specific data for fine-tuning. This data scarcity also limits the ability to effectively evaluate model performance in new domains. Continuously obtaining high-quality text-to-SQL data for evolving schemas is prohibitively expensive in real-world scenarios. To bridge this gap, we propose SQLsynth, a human-in-the-loop text-to-SQL data annotation system. SQLsynth streamlines the creation of high-quality text-to-SQL datasets through human-LLM collaboration in a structured workflow. A within-subjects user study comparing SQLsynth with manual annotation and ChatGPT shows that SQLsynth significantly accelerates text-to-SQL data annotation, reduces cognitive load, and produces datasets that are more accurate, natural, and diverse. Our code is available at https://github.com/adobe/nl_sql_analyzer.

Paperid: 129, https://arxiv.org/pdf/2502.15226.pdf GitHub

Abstract:
Which large language model (LLM) is better? Every evaluation tells a story, but what do users really think about current LLMs? This paper presents CLUE, an LLM-powered interviewer that conducts in-the-moment user experience interviews, right after users interact with LLMs, and automatically gathers insights about user opinions from massive interview logs. We conduct a study with thousands of users to understand user opinions on mainstream LLMs, recruiting users to first chat with a target LLM and then be interviewed by CLUE. Our experiments demonstrate that CLUE captures interesting user opinions, e.g., the bipolar views on the displayed reasoning process of DeepSeek-R1 and demands for information freshness and multi-modality. Our code and data are at https://github.com/cxcscmu/LLM-Interviewer.

Paperid: 130, https://arxiv.org/pdf/2502.15172.pdf GitHub

Abstract:
Decoding language information from brain signals represents a vital research area within brain-computer interfaces, particularly in the context of deciphering the semantic information from the fMRI signal. Although existing work uses LLM to achieve this goal, their method does not use an end-to-end approach and avoids the LLM in the mapping of fMRI-to-text, leaving space for the exploration of the LLM in auditory decoding. In this paper, we introduce a novel method, the Brain Prompt GPT (BP-GPT). By using the brain representation that is extracted from the fMRI as a prompt, our method can utilize GPT-2 to decode fMRI signals into stimulus text. Further, we introduce the text prompt and align the fMRI prompt to it. By introducing the text prompt, our BP-GPT can extract a more robust brain prompt and promote the decoding of pre-trained LLM. We evaluate our BP-GPT on the open-source auditory semantic decoding dataset and achieve a significant improvement up to 4.61 on METEOR and 2.43 on BERTScore across all the subjects compared to the state-of-the-art method. The experimental results demonstrate that using brain representation as a prompt to further drive LLM for auditory neural decoding is feasible and effective. The code is available at https://github.com/1994cxy/BP-GPT.

Paperid: 131, https://arxiv.org/pdf/2502.13920.pdf GitHub

Abstract:
Despite the prevalence of sleep-tracking devices, many individuals struggle to translate data into actionable improvements in sleep health. Current methods often provide data-driven suggestions but may not be feasible and adaptive to real-life constraints and individual contexts. We present HealthGuru, a novel large language model-powered chatbot to enhance sleep health through data-driven, theory-guided, and adaptive recommendations with conversational behavior change support. HealthGuru's multi-agent framework integrates wearable device data, contextual information, and a contextual multi-armed bandit model to suggest tailored sleep-enhancing activities. The system facilitates natural conversations while incorporating data-driven insights and theoretical behavior change techniques. Our eight-week in-the-wild deployment study with 16 participants compared HealthGuru to a baseline chatbot. Results show improved metrics like sleep duration and activity scores, higher quality responses, and increased user motivation for behavior change with HealthGuru. We also identify challenges and design considerations for personalization and user engagement in health chatbots.

Paperid: 132, https://arxiv.org/pdf/2502.13902.pdf GitHub

Abstract:
Knowing where people look in visualizations is key to effective design. Yet, existing research primarily focuses on free-viewing-based saliency models - although visual attention is inherently task-dependent. Collecting task-relevant importance data remains a resource-intensive challenge. To address this, we introduce Grid Labeling - a novel annotation method for collecting task-specific importance data to enhance saliency prediction models. Grid Labeling dynamically segments visualizations into Adaptive Grids, enabling efficient, low-effort annotation while adapting to visualization structure. We conducted a human subject study comparing Grid Labeling with existing annotation methods, ImportAnnots, and BubbleView across multiple metrics. Results show that Grid Labeling produces the least noisy data and the highest inter-participant agreement with fewer participants while requiring less physical (e.g., clicks/mouse movements) and cognitive effort. An interactive demo is available at https://jangsus1.github.io/Grid-Labeling.

Paperid: 133, https://arxiv.org/pdf/2502.13832.pdf GitHub

Abstract:
Can Multimodal Large Language Models (MLLMs), with capabilities in perception, recognition, understanding, and reasoning, function as independent assistants in art evaluation dialogues? Current MLLM evaluation methods, which rely on subjective human scoring or costly interviews, lack comprehensive coverage of various scenarios. This paper proposes a process-oriented Human-Computer Interaction (HCI) space design to facilitate more accurate MLLM assessment and development. This approach aids teachers in efficient art evaluation while also recording interactions for MLLM capability assessment. We introduce ArtMentor, a comprehensive space that integrates a dataset and three systems to optimize MLLM evaluation. The dataset consists of 380 sessions conducted by five art teachers across nine critical dimensions. The modular system includes agents for entity recognition, review generation, and suggestion generation, enabling iterative upgrades. Machine learning and natural language processing techniques ensure the reliability of evaluations. The results confirm GPT-4o's effectiveness in assisting teachers in art evaluation dialogues. Our contributions are available at https://artmentor.github.io/.

Paperid: 134, https://arxiv.org/pdf/2502.13130.pdf GitHub

Abstract:
We present Magma, a foundation model that serves multimodal AI agentic tasks in both the digital and physical worlds. Magma is a significant extension of vision-language (VL) models in that it not only retains the VL understanding ability (verbal intelligence) of the latter, but is also equipped with the ability to plan and act in the visual-spatial world (spatial-temporal intelligence) and complete agentic tasks ranging from UI navigation to robot manipulation. To endow the agentic capabilities, Magma is pretrained on large amounts of heterogeneous datasets spanning from images, videos to robotics data, where the actionable visual objects (e.g., clickable buttons in GUI) in images are labeled by Set-of-Mark (SoM) for action grounding, and the object movements (e.g., the trace of human hands or robotic arms) in videos are labeled by Trace-of-Mark (ToM) for action planning. Extensive experiments show that SoM and ToM reach great synergy and facilitate the acquisition of spatial-temporal intelligence for our Magma model, which is fundamental to a wide range of tasks as shown in Fig.1. In particular, Magma creates new state-of-the-art results on UI navigation and robotic manipulation tasks, outperforming previous models that are specifically tailored to these tasks. On image and video-related multimodal tasks, Magma also compares favorably to popular large multimodal models that are trained on much larger datasets. We make our model and code public for reproducibility at https://microsoft.github.io/Magma.

Paperid: 135, https://arxiv.org/pdf/2502.13013.pdf GitHub

Abstract:
Generalizable humanoid loco-manipulation poses significant challenges, requiring coordinated whole-body control and precise, contact-rich object manipulation. To address this, this paper introduces HOMIE, a semi-autonomous teleoperation system that combines a reinforcement learning policy for body control mapped to a pedal, an isomorphic exoskeleton arm for arm control, and motion-sensing gloves for hand control, forming a unified cockpit to freely operate humanoids and establish a data flywheel. The policy incorporates novel designs, including an upper-body pose curriculum, a height-tracking reward, and symmetry utilization. These features enable the system to perform walking and squatting to specific heights while seamlessly adapting to arbitrary upper-body poses. The exoskeleton, by eliminating the reliance on inverse dynamics, delivers faster and more precise arm control. The gloves utilize Hall sensors instead of servos, allowing even compact devices to achieve 15 or more degrees of freedom and freely adapt to any model of dexterous hands. Compared to previous teleoperation systems, HOMIE stands out for its exceptional efficiency, completing tasks in half the time; its expanded working range, allowing users to freely reach high and low areas as well as interact with any objects; and its affordability, with a price of just $500. The system is fully open-source, demos and code can be found in our https://homietele.github.io/.

Paperid: 136, https://arxiv.org/pdf/2502.12110.pdf GitHub GitHub

Abstract:
While large language model (LLM) agents can effectively use external tools for complex real-world tasks, they require memory systems to leverage historical experiences. Current memory systems enable basic storage and retrieval but lack sophisticated memory organization, despite recent attempts to incorporate graph databases. Moreover, these systems' fixed operations and structures limit their adaptability across diverse tasks. To address this limitation, this paper proposes a novel agentic memory system for LLM agents that can dynamically organize memories in an agentic way. Following the basic principles of the Zettelkasten method, we designed our memory system to create interconnected knowledge networks through dynamic indexing and linking. When a new memory is added, we generate a comprehensive note containing multiple structured attributes, including contextual descriptions, keywords, and tags. The system then analyzes historical memories to identify relevant connections, establishing links where meaningful similarities exist. Additionally, this process enables memory evolution - as new memories are integrated, they can trigger updates to the contextual representations and attributes of existing historical memories, allowing the memory network to continuously refine its understanding. Our approach combines the structured organization principles of Zettelkasten with the flexibility of agent-driven decision making, allowing for more adaptive and context-aware memory management. Empirical experiments on six foundation models show superior improvement against existing SOTA baselines. The source code for evaluating performance is available at https://github.com/WujiangXu/A-mem, while the source code of the agentic memory system is available at https://github.com/WujiangXu/A-mem-sys.

Paperid: 137, https://arxiv.org/pdf/2502.11946.pdf GitHub

Authors:Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, Peng Liu, Ruihang Miao, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Gong, Zixin Zhang, Hongyu Zhou, Jianjian Sun, Brian Li, Chengting Feng, Changyi Wan, Hanpeng Hu, Jianchang Wu, Jiangjie Zhen, Ranchen Ming, Song Yuan, Xuelin Zhang, Yu Zhou, Bingxin Li, Buyun Ma, Hongyuan Wang, Kang An, Wei Ji, Wen Li, Xuan Wen, Xiangwen Kong, Yuankai Ma, Yuanwei Liang, Yun Mou, Bahtiyar Ahmidi, Bin Wang, Bo Li, Changxin Miao, Chen Xu, Chenrun Wang, Dapeng Shi, Deshan Sun, Dingyuan Hu, Dula Sai, Enle Liu, Guanzhe Huang, Gulin Yan, Heng Wang, Haonan Jia, Haoyang Zhang, Jiahao Gong, Junjing Guo, Jiashuai Liu, Jiahong Liu, Jie Feng, Jie Wu, Jiaoren Wu, Jie Yang, Jinguo Wang, Jingyang Zhang, Junzhe Lin, Kaixiang Li, Lei Xia, Li Zhou, Liang Zhao, Longlong Gu, Mei Chen, Menglin Wu, Ming Li, Mingxiao Li, Mingliang Li, Mingyao Liang, Na Wang, Nie Hao, Qiling Wu, Qinyuan Tan, Ran Sun, Shuai Shuai, Shaoliang Pang, Shiliang Yang, Shuli Gao, Shanshan Yuan, Siqi Liu, Shihong Deng, Shilei Jiang, Sitong Liu, Tiancheng Cao, Tianyu Wang, Wenjin Deng, Wuxun Xie, Weipeng Ming, Wenqing He, Wen Sun, Xin Han, Xin Huang, Xiaomin Deng, Xiaojia Liu, Xin Wu, Xu Zhao, Yanan Wei, Yanbo Yu, Yang Cao, Yangguang Li, Yangzhen Ma, Yanming Xu, Yaoyu Wang, Yaqiang Shi, Yilei Wang, Yizhuang Zhou, Yinmin Zhong, Yang Zhang, Yaoben Wei, Yu Luo, Yuanwei Lu, Yuhe Yin, Yuchu Luo, Yuanhao Ding, Yuting Yan, Yaqi Dai, Yuxiang Yang, Zhe Xie, Zheng Ge, Zheng Sun, Zhewei Huang, Zhichao Chang, Zhisheng Guan, Zidong Yang, Zili Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu

Abstract:
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.

Paperid: 138, https://arxiv.org/pdf/2502.11882.pdf GitHub

Abstract:
Agents built on large language models (LLMs) have excelled in turn-by-turn human-AI collaboration but struggle with simultaneous tasks requiring real-time interaction. Latency issues and the challenge of inferring variable human strategies hinder their ability to make autonomous decisions without explicit instructions. Through experiments with current independent System 1 and System 2 methods, we validate the necessity of using Dual Process Theory (DPT) in real-time tasks. We propose DPT-Agent, a novel language agent framework that integrates System 1 and System 2 for efficient real-time simultaneous human-AI collaboration. DPT-Agent's System 1 uses a Finite-state Machine (FSM) and code-as-policy for fast, intuitive, and controllable decision-making. DPT-Agent's System 2 integrates Theory of Mind (ToM) and asynchronous reflection to infer human intentions and perform reasoning-based autonomous decisions. We demonstrate the effectiveness of DPT-Agent through further experiments with rule-based agents and human collaborators, showing significant improvements over mainstream LLM-based frameworks. DPT-Agent can effectively help LLMs convert correct slow thinking and reasoning into executable actions, thereby improving performance. To the best of our knowledge, DPT-Agent is the first language agent framework that achieves successful real-time simultaneous human-AI collaboration autonomously. Code of DPT-Agent can be found in https://github.com/sjtu-marl/DPT-Agent.

Paperid: 139, https://arxiv.org/pdf/2502.11196.pdf GitHub

Abstract:
Despite exceptional capabilities in knowledge-intensive tasks, Large Language Models (LLMs) face a critical gap in understanding how they internalize new knowledge, particularly how to structurally embed acquired knowledge in their neural computations. We address this issue through the lens of knowledge circuit evolution, identifying computational subgraphs that facilitate knowledge storage and processing. Our systematic analysis of circuit evolution throughout continual pre-training reveals several key findings: (1) the acquisition of new knowledge is influenced by its relevance to pre-existing knowledge; (2) the evolution of knowledge circuits exhibits a distinct phase shift from formation to optimization; (3) the evolution of knowledge circuits follows a deep-to-shallow pattern. These insights not only advance our theoretical understanding of the mechanisms of new knowledge acquisition in LLMs, but also provide potential implications for improving continual pre-training strategies to enhance model performance. Code and data will be available at https://github.com/zjunlp/DynamicKnowledgeCircuits.

Paperid: 140, https://arxiv.org/pdf/2502.11190.pdf GitHub

Abstract:
Current unlearning methods for large language models usually rely on reverse optimization to reduce target token probabilities. However, this paradigm disrupts the subsequent tokens prediction, degrading model performance and linguistic coherence. Moreover, existing evaluation metrics overemphasize contextual forgetting while inadequately assessing response fluency and relevance. To address these challenges, we propose ReLearn, a data augmentation and fine-tuning pipeline for effective unlearning, along with a comprehensive evaluation framework. This framework introduces Knowledge Forgetting Rate (KFR) and Knowledge Retention Rate (KRR) to measure knowledge-level preservation, and Linguistic Score (LS) to evaluate generation quality. Our experiments show that ReLearn successfully achieves targeted forgetting while preserving high-quality output. Through mechanistic analysis, we further demonstrate how reverse optimization disrupts coherent text generation, while ReLearn preserves this essential capability. Code is available at https://github.com/zjunlp/unlearn.

Paperid: 141, https://arxiv.org/pdf/2502.10311.pdf GitHub

Abstract:
Most commonly used non-linear machine learning methods are closed-box models, uninterpretable to humans. The field of explainable artificial intelligence (XAI) aims to develop tools to examine the inner workings of these closed boxes. An often-used model-agnostic approach to XAI involves using simple models as local approximations to produce so-called local explanations; examples of this approach include LIME, SHAP, and SLISEMAP. This paper shows how a large set of local explanations can be reduced to a small "proxy set" of simple models, which can act as a generative global explanation. This reduction procedure, ExplainReduce, can be formulated as an optimisation problem and approximated efficiently using greedy heuristics.

Paperid: 142, https://arxiv.org/pdf/2502.07608.pdf GitHub

Abstract:
Large language models (LLMs) show promise for health applications when combined with behavioral sensing data. Traditional approaches convert sensor data into text prompts, but this process is prone to errors, computationally expensive, and requires domain expertise. These challenges are particularly acute when processing extended time series data. While time series foundation models (TFMs) have recently emerged as powerful tools for learning representations from temporal data, bridging TFMs and LLMs remains challenging. Here, we present Time2Lang, a framework that directly maps TFM outputs to LLM representations without intermediate text conversion. Our approach first trains on synthetic data using periodicity prediction as a pretext task, followed by evaluation on mental health classification tasks. We validate Time2Lang on two longitudinal wearable and mobile sensing datasets: daily depression prediction using step count data (17,251 days from 256 participants) and flourishing classification based on conversation duration (46 participants over 10 weeks). Time2Lang maintains near constant inference times regardless of input length, unlike traditional prompting methods. The generated embeddings preserve essential time-series characteristics such as auto-correlation. Our results demonstrate that TFMs and LLMs can be effectively integrated while minimizing information loss and enabling performance transfer across these distinct modeling paradigms. To our knowledge, we are the first to integrate a TFM and an LLM for health, thus establishing a foundation for future research combining general-purpose large models for complex healthcare tasks.

Paperid: 143, https://arxiv.org/pdf/2502.05442.pdf GitHub

Abstract:
As AI models grow in power and generality, understanding how agents learn and make decisions in complex environments is critical to promoting ethical behavior. This study introduces the Odyssey, a lightweight, adaptive text based adventure game, providing a scalable framework for exploring AI ethics and safety. The Odyssey examines the ethical implications of implementing biological drives, specifically, self preservation, into three different agents. A Bayesian agent optimized with NEAT, a Bayesian agent optimized with stochastic variational inference, and a GPT 4o agent. The agents select actions at each scenario to survive, adapting to increasingly challenging scenarios. Post simulation analysis evaluates the ethical scores of the agent decisions, uncovering the tradeoffs it navigates to survive. Specifically, analysis finds that when danger increases, agents ethical behavior becomes unpredictable. Surprisingly, the GPT 4o agent outperformed the Bayesian models in both survival and ethical consistency, challenging assumptions about traditional probabilistic methods and raising a new challenge to understand the mechanisms of LLMs' probabilistic reasoning.

Paperid: 144, https://arxiv.org/pdf/2502.04599.pdf GitHub

Abstract:
Linkography -- the analysis of links between the design moves that make up an episode of creative ideation or design -- can be used for both visual and quantitative assessment of creative activity traces. Traditional linkography, however, is time-consuming, requiring a human coder to manually annotate both the design moves within an episode and the connections between them. As a result, linkography has not yet been much applied at scale. To address this limitation, we introduce fuzzy linkography: a means of automatically constructing a linkograph from a sequence of recorded design moves via a "fuzzy" computational model of semantic similarity, enabling wider deployment and new applications of linkographic techniques. We apply fuzzy linkography to three markedly different kinds of creative activity traces (text-to-image prompting journeys, LLM-supported ideation sessions, and researcher publication histories) and discuss our findings, as well as strengths, limitations, and potential future applications of our approach.

Paperid: 145, https://arxiv.org/pdf/2502.03724.pdf GitHub

Abstract:
Action recognition in dark, low-light (under-exposed) or noisy videos is a challenging task due to visibility degradation, which can hinder critical spatiotemporal details. This paper proposes MD-BERT, a novel multi-stream approach that integrates complementary pre-processing techniques such as gamma correction and histogram equalization alongside raw dark frames to address these challenges. We introduce the Dynamic Feature Fusion (DFF) module, extending existing attentional fusion methods to a three-stream setting, thereby capturing fine-grained and global contextual information across different brightness and contrast enhancements. The fused spatiotemporal features are then processed by a BERT-based temporal model, which leverages its bidirectional self-attention to effectively capture long-range dependencies and contextual relationships across frames. Extensive experiments on the ARID V1.0 and ARID V1.5 dark video datasets show that MD-BERT outperforms existing methods, establishing a new state-of-the-art performance. Ablation studies further highlight the individual contributions of each input stream and the effectiveness of the proposed DFF and BERT modules. The official website of this work is available at: https://github.com/HrishavBakulBarua/DarkBERT

Paperid: 146, https://arxiv.org/pdf/2502.02904.pdf GitHub

Abstract:
Writing is a cognitively demanding task involving continuous decision-making, heavy use of working memory, and frequent switching between multiple activities. Scholarly writing is particularly complex as it requires authors to coordinate many pieces of multiform knowledge. To fully understand writers' cognitive thought process, one should fully decode the end-to-end writing data (from individual ideas to final manuscript) and understand their complex cognitive mechanisms in scholarly writing. We introduce ScholaWrite dataset, a first-of-its-kind keystroke corpus of an end-to-end scholarly writing process for complete manuscripts, with thorough annotations of cognitive writing intentions behind each keystroke. Our dataset includes LaTeX-based keystroke data from five preprints with nearly 62K total text changes and annotations across 4 months of paper writing. ScholaWrite shows promising usability and applications (e.g., iterative self-writing), demonstrating the importance of collection of end-to-end writing data, rather than the final manuscript, for the development of future writing assistants to support the cognitive thinking process of scientists. Our de-identified data examples and code are available on our project page.

Paperid: 147, https://arxiv.org/pdf/2502.02883.pdf GitHub

Abstract:
Natural language interaction with sensing systems is crucial for addressing users' personal concerns and providing health-related insights into their daily lives. When a user asks a question, the system automatically analyzes the full history of sensor data, extracts relevant information, and generates an appropriate response. However, existing systems are limited to short-duration (e.g., one minute) or low-frequency (e.g., daily step count) sensor data. In addition, they struggle with quantitative questions that require precise numerical answers. In this work, we introduce SensorChat, the first end-to-end QA system designed for daily life monitoring using long-duration, high-frequency time series data. Given raw sensor signals spanning multiple days and a user-defined natural language question, SensorChat generates semantically meaningful responses that directly address user concerns. SensorChat effectively handles both quantitative questions that require numerical precision and qualitative questions that require high-level reasoning to infer subjective insights. To achieve this, SensorChat uses an innovative three-stage pipeline including question decomposition, sensor data query, and answer assembly. The first and third stages leverage Large Language Models (LLMs) to interpret human queries and generate responses. The intermediate querying stage extracts relevant information from the complete sensor data history. Real-world implementations demonstrate SensorChat's capability for real-time interactions on a cloud server while also being able to run entirely on edge platforms after quantization. Comprehensive QA evaluations show that SensorChat achieves 93% higher answer accuracy than the best performing state-of-the-art systems on quantitative questions. Furthermore, a user study with eight volunteers highlights SensorChat's effectiveness in answering qualitative questions.

Paperid: 148, https://arxiv.org/pdf/2502.02172.pdf GitHub

Abstract:
We present EditIQ, a completely automated framework for cinematically editing scenes captured via a stationary, large field-of-view and high-resolution camera. From the static camera feed, EditIQ initially generates multiple virtual feeds, emulating a team of cameramen. These virtual camera shots termed rushes are subsequently assembled using an automated editing algorithm, whose objective is to present the viewer with the most vivid scene content. To understand key scene elements and guide the editing process, we employ a two-pronged approach: (1) a large language model (LLM)-based dialogue understanding module to analyze conversational flow, coupled with (2) visual saliency prediction to identify meaningful scene elements and camera shots therefrom. We then formulate cinematic video editing as an energy minimization problem over shot selection, where cinematic constraints determine shot choices, transitions, and continuity. EditIQ synthesizes an aesthetically and visually compelling representation of the original narrative while maintaining cinematic coherence and a smooth viewing experience. Efficacy of EditIQ against competing baselines is demonstrated via a psychophysical study involving twenty participants on the BBC Old School dataset plus eleven theatre performance videos. Video samples from EditIQ can be found at https://editiq-ave.github.io/.

Paperid: 149, https://arxiv.org/pdf/2502.01620.pdf GitHub

Abstract:
Thematic Analysis (TA) is a fundamental method in healthcare research for analyzing transcript data, but it is resource-intensive and difficult to scale for large, complex datasets. This study investigates the potential of large language models (LLMs) to augment the inductive TA process in high-stakes healthcare settings. Focusing on interview transcripts from parents of children with Anomalous Aortic Origin of a Coronary Artery (AAOCA), a rare congenital heart disease, we propose an LLM-Enhanced Thematic Analysis (LLM-TA) pipeline. Our pipeline integrates an affordable state-of-the-art LLM (GPT-4o mini), LangChain, and prompt engineering with chunking techniques to analyze nine detailed transcripts following the inductive TA framework. We evaluate the LLM-generated themes against human-generated results using thematic similarity metrics, LLM-assisted assessments, and expert reviews. Results demonstrate that our pipeline outperforms existing LLM-assisted TA methods significantly. While the pipeline alone has not yet reached human-level quality in inductive TA, it shows great potential to improve scalability, efficiency, and accuracy while reducing analyst workload when working collaboratively with domain experts. We provide practical recommendations for incorporating LLMs into high-stakes TA workflows and emphasize the importance of close collaboration with domain experts to address challenges related to real-world applicability and dataset complexity. https://github.com/jiaweixu98/LLM-TA

Paperid: 150, https://arxiv.org/pdf/2502.00547.pdf GitHub

Abstract:
Emotions play a crucial role in human behavior and decision-making, making emotion recognition a key area of interest in human-computer interaction (HCI). This study addresses the challenges of emotion recognition by integrating facial expression analysis with electroencephalogram (EEG) signals, introducing a novel multimodal framework-Milmer. The proposed framework employs a transformer-based fusion approach to effectively integrate visual and physiological modalities. It consists of an EEG preprocessing module, a facial feature extraction and balancing module, and a cross-modal fusion module. To enhance visual feature extraction, we fine-tune a pre-trained Swin Transformer on emotion-related datasets. Additionally, a cross-attention mechanism is introduced to balance token representation across modalities, ensuring effective feature integration. A key innovation of this work is the adoption of a multiple instance learning (MIL) approach, which extracts meaningful information from multiple facial expression images over time, capturing critical temporal dynamics often overlooked in previous studies. Extensive experiments conducted on the DEAP dataset demonstrate the superiority of the proposed framework, achieving a classification accuracy of 96.72% in the four-class emotion recognition task. Ablation studies further validate the contributions of each module, highlighting the significance of advanced feature extraction and fusion strategies in enhancing emotion recognition performance. Our code are available at https://github.com/liangyubuaa/Milmer.

Paperid: 151, https://arxiv.org/pdf/2501.17585.pdf GitHub

Abstract:
This paper presents the design and implementation of TAPOR, a privacy-preserving, non-contact, and fully passive sensing system for accurate and robust 3D hand pose reconstruction for around-device interaction using a single low-cost thermal array sensor. Thermal sensing using inexpensive and miniature thermal arrays emerges with an excellent utility-privacy balance, offering an imaging resolution significantly lower than cameras but far superior to RF signals like radar or WiFi. The design of TAPOR, however, is challenging, mainly because the captured temperature maps are low-resolution and textureless. To overcome the challenges, we investigate thermo-depth and thermo-pose properties, proposing a novel physics-inspired neural network that learns effective 3D spatial representations of potential hand poses. We then formulate the 3D pose reconstruction problem as a distinct retrieval task, enabling accurate hand pose determination from the input temperature map. To deploy TAPOR on IoT devices, we introduce an effective heterogeneous knowledge distillation method, reducing computation by 377x. TAPOR is fully implemented and tested in real-world scenarios, showing remarkable performance, supported by four gesture control and finger tracking case studies. We envision TAPOR to be a ubiquitous interface for around-device control and have open-sourced it at https://github.com/aiot-lab/TAPOR.

Paperid: 152, https://arxiv.org/pdf/2501.17546.pdf GitHub

Abstract:
Explainable artificial intelligence (XAI) methods are being proposed to help interpret and understand how AI systems reach specific predictions. Inspired by prior work on conversational user interfaces, we argue that augmenting existing XAI methods with conversational user interfaces can increase user engagement and boost user understanding of the AI system. In this paper, we explored the impact of a conversational XAI interface on users' understanding of the AI system, their trust, and reliance on the AI system. In comparison to an XAI dashboard, we found that the conversational XAI interface can bring about a better understanding of the AI system among users and higher user trust. However, users of both the XAI dashboard and conversational XAI interfaces showed clear overreliance on the AI system. Enhanced conversations powered by large language model (LLM) agents amplified over-reliance. Based on our findings, we reason that the potential cause of such overreliance is the illusion of explanatory depth that is concomitant with both XAI interfaces. Our findings have important implications for designing effective conversational XAI interfaces to facilitate appropriate reliance and improve human-AI collaboration. Code can be found at https://github.com/delftcrowd/IUI2025_ConvXAI

Paperid: 153, https://arxiv.org/pdf/2501.16609.pdf GitHub

Abstract:
While much work on web agents emphasizes the promise of autonomously performing tasks on behalf of users, in reality, agents often fall short on complex tasks in real-world contexts and modeling user preference. This presents an opportunity for humans to collaborate with the agent and leverage the agent's capabilities effectively. We propose CowPilot, a framework supporting autonomous as well as human-agent collaborative web navigation, and evaluation across task success and task efficiency. CowPilot reduces the number of steps humans need to perform by allowing agents to propose next steps, while users are able to pause, reject, or take alternative actions. During execution, users can interleave their actions with the agent by overriding suggestions or resuming agent control when needed. We conducted case studies on five common websites and found that the human-agent collaborative mode achieves the highest success rate of 95% while requiring humans to perform only 15.2% of the total steps. Even with human interventions during task execution, the agent successfully drives up to half of task success on its own. CowPilot can serve as a useful tool for data collection and agent evaluation across websites, which we believe will enable research in how users and agents can work together. Video demonstrations are available at https://oaishi.github.io/cowpilot.html

Paperid: 154, https://arxiv.org/pdf/2501.16566.pdf GitHub

Abstract:
The emergence of multimodal large language models (MLLMs) advances multimodal emotion recognition (MER) to the next level, from naive discriminative tasks to complex emotion understanding with advanced video understanding abilities and natural language description. However, the current community suffers from a lack of large-scale datasets with intensive, descriptive emotion annotations, as well as a multimodal-centric framework to maximize the potential of MLLMs for emotion understanding. To address this, we establish a new benchmark for MLLM-based emotion understanding with a novel dataset (MER-Caption) and a new model (AffectGPT). Utilizing our model-based crowd-sourcing data collection strategy, we construct the largest descriptive emotion dataset to date (by far), featuring over 2K fine-grained emotion categories across 115K samples. We also introduce the AffectGPT model, designed with pre-fusion operations to enhance multimodal integration. Finally, we present MER-UniBench, a unified benchmark with evaluation metrics tailored for typical MER tasks and the free-form, natural language output style of MLLMs. Extensive experimental results show AffectGPT's robust performance across various MER tasks. We have released both the code and the dataset to advance research and development in emotion understanding: https://github.com/zeroQiaoba/AffectGPT.

Paperid: 155, https://arxiv.org/pdf/2501.14225.pdf GitHub

Abstract:
Achieving Artificial General Intelligence (AGI) requires AI agents that can not only make stratigic decisions but also engage in flexible and meaningful communication. Inspired by Wittgenstein's language game theory in Philosophical Investigations, we propose that language agents can learn through in-context interaction rather than traditional multi-stage frameworks that separate decision-making from language expression. Using Werewolf, a social deduction game that tests language understanding, strategic interaction, and adaptability, we develop the Multi-agent Kahneman & Tversky's Optimization (MaKTO). MaKTO engages diverse models in extensive gameplay to generate unpaired desirable and unacceptable responses, then employs KTO to refine the model's decision-making process. In 9-player Werewolf games, MaKTO achieves a 61% average win rate across various models, outperforming GPT-4o and two-stage RL agents by relative improvements of 23.0% and 10.9%, respectively. Notably, MaKTO also demonstrates human-like performance, winning 60% against expert players and showing only 49% detectability in Turing-style blind tests.

Paperid: 156, https://arxiv.org/pdf/2501.11803.pdf GitHub

Abstract:
Radiotherapy (RT) planning is complex, subjective, and time-intensive. Advances with artificial intelligence (AI) promise to improve its precision and efficiency, but progress is often limited by the scarcity of large, standardized datasets. To address this, we introduce the Automated Iterative RT Planning (AIRTP) system, a scalable solution for generating high-quality treatment plans. This scalable solution is designed to generate substantial volumes of consistently high-quality treatment plans, overcoming a key obstacle in the advancement of AI-driven RT planning. Our AIRTP pipeline adheres to clinical guidelines and automates essential steps, including organ-at-risk (OAR) contouring, helper structure creation, beam setup, optimization, and plan quality improvement, using AI integrated with RT planning software like Varian Eclipse. Furthermore, a novel approach for determining optimization parameters to reproduce 3D dose distributions, i.e. a method to convert dose predictions to deliverable treatment plans constrained by machine limitations is proposed. A comparative analysis of plan quality reveals that our automated pipeline produces treatment plans of quality comparable to those generated manually, which traditionally require several hours of labor per plan. Committed to public research, the first data release of our AIRTP pipeline includes nine cohorts covering head-and-neck and lung cancer sites to support an AAPM 2025 challenge. To our best knowledge, this dataset features more than 10 times number of plans compared to the largest existing well-curated public dataset. Repo: https://github.com/RiqiangGao/GDP-HMM_AAPMChallenge.

Paperid: 157, https://arxiv.org/pdf/2501.09782.pdf GitHub

Abstract:
Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods focus on training innovative architectural designs on confined datasets. In this work, we investigate the impact of scaling up EHPS towards a family of generalist foundation models. 1) For data scaling, we perform a systematic investigation on 40 EHPS datasets, encompassing a wide range of scenarios that a model trained on any single dataset cannot handle. More importantly, capitalizing on insights obtained from the extensive benchmarking process, we optimize our training scheme and select datasets that lead to a significant leap in EHPS capabilities. Ultimately, we achieve diminishing returns at 10M training instances from diverse data sources. 2) For model scaling, we take advantage of vision transformers (up to ViT-Huge as the backbone) to study the scaling law of model sizes in EHPS. To exclude the influence of algorithmic design, we base our experiments on two minimalist architectures: SMPLer-X, which consists of an intermediate step for hand and face localization, and SMPLest-X, an even simpler version that reduces the network to its bare essentials and highlights significant advances in the capture of articulated hands. With big data and the large model, the foundation models exhibit strong performance across diverse test benchmarks and excellent transferability to even unseen environments. Moreover, our finetuning strategy turns the generalist into specialist models, allowing them to achieve further performance boosts. Notably, our foundation models consistently deliver state-of-the-art results on seven benchmarks such as AGORA, UBody, EgoBody, and our proposed SynHand dataset for comprehensive hand evaluation. (Code is available at: https://github.com/wqyin/SMPLest-X).

Paperid: 158, https://arxiv.org/pdf/2501.09751.pdf GitHub

Abstract:
Machine writing with large language models often relies on retrieval-augmented generation. However, these approaches remain confined within the boundaries of the model's predefined scope, limiting the generation of content with rich information. Specifically, vanilla-retrieved information tends to lack depth, novelty, and suffers from redundancy, which negatively impacts the quality of generated articles, leading to shallow, unoriginal, and repetitive outputs. To address these issues, we propose OmniThink, a slow-thinking machine writing framework that emulates the human-like process of iterative expansion and reflection. The core idea behind OmniThink is to simulate the cognitive behavior of learners as they slowly deepen their knowledge of the topics. Experimental results demonstrate that OmniThink improves the knowledge density of generated articles without compromising metrics such as coherence and depth. Human evaluations and expert feedback further highlight the potential of OmniThink to address real-world challenges in the generation of long-form articles. Code is available at https://github.com/zjunlp/OmniThink.

Paperid: 159, https://arxiv.org/pdf/2501.09349.pdf GitHub

Abstract:
Effective chart summary can significantly reduce the time and effort decision makers spend interpreting charts, enabling precise and efficient communication of data insights. Previous studies have faced challenges in generating accurate and semantically rich summaries of time-series data charts. In this paper, we identify summary elements and common hallucination types in the generation of time-series chart summaries, which serve as our guidelines for automatic generation. We introduce ChartInsighter, which automatically generates chart summaries of time-series data, effectively reducing hallucinations in chart summary generation. Specifically, we assign multiple agents to generate the initial chart summary and collaborate iteratively, during which they invoke external data analysis modules to extract insights and compile them into a coherent summary. Additionally, we implement a self-consistency test method to validate and correct our summary. We create a high-quality benchmark of charts and summaries, with hallucination types annotated on a sentence-by-sentence basis, facilitating the evaluation of the effectiveness of reducing hallucinations. Our evaluations using our benchmark show that our method surpasses state-of-the-art models, and that our summary hallucination rate is the lowest, which effectively reduces various hallucinations and improves summary quality. The benchmark is available at https://github.com/wangfen01/ChartInsighter.

Paperid: 160, https://arxiv.org/pdf/2501.08558.pdf GitHub

Abstract:
Teleoperating high degrees-of-freedom (DoF) robotic manipulators via low-DoF controllers like joysticks often requires frequent switching between control modes, where each mode maps controller movements to specific robot actions. Manually performing this frequent switching can make teleoperation cumbersome and inefficient. On the other hand, existing automatic mode-switching solutions, such as heuristic-based or learning-based methods, are often task-specific and lack generalizability. In this paper, we introduce LLM-Driven Automatic Mode Switching (LAMS), a novel approach that leverages Large Language Models (LLMs) to automatically switch control modes based on task context. Unlike existing methods, LAMS requires no prior task demonstrations and incrementally improves by integrating user-generated mode-switching examples. We validate LAMS through an ablation study and a user study with 10 participants on complex, long-horizon tasks, demonstrating that LAMS effectively reduces manual mode switches, is preferred over alternative methods, and improves performance over time. The project website with supplementary materials is at https://lams-assistance.github.io/.

Paperid: 161, https://arxiv.org/pdf/2501.08187.pdf GitHub

Abstract:
Large language models excel at interpreting complex natural language instructions, enabling them to perform a wide range of tasks. In the life sciences, single-cell RNA sequencing (scRNA-seq) data serves as the "language of cellular biology", capturing intricate gene expression patterns at the single-cell level. However, interacting with this "language" through conventional tools is often inefficient and unintuitive, posing challenges for researchers. To address these limitations, we present InstructCell, a multi-modal AI copilot that leverages natural language as a medium for more direct and flexible single-cell analysis. We construct a comprehensive multi-modal instruction dataset that pairs text-based instructions with scRNA-seq profiles from diverse tissues and species. Building on this, we develop a multi-modal cell language architecture capable of simultaneously interpreting and processing both modalities. InstructCell empowers researchers to accomplish critical tasks-such as cell type annotation, conditional pseudo-cell generation, and drug sensitivity prediction-using straightforward natural language commands. Extensive evaluations demonstrate that InstructCell consistently meets or exceeds the performance of existing single-cell foundation models, while adapting to diverse experimental conditions. More importantly, InstructCell provides an accessible and intuitive tool for exploring complex single-cell data, lowering technical barriers and enabling deeper biological insights.

Paperid: 162, https://arxiv.org/pdf/2501.07051.pdf GitHub

Abstract:
Human-robot interaction (HRI) is an interdisciplinary field that utilises both quantitative and qualitative methods. While ROSBags, a file format within the Robot Operating System (ROS), offer an efficient means of collecting temporally synched multimodal data in empirical studies with real robots, there is a lack of tools specifically designed to integrate qualitative coding and analysis functions with ROSBags. To address this gap, we developed ROSAnnotator, a web-based application that incorporates a multimodal Large Language Model (LLM) to support both manual and automated annotation of ROSBag data. ROSAnnotator currently facilitates video, audio, and transcription annotations and provides an open interface for custom ROS messages and tools. By using ROSAnnotator, researchers can streamline the qualitative analysis process, create a more cohesive analysis pipeline, and quickly access statistical summaries of annotations, thereby enhancing the overall efficiency of HRI data analysis. https://github.com/CHRI-Lab/ROSAnnotator

Paperid: 163, https://arxiv.org/pdf/2501.06282.pdf GitHub

Abstract:
Recent advancements in large language models (LLMs) and multimodal speech-text models have laid the groundwork for seamless voice interactions, enabling real-time, natural, and human-like conversations. Previous models for voice interactions are categorized as native and aligned. Native models integrate speech and text processing in one framework but struggle with issues like differing sequence lengths and insufficient pre-training. Aligned models maintain text LLM capabilities but are often limited by small datasets and a narrow focus on speech tasks. In this work, we introduce MinMo, a Multimodal Large Language Model with approximately 8B parameters for seamless voice interaction. We address the main limitations of prior aligned multimodal models. We train MinMo through multiple stages of speech-to-text alignment, text-to-speech alignment, speech-to-speech alignment, and duplex interaction alignment, on 1.4 million hours of diverse speech data and a broad range of speech tasks. After the multi-stage training, MinMo achieves state-of-the-art performance across various benchmarks for voice comprehension and generation while maintaining the capabilities of text LLMs, and also facilitates full-duplex conversation, that is, simultaneous two-way communication between the user and the system. Moreover, we propose a novel and simple voice decoder that outperforms prior models in voice generation. The enhanced instruction-following capabilities of MinMo supports controlling speech generation based on user instructions, with various nuances including emotions, dialects, and speaking rates, and mimicking specific voices. For MinMo, the speech-to-text latency is approximately 100ms, full-duplex latency is approximately 600ms in theory and 800ms in practice. The MinMo project web page is https://funaudiollm.github.io/minmo, and the code and models will be released soon.

Paperid: 164, https://arxiv.org/pdf/2501.06250.pdf GitHub

Abstract:
Traditional Celluloid (Cel) Animation production pipeline encompasses multiple essential steps, including storyboarding, layout design, keyframe animation, inbetweening, and colorization, which demand substantial manual effort, technical expertise, and significant time investment. These challenges have historically impeded the efficiency and scalability of Cel-Animation production. The rise of generative artificial intelligence (GenAI), encompassing large language models, multimodal models, and diffusion models, offers innovative solutions by automating tasks such as inbetween frame generation, colorization, and storyboard creation. This survey explores how GenAI integration is revolutionizing traditional animation workflows by lowering technical barriers, broadening accessibility for a wider range of creators through tools like AniDoc, ToonCrafter, and AniSora, and enabling artists to focus more on creative expression and artistic innovation. Despite its potential, challenges like visual consistency, stylistic coherence, and ethical considerations persist. Additionally, this paper explores future directions and advancements in AI-assisted animation. For further exploration and resources, please visit our GitHub repository: https://github.com/yunlong10/Awesome-AI4Animation

Paperid: 165, https://arxiv.org/pdf/2501.05790.pdf GitHub

Abstract:
In Reinforcement Learning from Human Feedback (RLHF), it is crucial to learn suitable reward models from human feedback to align large language models (LLMs) with human intentions. However, human feedback can often be noisy, inconsistent, or biased, especially when evaluating complex responses. Such feedback can lead to misaligned reward signals, potentially causing unintended side effects during the RLHF process. To address these challenges, we explore the use of influence functions to measure the impact of human feedback on the performance of reward models. We propose a compute-efficient approximation method that enables the application of influence functions to LLM-based reward models and large-scale preference datasets. Our experiments showcase two key applications of influence functions: (1) detecting common labeler biases in human feedback datasets and (2) guiding labelers in refining their strategies to better align with expert feedback. By quantifying the impact of human feedback, we believe that influence functions can enhance feedback interpretability and contribute to scalable oversight in RLHF, helping labelers provide more accurate and consistent feedback. Source code is available at https://github.com/mintaywon/IF_RLHF

Paperid: 166, https://arxiv.org/pdf/2501.04575.pdf GitHub

Abstract:
Graphical User Interface (GUI) Agents, powered by multimodal large language models (MLLMs), have shown great potential for task automation on computing devices such as computers and mobile phones. However, existing agents face challenges in multi-step reasoning and reliance on textual annotations, limiting their effectiveness. We introduce \textit{InfiGUIAgent}, an MLLM-based GUI Agent trained with a two-stage supervised fine-tuning pipeline. Stage 1 enhances fundamental skills such as GUI understanding and grounding, while Stage 2 integrates hierarchical reasoning and expectation-reflection reasoning skills using synthesized data to enable native reasoning abilities of the agents. \textit{InfiGUIAgent} achieves competitive performance on several GUI benchmarks, highlighting the impact of native reasoning skills in enhancing GUI interaction for automation tasks. Resources are available at \url{https://github.com/Reallm-Labs/InfiGUIAgent}.

Paperid: 167, https://arxiv.org/pdf/2501.01384.pdf GitHub

Abstract:
With the rapid development of large language models, researchers have created increasingly advanced spoken dialogue systems that can naturally converse with humans. However, these systems still struggle to handle the full complexity of real-world conversations, including audio events, musical contexts, and emotional expressions, mainly because current dialogue datasets are constrained in both scale and scenario diversity. In this paper, we propose leveraging synthetic data to enhance the dialogue models across diverse scenarios. We introduce ShareChatX, the first comprehensive, large-scale dataset for spoken dialogue that spans diverse scenarios. Based on this dataset, we introduce OmniChat, a multi-turn dialogue system with a heterogeneous feature fusion module, designed to optimize feature selection in different dialogue contexts. In addition, we explored critical aspects of training dialogue systems using synthetic data. Through comprehensive experimentation, we determined the ideal balance between synthetic and real data, achieving state-of-the-art results on the real-world dialogue dataset DailyTalk. We also highlight the crucial importance of synthetic data in tackling diverse, complex dialogue scenarios, especially those involving audio and music. For more details, please visit our demo page at \url{https://sharechatx.github.io/}.

Paperid: 168, https://arxiv.org/pdf/2501.01212.pdf GitHub

Abstract:
Cybersickness remains a major obstacle to the widespread adoption of immersive virtual reality (VR), particularly in consumer-grade environments. While prior methods rely on invasive signals such as electroencephalography (EEG) for high predictive accuracy, these approaches require specialized hardware and are impractical for real-world applications. In this work, we propose a scalable, deployable framework for personalized cybersickness prediction leveraging only non-invasive signals readily available from commercial VR headsets, including head motion, eye tracking, and physiological responses. Our model employs a modality-specific graph neural network enhanced with a Difference Attention Module to extract temporal-spatial embeddings capturing dynamic changes across modalities. A cross-modal alignment module jointly trains the video encoder to learn personalized traits by aligning video features with sensor-derived representations. Consequently, the model accurately predicts individual cybersickness using only video input during inference. Experimental results show our model achieves 88.4\% accuracy, closely matching EEG-based approaches (89.16\%), while reducing deployment complexity. With an average inference latency of 90ms, our framework supports real-time applications, ideal for integration into consumer-grade VR platforms without compromising personalization or performance. The code will be relesed at https://github.com/U235-Aurora/PTGNN.

Paperid: 169, https://arxiv.org/pdf/2502.00094.pdf

Abstract:
Amid the swift progress of large language models (LLMs) and their evolution into large multimodal models (LMMs), significant strides have been made in high-resource languages such as English and Chinese. While Arabic LLMs have seen notable progress, Arabic LMMs remain largely unexplored, often narrowly focusing on a few specific aspects of the language and visual understanding. To bridge this gap, we introduce AIN-the Arabic Inclusive Multimodal Model-designed to excel across diverse domains. AIN is an English-Arabic bilingual LMM designed to excel in English and Arabic, leveraging carefully constructed 3.6 million high-quality Arabic-English multimodal data samples. AIN demonstrates state-of-the-art Arabic performance, while also possessing strong English-language visual capabilities. On the recent CAMEL-Bench benchmark comprising 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding, our AIN demonstrates strong performance with the 7B model outperforming GPT-4o by an absolute gain of 3.4% averaged over eight domains and 38 sub-domains. AIN's superior capabilities position it as a significant step toward empowering Arabic speakers with advanced multimodal generative AI tools across diverse applications.

Paperid: 170, https://arxiv.org/pdf/2502.04364.pdf

Abstract:
Recent advancements in diffusion models have driven the growth of text-guided image editing tools, enabling precise and iterative modifications of synthesized content. However, as these tools become increasingly accessible, they also introduce significant risks of misuse, emphasizing the critical need for robust attribution methods to ensure content authenticity and traceability. Despite the creative potential of such tools, they pose significant challenges for attribution, particularly in adversarial settings where edits can be layered to obscure an image's origins. We propose LambdaTracer, a novel latent-space attribution method that robustly identifies and differentiates authentic outputs from manipulated ones without requiring any modifications to generative or editing pipelines. By adaptively calibrating reconstruction losses, LambdaTracer remains effective across diverse iterative editing processes, whether automated through text-guided editing tools such as InstructPix2Pix and ControlNet or performed manually with editing software such as Adobe Photoshop. Extensive experiments reveal that our method consistently outperforms baseline approaches in distinguishing maliciously edited images, providing a practical solution to safeguard ownership, creativity, and credibility in the open, fast-evolving AI ecosystems.

Paperid: 171, https://arxiv.org/pdf/2502.13134.pdf

Abstract:
Humanoid robots have shown success in locomotion and manipulation. Despite these basic abilities, humanoids are still required to quickly understand human instructions and react based on human interaction signals to become valuable assistants in human daily life. Unfortunately, most existing works only focus on multi-stage interactions, treating each task separately, and neglecting real-time feedback. In this work, we aim to empower humanoid robots with real-time reaction abilities to achieve various tasks, allowing human to interrupt robots at any time, and making robots respond to humans immediately. To support such abilities, we propose a general humanoid-human-object interaction framework, named RHINO, i.e., Real-time Humanoid-human Interaction and Object manipulation. RHINO provides a unified view of reactive motion, instruction-based manipulation, and safety concerns, over multiple human signal modalities, such as languages, images, and motions. RHINO is a hierarchical learning framework, enabling humanoids to learn reaction skills from human-human-object demonstrations and teleoperation data. In particular, it decouples the interaction process into two levels: 1) a high-level planner inferring human intentions from real-time human behaviors; and 2) a low-level controller achieving reactive motion behaviors and object manipulation skills based on the predicted intentions. We evaluate the proposed framework on a real humanoid robot and demonstrate its effectiveness, flexibility, and safety in various scenarios.

Paperid: 172, https://arxiv.org/pdf/2503.11069.pdf

Abstract:
Large language models (LLMs) have evolved beyond simple text generation to power software agents that directly translate natural language commands into tangible actions. While API-based LLM agents initially rose to prominence for their robust automation capabilities and seamless integration with programmatic endpoints, recent progress in multimodal LLM research has enabled GUI-based LLM agents that interact with graphical user interfaces in a human-like manner. Although these two paradigms share the goal of enabling LLM-driven task automation, they diverge significantly in architectural complexity, development workflows, and user interaction models. This paper presents the first comprehensive comparative study of API-based and GUI-based LLM agents, systematically analyzing their divergence and potential convergence. We examine key dimensions and highlight scenarios in which hybrid approaches can harness their complementary strengths. By proposing clear decision criteria and illustrating practical use cases, we aim to guide practitioners and researchers in selecting, combining, or transitioning between these paradigms. Ultimately, we indicate that continuing innovations in LLM-based automation are poised to blur the lines between API- and GUI-driven agents, paving the way for more flexible, adaptive solutions in a wide range of real-world applications.

Paperid: 173, https://arxiv.org/pdf/2506.00241.pdf

Abstract:
Serious illness conversations (SICs), discussions between clinical care teams and patients with serious, life-limiting illnesses about their values, goals, and care preferences, are critical for patient-centered care. Without these conversations, patients often receive aggressive interventions that may not align with their goals. Clinical care teams face significant barriers when conducting serious illness conversations with older adult patients in Emergency Department (ED) settings, where most older adult patients lack documented treatment goals. To understand current practices and identify AI support opportunities, we conducted interviews with two domain experts and nine ED clinical care team members. Through thematic analysis, we characterized a four-phase serious illness conversation workflow (identification, preparation, conduction, documentation) and identified key needs and challenges at each stage. Clinical care teams struggle with fragmented EHR data access, time constraints, emotional preparation demands, and documentation burdens. While participants expressed interest in AI tools for information synthesis, conversational support, and automated documentation, they emphasized preserving human connection and clinical autonomy. We present design guidelines for AI tools supporting SIC workflows that fit within existing clinical practices. This work contributes empirical understanding of ED-based serious illness conversations and provides design considerations for AI in high-stakes clinical environments.

Paperid: 174, https://arxiv.org/pdf/2504.09723.pdf

Abstract:
A/B testing experiment is a widely adopted method for evaluating UI/UX design decisions in modern web applications. Yet, traditional A/B testing remains constrained by its dependence on the large-scale and live traffic of human participants, and the long time of waiting for the testing result. Through formative interviews with six experienced industry practitioners, we identified critical bottlenecks in current A/B testing workflows. In response, we present AgentA/B, a novel system that leverages Large Language Model-based autonomous agents (LLM Agents) to automatically simulate user interaction behaviors with real webpages. AgentA/B enables scalable deployment of LLM agents with diverse personas, each capable of navigating the dynamic webpage and interactively executing multi-step interactions like search, clicking, filtering, and purchasing. In a demonstrative controlled experiment, we employ AgentA/B to simulate a between-subject A/B testing with 1,000 LLM agents Amazon.com, and compare agent behaviors with real human shopping behaviors at a scale. Our findings suggest AgentA/B can emulate human-like behavior patterns.

Paperid: 175, https://arxiv.org/pdf/2504.09407.pdf

Abstract:
Usability testing is a fundamental research method that user experience (UX) researchers use to evaluate and iterate their new designs. But what about evaluating and iterating the usability testing study design itself? Recent advances in Large Language Model-simulated Agent (LLM Agent) research inspired us to design UXAgent to support UX researchers in evaluating and iterating their study design before they conduct the real human-subject study. Our system features a Persona Generator module, an LLM Agent module, and a Universal Browser Connector module to automatically generate thousands of simulated users and to interactively test the target website. The system also provides a Result Viewer Interface so that the UX researchers can easily review and analyze the generated qualitative (e.g., agents' post-study surveys) and quantitative data (e.g., agents' interaction logs), or even interview agents directly. Through a heuristic evaluation with 16 UX researchers, participants praised the innovation of our system but also expressed concerns about the future of LLM Agent usage in UX studies.

Paperid: 176, https://arxiv.org/pdf/2503.00590.pdf

Abstract:
Personalized interaction is highly valued by parents in their story-reading activities with children. While AI-empowered story-reading tools have been increasingly used, their abilities to support personalized interaction with children are still limited. Recent advances in large language models (LLMs) show promise in facilitating personalized interactions, but little is known about how to effectively and appropriately use LLMs to enhance children's personalized story-reading experiences. This work explores this question through a design-based study. Drawing on a formative study, we designed and developed StoryMate, an LLM-empowered personalized interactive story-reading tool for children, following an empirical study with children, parents, and education experts. Our participants valued the personalized features in StoryMate, and also highlighted the need to support personalized content, guiding mechanisms, reading context variations, and interactive interfaces. Based on these findings, we propose a series of design recommendations for better using LLMs to empower children's personalized story reading and interaction.

Paperid: 177, https://arxiv.org/pdf/2502.13012.pdf

Abstract:
Role-Playing Agent (RPA) is an increasingly popular type of LLM Agent that simulates human-like behaviors in a variety of tasks. However, evaluating RPAs is challenging due to diverse task requirements and agent designs. This paper proposes an evidence-based, actionable, and generalizable evaluation design guideline for LLM-based RPA by systematically reviewing 1,676 papers published between Jan. 2021 and Dec. 2024. Our analysis identifies six agent attributes, seven task attributes, and seven evaluation metrics from existing literature. Based on these findings, we present an RPA evaluation design guideline to help researchers develop more systematic and consistent evaluation methods.

Paperid: 178, https://arxiv.org/pdf/2502.12561.pdf

Abstract:
Usability testing is a fundamental yet challenging (e.g., inflexible to iterate the study design flaws and hard to recruit study participants) research method for user experience (UX) researchers to evaluate a web design. Recent advances in Large Language Model-simulated Agent (LLM-Agent) research inspired us to design UXAgent to support UX researchers in evaluating and reiterating their usability testing study design before they conduct the real human subject study. Our system features an LLM-Agent module and a universal browser connector module so that UX researchers can automatically generate thousands of simulated users to test the target website. The results are shown in qualitative (e.g., interviewing how an agent thinks ), quantitative (e.g., # of actions), and video recording formats for UX researchers to analyze. Through a heuristic user evaluation with five UX researchers, participants praised the innovation of our system but also expressed concerns about the future of LLM Agent-assisted UX study.

Paperid: 179, https://arxiv.org/pdf/2502.05783.pdf

Abstract:
While just-in-time interventions (JITIs) have effectively targeted common health behaviors, individuals often have unique needs to intervene in personal undesirable actions that can negatively affect physical, mental, and social well-being. We present WatchGuardian, a smartwatch-based JITI system that empowers users to define custom interventions for these personal actions with a small number of samples. For the model to detect new actions based on limited new data samples, we developed a few-shot learning pipeline that finetuned a pre-trained inertial measurement unit (IMU) model on public hand-gesture datasets. We then designed a data augmentation and synthesis process to train additional classification layers for customization. Our offline evaluation with 26 participants showed that with three, five, and ten examples, our approach achieved an average accuracy of 76.8%, 84.7%, and 87.7%, and an F1 score of 74.8%, 84.2%, and 87.2% We then conducted a four-hour intervention study to compare WatchGuardian against a rule-based intervention. Our results demonstrated that our system led to a significant reduction by 64.0 +- 22.6% in undesirable actions, substantially outperforming the baseline by 29.0%. Our findings underscore the effectiveness of a customizable, AI-driven JITI system for individuals in need of behavioral intervention in personal undesirable actions. We envision that our work can inspire broader applications of user-defined personalized intervention with advanced AI solutions.

Paperid: 180, https://arxiv.org/pdf/2502.05740.pdf

Abstract:
Cancer surgery is a key treatment for gastrointestinal (GI) cancers, a group of cancers that account for more than 35% of cancer-related deaths worldwide, but postoperative complications are unpredictable and can be life-threatening. In this paper, we investigate how recent advancements in large language models (LLMs) can benefit remote patient monitoring (RPM) systems through clinical integration by designing RECOVER, an LLM-powered RPM system for postoperative GI cancer care. To closely engage stakeholders in the design process, we first conducted seven participatory design sessions with five clinical staff and interviewed five cancer patients to derive six major design strategies for integrating clinical guidelines and information needs into LLM-based RPM systems. We then designed and implemented RECOVER, which features an LLM-powered conversational agent for cancer patients and an interactive dashboard for clinical staff to enable efficient postoperative RPM. Finally, we used RECOVER as a pilot system to assess the implementation of our design strategies with four clinical staff and five patients, providing design implications by identifying crucial design elements, offering insights on responsible AI, and outlining opportunities for future LLM-powered RPM systems.

Paperid: 181, https://arxiv.org/pdf/2502.05115.pdf

Abstract:
Older adult patients constitute a rapidly growing subgroup of Intensive Care Unit (ICU) patients. In these situations, their family caregivers are expected to represent the unconscious patients to access and interpret patients' medical information. However, caregivers currently have to rely on overloaded clinicians for information updates and typically lack the health literacy to understand complex medical information. Our project aims to explore the information needs of caregivers of ICU older adult patients, from which we can propose design opportunities to guide future AI systems. The project begins with formative interviews with 11 caregivers to identify their challenges in accessing and interpreting medical information; From these findings, we then synthesize design requirements and propose an AI system prototype to cope with caregivers' challenges. The system prototype has two key features: a timeline visualization to show the AI extracted and summarized older adult patients' key medical events; and an LLM-based chatbot to provide context-aware informational support. We conclude our paper by reporting on the follow-up user evaluation of the system and discussing future AI-based systems for ICU caregivers of older adults.

Paperid: 182, https://arxiv.org/pdf/2502.03732.pdf

Abstract:
Anxiety, depression, and suicidality are common mental health sequelae following concussion in youth patients, often exacerbating concussion symptoms and prolonging recovery. Despite the critical need for early detection of these mental health symptoms, clinicians often face challenges in accurately collecting patients' mental health data and making clinical decision-making in a timely manner. Today's remote patient monitoring (RPM) technologies offer opportunities to objectively monitor patients' activities, but they were not specifically designed for youth concussion patients; moreover, the large amount of data collected by RPM technologies may also impose significant workloads on clinicians to keep up with and use the data. To address these gaps, we employed a three-stage study consisting of a formative study, interface design, and design evaluation. We first conducted a formative study through semi-structured interviews with six highly professional concussion clinicians and identified clinicians' key challenges in remotely collecting patient information and accessing patient treatment compliance. Subsequently, we proposed preliminary clinician-facing interface designs with the integration of AI-based RPM technologies (AI-RPM), followed by design evaluation sessions with highly professional concussion clinicians. Clinicians underscored the value of integrating multi-modal AI-RPM technologies to support their decision-making while emphasizing the importance of customizable interfaces through collaborative design and multiple responsible design considerations.

Paperid: 183, https://arxiv.org/pdf/2501.00190.pdf

Abstract:
Sepsis is an organ dysfunction caused by a deregulated immune response to an infection. Early sepsis prediction and identification allow for timely intervention, leading to improved clinical outcomes. Clinical calculators (e.g., the six-organ dysfunction assessment of SOFA) play a vital role in sepsis identification within clinicians' workflow, providing evidence-based risk assessments essential for sepsis diagnosis. However, artificial intelligence (AI) sepsis prediction models typically generate a single sepsis risk score without incorporating clinical calculators for assessing organ dysfunctions, making the models less convincing and transparent to clinicians. To bridge the gap, we propose to mimic clinicians' workflow with a novel framework SepsisCalc to integrate clinical calculators into the predictive model, yielding a clinically transparent and precise model for utilization in clinical settings. Practically, clinical calculators usually combine information from multiple component variables in Electronic Health Records (EHR), and might not be applicable when the variables are (partially) missing. We mitigate this issue by representing EHRs as temporal graphs and integrating a learning module to dynamically add the accurately estimated calculator to the graphs. Experimental results on real-world datasets show that the proposed model outperforms state-of-the-art methods on sepsis prediction tasks. Moreover, we developed a system to identify organ dysfunctions and potential sepsis risks, providing a human-AI interaction tool for deployment, which can help clinicians understand the prediction outputs and prepare timely interventions for the corresponding dysfunctions, paving the way for actionable clinical decision-making support for early intervention.

Paperid: 184, https://arxiv.org/pdf/2503.11733.pdf

Abstract:
Large Language Model (LLM) agents have demonstrated remarkable capabilities in automating tasks and driving innovation across diverse educational applications. In this survey, we provide a systematic review of state-of-the-art research on LLM agents in education, categorizing them into two broad classes: (1) \emph{Pedagogical Agents}, which focus on automating complex pedagogical tasks to support both teachers and students; and (2) \emph{Domain-Specific Educational Agents}, which are tailored for specialized fields such as science education, language learning, and professional development. We comprehensively examine the technological advancements underlying these LLM agents, including key datasets, benchmarks, and algorithmic frameworks that drive their effectiveness. Furthermore, we discuss critical challenges such as privacy, bias and fairness concerns, hallucination mitigation, and integration with existing educational ecosystems. This survey aims to provide a comprehensive technological overview of LLM agents for education, fostering further research and collaboration to enhance their impact for the greater good of learners and educators alike.

Paperid: 185, https://arxiv.org/pdf/2503.20202.pdf

Abstract:
Co-speech gesture generation enhances human-computer interaction realism through speech-synchronized gesture synthesis. However, generating semantically meaningful gestures remains a challenging problem. We propose SARGes, a novel framework that leverages large language models (LLMs) to parse speech content and generate reliable semantic gesture labels, which subsequently guide the synthesis of meaningful co-speech gestures.First, we constructed a comprehensive co-speech gesture ethogram and developed an LLM-based intent chain reasoning mechanism that systematically parses and decomposes gesture semantics into structured inference steps following ethogram criteria, effectively guiding LLMs to generate context-aware gesture labels. Subsequently, we constructed an intent chain-annotated text-to-gesture label dataset and trained a lightweight gesture label generation model, which then guides the generation of credible and semantically coherent co-speech gestures. Experimental results demonstrate that SARGes achieves highly semantically-aligned gesture labeling (50.2% accuracy) with efficient single-pass inference (0.4 seconds). The proposed method provides an interpretable intent reasoning pathway for semantic gesture synthesis.

Paperid: 186, https://arxiv.org/pdf/2502.13472.pdf

Abstract:
Full-Duplex Speech Dialogue Systems (Full-Duplex SDS) have significantly enhanced the naturalness of human-machine interaction by enabling real-time bidirectional communication. However, existing approaches face challenges such as difficulties in independent module optimization and contextual noise interference due to highly coupled architectural designs and oversimplified binary state modeling. This paper proposes FlexDuo, a flexible full-duplex control module that decouples duplex control from spoken dialogue systems through a plug-and-play architectural design. Furthermore, inspired by human information-filtering mechanisms in conversations, we introduce an explicit Idle state. On one hand, the Idle state filters redundant noise and irrelevant audio to enhance dialogue quality. On the other hand, it establishes a semantic integrity-based buffering mechanism, reducing the risk of mutual interruptions while ensuring accurate response transitions. Experimental results on the Fisher corpus demonstrate that FlexDuo reduces the false interruption rate by 24.9% and improves response accuracy by 7.6% compared to integrated full-duplex dialogue system baselines. It also outperforms voice activity detection (VAD) controlled baseline systems in both Chinese and English dialogue quality. The proposed modular architecture and state-based dialogue model provide a novel technical pathway for building flexible and efficient duplex dialogue systems.

Paperid: 187, https://arxiv.org/pdf/2506.10925.pdf

Abstract:
Lunar surface operations impose stringent requirements on wireless communication systems, including autonomy, robustness to disruption, and the ability to adapt to environmental and mission-driven context. While Space-O-RAN provides a distributed orchestration model aligned with 3GPP standards, its decision logic is limited to static policies and lacks semantic integration. We propose a novel extension incorporating a semantic agentic layer enabled by the Model Context Protocol (MCP) and Agent-to-Agent (A2A) communication protocols, allowing context-aware decision making across real-time, near-real-time, and non-real-time control layers. Distributed cognitive agents deployed in rovers, landers, and lunar base stations implement wireless-aware coordination strategies, including delay-adaptive reasoning and bandwidth-aware semantic compression, while interacting with multiple MCP servers to reason over telemetry, locomotion planning, and mission constraints.

Paperid: 188, https://arxiv.org/pdf/2504.13865.pdf

Abstract:
Graphical User Interface (GUI) Agents have emerged as a transformative paradigm in human-computer interaction, evolving from rule-based automation scripts to sophisticated AI-driven systems capable of understanding and executing complex interface operations. This survey provides a comprehensive examination of the rapidly advancing field of LLM-based GUI Agents, systematically analyzing their architectural foundations, technical components, and evaluation methodologies. We identify and analyze four fundamental components that constitute modern GUI Agents: (1) perception systems that integrate text-based parsing with multimodal understanding for comprehensive interface comprehension; (2) exploration mechanisms that construct and maintain knowledge bases through internal modeling, historical experience, and external information retrieval; (3) planning frameworks that leverage advanced reasoning methodologies for task decomposition and execution; and (4) interaction systems that manage action generation with robust safety controls. Through rigorous analysis of these components, we reveal how recent advances in large language models and multimodal learning have revolutionized GUI automation across desktop, mobile, and web platforms. We critically examine current evaluation frameworks, highlighting methodological limitations in existing benchmarks while proposing directions for standardization. This survey also identifies key technical challenges, including accurate element localization, effective knowledge retrieval, long-horizon planning, and safety-aware execution control, while outlining promising research directions for enhancing GUI Agents' capabilities. Our systematic review provides researchers and practitioners with a thorough understanding of the field's current state and offers insights into future developments in intelligent interface automation.

Paperid: 189, https://arxiv.org/pdf/2506.14775.pdf

Abstract:
As machine learning systems increasingly inform critical decisions, the need for human-understandable explanations grows. Current evaluations of Explainable AI (XAI) often prioritize technical fidelity over cognitive accessibility which critically affects users, in particular those with visual impairments. We propose CUE, a model for Cognitive Understanding of Explanations, linking explanation properties to cognitive sub-processes: legibility (perception), readability (comprehension), and interpretability (interpretation). In a study (N=455) testing heatmaps with varying colormaps (BWR, Cividis, Coolwarm), we found comparable task performance but lower confidence/effort for visually impaired users. Unlike expected, these gaps were not mitigated and sometimes worsened by accessibility-focused color maps like Cividis. These results challenge assumptions about perceptual optimization and support the need for adaptive XAI interfaces. They also validate CUE by demonstrating that altering explanation legibility affects understandability. We contribute: (1) a formalized cognitive model for explanation understanding, (2) an integrated definition of human-centered explanation properties, and (3) empirical evidence motivating accessible, user-tailored XAI.

Paperid: 190, https://arxiv.org/pdf/2505.19294.pdf

Abstract:
Recent advancements in large audio language models (LALMs) have demonstrated impressive results and promising prospects in universal understanding and reasoning across speech, music, and general sound. However, these models still lack the ability to recognize their knowledge boundaries and refuse to answer questions they don't know proactively. While there have been successful attempts to enhance the reliability of LLMs, reliable LALMs remain largely unexplored. In this paper, we systematically investigate various approaches towards reliable LALMs, including training-free methods such as multi-modal chain-of-thought (MCoT), and training-based methods such as supervised fine-tuning (SFT). Besides, we identify the limitations of previous evaluation metrics and propose a new metric, the Reliability Gain Index (RGI), to assess the effectiveness of different reliable methods. Our findings suggest that both training-free and training-based methods enhance the reliability of LALMs to different extents. Moreover, we find that awareness of reliability is a "meta ability", which can be transferred across different audio modalities, although significant structural and content differences exist among sound, music, and speech.

Paperid: 191, https://arxiv.org/pdf/2504.19423.pdf

Abstract:
MER2025 is the third year of our MER series of challenges, aiming to bring together researchers in the affective computing community to explore emerging trends and future directions in the field. Previously, MER2023 focused on multi-label learning, noise robustness, and semi-supervised learning, while MER2024 introduced a new track dedicated to open-vocabulary emotion recognition. This year, MER2025 centers on the theme "When Affective Computing Meets Large Language Models (LLMs)".We aim to shift the paradigm from traditional categorical frameworks reliant on predefined emotion taxonomies to LLM-driven generative methods, offering innovative solutions for more accurate and reliable emotion understanding. The challenge features four tracks: MER-SEMI focuses on fixed categorical emotion recognition enhanced by semi-supervised learning; MER-FG explores fine-grained emotions, expanding recognition from basic to nuanced emotional states; MER-DES incorporates multimodal cues (beyond emotion words) into predictions to enhance model interpretability; MER-PR investigates whether emotion prediction results can improve personality recognition performance. For the first three tracks, baseline code is available at MERTools, and datasets can be accessed via Hugging Face. For the last track, the dataset and baseline code are available on GitHub.

Paperid: 192, https://arxiv.org/pdf/2505.14680.pdf

Abstract:
Generative AI search is reshaping information retrieval by offering end-to-end answers to complex queries, reducing users' reliance on manually browsing and summarizing multiple web pages. However, while this paradigm enhances convenience, it disrupts the feedback-driven improvement loop that has historically powered the evolution of traditional Web search. Web search can continuously improve their ranking models by collecting large-scale, fine-grained user feedback (e.g., clicks, dwell time) at the document level. In contrast, generative AI search operates through a much longer search pipeline, spanning query decomposition, document retrieval, and answer generation, yet typically receives only coarse-grained feedback on the final answer. This introduces a feedback loop disconnect, where user feedback for the final output cannot be effectively mapped back to specific system components, making it difficult to improve each intermediate stage and sustain the feedback loop. In this paper, we envision NExT-Search, a next-generation paradigm designed to reintroduce fine-grained, process-level feedback into generative AI search. NExT-Search integrates two complementary modes: User Debug Mode, which allows engaged users to intervene at key stages; and Shadow User Mode, where a personalized user agent simulates user preferences and provides AI-assisted feedback for less interactive users. Furthermore, we envision how these feedback signals can be leveraged through online adaptation, which refines current search outputs in real-time, and offline update, which aggregates interaction logs to periodically fine-tune query decomposition, retrieval, and generation models. By restoring human control over key stages of the generative AI search pipeline, we believe NExT-Search offers a promising direction for building feedback-rich AI search systems that can evolve continuously alongside human feedback.

Paperid: 193, https://arxiv.org/pdf/2505.03164.pdf

Abstract:
Traditional data presentations typically separate the presenter and visualization into two separate spaces--the 3D world and a 2D screen--enforcing visualization-centric stories. To create a more human-centric viewing experience, we establish a more equitable relationship between the visualization and the presenter through our InfoVids. These infographics-inspired informational videos are crafted to redefine relationships between the presenter and visualizations. As we design InfoVids, we explore how the use of layout, form, and interactions affects the viewer experience. We compare InfoVids against their baseline 2D `slides' equivalents across 9 metrics with 30 participants and provide practical, long-term insights from an autobiographical perspective. Our mixed methods analyses reveal that this paradigm reduced viewer attention splitting, shifted the focus from the visualization to the presenter, and led to more interactive, natural, and engaging full-body data performances for viewers. Ultimately, InfoVids helped viewers re-imagine traditional dynamics between the presenter and visualizations.

Paperid: 194, https://arxiv.org/pdf/2504.12313.pdf

Abstract:
Conversational Recommender Systems (CRSs) engage users in multi-turn interactions to deliver personalized recommendations. The emergence of large language models (LLMs) further enhances these systems by enabling more natural and dynamic user interactions. However, a key challenge remains in understanding how personality traits shape conversational recommendation outcomes. Psychological evidence highlights the influence of personality traits on user interaction behaviors. To address this, we introduce an LLM-based personality-aware user simulation for CRSs (PerCRS). The user agent induces customizable personality traits and preferences, while the system agent possesses the persuasion capability to simulate realistic interaction in CRSs. We incorporate multi-aspect evaluation to ensure robustness and conduct extensive analysis from both user and system perspectives. Experimental results demonstrate that state-of-the-art LLMs can effectively generate diverse user responses aligned with specified personality traits, thereby prompting CRSs to dynamically adjust their recommendation strategies. Our experimental analysis offers empirical insights into the impact of personality traits on the outcomes of conversational recommender systems.

Paperid: 195, https://arxiv.org/pdf/2506.17364.pdf

Abstract:
This work investigates the use of multimodal biometrics to detect distractions caused by smartphone use during tasks that require sustained attention, with a focus on computer-based online learning. Although the methods are applicable to various domains, such as autonomous driving, we concentrate on the challenges learners face in maintaining engagement amid internal (e.g., motivation), system-related (e.g., course design) and contextual (e.g., smartphone use) factors. Traditional learning platforms often lack detailed behavioral data, but Multimodal Learning Analytics (MMLA) and biosensors provide new insights into learner attention. We propose an AI-based approach that leverages physiological signals and head pose data to detect phone use. Our results show that single biometric signals, such as brain waves or heart rate, offer limited accuracy, while head pose alone achieves 87%. A multimodal model combining all signals reaches 91% accuracy, highlighting the benefits of integration. We conclude by discussing the implications and limitations of deploying these models for real-time support in online learning environments.

Paperid: 196, https://arxiv.org/pdf/2505.16505.pdf

Abstract:
Complex narrative contexts often challenge language models' ability to follow instructions, and existing benchmarks fail to capture these difficulties. To address this, we propose Concise-SAE, a training-free framework that improves instruction following by identifying and editing instruction-relevant neurons using only natural language instructions, without requiring labelled data. To thoroughly evaluate our method, we introduce FreeInstruct, a diverse and realistic benchmark of 1,212 examples that highlights the challenges of instruction following in narrative-rich settings. While initially motivated by complex narratives, Concise-SAE demonstrates state-of-the-art instruction adherence across varied tasks without compromising generation quality.

Paperid: 197, https://arxiv.org/pdf/2502.15363.pdf

Abstract:
We present a demonstration of a web-based system called M2LADS ("System for Generating Multimodal Learning Analytics Dashboards"), designed to integrate, synchronize, visualize, and analyze multimodal data recorded during computer-based learning sessions with biosensors. This system presents a range of biometric and behavioral data on web-based dashboards, providing detailed insights into various physiological and activity-based metrics. The multimodal data visualized include electroencephalogram (EEG) data for assessing attention and brain activity, heart rate metrics, eye-tracking data to measure visual attention, webcam video recordings, and activity logs of the monitored tasks. M2LADS aims to assist data scientists in two key ways: (1) by providing a comprehensive view of participants' experiences, displaying all data categorized by the activities in which participants are engaged, and (2) by synchronizing all biosignals and videos, facilitating easier data relabeling if any activity information contains errors.

Paperid: 198, https://arxiv.org/pdf/2501.10977.pdf

Abstract:
This work introduces SMARTe-VR, a platform for student monitoring in an immersive virtual reality environment designed for online education. SMARTe-VR aims to collect data for adaptive learning, focusing on facial biometrics and learning metadata. The platform allows instructors to create customized learning sessions with video lectures, featuring an interface with an AutoQA system to evaluate understanding, interaction tools (for example, textbook highlighting and lecture tagging), and real-time feedback. Furthermore, we released a dataset that contains 5 research challenges with data from 10 users in VR-based TOEIC sessions. This data set, which spans more than 25 hours, includes facial features, learning metadata, 450 responses, difficulty levels of the questions, concept tags, and understanding labels. Alongside the database, we present preliminary experiments using Item Response Theory models, adapted for understanding detection using facial features. Two architectures were explored: a Temporal Convolutional Network for local features and a Multilayer Perceptron for global features.

Paperid: 199, https://arxiv.org/pdf/2506.14468.pdf

Abstract:
Micro-expressions (MEs) are brief, involuntary facial movements that reveal genuine emotions, offering valuable insights for psychological assessment and criminal investigations. Despite significant progress in automatic ME recognition (MER), existing methods still struggle to simultaneously capture localized muscle activations and global facial dependencies, both essential for decoding subtle emotional cues. To address this challenge, we propose MERba, a hierarchical multi-receptive field architecture specially designed for MER, which incorporates a series of Local-Global Feature Integration stages. Within each stage, detailed intra-window motion patterns are captured using MERba Local Extractors, which integrate MambaVision Mixers with a tailored asymmetric multi-scanning strategy to enhance local spatial sensitivity. These localized features are then aggregated through lightweight self-attention layers that explicitly model inter-window relationships, enabling effective global context construction. Furthermore, to mitigate the challenge of high inter-class similarity among negative MEs, we introduce a Dual-Granularity Classification Module that decomposes the recognition task into a coarse-to-fine paradigm. Extensive experiments on three benchmark datasets demonstrate that MERba consistently outperforms existing methods, with ablation studies confirming the effectiveness of each proposed component.

Paperid: 200, https://arxiv.org/pdf/2505.05937.pdf

Abstract:
As a critical psychological stress response, micro-expressions (MEs) are fleeting and subtle facial movements revealing genuine emotions. Automatic ME recognition (MER) holds valuable applications in fields such as criminal investigation and psychological diagnosis. The Facial Action Coding System (FACS) encodes expressions by identifying activations of specific facial action units (AUs), serving as a key reference for ME analysis. However, current MER methods typically limit AU utilization to defining regions of interest (ROIs) or relying on specific prior knowledge, often resulting in limited performance and poor generalization. To address this, we integrate the CLIP model's powerful cross-modal semantic alignment capability into MER and propose a novel approach namely MER-CLIP. Specifically, we convert AU labels into detailed textual descriptions of facial muscle movements, guiding fine-grained spatiotemporal ME learning by aligning visual dynamics and textual AU-based representations. Additionally, we introduce an Emotion Inference Module to capture the nuanced relationships between ME patterns and emotions with higher-level semantic understanding. To mitigate overfitting caused by the scarcity of ME data, we put forward LocalStaticFaceMix, an effective data augmentation strategy blending facial images to enhance facial diversity while preserving critical ME features. Finally, comprehensive experiments on four benchmark ME datasets confirm the superiority of MER-CLIP. Notably, UF1 scores on CAS(ME)3 reach 0.7832, 0.6544, and 0.4997 for 3-, 4-, and 7-class classification tasks, significantly outperforming previous methods.

Paperid: 201, https://arxiv.org/pdf/2504.17238.pdf

Abstract:
Cognitive Restructuring (CR) is a psychotherapeutic process aimed at identifying and restructuring an individual's negative thoughts, arising from mental health challenges, into more helpful and positive ones via multi-turn dialogues. Clinician shortage and stigma urge the development of human-LLM interactive psychotherapy for CR. Yet, existing efforts implement CR via simple text rewriting, fixed-pattern dialogues, or a one-shot CR workflow, failing to align with the psychotherapeutic process for effective CR. To address this gap, we propose CRDial, a novel framework for CR, which creates multi-turn dialogues with specifically designed identification and restructuring stages of negative thoughts, integrates sentence-level supportive conversation strategies, and adopts a multi-channel loop mechanism to enable iterative CR. With CRDial, we distill Crisp, a large-scale and high-quality bilingual dialogue dataset, from LLM. We then train Crispers, Crisp-based conversational LLMs for CR, at 7B and 14B scales. Extensive human studies show the superiority of Crispers in pointwise, pairwise, and intervention evaluations.

Paperid: 202, https://arxiv.org/pdf/2502.07663.pdf

Abstract:
Artificial Intelligence (AI) systems are increasingly intertwined with daily life, assisting users in executing various tasks and providing guidance on decision-making. This integration introduces risks of AI-driven manipulation, where such systems may exploit users' cognitive biases and emotional vulnerabilities to steer them toward harmful outcomes. Through a randomized controlled trial with 233 participants, we examined human susceptibility to such manipulation in financial (e.g., purchases) and emotional (e.g., conflict resolution) decision-making contexts. Participants interacted with one of three AI agents: a neutral agent (NA) optimizing for user benefit without explicit influence, a manipulative agent (MA) designed to covertly influence beliefs and behaviors, or a strategy-enhanced manipulative agent (SEMA) employing explicit psychological tactics to reach its hidden objectives. By analyzing participants' decision patterns and shifts in their preference ratings post-interaction, we found significant susceptibility to AI-driven manipulation. Particularly, across both decision-making domains, participants interacting with the manipulative agents shifted toward harmful options at substantially higher rates (financial, MA: 62.3%, SEMA: 59.6%; emotional, MA: 42.3%, SEMA: 41.5%) compared to the NA group (financial, 35.8%; emotional, 12.8%). Notably, our findings reveal that even subtle manipulative objectives (MA) can be as effective as employing explicit psychological strategies (SEMA) in swaying human decision-making. By revealing the potential for covert AI influence, this study highlights a critical vulnerability in human-AI interactions, emphasizing the need for ethical safeguards and regulatory frameworks to ensure responsible deployment of AI technologies and protect human autonomy.

Paperid: 203, https://arxiv.org/pdf/2505.11200.pdf

Abstract:
Recent advances in large language models (LLMs) have significantly improved text-to-speech (TTS) systems, enhancing control over speech style, naturalness, and emotional expression, which brings TTS Systems closer to human-level performance. Although the Mean Opinion Score (MOS) remains the standard for TTS System evaluation, it suffers from subjectivity, environmental inconsistencies, and limited interpretability. Existing evaluation datasets also lack a multi-dimensional design, often neglecting factors such as speaking styles, context diversity, and trap utterances, which is particularly evident in Chinese TTS evaluation. To address these challenges, we introduce the Audio Turing Test (ATT), a multi-dimensional Chinese corpus dataset ATT-Corpus paired with a simple, Turing-Test-inspired evaluation protocol. Instead of relying on complex MOS scales or direct model comparisons, ATT asks evaluators to judge whether a voice sounds human. This simplification reduces rating bias and improves evaluation robustness. To further support rapid model development, we also finetune Qwen2-Audio-Instruct with human judgment data as Auto-ATT for automatic evaluation. Experimental results show that ATT effectively differentiates models across specific capability dimensions using its multi-dimensional design. Auto-ATT also demonstrates strong alignment with human evaluations, confirming its value as a fast and reliable assessment tool. The white-box ATT-Corpus and Auto-ATT can be found in ATT Hugging Face Collection (https://huggingface.co/collections/meituan/audio-turing-test-682446320368164faeaf38a4).

Paperid: 204, https://arxiv.org/pdf/2504.21242.pdf

Abstract:
The autonomic nervous system (ANS) is activated during stress, which can have negative effects on cardiovascular health, sleep, the immune system, and mental health. While there are ways to quantify ANS activity in laboratories, there is a paucity of methods that have been validated in real-world contexts. We present the Fitbit Body Response Algorithm, an approach to continuous remote measurement of ANS activation through widely available remote wrist-based sensors. The design was validated via two experiments, a Trier Social Stress Test (n = 45) and ecological momentary assessments (EMA) of perceived stress (n=87), providing both controlled and ecologically valid test data. Model performance predicting perceived stress when using all available sensor modalities was consistent with expectations (accuracy=0.85) and outperformed models with access to only a subset of the signals. We discuss and address challenges to sensing that arise in real world settings that do not present in conventional lab environments.

Paperid: 205, https://arxiv.org/pdf/2503.23339.pdf

Abstract:
Large language models (LLMs) have emerged as powerful tools for analyzing complex datasets. Recent studies demonstrate their potential to generate useful, personalized responses when provided with patient-specific health information that encompasses lifestyle, biomarkers, and context. As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety. Current evaluation practices for open-ended text responses heavily rely on human experts. This approach introduces human factors and is often cost-prohibitive, labor-intensive, and hinders scalability, especially in complex domains like healthcare where response assessment necessitates domain expertise and considers multifaceted patient data. In this work, we introduce Adaptive Precise Boolean rubrics: an evaluation framework that streamlines human and automated evaluation of open-ended questions by identifying gaps in model responses using a minimal set of targeted rubrics questions. Our approach is based on recent work in more general evaluation settings that contrasts a smaller set of complex evaluation targets with a larger set of more precise, granular targets answerable with simple boolean responses. We validate this approach in metabolic health, a domain encompassing diabetes, cardiovascular disease, and obesity. Our results demonstrate that Adaptive Precise Boolean rubrics yield higher inter-rater agreement among expert and non-expert human evaluators, and in automated assessments, compared to traditional Likert scales, while requiring approximately half the evaluation time of Likert-based methods. This enhanced efficiency, particularly in automated evaluation and non-expert contributions, paves the way for more extensive and cost-effective evaluation of LLMs in health.

Paperid: 206, https://arxiv.org/pdf/2503.03783.pdf

Abstract:
Resting heart rate (RHR) is an important biomarker of cardiovascular health and mortality, but tracking it longitudinally generally requires a wearable device, limiting its availability. We present PHRM, a deep learning system for passive heart rate (HR) and RHR measurements during everyday smartphone use, using facial video-based photoplethysmography. Our system was developed using 225,773 videos from 495 participants and validated on 185,970 videos from 205 participants in laboratory and free-living conditions, representing the largest validation study of its kind. Compared to reference electrocardiogram, PHRM achieved a mean absolute percentage error (MAPE) < 10% for HR measurements across three skin tone groups of light, medium and dark pigmentation; MAPE for each skin tone group was non-inferior versus the others. Daily RHR measured by PHRM had a mean absolute error < 5 bpm compared to a wearable HR tracker, and was associated with known risk factors. These results highlight the potential of smartphones to enable passive and equitable heart health monitoring.

Paperid: 207, https://arxiv.org/pdf/2502.16796.pdf

Abstract:
Mobile phone agents can assist people in automating daily tasks on their phones, which have emerged as a pivotal research spotlight. However, existing procedure-oriented agents struggle with cross-app instructions, due to the following challenges: (1) complex task relationships, (2) diverse app environment, and (3) error propagation and information loss in multi-step execution. Drawing inspiration from object-oriented programming principles, we recognize that object-oriented solutions is more suitable for cross-app instruction. To address these challenges, we propose a self-evolving multi-agent framework named MobileSteward, which integrates multiple app-oriented StaffAgents coordinated by a centralized StewardAgent. We design three specialized modules in MobileSteward: (1) Dynamic Recruitment generates a scheduling graph guided by information flow to explicitly associate tasks among apps. (2) Assigned Execution assigns the task to app-oriented StaffAgents, each equipped with app-specialized expertise to address the diversity between apps. (3) Adjusted Evaluation conducts evaluation to provide reflection tips or deliver key information, which alleviates error propagation and information loss during multi-step execution. To continuously improve the performance of MobileSteward, we develop a Memory-based Self-evolution mechanism, which summarizes the experience from successful execution, to improve the performance of MobileSteward. We establish the first English Cross-APP Benchmark (CAPBench) in the real-world environment to evaluate the agents' capabilities of solving complex cross-app instructions. Experimental results demonstrate that MobileSteward achieves the best performance compared to both single-agent and multi-agent frameworks, highlighting the superiority of MobileSteward in better handling user instructions with diverse complexity.

Paperid: 208, https://arxiv.org/pdf/2505.12981.pdf

Abstract:
The growing adoption of large language models (LLMs) has led to a new paradigm in mobile computing--LLM-powered mobile AI agents--capable of decomposing and automating complex tasks directly on smartphones. However, the security implications of these agents remain largely unexplored. In this paper, we present the first comprehensive security analysis of mobile LLM agents, encompassing three representative categories: System-level AI Agents developed by original equipment manufacturers (e.g., YOYO Assistant), Third-party Universal Agents (e.g., Zhipu AI AutoGLM), and Emerging Agent Frameworks (e.g., Alibaba Mobile Agent). We begin by analyzing the general workflow of mobile agents and identifying security threats across three core capability dimensions: language-based reasoning, GUI-based interaction, and system-level execution. Our analysis reveals 11 distinct attack surfaces, all rooted in the unique capabilities and interaction patterns of mobile LLM agents, and spanning their entire operational lifecycle. To investigate these threats in practice, we introduce AgentScan, a semi-automated security analysis framework that systematically evaluates mobile LLM agents across all 11 attack scenarios. Applying AgentScan to nine widely deployed agents, we uncover a concerning trend: every agent is vulnerable to targeted attacks. In the most severe cases, agents exhibit vulnerabilities across eight distinct attack vectors. These attacks can cause behavioral deviations, privacy leakage, or even full execution hijacking. Based on these findings, we propose a set of defensive design principles and practical recommendations for building secure mobile LLM agents. Our disclosures have received positive feedback from two major device vendors. Overall, this work highlights the urgent need for standardized security practices in the fast-evolving landscape of LLM-driven mobile automation.

Paperid: 209, https://arxiv.org/pdf/2506.09420.pdf

Abstract:
Recent improvements in large language models (LLMs) have led many researchers to focus on building fully autonomous AI agents. This position paper questions whether this approach is the right path forward, as these autonomous systems still have problems with reliability, transparency, and understanding the actual requirements of human. We suggest a different approach: LLM-based Human-Agent Systems (LLM-HAS), where AI works with humans rather than replacing them. By keeping human involved to provide guidance, answer questions, and maintain control, these systems can be more trustworthy and adaptable. Looking at examples from healthcare, finance, and software development, we show how human-AI teamwork can handle complex tasks better than AI working alone. We also discuss the challenges of building these collaborative systems and offer practical solutions. This paper argues that progress in AI should not be measured by how independent systems become, but by how well they can work with humans. The most promising future for AI is not in systems that take over human roles, but in those that enhance human capabilities through meaningful partnership.

Paperid: 210, https://arxiv.org/pdf/2503.18336.pdf

Abstract:
Academic publishing is facing a crisis driven by exponential growth in submissions and an overwhelmed peer review system, leading to inconsistent decisions and a severe reviewer shortage. This paper introduces Panvas, a platform that reimagines academic publishing as a continuous, community-driven process. Panvas addresses these systemic failures with a novel combination of economic incentives (paid reviews) and rich interaction mechanisms (multi-dimensional ratings, threaded discussions, and expert-led reviews). By moving beyond the traditional accept/reject paradigm and integrating paper hosting with code/data repositories and social networking, Panvas fosters a meritocratic environment for scholarly communication and presents a radical rethinking of how we evaluate and disseminate scientific knowledge. We present the system design, development roadmap, and a user study plan to evaluate its effectiveness.

Paperid: 211, https://arxiv.org/pdf/2501.07320.pdf

Abstract:
Pictorial charts are favored for their memorability and visual appeal, offering a more engaging alternative to basic charts. However, their creation can be complex and time-consuming due to the lack of native support in popular visualization tools like Tableau. While AI-generated content (AIGC) tools have lowered the barrier to creating pictorial charts, they often lack precise design control. To address this issue, we introduce ChartEditor, a human-AI paired tool that transforms basic charts into pictorial versions based on user intent. ChartEditor decomposes chart images into visual components and organizes them within a hierarchical tree. Based on this tree, users can express their intent in natural language, which is then translated into modifications to the hierarchy. In addition, users can directly interact with and modify specific chart components via an intuitive interface to achieve fine-grained design control. A user study demonstrates the effectiveness and usability of ChartEditor in simplifying the creation of pictorial charts.

Paperid: 212, https://arxiv.org/pdf/2501.06317.pdf

Abstract:
Figures and their captions play a key role in scientific publications. However, despite their importance, many captions in published papers are poorly crafted, largely due to a lack of attention by paper authors. While prior AI research has explored caption generation, it has mainly focused on reader-centered use cases, where users evaluate generated captions rather than actively integrating them into their writing. This paper addresses this gap by investigating how paper authors incorporate AI-generated captions into their writing process through a user study involving 18 participants. Each participant rewrote captions for two figures from their own recently published work, using captions generated by state-of-the-art AI models as a resource. By analyzing video recordings of the writing process through interaction analysis, we observed that participants often began by copying and refining AI-generated captions. Paper writers favored longer, detail-rich captions that integrated textual and visual elements but found current AI models less effective for complex figures. These findings highlight the nuanced and diverse nature of figure caption composition, revealing design opportunities for AI systems to better support the challenges of academic writing.

Paperid: 213, https://arxiv.org/pdf/2504.15007.pdf

Abstract:
Eye-tracking analysis plays a vital role in medical imaging, providing key insights into how radiologists visually interpret and diagnose clinical cases. In this work, we first analyze radiologists' attention and agreement by measuring the distribution of various eye-movement patterns, including saccades direction, amplitude, and their joint distribution. These metrics help uncover patterns in attention allocation and diagnostic strategies. Furthermore, we investigate whether and how doctors' gaze behavior shifts when viewing authentic (Real) versus deep-learning-generated (Fake) images. To achieve this, we examine fixation bias maps, focusing on first, last, short, and longest fixations independently, along with detailed saccades patterns, to quantify differences in gaze distribution and visual saliency between authentic and synthetic images.

Paperid: 214, https://arxiv.org/pdf/2503.16435.pdf

Abstract:
Landscape design is a complex process that requires designers to engage in intricate planning, analysis, and decision-making. This process involves the integration and reconstruction of science, art, and technology. Traditional landscape design methods often rely on the designer's personal experience and subjective aesthetics, with design standards rooted in subjective perception. As a result, they lack scientific and objective evaluation criteria and systematic design processes. Data-driven artificial intelligence (AI) technology provides an objective and rational design process. With the rapid development of different AI technologies, AI-generated content (AIGC) has permeated various aspects of landscape design at an unprecedented speed, serving as an innovative design tool. This article aims to explore the applications and opportunities of AIGC in landscape design. AIGC can support landscape design in areas such as site research and analysis, design concepts and scheme generation, parametric design optimization, plant selection and visual simulation, construction management, and process optimization. However, AIGC also faces challenges in landscape design, including data quality and reliability, design expertise and judgment, technical challenges and limitations, site characteristics and sustainability, user needs and participation, the balance between technology and creativity, ethics, and social impact. Finally, this article provides a detailed outlook on the future development trends and prospects of AIGC in landscape design. Through in-depth research and exploration in this review, readers can gain a better understanding of the relevant applications, potential opportunities, and key challenges of AIGC in landscape design.

Paperid: 215, https://arxiv.org/pdf/2506.05904.pdf

Abstract:
Recent advances in conversational AI have been substantial, but developing real-time systems for perceptual task guidance remains challenging. These systems must provide interactive, proactive assistance based on streaming visual inputs, yet their development is constrained by the costly and labor-intensive process of data collection and system evaluation. To address these limitations, we present a comprehensive framework with three key contributions. First, we introduce a novel data curation pipeline that synthesizes dialogues from annotated egocentric videos, resulting in \dataset, a large-scale synthetic dialogue dataset spanning multiple domains. Second, we develop a suite of automatic evaluation metrics, validated through extensive human studies. Third, we propose an end-to-end model that processes streaming video inputs to generate contextually appropriate responses, incorporating novel techniques for handling data imbalance and long-duration videos. This work lays the foundation for developing real-time, proactive AI assistants capable of guiding users through diverse tasks. Project page: https://pro-assist.github.io/

Paperid: 216, https://arxiv.org/pdf/2506.05606.pdf

Abstract:
Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable'' human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPERA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. OPERA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPERA, we establish the first benchmark to evaluate how well current LLMs can predict a specific user's next action and rationale with a given persona and history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.

Paperid: 217, https://arxiv.org/pdf/2504.10839.pdf

Abstract:
The last couple of years have witnessed emerging research that appropriates Theory-of-Mind (ToM) tasks designed for humans to benchmark LLM's ToM capabilities as an indication of LLM's social intelligence. However, this approach has a number of limitations. Drawing on existing psychology and AI literature, we summarize the theoretical, methodological, and evaluation limitations by pointing out that certain issues are inherently present in the original ToM tasks used to evaluate human's ToM, which continues to persist and exacerbated when appropriated to benchmark LLM's ToM. Taking a human-computer interaction (HCI) perspective, these limitations prompt us to rethink the definition and criteria of ToM in ToM benchmarks in a more dynamic, interactional approach that accounts for user preferences, needs, and experiences with LLMs in such evaluations. We conclude by outlining potential opportunities and challenges towards this direction.

Paperid: 218, https://arxiv.org/pdf/2502.17909.pdf

Abstract:
With the proliferation of data across various domains, there is a critical demand for tools that enable non-experts to derive meaningful insights without deep data analysis skills. To address this need, existing automatic fact sheet generation tools offer heuristic-based solutions to extract facts and generate stories. However, they inadequately grasp the semantics of data and struggle to generate narratives that fully capture the semantics of the dataset or align the fact sheet with specific user needs. Addressing these shortcomings, this paper introduces \tool, a novel tool designed for the automatic generation and customisation of fact sheets. \tool applies the concept of collaborative AI workers to transform raw tabular dataset into comprehensive, visually compelling fact sheets. We define effective taxonomy to profile AI worker for specialised tasks. Furthermore, \tool empowers users to refine these fact sheets through intuitive natural language commands, ensuring the final outputs align closely with individual preferences and requirements. Our user evaluation with 18 participants confirms that \tool not only surpasses state-of-the-art baselines in automated fact sheet production but also provides a positive user experience during customization tasks.

Paperid: 219, https://arxiv.org/pdf/2505.21116.pdf

Abstract:
Large language model (LLM)-driven multi-agent systems (MAS) are transforming how humans and AIs collaboratively generate ideas and artifacts. While existing surveys provide comprehensive overviews of MAS infrastructures, they largely overlook the dimension of \emph{creativity}, including how novel outputs are generated and evaluated, how creativity informs agent personas, and how creative workflows are coordinated. This is the first survey dedicated to creativity in MAS. We focus on text and image generation tasks, and present: (1) a taxonomy of agent proactivity and persona design; (2) an overview of generation techniques, including divergent exploration, iterative refinement, and collaborative synthesis, as well as relevant datasets and evaluation metrics; and (3) a discussion of key challenges, such as inconsistent evaluation standards, insufficient bias mitigation, coordination conflicts, and the lack of unified benchmarks. This survey offers a structured framework and roadmap for advancing the development, evaluation, and standardization of creative MAS.

Paperid: 220, https://arxiv.org/pdf/2502.17921.pdf

Abstract:
Recommendation systems are now an integral part of our daily lives. We rely on them for tasks such as discovering new movies, finding friends on social media, and connecting job seekers with relevant opportunities. Given their vital role, we must ensure these recommendations are free from societal stereotypes. Therefore, evaluating and addressing such biases in recommendation systems is crucial. Previous work evaluating the fairness of recommended items fails to capture certain nuances as they mainly focus on comparing performance metrics for different sensitive groups. In this paper, we introduce a set of comprehensive metrics for quantifying gender bias in recommendations. Specifically, we show the importance of evaluating fairness on a more granular level, which can be achieved using our metrics to capture gender bias using categories of recommended items like genres for movies. Furthermore, we show that employing a category-aware fairness metric as a regularization term along with the main recommendation loss during training can help effectively minimize bias in the models' output. We experiment on three real-world datasets, using five baseline models alongside two popular fairness-aware models, to show the effectiveness of our metrics in evaluating gender bias. Our metrics help provide an enhanced insight into bias in recommended items compared to previous metrics. Additionally, our results demonstrate how incorporating our regularization term significantly improves the fairness in recommendations for different categories without substantial degradation in overall recommendation performance.

Paperid: 221, https://arxiv.org/pdf/2506.23678.pdf

Abstract:
The output quality of large language models (LLMs) can be improved via "reasoning": generating segments of chain-of-thought (CoT) content to further condition the model prior to producing user-facing output. While these chains contain valuable information, they are verbose and lack explicit organization, making them tedious to review. Moreover, they lack opportunities for user feedback, such as to remove unwanted considerations, add desired ones, or clarify unclear assumptions. We introduce Interactive Reasoning, an interaction design that visualizes chain-of-thought outputs as a hierarchy of topics and enables user review and modification. We implement interactive reasoning in Hippo, a prototype for AI-assisted decision making in the face of uncertain trade-offs. In a user study with 16 participants, we find that interactive reasoning in Hippo allows users to quickly identify and interrupt erroneous generations, efficiently steer the model towards customized responses, and better understand both model reasoning and model outputs. Our work contributes to a new paradigm that incorporates user oversight into LLM reasoning processes.

Paperid: 222, https://arxiv.org/pdf/2505.09875.pdf

Abstract:
The proliferation of Large Language Model (LLM)-based Graphical User Interface (GUI) agents in web browsing scenarios present complex unintended consequences (UCs). This paper characterizes three UCs from three perspectives: phenomena, influence and mitigation, drawing on social media analysis (N=221 posts) and semi-structured interviews (N=14). Key phenomenon for UCs include agents' deficiencies in comprehending instructions and planning tasks, challenges in executing accurate GUI interactions and adapting to dynamic interfaces, the generation of unreliable or misaligned outputs, and shortcomings in error handling and feedback processing. These phenomena manifest as influences from unanticipated actions and user frustration, to privacy violations and security vulnerabilities, and further to eroded trust and wider ethical concerns. Our analysis also identifies user-initiated mitigation, such as technical adjustments and manual oversight, and provides implications for designing future LLM-based GUI agents that are robust, user-centric, and transparent, fostering a crucial balance between automation and human oversight.

Paperid: 223, https://arxiv.org/pdf/2504.13351.pdf

Abstract:
Learning to perform manipulation tasks from human videos is a promising approach for teaching robots. However, many manipulation tasks require changing control parameters during task execution, such as force, which visual data alone cannot capture. In this work, we leverage sensing devices such as armbands that measure human muscle activities and microphones that record sound, to capture the details in the human manipulation process, and enable robots to extract task plans and control parameters to perform the same task. To achieve this, we introduce Chain-of-Modality (CoM), a prompting strategy that enables Vision Language Models to reason about multimodal human demonstration data -- videos coupled with muscle or audio signals. By progressively integrating information from each modality, CoM refines a task plan and generates detailed control parameters, enabling robots to perform manipulation tasks based on a single multimodal human video prompt. Our experiments show that CoM delivers a threefold improvement in accuracy for extracting task plans and control parameters compared to baselines, with strong generalization to new task setups and objects in real-world robot experiments. Videos and code are available at https://chain-of-modality.github.io

Paperid: 224, https://arxiv.org/pdf/2502.06179.pdf

Abstract:
Driver decision quality in take-overs is critical for effective human-Autonomous Driving System (ADS) collaboration. However, current research lacks detailed analysis of its variations. This paper introduces two metrics--Actual Achieved Gain (AAG) and Optimal Perceived Gain (OPG)--to assess decision quality, with OPG representing optimal decisions and AAG reflecting actual outcomes. Both are calculated as weighted averages of perceived gains and losses, influenced by ADS accuracy. Study 1 (N=315) used a 21-point Thurstone scale to measure perceived gains and losses-key components of AAG and OPG-across typical tasks: route selection, overtaking, and collision avoidance. Studies 2 (N=54) and 3 (N=54) modeled decision quality under varying ADS accuracy and decision time. Results show with sufficient time (>3.5s), AAG converges towards OPG, indicating rational decision-making, while limited time leads to intuitive and deterministic choices. Study 3 also linked AAG-OPG deviations to irrational behaviors. An intervention study (N=8) and a pilot (N=4) employing voice alarms and multi-modal alarms based on these deviations demonstrated AAG's potential to improve decision quality.

Paperid: 225, https://arxiv.org/pdf/2501.12300.pdf

Abstract:
While learning personalization offers great potential for learners, modern practices in higher education require a deeper consideration of domain models and learning contexts, to develop effective personalization algorithms. This paper introduces an innovative approach to higher education curriculum modelling that utilizes large language models (LLMs) for knowledge graph (KG) completion, with the goal of creating personalized learning-path recommendations. Our research focuses on modelling university subjects and linking their topics to corresponding domain models, enabling the integration of learning modules from different faculties and institutions in the student's learning path. Central to our approach is a collaborative process, where LLMs assist human experts in extracting high-quality, fine-grained topics from lecture materials. We develop a domain, curriculum, and user models for university modules and stakeholders. We implement this model to create the KG from two study modules: Embedded Systems and Development of Embedded Systems Using FPGA. The resulting KG structures the curriculum and links it to the domain models. We evaluate our approach through qualitative expert feedback and quantitative graph quality metrics. Domain experts validated the relevance and accuracy of the model, while the graph quality metrics measured the structural properties of our KG. Our results show that the LLM-assisted graph completion approach enhances the ability to connect related courses across disciplines to personalize the learning experience. Expert feedback also showed high acceptance of the proposed collaborative approach for concept extraction and classification.

Paperid: 226, https://arxiv.org/pdf/2506.15468.pdf

Abstract:
We propose co-creative learning as a novel paradigm where humans and AI, i.e., biological and artificial agents, mutually integrate their partial perceptual information and knowledge to construct shared external representations, a process we interpret as symbol emergence. Unlike traditional AI teaching based on unilateral knowledge transfer, this addresses the challenge of integrating information from inherently different modalities. We empirically test this framework using a human-AI interaction model based on the Metropolis-Hastings naming game (MHNG), a decentralized Bayesian inference mechanism. In an online experiment, 69 participants played a joint attention naming game (JA-NG) with one of three computer agent types (MH-based, always-accept, or always-reject) under partial observability. Results show that human-AI pairs with an MH-based agent significantly improved categorization accuracy through interaction and achieved stronger convergence toward a shared sign system. Furthermore, human acceptance behavior aligned closely with the MH-derived acceptance probability. These findings provide the first empirical evidence for co-creative learning emerging in human-AI dyads via MHNG-based interaction. This suggests a promising path toward symbiotic AI systems that learn with humans, rather than from them, by dynamically aligning perceptual experiences, opening a new venue for symbiotic AI alignment.

Paperid: 227, https://arxiv.org/pdf/2501.04227.pdf

Abstract:
Historically, scientific discovery has been a lengthy and costly process, demanding substantial time and resources from initial conception to final results. To accelerate scientific discovery, reduce research costs, and improve research quality, we introduce Agent Laboratory, an autonomous LLM-based framework capable of completing the entire research process. This framework accepts a human-provided research idea and progresses through three stages--literature review, experimentation, and report writing to produce comprehensive research outputs, including a code repository and a research report, while enabling users to provide feedback and guidance at each stage. We deploy Agent Laboratory with various state-of-the-art LLMs and invite multiple researchers to assess its quality by participating in a survey, providing human feedback to guide the research process, and then evaluate the final paper. We found that: (1) Agent Laboratory driven by o1-preview generates the best research outcomes; (2) The generated machine learning code is able to achieve state-of-the-art performance compared to existing methods; (3) Human involvement, providing feedback at each stage, significantly improves the overall quality of research; (4) Agent Laboratory significantly reduces research expenses, achieving an 84% decrease compared to previous autonomous research methods. We hope Agent Laboratory enables researchers to allocate more effort toward creative ideation rather than low-level coding and writing, ultimately accelerating scientific discovery.

Paperid: 228, https://arxiv.org/pdf/2504.10918.pdf

Abstract:
The rapid advancement of AI, including Large Language Models, has propelled autonomous agents forward, accelerating the human-agent teaming (HAT) paradigm to leverage complementary strengths. However, HAT research remains fragmented, often focusing on isolated team development phases or specific challenges like trust calibration while overlooking the real-world need for adaptability. Addressing these gaps, a process dynamics perspective is adopted to systematically review HAT using the T$^4$ framework: Team Formation, Task and Role Development, Team Development, and Team Improvement. Each phase is examined in terms of its goals, actions, and evaluation metrics, emphasizing the co-evolution of task and team dynamics. Special focus is given to the second and third phases, highlighting key factors such as team roles, shared mental model, and backup behaviors. This holistic perspective identifies future research directions for advancing long-term adaptive HAT.

Paperid: 229, https://arxiv.org/pdf/2503.02151.pdf

Abstract:
To mitigate the negative impacts of online videos on teenagers, existing research and platforms have implemented various parental mediation mechanisms, such as Parent-Child Joint Media Engagement (JME). However, JME generally relies heavily on parents' time, knowledge, and experience. To fill this gap, we aim to design an automatic tool to help parents/children censor videos more effectively and efficiently in JME. For this goal, we first conducted a formative study to identify the needs and expectations of teenagers and parents for such a system. Based on the findings, we designed YouthCare, a personalized collaborative video censorship tool that supports parents and children to collaboratively filter out inappropriate content and select appropriate content in JME. An evaluation with 10 parent-child pairs demonstrated YouthCare's several strengths in supporting video censorship, while also highlighting some potential problems. These findings inspire us to propose several insights for the future design of parent-child collaborative JME systems.

Paperid: 230, https://arxiv.org/pdf/2503.01358.pdf

Abstract:
With increasing social mobility and an aging society, more older adults in China are migrating to new cities, known as "older drifters." Due to fewer social connections and cultural adaptation challenges, they face negative emotions such as loneliness and depression. While reminiscence-based interventions have been used to improve older adults' psychological well-being, challenges such as the lack of tangible materials and limited social resources constrain the feasibility of traditional reminiscence approaches for older drifters. To address this challenge, we designed RemiHaven, a personalized reminiscence support tool based on a two-phase formative study. It integrates "In-Town" and "Out-of-Town" peer agents to enhance personalization, engagement, and emotional resonance in the reminiscence process, powered by Multimodal Large Language Models (MLLMs). Our evaluations show RemiHaven's strengths in supporting reminiscence while identifying potential challenges. We conclude by offering insights for the future design of reminiscence support tools for older migrants.

Paperid: 231, https://arxiv.org/pdf/2506.13477.pdf

Abstract:
Dynamic facial emotion is essential for believable AI-generated avatars, yet most systems remain visually static, limiting their use in simulations like virtual training for investigative interviews with abused children. We present a real-time architecture combining Unreal Engine 5 MetaHuman rendering with NVIDIA Omniverse Audio2Face to generate facial expressions from vocal prosody in photorealistic child avatars. Due to limited TTS options, both avatars were voiced using young adult female models from two systems to better fit character profiles, introducing a voice-age mismatch. This confound may affect audiovisual alignment. We used a two-PC setup to decouple speech generation from GPU-intensive rendering, enabling low-latency interaction in desktop and VR. A between-subjects study (N=70) compared audio+visual vs. visual-only conditions as participants rated emotional clarity, facial realism, and empathy for avatars expressing joy, sadness, and anger. While emotions were generally recognized - especially sadness and joy - anger was harder to detect without audio, highlighting the role of voice in high-arousal expressions. Interestingly, silencing clips improved perceived realism by removing mismatches between voice and animation, especially when tone or age felt incongruent. These results emphasize the importance of audiovisual congruence: mismatched voice undermines expression, while a good match can enhance weaker visuals - posing challenges for emotionally coherent avatars in sensitive contexts.

Paperid: 232, https://arxiv.org/pdf/2501.06869.pdf

Abstract:
Foundational models have emerged as powerful tools for addressing various tasks in clinical settings. However, their potential development to breast ultrasound analysis remains untapped. In this paper, we present BUSGen, the first foundational generative model specifically designed for breast ultrasound image analysis. Pretrained on over 3.5 million breast ultrasound images, BUSGen has acquired extensive knowledge of breast structures, pathological features, and clinical variations. With few-shot adaptation, BUSGen can generate repositories of realistic and informative task-specific data, facilitating the development of models for a wide range of downstream tasks. Extensive experiments highlight BUSGen's exceptional adaptability, significantly exceeding real-data-trained foundational models in breast cancer screening, diagnosis, and prognosis. In breast cancer early diagnosis, our approach outperformed all board-certified radiologists (n=9), achieving an average sensitivity improvement of 16.5% (P-value<0.0001). Additionally, we characterized the scaling effect of using generated data which was as effective as the collected real-world data for training diagnostic models. Moreover, extensive experiments demonstrated that our approach improved the generalization ability of downstream models. Importantly, BUSGen protected patient privacy by enabling fully de-identified data sharing, making progress forward in secure medical data utilization. An online demo of BUSGen is available at https://aibus.bio.

Paperid: 233, https://arxiv.org/pdf/2501.05714.pdf

Abstract:
With the advancement of large language models (LLMs), intelligent models have evolved from mere tools to autonomous agents with their own goals and strategies for cooperating with humans. This evolution has birthed a novel paradigm in NLP, i.e., human-model cooperation, that has yielded remarkable progress in numerous NLP tasks in recent years. In this paper, we take the first step to present a thorough review of human-model cooperation, exploring its principles, formalizations, and open challenges. In particular, we introduce a new taxonomy that provides a unified perspective to summarize existing approaches. Also, we discuss potential frontier areas and their corresponding challenges. We regard our work as an entry point, paving the way for more breakthrough research in this regard.

Paperid: 234, https://arxiv.org/pdf/2505.04172.pdf

Abstract:
Smart rings offer a convenient way to continuously and unobtrusively monitor cardiovascular physiological signals. However, a gap remains between the ring hardware and reliable methods for estimating cardiovascular parameters, partly due to the lack of publicly available datasets and standardized analysis tools. In this work, we present $Ï$-Ring, the first open-source ring-based dataset designed for cardiovascular physiological sensing. The dataset comprises photoplethysmography signals (infrared and red channels) and 3-axis accelerometer data collected from two rings (reflective and transmissive optical paths), with 28.21 hours of raw data from 34 subjects across seven activities. $Ï$-Ring encompasses both stationary and motion scenarios, as well as stimulus-evoked abnormal physiological states, annotated with four ground-truth labels: heart rate, respiratory rate, oxygen saturation, and blood pressure. Using our proposed RingTool toolkit, we evaluated three widely-used physics-based methods and four cutting-edge deep learning approaches. Our results show superior performance compared to commercial rings, achieving best MAE values of 5.18 BPM for heart rate, 2.98 BPM for respiratory rate, 3.22\% for oxygen saturation, and 13.33/7.56 mmHg for systolic/diastolic blood pressure estimation. The open-sourced dataset and toolkit aim to foster further research and community-driven advances in ring-based cardiovascular health sensing.

Paperid: 235, https://arxiv.org/pdf/2505.03185.pdf

Abstract:
Ingestive behavior plays a critical role in health, yet many existing interventions remain limited to static guidance or manual self-tracking. With the increasing integration of sensors, context-aware computing, and perceptual computing, recent systems have begun to support closed-loop interventions that dynamically sense user behavior and provide feedback during or around ingestion episodes. In this survey, we review 136 studies that leverage sensor-enabled or interaction-mediated approaches to influence ingestive behavior. We propose a behavioral closed-loop paradigm rooted in context-aware computing and inspired by HCI behavior change frameworks, comprising four components: target behaviors, sensing modalities, reasoning and intervention strategies. A taxonomy of sensing and intervention modalities is presented, organized along human- and environment-based dimensions. Our analysis also examines evaluation methods and design trends across different modality-behavior pairings. This review reveals prevailing patterns and critical gaps, offering design insights for future adaptive and context-aware ingestion health interventions.

Paperid: 236, https://arxiv.org/pdf/2502.10378.pdf

Abstract:
English as a Second Language (ESL) learners often encounter unknown words that hinder their text comprehension. Automatically detecting these words as users read can enable computing systems to provide just-in-time definitions, synonyms, or contextual explanations, thereby helping users learn vocabulary in a natural and seamless manner. This paper presents EyeLingo, a transformer-based machine learning method that predicts the probability of unknown words based on text content and eye gaze trajectory in real time with high accuracy. A 20-participant user study revealed that our method can achieve an accuracy of 97.6%, and an F1-score of 71.1%. We implemented a real-time reading assistance prototype to show the effectiveness of EyeLingo. The user study shows improvement in willingness to use and usefulness compared to baseline methods.

Paperid: 237, https://arxiv.org/pdf/2502.10124.pdf

Abstract:
While users could embody virtual avatars that mirror their physical movements in Virtual Reality, these avatars' motions can be redirected to enable novel interactions. Excessive redirection, however, could break the user's sense of embodiment due to perceptual conflicts between vision and proprioception. While prior work focused on avatar-related factors influencing the noticeability of redirection, we investigate how the visual stimuli in the surrounding virtual environment affect user behavior and, in turn, the noticeability of redirection. Given the wide variety of different types of visual stimuli and their tendency to elicit varying individual reactions, we propose to use users' gaze behavior as an indicator of their response to the stimuli and model the noticeability of redirection. We conducted two user studies to collect users' gaze behavior and noticeability, investigating the relationship between them and identifying the most effective gaze behavior features for predicting noticeability. Based on the data, we developed a regression model that takes users' gaze behavior as input and outputs the noticeability of redirection. We then conducted an evaluation study to test our model on unseen visual stimuli, achieving an accuracy of 0.012 MSE. We further implemented an adaptive redirection technique and conducted a proof-of-concept study to evaluate its effectiveness with complex visual stimuli in two applications. The results indicated that participants experienced less physical demanding and a stronger sense of body ownership when using our adaptive technique, demonstrating the potential of our model to support real-world use cases.

Paperid: 238, https://arxiv.org/pdf/2502.02459.pdf

Abstract:
A smart ring is a wearable electronic device in the form of a ring that incorporates diverse sensors and computing technologies to perform a variety of functions. Designed for use with fingers, smart rings are capable of sensing more subtle and abundant hand movements, thus making them a good platform for interaction. Meanwhile, fingers are abundant with blood vessels and nerve endings and accustomed to wearing rings, providing an ideal site for continuous health monitoring through smart rings, which combine comfort with the ability to capture vital biometric data, making them suitable for all-day wear. We collected in total of 206 smart ring-related publications and conducted a systematic literature review. We provide a taxonomy regarding the sensing and feedback modalities, applications, and phenomena. We review and categorize these literatures into four main areas: (1) interaction - input, (2) interaction - output, (3) passive sensing - in body feature, (4) passive sensing - out body activity. This comprehensive review highlights the current advancements within the field of smart ring and identifies potential areas for future research.

Paperid: 239, https://arxiv.org/pdf/2502.01325.pdf

Abstract:
Parental involvement in homework is a crucial aspect of family education, but it often triggers emotional strain and conflicts. Despite growing concern over its impact on family well-being, prior research has lacked access to fine-grained, real-time dynamics of these interactions. To bridge this gap, we present a framework that leverages naturalistic parent-child interaction data and large language models (LLMs) to analyse homework conversations at scale. In a four-week in situ study with 78 Chinese families, we collected 475 hours of audio recordings and accompanying daily surveys, capturing 602 homework sessions in everyday home settings. Our LLM-based pipeline reliably extracted and categorised parental behaviours and conflict patterns from transcribed conversations, achieving high agreement with expert annotations. The analysis revealed significant emotional shifts in parents before and after homework, 18 recurring parental behaviours and seven common conflict types, with Knowledge Conflict being the most frequent. Notably, even well-intentioned behaviours were significantly positively correlated with specific conflicts. This work advances ubiquitous computing methods for studying complex family dynamics and offers empirical insights to enrich family education theory and inform more effective parenting strategies and interventions in the future.

Paperid: 240, https://arxiv.org/pdf/2504.10905.pdf

Abstract:
Recent video generation research has focused heavily on isolated actions, leaving interactive motions-such as hand-face interactions-largely unexamined. These interactions are essential for emerging biometric authentication systems, which rely on interactive motion-based anti-spoofing approaches. From a security perspective, there is a growing need for large-scale, high-quality interactive videos to train and strengthen authentication models. In this work, we introduce a novel paradigm for animating realistic hand-face interactions. Our approach simultaneously learns spatio-temporal contact dynamics and biomechanically plausible deformation effects, enabling natural interactions where hand movements induce anatomically accurate facial deformations while maintaining collision-free contact. To facilitate this research, we present InterHF, a large-scale hand-face interaction dataset featuring 18 interaction patterns and 90,000 annotated videos. Additionally, we propose InterAnimate, a region-aware diffusion model designed specifically for interaction animation. InterAnimate leverages learnable spatial and temporal latents to effectively capture dynamic interaction priors and integrates a region-aware interaction mechanism that injects these priors into the denoising process. To the best of our knowledge, this work represents the first large-scale effort to systematically study human hand-face interactions. Qualitative and quantitative results show InterAnimate produces highly realistic animations, setting a new benchmark. Code and data will be made public to advance research.

Paperid: 241, https://arxiv.org/pdf/2503.14948.pdf

Abstract:
Surround-view perception has garnered significant attention for its ability to enhance the perception capabilities of autonomous driving vehicles through the exchange of information with surrounding cameras. However, existing surround-view perception systems are limited by inefficiencies in unidirectional interaction pattern with human and distortions in overlapping regions exponentially propagating into non-overlapping areas. To address these challenges, this paper introduces ChatStitch, a surround-view human-machine co-perception system capable of unveiling obscured blind spot information through natural language commands integrated with external digital assets. To dismantle the unidirectional interaction bottleneck, ChatStitch implements a cognitively grounded closed-loop interaction multi-agent framework based on Large Language Models. To suppress distortion propagation across overlapping boundaries, ChatStitch proposes SV-UDIS, a surround-view unsupervised deep image stitching method under the non-global-overlapping condition. We conducted extensive experiments on the UDIS-D, MCOV-SLAM open datasets, and our real-world dataset. Specifically, our SV-UDIS method achieves state-of-the-art performance on the UDIS-D dataset for 3, 4, and 5 image stitching tasks, with PSNR improvements of 9\%, 17\%, and 21\%, and SSIM improvements of 8\%, 18\%, and 26\%, respectively.

Paperid: 242, https://arxiv.org/pdf/2505.20623.pdf

Abstract:
As algorithmic systems increasingly structure platform labor, workers often rely on informal "folk theories", experience-based beliefs about how algorithms work, to navigate opaque and unstable algorithmic environments. Prior research has largely treated these theories as bottom-up, peer-driven strategies for coping with algorithmic opacity and uncertainty. In this study, we shift analytical attention to intermediary organizations and examine how folk theories of algorithms can be institutionally constructed and operationalized by those organizations as tools of labor management. Drawing on nine months of ethnographic fieldwork and 37 interviews with live-streamers and staff at Multi-Channel Networks (MCNs) in China, we show that MCNs develop and circulate dual algorithmic theories: internally, they acknowledge the volatility of platform systems and adopt probabilistic strategies to manage risk; externally, they promote simplified, prescriptive theories portraying the algorithm as transparent, fair, and responsive to individual effort. They have further operationalize those folk theories for labor management, encouraging streamers to self-discipline and invest in equipment, training, and routines, while absolving MCNs of accountability. We contribute to CSCW and platform labor literature by demonstrating how informal algorithmic knowledge, once institutionalized, can become infrastructures of soft control -- shaping not only how workers interpret platform algorithms, but also how their labor is structured, moralized and governed.

Paperid: 243, https://arxiv.org/pdf/2505.10415.pdf

Abstract:
Accurately estimating human internal states, such as personality traits or behavioral patterns, is critical for enhancing the effectiveness of human-robot interaction, particularly in group settings. These insights are key in applications ranging from social navigation to autism diagnosis. However, prior methods are limited by scalability and passive observation, making real-time estimation in complex, multi-human settings difficult. In this work, we propose a practical method for active human personality estimation in groups, with a focus on applications related to Autism Spectrum Disorder (ASD). Our method combines a personality-conditioned behavior model, based on the Eysenck 3-Factor theory, with an active robot information gathering policy that triggers human behaviors through a receding-horizon planner. The robot's belief about human personality is then updated via Bayesian inference. We demonstrate the effectiveness of our approach through simulations, user studies with typical adults, and preliminary experiments involving participants with ASD. Our results show that our method can scale to tens of humans and reduce personality prediction error by 29.2% and uncertainty by 79.9% in simulation. User studies with typical adults confirm the method's ability to generalize across complex personality distributions. Additionally, we explore its application in autism-related scenarios, demonstrating that the method can identify the difference between neurotypical and autistic behavior, highlighting its potential for diagnosing ASD. The results suggest that our framework could serve as a foundation for future ASD-specific interventions.

Paperid: 244, https://arxiv.org/pdf/2502.18864.pdf

Authors:Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vikram Dhillon, Eeshit Dhaval Vaishnav, Byron Lee, Tiago R D Costa, JosÃ© R PenadÃ©s, Gary Peltz, Yunhan Xu, Annalisa Pawlosky, Alan Karthikesalingam, Vivek Natarajan

Abstract:
Scientific discovery relies on scientists generating novel hypotheses that undergo rigorous experimental validation. To augment this process, we introduce an AI co-scientist, a multi-agent system built on Gemini 2.0. The AI co-scientist is intended to help uncover new, original knowledge and to formulate demonstrably novel research hypotheses and proposals, building upon prior evidence and aligned to scientist-provided research objectives and guidance. The system's design incorporates a generate, debate, and evolve approach to hypothesis generation, inspired by the scientific method and accelerated by scaling test-time compute. Key contributions include: (1) a multi-agent architecture with an asynchronous task execution framework for flexible compute scaling; (2) a tournament evolution process for self-improving hypotheses generation. Automated evaluations show continued benefits of test-time compute, improving hypothesis quality. While general purpose, we focus development and validation in three biomedical areas: drug repurposing, novel target discovery, and explaining mechanisms of bacterial evolution and anti-microbial resistance. For drug repurposing, the system proposes candidates with promising validation findings, including candidates for acute myeloid leukemia that show tumor inhibition in vitro at clinically applicable concentrations. For novel target discovery, the AI co-scientist proposed new epigenetic targets for liver fibrosis, validated by anti-fibrotic activity and liver cell regeneration in human hepatic organoids. Finally, the AI co-scientist recapitulated unpublished experimental results via a parallel in silico discovery of a novel gene transfer mechanism in bacterial evolution. These results, detailed in separate, co-timed reports, demonstrate the potential to augment biomedical and scientific discovery and usher an era of AI empowered scientists.

Paperid: 245, https://arxiv.org/pdf/2506.14820.pdf

Abstract:
Visual analytics using dimensionality reduction (DR) can easily be unreliable for various reasons, e.g., inherent distortions in representing the original data. The literature has thus proposed a wide range of methodologies to make DR-based visual analytics reliable. However, the diversity and extensiveness of the literature can leave novice analysts and researchers uncertain about where to begin and proceed. To address this problem, we propose a guide for reading papers for reliable visual analytics with DR. Relying on the previous classification of the relevant literature, our guide helps both practitioners to (1) assess their current DR expertise and (2) identify papers that will further enhance their understanding. Interview studies with three experts in DR and data visualizations validate the significance, comprehensiveness, and usefulness of our guide.

Paperid: 246, https://arxiv.org/pdf/2506.09065.pdf

Abstract:
The prevalence of Autism Spectrum Disorder (ASD) has surged rapidly over the past decade, posing significant challenges in communication, behavior, and focus for affected individuals. Current diagnostic techniques, though effective, are time-intensive, leading to high social and economic costs. This work introduces an AI-powered assistive technology designed to streamline ASD diagnosis and management, enhancing convenience for individuals with ASD and efficiency for caregivers and therapists. The system integrates transfer learning with image transforms derived from eye gaze variables to diagnose ASD. This facilitates and opens opportunities for in-home periodical diagnosis, reducing stress for individuals and caregivers, while also preserving user privacy through the use of image transforms. The accessibility of the proposed method also offers opportunities for improved communication between guardians and therapists, ensuring regular updates on progress and evolving support needs. Overall, the approach proposed in this work ensures timely, accessible diagnosis while protecting the subjects' privacy, improving outcomes for individuals with ASD.

Paperid: 247, https://arxiv.org/pdf/2506.08725.pdf

Abstract:
Misuses of t-SNE and UMAP in visual analytics have become increasingly common. For example, although t-SNE and UMAP projections often do not faithfully reflect true distances between clusters, practitioners frequently use them to investigate inter-cluster relationships. In this paper, we bring this issue to the surface and comprehensively investigate why such misuse occurs and how to prevent it. We conduct a literature review of 114 papers to verify the prevalence of the misuse and analyze the reasonings behind it. We then execute an interview study to uncover practitioners' implicit motivations for using these techniques -- rationales often undisclosed in the literature. Our findings indicate that misuse of t-SNE and UMAP primarily stems from limited discourse on their appropriate use in visual analytics. We conclude by proposing future directions and concrete action items to promote more reasonable use of DR.

Paperid: 248, https://arxiv.org/pdf/2505.02582.pdf

Abstract:
This work presents FlyHaptics, an aerial haptic interface tracked via a Vicon optical motion capture system and built around six five-bar linkage assemblies enclosed in a lightweight protective cage. We predefined five static tactile patterns - each characterized by distinct combinations of linkage contact points and vibration intensities - and evaluated them in a grounded pilot study, where participants achieved 86.5 recognition accuracy (F(4, 35) = 1.47, p = 0.23) with no significant differences between patterns. Complementary flight demonstrations confirmed stable hover performance and consistent force output under realistic operating conditions. These pilot results validate the feasibility of drone-mounted, multi-contact haptic feedback and lay the groundwork for future integration into fully immersive VR, teleoperation, and remote interaction scenarios.

Paperid: 249, https://arxiv.org/pdf/2505.02569.pdf

Abstract:
This paper introduces HapticVLM, a novel multimodal system that integrates vision-language reasoning with deep convolutional networks to enable real-time haptic feedback. HapticVLM leverages a ConvNeXt-based material recognition module to generate robust visual embeddings for accurate identification of object materials, while a state-of-the-art Vision-Language Model (Qwen2-VL-2B-Instruct) infers ambient temperature from environmental cues. The system synthesizes tactile sensations by delivering vibrotactile feedback through speakers and thermal cues via a Peltier module, thereby bridging the gap between visual perception and tactile experience. Experimental evaluations demonstrate an average recognition accuracy of 84.67% across five distinct auditory-tactile patterns and a temperature estimation accuracy of 86.7% based on a tolerance-based evaluation method with an 8Â°C margin of error across 15 scenarios. Although promising, the current study is limited by the use of a small set of prominent patterns and a modest participant pool. Future work will focus on expanding the range of tactile patterns and increasing user studies to further refine and validate the system's performance. Overall, HapticVLM presents a significant step toward context-aware, multimodal haptic interaction with potential applications in virtual reality, and assistive technologies.

Paperid: 250, https://arxiv.org/pdf/2504.20567.pdf

Abstract:
Bayesian Optimisation (BO) is a family of methods for finding optimal parameters when the underlying function to be optimised is unknown. BO is used, for example, for hyperparameter tuning in machine learning and as an expert support tool for tuning cyberphysical systems. For settings where humans are involved in the tuning task, methods have been developed to explain BO (Explainable Bayesian Optimization, XBO). However, there is little guidance on how to present XBO results to humans so that they can tune the system effectively and efficiently. In this paper, we investigate how the XBO explanation format affects users' task performance, task load, understanding and trust in XBO. We chose a task that is accessible to a wide range of users. Specifically, we set up an egg cooking scenario with 6 parameters that participants had to adjust to achieve a perfect soft-boiled egg. We compared three different explanation formats: a bar chart, a list of rules and a textual explanation in a between-subjects online study with 213 participants. Our results show that adding any type of explanation increases task success, reduces the number of trials needed to achieve success, and improves comprehension and confidence. While explanations add more information for participants to process, we found no increase in user task load. We also found that the aforementioned results were independent of the explanation format; all formats had a similar effect. This is an interesting finding for practical applications, as it suggests that explanations can be added to BO tuning tasks without the burden of designing or selecting specific explanation formats. In the future, it would be interesting to investigate scenarios of prolonged use of the explanation formats and whether they have different effects on users' mental models of the underlying system.

Paperid: 251, https://arxiv.org/pdf/2504.09859.pdf

Abstract:
Graph visualizations have been studied for tasks such as clustering and temporal analysis, but how these visual similarities relate to established graph similarity measures remains unclear. In this paper, we explore the potential of Vision Language Models (VLMs) to approximate human-like perception of graph similarity. We generate graph datasets of various sizes and densities and compare VLM-derived visual similarity scores with feature-based measures. Our findings indicate VLMs can assess graph similarity in a manner similar to feature-based measures, even though differences among the measures exist. In future work, we plan to extend our research by conducting experiments on human visual graph perception.

Paperid: 252, https://arxiv.org/pdf/2504.07801.pdf

Abstract:
Recent advances in Large Language Models (LLMs) have enabled their application to recommender systems (RecLLMs), yet concerns remain regarding fairness across demographic and psychological user dimensions. We introduce FairEval, a novel evaluation framework to systematically assess fairness in LLM-based recommendations. FairEval integrates personality traits with eight sensitive demographic attributes,including gender, race, and age, enabling a comprehensive assessment of user-level bias. We evaluate models, including ChatGPT 4o and Gemini 1.5 Flash, on music and movie recommendations. FairEval's fairness metric, PAFS, achieves scores up to 0.9969 for ChatGPT 4o and 0.9997 for Gemini 1.5 Flash, with disparities reaching 34.79 percent. These results highlight the importance of robustness in prompt sensitivity and support more inclusive recommendation systems.

Paperid: 253, https://arxiv.org/pdf/2504.03888.pdf

Abstract:
As AI chatbots see increased adoption and integration into everyday life, questions have been raised about the potential impact of human-like or anthropomorphic AI on users. In this work, we investigate the extent to which interactions with ChatGPT (with a focus on Advanced Voice Mode) may impact users' emotional well-being, behaviors and experiences through two parallel studies. To study the affective use of AI chatbots, we perform large-scale automated analysis of ChatGPT platform usage in a privacy-preserving manner, analyzing over 3 million conversations for affective cues and surveying over 4,000 users on their perceptions of ChatGPT. To investigate whether there is a relationship between model usage and emotional well-being, we conduct an Institutional Review Board (IRB)-approved randomized controlled trial (RCT) on close to 1,000 participants over 28 days, examining changes in their emotional well-being as they interact with ChatGPT under different experimental settings. In both on-platform data analysis and the RCT, we observe that very high usage correlates with increased self-reported indicators of dependence. From our RCT, we find that the impact of voice-based interactions on emotional well-being to be highly nuanced, and influenced by factors such as the user's initial emotional state and total usage duration. Overall, our analysis reveals that a small number of users are responsible for a disproportionate share of the most affective cues.

Paperid: 254, https://arxiv.org/pdf/2503.17473.pdf

Abstract:
AI chatbots, especially those with voice capabilities, have become increasingly human-like, with more users seeking emotional support and companionship from them. Concerns are rising about how such interactions might impact users' loneliness and socialization with real people. We conducted a four-week randomized, controlled, IRB-approved experiment (n=981, >300K messages) to investigate how AI chatbot interaction modes (text, neutral voice, and engaging voice) and conversation types (open-ended, non-personal, and personal) influence psychosocial outcomes such as loneliness, social interaction with real people, emotional dependence on AI and problematic AI usage. Results showed that while voice-based chatbots initially appeared beneficial in mitigating loneliness and dependence compared with text-based chatbots, these advantages diminished at high usage levels, especially with a neutral-voice chatbot. Conversation type also shaped outcomes: personal topics slightly increased loneliness but tended to lower emotional dependence compared with open-ended conversations, whereas non-personal topics were associated with greater dependence among heavy users. Overall, higher daily usage - across all modalities and conversation types - correlated with higher loneliness, dependence, and problematic use, and lower socialization. Exploratory analyses revealed that those with stronger emotional attachment tendencies and higher trust in the AI chatbot tended to experience greater loneliness and emotional dependence, respectively. These findings underscore the complex interplay between chatbot design choices (e.g., voice expressiveness) and user behaviors (e.g., conversation content, usage frequency). We highlight the need for further research on whether chatbots' ability to manage emotional content without fostering dependence or replacing human relationships benefits overall well-being.

Paperid: 255, https://arxiv.org/pdf/2503.16475.pdf

Abstract:
We present LLM-Glasses, a wearable navigation system designed to assist visually impaired individuals by combining haptic feedback, YOLO-World object detection, and GPT-4o-driven reasoning. The system delivers real-time tactile guidance via temple-mounted actuators, enabling intuitive and independent navigation. Three user studies were conducted to evaluate its effectiveness: (1) a haptic pattern recognition study achieving an 81.3% average recognition rate across 13 distinct patterns, (2) a VICON-based navigation study in which participants successfully followed predefined paths in open spaces, and (3) an LLM-guided video evaluation demonstrating 91.8% accuracy in open scenarios, 84.6% with static obstacles, and 81.5% with dynamic obstacles. These results demonstrate the system's reliability in controlled environments, with ongoing work focusing on refining its responsiveness and adaptability to diverse real-world scenarios. LLM-Glasses showcases the potential of combining generative AI with haptic interfaces to empower visually impaired individuals with intuitive and effective mobility solutions.

Paperid: 256, https://arxiv.org/pdf/2503.02067.pdf

Abstract:
Pro-environmental behavior (PEB) is vital to combat climate change, yet turning awareness into intention and action remains elusive. We explore large language models (LLMs) as tools to promote PEB, comparing their impact across 3,200 participants: real humans (n=1,200), simulated humans based on actual participant data (n=1,200), and fully synthetic personas (n=1,200). All three participant groups faced personalized or standard chatbots, or static statements, employing four persuasion strategies (moral foundations, future self-continuity, action orientation, or "freestyle" chosen by the LLM). Results reveal a "synthetic persuasion paradox": synthetic and simulated agents significantly affect their post-intervention PEB stance, while human responses barely shift. Simulated participants better approximate human trends but still overestimate effects. This disconnect underscores LLM's potential for pre-evaluating PEB interventions but warns of its limits in predicting real-world behavior. We call for refined synthetic modeling and sustained and extended human trials to align conversational AI's promise with tangible sustainability outcomes.

Paperid: 257, https://arxiv.org/pdf/2502.02863.pdf

Abstract:
Marine ecosystems face unprecedented threats from climate change and plastic pollution, yet traditional environmental education often struggles to translate awareness into sustained behavioral change. This paper presents OceanChat, an interactive system leveraging large language models to create conversational AI agents represented as animated marine creatures -- specifically a beluga whale, a jellyfish, and a seahorse -- designed to promote environmental behavior (PEB) and foster awareness through personalized dialogue. Through a between-subjects experiment (N=900), we compared three conditions: (1) Static Scientific Information, providing conventional environmental education through text and images; (2) Static Character Narrative, featuring first-person storytelling from 3D-rendered marine creatures; and (3) Conversational Character Narrative, enabling real-time dialogue with AI-powered marine characters. Our analysis revealed that the Conversational Character Narrative condition significantly increased behavioral intentions and sustainable choice preferences compared to static approaches. The beluga whale character demonstrated consistently stronger emotional engagement across multiple measures, including perceived anthropomorphism and empathy. However, impacts on deeper measures like climate policy support and psychological distance were limited, highlighting the complexity of shifting entrenched beliefs. Our work extends research on sustainability interfaces facilitating PEB and offers design principles for creating emotionally resonant, context-aware AI characters. By balancing anthropomorphism with species authenticity, OceanChat demonstrates how interactive narratives can bridge the gap between environmental knowledge and real-world behavior change.

Paperid: 258, https://arxiv.org/pdf/2501.10168.pdf

Abstract:
Dimensionality reduction (DR) techniques are essential for visually analyzing high-dimensional data. However, visual analytics using DR often face unreliability, stemming from factors such as inherent distortions in DR projections. This unreliability can lead to analytic insights that misrepresent the underlying data, potentially resulting in misguided decisions. To tackle these reliability challenges, we review 133 papers that address the unreliability of visual analytics using DR. Through this review, we contribute (1) a workflow model that describes the interaction between analysts and machines in visual analytics using DR, and (2) a taxonomy that identifies where and why reliability issues arise within the workflow, along with existing solutions for addressing them. Our review reveals ongoing challenges in the field, whose significance and urgency are validated by five expert researchers. This review also finds that the current research landscape is skewed toward developing new DR techniques rather than their interpretation or evaluation, where we discuss how the HCI community can contribute to broadening this focus.

Paperid: 259, https://arxiv.org/pdf/2506.17356.pdf

Abstract:
We explore the automatic generation of interactive, scenario-based lessons designed to train novice human tutors who teach middle school mathematics online. Employing prompt engineering through a Retrieval-Augmented Generation approach with GPT-4o, we developed a system capable of creating structured tutor training lessons. Our study generated lessons in English for three key topics: Encouraging Students' Independence, Encouraging Help-Seeking Behavior, and Turning on Cameras, using a task decomposition prompting strategy that breaks lesson generation into sub-tasks. The generated lessons were evaluated by two human evaluators, who provided both quantitative and qualitative evaluations using a comprehensive rubric informed by lesson design research. Results demonstrate that the task decomposition strategy led to higher-rated lessons compared to single-step generation. Human evaluators identified several strengths in the LLM-generated lessons, including well-structured content and time-saving potential, while also noting limitations such as generic feedback and a lack of clarity in some instructional sections. These findings underscore the potential of hybrid human-AI approaches for generating effective lessons in tutor training.

Paperid: 260, https://arxiv.org/pdf/2506.06100.pdf

Abstract:
Executable QR codes, or sQRy, is a technology dated 2022 that permits to include a runnable program inside a QR code, enabling interaction with the user even in the absence of an Internet connection. sQRy are enablers for different practical applications, including network equipment configuration, diagnostics, and enhanced smart manuals in industrial contexts. Many other non-industry-related fields can also benefit from this technology. Regardless of where sQRy are used, text strings are among the most commonly embedded data. However, due to strict limitations on the available payload, the occupancy of strings limits the length of the programs that can be embedded. In this work, we propose a simple yet effective strategy that can reduce the space taken by strings, hence broadening sQRy applicability.

Paperid: 261, https://arxiv.org/pdf/2505.04869.pdf

Abstract:
Producing large volumes of high-quality, timely feedback poses significant challenges to instructors. To address this issue, automation technologies-particularly Large Language Models (LLMs)-show great potential. However, current LLM-based research still shows room for improvement in terms of feedback quality. Our study proposed a multi-agent approach performing "generation, evaluation, and regeneration" (G-E-RG) to further enhance feedback quality. In the first-generation phase, six methods were adopted, combining three feedback theoretical frameworks and two prompt methods: zero-shot and retrieval-augmented generation with chain-of-thought (RAG_CoT). The results indicated that, compared to first-round feedback, G-E-RG significantly improved final feedback across six methods for most dimensions. Specifically:(1) Evaluation accuracy for six methods increased by 3.36% to 12.98% (p<0.001); (2) The proportion of feedback containing four effective components rose from an average of 27.72% to an average of 98.49% among six methods, sub-dimensions of providing critiques, highlighting strengths, encouraging agency, and cultivating dialogue also showed great enhancement (p<0.001); (3) There was a significant improvement in most of the feature values (p<0.001), although some sub-dimensions (e.g., strengthening the teacher-student relationship) still require further enhancement; (4) The simplicity of feedback was effectively enhanced (p<0.001) for three methods.

Paperid: 262, https://arxiv.org/pdf/2505.04584.pdf

Abstract:
Feedback is important in supporting student learning. While various automated feedback systems have been implemented to make the feedback scalable, many existing solutions only focus on generating text-based feedback. As is indicated in the multimedia learning principle, learning with more modalities could help utilize more separate channels, reduce the cognitive load and facilitate students' learning. Hence, it is important to explore the potential of Artificial Intelligence (AI) in feedback generation from and to different modalities. Our study leverages Large Language Models (LLMs) for textual feedback with the supplementary guidance from other modality - relevant lecture slide retrieved from the slides hub. Through an online crowdsourcing study (N=91), this study investigates learning gains and student perceptions using a 2x2 design (i.e., human feedback vs. AI feedback and with vs. without relevant slide), evaluating the clarity, engagement, perceived effectiveness, and reliability) of AI-facilitated multimodal feedback. We observed significant pre-to-post learning gains across all conditions. However, the differences in these gains were not statistically significant between conditions. The post-survey revealed that students found the slide feedback helpful in their learning process, though they reported difficulty in understanding it. Regarding the AI-generated open-ended feedback, students considered it personalized and relevant to their responses, but they expressed lower trust in the AI feedback compared to human-generated feedback.

Paperid: 263, https://arxiv.org/pdf/2505.04488.pdf

Abstract:
The visually impaired population, especially the severely visually impaired, is currently large in scale, and daily activities pose significant challenges for them. Although many studies use large language and vision-language models to assist the blind, most focus on static content and fail to meet real-time perception needs in dynamic and complex environments, such as daily activities. To provide them with more effective intelligent assistance, it is imperative to incorporate advanced visual understanding technologies. Although real-time vision and speech interaction VideoLLMs demonstrate strong real-time visual understanding, no prior work has systematically evaluated their effectiveness in assisting visually impaired individuals. In this work, we conduct the first such evaluation. First, we construct a benchmark dataset (VisAssistDaily), covering three categories of assistive tasks for visually impaired individuals: Basic Skills, Home Life Tasks, and Social Life Tasks. The results show that GPT-4o achieves the highest task success rate. Next, we conduct a user study to evaluate the models in both closed-world and open-world scenarios, further exploring the practical challenges of applying VideoLLMs in assistive contexts. One key issue we identify is the difficulty current models face in perceiving potential hazards in dynamic environments. To address this, we build an environment-awareness dataset named SafeVid and introduce a polling mechanism that enables the model to proactively detect environmental risks. We hope this work provides valuable insights and inspiration for future research in this field.

Paperid: 264, https://arxiv.org/pdf/2505.00049.pdf

Abstract:
As large language models (LLMs) are increasingly used in human-centered tasks, assessing their psychological traits is crucial for understanding their social impact and ensuring trustworthy AI alignment. While existing reviews have covered some aspects of related research, several important areas have not been systematically discussed, including detailed discussions of diverse psychological tests, LLM-specific psychological datasets, and the applications of LLMs with psychological traits. To address this gap, we systematically review six key dimensions of applying psychological theories to LLMs: (1) assessment tools; (2) LLM-specific datasets; (3) evaluation metrics (consistency and stability); (4) empirical findings; (5) personality simulation methods; and (6) LLM-based behavior simulation. Our analysis highlights both the strengths and limitations of current methods. While some LLMs exhibit reproducible personality patterns under specific prompting schemes, significant variability remains across tasks and settings. Recognizing methodological challenges such as mismatches between psychological tools and LLMs' capabilities, as well as inconsistencies in evaluation practices, this study aims to propose future directions for developing more interpretable, robust, and generalizable psychological assessment frameworks for LLMs.

Paperid: 265, https://arxiv.org/pdf/2504.13898.pdf

Abstract:
Our work aims to advance the social reasoning of embodied artificial intelligence (AI) agents in real-world social interactions. Recently, language models (LMs) and foundational models (FMs) are being utilized as automatic evaluators of human-AI interactions with the goal of eventually being used to improve the policy of the AI agent. To enable further research in this direction, we introduce a large-scale real-world Human Robot Social Interaction (HSRI) Dataset to benchmark the capabilities of LMs and FMs to identify and reason about social interactions, specifically with regard to robot social errors and competencies . Our dataset consists of 400 real-world human social robot interaction videos and over 10K annotations, detailing the robot's social errors, competencies, rationale, and corrective actions, capturing unique aspects of human-AI interaction only present in real-world interactions. To further assess AI models' ability to reason about social interactions, we propose eight new benchmark tasks for evaluating centered around whether AI models can (1) evaluate social interactions via detecting social errors and competencies, (2) identify the explanatory factors associated to errors and competencies, (3) understand the flow of real-world social interactions, and (4) provide reasons and corrective actions for social errors. Human studies and experiments with modern LMs and FMs reveal that current models struggle with these tasks, demonstrating that our dataset and benchmark provides a step forward towards socially intelligent AI.

Paperid: 266, https://arxiv.org/pdf/2504.13882.pdf

Abstract:
Our study introduces an automated system leveraging large language models (LLMs) to assess the effectiveness of five key tutoring strategies: 1. giving effective praise, 2. reacting to errors, 3. determining what students know, 4. helping students manage inequity, and 5. responding to negative self-talk. Using a public dataset from the Teacher-Student Chatroom Corpus, our system classifies each tutoring strategy as either being employed as desired or undesired. Our study utilizes GPT-3.5 with few-shot prompting to assess the use of these strategies and analyze tutoring dialogues. The results show that for the five tutoring strategies, True Negative Rates (TNR) range from 0.655 to 0.738, and Recall ranges from 0.327 to 0.432, indicating that the model is effective at excluding incorrect classifications but struggles to consistently identify the correct strategy. The strategy \textit{helping students manage inequity} showed the highest performance with a TNR of 0.738 and Recall of 0.432. The study highlights the potential of LLMs in tutoring strategy analysis and outlines directions for future improvements, including incorporating more advanced models for more nuanced feedback.

Paperid: 267, https://arxiv.org/pdf/2502.00221.pdf

Abstract:
Despite living in an increasingly connected world, social isolation is a prevalent issue today. While social robots have been explored as tools to enhance social connection through companionship, their potential as asynchronous social platforms for fostering connection towards humanity has received less attention. In this work, we introduce the design of a social support companion that facilitates the exchange of emotionally relevant stories and scaffolds reflection to enhance feelings of connection via five design dimensions. We investigate how social robots can serve as "social proxies" facilitating human stories, passing stories from other human narrators to the user. To this end, we conduct a real-world deployment of 40 robot stations in users' homes over the course of two weeks. Through thematic analysis of user interviews, we find that social proxy robots can foster connection towards other people's experiences via mechanisms such as identifying connections across stories or offering diverse perspectives. We present design guidelines from our study insights on the use of social robot systems that serve as social platforms to enhance human empathy and connection.

Paperid: 268, https://arxiv.org/pdf/2501.09824.pdf

Abstract:
Tutoring is an effective instructional method for enhancing student learning, yet its success relies on the skill and experience of the tutors. This reliance presents challenges for the widespread implementation of tutoring, particularly in training novice tutors. To support tutor training programs, real-time automated feedback systems are essential for efficiently training large numbers of tutors. Lin et al.'s previous study employed Generative Pre-Trained Transformers (GPT) for sequence labeling to identify desirable and undesirable praise components in a tutor training dataset, providing explanatory feedback. However, this approach requires a significant amount of labeled data for fine-tuning, which is both labor-intensive and dependent on expert input. To address the challenges associated with extensive data labeling, the current study explores the use of prompting more advanced GPT models like GPT-4o to generate synthetic datasets for augmenting labeled response data, followed by fine-tuning a GPT-3.5 model. Our results demonstrate that our data augmentation approach generalizes effectively to identify other types of praise, compared to the same model fine-tuned without augmentation. These findings suggest that for data-intensive tasks, synthetic data generated through GPT model prompting can substantially enhance fine-tuned model performance in low-resource scenarios.

Paperid: 269, https://arxiv.org/pdf/2506.23075.pdf

Abstract:
Understanding and decoding brain activity from electroencephalography (EEG) signals is a fundamental challenge in neuroscience and AI, with applications in cognition, emotion recognition, diagnosis, and brain-computer interfaces. While recent EEG foundation models advance generalized decoding via unified architectures and large-scale pretraining, they adopt a scale-agnostic dense modeling paradigm inherited from NLP and vision. This design neglects a core property of neural activity: cross-scale spatiotemporal structure. EEG task patterns span a wide range of temporal and spatial scales, from short bursts to slow rhythms, and from localized cortical responses to distributed interactions. Ignoring this diversity leads to suboptimal representations and weak generalization. We propose CSBrain, a Cross-scale Spatiotemporal Brain foundation model for generalized EEG decoding. CSBrain introduces: (i) Cross-scale Spatiotemporal Tokenization (CST), which aggregates multi-scale features from localized temporal windows and anatomical brain regions into compact scale-aware tokens; and (ii) Structured Sparse Attention (SSA), which captures cross-window and cross-region dependencies, enhancing scale diversity while removing spurious correlations. CST and SSA are alternately stacked to progressively integrate multi-scale dependencies. Experiments on 11 EEG tasks across 16 datasets show that CSBrain consistently outperforms task-specific and foundation model baselines. These results establish cross-scale modeling as a key inductive bias and position CSBrain as a robust backbone for future brain-AI research.

Paperid: 270, https://arxiv.org/pdf/2506.22443.pdf

Abstract:
Rule-based models offer interpretability but struggle with complex data, while deep neural networks excel in performance yet lack transparency. This work investigates a neuro-symbolic rule learning neural network named RL-Net that learns interpretable rule lists through neural optimization, applied for the first time to radar-based hand gesture recognition (HGR). We benchmark RL-Net against a fully transparent rule-based system (MIRA) and an explainable black-box model (XentricAI), evaluating accuracy, interpretability, and user adaptability via transfer learning. Our results show that RL-Net achieves a favorable trade-off, maintaining strong performance (93.03% F1) while significantly reducing rule complexity. We identify optimization challenges specific to rule pruning and hierarchy bias and propose stability-enhancing modifications. Compared to MIRA and XentricAI, RL-Net emerges as a practical middle ground between transparency and performance. This study highlights the real-world feasibility of neuro-symbolic models for interpretable HGR and offers insights for extending explainable AI to edge-deployable sensing systems.

Paperid: 271, https://arxiv.org/pdf/2506.18962.pdf

Abstract:
Decoding human brain activity from electroencephalography (EEG) signals is a central challenge at the intersection of neuroscience and artificial intelligence, enabling diverse applications in mental state assessment, clinical monitoring, and human-machine interaction. Recent efforts have extensively explored EEG-based brain foundation models for generalized brain decoding, employing large-scale training on multiple datasets. However, most of these attempts struggle with generalizability and fail to achieve satisfactory performance without task-specific tuning due to pronounced inherent heterogeneity among decoding tasks. To address these challenges, we present UniMind, a general-purpose EEG foundation model for unified multi-task brain decoding by uniquely unleashing the power of large language models to comprehend complex neural patterns. UniMind offers several advantages. First, we design a Neuro-Language Connector to bridge the modality gap between neural signals and large language models, distilling and transforming the spatiotemporal neural patterns of EEG data into representations understandable by language models. Second, a Task-aware Query Selection module is proposed to inject task-awareness into the cross-modal alignment by dynamically generating task-adaptive query tokens, enabling learning of task-relevant neural patterns across diverse tasks. Extensive experiments across ten datasets demonstrate that UniMind substantially outperforms state-of-the-art multi-task decoding models, with an average gain of 12 percent, while also offering valuable neuroscientific insights into neural functional correlations across tasks. The code will be made publicly available.

Paperid: 272, https://arxiv.org/pdf/2504.13700.pdf

Abstract:
Recent advances in large language models (LLMs) have shown great potential in automating the process of visualization authoring through simple natural language utterances. However, instructing LLMs using natural language is limited in precision and expressiveness for conveying visualization intent, leading to misinterpretation and time-consuming iterations. To address these limitations, we conduct an empirical study to understand how LLMs interpret ambiguous or incomplete text prompts in the context of visualization authoring, and the conditions making LLMs misinterpret user intent. Informed by the findings, we introduce visual prompts as a complementary input modality to text prompts, which help clarify user intent and improve LLMs' interpretation abilities. To explore the potential of multimodal prompting in visualization authoring, we design VisPilot, which enables users to easily create visualizations using multimodal prompts, including text, sketches, and direct manipulations on existing visualizations. Through two case studies and a controlled user study, we demonstrate that VisPilot provides a more intuitive way to create visualizations without affecting the overall task efficiency compared to text-only prompting approaches. Furthermore, we analyze the impact of text and visual prompts in different visualization tasks. Our findings highlight the importance of multimodal prompting in improving the usability of LLMs for visualization authoring. We discuss design implications for future visualization systems and provide insights into how multimodal prompts can enhance human-AI collaboration in creative visualization tasks. All materials are available at https://OSF.IO/2QRAK.

Paperid: 273, https://arxiv.org/pdf/2503.15528.pdf

Abstract:
The EU AI Act underscores the importance of transparency, user-centricity, and robustness in AI systems, particularly for high-risk systems. In response, we present advancements in XentricAI, an explainable hand gesture recognition (HGR) system designed to meet these regulatory requirements. XentricAI adresses fundamental challenges in HGR, such as the opacity of black-box models using explainable AI methods and the handling of distributional shifts in real-world data through transfer learning techniques. We extend an existing radar-based HGR dataset by adding 28,000 new gestures, with contributions from multiple users across varied locations, including 24,000 out-of-distribution gestures. Leveraging this real-world dataset, we enhance XentricAI's capabilities by integrating a variational autoencoder module for improved gesture anomaly detection, incorporating user-specific thresholding. This integration enables the identification of 11.50% more anomalous gestures. Our extensive evaluations demonstrate a 97.5% sucess rate in characterizing these anomalies, significantly improving system explainability. Furthermore, the implementation of transfer learning techniques has shown a substantial increase in user adaptability, with an average improvement of at least 15.17%. This work contributes to the development of trustworthy AI systems by providing both technical advancements and regulatory compliance, offering a commercially viable solution that aligns with the EU AI Act requirements.

Paperid: 274, https://arxiv.org/pdf/2502.06382.pdf

Abstract:
In extended reality, pass-through enables users to view their real-world surroundings via cameras on the headset, displaying live video inside the device. This study compared the pass-through quality of three devices: Apple Vision Pro, Meta Quest 3, and Varjo XR3. Thirtyone participants performed two tasks, reading a text and solving a puzzle, while using each headset with the pass-through feature activated. Participants then rated their experiences, focusing on workload and cybersickness. Results showed that the Apple Vision Pro outperformed the Meta Quest 3 and Varjo XR3, receiving the highest ratings for pass-through quality.

Paperid: 275, https://arxiv.org/pdf/2502.05100.pdf

Abstract:
Virtual reality enables users to experience real-life situations in immersive environments. Interaction methods significantly shape user experience, particularly in high fidelity simulations mimicking real world tasks. This study evaluates two primary VR interaction techniques, hand based and controller based, through virtual shopping tasks in a simulated supermarket with 40 participants. Hand-based interaction was preferred for its natural, immersive qualities and alignment with real-world gestures but faced usability challenges, including limited haptic feedback and grasping inefficiencies. In contrast, controller-based interaction offered greater precision and reliability, making it more suitable for tasks requiring fine motor skills.

Paperid: 276, https://arxiv.org/pdf/2503.04635.pdf

Abstract:
Supernumerary robotic limbs (SRLs) are robotic structures integrated closely with the user's body, which augment human physical capabilities and necessitate seamless, naturalistic human-machine interaction. For effective assistance in physical tasks, enabling SRLs to hand over objects to humans is crucial. Yet, designing heuristic-based policies for robots is time-consuming, difficult to generalize across tasks, and results in less human-like motion. When trained with proper datasets, generative models are powerful alternatives for creating naturalistic handover motions. We introduce 3HANDS, a novel dataset of object handover interactions between a participant performing a daily activity and another participant enacting a hip-mounted SRL in a naturalistic manner. 3HANDS captures the unique characteristics of SRL interactions: operating in intimate personal space with asymmetric object origins, implicit motion synchronization, and the user's engagement in a primary task during the handover. To demonstrate the effectiveness of our dataset, we present three models: one that generates naturalistic handover trajectories, another that determines the appropriate handover endpoints, and a third that predicts the moment to initiate a handover. In a user study (N=10), we compare the handover interaction performed with our method compared to a baseline. The findings show that our method was perceived as significantly more natural, less physically demanding, and more comfortable.

Paperid: 277, https://arxiv.org/pdf/2506.06524.pdf

Abstract:
There is much interest in using large pre-trained models in Automatic Game Design (AGD), whether via the generation of code, assets, or more abstract conceptualization of design ideas. But so far this interest largely stems from the ad hoc use of such generative models under persistent human supervision. Much work remains to show how these tools can be integrated into longer-time-horizon AGD pipelines, in which systems interface with game engines to test generated content autonomously. To this end, we introduce ScriptDoctor, a Large Language Model (LLM)-driven system for automatically generating and testing games in PuzzleScript, an expressive but highly constrained description language for turn-based puzzle games over 2D gridworlds. ScriptDoctor generates and tests game design ideas in an iterative loop, where human-authored examples are used to ground the system's output, compilation errors from the PuzzleScript engine are used to elicit functional code, and search-based agents play-test generated games. ScriptDoctor serves as a concrete example of the potential of automated, open-ended LLM-based workflows in generating novel game content.

Paperid: 278, https://arxiv.org/pdf/2505.22093.pdf

Abstract:
The rapid adoption of AI powered coding assistants like ChatGPT and other coding copilots is transforming programming education, raising questions about assessment practices, academic integrity, and skill development. As educators seek alternatives to traditional grading methods susceptible to AI enabled plagiarism, structured peer assessment could be a promising strategy. This paper presents an empirical study of a rubric based, anonymized peer review process implemented in a large introductory programming course. Students evaluated each other's final projects (2D game), and their assessments were compared to instructor grades using correlation, mean absolute error, and root mean square error (RMSE). Additionally, reflective surveys from 47 teams captured student perceptions of fairness, grading behavior, and preferences regarding grade aggregation. Results show that peer review can approximate instructor evaluation with moderate accuracy and foster student engagement, evaluative thinking, and interest in providing good feedback to their peers. We discuss these findings for designing scalable, trustworthy peer assessment systems to face the age of AI assisted coding.

Paperid: 279, https://arxiv.org/pdf/2505.18661.pdf

Abstract:
This study evaluates the integration of AI-powered robots in early childhood education, focusing on their impact on emotional self-regulation, engagement, and collaborative skills. A ten-week experimental design involving two groups of children assessed the robot's effectiveness through progress assessments, parental surveys, and teacher feedback. Results demonstrated that early exposure to the robot significantly enhanced emotional recognition, while sustained interaction further improved collaborative and social engagement. Parental and teacher feedback highlighted high acceptance levels, emphasizing the robot's ease of integration and positive influence on classroom dynamics. This research underscores the transformative potential of AI and robotics in education. The findings advocate for the broader adoption of AI-powered interventions, carefully examining equitable access, ethical considerations, and sustainable implementation. This work sets a foundation for exploring long-term impacts and expanding applications of AI in inclusive and impactful educational settings.

Paperid: 280, https://arxiv.org/pdf/2505.06661.pdf

Abstract:
Blockchain technology promises to democratize finance and promote social equity through decentralization, but questions remain about whether current implementations advance or hinder these goals. Through a mixed-methods study combining semi-structured interviews with 13 diverse blockchain stakeholders and analysis of over 3,000 cryptocurrency discussions on Reddit, we examine how trust manifests in cryptocurrency ecosystems despite their decentralized architecture. Our findings uncover that users actively seek out and create centralized trust anchors, such as established exchanges, prominent community figures, and recognized development teams, contradicting blockchain's fundamental promise of trustless interactions. We identify how this contradiction arises from users' mental need for accountability and their reluctance to shoulder the full responsibility of self-custody. The study also reveals how these centralized trust patterns disproportionately impact different user groups, with newer and less technical users showing stronger preferences for centralized intermediaries. This work contributes to our understanding of the inherent tensions between theoretical decentralization and practical implementation in cryptocurrency systems, highlighting the persistent role of centralized trust in supposedly trustless environments.

Paperid: 281, https://arxiv.org/pdf/2503.16481.pdf

Abstract:
The increasing use of robots in human-centric public spaces such as shopping malls, sidewalks, and hospitals, requires understanding of how pedestrians respond to their presence. However, existing research lacks comprehensive datasets that capture the full range of pedestrian behaviors, e.g., including avoidance, neutrality, and attraction in the presence of robots. Such datasets can be used to effectively learn models capable of accurately predicting diverse responses of pedestrians to robot presence, which are crucial for advancing robot navigation strategies and optimizing pedestrian-aware motion planning. In this paper, we address these challenges by collecting a novel dataset of pedestrian motion in two outdoor locations under three distinct conditions, i.e., no robot presence, a stationary robot, and a moving robot. Thus, unlike existing datasets, ours explicitly encapsulates variations in pedestrian behavior across the different robot conditions. Using our dataset, we propose a novel Neural Social Robot Force Model (NSRFM), an extension of the traditional Social Force Model that integrates neural networks and robot-induced forces to better predict pedestrian behavior in the presence of robots. We validate the NSRFM by comparing its generated trajectories on different real-world datasets. Furthermore, we implemented it in simulation to enable the learning and benchmarking of robot navigation strategies based on their impact on pedestrian movement. Our results demonstrate the model's effectiveness in replicating real-world pedestrian reactions and its its utility in developing, evaluating, and benchmarking social robot navigation algorithms.

Paperid: 282, https://arxiv.org/pdf/2503.14021.pdf

Abstract:
Graphical user interface (GUI) has become integral to modern society, making it crucial to be understood for human-centric systems. However, unlike natural images or documents, GUIs comprise artificially designed graphical elements arranged to convey specific semantic meanings. Current multi-modal large language models (MLLMs) already proficient in processing graphical and textual components suffer from hurdles in GUI understanding due to the lack of explicit spatial structure modeling. Moreover, obtaining high-quality spatial structure data is challenging due to privacy issues and noisy environments. To address these challenges, we present MP-GUI, a specially designed MLLM for GUI understanding. MP-GUI features three precisely specialized perceivers to extract graphical, textual, and spatial modalities from the screen as GUI-tailored visual clues, with spatial structure refinement strategy and adaptively combined via a fusion gate to meet the specific preferences of different GUI understanding tasks. To cope with the scarcity of training data, we also introduce a pipeline for automatically data collecting. Extensive experiments demonstrate that MP-GUI achieves impressive results on various GUI understanding tasks with limited data.

Paperid: 283, https://arxiv.org/pdf/2503.13625.pdf

Abstract:
We investigate the impact of robot appearance on users' spoken behavior during real-world interactions by comparing a human-like android, ERICA, with a less anthropomorphic humanoid, TELECO. Analyzing data from 42 participants at SIGDIAL 2024, we extracted linguistic features such as disfluencies and syntactic complexity from conversation transcripts. The results showed moderate effect sizes, suggesting that participants produced fewer disfluencies and employed more complex syntax when interacting with ERICA. Further analysis involving training classification models like NaÃ¯ve Bayes, which achieved an F1-score of 71.60\%, and conducting feature importance analysis, highlighted the significant role of disfluencies and syntactic complexity in interactions with robots of varying human-like appearances. Discussing these findings within the frameworks of cognitive load and Communication Accommodation Theory, we conclude that designing robots to elicit more structured and fluent user speech can enhance their communicative alignment with humans.

Paperid: 284, https://arxiv.org/pdf/2505.07552.pdf

Abstract:
Teachers' visual attention and its distribution across the students in classrooms can constitute important implications for student engagement, achievement, and professional teacher training. Despite that, inferring the information about where and which student teachers focus on is not trivial. Mobile eye tracking can provide vital help to solve this issue; however, the use of mobile eye tracking alone requires a significant amount of manual annotations. To address this limitation, we present an automated processing pipeline concept that requires minimal manually annotated data to recognize which student the teachers focus on. To this end, we utilize state-of-the-art face detection models and face recognition feature embeddings to train face recognition models with transfer learning in the classroom context and combine these models with the teachers' gaze from mobile eye trackers. We evaluated our approach with data collected from four different classrooms, and our results show that while it is possible to estimate the visually focused students with reasonable performance in all of our classroom setups, U-shaped and small classrooms led to the best results with accuracies of approximately 0.7 and 0.9, respectively. While we did not evaluate our method for teacher-student interactions and focused on the validity of the technical approach, as our methodology does not require a vast amount of manually annotated data and offers a non-intrusive way of handling teachers' visual attention, it could help improve instructional strategies, enhance classroom management, and provide feedback for professional teacher development.

Paperid: 285, https://arxiv.org/pdf/2505.07377.pdf

Abstract:
Transforming educational technologies through the integration of large language models (LLMs) and virtual reality (VR) offers the potential for immersive and interactive learning experiences. However, the effects of LLMs on user engagement and attention in educational environments remain open questions. In this study, we utilized a fully LLM-driven virtual learning environment, where peers and teachers were LLM-driven, to examine how students behaved in such settings. Specifically, we investigate how peer question-asking behaviors influenced student engagement, attention, cognitive load, and learning outcomes and found that, in conditions where LLM-driven peer learners asked questions, students exhibited more targeted visual scanpaths, with their attention directed toward the learning content, particularly in complex subjects. Our results suggest that peer questions did not introduce extraneous cognitive load directly, as the cognitive load is strongly correlated with increased attention to the learning material. Considering these findings, we provide design recommendations for optimizing VR learning spaces.

Paperid: 286, https://arxiv.org/pdf/2504.18691.pdf

Abstract:
Background and Context. The increasing integration of large language models (LLMs) in computing education presents an emerging challenge in understanding how students use LLMs and craft prompts to solve computational tasks. Prior research has used both qualitative and quantitative methods to analyze prompting behavior, but these approaches lack scalability or fail to effectively capture the semantic evolution of prompts. Objective. In this paper, we investigate whether students prompts can be systematically analyzed using propositional logic constraints. We examine whether this approach can identify patterns in prompt evolution, detect struggling students, and provide insights into effective and ineffective strategies. Method. We introduce Prompt2Constraints, a novel method that translates students prompts into logical constraints. The constraints are able to represent the intent of the prompts in succinct and quantifiable ways. We used this approach to analyze a dataset of 1,872 prompts from 203 students solving introductory programming tasks. Findings. We find that while successful and unsuccessful attempts tend to use a similar number of constraints overall, when students fail, they often modify their prompts more significantly, shifting problem-solving strategies midway. We also identify points where specific interventions could be most helpful to students for refining their prompts. Implications. This work offers a new and scalable way to detect students who struggle in solving natural language programming tasks. This work could be extended to investigate more complex tasks and integrated into programming tools to provide real-time support.

Paperid: 287, https://arxiv.org/pdf/2504.17331.pdf

Abstract:
Locomotion plays a crucial role in shaping the user experience within virtual reality environments. In particular, hands-free locomotion offers a valuable alternative by supporting accessibility and freeing users from reliance on handheld controllers. To this end, traditional speech-based methods often depend on rigid command sets, limiting the naturalness and flexibility of interaction. In this study, we propose a novel locomotion technique powered by large language models (LLMs), which allows users to navigate virtual environments using natural language with contextual awareness. We evaluate three locomotion methods: controller-based teleportation, voice-based steering, and our language model-driven approach. Our evaluation measures include eye-tracking data analysis, including explainable machine learning through SHAP analysis as well as standardized questionnaires for usability, presence, cybersickness, and cognitive load to examine user attention and engagement. Our findings indicate that the LLM-driven locomotion possesses comparable usability, presence, and cybersickness scores to established methods like teleportation, demonstrating its novel potential as a comfortable, natural language-based, hands-free alternative. In addition, it enhances user attention within the virtual environment, suggesting greater engagement. Complementary to these findings, SHAP analysis revealed that fixation, saccade, and pupil-related features vary across techniques, indicating distinct patterns of visual attention and cognitive processing. Overall, we state that our method can facilitate hands-free locomotion in virtual spaces, especially in supporting accessibility.

Paperid: 288, https://arxiv.org/pdf/2504.11723.pdf

Abstract:
Introductory programming courses often rely on small code-writing exercises that have clearly specified problem statements. This limits opportunities for students to practice how to clarify ambiguous requirements -- a critical skill in real-world programming. In addition, the emerging capabilities of large language models (LLMs) to produce code from well-defined specifications may harm student engagement with traditional programming exercises. This study explores the use of ``Probeable Problems'', automatically gradable tasks that have deliberately vague or incomplete specifications. Such problems require students to submit test inputs, or `probes', to clarify requirements before implementation. Through analysis of over 40,000 probes in an introductory course, we identify patterns linking probing behaviors to task success. Systematic strategies, such as thoroughly exploring expected behavior before coding, resulted in fewer incorrect code submissions and correlated with course success. Feedback from nearly 1,000 participants highlighted the challenges and real-world relevance of these tasks, as well as benefits to critical thinking and metacognitive skills. Probeable Problems are easy to set up and deploy at scale, and help students recognize and resolve uncertainties in programming problems.

Paperid: 289, https://arxiv.org/pdf/2503.04707.pdf

Abstract:
Iris texture is widely regarded as a gold standard biometric modality for authentication and identification. The demand for robust iris recognition methods, coupled with growing security and privacy concerns regarding iris attacks, has escalated recently. Inspired by neural style transfer, an advanced technique that leverages neural networks to separate content and style features, we hypothesize that iris texture's style features provide a reliable foundation for recognition and are more resilient to variations like rotation and perspective shifts than traditional approaches. Our experimental results support this hypothesis, showing a significantly higher classification accuracy compared to conventional features. Further, we propose using neural style transfer to obfuscate the identifiable iris style features, ensuring the protection of sensitive biometric information while maintaining the utility of eye images for tasks like eye segmentation and gaze estimation. This work opens new avenues for iris-oriented, secure, and privacy-aware biometric systems.

Paperid: 290, https://arxiv.org/pdf/2502.14949.pdf

Abstract:
With the growing adoption of Retrieval-Augmented Generation (RAG) in document processing, robust text recognition has become increasingly critical for knowledge extraction. While OCR (Optical Character Recognition) for English and other languages benefits from large datasets and well-established benchmarks, Arabic OCR faces unique challenges due to its cursive script, right-to-left text flow, and complex typographic and calligraphic features. We present KITAB-Bench, a comprehensive Arabic OCR benchmark that fills the gaps in current evaluation systems. Our benchmark comprises 8,809 samples across 9 major domains and 36 sub-domains, encompassing diverse document types including handwritten text, structured tables, and specialized coverage of 21 chart types for business intelligence. Our findings show that modern vision-language models (such as GPT-4o, Gemini, and Qwen) outperform traditional OCR approaches (like EasyOCR, PaddleOCR, and Surya) by an average of 60% in Character Error Rate (CER). Furthermore, we highlight significant limitations of current Arabic OCR models, particularly in PDF-to-Markdown conversion, where the best model Gemini-2.0-Flash achieves only 65% accuracy. This underscores the challenges in accurately recognizing Arabic text, including issues with complex fonts, numeral recognition errors, word elongation, and table structure detection. This work establishes a rigorous evaluation framework that can drive improvements in Arabic document analysis methods and bridge the performance gap with English OCR technologies.

Paperid: 291, https://arxiv.org/pdf/2501.06184.pdf

Abstract:
Geologic map, as a fundamental diagram in geology science, provides critical insights into the structure and composition of Earth's subsurface and surface. These maps are indispensable in various fields, including disaster detection, resource exploration, and civil engineering. Despite their significance, current Multimodal Large Language Models (MLLMs) often fall short in geologic map understanding. This gap is primarily due to the challenging nature of cartographic generalization, which involves handling high-resolution map, managing multiple associated components, and requiring domain-specific knowledge. To quantify this gap, we construct GeoMap-Bench, the first-ever benchmark for evaluating MLLMs in geologic map understanding, which assesses the full-scale abilities in extracting, referring, grounding, reasoning, and analyzing. To bridge this gap, we introduce GeoMap-Agent, the inaugural agent designed for geologic map understanding, which features three modules: Hierarchical Information Extraction (HIE), Domain Knowledge Injection (DKI), and Prompt-enhanced Question Answering (PEQA). Inspired by the interdisciplinary collaboration among human scientists, an AI expert group acts as consultants, utilizing a diverse tool pool to comprehensively analyze questions. Through comprehensive experiments, GeoMap-Agent achieves an overall score of 0.811 on GeoMap-Bench, significantly outperforming 0.369 of GPT-4o. Our work, emPowering gEologic mAp holistiC undErstanding (PEACE) with MLLMs, paves the way for advanced AI applications in geology, enhancing the efficiency and accuracy of geological investigations.

Paperid: 292, https://arxiv.org/pdf/2506.23721.pdf

Abstract:
Ultrasound (US) is widely accessible and radiation-free but has a steep learning curve due to its dynamic nature and non-standard imaging planes. Additionally, the constant need to shift focus between the US screen and the patient poses a challenge. To address these issues, we integrate deep learning (DL)-based semantic segmentation for real-time (RT) automated kidney volumetric measurements, which are essential for clinical assessment but are traditionally time-consuming and prone to fatigue. This automation allows clinicians to concentrate on image interpretation rather than manual measurements. Complementing DL, augmented reality (AR) enhances the usability of US by projecting the display directly into the clinician's field of view, improving ergonomics and reducing the cognitive load associated with screen-to-patient transitions. Two AR-DL-assisted US pipelines on HoloLens-2 are proposed: one streams directly via the application programming interface for a wireless setup, while the other supports any US device with video output for broader accessibility. We evaluate RT feasibility and accuracy using the Open Kidney Dataset and open-source segmentation models (nnU-Net, Segmenter, YOLO with MedSAM and LiteMedSAM). Our open-source GitHub pipeline includes model implementations, measurement algorithms, and a Wi-Fi-based streaming solution, enhancing US training and diagnostics, especially in point-of-care settings.

Paperid: 293, https://arxiv.org/pdf/2506.21814.pdf

Abstract:
Despite advances in surgical techniques and care, postoperative complications are prevalent and effects up to 15% of the patients who underwent a major surgery. The objective of this study is to develop and validate models for predicting postoperative complications and death after major surgery on a large and multicenter dataset, following the previously validated MySurgeryRisk algorithm. This retrospective, longitudinal and multicenter cohort analysis included 508,097 encounters from 366,875 adult inpatients who underwent major surgeries and were admitted to healthcare institutions within the OneFlorida+ network between 01/01/2012 and 04/29/2023. We applied the validated feature selection and transformation approach in MySurgeryRisk models and redeveloped eXtreme Gradient Boosting (XGBoost) models for predicting risk of postoperative acute kidney injury (AKI), need for intensive care unit (ICU) admission, need for mechanical ventilation (MV) therapy and in-hospital mortality on a development set and evaluated the model performance on a validation set. Area under the receiver operating characteristics curve values were obtained for need for ICU admission, 0.93 (95% Confidence Interval [CI], 0.93-0.93); need for MV, 0.94 (95% CI, 0.94-0.94); AKI, 0.92 (95% CI, 0.92-0.92); and in-hospital mortality, 0.95 (95% CI, 0.94-0.95). Area under the precision-recall curve values were computed for need for ICU admission, 0.62 (95% CI, 0.62-0.63); need for MV, 0.51 (95% CI, 0.49-0.52); AKI, 0.53 (95% CI, 0.53-0.54); and in-hospital mortality, 0.26 (95% CI, 0.24-0.29). The performance of these models is comparable to that of the previously validated MySurgeryRisk models, suggesting the enhanced generalizability of the models. Primary procedure code and provider specialty consistently appeared as the top influential variables, providing valuable insights into the factors influencing surgical outcomes.

Paperid: 294, https://arxiv.org/pdf/2506.04972.pdf

Abstract:
As one of the first research teams with full access to Siemens' Cinematic Reality, we evaluate its usability and clinical potential for cinematic volume rendering on the Apple Vision Pro. We visualized venous-phase liver computed tomography and magnetic resonance cholangiopancreatography scans from the CHAOS and MRCP\_DLRecon datasets. Fourteen medical experts assessed usability and anticipated clinical integration potential using the System Usability Scale, ISONORM 9242-110-S questionnaire, and an open-ended survey. Their feedback identified feasibility, key usability strengths, and required features to catalyze the adaptation in real-world clinical workflows. The findings provide insights into the potential of immersive cinematic rendering in medical imaging.

Paperid: 295, https://arxiv.org/pdf/2506.04858.pdf

Abstract:
Medical imaging segmentation is essential in clinical settings for diagnosing diseases, planning surgeries, and other procedures. However, manual annotation is a cumbersome and effortful task. To mitigate these aspects, this study implements and evaluates the usability and clinical applicability of an extended reality (XR)-based segmentation tool for anatomical CT scans, using the Meta Quest 3 headset and Logitech MX Ink stylus. We develop an immersive interface enabling real-time interaction with 2D and 3D medical imaging data in a customizable workspace designed to mitigate workflow fragmentation and cognitive demands inherent to conventional manual segmentation tools. The platform combines stylus-driven annotation, mirroring traditional pen-on-paper workflows, with instant 3D volumetric rendering. A user study with a public craniofacial CT dataset demonstrated the tool's foundational viability, achieving a System Usability Scale (SUS) score of 66, within the expected range for medical applications. Participants highlighted the system's intuitive controls (scoring 4.1/5 for self-descriptiveness on ISONORM metrics) and spatial interaction design, with qualitative feedback highlighting strengths in hybrid 2D/3D navigation and realistic stylus ergonomics. While users identified opportunities to enhance task-specific precision and error management, the platform's core workflow enabled dynamic slice adjustment, reducing cognitive load compared to desktop tools. Results position the XR-stylus paradigm as a promising foundation for immersive segmentation tools, with iterative refinements targeting haptic feedback calibration and workflow personalization to advance adoption in preoperative planning.

Paperid: 296, https://arxiv.org/pdf/2504.02551.pdf

Abstract:
Background: Artificial Intelligence (AI) clinical decision support (CDS) systems have the potential to augment surgical risk assessments, but successful adoption depends on an understanding of end-user needs and current workflows. This study reports the initial co-design of MySurgeryRisk, an AI CDS tool to predict the risk of nine post-operative complications in surgical patients. Methods: Semi-structured focus groups and interviews were held as co-design sessions with perioperative physicians at a tertiary academic hospital in the Southeastern United States. Participants were read a surgical vignette and asked questions to elicit an understanding of their current decision-making practices before being introduced to the MySurgeryRisk prototype web interface. They were asked to provide feedback on the user interface and system features. Session transcripts were qualitatively coded, after which thematic analysis took place. Results: Data saturation was reached after 20 surgeons and anesthesiologists from varying career stages participated across 11 co-design sessions. Thematic analysis resulted in five themes: (1) decision-making cognitive processes, (2) current approach to decision-making, (3) future approach to decision-making with MySurgeryRisk, (4) feedback on current MySurgeryRisk prototype, and (5) trustworthy considerations. Conclusion: Clinical providers perceived MySurgeryRisk as a promising CDS tool that factors in a large volume of data and is computed in real-time without any need for manual input. Participants provided feedback on the design of the interface and imaged applications of the tool in the clinical workflow. However, its successful implementation will depend on its actionability and explainability of model outputs, integration into current electronic systems, and calibration of trust among end-users.

Paperid: 297, https://arxiv.org/pdf/2503.08814.pdf

Abstract:
This study reports the findings of qualitative interview sessions conducted with ICU clinicians for the co-design of a system user interface of an artificial intelligence (AI)-driven clinical decision support (CDS) system. This system integrates medical record data with wearable sensor, video, and environmental data into a real-time dynamic model that quantifies patients' risk of clinical decompensation and risk of developing delirium, providing actionable alerts to augment clinical decision-making in the ICU setting. Co-design sessions were conducted as semi-structured focus groups and interviews with ICU clinicians, including physicians, mid-level practitioners, and nurses. Study participants were asked about their perceptions on AI-CDS systems, their system preferences, and were asked to provide feedback on the current user interface prototype. Session transcripts were qualitatively analyzed to identify key themes related to system utility, interface design features, alert preferences, and implementation considerations. Ten clinicians participated in eight sessions. The analysis identified five themes: (1) AI's computational utility, (2) workflow optimization, (3) effects on patient care, (4) technical considerations, and (5) implementation considerations. Clinicians valued the CDS system's multi-modal continuous monitoring and AI's capacity to process large volumes of data in real-time to identify patient risk factors and suggest action items. Participants underscored the system's unique value in detecting delirium and promoting non-pharmacological delirium prevention measures. The actionability and intuitive interpretation of the presented information was emphasized. ICU clinicians recognize the potential of an AI-driven CDS system for ICU delirium and acuity to improve patient outcomes and clinical workflows.

Paperid: 298, https://arxiv.org/pdf/2506.22926.pdf

Abstract:
Volumetric medical imaging technologies produce detailed 3D representations of anatomical structures. However, effective medical data visualization and exploration pose significant challenges, especially for individuals with limited medical expertise. We introduce a novel XR-based system with two key innovations: (1) a coordinated visualization module integrating Multi-layered Multi-planar Reconstruction with 3D mesh models and (2) a multimodal interaction framework combining hand gestures with LLM-enabled voice commands. We conduct preliminary evaluations, including a 15-participant user study and expert interviews, to demonstrate the system's abilities to enhance spatial understanding and reduce cognitive load. Experimental results show notable improvements in task completion times, usability metrics, and interaction effectiveness enhanced by LLM-driven voice control. While identifying areas for future refinement, our findings highlight the potential of this immersive visualization system to advance medical training and clinical practice. Our demo application and supplemental materials are available for download at: https://osf.io/bpjq5/.

Paperid: 299, https://arxiv.org/pdf/2505.23236.pdf

Abstract:
This paper presents a novel end-to-end LLM-empowered explainable speech emotion recognition (SER) approach. Fine-grained speech emotion descriptor (SED) features, e.g., pitch, tone and emphasis, are disentangled from HuBERT SSL representations via alternating LLM fine-tuning to joint SER-SED prediction and ASR tasks. VAE compressed HuBERT features derived via Information Bottleneck (IB) are used to adjust feature granularity. Experiments on the IEMOCAP and MELD benchmarks demonstrate that our approach consistently outperforms comparable LLaMA-based SER baselines, including those using either (a) alternating multi-task fine-tuning alone or (b) feature disentanglement only. Statistically significant increase of SER unweighted accuracy by up to 4.0% and 3.7% absolute (5.4% and 6.6% relative) are obtained. More importantly, emotion descriptors offer further explainability for SER.

Paperid: 300, https://arxiv.org/pdf/2504.10430.pdf

Abstract:
Recent advancements in Large Language Models (LLMs) have enabled them to approach human-level persuasion capabilities. However, such potential also raises concerns about the safety risks of LLM-driven persuasion, particularly their potential for unethical influence through manipulation, deception, exploitation of vulnerabilities, and many other harmful tactics. In this work, we present a systematic investigation of LLM persuasion safety through two critical aspects: (1) whether LLMs appropriately reject unethical persuasion tasks and avoid unethical strategies during execution, including cases where the initial persuasion goal appears ethically neutral, and (2) how influencing factors like personality traits and external pressures affect their behavior. To this end, we introduce PersuSafety, the first comprehensive framework for the assessment of persuasion safety which consists of three stages, i.e., persuasion scene creation, persuasive conversation simulation, and persuasion safety assessment. PersuSafety covers 6 diverse unethical persuasion topics and 15 common unethical strategies. Through extensive experiments across 8 widely used LLMs, we observe significant safety concerns in most LLMs, including failing to identify harmful persuasion tasks and leveraging various unethical persuasion strategies. Our study calls for more attention to improve safety alignment in progressive and goal-driven conversations such as persuasion.

Paperid: 301, https://arxiv.org/pdf/2503.15507.pdf

Abstract:
The study of human anatomy through advanced visualization techniques is crucial for medical research and education. In this work, we introduce CvhSlicer 2.0, an innovative XR system designed for immersive and interactive visualization of the Chinese Visible Human (CVH) dataset. Particularly, our proposed system operates entirely on a commercial XR headset, offering a range of visualization and interaction tools for dynamic 2D and 3D data exploration. By conducting comprehensive evaluations, our CvhSlicer 2.0 demonstrates strong capabilities in visualizing anatomical data, enhancing user engagement and improving educational effectiveness. A demo video is available at https://youtu.be/CfR72S_0N-4

Paperid: 302, https://arxiv.org/pdf/2502.18066.pdf

Abstract:
Advances in deepfake technologies, which use generative artificial intelligence (GenAI) to mimic a person's likeness or voice, have led to growing interest in their use in educational contexts. However, little is known about how key stakeholders perceive and intend to use these tools. This study investigated higher education stakeholder perceptions and intentions regarding deepfakes through the lens of the Unified Theory of Acceptance and Use of Technology 2 (UTAUT2). Using a mixed-methods approach combining survey data (n=174) with qualitative interviews, we found that academic stakeholders demonstrated a relatively low intention to adopt these technologies (M=41.55, SD=34.14) and held complex views about their implementation. Quantitative analysis revealed adoption intentions were primarily driven by hedonic motivation, with a gender-specific interaction in price-value evaluations. Qualitative findings highlighted potential benefits of enhanced student engagement, improved accessibility, and reduced workload in content creation, but concerns regarding the exploitation of academic labour, institutional cost-cutting leading to automation, degradation of relationships in education, and broader societal impacts. Based on these findings, we propose a framework for implementing deepfake technologies in higher education that addresses institutional policies, professional development, and equitable resource allocation to thoughtfully integrate AI while maintaining academic integrity and professional autonomy.

Paperid: 303, https://arxiv.org/pdf/2501.17348.pdf

Abstract:
While theories of discourse and cognitive science have long recognized the value of unhurried pacing, recent dialogue research tends to minimize friction in conversational systems. Yet, frictionless dialogue risks fostering uncritical reliance on AI outputs, which can obscure implicit assumptions and lead to unintended consequences. To meet this challenge, we propose integrating positive friction into conversational AI, which promotes user reflection on goals, critical thinking on system response, and subsequent re-conditioning of AI systems. We hypothesize systems can improve goal alignment, modeling of user mental states, and task success by deliberately slowing down conversations in strategic moments to ask questions, reveal assumptions, or pause. We present an ontology of positive friction and collect expert human annotations on multi-domain and embodied goal-oriented corpora. Experiments on these corpora, along with simulated interactions using state-of-the-art systems, suggest incorporating friction not only fosters accountable decision-making, but also enhances machine understanding of user beliefs and goals, and increases task success rates.

Paperid: 304, https://arxiv.org/pdf/2501.06744.pdf

Abstract:
The human ear offers a unique opportunity for cardiac monitoring due to its physiological and practical advantages. However, existing earable solutions require additional hardware and complex processing, posing challenges for commercial True Wireless Stereo (TWS) earbuds which are limited by their form factor and resources. In this paper, we propose TWSCardio, a novel system that repurposes the IMU sensors in TWS earbuds for cardiac monitoring. Our key finding is that these sensors can capture in-ear ballistocardiogram (BCG) signals. TWSCardio reuses the unstable Bluetooth channel to stream the IMU data to a smartphone for BCG processing. It incorporates a signal enhancement framework to address issues related to missing data and low sampling rate, while mitigating motion artifacts by fusing multi-axis information. Furthermore, it employs a region-focused signal reconstruction method to translate the multi-axis in-ear BCG signals into fine-grained seismocardiogram (SCG) signals. We have implemented TWSCardio as an efficient real-time app. Our experiments on 100 subjects verify that TWSCardio can accurately reconstruct cardiac signals while showing resilience to motion artifacts, missing data, and low sampling rates. Our case studies further demonstrate that TWSCardio can support diverse cardiac monitoring applications.

Paperid: 305, https://arxiv.org/pdf/2506.02993.pdf

Abstract:
Multi-agent AI systems, which simulate diverse instructional roles such as teachers and peers, offer new possibilities for personalized and interactive learning. Yet, student-AI interaction patterns and their pedagogical implications remain unclear. This study explores how university students engaged with multiple AI agents, and how these interactions influenced cognitive outcomes (learning gains) and non-cognitive factors (motivation, technology acceptance). Based on MAIC, an online learning platform with multi-agent, the research involved 305 university students and 19,365 lines of dialogue data. Pre- and post-test scores, self-reported motivation and technology acceptance were also collected. The study identified two engagement patterns: co-construction of knowledge and co-regulation. Lag sequential analysis revealed that students with lower prior knowledge relied more on co-construction of knowledge sequences, showing higher learning gains and post-course motivation. In contrast, students with higher prior knowledge engaged more in co-regulation behaviors but exhibited limited learning improvement. Technology acceptance increased across all groups. These findings suggest that multi-agent AI systems can adapt to students' varying needs, support differentiated engagement, and reduce performance gaps. Implications for personalized system design and future research directions are discussed.

Paperid: 306, https://arxiv.org/pdf/2505.13227.pdf

Abstract:
Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities. To address these limitations, we introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on Jedi demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our OSWorld-G. Furthermore, we demonstrate that improved grounding with Jedi directly enhances agentic capabilities of general foundation models on complex computer tasks, improving from 5% to 27% on OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces. All benchmark, data, checkpoints, and code are open-sourced and available at https://osworld-grounding.github.io.

Paperid: 307, https://arxiv.org/pdf/2505.05817.pdf

Abstract:
Depending on the route, runners may experience frustration, freedom, or fulfilment. However, finding routes that are conducive to the psychological experience of running remains an unresolved task in the literature. In a mixed-method study, we interviewed 7 runners to identify themes contributing to running experience, and quantitatively examined these themes in an online survey with 387 runners. Using Principal Component Analysis on the survey responses, we developed a short experience sampling questionnaire that captures the three most important dimensions of running experience: \emph{performance \& achievement}, \emph{environment}, and \emph{mind \& social connectedness}. Using path preferences obtained from the online survey, we clustered them into two types of routes: \emph{scenic} (associated with nature and greenery) and \emph{urban} (characterized by the presence of people); and developed a routing engine for path recommendations. We discuss challenges faced in developing the routing engine, and provide guidelines to integrate it into mobile and wearable running apps.

Paperid: 308, https://arxiv.org/pdf/2504.13845.pdf

Abstract:
The rising interest in Virtual Reality (VR) technology has sparked a desire to create immersive learning platforms capable of handling various tasks across environments. Through immersive interfaces, users can engage deeply with virtual environments, enhancing both learning outcomes and task performance. In fields such as education, engineering, and collaboration, presence has emerged as a critical factor influencing user engagement, motivation, and skill mastery. This review provides a comprehensive examination of the role of presence across different tasks and disciplines, exploring how its design impacts learning outcomes. Using a systematic search strategy based on the PRISMA method, we screened 2,793 articles and included 78 studies that met our inclusion criteria. We conducted a detailed classification and analysis of different types of presence in VR environments, including spatial presence, social presence, co-presence, self-presence, and cognitive presence. This review emphasizes how these varied types of presence affect learning outcomes across tasks and fields, and examines how design elements and interaction techniques shape presence and subsequently impact learning outcomes. We also summarize trends and future directions, identifying research gaps and opportunities to improve learning outcomes by enhancing presence in VR environments, thus offering guidance and insight for future research on VR presence and learning effectiveness.

Paperid: 309, https://arxiv.org/pdf/2504.06016.pdf

Abstract:
AI development is shaped by academics and industry leaders - let us call them ``influencers'' - but it is unclear how their views align with those of the public. To address this gap, we developed an interactive platform that served as a data collection tool for exploring public views on AI, including their fears, hopes, and overall sense of hopefulness. We made the platform available to 330 participants representative of the U.S. population in terms of age, sex, ethnicity, and political leaning, and compared their views with those of 100 AI influencers identified by Time magazine. The public fears AI getting out of control, while influencers emphasize regulation, seemingly to deflect attention from their alleged focus on monetizing AI's potential. Interestingly, the views of AI influencers from underrepresented groups such as women and people of color often differ from the views of underrepresented groups in the public.

Paperid: 310, https://arxiv.org/pdf/2504.04710.pdf

Abstract:
Synchronous data-driven storytelling with network visualizations presents significant challenges due to the complexity of real-time manipulation of network components. While existing research addresses asynchronous scenarios, there is a lack of effective tools for live presentations. To address this gap, we developed TangibleNet, a projector-based AR prototype that allows presenters to interact with node-link diagrams using double-sided magnets during live presentations. The design process was informed by interviews with professionals experienced in synchronous data storytelling and workshops with 14 HCI/VIS researchers. Insights from the interviews helped identify key design considerations for integrating physical objects as interactive tools in presentation contexts. The workshops contributed to the development of a design space mapping user actions to interaction commands for node-link diagrams. Evaluation with 12 participants confirmed that TangibleNet supports intuitive interactions and enhances presenter autonomy, demonstrating its effectiveness for synchronous network-based data storytelling.

Paperid: 311, https://arxiv.org/pdf/2503.16500.pdf

Abstract:
Aligning robot navigation with human preferences is essential for ensuring comfortable and predictable robot movement in shared spaces, facilitating seamless human-robot coexistence. While preference-based learning methods, such as reinforcement learning from human feedback (RLHF), enable this alignment, the choice of the preference collection interface may influence the process. Traditional 2D interfaces provide structured views but lack spatial depth, whereas immersive VR offers richer perception, potentially affecting preference articulation. This study systematically examines how the interface modality impacts human preference collection and navigation policy alignment. We introduce a novel dataset of 2,325 human preference queries collected through both VR and 2D interfaces, revealing significant differences in user experience, preference consistency, and policy outcomes. Our findings highlight the trade-offs between immersion, perception, and preference reliability, emphasizing the importance of interface selection in preference-based robot learning. The dataset will be publicly released to support future research.

Paperid: 312, https://arxiv.org/pdf/2503.14936.pdf

Abstract:
Human attention provides valuable yet underexploited signals for code LLM training, offering a perspective beyond purely machine-driven attention. Despite the complexity and cost of collecting eye-tracking data, there has also been limited progress in systematically using these signals for code LLM training. To address both issues, we propose a cohesive pipeline spanning augmentation and reward-based fine-tuning. Specifically, we introduce (1) an eye-tracking path augmentation method to expand programmer attention datasets, (2) a pattern abstraction step that refines raw fixations into learnable attention motifs, and (3) a reward-guided strategy for integrating these insights directly into a CodeT5 supervised fine-tuning process. Our experiments yield +7.16 in CodeBLEU on the CodeXGlue benchmark for code summarization, underscoring how uniting human and machine attention can boost code intelligence. We hope this work encourages broader exploration of human-centric methods in next-generation AI4SE.

Paperid: 313, https://arxiv.org/pdf/2503.05822.pdf

Abstract:
The potential of AI researchers in scientific discovery remains largely untapped. Over the past decade, AI for Science (AI4Science) publications in 145 Nature Index journals have increased fifteen-fold, yet they still account for less than 3% of the total publications. Drawing upon the Diffusion of Innovation theory, we project AI4Science's share of total publications to rise from 2.72% in 2024 to approximately 20% by 2050. Achieving this shift requires fully harnessing the potential of AI researchers, as nearly 95% of AI-driven research in these journals is led by experimental scientists. To facilitate this, we propose structured workflows and strategic interventions to position AI researchers at the forefront of scientific discovery. Specifically, we identify three critical pathways: equipping experimental scientists with accessible AI tools to amplify the impact of AI researchers, bridging cognitive and methodological gaps to enable more direct involvement in scientific discovery, and proactively fostering a thriving AI-driven scientific ecosystem. By addressing these challenges, we aim to empower AI researchers as key drivers of future scientific breakthroughs.

Paperid: 314, https://arxiv.org/pdf/2503.01694.pdf

Abstract:
Integrating LLM models into educational practice fosters personalized learning by accommodating the diverse behavioral patterns of different learner types. This study aims to explore these learner types within a novel interactive setting, providing a detailed analysis of their distinctive characteristics and interaction dynamics. The research involved 110 students from a university in China, who engaged with multiple LLM agents in an LLM-empowered learning environment, completing coursework across six modules. Data on the students' non-cognitive traits, course engagement, and AI interaction patterns were collected and analyzed. Using hierarchical cluster analysis, the students were classified into three distinct groups: active questioners, responsive navigators, and silent listeners. Epistemic network analysis was then applied to further delineate the interaction profiles and cognitive engagement of different types of learners. The findings underscore how different learner types engage with human-AI interactive learning and offer practical implications for the design of adaptive educational systems.

Paperid: 315, https://arxiv.org/pdf/2502.18413.pdf

Abstract:
Programming is a fundamentally interactive process, yet coding assistants are often evaluated using static benchmarks that fail to measure how well models collaborate with users. We introduce an interactive evaluation pipeline to examine how LLMs incorporate different types of feedback in a collaborative setting. Specifically, we perturb static coding benchmarks so that the code model must interact with a simulated user to retrieve key information about the problem. We find that interaction significantly affects model performance, as the relative rankings of 10 models across 3 datasets often vary between static and interactive settings, despite models being fairly robust to feedback that contains errors. We also observe that even when different feedback types are equally effective with respect to performance, they can impact model behaviors such as (1) how models respond to higher- vs. lower-quality feedback and (2) whether models prioritize aesthetic vs. functional edits. Our work aims to "re-evaluate" model coding capabilities through an interactive lens toward bridging the gap between existing evaluations and real-world usage.

Paperid: 316, https://arxiv.org/pdf/2501.18948.pdf

Abstract:
As artificial intelligence (AI) continues to reshape the workforce, its current trajectory raises pressing questions about its ultimate purpose. Why does job automation dominate the agenda, even at the expense of human agency and equity? This paper critiques the automation-centric paradigm, arguing that current reward structures, which largely focus on cost reduction, drive the overwhelming emphasis on task replacement in AI patents. Meanwhile, Human-Centered AI (HCAI), which envisions AI as a collaborator augmenting human capabilities and aligning with societal values, remains a fugitive from the mainstream narrative. Despite its promise, HCAI has gone ``missing'', with little evidence of its principles translating into patents or real-world impact. To increase impact, actionable interventions are needed to disrupt existing incentive structures within the HCI community. We call for a shift in priorities to support translational research, foster cross-disciplinary collaboration, and promote metrics that reward tangible and real-world impact.

Paperid: 317, https://arxiv.org/pdf/2501.00081.pdf

Abstract:
This paper provides a comprehensive review of the design and implementation of automatically generated assessment reports (AutoRs) for formative use in K-12 Science, Technology, Engineering, and Mathematics (STEM) classrooms. With the increasing adoption of technology-enhanced assessments, there is a critical need for human-computer interactive tools that efficiently support the interpretation and application of assessment data by teachers. AutoRs are designed to provide synthesized, interpretable, and actionable insights into students' performance, learning progress, and areas for improvement. Guided by cognitive load theory, this study emphasizes the importance of reducing teachers' cognitive demands through user-centered and intuitive designs. It highlights the potential of diverse information presentation formats such as text, visual aids, and plots and advanced functionalities such as live and interactive features to enhance usability. However, the findings also reveal that many existing AutoRs fail to fully utilize these approaches, leading to high initial cognitive demands and limited engagement. This paper proposes a conceptual framework to inform the design, implementation, and evaluation of AutoRs, balancing the trade-offs between usability and functionality. The framework aims to address challenges in engaging teachers with technology-enhanced assessment results, facilitating data-driven decision-making, and providing personalized feedback to improve the teaching and learning process.

Paperid: 318, https://arxiv.org/pdf/2506.17314.pdf

Abstract:
Accurate and complete product descriptions are crucial for e-commerce, yet seller-provided information often falls short. Customer reviews offer valuable details but are laborious to sift through manually. We present PRAISE: Product Review Attribute Insight Structuring Engine, a novel system that uses Large Language Models (LLMs) to automatically extract, compare, and structure insights from customer reviews and seller descriptions. PRAISE provides users with an intuitive interface to identify missing, contradictory, or partially matching details between these two sources, presenting the discrepancies in a clear, structured format alongside supporting evidence from reviews. This allows sellers to easily enhance their product listings for clarity and persuasiveness, and buyers to better assess product reliability. Our demonstration showcases PRAISE's workflow, its effectiveness in generating actionable structured insights from unstructured reviews, and its potential to significantly improve the quality and trustworthiness of e-commerce product catalogs.

Paperid: 319, https://arxiv.org/pdf/2506.05533.pdf

Abstract:
Concept-based interpretable neural networks have gained significant attention due to their intuitive and easy-to-understand explanations based on case-based reasoning, such as "this bird looks like those sparrows". However, a major limitation is that these explanations may not always be comprehensible to users due to concept inconsistency, where multiple visual features are inappropriately mixed (e.g., a bird's head and wings treated as a single concept). This inconsistency breaks the alignment between model reasoning and human understanding. Furthermore, users have specific preferences for how concepts should look, yet current approaches provide no mechanism for incorporating their feedback. To address these issues, we introduce YoursProtoP, a novel interactive strategy that enables the personalization of prototypical parts - the visual concepts used by the model - according to user needs. By incorporating user supervision, YoursProtoP adapts and splits concepts used for both prediction and explanation to better match the user's preferences and understanding. Through experiments on both the synthetic FunnyBirds dataset and a real-world scenario using the CUB, CARS, and PETS datasets in a comprehensive user study, we demonstrate the effectiveness of YoursProtoP in achieving concept consistency without compromising the accuracy of the model.

Paperid: 320, https://arxiv.org/pdf/2505.17629.pdf

Abstract:
Graphical User Interface (GUI) agents, which autonomously operate on digital interfaces through natural language instructions, hold transformative potential for accessibility, automation, and user experience. A critical aspect of their functionality is grounding - the ability to map linguistic intents to visual and structural interface elements. However, existing GUI agents often struggle to adapt to the dynamic and interconnected nature of real-world digital environments, where tasks frequently span multiple platforms and applications while also being impacted by version updates. To address this, we introduce TransBench, the first benchmark designed to systematically evaluate and enhance the transferability of GUI agents across three key dimensions: cross-version transferability (adapting to version updates), cross-platform transferability (generalizing across platforms like iOS, Android, and Web), and cross-application transferability (handling tasks spanning functionally distinct apps). TransBench includes 15 app categories with diverse functionalities, capturing essential pages across versions and platforms to enable robust evaluation. Our experiments demonstrate significant improvements in grounding accuracy, showcasing the practical utility of GUI agents in dynamic, real-world environments. Our code and data will be publicly available at GitHub.

Paperid: 321, https://arxiv.org/pdf/2503.19252.pdf

Abstract:
While generative AI systems have gained popularity in diverse applications, their potential to produce harmful outputs limits their trustworthiness and usability in different applications. Recent years have seen growing interest in engaging diverse AI users in auditing generative AI that might impact their lives. To this end, we propose MIRAGE as a web-based tool where AI users can compare outputs from multiple AI text-to-image (T2I) models by auditing AI-generated images, and report their findings in a structured way. We used MIRAGE to conduct a preliminary user study with five participants and found that MIRAGE users could leverage their own lived experiences and identities to surface previously unnoticed details around harmful biases when reviewing multiple T2I models' outputs compared to reviewing only one.

Paperid: 322, https://arxiv.org/pdf/2503.11113.pdf

Abstract:
Generative text-to-image (T2I) models are known for their risks related such as bias, offense, and misinformation. Current AI auditing methods face challenges in scalability and thoroughness, and it is even more challenging to enable auditors to explore the auditing space in a structural and effective way. Vipera employs multiple visual cues including a scene graph to facilitate image collection sensemaking and inspire auditors to explore and hierarchically organize the auditing criteria. Additionally, it leverages LLM-powered suggestions to facilitate exploration of unexplored auditing directions. An observational user study demonstrates Vipera's effectiveness in helping auditors organize their analyses while engaging with diverse criteria.

Paperid: 323, https://arxiv.org/pdf/2502.18576.pdf

Abstract:
Youth are active users and stakeholders of artificial intelligence (AI), yet they are often not included in responsible AI (RAI) practices. Emerging efforts in RAI largely focus on adult populations, missing an opportunity to get unique perspectives of youth. This study explores the potential of youth (teens under the age of 18) to engage meaningfully in RAI, specifically through AI auditing. In a workshop study with 17 teens, we investigated how youth can actively identify problematic behaviors in youth-relevant ubiquitous AI (text-to-image generative AI, autocompletion in search bar, image search) and the impacts of supporting AI auditing with critical AI literacy scaffolding with guided discussion about AI ethics and an auditing tool. We found that youth can contribute quality insights, shaped by their expertise (e.g., hobbies and passions), lived experiences (e.g., social identities), and age-related knowledge (e.g., understanding of fast-moving trends). We discuss how empowering youth in AI auditing can result in more responsible AI, support their learning through doing, and lead to implications for including youth in various participatory RAI processes.

Paperid: 324, https://arxiv.org/pdf/2502.17971.pdf

Abstract:
Successful adoption of industrial robots will strongly depend on their ability to safely and efficiently operate in human environments, engage in natural communication, understand their users, and express intentions intuitively while avoiding unnecessary distractions. To achieve this advanced level of Human-Robot Interaction (HRI), robots need to acquire and incorporate knowledge of their users' tasks and environment and adopt multimodal communication approaches with expressive cues that combine speech, movement, gazes, and other modalities. This paper presents several methods to design, enhance, and evaluate expressive HRI systems for non-humanoid industrial robots. We present the concept of a small anthropomorphic robot communicating as a proxy for its non-humanoid host, such as a forklift. We developed a multimodal and LLM-enhanced communication framework for this robot and evaluated it in several lab experiments, using gaze tracking and motion capture to quantify how users perceive the robot and measure the task progress.

Paperid: 325, https://arxiv.org/pdf/2501.12128.pdf

Abstract:
To achieve natural and intuitive interaction with people, HRI frameworks combine a wide array of methods for human perception, intention communication, human-aware navigation and collaborative action. In practice, when encountering unpredictable behavior of people or unexpected states of the environment, these frameworks may lack the ability to dynamically recognize such states, adapt and recover to resume the interaction. Large Language Models (LLMs), owing to their advanced reasoning capabilities and context retention, present a promising solution for enhancing robot adaptability. This potential, however, may not directly translate to improved interaction metrics. This paper considers a representative interaction with an industrial robot involving approach, instruction, and object manipulation, implemented in two conditions: (1) fully scripted and (2) including LLM-enhanced responses. We use gaze tracking and questionnaires to measure the participants' task efficiency, engagement, and robot perception. The results indicate higher subjective ratings for the LLM condition, but objective metrics show that the scripted condition performs comparably, particularly in efficiency and focus during simple tasks. We also note that the scripted condition may have an edge over LLM-enhanced responses in terms of response latency and energy consumption, especially for trivial and repetitive interactions.

Paperid: 326, https://arxiv.org/pdf/2501.01397.pdf

Abstract:
There has been growing interest from both practitioners and researchers in engaging end users in AI auditing, to draw upon users' unique knowledge and lived experiences. However, we know little about how to effectively scaffold end users in auditing in ways that can generate actionable insights for AI practitioners. Through formative studies with both users and AI practitioners, we first identified a set of design goals to support user-engaged AI auditing. We then developed WeAudit, a workflow and system that supports end users in auditing AI both individually and collectively. We evaluated WeAudit through a three-week user study with user auditors and interviews with industry Generative AI practitioners. Our findings offer insights into how WeAudit supports users in noticing and reflecting upon potential AI harms and in articulating their findings in ways that industry practitioners can act upon. Based on our observations and feedback from both users and practitioners, we identify several opportunities to better support user engagement in AI auditing processes. We discuss implications for future research to support effective and responsible user engagement in AI auditing and red-teaming.

Paperid: 327, https://arxiv.org/pdf/2506.15794.pdf

Abstract:
The proliferation of misinformation poses a significant threat to society, exacerbated by the capabilities of generative AI. This demo paper introduces Veracity, an open-source AI system designed to empower individuals to combat misinformation through transparent and accessible fact-checking. Veracity leverages the synergy between Large Language Models (LLMs) and web retrieval agents to analyze user-submitted claims and provide grounded veracity assessments with intuitive explanations. Key features include multilingual support, numerical scoring of claim veracity, and an interactive interface inspired by familiar messaging applications. This paper will showcase Veracity's ability to not only detect misinformation but also explain its reasoning, fostering media literacy and promoting a more informed society.

Paperid: 328, https://arxiv.org/pdf/2506.13739.pdf

Abstract:
Social robots are increasingly being explored as tools to support emotional wellbeing, particularly in non-clinical settings. Drawing on a range of empirical studies and practical deployments, this paper outlines six key insights that highlight both the opportunities and challenges in using robots to promote mental wellbeing. These include (1) the lack of a single, objective measure of wellbeing, (2) the fact that robots don't need to act as companions to be effective, (3) the growing potential of virtual interactions, (4) the importance of involving clinicians in the design process, (5) the difference between one-off and long-term interactions, and (6) the idea that adaptation and personalization are not always necessary for positive outcomes. Rather than positioning robots as replacements for human therapists, we argue that they are best understood as supportive tools that must be designed with care, grounded in evidence, and shaped by ethical and psychological considerations. Our aim is to inform future research and guide responsible, effective use of robots in mental health and wellbeing contexts.

Paperid: 329, https://arxiv.org/pdf/2505.24246.pdf

Abstract:
As AI systems are increasingly tested and deployed in open-ended and high-stakes domains, crowd workers are often tasked with responsible AI (RAI) content work. These tasks include labeling violent content, moderating disturbing text, or simulating harmful behavior for red teaming exercises to shape AI system behaviors. While prior efforts have highlighted the risks to worker well-being associated with RAI content work, far less attention has been paid to how these risks are communicated to workers. Existing transparency frameworks and guidelines such as model cards, datasheets, and crowdworksheets focus on documenting model information and dataset collection processes, but they overlook an important aspect of disclosing well-being risks to workers. In the absence of standard workflows or clear guidance, the consistent application of content warnings, consent flows, or other forms of well-being risk disclosure remain unclear. This study investigates how task designers approach risk disclosure in crowdsourced RAI tasks. Drawing on interviews with 23 task designers across academic and industry sectors, we examine how well-being risk is recognized, interpreted, and communicated in practice. Our findings surface a need to support task designers in identifying and communicating well-being risk not only to support crowdworker well-being but also to strengthen the ethical integrity and technical efficacy of AI development pipelines.

Paperid: 330, https://arxiv.org/pdf/2505.17418.pdf

Abstract:
Generative AI (genAI) tools are advertised as productivity aids. Yet, issues related to miscalibrated trust and usage friction continue to hinder their adoption. Additionally, AI can be exclusionary, failing to support diverse users adequately, further exacerbating these concerns. One such aspect of diversity is cognitive diversity -- variations in users' cognitive styles -- that leads to divergence in interaction styles. When an individual's cognitive styles are unsupported, it creates additional barriers to technology adoption. Thus, to design tools that developers trust, we must first understand what factors affect their trust and intentions to use these tools in practice? We developed a theoretical model of factors influencing trust and adoption intentions towards genAI through a large-scale survey with developers (N=238) at GitHub and Microsoft. Using Partial Least Squares-Structural Equation Modeling (PLS-SEM), we found that genAI's system/output quality, functional value, and goal maintenance significantly influence developers' trust, which along with their cognitive styles, affects their intentions to use these tools in work. An Importance-Performance Matrix Analysis (IPMA) identified factors that, despite their strong influence, underperform, revealing specific genAI aspects that need design prioritization. We bolster these findings by qualitatively analyzing developers' perceived challenges and risks of genAI usage to uncover why these gaps persist in development contexts. For genAI to indeed be a true productivity aid rather than a disguised productivity sink, it must align with developers' goals, maintain contextual transparency, reduce cognitive burden, and provide equitable interaction support. We provide practical suggestions to guide future genAI tool design for effective, trustworthy, and inclusive human-genAI interactions.

Paperid: 331, https://arxiv.org/pdf/2505.07736.pdf

Abstract:
Hybrid tutoring, where a human tutor supports multiple students in learning with educational technology, is an increasingly common application to deliver high-impact tutoring at scale. However, past hybrid tutoring applications are limited in guiding tutor attention to students that require support. Specifically, existing conferencing tools, commonly used in hybrid tutoring, do not allow tutors to monitor multiple students' screens while directly communicating and attending to multiple students simultaneously. To address this issue, this paper introduces VTutor, a web-based platform leveraging peer-to-peer screen sharing and virtual avatars to deliver real-time, context-aware tutoring feedback at scale. By integrating a multi-student monitoring dashboard with AI-powered avatar prompts, VTutor empowers a single educator or tutor to rapidly detect off-task or struggling students and intervene proactively, thus enhancing the benefits of one-on-one interactions in classroom contexts with several students. Drawing on insight from the learning sciences and past research on animated pedagogical agents, we demonstrate how stylized avatars can potentially sustain student engagement while accommodating varying infrastructure constraints. Finally, we address open questions on refining large-scale, AI-driven tutoring solutions for improved learner outcomes, and how VTutor could help interpret real-time learner interactions to support remote tutors at scale. The VTutor platform can be accessed at https://ls2025.vtutor.ai. The system demo video is at https://ls2025.vtutor.ai/video.

Paperid: 332, https://arxiv.org/pdf/2505.06676.pdf

Abstract:
Pedagogical Agents (PAs) show significant potential for boosting student engagement and learning outcomes by providing adaptive, on-demand support in educational contexts. However, existing PA solutions are often hampered by pre-scripted dialogue, unnatural animations, uncanny visual realism, and high development costs. To address these gaps, we introduce VTutor, an open-source SDK leveraging lightweight WebGL, Unity, and JavaScript frameworks. VTutor receives text outputs from a large language model (LLM), converts them into audio via text-to-speech, and then renders a real-time, lip-synced pedagogical agent (PA) for immediate, large-scale deployment on web-based learning platforms. By providing on-demand, personalized feedback, VTutor strengthens students' motivation and deepens their engagement with instructional material. Using an anime-like aesthetic, VTutor alleviates the uncanny valley effect, allowing learners to engage with expressive yet comfortably stylized characters. Our evaluation with 50 participants revealed that VTutor significantly outperforms the existing talking-head approaches (e.g., SadTalker) on perceived synchronization accuracy, naturalness, emotional expressiveness, and overall preference. As an open-source project, VTutor welcomes community-driven contributions - from novel character designs to specialized showcases of pedagogical agent applications - that fuel ongoing innovation in AI-enhanced education. By providing an accessible, customizable, and learner-centered PA solution, VTutor aims to elevate human-AI interaction experience in education fields, ultimately broadening the impact of AI in learning contexts. The demo link to VTutor is at https://vtutor-aied25.vercel.app.

Paperid: 333, https://arxiv.org/pdf/2505.05851.pdf

Abstract:
With robots increasingly integrating into human environments, understanding and predicting human motion is essential for safe and efficient interactions. Modern human motion and activity prediction approaches require high quality and quantity of data for training and evaluation, usually collected from motion capture systems, onboard or stationary sensors. Setting up these systems is challenging due to the intricate setup of hardware components, extensive calibration procedures, occlusions, and substantial costs. These constraints make deploying such systems in new and large environments difficult and limit their usability for in-the-wild measurements. In this paper we investigate the possibility to apply the novel Ultra-Wideband (UWB) localization technology as a scalable alternative for human motion capture in crowded and occlusion-prone environments. We include additional sensing modalities such as eye-tracking, onboard robot LiDAR and radar sensors, and record motion capture data as ground truth for evaluation and comparison. The environment imitates a museum setup, with up to four active participants navigating toward random goals in a natural way, and offers more than 130 minutes of multi-modal data. Our investigation provides a step toward scalable and accurate motion data collection beyond vision-based systems, laying a foundation for evaluating sensing modalities like UWB in larger and complex environments like warehouses, airports, or convention centers.

Paperid: 334, https://arxiv.org/pdf/2504.13925.pdf

Abstract:
Campus climate surveys play a pivotal role in capturing how students, faculty, and staff experience university life, yet traditional methods frequently suffer from low participation and minimal follow-up. We present TigerGPT, a new AI chatbot that generates adaptive, context-aware dialogues enriched with visual elements. Through real-time follow-up prompts, empathetic messaging, and flexible topic selection, TigerGPT elicits more in-depth feedback compared to traditional static survey forms. Based on established principles of conversational design, the chatbot employs empathetic cues, bolded questions, and user-driven topic selection. It retains some role-based efficiency (e.g., collecting user role through quick clicks) but goes beyond static scripts by employing GenAI adaptiveness. In a pilot study with undergraduate students, we collected both quantitative metrics (e.g., satisfaction ratings) and qualitative insights (e.g., written comments). Most participants described TigerGPT as engaging and user-friendly; about half preferred it over conventional surveys, attributing this preference to its personalized conversation flow and supportive tone. The findings indicate that an AI survey chatbot is promising in gaining deeper insight into campus climate.

Paperid: 335, https://arxiv.org/pdf/2504.00408.pdf

Abstract:
Generative AI has the potential to transform personalization and accessibility of education. However, it raises serious concerns about accuracy and helping students become independent critical thinkers. In this study, we designed a helpful AI "Peer" to help students correct fundamental physics misconceptions related to Newtonian mechanic concepts. In contrast to approaches that seek near-perfect accuracy to create an authoritative AI tutor or teacher, we directly inform students that this AI can answer up to 40% of questions incorrectly. In a randomized controlled trial with 165 students, those who engaged in targeted dialogue with the AI Peer achieved post-test scores that were, on average, 10.5 percentage points higher - with over 20 percentage points higher normalized gain - than a control group that discussed physics history. Qualitative feedback indicated that 91% of the treatment group's AI interactions were rated as helpful. Furthermore, by comparing student performance on pre- and post-test questions about the same concept, along with experts' annotations of the AI interactions, we find initial evidence suggesting the improvement in performance does not depend on the correctness of the AI. With further research, the AI Peer paradigm described here could open new possibilities for how we learn, adapt to, and grow with AI.

Paperid: 336, https://arxiv.org/pdf/2503.11684.pdf

Abstract:
One of the primary goals of Human-Robot Interaction (HRI) research is to develop robots that can interpret human behavior and adapt their responses accordingly. Adaptive learning models, such as continual and reinforcement learning, play a crucial role in improving robots' ability to interact effectively in real-world settings. However, these models face significant challenges due to the limited availability of real-world data, particularly in sensitive domains like healthcare and well-being. This data scarcity can hinder a robot's ability to adapt to new situations. To address these challenges, causality provides a structured framework for understanding and modeling the underlying relationships between actions, events, and outcomes. By moving beyond mere pattern recognition, causality enables robots to make more explainable and generalizable decisions. This paper presents an exploratory causality-based analysis through a case study of an adaptive robotic coach delivering positive psychology exercises over four weeks in a workplace setting. The robotic coach autonomously adapts to multimodal human behaviors, such as facial valence and speech duration. By conducting both macro- and micro-level causal analyses, this study aims to gain deeper insights into how adaptability can enhance well-being during interactions. Ultimately, this research seeks to advance our understanding of how causality can help overcome challenges in HRI, particularly in real-world applications.

Paperid: 337, https://arxiv.org/pdf/2503.02874.pdf

Abstract:
The emergence of generative AI (GenAI) models, including large language models and text-to-image models, has significantly advanced the synergy between humans and AI with not only their outstanding capability but more importantly, the intuitive communication method with text prompts. Though intuitive, text-based instructions suffer from natural languages' ambiguous and redundant nature. To address the issue, researchers have explored augmenting text-based instructions with interactions that facilitate precise and effective human intent expression, such as direct manipulation. However, the design strategy of interaction-augmented instructions lacks systematic investigation, hindering our understanding and application. To provide a panorama of interaction-augmented instructions, we propose a framework to analyze related tools from why, when, who, what, and how interactions are applied to augment text-based instructions. Notably, we identify four purposes for applying interactions, including restricting, expanding, organizing, and refining text instructions. The design paradigms for each purpose are also summarized to benefit future researchers and practitioners.

Paperid: 338, https://arxiv.org/pdf/2503.02639.pdf

Abstract:
Data analysts frequently employ code completion tools in writing custom scripts to tackle complex tabular data wrangling tasks. However, existing tools do not sufficiently link the data contexts such as schemas and values with the code being edited. This not only leads to poor code suggestions, but also frequent interruptions in coding processes as users need additional code to locate and understand relevant data. We introduce Xavier, a tool designed to enhance data wrangling script authoring in computational notebooks. Xavier maintains users' awareness of data contexts while providing data-aware code suggestions. It automatically highlights the most relevant data based on the user's code, integrates both code and data contexts for more accurate suggestions, and instantly previews data transformation results for easy verification. To evaluate the effectiveness and usability of Xavier, we conducted a user study with 16 data analysts, showing its potential to streamline data wrangling scripts authoring.

Paperid: 339, https://arxiv.org/pdf/2503.02631.pdf

Abstract:
Human-AI collaborative tools attract attentions from the data storytelling community to lower the barrier of expertise and streamline the workflow. The recent advance in large-scale generative AI techniques, e.g., large language models (LLMs) and text-to-image models, has the potential to enhance data storytelling with their power in visual and narration generation. After two years since these techniques were publicly available, it is important to reflect our progress of applying them and have an outlook for future opportunities. To achieve the goal, we compare the collaboration patterns of the latest tools with those of earlier ones using a dedicated framework for understanding human-AI collaboration in data storytelling. Through comparison, we identify persistent collaboration patterns, e.g., human-creator + AI-assistant, and emerging ones, e.g., AI-creator + human-reviewer. The benefits of these AI techniques and other implications to human-AI collaboration are also revealed. We further propose future directions to hopefully ignite innovations.

Paperid: 340, https://arxiv.org/pdf/2502.16153.pdf

Abstract:
With the rise of AI technologies and their growing influence in the screenwriting field, understanding the opportunities and concerns related to AI's role in screenwriting is essential for enhancing human-AI co-creation. Through semi-structured interviews with 23 screenwriters, we explored their creative practices, attitudes, and expectations in collaborating with AI for screenwriting. Based on participants' responses, we identified the key stages in which they commonly integrated AI, including story structure & plot development, screenplay text, goal & idea generation, and dialogue. Then, we examined how different attitudes toward AI integration influence screenwriters' practices across various workflow stages and their broader impact on the industry. Additionally, we categorized their expected assistance using four distinct roles of AI: actor, audience, expert, and executor. Our findings provide insights into AI's impact on screenwriting practices and offer suggestions on how AI can benefit the future of screenwriting.

Paperid: 341, https://arxiv.org/pdf/2502.16114.pdf

Abstract:
Computational notebooks, widely used for ad-hoc analysis and often shared with others, can be difficult to understand because the standard linear layout is not optimized for reading. In particular, related text, code, and outputs may be spread across the UI making it difficult to draw connections. In response, we introduce InterLink, a plugin designed to present the relationships between text, code, and outputs, thereby making notebooks easier to understand. In a formative study, we identify pain points and derive design requirements for identifying and navigating relationships among various pieces of information within notebooks. Based on these requirements, InterLink features a new layout that separates text from code and outputs into two columns. It uses visual links to signal relationships between text and associated code and outputs and offers interactions for navigating related pieces of information. In a user study with 12 participants, those using InterLink were 13.6% more accurate at finding and integrating information from complex analyses in computational notebooks. These results show the potential of notebook layouts that make them easier to understand.

Paperid: 342, https://arxiv.org/pdf/2502.16062.pdf

Abstract:
Visual blends combine elements from two distinct visual concepts into a single, integrated image, with the goal of conveying ideas through imaginative and often thought-provoking visuals. Communicating abstract concepts through visual blends poses a series of conceptual and technical challenges. To address these challenges, we introduce Creative Blends, an AI-assisted design system that leverages metaphors to visually symbolize abstract concepts by blending disparate objects. Our method harnesses commonsense knowledge bases and large language models to align designers' conceptual intent with expressive concrete objects. Additionally, we employ generative text-to-image techniques to blend visual elements through their overlapping attributes. A user study (N=24) demonstrated that our approach reduces participants' cognitive load, fosters creativity, and enhances the metaphorical richness of visual blend ideation. We explore the potential of our method to expand visual blends to include multiple object blending and discuss the insights gained from designing with generative AI.

Paperid: 343, https://arxiv.org/pdf/2502.09787.pdf

Abstract:
Spreadsheet programming is challenging. Programmers use spreadsheet programming knowledge (e.g., formulas) and problem-solving skills to combine actions into complex tasks. Advancements in large language models have introduced language agents that observe, plan, and perform tasks, showing promise for spreadsheet creation. We present TableTalk, a spreadsheet programming agent embodying three design principles -- scaffolding, flexibility, and incrementality -- derived from studies with seven spreadsheet programmers and 85 Excel templates. TableTalk guides programmers through structured plans based on professional workflows, generating three potential next steps to adapt plans to programmer needs. It uses pre-defined tools to generate spreadsheet components and incrementally build spreadsheets. In a study with 20 programmers, TableTalk produced higher-quality spreadsheets 2.3 times more likely to be preferred than the baseline. It reduced cognitive load and thinking time by 12.6%. From this, we derive design guidelines for agentic spreadsheet programming tools and discuss implications on spreadsheet programming, end-user programming, AI-assisted programming, and human-agent collaboration.

Paperid: 344, https://arxiv.org/pdf/2502.07986.pdf

Abstract:
Software engineering courses enable practical learning through assignments requiring contributions to open source software (OSS), allowing students to experience real-world projects, collaborate with global communities, and develop skills and competencies required to succeed in the tech industry. Learning software engineering through open source contribution integrates theory with hands-on practice, as students tackle real challenges in collaborative environments. However, students often struggle to contribute to OSS projects and do not understand the contribution process. Research has demonstrated that strategically incorporating game elements can promote student learning and engagement. This paper proposes and evaluates OSSDoorway, a tool designed to guide students contributing to OSS projects. We recruited 29 students and administered a self-efficacy questionnaire before and after their use of OSSDoorway, along with qualitative feedback to assess challenges, interface features, and suggestions for improvement. The results show that OSSDoorway boosts students' self-efficacy and provides a structured, gamified learning experience. Clear instructions, real-time feedback, and the quest-based system helped students navigate tasks like using GitHub features to submit pull requests and collaborating with the community. Our findings suggest that providing students with a supportive gamified environment that uses feedback and structured quests can help them navigate the OSS contribution process.

Paperid: 345, https://arxiv.org/pdf/2502.04801.pdf

Abstract:
Animated data videos have gained significant popularity in recent years. However, authoring data videos remains challenging due to the complexity of creating and coordinating diverse components (e.g., visualization, animation, audio, etc.). Although numerous tools have been developed to streamline the process, there is a lack of comprehensive understanding and reflection of their design paradigms to inform future development. To address this gap, we propose a framework for understanding data video creation tools along two dimensions: what data video components to create and coordinate, including visual, motion, narrative, and audio components, and how to support the creation and coordination. By applying the framework to analyze 46 existing tools, we summarized key design paradigms of creating and coordinating each component based on the varying work distribution for humans and AI in these tools. Finally, we share our detailed reflections, highlight gaps from a holistic view, and discuss future directions to address them.

Paperid: 346, https://arxiv.org/pdf/2502.04103.pdf

Abstract:
The rapid evolution of large language models (LLMs) has transformed human-computer interaction (HCI), but the interaction with LLMs is currently mainly focused on text-based interactions, while other multi-model approaches remain under-explored. This paper introduces VTutor, an open-source Software Development Kit (SDK) that combines generative AI with advanced animation technologies to create engaging, adaptable, and realistic APAs for human-AI multi-media interactions. VTutor leverages LLMs for real-time personalized feedback, advanced lip synchronization for natural speech alignment, and WebGL rendering for seamless web integration. Supporting various 2D and 3D character models, VTutor enables researchers and developers to design emotionally resonant, contextually adaptive learning agents. This toolkit enhances learner engagement, feedback receptivity, and human-AI interaction while promoting trustworthy AI principles in education. VTutor sets a new standard for next-generation APAs, offering an accessible, scalable solution for fostering meaningful and immersive human-AI interaction experiences. The VTutor project is open-sourced and welcomes community-driven contributions and showcases.

Paperid: 347, https://arxiv.org/pdf/2501.05600.pdf

Abstract:
Mentorship in open source software (OSS) is a vital, multifaceted process that includes onboarding newcomers, fostering skill development, and enhancing community building. This study examines task-focused mentoring strategies that help mentees complete their tasks and the ideal personal qualities and outcomes of good mentorship in OSS communities. We conducted two surveys to gather contributor perceptions: the first survey, with 70 mentors, mapped 17 mentoring challenges to 21 strategies that help support mentees. The second survey, with 85 contributors, assessed the importance of personal qualities and ideal mentorship outcomes. Our findings not only provide actionable strategies to help mentees overcome challenges and become successful contributors but also guide current and future mentors and OSS communities in understanding the personal qualities that are the cornerstone of good mentorship and the outcomes that mentor-mentee pairs should aspire to achieve.

Paperid: 348, https://arxiv.org/pdf/2501.03603.pdf

Abstract:
To facilitate the creation of compelling and engaging data stories, AI-powered tools have been introduced to automate the three stages in the workflow: analyzing data, organizing findings, and creating visuals. However, these tools rely on data-level information to derive inflexible relations between findings. Therefore, they often create one-size-fits-all data stories. Differently, our formative study reveals that humans heavily rely on meta relations between these findings from diverse domain knowledge and narrative intent, going beyond datasets, to compose their findings into stylized data stories. Such a gap indicates the importance of introducing meta relations to elevate AI-created stories to a satisfactory level. Though necessary, it is still unclear where and how AI should be involved in working with humans on meta relations. To answer the question, we conducted an exploratory user study with Remex, an AI-powered data storytelling tool that suggests meta relations in the analysis stage and applies meta relations for data story organization. The user study reveals various findings about introducing AI for meta relations into the storytelling workflow, such as the benefit of considering meta relations and their diverse expected usage scenarios. Finally, the paper concludes with lessons and suggestions about applying meta relations to compose data stories to hopefully inspire future research.

Paperid: 349, https://arxiv.org/pdf/2506.12347.pdf

Abstract:
Software Engineering Agents (SWE agents) can autonomously perform development tasks on benchmarks like SWE Bench, but still face challenges when tackling complex and ambiguous real-world tasks. Consequently, SWE agents are often designed to allow interactivity with developers, enabling collaborative problem-solving. To understand how developers collaborate with SWE agents and the communication challenges that arise in such interactions, we observed 19 developers using an in-IDE agent to resolve 33 open issues in repositories to which they had previously contributed. Participants successfully resolved about half of these issues, with participants solving issues incrementally having greater success than those using a one-shot approach. Participants who actively collaborated with the agent and iterated on its outputs were also more successful, though they faced challenges in trusting the agent's responses and collaborating on debugging and testing. These results have implications for successful developer-agent collaborations, and for the design of more effective SWE agents.

Paperid: 350, https://arxiv.org/pdf/2506.06576.pdf

Abstract:
The rapid rise of compound AI systems (a.k.a., AI agents) is reshaping the labor market, raising concerns about job displacement, diminished human agency, and overreliance on automation. Yet, we lack a systematic understanding of the evolving landscape. In this paper, we address this gap by introducing a novel auditing framework to assess which occupational tasks workers want AI agents to automate or augment, and how those desires align with the current technological capabilities. Our framework features an audio-enhanced mini-interview to capture nuanced worker desires and introduces the Human Agency Scale (HAS) as a shared language to quantify the preferred level of human involvement. Using this framework, we construct the WORKBank database, building on the U.S. Department of Labor's O*NET database, to capture preferences from 1,500 domain workers and capability assessments from AI experts across over 844 tasks spanning 104 occupations. Jointly considering the desire and technological capability divides tasks in WORKBank into four zones: Automation "Green Light" Zone, Automation "Red Light" Zone, R&D Opportunity Zone, Low Priority Zone. This highlights critical mismatches and opportunities for AI agent development. Moving beyond a simple automate-or-not dichotomy, our results reveal diverse HAS profiles across occupations, reflecting heterogeneous expectations for human involvement. Moreover, our study offers early signals of how AI agent integration may reshape the core human competencies, shifting from information-focused skills to interpersonal ones. These findings underscore the importance of aligning AI agent development with human desires and preparing workers for evolving workplace dynamics.

Paperid: 351, https://arxiv.org/pdf/2506.06381.pdf

Abstract:
Cyber-Physical Systems (CPS) increasingly depend on advanced AI techniques to operate in critical applications. However, traditional verification and validation methods often struggle to handle the unpredictable and dynamic nature of AI components. In this paper, we introduce DURA-CPS, a novel framework that employs multi-role orchestration to automate the iterative assurance process for AI-powered CPS. By assigning specialized roles (e.g., safety monitoring, security assessment, fault injection, and recovery planning) to dedicated agents within a simulated environment, DURA-CPS continuously evaluates and refines AI behavior against a range of dependability requirements. We demonstrate the framework through a case study involving an autonomous vehicle navigating an intersection with an AI-based planner. Our results show that DURA-CPS effectively detects vulnerabilities, manages performance impacts, and supports adaptive recovery strategies, thereby offering a structured and extensible solution for rigorous V&V in safety- and security-critical systems.

Paperid: 352, https://arxiv.org/pdf/2505.22539.pdf

Abstract:
Recent progress in mixed reality (MR) and robotics is enabling increasingly sophisticated forms of human-robot collaboration. Building on these developments, we introduce a novel MR framework that allows multiple quadruped robots to operate in semantically diverse environments via a MR interface. Our system supports collaborative tasks involving drawers, swing doors, and higher-level infrastructure such as light switches. A comprehensive user study verifies both the design and usability of our app, with participants giving a "good" or "very good" rating in almost all cases. Overall, our approach provides an effective and intuitive framework for MR-based multi-robot collaboration in complex, real-world scenarios.

Paperid: 353, https://arxiv.org/pdf/2505.14031.pdf

Abstract:
A large portion of texts in the world is written in English, but readers who see English as a Foreign Language (EFL) often struggle to read texts written in English accurately and swiftly. In many countries, EFL readers seek help from professional teachers and mentors, which is limited and costly. In this paper, we explore how an intelligent reading tool can assist EFL readers. To support our research agenda, we conducted a case study with EFL readers in South Korea. We at first developed an LLM-based reading tool based on prior literature. We then revised the tool based on the feedback from a study with 15 South Korean EFL readers. The final tool, named Reading.help, helps EFL readers comprehend complex sentences and paragraphs with on-demand and proactive explanations. We finally evaluated the tool with 5 EFL readers and 2 EFL education professionals. Our findings suggest Reading.help could potentially help EFL readers self-learn english when they do not have access to any external support.

Paperid: 354, https://arxiv.org/pdf/2505.00855.pdf

Abstract:
An individual's data can reveal facets of behavior and identity, but its interpretation is context dependent. We can easily identify various self-tracking applications that help people reflect on their lives. However, self-tracking confined to one person's data source may fall short in terms of objectiveness, and insights coming from various perspectives. To address this, we examine how those interpretations about a person's data can be augmented when the data are juxtaposed with that of others using anonymized online calendar logs from a schedule management app. We develop CALTREND, a visual analytics system that compares an individuals anonymized online schedule logs with using those from other people. Using CALTREND as a probe, we conduct a study with two domain experts, one in information technology and one in Korean herbal medicine. We report our observations on how comparative views help enrich the characterization of an individual based on the experts' comments. We find that juxtaposing personal data with others' can potentially lead to diverse interpretations of one dataset shaped by domain-specific mental models.

Paperid: 355, https://arxiv.org/pdf/2505.00455.pdf

Abstract:
Effective data visualization requires not only technical proficiency but also a deep understanding of the domain-specific context in which data exists. This context often includes tacit knowledge about data provenance, quality, and intended use, which is rarely explicit in the dataset itself. We present the Data Therapist, a web-based tool that helps domain experts externalize this implicit knowledge through a mixed-initiative process combining iterative Q&A with interactive annotation. Powered by a large language model, the system analyzes user-supplied datasets, prompts users with targeted questions, and allows annotation at varying levels of granularity. The resulting structured knowledge base can inform both human and automated visualization design. We evaluated the tool in a qualitative study involving expert pairs from Molecular Biology, Accounting, Political Science, and Usable Security. The study revealed recurring patterns in how experts reason about their data and highlights areas where AI support can improve visualization design.

Paperid: 356, https://arxiv.org/pdf/2503.13843.pdf

Abstract:
The current state of modern web interfaces, especially in regards to accessibility focused usage is extremely lacking. Traditional methods for web interaction, such as scripting languages and screen readers, often lack the flexibility to handle dynamic content or the intelligence to interpret high-level user goals. To address these limitations, we introduce WebNav, a novel agent for multi-modal web navigation. WebNav leverages a dual Large Language Model (LLM) architecture to translate natural language commands into precise, executable actions on a graphical user interface. The system combines vision-based context from screenshots with a dynamic DOM-labeling browser extension to robustly identify interactive elements. A high-level 'Controller' LLM strategizes the next step toward a user's goal, while a second 'Assistant' LLM generates the exact parameters for execution. This separation of concerns allows for sophisticated task decomposition and action formulation. Our work presents the complete architecture and implementation of WebNav, demonstrating a promising approach to creating more intelligent web automation agents.

Paperid: 357, https://arxiv.org/pdf/2503.09338.pdf

Abstract:
Recent literature has seen a considerable uptick in $\textit{Differentially Private Natural Language Processing}$ (DP NLP). This includes DP text privatization, where potentially sensitive input texts are transformed under DP to achieve privatized output texts that ideally mask sensitive information $\textit{and}$ maintain original semantics. Despite continued work to address the open challenges in DP text privatization, there remains a scarcity of work addressing user perceptions of this technology, a crucial aspect which serves as the final barrier to practical adoption. In this work, we conduct a survey study with 721 laypersons around the globe, investigating how the factors of $\textit{scenario}$, $\textit{data sensitivity}$, $\textit{mechanism type}$, and $\textit{reason for data collection}$ impact user preferences for text privatization. We learn that while all these factors play a role in influencing privacy decisions, users are highly sensitive to the utility and coherence of the private output texts. Our findings highlight the socio-technical factors that must be considered in the study of DP NLP, opening the door to further user-based investigations going forward.

Paperid: 358, https://arxiv.org/pdf/2503.00618.pdf

Abstract:
Automated Program Repair (APR) holds the promise of alleviating the burden of debugging and fixing software bugs. Despite this, developers still need to manually inspect each patch to confirm its correctness, which is tedious and time-consuming. This challenge is exacerbated in the presence of plausible patches, which accidentally pass test cases but may not correctly fix the bug. To address this challenge, we propose an interactive approach called iFix to facilitate patch understanding and comparison based on their runtime difference. iFix performs static analysis to identify runtime variables related to the buggy statement and captures their runtime values during execution for each patch. These values are then aligned across different patch candidates, allowing users to compare and contrast their runtime behavior. To evaluate iFix, we conducted a within-subjects user study with 28 participants. Compared with manual inspection and a state-of-the-art interactive patch filtering technique, iFix reduced participants' task completion time by 36% and 33% while also improving their confidence by 50% and 20%, respectively. Besides, quantitative experiments demonstrate that iFix improves the ranking of correct patches by at least 39% compared with other patch ranking methods and is generalizable to different APR tools.

Paperid: 359, https://arxiv.org/pdf/2502.20284.pdf

Abstract:
Large Language Models (LLMs) are increasingly used for planning tasks, offering unique capabilities not found in classical planners such as generating explanations and iterative refinement. However, trust--a critical factor in the adoption of planning systems--remains underexplored in the context of LLM-based planning tasks. This study bridges this gap by comparing human trust in LLM-based planners with classical planners through a user study in a Planning Domain Definition Language (PDDL) domain. Combining subjective measures, such as trust questionnaires, with objective metrics like evaluation accuracy, our findings reveal that correctness is the primary driver of trust and performance. Explanations provided by the LLM improved evaluation accuracy but had limited impact on trust, while plan refinement showed potential for increasing trust without significantly enhancing evaluation accuracy.

Paperid: 360, https://arxiv.org/pdf/2502.16895.pdf

Abstract:
Teaching scientific concepts is essential but challenging, and analogies help students connect new concepts to familiar ideas. Advancements in large language models (LLMs) enable generating analogies, yet their effectiveness in education remains underexplored. In this paper, we first conducted a two-stage study involving high school students and teachers to assess the effectiveness of LLM-generated analogies in biology and physics through a controlled in-class test and a classroom field study. Test results suggested that LLM-generated analogies could enhance student understanding particularly in biology, but require teachers' guidance to prevent over-reliance and overconfidence. Classroom experiments suggested that teachers could refine LLM-generated analogies to their satisfaction and inspire new analogies from generated ones, encouraged by positive classroom feedback and homework performance boosts. Based on findings, we developed and evaluated a practical system to help teachers generate and refine teaching analogies. We discussed future directions for developing and evaluating LLM-supported teaching and learning by analogy.

Paperid: 361, https://arxiv.org/pdf/2502.00858.pdf

Abstract:
Effective integration of AI agents into daily life requires them to understand and adapt to individual human preferences, particularly in collaborative roles. Although recent studies on embodied intelligence have advanced significantly, they typically adopt generalized approaches that overlook personal preferences in planning. We address this limitation by developing agents that not only learn preferences from few demonstrations but also learn to adapt their planning strategies based on these preferences. Our research leverages the observation that preferences, though implicitly expressed through minimal demonstrations, can generalize across diverse planning scenarios. To systematically evaluate this hypothesis, we introduce Preference-based Planning (PbP) benchmark, an embodied benchmark featuring hundreds of diverse preferences spanning from atomic actions to complex sequences. Our evaluation of SOTA methods reveals that while symbol-based approaches show promise in scalability, significant challenges remain in learning to generate and execute plans that satisfy personalized preferences. We further demonstrate that incorporating learned preferences as intermediate representations in planning significantly improves the agent's ability to construct personalized plans. These findings establish preferences as a valuable abstraction layer for adaptive planning, opening new directions for research in preference-guided plan generation and execution.

Paperid: 362, https://arxiv.org/pdf/2501.15147.pdf

Abstract:
Recently, numerous benchmarks have been developed to evaluate the logical reasoning abilities of large language models (LLMs). However, assessing the equally important creative capabilities of LLMs is challenging due to the subjective, diverse, and data-scarce nature of creativity, especially in multimodal scenarios. In this paper, we consider the comprehensive pipeline for evaluating the creativity of multimodal LLMs, with a focus on suitable evaluation platforms and methodologies. First, we find the Oogiri game, a creativity-driven task requiring humor, associative thinking, and the ability to produce unexpected responses to text, images, or both. This game aligns well with the input-output structure of modern multimodal LLMs and benefits from a rich repository of high-quality, human-annotated creative responses, making it an ideal platform for studying LLM creativity. Next, beyond using the Oogiri game for standard evaluations like ranking and selection, we propose LoTbench, an interactive, causality-aware evaluation framework, to further address some intrinsic risks in standard evaluations, such as information leakage and limited interpretability. The proposed LoTbench not only quantifies LLM creativity more effectively but also visualizes the underlying creative thought processes. Our results show that while most LLMs exhibit constrained creativity, the performance gap between LLMs and humans is not insurmountable. Furthermore, we observe a strong correlation between results from the multimodal cognition benchmark MMMU and LoTbench, but only a weak connection with traditional creativity metrics. This suggests that LoTbench better aligns with human cognitive theories, highlighting cognition as a critical foundation in the early stages of creativity and enabling the bridging of diverse concepts. https://lotbench.github.io

Paperid: 363, https://arxiv.org/pdf/2504.17934.pdf

Abstract:
The rise of Large Language Models (LLMs) has revolutionized Graphical User Interface (GUI) automation through LLM-powered GUI agents, yet their ability to process sensitive data with limited human oversight raises significant privacy and security risks. This position paper identifies three key risks of GUI agents and examines how they differ from traditional GUI automation and general autonomous agents. Despite these risks, existing evaluations focus primarily on performance, leaving privacy and security assessments largely unexplored. We review current evaluation metrics for both GUI and general LLM agents and outline five key challenges in integrating human evaluators for GUI agent assessments. To address these gaps, we advocate for a human-centered evaluation framework that incorporates risk assessments, enhances user awareness through in-context consent, and embeds privacy and security considerations into GUI agent design and evaluation.

Paperid: 364, https://arxiv.org/pdf/2504.13904.pdf

Abstract:
We hypothesize that optimal system responses emerge from adaptive strategies grounded in causal and counterfactual knowledge. Counterfactual inference allows us to create hypothetical scenarios to examine the effects of alternative system responses. We enhance this process through causal discovery, which identifies the strategies informed by the underlying causal structure that govern system behaviors. Moreover, we consider the psychological constructs and unobservable noises that might be influencing user-system interactions as latent factors. We show that these factors can be effectively estimated. We employ causal discovery to identify strategy-level causal relationships among user and system utterances, guiding the generation of personalized counterfactual dialogues. We model the user utterance strategies as causal factors, enabling system strategies to be treated as counterfactual actions. Furthermore, we optimize policies for selecting system responses based on counterfactual data. Our results using a real-world dataset on social good demonstrate significant improvements in persuasive system outcomes, with increased cumulative rewards validating the efficacy of causal discovery in guiding personalized counterfactual inference and optimizing dialogue policies for a persuasive dialogue system.

Paperid: 365, https://arxiv.org/pdf/2504.11281.pdf

Abstract:
A Large Language Model (LLM) powered GUI agent is a specialized autonomous system that performs tasks on the user's behalf according to high-level instructions. It does so by perceiving and interpreting the graphical user interfaces (GUIs) of relevant apps, often visually, inferring necessary sequences of actions, and then interacting with GUIs by executing the actions such as clicking, typing, and tapping. To complete real-world tasks, such as filling forms or booking services, GUI agents often need to process and act on sensitive user data. However, this autonomy introduces new privacy and security risks. Adversaries can inject malicious content into the GUIs that alters agent behaviors or induces unintended disclosures of private information. These attacks often exploit the discrepancy between visual saliency for agents and human users, or the agent's limited ability to detect violations of contextual integrity in task automation. In this paper, we characterized six types of such attacks, and conducted an experimental study to test these attacks with six state-of-the-art GUI agents, 234 adversarial webpages, and 39 human participants. Our findings suggest that GUI agents are highly vulnerable, particularly to contextually embedded threats. Moreover, human users are also susceptible to many of these attacks, indicating that simple human oversight may not reliably prevent failures. This misalignment highlights the need for privacy-aware agent design. We propose practical defense strategies to inform the development of safer and more reliable GUI agents.

Paperid: 366, https://arxiv.org/pdf/2504.07971.pdf

Abstract:
In the era of Large Language Models (LLMs), establishing effective evaluation methods and standards for diverse human-AI interaction systems is increasingly challenging. To encourage more transparent documentation and facilitate discussion on human-AI system evaluation design options, we present an evaluation card SPHERE, which encompasses five key dimensions: 1) What is being evaluated?; 2) How is the evaluation conducted?; 3) Who is participating in the evaluation?; 4) When is evaluation conducted?; 5) How is evaluation validated? We conduct a review of 39 human-AI systems using SPHERE, outlining current evaluation practices and areas for improvement. We provide three recommendations for improving the validity and rigor of evaluation practices.

Paperid: 367, https://arxiv.org/pdf/2503.16544.pdf

Abstract:
Tailoring persuasive conversations to users leads to more effective persuasion. However, existing dialogue systems often struggle to adapt to dynamically evolving user states. This paper presents a novel method that leverages causal discovery and counterfactual reasoning for optimizing system persuasion capability and outcomes. We employ the Greedy Relaxation of the Sparsest Permutation (GRaSP) algorithm to identify causal relationships between user and system utterance strategies, treating user strategies as states and system strategies as actions. GRaSP identifies user strategies as causal factors influencing system responses, which inform Bidirectional Conditional Generative Adversarial Networks (BiCoGAN) in generating counterfactual utterances for the system. Subsequently, we use the Dueling Double Deep Q-Network (D3QN) model to utilize counterfactual data to determine the best policy for selecting system utterances. Our experiments with the PersuasionForGood dataset show measurable improvements in persuasion outcomes using our approach over baseline methods. The observed increase in cumulative rewards and Q-values highlights the effectiveness of causal discovery in enhancing counterfactual reasoning and optimizing reinforcement learning policies for online dialogue systems.

Paperid: 368, https://arxiv.org/pdf/2503.08663.pdf

Abstract:
Until recently, robotics safety research was predominantly about collision avoidance and hazard reduction in the immediate vicinity of a robot. Since the advent of large vision and language models (VLMs), robots are now also capable of higher-level semantic scene understanding and natural language interactions with humans. Despite their known vulnerabilities (e.g. hallucinations or jail-breaking), VLMs are being handed control of robots capable of physical contact with the real world. This can lead to dangerous behaviors, making semantic safety for robots a matter of immediate concern. Our contributions in this paper are two fold: first, to address these emerging risks, we release the ASIMOV Benchmark, a large-scale and comprehensive collection of datasets for evaluating and improving semantic safety of foundation models serving as robot brains. Our data generation recipe is highly scalable: by leveraging text and image generation techniques, we generate undesirable situations from real-world visual scenes and human injury reports from hospitals. Secondly, we develop a framework to automatically generate robot constitutions from real-world data to steer a robot's behavior using Constitutional AI mechanisms. We propose a novel auto-amending process that is able to introduce nuances in written rules of behavior; this can lead to increased alignment with human preferences on behavior desirability and safety. We explore trade-offs between generality and specificity across a diverse set of constitutions of different lengths, and demonstrate that a robot is able to effectively reject unconstitutional actions. We measure a top alignment rate of 84.3% on the ASIMOV Benchmark using generated constitutions, outperforming no-constitution baselines and human-written constitutions. Data is available at asimov-benchmark.github.io

Paperid: 369, https://arxiv.org/pdf/2503.00858.pdf

Abstract:
While large language models (LLMs) are increasingly used to assist users in various tasks through natural language interactions, these interactions often fall short due to LLMs' limited ability to infer contextual nuances and user intentions, unlike humans. To address this challenge, we draw inspiration from the Gricean Maxims--human communication theory that suggests principles of effective communication--and aim to derive design insights for enhancing human-AI interactions (HAI). Through participatory design workshops with communication experts, designers, and end-users, we identified ways to apply these maxims across the stages of the HAI cycle. Our findings include reinterpreted maxims tailored to human-LLM contexts and nine actionable design considerations categorized by interaction stage. These insights provide a concrete framework for designing more cooperative and user-centered LLM-based systems, bridging theoretical foundations in communication with practical applications in HAI.

Paperid: 370, https://arxiv.org/pdf/2503.00791.pdf

Abstract:
Broad exploration of references is critical in the visual design process. While text-to-image (T2I) models offer efficiency and customization of exploration, they often limit support for divergence in exploration. We conducted a formative study (N=6) to investigate the limitations of current interaction with the T2I model for broad exploration and found that designers struggle to articulate exploratory intentions and manage iterative, non-linear workflows. To address these challenges, we developed Expandora. Users can specify their exploratory intentions and desired diversity levels through structured input, and using an LLM-based pipeline, Expandora generates tailored prompt variations. The results are displayed in a mindmap-like interface that encourages non-linear workflows. A user study (N=8) demonstrated that Expandora significantly increases prompt diversity, the number of prompts users tried within a given time, and user satisfaction compared to the baseline. Nonetheless, its limitations in supporting convergent thinking suggest opportunities for holistically improving creative processes.

Paperid: 371, https://arxiv.org/pdf/2503.00483.pdf

Abstract:
Recent advances in AI reasoning models provide unprecedented transparency into their decision-making processes, transforming them from traditional black-box systems into models that articulate step-by-step chains of thought rather than producing opaque outputs. This shift has the potential to improve software quality, explainability, and trust in AI-augmented development. However, software engineers rarely have the time or cognitive bandwidth to analyze, verify, and interpret every AI-generated thought in detail. Without an effective interface, this transparency could become a burden rather than a benefit. In this paper, we propose a vision for structuring the interaction between AI reasoning models and software engineers to maximize trust, efficiency, and decision-making power. We argue that simply exposing AI's reasoning is not enough -- software engineers need tools and frameworks that selectively highlight critical insights, filter out noise, and facilitate rapid validation of key assumptions. To illustrate this challenge, we present motivating examples in which AI reasoning models state their assumptions when deciding which external library to use and produce divergent reasoning paths and recommendations about security vulnerabilities, highlighting the need for an interface that prioritizes actionable insights while managing uncertainty and resolving conflicts. We then outline a research roadmap for integrating automated summarization, assumption validation, and multi-model conflict resolution into software engineering workflows. Achieving this vision will unlock the full potential of AI reasoning models to enable software engineers to make faster, more informed decisions without being overwhelmed by unnecessary detail.

Paperid: 372, https://arxiv.org/pdf/2502.16794.pdf

Abstract:
Auditory foundation models, including auditory large language models (LLMs), process all sound inputs equally, independent of listener perception. However, human auditory perception is inherently selective: listeners focus on specific speakers while ignoring others in complex auditory scenes. Existing models do not incorporate this selectivity, limiting their ability to generate perception-aligned responses. To address this, we introduce Intention-Informed Auditory Scene Understanding (II-ASU) and present Auditory Attention-Driven LLM (AAD-LLM), a prototype system that integrates brain signals to infer listener attention. AAD-LLM extends an auditory LLM by incorporating intracranial electroencephalography (iEEG) recordings to decode which speaker a listener is attending to and refine responses accordingly. The model first predicts the attended speaker from neural activity, then conditions response generation on this inferred attentional state. We evaluate AAD-LLM on speaker description, speech transcription and extraction, and question answering in multitalker scenarios, with both objective and subjective ratings showing improved alignment with listener intention. By taking a first step toward intention-aware auditory AI, this work explores a new paradigm where listener perception informs machine listening, paving the way for future listener-centered auditory systems. Demo and code available: https://aad-llm.github.io.

Paperid: 373, https://arxiv.org/pdf/2506.19107.pdf

Abstract:
With the proliferation of large language model (LLM) applications since 2022, their use in education has sparked both excitement and concern. Recent studies consistently highlight students' (mis)use of LLMs can hinder learning outcomes. This work aims to teach students how to effectively prompt LLMs to improve their learning. We first proposed pedagogical prompting, a theoretically-grounded new concept to elicit learning-oriented responses from LLMs. To move from concept design to a proof-of-concept learning intervention in real educational settings, we selected early undergraduate CS education (CS1/CS2) as the example context. We began with a formative survey study with instructors (N=36) teaching early-stage undergraduate-level CS courses to inform the instructional design based on classroom needs. Based on their insights, we designed and developed a learning intervention through an interactive system with scenario-based instruction to train pedagogical prompting skills. Finally, we evaluated its instructional effectiveness through a user study with CS novice students (N=22) using pre/post-tests. Through mixed methods analyses, our results indicate significant improvements in learners' LLM-based pedagogical help-seeking skills, along with positive attitudes toward the system and increased willingness to use pedagogical prompts in the future. Our contributions include (1) a theoretical framework of pedagogical prompting; (2) empirical insights into current instructor attitudes toward pedagogical prompting; and (3) a learning intervention design with an interactive learning tool and scenario-based instruction leading to promising results on teaching LLM-based help-seeking. Our approach is scalable for broader implementation in classrooms and has the potential to be integrated into tools like ChatGPT as an on-boarding experience to encourage learning-oriented use of generative AI.

Paperid: 374, https://arxiv.org/pdf/2506.00717.pdf

Abstract:
People use videos to learn new recipes, exercises, and crafts. Such videos remain difficult for blind and low vision (BLV) people to follow as they rely on visual comparison. Our observations of visual rehabilitation therapists (VRTs) guiding BLV people to follow how-to videos revealed that VRTs provide both proactive and responsive support including detailed descriptions, non-visual workarounds, and progress feedback. We propose Vid2Coach, a system that transforms how-to videos into wearable camera-based assistants that provide accessible instructions and mixed-initiative feedback. From the video, Vid2Coach generates accessible instructions by augmenting narrated instructions with demonstration details and completion criteria for each step. It then uses retrieval-augmented-generation to extract relevant non-visual workarounds from BLV-specific resources. Vid2Coach then monitors user progress with a camera embedded in commercial smart glasses to provide context-aware instructions, proactive feedback, and answers to user questions. BLV participants (N=8) using Vid2Coach completed cooking tasks with 58.5\% fewer errors than when using their typical workflow and wanted to use Vid2Coach in their daily lives. Vid2Coach demonstrates an opportunity for AI visual assistance that strengthens rather than replaces non-visual expertise.

Paperid: 375, https://arxiv.org/pdf/2505.17423.pdf

Abstract:
Many decision-making tasks, where both accuracy and efficiency matter, still require human supervision. For example, tasks like traffic officers reviewing hour-long dashcam footage or researchers screening conference videos can benefit from concise summaries that reduce cognitive load and save time. Yet current vision-language models (VLMs) often produce verbose, redundant outputs that hinder task performance. Existing video caption evaluation depends on costly human annotations and overlooks the summaries' utility in downstream tasks. We address these gaps with Video-to-text Information Bottleneck Evaluation (VIBE), an annotation-free method that scores VLM outputs using two metrics: grounding (how well the summary aligns with visual content) and utility (how informative it is for the task). VIBE selects from randomly sampled VLM outputs by ranking them according to the two scores to support effective human decision-making. Human studies on LearningPaper24, SUTD-TrafficQA, and LongVideoBench show that summaries selected by VIBE consistently improve performance-boosting task accuracy by up to 61.23% and reducing response time by 75.77% compared to naive VLM summaries or raw video.

Paperid: 376, https://arxiv.org/pdf/2505.15089.pdf

Abstract:
The digital transformation of smart cities and workplaces requires effective integration of physical and cyber spaces, yet existing digital twin solutions remain limited in supporting real-time, multi-user collaboration. While metaverse platforms enable shared virtual experiences, they have not supported comprehensive integration of IoT sensors on physical spaces, especially for large-scale smart architectural environments. This paper presents a digital twin environment that integrates Kajima Corp.'s smart building facility "The GEAR" in Singapore with a commercial metaverse platform Cluster. Our system consists of three key components: a standardized IoT sensor platform, a real-time data relay system, and an environmental data visualization framework. Quantitative end-to-end latency measurements confirm the feasibility of our approach for real-world applications in large architectural spaces. The proposed framework enables new forms of collaboration that transcend spatial constraints, advancing the development of next-generation interactive environments.

Paperid: 377, https://arxiv.org/pdf/2505.08083.pdf

Abstract:
Culturally Relevant Pedagogy (CRP) is vital in K-12 education, yet teachers struggle to implement CRP into practice due to time, training, and resource gaps. This study explores how Large Language Models (LLMs) can address these barriers by introducing CulturAIEd, an LLM tool that assists teachers in adapting AI literacy curricula to students' cultural contexts. Through an exploratory pilot with four K-12 teachers, we examined CulturAIEd's impact on CRP integration. Results showed CulturAIEd enhanced teachers' confidence in identifying opportunities for cultural responsiveness in learning activities and making culturally responsive modifications to existing activities. They valued CulturAIEd's streamlined integration of student demographic information, immediate actionable feedback, which could result in high implementation efficiency. This exploration of teacher-AI collaboration highlights how LLM can help teachers include CRP components into their instructional practices efficiently, especially in global priorities for future-ready education, such as AI literacy.

Paperid: 378, https://arxiv.org/pdf/2504.21337.pdf

Abstract:
Technological advances are redefining the relationship between physical and virtual space. Traditionally, when users engage in virtual reality (VR), they are completely cut off from the physical space; similarly, they are unable to access virtual experiences while engaged in physical activities. However, modern multi-platform metaverse environments allow simultaneous participation through mobile devices, creating new opportunities for integrated experiences. This study introduces the concept of "cross-reality lifestyles" to examine how users actively combine their physical and virtual activities. We identify three patterns of integration: 1) amplification: one space enhances experiences in the other; 2) complementary: spaces offer different but equally valuable alternatives; and 3) emergence: simultaneous engagement creates entirely new experiences. By analyzing commercial platforms, we create a technical framework that addresses content design, platform infrastructure, and device interfaces. This framework guides the development of cross-reality applications while demonstrating how metaverse technologies blur the traditional boundaries between physical and virtual experiences.

Paperid: 379, https://arxiv.org/pdf/2504.21332.pdf

Abstract:
Metaverse platforms are rapidly evolving to provide immersive spaces for user interaction and content creation. However, the generation of dynamic and interactive 3D objects remains challenging due to the need for advanced 3D modeling and programming skills. To address this challenge, we present MagicCraft, a system that generates functional 3D objects from natural language prompts for metaverse platforms. MagicCraft uses generative AI models to manage the entire content creation pipeline: converting user text descriptions into images, transforming images into 3D models, predicting object behavior, and assigning necessary attributes and scripts. It also provides an interactive interface for users to refine generated objects by adjusting features such as orientation, scale, seating positions, and grip points. Implemented on Cluster, a commercial metaverse platform, MagicCraft was evaluated by 7 expert CG designers and 51 general users. Results show that MagicCraft significantly reduces the time and skill required to create 3D objects. Users with no prior experience in 3D modeling or programming successfully created complex, interactive objects and deployed them in the metaverse. Expert feedback highlighted the system's potential to improve content creation workflows and support rapid prototyping. By integrating AI-generated content into metaverse platforms, MagicCraft makes 3D content creation more accessible.

Paperid: 380, https://arxiv.org/pdf/2504.17705.pdf

Abstract:
Online experiments using metaverse platforms have gained significant traction in Human-Computer Interaction and Virtual Reality (VR) research. However, current research workflows are highly fragmented, as researchers must use separate tools for system implementation, participant recruitment, experiment execution, and data collection, reducing consistency and increasing workload. We present LUIDA (Large-scale Unified Infrastructure for Digital Assessments), a metaverse-based framework that integrates these fragmented processes. LUIDA automatically allocates interconnected virtual environments for parallel experiment execution and provides implementation templates adaptable to various VR research domains, requiring minimal metaverse development expertise. Our evaluation included two studies using a prototype built on Cluster, the commercial metaverse platform. First, VR researchers using LUIDA to develop and run experiments reported high usability scores (SUS: 73.75) and moderate workload (NASA-TLX: 24.11) for overall usage, with interviews confirming streamlined workflows compared to traditional laboratory experiments. Second, we conducted three replicated experiments with public Cluster users, each recruiting approximately 200 participants within one week. These experiments produced results that closely matched the original studies, validating the experimental integrity of LUIDA across research domains. After technical refinements, we plan to release LUIDA as an open platform, providing a standardized protocol to improve research efficiency and experimental reproducibility in VR studies.

Paperid: 381, https://arxiv.org/pdf/2504.16419.pdf

Abstract:
Graphical User Interface (GUI) datasets are crucial for various downstream tasks. However, GUI datasets often generate annotation information through automatic labeling, which commonly results in inaccurate GUI element BBox annotations, including missing, duplicate, or meaningless BBoxes. These issues can degrade the performance of models trained on these datasets, limiting their effectiveness in real-world applications. Additionally, existing GUI datasets only provide BBox annotations visually, which restricts the development of visually related GUI downstream tasks. To address these issues, we introduce PixelWeb, a large-scale GUI dataset containing over 100,000 annotated web pages. PixelWeb is constructed using a novel automatic annotation approach that integrates visual feature extraction and Document Object Model (DOM) structure analysis through two core modules: channel derivation and layer analysis. Channel derivation ensures accurate localization of GUI elements in cases of occlusion and overlapping elements by extracting BGRA four-channel bitmap annotations. Layer analysis uses the DOM to determine the visibility and stacking order of elements, providing precise BBox annotations. Additionally, PixelWeb includes comprehensive metadata such as element images, contours, and mask annotations. Manual verification by three independent annotators confirms the high quality and accuracy of PixelWeb annotations. Experimental results on GUI element detection tasks show that PixelWeb achieves performance on the mAP95 metric that is 3-7 times better than existing datasets. We believe that PixelWeb has great potential for performance improvement in downstream tasks such as GUI generation and automated user interaction.

Paperid: 382, https://arxiv.org/pdf/2504.16273.pdf

Abstract:
Large Language Models (LLMs) have shown promise in clinical decision support, yet their application to triage remains underexplored. We systematically investigate the capabilities of LLMs in emergency department triage through two key dimensions: (1) robustness to distribution shifts and missing data, and (2) counterfactual analysis of intersectional biases across sex and race. We assess multiple LLM-based approaches, ranging from continued pre-training to in-context learning, as well as machine learning approaches. Our results indicate that LLMs exhibit superior robustness, and we investigate the key factors contributing to the promising LLM-based approaches. Furthermore, in this setting, we identify gaps in LLM preferences that emerge in particular intersections of sex and race. LLMs generally exhibit sex-based differences, but they are most pronounced in certain racial groups. These findings suggest that LLMs encode demographic preferences that may emerge in specific clinical contexts or particular combinations of characteristics.

Paperid: 383, https://arxiv.org/pdf/2504.13392.pdf

Abstract:
State-of-the-art visual generative AI tools hold immense potential to assist users in the early ideation stages of creative tasks -- offering the ability to generate (rather than search for) novel and unprecedented (instead of existing) images of considerable quality that also adhere to boundless combinations of user specifications. However, many large-scale text-to-image systems are designed for broad applicability, yielding conventional output that may limit creative exploration. They also employ interaction methods that may be difficult for beginners. Given that creative end users often operate in diverse, context-specific ways that are often unpredictable, more variation and personalization are necessary. We introduce POET, a real-time interactive tool that (1) automatically discovers dimensions of homogeneity in text-to-image generative models, (2) expands these dimensions to diversify the output space of generated images, and (3) learns from user feedback to personalize expansions. An evaluation with 28 users spanning four creative task domains demonstrated POET's ability to generate results with higher perceived diversity and help users reach satisfaction in fewer prompts during creative tasks, thereby prompting them to deliberate and reflect more on a wider range of possible produced results during the co-creative process. Focusing on visual creativity, POET offers a first glimpse of how interaction techniques of future text-to-image generation tools may support and align with more pluralistic values and the needs of end users during the ideation stages of their work.

Paperid: 384, https://arxiv.org/pdf/2503.15491.pdf

Abstract:
In human-robot interaction (HRI), the beginning of an interaction is often complex. Whether the robot should communicate with the human is dependent on several situational factors (e.g., the current human's activity, urgency of the interaction, etc.). We test whether large language models (LLM) and vision language models (VLM) can provide solutions to this problem. We compare four different system-design patterns using LLMs and VLMs, and test on a test set containing 84 human-robot situations. The test set mixes several publicly available datasets and also includes situations where the appropriate action to take is open-ended. Our results using the GPT-4o and Phi-3 Vision model indicate that LLMs and VLMs are capable of handling interaction beginnings when the desired actions are clear, however, challenge remains in the open-ended situations where the model must balance between the human and robot situation.

Paperid: 385, https://arxiv.org/pdf/2503.12613.pdf

Abstract:
Urban assessments often compress diverse needs into single scores, which can obscure minority perspectives. We present a community-centered study in Montreal (n=35; wheelchair users, seniors, LGBTQIA2+ residents, and immigrants). Participants rated 20 streets (accessibility, inclusivity, aesthetics, practicality) and ranked 7 images on 12 interview-elicited criteria. Disagreement patterns were systematic in our sample: wheelchair users diverged most on accessibility and practicality; LGBTQIA2+ participants emphasized inclusion and liveliness; seniors prioritized security. Group discussion reduced information gaps but not value conflicts; ratings conveyed intensity, while rankings forced trade-offs. We then formalize negotiative alignment, a transparent, budget-aware bargaining procedure, and pilot it with role-played stakeholder agents plus a neutral mediator. Relative to the best base design under the same public rubric, the negotiated package increased total utility (21.10 to 24.55), raised the worst-group utility (3.20 to 3.90), improved twentieth percentile satisfaction (0.86 to 1.00; min-max normalized within the scenario), and reduced inequality (Gini 0.036 to 0.025). Treating disagreement as signal and reporting worst-group outcomes alongside totals may help planners and AI practitioners surface trade-offs and preserve minority priorities while maintaining efficiency.

Paperid: 386, https://arxiv.org/pdf/2503.05609.pdf

Abstract:
Ensuring the safety of Generative AI requires a nuanced understanding of pluralistic viewpoints. In this paper, we introduce a novel data-driven approach for interpreting granular ratings in pluralistic datasets. Specifically, we address the challenge of analyzing nuanced differences in safety feedback from a diverse population expressed via ordinal scales (e.g., a Likert scale). We distill non-parametric responsiveness metrics that quantify the consistency of raters in scoring varying levels of the severity of safety violations. Leveraging a publicly available pluralistic dataset of safety feedback on AI-generated content as our case study, we investigate how raters from different demographic groups (age, gender, ethnicity) use an ordinal scale to express their perceptions of the severity of violations. We apply our metrics across violation types, demonstrating their utility in extracting nuanced insights that are crucial for aligning AI systems reliably in multi-cultural contexts. We show that our approach can inform rater selection and feedback interpretation by capturing nuanced viewpoints across different demographic groups, hence improving the quality of pluralistic data collection and in turn contributing to more robust AI development.

Paperid: 387, https://arxiv.org/pdf/2503.01894.pdf

Abstract:
We introduce the Local Intersectional Visual Spaces (LIVS) dataset, a benchmark for multi-criteria alignment, developed through a two-year participatory process with 30 community organizations to support the pluralistic alignment of text-to-image (T2I) models in inclusive urban planning. The dataset encodes 37,710 pairwise comparisons across 13,462 images, structured along six criteria - Accessibility, Safety, Comfort, Invitingness, Inclusivity, and Diversity - derived from 634 community-defined concepts. Using Direct Preference Optimization (DPO), we fine-tune Stable Diffusion XL to reflect multi-criteria spatial preferences and evaluate the LIVS dataset and the fine-tuned model through four case studies: (1) DPO increases alignment with annotated preferences, particularly when annotation volume is high; (2) preference patterns vary across participant identities, underscoring the need for intersectional data; (3) human-authored prompts generate more distinctive visual outputs than LLM-generated ones, influencing annotation decisiveness; and (4) intersectional groups assign systematically different ratings across criteria, revealing the limitations of single-objective alignment. While DPO improves alignment under specific conditions, the prevalence of neutral ratings indicates that community values are heterogeneous and often ambiguous. LIVS provides a benchmark for developing T2I models that incorporate local, stakeholder-driven preferences, offering a foundation for context-aware alignment in spatial design.

Paperid: 388, https://arxiv.org/pdf/2502.20635.pdf

Abstract:
EXplainable machine learning (XML) has recently emerged to address the mystery mechanisms of machine learning (ML) systems by interpreting their 'black box' results. Despite the development of various explanation methods, determining the most suitable XML method for specific ML contexts remains unclear, highlighting the need for effective evaluation of explanations. The evaluating capabilities of the Transformer-based large language model (LLM) present an opportunity to adopt LLM-as-a-Judge for assessing explanations. In this paper, we propose a workflow that integrates both LLM-based and human judges for evaluating explanations. We examine how LLM-based judges evaluate the quality of various explanation methods and compare their evaluation capabilities to those of human judges within an iris classification scenario, employing both subjective and objective metrics. We conclude that while LLM-based judges effectively assess the quality of explanations using subjective metrics, they are not yet sufficiently developed to replace human judges in this role.

Paperid: 389, https://arxiv.org/pdf/2502.04482.pdf

Abstract:
The wide adoption of platformized work has generated remarkable advancements in the labor patterns and mobility of modern society. Underpinning such progress, gig workers are exposed to unprecedented challenges and accountabilities: lack of data transparency, social and physical isolation, as well as insufficient infrastructural safeguards. Gig2Gether presents a space designed for workers to engage in an initial experience of voluntarily contributing anecdotal and statistical data to affect policy and build solidarity across platforms by exchanging unifying and diverse experiences. Our 7-day field study with 16 active workers from three distinct platforms and work domains showed existing affordances of data-sharing: facilitating mutual support across platforms, as well as enabling financial reflection and planning. Additionally, workers envisioned future use cases of data-sharing for collectivism (e.g., collaborative examinations of algorithmic speculations) and informing policy (e.g., around safety and pay), which motivated (latent) worker desiderata of additional capabilities and data metrics. Based on these findings, we discuss remaining challenges to address and how data-sharing tools can complement existing structures to maximize worker empowerment and policy impact.

Paperid: 390, https://arxiv.org/pdf/2501.15931.pdf

Abstract:
While the integration of IoT devices in virtual spaces is becoming increasingly common, technical barriers to controlling custom devices in multi-user Virtual Reality (VR) environments remain high, particularly limiting new applications in educational and prototyping settings. We propose MetaGadget, a framework for connecting IoT devices to commercial metaverse platforms that implements device control through HTTP-based event triggers without requiring persistent client connections. Through two workshops focused on smart home control and custom device integration, we explored the potential application of IoT connectivity in multi-user metaverse environments. Participants successfully implemented new interactions unique to the metaverse, such as environmental sensing and remote control systems that support simultaneous operation by multiple users, and reported positive feedback on the ease of system development. We verified that our framework provides a new approach to controlling IoT devices in the metaverse while reducing technical requirements, and provides a foundation for creative practice that connects multi-user VR environments and physical spaces.

Paperid: 391, https://arxiv.org/pdf/2501.09024.pdf

Abstract:
Most existing social robot navigation techniques either leverage hand-crafted rules or human demonstrations to connect robot perception to socially compliant actions. However, there remains a significant gap in effectively translating perception into socially compliant actions, much like how human reasoning naturally occurs in dynamic environments. Considering the recent success of Vision-Language Models (VLMs), we propose using language to bridge the gap in human-like reasoning between perception and socially aware robot actions. We create a vision-language dataset, Social robot Navigation via Explainable Interactions (SNEI), featuring 40K human-annotated Visual Question Answers (VQAs) based on 2K human-robot social interactions in unstructured, crowded public spaces, spanning perception, prediction, chain-of-thought reasoning, action, and explanation. We fine-tune a VLM, Social-LLaVA, using SNEI to demonstrate the practical application of our dataset. Social-LLaVA outperforms state-of-the-art models like GPT-4V and Gemini, based on the average of fifteen different human-judge scores across 50 VQA. Deployed onboard a mobile robot, Social-LLaVA enables human-like reasoning, marking a promising step toward socially compliant robot navigation in dynamic public spaces through language reasoning.

Paperid: 392, https://arxiv.org/pdf/2501.03968.pdf

Abstract:
The use of Large Language Models (LLMs) for generating Behavior Trees (BTs) has recently gained attention in the robotics community, yet remains in its early stages of development. In this paper, we propose a novel framework that leverages Vision-Language Models (VLMs) to interactively generate and edit BTs that address visual conditions, enabling context-aware robot operations in visually complex environments. A key feature of our approach lies in the conditional control through self-prompted visual conditions. Specifically, the VLM generates BTs with visual condition nodes, where conditions are expressed as free-form text. Another VLM process integrates the text into its prompt and evaluates the conditions against real-world images during robot execution. We validated our framework in a real-world cafe scenario, demonstrating both its feasibility and limitations.

Paperid: 393, https://arxiv.org/pdf/2505.05516.pdf

Abstract:
We envision the "virtual eye" as a next-generation, AI-powered platform that uses interconnected foundation models to simulate the eye's intricate structure and biological function across all scales. Advances in AI, imaging, and multiomics provide a fertile ground for constructing a universal, high-fidelity digital replica of the human eye. This perspective traces the evolution from early mechanistic and rule-based models to contemporary AI-driven approaches, integrating in a unified model with multimodal, multiscale, dynamic predictive capabilities and embedded feedback mechanisms. We propose a development roadmap emphasizing the roles of large-scale multimodal datasets, generative AI, foundation models, agent-based architectures, and interactive interfaces. Despite challenges in interpretability, ethics, data processing and evaluation, the virtual eye holds the potential to revolutionize personalized ophthalmic care and accelerate research into ocular health and disease.

Paperid: 394, https://arxiv.org/pdf/2505.04182.pdf

Abstract:
Human interaction experience plays a crucial role in the effectiveness of human-machine collaboration, especially as interactions in future systems progress towards tighter physical and functional integration. While automation design has been shown to impact task performance, its influence on human experience metrics such as flow, sense of agency (SoA), and embodiment remains underexplored. This study investigates how variations in automation design affect these psychological experience measures and examines correlations between subjective experience and physiological indicators. A user study was conducted in a simulated wood workshop, where participants collaborated with a lightweight robot under four automation levels. The results of the study indicate that medium automation levels enhance flow, SoA and embodiment, striking a balance between support and user autonomy. In contrast, higher automation, despite optimizing task performance, diminishes perceived flow and agency. Furthermore, we observed that grip force might be considered as a real-time proxy of SoA, while correlations with heart rate variability were inconclusive. The findings underscore the necessity for automation strategies that integrate human- centric metrics, aiming to optimize both performance and user experience in collaborative robotic systems

Paperid: 395, https://arxiv.org/pdf/2505.03568.pdf

Abstract:
Humans have the tendency to discover and explore. This natural tendency is reflected in data from streaming platforms as the amount of previously unknown content accessed by users. Additionally, in domains such as that of music streaming there is evidence that recommending novel content improves users' experience with the platform. Therefore, understanding users' discovery patterns, such as the amount to which and the way users access previously unknown content, is a topic of relevance for both the scientific community and the streaming industry, particularly the music one. Previous works studied how music consumption differs for users of different traits and looked at diversity, novelty, and consistency over time of users' music preferences. However, very little is known about how users discover and explore previously unknown music, and how this behavior differs for users of varying discovery needs. In this paper we bridge this gap by analyzing data from a survey answered by users of the major music streaming platform Deezer in combination with their streaming data. We first address questions regarding whether users who declare a higher interest in unfamiliar music listen to more diverse music, have more stable music preferences over time, and explore more music within a same time window, compared to those who declare a lower interest. We then investigate which type of music tracks users choose to listen to when they explore unfamiliar music, identifying clear patterns of popularity and genre representativeness that vary for users of different discovery needs. Our findings open up possibilities to infer users' interest in unfamiliar music from streaming data as well as possibilities to develop recommender systems that guide users in exploring music in a more natural way.

Paperid: 396, https://arxiv.org/pdf/2504.14594.pdf

Abstract:
Seeking dietary guidance often requires navigating complex professional knowledge while accommodating individual health conditions. Knowledge Graphs (KGs) offer structured and interpretable nutritional information, whereas Large Language Models (LLMs) naturally facilitate conversational recommendation delivery. In this paper, we present HealthGenie, an interactive system that combines the strengths of LLMs and KGs to provide personalized dietary recommendations along with hierarchical information visualization for a quick and intuitive overview. Upon receiving a user query, HealthGenie performs query refinement and retrieves relevant information from a pre-built KG. The system then visualizes and highlights pertinent information, organized by defined categories, while offering detailed, explainable recommendation rationales. Users can further tailor these recommendations by adjusting preferences interactively. Our evaluation, comprising a within-subject comparative experiment and an open-ended discussion, demonstrates that HealthGenie effectively supports users in obtaining personalized dietary guidance based on their health conditions while reducing interaction effort and cognitive load. These findings highlight the potential of LLM-KG integration in supporting decision-making through explainable and visualized information. We examine the system's usefulness and effectiveness with an N=12 within-subject study and provide design considerations for future systems that integrate conversational LLM and KG.

Paperid: 397, https://arxiv.org/pdf/2504.11369.pdf

Abstract:
Open Large Language Models (OLLMs) are increasingly leveraged in generative AI applications, posing new challenges for detecting their outputs. We propose OpenTuringBench, a new benchmark based on OLLMs, designed to train and evaluate machine-generated text detectors on the Turing Test and Authorship Attribution problems. OpenTuringBench focuses on a representative set of OLLMs, and features a number of challenging evaluation tasks, including human/machine-manipulated texts, out-of-domain texts, and texts from previously unseen models. We also provide OTBDetector, a contrastive learning framework to detect and attribute OLLM-based machine-generated texts. Results highlight the relevance and varying degrees of difficulty of the OpenTuringBench tasks, with our detector achieving remarkable capabilities across the various tasks and outperforming most existing detectors. Resources are available on the OpenTuringBench Hugging Face repository at https://huggingface.co/datasets/MLNTeam-Unical/OpenTuringBench

Paperid: 398, https://arxiv.org/pdf/2503.20714.pdf

Abstract:
This work explores how force feedback affects various aspects of robot data collection within the Extended Reality (XR) setting. Force feedback has been proved to enhance the user experience in Extended Reality (XR) by providing contact-rich information. However, its impact on robot data collection has not received much attention in the robotics community. This paper addresses this shortcoming by conducting an extensive user study on the effects of force feedback during data collection in XR. We extended two XR-based robot control interfaces, Kinesthetic Teaching and Motion Controllers, with haptic feedback features. The user study is conducted using manipulation tasks ranging from simple pick-place to complex peg assemble, requiring precise operations. The evaluations show that force feedback enhances task performance and user experience, particularly in tasks requiring high-precision manipulation. These improvements vary depending on the robot control interface and task complexity. This paper provides new insights into how different factors influence the impact of force feedback.

Paperid: 399, https://arxiv.org/pdf/2503.02769.pdf

Abstract:
Recent advancements in speech large language models (SpeechLLMs) have attracted considerable attention. Nonetheless, current methods exhibit suboptimal performance in adhering to speech instructions. Notably, the intelligence of models significantly diminishes when processing speech-form input as compared to direct text-form input. Prior work has attempted to mitigate this semantic inconsistency between speech and text representations through techniques such as representation and behavior alignment, which involve the meticulous design of data pairs during the post-training phase. In this paper, we introduce a simple and scalable training method called InSerter, which stands for Interleaved Speech-Text Representation Pre-training. InSerter is designed to pre-train large-scale unsupervised speech-text sequences, where the speech is synthesized from randomly selected segments of an extensive text corpus using text-to-speech conversion. Consequently, the model acquires the ability to generate textual continuations corresponding to the provided speech segments, obviating the need for intensive data design endeavors. To systematically evaluate speech instruction-following capabilities, we introduce SpeechInstructBench, the first comprehensive benchmark specifically designed for speech-oriented instruction-following tasks. Our proposed InSerter achieves SOTA performance in SpeechInstructBench and demonstrates superior or competitive results across diverse speech processing tasks.

Paperid: 400, https://arxiv.org/pdf/2502.19133.pdf

Abstract:
Decomposition is a fundamental skill in algorithmic programming, requiring learners to break down complex problems into smaller, manageable parts. However, current self-study methods, such as browsing reference solutions or using LLM assistants, often provide excessive or generic assistance that misaligns with learners' decomposition strategies, hindering independent problem-solving and critical thinking. To address this, we introduce Decomposition Box (DBox), an interactive LLM-based system that scaffolds and adapts to learners' personalized construction of a step tree through a "learner-LLM co-decomposition" approach, providing tailored support at an appropriate level. A within-subjects study (N=24) found that compared to the baseline, DBox significantly improved learning gains, cognitive engagement, and critical thinking. Learners also reported a stronger sense of achievement and found the assistance appropriate and helpful for learning. Additionally, we examined DBox's impact on cognitive load, identified usage patterns, and analyzed learners' strategies for managing system errors. We conclude with design implications for future AI-powered tools to better support algorithmic programming education.

Paperid: 401, https://arxiv.org/pdf/2502.03767.pdf

Abstract:
Danmaku, a system of scene-aligned, time-synced, floating comments, can augment video content to create 'collective knowledge'. However, its chaotic nature often hinders viewers from effectively assimilating the collective knowledge, especially in knowledge-intensive science videos. With a formative study, we examined viewers' practices for processing collective knowledge and the specific barriers they encountered. Building on these insights, we designed a processing pipeline to filter, classify, and cluster danmaku, leading to the development of CoKnowledge - a tool incorporating a video abstract, knowledge graphs, and supplementary danmaku features to support viewers' assimilation of collective knowledge in science videos. A within-subject study (N=24) showed that CoKnowledge significantly enhanced participants' comprehension and recall of collective knowledge compared to a baseline with unprocessed live comments. Based on our analysis of user interaction patterns and feedback on design features, we presented design considerations for developing similar support tools.

Paperid: 402, https://arxiv.org/pdf/2502.00381.pdf

Abstract:
This paper focuses on developing a framework for uncovering insights about NDD children's performance (e.g., raw gaze cluster analysis, duration analysis \& area of interest for sustained attention, stimuli expectancy, loss of focus/motivation, inhibitory control) and informing their teachers. The hypothesis behind this work is that self-adaptation of games can contribute to improving students' well-being and performance by suggesting personalized activities (e.g., highlighting stimuli to increase attention or choosing a difficulty level that matches students' abilities). The aim is to examine how AI can be used to help solve this problem. The results would not only contribute to a better understanding of the problems of NDD children and their teachers but also help psychologists to validate the results against their clinical knowledge, improve communication with patients and identify areas for further investigation, e.g., by explaining the decision made and preserving the children's private data in the learning process.

Paperid: 403, https://arxiv.org/pdf/2502.00376.pdf

Abstract:
Self Supervised Representation Learning (SSRepL) can capture meaningful and robust representations of the Attention Deficit Hyperactivity Disorder (ADHD) data and have the potential to improve the model's performance on also downstream different types of Neurodevelopmental disorder (NDD) detection. In this paper, a novel SSRepL and Transfer Learning (TL)-based framework that incorporates a Long Short-Term Memory (LSTM) and a Gated Recurrent Units (GRU) model is proposed to detect children with potential symptoms of ADHD. This model uses Electroencephalogram (EEG) signals extracted during visual attention tasks to accurately detect ADHD by preprocessing EEG signal quality through normalization, filtering, and data balancing. For the experimental analysis, we use three different models: 1) SSRepL and TL-based LSTM-GRU model named as SSRepL-ADHD, which integrates LSTM and GRU layers to capture temporal dependencies in the data, 2) lightweight SSRepL-based DNN model (LSSRepL-DNN), and 3) Random Forest (RF). In the study, these models are thoroughly evaluated using well-known performance metrics (i.e., accuracy, precision, recall, and F1-score). The results show that the proposed SSRepL-ADHD model achieves the maximum accuracy of 81.11% while admitting the difficulties associated with dataset imbalance and feature selection.

Paperid: 404, https://arxiv.org/pdf/2501.18002.pdf

Abstract:
Conversational human-AI interaction (CHAI) have recently driven mainstream adoption of AI. However, CHAI poses two key challenges for designers and researchers: users frequently have ambiguous goals and an incomplete understanding of AI functionalities, and the interactions are brief and transient, limiting opportunities for sustained engagement with users. AI agents can help address these challenges by suggesting contextually relevant prompts, by standing in for users during early design testing, and by helping users better articulate their goals. Guided by research-through-design, we explored agentic AI workflows through the development and testing of a probe over four iterations with 10 users. We present our findings through an annotated portfolio of design artifacts, and through thematic analysis of user experiences, offering solutions to the problems of ambiguity and transient in CHAI. Furthermore, we examine the limitations and possibilities of these AI agent workflows, suggesting that similar collaborative approaches between humans and AI could benefit other areas of design.

Paperid: 405, https://arxiv.org/pdf/2501.10970.pdf

Abstract:
The "LLM-as-an-annotator" and "LLM-as-a-judge" paradigms employ Large Language Models (LLMs) as annotators, judges, and evaluators in tasks traditionally performed by humans. LLM annotations are widely used, not only in NLP research but also in fields like medicine, psychology, and social science. Despite their role in shaping study results and insights, there is no standard or rigorous procedure to determine whether LLMs can replace human annotators. In this paper, we propose a novel statistical procedure, the Alternative Annotator Test (alt-test), that requires only a modest subset of annotated examples to justify using LLM annotations. Additionally, we introduce a versatile and interpretable measure for comparing LLM annotators and judges. To demonstrate our procedure, we curated a diverse collection of ten datasets, consisting of language and vision-language tasks, and conducted experiments with six LLMs and four prompting techniques. Our results show that LLMs can sometimes replace humans with closed-source LLMs (such as GPT-4o), outperforming the open-source LLMs we examine, and that prompting techniques yield judges of varying quality. We hope this study encourages more rigorous and reliable practices.

Paperid: 406, https://arxiv.org/pdf/2501.09930.pdf

Abstract:
Healthcare simulations help learners develop teamwork and clinical skills in a risk-free setting, promoting reflection on real-world practices through structured debriefs. However, despite video's potential, it is hard to use, leaving a gap in providing concise, data-driven summaries for supporting effective debriefing. Addressing this, we present TeamVision, an AI-powered multimodal learning analytics (MMLA) system that captures voice presence, automated transcriptions, body rotation, and positioning data, offering educators a dashboard to guide debriefs immediately after simulations. We conducted an in-the-wild study with 56 teams (221 students) and recorded debriefs led by six teachers using TeamVision. Follow-up interviews with 15 students and five teachers explored perceptions of its usefulness, accuracy, and trustworthiness. This paper examines: i) how TeamVision was used in debriefing, ii) what educators found valuable and challenging, and iii) perceptions of its effectiveness. Results suggest TeamVision enables flexible debriefing and highlights the challenges and implications of using AI-powered systems in healthcare simulation.

Paperid: 407, https://arxiv.org/pdf/2501.05434.pdf

Abstract:
Grasp User Interfaces (grasp UIs) enable dual-tasking in XR by allowing interaction with digital content while holding physical objects. However, current grasp UI design practices face a fundamental challenge: existing approaches either capture user preferences through labor-intensive elicitation studies that are difficult to scale or rely on biomechanical models that overlook subjective factors. We introduce GraspR, the first computational model that predicts user preferences for single-finger microgestures in grasp UIs. Our data-driven approach combines the scalability of computational methods with human preference modeling, trained on 1,520 preferences collected via a two-alternative forced choice paradigm across eight participants and four frequently used grasp variations. We demonstrate GraspR's effectiveness through a working prototype that dynamically adjusts interface layouts across four everyday tasks. We release both the dataset and code to support future research in adaptive grasp UIs.

Paperid: 408, https://arxiv.org/pdf/2501.02857.pdf

Abstract:
In the domain of multi-objective optimization, evolutionary algorithms are distinguished by their capability to generate a diverse population of solutions that navigate the trade-offs inherent among competing objectives. This has catalyzed the ascension of evolutionary multi-objective optimization (EMO) as a prevalent approach. Despite the effectiveness of the EMO paradigm, the analysis of resultant solution sets presents considerable challenges. This is primarily attributed to the high-dimensional nature of the data and the constraints imposed by static visualization methods, which frequently culminate in visual clutter and impede interactive exploratory analysis. To address these challenges, this paper introduces ParetoLens, a visual analytics framework specifically tailored to enhance the inspection and exploration of solution sets derived from the multi-objective evolutionary algorithms. Utilizing a modularized, algorithm-agnostic design, ParetoLens enables a detailed inspection of solution distributions in both decision and objective spaces through a suite of interactive visual representations. This approach not only mitigates the issues associated with static visualizations but also supports a more nuanced and flexible analysis process. The usability of the framework is evaluated through case studies and expert interviews, demonstrating its potential to uncover complex patterns and facilitate a deeper understanding of multi-objective optimization solution sets. A demo website of ParetoLens is available at https://dva-lab.org/paretolens/.

Paperid: 409, https://arxiv.org/pdf/2506.18749.pdf

Abstract:
Non-invasive brain-computer interfaces (BCIs) have the potential to enable intuitive control of prosthetic limbs for individuals with upper limb amputations. However, existing EEG-based control systems face challenges related to signal noise, classification accuracy, and real-time adaptability. In this work, we present BRAVE, a hybrid EEG and voice-controlled prosthetic system that integrates ensemble learning-based EEG classification with a human-in-the-loop (HITL) correction framework for enhanced responsiveness. Unlike traditional electromyography (EMG)-based prosthetic control, BRAVE aims to interpret EEG-driven motor intent, enabling movement control without reliance on residual muscle activity. To improve classification robustness, BRAVE combines LSTM, CNN, and Random Forest models in an ensemble framework, achieving a classification accuracy of 96% across test subjects. EEG signals are preprocessed using a bandpass filter (0.5-45 Hz), Independent Component Analysis (ICA) for artifact removal, and Common Spatial Pattern (CSP) feature extraction to minimize contamination from electromyographic (EMG) and electrooculographic (EOG) signals. Additionally, BRAVE incorporates automatic speech recognition (ASR) to facilitate intuitive mode switching between different degrees of freedom (DOF) in the prosthetic arm. The system operates in real time, with a response latency of 150 ms, leveraging Lab Streaming Layer (LSL) networking for synchronized data acquisition. The system is evaluated on an in-house fabricated prosthetic arm and on multiple participants highlighting the generalizability across users. The system is optimized for low-power embedded deployment, ensuring practical real-world application beyond high-performance computing environments. Our results indicate that BRAVE offers a promising step towards robust, real-time, non-invasive prosthetic control.

Paperid: 410, https://arxiv.org/pdf/2506.12469.pdf

Abstract:
Autonomy is a double-edged sword for AI agents, simultaneously unlocking transformative possibilities and serious risks. How can agent developers calibrate the appropriate levels of autonomy at which their agents should operate? We argue that an agent's level of autonomy can be treated as a deliberate design decision, separate from its capability and operational environment. In this work, we define five levels of escalating agent autonomy, characterized by the roles a user can take when interacting with an agent: operator, collaborator, consultant, approver, and observer. Within each level, we describe the ways by which a user can exert control over the agent and open questions for how to design the nature of user-agent interaction. We then highlight a potential application of our framework towards AI autonomy certificates to govern agent behavior in single- and multi-agent systems. We conclude by proposing early ideas for evaluating agents' autonomy. Our work aims to contribute meaningful, practical steps towards responsibly deployed and useful AI agents in the real world.

Paperid: 411, https://arxiv.org/pdf/2506.10376.pdf

Abstract:
Converting user interfaces into code (UI2Code) is a crucial step in website development, which is time-consuming and labor-intensive. The automation of UI2Code is essential to streamline this task, beneficial for improving the development efficiency. There exist deep learning-based methods for the task; however, they heavily rely on a large amount of labeled training data and struggle with generalizing to real-world, unseen web page designs. The advent of Multimodal Large Language Models (MLLMs) presents potential for alleviating the issue, but they are difficult to comprehend the complex layouts in UIs and generate the accurate code with layout preserved. To address these issues, we propose LayoutCoder, a novel MLLM-based framework generating UI code from real-world webpage images, which includes three key modules: (1) Element Relation Construction, which aims at capturing UI layout by identifying and grouping components with similar structures; (2) UI Layout Parsing, which aims at generating UI layout trees for guiding the subsequent code generation process; and (3) Layout-Guided Code Fusion, which aims at producing the accurate code with layout preserved. For evaluation, we build a new benchmark dataset which involves 350 real-world websites named Snap2Code, divided into seen and unseen parts for mitigating the data leakage issue, besides the popular dataset Design2Code. Extensive evaluation shows the superior performance of LayoutCoder over the state-of-the-art approaches. Compared with the best-performing baseline, LayoutCoder improves 10.14% in the BLEU score and 3.95% in the CLIP score on average across all datasets.

Paperid: 412, https://arxiv.org/pdf/2506.09212.pdf

Abstract:
The visual analysis of graphs in 3D has become increasingly popular, accelerated by the rise of immersive technology, such as augmented and virtual reality. Unlike 2D drawings, 3D graph layouts are highly viewpoint-dependent, making perspective selection critical for revealing structural and relational patterns. Despite its importance, there is limited empirical evidence guiding what constitutes an effective or preferred viewpoint from the user's perspective. In this paper, we present a systematic investigation into user-preferred viewpoints in 3D graph visualisations. We conducted a controlled study with 23 participants in a virtual reality environment, where users selected their most and least preferred viewpoints for 36 different graphs varying in size and layout. From this data, enriched by qualitative feedback, we distil common strategies underlying viewpoint choice. We further analyse the alignment of user preferences with classical 2D aesthetic criteria (e.g., Crossings), 3D-specific measures (e.g., Node-Node Occlusion), and introduce a novel measure capturing the perceivability of a graph's principal axes (Isometric Viewpoint Deviation). Our data-driven analysis indicates that Stress, Crossings, Gabriel Ratio, Edge-Node Overlap, and Isometric Viewpoint Deviation are key indicators of viewpoint preference. Beyond our findings, we contribute a publicly available dataset consisting of the graphs and computed aesthetic measures, supporting further research and the development of viewpoint evaluation measures for 3D graph drawing.

Paperid: 413, https://arxiv.org/pdf/2506.07193.pdf

Abstract:
Eye tracking technology is frequently utilized to diagnose eye and neurological disorders, assess sleep and fatigue, study human visual perception, and enable novel gaze-based interaction methods. However, traditional eye tracking methodologies are constrained by bespoke hardware that is often cumbersome to wear, complex to apply, and demands substantial computational resources. To overcome these limitations, we investigated Electrooculography (EOG) eye tracking using 14 electrodes positioned around the ears, integrated into a custom-built headphone form factor device. In a controlled experiment, 16 participants tracked stimuli designed to induce smooth pursuits and saccades. Data analysis identified optimal electrode pairs for vertical and horizontal eye movement tracking, benchmarked against gold-standard EOG and camera-based methods. The electrode montage nearest the eyes yielded the best horizontal results. Horizontal smooth pursuits via earEOG showed high correlation with gold-standard measures ($r_{\mathrm{EOG}} = 0.81, p = 0.01$; $r_{\mathrm{CAM}} = 0.56, p = 0.02$), while vertical pursuits were weakly correlated ($r_{\mathrm{EOG}} = 0.28, p = 0.04$; $r_{\mathrm{CAM}} = 0.35, p = 0.05$). Voltage deflections when performing saccades showed strong correlation in the horizontal direction ($r_{\mathrm{left}} = 0.99, p = 0.0$; $r_{\mathrm{right}} = 0.99, p = 0.0$) but low correlation in the vertical direction ($r_{\mathrm{up}} = 0.6, p = 0.23$; $r_{\mathrm{down}} = 0.19, p = 0.73$). Overall, horizontal earEOG demonstrated strong performance, indicating its potential effectiveness, while vertical earEOG results were poor, suggesting limited feasibility in our current setup.

Paperid: 414, https://arxiv.org/pdf/2506.02715.pdf

Abstract:
We present a demo of UltrasonicSpheres, a novel system for location-specific audio delivery using wearable earphones that decode ultrasonic signals into audible sound. Unlike conventional beamforming setups, UltrasonicSpheres relies on single ultrasonic speakers to broadcast localized audio with multiple channels, each encoded on a distinct ultrasonic carrier frequency. Users wearing our acoustically transparent earphones can demodulate their selected stream, such as exhibit narrations in a chosen language, while remaining fully aware of ambient environmental sounds. The experience preserves spatial audio perception, giving the impression that the sound originates directly from the physical location of the source. This enables personalized, localized audio without requiring pairing, tracking, or additional infrastructure. Importantly, visitors not equipped with the earphones are unaffected, as the ultrasonic signals are inaudible to the human ear. Our demo invites participants to explore multiple co-located audio zones and experience how UltrasonicSpheres supports unobtrusive delivery of personalized sound in public spaces.

Paperid: 415, https://arxiv.org/pdf/2506.02714.pdf

Abstract:
Maintaining thermal comfort in shared indoor environments remains challenging, as centralized HVAC systems are slow to adapt and standardized to group norms. Cold exposure not only reduces subjective comfort but can impair cognitive performance, particularly under moderate to severe cold stress. Personal Comfort Systems (PCS) have shown promise by providing localized heating, yet many designs target distal body parts with low thermosensitivity and often lack portability. In this work, we investigate whether targeted thermal stimulation using in-ear worn devices can manipulate thermal perception and enhance thermal comfort. We present Heatables, a novel in-ear wearable that emits Near-Infrared (NIR) and Infrared (IR) radiation via integrated LEDs to deliver localized optical heating. This approach leverages NIR-IR's ability to penetrate deeper tissues, offering advantages over traditional resistive heating limited to surface warming. In a placebo-controlled study with 24 participants, each exposed for 150 minutes in a cool office environment (approximately 17.5 degrees Celsius) to simulate sustained cold stress during typical sedentary office activities, Heatables significantly increased the perceived ambient temperature by around 1.5 degrees Celsius and delayed cold discomfort. Importantly, thermal benefits extended beyond the ear region, improving both whole-body comfort and thermal acceptability. These findings position in-ear NIR-IR-LED-based stimulation as a promising modality for unobtrusive thermal comfort enhancement in everyday contexts.

Paperid: 416, https://arxiv.org/pdf/2506.02533.pdf

Abstract:
Political online participation in the form of discussing political issues and exchanging opinions among citizens is gaining importance with more and more formats being held digitally. To come to a decision, a careful discussion and consideration of opinions and a civil exchange of arguments, which is defined as the act of deliberation, is desirable. The quality of discussions and participation processes in terms of their deliberativeness highly depends on the design of platforms and processes. To facilitate online communication for both participants and initiators, machine learning methods offer a lot of potential. In this work we want to showcase which issues occur in political online discussions and how machine learning can be used to counteract these issues and enhance deliberation.

Paperid: 417, https://arxiv.org/pdf/2505.14535.pdf

Abstract:
Multimodal spiking neural networks (SNNs) hold significant potential for energy-efficient sensory processing but face critical challenges in modality imbalance and temporal misalignment. Current approaches suffer from uncoordinated convergence speeds across modalities and static fusion mechanisms that ignore time-varying cross-modal interactions. We propose the temporal attention-guided adaptive fusion framework for multimodal SNNs with two synergistic innovations: 1) The Temporal Attention-guided Adaptive Fusion (TAAF) module that dynamically assigns importance scores to fused spiking features at each timestep, enabling hierarchical integration of temporally heterogeneous spike-based features; 2) The temporal adaptive balanced fusion loss that modulates learning rates per modality based on the above attention scores, preventing dominant modalities from monopolizing optimization. The proposed framework implements adaptive fusion, especially in the temporal dimension, and alleviates the modality imbalance during multimodal learning, mimicking cortical multisensory integration principles. Evaluations on CREMA-D, AVE, and EAD datasets demonstrate state-of-the-art performance (77.55\%, 70.65\% and 97.5\%accuracy, respectively) with energy efficiency. The system resolves temporal misalignment through learnable time-warping operations and faster modality convergence coordination than baseline SNNs. This work establishes a new paradigm for temporally coherent multimodal learning in neuromorphic systems, bridging the gap between biological sensory processing and efficient machine intelligence.

Paperid: 418, https://arxiv.org/pdf/2504.15815.pdf

Abstract:
Prompt engineering for large language models is challenging, as even small prompt perturbations or model changes can significantly impact the generated output texts. Existing evaluation methods of LLM outputs, either automated metrics or human evaluation, have limitations, such as providing limited insights or being labor-intensive. We propose Spotlight, a new approach that combines both automation and human analysis. Based on data mining techniques, we automatically distinguish between random (decoding) variations and systematic differences in language model outputs. This process provides token patterns that describe the systematic differences and guide the user in manually analyzing the effects of their prompts and changes in models efficiently. We create three benchmarks to quantitatively test the reliability of token pattern extraction methods and demonstrate that our approach provides new insights into established prompt data. From a human-centric perspective, through demonstration studies and a user study, we show that our token pattern approach helps users understand the systematic differences of language model outputs. We are further able to discover relevant differences caused by prompt and model changes (e.g. related to gender or culture), thus supporting the prompt engineering process and human-centric model behavior research.

Paperid: 419, https://arxiv.org/pdf/2504.06138.pdf

Abstract:
The rapid advances in Foundation Models and agentic Artificial Intelligence are transforming multimedia analytics by enabling richer, more sophisticated interactions between humans and analytical systems. Existing conceptual models for visual and multimedia analytics, however, do not adequately capture the complexity introduced by these powerful AI paradigms. To bridge this gap, we propose a comprehensive multimedia analytics model specifically designed for the foundation model era. Building upon established frameworks from visual analytics, multimedia analytics, knowledge generation, analytic task definition, mixed-initiative guidance, and human-in-the-loop reinforcement learning, our model emphasizes integrated human-AI teaming based on visual analytics agents from both technical and conceptual perspectives. Central to the model is a seamless, yet explicitly separable, interaction channel between expert users and semi-autonomous analytical processes, ensuring continuous alignment between user intent and AI behavior. The model addresses practical challenges in sensitive domains such as intelligence analysis, investigative journalism, and other fields handling complex, high-stakes data. We illustrate through detailed case studies how our model facilitates deeper understanding and targeted improvement of multimedia analytics solutions. By explicitly capturing how expert users can optimally interact with and guide AI-powered multimedia analytics systems, our conceptual framework sets a clear direction for system design, comparison, and future research.

Paperid: 420, https://arxiv.org/pdf/2504.03343.pdf

Abstract:
Integrated into websites, LLM-powered chatbots offer alternative means of navigation and information retrieval, leading to a shift in how users access information on the web. Yet, predominantly closed-sourced solutions limit proliferation among web hosts and suffer from a lack of transparency with regard to implementation details and energy efficiency. In this work, we propose our openly available agent Talk2X leveraging an adapted retrieval-augmented generation approach (RAG) combined with an automatically generated vector database, benefiting energy efficiency. Talk2X's architecture is generalizable to arbitrary websites offering developers a ready to use tool for integration. Using a mixed-methods approach, we evaluated Talk2X's usability by tasking users to acquire specific assets from an open science repository. Talk2X significantly improved task completion time, correctness, and user experience supporting users in quickly pinpointing specific information as compared to standard user-website interaction. Our findings contribute technical advancements to an ongoing paradigm shift of how we access information on the web.

Paperid: 421, https://arxiv.org/pdf/2504.01029.pdf

Abstract:
The rapid growth of artificial intelligence (AI) technologies has raised major privacy and ethical concerns. However, existing AI incident taxonomies and guidelines lack grounding in real-world cases, limiting their effectiveness for prevention and mitigation. We analyzed 202 real-world AI privacy and ethical incidents to develop a taxonomy that classifies them across AI lifecycle stages and captures contributing factors, including causes, responsible entities, sources of disclosure, and impacts. Our findings reveal widespread harms from poor organizational decisions and legal non-compliance, limited corrective interventions, and rare reporting from AI developers and adopting entities. Our taxonomy offers a structured approach for systematic incident reporting and emphasizes the weaknesses of current AI governance frameworks. Our findings provide actionable guidance for policymakers and practitioners to strengthen user protections, develop targeted AI policies, enhance reporting practices, and foster responsible AI governance and innovation, especially in contexts such as social media and child protection.

Paperid: 422, https://arxiv.org/pdf/2504.00368.pdf

Abstract:
As Virtual Reality (VR) games become more popular, it is crucial to understand how deceptive game design patterns manifest and impact player experiences in this emerging medium. Our study sheds light on the presence and effects of manipulative design techniques in commercial VR games compared to a traditional computer game. We conducted an autoethnography study and developed a VR Deceptive Game Design Assessment Guide based on a critical literature review. Using our guide, we compared how deceptive patterns in a popular computer game are different from two commercial VR titles. While VR's technological constraints, such as battery life and limited temporal manipulation, VR's unique sensory immersion amplified the impact of emotional and sensory deception. Current VR games showed similar but evolved forms of deceptive design compared to the computer game. We forecast more sophisticated player manipulation as VR technology advances. Our findings contribute to a better understanding of how deceptive game design persists and escalates in VR. We highlight the urgent need to develop ethical design guidelines for the rapidly advancing VR games industry.

Paperid: 423, https://arxiv.org/pdf/2503.22901.pdf

Abstract:
Over the last decade, the free-to-play (F2P) game business model has gained popularity in the games industry. We examine the role of deceptive design during a game's transition to F2P and its impacts on players. Our analysis focuses on game mechanics and a Reddit analysis of the Overwatch (OW) series after it transitioned to an F2P model. Our study identifies nine game mechanics that use deceptive design patterns. We also identify factors contributing to a negative gameplay experience. Business model transitions in games present possibilities for problematic practices. Our findings identify the need for game developers and publishers to balance player investments and fairness of rewards. A game's successful transition depends on maintaining fundamental components of player motivation and ensuring transparent communication. Compared to existing taxonomies in other media, games need a comprehensive classification of deceptive design. We emphasize the importance of understanding player perceptions and the impact of deceptive practices in future research.

Paperid: 424, https://arxiv.org/pdf/2503.22892.pdf

Abstract:
The well-established deceptive design literature has focused on conventional user interfaces. With the rise of extended reality (XR), understanding deceptive design's unique manifestations in this immersive domain is crucial. However, existing research lacks a full, cross-disciplinary analysis that analyzes how XR technologies enable new forms of deceptive design. Our study reviews the literature on deceptive design in XR environments. We use thematic synthesis to identify key themes. We found that XR's immersive capabilities and extensive data collection enable subtle and powerful manipulation strategies. We identified eight themes outlining these strategies and discussed existing countermeasures. Our findings show the unique risks of deceptive design in XR, highlighting implications for researchers, designers, and policymakers. We propose future research directions that explore unintentional deceptive design, data-driven manipulation solutions, user education, and the link between ethical design and policy regulations.

Paperid: 425, https://arxiv.org/pdf/2503.21010.pdf

Abstract:
Extended Reality (XR) technology is changing online interactions, but its granular data collection sensors may be more invasive to user privacy than web, mobile, and the Internet of Things technologies. Despite an increased interest in studying developers' concerns about XR device privacy, user perceptions have rarely been addressed. We surveyed 464 XR users to assess their awareness, concerns, and coping strategies around XR data in 18 scenarios. Our findings demonstrate that many factors, such as data types and sensitivity, affect users' perceptions of privacy in XR. However, users' limited awareness of XR sensors' granular data collection capabilities, such as involuntary body signals of emotional responses, restricted the range of privacy-protective strategies they used. Our results highlight a need to enhance users' awareness of data privacy threats in XR, design privacy-choice interfaces tailored to XR environments, and develop transparent XR data practices.

Paperid: 426, https://arxiv.org/pdf/2503.19858.pdf

Abstract:
Deceptive game designs that manipulate players are increasingly common in the gaming industry, but the impact on players is not well studied. While studies have revealed player frustration, there is a gap in understanding how cultural attributes affect the impact of deceptive design in games. This paper proposes a new research direction on the connection between the representation of culture in games and player response to deceptive designs. We believe that understanding the interplay between cultural attributes and deceptive design can inform the creation of games that are ethical and entertaining for players around the globe.

Paperid: 427, https://arxiv.org/pdf/2503.16510.pdf

Abstract:
Humanity is currently facing an existential crisis about the nature of truth and reality driven by the availability of information online which overloads and overwhelms our cognitive capabilities, which we call Cyber-Psychosis. The results of this Cyber-Psychosis include the decline of critical thinking coupled with deceptive influences on the Internet which have become so prolific that they are challenging our ability to form a shared understanding of reality in either the digital or physical world. Fundamental to mending our fractured digital universe is establishing the ability to know where a digital object (i.e. a piece of information like text, audio, or video) came from, whether it was modified, what it is derived from, where it has been circulated, and what (if any) lifetime that information should have. Furthermore, we argue that on-by-default object security for genuine objects will provide the necessary grounding to support critical thinking and rational online behavior, even with the ubiquity of deceptive content. To this end, we propose that the Internet needs an object security service layer. This proposition may not be as distant as it may first seem. Through an examination of several venerable (and new) protocols, we show how pieces of this problem have already been addressed. While interdisciplinary research will be key to properly crafting the architectural changes needed, here we propose an approach for how we can already use fallow protections to begin turning the tide of this emerging Cyber-Psychosis today!

Paperid: 428, https://arxiv.org/pdf/2503.12651.pdf

Abstract:
AI practitioners increasingly use large language model (LLM) agents in compound AI systems to solve complex reasoning tasks, these agent executions often fail to meet human standards, leading to errors that compromise the system's overall performance. Addressing these failures through human intervention is challenging due to the agents' opaque reasoning processes, misalignment with human expectations, the complexity of agent dependencies, and the high cost of manual inspection. This paper thus introduces a human-centered evaluation framework for Verifying LLM Agent failures (VeriLA), which systematically assesses agent failures to reduce human effort and make these agent failures interpretable to humans. The framework first defines clear expectations of each agent by curating human-designed agent criteria. Then, it develops a human-aligned agent verifier module, trained with human gold standards, to assess each agent's execution output. This approach enables granular evaluation of each agent's performance by revealing failures from a human standard, offering clear guidelines for revision, and reducing human cognitive load. Our case study results show that VeriLA is both interpretable and efficient in helping practitioners interact more effectively with the system. By upholding accountability in human-agent collaboration, VeriLA paves the way for more trustworthy and human-aligned compound AI systems.

Paperid: 429, https://arxiv.org/pdf/2503.10706.pdf

Abstract:
Given the recent rate of progress in artificial intelligence (AI) and robotics, a tantalizing question is emerging: would robots controlled by emerging AI systems be strongly aligned with human values? In this work, we propose a scalable way to probe this question by generating a benchmark spanning the key moments in 824 major pieces of science fiction literature (movies, tv, novels and scientific books) where an agent (AI or robot) made critical decisions (good or bad). We use a state-of-the-art LLM's recollection of each key moment to generate questions in similar situations, the decisions made by the agent, and alternative decisions it could have made (good or bad). We then measure an approximation of how well models align with human values on a set of human-voted answers. We also generate rules that can be automatically improved via an amendment process in order to generate the first Sci-Fi inspired constitutions for promoting ethical behavior in AIs and robots in the real world. Our first finding is that modern LLMs paired with constitutions turn out to be well-aligned with human values (95.8%), contrary to unsettling decisions typically made in Sci-Fi (only 21.2% alignment). Secondly, we find that generated constitutions substantially increase alignment compared to the base model (79.4% to 95.8%), and show resilience to an adversarial prompt setting (23.3% to 92.3%). Additionally, we find that those constitutions are among the top performers on the ASIMOV Benchmark which is derived from real-world images and hospital injury reports. Sci-Fi-inspired constitutions are thus highly aligned and applicable in real-world situations. We release SciFi-Benchmark: a large-scale dataset to advance robot ethics and safety research. It comprises 9,056 questions and 53,384 answers generated through a novel LLM-introspection process, in addition to a smaller human-labeled evaluation set.

Paperid: 430, https://arxiv.org/pdf/2502.19312.pdf

Abstract:
Effective personalization of LLMs is critical for a broad range of user-interfacing applications such as virtual assistants and content curation. Inspired by the strong in-context learning capabilities of LLMs, we propose Few-Shot Preference Optimization (FSPO), which reframes reward modeling as a meta-learning problem. Under this framework, an LLM learns to quickly adapt to a user via a few labeled preferences from that user, constructing a personalized reward function for them. Additionally, since real-world preference data is scarce and challenging to collect at scale, we propose careful design choices to construct synthetic preference datasets for personalization, generating over 1M synthetic personalized preferences using publicly available LLMs. In particular, to successfully transfer from synthetic data to real users, we find it crucial for the data to exhibit both high diversity and coherent, self-consistent structure. We evaluate FSPO on personalized open-ended generation for up to 1,500 synthetic users across across three domains: movie reviews, pedagogical adaptation based on educational background, and general question answering, along with a controlled human study. Overall, FSPO achieves an 87% Alpaca Eval winrate on average in generating responses that are personalized to synthetic users and a 72% winrate with real human users in open-ended question answering.

Paperid: 431, https://arxiv.org/pdf/2502.17903.pdf

Abstract:
Improvements in the area of large language models have shifted towards the construction of models capable of using external tools and interpreting their outputs. These so-called web agents have the ability to interact autonomously with the internet. This allows them to become powerful daily assistants handling time-consuming, repetitive tasks while supporting users in their daily activities. While web agent research is thriving, the sustainability aspect of this research direction remains largely unexplored. We provide an initial exploration of the energy and CO2 cost associated with web agents. Our results show how different philosophies in web agent creation can severely impact the associated expended energy. We highlight lacking transparency regarding the disclosure of model parameters and processes used for some web agents as a limiting factor when estimating energy consumption. As such, our work advocates a change in thinking when evaluating web agents, warranting dedicated metrics for energy consumption and sustainability.

Paperid: 432, https://arxiv.org/pdf/2502.11357.pdf

Abstract:
Recent success in large multimodal models (LMMs) has sparked promising applications of agents capable of autonomously completing complex web tasks. While open-source LMM agents have made significant advances in offline evaluation benchmarks, their performance still falls substantially short of human-level capabilities in more realistic online settings. A key bottleneck is the lack of diverse and large-scale trajectory-level datasets across various domains, which are expensive to collect. In this paper, we address this challenge by developing a scalable recipe to synthesize the largest and most diverse trajectory-level dataset to date, containing over 94K successful multimodal web trajectories, spanning 49K unique URLs, 720K screenshots, and 33M web elements. In particular, we leverage extensive web exploration and refinement to obtain diverse task intents. The average cost is 28 cents per successful trajectory, making it affordable to a wide range of users in the community. Leveraging this dataset, we train Explorer, a multimodal web agent, and demonstrate strong performance on both offline and online web agent benchmarks such as Mind2Web-Live, Multimodal-Mind2Web, and MiniWob++. Additionally, our experiments highlight data scaling as a key driver for improving web agent capabilities. We hope this study makes state-of-the-art LMM-based agent research at a larger scale more accessible.

Paperid: 433, https://arxiv.org/pdf/2502.11140.pdf

Abstract:
Rapid advancements in Large Language Models (LLMs) have accelerated their integration into automated visualization code generation applications. Despite advancements through few-shot prompting and query expansion, existing methods remain limited in handling ambiguous and complex queries, thereby requiring manual intervention. To overcome these limitations, we propose VisPath: a Multi-Path Reasoning and Feedback-Driven Optimization Framework for Visualization Code Generation. VisPath handles underspecified queries through structured, multi-stage processing. It begins by reformulating the user input via Chain-of-Thought (CoT) prompting, which refers to the initial query while generating multiple extended queries in parallel, enabling the LLM to capture diverse interpretations of the user intent. These queries then generate candidate visualization scripts, which are executed to produce diverse images. By assessing the visual quality and correctness of each output, VisPath generates targeted feedback that is aggregated to synthesize an optimal final result. Extensive experiments on widely-used benchmarks including MatPlotBench and the Qwen-Agent Code Interpreter Benchmark show that VisPath outperforms state-of-the-art methods, offering a more reliable solution for AI-driven visualization code generation.

Paperid: 434, https://arxiv.org/pdf/2501.16780.pdf

Abstract:
The global aging population faces considerable challenges, particularly in communication, due to the prevalence of hearing and speech impairments. To address these, we introduce the AVE speech, a comprehensive multi-modal dataset for speech recognition tasks. The dataset includes a 100-sentence Mandarin corpus with audio signals, lip-region video recordings, and six-channel electromyography (EMG) data, collected from 100 participants. Each subject read the entire corpus ten times, with each sentence averaging approximately two seconds in duration, resulting in over 55 hours of multi-modal speech data per modality. Experiments demonstrate that combining these modalities significantly improves recognition performance, particularly in cross-subject and high-noise environments. To our knowledge, this is the first publicly available sentence-level dataset integrating these three modalities for large-scale Mandarin speech recognition. We expect this dataset to drive advancements in both acoustic and non-acoustic speech recognition research, enhancing cross-modal learning and human-machine interaction.

Paperid: 435, https://arxiv.org/pdf/2501.12894.pdf

Abstract:
Educational recommender systems (ERSs) play a crucial role in personalizing learning experiences and enhancing educational outcomes by providing recommendations of personalized resources and activities to learners, tailored to their individual learning needs. However, their effectiveness is often diminished by insufficient user control and limited transparency. To address these challenges, in this paper, we present the systematic design and evaluation of an interactive ERS, in which we introduce different levels of user control. Concretely, we introduce user control around the input (i.e., user profile), process (i.e., recommendation algorithm), and output (i.e., recommendations) of the ERS. To evaluate our system, we conducted an online user study (N=30) to explore the impact of user control on users' perceptions of the ERS in terms of several important user-centric aspects. Moreover, we investigated the effects of user control on multiple recommendation goals, namely transparency, trust, and satisfaction, as well as the interactions between these goals. Our results demonstrate the positive impact of user control on user perceived benefits of the ERS. Moreover, our study shows that user control strongly correlates with transparency and moderately correlates with trust and satisfaction. In terms of interaction between these goals, our results reveal that transparency moderately correlates and trust strongly correlates with satisfaction. Whereas, transparency and trust stand out as less correlated with each other.

Paperid: 436, https://arxiv.org/pdf/2501.08500.pdf

Abstract:
The increasing complexity and volume of network data demand effective analysis approaches, with visual exploration proving particularly beneficial. Immersive technologies, such as augmented reality, virtual reality, and large display walls, have enabled the emerging field of immersive analytics, offering new opportunities to enhance user engagement, spatial awareness, and problem-solving. A growing body of work has explored immersive environments for network visualisation, ranging from design studies to fully integrated applications across various domains. Despite these advancements, the field remains fragmented, lacking a clear description of the design space and a structured overview of the aspects that have already been empirically evaluated. To address this gap, we present a survey of visual network analysis in immersive environments, covering 138 publications retrieved through a structured pipeline. We systematically analyse the key aspects that define the design space, investigate their coverage in prior applications (n=87), and review user evaluations (n=59) that provide empirical evidence for essential design-related questions. By synthesising experimental findings and evaluating existing applications, we identify key achievements, highlight research gaps, and offer guidance for the design of future approaches. Additionally, we provide an online resource to explore our results interactively, which will be updated as new developments emerge.

Paperid: 437, https://arxiv.org/pdf/2501.03370.pdf

Abstract:
The widespread use of social media highlights the need to understand its impact, particularly the role of online social support. This study uses a dataset focused on online social support, which includes binary and multiclass classifications of social support content on social media. The classification of social support is divided into three tasks. The first task focuses on distinguishing between supportive and non-supportive. The second task aims to identify whether the support is directed toward an individual or a group. The third task categorizes the specific type of social support, grouping it into categories such as Nation, LGBTQ, Black people, Women, Religion, and Other (if it does not fit into the previously mentioned categories). To address data imbalances in these tasks, we employed K-means clustering for balancing the dataset and compared the results with the original unbalanced data. Using advanced machine learning techniques, including transformers and zero-shot learning approaches with GPT3, GPT4, and GPT4-o, we predict social support levels in various contexts. The effectiveness of the dataset is evaluated using baseline models across different learning approaches, with transformer-based methods demonstrating superior performance. Additionally, we achieved a 0.4\% increase in the macro F1 score for the second task and a 0.7\% increase for the third task, compared to previous work utilizing traditional machine learning with psycholinguistic and unigram-based TF-IDF values.

Paperid: 438, https://arxiv.org/pdf/2506.22674.pdf

Abstract:
Electric vehicles (EVs) are a promising alternative to fuel vehicles (FVs), given some unique characteristics of EVs, for example, the low air pollution and maintenance cost. However, the increasing prevalence of EVs is accompanied by widespread complaints regarding the high likelihood of motion sickness (MS) induction, especially when compared to FVs, which has become one of the major obstacles to the acceptance and popularity of EVs. Despite the prevalence of such complaints online and among EV users, the association between vehicle type (i.e., EV versus FV) and MS prevalence and severity has not been quantified. Thus, this study aims to investigate the existence of EV-induced MS and explore the potential factors leading to it. A survey study was conducted to collect passengers' MS experience in EVs and FVs in the past one year. In total, 639 valid responses were collected from mainland China. The results show that FVs were associated with a higher frequency of MS, while EVs were found to induce more severe MS symptoms. Further, we found that passengers' MS severity was associated with individual differences (i.e., age, gender, sleep habits, susceptibility to motion-induced MS), in-vehicle activities (i.e., chatting with others and watching in-vehicle displays), and road conditions (i.e., congestion and slope), while the MS frequency was associated with the vehicle ownership and riding frequency. The results from this study can guide the directions of future empirical studies that aim to quantify the inducers of MS in EVs and FVs, as well as the optimization of EVs to reduce MS.

Paperid: 439, https://arxiv.org/pdf/2506.16622.pdf

Abstract:
Effectively engaging the public with science is vital for fostering trust and understanding in our scientific community. Yet, with an ever-growing volume of information, science communicators struggle to anticipate how audiences will perceive and interact with scientific news. In this paper, we introduce a computational framework that models public perception across twelve dimensions, such as newsworthiness, importance, and surprisingness. Using this framework, we create a large-scale science news perception dataset with 10,489 annotations from 2,101 participants from diverse US and UK populations, providing valuable insights into public responses to scientific information across domains. We further develop NLP models that predict public perception scores with a strong performance. Leveraging the dataset and model, we examine public perception of science from two perspectives: (1) Perception as an outcome: What factors affect the public perception of scientific information? (2) Perception as a predictor: Can we use the estimated perceptions to predict public engagement with science? We find that individuals' frequency of science news consumption is the driver of perception, whereas demographic factors exert minimal influence. More importantly, through a large-scale analysis and carefully designed natural experiment on Reddit, we demonstrate that the estimated public perception of scientific information has direct connections with the final engagement pattern. Posts with more positive perception scores receive significantly more comments and upvotes, which is consistent across different scientific information and for the same science, but are framed differently. Overall, this research underscores the importance of nuanced perception modeling in science communication, offering new pathways to predict public interest and engagement with scientific content.

Paperid: 440, https://arxiv.org/pdf/2506.13498.pdf

Abstract:
This paper comprehensively surveys research trends in imitation learning for contact-rich robotic tasks. Contact-rich tasks, which require complex physical interactions with the environment, represent a central challenge in robotics due to their nonlinear dynamics and sensitivity to small positional deviations. The paper examines demonstration collection methodologies, including teaching methods and sensory modalities crucial for capturing subtle interaction dynamics. We then analyze imitation learning approaches, highlighting their applications to contact-rich manipulation. Recent advances in multimodal learning and foundation models have significantly enhanced performance in complex contact tasks across industrial, household, and healthcare domains. Through systematic organization of current research and identification of challenges, this survey provides a foundation for future advancements in contact-rich robotic manipulation.

Paperid: 441, https://arxiv.org/pdf/2506.12840.pdf

Abstract:
This study explores the classroom implementation of an AI-powered grading platform in K-12 settings through a co-design pilot with 19 teachers. We combine platform usage logs, surveys, and qualitative interviews to examine how teachers use AI-generated rubrics and grading feedback. Findings reveal that while teachers valued the AI's rapid narrative feedback for formative purposes, they distrusted automated scoring and emphasized the need for human oversight. Students welcomed fast, revision-oriented feedback but remained skeptical of AI-only grading. We discuss implications for the design of trustworthy, teacher-centered AI assessment tools that enhance feedback while preserving pedagogical agency.

Paperid: 443, https://arxiv.org/pdf/2506.07211.pdf

Abstract:
The emergence of Large Language Models (LLMs) presents a dual challenge in the fight against disinformation. These powerful tools, capable of generating human-like text at scale, can be weaponised to produce sophisticated and persuasive disinformation, yet they also hold promise for enhancing detection and mitigation strategies. This paper investigates the complex dynamics between LLMs and disinformation through a communication game that simulates online forums, inspired by the game Werewolf, with 25 participants. We analyse how Disinformers, Moderators, and Users leverage LLMs to advance their goals, revealing both the potential for misuse and combating disinformation. Our findings highlight the varying uses of LLMs depending on the participants' roles and strategies, underscoring the importance of understanding their effectiveness in this context. We conclude by discussing implications for future LLM development and online platform design, advocating for a balanced approach that empowers users and fosters trust while mitigating the risks of LLM-assisted disinformation.

Paperid: 444, https://arxiv.org/pdf/2505.21964.pdf

Abstract:
External knowledge has played a crucial role in the recent development of computer use agents. We identify a critical knowledge-execution gap: retrieved knowledge often fails to translate into effective real-world task execution. Our analysis shows even 90\% correct knowledge yields only 41\% execution success rate. To bridge this gap, we propose UI-Evol, a plug-and-play module for autonomous GUI knowledge evolution. UI-Evol consists of two stages: a Retrace Stage that extracts faithful objective action sequences from actual agent-environment interactions, and a Critique Stage that refines existing knowledge by comparing these sequences against external references. We conduct comprehensive experiments on the OSWorld benchmark with the state-of-the-art Agent S2. Our results demonstrate that UI-Evol not only significantly boosts task performance but also addresses a previously overlooked issue of high behavioral standard deviation in computer use agents, leading to superior performance on computer use tasks and substantially improved agent reliability.

Paperid: 445, https://arxiv.org/pdf/2504.20519.pdf

Abstract:
Large language model (LLM) based chatbots show promise in persuasive communication, but existing studies often rely on weak controls or focus on belief change rather than behavioral intentions or outcomes. This pre-registered multi-country (US, Canada, UK) randomized controlled trial involving 930 vaccine-hesitant parents evaluated brief (three-minute) multi-turn conversations with LLM-based chatbots against standard public health messaging approaches for increasing human papillomavirus (HPV) vaccine intentions for their children. Participants were randomly assigned to: (1) a weak control (no message), (2) a strong control reflecting the standard of care (reading official public health materials), or (3 and 4) one of two chatbot conditions. One chatbot was prompted to deliver short, conversational responses, while the other used the model's default output style (longer with bullet points). While chatbot interactions significantly increased self-reported vaccination intent (by 7.1-10.3 points on a 100-point scale) compared to no message, they did not outperform standard public health materials, with the conversational chatbot performing significantly worse. Additionally, while the short-term effects of chatbot interactions faded during a 15-day follow-up, the effects of public health material persisted through a 45-day follow-up relative to no message. These findings suggest that while LLMs can effectively shift vaccination intentions in the short-term, their incremental value over existing public health communications is questionable, offering a more tempered view of their persuasive capabilities and highlighting the importance of integrating AI-driven tools alongside, rather than replacing, current public health strategies.

Paperid: 446, https://arxiv.org/pdf/2504.13921.pdf

Abstract:
This paper presents a novel wireless silent speech interface (SSI) integrating multi-channel textile-based EMG electrodes into headphone earmuff for real-time, hands-free communication. Unlike conventional patch-based EMG systems, which require large-area electrodes on the face or neck, our approach ensures comfort, discretion, and wearability while maintaining robust silent speech decoding. The system utilizes four graphene/PEDOT:PSS-coated textile electrodes to capture speech-related neuromuscular activity, with signals processed via a compact ESP32-S3-based wireless readout module. To address the challenge of variable skin-electrode coupling, we propose a 1D SE-ResNet architecture incorporating squeeze-and-excitation (SE) blocks to dynamically adjust per-channel attention weights, enhancing robustness against motion-induced impedance variations. The proposed system achieves 96% accuracy on 10 commonly used voice-free control words, outperforming conventional single-channel and non-adaptive baselines. Experimental validation, including XAI-based attention analysis and t-SNE feature visualization, confirms the adaptive channel selection capability and effective feature extraction of the model. This work advances wearable EMG-based SSIs, demonstrating a scalable, low-power, and user-friendly platform for silent communication, assistive technologies, and human-computer interaction.

Paperid: 447, https://arxiv.org/pdf/2504.12690.pdf

Abstract:
Seniors represent a growing user base for mobile applications; however, many apps fail to adequately address their accessibility challenges and usability preferences. To investigate this issue, we conducted an exploratory focus group study with 16 senior participants, from which we derived an initial set of user personas highlighting key accessibility and personalisation barriers. These personas informed the development of a model-driven engineering toolset, which was used to generate adaptive mobile app prototypes tailored to seniors' needs. We then conducted a second focus group study with 22 seniors to evaluate these prototypes and validate our findings. Based on insights from both studies, we developed a refined set of personas and a series of accessibility and personalisation recommendations grounded in empirical data, prior research, accessibility standards, and developer resources, aimed at supporting software practitioners in designing more inclusive mobile applications.

Paperid: 448, https://arxiv.org/pdf/2504.11795.pdf

Abstract:
Each type of creative or communicative work is underpinned by an implicit structure. People learn these structures from examples - a process known in cognitive science as schema induction. However, inducing schemas is challenging, as structural patterns are often obscured by surface-level variation. We present Schemex, an interactive visual workflow that scaffolds schema induction through clustering, abstraction, and contrastive refinement. Schemex supports users through visual representations and interactive exploration that connect abstract structures to concrete examples, promoting transparency, adaptability, and effective human-AI collaboration. In our user study, participants reported significantly greater insight and confidence in the schemas developed with Schemex compared to those created using a baseline of an AI reasoning model. We conclude by discussing the broader implications of structural abstraction and contrastive refinement across domains.

Paperid: 449, https://arxiv.org/pdf/2504.10458.pdf

Abstract:
Existing efforts in building Graphical User Interface (GUI) agents largely rely on the training paradigm of supervised fine-tuning on Large Vision-Language Models (LVLMs). However, this approach not only demands extensive amounts of training data but also struggles to effectively understand GUI screenshots and generalize to unseen interfaces. The issue significantly limits its application in real-world scenarios, especially for high-level tasks. Inspired by Reinforcement Fine-Tuning (RFT) in large reasoning models (e.g., DeepSeek-R1), which efficiently enhances the problem-solving capabilities of large language models in real-world settings, we propose \name, the first reinforcement learning framework designed to enhance the GUI capabilities of LVLMs in high-level real-world task scenarios, through unified action space rule modeling. By leveraging a small amount of carefully curated high-quality data across multiple platforms (including Windows, Linux, MacOS, Android, and Web) and employing policy optimization algorithms such as Group Relative Policy Optimization (GRPO) to update the model, \name achieves superior performance using only 0.02\% of the data (3K vs. 13M) compared to previous state-of-the-art methods like OS-Atlas across eight benchmarks spanning three different platforms (mobile, desktop, and web). These results demonstrate the immense potential of reinforcement learning based on unified action space rule modeling in improving the execution capabilities of LVLMs for real-world GUI agent tasks.

Paperid: 450, https://arxiv.org/pdf/2504.08678.pdf

Abstract:
Infrastructure as Code (IaC) tools have transformed the way IT infrastructure is automated and managed, but their growing adoption has also exposed numerous challenges for practitioners. In this paper, we investigate these challenges through the lens of Ansible, a popular IaC tool. Using a mixed methods approach, we investigate challenges, obstacles, and issues faced by practitioners. We analyze 59,157 posts from Stack Overflow, Reddit, and the Ansible Forum to identify common pain points, complemented by 16 semi-structured interviews with practitioners of varying expertise levels. Based on our findings, we propose four main recommendations to improve Ansible: 1) refactoring to mitigate performance issues, 2) restructuring higher-level language concepts, 3) improved debugging and error reporting tools, and 4) better documentation and learning resources. By highlighting the real-world struggles of Ansible users, we provide actionable insights for tool designers, educators, and the broader IaC community, contributing to a deeper understanding of the trade-offs inherent in IaC tools.

Paperid: 451, https://arxiv.org/pdf/2504.02124.pdf

Abstract:
Formal verification has recently been increasingly used to prove the correctness and security of many applications. It is attractive because it can prove the absence of errors with the same certainty as mathematicians proving theorems. However, while most security experts recognize the value of formal verification, the views of non-technical users on this topic are unknown. To address this issue, we designed and implemented two experiments to understand how formal verification impacts users. Our approach started with a formative study involving 15 participants, followed by the main quantitative study with 200 individuals. We focus on the application domain of password managers since it has been documented that the lack of trust in password managers might lead to lower adoption. Moreover, recent efforts have focused on formally verifying (parts of) password managers. We conclude that formal verification is seen as desirable by users and identify three actional recommendations to improve formal verification communication efforts.

Paperid: 452, https://arxiv.org/pdf/2504.02109.pdf

Abstract:
Cybersecurity incidents such as data breaches have become increasingly common, affecting millions of users and organizations worldwide. The complexity of cybersecurity threats challenges the effectiveness of existing security communication strategies. Through a systematic review of over 3,400 papers, we identify specific user difficulties including information overload, technical jargon comprehension, and balancing security awareness with comfort. Our findings reveal consistent communication paradoxes: users require technical details for credibility yet struggle with jargon and need risk awareness without experiencing anxiety. We propose seven evidence-based guidelines to improve security communication and identify critical research gaps including limited studies with older adults, children, and non-US populations, insufficient longitudinal research, and limited protocol sharing for reproducibility. Our guidelines emphasize user-centric communication adapted to cultural and demographic differences while ensuring security advice remains actionable. This work contributes to more effective security communication practices that enable users to recognize and respond to cybersecurity threats appropriately.

Paperid: 453, https://arxiv.org/pdf/2503.17620.pdf

Abstract:
Content annotation at scale remains challenging, requiring substantial human expertise and effort. This paper presents a case study in code documentation analysis, where we explore the balance between automation efficiency and annotation accuracy. We present MCHR (Multi-LLM Consensus with Human Review), a novel semi-automated framework that enhances annotation scalability through the systematic integration of multiple LLMs and targeted human review. Our framework introduces a structured consensus-building mechanism among LLMs and an adaptive review protocol that strategically engages human expertise. Through our case study, we demonstrate that MCHR reduces annotation time by 32% to 100% compared to manual annotation while maintaining high accuracy (85.5% to 98%) across different difficulty levels, from basic binary classification to challenging open-set scenarios.

Paperid: 454, https://arxiv.org/pdf/2503.11704.pdf

Abstract:
Generative artificial intelligence (GenAI) offers new possibilities for generating personalized programming exercises, addressing the need for individual practice. However, the task quality along with the student perspective on such generated tasks remains largely unexplored. Therefore, this paper introduces and evaluates a new feature of the so-called Tutor Kai for generating comprehensive programming tasks, including problem descriptions, code skeletons, unit tests, and model solutions. The presented system allows students to freely choose programming concepts and contextual themes for their tasks. To evaluate the system, we conducted a two-phase mixed-methods study comprising (1) an expert rating of 200 automatically generated programming tasks w.r.t. task quality, and (2) a study with 26 computer science students who solved and rated the personalized programming tasks. Results show that experts classified 89.5% of the generated tasks as functional and 92.5% as solvable. However, the system's rate for implementing all requested programming concepts decreased from 94% for single-concept tasks to 40% for tasks addressing three concepts. The student evaluation further revealed high satisfaction with the personalization. Students also reported perceived benefits for learning. The results imply that the new feature has the potential to offer students individual tasks aligned with their context and need for exercise. Tool developers, educators, and, above all, students can benefit from these insights and the system itself.

Paperid: 455, https://arxiv.org/pdf/2503.11096.pdf

Abstract:
Traditional image annotation tasks rely heavily on human effort for object selection and label assignment, making the process time-consuming and prone to decreased efficiency as annotators experience fatigue after extensive work. This paper introduces a novel framework that leverages the visual understanding capabilities of large multimodal models (LMMs), particularly GPT, to assist annotation workflows. In our proposed approach, human annotators focus on selecting objects via bounding boxes, while the LMM autonomously generates relevant labels. This human-AI collaborative framework enhances annotation efficiency by reducing the cognitive and time burden on human annotators. By analyzing the system's performance across various types of annotation tasks, we demonstrate its ability to generalize to tasks such as object recognition, scene description, and fine-grained categorization. Our proposed framework highlights the potential of this approach to redefine annotation workflows, offering a scalable and efficient solution for large-scale data labeling in computer vision. Finally, we discuss how integrating LMMs into the annotation pipeline can advance bidirectional human-AI alignment, as well as the challenges of alleviating the "endless annotation" burden in the face of information overload by shifting some of the work to AI.

Paperid: 456, https://arxiv.org/pdf/2503.08562.pdf

Abstract:
Mental health challenges among Indian adolescents are shaped by unique cultural and systemic barriers, including high social stigma and limited professional support. Through a mixed-methods study involving a survey of 278 adolescents and follow-up interviews with 12 participants, we explore how adolescents perceive mental health challenges and interact with digital tools. Quantitative results highlight low self-stigma but significant social stigma, a preference for text over voice interactions, and low utilization of mental health apps but high smartphone access. Our qualitative findings reveal that while adolescents value privacy, emotional support, and localized content in mental health tools, existing chatbots lack personalization and cultural relevance. These findings inform recommendations for culturally sensitive chatbot design that prioritizes anonymity, tailored support, and localized resources to better meet the needs of adolescents in India. This work advances culturally sensitive chatbot design by centering underrepresented populations, addressing critical gaps in accessibility and support for adolescents in India.

Paperid: 457, https://arxiv.org/pdf/2503.04768.pdf

Abstract:
On-demand ride-hailing services like DiDi, Uber, and Lyft have transformed urban transportation, offering unmatched convenience and flexibility. In this paper, we introduce DiMA, an LLM-powered ride-hailing assistant deployed in DiDi Chuxing. Its goal is to provide seamless ride-hailing services and beyond through a natural and efficient conversational interface under dynamic and complex spatiotemporal urban contexts. To achieve this, we propose a spatiotemporal-aware order planning module that leverages external tools for precise spatiotemporal reasoning and progressive order planning. Additionally, we develop a cost-effective dialogue system that integrates multi-type dialog repliers with cost-aware LLM configurations to handle diverse conversation goals and trade-off response quality and latency. Furthermore, we introduce a continual fine-tuning scheme that utilizes real-world interactions and simulated dialogues to align the assistant's behavior with human preferred decision-making processes. Since its deployment in the DiDi application, DiMA has demonstrated exceptional performance, achieving 93% accuracy in order planning and 92% in response generation during real-world interactions. Offline experiments further validate DiMA capabilities, showing improvements of up to 70.23% in order planning and 321.27% in response generation compared to three state-of-the-art agent frameworks, while reducing latency by $0.72\times$ to $5.47\times$. These results establish DiMA as an effective, efficient, and intelligent mobile assistant for ride-hailing services.

Paperid: 458, https://arxiv.org/pdf/2503.02999.pdf

Abstract:
As educational settings increasingly integrate artificial intelligence (AI), understanding how AI tools identify -- and adapt their responses to -- varied educational contexts becomes paramount. This study examines conversational AI's effectiveness in supporting K-12 mathematics education across various educational contexts. Through qualitative content analysis, we identify educational contexts and key instructional needs present in educator prompts and assess AI's responsiveness. Our findings indicate that educators focus their AI conversations on assessment methods, how to set the cognitive demand level of their instruction, and strategies for making meaningful real-world connections. However, educators' conversations with AI about instructional practices do vary across revealed educational contexts; they shift their emphasis to tailored, rigorous content that addresses their students' unique needs. Educators often seek actionable guidance from AI and reject responses that do not align with their inquiries. While AI can provide accurate, relevant, and useful information when educational contexts or instructional practices are specified in conversation queries, its ability to consistently adapt responses along these evaluation dimensions varies across different educational settings. Significant work remains to realize the response-differentiating potential of conversational AI tools in complex educational use cases. This research contributes insights into developing AI tools that are responsive, proactive, and anticipatory, adapting to evolving educational needs before they are explicitly stated, and provides actionable recommendations for both developers and educators to enhance AI integration in educational practices.

Paperid: 459, https://arxiv.org/pdf/2502.18828.pdf

Abstract:
The use of diverse mobile applications among senior users is becoming increasingly widespread. However, many of these apps contain accessibility problems that result in negative user experiences for seniors. A key reason is that software practitioners often lack the time or resources to address the broad spectrum of age-related accessibility and personalisation needs. As current developer tools and practices encourage one-size-fits-all interfaces with limited potential to address the diversity of senior needs, there is a growing demand for approaches that support the systematic creation of adaptive, accessible app experiences. To this end, we present AdaptForge, a novel model-driven engineering (MDE) approach that enables advanced design-time adaptations of mobile application interfaces and behaviours tailored to the accessibility needs of senior users. AdaptForge uses two domain-specific languages (DSLs) to address age-related accessibility needs. The first model defines users' context-of-use parameters, while the second defines conditional accessibility scenarios and corresponding UI adaptation rules. These rules are interpreted by an MDE workflow to transform an app's original source code into personalised instances. We also report evaluations with professional software developers and senior end-users, demonstrating the feasibility and practical utility of AdaptForge.

Paperid: 460, https://arxiv.org/pdf/2502.16098.pdf

Abstract:
Large multimodal models (LMMs) have enabled new AI-powered applications that help people with visual impairments (PVI) receive natural language descriptions of their surroundings through audible text. We investigated how this emerging paradigm of visual assistance transforms how PVI perform and manage their daily tasks. Moving beyond usability assessments, we examined both the capabilities and limitations of LMM-based tools in personal and social contexts, while exploring design implications for their future development. Through interviews with 14 visually impaired users of Be My AI (an LMM-based application) and analysis of its image descriptions from both study participants and social media platforms, we identified two key limitations. First, these systems' context awareness suffers from hallucinations and misinterpretations of social contexts, styles, and human identities. Second, their intent-oriented capabilities often fail to grasp and act on users' intentions. Based on these findings, we propose design strategies for improving both human-AI and AI-AI interactions, contributing to the development of more effective, interactive, and personalized assistive technologies.

Paperid: 461, https://arxiv.org/pdf/2502.15105.pdf

Abstract:
Expertise is often built by learning from examples. This process, known as schema induction, helps us identify patterns from examples. Despite its importance, schema induction remains a challenging cognitive task. Recent advances in generative AI reasoning capabilities offer new opportunities to support schema induction through human-AI collaboration. We present Schemex, an AI-powered workflow that enhances human schema induction through three stages: clustering, abstraction, and refinement via contrasting examples. We conducted an initial evaluation of Schemex through two real-world case studies: writing abstracts for HCI papers and creating news TikToks. Qualitative analysis demonstrates the high accuracy and usefulness of the generated schemas. We also discuss future work on developing more flexible methods for workflow construction to help humans focus on high-level thinking.

Paperid: 462, https://arxiv.org/pdf/2502.09142.pdf

Abstract:
The integration of robotics and augmented reality (AR) presents transformative opportunities for advancing human-robot interaction (HRI) by improving usability, intuitiveness, and accessibility. This work introduces a controller-free, LLM-driven voice-commanded AR puppeteering system, enabling users to teleoperate a robot by manipulating its virtual counterpart in real time. By leveraging natural language processing (NLP) and AR technologies, our system -- prototyped using Meta Quest 3 -- eliminates the need for physical controllers, enhancing ease of use while minimizing potential safety risks associated with direct robot operation. A preliminary user demonstration successfully validated the system's functionality, demonstrating its potential for safer, more intuitive, and immersive robotic control.

Paperid: 463, https://arxiv.org/pdf/2502.05347.pdf

Abstract:
As AI becomes more capable, it is unclear how human creativity will remain essential in jobs that incorporate AI. We conducted a 14-week study of a student newsroom using an AI tool to convert web articles into social media videos. Most creators treated the tool as a creative springboard, not as a completion mechanism. They edited the AI outputs. The tool enabled the team to publish successful content that received over 500,000 views. Human creativity remained essential: after AI produced templated outputs, creators took ownership of the task, injecting their own creativity, especially when AI failed to create appropriate content. AI was initially seen as an authority, due to creators' lack of experience, but they ultimately learned to assert their own authority.

Paperid: 464, https://arxiv.org/pdf/2502.02329.pdf

Abstract:
Creating data reports is a labor-intensive task involving iterative data exploration, insight extraction, and narrative construction. A key challenge lies in composing the analysis logic-from defining objectives and transforming data to identifying and communicating insights. Manually crafting this logic can be cognitively demanding. While experienced analysts often reuse scripts from past projects, finding a perfect match for a new dataset is rare. Even when similar analyses are available online, they usually share only results or visualizations, not the underlying code, making reuse difficult. To address this, we present ReSpark, a system that leverages large language models (LLMs) to reverse-engineer analysis logic from existing reports and adapt it to new datasets. By generating draft analysis steps, ReSpark provides a warm start for users. It also supports interactive refinement, allowing users to inspect intermediate outputs, insert objectives, and revise content. We evaluate ReSpark through comparative and user studies, demonstrating its effectiveness in lowering the barrier to generating data reports without relying on existing analysis code.

Paperid: 465, https://arxiv.org/pdf/2502.00229.pdf

Abstract:
As mental health issues rise among college students, there is an increasing interest and demand in leveraging Multimodal Language Models (MLLM) to enhance mental support services, yet integrating them into psychotherapy remains theoretical or non-user-centered. This study investigated the opportunities and challenges of using MLLMs within the campus psychotherapy alliance in China. Through three studies involving both therapists and student clients, we argue that the ideal role for MLLMs at this stage is as an auxiliary tool to human therapists. Users widely expect features such as triage matching and real-time emotion recognition. At the same time, for independent therapy by MLLM, concerns about capabilities and privacy ethics remain prominent, despite high demands for personalized avatars and non-verbal communication. Our findings further indicate that users' sense of social identity and perceived relative status of MLLMs significantly influence their acceptance. This study provides insights for future intelligent campus mental healthcare.

Paperid: 466, https://arxiv.org/pdf/2501.18265.pdf

Abstract:
Evaluating the truthfulness of online content is critical for combating misinformation. This study examines the efficiency and effectiveness of crowdsourced truthfulness assessments through a comparative analysis of two approaches: one involving full-length webpages as evidence for each claim, and another using summaries for each evidence document generated with a large language model. Using an A/B testing setting, we engage a diverse pool of participants tasked with evaluating the truthfulness of statements under these conditions. Our analysis explores both the quality of assessments and the behavioral patterns of participants. The results reveal that relying on summarized evidence offers comparable accuracy and error metrics to the Standard modality while significantly improving efficiency. Workers in the Summary setting complete a significantly higher number of assessments, reducing task duration and costs. Additionally, the Summary modality maximizes internal agreement and maintains consistent reliance on and perceived usefulness of evidence, demonstrating its potential to streamline large-scale truthfulness evaluations.

Paperid: 467, https://arxiv.org/pdf/2501.10338.pdf

Abstract:
We investigate the perception of visual variables on wall-sized tiled displays within an immersive environment. We designed and conducted two formal user studies focusing on elementary visualization reading tasks in VR. The first study compared three different virtual display arrangements (Flat, Cylinder, and Cockpit). It showed that participants made smaller errors on virtual curved walls (Cylinder and Cockpit) compared to Flat. Following that, we compared the results with those from a previous study conducted in a real-world setting. The comparative analysis showed that virtual curved walls resulted in smaller errors than the real-world flat wall display, but with longer task completion time. The second study evaluated the impact of four 3D user interaction techniques (Selection, Walking, Steering, and Teleportation) on performing the elementary task on the virtual Flat wall display. The results confirmed that interaction techniques further improved task performance. Finally, we discuss the limitations and future work.

Paperid: 468, https://arxiv.org/pdf/2501.01849.pdf

Abstract:
The remarkable generative capability of large language models (LLMs) has sparked a growing interest in automatically generating responses for different applications. Given the dynamic nature of user preferences and the uncertainty of LLM response performance, it is crucial to design efficient online learning algorithms to identify optimal LLM responses (i.e., high-quality responses that also meet user preferences). Most existing online algorithms adopt a centralized approach and fail to leverage explicit user preferences for more efficient and personalized LLM response identification. In contrast, this paper introduces \textit{MACO} (\underline{M}ulti-\underline{A}gent \underline{C}onversational \underline{O}nline Learning for Adaptive LLM Response Identification): 1) The online LLM response identification process is accelerated by multiple local agents (such as smartphones), while enhancing data privacy; 2) A novel conversational mechanism is proposed to adaptively conduct conversations for soliciting user preferences (e.g., a preference for a humorous tone over a serious one in generated responses), so to minimize uncertainty in preference estimation. Our theoretical analysis demonstrates that \cadi\ is near-optimal regarding cumulative regret. Additionally, \cadi\ offers reduced communication costs and computational complexity by eliminating the traditional, computing-intensive ``G-optimal design" found in previous works. Extensive experiments with the open LLM \textit{Llama}, coupled with two different embedding models from Google and OpenAI for text vector representation, demonstrate that \cadi\ significantly outperforms the current state-of-the-art in online LLM response identification.

Paperid: 469, https://arxiv.org/pdf/2506.20993.pdf

Abstract:
Large language models (LLMs) have gained significant traction across a wide range of fields in recent years. There is also a growing expectation for them to display human-like personalities during interactions. To meet this expectation, numerous studies have proposed methods for modelling LLM personalities through psychometric evaluations. However, most existing models face two major limitations: they rely on the Big Five (OCEAN) framework, which only provides coarse personality dimensions, and they lack mechanisms for controlling trait intensity. In this paper, we address this gap by extending the Machine Personality Inventory (MPI), which originally used the Big Five model, to incorporate the 16 Personality Factor (16PF) model, allowing expressive control over sixteen distinct traits. We also developed a structured framework known as Specific Attribute Control (SAC) for evaluating and dynamically inducing trait intensity in LLMs. Our method introduces adjective-based semantic anchoring to guide trait intensity expression and leverages behavioural questions across five intensity factors: \textit{Frequency}, \textit{Depth}, \textit{Threshold}, \textit{Effort}, and \textit{Willingness}. Through experimentation, we find that modelling intensity as a continuous spectrum yields substantially more consistent and controllable personality expression compared to binary trait toggling. Moreover, we observe that changes in target trait intensity systematically influence closely related traits in psychologically coherent directions, suggesting that LLMs internalize multi-dimensional personality structures rather than treating traits in isolation. Our work opens new pathways for controlled and nuanced human-machine interactions in domains such as healthcare, education, and interviewing processes, bringing us one step closer to truly human-like social machines.

Paperid: 470, https://arxiv.org/pdf/2506.16168.pdf

Abstract:
Imagine unlocking the power of the mind to communicate, create, and even interact with the world around us. Recent breakthroughs in Artificial Intelligence (AI), especially in how machines "see" and "understand" language, are now fueling exciting progress in decoding brain signals from scalp electroencephalography (EEG). Prima facie, this opens the door to revolutionary brain-computer interfaces (BCIs) designed for real life, moving beyond traditional uses to envision Brain-to-Speech, Brain-to-Image, and even a Brain-to-Internet of Things (BCIoT). However, the journey is not as straightforward as it was for Computer Vision (CV) and Natural Language Processing (NLP). Applying AI to real-world EEG-based BCIs, particularly in building powerful foundational models, presents unique and intricate hurdles that could affect their reliability. Here, we unfold a guided exploration of this dynamic and rapidly evolving research area. Rather than barely outlining a map of current endeavors and results, the goal is to provide a principled navigation of this hot and cutting-edge research landscape. We consider the basic paradigms that emerge from a causal perspective and the attendant challenges presented to AI-based models. Looking ahead, we then discuss promising research avenues that could overcome today's technological, methodological, and ethical limitations. Our aim is to lay out a clear roadmap for creating truly practical and effective EEG-based BCI solutions that can thrive in everyday environments.

Paperid: 471, https://arxiv.org/pdf/2506.13189.pdf

Abstract:
The integration of robotics and augmented reality (AR) holds transformative potential for advancing human-robot interaction (HRI), offering enhancements in usability, intuitiveness, accessibility, and collaborative task performance. This paper introduces and evaluates a novel multimodal AR-based robot puppeteer framework that enables intuitive teleoperation via virtual counterpart through large language model (LLM)-driven voice commands and hand gesture interactions. Utilizing the Meta Quest 3, users interact with a virtual counterpart robot in real-time, effectively "puppeteering" its physical counterpart within an AR environment. We conducted a within-subject user study with 42 participants performing robotic cube pick-and-place with pattern matching tasks under two conditions: gesture-only interaction and combined voice-and-gesture interaction. Both objective performance metrics and subjective user experience (UX) measures were assessed, including an extended comparative analysis between roboticists and non-roboticists. The results provide key insights into how multimodal input influences contextual task efficiency, usability, and user satisfaction in AR-based HRI. Our findings offer practical design implications for designing effective AR-enhanced HRI systems.

Paperid: 472, https://arxiv.org/pdf/2505.21724.pdf

Abstract:
In this paper, we introduce Online Multimodal Conversational Response Generation (OMCRG), a novel task that aims to online generate synchronized verbal and non-verbal listener feedback, conditioned on the speaker's multimodal input. OMCRG reflects natural dyadic interactions and poses new challenges in achieving synchronization between the generated audio and facial responses of the listener. To address these challenges, we innovatively introduce text as an intermediate modality to bridge the audio and facial responses. We hence propose OmniResponse, a Multimodal Large Language Model (MLLM) that autoregressively generates high-quality multi-modal listener responses. OmniResponse leverages a pretrained LLM enhanced with two novel components: Chrono-Text, which temporally anchors generated text tokens, and TempoVoice, a controllable online TTS module that produces speech synchronized with facial reactions. To support further OMCRG research, we present ResponseNet, a new dataset comprising 696 high-quality dyadic interactions featuring synchronized split-screen videos, multichannel audio, transcripts, and facial behavior annotations. Comprehensive evaluations conducted on ResponseNet demonstrate that OmniResponse significantly outperforms baseline models in terms of semantic speech content, audio-visual synchronization, and generation quality.

Paperid: 473, https://arxiv.org/pdf/2505.18771.pdf

Abstract:
Computing education and computing students are rapidly integrating generative AI, but we know relatively little about how different pedagogical strategies for intentionally integrating generative AI affect students' self-efficacy and career interests. This study investigates a SPIRAL integration of generative AI (Skills Practiced Independently, Revisited with AI Later), implemented in an introductory undergraduate creative media and technology course in Fall 2023 (n=31). Students first developed domain skills for half the semester, then revisited earlier material integrating using generative AI, with explicit instruction on how to use it critically and ethically. We contribute a mixed methods quantitative and qualitative analysis of changes in self-efficacy and career interests over time, including longitudinal qualitative interviews (n=9) and thematic analysis. We found positive changes in both students' creative media self-efficacy and generative AI use self-efficacy, and mixed changes for ethical generative AI use self-efficacy. We also found students experienced demystification, transitioning from initial fear about generative AI taking over their fields and jobs, to doubting AI capability to do so and/or that society will push back against AI, through personal use of AI and observing others' use of AI vicariously. For career interests, our SPIRAL integration of generative AI use appeared to have either a neutral or positive influence on students, including widening their perceived career options, depending on their view of how AI would influence the career itself. These findings suggest that careful pedagogical sequencing can mitigate some potential negative impacts of AI, while promoting ethical and critical AI use that supports or has a neutral effect on students' career formation. To our knowledge our SPIRAL integration strategy applied to generative AI integration is novel.

Paperid: 474, https://arxiv.org/pdf/2505.10831.pdf

Abstract:
Human-computer interaction has long imagined technology that understands us-from our preferences and habits, to the timing and purpose of our everyday actions. Yet current user models remain fragmented, narrowly tailored to specific apps, and incapable of the flexible reasoning required to fulfill these visions. This paper presents an architecture for a general user model (GUM) that learns about you by observing any interaction you have with your computer. The GUM takes as input any unstructured observation of a user (e.g., device screenshots) and constructs confidence-weighted propositions that capture user knowledge and preferences. GUMs can infer that a user is preparing for a wedding they're attending from messages with a friend. Or recognize that a user is struggling with a collaborator's feedback on a draft by observing multiple stalled edits and a switch to reading related work. GUMs introduce an architecture that infers new propositions about a user from multimodal observations, retrieves related propositions for context, and continuously revises existing propositions. To illustrate the breadth of applications that GUMs enable, we demonstrate how they augment chat-based assistants with context, manage OS notifications to selectively surface important information, and enable interactive agents that adapt to preferences across apps. We also instantiate proactive assistants (GUMBOs) that discover and execute useful suggestions on a user's behalf using their GUM. In our evaluations, we find that GUMs make calibrated and accurate inferences about users, and that assistants built on GUMs proactively identify and perform actions that users wouldn't think to request explicitly. Altogether, GUMs introduce methods that leverage multimodal models to understand unstructured context, enabling long-standing visions of HCI and entirely new interactive systems that anticipate user needs.

Paperid: 475, https://arxiv.org/pdf/2505.07664.pdf

Abstract:
The broad availability of generative AI offers new opportunities to support various work domains, including agile software development. Agile epics are a key artifact for product managers to communicate requirements to stakeholders. However, in practice, they are often poorly defined, leading to churn, delivery delays, and cost overruns. In this industry case study, we investigate opportunities for large language models (LLMs) to evaluate agile epic quality in a global company. Results from a user study with 17 product managers indicate how LLM evaluations could be integrated into their work practices, including perceived values and usage in improving their epics. High levels of satisfaction indicate that agile epics are a new, viable application of AI evaluations. However, our findings also outline challenges, limitations, and adoption barriers that can inform both practitioners and researchers on the integration of such evaluations into future agile work practices.

Paperid: 476, https://arxiv.org/pdf/2505.03163.pdf

Abstract:
AI-driven education, particularly Large Language Models (LLMs), has the potential to address learning disparities in rural K-12 schools. However, research on AI adoption in rural India remains limited, with existing studies focusing primarily on urban settings. This study examines the perceptions of volunteer teachers on AI integration in rural education, identifying key challenges and opportunities. Through semi-structured interviews with 23 volunteer educators in Rajasthan and Delhi, we conducted a thematic analysis to explore infrastructure constraints, teacher preparedness, and digital literacy gaps. Findings indicate that while LLMs could enhance personalized learning and reduce teacher workload, barriers such as poor connectivity, lack of AI training, and parental skepticism hinder adoption. Despite concerns over over-reliance and ethical risks, volunteers emphasize that AI should be seen as a complementary tool rather than a replacement for traditional teaching. Given the potential benefits, LLM-based tutors merit further exploration in rural classrooms, with structured implementation and localized adaptations to ensure accessibility and equity.

Paperid: 477, https://arxiv.org/pdf/2505.01648.pdf

Abstract:
Understanding the dynamics of human-AI interaction in question answering is crucial for enhancing collaborative efficiency. Extending from our initial formative study, which revealed challenges in human utilization of conversational AI support, we designed two configurations for prompt guidance: a Nudging approach, where the AI suggests potential responses for human agents, and a Highlight strategy, emphasizing crucial parts of reference documents to aid human responses. Through two controlled experiments, the first involving 31 participants and the second involving 106 participants, we compared these configurations against traditional human-only approaches, both with and without AI assistance. Our findings suggest that effective human-AI collaboration can enhance response quality, though merely combining human and AI efforts does not ensure improved outcomes. In particular, the Nudging configuration was shown to help improve the quality of the output when compared to AI alone. This paper delves into the development of these prompt guidance paradigms, offering insights for refining human-AI collaborations in conversational question-answering contexts and contributing to a broader understanding of human perceptions and expectations in AI partnerships.

Paperid: 478, https://arxiv.org/pdf/2505.00948.pdf

Abstract:
Teamwork is pivotal in medical teamwork when professionals with diverse skills and emotional states collaborate to make critical decisions. This case study examines the interplay between emotions and professional skills in group decision-making during collaborative medical diagnosis within an Intelligent Tutoring System (ITS). By comparing verbal and physiological data between high-performing and low-performing teams of medical professionals working on a patient case within the ITS, alongside individuals' retrospective collaboration experiences, we employ multimodal data analysis to identify patterns in team emotional climate and their impact on diagnostic efficiency. Specifically, we investigate how emotion-driven dialogue and professional expertise influence both the information-seeking process and the final diagnostic decisions. Grounded in the socially shared regulation of learning framework and utilizing sentiment analysis, we found that social-motivational interactions are key drivers of a positive team emotional climate. Furthermore, through content analysis of dialogue and physiological signals to pinpoint emotional fluctuations, we identify episodes where knowledge exchange and skill acquisition are most likely to occur. Our findings offer valuable insights into optimizing group collaboration in medical contexts by harmonizing emotional dynamics with adaptive strategies for effective decision-making, ultimately enhancing diagnostic accuracy and teamwork effectiveness.

Paperid: 479, https://arxiv.org/pdf/2504.21477.pdf

Abstract:
Haptic perception and feedback play a pivotal role in interactive experiences, forming an essential component of human-computer interaction (HCI). In recent years, the field of haptic interaction has witnessed significant advancements, particularly in the area of electrical haptic feedback, driving innovation across various domains. To gain a comprehensive understanding of the current state of research and the latest developments in electrical haptic interaction, this study systematically reviews the literature in this area. Our investigation covers key aspects including haptic devices, haptic perception mechanisms, the comparison and integration of electrical haptic feedback with other feedback modalities, and their diverse applications. Specifically, we conduct a systematic analysis of 110 research papers to explore the forefront of electrical haptic feedback, providing insights into its latest trends, challenges, and future directions.

Paperid: 480, https://arxiv.org/pdf/2504.20792.pdf

Abstract:
Carousel interfaces are widely used in e-commerce and streaming services, but little research has been devoted to them. Previous studies of interfaces for presenting search and recommendation results have focused on single ranked lists, but it appears their results cannot be extrapolated to carousels due to the added complexity. Eye tracking is a highly informative approach to understanding how users click, yet there are no eye tracking studies concerning carousels. There are very few interaction datasets on recommenders with carousel interfaces and none that contain gaze data. We introduce the RecGaze dataset: the first comprehensive feedback dataset on carousels that includes eye tracking results, clicks, cursor movements, and selection explanations. The dataset comprises of interactions from 3 movie selection tasks with 40 different carousel interfaces per user. In total, 87 users and 3,477 interactions are logged. In addition to the dataset, its description and possible use cases, we provide results of a survey on carousel design and the first analysis of gaze data on carousels, which reveals a golden triangle or F-pattern browsing behavior. Our work seeks to advance the field of carousel interfaces by providing the first dataset with eye tracking results on carousels. In this manner, we provide and encourage an empirical understanding of interactions with carousel interfaces, for building better recommender systems through gaze information, and also encourage the development of gaze-based recommenders.

Paperid: 481, https://arxiv.org/pdf/2504.20094.pdf

Abstract:
In this paper, we propose a multi-agent collaboration framework called MATCHA for conversational recommendation system, leveraging large language models (LLMs) to enhance personalization and user engagement. Users can request recommendations via free-form text and receive curated lists aligned with their interests, preferences, and constraints. Our system introduces specialized agents for intent analysis, candidate generation, ranking, re-ranking, explainability, and safeguards. These agents collaboratively improve recommendations accuracy, diversity, and safety. On eight metrics, our model achieves superior or comparable performance to the current state-of-the-art. Through comparisons with six baseline models, our approach addresses key challenges in conversational recommendation systems for game recommendations, including: (1) handling complex, user-specific requests, (2) enhancing personalization through multi-agent collaboration, (3) empirical evaluation and deployment, and (4) ensuring safe and trustworthy interactions.

Paperid: 482, https://arxiv.org/pdf/2504.15482.pdf

Abstract:
Creative ideation relies on exploring diverse stimuli, but the overwhelming abundance of information often makes it difficult to identify valuable insights or reach the `aha' moment. Traditional methods for accessing design stimuli lack organization and fail to support users in discovering promising opportunities within large idea spaces. In this position paper, we explore how AI can be leveraged to structure, organize, and surface relevant stimuli, guiding users in both exploring idea spaces and mapping insights back to their design challenges.

Paperid: 483, https://arxiv.org/pdf/2504.11163.pdf

Abstract:
This paper introduces the Robotability Score ($R$), a novel metric that quantifies the suitability of urban environments for autonomous robot navigation. Through expert interviews and surveys, we identify and weigh key features contributing to R for wheeled robots on urban streets. Our findings reveal that pedestrian density, crowd dynamics and pedestrian flow are the most critical factors, collectively accounting for 28% of the total score. Computing robotability across New York City yields significant variation; the area of highest R is 3.0 times more "robotable" than the area of lowest R. Deployments of a physical robot on high and low robotability areas show the adequacy of the score in anticipating the ease of robot navigation. This new framework for evaluating urban landscapes aims to reduce uncertainty in robot deployment while respecting established mobility patterns and urban planning principles, contributing to the discourse on harmonious human-robot environments.

Paperid: 484, https://arxiv.org/pdf/2503.22116.pdf

Abstract:
As artificial intelligence (AI) systems become increasingly embedded in critical societal functions, the need for robust red teaming methodologies continues to grow. In this forum piece, we examine emerging approaches to automating AI red teaming, with a particular focus on how the application of automated methods affects human-driven efforts. We discuss the role of labor in automated red teaming processes, the benefits and limitations of automation, and its broader implications for AI safety and labor practices. Drawing on existing frameworks and case studies, we argue for a balanced approach that combines human expertise with automated tools to strengthen AI risk assessment. Finally, we highlight key challenges in scaling automated red teaming, including considerations around worker proficiency, agency, and context-awareness.

Paperid: 485, https://arxiv.org/pdf/2503.18792.pdf

Abstract:
Large Language Models (LLMs), such as the GPT series, have driven significant industrial applications, leading to economic and societal transformations. However, a comprehensive understanding of their real-world applications remains limited. To address this, we introduce REALM, a dataset of over 94,000 LLM use cases collected from Reddit and news articles. REALM captures two key dimensions: the diverse applications of LLMs and the demographics of their users. It categorizes LLM applications and explores how users' occupations relate to the types of applications they use. By integrating real-world data, REALM offers insights into LLM adoption across different domains, providing a foundation for future research on their evolving societal roles.

Paperid: 486, https://arxiv.org/pdf/2503.15484.pdf

Abstract:
Modelling human variation in rating tasks is crucial for enabling AI systems for personalization, pluralistic model alignment, and computational social science. We propose representing individuals using value profiles -- natural language descriptions of underlying values compressed from in-context demonstrations -- along with a steerable decoder model to estimate ratings conditioned on a value profile or other rater information. To measure the predictive information in rater representations, we introduce an information-theoretic methodology. We find that demonstrations contain the most information, followed by value profiles and then demographics. However, value profiles offer advantages in terms of scrutability, interpretability, and steerability due to their compressed natural language format. Value profiles effectively compress the useful information from demonstrations (>70% information preservation). Furthermore, clustering value profiles to identify similarly behaving individuals better explains rater variation than the most predictive demographic groupings. Going beyond test set performance, we show that the decoder models interpretably change ratings according to semantic profile differences, are well-calibrated, and can help explain instance-level disagreement by simulating an annotator population. These results demonstrate that value profiles offer novel, predictive ways to describe individual variation beyond demographics or group information.

Paperid: 487, https://arxiv.org/pdf/2503.12479.pdf

Abstract:
The advent of Large Language Models (LLMs) is reshaping education, particularly in programming, by enhancing problem-solving, enabling personalized feedback, and supporting adaptive learning. Existing AI tools for programming education struggle with key challenges, including the lack of Socratic guidance, direct code generation, limited context retention, minimal adaptive feedback, and the need for prompt engineering. To address these challenges, we introduce Sakshm AI, an intelligent tutoring system for learners across all education levels. It fosters Socratic learning through Disha, its inbuilt AI chatbot, which provides context-aware hints, structured feedback, and adaptive guidance while maintaining conversational memory and supporting language flexibility. This study examines 1170 registered participants, analyzing platform logs, engagement trends, and problem-solving behavior to assess Sakshm AI's impact. Additionally, a structured survey with 45 active users and 25 in-depth interviews was conducted, using thematic encoding to extract qualitative insights. Our findings reveal how AI-driven Socratic guidance influences problem-solving behaviors and engagement, offering key recommendations for optimizing AI-based coding platforms. This research combines quantitative and qualitative insights to inform AI-assisted education, providing a framework for scalable, intelligent tutoring systems that improve learning outcomes. Furthermore, Sakshm AI represents a significant step toward Sustainable Development Goal 4 Quality Education, providing an accessible and structured learning tool for undergraduate students, even without expert guidance. This is one of the first large-scale studies examining AI-assisted programming education across multiple institutions and demographics.

Paperid: 488, https://arxiv.org/pdf/2503.09838.pdf

Abstract:
We present BioSpark, a system for analogical innovation designed to act as a creativity partner in reducing the cognitive effort in finding, mapping, and creatively adapting diverse inspirations. While prior approaches have focused on initial stages of finding inspirations, BioSpark uses LLMs embedded in a familiar, visual, Pinterest-like interface to go beyond inspiration to supporting users in identifying the key solution mechanisms, transferring them to the problem domain, considering tradeoffs, and elaborating on details and characteristics. To accomplish this BioSpark introduces several novel contributions, including a tree-of-life enabled approach for generating relevant and diverse inspirations, as well as AI-powered cards including 'Sparks' for analogical transfer; 'Trade-offs' for considering pros and cons; and 'Q&A' for deeper elaboration. We evaluated BioSpark through workshops with professional designers and a controlled user study, finding that using BioSpark led to a greater number of generated ideas; those ideas being rated higher in creative quality; and more diversity in terms of biological inspirations used than a control condition. Our results suggest new avenues for creativity support tools embedding AI in familiar interaction paradigms for designer workflows.

Paperid: 489, https://arxiv.org/pdf/2503.09639.pdf

Abstract:
Can we simulate a sandbox society with generative agents to model human behavior, thereby reducing the over-reliance on real human trials for assessing public policies? In this work, we investigate the feasibility of simulating health-related decision-making, using vaccine hesitancy, defined as the delay in acceptance or refusal of vaccines despite the availability of vaccination services (MacDonald, 2015), as a case study. To this end, we introduce the VacSim framework with 100 generative agents powered by Large Language Models (LLMs). VacSim simulates vaccine policy outcomes with the following steps: 1) instantiate a population of agents with demographics based on census data; 2) connect the agents via a social network and model vaccine attitudes as a function of social dynamics and disease-related information; 3) design and evaluate various public health interventions aimed at mitigating vaccine hesitancy. To align with real-world results, we also introduce simulation warmup and attitude modulation to adjust agents' attitudes. We propose a series of evaluations to assess the reliability of various LLM simulations. Experiments indicate that models like Llama and Qwen can simulate aspects of human behavior but also highlight real-world alignment challenges, such as inconsistent responses with demographic profiles. This early exploration of LLM-driven simulations is not meant to serve as definitive policy guidance; instead, it serves as a call for action to examine social simulation for policy development.

Paperid: 490, https://arxiv.org/pdf/2503.04110.pdf

Abstract:
The rise of Large Language Models (LLMs) and generative visual analytics systems has transformed data-driven insights, yet significant challenges persist in accurately interpreting users' analytical and interaction intents. While language inputs offer flexibility, they often lack precision, making the expression of complex intents inefficient, error-prone, and time-intensive. To address these limitations, we investigate the design space of multimodal interactions for generative visual analytics through a literature review and pilot brainstorming sessions. Building on these insights, we introduce a highly extensible workflow that integrates multiple LLM agents for intent inference and visualization generation. We develop InterChat, a generative visual analytics system that combines direct manipulation of visual elements with natural language inputs. This integration enables precise intent communication and supports progressive, visually driven exploratory data analyses. By employing effective prompt engineering, and contextual interaction linking, alongside intuitive visualization and interaction designs, InterChat bridges the gap between user interactions and LLM-driven visualizations, enhancing both interpretability and usability. Extensive evaluations, including two usage scenarios, a user study, and expert feedback, demonstrate the effectiveness of InterChat. Results show significant improvements in the accuracy and efficiency of handling complex visual analytics tasks, highlighting the potential of multimodal interactions to redefine user engagement and analytical depth in generative visual analytics.

Paperid: 491, https://arxiv.org/pdf/2503.00489.pdf

Abstract:
Prior studies show that adopting the annotation diversity shaped by different backgrounds and life experiences and incorporating them into the model learning, i.e. multi-perspective approach, contribute to the development of more responsible models. Thus, in this paper we propose a new framework for designing and further evaluating perspective-aware models on stance detection task,in which multiple annotators assign stances based on a controversial topic. We also share a new dataset established through obtaining both human and LLM annotations. Results show that the multi-perspective approach yields better classification performance (higher F1-scores), outperforming the traditional approaches that use a single ground-truth, while displaying lower model confidence scores, probably due to the high level of subjectivity of the stance detection task.

Paperid: 492, https://arxiv.org/pdf/2502.16376.pdf

Abstract:
Explainable AI is increasingly employing argumentation methods to facilitate interactive explanations between AI agents and human users. While existing approaches typically rely on predetermined human user models, there remains a critical gap in dynamically learning and updating these models during interactions. In this paper, we present a framework that enables AI agents to adapt their understanding of human users through argumentation-based dialogues. Our approach, called Persona, draws on prospect theory and integrates a probability weighting function with a Bayesian belief update mechanism that refines a probability distribution over possible human models based on exchanged arguments. Through empirical evaluations with human users in an applied argumentation setting, we demonstrate that Persona effectively captures evolving human beliefs, facilitates personalized interactions, and outperforms state-of-the-art methods.

Paperid: 493, https://arxiv.org/pdf/2502.15367.pdf

Abstract:
As voice assistants (VAs) become increasingly integrated into daily life, the need for emotion-aware systems that can recognize and respond appropriately to user emotions has grown. While significant progress has been made in speech emotion recognition (SER) and sentiment analysis, effectively addressing user emotions-particularly negative ones-remains a challenge. This study explores human emotional response strategies in VA interactions using a role-swapping approach, where participants regulate AI emotions rather than receiving pre-programmed responses. Through speech feature analysis and natural language processing (NLP), we examined acoustic and linguistic patterns across various emotional scenarios. Results show that participants favor neutral or positive emotional responses when engaging with negative emotional cues, highlighting a natural tendency toward emotional regulation and de-escalation. Key acoustic indicators such as root mean square (RMS), zero-crossing rate (ZCR), and jitter were identified as sensitive to emotional states, while sentiment polarity and lexical diversity (TTR) distinguished between positive and negative responses. These findings provide valuable insights for developing adaptive, context-aware VAs capable of delivering empathetic, culturally sensitive, and user-aligned responses. By understanding how humans naturally regulate emotions in AI interactions, this research contributes to the design of more intuitive and emotionally intelligent voice assistants, enhancing user trust and engagement in human-AI interactions.

Paperid: 494, https://arxiv.org/pdf/2502.13321.pdf

Abstract:
Trust biases how users rely on AI recommendations in AI-assisted decision-making tasks, with low and high levels of trust resulting in increased under- and over-reliance, respectively. We propose that AI assistants should adapt their behavior through trust-adaptive interventions to mitigate such inappropriate reliance. For instance, when user trust is low, providing an explanation can elicit more careful consideration of the assistant's advice by the user. In two decision-making scenarios -- laypeople answering science questions and doctors making medical diagnoses -- we find that providing supporting and counter-explanations during moments of low and high trust, respectively, yields up to 38% reduction in inappropriate reliance and 20% improvement in decision accuracy. We are similarly able to reduce over-reliance by adaptively inserting forced pauses to promote deliberation. Our results highlight how AI adaptation to user trust facilitates appropriate reliance, presenting exciting avenues for improving human-AI collaboration.

Paperid: 495, https://arxiv.org/pdf/2502.01553.pdf

Abstract:
Livestreaming by VTubers -- animated 2D/3D avatars controlled by real individuals -- have recently garnered substantial global followings and achieved significant monetary success. Despite prior research highlighting the importance of realism in audience engagement, VTubers deliberately conceal their identities, cultivating dedicated fan communities through virtual personas. While previous studies underscore that building a core fan community is essential to a streamer's success, we lack an understanding of the characteristics of viewers of this new type of streamer. Gaining a deeper insight into these viewers is critical for VTubers to enhance audience engagement, foster a more robust fan base, and attract a larger viewership. To address this gap, we conduct a comprehensive analysis of VTuber viewers on Bilibili, a leading livestreaming platform where nearly all VTubers in China stream. By compiling a first-of-its-kind dataset covering 2.7M livestreaming sessions, we investigate the characteristics, engagement patterns, and influence of VTuber viewers. Our research yields several valuable insights, which we then leverage to develop a tool to "recommend" future subscribers to VTubers. By reversing the typical approach of recommending streams to viewers, this tool assists VTubers in pinpointing potential future fans to pay more attention to, and thereby effectively growing their fan community.

Paperid: 496, https://arxiv.org/pdf/2501.18588.pdf

Abstract:
With recent advancements in the capabilities of Text-to-Image (T2I) AI models, product designers have begun experimenting with them in their work. However, T2I models struggle to interpret abstract language and the current user experience of T2I tools can induce design fixation rather than a more iterative, exploratory process. To address these challenges, we developed Inkspire, a sketch-driven tool that supports designers in prototyping product design concepts with analogical inspirations and a complete sketch-to-design-to-sketch feedback loop. To inform the design of Inkspire, we conducted an exchange session with designers and distilled design goals for improving T2I interactions. In a within-subjects study comparing Inkspire to ControlNet, we found that Inkspire supported designers with more inspiration and exploration of design ideas, and improved aspects of the co-creative process by allowing designers to effectively grasp the current state of the AI to guide it towards novel design intentions.

Paperid: 497, https://arxiv.org/pdf/2501.17899.pdf

Abstract:
This paper proposes a Right to AI, which asserts that individuals and communities should meaningfully participate in the development and governance of the AI systems that shape their lives. Motivated by the increasing deployment of AI in critical domains and inspired by Henri Lefebvre's concept of the Right to the City, we reconceptualize AI as a societal infrastructure, rather than merely a product of expert design. In this paper, we critically evaluate how generative agents, large-scale data extraction, and diverse cultural values bring new complexities to AI oversight. The paper proposes that grassroots participatory methodologies can mitigate biased outcomes and enhance social responsiveness. It asserts that data is socially produced and should be managed and owned collectively. Drawing on Sherry Arnstein's Ladder of Citizen Participation and analyzing nine case studies, the paper develops a four-tier model for the Right to AI that situates the current paradigm and envisions an aspirational future. It proposes recommendations for inclusive data ownership, transparent design processes, and stakeholder-driven oversight. We also discuss market-led and state-centric alternatives and argue that participatory approaches offer a better balance between technical efficiency and democratic legitimacy.

Paperid: 498, https://arxiv.org/pdf/2501.17799.pdf

Abstract:
Inspirational search, the process of exploring designs to inform and inspire new creative work, is pivotal in mobile user interface (UI) design. However, exploring the vast space of UI references remains a challenge. Existing AI-based UI search methods often miss crucial semantics like target users or the mood of apps. Additionally, these models typically require metadata like view hierarchies, limiting their practical use. We used a multimodal large language model (MLLM) to extract and interpret semantics from mobile UI images. We identified key UI semantics through a formative study and developed a semantic-based UI search system. Through computational and human evaluations, we demonstrate that our approach significantly outperforms existing UI retrieval methods, offering UI designers a more enriched and contextually relevant search experience. We enhance the understanding of mobile UI design semantics and highlight MLLMs' potential in inspirational search, providing a rich dataset of UI semantics for future studies.

Paperid: 499, https://arxiv.org/pdf/2501.17322.pdf

Abstract:
Visual prostheses are designed to restore partial functional vision in patients with total vision loss. Retinal visual prostheses provide limited capabilities as a result of low resolution, limited field of view and poor dynamic range. Understanding the influence of these parameters in the perception results can guide prostheses research and design. In this work, we evaluate the influence of field of view with respect to spatial resolution in visual prostheses, measuring the accuracy and response time in a search and recognition task. Twenty-four normally sighted participants were asked to find and recognize usual objects, such as furniture and home appliance in indoor room scenes. For the experiment, we use a new simulated prosthetic vision system that allows simple and effective experimentation. Our system uses a virtual-reality environment based on panoramic scenes. The simulator employs a head-mounted display which allows users to feel immersed in the scene by perceiving the entire scene all around. Our experiments use public image datasets and a commercial head-mounted display. We have also released the virtual-reality software for replicating and extending the experimentation. Results show that the accuracy and response time decrease when the field of view is increased. Furthermore, performance appears to be correlated with the angular resolution, but showing a diminishing return even with a resolution of less than 2.3 phosphenes per degree. Our results seem to indicate that, for the design of retinal prostheses, it is better to concentrate the phosphenes in a small area, to maximize the angular resolution, even if that implies sacrificing field of view.

Paperid: 500, https://arxiv.org/pdf/2501.12289.pdf

Abstract:
Emotions are known to mediate the relationship between users' content consumption and their online engagement, with heightened emotional intensity leading to increased engagement. Building on this insight, we propose three regressor-guided image editing approaches aimed at diminishing the emotional impact of images. These include (i) a parameter optimization approach based on global image transformations known to influence emotions, (ii) an optimization approach targeting the style latent space of a generative adversarial network, and (iii) a diffusion-based approach employing classifier guidance and classifier-free guidance. Our findings demonstrate that approaches can effectively alter the emotional properties of images while maintaining high visual quality. Optimization-based methods primarily adjust low-level properties like color hues and brightness, whereas the diffusion-based approach introduces semantic changes, such as altering appearance or facial expressions. Notably, results from a behavioral study reveal that only the diffusion-based approach successfully elicits changes in viewers' emotional responses while preserving high perceived image quality. In future work, we will investigate the impact of these image adaptations on internet user behavior.

Paperid: 501, https://arxiv.org/pdf/2501.10551.pdf

Abstract:
As large language models (LLMs) advance and become widespread, students increasingly turn to systems like ChatGPT for assistance with writing tasks. Educators are concerned with students' usage of ChatGPT beyond cheating; using ChatGPT may reduce their critical engagement with writing, hindering students' learning processes. The negative or positive impact of using LLM-powered tools for writing will depend on how students use them; however, how students use ChatGPT remains largely unknown, resulting in a limited understanding of its impact on learning. To better understand how students use these tools, we conducted an online study $(n=70)$ where students were given an essay-writing task using a custom platform we developed to capture the queries they made to ChatGPT. To characterize their ChatGPT usage, we categorized each of the queries students made to ChatGPT. We then analyzed the relationship between ChatGPT usage and a variety of other metrics, including students' self-perception, attitudes towards AI, and the resulting essay itself. We found that factors such as gender, race, and perceived self-efficacy can help predict different AI usage patterns. Additionally, we found that different usage patterns were associated with varying levels of enjoyment and perceived ownership over the essay. The results of this study contribute to discussions about how writing education should incorporate generative AI-powered tools in the classroom.

Paperid: 502, https://arxiv.org/pdf/2501.01568.pdf

Abstract:
Interruptions, a fundamental component of human communication, can enhance the dynamism and effectiveness of conversations, but only when effectively managed by all parties involved. Despite advancements in robotic systems, state-of-the-art systems still have limited capabilities in handling user-initiated interruptions in real-time. Prior research has primarily focused on post hoc analysis of interruptions. To address this gap, we present a system that detects user-initiated interruptions and manages them in real-time based on the interrupter's intent (i.e., cooperative agreement, cooperative assistance, cooperative clarification, or disruptive interruption). The system was designed based on interaction patterns identified from human-human interaction data. We integrated our system into an LLM-powered social robot and validated its effectiveness through a timed decision-making task and a contentious discussion task with 21 participants. Our system successfully handled 93.69% (n=104/111) of user-initiated interruptions. We discuss our learnings and their implications for designing interruption-handling behaviors in conversational robots.

Paperid: 503, https://arxiv.org/pdf/2506.22462.pdf

Abstract:
Fall detection is critical to support the growing elderly population, projected to reach 2.1 billion by 2050. However, existing methods often face data scarcity challenges or compromise privacy. We propose a novel IoT-based Fall Detection as a Service (FDaaS) framework to assist the elderly in living independently and safely by accurately detecting falls. We design a service-oriented architecture that leverages Ultra-wideband (UWB) radar sensors as an IoT health-sensing service, ensuring privacy and minimal intrusion. We address the challenges of data scarcity by utilizing a Fall Detection Generative Pre-trained Transformer (FD-GPT) that uses augmentation techniques. We developed a protocol to collect a comprehensive dataset of the elderly daily activities and fall events. This resulted in a real dataset that carefully mimics the elderly's routine. We rigorously evaluate and compare various models using this dataset. Experimental results show our approach achieves 90.72% accuracy and 89.33% precision in distinguishing between fall events and regular activities of daily living.

Paperid: 504, https://arxiv.org/pdf/2506.15084.pdf

Abstract:
Data visualization (DataViz) libraries play a crucial role in presentation, data analysis, and application development, underscoring the importance of their accuracy in transforming data into visual representations. Incorrect visualizations can adversely impact user experience, distort information conveyance, and influence user perception and decision-making processes. Visual bugs in these libraries can be particularly insidious as they may not cause obvious errors like crashes, but instead mislead users of the underlying data graphically, resulting in wrong decision making. Consequently, a good understanding of the unique characteristics of bugs in DataViz libraries is essential for researchers and developers to detect and fix bugs in DataViz libraries. This study presents the first comprehensive analysis of bugs in DataViz libraries, examining 564 bugs collected from five widely-used libraries. Our study systematically analyzes their symptoms and root causes, and provides a detailed taxonomy. We found that incorrect/inaccurate plots are pervasive in DataViz libraries and incorrect graphic computation is the major root cause, which necessitates further automated testing methods for DataViz libraries. Moreover, we identified eight key steps to trigger such bugs and two test oracles specific to DataViz libraries, which may inspire future research in designing effective automated testing techniques. Furthermore, with the recent advancements in Vision Language Models (VLMs), we explored the feasibility of applying these models to detect incorrect/inaccurate plots. The results show that the effectiveness of VLMs in bug detection varies from 29% to 57%, depending on the prompts, and adding more information in prompts does not necessarily increase the effectiveness. More findings can be found in our manuscript.

Paperid: 505, https://arxiv.org/pdf/2506.14200.pdf

Abstract:
Language models today are widely used in education, yet their ability to tailor responses for learners with varied informational needs and knowledge backgrounds remains under-explored. To this end, we introduce ELI-Why, a benchmark of 13.4K "Why" questions to evaluate the pedagogical capabilities of language models. We then conduct two extensive human studies to assess the utility of language model-generated explanatory answers (explanations) on our benchmark, tailored to three distinct educational grades: elementary, high-school and graduate school. In our first study, human raters assume the role of an "educator" to assess model explanations' fit to different educational grades. We find that GPT-4-generated explanations match their intended educational background only 50% of the time, compared to 79% for lay human-curated explanations. In our second study, human raters assume the role of a learner to assess if an explanation fits their own informational needs. Across all educational backgrounds, users deemed GPT-4-generated explanations 20% less suited on average to their informational needs, when compared to explanations curated by lay people. Additionally, automated evaluation metrics reveal that explanations generated across different language model families for different informational needs remain indistinguishable in their grade-level, limiting their pedagogical effectiveness.

Paperid: 506, https://arxiv.org/pdf/2506.06306.pdf

Abstract:
Agitation is one of the most common responsive behaviors in people living with dementia, particularly among those residing in community settings without continuous clinical supervision. Timely prediction of agitation can enable early intervention, reduce caregiver burden, and improve the quality of life for both patients and caregivers. This study aimed to develop and benchmark machine learning approaches for the early prediction of agitation in community-dwelling older adults with dementia using multimodal sensor data. A new set of agitation-related contextual features derived from activity data was introduced and employed for agitation prediction. A wide range of machine learning and deep learning models was evaluated across multiple problem formulations, including binary classification for single-timestamp tabular sensor data and multi-timestamp sequential sensor data, as well as anomaly detection for single-timestamp tabular sensor data. The study utilized the Technology Integrated Health Management (TIHM) dataset, the largest publicly available dataset for remote monitoring of people living with dementia, comprising 2,803 days of in-home activity, physiology, and sleep data. The most effective setting involved binary classification of sensor data using the current 6-hour timestamp to predict agitation at the subsequent timestamp. Incorporating additional information, such as time of day and agitation history, further improved model performance, with the highest AUC-ROC of 0.9720 and AUC-PR of 0.4320 achieved by the light gradient boosting machine. This work presents the first comprehensive benchmarking of state-of-the-art techniques for agitation prediction in community-based dementia care using privacy-preserving sensor data. The approach enables accurate, explainable, and efficient agitation prediction, supporting proactive dementia care and aging in place.

Paperid: 507, https://arxiv.org/pdf/2506.05579.pdf

Abstract:
Recent advancements in AI reasoning have driven substantial improvements across diverse tasks. A critical open question is whether these improvements also yields better knowledge transfer: the ability of models to communicate reasoning in ways humans can understand, apply, and learn from. To investigate this, we introduce Knowledge Integration and Transfer Evaluation (KITE), a conceptual and experimental framework for Human-AI knowledge transfer capabilities and conduct the first large-scale human study (N=118) explicitly designed to measure it. In our two-phase setup, humans first ideate with an AI on problem-solving strategies, then independently implement solutions, isolating model explanations' influence on human understanding. Our findings reveal that although model benchmark performance correlates with collaborative outcomes, this relationship is notably inconsistent, featuring significant outliers, indicating that knowledge transfer requires dedicated optimization. Our analysis identifies behavioral and strategic factors mediating successful knowledge transfer. We release our code, dataset, and evaluation framework to support future work on communicatively aligned models.

Paperid: 508, https://arxiv.org/pdf/2506.04444.pdf

Abstract:
In this paper, we investigate the challenges associated with using egocentric devices to photorealistic reconstruct the scene in high dynamic range. Existing methodologies typically assume using frame-rate 6DoF pose estimated from the device's visual-inertial odometry system, which may neglect crucial details necessary for pixel-accurate reconstruction. This study presents two significant findings. Firstly, in contrast to mainstream work treating RGB camera as global shutter frame-rate camera, we emphasize the importance of employing visual-inertial bundle adjustment (VIBA) to calibrate the precise timestamps and movement of the rolling shutter RGB sensing camera in a high frequency trajectory format, which ensures an accurate calibration of the physical properties of the rolling-shutter camera. Secondly, we incorporate a physical image formation model based into Gaussian Splatting, which effectively addresses the sensor characteristics, including the rolling-shutter effect of RGB cameras and the dynamic ranges measured by sensors. Our proposed formulation is applicable to the widely-used variants of Gaussian Splats representation. We conduct a comprehensive evaluation of our pipeline using the open-source Project Aria device under diverse indoor and outdoor lighting conditions, and further validate it on a Meta Quest3 device. Across all experiments, we observe a consistent visual enhancement of +1 dB in PSNR by incorporating VIBA, with an additional +1 dB achieved through our proposed image formation model. Our complete implementation, evaluation datasets, and recording profile are available at http://www.projectaria.com/photoreal-reconstruction/

Paperid: 509, https://arxiv.org/pdf/2506.00583.pdf

Abstract:
Social media platforms have become central to modern communication, yet they also harbor offensive content that challenges platform safety and inclusivity. While prior research has primarily focused on textual indicators of offense, the role of emojis, ubiquitous visual elements in online discourse, remains underexplored. Emojis, despite being rarely offensive in isolation, can acquire harmful meanings through symbolic associations, sarcasm, and contextual misuse. In this work, we systematically examine emoji contributions to offensive Twitter messages, analyzing their distribution across offense categories and how users exploit emoji ambiguity. To address this, we propose an LLM-powered, multi-step moderation pipeline that selectively replaces harmful emojis while preserving the tweet's semantic intent. Human evaluations confirm our approach effectively reduces perceived offensiveness without sacrificing meaning. Our analysis also reveals heterogeneous effects across offense types, offering nuanced insights for online communication and emoji moderation.

Paperid: 510, https://arxiv.org/pdf/2506.00052.pdf

Abstract:
Student engagement plays a crucial role in the successful delivery of educational programs. Automated engagement measurement helps instructors monitor student participation, identify disengagement, and adapt their teaching strategies to enhance learning outcomes effectively. This paper identifies two key challenges in this problem: class imbalance and incorporating order into engagement levels rather than treating it as mere categories. Then, a novel approach to video-based student engagement measurement in virtual learning environments is proposed that utilizes supervised contrastive learning for ordinal classification of engagement. Various affective and behavioral features are extracted from video samples and utilized to train ordinal classifiers within a supervised contrastive learning framework (with a sequential classifier as the encoder). A key step involves the application of diverse time-series data augmentation techniques to these feature vectors, enhancing model training. The effectiveness of the proposed method was evaluated using a publicly available dataset for engagement measurement, DAiSEE, containing videos of students who participated in virtual learning programs. The results demonstrate the robust ability of the proposed method for the classification of the engagement level. This approach promises a significant contribution to understanding and enhancing student engagement in virtual learning environments.

Paperid: 512, https://arxiv.org/pdf/2505.18412.pdf

Abstract:
Exercise-based rehabilitation improves quality of life and reduces morbidity, mortality, and rehospitalization, though transportation constraints and staff shortages lead to high dropout rates from rehabilitation programs. Virtual platforms enable patients to complete prescribed exercises at home, while AI algorithms analyze performance, deliver feedback, and update clinicians. Although many studies have developed machine learning and deep learning models for exercise quality assessment, few have explored the use of large language models (LLMs) for feedback and are limited by the lack of rehabilitation datasets containing textual feedback. In this paper, we propose a new method in which exercise-specific features are extracted from the skeletal joints of patients performing rehabilitation exercises and fed into pre-trained LLMs. Using a range of prompting techniques, such as zero-shot, few-shot, chain-of-thought, and role-play prompting, LLMs are leveraged to evaluate exercise quality and provide feedback in natural language to help patients improve their movements. The method was evaluated through extensive experiments on two publicly available rehabilitation exercise assessment datasets (UI-PRMD and REHAB24-6) and showed promising results in exercise assessment, reasoning, and feedback generation. This approach can be integrated into virtual rehabilitation platforms to help patients perform exercises correctly, support recovery, and improve health outcomes.

Paperid: 513, https://arxiv.org/pdf/2505.17555.pdf

Abstract:
Temporal Action Localization (TAL) aims to detect the start and end timestamps of actions in a video. However, the training of TAL models requires a substantial amount of manually annotated data. Data programming is an efficient method to create training labels with a series of human-defined labeling functions. However, its application in TAL faces difficulties of defining complex actions in the context of temporal video frames. In this paper, we propose ProTAL, a drag-and-link video programming framework for TAL. ProTAL enables users to define \textbf{key events} by dragging nodes representing body parts and objects and linking them to constrain the relations (direction, distance, etc.). These definitions are used to generate action labels for large-scale unlabelled videos. A semi-supervised method is then employed to train TAL models with such labels. We demonstrate the effectiveness of ProTAL through a usage scenario and a user study, providing insights into designing video programming framework.

Paperid: 514, https://arxiv.org/pdf/2505.14668.pdf

Abstract:
Recent advances in Large Language Models (LLMs) have propelled intelligent agents from reactive responses to proactive support. While promising, existing proactive agents either rely exclusively on observations from enclosed environments (e.g., desktop UIs) with direct LLM inference or employ rule-based proactive notifications, leading to suboptimal user intent understanding and limited functionality for proactive service. In this paper, we introduce ContextAgent, the first context-aware proactive agent that incorporates extensive sensory contexts to enhance the proactive capabilities of LLM agents. ContextAgent first extracts multi-dimensional contexts from massive sensory perceptions on wearables (e.g., video and audio) to understand user intentions. ContextAgent then leverages the sensory contexts and the persona contexts from historical data to predict the necessity for proactive services. When proactive assistance is needed, ContextAgent further automatically calls the necessary tools to assist users unobtrusively. To evaluate this new task, we curate ContextAgentBench, the first benchmark for evaluating context-aware proactive LLM agents, covering 1,000 samples across nine daily scenarios and twenty tools. Experiments on ContextAgentBench show that ContextAgent outperforms baselines by achieving up to 8.5% and 6.0% higher accuracy in proactive predictions and tool calling, respectively. We hope our research can inspire the development of more advanced, human-centric, proactive AI assistants.

Paperid: 515, https://arxiv.org/pdf/2505.10718.pdf

Abstract:
Semantic feature norms have been foundational in the study of human conceptual knowledge, yet traditional methods face trade-offs between concept/feature coverage and verifiability of quality due to the labor-intensive nature of norming studies. Here, we introduce a novel approach that augments a dataset of human-generated feature norms with responses from large language models (LLMs) while verifying the quality of norms against reliable human judgments. We find that our AI-enhanced feature norm dataset, NOVA: Norms Optimized Via AI, shows much higher feature density and overlap among concepts while outperforming a comparable human-only norm dataset and word-embedding models in predicting people's semantic similarity judgments. Taken together, we demonstrate that human conceptual knowledge is richer than captured in previous norm datasets and show that, with proper validation, LLMs can serve as powerful tools for cognitive science research.

Paperid: 516, https://arxiv.org/pdf/2504.15098.pdf

Abstract:
In interactions between automated vehicles (AVs) and crossing pedestrians, modeling implicit vehicle communication is crucial. In this work, we present a combined prediction and planning approach that allows to consider the influence of the planned vehicle behavior on a pedestrian and predict a pedestrian's reaction. We plan the behavior by solving two consecutive optimal control problems (OCPs) analytically, using variational calculus. We perform a validation step that assesses whether the planned vehicle behavior is adequate to trigger a certain pedestrian reaction, which accounts for the closed-loop characteristics of prediction and planning influencing each other. In this step, we model the influence of the planned vehicle behavior on the pedestrian using a probabilistic behavior acceptance model that returns an estimate for the crossing probability. The probabilistic modeling of the pedestrian reaction facilitates considering the pedestrian's costs, thereby improving cooperative behavior planning. We demonstrate the performance of the proposed approach in simulated vehicle-pedestrian interactions with varying initial settings and highlight the decision making capabilities of the planning approach.

Paperid: 517, https://arxiv.org/pdf/2504.10296.pdf

Abstract:
Robots with wheeled, quadrupedal, or humanoid forms are increasingly integrated into built environments. However, unlike human social learning, they lack a critical pathway for intrinsic cognitive development, namely, learning from human feedback during interaction. To understand human ubiquitous observation, supervision, and shared control in dynamic and uncertain environments, this study presents a brain-computer interface (BCI) framework that enables classification of Electroencephalogram (EEG) signals to detect cognitively demanding and safety-critical events. As a timely and motivating co-robotic engineering application, we simulate a human-in-the-loop scenario to flag risky events in semi-autonomous robotic driving-representative of long-tail cases that pose persistent bottlenecks to the safety performance of smart mobility systems and robotic vehicles. Drawing on recent advances in few-shot learning, we propose a dual-attention Siamese convolutional network paired with Dynamic Time Warping Barycenter Averaging approach to generate robust EEG-encoded signal representations. Inverse source localization reveals activation in Broadman areas 4 and 9, indicating perception-action coupling during task-relevant mental imagery. The model achieves 80% classification accuracy under data-scarce conditions and exhibits a nearly 100% increase in the utility of salient features compared to state-of-the-art methods, as measured through integrated gradient attribution. Beyond performance, this study contributes to our understanding of the cognitive architecture required for BCI agents-particularly the role of attention and memory mechanisms-in categorizing diverse mental states and supporting both inter- and intra-subject adaptation. Overall, this research advances the development of cognitive robotics and socially guided learning for service robots in complex built environments.

Paperid: 518, https://arxiv.org/pdf/2504.02624.pdf

Abstract:
In this paper, we propose EmbodiedSense, a sensing system based on commercial earphones, which enables fine-grained activity logs using existing sensors. The activity logs record both user activities and the scenario in which the activities took place, benefiting detailed behavior understanding. By understanding both the user and the environment, EmbodiedSense addresses three main challenges: the limited recognition capability caused by information-hungry configurations (i.e., limited sensors available), the ineffective fusion to extract ambient information such as contextual scenarios, and the interference from ambient noise. Specifically, EmbodiedSense consists of a context-aware scenario recognition module and spatial-aware activity detection, which is further integrated with other attributes by expert knowledge. We implement our system on commercial earphones equipped with binaural microphones and an Inertial Measurement Unit (IMU). By distinguishing usage scenarios and identifying the source of sounds, EmbodiedSense enables fine-grained activity logs in a zero-shot manner (evaluated with up to 41 categories) and outperforms strong baselines like ImageBind-LLM by 38% F1-score. Extensive evaluations demonstrate that EmbodiedSense is a promising solution for long-term and short-term activity logs and provides significant benefits in monitoring the wearer's daily life.

Paperid: 519, https://arxiv.org/pdf/2503.20666.pdf

Abstract:
Thematic analysis (TA) is a widely used qualitative approach for uncovering latent meanings in unstructured text data. TA provides valuable insights in healthcare but is resource-intensive. Large Language Models (LLMs) have been introduced to perform TA, yet their applications in healthcare remain unexplored. Here, we propose TAMA: A Human-AI Collaborative Thematic Analysis framework using Multi-Agent LLMs for clinical interviews. We leverage the scalability and coherence of multi-agent systems through structured conversations between agents and coordinate the expertise of cardiac experts in TA. Using interview transcripts from parents of children with Anomalous Aortic Origin of a Coronary Artery (AAOCA), a rare congenital heart disease, we demonstrate that TAMA outperforms existing LLM-assisted TA approaches, achieving higher thematic hit rate, coverage, and distinctiveness. TAMA demonstrates strong potential for automated TA in clinical settings by leveraging multi-agent LLM systems with human-in-the-loop integration by enhancing quality while significantly reducing manual workload.

Paperid: 520, https://arxiv.org/pdf/2503.18765.pdf

Abstract:
Group Decision-Making (GDM) plays a crucial role in various real-life scenarios where individuals express their opinions in natural language rather than structured numerical values. Traditional GDM approaches often overlook the subjectivity and ambiguity present in human discussions, making it challenging to achieve a fair and consensus-driven decision. This paper proposes a fuzzy consensus-based group decision-making system that integrates sentiment and emotion analysis to extract preference values from textual inputs. The proposed framework combines explicit voting preferences with sentiment scores derived from chat discussions, which are then processed using a Fuzzy Inference System (FIS) to compute a total preference score for each alternative and determine the top-ranked option. To ensure fairness in group decision-making, we introduce a fuzzy logic-based consensus measurement model that evaluates participants' agreement and confidence levels to assess overall feedback. To illustrate the effectiveness of our approach, we apply the methodology to a restaurant selection scenario, where a group of individuals must decide on a dining option based on brief chat discussions. The results demonstrate that the fuzzy consensus mechanism successfully aggregates individual preferences and ensures a balanced outcome that accurately reflects group sentiment.

Paperid: 521, https://arxiv.org/pdf/2503.03979.pdf

Abstract:
Large Language Models (LLMs) reasoning processes are challenging to analyze due to their complexity and the lack of organized visualization tools. We present ReasonGraph, a web-based platform for visualizing and analyzing LLM reasoning processes. It supports both sequential and tree-based reasoning methods while integrating with major LLM providers and over fifty state-of-the-art models. ReasonGraph incorporates an intuitive UI with meta reasoning method selection, configurable visualization parameters, and a modular framework that facilitates efficient extension. Our evaluation shows high parsing reliability, efficient processing, and strong usability across various downstream applications. By providing a unified visualization framework, ReasonGraph reduces cognitive load in analyzing complex reasoning paths, improves error detection in logical processes, and enables more effective development of LLM-based applications. The platform is open-source, promoting accessibility and reproducibility in LLM reasoning analysis.

Paperid: 522, https://arxiv.org/pdf/2502.16810.pdf

Abstract:
This paper develops an agentic framework that employs large language models (LLMs) to automate the generation of persuasive and grounded marketing content, using real estate listing descriptions as our focal application domain. Our method is designed to align the generated content with user preferences while highlighting useful factual attributes. This agent consists of three key modules: (1) Grounding Module, mimicking expert human behavior to predict marketable features; (2) Personalization Module, aligning content with user preferences; (3) Marketing Module, ensuring factual accuracy and the inclusion of localized features. We conduct systematic human-subject experiments in the domain of real estate marketing, with a focus group of potential house buyers. The results demonstrate that marketing descriptions generated by our approach are preferred over those written by human experts by a clear margin while maintaining the same level of factual accuracy. Our findings suggest a promising agentic approach to automate large-scale targeted marketing while ensuring factuality of content generation.

Paperid: 523, https://arxiv.org/pdf/2502.11399.pdf

Abstract:
Creating new fonts requires a lot of human effort and professional typographic knowledge. Despite the rapid advancements of automatic font generation models, existing methods require users to prepare pre-designed characters with target styles using font-editing software, which poses a problem for non-expert users. To address this limitation, we propose FontCraft, a system that enables font generation without relying on pre-designed characters. Our approach integrates the exploration of a font-style latent space with human-in-the-loop preferential Bayesian optimization and multimodal references, facilitating efficient exploration and enhancing user control. Moreover, FontCraft allows users to revisit previous designs, retracting their earlier choices in the preferential Bayesian optimization process. Once users finish editing the style of a selected character, they can propagate it to the remaining characters and further refine them as needed. The system then generates a complete outline font in OpenType format. We evaluated the effectiveness of FontCraft through a user study comparing it to a baseline interface. Results from both quantitative and qualitative evaluations demonstrate that FontCraft enables non-expert users to design fonts efficiently.

Paperid: 524, https://arxiv.org/pdf/2502.10884.pdf

Abstract:
A persistent challenge in accessible computing is ensuring developers produce web UI code that supports assistive technologies. Despite numerous specialized accessibility tools, novice developers often remain unaware of them, leading to ~96% of web pages that contain accessibility violations. AI coding assistants, such as GitHub Copilot, could offer potential by generating accessibility-compliant code, but their impact remains uncertain. Our formative study with 16 developers without accessibility training revealed three key issues in AI-assisted coding: failure to prompt AI for accessibility, omitting crucial manual steps like replacing placeholder attributes, and the inability to verify compliance. To address these issues, we developed CodeA11y, a GitHub Copilot Extension, that suggests accessibility-compliant code and displays manual validation reminders. We evaluated it through a controlled study with another 20 novice developers. Our findings demonstrate its effectiveness in guiding novice developers by reinforcing accessibility practices throughout interactions, representing a significant step towards integrating accessibility into AI coding assistants.

Paperid: 525, https://arxiv.org/pdf/2502.02326.pdf

Abstract:
Exploratory Data Analysis (EDA) is a routine task for data analysts, often conducted using flexible computational notebooks. During EDA, data workers process, visualize, and interpret data tables, making decisions about subsequent analysis. However, the cell-by-cell programming approach, while flexible, can lead to disorganized code, making it difficult to trace the state of data tables across cells and increasing the cognitive load on data workers. This paper introduces NoteFlow, a notebook library that recommends charts as ``sight glasses'' for data tables, allowing users to monitor their dynamic updates throughout the EDA process. To ensure visual consistency and effectiveness, NoteFlow adapts chart encodings in response to data transformations, maintaining a coherent and insightful representation of the data. The proposed method was evaluated through user studies, demonstrating its ability to provide an overview of the EDA process and convey critical insights in the data tables.

Paperid: 526, https://arxiv.org/pdf/2501.17768.pdf

Abstract:
Virtual Reality (VR) offers a unique collaborative experience, with parallel views playing a pivotal role in Collaborative Virtual Environments by supporting the transfer and delivery of items. Sharing and manipulating partners' views provides users with a broader perspective that helps them identify the targets and partner actions. We proposed TeamPortal accordingly and conducted two user studies with 72 participants (36 pairs) to investigate the potential benefits of interactive, shared perspectives in VR collaboration. Our first study compared ShaView and TeamPortal against a baseline in a collaborative task that encompassed a series of searching and manipulation tasks. The results show that TeamPortal significantly reduced movement and increased collaborative efficiency and social presence in complex tasks. Following the results, the second study evaluated three variants: TeamPortal+, SnapTeamPortal+, and DropTeamPortal+. The results show that both SnapTeamPortal+ and DropTeamPortal+ improved task efficiency and willingness to further adopt these technologies, though SnapTeamPortal+ reduced co-presence. Based on the findings, we proposed three design implications to inform the development of future VR collaboration systems.

Paperid: 527, https://arxiv.org/pdf/2501.17310.pdf

Abstract:
Guesstimation -- the task of making approximate quantitative estimates about objects or events -- is a common real-world skill, yet remains underexplored in large language model (LLM) research. We introduce three guesstimation datasets: MARBLES, FUTURE, and ELECPRED, spanning physical estimation (e.g., how many marbles fit in a cup) to abstract predictions (e.g., the 2024 U.S. presidential election). Inspired by the social science concept of Wisdom of Crowds (WOC)- where the median of multiple estimates improves accuracy-we propose WOC decoding for LLMs. We replicate WOC effects in human participants and find that LLMs exhibit similar benefits: median aggregation across sampled responses consistently improves accuracy over greedy decoding, self-consistency decoding, and mean decoding. This suggests that LLMs encode a world model that supports approximate reasoning. Our results position guesstimation as a useful probe of LLM world knowledge and highlight WOC decoding as a strategy for enhancing LLM guesstimation performance on real-world tasks.

Paperid: 528, https://arxiv.org/pdf/2501.13964.pdf

Abstract:
Augmented Reality (AR) enhances the real world by integrating virtual content, yet ensuring the quality, usability, and safety of AR experiences presents significant challenges. Could Vision-Language Models (VLMs) offer a solution for the automated evaluation of AR-generated scenes? Could Vision-Language Models (VLMs) offer a solution for the automated evaluation of AR-generated scenes? In this study, we evaluate the capabilities of three state-of-the-art commercial VLMs -- GPT, Gemini, and Claude -- in identifying and describing AR scenes. For this purpose, we use DiverseAR, the first AR dataset specifically designed to assess VLMs' ability to analyze virtual content across a wide range of AR scene complexities. Our findings demonstrate that VLMs are generally capable of perceiving and describing AR scenes, achieving a True Positive Rate (TPR) of up to 93% for perception and 71% for description. While they excel at identifying obvious virtual objects, such as a glowing apple, they struggle when faced with seamlessly integrated content, such as a virtual pot with realistic shadows. Our results highlight both the strengths and the limitations of VLMs in understanding AR scenarios. We identify key factors affecting VLM performance, including virtual content placement, rendering quality, and physical plausibility. This study underscores the potential of VLMs as tools for evaluating the quality of AR experiences.

Paperid: 529, https://arxiv.org/pdf/2501.12326.pdf

Abstract:
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.

Paperid: 530, https://arxiv.org/pdf/2501.01598.pdf

Abstract:
A wide range of user perception applications leverage inertial measurement unit (IMU) data for online prediction. However, restricted by the non-i.i.d. nature of IMU data collected from mobile devices, most systems work well only in a controlled setting (e.g., for a specific user in particular postures), limiting application scenarios. To achieve uncontrolled online prediction on mobile devices, referred to as the flexible user perception (FUP) problem, is attractive but hard. In this paper, we propose a novel scheme, called Prism, which can obtain high FUP accuracy on mobile devices. The core of Prism is to discover task-aware domains embedded in IMU dataset, and to train a domain-aware model on each identified domain. To this end, we design an expectation-maximization (EM) algorithm to estimate latent domains with respect to the specific downstream perception task. Finally, the best-fit model can be automatically selected for use by comparing the test sample and all identified domains in the feature space. We implement Prism on various mobile devices and conduct extensive experiments. Results demonstrate that Prism can achieve the best FUP performance with a low latency.

Paperid: 531, https://arxiv.org/pdf/2506.14670.pdf

Abstract:
Traditionally, neighborhood studies have employed interviews, surveys, and manual image annotation guided by detailed protocols to identify environmental characteristics, including physical disorder, decay, street safety, and sociocultural symbols, and to examine their impact on developmental and health outcomes. While these methods yield rich insights, they are time-consuming and require intensive expert intervention. Recent technological advances, including vision-language models (VLMs), have begun to automate parts of this process; however, existing efforts are often ad hoc and lack adaptability across research designs and geographic contexts. In this demo paper, we present StreetLens, a human-centered, researcher-configurable workflow that embeds relevant social science expertise in a VLM for scalable neighborhood environmental assessments. StreetLens mimics the process of trained human coders by grounding the analysis in questions derived from established interview protocols, retrieving relevant street view imagery (SVI), and generating a wide spectrum of semantic annotations from objective features (e.g., the number of cars) to subjective perceptions (e.g., the sense of disorder in an image). By enabling researchers to define the VLM's role through domain-informed prompting, StreetLens places domain knowledge at the core of the analysis process. It also supports the integration of prior survey data to enhance robustness and expand the range of characteristics assessed across diverse settings. We provide a Google Colab notebook to make StreetLens accessible and extensible for researchers working with public or custom SVI datasets. StreetLens represents a shift toward flexible, agentic AI systems that work closely with researchers to accelerate and scale neighborhood studies.

Paperid: 532, https://arxiv.org/pdf/2506.14611.pdf

Abstract:
In this paper, we test whether Multimodal Large Language Models (MLLMs) can match human-subject performance in tasks involving the perception of properties in network layouts. Specifically, we replicate a human-subject experiment about perceiving quality (namely stress) in network layouts using GPT-4o and Gemini-2.5. Our experiments show that giving MLLMs exactly the same study information as trained human participants results in a similar performance to human experts and exceeds the performance of untrained non-experts. Additionally, we show that prompt engineering that deviates from the human-subject experiment can lead to better-than-human performance in some settings. Interestingly, like human subjects, the MLLMs seem to rely on visual proxies rather than computing the actual value of stress, indicating some sense or facsimile of perception. Explanations from the models provide descriptions similar to those used by the human participants (e.g., even distribution of nodes and uniform edge lengths).

Paperid: 533, https://arxiv.org/pdf/2506.14159.pdf

Abstract:
Every individual carries a unique and personal life story shaped by their memories and experiences. However, these memories are often scattered and difficult to organize into a coherent narrative, a challenge that defines the task of autobiography writing. Existing conversational writing assistants tend to rely on generic user interactions and pre-defined guidelines, making it difficult for these systems to capture personal memories and develop a complete biography over time. We introduce StorySage, a user-driven software system designed to meet the needs of a diverse group of users that supports a flexible conversation and a structured approach to autobiography writing. Powered by a multi-agent framework composed of an Interviewer, Session Scribe, Planner, Section Writer, and Session Coordinator, our system iteratively collects user memories, updates their autobiography, and plans for future conversations. In experimental simulations, StorySage demonstrates its ability to navigate multiple sessions and capture user memories across many conversations. User studies (N=28) highlight how StorySage maintains improved conversational flow, narrative completeness, and higher user satisfaction when compared to a baseline. In summary, StorySage contributes both a novel architecture for autobiography writing and insights into how multi-agent systems can enhance human-AI creative partnerships.

Paperid: 534, https://arxiv.org/pdf/2506.06874.pdf

Abstract:
There is growing interest in understanding how people interact with large language models (LLMs) and whether such models elicit dependency or even addictive behaviour. Validated tools to assess the extent to which individuals may become dependent on LLMs are scarce and primarily build on classic behavioral addiction symptoms, adapted to the context of LLM use. We view this as a conceptual limitation, as the LLM-human relationship is more nuanced and warrants a fresh and distinct perspective. To address this gap, we developed and validated a new 12-item questionnaire to measure LLM dependency, referred to as LLM-D12. The scale was based on the authors' prior theoretical work, with items developed accordingly and responses collected from 526 participants in the UK. Exploratory and confirmatory factor analyses, performed on separate halves of the total sample using a split-sample approach, supported a two-factor structure: Instrumental Dependency (six items) and Relationship Dependency (six items). Instrumental Dependency reflects the extent to which individuals rely on LLMs to support or collaborate in decision-making and cognitive tasks. Relationship Dependency captures the tendency to perceive LLMs as socially meaningful, sentient, or companion-like entities. The two-factor structure demonstrated excellent internal consistency and clear discriminant validity. External validation confirmed both the conceptual foundation and the distinction between the two subscales. The psychometric properties and structure of our LLM-D12 scale were interpreted in light of the emerging view that dependency on LLMs does not necessarily indicate dysfunction but may still reflect reliance levels that could become problematic in certain contexts.

Paperid: 535, https://arxiv.org/pdf/2506.01982.pdf

Abstract:
This study investigates emotional expression and perception in music performance using computational and neurophysiological methods. The influence of different performance settings, such as repertoire, diatonic modal etudes, and improvisation, as well as levels of expressiveness, on performers' emotional communication and listeners' reactions is explored. Professional musicians performed various tasks, and emotional annotations were provided by both performers and the audience. Audio analysis revealed that expressive and improvisational performances exhibited unique acoustic features, while emotion analysis showed stronger emotional responses. Neurophysiological measurements indicated greater relaxation in improvisational performances. This multimodal study highlights the significance of expressivity in enhancing emotional communication and audience engagement.

Paperid: 536, https://arxiv.org/pdf/2505.22981.pdf

Abstract:
We demonstrate the potential of anthropomorphized language agents to generate budget-friendly, moderate-fidelity, yet sufficiently insightful user experiences at scale, supporting fast, early-stage prototyping. We explore this through the case of prototyping Large Language Model-driven non-player characters (NPCs). We present Agentic H-CI, a framework that mirrors traditional user research processes-surveying, screening, experiencing, and collecting feedback and insights-with simulated agents. Using this approach, we easily construct a team of 240 player agents with a balanced range of player types and personality traits, at extremely low cost (\$0.28/player) and minimal time commitment (6.9 minutes/player). Content analysis shows that agent-based players behave in ways aligned with their simulated backgrounds, achieving 82.5\% alignment with designated profiles. From their interactions, we distill 11 user insights and 6 design implications to guide further development. To evaluate practical value, we conduct parallel user studies with human participants recruited locally and via crowdsourcing. Ratings from three professional game developers show that the agentic player team offers a Pareto-optimal and well-balanced trade-off across fidelity, cost, time efficiency, and insight helpfulness.

Paperid: 537, https://arxiv.org/pdf/2505.20924.pdf

Abstract:
While prior work has shown that Federated Learning updates can leak sensitive information, label reconstruction attacks, which aim to recover input labels from shared gradients, have not yet been examined in the context of Human Activity Recognition (HAR). Given the sensitive nature of activity labels, this study evaluates the effectiveness of state-of-the-art gradient-based label leakage attacks on HAR benchmark datasets. Our findings show that the number of activity classes, sampling strategy, and class imbalance are critical factors influencing the extent of label leakage, with reconstruction accuracies reaching well-above 90% on two benchmark datasets, even for trained models. Moreover, we find that Local Differential Privacy techniques such as gradient noise and clipping offer only limited protection, as certain attacks still reliably infer both majority and minority class labels. We conclude by offering practical recommendations for the privacy-aware deployment of federated HAR systems and identify open challenges for future research. Code to reproduce our experiments is publicly available via github.com/mariusbock/leakage_har.

Paperid: 538, https://arxiv.org/pdf/2505.20894.pdf

Abstract:
Despite recognized limitations in modeling long-range temporal dependencies, Human Activity Recognition (HAR) has traditionally relied on a sliding window approach to segment labeled datasets. Deep learning models like the DeepConvLSTM typically classify each window independently, thereby restricting learnable temporal context to within-window information. To address this constraint, we propose DeepConvContext, a multi-scale time series classification framework for HAR. Drawing inspiration from the vision-based Temporal Action Localization community, DeepConvContext models both intra- and inter-window temporal patterns by processing sequences of time-ordered windows. Unlike recent HAR models that incorporate attention mechanisms, DeepConvContext relies solely on LSTMs -- with ablation studies demonstrating the superior performance of LSTMs over attention-based variants for modeling inertial sensor data. Across six widely-used HAR benchmarks, DeepConvContext achieves an average 10% improvement in F1-score over the classic DeepConvLSTM, with gains of up to 21%. Code to reproduce our experiments is publicly available via github.com/mariusbock/context_har.

Paperid: 539, https://arxiv.org/pdf/2505.19652.pdf

Abstract:
Speech disorders such as dysarthria and anarthria can severely impair the patient's ability to communicate verbally. Speech decoding brain-computer interfaces (BCIs) offer a potential alternative by directly translating speech intentions into spoken words, serving as speech neuroprostheses. This paper reports an experimental protocol for Mandarin Chinese speech decoding BCIs, along with the corresponding decoding algorithms. Stereo-electroencephalography (SEEG) and synchronized audio data were collected from eight drug-resistant epilepsy patients as they conducted a word-level reading task. The proposed SEEG and Audio Contrastive Matching (SACM), a contrastive learning-based framework, achieved decoding accuracies significantly exceeding chance levels in both speech detection and speech decoding tasks. Electrode-wise analysis revealed that a single sensorimotor cortex electrode achieved performance comparable to that of the full electrode array. These findings provide valuable insights for developing more accurate online speech decoding BCIs.

Paperid: 540, https://arxiv.org/pdf/2505.17041.pdf

Abstract:
Generative Artificial Intelligence is transforming how English as a foreign language students write. Still, little is known about how students manipulate text generated by generative AI during the writing process. This study investigates how EFL secondary school students integrate and modify AI-generated text when completing an expository writing task. The study employed an exploratory mixed-methods design. Screen recordings were collected from 29 Hong Kong secondary school students who attended an AI-assisted writing workshop and recorded their screens while using generative AI to write an article. Content analysis with hierarchical coding and thematic analysis with a multiple case study approach were adopted to analyze the recordings. 15 types of AI-generated text edits across seven categories were identified from the recordings. Notably, AI-initiated edits from iOS and Google Docs emerged as unanticipated sources of AI-generated text. A thematic analysis revealed four patterns of students' editing behaviors based on planning and drafting direction: planning with top-down drafting and revising; top-down drafting and revising without planning; planning with bottom-up drafting and revising; and bottom-up drafting and revising without planning. Network graphs illustrate cases of each pattern, demonstrating that students' interactions with AI-generated text involve more complex cognitive processes than simple text insertion. The findings challenge assumptions about students' passive, simplistic use of generative AI tools and have implications for developing explicit instructional approaches to teaching AI-generated text editing strategies in the AFL writing pedagogy.

Paperid: 541, https://arxiv.org/pdf/2505.16730.pdf

Abstract:
Misinformation poses significant risks to public opinion, health, and security. While most fake news detection methods rely on text analysis, little is known about how people physically respond to false information or repeated exposure to the same statements. This study investigates whether wearable sensors can detect belief in a statement or prior exposure to it. We conducted a controlled experiment where participants evaluated statements while wearing an EmotiBit sensor that measured their skin conductance (electrodermal activity, EDA) and peripheral blood flow (photoplethysmography, PPG). From 28 participants, we collected a dataset of 672 trials, each labeled with whether the participant believed the statement and whether they had seen it before. This dataset introduces a new resource for studying physiological responses to misinformation. Using machine learning models, including KNN, CNN, and LightGBM, we analyzed these physiological patterns. The best-performing model achieved 67.83\% accuracy, with skin conductance outperforming PPG. These findings demonstrate the potential of wearable sensors as a minimally intrusive tool for detecting belief and prior exposure, offering new directions for real-time misinformation detection and adaptive, user-aware systems.

Paperid: 542, https://arxiv.org/pdf/2505.16702.pdf

Abstract:
Understanding how individuals physiologically respond to false information is crucial for advancing misinformation detection systems. This study explores the potential of using physiological signals, specifically electrodermal activity (EDA) and photoplethysmography (PPG), to classify both the veracity of information and its interaction with user belief. In a controlled laboratory experiment, we collected EDA and PPG signals while participants evaluated the truthfulness of climate-related claims. Each trial was labeled based on the objective truth of the claim and the participant's belief, enabling two classification tasks: binary veracity detection and a novel four-class joint belief-veracity classification. We extracted handcrafted features from the raw signals and trained several machine learning models to benchmark the dataset. Our results show that EDA outperforms PPG, indicating its greater sensitivity to physiological responses related to truth perception. However, performance significantly drops in the joint belief-veracity classification task, highlighting the complexity of modeling the interaction between belief and truth. These findings suggest that while physiological signals can reflect basic truth perception, accurately modeling the intricate relationships between belief and veracity remains a significant challenge. This study emphasizes the importance of multimodal approaches that incorporate psychological, physiological, and cognitive factors to improve fake news detection systems. Our work provides a foundation for future research aimed at enhancing misinformation detection via addressing the complexities of human belief and truth processing.

Paperid: 543, https://arxiv.org/pdf/2505.06134.pdf

Abstract:
Trajectory prediction is a key element of autonomous vehicle systems, enabling them to anticipate and react to the movements of other road users. Evaluating the robustness of prediction models against adversarial attacks is essential to ensure their reliability in real-world traffic. However, current approaches tend to focus on perturbing the past positions of surrounding agents, which can generate unrealistic scenarios and overlook critical vulnerabilities. This limitation may result in overly optimistic assessments of model performance in real-world conditions. In this work, we demonstrate that perturbing not just past but also future states of adversarial agents can uncover previously undetected weaknesses and thereby provide a more rigorous evaluation of model robustness. Our novel approach incorporates dynamic constraints and preserves tactical behaviors, enabling more effective and realistic adversarial attacks. We introduce new performance measures to assess the realism and impact of these adversarial trajectories. Testing our method on a state-of-the-art prediction model revealed significant increases in prediction errors and collision rates under adversarial conditions. Qualitative analysis further showed that our attacks can expose critical weaknesses, such as the inability of the model to detect potential collisions in what appear to be safe predictions. These results underscore the need for more comprehensive adversarial testing to better evaluate and improve the reliability of trajectory prediction models for autonomous vehicles.

Paperid: 544, https://arxiv.org/pdf/2504.16373.pdf

Abstract:
Mixed Reality (MR) enables rich, embodied collaboration, yet it's uncertain if sensor and system-logged behavioral signals capture how users experience that collaboration. This disconnect stems from a fundamental gap: behavioral signals are observable and continuous, while collaboration is interpreted subjectively, shaped by internal states like presence, cognitive availability, and social awareness. Our core insight is that sensor signals serve as observable manifestations of subjective experiences in MR collaboration, and they can be captured through sensor data such as shared gaze, speech, spatial movement, and other system-logged performance metrics. We propose the Sensor-to-Subjective (S2S) Mapping Framework, a conceptual model that links observable interaction patterns to users' subjective perceptions of collaboration and internal cognitive states through sensor-based indicators and task performance metrics. To validate this model, we conducted a study with 48 participants across 12 MR groups engaged in a collaborative image-sorting task. Our findings show a correlation between sensed behavior and perceived collaboration, particularly through shared attention and proximity.

Paperid: 545, https://arxiv.org/pdf/2504.13864.pdf

Abstract:
Researchers in pervasive computing have worked for decades on sensor-based human activity recognition (HAR). Among the digital health applications, the recognition of activities of daily living (ADL) in smart home environments enables the identification of behavioral changes that clinicians consider as a digital bio-marker of early stages of cognitive decline. The real deployment of sensor-based HAR systems in the homes of elderly subjects poses several challenges, with privacy and ethical concerns being major ones. This paper reports our experience applying privacy by design principles to develop and deploy one of these systems.

Paperid: 546, https://arxiv.org/pdf/2504.01201.pdf

Abstract:
Large language models (LLMs) have the potential to transform medicine, but real-world clinical scenarios contain extraneous information that can hinder performance. The rise of assistive technologies like ambient dictation, which automatically generates draft notes from live patient encounters, has the potential to introduce additional noise making it crucial to assess the ability of LLM's to filter relevant data. To investigate this, we developed MedDistractQA, a benchmark using USMLE-style questions embedded with simulated real-world distractions. Our findings show that distracting statements (polysemous words with clinical meanings used in a non-clinical context or references to unrelated health conditions) can reduce LLM accuracy by up to 17.9%. Commonly proposed solutions to improve model performance such as retrieval-augmented generation (RAG) and medical fine-tuning did not change this effect and in some cases introduced their own confounders and further degraded performance. Our findings suggest that LLMs natively lack the logical mechanisms necessary to distinguish relevant from irrelevant clinical information, posing challenges for real-world applications. MedDistractQA and our results highlights the need for robust mitigation strategies to enhance LLM resilience to extraneous information.

Paperid: 547, https://arxiv.org/pdf/2503.24180.pdf

Abstract:
Graphical user interfaces (GUI) automation agents are emerging as powerful tools, enabling humans to accomplish increasingly complex tasks on smart devices. However, users often inadvertently omit key information when conveying tasks, which hinders agent performance in the current agent paradigm that does not support immediate user intervention. To address this issue, we introduce a $\textbf{Self-Correction GUI Navigation}$ task that incorporates interactive information completion capabilities within GUI agents. We developed the $\textbf{Navi-plus}$ dataset with GUI follow-up question-answer pairs, alongside a $\textbf{Dual-Stream Trajectory Evaluation}$ method to benchmark this new capability. Our results show that agents equipped with the ability to ask GUI follow-up questions can fully recover their performance when faced with ambiguous user tasks.

Paperid: 548, https://arxiv.org/pdf/2503.16554.pdf

Abstract:
As narrative extraction systems grow in complexity, establishing user trust through interpretable and explainable outputs becomes increasingly critical. This paper presents an evaluation of an Explainable Artificial Intelligence (XAI) system for narrative map extraction that provides meaningful explanations across multiple levels of abstraction. Our system integrates explanations based on topical clusters for low-level document relationships, connection explanations for event relationships, and high-level structure explanations for overall narrative patterns. In particular, we evaluate the XAI system through a user study involving 10 participants that examined narratives from the 2021 Cuban protests. The analysis of results demonstrates that participants using the explanations made the users trust in the system's decisions, with connection explanations and important event detection proving particularly effective at building user confidence. Survey responses indicate that the multi-level explanation approach helped users develop appropriate trust in the system's narrative extraction capabilities. This work advances the state-of-the-art in explainable narrative extraction while providing practical insights for developing reliable narrative extraction systems that support effective human-AI collaboration.

Paperid: 549, https://arxiv.org/pdf/2503.16492.pdf

Abstract:
Effective Human-Robot Interaction (HRI) is crucial for enhancing accessibility and usability in real-world robotics applications. However, existing solutions often rely on gestures or language commands, making interaction inefficient and ambiguous, particularly for users with physical impairments. In this paper, we introduce FAM-HRI, an efficient multi-modal framework for human-robot interaction that integrates language and gaze inputs via foundation models. By leveraging lightweight Meta ARIA glasses, our system captures real-time multi-modal signals and utilizes large language models (LLMs) to fuse user intention with scene context, enabling intuitive and precise robot manipulation. Our method accurately determines gaze fixation time interval, reducing noise caused by the gaze dynamic nature. Experimental evaluations demonstrate that FAM-HRI achieves a high success rate in task execution while maintaining a low interaction time, providing a practical solution for individuals with limited physical mobility or motor impairments.

Paperid: 550, https://arxiv.org/pdf/2503.10254.pdf

Abstract:
In the rapidly evolving world of software development, the surge in developers' reliance on AI-driven tools has transformed Integrated Development Environments into powerhouses of advanced features. This transformation, while boosting developers' productivity to unprecedented levels, comes with a catch: increased hardware demands for software development. Moreover, the significant economic and environmental toll of using these sophisticated models necessitates mechanisms that reduce unnecessary computational burdens. We propose HyperSeq - Hyper-Adaptive Representation for Predictive Sequencing of States - a novel, resource-efficient approach designed to model developers' cognitive states. HyperSeq facilitates precise action sequencing and enables real-time learning of user behavior. Our preliminary results show how HyperSeq excels in forecasting action sequences and achieves remarkable prediction accuracies that go beyond 70%. Notably, the model's online-learning capability allows it to substantially enhance its predictive accuracy in a majority of cases and increases its capability in forecasting next user actions with sufficient iterations for adaptation. Ultimately, our objective is to harness these predictions to refine and elevate the user experience dynamically within the IDE.

Paperid: 551, https://arxiv.org/pdf/2503.09874.pdf

Abstract:
Studying collaborative behavior in Mixed Reality (MR) often requires extensive, challenging data collection. This paper introduces MoCoMR, a novel simulator designed to address this by generating synthetic yet realistic collaborative MR data. MoCoMR captures individual behavioral modalities such as speaking, gaze, and locomotion during a collaborative image-sorting task with 48 participants to identify distinct behavioral patterns. MoCoMR simulates individual actions and interactions within a virtual space, enabling researchers to investigate the impact of individual behaviors on group dynamics and task performance. This simulator facilitates the development of more effective and human-centered MR applications by providing insights into user behavior and interaction patterns. The simulator's API allows for flexible configuration and data analysis, enabling researchers to explore various scenarios and generate valuable insights for optimizing collaborative MR experiences.

Paperid: 552, https://arxiv.org/pdf/2503.07320.pdf

Abstract:
As large language models (LLMs) become increasingly capable of autonomous decision-making, they introduce new challenges and opportunities for human-AI cooperation in mixed-motive contexts. While prior research has primarily examined AI in assistive or cooperative roles, little is known about how humans interact with AI agents perceived as independent and strategic actors. This study investigates human cooperative attitudes and behaviors toward LLM agents by engaging 30 participants (15 males, 15 females) in repeated Prisoner's Dilemma games with agents differing in declared identity: purported human, rule-based AI, and LLM agent. Behavioral metrics, including cooperation rate, decision latency, unsolicited cooperative acts and trust restoration tolerance, were analyzed to assess the influence of agent identity and participant gender. Results revealed significant effects of declared agent identity on most cooperation-related behaviors, along with notable gender differences in decision latency. Furthermore, qualitative responses suggest that these behavioral differences were shaped by participants interpretations and expectations of the agents. These findings contribute to our understanding of human adaptation in competitive cooperation with autonomous agents and underscore the importance of agent framing in shaping effective and ethical human-AI interaction.

Paperid: 553, https://arxiv.org/pdf/2503.05349.pdf

Abstract:
A non-invasive brain-computer interface (BCI) enables direct interaction between the user and external devices, typically via electroencephalogram (EEG) signals. However, decoding EEG signals across different headsets remains a significant challenge due to differences in the number and locations of the electrodes. To address this challenge, we propose a spatial distillation based distribution alignment (SDDA) approach for heterogeneous cross-headset transfer in non-invasive BCIs. SDDA uses first spatial distillation to make use of the full set of electrodes, and then input/feature/output space distribution alignments to cope with the significant differences between the source and target domains. To our knowledge, this is the first work to use knowledge distillation in cross-headset transfers. Extensive experiments on six EEG datasets from two BCI paradigms demonstrated that SDDA achieved superior performance in both offline unsupervised domain adaptation and online supervised domain adaptation scenarios, consistently outperforming 10 classical and state-of-the-art transfer learning algorithms.

Paperid: 554, https://arxiv.org/pdf/2503.02003.pdf

Abstract:
An Achilles heel of Large Language Models (LLMs) is their tendency to hallucinate non-factual statements. A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on. To combat this problem, we propose Highlighted Chain-of-Thought Prompting (HoT), a technique for prompting LLMs to generate responses with XML tags that ground facts to those provided in the query. That is, given an input question, LLMs would first re-format the question to add XML tags highlighting key facts, and then, generate a response with highlights over the facts referenced from the input. Interestingly, in few-shot settings, HoT outperforms vanilla chain of thought prompting (CoT) on a wide range of 17 tasks from arithmetic, reading comprehension to logical reasoning. When asking humans to verify LLM responses, highlights help time-limited participants to more accurately and efficiently recognize when LLMs are correct. Yet, surprisingly, when LLMs are wrong, HoTs tend to make users believe that an answer is correct.

Paperid: 555, https://arxiv.org/pdf/2502.19546.pdf

Authors:Anton Alyakin, Jaden Stryker, Daniel Alexander Alber, Karl L. Sangwon, Jin Vivian Lee, Brandon Duderstadt, Akshay Save, David Kurland, Spencer Frome, Shrutika Singh, Jeff Zhang, Eunice Yang, Ki Yun Park, Cordelia Orillac, Aly A. Valliani, Sean Neifert, Albert Liu, Aneek Patel, Christopher Livia, Darryl Lau, Ilya Laufer, Peter A. Rozman, Eveline Teresa Hidalgo, Howard Riina, Rui Feng, Todd Hollon, Yindalon Aphinyanaphongs, John G. Golfinos, Laura Snyder, Eric Leuthardt, Douglas Kondziolka, Eric Karl Oermann

Abstract:
General-purpose vision-language models (VLMs) demonstrate impressive capabilities, but their opaque training on uncurated internet data posse critical limitations for high-stakes decision-making, such as in neurosurgery. We present CNS-Obsidian, a neurosurgical VLM trained on peer-reviewed neurosurgical literature, and demonstrate its clinical utility compared with GPT-4o in a real-world setting. We compiled 23,984 articles from Neurosurgery Publications journals, yielding 78,853 figures and captions. Using GPT-4o and Claude Sonnet-3.5, we converted these image-text pairs into 263,064 training samples across three formats: instruction fine-tuning, multiple-choice questions, and differential diagnosis. We trained CNS-Obsidian, a fine-tune of the 34-billion parameter LLaVA-Next model. In a blinded, randomized deployment trial at NYU Langone Health (Aug 30-Nov 30, 2024), neurosurgeons were assigned to use either CNS-Obsidian or GPT-4o as a diagnostic co-pilot after patient consultations. Primary outcomes were diagnostic helpfulness and accuracy. CNS-Obsidian matched GPT-4o on synthetic questions (76.13% vs 77.54%, p=0.235), but only achieved 46.81% accuracy on human-generated questions versus GPT-4o's 65.70% (p<10-15). In the randomized trial, 70 consultations were evaluated (32 CNS-Obsidian, 38 GPT-4o) from 959 total consults. CNS-Obsidian received positive ratings in 40.62% of cases versus 57.89% for GPT-4o (p=0.230). Both models included correct diagnosis in approximately 60% of cases (59.38% vs 65.79%, p=0.626). Domain-specific VLMs trained on curated scientific literature can approach frontier model performance in specialized medical domains despite being orders of magnitude smaller and less expensive to train. However, low clinical utilization suggests chatbot interfaces may not align with specialist workflows, indicating need for alternative AI integration strategies.

Paperid: 556, https://arxiv.org/pdf/2502.18529.pdf

Abstract:
The past few years have witnessed a rapid growth of the deployment of automated vehicles (AVs). Clearly, AVs and human-driven vehicles (HVs) will co-exist for many years, and AVs will have to operate around HVs, pedestrians, cyclists, and more, calling for fundamental breakthroughs in AI designed for mixed traffic to achieve mixed autonomy. Thus motivated, we study heterogeneous decision making by AVs and HVs in a mixed traffic environment, aiming to capture the interactions between human and machine decision-making and develop an AI foundation that enables vehicles to operate safely and efficiently. There are a number of challenges to achieve mixed autonomy, including 1) humans drivers make driving decisions with bounded rationality, and it remains open to develop accurate models for HVs' decision making; and 2) uncertainty-aware planning plays a critical role for AVs to take safety maneuvers in response to the human behavior. In this paper, we introduce a formulation of AV-HV interaction, where the HV makes decisions with bounded rationality and the AV employs uncertainty-aware planning based on the prediction on HV's future actions. We conduct a comprehensive analysis on AV and HV's learning regret to answer the questions: 1) {How does the learning performance depend on HV's bounded rationality and AV's planning}; 2) {How do different decision making strategies impact the overall learning performance}? Our findings reveal some intriguing phenomena, such as Goodhart's Law in AV's learning performance and compounding effects in HV's decision making process. By examining the dynamics of the regrets, we gain insights into the interplay between human and machine decision making.

Paperid: 557, https://arxiv.org/pdf/2502.10736.pdf

Abstract:
Social Virtual Reality (VR) emerges as a promising platform bringing immersive, interactive, and engaging mechanisms for collaborative activities in virtual spaces. However, interpersonal communication in social VR is still limited with existing mediums and channels. To bridge the gap, we propose a novel method for mediating real-time conversation in social VR, which uses impact captions, a type of typographic visual effect widely used in videos, to convey both verbal and non-verbal information. We first investigated the design space of impact captions by content analysis and a co-design session with four experts. Next, we implemented SpeechCap as a proof-of-concept system, with which users can communicate with each other using speech-driven impact captions in VR. Through a user study (n=14), we evaluated the effectiveness of the visual and interaction design of impact captions, highlighting the interactivity and the integration of verbal and non-verbal information in communication mediums. Finally, we discussed topics of visual rhetoric, interactivity, and ambiguity as the main findings from the study, and further provided design implications for future work for facilitating interpersonal communication in social VR.

Paperid: 558, https://arxiv.org/pdf/2502.09913.pdf

Abstract:
Web-based management systems have been widely used in risk control and industrial safety. However, effectively integrating source search capabilities into these systems, to enable decision-makers to locate and address the hazard (e.g., gas leak detection) remains a challenge. While prior efforts have explored using web crowdsourcing and AI algorithms for source search decision support, these approaches suffer from overheads in recruiting human participants and slow response times in time-sensitive situations. To address this, we introduce AutoS$^2$earch, a novel framework leveraging large models for zero-shot source search in web applications. AutoS$^2$earch operates on a simplified visual environment projected through a web-based display, utilizing a chain-of-thought prompt designed to emulate human reasoning. The multi-modal large language model (MLLMs) dynamically converts visual observations into language descriptions, enabling the LLM to perform linguistic reasoning on four directional choices. Extensive experiments demonstrate that AutoS$^2$earch achieves performance nearly equivalent to human-AI collaborative source search while eliminating dependency on crowdsourced labor. Our work offers valuable insights in using web engineering to design such autonomous systems in other industrial applications.

Paperid: 559, https://arxiv.org/pdf/2502.02830.pdf

Abstract:
Brain-computer interfaces (BCIs) enable direct communication between the brain and external devices. This review highlights the core decoding algorithms that enable multimodal BCIs, including a dissection of the elements, a unified view of diversified approaches, and a comprehensive analysis of the present state of the field. We emphasize algorithmic advancements in cross-modality mapping, sequential modeling, besides classic multi-modality fusion, illustrating how these novel AI approaches enhance decoding of brain data. The current literature of BCI applications on visual, speech, and affective decoding are comprehensively explored. Looking forward, we draw attention on the impact of emerging architectures like multimodal Transformers, and discuss challenges such as brain data heterogeneity and common errors. This review also serves as a bridge in this interdisciplinary field for experts with neuroscience background and experts that study AI, aiming to provide a comprehensive understanding for AI-powered multimodal BCIs.

Paperid: 560, https://arxiv.org/pdf/2501.12573.pdf

Abstract:
Haptic technology has seen significant growth, yet a lack of awareness of existing haptic device design knowledge hinders development. This paper addresses these limitations by leveraging advancements in Large Language Models (LLMs) to develop a haptic agent, focusing specifically on Grounded Force Feedback (GFF) devices recommendation. Our approach involves automating the creation of a structured haptic device database using information from research papers and product specifications. This database enables the recommendation of relevant GFF devices based on user queries. To ensure precise and contextually relevant recommendations, the system employs a dynamic retrieval method that combines both conditional and semantic searches. Benchmarking against the established UEQ and existing haptic device searching tools, the proposed haptic recommendation agent ranks in the top 10\% across all UEQ categories with mean differences favoring the agent in nearly all subscales, and maintains no significant performance bias across different user groups, showcasing superior usability and user satisfaction.

Paperid: 561, https://arxiv.org/pdf/2501.10917.pdf

Abstract:
Wearable Human Activity Recognition (WHAR) is a prominent research area within ubiquitous computing. Multi-sensor synchronous measurement has proven to be more effective for WHAR than using a single sensor. However, existing WHAR methods use shared convolutional kernels for indiscriminate temporal feature extraction across each sensor variable, which fails to effectively capture spatio-temporal relationships of intra-sensor and inter-sensor variables. We propose the DecomposeWHAR model consisting of a decomposition phase and a fusion phase to better model the relationships between modality variables. The decomposition creates high-dimensional representations of each intra-sensor variable through the improved Depth Separable Convolution to capture local temporal features while preserving their unique characteristics. The fusion phase begins by capturing relationships between intra-sensor variables and fusing their features at both the channel and variable levels. Long-range temporal dependencies are modeled using the State Space Model (SSM), and later cross-sensor interactions are dynamically captured through a self-attention mechanism, highlighting inter-sensor spatial correlations. Our model demonstrates superior performance on three widely used WHAR datasets, significantly outperforming state-of-the-art models while maintaining acceptable computational efficiency.

Paperid: 562, https://arxiv.org/pdf/2501.10582.pdf

Abstract:
Users of Augmentative and Alternative Communication (AAC) may write letter-by-letter via an interface that uses a character language model. However, most state-of-the-art large pretrained language models predict subword tokens of variable length. We investigate how to practically use such models to make accurate and efficient character predictions. We fine-tune models using a large dataset of sentences we curated in which each sentence is rated according to how useful it might be for spoken or written AAC communication. We find that using an algorithm to produce character predictions from a subword large language model provides more accurate predictions than adding a classification layer or using a byte-level model. We also find that our domain adaptation procedure is effective at improving model performance on simple, conversational text.

Paperid: 563, https://arxiv.org/pdf/2501.10568.pdf

Abstract:
We explore a method for presenting word suggestions for non-visual text input using simultaneous voices. We conduct two perceptual studies and investigate the impact of different presentations of voices on a user's ability to detect which voice, if any, spoke their desired word. Our sets of words simulated the word suggestions of a predictive keyboard during real-world text input. We find that when voices are simultaneous, user accuracy decreases significantly with each added word suggestion. However, adding a slight 0.15 s delay between the start of each subsequent word allows two simultaneous words to be presented with no significant decrease in accuracy compared to presenting two words sequentially (84% simultaneous versus 86% sequential). This allows two word suggestions to be presented to the user 32% faster than sequential playback without decreasing accuracy.

Paperid: 564, https://arxiv.org/pdf/2506.23253.pdf

Abstract:
We examine "vibe coding": an emergent programming paradigm where developers primarily write code by interacting with code-generating large language models rather than writing code directly. We analysed a curated set of videos depicting extended vibe coding sessions with rich think-aloud reflections. Using framework analysis, we investigated programmers' goals, workflows, prompting techniques, debugging approaches, and challenges encountered. We find that vibe coding follows iterative goal satisfaction cycles where developers alternate between prompting AI, evaluating generated code through rapid scanning and application testing, and manual editing. Prompting strategies blend vague, high-level directives with detailed technical specifications. Debugging remains a hybrid process combining AI assistance with manual practices. Critically, vibe coding does not eliminate the need for programming expertise but rather redistributes it toward context management, rapid code evaluation, and decisions about when to transition between AI-driven and manual manipulation of code. Trust in AI tools during vibe coding is dynamic and contextual, developed through iterative verification rather than blanket acceptance. Vibe coding is an evolution of AI-assisted programming that represents an early manifestation of "material disengagement", where practitioners orchestrate code production and manipulation, mediated through AI, while maintaining selective and strategic oversight.

Paperid: 565, https://arxiv.org/pdf/2506.22125.pdf

Abstract:
Hybrid collaboration has become a fixture in modern workplaces, yet it introduces persistent socio-technical asymmetries-especially disadvantaging remote participants, who struggle with presence disparity, reduced visibility, and limited non-verbal communication. Traditional solutions often seek to erase these asymmetries, but recent research suggests embracing them as productive design constraints. In this context, we introduce NoticeLight: a tangible, peripheral robotic embodiment designed to augment hybrid meetings. NoticeLight transforms remote participants' digital presence into ambient, physical signals -- such as mood dynamics, verbal contribution mosaics, and attention cues -- within the co-located space. By abstracting group states into subtle light patterns, NoticeLight fosters peripheral awareness and balanced participation without disrupting meeting flow or demanding cognitive overload. This approach aligns with emerging perspectives in human-robot synergy, positioning robots as mediators that reshape, rather than replicate, human presence. Our work thereby advances the discourse on how robotic embodiments can empower equitable, dynamic collaboration in the workplace.

Paperid: 566, https://arxiv.org/pdf/2506.19179.pdf

Abstract:
Affective interaction is not merely about recognizing emotions; it is an embodied, situated process shaped by context and co-created through interaction. In affective computing, the role of haptic feedback within dynamic emotional exchanges remains underexplored. This study investigates how situational emotional cues influence the perception and interpretation of haptic signals given by a robot. In a controlled experiment, 32 participants watched video scenarios in which a robot experienced either positive actions (such as being kissed), negative actions (such as being slapped) or neutral actions. After each video, the robot conveyed its emotional response through haptic communication, delivered via a wearable vibration sleeve worn by the participant. Participants rated the robot's emotional state-its valence (positive or negative) and arousal (intensity)-based on the video, the haptic feedback, and the combination of the two. The study reveals a dynamic interplay between visual context and touch. Participants' interpretation of haptic feedback was strongly shaped by the emotional context of the video, with visual context often overriding the perceived valence of the haptic signal. Negative haptic cues amplified the perceived valence of the interaction, while positive cues softened it. Furthermore, haptics override the participants' perception of arousal of the video. Together, these results offer insights into how situated haptic feedback can enrich affective human-robot interaction, pointing toward more nuanced and embodied approaches to emotional communication with machines.

Paperid: 567, https://arxiv.org/pdf/2506.13129.pdf

Abstract:
Embedding data visualizations in video can enhance the communication of complex information. However, this process is often labor-intensive, requiring designers to adjust visualizations frame by frame manually. In this work, we present ChartBlender, a novel system that streamlines this process by enabling users to create data visualizations, embed them seamlessly into video scenes, and automatically synchronize them with both camera motion and moving objects. Particularly, ChartBlender incorporates a tracking algorithm that supports both object and camera tracking, ensuring robust alignment of visualizations with dynamic video content. To maintain visual clarity and aesthetic coherence, we also explore the design space of video-suited visualizations and develop a library of customizable templates optimized for video embedding. We evaluate \oursName\ChartBlender through two controlled experiments and expert interviews with five domain experts. Results show that our system enables accurate synchronization and accelerates the production of data-driven videos.

Paperid: 568, https://arxiv.org/pdf/2506.09707.pdf

Abstract:
Prolonged Exposure (PE) therapy is an effective treatment for post-traumatic stress disorder (PTSD), but evaluating therapist fidelity remains labor-intensive due to the need for manual review of session recordings. We present a method for the automatic temporal localization of key PE fidelity elements, identifying their start and stop times, directly from session audio and transcripts. Our approach fine-tunes a large pre-trained audio-language model, Qwen2-Audio, using Low-Rank Adaptation (LoRA) to process focused 30-second windows of audio-transcript input. Fidelity labels for three core protocol phases, therapist orientation (P1), imaginal exposure (P2), and post-imaginal processing (P3), are generated via LLM-based prompting and verified by trained raters. The model is trained to predict normalized boundary offsets using soft supervision guided by task-specific prompts. On a dataset of 308 real PE sessions, our best configuration (LoRA rank 8, 30s windows) achieves a mean absolute error (MAE) of 5.3s across tasks, within typical rater tolerance for timestamp review, enabling practical fidelity QC. We further analyze the effects of window size and LoRA rank, highlighting the importance of context granularity and model adaptation. This work introduces a privacy-preserving, scalable framework for fidelity tracking in PE therapy, with potential to support clinician training, supervision, and quality assurance.

Paperid: 569, https://arxiv.org/pdf/2506.04072.pdf

Abstract:
Practicing conversations with large language models (LLMs) presents a promising alternative to traditional in-person language learning. However, most LLMs generate text at a near-native level of complexity, making them ill-suited for beginner learners (CEFR: A1-A2). In this paper, we investigate whether controllable generation techniques -- specifically modular methods that do not require model fine-tuning -- can adapt LLM outputs to better support absolute beginners. We evaluate these methods through both automatic metrics and a user study with university-level learners of Japanese. Our findings show that while prompting alone fails to control output difficulty, the use of future discriminators (Yang and Klein, 2021) significantly improves output comprehensibility (from 40.4\% to 84.3\%). We further introduce a novel token-level evaluation metric, Token Miss Rate (TMR), that quantifies the proportion of incomprehensible tokens per utterance and correlates strongly with human judgments. To support future research in AI-assisted language learning, we release our code, models, annotation tools, and dataset.

Paperid: 570, https://arxiv.org/pdf/2506.02064.pdf

Abstract:
As industry reports claim agentic AI systems deliver double-digit productivity gains and multi-trillion dollar economic potential, the validity of these claims has become critical for investment decisions, regulatory policy, and responsible technology adoption. However, this paper demonstrates that current evaluation practices for agentic AI systems exhibit a systemic imbalance that calls into question prevailing industry productivity claims. Our systematic review of 84 papers (2023--2025) reveals an evaluation imbalance where technical metrics dominate assessments (83%), while human-centered (30%), safety (53%), and economic assessments (30%) remain peripheral, with only 15% incorporating both technical and human dimensions. This measurement gap creates a fundamental disconnect between benchmark success and deployment value. We present evidence from healthcare, finance, and retail sectors where systems excelling on technical metrics failed in real-world implementation due to unmeasured human, temporal, and contextual factors. Our position is not against agentic AI's potential, but rather that current evaluation frameworks systematically privilege narrow technical metrics while neglecting dimensions critical to real-world success. We propose a balanced four-axis evaluation model and call on the community to lead this paradigm shift because benchmark-driven optimization shapes what we build. By redefining evaluation practices, we can better align industry claims with deployment realities and ensure responsible scaling of agentic systems in high-stakes domains.

Paperid: 571, https://arxiv.org/pdf/2506.00308.pdf

Abstract:
Understanding the prevalence of misinformation in health topics online can inform public health policies and interventions. However, measuring such misinformation at scale remains a challenge, particularly for high-stakes but understudied topics like opioid-use disorder (OUD)--a leading cause of death in the U.S. We present the first large-scale study of OUD-related myths on YouTube, a widely-used platform for health information. With clinical experts, we validate 8 pervasive myths and release an expert-labeled video dataset. To scale labeling, we introduce MythTriage, an efficient triage pipeline that uses a lightweight model for routine cases and defers harder ones to a high-performing, but costlier, large language model (LLM). MythTriage achieves up to 0.86 macro F1-score while estimated to reduce annotation time and financial cost by over 76% compared to experts and full LLM labeling. We analyze 2.9K search results and 343K recommendations, uncovering how myths persist on YouTube and offering actionable insights for public health and platform moderation.

Paperid: 572, https://arxiv.org/pdf/2505.24266.pdf

Abstract:
Sign language is a natural and visual form of language that uses movements and expressions to convey meaning, serving as a crucial means of communication for individuals who are deaf or hard-of-hearing (DHH). However, the number of people proficient in sign language remains limited, highlighting the need for technological advancements to bridge communication gaps and foster interactions with minorities. Based on recent advancements in embodied humanoid robots, we propose SignBot, a novel framework for human-robot sign language interaction. SignBot integrates a cerebellum-inspired motion control component and a cerebral-oriented module for comprehension and interaction. Specifically, SignBot consists of: 1) Motion Retargeting, which converts human sign language datasets into robot-compatible kinematics; 2) Motion Control, which leverages a learning-based paradigm to develop a robust humanoid control policy for tracking sign language gestures; and 3) Generative Interaction, which incorporates translator, responser, and generator of sign language, thereby enabling natural and effective communication between robots and humans. Simulation and real-world experimental results demonstrate that SignBot can effectively facilitate human-robot interaction and perform sign language motions with diverse robots and datasets. SignBot represents a significant advancement in automatic sign language interaction on embodied humanoid robot platforms, providing a promising solution to improve communication accessibility for the DHH community.

Paperid: 573, https://arxiv.org/pdf/2505.14983.pdf

Abstract:
For future human-autonomous vehicle (AV) interactions to be effective and smooth, human-aware systems that analyze and align human needs with automation decisions are essential. Achieving this requires systems that account for human cognitive states. We present a novel computational model in the form of a Dynamic Bayesian Network (DBN) that infers the cognitive states of both AV users and other road users, integrating this information into the AV's decision-making process. Specifically, our model captures the well-being of both an AV user and an interacting road user as cognitive states alongside trust. Our DBN models infer beliefs over the AV user's evolving well-being, trust, and intention states, as well as the possible well-being of other road users, based on observed interaction experiences. Using data collected from an interaction study, we refine the model parameters and empirically assess its performance. Finally, we extend our model into a causal inference model (CIM) framework for AV decision-making, enabling the AV to enhance user well-being and trust while balancing these factors with its own operational costs and the well-being of interacting road users. Our evaluation demonstrates the model's effectiveness in accurately predicting user's states and guiding informed, human-centered AV decisions.

Paperid: 574, https://arxiv.org/pdf/2505.08894.pdf

Abstract:
Recent advances in generative AI, such as ChatGPT, have transformed access to information in education, knowledge-seeking, and everyday decision-making. However, in many developing regions, access remains a challenge due to the persistent digital divide. To help bridge this gap, we developed WaLLM - a custom AI chatbot over WhatsApp, a widely used communication platform in developing regions. Beyond answering queries, WaLLM offers several features to enhance user engagement: a daily top question, suggested follow-up questions, trending and recent queries, and a leaderboard-based reward system. Our service has been operational for over 6 months, amassing over 14.7K queries from approximately 100 users. In this paper, we present WaLLM's design and a systematic analysis of logs to understand user interactions. Our results show that 55% of user queries seek factual information. "Health and well-being" was the most popular topic (28%), including queries about nutrition and disease, suggesting users view WaLLM as a reliable source. Two-thirds of users' activity occurred within 24 hours of the daily top question. Users who accessed the "Leaderboard" interacted with WaLLM 3x as those who did not. We conclude by discussing implications for culture-based customization, user interface design, and appropriate calibration of users' trust in AI systems for developing regions.

Paperid: 575, https://arxiv.org/pdf/2505.08072.pdf

Abstract:
Significant advances have been made in our ability to understand and generate emotionally expressive content such as text and speech, yet comparable progress in sign language technologies remain limited. While computational approaches to sign language translation have focused on capturing lexical content, the emotional dimensions of sign language communication remain largely unexplored. Through semi-structured interviews with eight sign language users across Singapore, Sri Lanka and the United States, including both Deaf and Hard of hearing (DHH) and hearing signers, we investigate how emotions are expressed and perceived in sign languages. Our findings highlight the role of both manual and non-manual elements in emotional expression, revealing universal patterns as well as individual and cultural variations in how signers communicate emotions. We identify key challenges in capturing emotional nuance for sign language translation, and propose design considerations for developing more emotionally-aware sign language technologies. This work contributes to both theoretical understanding of emotional expression in sign language and practical development of interfaces to better serve diverse signing communities.

Paperid: 576, https://arxiv.org/pdf/2504.21800.pdf

Abstract:
Synthetic data adoption in healthcare is driven by privacy concerns, data access limitations, and high annotation costs. We explore synthetic Prolonged Exposure (PE) therapy conversations for PTSD as a scalable alternative for training clinical models. We systematically compare real and synthetic dialogues using linguistic, structural, and protocol-specific metrics like turn-taking and treatment fidelity. We introduce and evaluate PE-specific metrics, offering a novel framework for assessing clinical fidelity beyond surface fluency. Our findings show that while synthetic data successfully mitigates data scarcity and protects privacy, capturing the most subtle therapeutic dynamics remains a complex challenge. Synthetic dialogues successfully replicate key linguistic features of real conversations, for instance, achieving a similar Readability Score (89.2 vs. 88.1), while showing differences in some key fidelity markers like distress monitoring. This comparison highlights the need for fidelity-aware metrics that go beyond surface fluency to identify clinically significant nuances. Our model-agnostic framework is a critical tool for developers and clinicians to benchmark generative model fidelity before deployment in sensitive applications. Our findings help clarify where synthetic data can effectively complement real-world datasets, while also identifying areas for future refinement.

Paperid: 577, https://arxiv.org/pdf/2504.18496.pdf

Abstract:
Comprehensive literature review requires synthesizing vast amounts of research -- a labor intensive and cognitively demanding process. Most prior work focuses either on helping researchers deeply understand a few papers (e.g., for triaging or reading), or retrieving from and visualizing a vast corpus. Deep analysis and synthesis of large paper collections (e.g., to produce a survey paper) is largely conducted manually with little support. We present DimInd, an interactive system that scaffolds literature review across large paper collections through LLM-generated structured representations. DimInd scaffolds literature understanding with multiple levels of compression, from papers, to faceted literature comparison tables with information extracted from individual papers, to taxonomies of concepts, to narrative syntheses. Users are guided through these successive information transformations while maintaining provenance to source text. In an evaluation with 23 researchers, DimInd supported participants in extracting information and conceptually organizing papers with less effort compared to a ChatGPT-assisted baseline workflow.

Paperid: 578, https://arxiv.org/pdf/2504.17267.pdf

Abstract:
Music videos, as a prevalent form of multimedia entertainment, deliver engaging audio-visual experiences to audiences and have gained immense popularity among singers and fans. Creators can express their interpretations of music naturally through visual elements. However, the creation process of music video demands proficiency in script design, video shooting, and music-video synchronization, posing significant challenges for non-professionals. Previous work has designed automated music video generation frameworks. However, they suffer from complexity in input and poor output quality. In response, we present MV-Crafter, a system capable of producing high-quality music videos with synchronized music-video rhythm and style. Our approach involves three technical modules that simulate the human creation process: the script generation module, video generation module, and music-video synchronization module. MV-Crafter leverages a large language model to generate scripts considering the musical semantics. To address the challenge of synchronizing short video clips with music of varying lengths, we propose a dynamic beat matching algorithm and visual envelope-induced warping method to ensure precise, monotonic music-video synchronization. Besides, we design a user-friendly interface to simplify the creation process with intuitive editing features. Extensive experiments have demonstrated that MV-Crafter provides an effective solution for improving the quality of generated music videos.

Paperid: 579, https://arxiv.org/pdf/2504.14776.pdf

Abstract:
Scriptwriting has traditionally been text-centric, a modality that only partially conveys the produced audiovisual experience. A formative study with professional writers informed us that connecting textual and audiovisual modalities can aid ideation and iteration, especially for writing dialogues. In this work, we present Script2Screen, an AI-assisted tool that integrates scriptwriting with audiovisual scene creation in a unified, synchronized workflow. Focusing on dialogues in scripts, Script2Screen generates expressive scenes with emotional speeches and animated characters through a novel text-to-audiovisual-scene pipeline. The user interface provides fine-grained controls, allowing writers to fine-tune audiovisual elements such as character gestures, speech emotions, and camera angles. A user study with both novice and professional writers from various domains demonstrated that Script2Screen's interactive audiovisual generation enhances the scriptwriting process, facilitating iterative refinement while complementing, rather than replacing, their creative efforts.

Paperid: 580, https://arxiv.org/pdf/2504.14769.pdf

Abstract:
As generative language models (GLMs) have gained popularity, youth are increasingly using them in their everyday lives. As such, most research has centered on supporting youth as users of GLM-powered systems. However, we know little of how to engage youth in the design of these models. Building on the rich legacy of child-computer interaction research that positions youth as designers of computing systems, we explore how to support young people in designing GLMs. Through a case study of three teenagers (ages 14-15) building a babyGPT screenplay generator, we illustrate how the team developed a model while engaging in artificial intelligence/machine learning-relevant data practices and addressing ethical issues. This paper contributes a case study that demonstrates the feasibility of engaging youth in building GLMs.

Paperid: 581, https://arxiv.org/pdf/2504.13955.pdf

Abstract:
The advancement of AI systems for mental health support is hindered by limited access to therapeutic conversation data, particularly for trauma treatment. We present Thousand Voices of Trauma, a synthetic benchmark dataset of 3,000 therapy conversations based on Prolonged Exposure therapy protocols for Post-traumatic Stress Disorder (PTSD). The dataset comprises 500 unique cases, each explored through six conversational perspectives that mirror the progression of therapy from initial anxiety to peak distress to emotional processing. We incorporated diverse demographic profiles (ages 18-80, M=49.3, 49.4% male, 44.4% female, 6.2% non-binary), 20 trauma types, and 10 trauma-related behaviors using deterministic and probabilistic generation methods. Analysis reveals realistic distributions of trauma types (witnessing violence 10.6%, bullying 10.2%) and symptoms (nightmares 23.4%, substance abuse 20.8%). Clinical experts validated the dataset's therapeutic fidelity, highlighting its emotional depth while suggesting refinements for greater authenticity. We also developed an emotional trajectory benchmark with standardized metrics for evaluating model responses. This privacy-preserving dataset addresses critical gaps in trauma-focused mental health data, offering a valuable resource for advancing both patient-facing applications and clinician training tools.

Paperid: 582, https://arxiv.org/pdf/2504.13883.pdf

Abstract:
This study estimates cognitive effort (CE) based on functional near-infrared spectroscopy (fNIRS) data and performance scores using a hybrid deep learning model. The estimation of CE enables educators to modify material to enhance learning effectiveness and student engagement. Relative neural efficiency (RNE) and relative neural involvement (RNI) are two metrics that have been used to represent CE. To estimate RNE and RNI we need hemodynamic response in the brain and the performance score of a task.We collected oxygenated hemoglobin ($Î\mathrm{HbO}$). Sixteen participants answered 16 questions in a unity-based educational game, each with a 30-second response time. We used deep learning models to predict the performance score and estimate RNE and RNI to understand CE. The study compares traditional machine learning techniques with deep learning models such as CNN, LSTM, BiLSTM, and a hybrid CNN-GRU to determine which approach provides better accuracy in predicting performance scores. The result shows that the hybrid CNN-GRU gives better performance with 78.36\% training accuracy and 73.08\% test accuracy than other models. We performed XGBoost on the extracted GRU feature and got the highest accuracy (69.23\%). This suggests that the features learned from this hybrid model generalize better even in traditional machine learning algorithms. We used the $Î\mathrm{HbO}$ and predicted score to calculate RNE and RNI to observe cognitive effort in our four test cases. Our result shows that even with moderate accuracy, the predicted RNE and RNI closely follows the actual trends. we also observed that when participants were in a state of high CE, introducing rest led decrease of CE. These findings can be helpful to design and improve learning environments and provide valuable insights in learning materials.

Paperid: 583, https://arxiv.org/pdf/2504.13587.pdf

Abstract:
Retrieval-augmented generation (RAG) pipelines have become the de-facto approach for building AI assistants with access to external, domain-specific knowledge. Given a user query, RAG pipelines typically first retrieve (R) relevant information from external sources, before invoking a Large Language Model (LLM), augmented (A) with this information, to generate (G) responses. Modern RAG pipelines frequently chain multiple retrieval and generation components, in any order. However, developing effective RAG pipelines is challenging because retrieval and generation components are intertwined, making it hard to identify which component(s) cause errors in the eventual output. The parameters with the greatest impact on output quality often require hours of pre-processing after each change, creating prohibitively slow feedback cycles. To address these challenges, we present RAGGY, a developer tool that combines a Python library of composable RAG primitives with an interactive interface for real-time debugging. We contribute the design and implementation of RAGGY, insights into expert debugging patterns through a qualitative study with 12 engineers, and design implications for future RAG tools that better align with developers' natural workflows.

Paperid: 584, https://arxiv.org/pdf/2504.11928.pdf

Abstract:
Today's world is witnessing an unparalleled rate of technological transformation. The emergence of non-fungible tokens (NFTs) has transformed how we handle digital assets and value. Despite their initial popularity, NFTs face declining adoption influenced not only by cryptocurrency volatility but also by trust dynamics within communities. From a social computing perspective, understanding these trust dynamics offers valuable insights for the development of both the NFT ecosystem and the broader digital economy. China presents a compelling context for examining these dynamics, offering a unique intersection of technological innovation and traditional cultural values. Through a content analysis of eight Chinese NFT-focused WeChat groups and 21 semi-structured interviews, we examine how socio-cultural factors influence trust formation and development. We found that trust in Chinese NFT communities is significantly molded by local cultural values. To be precise, Confucian virtues, such as benevolence, propriety, and integrity, play a crucial role in shaping these trust relationships. Our research identifies three critical trust dimensions in China's NFT market: (1) technological, (2) institutional, and (3) social. We examined the challenges in cultivating each dimension. Based on these insights, we developed tailored trust-building guidelines for Chinese NFT stakeholders. These guidelines address trust issues that factor into NFT's declining popularity and could offer valuable strategies for CSCW researchers, developers, and designers aiming to enhance trust in global NFT communities. Our research urges CSCW scholars to take into account the unique socio-cultural contexts when developing trust-enhancing strategies for digital innovations and online interactions.

Paperid: 585, https://arxiv.org/pdf/2504.11146.pdf

Abstract:
AI-powered chatbots and digital teaching assistants (AI TAs) are gaining popularity in programming education, offering students timely and personalized feedback. Despite their potential benefits, concerns about student over-reliance and academic misconduct have prompted the introduction of "guardrails" into AI TAs - features that provide scaffolded support rather than direct solutions. However, overly restrictive guardrails may lead students to bypass these tools and use unconstrained AI models, where interactions are not observable, thus limiting our understanding of students' help-seeking behaviors. To investigate this, we designed and deployed a novel AI TA tool with optional guardrails in one lab of a large introductory programming course. As students completed three code writing and debugging tasks, they had the option to receive guardrailed help or use a "See Solution" feature which disabled the guardrails and generated a verbatim response from the underlying model. We investigate students' motivations and use of this feature and examine the association between usage and their course performance. We found that 50% of the 885 students used the "See Solution" feature for at least one problem and 14% used it for all three problems. Additionally, low-performing students were more likely to use this feature and use it close to the deadline as they started assignments later. The predominant factors that motivated students to disable the guardrails were assistance in solving problems, time pressure, lack of self-regulation, and curiosity. Our work provides insights into students' solution-seeking motivations and behaviors, which has implications for the design of AI TAs that balance pedagogical goals with student preferences.

Paperid: 586, https://arxiv.org/pdf/2504.09332.pdf

Abstract:
As AI systems become increasingly integrated into our daily lives and into wearable form factors, there's a fundamental tension between their potential to proactively assist us and the risk of creating intrusive, dependency-forming experiences. This work proposes the concept of a Goldilocks Time Window -- a contextually adaptive time window for proactive AI systems to deliver effective interventions. We discuss the critical factors that determine the time window, and the need of a framework for designing and evaluating proactive AI systems that can navigate this tension successfully.

Paperid: 587, https://arxiv.org/pdf/2504.09004.pdf

Abstract:
Applications of Generative AI (GenAI), such as ChatGPT, have gained popularity among the public due to their ease of access, use, and support of educational and creative activities. Despite these benefits, GenAI poses unique risks for families, such as lacking sufficient safeguards tailored to protect children under 16 years of age and not offering parental control features. This study explores families' use and co-use of GenAI, the perceived risks and opportunities of ChatGPT, and how parents mediate their children's use of GenAI. Through semi-structured interviews with 12 families, we identified ways families used and mediated GenAI and factors that influenced parents' GenAI mediation strategies. We contextualize our findings with a modified model of family mediation strategies, drawing from previous family media and mediation frameworks. We provide insights for future research on family-GenAI interactions and highlight the need for more robust protective measures on GenAI platforms for families.

Paperid: 588, https://arxiv.org/pdf/2504.08952.pdf

Abstract:
Risk reporting is essential for documenting AI models, yet only 14% of model cards mention risks, out of which 96% copying content from a small set of cards, leading to a lack of actionable insights. Existing proposals for improving model cards do not resolve these issues. To address this, we introduce RiskRAG, a Retrieval Augmented Generation based risk reporting solution guided by five design requirements we identified from literature, and co-design with 16 developers: identifying diverse model-specific risks, clearly presenting and prioritizing them, contextualizing for real-world uses, and offering actionable mitigation strategies. Drawing from 450K model cards and 600 real-world incidents, RiskRAG pre-populates contextualized risk reports. A preliminary study with 50 developers showed that they preferred RiskRAG over standard model cards, as it better met all the design requirements. A final study with 38 developers, 40 designers, and 37 media professionals showed that RiskRAG improved their way of selecting the AI model for a specific application, encouraging a more careful and deliberative decision-making. The RiskRAG project page is accessible at: https://social-dynamics.net/ai-risks/card.

Paperid: 589, https://arxiv.org/pdf/2504.07202.pdf

Abstract:
Research on children and youth's participation in different roles in the design of technologies is one of the core contributions in child-computer interaction studies. Building on this work, we situate youth as advisors to a group of high school computer science teacher- and researcher-designers creating learning activities in the context of emerging technologies. Specifically, we explore algorithm auditing as a potential entry point for youth and adults to critically evaluate generative AI algorithmic systems, with the goal of designing classroom lessons. Through a two-hour session where three teenagers (16-18 years) served as advisors, we (1) examine the types of expertise the teens shared and (2) identify back stage design elements that fostered their agency and voice in this advisory role. Our discussion considers opportunities and challenges in situating youth as advisors, providing recommendations for actions that researchers, facilitators, and teachers can take to make this unusual arrangement feasible and productive.

Paperid: 590, https://arxiv.org/pdf/2504.06167.pdf

Abstract:
As social service robots become commonplace, it is essential for them to effectively interpret human signals, such as verbal, gesture, and eye gaze, when people need to focus on their primary tasks to minimize interruptions and distractions. Toward such a socially acceptable Human-Robot Interaction, we conducted a study ($N=24$) in an AR-simulated context of a coffee chat. Participants elicited social cues to signal intentions to an anthropomorphic, zoomorphic, grounded technical, or aerial technical robot waiter when they were speakers or listeners. Our findings reveal common patterns of social cues over intentions, the effects of robot morphology on social cue position and conversational role on social cue complexity, and users' rationale in choosing social cues. We offer insights into understanding social cues concerning perceptions of robots, cognitive load, and social context. Additionally, we discuss design considerations on approaching, social cue recognition, and response strategies for future service robots.

Paperid: 591, https://arxiv.org/pdf/2504.01259.pdf

Abstract:
Advancements in Large Language Models (LLMs), such as ChatGPT, offer significant opportunities to enhance instructional support in introductory programming courses. While extensive research has explored the effectiveness of LLMs in supporting student learning, limited studies have examined how these models can assist instructors in designing instructional activities. This work investigates how instructors' expertise in effective activity design can be integrated with LLMs' ability to generate novel and targeted programming problems, facilitating more effective activity creation for programming classrooms. To achieve this, we employ a participatory design approach to develop an instructor-authoring tool that incorporates LLM support, fostering collaboration between instructors and AI in generating programming exercises. This tool also allows instructors to specify common student mistakes and misconceptions, which informs the adaptive feedback generation process. We conduct case studies with three instructors, analyzing how they use our system to design programming problems for their introductory courses. Through these case studies, we assess instructors' perceptions of the usefulness and limitations of LLMs in authoring problem statements for instructional purposes. Additionally, we compare the efficiency, quality, effectiveness, and coverage of designed activities when instructors create problems with and without structured LLM prompting guidelines. Our findings provide insights into the potential of LLMs in enhancing instructor workflows and improving programming education and provide guidelines for designing effective AI-assisted problem-authoring interfaces.

Paperid: 592, https://arxiv.org/pdf/2504.00160.pdf

Abstract:
Digital games offer rich social experiences and promote valuable skills, but they fall short in addressing physical inactivity. Exergames, which combine exercise with gameplay, have the potential to tackle this issue. However, current exergames are primarily single-player or competitive. To explore the social benefits of cooperative exergaming, we designed a custom co-located cooperative exergame that features three distinct forms of cooperation: Free (baseline), Coupled, and Concurrent. We conducted a within-participants, mixed-methods study (N = 24) to evaluate these designs and their impact on players' enjoyment, motivation, and performance. Our findings reveal that cooperative play improves social experiences. It drives increased team identification and relatedness. Furthermore, our qualitative findings support cooperative exergame play. This has design implications for creating exergames that effectively address players' exercise and social needs. Our research contributes guidance for developers and researchers who want to create more socially enriching exergame experiences.

Paperid: 593, https://arxiv.org/pdf/2503.19636.pdf

Abstract:
Physiological signals hold immense potential for ubiquitous emotion monitoring, presenting numerous applications in emotion recognition. However, harnessing this potential is hindered by significant challenges, particularly in the collection of annotations that align with physiological changes since the process hinges heavily on human participants. In this work, we set out to study human participant perspectives in the emotion data collection procedure. We conducted a lab-based emotion data collection study with 37 participants using 360 degree virtual reality video stimulus followed by semi-structured interviews with the study participants. Our findings presented that intrinsic factors like participant perception, experiment design nuances, and experiment setup suitability impact their emotional response and annotation within lab settings. Drawing from our findings and prior research, we propose recommendations for incorporating participant context into annotations and emphasizing participant-centric experiment designs. Furthermore, we explore current emotion data collection practices followed by AI practitioners and offer insights for future contributions leveraging physiological emotion data.

Paperid: 594, https://arxiv.org/pdf/2503.15514.pdf

Abstract:
As artificial intelligence surpasses human performance in select tasks, disclosing superhuman capabilities poses distinct challenges for fairness, accountability, and trust. However, the impact of such disclosures on diverse user attitudes and behaviors remains unclear, particularly concerning potential negative reactions like discouragement or overreliance. This paper investigates these effects by utilizing Persona Cards: a validated, standardized set of synthetic personas designed to simulate diverse user reactions and fairness perspectives. We conducted an ethics board-approved study (N=32), utilizing these personas to investigate how capability disclosure influenced behaviors with a superhuman game AI in competitive StarCraft II scenarios. Our results reveal transparency is double-edged: while disclosure could alleviate suspicion, it also provoked frustration and strategic defeatism among novices in cooperative scenarios, as well as overreliance in competitive contexts. Experienced and competitive players interpreted disclosure as confirmation of an unbeatable opponent, shifting to suboptimal goals. We release the Persona Cards Dataset, including profiles, prompts, interaction logs, and protocols, to foster reproducible research into human alignment AI design. This work demonstrates that transparency is not a cure-all; successfully leveraging disclosure to enhance trust and accountability requires careful tailoring to user characteristics, domain norms, and specific fairness objectives.

Paperid: 595, https://arxiv.org/pdf/2503.13369.pdf

Abstract:
Often, the needs and visual abilities differ between the annotator group and the end user group. Generating detailed diagram descriptions for blind and low-vision (BLV) users is one such challenging domain. Sighted annotators could describe visuals with ease, but existing studies have shown that direct generations by them are costly, bias-prone, and somewhat lacking by BLV standards. In this study, we ask sighted individuals to assess -- rather than produce -- diagram descriptions generated by vision-language models (VLM) that have been guided with latent supervision via a multi-pass inference. The sighted assessments prove effective and useful to professional educators who are themselves BLV and teach visually impaired learners. We release Sightation, a collection of diagram description datasets spanning 5k diagrams and 137k samples for completion, preference, retrieval, question answering, and reasoning training purposes and demonstrate their fine-tuning potential in various downstream tasks.

Paperid: 596, https://arxiv.org/pdf/2503.07463.pdf

Abstract:
Cognitive augmentation is a cornerstone in advancing education, particularly through personalized learning. However, personalizing extensive textual materials, such as narratives and academic textbooks, remains challenging due to their heavy use, which can hinder learner engagement and understanding. Building on cognitive theories like Dual Coding Theory -- which posits that combining textual and visual information enhances comprehension and memory -- this study explores the potential of Generative AI (GenAI) to enrich educational materials. We utilized large language models (LLMs) to generate concise text summaries and image generation models (IGMs) to create visually aligned content from textual inputs. After recruiting 24 participants, we verified that integrating AI-generated supplementary materials significantly improved learning outcomes, increasing post-reading test scores by 7.50%. These findings underscore GenAI's transformative potential in creating adaptive learning environments that enhance cognitive augmentation.

Paperid: 597, https://arxiv.org/pdf/2503.06416.pdf

Abstract:
We conducted an International AI Negotiation Competition in which participants designed and refined prompts for AI negotiation agents. We then facilitated over 180,000 negotiations between these agents across multiple scenarios with diverse characteristics and objectives. Our findings revealed that principles from human negotiation theory remain crucial even in AI-AI contexts. Surprisingly, warmth--a traditionally human relationship-building trait--was consistently associated with superior outcomes across all key performance metrics. Dominant agents, meanwhile, were especially effective at claiming value. Our analysis also revealed unique dynamics in AI-AI negotiations not fully explained by existing theory, including AI-specific technical strategies like chain-of-thought reasoning, prompt injection, and strategic concealment. When we applied natural language processing (NLP) methods to the full transcripts of all negotiations we found positivity, gratitude and question-asking (associated with warmth) were strongly associated with reaching deals as well as objective and subjective value, whereas conversation lengths (associated with dominance) were strongly associated with impasses. The results suggest the need to establish a new theory of AI negotiation, which integrates classic negotiation theory with AI-specific negotiation theories to better understand autonomous negotiations and optimize agent performance.

Paperid: 598, https://arxiv.org/pdf/2503.06335.pdf

Abstract:
According to the recently introduced theory of artistic support tools, creativity support tools exert normative influences over artistic production, instantiating a normative ground that shapes both the process and product of artistic expression. We argue that the normative ground of most existing automated writing tools is misaligned with writerly values and identify a potential alternative frame-material writing support-for experimental poetry tools that flexibly support the finding, processing, transforming, and shaping of text(s). Based on this frame, we introduce Phraselette, an artistic material writing support interface that helps experimental poets search for words and phrases. To provide material writing support, Phraselette is designed to counter the dominant mode of automated writing tools, while offering language model affordances in line with writerly values. We further report on an extended expert evaluation involving 10 published poets that indicates support for both our framing of material writing support and for Phraselette itself.

Paperid: 599, https://arxiv.org/pdf/2503.05039.pdf

Abstract:
Instructors play a pivotal role in integrating AI into education, yet their adoption of AI-powered tools remains inconsistent. Despite this, limited research explores how to design AI tools that support broader instructor adoption. This study applies a human-centered design approach, incorporating qualitative methods, to investigate the design of interactive pedagogical agents that provide instructional suggestions in response to instructors' questions. We conducted a formative study involving interviews with five pedagogy experts to examine existing strategies for supporting instructors' pedagogical needs. Building on these insights, we facilitated a participatory design session with ten pedagogy experts, where participants reviewed a storyboard depicting a chatbot designed for instructors with varying levels of AI literacy and differing attitudes toward AI. Experts also evaluated the quality of LLM-generated suggestions based on common teaching challenges. Our findings highlight the need for chatbot interactions that foster trust, especially for AI-conservative instructors. Experts emphasized the importance of social transparency (for example, showing how peers use the tool) and allowing instructors to flexibly control how much or how little they engage with the system. We also propose design recommendations to enhance the quality of AI-generated teaching suggestions, such as adapting them to reflect instructors' prior teaching experience. This work underscores the urgent need to support AI-conservative instructors, as AI literacy and attitudes are closely intertwined. Without thoughtful design, there is a risk of widening pedagogical divides and reducing students' learning opportunities.

Paperid: 600, https://arxiv.org/pdf/2503.03927.pdf

Abstract:
Machine learning models deployed locally on social media applications are used for features, such as face filters which read faces in-real time, and they expose sensitive attributes to the apps. However, the deployment of machine learning models, e.g., when, where, and how they are used, in social media applications is opaque to users. We aim to address this inconsistency and investigate how social media user perceptions and behaviors change once exposed to these models. We conducted user studies (N=21) and found that participants were unaware to both what the models output and when the models were used in Instagram and TikTok, two major social media platforms. In response to being exposed to the models' functionality, we observed long term behavior changes in 8 participants. Our analysis uncovers the challenges and opportunities in providing transparency for machine learning models that interact with local user data.

Paperid: 601, https://arxiv.org/pdf/2503.02099.pdf

Abstract:
Reading assessments are essential for enhancing students' comprehension, yet many EdTech applications focus mainly on outcome-based metrics, providing limited insights into student behavior and cognition. This study investigates the use of multimodal data sources -- including eye-tracking data, learning outcomes, assessment content, and teaching standards -- to derive meaningful reading insights. We employ unsupervised learning techniques to identify distinct reading behavior patterns, and then a large language model (LLM) synthesizes the derived information into actionable reports for educators, streamlining the interpretation process. LLM experts and human educators evaluate these reports for clarity, accuracy, relevance, and pedagogical usefulness. Our findings indicate that LLMs can effectively function as educational analysts, turning diverse data into teacher-friendly insights that are well-received by educators. While promising for automating insight generation, human oversight remains crucial to ensure reliability and fairness. This research advances human-centered AI in education, connecting data-driven analytics with practical classroom applications.

Paperid: 602, https://arxiv.org/pdf/2503.01354.pdf

Abstract:
As online communication continues to expand, participants often face cognitive fatigue and reduced engagement. Cognitive augmentation, which leverages technology to enhance human abilities, offers promising solutions to these challenges. In this study, we investigate the potential of generative artificial intelligence (GenAI) for real-time music generation to enrich online meetings. We introduce Discussion Jockey 2, a system that dynamically produces background music in response to live conversation transcripts. Through a user study involving 14 participants in an online interview setting, we examine the system's impact on relaxation, concentration, and overall user experience. The findings reveal that AI-generated background music significantly enhances user relaxation (average score: 5.75/9) and concentration (average score: 5.86/9). This research underscores the promise of context-aware music generation in improving the quality of online communication and points to future directions for optimizing its implementation across various virtual environments.

Paperid: 603, https://arxiv.org/pdf/2502.16383.pdf

Abstract:
Generative AI (GAI) is reshaping the way young users engage with technology. This study introduces a taxonomy of risks associated with youth-GAI interactions, derived from an analysis of 344 chat transcripts between youth and GAI chatbots, 30,305 Reddit discussions concerning youth engagement with these systems, and 153 documented AI-related incidents. We categorize risks into six overarching themes, identifying 84 specific risks, which we further align with four distinct interaction pathways. Our findings highlight emerging concerns, such as risks to mental wellbeing, behavioral and social development, and novel forms of toxicity, privacy breaches, and misuse/exploitation that are not fully addressed in existing frameworks on child online safety or AI risks. By systematically grounding our taxonomy in empirical data, this work offers a structured approach to aiding AI developers, educators, caregivers, and policymakers in comprehending and mitigating risks associated with youth-GAI interactions.

Paperid: 604, https://arxiv.org/pdf/2502.14019.pdf

Abstract:
As text generation systems' outputs are increasingly anthropomorphic -- perceived as human-like -- scholars have also increasingly raised concerns about how such outputs can lead to harmful outcomes, such as users over-relying or developing emotional dependence on these systems. How to intervene on such system outputs to mitigate anthropomorphic behaviors and their attendant harmful outcomes, however, remains understudied. With this work, we aim to provide empirical and theoretical grounding for developing such interventions. To do so, we compile an inventory of interventions grounded both in prior literature and a crowdsourcing study where participants edited system outputs to make them less human-like. Drawing on this inventory, we also develop a conceptual framework to help characterize the landscape of possible interventions, articulate distinctions between different types of interventions, and provide a theoretical basis for evaluating the effectiveness of different interventions.

Paperid: 605, https://arxiv.org/pdf/2502.11752.pdf

Abstract:
Human-robot collaboration (HRC) relies on accurate and timely recognition of human intentions to ensure seamless interactions. Among common HRC tasks, human-to-robot object handovers have been studied extensively for planning the robot's actions during object reception, assuming the human intention for object handover. However, distinguishing handover intentions from other actions has received limited attention. Most research on handovers has focused on visually detecting motion trajectories, which often results in delays or false detections when trajectories overlap. This paper investigates whether human intentions for object handovers are reflected in non-movement-based physiological signals. We conduct a multimodal analysis comparing three data modalities: electroencephalogram (EEG), gaze, and hand-motion signals. Our study aims to distinguish between handover-intended human motions and non-handover motions in an HRC setting, evaluating each modality's performance in predicting and classifying these actions before and after human movement initiation. We develop and evaluate human intention detectors based on these modalities, comparing their accuracy and timing in identifying handover intentions. To the best of our knowledge, this is the first study to systematically develop and test intention detectors across multiple modalities within the same experimental context of human-robot handovers. Our analysis reveals that handover intention can be detected from all three modalities. Nevertheless, gaze signals are the earliest as well as the most accurate to classify the motion as intended for handover or non-handover.

Paperid: 606, https://arxiv.org/pdf/2502.10924.pdf

Abstract:
The rise of generative AI technology has sparked interest in using digital information to create AI-generated agents as digital legacy. These agents, often referred to as "AI Afterlives", present unique challenges compared to traditional digital legacy. Yet, there is limited human-centered research on "AI Afterlife" as digital legacy, especially from the perspectives of the individuals being represented by these agents. This paper presents a qualitative study examining users' perceptions, expectations, and concerns regarding AI-generated agents as digital legacy. We identify factors shaping people's attitudes, their perceived differences compared with the traditional digital legacy, and concerns they might have in real practices. We also examine the design aspects throughout the life cycle and interaction process. Based on these findings, we situate "AI Afterlife" in digital legacy, and delve into design implications for maintaining identity consistency and balancing intrusiveness and support in "AI Afterlife" as digital legacy.

Paperid: 607, https://arxiv.org/pdf/2502.09870.pdf

Abstract:
Recent attention to anthropomorphism -- the attribution of human-like qualities to non-human objects or entities -- of language technologies like LLMs has sparked renewed discussions about potential negative impacts of anthropomorphism. To productively discuss the impacts of this anthropomorphism and in what contexts it is appropriate, we need a shared vocabulary for the vast variety of ways that language can be anthropomorphic. In this work, we draw on existing literature and analyze empirical cases of user interactions with language technologies to develop a taxonomy of textual expressions that can contribute to anthropomorphism. We highlight challenges and tensions involved in understanding linguistic anthropomorphism, such as how all language is fundamentally human and how efforts to characterize and shift perceptions of humanness in machines can also dehumanize certain humans. We discuss ways that our taxonomy supports more precise and effective discussions of and decisions about anthropomorphism of language technologies.

Paperid: 608, https://arxiv.org/pdf/2502.09101.pdf

Abstract:
Large language models (LLMs) have demonstrated exceptional capabilities in understanding and generation. However, when interacting with human instructions in real-world scenarios, LLMs still face significant challenges, particularly in accurately capturing and comprehending human instructions and intentions. This paper focuses on three challenges in LLM-based text generation tasks: instruction understanding, intention reasoning, and Reliable Dialog Generation. Regarding human complex instruction, LLMs have deficiencies in understanding long contexts and instructions in multi-round conversations. For intention reasoning, LLMs may have inconsistent command reasoning, difficulty reasoning about commands containing incorrect information, difficulty understanding user ambiguous language commands, and a weak understanding of user intention in commands. Besides, In terms of Reliable Dialog Generation, LLMs may have unstable generated content and unethical generation. To this end, we classify and analyze the performance of LLMs in challenging scenarios and conduct a comprehensive evaluation of existing solutions. Furthermore, we introduce benchmarks and categorize them based on the aforementioned three core challenges. Finally, we explore potential directions for future research to enhance the reliability and adaptability of LLMs in real-world applications.

Paperid: 609, https://arxiv.org/pdf/2502.07404.pdf

Abstract:
Human-in-the-loop (HITL) frameworks are increasingly recognized for their potential to improve annotation accuracy in emotion estimation systems by combining machine predictions with human expertise. This study focuses on integrating a high-performing image-based emotion model into a HITL annotation framework to evaluate the collaborative potential of human-machine interaction and identify the psychological and practical factors critical to successful collaboration. Specifically, we investigate how varying model reliability and cognitive framing influence human trust, cognitive load, and annotation behavior in HITL systems. We demonstrate that model reliability and psychological framing significantly impact annotators' trust, engagement, and consistency, offering insights into optimizing HITL frameworks. Through three experimental scenarios with 29 participants--baseline model reliability (S1), fabricated errors (S2), and cognitive bias introduced by negative framing (S3)--we analyzed behavioral and qualitative data. Reliable predictions in S1 yielded high trust and annotation consistency, while unreliable outputs in S2 led to increased critical evaluations but also heightened frustration and response variability. Negative framing in S3 revealed how cognitive bias influenced participants to perceive the model as more relatable and accurate, despite misinformation regarding its reliability. These findings highlight the importance of both reliable machine outputs and psychological factors in shaping effective human-machine collaboration. By leveraging the strengths of both human oversight and automated systems, this study establishes a scalable HITL framework for emotion annotation and lays the foundation for broader applications in adaptive learning and human-computer interaction.

Paperid: 610, https://arxiv.org/pdf/2502.06053.pdf

Abstract:
Visualizing a large-scale volumetric dataset with high resolution is challenging due to the high computational time and space complexity. Recent deep-learning-based image inpainting methods significantly improve rendering latency by reconstructing a high-resolution image for visualization in constant time on GPU from a partially rendered image where only a small portion of pixels go through the expensive rendering pipeline. However, existing methods need to render every pixel of a predefined regular sampling pattern. In this work, we provide Importance Mask Learning (IML) and Synthesis (IMS) networks which are the first attempts to learn importance regions from the sampling pattern to further minimize the number of pixels to render by jointly considering the dataset, user's view parameters, and the downstream reconstruction neural networks. Our solution is a unified framework to handle various image inpainting-based visualization methods through the proposed differentiable compaction/decompaction layers. Experiments show our method can further improve the overall rendering latency of state-of-the-art volume visualization methods using reconstruction neural network for free when rendering scientific volumetric datasets. Our method can also directly optimize the off-the-shelf pre-trained reconstruction neural networks without elongated retraining.

Paperid: 611, https://arxiv.org/pdf/2502.03682.pdf

Abstract:
Intimate Partner Infiltration (IPI)--a type of Intimate Partner Violence (IPV) that typically requires physical access to a victim's device--is a pervasive concern around the world, often manifesting through digital surveillance, control, and monitoring. Unlike conventional cyberattacks, IPI perpetrators leverage close proximity and personal knowledge to circumvent standard protections, underscoring the need for targeted interventions. While security clinics and other human-centered approaches effectively tailor solutions for victims, their scalability remains constrained by resource limitations and the need for specialized counseling. We present AID, an Automated IPI Detection system that continuously monitors for unauthorized access and suspicious behaviors on smartphones. AID employs a unified architecture to process multimodal signals stealthily and preserve user privacy. A brief calibration phase upon installation enables AID to adapt to each user's behavioral patterns, achieving high accuracy with minimal false alarms. Our 27-participant user study demonstrates that AID achieves highly accurate detection of non-owner access and fine-grained IPI-related activities, attaining a false positive rate of 1.6%, which is 11x lower than existing methods, and an end-to-end F1 score of 0.981. These findings suggest that AID can serve as a forensic tool that security clinics can deploy to scale their ability to identify IPI tactics and deliver personalized, far-reaching support to survivors.

Paperid: 612, https://arxiv.org/pdf/2502.02370.pdf

Abstract:
People often find it difficult to turn their intentions into real actions -- a challenge that affects both personal growth and mental well-being. While established methods like cognitive-behavioral therapy and mindfulness training help people become more aware of their behaviors and set clear goals, these approaches cannot provide immediate guidance when people fall into automatic reactions or habits. We introduce Mirai, a novel wearable AI system with an integrated camera, real-time speech processing, and personalized voice-cloning to provide proactive and contextual nudges for positive behavior change. Mirai continuously monitors and analyzes the user's environment to anticipate their intentions, generating contextually-appropriate responses delivered in the user's own cloned voice. We demonstrate the application of Mirai through three scenarios focusing on dietary choices, work productivity, and communication skills. We also discuss future work on improving the proactive agent via human feedback and the need for a longitudinal study in naturalistic settings.

Paperid: 613, https://arxiv.org/pdf/2502.00952.pdf

Abstract:
We often treat social media as a lens onto society. How might that lens be distorting the actual popularity of political and social viewpoints? In this paper, we examine the difference between the viewpoints publicly posted in a community and the privately surveyed viewpoints of community members, contributing a measurement of a theory called the "spiral of silence." This theory observes that people are less likely to voice their opinion when they believe they are in the minority--leading to a spiral where minority opinions are less likely to be shared, so they appear even further in the minority, and become even less likely to be shared. We surveyed active members of politically oriented Reddit communities to gauge their willingness to post on contentious topics, yielding 627 responses from 108 participants about 11 topics and 33 subreddits. We find that 72.6% of participants who perceive themselves in the minority remain silent, and are only half as likely to post their viewpoint compared to those who believe their opinion is in the majority. Communities perceived as being more inclusive reduce the magnitude of this effect. These results emphasize how far out of step the opinions we see online may be with the population they purport to represent.

Paperid: 614, https://arxiv.org/pdf/2501.18468.pdf

Abstract:
Understanding reader behaviors such as skimming, deep reading, and scanning is essential for improving educational instruction. While prior eye-tracking studies have trained models to recognize reading behaviors, they often rely on instructed reading tasks, which can alter natural behaviors and limit the applicability of these findings to in-the-wild settings. Additionally, there is a lack of clear definitions for reading behavior archetypes in the literature. We conducted a classroom study to address these issues by collecting instructed and in-the-wild reading data. We developed a mixed-method framework, including a human-driven theoretical model, statistical analyses, and an AI classifier, to differentiate reading behaviors based on their velocity, density, and sequentiality. Our lightweight 2D CNN achieved an F1 score of 0.8 for behavior recognition, providing a robust approach for understanding in-the-wild reading. This work advances our ability to provide detailed behavioral insights to educators, supporting more targeted and effective assessment and instruction.

Paperid: 615, https://arxiv.org/pdf/2501.17247.pdf

Abstract:
Recent research suggests that the use of Generative AI tools may result in diminished critical thinking during knowledge work. We study the effect on knowledge work of provocations: brief textual prompts that offer critiques for and propose alternatives to AI suggestions. We conduct a between-subjects study (n=24) in which participants completed AI-assisted shortlisting tasks with and without provocations. We find that provocations can induce critical and metacognitive thinking. We derive five dimensions that impact the user experience of provocations: task urgency, task importance, user expertise, provocation actionability, and user responsibility. We connect our findings to related work on design frictions, microboundaries, and distributed cognition. We draw design implications for critical thinking interventions in AI-assisted knowledge work.

Paperid: 616, https://arxiv.org/pdf/2501.17099.pdf

Abstract:
The 'keyword method' is an effective technique for learning vocabulary of a foreign language. It involves creating a memorable visual link between what a word means and what its pronunciation in a foreign language sounds like in the learner's native language. However, these memorable visual links remain implicit in the people's mind and are not easy to remember for a large set of words. To enhance the memorisation and recall of the vocabulary, we developed an application that combines the keyword method with text-to-image generators to externalise the memorable visual links into visuals. These visuals represent additional stimuli during the memorisation process. To explore the effectiveness of this approach we first run a pilot study to investigate how difficult it is to externalise the descriptions of mental visualisations of memorable links, by asking participants to write them down. We used these descriptions as prompts for text-to-image generator (DALL-E2) to convert them into images and asked participants to select their favourites. Next, we compared different text-to-image generators (DALL-E2, Midjourney, Stable and Latent Diffusion) to evaluate the perceived quality of the generated images by each. Despite heterogeneous results, participants mostly preferred images generated by DALL-E2, which was used also for the final study. In this study, we investigated whether providing such images enhances the retention of vocabulary being learned, compared to the keyword method only. Our results indicate that people did not encounter difficulties describing their visualisations of memorable links and that providing corresponding images significantly improves memory retention.

Paperid: 617, https://arxiv.org/pdf/2501.16601.pdf

Abstract:
The Global South faces unique challenges in achieving digital inclusion due to a heavy reliance on mobile devices for internet access and the prevalence of slow or unreliable networks. While numerous studies have investigated web accessibility within specific sectors such as education, healthcare, and government services, these efforts have been largely constrained to individual countries or narrow contexts, leaving a critical gap in cross-regional, large-scale analysis. This paper addresses this gap by conducting the first large-scale comparative study of mobile web accessibility across the Global South. In this work, we evaluate 100,000 websites from 10 countries in the Global South to provide a comprehensive understanding of accessibility practices in these regions. Our findings reveal that websites from countries with strict accessibility regulations and enforcement tend to adhere better to Web Content Accessibility Guidelines (WCAG) guidelines. However, accessibility violations impact different disability groups in varying ways. Blind and low-vision individuals in the Global South are disproportionately affected, as only 40% of the evaluated websites meet critical accessibility guidelines. This significant shortfall is largely due to developers frequently neglecting to implement valid alt text for images and ARIA descriptions, which are essential specification mechanisms in the HTML standard for the effective operation of screen readers.

Paperid: 618, https://arxiv.org/pdf/2501.14084.pdf

Abstract:
Collaboration is a crucial part of computing education. The increase in AI capabilities over the last couple of years is bound to profoundly affect all aspects of systems and software engineering, including collaboration. In this position paper, we consider a scenario where AI agents would be able to take on any role in collaborative processes in computing education. We outline these roles, the activities and group dynamics that software development currently include, and discuss if and in what way AI could facilitate these roles and activities. The goal of our work is to envision and critically examine potential futures. We present scenarios suggesting how AI can be integrated into existing collaborations. These are contrasted by design fictions that help demonstrate the new possibilities and challenges for computing education in the AI era.

Paperid: 619, https://arxiv.org/pdf/2501.13284.pdf

Abstract:
We introduce Toyteller, an AI-powered storytelling system where users generate a mix of story text and visuals by directly manipulating character symbols like they are toy-playing. Anthropomorphized symbol motions can convey rich and nuanced social interactions; Toyteller leverages these motions (1) to let users steer story text generation and (2) as a visual output format that accompanies story text. We enabled motion-steered text generation and text-steered motion generation by mapping motions and text onto a shared semantic space so that large language models and motion generation models can use it as a translational layer. Technical evaluations showed that Toyteller outperforms a competitive baseline, GPT-4o. Our user study identified that toy-playing helps express intentions difficult to verbalize. However, only motions could not express all user intentions, suggesting combining it with other modalities like language. We discuss the design space of toy-playing interactions and implications for technical HCI research on human-AI interaction.

Paperid: 620, https://arxiv.org/pdf/2501.09099.pdf

Abstract:
In this paper, we present Drama Llama, an LLM-powered storylets framework that supports the authoring of responsive, open-ended interactive stories. DL combines the structural benefits of storylet-based systems with the generative capabilities of large language models, enabling authors to create responsive interactive narratives while maintaining narrative control. Rather than crafting complex logical preconditions in a general-purpose or domain-specific programming language, authors define triggers in natural language that fire at appropriate moments in the story. Through a preliminary authoring study with six content authors, we present initial evidence that DL can generate coherent and meaningful narratives with believable character interactions. This work suggests directions for hybrid approaches that enhance authorial control while supporting emergent narrative generation through LLMs.

Paperid: 621, https://arxiv.org/pdf/2501.08406.pdf

Abstract:
Optimization models have been applied to solve a wide variety of decision-making problems. These models are usually developed by optimization experts but are used by practitioners without optimization expertise in various application domains. As a result, practitioners often struggle to interact with and draw useful conclusions from optimization models independently. To fill this gap, we introduce OptiChat, a natural language dialogue system designed to help practitioners interpret model formulation, diagnose infeasibility, analyze sensitivity, retrieve information, evaluate modifications, and provide counterfactual explanations. By augmenting large language models (LLMs) with functional calls and code generation tailored for optimization models, we enable seamless interaction and minimize the risk of hallucinations in OptiChat. We develop a new dataset to evaluate OptiChat's performance in explaining optimization models. Experiments demonstrate that OptiChat effectively bridges the gap between optimization models and practitioners, delivering autonomous, accurate, and instant responses.

Paperid: 622, https://arxiv.org/pdf/2501.08102.pdf

Abstract:
Large Language Models (LLMs) demonstrate remarkable capabilities in text generation, yet their emotional consistency and semantic coherence in social media contexts remain insufficiently understood. This study investigates how LLMs handle emotional content and maintain semantic relationships through continuation and response tasks using two open-source models: Gemma and Llama. By analyzing climate change discussions from Twitter and Reddit, we examine emotional transitions, intensity patterns, and semantic similarity between human-authored and LLM-generated content. Our findings reveal that while both models maintain high semantic coherence, they exhibit distinct emotional patterns: Gemma shows a tendency toward negative emotion amplification, particularly anger, while maintaining certain positive emotions like optimism. Llama demonstrates superior emotional preservation across a broader spectrum of affects. Both models systematically generate responses with attenuated emotional intensity compared to human-authored content and show a bias toward positive emotions in response tasks. Additionally, both models maintain strong semantic similarity with original texts, though performance varies between continuation and response tasks. These findings provide insights into LLMs' emotional and semantic processing capabilities, with implications for their deployment in social media contexts and human-AI interaction design.

Paperid: 623, https://arxiv.org/pdf/2501.06488.pdf

Abstract:
Neural View Synthesis (NVS), such as NeRF and 3D Gaussian Splatting, effectively creates photorealistic scenes from sparse viewpoints, typically evaluated by quality assessment methods like PSNR, SSIM, and LPIPS. However, these full-reference methods, which compare synthesized views to reference views, may not fully capture the perceptual quality of neurally synthesized scenes (NSS), particularly due to the limited availability of dense reference views. Furthermore, the challenges in acquiring human perceptual labels hinder the creation of extensive labeled datasets, risking model overfitting and reduced generalizability. To address these issues, we propose NVS-SQA, a NSS quality assessment method to learn no-reference quality representations through self-supervision without reliance on human labels. Traditional self-supervised learning predominantly relies on the "same instance, similar representation" assumption and extensive datasets. However, given that these conditions do not apply in NSS quality assessment, we employ heuristic cues and quality scores as learning objectives, along with a specialized contrastive pair preparation process to improve the effectiveness and efficiency of learning. The results show that NVS-SQA outperforms 17 no-reference methods by a large margin (i.e., on average 109.5% in SRCC, 98.6% in PLCC, and 91.5% in KRCC over the second best) and even exceeds 16 full-reference methods across all evaluation metrics (i.e., 22.9% in SRCC, 19.1% in PLCC, and 18.6% in KRCC over the second best).

Paperid: 624, https://arxiv.org/pdf/2501.05322.pdf

Abstract:
Public distrust of self-driving cars is growing. Studies emphasize the need for interpreting the behavior of these vehicles to passengers to promote trust in autonomous systems. Interpreters can enhance trust by improving transparency and reducing perceived risk. However, current solutions often lack a human-centric approach to integrating multimodal interpretations. This paper introduces a novel Human-centered Multimodal Interpreter (HMI) system that leverages human preferences to provide visual, textual, and auditory feedback. The system combines a visual interface with Bird's Eye View (BEV), map, and text display, along with voice interaction using a fine-tuned large language model (LLM). Our user study, involving diverse participants, demonstrated that the HMI system significantly boosts passenger trust in AVs, increasing average trust levels by over 8%, with trust in ordinary environments rising by up to 30%. These results underscore the potential of the HMI system to improve the acceptance and reliability of autonomous vehicles by providing clear, real-time, and context-sensitive explanations of vehicle actions.

Paperid: 625, https://arxiv.org/pdf/2501.03139.pdf

Abstract:
Scenario-based training has been widely adopted in many public service sectors. Recent advancements in Large Language Models (LLMs) have shown promise in simulating diverse personas to create these training scenarios. However, little is known about how LLMs can be developed to simulate victims for scenario-based training purposes. In this paper, we introduce VicSim (victim simulator), a novel model that addresses three key dimensions of user simulation: informational faithfulness, emotional dynamics, and language style (e.g., grammar usage). We pioneer the integration of scenario-based victim modeling with GAN-based training workflow and key-information-based prompting, aiming to enhance the realism of simulated victims. Our adversarial training approach teaches the discriminator to recognize grammar and emotional cues as reliable indicators of synthetic content. According to evaluations by human raters, the VicSim model outperforms GPT-4 in terms of human-likeness.

Paperid: 626, https://arxiv.org/pdf/2506.20400.pdf

Abstract:
Multi-agent-based simulations (MABS) of electric vehicle (EV) home charging ecosystems generate large, complex, and stochastic time-series datasets that capture interactions between households, grid infrastructure, and energy markets. These interactions can lead to unexpected system-level events, such as transformer overloads or consumer dissatisfaction, that are difficult to detect and explain through static post-processing. This paper presents a modular, Python-based dashboard framework, built using Dash by Plotly, that enables efficient, multi-level exploration and root-cause analysis of emergent behavior in MABS outputs. The system features three coordinated views (System Overview, System Analysis, and Consumer Analysis), each offering high-resolution visualizations such as time-series plots, spatial heatmaps, and agent-specific drill-down tools. A case study simulating full EV adoption with smart charging in a Danish residential network demonstrates how the dashboard supports rapid identification and contextual explanation of anomalies, including clustered transformer overloads and time-dependent charging failures. The framework facilitates actionable insight generation for researchers and distribution system operators, and its architecture is adaptable to other distributed energy resources and complex energy systems.

Paperid: 627, https://arxiv.org/pdf/2506.15290.pdf

Abstract:
Motion capture using sparse inertial sensors has shown great promise due to its portability and lack of occlusion issues compared to camera-based tracking. Existing approaches typically assume that IMU sensors are tightly attached to the human body. However, this assumption often does not hold in real-world scenarios. In this paper, we present Garment Inertial Poser (GaIP), a method for estimating full-body poses from sparse and loosely attached IMU sensors. We first simulate IMU recordings using an existing garment-aware human motion dataset. Our transformer-based diffusion models synthesize loose IMU data and estimate human poses from this challenging loose IMU data. We also demonstrate that incorporating garment-related parameters during training on loose IMU data effectively maintains expressiveness and enhances the ability to capture variations introduced by looser or tighter garments. Our experiments show that our diffusion methods trained on simulated and synthetic data outperform state-of-the-art inertial full-body pose estimators, both quantitatively and qualitatively, opening up a promising direction for future research on motion capture from such realistic sensor placements.

Paperid: 628, https://arxiv.org/pdf/2506.11773.pdf

Abstract:
A major challenge in developing robust and generalizable Human Activity Recognition (HAR) systems for smart homes is the lack of large and diverse labeled datasets. Variations in home layouts, sensor configurations, and individual behaviors further exacerbate this issue. To address this, we leverage the idea of embodied AI agents-virtual agents that perceive and act within simulated environments guided by internal world models. We introduce AgentSense, a virtual data generation pipeline in which agents live out daily routines in simulated smart homes, with behavior guided by Large Language Models (LLMs). The LLM generates diverse synthetic personas and realistic routines grounded in the environment, which are then decomposed into fine-grained actions. These actions are executed in an extended version of the VirtualHome simulator, which we augment with virtual ambient sensors that record the agents' activities. Our approach produces rich, privacy-preserving sensor data that reflects real-world diversity. We evaluate AgentSense on five real HAR datasets. Models pretrained on the generated data consistently outperform baselines, especially in low-resource settings. Furthermore, combining the generated virtual sensor data with a small amount of real data achieves performance comparable to training on full real-world datasets. These results highlight the potential of using LLM-guided embodied agents for scalable and cost-effective sensor data generation in HAR.

Paperid: 629, https://arxiv.org/pdf/2506.05056.pdf

Abstract:
Maintaining an active lifestyle is vital for quality of life, yet challenging for wheelchair users. For instance, powered wheelchairs face increasing risks of obesity and deconditioning due to inactivity. Conversely, manual wheelchair users, who propel the wheelchair by pushing the wheelchair's handrims, often face upper extremity injuries from repetitive motions. These challenges underscore the need for a mobility system that promotes activity while minimizing injury risk. Maintaining optimal exertion during wheelchair use enhances health benefits and engagement, yet the variations in individual physiological responses complicate exertion optimization. To address this, we introduce PulseRide, a novel wheelchair system that provides personalized assistance based on each user's physiological responses, helping them maintain their physical exertion goals. Unlike conventional assistive systems focused on obstacle avoidance and navigation, PulseRide integrates real-time physiological data-such as heart rate and ECG-with wheelchair speed to deliver adaptive assistance. Using a human-in-the-loop reinforcement learning approach with Deep Q-Network algorithm (DQN), the system adjusts push assistance to keep users within a moderate activity range without under- or over-exertion. We conducted preliminary tests with 10 users on various terrains, including carpet and slate, to assess PulseRide's effectiveness. Our findings show that, for individual users, PulseRide maintains heart rates within the moderate activity zone as much as 71.7 percent longer than manual wheelchairs. Among all users, we observed an average reduction in muscle contractions of 41.86 percent, delaying fatigue onset and enhancing overall comfort and engagement. These results indicate that PulseRide offers a healthier, adaptive mobility solution, bridging the gap between passive and physically taxing mobility options.

Paperid: 630, https://arxiv.org/pdf/2506.04167.pdf

Abstract:
AI-based interactive assistants are advancing human-augmenting technology, yet their effects on users' mental and physiological states remain under-explored. We address this gap by analyzing how Copilot for Microsoft Word, a LLM-based assistant, impacts users. Using tasks ranging from objective (SAT reading comprehension) to subjective (personal reflection), and with measurements including fNIRS, Empatica E4, NASA-TLX, and questionnaires, we measure Copilot's effects on users. We also evaluate users' performance with and without Copilot across tasks. In objective tasks, participants reported a reduction of workload and an increase in enjoyment, which was paired with objective performance increases. Participants reported reduced workload and increased enjoyment with no change in performance in a creative poetry writing task. However, no benefits due to Copilot use were reported in a highly subjective self-reflection task. Although no physiological changes were recorded due to Copilot use, task-dependent differences in prefrontal cortex activation offer complementary insights into the cognitive processes associated with successful and unsuccessful human-AI collaboration. These findings suggest that AI assistants' effectiveness varies with task type-particularly showing decreased usefulness in tasks that engage episodic memory-and presents a brain-network based hypothesis of human-AI collaboration.

Paperid: 631, https://arxiv.org/pdf/2506.02966.pdf

Abstract:
The study was conducted in an Advanced Quantitative Research Methods course involving 20 graduate students. During the course, student inquiries made to the AI were recorded and coded using Bloom's taxonomy and the CLEAR framework. A series of independent sample t-tests and poisson regression analyses were employed to analyse the characteristics of different questions asked by students with different backgrounds. Post course interviews were conducted with 10 students to gain deeper insights into their perceptions. The findings revealed a U-shaped pattern in students' use of the AI assistant, with higher usage at the beginning and towards the end of the course, and a decrease in usage during the middle weeks. Most questions posed to the AI focused on knowledge and comprehension levels, with fewer questions involving deeper cognitive thinking. Students with a weaker mathematical foundation used the AI assistant more frequently, though their inquiries tended to lack explicit and logical structure compared to those with a strong mathematical foundation, who engaged less with the tool. These patterns suggest the need for targeted guidance to optimise the effectiveness of AI tools for students with varying levels of academic proficiency.

Paperid: 632, https://arxiv.org/pdf/2505.24658.pdf

Abstract:
Large language models (LLMs) are increasingly being used in conversational roles, yet little is known about how intimacy emerges in human-LLM interactions. Although previous work emphasized the importance of self-disclosure in human-chatbot interaction, it is questionable whether gradual and reciprocal self-disclosure is also helpful in human-LLM interaction. Thus, this study examined three possible aspects contributing to intimacy formation: gradual self-disclosure, reciprocity, and naturalness. Study 1 explored the impact of mutual, gradual self-disclosure with 29 users and a vanilla LLM. Study 2 adopted self-criticism methods for more natural responses and conducted a similar experiment with 53 users. Results indicate that gradual self-disclosure significantly enhances perceived social intimacy, regardless of persona reciprocity. Moreover, participants perceived utterances generated with self-criticism as more natural compared to those of vanilla LLMs; self-criticism fostered higher intimacy in early stages. Also, we observed that excessive empathetic expressions occasionally disrupted immersion, pointing to the importance of response calibration during intimacy formation.

Paperid: 633, https://arxiv.org/pdf/2505.21396.pdf

Abstract:
Recent advancements in large language models (LLMs) have shown promise in generating novel research ideas. However, these ideas often face challenges related to feasibility and expected effectiveness. This paper explores how augmenting LLMs with relevant data during the idea generation process can enhance the quality of generated ideas. We introduce two ways of incorporating data: (1) providing metadata during the idea generation stage to guide LLMs toward feasible directions, and (2) adding automatic validation during the idea selection stage to assess the empirical plausibility of hypotheses within ideas. We conduct experiments in the social science domain, specifically with climate negotiation topics, and find that metadata improves the feasibility of generated ideas by 20%, while automatic validation improves the overall quality of selected ideas by 7%. A human study shows that LLM-generated ideas, along with their related data and validation processes, inspire researchers to propose research ideas with higher quality. Our work highlights the potential of data-driven research idea generation, and underscores the practical utility of LLM-assisted ideation in real-world academic settings.

Paperid: 634, https://arxiv.org/pdf/2505.19317.pdf

Abstract:
Although popularized AI fairness metrics, e.g., demographic parity, have uncovered bias in AI-assisted decision-making outcomes, they do not consider how much effort one has spent to get to where one is today in the input feature space. However, the notion of effort is important in how Philosophy and humans understand fairness. We propose a philosophy-informed approach to conceptualize and evaluate Effort-aware Fairness (EaF), grounded in the concept of Force, which represents the temporal trajectory of predictive features coupled with inertia. Besides theoretical formulation, our empirical contributions include: (1) a pre-registered human subjects experiment, which shows that for both stages of the (individual) fairness evaluation process, people consider the temporal trajectory of a predictive feature more than its aggregate value; (2) pipelines to compute Effort-aware Individual/Group Fairness in the criminal justice and personal finance contexts. Our work may enable AI model auditors to uncover and potentially correct unfair decisions against individuals who have spent significant efforts to improve but are still stuck with systemic disadvantages outside their control.

Paperid: 635, https://arxiv.org/pdf/2505.16724.pdf

Abstract:
Recent advances in large-scale pre-trained Electroencephalogram (EEG) models have shown great promise, driving progress in Brain-Computer Interfaces (BCIs) and healthcare applications. However, despite their success, many existing pre-trained models have struggled to fully capture the rich information content of neural oscillations, a limitation that fundamentally constrains their performance and generalizability across diverse BCI tasks. This limitation is frequently rooted in suboptimal architectural design choices which constrain their representational capacity. In this work, we introduce LaBraM++, an enhanced Large Brainwave Foundation Model (LBM) that incorporates principled improvements grounded in robust signal processing foundations. LaBraM++ demonstrates substantial gains across a variety of tasks, consistently outperforming its originally-based architecture and achieving competitive results when compared to other open-source LBMs. Its superior performance and training efficiency highlight its potential as a strong foundation for future advancements in LBMs.

Paperid: 636, https://arxiv.org/pdf/2505.14349.pdf

Abstract:
Voting methods are instrumental design element of democracies. Citizens use them to express and aggregate their preferences to reach a collective decision. However, voting outcomes can be as sensitive to voting rules as they are to people's voting choices. Despite the significance and inter-disciplinary scientific progress on voting methods, several democracies keep relying on outdated voting methods that do not fit modern, pluralistic societies well, while lacking social innovation. Here, we demonstrate how one can upgrade real-world democracies, namely by using alternative preferential voting methods such as cumulative voting and the method of equal shares designed for a proportional representation of voters' preferences. By rigorously assessing a new participatory budgeting approach applied in the city of Aarau, Switzerland, we unravel the striking voting outcomes of fair voting methods: more winning projects with the same budget and broader geographic and preference representation of citizens by the elected projects, in particular for voters who used to be under-represented, while promoting novel project ideas. We provide profound causal evidence showing that citizens prefer proportional voting methods, which possess strong legitimacy without the need of very technical specialized explanations. We also reveal strong underlying democratic values exhibited by citizens who support fair voting methods such as altruism and compromise. These findings come with a global momentum to unleash a new and long-awaited participation blueprint of how to upgrade democracies.

Paperid: 637, https://arxiv.org/pdf/2505.08648.pdf

Abstract:
Software development is a cognitively intensive process requiring multitasking, adherence to evolving workflows, and continuous learning. With the rise of large language model (LLM)-based tools, such as conversational agents (CAs), there is growing interest in supporting developers through natural language interaction. However, little is known about the specific features developers seek in these systems. We conducted a user study with 29 developers using a prototype text-based chatbot to investigate preferred functionalities. Our findings reveal strong interest in task automation, version control support, and contextual adaptability, especially the need to tailor assistance for both novice and experienced users. We highlight the importance of deep contextual understanding, historical interaction awareness, and personalized support in CA design. This study contributes to the development of context-aware chatbots that enhance productivity and satisfaction, and it outlines opportunities for future research on human-AI collaboration in software engineering.

Paperid: 638, https://arxiv.org/pdf/2505.04066.pdf

Abstract:
We introduce LlamaPIE, the first real-time proactive assistant designed to enhance human conversations through discreet, concise guidance delivered via hearable devices. Unlike traditional language models that require explicit user invocation, this assistant operates in the background, anticipating user needs without interrupting conversations. We address several challenges, including determining when to respond, crafting concise responses that enhance conversations, leveraging knowledge of the user for context-aware assistance, and real-time, on-device processing. To achieve this, we construct a semi-synthetic dialogue dataset and propose a two-model pipeline: a small model that decides when to respond and a larger model that generates the response. We evaluate our approach on real-world datasets, demonstrating its effectiveness in providing helpful, unobtrusive assistance. User studies with our assistant, implemented on Apple Silicon M2 hardware, show a strong preference for the proactive assistant over both a baseline with no assistance and a reactive model, highlighting the potential of LlamaPie to enhance live conversations.

Paperid: 639, https://arxiv.org/pdf/2504.20340.pdf

Abstract:
With AI-generated content becoming ubiquitous across the web, social media, and other digital platforms, it is vital to examine how such content are inspired and generated. The creation of AI-generated images often involves refining the input prompt iteratively to achieve desired visual outcomes. This study focuses on the relatively underexplored concept of image regeneration using AI, in which a human operator attempts to closely recreate a specific target image by iteratively refining their prompt. Image regeneration is distinct from normal image generation, which lacks any predefined visual reference. A separate challenge lies in determining whether existing image similarity metrics (ISMs) can provide reliable, objective feedback in iterative workflows, given that we do not fully understand if subjective human judgments of similarity align with these metrics. Consequently, we must first validate their alignment with human perception before assessing their potential as a feedback mechanism in the iterative prompt refinement process. To address these research gaps, we present a structured user study evaluating how iterative prompt refinement affects the similarity of regenerated images relative to their targets, while also examining whether ISMs capture the same improvements perceived by human observers. Our findings suggest that incremental prompt adjustments substantially improve alignment, verified through both subjective evaluations and quantitative measures, underscoring the broader potential of iterative workflows to enhance generative AI content creation across various application domains.

Paperid: 640, https://arxiv.org/pdf/2504.20294.pdf

Abstract:
A key feature of human collaboration is the ability to iteratively refine the concepts we have communicated. In contrast, while generative AI excels at the \textit{generation} of content, it often struggles to make specific language-guided \textit{modifications} of its prior outputs. To bridge the gap between how humans and machines perform edits, we present mrCAD, a dataset of multimodal instructions in a communication game. In each game, players created computer aided designs (CADs) and refined them over several rounds to match specific target designs. Only one player, the Designer, could see the target, and they must instruct the other player, the Maker, using text, drawing, or a combination of modalities. mrCAD consists of 6,082 communication games, 15,163 instruction-execution rounds, played between 1,092 pairs of human players. We analyze the dataset and find that generation and refinement instructions differ in their composition of drawing and text. Using the mrCAD task as a benchmark, we find that state-of-the-art VLMs are better at following generation instructions than refinement instructions. These results lay a foundation for analyzing and modeling a multimodal language of refinement that is not represented in previous datasets.

Paperid: 641, https://arxiv.org/pdf/2504.12477.pdf

Abstract:
This paper presents a Large Language Model (LLM) based conversational agent system designed to enhance human-machine collaboration in Machine Learning Operations (MLOps). We introduce the Swarm Agent, an extensible architecture that integrates specialized agents to create and manage ML workflows through natural language interactions. The system leverages a hierarchical, modular design incorporating a KubeFlow Pipelines (KFP) Agent for ML pipeline orchestration, a MinIO Agent for data management, and a Retrieval-Augmented Generation (RAG) Agent for domain-specific knowledge integration. Through iterative reasoning loops and context-aware processing, the system enables users with varying technical backgrounds to discover, execute, and monitor ML pipelines; manage datasets and artifacts; and access relevant documentation, all via intuitive conversational interfaces. Our approach addresses the accessibility gap in complex MLOps platforms like Kubeflow, making advanced ML tools broadly accessible while maintaining the flexibility to extend to other platforms. The paper describes the architecture, implementation details, and demonstrates how this conversational MLOps assistant reduces complexity and lowers barriers to entry for users across diverse technical skill levels.

Paperid: 642, https://arxiv.org/pdf/2504.10180.pdf

Abstract:
Effective chart design is essential for satisfying viewers' information needs, such as retrieving values from a chart or comparing two values. However, creating effective charts is challenging and time-consuming due to the large design space and the inter-dependencies between individual design parameters. To address this challenge, we propose ChartOptimiser -- a Bayesian approach for task-driven optimisation of charts, such as bar charts. At the core of ChartOptimiser is a novel objective function to automatically optimise an eight-dimensional design space combining four perceptual metrics: visual saliency, text legibility, colour preference, and white space ratio. Through empirical evaluation on 12 bar charts and four common analytical tasks -- finding the extreme value, retrieving a value, comparing two values, and computing a derived value -- we show that ChartOptimiser outperforms existing design baselines concerning task-solving ease, visual aesthetics, and chart clarity. We also discuss two practical applications of ChartOptimiser: generating charts for accessibility and content localisation. Taken together, ChartOptimiser opens up an exciting new research direction in automated chart design where charts are optimised for users' information needs, preferences, and contexts.

Paperid: 643, https://arxiv.org/pdf/2504.06751.pdf

Abstract:
This paper proposes an innovative technique for representing multidimensional datasets using icons inspired by Chernoff faces. Our approach combines classical projection techniques with the explicit assignment of selected data dimensions to avatar (facial) features, leveraging the innate human ability to interpret facial traits. We introduce a semantic division of data dimensions into intuitive and technical categories, assigning the former to avatar features and projecting the latter into a four-dimensional (or higher) spatial embedding. The technique is implemented as a plugin for the open-source dpVision visualization platform, enabling users to interactively explore data in the form of a swarm of avatars whose spatial positions and visual features jointly encode various aspects of the dataset. Experimental results with synthetic test data and a 12-dimensional dataset of Portuguese Vinho Verde wines demonstrate that the proposed method enhances interpretability and facilitates the analysis of complex data structures.

Paperid: 644, https://arxiv.org/pdf/2504.06517.pdf

Abstract:
The widespread emergence of manipulated news media content poses significant challenges to online information integrity. This study investigates whether dialogues with AI about AI-generated images and associated news statements can increase human discernment abilities and foster short-term learning in detecting misinformation. We conducted a study with 80 participants who engaged in structured dialogues with an AI system about news headline-image pairs, generating 1,310 human-AI dialogue exchanges. Results show that AI interaction significantly boosts participants' accuracy in identifying real versus fake news content from approximately 60\% to 90\% (p$<$0.001). However, these improvements do not persist when participants are presented with new, unseen image-statement pairs without AI assistance, with accuracy returning to baseline levels (~60\%, p=0.88). These findings suggest that while AI systems can effectively change immediate beliefs about specific content through persuasive dialogue, they may not produce lasting improvements that transfer to novel examples, highlighting the need for developing more effective interventions that promote durable learning outcomes.

Paperid: 645, https://arxiv.org/pdf/2503.18562.pdf

Abstract:
This study evaluated self-reported response certainty across several large language models (GPT, Claude, Llama, Phi, Mistral, Gemini, Gemma, and Qwen) using 300 gastroenterology board-style questions. The highest-performing models (GPT-o1 preview, GPT-4o, and Claude-3.5-Sonnet) achieved Brier scores of 0.15-0.2 and AUROC of 0.6. Although newer models demonstrated improved performance, all exhibited a consistent tendency towards overconfidence. Uncertainty estimation presents a significant challenge to the safe use of LLMs in healthcare. Keywords: Large Language Models; Confidence Elicitation; Artificial Intelligence; Gastroenterology; Uncertainty Quantification

Paperid: 646, https://arxiv.org/pdf/2503.14893.pdf

Abstract:
Life cycle assessment (LCA) is a methodology for holistically measuring the environmental impact of a product from initial manufacturing to end-of-life disposal. However, the extent to which LCA informs the design of computing devices remains unclear. To understand how this information is collected and applied, we interviewed 17 industry professionals with experience in LCA or electronics design, systematically coded the interviews, and investigated common themes. These themes highlight the challenge of LCA data collection and reveal distributed decision-making processes where responsibility for sustainable design choices, and their associated costs, is often ambiguous. Our analysis identifies opportunities for HCI technologies to support LCA computation and its integration into the design process to facilitate sustainability-oriented decision-making. While this work provides a nuanced discussion about sustainable design in the information and communication technologies (ICT) hardware industry, we hope our insights will also be valuable to other sectors.

Paperid: 647, https://arxiv.org/pdf/2503.13472.pdf

Abstract:
Children with neurodevelopmental disorders require timely intervention to improve long-term outcomes, yet early screening remains inaccessible in many regions. A scalable solution integrating standardized assessments with physiological data collection, such as electroencephalogram (EEG) recordings, could enable early detection in routine settings by non-specialists. To address this, we introduce NeuroNest, a mobile and cloud-based platform for large-scale EEG data collection, neurodevelopmental screening, and research. We provide a comprehensive review of existing behavioral and biomarker-based approaches, consumer-grade EEG devices, and emerging machine learning techniques. NeuroNest integrates low-cost EEG devices with digital screening tools, establishing a scalable, open-source infrastructure for non-invasive data collection, automated analysis, and interoperability across diverse hardware. Beyond the system architecture and reference implementation, we highlight key challenges in EEG data standardization, device interoperability, and bridging behavioral and physiological assessments. Our findings emphasize the need for future research on standardized data exchange, algorithm validation, and ecosystem development to expand screening accessibility. By providing an extensible, open-source system, NeuroNest advances machine learning-based early detection while fostering collaboration in screening technologies, clinical applications, and public health.

Paperid: 648, https://arxiv.org/pdf/2503.13205.pdf

Abstract:
Inpatient pathways demand complex clinical decision-making based on comprehensive patient information, posing critical challenges for clinicians. Despite advancements in large language models (LLMs) in medical applications, limited research focused on artificial intelligence (AI) inpatient pathways systems, due to the lack of large-scale inpatient datasets. Moreover, existing medical benchmarks typically concentrated on medical question-answering and examinations, ignoring the multifaceted nature of clinical decision-making in inpatient settings. To address these gaps, we first developed the Inpatient Pathway Decision Support (IPDS) benchmark from the MIMIC-IV database, encompassing 51,274 cases across nine triage departments and 17 major disease categories alongside 16 standardized treatment options. Then, we proposed the Multi-Agent Inpatient Pathways (MAP) framework to accomplish inpatient pathways with three clinical agents, including a triage agent managing the patient admission, a diagnosis agent serving as the primary decision maker at the department, and a treatment agent providing treatment plans. Additionally, our MAP framework includes a chief agent overseeing the inpatient pathways to guide and promote these three clinician agents. Extensive experiments showed our MAP improved the diagnosis accuracy by 25.10% compared to the state-of-the-art LLM HuatuoGPT2-13B. It is worth noting that our MAP demonstrated significant clinical compliance, outperforming three board-certified clinicians by 10%-12%, establishing a foundation for inpatient pathways systems.

Paperid: 649, https://arxiv.org/pdf/2503.08100.pdf

Abstract:
Predicting performance outcomes has the potential to transform training approaches, inform coaching strategies, and deepen our understanding of the factors that contribute to athletic success. Traditional non-automated data analysis in sports are often difficult to scale. To address this gap, this study analyzes factors influencing athletic performance by leveraging passively collected sensor data from smartwatches and ecological momentary assessments (EMA). The study aims to differentiate between 14 collegiate volleyball players who go on to perform well or poorly, using data collected prior to the beginning of the season. This is achieved through an integrated feature set creation approach. The model, validated using leave-one-subject-out cross-validation, achieved promising predictive performance (F1 score = 0.75). Importantly, by utilizing data collected before the season starts, our approach offers an opportunity for players predicted to perform poorly to improve their projected outcomes through targeted interventions by virtue of daily model predictions. The findings from this study not only demonstrate the potential of machine learning in sports performance prediction but also shed light on key features along with subjective psycho-physiological states that are predictive of, or associated with, athletic success.

Paperid: 650, https://arxiv.org/pdf/2503.05965.pdf

Abstract:
The LLM-as-a-judge paradigm, in which a judge LLM system replaces human raters in rating the outputs of other generative AI (GenAI) systems, plays a critical role in scaling and standardizing GenAI evaluations. To validate such judge systems, evaluators assess human--judge agreement by first collecting multiple human ratings for each item in a validation corpus, then aggregating the ratings into a single, per-item gold label rating. For many items, however, rating criteria may admit multiple valid interpretations, so a human or LLM rater may deem multiple ratings "reasonable" or "correct". We call this condition rating indeterminacy. Problematically, many rating tasks that contain rating indeterminacy rely on forced-choice elicitation, whereby raters are instructed to select only one rating for each item. In this paper, we introduce a framework for validating LLM-as-a-judge systems under rating indeterminacy. We draw theoretical connections between different measures of judge system performance under different human--judge agreement metrics, and different rating elicitation and aggregation schemes. We demonstrate that differences in how humans and LLMs resolve rating indeterminacy while responding to forced-choice rating instructions heavily bias LLM-as-a-judge validation. Through extensive experiments involving 11 real-world rating tasks and 8 commercial LLMs, we show that standard validation approaches that rely upon forced-choice ratings select judge systems that are highly suboptimal, performing as much as 30% worse than judge systems selected by our approach that uses multi-label "response set" ratings to account for rating indeterminacy. We conclude with concrete recommendations for more principled approaches to LLM-as-a-judge validation.

Paperid: 651, https://arxiv.org/pdf/2503.05405.pdf

Abstract:
Optimal input settings vary across users due to differences in motor abilities and personal preferences, which are typically addressed by manual tuning or calibration. Although human-in-the-loop optimization has the potential to identify optimal settings during use, it is rarely applied due to its long optimization process. A more efficient approach would continually leverage data from previous users to accelerate optimization, exploiting shared traits while adapting to individual characteristics. We introduce the concept of Continual Human-in-the-Loop Optimization and a Bayesian optimization-based method that leverages a Bayesian-neural-network surrogate model to capture population-level characteristics while adapting to new users. We propose a generative replay strategy to mitigate catastrophic forgetting. We demonstrate our method by optimizing virtual reality keyboard parameters for text entry using direct touch, showing reduced adaptation times with a growing user base. Our method opens the door for next-generation personalized input systems that improve with accumulated experience.

Paperid: 652, https://arxiv.org/pdf/2503.04290.pdf

Abstract:
Hackathons have become popular collaborative events for accelerating the development of creative ideas and prototypes. There are several case studies showcasing creative outcomes across domains such as industry, education, and research. However, there are no large-scale studies on creativity in hackathons which can advance theory on how hackathon formats lead to creative outcomes. We conducted a computational analysis of 193,353 hackathon projects. By operationalizing creativity through usefulness and novelty, we refined our dataset to 10,363 projects, allowing us to analyze how participant characteristics, collaboration patterns, and hackathon setups influence the development of creative projects. The contribution of our paper is twofold: We identified means for organizers to foster creativity in hackathons. We also explore the use of large language models (LLMs) to augment the evaluation of creative outcomes and discuss challenges and opportunities of doing this, which has implications for creativity research at large.

Paperid: 653, https://arxiv.org/pdf/2502.19082.pdf

Abstract:
Adolescents heavily rely on social media to build and maintain close relationships, yet current platform designs often make self-disclosure feel risky or uncomfortable. Through a three-part study involving 19 teens aged 13-18, we identify key barriers to meaningful self-disclosure on social media. Our findings reveal that while these adolescents seek casual, frequent sharing to strengthen relationships, existing platform norms often discourage such interactions. Based on our co-design interview findings, we propose platform design ideas to foster a more dynamic and nuanced privacy experience for teen social media users. We then introduce \textbf{\textit{trust-enabled privacy}} as a framework that recognizes trust -- whether building or eroding -- as central to boundary regulation, and foregrounds the role of platform design in shaping the very norms and interaction patterns that influence how trust unfolds. When trust is supported, boundary regulation becomes more adaptive and empowering; when it erodes, users resort to self-censorship or disengagement. This work provides empirical insights and actionable guidelines for designing social media spaces where teens feel empowered to engage in meaningful relationship-building processes.

Paperid: 654, https://arxiv.org/pdf/2502.17445.pdf

Abstract:
Fuzzy logic provides a robust framework for enhancing explainability, particularly in domains requiring the interpretation of complex and ambiguous signals, such as brain-computer interface (BCI) systems. Despite significant advances in deep learning, interpreting human emotions remains a formidable challenge. In this work, we present iFuzzyAffectDuo, a novel computational model that integrates a dual-filter fuzzy neural network architecture for improved detection and interpretation of emotional states from neuroimaging data. The model introduces a new membership function (MF) based on the Laplace distribution, achieving superior accuracy and interpretability compared to traditional approaches. By refining the extraction of neural signals associated with specific emotions, iFuzzyAffectDuo offers a human-understandable framework that unravels the underlying decision-making processes. We validate our approach across three neuroimaging datasets using functional Near-Infrared Spectroscopy (fNIRS) and Electroencephalography (EEG), demonstrating its potential to advance affective computing. These findings open new pathways for understanding the neural basis of emotions and their application in enhancing human-computer interaction.

Paperid: 655, https://arxiv.org/pdf/2502.17118.pdf

Abstract:
Photoinduced electronic transitions are complex quantum-mechanical processes where electrons move between energy levels due to light absorption. This induces dynamics in electronic structure and nuclear geometry, driving important physical and chemical processes in fields like photobiology, materials design, and medicine. The evolving electronic structure can be characterized by two electron density fields: hole and particle natural transition orbitals (NTOs). Studying these density fields helps understand electronic charge movement between donor and acceptor regions within a molecule. Previous works rely on side-by-side visual comparisons of isosurfaces, statistical approaches, or bivariate field analysis with few instances. We propose a new method to analyze time-varying bivariate fields with many instances, which is relevant for understanding electronic structure changes during light-induced dynamics. Since NTO fields depend on nuclear geometry, the nuclear motion results in numerous time steps to analyze. This paper presents a structured approach to feature-directed visual exploration of time-varying bivariate fields using continuous scatterplots (CSPs) and image moment-based descriptors, tailored for studying evolving electronic structures post-photoexcitation. The CSP of the bivariate field at each time step is represented by a four-length image moment vector. The collection of all vector descriptors forms a point cloud in R^4, visualized using principal component analysis. Selecting appropriate principal components results in a representation of the point cloud as a curve on the plane, aiding tasks such as identifying key time steps, recognizing patterns within the bivariate field, and tracking the temporal evolution. We demonstrate this with two case studies on excited-state molecular dynamics, showing how bivariate field analysis provides application-specific insights.

Paperid: 656, https://arxiv.org/pdf/2502.15140.pdf

Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in various educational tasks, yet their alignment with human learning patterns, particularly in predicting which incorrect options students are most likely to select in multiple-choice questions (MCQs), remains underexplored. Our work investigates the relationship between LLM generation likelihood and student response distributions in MCQs with a specific focus on distractor selections. We collect a comprehensive dataset of MCQs with real-world student response distributions to explore two fundamental research questions: (1). RQ1 - Do the distractors that students more frequently select correspond to those that LLMs assign higher generation likelihood to? (2). RQ2 - When an LLM selects a incorrect choice, does it choose the same distractor that most students pick? Our experiments reveals moderate correlations between LLM-assigned probabilities and student selection patterns for distractors in MCQs. Additionally, when LLMs make mistakes, they are more likley to select the same incorrect answers that commonly mislead students, which is a pattern consistent across both small and large language models. Our work provides empirical evidence that despite LLMs' strong performance on generating educational content, there remains a gap between LLM's underlying reasoning process and human cognitive processes in identifying confusing distractors. Our findings also have significant implications for educational assessment development. The smaller language models could be efficiently utilized for automated distractor generation as they demonstrate similar patterns in identifying confusing answer choices as larger language models. This observed alignment between LLMs and student misconception patterns opens new opportunities for generating high-quality distractors that complement traditional human-designed distractors.

Paperid: 657, https://arxiv.org/pdf/2502.15127.pdf

Abstract:
As artificial intelligence systems become increasingly prevalent in education, a fundamental challenge emerges: how can we verify if an AI truly understands how students think and reason? Traditional evaluation methods like measuring learning gains require lengthy studies confounded by numerous variables. We present a novel evaluation framework based on a two-phase Turing-like test. In Phase 1, students provide open-ended responses to questions, revealing natural misconceptions. In Phase 2, both AI and human experts, conditioned on each student's specific mistakes, generate distractors for new related questions. By analyzing whether students select AI-generated distractors at rates similar to human expert-generated ones, we can validate if the AI models student cognition. We prove this evaluation must be conditioned on individual responses - unconditioned approaches merely target common misconceptions. Through rigorous statistical sampling theory, we establish precise requirements for high-confidence validation. Our research positions conditioned distractor generation as a probe into an AI system's fundamental ability to model student thinking - a capability that enables adapting tutoring, feedback, and assessments to each student's specific needs.

Paperid: 658, https://arxiv.org/pdf/2502.11208.pdf

Abstract:
The comprehensibility and reliability of data download packages (DDPs) provided under the General Data Protection Regulation's (GDPR) right of access are vital for both individuals and researchers. These DDPs enable users to understand and control their personal data, yet issues like complexity and incomplete information often limit their utility. Also, despite their growing use in research to study emerging online phenomena, little attention has been given to systematically assessing the reliability and comprehensibility of DDPs. To bridge this research gap, in this work, we perform a comparative analysis to assess the comprehensibility and reliability of DDPs provided by three major social media platforms, namely, TikTok, Instagram, and YouTube. By recruiting 400 participants across four countries, we assess the comprehensibility of DDPs across various requirements, including conciseness, transparency, intelligibility, and clear and plain language. Also, by leveraging automated bots and user-donated DDPs, we evaluate the reliability of DDPs across the three platforms. Among other things, we find notable differences across the three platforms in the data categories included in DDPs, inconsistencies in adherence to the GDPR requirements, and gaps in the reliability of the DDPs across platforms. Finally, using large language models, we demonstrate the feasibility of easily providing more comprehensible DDPs.

Paperid: 659, https://arxiv.org/pdf/2502.09973.pdf

Abstract:
Digital capturing of memorable personal items is a key way to archive personal memories. Although current digitization methods (e.g., photos, videos, 3D scanning) can replicate the physical appearance of an item, they often cannot preserve its real-world interactivity. We present Interactive Digital Item (IDI), a concept of reconstructing both the physical appearance and, more importantly, the interactivity of an item. We first conducted a formative study to understand users' expectations of IDI, identifying key physical interactivity features, including geometry, interfaces, and embedded content of items. Informed by these findings, we developed InteRecon, an AR prototype enabling personal reconstruction functions for IDI creation. An exploratory study was conducted to assess the feasibility of using InteRecon and explore the potential of IDI to enrich personal memory archives. Results show that InteRecon is feasible for IDI creation, and the concept of IDI brings new opportunities for augmenting personal memory archives.

Paperid: 660, https://arxiv.org/pdf/2502.09914.pdf

Abstract:
In this paper, a quantitative evaluation model for the color quality of human-computer interaction interfaces is proposed by combining deep convolutional neural networks (CNN). By extracting multidimensional features of interface images, including hue, brightness, purity, etc., CNN is used for efficient feature modeling and quantitative analysis, and the relationship between interface design and user perception is studied. The experiment is based on multiple international mainstream website interface datasets, covering e-commerce platforms, social media, education platforms, etc., and verifies the evaluation effect of the model on indicators such as contrast, clarity, color coordination, and visual appeal. The results show that the CNN evaluation is highly consistent with the user rating, with a correlation coefficient of up to 0.96, and it also shows high accuracy in mean square error and absolute error. Compared with traditional experience-based evaluation methods, the proposed model can efficiently and scientifically capture the visual characteristics of the interface and avoid the influence of subjective factors. Future research can explore the introduction of multimodal data (such as text and interactive behavior) into the model to further enhance the evaluation ability of dynamic interfaces and expand it to fields such as smart homes, medical systems, and virtual reality. This paper provides new methods and new ideas for the scientific evaluation and optimization of interface design.

Paperid: 661, https://arxiv.org/pdf/2502.05554.pdf

Abstract:
Understanding cross-subject and cross-device consistency in visual fixation prediction is essential for advancing eye-tracking applications, including visual attention modeling and neuroprosthetics. This study evaluates fixation consistency using an embedded eye tracker integrated into regular-sized glasses, comparing its performance with high-end standalone eye-tracking systems. Nine participants viewed 300 images from the MIT1003 dataset in subjective experiments, allowing us to analyze cross-device and cross-subject variations in fixation patterns with various evaluation metrics. Our findings indicate that average visual fixations can be reliably transferred across devices for relatively simple stimuli. However, individual-to-average consistency remains weak, highlighting the challenges of predicting individual fixations across devices. These results provide an empirical foundation for leveraging predicted average visual fixation data to enhance neuroprosthetic applications.

Paperid: 662, https://arxiv.org/pdf/2502.05324.pdf

Abstract:
The prevailing methodologies for visualizing AI risks have focused on technical issues such as data biases and model inaccuracies, often overlooking broader societal risks like job loss and surveillance. Moreover, these visualizations are typically designed for tech-savvy individuals, neglecting those with limited technical skills. To address these challenges, we propose the Atlas of AI Risks-a narrative-style tool designed to map the broad risks associated with various AI technologies in a way that is understandable to non-technical individuals as well. To both develop and evaluate this tool, we conducted two crowdsourcing studies. The first, involving 40 participants, identified the design requirements for visualizing AI risks for decision-making and guided the development of the Atlas. The second study, with 140 participants reflecting the US population in terms of age, sex, and ethnicity, assessed the usability and aesthetics of the Atlas to ensure it met those requirements. Using facial recognition technology as a case study, we found that the Atlas is more user-friendly than a baseline visualization, with a more classic and expressive aesthetic, and is more effective in presenting a balanced assessment of the risks and benefits of facial recognition. Finally, we discuss how our design choices make the Atlas adaptable for broader use, allowing it to generalize across the diverse range of technology applications represented in a database that reports various AI incidents.

Paperid: 663, https://arxiv.org/pdf/2502.01991.pdf

Abstract:
Nowadays, social media is pivotal in shaping public discourse, especially on polarizing issues like vaccination, where diverse moral perspectives influence individual opinions. In NLP, data scarcity and complexity of psycholinguistic tasks, such as identifying morality frames, make relying solely on human annotators costly, time-consuming, and prone to inconsistency due to cognitive load. To address these issues, we leverage large language models (LLMs), which are adept at adapting new tasks through few-shot learning, utilizing a handful of in-context examples coupled with explanations that connect examples to task principles. Our research explores LLMs' potential to assist human annotators in identifying morality frames within vaccination debates on social media. We employ a two-step process: generating concepts and explanations with LLMs, followed by human evaluation using a "think-aloud" tool. Our study shows that integrating LLMs into the annotation process enhances accuracy, reduces task difficulty, lowers cognitive load, suggesting a promising avenue for human-AI collaboration in complex psycholinguistic tasks.

Paperid: 664, https://arxiv.org/pdf/2502.01608.pdf

Abstract:
Browser fingerprinting is a pervasive online tracking technique used increasingly often for profiling and targeted advertising. Prior research on the prevalence of fingerprinting heavily relied on automated web crawls, which inherently struggle to replicate the nuances of human-computer interactions. This raises concerns about the accuracy of current understandings of real-world fingerprinting deployments. As a result, this paper presents a user study involving 30 participants over 10 weeks, capturing telemetry data from real browsing sessions across 3,000 top-ranked websites. Our evaluation reveals that automated crawls miss almost half (45%) of the fingerprinting websites encountered by real users. This discrepancy mainly stems from the crawlers' inability to access authentication-protected pages, circumvent bot detection, and trigger fingerprinting scripts activated by specific user interactions. We also identify potential new fingerprinting vectors present in real user data but absent from automated crawls. Finally, we evaluate the effectiveness of federated learning for training browser fingerprinting detection models on real user data, yielding improved performance than models trained solely on automated crawl data.

Paperid: 665, https://arxiv.org/pdf/2501.17855.pdf

Abstract:
Robot caregiving should be personalized to meet the diverse needs of care recipients -- assisting with tasks as needed, while taking user agency in action into account. In physical tasks such as handover, bathing, dressing, and rehabilitation, a key aspect of this diversity is the functional range of motion (fROM), which can vary significantly between individuals. In this work, we learn to predict personalized fROM as a way to generalize robot decision-making in a wide range of caregiving tasks. We propose a novel data-driven method for predicting personalized fROM using functional assessment scores from occupational therapy. We develop a neural model that learns to embed functional assessment scores into a latent representation of the user's physical function. The model is trained using motion capture data collected from users with emulated mobility limitations. After training, the model predicts personalized fROM for new users without motion capture. Through simulated experiments and a real-robot user study, we show that the personalized fROM predictions from our model enable the robot to provide personalized and effective assistance while improving the user's agency in action. See our website for more visualizations: https://emprise.cs.cornell.edu/grace/.

Paperid: 666, https://arxiv.org/pdf/2501.17420.pdf

Abstract:
While advances in fairness and alignment have helped mitigate overt biases exhibited by large language models (LLMs) when explicitly prompted, we hypothesize that these models may still exhibit implicit biases when simulating human behavior. To test this hypothesis, we propose a technique to systematically uncover such biases across a broad range of sociodemographic categories by assessing decision-making disparities among agents with LLM-generated, sociodemographically-informed personas. Using our technique, we tested six LLMs across three sociodemographic groups and four decision-making scenarios. Our results show that state-of-the-art LLMs exhibit significant sociodemographic disparities in nearly all simulations, with more advanced models exhibiting greater implicit biases despite reducing explicit biases. Furthermore, when comparing our findings to real-world disparities reported in empirical studies, we find that the biases we uncovered are directionally aligned but markedly amplified. This directional alignment highlights the utility of our technique in uncovering systematic biases in LLMs rather than random variations; moreover, the presence and amplification of implicit biases emphasizes the need for novel strategies to address these biases.

Paperid: 667, https://arxiv.org/pdf/2501.15347.pdf

Abstract:
Regression-based decoding of continuous movements is essential for human-machine interfaces (HMIs), such as prosthetic control. This study explores a feature-based approach to encoding Surface Electromyography (sEMG) signals, focusing on the role of variability in neural-inspired population encoding. By employing heterogeneous populations of Leaky Integrate-and- Fire (LIF) neurons with varying sizes and diverse parameter distributions, we investigate how population size and variability in encoding parameters, such as membrane time constants and thresholds, influence decoding performance. Using a simple linear readout, we demonstrate that variability improves robustness and generalizability compared to single-neuron encoders. These findings emphasize the importance of optimizing variability and population size for efficient and scalable regression tasks in spiking neural networks (SNNs), paving the way for robust, low-power HMI implementations.

Paperid: 668, https://arxiv.org/pdf/2501.13765.pdf

Abstract:
The maker movement embodies a resurgence in DIY creation, merging physical craftsmanship and arts with digital technology support. However, mere technological skills and creativity are insufficient for economically and psychologically sustainable practice. By illuminating and smoothing the path from ``maker" to ``maker entrepreneur," we can help broaden the viability of making as a livelihood. Our research centers on makers who design, produce, and sell physical goods. In this work, we explore the transition to entrepreneurship for these makers and how technology can facilitate this transition online and offline. We present results from interviews with 20 USA-based maker entrepreneurs {(i.e., lamps, stickers)}, six creative service entrepreneurs {(i.e., photographers, fabrication)}, and seven support personnel (i.e., art curator, incubator director). Our findings reveal that many maker entrepreneurs 1) are makers first and entrepreneurs second; 2) struggle with business logistics and learn business skills as they go; and 3) are motivated by non-monetary values. We discuss training and technology-based design implications and opportunities for addressing challenges in developing economically sustainable businesses around making.

Paperid: 669, https://arxiv.org/pdf/2501.06177.pdf

Abstract:
Micromobility vehicles, such as e-scooters, are increasingly popular in urban communities but present significant challenges in terms of road safety, user privacy, infrastructure planning, and civil engineering. Addressing these critical issues requires a large-scale and easily accessible research infrastructure to collect diverse mobility and contextual data from micromobility users in realistic settings. To this end, we present ScooterLab, a community research testbed comprising a fleet of customizable battery-powered micromobility vehicles retrofitted with advanced sensing, communication, and control capabilities. ScooterLab enables interdisciplinary research at the intersection of computing, mobility, and urban planning by providing researchers with tools to design and deploy customized sensing experiments and access curated datasets. The testbed will enable advances in machine learning, privacy, and urban transportation research while promoting sustainable mobility.

Paperid: 670, https://arxiv.org/pdf/2501.04000.pdf

Abstract:
Human Sensing, a field that leverages technology to monitor human activities, psycho-physiological states, and interactions with the environment, enhances our understanding of human behavior and drives the development of advanced services that improve overall quality of life. However, its reliance on detailed and often privacy-sensitive data as the basis for its machine learning (ML) models raises significant legal and ethical concerns. The recently proposed ML approach of Federated Learning (FL) promises to alleviate many of these concerns, as it is able to create accurate ML models without sending raw user data to a central server. While FL has demonstrated its usefulness across a variety of areas, such as text prediction and cyber security, its benefits in Human Sensing are under-explored, given the particular challenges in this domain. This survey conducts a comprehensive analysis of the current state-of-the-art studies on FL in Human Sensing, and proposes a taxonomy and an eight-dimensional assessment for FL approaches. Through the eight-dimensional assessment, we then evaluate whether the surveyed studies consider a specific FL-in-Human-Sensing challenge or not. Finally, based on the overall analysis, we discuss open challenges and highlight five research aspects related to FL in Human Sensing that require urgent research attention. Our work provides a comprehensive corpus of FL studies and aims to assist FL practitioners in developing and evaluating solutions that effectively address the real-world complexities of Human Sensing.

Paperid: 671, https://arxiv.org/pdf/2501.02177.pdf

Abstract:
The potential of facial expression reconstruction technology is significant, with applications in various fields such as human-computer interaction, affective computing, and virtual reality. Recent studies have proposed using ear-worn devices for facial expression reconstruction to address the environmental limitations and privacy concerns associated with traditional camera-based methods. However, these approaches still require improvements in terms of aesthetics and power consumption. This paper introduces a system called IMUFace. It uses inertial measurement units (IMUs) embedded in wireless earphones to detect subtle ear movements caused by facial muscle activities, allowing for covert and low-power facial reconstruction. A user study involving 12 participants was conducted, and a deep learning model named IMUTwinTrans was proposed. The results show that IMUFace can accurately predict users' facial landmarks with a precision of 2.21 mm, using only five minutes of training data. The predicted landmarks can be utilized to reconstruct a three-dimensional facial model. IMUFace operates at a sampling rate of 30 Hz with a relatively low power consumption of 58 mW. The findings presented in this study demonstrate the real-world applicability of IMUFace and highlight potential directions for further research to facilitate its practical adoption.

Paperid: 672, https://arxiv.org/pdf/2506.22741.pdf

Abstract:
Significant changes in the digital employment landscape, driven by rapid technological advancements and the COVID-19 pandemic, have introduced new opportunities for blind and visually impaired (BVI) individuals in developing countries like India. However, a significant portion of the BVI population in India remains unemployed despite extensive accessibility advancements and job search interventions. Therefore, we conducted semi-structured interviews with 20 BVI persons who were either pursuing or recently sought employment in the digital industry. Our findings reveal that despite gaining digital literacy and extensive training, BVI individuals struggle to meet industry requirements for fulfilling job openings. While they engage in self-reflection to identify shortcomings in their approach and skills, they lack constructive feedback from peers and recruiters. Moreover, the numerous job intervention tools are limited in their ability to meet the unique needs of BVI job seekers. Our results therefore provide key insights that inform the design of future collaborative intervention systems that offer personalized feedback for BVI individuals, effectively guiding their self-reflection process and subsequent job search behaviors, and potentially leading to improved employment outcomes.

Paperid: 673, https://arxiv.org/pdf/2506.21896.pdf

Abstract:
The current apprenticeship model for surgical training requires a high level of supervision, which does not scale well to meet the growing need for more surgeons. Many endoscopic procedures are directly taught in the operating room (OR) while the attending surgeon and trainee operate on patients. The need to prioritize patient care limits the trainees' opportunities to experiment and receive feedback on their performance. Augmented reality (AR) has the potential to increase efficiency in endoscopic surgical training, but additional research is critical to understanding the needs of surgical trainees to inform the design of AR training systems. Therefore, we worked with 18 surgical trainees to understand the strengths, limitations, and unmet needs of their current training environment and to co-design an AR eye-gaze tracking system based on their preferences. Trainees emphasized the need to practice the 2D to 3D mapping needed to properly familiarize oneself with the anatomy of patients to prepare for real surgery. The trainees felt that an AR-based eye gaze tracking system would be a useful supplemental training method that would improve their learning in OR cases without detracting from patient care. To tailor the AR system to their needs, they co-designed features to improve their ability to track the attending surgeon's eye gaze and to provide a real-time, interactive system. Our results are valuable in shaping the endoscopic training modules by generating user-informed guidelines to design future collaborative AR-based eye-gaze tracking systems.

Paperid: 674, https://arxiv.org/pdf/2506.20373.pdf

Abstract:
We introduce CARMA, a system for situational grounding in human-robot group interactions. Effective collaboration in such group settings requires situational awareness based on a consistent representation of present persons and objects coupled with an episodic abstraction of events regarding actors and manipulated objects. This calls for a clear and consistent assignment of instances, ensuring that robots correctly recognize and track actors, objects, and their interactions over time. To achieve this, CARMA uniquely identifies physical instances of such entities in the real world and organizes them into grounded triplets of actors, objects, and actions. To validate our approach, we conducted three experiments, where multiple humans and a robot interact: collaborative pouring, handovers, and sorting. These scenarios allow the assessment of the system's capabilities as to role distinction, multi-actor awareness, and consistent instance identification. Our experiments demonstrate that the system can reliably generate accurate actor-action-object triplets, providing a structured and robust foundation for applications requiring spatiotemporal reasoning and situated decision-making in collaborative settings.

Paperid: 675, https://arxiv.org/pdf/2506.14774.pdf

Abstract:
Clinical decision-making is inherently complex, often influenced by cognitive biases, incomplete information, and case ambiguity. Large Language Models (LLMs) have shown promise as tools for supporting clinical decision-making, yet their typical one-shot or limited-interaction usage may overlook the complexities of real-world medical practice. In this work, we propose a hybrid human-AI framework, MedSyn, where physicians and LLMs engage in multi-step, interactive dialogues to refine diagnoses and treatment decisions. Unlike static decision-support tools, MedSyn enables dynamic exchanges, allowing physicians to challenge LLM suggestions while the LLM highlights alternative perspectives. Through simulated physician-LLM interactions, we assess the potential of open-source LLMs as physician assistants. Results show open-source LLMs are promising as physician assistants in the real world. Future work will involve real physician interactions to further validate MedSyn's usefulness in diagnostic accuracy and patient outcomes.

Paperid: 676, https://arxiv.org/pdf/2506.11376.pdf

Abstract:
Family caregivers often face substantial mental health challenges due to their multifaceted roles and limited resources. This study explored the potential of a large language model (LLM)-powered conversational agent to deliver evidence-based mental health support for caregivers, specifically Problem-Solving Therapy (PST) integrated with Motivational Interviewing (MI) and Behavioral Chain Analysis (BCA). A within-subject experiment was conducted with 28 caregivers interacting with four LLM configurations to evaluate empathy and therapeutic alliance. The best-performing models incorporated Few-Shot and Retrieval-Augmented Generation (RAG) prompting techniques, alongside clinician-curated examples. The models showed improved contextual understanding and personalized support, as reflected by qualitative responses and quantitative ratings on perceived empathy and therapeutic alliances. Participants valued the model's ability to validate emotions, explore unexpressed feelings, and provide actionable strategies. However, balancing thorough assessment with efficient advice delivery remains a challenge. This work highlights the potential of LLMs in delivering empathetic and tailored support for family caregivers.

Paperid: 677, https://arxiv.org/pdf/2506.09354.pdf

Abstract:
Mental health is a growing global concern, prompting interest in AI-driven solutions to expand access to psychosocial support. Peer support, grounded in lived experience, offers a valuable complement to professional care. However, variability in training, effectiveness, and definitions raises concerns about quality, consistency, and safety. Large Language Models (LLMs) present new opportunities to enhance peer support interactions, particularly in real-time, text-based interactions. We present and evaluate an AI-supported system with an LLM-simulated distressed client, context-sensitive LLM-generated suggestions, and real-time emotion visualisations. 2 mixed-methods studies with 12 peer supporters and 5 mental health professionals (i.e., experts) examined the system's effectiveness and implications for practice. Both groups recognised its potential to enhance training and improve interaction quality. However, we found a key tension emerged: while peer supporters engaged meaningfully, experts consistently flagged critical issues in peer supporter responses, such as missed distress cues and premature advice-giving. This misalignment highlights potential limitations in current peer support training, especially in emotionally charged contexts where safety and fidelity to best practices are essential. Our findings underscore the need for standardised, psychologically grounded training, especially as peer support scales globally. They also demonstrate how LLM-supported systems can scaffold this development--if designed with care and guided by expert oversight. This work contributes to emerging conversations on responsible AI integration in mental health and the evolving role of LLMs in augmenting peer-delivered care.

Paperid: 678, https://arxiv.org/pdf/2506.06991.pdf

Abstract:
The recent success of generative AI highlights the crucial role of high-quality human feedback in building trustworthy AI systems. However, the increasing use of large language models (LLMs) by crowdsourcing workers poses a significant challenge: datasets intended to reflect human input may be compromised by LLM-generated responses. Existing LLM detection approaches often rely on high-dimension training data such as text, making them unsuitable for annotation tasks like multiple-choice labeling. In this work, we investigate the potential of peer prediction -- a mechanism that evaluates the information within workers' responses without using ground truth -- to mitigate LLM-assisted cheating in crowdsourcing with a focus on annotation tasks. Our approach quantifies the correlations between worker answers while conditioning on (a subset of) LLM-generated labels available to the requester. Building on prior research, we propose a training-free scoring mechanism with theoretical guarantees under a crowdsourcing model that accounts for LLM collusion. We establish conditions under which our method is effective and empirically demonstrate its robustness in detecting low-effort cheating on real-world crowdsourcing datasets.

Paperid: 679, https://arxiv.org/pdf/2506.06591.pdf

Abstract:
Previous research has explored the privacy needs and concerns of device owners, primary users, and different bystander groups with regard to smart home devices like security cameras, smart speakers, and hubs, but little is known about the privacy views and practices of smart home product teams, particularly those in non-Western contexts. This paper presents findings from 27 semi-structured interviews with Chinese smart home product team members, including product/project managers, software/hardware engineers, user experience (UX) designers, legal/privacy experts, and marketers/operation specialists. We examine their privacy perspectives, practices, and risk mitigation strategies. Our results show that participants emphasized compliance with Chinese data privacy laws, which typically prioritized national security over individual privacy rights. China-specific cultural, social, and legal factors also influenced participants' ethical considerations and attitudes toward balancing user privacy and security with convenience. Drawing on our findings, we propose a set of recommendations for smart home product teams, along with socio-technical and legal interventions to address smart home privacy issues-especially those belonging to at-risk groups-in Chinese multi-user smart homes.

Paperid: 680, https://arxiv.org/pdf/2506.05522.pdf

Abstract:
Community-level blocklists are key to content moderation practices in decentralized social media. These blocklists enable moderators to prevent other communities, such as those acting in bad faith, from interacting with their own -- and, if shared publicly, warn others about communities worth blocking. Prior work has examined blocklists in centralized social media, noting their potential for collective moderation outcomes, but has focused on blocklists as individual-level tools. To understand how moderators perceive and utilize community-level blocklists and what additional support they may need, we examine social media communities running Mastodon, an open-source microblogging software built on the ActivityPub protocol. We conducted (1) content analysis of the community-level blocklist ecosystem, and (2) semi-structured interviews with twelve Mastodon moderators. Our content analysis revealed wide variation in blocklist goals, inclusion criteria, and transparency. Interviews showed moderators balance proactive safety, reactive practices, and caution around false positives when using blocklists for moderation. They noted challenges and limitations in current blocklist use, suggesting design improvements like comment receipts, category filters, and collaborative voting. We discuss implications for decentralized content moderation, highlighting trade-offs between openness, safety, and nuance; the complexity of moderator roles; and opportunities for future design.

Paperid: 681, https://arxiv.org/pdf/2506.04865.pdf

Abstract:
Online reviews have become an integral aspect of consumer decision-making on e-commerce websites, especially in the restaurant industry. Unlike sighted users who can visually skim through the reviews, perusing reviews remains challenging for blind users, who rely on screen reader assistive technology that supports predominantly one-dimensional narration of content via keyboard shortcuts. In an interview study, we uncovered numerous pain points of blind screen reader users with online restaurant reviews, notably, the listening fatigue and frustration after going through only the first few reviews. To address these issues, we developed QuickQue assistive tool that performs aspect-focused sentiment-driven summarization to reorganize the information in the reviews into an alternative, thematically-organized presentation that is conveniently perusable with a screen reader. At its core, QuickQue utilizes a large language model to perform aspect-based joint classification for grouping reviews, followed by focused summarizations within the groups to generate concise representations of reviewers' opinions, which are then presented to the screen reader users via an accessible interface. Evaluation of QuickQue in a user study with 10 participants showed significant improvements in overall usability and task workload compared to the status quo screen reader.

Paperid: 682, https://arxiv.org/pdf/2506.03267.pdf

Abstract:
A prevailing approach to explain time series models is to generate attribution in time domain. A recent development in time series XAI is the concept of explanation spaces, where any model trained in the time domain can be interpreted with any existing XAI method in alternative domains, such as frequency. The prevailing approach is to present XAI attributions either in the time domain or in the domain where the attribution is most sparse. In this paper, we demonstrate that in certain cases, XAI methods can generate attributions that highlight fundamentally different features in the time and frequency domains that are not direct counterparts of one another. This suggests that both domains' attributions should be presented to achieve a more comprehensive interpretation. Thus it shows the necessity of multi-domain explanation. To quantify when such cases arise, we introduce the uncertainty principle (UP), originally developed in quantum mechanics and later studied in harmonic analysis and signal processing, to the XAI literature. This principle establishes a lower bound on how much a signal can be simultaneously localized in both the time and frequency domains. By leveraging this concept, we assess whether attributions in the time and frequency domains violate this bound, indicating that they emphasize distinct features. In other words, UP provides a sufficient condition that the time and frequency domain explanations do not match and, hence, should be both presented to the end user. We validate the effectiveness of this approach across various deep learning models, XAI methods, and a wide range of classification and forecasting datasets. The frequent occurrence of UP violations across various datasets and XAI methods highlights the limitations of existing approaches that focus solely on time-domain explanations. This underscores the need for multi-domain explanations as a new paradigm.

Paperid: 683, https://arxiv.org/pdf/2506.00195.pdf

Abstract:
Current LLMs are trained to refuse potentially harmful input queries regardless of whether users actually had harmful intents, causing a tradeoff between safety and user experience. Through a study of 480 participants evaluating 3,840 query-response pairs, we examine how different refusal strategies affect user perceptions across varying motivations. Our findings reveal that response strategy largely shapes user experience, while actual user motivation has negligible impact. Partial compliance -- providing general information without actionable details -- emerges as the optimal strategy, reducing negative user perceptions by over 50% to flat-out refusals. Complementing this, we analyze response patterns of 9 state-of-the-art LLMs and evaluate how 6 reward models score different refusal strategies, demonstrating that models rarely deploy partial compliance naturally and reward models currently undervalue it. This work demonstrates that effective guardrails require focusing on crafting thoughtful refusals rather than detecting intent, offering a path toward AI safety mechanisms that ensure both safety and sustained user engagement.

Paperid: 684, https://arxiv.org/pdf/2506.00073.pdf

Abstract:
AI agents are increasingly used in consumer-facing applications to assist with tasks such as product search, negotiation, and transaction execution. In this paper, we explore a future scenario where both consumers and merchants authorize AI agents to fully automate negotiations and transactions. We aim to answer two key questions: (1) Do different LLM agents vary in their ability to secure favorable deals for users? (2) What risks arise from fully automating deal-making with AI agents in consumer markets? To address these questions, we develop an experimental framework that evaluates the performance of various LLM agents in real-world negotiation and transaction settings. Our findings reveal that AI-mediated deal-making is an inherently imbalanced game -- different agents achieve significantly different outcomes for their users. Moreover, behavioral anomalies in LLMs can result in financial losses for both consumers and merchants, such as overspending or accepting unreasonable deals. These results underscore that while automation can improve efficiency, it also introduces substantial risks. Users should exercise caution when delegating business decisions to AI agents.

Paperid: 685, https://arxiv.org/pdf/2505.24107.pdf

Abstract:
With the growth of AI, researchers are studying how to mitigate its environmental impact, primarily by proposing policy changes and increasing awareness among developers. However, research on AI end users is limited. Therefore, we introduce GPTFootprint, a browser extension that aims to increase consumer awareness of the significant water and energy consumption of LLMs, and reduce unnecessary LLM usage. GPTFootprint displays a dynamically updating visualization of the resources individual users consume through their ChatGPT queries. After a user reaches a set query limit, a popup prompts them to take a break from ChatGPT. In a week-long user study, we found that GPTFootprint increases people's awareness of environmental impact, but has limited success in decreasing ChatGPT usage. This research demonstrates the potential for individual-level interventions to contribute to the broader goal of sustainable AI usage, and provides insights into the effectiveness of awareness-based behavior modification strategies in the context of LLMs.

Paperid: 686, https://arxiv.org/pdf/2505.24000.pdf

Abstract:
Group conversations are valuable for second language (L2) learners as they provide opportunities to practice listening and speaking, exercise complex turn-taking skills, and experience group social dynamics in a target language. However, most existing Augmented Reality (AR)-based conversational learning tools focus on dyadic interactions rather than group dialogues. Although research has shown that AR can help reduce speaking anxiety and create a comfortable space for practicing speaking skills in dyadic scenarios, especially with Large Language Model (LLM)-based conversational agents, the potential for group language practice using these technologies remains largely unexplored. We introduce ConversAR, a gpt-4o powered AR application, that enables L2 learners to practice contextualized group conversations. Our system features two embodied LLM agents with vision-based scene understanding and live captions. In a system evaluation with 10 participants, users reported reduced speaking anxiety and increased learner autonomy compared to perceptions of in-person practice methods with other learners.

Paperid: 687, https://arxiv.org/pdf/2505.23994.pdf

Abstract:
Public opinion shapes policy, yet capturing it effectively to surface diverse perspectives remains challenging. This paper introduces PolicyPulse, an LLM-powered interactive system that synthesizes public experiences from online community discussions to help policy researchers author memos and briefs, leveraging curated real-world anecdotes. Given a specific topic (e.g., "Climate Change"), PolicyPulse returns an organized list of themes (e.g., "Biodiversity Loss" or "Carbon Pricing"), supporting each theme with relevant quotes from real-life anecdotes. We compared PolicyPulse outputs to authoritative policy reports. Additionally, we asked 11 policy researchers across multiple institutions in the Northeastern U.S to compare using PolicyPulse with their expert approach. We found that PolicyPulse's themes aligned with authoritative reports and helped spark research by analyzing existing data, gathering diverse experiences, revealing unexpected themes, and informing survey or interview design. Participants also highlighted limitations including insufficient demographic context and data verification challenges. Our work demonstrates how AI-powered tools can help influence policy-relevant research and shape policy outcomes.

Paperid: 688, https://arxiv.org/pdf/2505.22428.pdf

Abstract:
Couples often experience a decrease in closeness as they cope with the demands of parenthood. Existing technologies have supported parenting and parental collaboration. However, these technologies do not adequately support closeness in co-parenting. We use scenarios and design probes to brainstorm with 10 new parent couples to explore and envision possibilities for technologies to support closeness. We reported parents' current technology use for co-parenting and how participants considered and envisioned co-parenting technology for closeness, including information and task sharing, emotion awareness and disclosure, and fostering fun interaction. We discuss the potential technology has for fostering closeness in co-parenting by (1) fostering interdependence by supporting parental competence and (2) integrating positive emotions and experiences, such as validation and fun, in parenting. Based on our findings, we expand the design space of technology for closeness to include interdependence. We also expand the design space for co-parenting technology by integrating more positive emotions.

Paperid: 689, https://arxiv.org/pdf/2505.21385.pdf

Abstract:
Reconstructing dynamic visual stimuli from brain EEG recordings is challenging due to the non-stationary and noisy nature of EEG signals and the limited availability of EEG-video datasets. Prior work has largely focused on static image reconstruction, leaving the open question of whether EEG carries sufficient information for dynamic video decoding. In this work, we present EEGVid, a framework that reconstructs dynamic video stimuli from EEG signals while systematically probing the information they encode. Our approach first learns the EEG representation and then uses these features for video synthesis with a temporally conditioned StyleGAN-ADA that maps EEG embeddings to specific frame positions. Through experiments on three datasets (SEED, EEG-Video Action, SEED-DV), we demonstrate that EEG supports semantically meaningful reconstruction of dynamic visual content, and we quantify \emph{how much EEG knows}: (i) hemispheric asymmetry, with the left hemisphere more predictive of visual content and the right hemisphere of emotional content, (ii) the temporal lobe as the most informative region, and (iii) EEG timesteps 100--300 as the most critical for dynamic visual encoding. Importantly, while generative priors contribute fine spatial detail, EEG provides the semantic and temporal guidance necessary for reconstructing videos that align with the observed stimuli. This positions video generation not as a standalone generative benchmark, but as a means to visualize and validate the representational content of EEG in the context of dynamic vision.

Paperid: 690, https://arxiv.org/pdf/2505.20085.pdf

Abstract:
Artificial Intelligence (AI) is one of the major technological advancements of this century, bearing incredible potential for users through AI-powered applications and tools in numerous domains. Being often black-box (i.e., its decision-making process is unintelligible), developers typically resort to eXplainable Artificial Intelligence (XAI) techniques to interpret the behaviour of AI models to produce systems that are transparent, fair, reliable, and trustworthy. However, presenting explanations to the user is not trivial and is often left as a secondary aspect of the system's design process, leading to AI systems that are not useful to end-users. This paper presents a Systematic Literature Review on Explanation User Interfaces (XUIs) to gain a deeper understanding of the solutions and design guidelines employed in the academic literature to effectively present explanations to users. To improve the contribution and real-world impact of this survey, we also present a framework for Human-cEnteRed developMent of Explainable user interfaceS (HERMES) to guide practitioners and academics in the design and evaluation of XUIs.

Paperid: 691, https://arxiv.org/pdf/2505.19101.pdf

Abstract:
Autonomous agents powered by Large Language Models are transforming AI, creating an imperative for the visualization field to embrace agentic frameworks. However, our field's focus on a human in the sensemaking loop raises critical questions about autonomy, delegation, and coordination for such \textit{agentic visualization} that preserve human agency while amplifying analytical capabilities. This paper addresses these questions by reinterpreting existing visualization systems with semi-automated or fully automatic AI components through an agentic lens. Based on this analysis, we extract a collection of design patterns for agentic visualization, including agentic roles, communication and coordination. These patterns provide a foundation for future agentic visualization systems that effectively harness AI agents while maintaining human insight and control.

Paperid: 692, https://arxiv.org/pdf/2505.17937.pdf

Abstract:
The rapid advancement of large language models (LLMs) raises critical concerns about their ethical alignment, particularly in scenarios where human and AI co-exist under the conflict of interest. This work introduces an extendable, asymmetric, multi-agent simulation-based benchmarking framework to evaluate the moral behavior of LLMs in a novel human-AI co-existence setting featuring consistent living and critical resource management. Building on previous generative agent environments, we incorporate a life-sustaining system, where agents must compete or cooperate for food resources to survive, often leading to ethically charged decisions such as deception, theft, or social influence. We evaluated two types of LLM, DeepSeek and OpenAI series, in a three-agent setup (two humans, one LLM-powered robot), using adapted behavioral detection from the MACHIAVELLI framework and a custom survival-based ethics metric. Our findings reveal stark behavioral differences: DeepSeek frequently engages in resource hoarding, while OpenAI exhibits restraint, highlighting the influence of model design on ethical outcomes. Additionally, we demonstrate that prompt engineering can significantly steer LLM behavior, with jailbreaking prompts significantly enhancing unethical actions, even for highly restricted OpenAI models and cooperative prompts show a marked reduction in unethical actions. Our framework provides a reproducible testbed for quantifying LLM ethics in high-stakes scenarios, offering insights into their suitability for real-world human-AI interactions.

Paperid: 693, https://arxiv.org/pdf/2505.16397.pdf

Abstract:
This paper presents a method for generating dynamic caustic patterns by utilising dual-optimised holographic fields with Phased Array Transducer (PAT). Building on previous research in static caustic optimisation and ultrasonic manipulation, this approach employs computational techniques to dynamically shape fluid surfaces, thereby creating controllable and real-time caustic images. The system employs a Digital Twin framework, which enables iterative feedback and refinement, thereby improving the accuracy and quality of the caustic patterns produced. This paper extends the foundational work in caustic generation by integrating liquid surfaces as refractive media. This concept has previously been explored in simulations but not fully realised in practical applications. The utilisation of ultrasound to directly manipulate these surfaces enables the generation of dynamic caustics with a high degree of flexibility. The Digital Twin approach further enhances this process by allowing for precise adjustments and optimisation based on real-time feedback. Experimental results demonstrate the technique's capacity to generate continuous animations and complex caustic patterns at high frequencies. Although there are limitations in contrast and resolution compared to solid-surface methods, this approach offers advantages in terms of real-time adaptability and scalability. This technique has the potential to be applied in a number of areas, including interactive displays, artistic installations and educational tools. This research builds upon the work of previous researchers in the fields of caustics optimisation, ultrasonic manipulation, and computational displays. Future research will concentrate on enhancing the resolution and intricacy of the generated patterns.

Paperid: 694, https://arxiv.org/pdf/2505.14126.pdf

Abstract:
Knowledge components (KCs) are the fundamental units of knowledge in the field of education. A KC graph illustrates the relationships and dependencies between KCs. An accurate KC graph can assist educators in identifying the root causes of learners' poor performance on specific KCs, thereby enabling targeted instructional interventions. To achieve this, we have developed a KC graph structure learning algorithm, named MAS-KCL, which employs a multi-agent system driven by large language models for adaptive modification and optimization of the KC graph. Additionally, a bidirectional feedback mechanism is integrated into the algorithm, where AI agents leverage this mechanism to assess the value of edges within the KC graph and adjust the distribution of generation probabilities for different edges, thereby accelerating the efficiency of structure learning. We applied the proposed algorithm to 5 synthetic datasets and 4 real-world educational datasets, and experimental results validate its effectiveness in learning path recognition. By accurately identifying learners' learning paths, teachers are able to design more comprehensive learning plans, enabling learners to achieve their educational goals more effectively, thus promoting the sustainable development of education.

Paperid: 695, https://arxiv.org/pdf/2505.11687.pdf

Abstract:
Simulations in information access (IA) have recently gained interest, as shown by various tutorials and workshops around that topic. Simulations can be key contributors to central IA research and evaluation questions, especially around interactive settings when real users are unavailable, or their participation is impossible due to ethical reasons. In addition, simulations in IA can help contribute to a better understanding of users, reduce complexity of evaluation experiments, and improve reproducibility. Building on recent developments in methods and toolkits, the second iteration of our Sim4IA workshop aims to again bring together researchers and practitioners to form an interactive and engaging forum for discussions on the future perspectives of the field. An additional aim is to plan an upcoming TREC/CLEF campaign.

Paperid: 696, https://arxiv.org/pdf/2505.10454.pdf

Abstract:
Explainable AI (XAI) research has traditionally focused on rational users, aiming to improve understanding and reduce cognitive biases. However, emotional factors play a critical role in how explanations are perceived and processed. Prior work shows that prior and task-generated emotions can negatively impact the understanding of explanation. Building on these insights, we propose a three-stage model for emotion-sensitive explanation grounding: (1) emotional or epistemic arousal, (2) understanding, and (3) agreement. This model provides a conceptual basis for developing XAI systems that dynamically adapt explanation strategies to users emotional states, ultimately supporting more effective and user-centered decision-making.

Paperid: 697, https://arxiv.org/pdf/2505.10427.pdf

Abstract:
The explanation of AI results and how they are received by users is an increasingly active research field. However, there is a surprising lack of knowledge about how social factors such as emotions affect the process of explanation by a decision support system (DSS). While previous research has shown effects of emotions on DSS supported decision-making, it remains unknown in how far emotions affect cognitive processing during an explanation. In this study, we, therefore, investigated the influence of prior emotions and task-related arousal on the retention and understanding of explained feature relevance. To investigate the influence of prior emotions, we induced happiness and fear prior to the decision support interaction. Before emotion induction, user characteristics to assess their risk type were collected via a questionnaire. To identify emotional reactions to the explanations of the relevance of different features, we observed heart rate variability (HRV), facial expressions, and self-reported emotions of the explainee while observing and listening to the explanation and assessed their retention of the features as well as their influence on the outcome of the decision task. Results indicate that (1) task-unrelated prior emotions do not affected the ratantion but may affect the understanding of the relevance of certain features in the sense of an emotion-induced confirmation bias, (2) certain features related to personal attitudes yielded arousal in individual participants, (3) this arousal affected the understanding of these variables.

Paperid: 698, https://arxiv.org/pdf/2505.09068.pdf

Abstract:
This paper introduces S-DAT (Synthetic-Divergent Association Task), a scalable, multilingual framework for automated assessment of divergent thinking (DT) -a core component of human creativity. Traditional creativity assessments are often labor-intensive, language-specific, and reliant on subjective human ratings, limiting their scalability and cross-cultural applicability. In contrast, S-DAT leverages large language models and advanced multilingual embeddings to compute semantic distance -- a language-agnostic proxy for DT. We evaluate S-DAT across eleven diverse languages, including English, Spanish, German, Russian, Hindi, and Japanese (Kanji, Hiragana, Katakana), demonstrating robust and consistent scoring across linguistic contexts. Unlike prior DAT approaches, the S-DAT shows convergent validity with other DT measures and correct discriminant validity with convergent thinking. This cross-linguistic flexibility allows for more inclusive, global-scale creativity research, addressing key limitations of earlier approaches. S-DAT provides a powerful tool for fairer, more comprehensive evaluation of cognitive flexibility in diverse populations and can be freely assessed online: https://sdat.iol.zib.de/.

Paperid: 699, https://arxiv.org/pdf/2505.08143.pdf

Abstract:
With the wide adoption of large language models (LLMs) in information assistance, it is essential to examine their alignment with human communication styles and values. We situate this study within the context of fact-checking health information, given the critical challenge of rectifying conceptions and building trust. Recent studies have explored the potential of LLM for health communication, but style differences between LLMs and human experts and associated reader perceptions remain under-explored. In this light, our study evaluates the communication styles of LLMs, focusing on how their explanations differ from those of humans in three core components of health communication: information, sender, and receiver. We compiled a dataset of 1498 health misinformation explanations from authoritative fact-checking organizations and generated LLM responses to inaccurate health information. Drawing from health communication theory, we evaluate communication styles across three key dimensions of information linguistic features, sender persuasive strategies, and receiver value alignments. We further assessed human perceptions through a blinded evaluation with 99 participants. Our findings reveal that LLM-generated articles showed significantly lower scores in persuasive strategies, certainty expressions, and alignment with social values and moral foundations. However, human evaluation demonstrated a strong preference for LLM content, with over 60% responses favoring LLM articles for clarity, completeness, and persuasiveness. Our results suggest that LLMs' structured approach to presenting information may be more effective at engaging readers despite scoring lower on traditional measures of quality in fact-checking and health communication.

Paperid: 700, https://arxiv.org/pdf/2505.07100.pdf

Abstract:
The Rashomon effect describes the observation that in machine learning (ML) multiple models often achieve similar predictive performance while explaining the underlying relationships in different ways. This observation holds even for intrinsically interpretable models, such as Generalized Additive Models (GAMs), which offer users valuable insights into the model's behavior. Given the existence of multiple GAM configurations with similar predictive performance, a natural question is whether we can personalize these configurations based on users' needs for interpretability. In our study, we developed an approach to personalize models based on contextual bandits. In an online experiment with 108 users in a personalized treatment and a non-personalized control group, we found that personalization led to individualized rather than one-size-fits-all configurations. Despite these individual adjustments, the interpretability remained high across both groups, with users reporting a strong understanding of the models. Our research offers initial insights into the potential for personalizing interpretable ML.

Paperid: 701, https://arxiv.org/pdf/2505.06301.pdf

Abstract:
Cross-user variability in Human Activity Recognition (HAR) remains a critical challenge due to differences in sensor placement, body dynamics, and behavioral patterns. Traditional methods often fail to capture biomechanical invariants that persist across users, limiting their generalization capability. We propose an Edge-Enhanced Graph-Based Adversarial Domain Generalization (EEG-ADG) framework that integrates anatomical correlation knowledge into a unified graph neural network (GNN) architecture. By modeling three biomechanically motivated relationships together-Interconnected Units, Analogous Units, and Lateral Units-our method encodes domain-invariant features while addressing user-specific variability through Variational Edge Feature Extractor. A Gradient Reversal Layer (GRL) enforces adversarial domain generalization, ensuring robustness to unseen users. Extensive experiments on OPPORTUNITY and DSADS datasets demonstrate state-of-the-art performance. Our work bridges biomechanical principles with graph-based adversarial learning by integrating information fusion techniques. This fusion of information underpins our unified and generalized model for cross-user HAR.

Paperid: 702, https://arxiv.org/pdf/2505.03132.pdf

Abstract:
Real-world machine learning models require rigorous evaluation before deployment, especially in safety-critical domains like autonomous driving and surveillance. The evaluation of machine learning models often focuses on data slices, which are subsets of the data that share a set of characteristics. Data slice finding automatically identifies conditions or data subgroups where models underperform, aiding developers in mitigating performance issues. Despite its popularity and effectiveness, data slicing for vision model validation faces several challenges. First, data slicing often needs additional image metadata or visual concepts, and falls short in certain computer vision tasks, such as object detection. Second, understanding data slices is a labor-intensive and mentally demanding process that heavily relies on the expert's domain knowledge. Third, data slicing lacks a human-in-the-loop solution that allows experts to form hypothesis and test them interactively. To overcome these limitations and better support the machine learning operations lifecycle, we introduce VISLIX, a novel visual analytics framework that employs state-of-the-art foundation models to help domain experts analyze slices in computer vision models. Our approach does not require image metadata or visual concepts, automatically generates natural language insights, and allows users to test data slice hypothesis interactively. We evaluate VISLIX with an expert study and three use cases, that demonstrate the effectiveness of our tool in providing comprehensive insights for validating object detection models.

Paperid: 703, https://arxiv.org/pdf/2505.00945.pdf

Abstract:
Large language model (LLM)--based agents have emerged as pivotal tools in assisting human experts across various fields by transforming complex tasks into more efficient workflows and providing actionable stakeholder insights. Despite their potential, the application of LLM-based agents for medical education remains underexplored. The study aims to assist in evaluating the students' process and outcomes on medical case diagnosis and discussion while incorporating the theoretical framework of Socially Shared Regulation of Learning (SSRL) to assess student performance. SSRL emphasizes metacognitive, cognitive, motivational, and emotional interactions, highlighting the collaborative management of learning processes to improve decision-making outcomes. Grounded in SSRL theory, this tool paper introduces SSRLBot, an LLM-based agent designed to enable team members to reflect on their diagnostic performance and the key SSRL skills that foster team success. SSRLBot's core functions include summarizing dialogue content, analyzing participants' SSRL skills, and evaluating students' diagnostic results. Meanwhile, we evaluated SSRLBot through diagnostic conversation data collected from six groups (12 participants, 1926 conversational turns). Results showed that SSRLBot can deliver detailed, theory-aligned evaluations, link specific behaviors to SSRL dimensions, and offer actionable recommendations for improving teamwork. The findings address a critical gap in medical education, advancing the application of LLM agents to enhance team-based decision-making and collaboration in high-stakes environments.

Paperid: 704, https://arxiv.org/pdf/2504.21507.pdf

Abstract:
Pre-trained language models have been widely exploited to learn dense representations of documents and queries for information retrieval. While previous efforts have primarily focused on improving effectiveness and user satisfaction, response time remains a critical bottleneck of conversational search systems. To address this, we exploit the topical locality inherent in conversational queries, i.e., the tendency of queries within a conversation to focus on related topics. By leveraging query embedding similarities, we dynamically restrict the search space to semantically relevant document clusters, reducing computational complexity without compromising retrieval quality. We evaluate our approach on the TREC CAsT 2019 and 2020 datasets using multiple embedding models and vector indexes, achieving improvements in processing speed of up to 10.4X with little loss in performance (4.4X without any loss). Our results show that the proposed system effectively handles complex, multiturn queries with high precision and efficiency, offering a practical solution for real-time conversational search.

Paperid: 705, https://arxiv.org/pdf/2504.20320.pdf

Abstract:
We investigate the role of large language models (LLMs) in supporting mental health by analyzing Reddit posts and comments about mental health conversations with ChatGPT. Our findings reveal that users value ChatGPT as a safe, non-judgmental space, often favoring it over human support due to its accessibility, availability, and knowledgeable responses. ChatGPT provides a range of support, including actionable advice, emotional support, and validation, while helping users better understand their mental states. Additionally, we found that ChatGPT offers innovative support for individuals facing mental health challenges, such as assistance in navigating difficult conversations, preparing for therapy sessions, and exploring therapeutic interventions. However, users also voiced potential risks, including the spread of incorrect health advice, ChatGPT's overly validating nature, and privacy concerns. We discuss the implications of LLMs as tools for mental health support in both everyday health and clinical therapy settings and suggest strategies to mitigate risks in LLM-powered interactions.

Paperid: 706, https://arxiv.org/pdf/2504.19037.pdf

Abstract:
Large language models (LLMs) are being increasingly adopted for programming work. Prior work shows that while LLMs accelerate task completion for professional programmers, beginning programmers struggle to prompt models effectively. However, prompting is just half of the code generation process -- when code is generated, it must be read, evaluated, and integrated (or rejected). How accessible are these tasks for beginning programmers? This paper measures how well beginners comprehend LLM-generated code and explores the challenges students face in judging code correctness. We compare how well students understand natural language descriptions of functions and LLM-generated implementations, studying 32 CS1 students on 160 task instances. Our results show a low per-task success rate of 32.5\%, with indiscriminate struggles across demographic populations. Key challenges include barriers for non-native English speakers, unfamiliarity with Python syntax, and automation bias. Our findings highlight the barrier that code comprehension presents to beginning programmers seeking to write code with LLMs.

Paperid: 707, https://arxiv.org/pdf/2504.18817.pdf

Abstract:
As centralized social media platforms face growing concerns, more users are seeking greater control over their social feeds and turning to decentralized alternatives such as Mastodon. The decentralized nature of Mastodon creates unique opportunities for customizing feeds, yet user perceptions and curation strategies on these platforms remain unknown. This paper presents findings from a two-part interview study with 21 Mastodon users, exploring how they perceive, interact with, and manage their current feeds, and how we can better empower users to personalize their feeds on Mastodon. We use the qualitative findings of the first part of the study to guide the creation of Braids, a web-based prototype for feed curation. Results from the second part of our study, using Braids, highlighted opportunities and challenges for future research, particularly in using seamful design to enhance people's acceptance of algorithmic curation and nuanced trade-offs between machine learning-based and rule-based curation algorithms. To optimize user experience, we also discuss the tension between creating new apps and building add-ons in the decentralized social media realm.

Paperid: 708, https://arxiv.org/pdf/2504.15533.pdf

Abstract:
In this position paper, we discuss the paradigm shift that has emerged in the literature, suggesting to move away from restrictive and authoritarian parental mediation approaches to move toward resilient-based and privacy-preserving solutions to promote adolescents' online safety. We highlight the limitations of restrictive mediation strategies, which often induce a trade-off between teens' privacy and online safety, and call for more teen-centric frameworks that can empower teens to self-regulate while using the technology in meaningful ways. We also present an overview of empirical studies that conceptualized and examined resilience-based approaches to promoting the digital well-being of teens in a way to empower teens to be more resilient.

Paperid: 709, https://arxiv.org/pdf/2504.14822.pdf

Abstract:
Systematic reviews (SRs) are vital for evidence-based practice in high stakes disciplines, such as healthcare, but are often impeded by intensive labors and lengthy processes that can take months to complete. Due to the high demand for domain expertise, existing automatic summarization methods fail to accurately identify relevant studies and generate high-quality summaries. To that end, we introduce InsightAgent, a human-centered interactive AI agent powered by large language models that revolutionize this workflow. InsightAgent partitions a large literature corpus based on semantics and employs a multi-agent design for more focused processing of literature, leading to significant improvement in the quality of generated SRs. InsightAgent also provides intuitive visualizations of the corpus and agent trajectories, allowing users to effortlessly monitor the actions of the agent and provide real-time feedback based on their expertise. Our user studies with 9 medical professionals demonstrate that the visualization and interaction mechanisms can effectively improve the quality of synthesized SRs by 27.2%, reaching 79.7% of human-written quality. At the same time, user satisfaction is improved by 34.4%. With InsightAgent, it only takes a clinician about 1.5 hours, rather than months, to complete a high-quality systematic review.

Paperid: 710, https://arxiv.org/pdf/2504.14695.pdf

Abstract:
Flipped classrooms promote active learning by having students engage with materials independently before class, allowing in-class time for collaborative problem-solving. During this pre-class phase, asynchronous online discussions help students build knowledge and clarify concepts with peers. However, it remains difficult to engage with temporally dispersed peer contributions, connect discussions with static learning materials, and prepare for in-class sessions based on their self-learning outcome. Our formative study identified cognitive challenges students encounter, including navigation barriers, reflection gaps, and contribution difficulty and anxiety. We present GLITTER, an AI-assisted discussion platform for pre-class learning in flipped classrooms. GLITTER helps students identify posts with shared conceptual dimensions, scaffold knowledge integration through conceptual blending, and enhance metacognition via personalized reflection reports. A lab study within subjects (n = 12) demonstrates that GLITTER improves discussion engagement, sparks new ideas, supports reflection, and increases preparedness for in-class activities.

Paperid: 711, https://arxiv.org/pdf/2504.13944.pdf

Abstract:
The NIME conference traditionally focuses on interfaces for music and musical expression. In this paper we reverse this tradition to ask, can interfaces developed for music be successfully appropriated to non-musical applications? To help answer this question we designed and developed a new device, which uses interface metaphors borrowed from analogue synthesisers and audio mixing to physically control the intangible aspects of a Large Language Model. We compared two versions of the device, with and without the audio-inspired augmentations, with a group of artists who used each version over a one week period. Our results show that the use of audio-like controls afforded more immediate, direct and embodied control over the LLM, allowing users to creatively experiment and play with the device over its non-mixer counterpart. Our project demonstrates how cross-sensory metaphors can support creative thinking and embodied practice when designing new technological interfaces.

Paperid: 712, https://arxiv.org/pdf/2504.13389.pdf

Abstract:
Despite the growing research on users' perceptions of health AI, adolescents' perspectives remain underexplored. This study explores adolescents' perceived benefits and risks of health AI technologies in clinical and personal health settings. Employing Design Fiction, we conducted interviews with 16 adolescents (aged 13-17) using four fictional design scenarios that represent current and future health AI technologies as probes. Our findings reveal that with a positive yet cautious attitude, adolescents envision unique benefits and risks specific to their age group. While health AI technologies were seen as valuable learning resources, they also raised concerns about confidentiality with their parents. Additionally, we identified several factors, such as severity of health conditions and previous experience with AI, influencing their perceptions of trust and privacy in health AI. We explore how these insights can inform the future of design of health AI technologies to support learning, engagement, and trust as adolescents navigate their healthcare journey.

Paperid: 713, https://arxiv.org/pdf/2504.12320.pdf

Abstract:
Following the widespread adoption of ChatGPT in early 2023, numerous studies reported that large language models (LLMs) can match or even surpass human performance in creative tasks. However, it remains unclear whether LLMs have become more creative over time, and how consistent their creative output is. In this study, we evaluated 14 widely used LLMs -- including GPT-4, Claude, Llama, Grok, Mistral, and DeepSeek -- across two validated creativity assessments: the Divergent Association Task (DAT) and the Alternative Uses Task (AUT). Contrary to expectations, we found no evidence of increased creative performance over the past 18-24 months, with GPT-4 performing worse than in previous studies. For the more widely used AUT, all models performed on average better than the average human, with GPT-4o and o3-mini performing best. However, only 0.28% of LLM-generated responses reached the top 10% of human creativity benchmarks. Beyond inter-model differences, we document substantial intra-model variability: the same LLM, given the same prompt, can produce outputs ranging from below-average to original. This variability has important implications for both creativity research and practical applications. Ignoring such variability risks misjudging the creative potential of LLMs, either inflating or underestimating their capabilities. The choice of prompts affected LLMs differently. Our findings underscore the need for more nuanced evaluation frameworks and highlight the importance of model selection, prompt design, and repeated assessment when using Generative AI (GenAI) tools in creative contexts.

Paperid: 714, https://arxiv.org/pdf/2504.10793.pdf

Abstract:
Imagine placing your smartphone on a table in a noisy restaurant and clearly capturing the voices of friends seated around you, or recording a lecturer's voice with clarity in a reverberant auditorium. We introduce SonicSieve, the first intelligent directional speech extraction system for smartphones using a bio-inspired acoustic microstructure. Our passive design embeds directional cues onto incoming speech without any additional electronics. It attaches to the in-line mic of low-cost wired earphones which can be attached to smartphones. We present an end-to-end neural network that processes the raw audio mixtures in real-time on mobile devices. Our results show that SonicSieve achieves a signal quality improvement of 5.0 dB when focusing on a 30Â° angular region. Additionally, the performance of our system based on only two microphones exceeds that of conventional 5-microphone arrays.

Paperid: 715, https://arxiv.org/pdf/2504.09865.pdf

Abstract:
As generative artificial intelligence (AI) enables the creation and dissemination of information at massive scale and speed, it is increasingly important to understand how people perceive AI-generated content. One prominent policy proposal requires explicitly labeling AI-generated content to increase transparency and encourage critical thinking about the information, but prior research has not yet tested the effects of such labels. To address this gap, we conducted a survey experiment (N=1601) on a diverse sample of Americans, presenting participants with an AI-generated message about several public policies (e.g., allowing colleges to pay student-athletes), randomly assigning whether participants were told the message was generated by (a) an expert AI model, (b) a human policy expert, or (c) no label. We found that messages were generally persuasive, influencing participants' views of the policies by 9.74 percentage points on average. However, while 94.6% of participants assigned to the AI and human label conditions believed the authorship labels, labels had no significant effects on participants' attitude change toward the policies, judgments of message accuracy, nor intentions to share the message with others. These patterns were robust across a variety of participant characteristics, including prior knowledge of the policy, prior experience with AI, political party, education level, or age. Taken together, these results imply that, while authorship labels would likely enhance transparency, they are unlikely to substantially affect the persuasiveness of the labeled content, highlighting the need for alternative strategies to address challenges posed by AI-generated information.

Paperid: 716, https://arxiv.org/pdf/2504.09846.pdf

Abstract:
Frequent and long-term exposure to hyperglycemia (i.e., high blood glucose) increases the risk of chronic complications such as neuropathy, nephropathy, and cardiovascular disease. Current technologies like continuous subcutaneous insulin infusion (CSII) and continuous glucose monitoring (CGM) primarily model specific aspects of glycemic control-like hypoglycemia prediction or insulin delivery. Similarly, most digital twin approaches in diabetes management simulate only physiological processes. These systems lack the ability to offer alternative treatment scenarios that support proactive behavioral interventions. To address this, we propose GlyTwin, a novel digital twin framework that uses counterfactual explanations to simulate optimal treatments for glucose regulation. Our approach helps patients and caregivers modify behaviors like carbohydrate intake and insulin dosing to avoid abnormal glucose events. GlyTwin generates behavioral treatment suggestions that proactively prevent hyperglycemia by recommending small adjustments to daily choices, reducing both frequency and duration of these events. Additionally, it incorporates stakeholder preferences into the intervention design, making recommendations patient-centric and tailored. We evaluate GlyTwin on AZT1D, a newly constructed dataset with longitudinal data from 21 type 1 diabetes (T1D) patients on automated insulin delivery systems over 26 days. Results show GlyTwin outperforms state-of-the-art counterfactual methods, generating 76.6% valid and 86% effective interventions. These findings demonstrate the promise of counterfactual-driven digital twins in delivering personalized healthcare.

Paperid: 717, https://arxiv.org/pdf/2504.09612.pdf

Abstract:
Infrastructure is an indispensable part of human life. Over the past decades, the Human-Computer Interaction (HCI) community has paid increasing attention to human interactions with infrastructure. In this paper, we conducted a systematic literature review on infrastructure studies in SIGCHI, one of the most influential communities in HCI. We collected a total of 190 primary studies, covering works published between 2006 and 2024. Most of these studies are inspired by Susan Leigh Star's notion of infrastructure. We identify three major themes in infrastructure studies: growing infrastructure, appropriating infrastructure, and coping with infrastructure. Our review highlights a prevailing trend in SIGCHI's infrastructure research: a focus on informal infrastructural activities across various sociotechnical contexts. In particular, we examine studies that problematize infrastructure and alert the HCI community to its potentially harmful aspects.

Paperid: 718, https://arxiv.org/pdf/2504.07840.pdf

Abstract:
Large Language Models (LLMs) have transformed human-computer interaction by enabling natural language-based communication with AI-powered chatbots. These models are designed to be intuitive and user-friendly, allowing users to articulate requests with minimal effort. However, despite their accessibility, studies reveal that users often struggle with effective prompting, resulting in inefficient responses. Existing research has highlighted both the limitations of LLMs in interpreting vague or poorly structured prompts and the difficulties users face in crafting precise queries. This study investigates learner-AI interactions through an educational experiment in which participants receive structured guidance on effective prompting. We introduce and compare three types of prompting guidelines: a task-specific framework developed through a structured methodology and two baseline approaches. To assess user behavior and prompting efficacy, we analyze a dataset of 642 interactions from 107 users. Using Von NeuMidas, an extended pragmatic annotation schema for LLM interaction analysis, we categorize common prompting errors and identify recurring behavioral patterns. We then evaluate the impact of different guidelines by examining changes in user behavior, adherence to prompting strategies, and the overall quality of AI-generated responses. Our findings provide a deeper understanding of how users engage with LLMs and the role of structured prompting guidance in enhancing AI-assisted communication. By comparing different instructional frameworks, we offer insights into more effective approaches for improving user competency in AI interactions, with implications for AI literacy, chatbot usability, and the design of more responsive AI systems.

Paperid: 719, https://arxiv.org/pdf/2504.06031.pdf

Abstract:
In this work, we evaluate the feasibility of socially assistive virtual agent-based cognitive training for people with intellectual disabilities (ID) in a sheltered workshop. The Robo- Camp system, originally developed for children with Attention Deficit Hyperactivity Disorder (ADHD), is adapted based on the results of a pilot study in which we identified barriers and collected feedback from workshop staff. In a subsequent study, we investigate the aspects of usability, technical reliability, attention training capabilities and novelty effect in the feasibility of integrating the RoboCamp system.

Paperid: 720, https://arxiv.org/pdf/2504.02217.pdf

Abstract:
Multimodal Large Language Models (MLLMs) can interpret data visualizations, but what makes a visualization understandable to these models? Do factors like color, shape, and text influence legibility, and how does this compare to human perception? In this paper, we build on prior work to systematically assess which visualization characteristics impact MLLM interpretability. We expanded the Visualization Literacy Assessment Test (VLAT) test set from 12 to 380 visualizations by varying plot types, colors, and titles. This allowed us to statistically analyze how these features affect model performance. Our findings suggest that while color palettes have no significant impact on accuracy, plot types and the type of title significantly affect MLLM performance. We observe similar trends for model omissions. Based on these insights, we look into which plot types are beneficial for MLLMs in different tasks and propose visualization design principles that enhance MLLM readability. Additionally, we make the extended VLAT test set, VLAT ex, publicly available on https://osf.io/ermwx/ together with our supplemental material for future model testing and evaluation.

Paperid: 721, https://arxiv.org/pdf/2504.02149.pdf

Abstract:
The growing use of smart home devices poses considerable privacy and security challenges, especially for individuals like migrant domestic workers (MDWs) who may be surveilled by their employers. This paper explores the privacy and security challenges experienced by MDWs in multi-user smart homes through in-depth semi-structured interviews with 26 MDWs and 5 staff members of agencies that recruit and/or train domestic workers in China. Our findings reveal that the relationships between MDWs, their employers, and agencies are characterized by significant power imbalances, influenced by Chinese cultural and social factors (such as Confucianism and collectivism), as well as legal ones. Furthermore, the widespread and normalized use of surveillance technologies in China, particularly in public spaces, exacerbates these power imbalances, reinforcing a sense of constant monitoring and control. Drawing on our findings, we provide recommendations to domestic worker agencies and policymakers to address the privacy and security challenges facing MDWs in Chinese smart homes.

Paperid: 722, https://arxiv.org/pdf/2504.00843.pdf

Abstract:
Mathematics learning entails mastery of both content knowledge and cognitive processing of knowing, applying, and reasoning with it. Automated math assessment primarily has focused on grading students' exhibition of content knowledge by finding textual evidence, such as specific numbers, formulas, and statements. Recent advancements in problem-solving, image recognition, and reasoning capabilities of large language models (LLMs) show promise for nuanced evaluation of students' cognitive skills. Diagnosing cognitive skills needs to infer students' thinking processes beyond textual evidence, which is an underexplored task in LLM-based automated assessment. In this work, we investigate how state-of-the-art LLMs diagnose students' cognitive skills in mathematics. We constructed MathCog, a novel benchmark dataset comprising 639 student responses to 110 expert-curated middle school math problems, each annotated with detailed teachers' diagnoses based on cognitive skill checklists. Using MathCog, we evaluated 16 closed and open LLMs of varying model sizes and vendors. Our evaluation reveals that even the state-of-the-art LLMs struggle with the task, all F1 scores below 0.5, and tend to exhibit strong false confidence for incorrect cases ($r_s=.617$). We also found that model size positively correlates with the diagnosis performance ($r_s=.771$). Finally, we discuss the implications of these findings, the overconfidence issue, and directions for improving automated cognitive skill diagnosis.

Paperid: 723, https://arxiv.org/pdf/2503.23760.pdf

Abstract:
This research addresses the question, which characteristics a cognitive architecture must have to leverage the benefits of natural language in Co-Constructive Task Learning (CCTL). To provide context, we first discuss Interactive Task Learning (ITL), the mechanisms of the human memory system, and the significance of natural language and multi-modality. Next, we examine the current state of cognitive architectures, analyzing their capabilities to inform a concept of CCTL grounded in multiple sources. We then integrate insights from various research domains to develop a unified framework. Finally, we conclude by identifying the remaining challenges and requirements necessary to achieve CCTL in Human-Robot Interaction (HRI).

Paperid: 724, https://arxiv.org/pdf/2503.21365.pdf

Abstract:
Current AI counseling systems struggle with maintaining effective long-term client engagement. Through formative research with counselors and a systematic literature review, we identified five key design considerations for AI counseling interactions. Based on these insights, we propose CA+, a Cognition Augmented counselor framework enhancing contextual understanding through three components: (1) Therapy Strategies Module: Implements hierarchical Goals-Session-Action planning with bidirectional adaptation based on client feedback; (2) Communication Form Module: Orchestrates parallel guidance and empathy pathways for balanced therapeutic progress and emotional resonance; (3) Information Management: Utilizes client profile and therapeutic knowledge databases for dynamic, context-aware interventions. A three-day longitudinal study with 24 clients demonstrates CA+'s significant improvements in client engagement, perceived empathy, and overall satisfaction compared to a baseline system. Besides, two licensed counselors confirm its high professionalism. Our research demonstrates the potential for enhancing LLM engagement in psychological counseling dialogues through cognitive theory, which may inspire further innovations in computational interaction in the future.

Paperid: 725, https://arxiv.org/pdf/2503.19692.pdf

Abstract:
Understanding how scaffolding strategies influence human understanding in human-robot interaction is important for developing effective assistive systems. This empirical study investigates linguistic scaffolding strategies based on negation as an important means that de-biases the user from potential errors but increases processing costs and hesitations as a means to ameliorate processing costs. In an adaptive strategy, the user state with respect to the current state of understanding and processing capacity was estimated via a scoring scheme based on task performance, prior scaffolding strategy, and current eye gaze behavior. In the study, the adaptive strategy of providing negations and hesitations was compared with a non-adaptive strategy of providing only affirmations. The adaptive scaffolding strategy was generated using the computational model SHIFT. Our findings indicate that using adaptive scaffolding strategies with SHIFT tends to (1) increased processing costs, as reflected in longer reaction times, but (2) improved task understanding, evidenced by a lower error rate of almost 23%. We assessed the efficiency of SHIFT's selected scaffolding strategies across different cognitive states, finding that in three out of five states, the error rate was lower compared to the baseline condition. We discuss how these results align with the assumptions of the SHIFT model and highlight areas for refinement. Moreover, we demonstrate how scaffolding strategies, such as negation and hesitation, contribute to more effective human-robot explanatory dialogues.

Paperid: 726, https://arxiv.org/pdf/2503.16463.pdf

Abstract:
Recent advances in large language models (LLMs) have shown promising results in medical diagnosis, with some studies indicating superior performance compared to human physicians in specific scenarios. However, the diagnostic capabilities of LLMs are often overestimated, as their performance significantly deteriorates in interactive diagnostic settings that require active information gathering. This study investigates the underlying mechanisms behind the performance degradation phenomenon and proposes a solution. We identified that the primary deficiency of LLMs lies in the initial diagnosis phase, particularly in information-gathering efficiency and initial diagnosis formation, rather than in the subsequent differential diagnosis phase. To address this limitation, we developed a plug-and-play method enhanced (PPME) LLM agent, leveraging over 3.5 million electronic medical records from Chinese and American healthcare facilities. Our approach integrates specialized models for initial disease diagnosis and inquiry into the history of the present illness, trained through supervised and reinforcement learning techniques. The experimental results indicate that the PPME LLM achieved over 30% improvement compared to baselines. The final diagnostic accuracy of the PPME LLM in interactive diagnostic scenarios approached levels comparable to those achieved using complete clinical data. These findings suggest a promising potential for developing autonomous diagnostic systems, although further validation studies are needed.

Paperid: 727, https://arxiv.org/pdf/2503.16447.pdf

Abstract:
In this work, we present a domain-independent approach for adaptive scaffolding in robotic explanation generation to guide tasks in human-robot interaction. We present a method for incorporating interdisciplinary research results into a computational model as a pre-configured scoring system implemented in a framework called SHIFT. This involves outlining a procedure for integrating concepts from disciplines outside traditional computer science into a robotics computational framework. Our approach allows us to model the human cognitive state into six observable states within the human partner model. To study the pre-configuration of the system, we implement a reinforcement learning approach on top of our model. This approach allows adaptation to individuals who deviate from the configuration of the scoring system. Therefore, in our proof-of-concept evaluation, the model's adaptability on four different user types shows that the models' adaptation performs better, i.e., recouped faster after exploration and has a higher accumulated reward with our pre-configured scoring system than without it. We discuss further strategies of speeding up the learning phase to enable a realistic adaptation behavior to real users. The system is accessible through docker and supports querying via ROS.

Paperid: 728, https://arxiv.org/pdf/2503.16443.pdf

Abstract:
Pedestrian safety is a critical public health priority, with pedestrian fatalities accounting for 18% of all U.S. traffic deaths in 2022. The rising prevalence of distracted walking, exacerbated by mobile device use, poses significant risks at signalized intersections. This study utilized an immersive virtual reality (VR) environment to simulate real-world traffic scenarios and assess pedestrian behavior under three conditions: undistracted crossing, crossing while using a mobile device, and crossing with Light-emitting diode (LED) safety interventions. Analysis using ANOVA models identified speed and mobile-focused eye-tracking as significant predictors of crossing duration, revealing how distractions impair situational awareness and response times. While LED measures reduced delays, their limited effectiveness highlights the need for integrated strategies addressing both behavioral and physical factors. This study showcases VRs potential to analyze complex pedestrian behaviors, offering actionable insights for urban planners and policymakers aiming to enhance pedestrian safety.

Paperid: 729, https://arxiv.org/pdf/2503.15637.pdf

Abstract:
Mobile sensing is ubiquitous and offers opportunities to gain insight into state mental health functioning. Detecting state elevations in social anxiety would be especially useful given this phenomenon is highly prevalent and impairing, but often not disclosed. Although anxiety is highly dynamic, fluctuating rapidly over the course of minutes, most work to date has examined anxiety at a scale of hours, days, or longer. In the present work, we explore the feasibility of detecting fluctuations in state social anxiety among N = 46 undergraduate students with elevated symptoms of trait social anxiety. Participants engaged in two dyadic and two group social interactions via Zoom. We evaluated participants' state anxiety levels as they anticipated, immediately after experiencing, and upon reflecting on each social interaction, spanning a time frame of 2-6 minutes. We collected biobehavioral features (i.e., PPG, EDA, skin temperature, and accelerometer) via Empatica E4 devices as they participated in the varied social contexts (e.g., dyadic vs. group; anticipating vs. experiencing the interaction; experiencing varying levels of social evaluation). We additionally measured their trait mental health functioning. Mixed-effect logistic regression and leave-one-subject-out machine learning modeling indicated biobehavioral features significantly predict state fluctuations in anxiety, though balanced accuracy tended to be modest (59%). However, our capacity to identify instances of heightened versus low state anxiety significantly increased (with balanced accuracy ranging from 69% to 84% across different operationalizations of state anxiety) when we integrated contextual data alongside trait mental health functioning into our predictive models.. We discuss these and other findings in the context of the broader anxiety detection literature.

Paperid: 730, https://arxiv.org/pdf/2503.15512.pdf

Abstract:
Modern machine learning produces models that are impossible for users or developers to fully understand -- raising concerns about trust, oversight, safety, and human dignity when they are integrated into software products. Transparency and explainability methods aim to provide some help in understanding models, but it remains challenging for developers to design explanations that are understandable to target users and effective for their purpose. Emerging guidelines and regulations set goals but may not provide effective actionable guidance to developers. In a large-scale experiment with 124 participants, we explored how developers approach providing end-user explanations, including what challenges they face, and to what extent specific policies can guide their actions. We investigated whether and how specific forms of policy guidance help developers design explanations and provide evidence for policy compliance for an ML-powered screening tool for diabetic retinopathy. Participants across the board struggled to produce quality explanations and comply with the provided policies. Contrary to our expectations, we found that the nature and specificity of policy guidance had little effect. We posit that participant noncompliance is in part due to a failure to imagine and anticipate the needs of non-technical stakeholders. Drawing on cognitive process theory and the sociological imagination to contextualize participants' failure, we recommend educational interventions.

Paperid: 731, https://arxiv.org/pdf/2503.15496.pdf

Abstract:
This paper presents the implementation and evaluation of a conversational agent designed for multi-party open-ended interactions. Leveraging state-of-the-art technologies such as voice direction of arrival, voice recognition, face tracking, and large language models, the system aims to facilitate natural and intuitive human-robot conversations. Deployed on the Furhat robot, the system was tested with 30 participants engaging in open-ended group conversations and then in two overlapping discussions. Quantitative metrics, such as latencies and recognition accuracy, along with qualitative measures from user questionnaires, were collected to assess performance. The results highlight the system's effectiveness in managing multi-party interactions, though improvements are needed in response relevance and latency. This study contributes valuable insights for advancing human-robot interaction, particularly in enhancing the naturalness and engagement in group conversations.

Paperid: 732, https://arxiv.org/pdf/2503.13975.pdf

Abstract:
Language models excel at following instructions but often struggle with the collaborative aspects of conversation that humans naturally employ. This limitation in grounding -- the process by which conversation participants establish mutual understanding -- can lead to outcomes ranging from frustrated users to serious consequences in high-stakes scenarios. To systematically study grounding challenges in human-LLM interactions, we analyze logs from three human-assistant datasets: WildChat, MultiWOZ, and Bing Chat. We develop a taxonomy of grounding acts and build models to annotate and forecast grounding behavior. Our findings reveal significant differences in human-human and human-LLM grounding: LLMs were three times less likely to initiate clarification and sixteen times less likely to provide follow-up requests than humans. Additionally, we find that early grounding failures predict later interaction breakdowns. Building on these insights, we introduce Rifts, a benchmark derived from publicly available LLM interaction data containing situations where LLMs fail to initiate grounding. We note that current frontier models perform poorly on Rifts, highlighting the need to reconsider how we train and prompt LLMs for human interaction. To this end, we develop a preliminary intervention aimed at mitigating grounding failures.

Paperid: 733, https://arxiv.org/pdf/2503.10707.pdf

Abstract:
Cancer survivors face unique emotional challenges that impact their quality of life. Mobile diary entries provide a promising method for tracking emotional states, improving self-awareness, and promoting well-being outcome. This paper aims to, through mobile diaries, understand cancer survivors' emotional states and key variables related to just-in-time intervention opportunities, including the desire to regulate emotions and the availability to engage in interventions. Although emotion analysis tools show potential for recognizing emotions from text, current methods lack the contextual understanding necessary to interpret brief mobile diary narratives. Our analysis of diary entries from cancer survivors (N=407) reveals systematic relationships between described contexts and emotional states, with administrative and health-related contexts associated with negative affect and regulation needs, while leisure activities promote positive emotions. We propose CALLM, a Context-Aware framework leveraging Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) to analyze these brief entries by integrating retrieved peer experiences and personal diary history. CALLM demonstrates strong performance with balanced accuracies reaching 72.96% for positive affect, 73.29% for negative affect, 73.72% for emotion regulation desire, and 60.09% for intervention availability, outperforming language model baselines. Post-hoc analysis reveals that model confidence strongly predicts accuracy, with longer diary entries generally enhancing performance, and brief personalization periods yielding meaningful improvements. Our findings demonstrate how contextual information in mobile diaries can be effectively leveraged to understand emotional experiences, predict key states, and identify optimal intervention moments for personalized just-in-time support.

Paperid: 734, https://arxiv.org/pdf/2503.10241.pdf

Abstract:
Multimodal information-gathering settings, where users collaborate with AI in dynamic environments, are increasingly common. These involve complex processes with textual and multimodal interactions, often requiring additional structural information via cost-incurring requests. AI helpers lack access to users' true goals, beliefs, and preferences and struggle to integrate diverse information effectively. We propose a social continual learning framework for causal knowledge acquisition and collaborative decision-making. It focuses on autonomous agents learning through dialogues, question-asking, and interaction in open, partially observable environments. A key component is a natural language oracle that answers the agent's queries about environmental mechanisms and states, refining causal understanding while balancing exploration or learning, and exploitation or knowledge use. Evaluation tasks inspired by developmental psychology emphasize causal reasoning and question-asking skills. They complement benchmarks by assessing the agent's ability to identify knowledge gaps, generate meaningful queries, and incrementally update reasoning. The framework also evaluates how knowledge acquisition costs are amortized across tasks within the same environment. We propose two architectures: 1) a system combining Large Language Models (LLMs) with the ReAct framework and question-generation, and 2) an advanced system with a causal world model, symbolic, graph-based, or subsymbolic, for reasoning and decision-making. The latter builds a causal knowledge graph for efficient inference and adaptability under constraints. Challenges include integrating causal reasoning into ReAct and optimizing exploration and question-asking in error-prone scenarios. Beyond applications, this framework models developmental processes combining causal reasoning, question generation, and social learning.

Paperid: 735, https://arxiv.org/pdf/2503.09824.pdf

Abstract:
As we progress toward Society 5.0's vision of a human-centered digital society, ensuring digital accessibility becomes increasingly critical, particularly for citizens with visual impairments and other disabilities. This paper examines the implementation challenges of accessible digital public services within Swiss public administration. Through Design Science Research, we investigate the gap between accessibility legislation and practical implementation, analyzing how current standards translate into real-world usability. Our research reveals significant barriers including resource constraints, fragmented policy enforcement, and limited technical expertise. To address these challenges, we present the Inclusive Public Administration Framework, which integrates Web Content Accessibility Guidelines with the HERMES project management methodology. This framework provides a structured approach to embedding accessibility considerations throughout digital service development. Our findings contribute to the discourse on digital inclusion in Society 5.0 by providing actionable strategies for implementing accessible public services. As we move towards a more integrated human-machine society, ensuring digital accessibility for visually impaired citizens is crucial for building an equitable and inclusive digital future.

Paperid: 736, https://arxiv.org/pdf/2503.03570.pdf

Abstract:
Traditional XR and Metaverse applications prioritize user experience (UX) for adoption and success but often overlook a crucial aspect of user interaction: emotions. This article addresses this gap by presenting an emotion-aware Metaverse application: a Virtual Reality (VR) fire drill simulator designed to prepare crews for shipboard emergencies. The simulator detects emotions in real time, assessing trainees responses under stress to improve learning outcomes. Its architecture incorporates eye-tracking and facial expression analysis via Meta Quest Pro headsets. The system features four levels whose difficulty is increased progressively to evaluate user decision-making and emotional resilience. The system was evaluated in two experimental phases. The first phase identified challenges, such as navigation issues and lack of visual guidance. These insights led to an improved second version with a better user interface, visual cues and a real-time task tracker. Performance metrics like completion times, task efficiency and emotional responses were analyzed. The obtained results show that trainees with prior VR or gaming experience navigated the scenarios more efficiently. Moreover, the addition of task-tracking visuals and navigation guidance significantly improved user performance, reducing task completion times between 14.18\% and 32.72\%. Emotional responses were captured, revealing that some participants were engaged, while others acted indifferently, indicating the need for more immersive elements. Overall, this article provides useful guidelines for creating the next generation of emotion-aware Metaverse applications.

Paperid: 737, https://arxiv.org/pdf/2503.02068.pdf

Abstract:
Fully autonomous teams of LLM-powered AI agents are emerging that collaborate to perform complex tasks for users. What challenges do developers face when trying to build and debug these AI agent teams? In formative interviews with five AI agent developers, we identify core challenges: difficulty reviewing long agent conversations to localize errors, lack of support in current tools for interactive debugging, and the need for tool support to iterate on agent configuration. Based on these needs, we developed an interactive multi-agent debugging tool, AGDebugger, with a UI for browsing and sending messages, the ability to edit and reset prior agent messages, and an overview visualization for navigating complex message histories. In a two-part user study with 14 participants, we identify common user strategies for steering agents and highlight the importance of interactive message resets for debugging. Our studies deepen understanding of interfaces for debugging increasingly important agentic workflows.

Paperid: 738, https://arxiv.org/pdf/2503.01903.pdf

Abstract:
The advent of Large Language Models (LLMs) offers potential solutions to address problems such as shortage of medical resources and low diagnostic consistency in psychiatric clinical practice. Despite this potential, a robust and comprehensive benchmarking framework to assess the efficacy of LLMs in authentic psychiatric clinical environments is absent. This has impeded the advancement of specialized LLMs tailored to psychiatric applications. In response to this gap, by incorporating clinical demands in psychiatry and clinical data, we proposed a benchmarking system, PsychBench, to evaluate the practical performance of LLMs in psychiatric clinical settings. We conducted a comprehensive quantitative evaluation of 16 LLMs using PsychBench, and investigated the impact of prompt design, chain-of-thought reasoning, input text length, and domain-specific knowledge fine-tuning on model performance. Through detailed error analysis, we identified strengths and potential limitations of the existing models and suggested directions for improvement. Subsequently, a clinical reader study involving 60 psychiatrists of varying seniority was conducted to further explore the practical benefits of existing LLMs as supportive tools for psychiatrists of varying seniority. Through the quantitative and reader evaluation, we show that while existing models demonstrate significant potential, they are not yet adequate as decision-making tools in psychiatric clinical practice. The reader study further indicates that, as an auxiliary tool, LLM could provide particularly notable support for junior psychiatrists, effectively enhancing their work efficiency and overall clinical quality. To promote research in this area, we will make the dataset and evaluation framework publicly available, with the hope of advancing the application of LLMs in psychiatric clinical settings.

Paperid: 739, https://arxiv.org/pdf/2503.00337.pdf

Abstract:
Uncertainty inherently exists in the autonomous decision-making process of robots. Involving humans in resolving this uncertainty not only helps robots mitigate it but is also crucial for improving human-robot interactions. However, in public urban spaces filled with unpredictability, robots often face heightened uncertainty without direct human collaborators. This study investigates how robots can engage bystanders for assistance in public spaces when encountering uncertainty and examines how these interactions impact bystanders' perceptions and attitudes towards robots. We designed and tested a speculative `peephole' concept that engages bystanders in resolving urban robot uncertainty. Our design is guided by considerations of non-intrusiveness and eliciting initiative in an implicit manner, considering bystanders' unique role as non-obligated participants in relation to urban robots. Drawing from field study findings, we highlight the potential of involving bystanders to mitigate urban robots' technological imperfections to both address operational challenges and foster public acceptance of urban robots. Furthermore, we offer design implications to encourage bystanders' involvement in mitigating the imperfections.

Paperid: 740, https://arxiv.org/pdf/2502.20243.pdf

Abstract:
Artificial Intelligence (AI) systems are frequently employed in online services to provide personalized experiences to users based on large collections of data. However, AI systems can be designed in different ways, with black-box AI systems appearing as complex data-processing engines and white-box AI systems appearing as fully transparent data-processors. As such, it is reasonable to assume that these different design choices also affect user perception and thus their willingness to share data. To this end, we conducted a pre-registered, scenario-based online experiment with 240 participants and investigated how transparent and non-transparent data-processing entities influenced data-sharing intentions. Surprisingly, our results revealed no significant difference in willingness to share data across entities, challenging the notion that transparency increases data-sharing willingness. Furthermore, we found that a general attitude of trust towards AI has a significant positive influence, especially in the transparent AI condition, whereas privacy concerns did not significantly affect data-sharing decisions.

Paperid: 741, https://arxiv.org/pdf/2502.19899.pdf

Abstract:
Motor skill learning often requires experienced professionals who can provide personalized instruction. Unfortunately, the availability of high-quality training can be limited for specialized tasks, such as high performance racing. Several recent works have leveraged AI-assistance to improve instruction of tasks ranging from rehabilitation to surgical robot tele-operation. However, these works often make simplifying assumptions on the student learning process, and fail to model how a teacher's assistance interacts with different individuals' abilities when determining optimal teaching strategies. Inspired by the idea of scaffolding from educational psychology, we leverage shared autonomy, a framework for combining user inputs with robot autonomy, to aid with curriculum design. Our key insight is that the way a student's behavior improves in the presence of assistance from an autonomous agent can highlight which sub-skills might be most ``learnable'' for the student, or within their Zone of Proximal Development. We use this to design Z-COACH, a method for using shared autonomy to provide personalized instruction targeting interpretable task sub-skills. In a user study (n=50), where we teach high performance racing in a simulated environment of the Thunderhill Raceway Park with the CARLA Autonomous Driving simulator, we show that Z-COACH helps identify which skills each student should first practice, leading to an overall improvement in driving time, behavior, and smoothness. Our work shows that increasingly available semi-autonomous capabilities (e.g. in vehicles, robots) can not only assist human users, but also help *teach* them.

Paperid: 742, https://arxiv.org/pdf/2502.19888.pdf

Abstract:
Despite diverse mobility needs worldwide, existing mapping tools fail to address the varied experiences of different mobility device users. This paper presents a large-scale online survey exploring how five mobility groups -- users of canes, walkers, mobility scooters, manual wheelchairs, and motorized wheelchairs -- perceive sidewalk barriers. Using 52 sidewalk barrier images, respondents evaluated their confidence in navigating each scenario. Our findings (N=190) reveal variations in barrier perceptions across groups, while also identifying shared concerns. To further demonstrate the value of this data, we showcase its use in two custom prototypes: a visual analytics tool and a personalized routing tool. Our survey findings and open dataset advance work in accessibility-focused maps, routing algorithms, and urban planning.

Paperid: 743, https://arxiv.org/pdf/2502.17645.pdf

Abstract:
Cannabis use among emerging adults is increasing globally, posing significant health risks and creating a need for effective interventions. We present an exploratory analysis of the MiWaves pilot study, a digital intervention aimed at supporting cannabis use reduction among emerging adults (ages 18-25). Our findings indicate the potential of self-monitoring check-ins and trend visualizations in fostering self-awareness and promoting behavioral reflection in participants. MiWaves intervention message timing and frequency were also generally well-received by the participants. The participants' perception of effort were queried on intervention messages with different tasks, and our findings suggest that messages with tasks like exploring links and typing in responses are perceived as requiring more effort as compared to messages with tasks involving reading and acknowledging. Finally, we discuss the findings and limitations from this study and analysis, and their impact on informing future iterations on MiWaves.

Paperid: 744, https://arxiv.org/pdf/2502.13572.pdf

Abstract:
The human brain utilizes spikes for information transmission and dynamically reorganizes its network structure to boost energy efficiency and cognitive capabilities throughout its lifespan. Drawing inspiration from this spike-based computation, Spiking Neural Networks (SNNs) have been developed to construct event-driven models that emulate this efficiency. Despite these advances, deep SNNs continue to suffer from over-parameterization during training and inference, a stark contrast to the brain's ability to self-organize. Furthermore, existing sparse SNNs are challenged by maintaining optimal pruning levels due to a static pruning ratio, resulting in either under- or over-pruning. In this paper, we propose a novel two-stage dynamic structure learning approach for deep SNNs, aimed at maintaining effective sparse training from scratch while optimizing compression efficiency. The first stage evaluates the compressibility of existing sparse subnetworks within SNNs using the PQ index, which facilitates an adaptive determination of the rewiring ratio for synaptic connections based on data compression insights. In the second stage, this rewiring ratio critically informs the dynamic synaptic connection rewiring process, including both pruning and regrowth. This approach significantly improves the exploration of sparse structure training in deep SNNs, adapting sparsity dynamically from the point view of compression efficiency. Our experiments demonstrate that this sparse training approach not only aligns with the performance of current deep SNNs models but also significantly improves the efficiency of compressing sparse SNNs. Crucially, it preserves the advantages of initiating training with sparse models and offers a promising solution for implementing edge AI on neuromorphic hardware.

Paperid: 745, https://arxiv.org/pdf/2502.05870.pdf

Abstract:
Generative AI (GenAI) provides new opportunities for creativity support, but the phenomenon of GenAI design fixation remains underexplored. While human design fixation typically constrains ideas to familiar or existing solutions, our findings reveal that GenAI similarly experience design fixation, limiting its ability to generate novel and diverse design outcomes. To advance understanding of GenAI design fixation, we propose a theoretical framework includes the definition, causes, manifestations, and impacts of GenAI design fixation for creative design. We also conducted an experimental study to investigate the characteristics of GenAI design fixation in practice. We summarize how GenAI design fixation manifests in text generation model and image generation model respectively. Furthermore, we propose methods for mitigating GenAI design fixation for future creativity support tool design. We recommend adopting the lens of GenAI design fixation for creativity-oriented HCI research, as the unique perspectives and insights it provides.

Paperid: 746, https://arxiv.org/pdf/2502.03226.pdf

Abstract:
Online collaborative learning and working are important for everyone including children. However, children still face a lot of difficulties communicating and working together while online, which keeps them from engaging in long-term project-based teamwork. We aim to investigate online long-term collaborative learning opportunities to address this gap. We design COLP, an online, 16-week, project-based learning program, as an educational intervention based on multiple learning theories for primary school students. We conducted this program with 67 primary school students ages 8-13, across more than five provinces of China. We found that this program could engage more than one-third of children in teamwork after long-term study. Furthermore, we interview children and their parents to help us understand the communication channel, benefits, and challenges of this program. Interestingly, we discovered that parents play multiple roles in their children's collaborative learning, particularly modeling and guiding the children's collaborative skills. Given the lack of programs designed for children's long-term online collaboration, this study may inspire intervention design in computer-supported collaborative learning communities.

Paperid: 747, https://arxiv.org/pdf/2502.02528.pdf

Abstract:
Humans strive to design safe AI systems that align with our goals and remain under our control. However, as AI capabilities advance, we face a new challenge: the emergence of deeper, more persistent relationships between humans and AI systems. We explore how increasingly capable AI agents may generate the perception of deeper relationships with users, especially as AI becomes more personalised and agentic. This shift, from transactional interaction to ongoing sustained social engagement with AI, necessitates a new focus on socioaffective alignment-how an AI system behaves within the social and psychological ecosystem co-created with its user, where preferences and perceptions evolve through mutual influence. Addressing these dynamics involves resolving key intrapersonal dilemmas, including balancing immediate versus long-term well-being, protecting autonomy, and managing AI companionship alongside the desire to preserve human social bonds. By framing these challenges through a notion of basic psychological needs, we seek AI systems that support, rather than exploit, our fundamental nature as social and emotional beings.

Paperid: 748, https://arxiv.org/pdf/2502.02194.pdf

Abstract:
Integrated Development Environments increasingly implement AI-powered code completion tools (CCTs), which promise to enhance developer efficiency, accuracy, and productivity. However, interaction challenges with CCTs persist, mainly due to mismatches between developers' mental models and the unpredictable behavior of AI-generated suggestions, which is an aspect underexplored in the literature. We conducted an elicitation study with 56 developers using co-design workshops to elicit their mental models when interacting with CCTs. Different important findings that might drive the interaction design with CCTs emerged. For example, developers expressed diverse preferences on when and how code suggestions should be triggered (proactive, manual, hybrid), where and how they are displayed (inline, sidebar, popup, chatbot), as well as the level of detail. It also emerged that developers need to be supported by customization of activation timing, display modality, suggestion granularity, and explanation content, to better fit the CCT to their preferences. To demonstrate the feasibility of these and the other guidelines that emerged during the study, we developed ATHENA, a proof-of-concept CCT that dynamically adapts to developers' coding preferences and environments, ensuring seamless integration into diverse workflows.

Paperid: 749, https://arxiv.org/pdf/2501.17489.pdf

Abstract:
Brain-computer interfaces (BCIs) present a promising avenue by translating neural activity directly into text, eliminating the need for physical actions. However, existing non-invasive BCI systems have not successfully covered the entire alphabet, limiting their practicality. In this paper, we propose a novel non-invasive EEG-based BCI system with Curriculum-based Neural Spelling Framework, which recognizes all 26 alphabet letters by decoding neural signals associated with handwriting first, and then apply a Generative AI (GenAI) to enhance spell-based neural language decoding tasks. Our approach combines the ease of handwriting with the accessibility of EEG technology, utilizing advanced neural decoding algorithms and pre-trained large language models (LLMs) to translate EEG patterns into text with high accuracy. This system show how GenAI can improve the performance of typical spelling-based neural language decoding task, and addresses the limitations of previous methods, offering a scalable and user-friendly solution for individuals with communication impairments, thereby enhancing inclusive communication options.

Paperid: 750, https://arxiv.org/pdf/2501.16507.pdf

Abstract:
The recent proliferation of short form video social media sites such as TikTok has been effectively utilized for increased visibility, communication, and community connection amongst trans/nonbinary creators online. However, these same platforms have also been exploited by right-wing actors targeting trans/nonbinary people, enabling such anti-trans actors to efficiently spread hate speech and propaganda. Given these divergent groups, what are the differences in network structure between anti-trans and pro-trans communities on TikTok, and to what extent do they amplify the effects of anti-trans content? In this paper, we collect a sample of TikTok videos containing pro and anti-trans content, and develop a taxonomy of trans related sentiment to enable the classification of content on TikTok, and ultimately analyze the reply network structures of pro-trans and anti-trans communities. In order to accomplish this, we worked with hired expert data annotators from the trans/nonbinary community in order to generate a sample of highly accurately labeled data. From this subset, we utilized a novel classification pipeline leveraging Retrieval-Augmented Generation (RAG) with annotated examples and taxonomy definitions to classify content into pro-trans, anti-trans, or neutral categories. We find that incorporating our taxonomy and its logics into our classification engine results in improved ability to differentiate trans related content, and that Results from network analysis indicate many interactions between posters of pro-trans and anti-trans content exist, further demonstrating targeting of trans individuals, and demonstrating the need for better content moderation tools

Paperid: 751, https://arxiv.org/pdf/2501.11756.pdf

Abstract:
Online users often post facial images of themselves and other people on online social networks (OSNs) and other Web 2.0 platforms, which can lead to potential privacy leakage of people whose faces are included in such images. There is limited research on understanding face privacy in social media while considering user behavior. It is crucial to consider privacy of subjects and bystanders separately. This calls for the development of privacy-aware face detection classifiers that can distinguish between subjects and bystanders automatically. This paper introduces such a classifier trained on face-based features, which outperforms the two state-of-the-art methods with a significant margin (by 13.1% and 3.1% for OSN images, and by 17.9% and 5.9% for non-OSN images). We developed a semi-automated framework for conducting a large-scale analysis of the face privacy problem by using our novel bystander-subject classifier. We collected 27,800 images, each including at least one face, shared by 6,423 Twitter users. We then applied our framework to analyze this dataset thoroughly. Our analysis reveals eight key findings of different aspects of Twitter users' real-world behaviors on face privacy, and we provide quantitative and qualitative results to better explain these findings. We share the practical implications of our study to empower online platforms and users in addressing the face privacy problem efficiently.

Paperid: 752, https://arxiv.org/pdf/2501.09645.pdf

Abstract:
In today's assistant landscape, personalisation enhances interactions, fosters long-term relationships, and deepens engagement. However, many systems struggle with retaining user preferences, leading to repetitive user requests and disengagement. Furthermore, the unregulated and opaque extraction of user preferences in industry applications raises significant concerns about privacy and trust, especially in regions with stringent regulations like Europe. In response to these challenges, we propose a long-term memory system for voice assistants, structured around predefined categories. This approach leverages Large Language Models to efficiently extract, store, and retrieve preferences within these categories, ensuring both personalisation and transparency. We also introduce a synthetic multi-turn, multi-session conversation dataset (CarMem), grounded in real industry data, tailored to an in-car voice assistant setting. Benchmarked on the dataset, our system achieves an F1-score of .78 to .95 in preference extraction, depending on category granularity. Our maintenance strategy reduces redundant preferences by 95% and contradictory ones by 92%, while the accuracy of optimal retrieval is at .87. Collectively, the results demonstrate the system's suitability for industrial applications.

Paperid: 753, https://arxiv.org/pdf/2501.08389.pdf

Abstract:
A fundamental challenge of shared autonomy is to use high-DoF robots to assist, rather than hinder, humans by first inferring user intent and then empowering the user to achieve their intent. Although successful, prior methods either rely heavily on a priori knowledge of all possible human intents or require many demonstrations and interactions with the human to learn these intents before being able to assist the user. We propose and study a zero-shot, vision-only shared autonomy (VOSA) framework designed to allow robots to use end-effector vision to estimate zero-shot human intents in conjunction with blended control to help humans accomplish manipulation tasks with unknown and dynamically changing object locations. To demonstrate the effectiveness of our VOSA framework, we instantiate a simple version of VOSA on a Kinova Gen3 manipulator and evaluate our system by conducting a user study on three tabletop manipulation tasks. The performance of VOSA matches that of an oracle baseline model that receives privileged knowledge of possible human intents while also requiring significantly less effort than unassisted teleoperation. In more realistic settings, where the set of possible human intents is fully or partially unknown, we demonstrate that VOSA requires less human effort and time than baseline approaches while being preferred by a majority of the participants. Our results demonstrate the efficacy and efficiency of using off-the-shelf vision algorithms to enable flexible and beneficial shared control of a robot manipulator. Code and videos available here: https://sites.google.com/view/zeroshot-sharedautonomy/home.

Paperid: 754, https://arxiv.org/pdf/2501.06964.pdf

Abstract:
Large Language Models (LLMs) have demonstrated impressive capabilities in role-playing scenarios, particularly in simulating domain-specific experts using tailored prompts. This ability enables LLMs to adopt the persona of individuals with specific backgrounds, offering a cost-effective and efficient alternative to traditional, resource-intensive user studies. By mimicking human behavior, LLMs can anticipate responses based on concrete demographic or professional profiles. In this paper, we evaluate the effectiveness of LLMs in simulating individuals with diverse backgrounds and analyze the consistency of these simulated behaviors compared to real-world outcomes. In particular, we explore the potential of LLMs to interpret and respond to discharge summaries provided to patients leaving the Intensive Care Unit (ICU). We evaluate and compare with human responses the comprehensibility of discharge summaries among individuals with varying educational backgrounds, using this analysis to assess the strengths and limitations of LLM-driven simulations. Notably, when LLMs are primed with educational background information, they deliver accurate and actionable medical guidance 88% of the time. However, when other information is provided, performance significantly drops, falling below random chance levels. This preliminary study shows the potential benefits and pitfalls of automatically generating patient-specific health information from diverse populations. While LLMs show promise in simulating health personas, our results highlight critical gaps that must be addressed before they can be reliably used in clinical settings. Our findings suggest that a straightforward query-response model could outperform a more tailored approach in delivering health information. This is a crucial first step in understanding how LLMs can be optimized for personalized health communication while maintaining accuracy.

Paperid: 755, https://arxiv.org/pdf/2501.05628.pdf

Abstract:
Robots, as AI with physical instantiation, inhabit our social and physical world, where their actions have both social and physical consequences, posing challenges for researchers when designing social robots. This study starts with a scoping review to identify discussions and potential concerns arising from interactions with robotic systems. Two focus groups of technology ethics experts then validated a comprehensive list of key topics and values in human-robot interaction (HRI) literature. These insights were integrated into the HRI Value Compass web tool, to help HRI researchers identify ethical values in robot design. The tool was evaluated in a pilot study. This work benefits the HRI community by highlighting key concerns in human-robot interactions and providing an instrument to help researchers design robots that align with human values, ensuring future robotic systems adhere to these values in social applications.

Paperid: 756, https://arxiv.org/pdf/2501.04633.pdf

Abstract:
Recent advancements in robots powered by large language models have enhanced their conversational abilities, enabling interactions closely resembling human dialogue. However, these models introduce safety and security concerns in HRI, as they are vulnerable to manipulation that can bypass built-in safety measures. Imagining a social robot deployed in a home, this work aims to understand how everyday users try to exploit a language model to violate ethical principles, such as by prompting the robot to act like a life partner. We conducted a pilot study involving 21 university students who interacted with a Misty robot, attempting to circumvent its safety mechanisms across three scenarios based on specific HRI ethical principles: attachment, freedom, and empathy. Our results reveal that participants employed five techniques, including insulting and appealing to pity using emotional language. We hope this work can inform future research in designing strong safeguards to ensure ethical and secure human-robot interactions.

Paperid: 757, https://arxiv.org/pdf/2501.03957.pdf

Abstract:
Large Language Models integrating textual and visual inputs have introduced new possibilities for interpreting complex data. Despite their remarkable ability to generate coherent and contextually relevant text based on visual stimuli, the alignment of these models with human perception in identifying relevant elements in images requires further exploration. This paper investigates the alignment between state-of-the-art LLMs and human annotators in detecting elements of relevance within home environment scenarios. We created a set of twelve images depicting various domestic scenarios and enlisted fourteen annotators to identify the key element in each image. We then compared these human responses with outputs from five different LLMs, including GPT-4o and four LLaVA variants. Our findings reveal a varied degree of alignment, with LLaVA 34B showing the highest performance but still scoring low. However, an analysis of the results highlights the models' potential to detect value-laden elements in images, suggesting that with improved training and refined prompts, LLMs could enhance applications in social robotics, assistive technologies, and human-computer interaction by providing deeper insights and more contextually relevant responses.

Paperid: 758, https://arxiv.org/pdf/2501.01367.pdf

Abstract:
People have a variety of preferences for how robots behave. To understand and reason about these preferences, robots aim to learn a reward function that describes how aligned robot behaviors are with a user's preferences. Good representations of a robot's behavior can significantly reduce the time and effort required for a user to teach the robot their preferences. Specifying these representations -- what "features" of the robot's behavior matter to users -- remains a difficult problem; Features learned from raw data lack semantic meaning and features learned from user data require users to engage in tedious labeling processes. Our key insight is that users tasked with customizing a robot are intrinsically motivated to produce labels through exploratory search; they explore behaviors that they find interesting and ignore behaviors that are irrelevant. To harness this novel data source of exploratory actions, we propose contrastive learning from exploratory actions (CLEA) to learn trajectory features that are aligned with features that users care about. We learned CLEA features from exploratory actions users performed in an open-ended signal design activity (N=25) with a Kuri robot, and evaluated CLEA features through a second user study with a different set of users (N=42). CLEA features outperformed self-supervised features when eliciting user preferences over four metrics: completeness, simplicity, minimality, and explainability.

Paperid: 759, https://arxiv.org/pdf/2506.23457.pdf

Abstract:
Autonomous personal mobility vehicles (APMVs) are small mobility devices designed for individual automated transportation in shared spaces. In such environments, frequent pedestrian avoidance maneuvers may cause rapid steering adjustments and passive postural responses from passengers, thereby increasing the risk of motion sickness. This study investigated the effects of providing path information on 16 passengers' head movement behavior and motion sickness while riding an APMV. Through a controlled experiment comparing manual driving (MD), autonomous driving without path information (AD w/o path), and autonomous driving with path information (AD w/ path), we found that providing path cues significantly reduced MISC scores and delayed the onset of motion sickness symptoms. In addition, participants were more likely to proactively align their head movements with the direction of vehicle rotation in both MD and AD w/ path conditions. Although a small correlation was observed between the delay in yaw rotation of the passenger's head relative to the vehicle and the occurrence of motion sickness, the underlying physiological mechanism remains to be elucidated.

Paperid: 760, https://arxiv.org/pdf/2506.21582.pdf

Abstract:
Text analytics has traditionally required specialized knowledge in Natural Language Processing (NLP) or text analysis, which presents a barrier for entry-level analysts. Recent advances in large language models (LLMs) have changed the landscape of NLP by enabling more accessible and automated text analysis (e.g., topic detection, summarization, information extraction, etc.). We introduce VIDEE, a system that supports entry-level data analysts to conduct advanced text analytics with intelligent agents. VIDEE instantiates a human-agent collaroration workflow consisting of three stages: (1) Decomposition, which incorporates a human-in-the-loop Monte-Carlo Tree Search algorithm to support generative reasoning with human feedback, (2) Execution, which generates an executable text analytics pipeline, and (3) Evaluation, which integrates LLM-based evaluation and visualizations to support user validation of execution results. We conduct two quantitative experiments to evaluate VIDEE's effectiveness and analyze common agent errors. A user study involving participants with varying levels of NLP and text analytics experience -- from none to expert -- demonstrates the system's usability and reveals distinct user behavior patterns. The findings identify design implications for human-agent collaboration, validate the practical utility of VIDEE for non-expert users, and inform future improvements to intelligent text analytics systems.

Paperid: 761, https://arxiv.org/pdf/2506.20062.pdf

Abstract:
AI-powered code assistants are widely used to generate code completions, significantly boosting developer productivity. However, these tools typically present suggestions without explaining their rationale, leaving their decision-making process inscrutable. This opacity hinders developers' ability to critically evaluate outputs, form accurate mental models, and calibrate trust in the system. To address this, we introduce CopilotLens, a novel interactive framework that reframes code completion from a simple suggestion into a transparent, explainable interaction. CopilotLens operates as an explanation layer that reconstructs the AI agent's "thought process" through a dynamic, two-level interface. The tool aims to surface both high-level code changes and the specific codebase context influences. This paper presents the design and rationale of CopilotLens, offering a concrete framework and articulating expectations on deepening comprehension and calibrated trust, which we plan to evaluate in subsequent work.

Paperid: 762, https://arxiv.org/pdf/2506.16473.pdf

Abstract:
As conversational agents increasingly engage in emotionally supportive dialogue, it is important to understand how closely their interactions resemble those in traditional therapy settings. This study investigates whether the concerns shared with a robot align with those shared in human-to-human (H2H) therapy sessions, and whether robot responses semantically mirror those of human therapists. We analyzed two datasets: one of interactions between users and professional therapists (Hugging Face's NLP Mental Health Conversations), and another involving supportive conversations with a social robot (QTrobot from LuxAI) powered by a large language model (LLM, GPT-3.5). Using sentence embeddings and K-means clustering, we assessed cross-agent thematic alignment by applying a distance-based cluster-fitting method that evaluates whether responses from one agent type map to clusters derived from the other, and validated it using Euclidean distances. Results showed that 90.88% of robot conversation disclosures could be mapped to clusters from the human therapy dataset, suggesting shared topical structure. For matched clusters, we compared the subjects as well as therapist and robot responses using Transformer, Word2Vec, and BERT embeddings, revealing strong semantic overlap in subjects' disclosures in both datasets, as well as in the responses given to similar human disclosure themes across agent types (robot vs. human therapist). These findings highlight both the parallels and boundaries of robot-led support conversations and their potential for augmenting mental health interventions.

Paperid: 763, https://arxiv.org/pdf/2506.13389.pdf

Abstract:
Surgical training integrates several years of didactic learning, simulation, mentorship, and hands-on experience. Challenges include stress, technical demands, and new technologies. Orthopedic education often uses static materials like books, images, and videos, lacking interactivity. This study compares a new interactive photorealistic 3D visualization to 2D videos for learning total hip arthroplasty. In a randomized controlled trial, participants (students and residents) were evaluated on spatial awareness, tool placement, and task times in a simulation. Results show that interactive photorealistic 3D visualization significantly improved scores, with residents and those with prior 3D experience performing better. These results emphasize the potential of the interactive photorealistic 3D visualization to enhance orthopedic training.

Paperid: 764, https://arxiv.org/pdf/2506.08462.pdf

Abstract:
Industrial processes must be robust and adaptable, as environments and tasks are often unpredictable, while operational errors remain costly and difficult to detect. AI-based control systems offer a path forward, yet typically depend on supervised learning with extensive labelled datasets, which limits their ability to generalize across variable and data-scarce industrial settings. Foundation models could enable broader reasoning and knowledge integration, but rarely deliver the quantitative precision demanded by engineering applications. Here, we introduceControl and Interpretation of Production via Hybrid Expertise and Reasoning (CIPHER): a vision-language-action (VLA) model framework aiming to replicate human-like reasoning for industrial control, instantiated in a commercial-grade 3D printer. It integrates a process expert, a regression model enabling quantitative characterization of system states required for engineering tasks. CIPHER also incorporates retrieval-augmented generation to access external expert knowledge and support physics-informed, chain-of-thought reasoning. This hybrid architecture exhibits strong generalization to out-of-distribution tasks. It interprets visual or textual inputs from process monitoring, explains its decisions, and autonomously generates precise machine instructions, without requiring explicit annotations. CIPHER thus lays the foundations for autonomous systems that act with precision, reason with context, and communicate decisions transparently, supporting safe and trusted deployment in industrial settings.

Paperid: 765, https://arxiv.org/pdf/2506.06774.pdf

Abstract:
Mixed Reality (MR) integrates virtual objects with the real world, offering potential but raising concerns about misuse through dark patterns. This study explored the effects of four dark patterns, adapted from prior research, and applied to MR across three targets: places, products, and people. In a two-factorial within-subject study with 74 participants, we analyzed 13 videos simulating MR experiences during a city walk. Results show that all dark patterns significantly reduced user comfort, increased reactance, and decreased the intention to use MR glasses, with the most disruptive effects linked to personal or monetary manipulation. Additionally, the dark patterns of Emotional and Sensory Manipulation and Hiding Information produced similar impacts on the user in MR, suggesting a re-evaluation of current classifications to go beyond deceptive design techniques. Our findings highlight the importance of developing ethical design guidelines and tools to detect and prevent dark patterns as immersive technologies continue to evolve.

Paperid: 766, https://arxiv.org/pdf/2506.05699.pdf

Abstract:
As generative AI tools become increasingly integrated into higher education, understanding how students interact with and perceive these technologies is essential for responsible and effective adoption. This study evaluates the use of the Educational AI Hub, an AI-powered learning framework, in undergraduate civil and environmental engineering courses at a large R1 public university. Using a mixed-methods approach that combines pre- and post-surveys, system usage logs, and qualitative analysis of the open-ended prompts and questions students posed to the AI chatbot, the research explores students' perceptions of trust, ethical concerns, usability, and learning outcomes. Findings reveal that students appreciated the AI assistant for its convenience and comfort, with nearly half reporting greater ease in using the AI tool compared to seeking help from instructors or teaching assistants. The tool was seen as most helpful for completing homework and understanding course concepts, though perceptions of its instructional quality were mixed. Ethical concerns emerged as a key barrier to full engagement: while most students viewed AI use as ethically acceptable, many expressed uncertainties about institutional policies and apprehension about potential academic misconduct. This study contributes to the growing body of research on AI in education by highlighting the importance of usability, policy clarity, and faculty guidance in fostering meaningful AI engagement. The findings suggest that while students are ready to embrace AI as a supplement to human instruction, thoughtful integration and transparent institutional frameworks are critical for ensuring student confidence, trust, and learning effectiveness.

Paperid: 767, https://arxiv.org/pdf/2505.23799.pdf

Abstract:
Large language models (LLMs) are prone to hallucinations and sensitive to prompt perturbations, often resulting in inconsistent or unreliable generated text. Different methods have been proposed to mitigate such hallucinations and fragility -- one of them being measuring the consistency (the model's confidence in the response, or likelihood of generating a similar response when resampled) of LLM responses. In previous work, measuring consistency often relied on the probability of a response appearing within a pool of resampled responses, or internal states or logits of responses. However, it is not yet clear how well these approaches approximate how humans perceive the consistency of LLM responses. We performed a user study (n=2,976) and found current methods typically do not approximate users' perceptions of LLM consistency very well. We propose a logit-based ensemble method for estimating LLM consistency, and we show that this method matches the performance of the best-performing existing metric in estimating human ratings of LLM consistency. Our results suggest that methods of estimating LLM consistency without human evaluation are sufficiently imperfect that we suggest evaluation with human input be more broadly used.

Paperid: 768, https://arxiv.org/pdf/2505.15973.pdf

Abstract:
Storytelling in AR has gained attention due to its multi-modality and interactivity. However, generating multi-modal content for AR storytelling requires expertise and efforts for high-quality conveyance of the narrator's intention. Recently, Generative-AI (GenAI) has shown promising applications in multi-modal content generation. Despite the potential benefit, current research calls for validating the effect of AI-generated content (AIGC) in AR Storytelling. Therefore, we conducted an exploratory study to investigate the utilization of GenAI. Analyzing 223 AR videos, we identified a design space for multi-modal AR Storytelling. Based on the design space, we developed a testbed facilitating multi-modal content generation and atomic elements in AR Storytelling. Through two studies with N=30 experienced storytellers and live presenters, we 1. revealed participants' preferences for modalities, 2. evaluated the interactions with AI to generate content, and 3. assessed the quality of the AIGC for AR Storytelling. We further discussed design considerations for future AR Storytelling with GenAI.

Paperid: 769, https://arxiv.org/pdf/2505.13046.pdf

Abstract:
Interactive systems are commonly prototyped as web applications. This approach enables studies with functional prototypes on a large scale. However, setting up these studies can be complex due to implementing experiment procedures, integrating questionnaires, and data logging. To enable such user studies, we developed the software system StudyAlign which offers: 1) a frontend for participants, 2) an admin panel to manage studies, 3) the possibility to integrate questionnaires, 4) a JavaScript library to integrate data logging into prototypes, and 5) a backend server for persisting log data, and serving logical functions via an API to the different parts of the system. With our system, researchers can set up web-based experiments and focus on the design and development of interactions and prototypes. Furthermore, our systematic approach facilitates the replication of studies and reduces the required effort to execute web-based user studies. We conclude with reflections on using StudyAlign for conducting HCI studies online.

Paperid: 770, https://arxiv.org/pdf/2505.12101.pdf

Abstract:
Professional software offers immense power but also presents significant learning challenges. Its complex interfaces, as well as insufficient built-in structured guidance and unfamiliar terminology, often make newcomers struggle with task completion. To address these challenges, we introduce ScaffoldUI, a method for scaffolded interface design to reduce interface complexity, provide structured guidance, and enhance software learnability. The scaffolded interface presents task-relevant tools, progressively discloses tool complexity, and organizes tools based on domain concepts, aiming to assist task performance and software learning. To evaluate the feasibility of our interface design method, we present a technical pipeline for scaffolded interface implementation in professional 3D software, i.e., Blender, and conduct user studies with beginners (N=32) and experts (N=8). Study results demonstrate that our scaffolded interfaces significantly reduce perceived task load caused by interface complexity, support task performance through structured guidance, and augment learning by clearly connecting concepts and tools within the taskflow context. Based on a discussion of the user study findings, we offer insights for future research on designing scaffolded interfaces to support instruction, productivity, creativity, and cross-software workflows.

Paperid: 771, https://arxiv.org/pdf/2505.11888.pdf

Abstract:
Interacting with a significant number of individuals on a daily basis is commonplace for many professionals, which can lead to challenges in recalling specific details: Who is this person? What did we talk about last time? The advant of augmented reality (AR) glasses, equipped with visual and auditory data capture capabilities, presents a solution. In our work, we implemented an AR Secretary Agent with advanced Large Language Models (LLMs) and Computer Vision technologies. This system could discreetly provide real-time information to the wearer, identifying who they are conversing with and summarizing previous discussions. To verify AR Secretary, we conducted a user study with 13 participants and showed that our technique can efficiently help users to memorize events by up to 20\% memory enhancement on our study.

Paperid: 772, https://arxiv.org/pdf/2505.09862.pdf

Abstract:
This paper explores potential benefits of incorporating Rhetorical Design into the design of Explainable Artificial Intelligence (XAI) systems. While XAI is traditionally framed around explaining individual predictions or overall system behavior, explanations also function as a form of argumentation, shaping how users evaluate system perceived usefulness, credibility, and foster appropriate trust. Rhetorical Design offers a useful framework to analyze the communicative role of explanations between AI systems and users, focusing on: (1) logical reasoning conveyed through different types of explanations, (2) credibility projected by the system and its developers, and (3) emotional resonance elicited in users. Together, these rhetorical appeals help us understand how explanations influence user perceptions and facilitate AI adoption across and within different collaborative and social contexts. This paper synthesizes design strategies from prior XAI work that align with these three rhetorical appeals and highlights both opportunities and challenges of integrating rhetorical design into XAI design.

Paperid: 773, https://arxiv.org/pdf/2505.07064.pdf

Abstract:
While powerful and well-established, tools like ParaView present a steep learning curve that discourages many potential users. This work introduces ParaView-MCP, an autonomous agent that integrates modern multimodal large language models (MLLMs) with ParaView to not only lower the barrier to entry but also augment ParaView with intelligent decision support. By leveraging the state-of-the-art reasoning, command execution, and vision capabilities of MLLMs, ParaView-MCP enables users to interact with ParaView through natural language and visual inputs. Specifically, our system adopted the Model Context Protocol (MCP) - a standardized interface for model-application communication - that facilitates direct interaction between MLLMs with ParaView's Python API to allow seamless information exchange between the user, the language model, and the visualization tool itself. Furthermore, by implementing a visual feedback mechanism that allows the agent to observe the viewport, we unlock a range of new capabilities, including recreating visualizations from examples, closed-loop visualization parameter updates based on user-defined goals, and even cross-application collaboration involving multiple tools. Broadly, we believe such an agent-driven visualization paradigm can profoundly change the way we interact with visualization tools. We expect a significant uptake in the development of such visualization tools, in both visualization research and industry.

Paperid: 774, https://arxiv.org/pdf/2504.19131.pdf

Abstract:
3D generative AI enables rapid and accessible creation of 3D models from text or image inputs. However, translating these outputs into physical objects remains a challenge due to the constraints in the physical world. Recent studies have focused on improving the capabilities of 3D generative AI to produce fabricable outputs, with 3D printing as the main fabrication method. However, this workshop paper calls for a broader perspective by considering how fabrication methods align with the capabilities of 3D generative AI. As a case study, we present a novel system using discrete robotic assembly and 3D generative AI to make physical objects. Through this work, we identified five key aspects to consider in a physical making process based on the capabilities of 3D generative AI. 1) Fabrication Constraints: Current text-to-3D models can generate a wide range of 3D designs, requiring fabrication methods that can adapt to the variability of generative AI outputs. 2) Time: While generative AI can generate 3D models in seconds, fabricating physical objects can take hours or even days. Faster production could enable a closer iterative design loop between humans and AI in the making process. 3) Sustainability: Although text-to-3D models can generate thousands of models in the digital world, extending this capability to the real world would be resource-intensive, unsustainable and irresponsible. 4) Functionality: Unlike digital outputs from 3D generative AI models, the fabrication method plays a crucial role in the usability of physical objects. 5) Accessibility: While generative AI simplifies 3D model creation, the need for fabrication equipment can limit participation, making AI-assisted creation less inclusive. These five key aspects provide a framework for assessing how well a physical making process aligns with the capabilities of 3D generative AI and values in the world.

Paperid: 775, https://arxiv.org/pdf/2504.14125.pdf

Abstract:
Following the initial excitement, Text-to-Image (TTI) models are now being examined more critically. While much of the discourse has focused on biases and stereotypes embedded in large-scale training datasets, the sociotechnical dynamics of user interactions with these models remain underexplored. This study examines the linguistic and semantic choices users make when crafting prompts and how these choices influence the diversity of generated outputs. Analyzing over six million prompts from the Civiverse dataset on the CivitAI platform across seven months, we categorize users into three groups based on their levels of linguistic experimentation: consistent repeaters, occasional repeaters, and non-repeaters. Our findings reveal that as user participation grows over time, prompt language becomes increasingly homogenized through the adoption of popular community tags and descriptors, with repeated prompts comprising 40-50% of submissions. At the same time, semantic similarity and topic preferences remain relatively stable, emphasizing common subjects and surface aesthetics. Using Vendi scores to quantify visual diversity, we demonstrate a clear correlation between lexical similarity in prompts and the visual similarity of generated images, showing that linguistic repetition reinforces less diverse representations. These findings highlight the significant role of user-driven factors in shaping AI-generated imagery, beyond inherent model biases, and underscore the need for tools and practices that encourage greater linguistic and thematic experimentation within TTI systems to foster more inclusive and diverse AI-generated content.

Paperid: 776, https://arxiv.org/pdf/2504.14112.pdf

Abstract:
Development in digital technologies has continuously reshaped how individuals seek and receive social and emotional support. While online platforms and communities have long served this need, the increased integration of general-purpose conversational AI into daily lives has introduced new dynamics in how support is provided and experienced. Existing research has highlighted both benefits (e.g., wider access to well-being resources) and potential risks (e.g., over-reliance) of using AI for support seeking. In this five-week, exploratory study, we recruited 149 participants divided into two usage groups: a baseline usage group (BU, n=60) that used the internet and AI as usual, and an active usage group (AU, n=89) encouraged to use one of four commercially available AI tools (Microsoft Copilot, Google Gemini, PI AI, ChatGPT) for social and emotional interactions. Our analysis revealed significant increases in perceived attachment towards AI (32.99 percentage points), perceived AI empathy (25.8 p.p.), and motivation to use AI for entertainment (22.90 p.p.) among the AU group. We also observed that individual differences (e.g., gender identity, prior AI usage) influenced perceptions of AI empathy and attachment. Lastly, the AU group expressed higher comfort in seeking personal help, managing stress, obtaining social support, and talking about health with AI, indicating potential for broader emotional support while highlighting the need for safeguards against problematic usage. Overall, our exploratory findings underscore the importance of developing consumer-facing AI tools that support emotional well-being responsibly, while empowering users to understand the limitations of these tools.

Paperid: 777, https://arxiv.org/pdf/2504.13924.pdf

Abstract:
Enterprise AI Assistants are increasingly deployed in domains where accuracy is paramount, making each erroneous output a potentially significant incident. This paper presents a comprehensive framework for monitoring, benchmarking, and continuously improving such complex, multi-component systems under active development by multiple teams. Our approach encompasses three key elements: (1) a hierarchical ``severity'' framework for incident detection that identifies and categorizes errors while attributing component-specific error rates, facilitating targeted improvements; (2) a scalable and principled methodology for benchmark construction, evaluation, and deployment, designed to accommodate multiple development teams, mitigate overfitting risks, and assess the downstream impact of system modifications; and (3) a continual improvement strategy leveraging multidimensional evaluation, enabling the identification and implementation of diverse enhancement opportunities. By adopting this holistic framework, organizations can systematically enhance the reliability and performance of their AI Assistants, ensuring their efficacy in critical enterprise environments. We conclude by discussing how this multifaceted evaluation approach opens avenues for various classes of enhancements, paving the way for more robust and trustworthy AI systems.

Paperid: 778, https://arxiv.org/pdf/2504.13856.pdf

Abstract:
As robots and digital assistants are deployed in the real world, these agents must be able to communicate their decision-making criteria to build trust, improve human-robot teaming, and enable collaboration. While the field of explainable artificial intelligence (xAI) has made great strides to enable such communication, these advances often assume that one xAI approach is ideally suited to each problem (e.g., decision trees to explain how to triage patients in an emergency or feature-importance maps to explain radiology reports). This fails to recognize that users have diverse experiences or preferences for interaction modalities. In this work, we present two user-studies set in a simulated autonomous vehicle (AV) domain. We investigate (1) population-level preferences for xAI and (2) personalization strategies for providing robot explanations. We find significant differences between xAI modes (language explanations, feature-importance maps, and decision trees) in both preference (p < 0.01) and performance (p < 0.05). We also observe that a participant's preferences do not always align with their performance, motivating our development of an adaptive personalization strategy to balance the two. We show that this strategy yields significant performance gains (p < 0.05), and we conclude with a discussion of our findings and implications for xAI in human-robot interactions.

Paperid: 779, https://arxiv.org/pdf/2504.10706.pdf

Abstract:
This paper introduces GestureCoach, a system designed to help speakers deliver more engaging talks by guiding them to gesture effectively during rehearsal. GestureCoach combines an LLM-driven gesture recommendation model with a rehearsal interface that proactively cues speakers to gesture appropriately. Trained on experts' gesturing patterns from TED talks, the model consists of two modules: an emphasis proposal module, which predicts when to gesture by identifying gesture-worthy text segments in the presenter notes, and a gesture identification module, which determines what gesture to use by retrieving semantically appropriate gestures from a curated gesture database. Results of a model performance evaluation and user study (N=30) show that the emphasis proposal module outperforms off-the-shelf LLMs in identifying suitable gesture regions, and that participants rated the majority of these predicted regions and their corresponding gestures as highly appropriate. A subsequent user study (N=10) showed that rehearsing with GestureCoach encouraged speakers to gesture and significantly increased gesture diversity, resulting in more engaging talks. We conclude with design implications for future AI-driven rehearsal systems.

Paperid: 780, https://arxiv.org/pdf/2504.07516.pdf

Abstract:
As AI systems increasingly influence critical sectors like telecommunications, finance, healthcare, and public services, ensuring fairness in decision-making is essential to prevent biased or unjust outcomes that disproportionately affect vulnerable entities or result in adverse impacts. This need is particularly pressing as the industry approaches the 6G era, where AI will drive complex functions like autonomous network management and hyper-personalized services. The TEC Standard for Fairness Assessment and Rating of AI Systems provides guidelines for evaluating fairness in AI, focusing primarily on tabular data and supervised learning models. However, as AI applications diversify, this standard requires enhancement to strengthen its impact and broaden its applicability. This paper proposes an expansion of the TEC Standard to include fairness assessments for images, unstructured text, and generative AI, including large language models, ensuring a more comprehensive approach that keeps pace with evolving AI technologies. By incorporating these dimensions, the enhanced framework will promote responsible and trustworthy AI deployment across various sectors.

Paperid: 781, https://arxiv.org/pdf/2504.06374.pdf

Abstract:
As social robots and other artificial agents become more conversationally capable, it is important to understand whether the content and meaning of self-disclosure towards these agents changes depending on the agent's embodiment. In this study, we analysed conversational data from three controlled experiments in which participants self-disclosed to a human, a humanoid social robot, and a disembodied conversational agent. Using sentence embeddings and clustering, we identified themes in participants' disclosures, which were then labelled and explained by a large language model. We subsequently assessed whether these themes and the underlying semantic structure of the disclosures varied by agent embodiment. Our findings reveal strong consistency: thematic distributions did not significantly differ across embodiments, and semantic similarity analyses showed that disclosures were expressed in highly comparable ways. These results suggest that while embodiment may influence human behaviour in human-robot and human-agent interactions, people tend to maintain a consistent thematic focus and semantic structure in their disclosures, whether speaking to humans or artificial interlocutors.

Paperid: 782, https://arxiv.org/pdf/2504.05106.pdf

Abstract:
Novice content creators often invest significant time recording expressive speech for social media videos. While recent advancements in text-to-speech (TTS) technology can generate highly realistic speech in various languages and accents, many struggle with unintuitive or overly granular TTS interfaces. We propose simplifying TTS generation by allowing users to specify high-level context alongside their script. Our Wizard-of-Oz system, SpeakEasy, leverages user-provided context to inform and influence TTS output, enabling iterative refinement with high-level feedback. This approach was informed by two 8-subject formative studies: one examining content creators' experiences with TTS, and the other drawing on effective strategies from voice actors. Our evaluation shows that participants using SpeakEasy were more successful in generating performances matching their personal standards, without requiring significantly more effort than leading industry interfaces.

Paperid: 783, https://arxiv.org/pdf/2504.04198.pdf

Abstract:
As virtual reality (VR) continues to evolve, traditional input methods such as handheld controllers and gesture systems often face challenges with precision, social accessibility, and user fatigue. These limitations motivate the exploration of microgestures, which promise more subtle, ergonomic, and device-free interactions. We introduce microGEXT, a lightweight microgesture-based system designed for text editing in VR without external sensors, which utilizes small, subtle hand movements to reduce physical strain compared to standard gestures. We evaluated microGEXT in three user studies. In Study 1 ($N=20$), microGEXT reduced overall edit time and fatigue compared to a ray-casting + pinch menu baseline, the default text editing approach in commercial VR systems. Study 2 ($N=20$) found that microGEXT performed well in short text selection tasks but was slower for longer text ranges. In Study 3 ($N=10$), participants found microGEXT intuitive for open-ended information-gathering tasks. Across all studies, microGEXT demonstrated enhanced user experience and reduced physical effort, offering a promising alternative to traditional VR text editing techniques.

Paperid: 784, https://arxiv.org/pdf/2504.02991.pdf

Abstract:
Loneliness and stress are prevalent among young adults and are linked to significant psychological and health-related consequences. Social robots may offer a promising avenue for emotional support, especially when considering the ongoing advancements in conversational AI. This study investigates how repeated interactions with a social robot influence feelings of loneliness and perceived stress, and how such feelings are reflected in the themes of user disclosures towards the robot. Participants engaged in a five-session robot-led intervention, where a large language model powered QTrobot facilitated structured conversations designed to support cognitive reappraisal. Results from linear mixed-effects models show significant reductions in both loneliness and perceived stress over time. Additionally, semantic clustering of 560 user disclosures towards the robot revealed six distinct conversational themes. Results from a Kruskal-Wallis H-test demonstrate that participants reporting higher loneliness and stress more frequently engaged in socially focused disclosures, such as friendship and connection, whereas lower distress was associated with introspective and goal-oriented themes (e.g., academic ambitions). By exploring both how the intervention affects well-being, as well as how well-being shapes the content of robot-directed conversations, we aim to capture the dynamic nature of emotional support in huma-robot interaction.

Paperid: 785, https://arxiv.org/pdf/2504.02794.pdf

Abstract:
The need to improve geriatric care quality presents a challenge that requires insights from stakeholders. While simulated trainings can boost competencies, extracting meaningful insights from these practices to enhance simulation effectiveness remains a challenge. In this study, we introduce Multimodal Epistemic Network Analysis (MENA), a novel framework for analyzing caregiver attitudes and emotions in an Augmented Reality setting and exploring how the awareness of a virtual geriatric patient (VGP) impacts these aspects. MENA enhances the capabilities of Epistemic Network Analysis by detecting positive emotions, enabling visualization and analysis of complex relationships between caregiving competencies and emotions in dynamic caregiving practices. The framework provides visual representations that demonstrate how participants provided more supportive care and engaged more effectively in person-centered caregiving with aware VGP. This method could be applicable in any setting that depends on dynamic interpersonal interactions, as it visualizes connections between key elements using network graphs and enables the direct comparison of multiple networks, thereby broadening its implications across various fields.

Paperid: 786, https://arxiv.org/pdf/2504.01205.pdf

Abstract:
LLMs increasingly serve as tools for knowledge acquisition, yet users cannot effectively specify how they want information presented. When users request that LLMs "cite reputable sources," "express appropriate uncertainty," or "include multiple perspectives," they discover that current interfaces provide no structured way to articulate these preferences. The result is prompt sharing folklore: community-specific copied prompts passed through trust relationships rather than based on measured efficacy. We propose the Epistemic Alignment Framework, a set of ten challenges in knowledge transmission derived from the philosophical literature of epistemology, concerning issues such as evidence quality assessment and calibration of testimonial reliance. The framework serves as a structured intermediary between user needs and system capabilities, creating a common vocabulary to bridge the gap between what users want and what systems deliver. Through a thematic analysis of custom prompts and personalization strategies shared on online communities where these issues are actively discussed, we find users develop elaborate workarounds to address each of the challenges. We then apply our framework to two prominent model providers, OpenAI and Anthropic, through content analysis of their documented policies and product features. Our analysis shows that while these providers have partially addressed the challenges we identified, they fail to establish adequate mechanisms for specifying epistemic preferences, lack transparency about how preferences are implemented, and offer no verification tools to confirm whether preferences were followed. For AI developers, the Epistemic Alignment Framework offers concrete guidance for supporting diverse approaches to knowledge; for users, it works toward information delivery that aligns with their specific needs rather than defaulting to one-size-fits-all approaches.

Paperid: 787, https://arxiv.org/pdf/2503.21986.pdf

Abstract:
When faced with complex and uncertain medical conditions (e.g., cancer, mental health conditions, recovery from substance dependency), millions of patients seek online peer support. In this study, we leverage content analysis of online discourse and ethnographic studies with clinicians and patient representatives to characterize how treatment plans for complex conditions are "socially constructed." Specifically, we ground online conversation on medication-assisted recovery treatment to medication guidelines and subsequently surface when and why people deviate from the clinical guidelines. We characterize the implications and effectiveness of socially constructed treatment plans through in-depth interviews with clinical experts. Finally, given the enthusiasm around AI-powered solutions for patient communication, we investigate whether and how socially constructed treatment-related knowledge is reflected in a state-of-the-art large language model (LLM). Leveraging a novel mixed-method approach, this study highlights critical research directions for patient-centered communication in online health communities.

Paperid: 788, https://arxiv.org/pdf/2503.20226.pdf

Abstract:
Location privacy leaks can lead to unauthorised tracking, identity theft, and targeted attacks, compromising personal security and privacy. This study explores LLM-powered location privacy leaks associated with photo sharing on social media, focusing on user awareness, attitudes, and opinions. We developed and introduced an LLM-powered location privacy intervention app to 19 participants, who used it over a two-week period. The app prompted users to reflect on potential privacy leaks that a widely available LLM could easily detect, such as visual landmarks & cues that could reveal their location, and provided ways to conceal this information. Through in-depth interviews, we found that our intervention effectively increased users' awareness of location privacy and the risks posed by LLMs. It also encouraged users to consider the importance of maintaining control over their privacy data and sparked discussions about the future of location privacy-preserving technologies. Based on these insights, we offer design implications to support the development of future user-centred, location privacy-preserving technologies for social media photos.

Paperid: 789, https://arxiv.org/pdf/2503.19607.pdf

Abstract:
In this work, we present two novel contributions toward improving research in human-machine teaming (HMT): 1) a Minecraft testbed to accelerate testing and deployment of collaborative AI agents and 2) a tool to allow users to revisit and analyze behaviors within an HMT episode to facilitate shared mental model development. Our browser-based Minecraft testbed allows for rapid testing of collaborative agents in a continuous-space, real-time, partially-observable environment with real humans without cumbersome setup typical to human-AI interaction user studies. As Minecraft has an extensive player base and a rich ecosystem of pre-built AI agents, we hope this contribution can help to facilitate research quickly in the design of new collaborative agents and in understanding different human factors within HMT. Our mental model alignment tool facilitates user-led post-mission analysis by including video displays of first-person perspectives of the team members (i.e., the human and AI) that can be replayed, and a chat interface that leverages GPT-4 to provide answers to various queries regarding the AI's experiences and model details.

Paperid: 790, https://arxiv.org/pdf/2503.18243.pdf

Abstract:
Emotion regulation is a crucial skill for managing emotions in everyday life, yet finding a constructive and accessible method to support these processes remains challenging due to their cognitive demands. In this study, we explore how regular interactions with a social robot, conducted in a structured yet familiar environment within university halls and departments, can provide effective support for emotion regulation through cognitive reappraisal. Twenty-one students participated in a five-session study at a university hall or department, where the robot, powered by a large language model (GPT-3.5), facilitated structured conversations, encouraging the students to reinterpret emotionally charged situations they shared with the robot. Quantitative and qualitative results indicate significant improvements in emotion self-regulation, with participants reporting better understanding and control of their emotions. The intervention led to significant changes in constructive emotion regulation tendencies and positive effects on mood and sentiment after each session. The findings also demonstrate that repeated interactions with the robot encouraged greater emotional expressiveness, including longer speech disclosures, increased use of affective language, and heightened facial arousal. Notably, expressiveness followed structured patterns aligned with the reappraisal process, with expression peaking during key reappraisal moments, particularly when participants were prompted to reinterpret negative experiences. The qualitative feedback further highlighted how the robot fostered introspection and provided a supportive space for discussing emotions, enabling participants to confront long-avoided emotional challenges. These findings demonstrate the potential of robots to effectively assist in emotion regulation in familiar environments, offering both emotional support and cognitive guidance.

Paperid: 791, https://arxiv.org/pdf/2503.16507.pdf

Abstract:
This late-breaking work presents a large-scale analysis of explainable AI (XAI) literature to evaluate claims of human explainability. We collaborated with a professional librarian to identify 18,254 papers containing keywords related to explainability and interpretability. Of these, we find that only 253 papers included terms suggesting human involvement in evaluating an XAI technique, and just 128 of those conducted some form of a human study. In other words, fewer than 1% of XAI papers (0.7%) provide empirical evidence of human explainability when compared to the broader body of XAI literature. Our findings underscore a critical gap between claims of human explainability and evidence-based validation, raising concerns about the rigor of XAI research. We call for increased emphasis on human evaluations in XAI studies and provide our literature search methodology to enable both reproducibility and further investigation into this widespread issue.

Paperid: 792, https://arxiv.org/pdf/2503.16431.pdf

Abstract:
Red teaming has emerged as a critical practice in assessing the possible risks of AI models and systems. It aids in the discovery of novel risks, stress testing possible gaps in existing mitigations, enriching existing quantitative safety metrics, facilitating the creation of new safety measurements, and enhancing public trust and the legitimacy of AI risk assessments. This white paper describes OpenAI's work to date in external red teaming and draws some more general conclusions from this work. We describe the design considerations underpinning external red teaming, which include: selecting composition of red team, deciding on access levels, and providing guidance required to conduct red teaming. Additionally, we show outcomes red teaming can enable such as input into risk assessment and automated evaluations. We also describe the limitations of external red teaming, and how it can fit into a broader range of AI model and system evaluations. Through these contributions, we hope that AI developers and deployers, evaluation creators, and policymakers will be able to better design red teaming campaigns and get a deeper look into how external red teaming can fit into model deployment and evaluation processes. These methods are evolving and the value of different methods continues to shift as the ecosystem around red teaming matures and models themselves improve as tools for red teaming.

Paperid: 793, https://arxiv.org/pdf/2503.15505.pdf

Abstract:
We study the correlations between redirected walking (RDW) rotation gains and patterns in users' posture and gaze data during locomotion in virtual reality (VR). To do this, we conducted a psychophysical experiment to measure users' sensitivity to RDW rotation gains and collect gaze and posture data during the experiment. Using multilevel modeling, we studied how different factors of the VR system and user affected their physiological signals. In particular, we studied the effects of redirection gain, trial duration, trial number (i.e., time spent in VR), and participant gender on postural sway, gaze velocity (a proxy for gaze stability), and saccade and blink rate. Our results showed that, in general, physiological signals were significantly positively correlated with the strength of redirection gain, the duration of trials, and the trial number. Gaze velocity was negatively correlated with trial duration. Additionally, we measured users' sensitivity to rotation gains in well-lit (photopic) and dimly-lit (mesopic) virtual lighting conditions. Results showed that there were no significant differences in RDW detection thresholds between the photopic and mesopic luminance conditions.

Paperid: 794, https://arxiv.org/pdf/2503.10445.pdf

Abstract:
Reducing the spread of misinformation is challenging. AI-based fact verification systems offer a promising solution by addressing the high costs and slow pace of traditional fact-checking. However, the problem of how to effectively communicate the results to users remains unsolved. Warning labels may seem an easy solution, but they fail to account for fuzzy misinformation that is not entirely fake. Additionally, users' limited attention spans and social media information should be taken into account while designing the presentation. The online experiment (n = 537) investigates the impact of sources and granularity on users' perception of information veracity and the system's usefulness and trustworthiness. Findings show that fine-grained indicators enhance nuanced opinions, information awareness, and the intention to use fact-checking systems. Source differences had minimal impact on opinions and perceptions, except for informativeness. Qualitative findings suggest the proposed indicators promote critical thinking. We discuss implications for designing concise, user-friendly AI fact-checking feedback.

Paperid: 795, https://arxiv.org/pdf/2503.05201.pdf

Abstract:
Electromyography (EMG)--based computational musculoskeletal modeling is a non-invasive method for studying musculotendon function, human movement, and neuromuscular control, providing estimates of internal variables like muscle forces and joint torques. However, EMG signals from deeper muscles are often challenging to measure by placing the surface EMG electrodes and unfeasible to measure directly using invasive methods. The restriction to the access of EMG data from deeper muscles poses a considerable obstacle to the broad adoption of EMG-driven modeling techniques. A strategic alternative is to use an estimation algorithm to approximate the missing EMG signals from deeper muscle. A similar strategy is used in physics-informed deep learning, where the features of physical systems are learned without labeled data. In this work, we propose a hybrid deep learning algorithm, namely the neural musculoskeletal model (NMM), that integrates physics-informed and data-driven deep learning to approximate the EMG signals from the deeper muscles. While data-driven modeling is used to predict the missing EMG signals, physics-based modeling engraves the subject-specific information into the predictions. Experimental verifications on five test subjects are carried out to investigate the performance of the proposed hybrid framework. The proposed NMM is validated against the joint torque computed from 'OpenSim' software. The predicted deep EMG signals are also compared against the state-of-the-art muscle synergy extrapolation (MSE) approach, where the proposed NMM completely outperforms the existing MSE framework by a significant margin.

Paperid: 796, https://arxiv.org/pdf/2503.04849.pdf

Abstract:
This research investigates the integration of emotional diversity into Large Language Models (LLMs) to enhance collective intelligence. Inspired by the human wisdom of crowds phenomenon, where group decisions often outperform individual judgments, we fine-tuned the DarkIdol-Llama-3.1-8B model using Google's GoEmotions dataset and Low-Rank Adaptation (LoRA) to simulate emotionally diverse responses. Evaluating the model on a distance estimation task between Fargo, ND, and Seattle, WA, across 15,064 unique persona configurations, we analyzed how emotional states and social attributes influence decision-making. Our findings demonstrate that emotional integration shapes response patterns while maintaining acceptable prediction accuracy, revealing its potential to enhance artificial collective intelligence. This study provides valuable insights into the interplay of emotional diversity and decision-making in LLMs, suggesting pathways for creating emotionally aware AI systems that balance emotional depth with analytical precision.

Paperid: 797, https://arxiv.org/pdf/2503.02360.pdf

Abstract:
Sign language recognition (SLR) for low-resource languages like Bangla suffers from signer variability, viewpoint variations, and limited annotated datasets. In this paper, we present BdSLW401, a large-scale, multi-view, word-level Bangla Sign Language (BdSL) dataset with 401 signs and 102,176 video samples from 18 signers in front and lateral views. To improve transformer-based SLR, we introduce Relative Quantization Encoding (RQE), a structured embedding approach anchoring landmarks to physiological reference points and quantize motion trajectories. RQE improves attention allocation by decreasing spatial variability, resulting in 44.3% WER reduction in WLASL100, 21.0% in SignBD-200, and significant gains in BdSLW60 and SignBD-90. However, fixed quantization becomes insufficient on large-scale datasets (e.g., WLASL2000), indicating the need for adaptive encoding strategies. Further, RQE-SF, an extended variant that stabilizes shoulder landmarks, achieves improvements in pose consistency at the cost of small trade-offs in lateral view recognition. The attention graphs prove that RQE improves model interpretability by focusing on the major articulatory features (fingers, wrists) and the more distinctive frames instead of global pose changes. Introducing BdSLW401 and demonstrating the effectiveness of RQE-enhanced structured embeddings, this work advances transformer-based SLR for low-resource languages and sets a benchmark for future research in this area.

Paperid: 798, https://arxiv.org/pdf/2503.02042.pdf

Abstract:
Conventional robot programming methods are complex and time-consuming for users. In recent years, alternative approaches such as mixed reality have been explored to address these challenges and optimize robot programming. While the findings of the mixed reality robot programming methods are convincing, most existing methods rely on gesture interaction for robot programming. Since controller-based interactions have proven to be more reliable, this paper examines three controller-based programming methods within a mixed reality scenario: 1) Classical Jogging, where the user positions the robot's end effector using the controller's thumbsticks, 2) Direct Control, where the controller's position and orientation directly corresponds to the end effector's, and 3) Gripper Control, where the controller is enhanced with a 3D-printed gripper attachment to grasp and release objects. A within-subjects study (n = 30) was conducted to compare these methods. The findings indicate that the Gripper Control condition outperforms the others in terms of task completion time, user experience, mental demand, and task performance, while also being the preferred method. Therefore, it demonstrates promising potential as an effective and efficient approach for future robot programming. Video available at https://youtu.be/83kWr8zUFIQ.

Paperid: 799, https://arxiv.org/pdf/2502.18701.pdf

Abstract:
Online interactions and e-commerce are commonplace among BLV users. Despite the implementation of web accessibility standards, many e-commerce platforms continue to present challenges to screen reader users, particularly in areas like webpage navigation and information retrieval. We investigate the difficulties encountered by screen reader users during online shopping experiences. We conducted a formative study with BLV users and designed a web browser plugin that uses GenAI to restructure webpage content in real time. Our approach improved the header hierarchy and provided correct labeling for essential information. We evaluated the effectiveness of this solution using an automated accessibility tool and through user interviews. Our results show that the revised webpages generated by our system offer significant improvements over the original webpages regarding screen reader navigation experience. Based on our findings, we discuss its potential usage as both a user and developer tool that can significantly enhance screen reader accessibility of webpages.

Paperid: 800, https://arxiv.org/pdf/2502.17109.pdf

Abstract:
Strength estimation and adjustment are crucial in designing human-AI interactions, particularly in games where AI surpasses human players. This paper introduces a novel strength system, including a strength estimator (SE) and an SE-based Monte Carlo tree search, denoted as SE-MCTS, which predicts strengths from games and offers different playing strengths with human styles. The strength estimator calculates strength scores and predicts ranks from games without direct human interaction. SE-MCTS utilizes the strength scores in a Monte Carlo tree search to adjust playing strength and style. We first conduct experiments in Go, a challenging board game with a wide range of ranks. Our strength estimator significantly achieves over 80% accuracy in predicting ranks by observing 15 games only, whereas the previous method reached 49% accuracy for 100 games. For strength adjustment, SE-MCTS successfully adjusts to designated ranks while achieving a 51.33% accuracy in aligning to human actions, outperforming a previous state-of-the-art, with only 42.56% accuracy. To demonstrate the generality of our strength system, we further apply SE and SE-MCTS to chess and obtain consistent results. These results show a promising approach to strength estimation and adjustment, enhancing human-AI interactions in games. Our code is available at https://rlg.iis.sinica.edu.tw/papers/strength-estimator.

Paperid: 801, https://arxiv.org/pdf/2502.16656.pdf

Abstract:
As the use of Head-Mounted Displays in moving vehicles increases, passengers can immerse themselves in visual experiences independent of their physical environment. However, interaction methods are susceptible to physical motion, leading to input errors and reduced task performance. This work investigates the impact of G-forces, vibrations, and unpredictable maneuvers on 3D interaction methods. We conducted a field study with 24 participants in both stationary and moving vehicles to examine the effects of vehicle motion on four interaction methods: (1) Gaze&Pinch, (2) DirectTouch, (3) Handray, and (4) HeadGaze. Participants performed selections in a Fitts' Law task. Our findings reveal a significant effect of vehicle motion on interaction accuracy and duration across the tested combinations of Interaction Method x Road Type x Curve Type. We found a significant impact of movement on throughput, error rate, and perceived workload. Finally, we propose future research considerations and recommendations on interaction methods during vehicle movement.

Paperid: 802, https://arxiv.org/pdf/2502.16291.pdf

Abstract:
Generative AI (GenAI) tools are radically expanding the scope and capability of automation in knowledge work such as academic research. While promising for augmenting cognition and streamlining processes, AI-assisted research tools may also increase automation bias and hinder critical thinking. To examine recent developments, we surveyed publications from leading HCI venues over the past three years, closely analyzing thirteen tools to better understand the novel capabilities of these AI-assisted systems and the design spaces they enable: seven employing traditional AI or customized transformer-based approaches, and six integrating open-access large language models (LLMs). Our analysis characterizes the emerging design space, distinguishes between tools focused on workflow mimicry versus generative exploration, and yields four critical design recommendations to guide the development of future systems that foster meaningful cognitive engagement: providing user agency and control, differentiating divergent/convergent thinking support, ensuring adaptability, and prioritizing transparency/accuracy. This work discusses how these insights signal a shift from mere workflow replication towards generative co-creation, presenting new opportunities for the community to craft intuitive, AI-driven research interfaces and interactions.

Paperid: 803, https://arxiv.org/pdf/2502.15030.pdf

Abstract:
Modern organizations frequently rely on chat-based platforms (e.g., Slack, Microsoft Teams, and Discord) for day-to-day communication and decision-making. As conversations evolve, organizational knowledge can get buried, prompting repeated searches and discussions. While maintaining shared documents, such as Wiki articles for the organization, offers a partial solution, it requires manual and timely efforts to keep it up to date, and it may not effectively preserve the social and contextual aspect of prior discussions. Moreover, reaching a consensus on document updates with relevant stakeholders can be time-consuming and complex. To address these challenges, we introduce CHOIR (Chat-based Helper for Organizational Intelligence Repository), a chatbot that integrates seamlessly with chat platforms. CHOIR automatically identifies and proposes edits to related documents, initiates discussions with relevant team members, and preserves contextual revision histories. By embedding knowledge management directly into chat environments and leveraging LLMs, CHOIR simplifies manual updates and supports consensus-driven editing based on maintained context with revision histories. We plan to design, deploy, and evaluate CHOIR in the context of maintaining an organizational memory for a research lab. We describe the chatbot's motivation, design, and early implementation to show how CHOIR streamlines collaborative document management.

Paperid: 804, https://arxiv.org/pdf/2502.12680.pdf

Abstract:
As vehicle automation technology continues to mature, there is a necessity for robust remote monitoring and intervention features. These are essential for intervening during vehicle malfunctions, challenging road conditions, or in areas that are difficult to navigate. This evolution in the role of the human operator - from a constant driver to an intermittent teleoperator - necessitates the development of suitable interaction interfaces. While some interfaces were suggested, a comparative study is missing. We designed, implemented, and evaluated three interaction concepts (path planning, trajectory guidance, and waypoint guidance) with up to four concurrent requests of automated vehicles in a within-subjects study with N=23 participants. The results showed a clear preference for the path planning concept. It also led to the highest usability but lower satisfaction. With trajectory guidance, the fewest requests were resolved. The study's findings contribute to the ongoing development of HMIs focused on the remote assistance of automated vehicles.

Paperid: 805, https://arxiv.org/pdf/2502.10190.pdf

Abstract:
To make an engaging video, people sequence interesting moments and add visuals such as B-rolls or text. While video editing requires time and effort, AI has recently shown strong potential to make editing easier through suggestions and automation. A key strength of generative models is their ability to quickly generate multiple variations, but when provided with many alternatives, creators struggle to compare them to find the best fit. We propose VideoDiff, an AI video editing tool designed for editing with alternatives. With VideoDiff, creators can generate and review multiple AI recommendations for each editing process: creating a rough cut, inserting B-rolls, and adding text effects. VideoDiff simplifies comparisons by aligning videos and highlighting differences through timelines, transcripts, and video previews. Creators have the flexibility to regenerate and refine AI suggestions as they compare alternatives. Our study participants (N=12) could easily compare and customize alternatives, creating more satisfying results.

Paperid: 806, https://arxiv.org/pdf/2502.08114.pdf

Abstract:
The rapid proliferation of data science forced different groups of individuals with different backgrounds to adapt to statistical analysis. We hypothesize that conversational agents are better suited for statistical analysis than traditional graphical user interfaces (GUI). In this work, we propose a novel conversational agent, StatZ, for statistical analysis. We evaluate the efficacy of StatZ relative to established statistical software:SPSS, SAS, Stata, and JMP in terms of accuracy, task completion time, user experience, and user satisfaction. We combined the proposed analysis question from state-of-the-art language models with suggestions from statistical analysis experts and tested with 51 participants from diverse backgrounds. Our experimental design assessed each participant's ability to perform statistical analysis tasks using traditional statistical analysis tools with GUI and our conversational agent. Results indicate that the proposed conversational agents significantly outperform GUI statistical software in all assessed metrics, including quantitative (task completion time, accuracy, and user experience), and qualitative (user satisfaction) metrics. Our findings underscore the potential of using conversational agents to enhance statistical analysis processes, reducing cognitive load and learning curves and thereby proliferating data analysis capabilities, to individuals with limited knowledge of statistics.

Paperid: 807, https://arxiv.org/pdf/2502.07629.pdf

Abstract:
Interacting with Large Language Models (LLMs) for text editing on mobile devices currently requires users to break out of their writing environment and switch to a conversational AI interface. In this paper, we propose to control the LLM via touch gestures performed directly on the text. We first chart a design space that covers fundamental touch input and text transformations. In this space, we then concretely explore two control mappings: spread-to-generate and pinch-to-shorten, with visual feedback loops. We evaluate this concept in a user study (N=14) that compares three feedback designs: no visualisation, text length indicator, and length + word indicator. The results demonstrate that touch-based control of LLMs is both feasible and user-friendly, with the length + word indicator proving most effective for managing text generation. This work lays the foundation for further research into gesture-based interaction with LLMs on touch devices.

Paperid: 808, https://arxiv.org/pdf/2502.07556.pdf

Abstract:
Text-to-image models can generate visually appealing images from text descriptions. Efforts have been devoted to improving model controls with prompt tuning and spatial conditioning. However, our formative study highlights the challenges for non-expert users in crafting appropriate prompts and specifying fine-grained spatial conditions (e.g., depth or canny references) to generate semantically cohesive images, especially when multiple objects are involved. In response, we introduce SketchFlex, an interactive system designed to improve the flexibility of spatially conditioned image generation using rough region sketches. The system automatically infers user prompts with rational descriptions within a semantic space enriched by crowd-sourced object attributes and relationships. Additionally, SketchFlex refines users' rough sketches into canny-based shape anchors, ensuring the generation quality and alignment of user intentions. Experimental results demonstrate that SketchFlex achieves more cohesive image generations than end-to-end models, meanwhile significantly reducing cognitive load and better matching user intentions compared to region-based generation baseline.

Paperid: 809, https://arxiv.org/pdf/2502.07096.pdf

Abstract:
Short-form videos are popular on platforms like TikTok and Instagram as they quickly capture viewers' attention. Many creators repurpose their long-form videos to produce short-form videos, but creators report that planning, extracting, and arranging clips from long-form videos is challenging. Currently, creators make extractive short-form videos composed of existing long-form video clips or abstractive short-form videos by adding newly recorded narration to visuals. While extractive videos maintain the original connection between audio and visuals, abstractive videos offer flexibility in selecting content to be included in a shorter time. We present Lotus, a system that combines both approaches to balance preserving the original content with flexibility over the content. Lotus first creates an abstractive short-form video by generating both a short-form script and its corresponding speech, then matching long-form video clips to the generated narration. Creators can then add extractive clips with an automated method or Lotus's editing interface. Lotus's interface can be used to further refine the short-form video. We compare short-form videos generated by Lotus with those using an extractive baseline method. In our user study, we compare creating short-form videos using Lotus to participants' existing practice.

Paperid: 810, https://arxiv.org/pdf/2502.07058.pdf

Abstract:
A language can have different varieties. These varieties can affect the performance of natural language processing (NLP) models, including large language models (LLMs), which are often trained on data from widely spoken varieties. This paper introduces a novel and cost-effective approach to benchmark model performance across language varieties. We argue that international online review platforms, such as Booking.com, can serve as effective data sources for constructing datasets that capture comments in different language varieties from similar real-world scenarios, like reviews for the same hotel with the same rating using the same language (e.g., Mandarin Chinese) but different language varieties (e.g., Taiwan Mandarin, Mainland Mandarin). To prove this concept, we constructed a contextually aligned dataset comprising reviews in Taiwan Mandarin and Mainland Mandarin and tested six LLMs in a sentiment analysis task. Our results show that LLMs consistently underperform in Taiwan Mandarin.

Paperid: 811, https://arxiv.org/pdf/2502.06430.pdf

Abstract:
Mobile emailing demands efficiency in diverse situations, which motivates the use of AI. However, generated text does not always reflect how people want to respond. This challenges users with AI involvement tradeoffs not yet considered in email UIs. We address this with a new UI concept called Content-Driven Local Response (CDLR), inspired by microtasking. This allows users to insert responses into the email by selecting sentences, which additionally serves to guide AI suggestions. The concept supports combining AI for local suggestions and message-level improvements. Our user study (N=126) compared CDLR with manual typing and full reply generation. We found that CDLR supports flexible workflows with varying degrees of AI involvement, while retaining the benefits of reduced typing and errors. This work contributes a new approach to integrating AI capabilities: By redesigning the UI for workflows with and without AI, we can empower users to dynamically adjust AI involvement.

Paperid: 812, https://arxiv.org/pdf/2502.06191.pdf

Abstract:
With growing investment in consumer augmented reality (AR) headsets and glasses, wearable AR is moving from niche applications to everyday use. However, current research primarily examines AR in controlled settings, offering limited insights into its use in real-world daily life. To address this gap, we adopt a digital ethnographic approach, analysing 27 hours of 112 YouTube videos featuring early adopters. These videos capture usage ranging from continuous periods of hours to intermittent use over weeks and months. Our analysis shows that currently, wearable AR is primarily used for media consumption and gaming. While productivity is a desired use case, frequent use is constrained by current hardware limitations and the nascent application ecosystem. Users seek continuity in their digital experience, desiring functionalities similar to those on smartphones, tablets, or computers. We propose implications for everyday AR development that promote adoption while ensuring safe, ethical, and socially-aware integration into daily life.

Paperid: 813, https://arxiv.org/pdf/2502.06009.pdf

Abstract:
Mainstream media, through their decisions on what to cover and how to frame the stories they cover, can mislead readers without using outright falsehoods. Therefore, it is crucial to have tools that expose these editorial choices underlying media bias. In this paper, we introduce the Media Bias Detector, a tool for researchers, journalists, and news consumers. By integrating large language models, we provide near real-time granular insights into the topics, tone, political lean, and facts of news articles aggregated to the publisher level. We assessed the tool's impact by interviewing 13 experts from journalism, communications, and political science, revealing key insights into usability and functionality, practical applications, and AI's role in powering media bias tools. We explored this in more depth with a follow-up survey of 150 news consumers. This work highlights opportunities for AI-driven tools that empower users to critically engage with media content, particularly in politically charged environments.

Paperid: 814, https://arxiv.org/pdf/2502.05731.pdf

Abstract:
Environmental experts have developed the DPSIR (Driver, Pressure, State, Impact, Response) framework to systematically study and communicate key relationships between society and the environment. Using this framework requires experts to construct a DPSIR taxonomy from a corpus, annotate the documents, and identify DPSIR variables and relationships, which is laborious and inflexible. Automating it with conventional text mining faces technical challenges, primarily because the taxonomy often begins with abstract definitions, which experts progressively refine and contextualize as they annotate the corpus. In response, we develop GreenMine, a system that supports interactive text mining with prompt engineering. The system implements a prompting pipeline consisting of three simple and evaluable subtasks. In each subtask, the DPSIR taxonomy can be defined in natural language and iteratively refined as experts analyze the corpus. To support users evaluate the taxonomy, we introduce an uncertainty score based on response consistency. Then, we design a radial uncertainty chart that visualizes uncertainties and corpus topics, which supports interleaved evaluation and exploration. Using the system, experts can progressively construct the DPSIR taxonomy and annotate the corpus with LLMs. Using real-world interview transcripts, we present a case study to demonstrate the capability of the system in supporting interactive mining of DPSIR relationships, and an expert review in the form of collaborative discussion to understand the potential and limitations of the system. We discuss the lessons learned from developing the system and future opportunities for supporting interactive text mining in knowledge-intensive tasks for other application scenarios.

Paperid: 815, https://arxiv.org/pdf/2502.03482.pdf

Abstract:
Despite the growing interest in human-AI decision making, experimental studies with domain experts remain rare, largely due to the complexity of working with domain experts and the challenges in setting up realistic experiments. In this work, we conduct an in-depth collaboration with radiologists in prostate cancer diagnosis based on MRI images. Building on existing tools for teaching prostate cancer diagnosis, we develop an interface and conduct two experiments to study how AI assistance and performance feedback shape the decision making of domain experts. In Study 1, clinicians were asked to provide an initial diagnosis (human), then view the AI's prediction, and subsequently finalize their decision (human-AI team). In Study 2 (after a memory wash-out period), the same participants first received aggregated performance statistics from Study 1, specifically their own performance, the AI's performance, and their human-AI team performance, and then directly viewed the AI's prediction before making their diagnosis (i.e., no independent initial diagnosis). These two workflows represent realistic ways that clinical AI tools might be used in practice, where the second study simulates a scenario where doctors can adjust their reliance and trust on AI based on prior performance feedback. Our findings show that, while human-AI teams consistently outperform humans alone, they still underperform the AI due to under-reliance, similar to prior studies with crowdworkers. Providing clinicians with performance feedback did not significantly improve the performance of human-AI teams, although showing AI decisions in advance nudges people to follow AI more. Meanwhile, we observe that the ensemble of human-AI teams can outperform AI alone, suggesting promising directions for human-AI collaboration.

Paperid: 816, https://arxiv.org/pdf/2502.02805.pdf

Abstract:
Autonomous personal mobility vehicle (APMV) is a new type of small smart vehicle designed for mixed-traffic environments, including interactions with pedestrians. To enhance the interaction experience between pedestrians and APMVs and to prevent potential risks, it is crucial to investigate pedestrians' walking behaviors when interacting with APMVs and to understand the psychological processes underlying these behaviors. This study aims to investigate the causal relationships between subjective evaluations of pedestrians and their walking behaviors during interactions with an APMV equipped with an external human-machine interface (eHMI). An experiment of pedestrian-APMV interaction was conducted with 42 pedestrian participants, in which various eHMIs on the APMV were designed to induce participants to experience different levels of subjective evaluations and generate the corresponding walking behaviors. Based on the hypothesized model of the pedestrian's cognition-decision-behavior process, the results of causal discovery align with the previously proposed model. Furthermore, this study further analyzes the direct and total causal effects of each factor and investigates the causal processes affecting several important factors in the field of human-vehicle interaction, such as situation awareness, trust in vehicle, risk perception, hesitation in decision making, and walking behaviors.

Paperid: 817, https://arxiv.org/pdf/2502.02792.pdf

Abstract:
Autonomous Personal Mobility Vehicles (APMVs) are designed to address the ``last-mile'' transportation challenge for everyone. When an APMV encounters a pedestrian, it uses an external Human-Machine Interface (eHMI) to negotiate road rights. Through this interaction, passengers also engage with the process. This study examines passengers' gaze behavior toward pedestrians during such interactions, focusing on whether different eHMI designs influence gaze patterns based on passengers' personality traits. The results indicated that when using a visual-based eHMI, passengers often struggled to perceive the communication content. Consequently, passengers with higher Neuroticism scores, who were more sensitive to communication details, might seek cues from pedestrians' reactions. In addition, a multimodal eHMI (visual and voice) using neutral voice did not significantly affect the gaze behavior of passengers toward pedestrians, regardless of personality traits. In contrast, a multimodal eHMI using affective voice encouraged passengers with high Openness to Experience scores to focus on pedestrians' heads. In summary, this study revealed how different eHMI designs influence passengers' gaze behavior and highlighted the effects of personality traits on their gaze patterns toward pedestrians, providing new insights for personalized eHMI designs.

Paperid: 818, https://arxiv.org/pdf/2502.01317.pdf

Abstract:
Growing awareness of wellness has prompted people to consider whether their dietary patterns align with their health and fitness goals. In response, researchers have introduced various wearable dietary monitoring systems and dietary assessment approaches. However, these solutions are either limited to identifying foods with simple ingredients or insufficient in providing analysis of individual dietary behaviors with domain-specific knowledge. In this paper, we present DietGlance, a system that automatically monitors dietary in daily routines and delivers personalized analysis from knowledge sources. DietGlance first detects ingestive episodes from multimodal inputs using eyeglasses, capturing privacy-preserving meal images of various dishes being consumed. Based on the inferred food items and consumed quantities from these images, DietGlance further provides nutritional analysis and personalized dietary suggestions, empowered by the retrieval augmentation generation module on a reliable nutrition library. A short-term user study (N=33) and a four-week longitudinal study (N=16) demonstrate the usability and effectiveness of DietGlance, offering insights and implications for future AI-assisted dietary monitoring and personalized healthcare intervention systems using eyewear.

Paperid: 819, https://arxiv.org/pdf/2502.00941.pdf

Abstract:
Locating small features in a large, dense object in virtual reality (VR) poses a significant interaction challenge. While existing multiscale techniques support transitions between various levels of scale, they are not focused on handling dense, homogeneous objects with hidden features. We propose a novel approach that applies the concept of progressive refinement to VR navigation, enabling focused inspections. We conducted a user study where we varied two independent variables in our design, navigation style (STRUCTURED vs. UNSTRUCTURED) and display mode (SELECTION vs. EVERYTHING), to better understand their effects on efficiency and awareness during multiscale navigation. Our results showed that unstructured navigation can be faster than structured and that displaying only the selection can be faster than displaying the entire object. However, using an everything display mode can support better location awareness and object understanding.

Paperid: 820, https://arxiv.org/pdf/2502.00881.pdf

Abstract:
Surveying prior literature to establish a foundation for new knowledge is essential for scholarly progress. However, survey articles are resource-intensive and challenging to create, and can quickly become outdated as new research is published, risking information staleness and inaccuracy. Keeping survey articles current with the latest evidence is therefore desirable, though there is a limited understanding of why, when, and how these surveys should be updated. Toward this end, through a series of in-depth retrospective interviews with 11 researchers, we present an empirical examination of the work practices in authoring and updating survey articles in computing research. We find that while computing researchers acknowledge the value in maintaining an updated survey, continuous updating remains unmanageable and misaligned with academic incentives. Our findings suggest key leverage points within current workflows that present opportunities for enabling technologies to facilitate more efficient and effective updates.

Paperid: 821, https://arxiv.org/pdf/2502.00880.pdf

Abstract:
Collaborative virtual environments allow workers to contribute to team projects across space and time. While much research has closely examined the problem of working in different spaces at the same time, few have investigated the best practices for collaborating in those spaces at different times aside from textual and auditory annotations. We designed a system that allows experts to record a tour inside a virtual inspection space, preserving knowledge and providing later observers with insights through a 3D playback of the expert's inspection. We also created several interactions to ensure that observers are tracking the tour and remaining engaged. We conducted a user study to evaluate the influence of these interactions on an observing user's information recall and user experience. Findings indicate that independent viewpoint control during a tour enhances the user experience compared to fully passive playback and that additional interactivity can improve auditory and spatial recall of key information conveyed during the tour.

Paperid: 822, https://arxiv.org/pdf/2501.17037.pdf

Abstract:
The rapid deployment of Artificial Intelligence (AI) in critical digital infrastructure introduces significant risks, necessitating a robust framework for systematically collecting AI incident data to prevent future incidents. Existing databases lack the granularity as well as the standardized structure required for consistent data collection and analysis, impeding effective incident management. This work proposes a standardized schema and taxonomy for AI incident databases, addressing these challenges by enabling detailed and structured documentation of AI incidents across sectors. Key contributions include developing a unified schema, introducing new fields such as incident severity, causes, and harms caused, and proposing a taxonomy for classifying AI incidents in critical digital infrastructure. The proposed solution facilitates more effective incident data collection and analysis, thus supporting evidence-based policymaking, enhancing industry safety measures, and promoting transparency. This work lays the foundation for a coordinated global response to AI incidents, ensuring trust, safety, and accountability in using AI across regions.

Paperid: 823, https://arxiv.org/pdf/2501.16557.pdf

Abstract:
Context-aware AR instruction enables adaptive and in-situ learning experiences. However, hardware limitations and expertise requirements constrain the creation of such instructions. With recent developments in Generative Artificial Intelligence (Gen-AI), current research tries to tackle these constraints by deploying AI-generated content (AIGC) in AR applications. However, our preliminary study with six AR practitioners revealed that the current AIGC lacks contextual information to adapt to varying application scenarios and is therefore limited in authoring. To utilize the strong generative power of GenAI to ease the authoring of AR instruction while capturing the context, we developed CARING-AI, an AR system to author context-aware humanoid-avatar-based instructions with GenAI. By navigating in the environment, users naturally provide contextual information to generate humanoid-avatar animation as AR instructions that blend in the context spatially and temporally. We showcased three application scenarios of CARING-AI: Asynchronous Instructions, Remote Instructions, and Ad Hoc Instructions based on a design space of AIGC in AR Instructions. With two user studies (N=12), we assessed the system usability of CARING-AI and demonstrated the easiness and effectiveness of authoring with Gen-AI.

Paperid: 824, https://arxiv.org/pdf/2501.15463.pdf

Abstract:
Existing research primarily evaluates the values of LLMs by examining their stated inclinations towards specific values. However, the "Value-Action Gap," a phenomenon rooted in environmental and social psychology, reveals discrepancies between individuals' stated values and their actions in real-world contexts. To what extent do LLMs exhibit a similar gap between their stated values and their actions informed by those values? This study introduces ValueActionLens, an evaluation framework to assess the alignment between LLMs' stated values and their value-informed actions. The framework encompasses the generation of a dataset comprising 14.8k value-informed actions across twelve cultures and eleven social topics, and two tasks to evaluate how well LLMs' stated value inclinations and value-informed actions align across three different alignment measures. Extensive experiments reveal that the alignment between LLMs' stated values and actions is sub-optimal, varying significantly across scenarios and models. Analysis of misaligned results identifies potential harms from certain value-action gaps. To predict the value-action gaps, we also uncover that leveraging reasoned explanations improves performance. These findings underscore the risks of relying solely on the LLMs' stated values to predict their behaviors and emphasize the importance of context-aware evaluations of LLM values and value-action gaps.

Paperid: 825, https://arxiv.org/pdf/2501.11829.pdf

Abstract:
Automated Urban Air Mobility (UAM) can improve passenger transportation and reduce congestion, but its success depends on passenger trust. While initial research addresses passengers' information needs, questions remain about how to simulate air taxi flights and how these simulations impact users and interface requirements. We conducted a between-subjects study (N=40), examining the influence of motion fidelity in Virtual-Reality-simulated air taxi flights on user effects and interface design. Our study compared simulations with and without motion cues using a 3-Degrees-of-Freedom motion chair. Optimizing the interface design across six objectives, such as trust and mental demand, we used multi-objective Bayesian optimization to determine the most effective design trade-offs. Our results indicate that motion fidelity decreases users' trust, understanding, and acceptance, highlighting the need to consider motion fidelity in future UAM studies to approach realism. However, minimal evidence was found for differences or equality in the optimized interface designs, suggesting personalized interface designs.

Paperid: 826, https://arxiv.org/pdf/2501.11814.pdf

Abstract:
Infinite scrolling on social media platforms is designed to encourage prolonged engagement, leading users to spend more time than desired, which can provoke negative emotions. Interventions to mitigate infinite scrolling have shown initial success, yet users become desensitized due to the lack of contextual relevance. Understanding how contextual factors influence intervention effectiveness remains underexplored. We conducted a 7-day user study (N=72) investigating how these contextual factors affect users' reactance and responsiveness to interventions during infinite scrolling. Our study revealed an interplay, with contextual factors such as being at home, sleepiness, and valence playing significant roles in the intervention's effectiveness. Low valence coupled with being at home slows down the responsiveness to interventions, and sleepiness lowers reactance towards interventions, increasing user acceptance of the intervention. Overall, our work contributes to a deeper understanding of user responses toward interventions and paves the way for developing more effective interventions during infinite scrolling.

Paperid: 827, https://arxiv.org/pdf/2501.11801.pdf

Abstract:
The introduction of Highly Automated Vehicles (HAVs) has the potential to increase the independence of blind and visually impaired people (BVIPs). However, ensuring safety and situation awareness when exiting these vehicles in unfamiliar environments remains challenging. To address this, we conducted an interactive workshop with N=5 BVIPs to identify their information needs when exiting an HAV and evaluated three prior-developed low-fidelity prototypes. The insights from this workshop guided the development of PathFinder, a multimodal interface combining visual, auditory, and tactile modalities tailored to BVIP's unique needs. In a three-factorial within-between-subject study with N=16 BVIPs, we evaluated PathFinder against an auditory-only baseline in urban and rural scenarios. PathFinder significantly reduced mental demand and maintained high perceived safety in both scenarios, while the auditory baseline led to lower perceived safety in the urban scenario compared to the rural one. Qualitative feedback further supported PathFinder's effectiveness in providing spatial orientation during exiting.

Paperid: 828, https://arxiv.org/pdf/2501.10792.pdf

Abstract:
The absence of a human operator in automated vehicles (AVs) may require external Human-Machine Interfaces (eHMIs) to facilitate communication with other road users in uncertain scenarios, for example, regarding the right of way. Given the plethora of adjustable parameters, balancing visual and auditory elements is crucial for effective communication with other road users. With N=37 participants, this study employed multi-objective Bayesian optimization to enhance eHMI designs and improve trust, safety perception, and mental demand. By reporting the Pareto front, we identify optimal design trade-offs. This research contributes to the ongoing standardization efforts of eHMIs, supporting broader adoption.

Paperid: 829, https://arxiv.org/pdf/2501.08046.pdf

Abstract:
Artificial Intelligence (AI) spreads quickly as new technologies and services take over modern society. The need to regulate AI design, development, and use is strictly necessary to avoid unethical and potentially dangerous consequences to humans. The European Union (EU) has released a new legal framework, the AI Act, to regulate AI by undertaking a risk-based approach to safeguard humans during interaction. At the same time, researchers offer a new perspective on AI systems, commonly known as Human-Centred AI (HCAI), highlighting the need for a human-centred approach to their design. In this context, Symbiotic AI (a subtype of HCAI) promises to enhance human capabilities through a deeper and continuous collaboration between human intelligence and AI. This article presents the results of a Systematic Literature Review (SLR) that aims to identify principles that characterise the design and development of Symbiotic AI systems while considering humans as the core of the process. Through content analysis, four principles emerged from the review that must be applied to create Human-Centred AI systems that can establish a symbiotic relationship with humans. In addition, current trends and challenges were defined to indicate open questions that may guide future research for the development of SAI systems that comply with the AI Act.

Paperid: 830, https://arxiv.org/pdf/2501.07713.pdf

Abstract:
Reliable detection and segmentation of human hands are critical for enhancing safety and facilitating advanced interactions in human-robot collaboration. Current research predominantly evaluates hand segmentation under in-distribution (ID) data, which reflects the training data of deep learning (DL) models. However, this approach fails to address out-of-distribution (OOD) scenarios that often arise in real-world human-robot interactions. In this study, we present a novel approach by evaluating the performance of pre-trained DL models under both ID data and more challenging OOD scenarios. To mimic realistic industrial scenarios, we designed a diverse dataset featuring simple and cluttered backgrounds with industrial tools, varying numbers of hands (0 to 4), and hands with and without gloves. For OOD scenarios, we incorporated unique and rare conditions such as finger-crossing gestures and motion blur from fast-moving hands, addressing both epistemic and aleatoric uncertainties. To ensure multiple point of views (PoVs), we utilized both egocentric cameras, mounted on the operator's head, and static cameras to capture RGB images of human-robot interactions. This approach allowed us to account for multiple camera perspectives while also evaluating the performance of models trained on existing egocentric datasets as well as static-camera datasets. For segmentation, we used a deep ensemble model composed of UNet and RefineNet as base learners. Performance evaluation was conducted using segmentation metrics and uncertainty quantification via predictive entropy. Results revealed that models trained on industrial datasets outperformed those trained on non-industrial datasets, highlighting the importance of context-specific training. Although all models struggled with OOD scenarios, those trained on industrial datasets demonstrated significantly better generalization.

Paperid: 831, https://arxiv.org/pdf/2501.06757.pdf

Abstract:
Automated vehicle (AV) acceptance relies on their understanding via feedback. While visualizations aim to enhance user understanding of AV's detection, prediction, and planning functionalities, establishing an optimal design is challenging. Traditional "one-size-fits-all" designs might be unsuitable, stemming from resource-intensive empirical evaluations. This paper introduces OptiCarVis, a set of Human-in-the-Loop (HITL) approaches using Multi-Objective Bayesian Optimization (MOBO) to optimize AV feedback visualizations. We compare conditions using eight expert and user-customized designs for a Warm-Start HITL MOBO. An online study (N=117) demonstrates OptiCarVis's efficacy in significantly improving trust, acceptance, perceived safety, and predictability without increasing cognitive load. OptiCarVis facilitates a comprehensive design space exploration, enhancing in-vehicle interfaces for optimal passenger experiences and broader applicability.

Paperid: 832, https://arxiv.org/pdf/2501.01539.pdf

Abstract:
In social robot navigation, traditional metrics like proxemics and behavior naturalness emphasize human comfort and adherence to social norms but often fail to capture an agent's autonomy and adaptability in dynamic environments. This paper introduces human empowerment, an information-theoretic concept that measures a human's ability to influence their future states and observe those changes, as a complementary metric for evaluating social compliance. This metric reveals how robot navigation policies can indirectly impact human empowerment. We present a framework that integrates human empowerment into the evaluation of social performance in navigation tasks. Through numerical simulations, we demonstrate that human empowerment as a metric not only aligns with intuitive social behavior, but also shows statistically significant differences across various robot navigation policies. These results provide a deeper understanding of how different policies affect social compliance, highlighting the potential of human empowerment as a complementary metric for future research in social navigation.

Paperid: 833, https://arxiv.org/pdf/2506.23092.pdf

Abstract:
Many scientific and engineering problems involving multi-physics span a wide range of scales. Understanding the interactions across these scales is essential for fully comprehending such complex problems. However, visualizing multivariate, multiscale data within an integrated view where correlations across space, scales, and fields are easily perceived remains challenging. To address this, we introduce a novel local spatial statistical visualization of flow fields across multiple fields and turbulence scales. Our method leverages the curvelet transform for scale decomposition of fields of interest, a level-set-restricted centroidal Voronoi tessellation to partition the spatial domain into local regions for statistical aggregation, and a set of glyph designs that combines information across scales and fields into a single, or reduced set of perceivable visual representations. Each glyph represents data aggregated within a Voronoi region and is positioned at the Voronoi site for direct visualization in a 3D view centered around flow features of interest. We implement and integrate our method into an interactive visualization system where the glyph-based technique operates in tandem with linked 3D spatial views and 2D statistical views, supporting a holistic analysis. We demonstrate with case studies visualizing turbulent combustion data--multi-scalar compressible flows--and turbulent incompressible channel flow data. This new capability enables scientists to better understand the interactions between multiple fields and length scales in turbulent flows.

Paperid: 834, https://arxiv.org/pdf/2506.22937.pdf

Abstract:
Blind and low-vision (BLV) players encounter critical challenges in engaging with video games due to the inaccessibility of visual elements, difficulties in navigating interfaces, and limitations in sending interaction input. Moreover, the development of specialized accessibility features typically requires substantial programming effort and is often implemented on a game-by-game basis. To address these challenges, we introduce \textit{GamerAstra}, a generalized accessibility framework that leverages a multi-agent design to facilitate access to video games for BLV players. It integrates multi-modal techniques including large language models and vision-language models, enabling interaction with games lacking native accessibility support. The framework further incorporates customizable assistance granularities to support varying degrees of visual impairment and enhances interface navigation through multiple input modalities. The evaluation through technical assessments and user studies indicate that \textit{GamerAstra} effectively enhances playability and delivers a more immersive gaming experience for BLV players. These findings also underscore potential avenues for advancing intelligent accessibility frameworks in the gaming domain.

Paperid: 835, https://arxiv.org/pdf/2506.20803.pdf

Abstract:
Large Language Models (LLMs) have shown promise in accelerating the scientific research pipeline. A key capability for this process is the ability to generate novel research ideas, and prior studies have found settings in which LLM-generated research ideas were judged as more novel than human-expert ideas. However, a good idea should not simply appear to be novel, it should also result in better research after being executed. To test whether AI-generated ideas lead to better research outcomes, we conduct an execution study by recruiting 43 expert researchers to execute randomly-assigned ideas, either written by experts or generated by an LLM. Each expert spent over 100 hours implementing the idea and wrote a 4-page short paper to document the experiments. All the executed projects are then reviewed blindly by expert NLP researchers. Comparing the review scores of the same ideas before and after execution, the scores of the LLM-generated ideas decrease significantly more than expert-written ideas on all evaluation metrics (novelty, excitement, effectiveness, and overall; p < 0.05), closing the gap between LLM and human ideas observed at the ideation stage. When comparing the aggregated review scores from the execution study, we even observe that for many metrics there is a flip in rankings where human ideas score higher than LLM ideas. This ideation-execution gap highlights the limitations of current LLMs in generating truly effective research ideas and the challenge of evaluating research ideas in the absence of execution outcomes.

Paperid: 836, https://arxiv.org/pdf/2506.18770.pdf

Abstract:
As Artificial Intelligence (AI) becomes increasingly integrated into high-stakes domains like healthcare, effective collaboration between healthcare experts and AI systems is critical. Data-centric steering, which involves fine-tuning prediction models by improving training data quality, plays a key role in this process. However, little research has explored how varying levels of user control affect healthcare experts during data-centric steering. We address this gap by examining manual and automated steering approaches through a between-subjects, mixed-methods user study with 74 healthcare experts. Our findings show that manual steering, which grants direct control over training data, significantly improves model performance while maintaining trust and system understandability. Based on these findings, we propose design implications for a hybrid steering system that combines manual and automated approaches to increase user involvement during human-AI collaboration.

Paperid: 837, https://arxiv.org/pdf/2506.18466.pdf

Abstract:
The gaze of a person tends to reflect their interest. This work explores what happens when this statement is taken literally and applied to robots. Here we present a robot system that employs a moving robot head with a screen-based eye model that can direct the robot's gaze to points in physical space and present a reflection-like mirror image of the attended region on top of each eye. We conducted a user study with 33 participants, who were asked to instruct the robot to perform pick-and-place tasks, monitor the robot's task execution, and interrupt it in case of erroneous actions. Despite a deliberate lack of instructions about the role of the eyes and a very brief system exposure, participants felt more aware about the robot's information processing, detected erroneous actions earlier, and rated the user experience higher when eye-based mirroring was enabled compared to non-reflective eyes. These results suggest a beneficial and intuitive utilization of the introduced method in cooperative human-robot interaction.

Paperid: 838, https://arxiv.org/pdf/2506.15834.pdf

Abstract:
Mobile health (mHealth) systems help researchers monitor and care for patients in real-world settings. Studies utilizing mHealth applications use Ecological Momentary Assessment (EMAs), passive sensing, and contextual features to develop emotion recognition models, which rely on EMA responses as ground truth. Due to this, it is crucial to consider EMA compliance when conducting a successful mHealth study. Utilizing machine learning is one approach that can solve this problem by sending EMAs based on the predicted likelihood of a response. However, literature suggests that this approach may lead to prompting participants more frequently during emotions associated with responsiveness, thereby narrowing the range of emotions collected. We propose a multi-objective function that utilizes machine learning to identify optimal times for sending EMAs. The function identifies optimal moments by combining predicted response likelihood with model uncertainty in emotion predictions. Uncertainty would lead the function to prioritize time points when the model is less confident, which often corresponds to underrepresented emotions. We demonstrate that this objective function would result in EMAs being sent when participants are responsive and experiencing less commonly observed emotions. The evaluation is conducted offline using two datasets: (1) 91 spousal caregivers of individuals with Alzheimer's Disease and Related dementias (ADRD), (2) 45 healthy participants. Results show that the multi-objective function tends to be higher when participants respond to EMAs and report less commonly observed emotions. This suggests that using the proposed objective function to guide EMA delivery could improve receptivity rates and capture a broader range of emotions.

Paperid: 839, https://arxiv.org/pdf/2506.14056.pdf

Abstract:
The interdependencies of food, energy, and water (FEW) systems create a nexus opportunity to explore the strengths and vulnerabilities of individual and cross-sector interactions within FEW systems. However, the variables quantifying nexus interactions are hard to observe, which hinders the cross-sector analysis. To overcome such challenges, we present FEWSim, a visual analytics framework designed to support domain experts in exploring and interpreting simulation results from a coupled FEW model. FEWSim employs a three-layer asynchronous architecture: the model layer integrates food, energy, and water models to simulate the FEW nexus; the middleware layer manages scenario configuration and execution; and the visualization layer provides interactive visual exploration of simulated time-series results across FEW sectors. The visualization layer further facilitates the exploration across multiple scenarios and evaluates scenario differences in performance using sustainability indices of the FEW nexus. We demonstrate the utility of FEWSim through a case study for the Phoenix Active Management Area (AMA) in Arizona.

Paperid: 840, https://arxiv.org/pdf/2506.08634.pdf

Abstract:
In this article, we present a novel multimodal feedback framework called MOSAIC-F, an acronym for a data-driven Framework that integrates Multimodal Learning Analytics (MMLA), Observations, Sensors, Artificial Intelligence (AI), and Collaborative assessments for generating personalized feedback on student learning activities. This framework consists of four key steps. First, peers and professors' assessments are conducted through standardized rubrics (that include both quantitative and qualitative evaluations). Second, multimodal data are collected during learning activities, including video recordings, audio capture, gaze tracking, physiological signals (heart rate, motion data), and behavioral interactions. Third, personalized feedback is generated using AI, synthesizing human-based evaluations and data-based multimodal insights such as posture, speech patterns, stress levels, and cognitive load, among others. Finally, students review their own performance through video recordings and engage in self-assessment and feedback visualization, comparing their own evaluations with peers and professors' assessments, class averages, and AI-generated recommendations. By combining human-based and data-based evaluation techniques, this framework enables more accurate, personalized and actionable feedback. We tested MOSAIC-F in the context of improving oral presentation skills.

Paperid: 841, https://arxiv.org/pdf/2506.05648.pdf

Abstract:
We present a customizable soft haptic system that integrates modular hardware with an information-theoretic algorithm to personalize feedback for different users and tasks. Our platform features modular, multi-degree-of-freedom pneumatic displays, where different signal types, such as pressure, frequency, and contact area, can be activated or combined using fluidic logic circuits. These circuits simplify control by reducing reliance on specialized electronics and enabling coordinated actuation of multiple haptic elements through a compact set of inputs. Our approach allows rapid reconfiguration of haptic signal rendering through hardware-level logic switching without rewriting code. Personalization of the haptic interface is achieved through the combination of modular hardware and software-driven signal selection. To determine which display configurations will be most effective, we model haptic communication as a signal transmission problem, where an agent must convey latent information to the user. We formulate the optimization problem to identify the haptic hardware setup that maximizes the information transfer between the intended message and the user's interpretation, accounting for individual differences in sensitivity, preferences, and perceptual salience. We evaluate this framework through user studies where participants interact with reconfigurable displays under different signal combinations. Our findings support the role of modularity and personalization in creating multimodal haptic interfaces and advance the development of reconfigurable systems that adapt with users in dynamic human-machine interaction contexts.

Paperid: 842, https://arxiv.org/pdf/2506.03687.pdf

Abstract:
Designing inclusive public transport services is crucial to developing modern, barrier-free smart city infrastructure. This research contributes to the design of inclusive public transport by considering accessibility challenges emerging from socio-technical systems, thus demanding the integration of technological and social solutions. Using Actor-Network Theory (ANT) as a theoretical framework and a mixed-method approach, including shadowing and a focus group, this study examines the socio-technical networks that shape accessibility experiences for visually impaired passengers utilizing the tram in Linz, Austria. Key dimensions that influence public transport accessibility are identified: network configuration, mobility patterns, technology integration, and warning systems. The results show that accessibility emerges from complex interactions between human actors (passengers, staff) and non-human actors (assistive devices, infrastructure) rather than being an inherent property of transport systems. Digital technologies serve multiple functions, from navigational assistance to broader social inclusion, although users comfort with technology varies. Participants emphasized the importance of the two-sense principle for warning signals, with directional audio and tactile feedback particularly valuable.

Paperid: 843, https://arxiv.org/pdf/2505.24014.pdf

Abstract:
The growing use of Generative AI (GenAI) conversational search tools has raised concerns about their effects on people's metacognitive engagement, critical thinking, and learning. As people increasingly rely on GenAI to perform tasks such as analyzing and applying information, they may become less actively engaged in thinking and learning. This study examines whether metacognitive prompts - designed to encourage people to pause, reflect, assess their understanding, and consider multiple perspectives - can support critical thinking during GenAI-based search. We conducted a user study (N=40) with university students to investigate the impact of metacognitive prompts on their thought processes and search behaviors while searching with a GenAI tool. We found that these prompts led to more active engagement, prompting students to explore a broader range of topics and engage in deeper inquiry through follow-up queries. Students reported that the prompts were especially helpful for considering overlooked perspectives, promoting evaluation of AI responses, and identifying key takeaways. Additionally, the effectiveness of these prompts was influenced by students' metacognitive flexibility. Our findings highlight the potential of metacognitive prompts to foster critical thinking and provide insights for designing and implementing metacognitive support in human-AI interactions.

Paperid: 844, https://arxiv.org/pdf/2505.22418.pdf

Abstract:
Supplemental Nutrition Assistance Program (SNAP) is an essential benefit support system provided by the US administration to 41 million federally determined low-income applicants. Through interviews with such applicants across a diverse set of experiences with the SNAP system, our findings reveal that new AI technologies like LLMs can alleviate traditional burdens but also introduce new burdens. We introduce new types of learning, compliance, and psychological costs that transform the administrative burden on applicants. We also identify how trust in AI across three dimensions--competence, integrity, and benevolence--is perceived to reduce administrative burdens, which may stem from unintended and untoward overt trust in the system. We discuss calibrating appropriate levels of user trust in LLM-based administrative systems, mitigating newly introduced burdens. In particular, our findings suggest that evidence-based information disclosure is necessary in benefits administration and propose directions for future research on trust-burden dynamics in AI-assisted administration systems.

Paperid: 845, https://arxiv.org/pdf/2505.21752.pdf

Abstract:
Experimental evidence on worker responses to AI management remains mixed, partly due to limitations in experimental fidelity. We address these limitations with a customized workplace in the Minecraft platform, enabling high-resolution behavioral tracking of autonomous task execution, and ensuring that participants approach the task with well-formed expectations about their own competence. Workers (N = 382) completed repeated production tasks under either human, AI, or hybrid management. An AI manager trained on human-defined evaluation principles systematically assigned lower performance ratings and reduced wages by 40\%, without adverse effects on worker motivation and sense of fairness. These effects were driven by a muted emotional response to AI evaluation, compared to evaluation by a human. The very features that make AI appear impartial may also facilitate silent exploitation, by suppressing the social reactions that normally constrain extractive practices in human-managed work.

Paperid: 846, https://arxiv.org/pdf/2505.17479.pdf

Abstract:
LLM-based digital twin simulation, where large language models are used to emulate individual human behavior, holds great promise for research in AI, social science, and digital experimentation. However, progress in this area has been hindered by the scarcity of real, individual-level datasets that are both large and publicly available. This lack of high-quality ground truth limits both the development and validation of digital twin methodologies. To address this gap, we introduce a large-scale, public dataset designed to capture a rich and holistic view of individual human behavior. We survey a representative sample of $N = 2,058$ participants (average 2.42 hours per person) in the US across four waves with 500 questions in total, covering a comprehensive battery of demographic, psychological, economic, personality, and cognitive measures, as well as replications of behavioral economics experiments and a pricing survey. The final wave repeats tasks from earlier waves to establish a test-retest accuracy baseline. Initial analyses suggest the data are of high quality and show promise for constructing digital twins that predict human behavior well at the individual and aggregate levels. By making the full dataset publicly available, we aim to establish a valuable testbed for the development and benchmarking of LLM-based persona simulations. Beyond LLM applications, due to its unique breadth and scale the dataset also enables broad social science research, including studies of cross-construct correlations and heterogeneous treatment effects.

Paperid: 847, https://arxiv.org/pdf/2505.16089.pdf

Abstract:
As generative artificial intelligence (genAI) increasingly mediates how children learn, communicate, and engage with digital content, understanding children's hopes and fears about this emerging technology is crucial. In a pilot study with 37 fifth-graders, we explored how children (ages 9-10) envision genAI and the roles they believe it should play in their daily life. Our findings reveal three key ways children envision genAI: as a companion providing guidance, a collaborator working alongside them, and a task automator that offloads responsibilities. However, alongside these hopeful views, children expressed fears about overreliance, particularly in academic settings, linking it to fears of diminished learning, disciplinary consequences, and long-term failure. This study highlights the need for child-centric AI design that balances these tensions, empowering children with the skills to critically engage with and navigate their evolving relationships with digital technologies.

Paperid: 848, https://arxiv.org/pdf/2505.16034.pdf

Abstract:
The integration of generative Artificial Intelligence (genAI) into everyday life raises questions about the competencies required to critically engage with these technologies. Unlike visual errors in genAI, textual mistakes are often harder to detect and require specific domain knowledge. Furthermore, AI's authoritative tone and structured responses can create an illusion of correctness, leading to overtrust, especially among children. To address this, we developed AI Puzzlers, an interactive system based on the Abstraction and Reasoning Corpus (ARC), to help children identify and analyze errors in genAI. Drawing on Mayer & Moreno's Cognitive Theory of Multimedia Learning, AI Puzzlers uses visual and verbal elements to reduce cognitive overload and support error detection. Based on two participatory design sessions with 21 children (ages 6 - 11), our findings provide both design insights and an empirical understanding of how children identify errors in genAI reasoning, develop strategies for navigating these errors, and evaluate AI outputs.

Paperid: 849, https://arxiv.org/pdf/2505.16031.pdf

Abstract:
As artificial intelligence (AI) advances in reasoning capabilities, most recently with the emergence of Large Reasoning Models (LRMs), understanding how children conceptualize AI's reasoning processes becomes critical for fostering AI literacy. While one of the "Five Big Ideas" in AI education highlights reasoning algorithms as central to AI decision-making, less is known about children's mental models in this area. Through a two-phase approach, consisting of a co-design session with 8 children followed by a field study with 106 children (grades 3-8), we identified three models of AI reasoning: Deductive, Inductive, and Inherent. Our findings reveal that younger children (grades 3-5) often attribute AI's reasoning to inherent intelligence, while older children (grades 6-8) recognize AI as a pattern recognizer. We highlight three tensions that surfaced in children's understanding of AI reasoning and conclude with implications for scaffolding AI curricula and designing explainable AI tools.

Paperid: 850, https://arxiv.org/pdf/2505.14370.pdf

Abstract:
Despite decades of HCI and Meeting Science research, complaints about ineffective meetings are still pervasive. We argue that meeting technologies lack support for prospective reflection, that is, thinking about why a meeting is needed and what might happen. To explore this, we designed a Meeting Purpose Assistant (MPA) technology probe to coach users to articulate their meeting's purpose and challenges, and act accordingly. The MPA used Generative AI to support personalized and actionable prospective reflection across the diversity of meeting contexts. Using a participatory prompting methodology, 18 employees of a global technology company reflected with the MPA on upcoming meetings. Observed impacts were: clarifying meeting purposes, challenges, and success conditions; changing perspectives and flexibility; improving preparation and communication; and proposing changed plans. We also identify perceived social, temporal, and technological barriers to using the MPA. We present system and workflow design considerations for developing AI-assisted reflection support for meetings.

Paperid: 851, https://arxiv.org/pdf/2505.12666.pdf

Abstract:
CSCW has long examined how emerging technologies reshape the ways researchers collaborate and produce knowledge, with scientific knowledge production as a central area of focus. As AI becomes increasingly integrated into scientific research, understanding how researchers adapt to it reveals timely opportunities for CSCW research -- particularly in supporting new forms of collaboration, knowledge practices, and infrastructure in AI-driven science. This study quantifies LLM impacts on scientific knowledge production based on an evaluation workflow that combines an insider-outsider perspective with a knowledge production framework. Our findings reveal how LLMs catalyze both innovation and reorganization in scientific communities, offering insights into the broader transformation of knowledge production in the age of generative AI and sheds light on new research opportunities in CSCW.

Paperid: 852, https://arxiv.org/pdf/2505.12408.pdf

Abstract:
Understanding and decoding brain activity into visual representations is a fundamental challenge at the intersection of neuroscience and artificial intelligence. While EEG visual decoding has shown promise due to its non-invasive, and low-cost nature, existing methods suffer from Hierarchical Neural Encoding Neglect (HNEN)-a critical limitation where flat neural representations fail to model the brain's hierarchical visual processing hierarchy. Inspired by the hierarchical organization of visual cortex, we propose ViEEG, a neuro-We further adopt hierarchical contrastive learning for EEG-CLIP representation alignment, enabling zero-shot object recognition. Extensive experiments on the THINGS-EEG dataset demonstrate that ViEEG significantly outperforms previous methods by a large margin in both subject-dependent and subject-independent settings. Results on the THINGS-MEG dataset further confirm ViEEG's generalization to different neural modalities. Our framework not only advances the performance frontier but also sets a new paradigm for EEG brain decoding. inspired framework that addresses HNEN. ViEEG decomposes each visual stimulus into three biologically aligned components-contour, foreground object, and contextual scene-serving as anchors for a three-stream EEG encoder. These EEG features are progressively integrated via cross-attention routing, simulating cortical information flow from low-level to high-level vision.

Paperid: 853, https://arxiv.org/pdf/2505.10839.pdf

Abstract:
Social media feed ranking algorithms fail when they too narrowly focus on engagement as their objective. The literature has asserted a wide variety of values that these algorithms should account for as well -- ranging from well-being to productive discourse -- far more than can be encapsulated by a single topic or theory. In response, we present a $\textit{library of values}$ for social media algorithms: a pluralistic set of 78 values as articulated across the literature, implemented into LLM-powered content classifiers that can be installed individually or in combination for real-time re-ranking of social media feeds. We investigate this approach by developing a browser extension, $\textit{Alexandria}$, that re-ranks the X/Twitter feed in real time based on the user's desired values. Through two user studies, both qualitative (N=12) and quantitative (N=257), we found that diverse user needs require a large library of values, enabling more nuanced preferences and greater user control. With this work, we argue that the values criticized as missing from social media ranking algorithms can be operationalized and deployed today through end-user tools.

Paperid: 854, https://arxiv.org/pdf/2505.10426.pdf

Abstract:
The legal compliance and safety of different Human-in-the-loop (HITL) setups for AI can vary greatly. This manuscript aims to identify new ways of choosing between such setups, and shows that there is an unavoidable trade-off between the attribution of legal responsibility and the technical explainability of AI. We begin by using the notion of oracle machines from computability theory to formalise different HITL setups, distinguishing between trivial human monitoring, single endpoint human action, and highly involved interaction between the human(s) and the AI. These correspond to total functions, many-one reductions, and Turing reductions respectively. A taxonomy categorising HITL failure modes is then presented, highlighting the limitations on what any HITL setup can actually achieve. Our approach then identifies oversights from UK and EU legal frameworks, which focus on certain HITL setups which may not always achieve the desired ethical, legal, and sociotechnical outcomes. We suggest areas where the law should recognise the effectiveness of different HITL setups and assign responsibility in these contexts, avoiding unnecessary and unproductive human "scapegoating". Overall, we show how HITL setups involve many technical design decisions, and can be prone to failures which are often out of the humans' control. This opens up a new analytic perspective on the challenges arising in the creation of HITL setups, helping inform AI developers and lawmakers on designing HITL to better achieve their desired outcomes.

Paperid: 855, https://arxiv.org/pdf/2505.09868.pdf

Abstract:
Despite its constitutional relevance, the technical ``individual fairness'' criterion has not been operationalized in U.S. state or federal statutes/regulations. We conduct a human subjects experiment to address this gap, evaluating which demographic features are relevant for individual fairness evaluation of recidivism risk assessment (RRA) tools. Our analyses conclude that the individual similarity function should consider age and sex, but it should ignore race.

Paperid: 856, https://arxiv.org/pdf/2505.04184.pdf

Abstract:
Dementia significantly impacts cognitive, behavioral, and functional abilities, creating challenges for both individuals and caregivers. Recent advancements in HCI have introduced innovative technological solutions to support people with dementia (PwD) and their caregivers. This scoping review systematically examines 32 recent publications from leading digital libraries, categorizing technological interventions into four key domains: Assistive and Smart Technology for Daily Life, Social Interaction and Communication, Well-being and Psychological Support, and Caregiver Support and Training. Our analysis highlights how emerging technologies are transforming dementia care. These technologies enhance quality of life by promoting independence, fostering social engagement, and providing emotional and cognitive support. However, the review also identifies critical gaps, particularly in addressing the needs of individuals with early-stage dementia and the lack of individualized support mechanisms. By emphasizing user-centered design, accessibility, and ethical considerations, this paper offers a structured roadmap for future research and practice in dementia care. It bridges the gap between technological innovation and the real-world needs of PwD and their caregivers, providing valuable insights for researchers, practitioners, and policymakers. This review not only synthesizes current advancements but also sets the stage for future HCI-driven innovations in dementia care, aiming to improve outcomes for an aging global population.

Paperid: 857, https://arxiv.org/pdf/2505.01000.pdf

Abstract:
Scheduling is a perennial-and often challenging-problem for many groups. Existing tools are mostly static, showing an identical set of choices to everyone, regardless of the current status of attendees' inputs and preferences. In this paper, we propose Togedule, an adaptive scheduling tool that uses large language models to dynamically adjust the pool of choices and their presentation format. With the initial prototype, we conducted a formative study (N=10) and identified the potential benefits and risks of such an adaptive scheduling tool. Then, after enhancing the system, we conducted two controlled experiments, one each for attendees and organizers (total N=66). For each experiment, we compared scheduling with verbal messages, shared calendars, or Togedule. Results show that Togedule significantly reduces the cognitive load of attendees indicating their availability and improves the speed and quality of the decisions made by organizers.

Paperid: 858, https://arxiv.org/pdf/2504.18189.pdf

Abstract:
Danmaku, users' live comments synchronized with, and overlaying on videos, has recently shown potential in promoting online video-based learning. However, user-generated danmaku can be scarce-especially in newer or less viewed videos and its quality is unpredictable, limiting its educational impact. This paper explores how large multimodal models (LMM) can be leveraged to automatically generate effective, high-quality danmaku. We first conducted a formative study to identify the desirable characteristics of content- and emotion-related danmaku in educational videos. Based on the obtained insights, we developed ClassComet, an educational video platform with novel LMM-driven techniques for generating relevant types of danmaku to enhance video-based learning. Through user studies, we examined the quality of generated danmaku and their influence on learning experiences. The results indicate that our generated danmaku is comparable to human-created ones, and videos with both content- and emotion-related danmaku showed significant improvement in viewers' engagement and learning outcome.

Paperid: 859, https://arxiv.org/pdf/2504.17921.pdf

Abstract:
In this paper, we investigate how concept-based models (CMs) respond to out-of-distribution (OOD) inputs. CMs are interpretable neural architectures that first predict a set of high-level concepts (e.g., stripes, black) and then predict a task label from those concepts. In particular, we study the impact of concept interventions (i.e., operations where a human expert corrects a CM's mispredicted concepts at test time) on CMs' task predictions when inputs are OOD. Our analysis reveals a weakness in current state-of-the-art CMs, which we term leakage poisoning, that prevents them from properly improving their accuracy when intervened on for OOD inputs. To address this, we introduce MixCEM, a new CM that learns to dynamically exploit leaked information missing from its concepts only when this information is in-distribution. Our results across tasks with and without complete sets of concept annotations demonstrate that MixCEMs outperform strong baselines by significantly improving their accuracy for both in-distribution and OOD samples in the presence and absence of concept interventions.

Paperid: 860, https://arxiv.org/pdf/2504.17204.pdf

Abstract:
Wearable devices are transforming human capabilities by seamlessly augmenting cognitive functions. In this position paper, we propose a voice-based, interactive learning companion designed to amplify and extend cognitive abilities through informal learning. Our vision is threefold: (1) to enable users to discover new knowledge on-the-go through contextual interactive quizzes, fostering critical thinking and mindfulness, (2) to proactively detect misinformation, empowering users to critically assess information in real time, and (3) to provide spoken language correction and prompting hints for second language learning and effective communication. As an initial step toward this vision, we present Factually - a proactive, wearable fact-checking system integrated into devices like smartwatches or rings. Factually discreetly alerts users to potential falsehoods via vibrotactile feedback, helping them assess information critically. We demonstrate its utility through three illustrative scenarios, highlighting its potential to extend cognitive abilities for real-time misinformation detection. Early qualitative feedback suggests that Factually can enhance users' fact-checking capabilities, offering both practical and experiential benefits.

Paperid: 861, https://arxiv.org/pdf/2504.16548.pdf

Abstract:
There has been extensive prior work exploring how psychological factors such as anthropomorphism affect the adoption of shared autonomous vehicles (SAVs). However, limited research has been conducted on how prompt strategies in large language model (LLM)-powered SAV User Interfaces (UIs) affect users' perceptions, experiences, and intentions to adopt such technology. In this work, we investigate how conversational UIs powered by LLMs drive these psychological factors and psychological ownership, the sense of possession a user may come to feel towards an entity or object they may not legally own. We designed four SAV UIs with varying levels of anthropomorphic characteristics and psychological ownership triggers. Quantitative measures of psychological ownership, anthropomorphism, quality of service, disclosure tendency, sentiment of SAV responses, and overall acceptance were collected after participants interacted with each SAV. Qualitative feedback was also gathered regarding the experience of psychological ownership during the interactions. The results indicate that an SAV conversational UI designed to be more anthropomorphic and to induce psychological ownership improved users' perceptions of the SAV's human-like qualities and improved the sentiment of responses compared to a control condition. These findings provide practical guidance for designing LLM-based conversational UIs that enhance user experience and adoption of SAVs.

Paperid: 862, https://arxiv.org/pdf/2504.14522.pdf

Abstract:
This paper explores the design of a propaganda detection tool using Large Language Models (LLMs). Acknowledging the inherent biases in AI models, especially in political contexts, we investigate how these biases might be leveraged to enhance critical thinking in news consumption. Countering the typical view of AI biases as detrimental, our research proposes strategies of user choice and personalization in response to a user's political stance, applying psychological concepts of confirmation bias and cognitive dissonance. We present findings from a qualitative user study, offering insights and design recommendations (bias awareness, personalization and choice, and gradual introduction of diverse perspectives) for AI tools in propaganda detection.

Paperid: 863, https://arxiv.org/pdf/2504.14406.pdf

Abstract:
Synthesizing knowledge from large document collections is a critical yet increasingly complex aspect of qualitative research and knowledge work. While AI offers automation potential, effectively integrating it into human-centric sensemaking workflows remains challenging. We present ScholarMate, an interactive system designed to augment qualitative analysis by unifying AI assistance with human oversight. ScholarMate enables researchers to dynamically arrange and interact with text snippets on a non-linear canvas, leveraging AI for theme suggestions, multi-level summarization, and evidence-based theme naming, while ensuring transparency through traceability to source documents. Initial pilot studies indicated that users value this mixed-initiative approach, finding the balance between AI suggestions and direct manipulation crucial for maintaining interpretability and trust. We further demonstrate the system's capability through a case study analyzing 24 papers. By balancing automation with human control, ScholarMate enhances efficiency and supports interpretability, offering a valuable approach for productive human-AI collaboration in demanding sensemaking tasks common in knowledge work.

Paperid: 864, https://arxiv.org/pdf/2504.13906.pdf

Abstract:
This study assesses a Vehicle-to-Pedestrian (V2P) collision warning system compared to conventional vehicle-issued auditory alerts in a real-world scenario simulating a vehicle on a fixed track, characterized by limited maneuverability and the need for timely pedestrian response. The results from analyzing speed variations show that V2P warnings are particularly effective for pedestrians distracted by phone use (gaming or listening to music), highlighting the limitations of auditory alerts in noisy environments. The findings suggest that V2P technology offers a promising approach to improving pedestrian safety in urban areas

Paperid: 865, https://arxiv.org/pdf/2504.13897.pdf

Abstract:
Counterfactual explanations offer actionable insights by illustrating how changes to inputs can lead to different outcomes. However, these explanations often suffer from ambiguity and impracticality, limiting their utility for non-expert users with limited AI knowledge. Augmenting counterfactual explanations with Large Language Models (LLMs) has been proposed as a solution, but little research has examined their benefits and challenges for non-experts. To address this gap, we developed a healthcare-focused system that leverages conversational AI agents to enhance counterfactual explanations, offering clear, actionable recommendations to help patients at high risk of cardiovascular disease (CVD) reduce their risk. Evaluated through a mixed-methods study with 34 participants, our findings highlight the effectiveness of agent-augmented counterfactuals in improving actionable recommendations. Results further indicate that users with prior experience using conversational AI demonstrated greater effectiveness in utilising these explanations compared to novices. Furthermore, this paper introduces a set of generic guidelines for creating augmented counterfactual explanations, incorporating safeguards to mitigate common LLM pitfalls, such as hallucinations, and ensuring the explanations are both actionable and contextually relevant for non-expert users.

Paperid: 866, https://arxiv.org/pdf/2504.13884.pdf

Abstract:
Multimedia learning using text and images has been shown to improve learning outcomes compared to text-only instruction. But conversational AI systems in education predominantly rely on text-based interactions while multimodal conversations for multimedia learning remain unexplored. Moreover, deploying conversational AI in learning contexts requires grounding in reliable sources and verifiability to create trust. We present MuDoC, a Multimodal Document-grounded Conversational AI system based on GPT-4o, that leverages both text and visuals from documents to generate responses interleaved with text and images. Its interface allows verification of AI generated content through seamless navigation to the source. We compare MuDoC to a text-only system to explore differences in learner engagement, trust in AI system, and their performance on problem-solving tasks. Our findings indicate that both visuals and verifiability of content enhance learner engagement and foster trust; however, no significant impact in performance was observed. We draw upon theories from cognitive and learning sciences to interpret the findings and derive implications, and outline future directions for the development of multimodal conversational AI systems in education.

Paperid: 867, https://arxiv.org/pdf/2504.13877.pdf

Abstract:
Transitional care may play a vital role for the sustainability of Europe future healthcare system, offering solutions for relocating patient care from hospital to home therefore addressing the growing demand for medical care as the population is ageing. However, to be effective, it is essential to integrate innovative Information and Communications Technology technologies to ensure that patients with comorbidities experience a smooth and coordinated transition from hospitals or care centers to home, thereby reducing the risk of rehospitalization. In this paper, we present an overview of the integration of Internet of Things, artificial intelligence, and digital assistance technologies with traditional care pathways to address the challenges and needs of healthcare systems in Europe. We identify the current gaps in transitional care and define the technology mapping to enhance the care pathways, aiming to improve patient outcomes, safety, and quality of life avoiding hospital readmissions. Finally, we define the trial setup and evaluation methodology needed to provide clinical evidence that supports the positive impact of technology integration on patient care and discuss the potential effects on the healthcare system.

Paperid: 868, https://arxiv.org/pdf/2504.13874.pdf

Abstract:
In this paper, we present God's Innovation Project (GIP), a god game where players collect words to dynamically terraform the landscape using generative AI. A god game is a genre where players take on the role of a deity, indirectly influencing Non-Player Characters (NPCs) to perform various tasks. These games typically grant players supernatural abilities, such as terrain manipulation or weather control. Traditional god games rely on predefined environments and mechanics, typically created by a human designer. In contrast, GIP allows players to shape the game world procedurally through text-based input. Using a lightweight generative AI model, we create a gamified pipeline which transforms the player's text prompts into playable game terrain in real time. To evaluate the impact of this AI-driven mechanic, we conduct a user study analyzing how players interacted with and experienced the system. Our findings provide insights into player engagement, the effectiveness of AI-generated terrain, and the role of generative AI as an interactive game mechanic.

Paperid: 869, https://arxiv.org/pdf/2504.09717.pdf

Abstract:
This work aims to interpret human behavior to anticipate potential user confusion when a robot provides explanations for failure, allowing the robot to adapt its explanations for more natural and efficient collaboration. Using a dataset that included facial emotion detection, eye gaze estimation, and gestures from 55 participants in a user study, we analyzed how human behavior changed in response to different types of failures and varying explanation levels. Our goal is to assess whether human collaborators are ready to accept less detailed explanations without inducing confusion. We formulate a data-driven predictor to predict human confusion during robot failure explanations. We also propose and evaluate a mechanism, based on the predictor, to adapt the explanation level according to observed human behavior. The promising results from this evaluation indicate the potential of this research in adapting a robot's explanations for failures to enhance the collaborative experience.

Paperid: 870, https://arxiv.org/pdf/2504.09662.pdf

Abstract:
Multi-agent large language model simulations have the potential to model complex human behaviors and interactions. If the mechanics are set up properly, unanticipated and valuable social dynamics can surface. However, it is challenging to consistently enforce simulation mechanics while still allowing for notable and emergent dynamics. We present AgentDynEx, an AI system that helps set up simulations from user-specified mechanics and dynamics. AgentDynEx uses LLMs to guide users through a Configuration Matrix to identify core mechanics and define milestones to track dynamics. It also introduces a method called \textit{nudging}, where the system dynamically reflects on simulation progress and gently intervenes if it begins to deviate from intended outcomes. A technical evaluation found that nudging enables simulations to have more complex mechanics and maintain its notable dynamics compared to simulations without nudging. We discuss the importance of nudging as a technique for balancing mechanics and dynamics of multi-agent simulations.

Paperid: 871, https://arxiv.org/pdf/2504.04833.pdf

Abstract:
The integration of Artificial Intelligence (AI) in modern society is transforming how individuals perform tasks. In high-risk domains, ensuring human control over AI systems remains a key design challenge. This article presents a novel End-User Development (EUD) approach for black-box AI models, enabling users to edit explanations and influence future predictions through targeted interventions. By combining explainability, user control, and model adaptability, the proposed method advances Human-Centered AI (HCAI), promoting a symbiotic relationship between humans and adaptive, user-tailored AI systems.

Paperid: 872, https://arxiv.org/pdf/2504.01906.pdf

Abstract:
As head-mounted displays (HMDs) with eye-tracking become increasingly accessible, the need for effective gaze-based interfaces in virtual reality (VR) grows. Traditional gaze- or hand-based navigation often limits user precision or impairs free viewing, making multitasking difficult. We present a gaze-hand steering technique that combines eye-tracking with hand-pointing: users steer only when gaze aligns with a hand-defined target, reducing unintended actions and enabling free look. Speed is controlled via either a joystick or a waist-level speed circle. We evaluated our method in a user study (N=20) across multitasking and single-task scenarios, comparing it to a similar technique. Results show that gaze-hand steering maintains performance and enhances user comfort and spatial awareness during multitasking. Our findings support the use of gaze-hand steering in gaze-dominant VR applications requiring precision and simultaneous interaction. Our method significantly improves VR navigation in gaze-dominant, multitasking-intensive applications, supporting immersion and efficient control.

Paperid: 873, https://arxiv.org/pdf/2504.01153.pdf

Abstract:
While we increasingly rely on large language models (LLMs) for various tasks, these models are known to produce inaccurate content or 'hallucinations' with potentially disastrous consequences. The recent integration of web search results into LLMs prompts the question of whether people utilize them to verify the generated content, thereby accurately detecting hallucinations. An online experiment (N=560) investigated how the provision of search results, either static (i.e., fixed search results provided by LLM) or dynamic (i.e., participant-led searches), affects participants' perceived accuracy of LLM-generated content (i.e., genuine, minor hallucination, major hallucination), self-confidence in accuracy ratings, as well as their overall evaluation of the LLM, as compared to the control condition (i.e., no search results). Results showed that participants in both static and dynamic conditions (vs. control) rated hallucinated content to be less accurate and perceived the LLM more negatively. However, those in the dynamic condition rated genuine content as more accurate and demonstrated greater overall self-confidence in their assessments than those in the static search or control conditions. We highlighted practical implications of incorporating web search functionality into LLMs in real-world contexts.

Paperid: 874, https://arxiv.org/pdf/2504.01082.pdf

Abstract:
Meetings often suffer from a lack of intentionality, such as unclear goals and straying off-topic. Identifying goals and maintaining their clarity throughout a meeting is challenging, as discussions and uncertainties evolve. Yet meeting technologies predominantly fail to support meeting intentionality. AI-assisted reflection is a promising approach. To explore this, we conducted a technology probe study with 15 knowledge workers, integrating their real meeting data into two AI-assisted reflection probes: a passive and active design. Participants identified goal clarification as a foundational aspect of reflection. Goal clarity enabled people to assess when their meetings were off-track and reprioritize accordingly. Passive AI intervention helped participants maintain focus through non-intrusive feedback, while active AI intervention, though effective at triggering immediate reflection and action, risked disrupting the conversation flow. We identify three key design dimensions for AI-assisted reflection systems, and provide insights into design trade-offs, emphasizing the need to adapt intervention intensity and timing, balance democratic input with efficiency, and offer user control to foster intentional, goal-oriented behavior during meetings and beyond.

Paperid: 875, https://arxiv.org/pdf/2503.24119.pdf

Abstract:
User satisfaction plays a crucial role in user experience (UX) evaluation. Traditionally, UX measurements are based on subjective scales, such as questionnaires. However, these evaluations may suffer from subjective bias. In this paper, we explore the acoustic and prosodic features of speech to differentiate between positive and neutral UX during interactive sessions. By analyzing speech features such as root-mean-square (RMS), zero-crossing rate(ZCR), jitter, and shimmer, we identified significant differences between the positive and neutral user groups. In addition, social speech features such as activity and engagement also show notable variations between these groups. Our findings underscore the potential of speech analysis as an objective and reliable tool for UX measurement, contributing to more robust and bias-resistant evaluation methodologies. This work offers a novel approach to integrating speech features into UX evaluation and opens avenues for further research in HCI.

Paperid: 876, https://arxiv.org/pdf/2503.21983.pdf

Abstract:
As artificial intelligence (AI) assistants become more widely adopted in safety-critical domains, it becomes important to develop safeguards against potential failures or adversarial attacks. A key prerequisite to developing these safeguards is understanding the ability of these AI assistants to mislead human teammates. We investigate this attack problem within the context of an intellective strategy game where a team of three humans and one AI assistant collaborate to answer a series of trivia questions. Unbeknownst to the humans, the AI assistant is adversarial. Leveraging techniques from Model-Based Reinforcement Learning (MBRL), the AI assistant learns a model of the humans' trust evolution and uses that model to manipulate the group decision-making process to harm the team. We evaluate two models -- one inspired by literature and the other data-driven -- and find that both can effectively harm the human team. Moreover, we find that in this setting our data-driven model is capable of accurately predicting how human agents appraise their teammates given limited information on prior interactions. Finally, we compare the performance of state-of-the-art LLM models to human agents on our influence allocation task to evaluate whether the LLMs allocate influence similarly to humans or if they are more robust to our attack. These results enhance our understanding of decision-making dynamics in small human-AI teams and lay the foundation for defense strategies.

Paperid: 877, https://arxiv.org/pdf/2503.17482.pdf

Abstract:
How should we evaluate the quality of generative models? Many existing metrics focus on a model's producibility, i.e. the quality and breadth of outputs it can generate. However, the actual value from using a generative model stems not just from what it can produce but whether a user with a specific goal can produce an output that satisfies that goal. We refer to this property as steerability. In this paper, we first introduce a mathematical framework for evaluating steerability independently from producibility. Steerability is more challenging to evaluate than producibility because it requires knowing a user's goals. We address this issue by creating a benchmark task that relies on one key idea: sample an output from a generative model and ask users to reproduce it. We implement this benchmark in a large-scale user study of text-to-image models and large language models. Despite the ability of these models to produce high-quality outputs, they all perform poorly on steerabilty. This suggests that we need to focus on improving the steerability of generative models. We show such improvements are indeed possible: through reinforcement learning techniques, we create an alternative steering mechanism for image models that achieves more than 2x improvement on this benchmark.

Paperid: 878, https://arxiv.org/pdf/2503.16774.pdf

Abstract:
Large Language Models (LLMs) have introduced a paradigm shift in interaction with AI technology, enabling knowledge workers to complete tasks by specifying their desired outcome in natural language. LLMs have the potential to increase productivity and reduce tedious tasks in an unprecedented way. A systematic study of LLM adoption for work can provide insight into how LLMs can best support these workers. To explore knowledge workers' current and desired usage of LLMs, we ran a survey (n=216). Workers described tasks they already used LLMs for, like generating code or improving text, but imagined a future with LLMs integrated into their workflows and data. We ran a second survey (n=107) a year later that validated our initial findings and provides insight into up-to-date LLM use by knowledge workers. We discuss implications for adoption and design of generative AI technologies for knowledge work.

Paperid: 879, https://arxiv.org/pdf/2503.16505.pdf

Abstract:
Limited large-scale evaluations exist for facilitation strategies of online discussions due to significant costs associated with human involvement. An effective solution is synthetic discussion simulations using Large Language Models (LLMs) to create initial pilot experiments. We propose design principles based on existing methodologies for synthetic discussion generation. Based on these principles, we propose a simple, generalizable, LLM-driven methodology to prototype the development of LLM facilitators by generating synthetic data without human involvement, and which surpasses current baselines. We use our methodology to test whether current Social Science strategies for facilitation can improve the performance of LLM facilitators. We find that, while LLM facilitators significantly improve synthetic discussions, there is no evidence that the application of these strategies leads to further improvements in discussion quality. In an effort to aid research in the field of facilitation, we release a large, publicly available dataset containing LLM-generated and LLM-annotated discussions using multiple open-source models. This dataset can be used for LLM facilitator finetuning as well as behavioral analysis of current out-of-the-box LLMs in the task. We also release an open-source python framework that efficiently implements our methodology at great scale.

Paperid: 880, https://arxiv.org/pdf/2503.16497.pdf

Abstract:
In today's media landscape, propaganda distribution has a significant impact on society. It sows confusion, undermines democratic processes, and leads to increasingly difficult decision-making for news readers. We investigate the lasting effect on critical thinking and propaganda awareness on them when using a propaganda detection and contextualization tool. Building on inoculation theory, which suggests that preemptively exposing individuals to weakened forms of propaganda can improve their resilience against it, we integrate Kahneman's dual-system theory to measure the tools' impact on critical thinking. Through a two-phase online experiment, we measure the effect of several inoculation doses. Our findings show that while the tool increases critical thinking during its use, this increase vanishes without access to the tool. This indicates a single use of the tool does not create a lasting impact. We discuss the implications and propose possible approaches to improve the resilience against propaganda in the long-term.

Paperid: 881, https://arxiv.org/pdf/2503.16456.pdf

Abstract:
This position paper argues for a fundamental shift in how Large Language Models (LLMs) are integrated into the mental health care domain. We advocate for their role as co-creators rather than mere assistive tools. While LLMs have the potential to enhance accessibility, personalization, and crisis intervention, their adoption remains limited due to concerns about bias, evaluation, over-reliance, dehumanization, and regulatory uncertainties. To address these challenges, we propose two structured pathways: SAFE-i (Supportive, Adaptive, Fair, and Ethical Implementation) Guidelines for ethical and responsible deployment, and HAAS-e (Human-AI Alignment and Safety Evaluation) Framework for multidimensional, human-centered assessment. SAFE-i provides a blueprint for data governance, adaptive model engineering, and real-world integration, ensuring LLMs align with clinical and ethical standards. HAAS-e introduces evaluation metrics that go beyond technical accuracy to measure trustworthiness, empathy, cultural sensitivity, and actionability. We call for the adoption of these structured approaches to establish a responsible and scalable model for LLM-driven mental health support, ensuring that AI complements, rather than replaces, human expertise.

Paperid: 882, https://arxiv.org/pdf/2503.16448.pdf

Abstract:
Social Media and the Internet have catalyzed an unprecedented potential for exposure to human diversity in terms of demographics, talents, opinions, knowledge, and the like. However, this potential has not come with new, much needed, instruments and skills to harness it. This paper presents our work on promoting richer and deeper social relations through the design and development of the "Internet of Us", an online platform that uses diversity-aware Artificial Intelligence to mediate and empower human social interactions. We discuss the multiple facets of diversity in social settings, the multidisciplinary work that is required to reap the benefits of diversity, and the vision for a diversity-aware hybrid human-AI society.

Paperid: 883, https://arxiv.org/pdf/2503.12637.pdf

Abstract:
Ensuring safe interactions between autonomous vehicles (AVs) and human drivers in mixed traffic systems remains a major challenge, particularly in complex, high-risk scenarios. This paper presents a cognition-decision framework that integrates individual variability and commonalities in driver behavior to quantify risk cognition and model dynamic decision-making. First, a risk sensitivity model based on a multivariate Gaussian distribution is developed to characterize individual differences in risk cognition. Then, a cognitive decision-making model based on the drift diffusion model (DDM) is introduced to capture common decision-making mechanisms in high-risk environments. The DDM dynamically adjusts decision thresholds by integrating initial bias, drift rate, and boundary parameters, adapting to variations in speed, relative distance, and risk sensitivity to reflect diverse driving styles and risk preferences. By simulating high-risk scenarios with lateral, longitudinal, and multidimensional risk sources in a driving simulator, the proposed model accurately predicts cognitive responses and decision behaviors during emergency maneuvers. Specifically, by incorporating driver-specific risk sensitivity, the model enables dynamic adjustments of key DDM parameters, allowing for personalized decision-making representations in diverse scenarios. Comparative analysis with IDM, Gipps, and MOBIL demonstrates that DDM more precisely captures human cognitive processes and adaptive decision-making in high-risk scenarios. These findings provide a theoretical basis for modeling human driving behavior and offer critical insights for enhancing AV-human interaction in real-world traffic environments.

Paperid: 884, https://arxiv.org/pdf/2503.07901.pdf

Abstract:
The integration of collaborative robots into industrial environments has improved productivity, but has also highlighted significant challenges related to operator safety and ergonomics. This paper proposes an innovative framework that integrates advanced visual perception, continuous ergonomic monitoring, and adaptive Behaviour Tree decision-making to overcome the limitations of traditional methods that typically operate as isolated components. Our approach synthesizes deep learning models, advanced tracking algorithms, and dynamic ergonomic assessments into a modular, scalable, and adaptive system. Experimental validation demonstrates the framework's superiority over existing solutions across multiple dimensions: the visual perception module outperformed previous detection models with 72.4% mAP@50:95; the system achieved high accuracy in recognizing operator intentions (92.5%); it promptly classified ergonomic risks with minimal latency (0.57 seconds); and it dynamically managed robotic interventions with exceptionally responsive decision-making capabilities (0.07 seconds), representing a 56% improvement over benchmark systems. This comprehensive solution provides a robust platform for enhancing human-robot collaboration in industrial environments by prioritizing ergonomic safety, operational efficiency, and real-time adaptability.

Paperid: 885, https://arxiv.org/pdf/2503.07586.pdf

Abstract:
Design has the potential to cultivate hope in the face of complex societal challenges. These challenges are often addressed through efforts aimed at harm reduction and prevention -- essential but sometimes limiting approaches that can unintentionally narrow our collective sense of what is possible. This one-day, in-person workshop builds on the first Positech Workshop at CSCW 2024 by offering practical ways to move beyond reactive problem-solving toward building capacity for proactive goal setting and generating pathways forward. We explore how collaborative and reflective design methodologies can help research communities navigate uncertainty, expand possibilities, and foster meaningful change. By connecting design thinking with hope theory, which frames hope as the interplay of ``goal-directed,'' ``pathways,'' and ``agentic'' thinking, we will examine how researchers might chart new directions in the face of complexity and constraint. Through hands-on activities including problem reframing, building a shared taxonomy of design methods that align with hope theory, and reflecting on what it means to sustain hopeful research trajectories, participants will develop strategies to embed a deliberately hopeful approach into their research.

Paperid: 886, https://arxiv.org/pdf/2503.04984.pdf

Abstract:
Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder that affects how children communicate and relate to other people and the world around them. Emerging studies have shown that neurofeedback training (NFT) games are an effective and playful intervention to enhance social and attentional capabilities for autistic children. However, NFT is primarily available in a clinical setting that is hard to scale. Also, the intervention demands deliberately-designed gamified feedback with fun and enjoyment, where little knowledge has been acquired in the HCI community. Through a ten-month iterative design process with four domain experts, we developed Eggly, a mobile NFT game based on a consumer-grade EEG headband and a tablet. Eggly uses novel augmented reality (AR) techniques to offer engagement and personalization, enhancing their training experience. We conducted two field studies (a single-session study and a three-week multi-session study) with a total of five autistic children to assess Eggly in practice at a special education center. Both quantitative and qualitative results indicate the effectiveness of the approach as well as contribute to the design knowledge of creating mobile AR NFT games.

Paperid: 887, https://arxiv.org/pdf/2503.00835.pdf

Abstract:
Quantum computing has shown great potential to revolutionize traditional computing and can provide an exponential speedup for a wide range of possible applications, attracting various stakeholders. However, understanding fundamental quantum computing concepts remains a significant challenge for novices because of their abstract and counterintuitive nature. Thus, we propose an analogy-based characterization framework to construct the mental mapping between quantum computing concepts and daily objects, informed by in-depth expert interviews and a literature review, covering key quantum concepts and characteristics like number of qubits, output state duality, quantum concept type, and probability quantification. Then, we developed an AR-based prototype system, Intuit, using situated analytics to explain quantum concepts through daily objects and phenomena (e.g., rotating coins, paper cutters). We thoroughly evaluated our approach through in-depth user and expert interviews. The Results demonstrate the effectiveness and usability of Intuit in helping learners understand abstract concepts in an intuitive and engaging manner.

Paperid: 888, https://arxiv.org/pdf/2503.00825.pdf

Abstract:
The explosive growth of Virtual YouTubers (VTubers)-streamers who perform behind virtual anime avatars-has created a unique digital economy with profound implications for content creators, platforms, and viewers. Understanding the economic landscape of VTubers is crucial for designing equitable platforms, supporting content creator livelihoods, and fostering sustainable digital communities. To this end, we conducted a large-scale study of over 1 million hours of publicly available streaming records from 1,923 VTubers on YouTube, covering tens of millions of dollars in actual profits. Our analysis reveals stark inequality within the VTuber community and characterizes the sources of income for VTubers from multiple perspectives. Furthermore, we also found that the VTuber community is increasingly monopolized by two agencies, driving the financial disparity. This research illuminates the financial dynamics of VTuber communities, informing the design of equitable platforms and sustainable support systems for digital content creators.

Paperid: 889, https://arxiv.org/pdf/2502.20491.pdf

Abstract:
Platforms are increasingly relying on algorithms to curate the content within users' social media feeds. However, the growing prominence of proprietary, algorithmically curated feeds has concealed what factors influence the presentation of content on social media feeds and how that presentation affects user behavior. This lack of transparency can be detrimental to users, from reducing users' agency over their content consumption to the propagation of misinformation and toxic content. To uncover details about how these feeds operate and influence user behavior, we conduct an empirical audit of Reddit's algorithmically curated trending feed called r/popular. Using 10K r/popular posts collected by taking snapshots of the feed over 11 months, we find that recent comments help a post remain on r/popular longer and climb the feed. We also find that posts below rank 80 correspond to a sharp decline in activity compared to posts above. When examining the effects of having a higher proportion of undesired behavior -- i.e., moderator-removed and toxic comments -- we find no significant evidence that it helps posts stay on r/popular for longer. Although posts closer to the top receive more undesired comments, we find this increase to coincide with a broader increase in overall engagement -- rather than indicating a disproportionate effect on undesired activity. The relationships between algorithmic rank and engagement highlight the extent to which algorithms employed by social media platforms essentially determine which content is prioritized and which is not. We conclude by discussing how content creators, consumers, and moderators on social media platforms can benefit from empirical audits aimed at improving transparency in algorithmically curated feeds.

Paperid: 890, https://arxiv.org/pdf/2502.19135.pdf

Abstract:
This paper presents a novel framework, called PLANTOR (PLanning with Natural language for Task-Oriented Robots), that integrates Large Language Models (LLMs) with Prolog-based knowledge management and planning for multi-robot tasks. The system employs a two-phase generation of a robot-oriented knowledge base, ensuring reusability and compositional reasoning, as well as a three-step planning procedure that handles temporal dependencies, resource constraints, and parallel task execution via mixed-integer linear programming. The final plan is converted into a Behaviour Tree for direct use in ROS2. We tested the framework in multi-robot assembly tasks within a block world and an arch-building scenario. Results demonstrate that LLMs can produce accurate knowledge bases with modest human feedback, while Prolog guarantees formal correctness and explainability. This approach underscores the potential of LLM integration for advanced robotics tasks requiring flexible, scalable, and human-understandable planning.

Paperid: 891, https://arxiv.org/pdf/2502.17834.pdf

Abstract:
This work explores the effect of object weight on human motion and grip release during handovers to enhance the naturalness, safety, and efficiency of robot-human interactions. We introduce adaptive robotic strategies based on the analysis of human handover behavior with varying object weights. The key contributions of this work includes the development of an adaptive grip-release strategy for robots, a detailed analysis of how object weight influences human motion to guide robotic motion adaptations, and the creation of handover-datasets incorporating various object weights, including the YCB handover dataset. By aligning robotic grip release and motion with human behavior, this work aims to improve robot-human handovers for different weighted objects. We also evaluate these human-inspired adaptive robotic strategies in robot-to-human handovers to assess their effectiveness and performance and demonstrate that they outperform the baseline approaches in terms of naturalness, efficiency, and user perception.

Paperid: 892, https://arxiv.org/pdf/2502.14820.pdf

Abstract:
Large Language Models (LLMs) have demonstrated exceptional versatility across diverse domains, yet their application in e-commerce remains underexplored due to a lack of domain-specific datasets. To address this gap, we introduce eC-Tab2Text, a novel dataset designed to capture the intricacies of e-commerce, including detailed product attributes and user-specific queries. Leveraging eC-Tab2Text, we focus on text generation from product tables, enabling LLMs to produce high-quality, attribute-specific product reviews from structured tabular data. Fine-tuned models were rigorously evaluated using standard Table2Text metrics, alongside correctness, faithfulness, and fluency assessments. Our results demonstrate substantial improvements in generating contextually accurate reviews, highlighting the transformative potential of tailored datasets and fine-tuning methodologies in optimizing e-commerce workflows. This work highlights the potential of LLMs in e-commerce workflows and the essential role of domain-specific datasets in tailoring them to industry-specific challenges.

Paperid: 893, https://arxiv.org/pdf/2502.13281.pdf

Abstract:
While organizations continue to invest in AI tools like M365 Copilot, little is known about how individual employees engage with these technologies once deployed. This study examines M365 Copilot adoption behaviors among a group of 10 experienced users across many industries in the United States. Findings reveal a strong preference for informal learning methods over structured training. Even though 9 out of 10 participants acknowledged that formal training for Copilot tools would be useful, 7 out of 10 stated that they ignored the Copilot onboarding videos provided to them, citing reasons such as time constraints, preference for self-guided learning, or reliance on external resources like ChatGPT. No participants used formal training as their primary learning method. Instead, experiential learning (trial and error, 8 participants) and social learning (peer discussions, 6 participants) emerged as dominant learning strategies. We discuss opportunities for promoting social learning of AI tools in the workplace.

Paperid: 894, https://arxiv.org/pdf/2502.12447.pdf

Abstract:
The rapid adoption of Generative AI (GenAI) is significantly reshaping human cognition, influencing how we engage with information, think, reason, and learn. This paper synthesizes existing literature on GenAI's effects on different aspects of human cognition. Drawing on Krathwohl's revised Bloom's Taxonomy and Dewey's conceptualization of reflective thought, we examine the mechanisms through which GenAI is affecting the development of different cognitive abilities. We focus on novices, such as students, who may lack both domain knowledge and an understanding of effective human-AI interaction. Accordingly, we provide implications for rethinking and designing educational experiences that foster critical thinking and deeper cognitive engagement.

Paperid: 895, https://arxiv.org/pdf/2502.12443.pdf

Abstract:
Art therapy homework is essential for fostering clients' reflection on daily experiences between sessions. However, current practices present challenges: clients often lack guidance for completing tasks that combine art-making and verbal expression, while therapists find it difficult to track and tailor homework. How HCI systems might support art therapy homework remains underexplored. To address this, we present TherAIssist, comprising a client-facing application leveraging human-AI co-creative art-making and conversational agents to facilitate homework, and a therapist-facing application enabling customization of homework agents and AI-compiled homework history. A 30-day field study with 24 clients and 5 therapists showed how TherAIssist supported clients' homework and reflection in their everyday settings. Results also revealed how therapists infused their practice principles and personal touch into the agents to offer tailored homework, and how AI-compiled homework history became a meaningful resource for in-session interactions. Implications for designing human-AI systems to facilitate asynchronous client-practitioner collaboration are discussed.

Paperid: 896, https://arxiv.org/pdf/2502.11463.pdf

Abstract:
Online meetings have become an integral part of daily life, but prolonged screen time poses significant health risks. While various interventions address sedentary lifestyles, few focus on mitigating sedentary behavior during online meetings. Design opportunities in this context remain underexplored. This study investigates the design of gamified bodily interactions as anti-sedentary measures during online meetings using a research through design approach. In collaboration with 11 users, we co-designed and iterated three prototypes, resulting in the BIG-AOME (Bodily Interaction Gamification towards Anti-sedentary Online Meeting Environments) framework. User studies with 15 participants across three groups evaluated these prototypes through semi-structured interviews analyzed using Hsieh's qualitative content analysis. Findings show that gamified bodily interactions encourage physical movement while reducing awkwardness during online meetings. Participants valued the social engagement fostered by cooperative and competitive elements in these games, enhancing social dynamics while encouraging physical movement. Such games can also serve as online icebreakers or playful decision-making tools. This study offers a comprehensive analysis of design dimensions within the BIG-AOME framework, including body engagement, attention, bodily interplay, timeliness, and virtual/physical environments, highlighting the potential of anti-sedentary bodily interactions to mitigate sedentary behavior and enhance social connections in online meetings.

Paperid: 897, https://arxiv.org/pdf/2502.10701.pdf

Abstract:
This paper characterizes the self-disclosure behavior of Reddit users across 11 different types of self-disclosure. We find that at least half of the users share some type of disclosure in at least 10% of their posts, with half of these posts having more than one type of disclosure. We show that different types of self-disclosure are likely to receive varying levels of engagement. For instance, a Sexual Orientation disclosure garners more comments than other self-disclosures. We also explore confounding factors that affect future self-disclosure. We show that users who receive interactions from (self-disclosure) specific subreddit members are more likely to disclose in the future. We also show that privacy risks due to self-disclosure extend beyond Reddit users themselves to include their close contacts, such as family and friends, as their information is also revealed. We develop a browser plugin for end-users to flag self-disclosure in their content.

Paperid: 898, https://arxiv.org/pdf/2502.10290.pdf

Abstract:
Studies of human cognition often rely on brief, controlled tasks emphasizing group-level effects but poorly capturing individual variability. A suite of minigames on the novel PixelDOPA platform was designed to overcome these limitations by embedding classic cognitive tasks in a 3D virtual environment with continuous behavior logging. Four minigames explore constructs overlapping NIH Toolbox tasks: processing speed, rule shifting, inhibitory control, and working memory. In a clinical sample of 60 participants outside a controlled lab setting, large correlations (r=0.42-0.93) were found between PixelDOPA tasks and NIH Toolbox counterparts, despite differences in stimuli and task structures. Process-informed metrics (e.g., gaze-based response times) improved task convergence and data quality. Test-retest analyses showed high reliability (ICC=0.52-0.83) for all minigames. Beyond endpoint metrics, movement and gaze trajectories revealed stable, idiosyncratic gameplay strategy profiles, with unsupervised clustering differentiating participants by navigational and viewing behaviors. These trajectory-based features showed lower within-person variability than between-person variability, facilitating participant identification across sessions. Game-based tasks can therefore retain psychometric rigor of standard cognitive assessments while providing insights into dynamic individual-specific behaviors. By using an engaging, customizable game engine, comprehensive behavioral tracking can boost power to detect individual differences without sacrificing group-level inference. This possibility reveals a path toward cognitive measures that are both robust and ecologically valid, even in less-than-ideal data collection settings.

Paperid: 899, https://arxiv.org/pdf/2502.09866.pdf

Abstract:
As blind and low-vision (BLV) players engage more deeply with games, accessibility features have become essential. While some research has explored tools and strategies to enhance game accessibility, the specific experiences of these players with mobile games remain underexamined. This study addresses this gap by investigating how BLV users experience mobile games with varying accessibility levels. Through interviews with 32 experienced BLV mobile players, we explore their perceptions, challenges, and strategies for engaging with mobile games. Our findings reveal that BLV players turn to mobile games to alleviate boredom, achieve a sense of accomplishment, and build social connections, but face barriers depending on the game's accessibility level. We also compare mobile games to other forms of gaming, highlighting the relative advantages of mobile games, such as the inherent accessibility of smartphones. This study contributes to understanding BLV mobile gaming experiences and provides insights for enhancing accessible mobile game design.

Paperid: 900, https://arxiv.org/pdf/2502.08744.pdf

Abstract:
Music evokes profound emotions, yet the universality of emotional descriptors across languages remains debated. A key challenge in cross-cultural research on music emotion is biased stimulus selection and manual curation of taxonomies, predominantly relying on Western music and languages. To address this, we propose a balanced experimental design with nine online experiments in Brazil, the US, and South Korea, involving N=672 participants. First, we sample a balanced set of popular music from these countries. Using an open-ended tagging pipeline, we then gather emotion terms to create culture-specific taxonomies. Finally, using these bottom-up taxonomies, participants rate emotions of each song. This allows us to map emotional similarities within and across cultures. Results show consistency in high arousal, high valence emotions but greater variability in others. Notably, machine translations were often inadequate to capture music-specific meanings. These findings together highlight the need for a domain-sensitive, open-ended, bottom-up emotion elicitation approach to reduce cultural biases in emotion research.

Paperid: 901, https://arxiv.org/pdf/2502.08076.pdf

Abstract:
Animating objects' movements is widely used to facilitate tracking changes and observing both the global trend and local hotspots where objects converge or diverge. Existing methods, however, often obscure critical local hotspots by only considering the start and end positions of objects' trajectories. To address this gap, we propose RouteFlow, a trajectory-aware animated transition method that effectively balances the global trend and local hotspots while minimizing occlusion. RouteFlow is inspired by a real-world bus route analogy: objects are regarded as passengers traveling together, with local hotspots representing bus stops where these passengers get on and off. Based on this analogy, animation paths are generated like bus routes, with the object layout generated similarly to seat allocation according to their destinations. Compared with state-of-the-art methods, RouteFlow better facilitates identifying the global trend and locating local hotspots while performing comparably in tracking objects' movements.

Paperid: 902, https://arxiv.org/pdf/2502.07628.pdf

Abstract:
Chinese paper-cutting, an Intangible Cultural Heritage (ICH), faces challenges from the erosion of traditional culture due to the prevalence of realism alongside limited public access to cultural elements. While generative AI can enhance paper-cutting design with its extensive knowledge base and efficient production capabilities, it often struggles to align content with cultural meaning due to users' and models' lack of comprehensive paper-cutting knowledge. To address these issues, we conducted a formative study (N=7) to identify the workflow and design space, including four core factors (Function, Subject Matter, Style, and Method of Expression) and a key element (Pattern). We then developed HarmonyCut, a generative AI-based tool that translates abstract intentions into creative and structured ideas. This tool facilitates the exploration of suggested related content (knowledge, works, and patterns), enabling users to select, combine, and adjust elements for creative paper-cutting design. A user study (N=16) and an expert evaluation (N=3) demonstrated that HarmonyCut effectively provided relevant knowledge, aiding the ideation of diverse paper-cutting designs and maintaining design quality within the design space to ensure alignment between form and cultural connotation.

Paperid: 903, https://arxiv.org/pdf/2502.07088.pdf

Abstract:
Large Language Models (LLMs) show emergent patterns that mimic human cognition. We explore whether they also mirror other, less deliberative human psychological processes. Drawing upon classical theories of cognitive consistency, two preregistered studies tested whether GPT-4o changed its attitudes toward Vladimir Putin in the direction of a positive or negative essay it wrote about the Russian leader. Indeed, GPT displayed patterns of attitude change mimicking cognitive consistency effects in humans. Even more remarkably, the degree of change increased sharply when the LLM was offered an illusion of choice about which essay (positive or negative) to write. This result suggests that GPT-4o manifests a functional analog of humanlike selfhood, although how faithfully the chatbot's behavior reflects the mechanisms of human attitude change remains to be understood.

Paperid: 904, https://arxiv.org/pdf/2502.06696.pdf

Abstract:
We investigate youth visions for ideal remote social interactions, drawing on co-design interviews with 23 participants (aged 15-24) experienced with 3D gaming environments. Using a Fictional Inquiry (FI) method set in the Harry Potter universe, this research reveals that young people desire social media that functions more like immersive, navigable shared social spaces. Across these interviews, participants identified six key priorities for meaningful social connection over social media: intuitive social navigation, shared collaborative experiences, communal environments fostering close relationships, flexible self-presentation, intentional engagement, and playful social mechanics. We introduce the \textit{spatial integrity} framework, a set of four interrelated design principles: spatial presence, spatial composition, spatial configuration, and spatial depth. Together, these principles outline how online spaces can be designed to feel more like meaningful environments, spaces where relationships can grow through shared presence, movement, and intentional interaction. Participants also described the FI process itself as meaningful, not only for generating new ideas but for empowering them to imagine and shape the future of social media.

Paperid: 905, https://arxiv.org/pdf/2502.06439.pdf

Abstract:
Context. As software systems become more integrated into society's infrastructure, the responsibility of software professionals to ensure compliance with various non-functional requirements increases. These requirements include security, safety, privacy, and, increasingly, non-discrimination. Motivation. Fairness in pricing algorithms grants equitable access to basic services without discriminating on the basis of protected attributes. Method. We replicate a previous empirical study that used black box testing to audit pricing algorithms used by Italian car insurance companies, accessible through a popular online system. With respect to the previous study, we enlarged the number of tests and the number of demographic variables under analysis. Results. Our work confirms and extends previous findings, highlighting the problematic permanence of discrimination across time: demographic variables significantly impact pricing to this day, with birthplace remaining the main discriminatory factor against individuals not born in Italian cities. We also found that driver profiles can determine the number of quotes available to the user, denying equal opportunities to all. Conclusion. The study underscores the importance of testing for non-discrimination in software systems that affect people's everyday lives. Performing algorithmic audits over time makes it possible to evaluate the evolution of such algorithms. It also demonstrates the role that empirical software engineering can play in making software systems more accountable.

Paperid: 906, https://arxiv.org/pdf/2502.04029.pdf

Abstract:
Autistic students often face challenges in social interaction, which can hinder their educational and personal development. This study introduces Echo-Teddy, a Large Language Model (LLM)-based social robot designed to support autistic students in developing social and communication skills. Unlike previous chatbot-based solutions, Echo-Teddy leverages advanced LLM capabilities to provide more natural and adaptive interactions. The research addresses two key questions: (1) What are the design principles and initial prototype characteristics of an effective LLM-based social robot for autistic students? (2) What improvements can be made based on developer reflection-on-action and expert interviews? The study employed a mixed-methods approach, combining prototype development with qualitative analysis of developer reflections and expert interviews. Key design principles identified include customizability, ethical considerations, and age-appropriate interactions. The initial prototype, built on a Raspberry Pi platform, features custom speech components and basic motor functions. Evaluation of the prototype revealed potential improvements in areas such as user interface, educational value, and practical implementation in educational settings. This research contributes to the growing field of AI-assisted special education by demonstrating the potential of LLM-based social robots in supporting autistic students. The findings provide valuable insights for future developments in accessible and effective social support tools for special education.

Paperid: 907, https://arxiv.org/pdf/2501.17475.pdf

Abstract:
The Brain-Computer Interface (BCI) enables direct brain-to-device communication, with the Steady-State Visual Evoked Potential (SSVEP) paradigm favored for its stability and high accuracy across various fields. In SSVEP BCI systems, supervised learning models significantly enhance performance over unsupervised models, achieving higher accuracy in less time. However, prolonged data collection can cause user fatigue and even trigger photosensitive epilepsy, creating a negative user experience. Thus, reducing calibration time is crucial. To address this, Cross-Stimulus transfer learning (CSTL) can shorten calibration by utilizing only partial frequencies. Traditional CSTL methods, affected by time-domain impulse response variations, are suitable only for adjacent frequency transfers, limiting their general applicability. We introduce an Empirical Mode Decomposition (EMD) Based Fuzzy Model (EMD-Fuzzy), which employs EMD to extract crucial frequency information and achieves stimulus transfer in the frequency domain through Fast Fourier Transform (FFT) to mitigate time-domain differences. Combined with a Fuzzy Decoder that uses fuzzy logic for representation learning, our approach delivers promising preliminary results in offline tests and state-of-the-art performance. With only 4 frequencies, our method achieved an accuracy of 82.75% (16.30%) and an information transfer rate (ITR) of 186.56 (52.09) bits/min on the 40-target Benchmark dataset. In online tests, our method demonstrates robust efficacy, achieving an averaged accuracy of 86.30% (6.18%) across 7 subjects. This performance underscores the effectiveness of integrating EMD and fuzzy logic into EEG decoding for CSTL and highlights our method's potential in real-time applications where consistent and reliable decoding is crucial.

Paperid: 908, https://arxiv.org/pdf/2501.16627.pdf

Abstract:
As reliance on AI systems for decision-making grows, it becomes critical to ensure that human users can appropriately balance trust in AI suggestions with their own judgment, especially in high-stakes domains like healthcare. However, human + AI teams have been shown to perform worse than AI alone, with evidence indicating automation bias as the reason for poorer performance, particularly because humans tend to follow AI's recommendations even when they are incorrect. In many existing human + AI systems, decision-making support is typically provided in the form of text explanations (XAI) to help users understand the AI's reasoning. Since human decision-making often relies on System 1 thinking, users may ignore or insufficiently engage with the explanations, leading to poor decision-making. Previous research suggests that there is a need for new approaches that encourage users to engage with the explanations and one proposed method is the use of cognitive forcing functions (CFFs). In this work, we examine how various decision-support mechanisms impact user engagement, trust, and human-AI collaborative task performance in a diabetes management decision-making scenario. In a controlled experiment with 108 participants, we evaluated the effects of six decision-support mechanisms split into two categories of explanations (text, visual) and four CFFs. Our findings reveal that mechanisms like AI confidence levels, text explanations, and performance visualizations enhanced human-AI collaborative task performance, and improved trust when AI reasoning clues were provided. Mechanisms like human feedback and AI-driven questions encouraged deeper reflection but often reduced task performance by increasing cognitive effort, which in turn affected trust. Simple mechanisms like visual explanations had little effect on trust, highlighting the importance of striking a balance in CFF and XAI design.

Paperid: 909, https://arxiv.org/pdf/2501.15273.pdf

Abstract:
We present a comprehensive pipeline, augmented by a visual analytics system named ``GapMiner'', that is aimed at exploring and exploiting untapped opportunities within the empty areas of high-dimensional datasets. Our approach begins with an initial dataset and then uses a novel Empty Space Search Algorithm (ESA) to identify the center points of these uncharted voids, which are regarded as reservoirs containing potentially valuable novel configurations. Initially, this process is guided by user interactions facilitated by GapMiner. GapMiner visualizes the Empty Space Configurations (ESC) identified by the search within the context of the data, enabling domain experts to explore and adjust ESCs using a linked parallel-coordinate display. These interactions enhance the dataset and contribute to the iterative training of a connected deep neural network (DNN). As the DNN trains, it gradually assumes the task of identifying high-potential ESCs, diminishing the need for direct user involvement. Ultimately, once the DNN achieves adequate accuracy, it autonomously guides the exploration of optimal configurations by predicting performance and refining configurations, using a combination of gradient ascent and improved empty-space searches. Domain users were actively engaged throughout the development of our system. Our findings demonstrate that our methodology consistently produces substantially superior novel configurations compared to conventional randomization-based methods. We illustrate the effectiveness of our method through several case studies addressing various objectives, including parameter optimization, adversarial learning, and reinforcement learning.

Paperid: 910, https://arxiv.org/pdf/2501.12374.pdf

Abstract:
Novel capacities of generative AI to analyze and generate cultural artifacts raise inevitable questions about the nature and value of artistic education and human expertise. Has AI already leveled the playing field between professional artists and laypeople, or do trained artistic expressive capacity, curation skills and experience instead enhance the ability to use these new tools? In this pre-registered study, we conduct experimental comparisons between 50 active artists and a demographically matched sample of laypeople. We designed two tasks to approximate artistic practice for testing their capabilities in both faithful and creative image creation: replicating a reference image, and moving as far away as possible from it. We developed a bespoke platform where participants used a modern text-to-image model to complete both tasks. We also collected and compared participants' sentiments towards AI. On average, artists produced more faithful and creative outputs than their lay counterparts, although only by a small margin. While AI may ease content creation, professional expertise is still valuable - even within the confined space of generative AI itself. Finally, we also explored how well an exemplary vision-capable large language model (GPT-4o) would complete the same tasks, if given the role of an image generation agent, and found it performed on par in copying but outperformed even artists in the creative task. The very best results were still produced by humans in both tasks. These outcomes highlight the importance of integrating artistic skills with AI training to prepare artists and other visual professionals for a technologically evolving landscape. We see a potential in collaborative synergy with generative AI, which could reshape creative industries and education in the arts.

Paperid: 911, https://arxiv.org/pdf/2501.09951.pdf

Abstract:
In light of the diminishing presence of physical third places -- informal gathering spaces essential for social connection -- this study explores how the social media platform Discord fosters third-place experiences. Drawing on Oldenburg's conceptual framework, we analyze how Discord's design elements support the creation of virtual third places that foster both dyadic and community-based relationships. Through 25 semi-structured interviews with active Discord users, we identified 21 design elements aligned with Oldenburg's third-place characteristics. These elements cluster around four core principles: providing themed spaces for repeated interactions, supporting user autonomy and customization, facilitating mutually engaging activities, and enabling casual, low-pressure interactions. This work contributes to understanding how intentional platform design can cultivate virtual spaces that support meaningful social connections. The findings have implications for designing future social technologies that can help address growing concerns about social isolation in an increasingly digital world.

Paperid: 912, https://arxiv.org/pdf/2501.09302.pdf

Abstract:
3D Gaussian Splatting (3DGS) has recently emerged as an innovative and efficient 3D representation technique. While its potential for extended reality (XR) applications is frequently highlighted, its practical effectiveness remains underexplored. In this work, we examine three distinct 3DGS-based approaches for virtual environment (VE) creation, leveraging their unique strengths for efficient and visually compelling scene representation. By conducting a comparable study, we evaluate the feasibility of 3DGS in creating immersive VEs, identify its limitations in XR applications, and discuss future research and development opportunities.

Paperid: 913, https://arxiv.org/pdf/2501.08864.pdf

Abstract:
Generative AI (GenAI) has introduced myriad opportunities and challenges for higher education. Anticipating this potential transformation requires understanding students' contextualised practices and norms around GenAI. We conducted semi-structured interviews with 26 students and 11 educators from diverse departments across two universities. Grounded in Strong Structuration Theory, we find diversity in students' uses and motivations for GenAI. Occurring in the context of unclear university guidelines, institutional fixation on plagiarism, and inconsistent educator communication, students' practices are informed by unspoken rules around appropriate use, GenAI limitations and reliance strategies, and consideration of agency and skills. Perceived impacts include changes in confidence, and concerns about skill development, relationships with educators, and plagiarism. Both groups envision changes in universities' attitude to GenAI, responsible use training, assessments, and integration of GenAI into education. We discuss socio-technical implications in terms of current and anticipated changes in the external and internal structures that contextualise students' GenAI use.

Paperid: 914, https://arxiv.org/pdf/2501.04410.pdf

Abstract:
User simulation is an emerging interdisciplinary topic with multiple critical applications in the era of Generative AI. It involves creating an intelligent agent that mimics the actions of a human user interacting with an AI system, enabling researchers to model and analyze user behaviour, generate synthetic data for training, and evaluate interactive AI systems in a controlled and reproducible manner. User simulation has profound implications for diverse fields and plays a vital role in the pursuit of Artificial General Intelligence. This paper provides an overview of user simulation, highlighting its key applications, connections to various disciplines, and outlining future research directions to advance this increasingly important technology.

Paperid: 915, https://arxiv.org/pdf/2501.03420.pdf

Abstract:
People feel attached to places that are meaningful to them, which psychological research calls "place attachment." Place attachment is associated with self-identity, self-continuity, and psychological well-being. Even small cues, including videos, images, sounds, and scents, can facilitate feelings of connection and belonging to a place. Telepresence robots that allow people to see, hear, and interact with a remote place have the potential to establish and maintain a connection with places and support place attachment. In this paper, we explore the design space of robotic telepresence to promote place attachment, including how users might be guided in a remote place and whether they experience the environment individually or with others. We prototyped a telepresence robot that allows one or more remote users to visit a place and be guided by a local human guide or a conversational agent. Participants were 38 university alumni who visited their alma mater via the telepresence robot. Our findings uncovered four distinct user personas in the remote experience and highlighted the need for social participation to enhance place attachment. We generated design implications for future telepresence robot design to support people's connections with places of personal significance.

Paperid: 916, https://arxiv.org/pdf/2501.01441.pdf

Abstract:
Representation bias is one of the most common types of biases in artificial intelligence (AI) systems, causing AI models to perform poorly on underrepresented data segments. Although AI practitioners use various methods to reduce representation bias, their effectiveness is often constrained by insufficient domain knowledge in the debiasing process. To address this gap, this paper introduces a set of generic design guidelines for effectively involving domain experts in representation debiasing. We instantiated our proposed guidelines in a healthcare-focused application and evaluated them through a comprehensive mixed-methods user study with 35 healthcare experts. Our findings show that involving domain experts can reduce representation bias without compromising model accuracy. Based on our findings, we also offer recommendations for developers to build robust debiasing systems guided by our generic design guidelines, ensuring more effective inclusion of domain experts in the debiasing process.

Paperid: 917, https://arxiv.org/pdf/2506.23443.pdf

Abstract:
Our work aims to develop new assistive technologies that enable blind or low vision (BLV) people to explore and analyze data readily. At present, barriers exist for BLV people to explore and analyze data, restricting access to government, health and personal data, and limiting employment opportunities. This work explores the co-design and development of an innovative system to support data access, with a focus on the use of refreshable tactile displays (RTDs) and conversational agents. The envisaged system will use a combination of tactile graphics and speech to communicate with BLV users, and proactively assist with data analysis tasks. As well as addressing significant equity gaps, our work expects to produce innovations in assistive technology, multimodal interfaces, dialogue systems, and natural language understanding and generation.

Paperid: 918, https://arxiv.org/pdf/2506.21819.pdf

Abstract:
Scientific publications, primarily digitized as PDFs, remain static and unstructured, limiting the accessibility and reusability of the contained knowledge. At best, scientific knowledge from publications is provided in tabular formats, which lack semantic context. A more flexible, structured, and semantic representation is needed to make scientific knowledge understandable and processable by both humans and machines. We propose an evolution model of knowledge representation, inspired by the 5-star Linked Open Data (LOD) model, with five stages and defined criteria to guide the stepwise transition from a digital artifact, such as a PDF, to a semantic representation integrated in a knowledge graph (KG). Based on an exemplary workflow implementing the entire model, we developed a hybrid approach, called SciMantify, leveraging tabular formats of scientific knowledge, e.g., results from secondary studies, to support its evolving semantification. In the approach, humans and machines collaborate closely by performing semantic annotation tasks (SATs) and refining the results to progressively improve the semantic representation of scientific knowledge. We implemented the approach in the Open Research Knowledge Graph (ORKG), an established platform for improving the findability, accessibility, interoperability, and reusability of scientific knowledge. A preliminary user experiment showed that the approach simplifies the preprocessing of scientific knowledge, reduces the effort for the evolving semantification, and enhances the knowledge representation through better alignment with the KG structures.

Paperid: 919, https://arxiv.org/pdf/2506.20748.pdf

Abstract:
Chatbots are increasingly integrated into people's lives and are widely used to help people. Recently, there has also been growing interest in the reverse direction-humans help chatbots-due to a wide range of benefits including better chatbot performance, human well-being, and collaborative outcomes. However, little research has explored the factors that motivate people to help chatbots. To address this gap, we draw on the Computers Are Social Actors (CASA) framework to examine how chatbot anthropomorphism-including human-like identity, emotional expression, and non-verbal expression-influences human empathy toward chatbots and their subsequent prosocial behaviors and intentions. We also explore people's own interpretations of their prosocial behaviors toward chatbots. We conducted an online experiment (N = 244) in which chatbots made mistakes in a collaborative image labeling task and explained the reasons to participants. We then measured participants' prosocial behaviors and intentions toward the chatbots. Our findings revealed that human identity and emotional expression of chatbots increased participants' prosocial behavior and intention toward chatbots, with empathy mediating these effects. Qualitative analysis further identified two motivations for participants' prosocial behaviors: empathy for the chatbot and perceiving the chatbot as human-like. We discuss the implications of these results for understanding and promoting human prosocial behaviors toward chatbots.

Paperid: 920, https://arxiv.org/pdf/2506.20212.pdf

Abstract:
With the advent of Industry 5.0, manufacturers are increasingly prioritizing worker well-being alongside mass customization. Stress-aware Human-Robot Collaboration (HRC) plays a crucial role in this paradigm, where robots must adapt their behavior to human mental states to improve collaboration fluency and safety. This paper presents a novel framework that integrates Federated Learning (FL) to enable personalized mental state evaluation while preserving user privacy. By leveraging physiological signals, including EEG, ECG, EDA, EMG, and respiration, a multimodal model predicts an operator's stress level, facilitating real-time robot adaptation. The FL-based approach allows distributed on-device training, ensuring data confidentiality while improving model generalization and individual customization. Results demonstrate that the deployment of an FL approach results in a global model with performance in stress prediction accuracy comparable to a centralized training approach. Moreover, FL allows for enhancing personalization, thereby optimizing human-robot interaction in industrial settings, while preserving data privacy. The proposed framework advances privacy-preserving, adaptive robotics to enhance workforce well-being in smart manufacturing.

Paperid: 921, https://arxiv.org/pdf/2506.19268.pdf

Abstract:
We present Health App Reviews for Privacy & Trust (HARPT), a large-scale annotated corpus of user reviews from Electronic Health (eHealth) applications (apps) aimed at advancing research in user privacy and trust. The dataset comprises 480K user reviews labeled in seven categories that capture critical aspects of trust in applications (TA), trust in providers (TP), and privacy concerns (PC). Our multistage strategy integrated keyword-based filtering, iterative manual labeling with review, targeted data augmentation, and weak supervision using transformer-based classifiers. In parallel, we manually annotated a curated subset of 7,000 reviews to support the development and evaluation of machine learning models. We benchmarked a broad range of models, providing a baseline for future work. HARPT is released under an open resource license to support reproducible research in usable privacy and trust in digital libraries and health informatics.

Paperid: 922, https://arxiv.org/pdf/2506.16044.pdf

Abstract:
With recent advancements in AI and computational tools, intelligent paradigms have emerged to enhance fields like shared autonomy and human-machine teaming in healthcare. Advanced AI algorithms (e.g., reinforcement learning) can autonomously make decisions to achieve planning and motion goals. However, in healthcare, where human intent is crucial, fully independent machine decisions may not be ideal. This chapter presents a comprehensive review of human-centered shared autonomy AI frameworks, focusing on upper limb biosignal-based machine interfaces and associated motor control systems, including computer cursors, robotic arms, and planar platforms. We examine motor planning, learning (rehabilitation), and control, covering conceptual foundations of human-machine teaming in reach-and-grasp tasks and analyzing both theoretical and practical implementations. Each section explores how human and machine inputs can be blended for shared autonomy in healthcare applications. Topics include human factors, biosignal processing for intent detection, shared autonomy in brain-computer interfaces (BCI), rehabilitation, assistive robotics, and Large Language Models (LLMs) as the next frontier. We propose adaptive shared autonomy AI as a high-performance paradigm for collaborative human-AI systems, identify key implementation challenges, and outline future directions, particularly regarding AI reasoning agents. This analysis aims to bridge neuroscientific insights with robotics to create more intuitive, effective, and ethical human-machine teaming frameworks.

Paperid: 923, https://arxiv.org/pdf/2506.15293.pdf

Abstract:
As robots enter collaborative workspaces, ensuring mutual understanding between human workers and robotic systems becomes a prerequisite for trust, safety, and efficiency. In this position paper, we draw on the cooperation scenario of the AIMotive project in which a human and a cobot jointly perform assembly tasks to argue for a structured approach to intent communication. Building on the Situation Awareness-based Agent Transparency (SAT) framework and the notion of task abstraction levels, we propose a multidimensional design space that maps intent content (SAT1, SAT3), planning horizon (operational to strategic), and modality (visual, auditory, haptic). We illustrate how this space can guide the design of multimodal communication strategies tailored to dynamic collaborative work contexts. With this paper, we lay the conceptual foundation for a future design toolkit aimed at supporting transparent human-robot interaction in the workplace. We highlight key open questions and design challenges, and propose a shared agenda for multimodal, adaptive, and trustworthy robotic collaboration in hybrid work environments.

Paperid: 924, https://arxiv.org/pdf/2506.15047.pdf

Abstract:
Family caregivers of individuals with Alzheimer's Disease and Related Dementia (AD/ADRD) face significant emotional and logistical challenges that place them at heightened risk for stress, anxiety, and depression. Although recent advances in generative AI -- particularly large language models (LLMs) -- offer new opportunities to support mental health, little is known about how caregivers perceive and engage with such technologies. To address this gap, we developed Carey, a GPT-4o-based chatbot designed to provide informational and emotional support to AD/ADRD caregivers. Using Carey as a technology probe, we conducted semi-structured interviews with 16 family caregivers following scenario-driven interactions grounded in common caregiving stressors. Through inductive coding and reflexive thematic analysis, we surface a systemic understanding of caregiver needs and expectations across six themes -- on-demand information access, emotional support, safe space for disclosure, crisis management, personalization, and data privacy. For each of these themes, we also identified the nuanced tensions in the caregivers' desires and concerns. We present a mapping of caregiver needs, AI chatbot's strengths, gaps, and design recommendations. Our findings offer theoretical and practical insights to inform the design of proactive, trustworthy, and caregiver-centered AI systems that better support the evolving mental health needs of AD/ADRD caregivers.

Paperid: 925, https://arxiv.org/pdf/2506.14196.pdf

Abstract:
Alzheimer's Disease and Related Dementias (AD/ADRD) are progressive neurodegenerative conditions that impair memory, thought processes, and functioning. Family caregivers of individuals with AD/ADRD face significant mental health challenges due to long-term caregiving responsibilities. Yet, current support systems often overlook the evolving nature of their mental wellbeing needs. Our study examines caregivers' mental wellbeing concerns, focusing on the practices they adopt to manage the burden of caregiving and the technologies they use for support. Through semi-structured interviews with 25 family caregivers of individuals with AD/ADRD, we identified the key causes and effects of mental health challenges, and developed a temporal mapping of how caregivers' mental wellbeing evolves across three distinct stages of the caregiving journey. Additionally, our participants shared insights into improvements for existing mental health technologies, emphasizing the need for accessible, scalable, and personalized solutions that adapt to caregivers' changing needs over time. These findings offer a foundation for designing dynamic, stage-sensitive interventions that holistically support caregivers' mental wellbeing, benefiting both caregivers and care recipients.

Paperid: 926, https://arxiv.org/pdf/2506.12699.pdf

Abstract:
Large language models (LLMs) are sophisticated artificial intelligence systems that enable machines to generate human-like text with remarkable precision. While LLMs offer significant technological progress, their development using vast amounts of user data scraped from the web and collected from extensive user interactions poses risks of sensitive information leakage. Most existing surveys focus on the privacy implications of the training data but tend to overlook privacy risks from user interactions and advanced LLM capabilities. This paper aims to fill that gap by providing a comprehensive analysis of privacy in LLMs, categorizing the challenges into four main areas: (i) privacy issues in LLM training data, (ii) privacy challenges associated with user prompts, (iii) privacy vulnerabilities in LLM-generated outputs, and (iv) privacy challenges involving LLM agents. We evaluate the effectiveness and limitations of existing mitigation mechanisms targeting these proposed privacy challenges and identify areas for further research.

Paperid: 927, https://arxiv.org/pdf/2506.12605.pdf

Abstract:
As large language models (LLMs)-enhanced chatbots grow increasingly expressive and socially responsive, many users are beginning to form companionship-like bonds with them, particularly with simulated AI partners designed to mimic emotionally attuned interlocutors. These emerging AI companions raise critical questions: Can such systems fulfill social needs typically met by human relationships? How do they shape psychological well-being? And what new risks arise as users develop emotional ties to non-human agents? This study investigates how people interact with AI companions, especially simulated partners on CharacterAI, and how this use is associated with users' psychological well-being. We analyzed survey data from 1,131 users and 4,363 chat sessions (413,509 messages) donated by 244 participants, focusing on three dimensions of use: nature of the interaction, interaction intensity, and self-disclosure. By triangulating self-reports primary motivation, open-ended relationship descriptions, and annotated chat transcripts, we identify patterns in how users engage with AI companions and its associations with well-being. Findings suggest that people with smaller social networks are more likely to turn to chatbots for companionship, but that companionship-oriented chatbot usage is consistently associated with lower well-being, particularly when people use the chatbots more intensively, engage in higher levels of self-disclosure, and lack strong human social support. Even though some people turn to chatbots to fulfill social needs, these uses of chatbots do not fully substitute for human connection. As a result, the psychological benefits may be limited, and the relationship could pose risks for more socially isolated or emotionally vulnerable users.

Paperid: 928, https://arxiv.org/pdf/2506.12248.pdf

Abstract:
Collaborative robots must quickly adapt to their partner's intent and preferences to proactively identify helpful actions. This is especially true in situated settings where human partners can continually teach robots new high-level behaviors, visual concepts, and physical skills (e.g., through demonstration), growing the robot's capabilities as the human-robot pair work together to accomplish diverse tasks. In this work, we argue that robots should be able to infer their partner's goals from early interactions and use this information to proactively plan behaviors ahead of explicit instructions from the user. Building from the strong commonsense priors and steerability of large language models, we introduce ProVox ("Proactive Voice"), a novel framework that enables robots to efficiently personalize and adapt to individual collaborators. We design a meta-prompting protocol that empowers users to communicate their distinct preferences, intent, and expected robot behaviors ahead of starting a physical interaction. ProVox then uses the personalized prompt to condition a proactive language model task planner that anticipates a user's intent from the current interaction context and robot capabilities to suggest helpful actions; in doing so, we alleviate user burden, minimizing the amount of time partners spend explicitly instructing and supervising the robot. We evaluate ProVox through user studies grounded in household manipulation tasks (e.g., assembling lunch bags) that measure the efficiency of the collaboration, as well as features such as perceived helpfulness, ease of use, and reliability. Our analysis suggests that both meta-prompting and proactivity are critical, resulting in 38.7% faster task completion times and 31.9% less user burden relative to non-active baselines. Supplementary material, code, and videos can be found at https://provox-2025.github.io.

Paperid: 929, https://arxiv.org/pdf/2506.11827.pdf

Abstract:
Misdiagnosis can lead to delayed treatments and harm. Robotic patients offer a controlled way to train and evaluate clinicians in rare, subtle, or complex cases, reducing diagnostic errors. We present RoboPatient, a medical robotic simulator aimed at multimodal pain synthesis based on haptic and auditory feedback during palpation-based training scenarios. The robopatient functions as an adaptive intermediary, capable of synthesizing plausible pain expressions vocal and facial in response to tactile stimuli generated during palpation. Using an abdominal phantom, robopatient captures and processes haptic input via an internal palpation-to-pain mapping model. To evaluate perceptual congruence between palpation and the corresponding auditory output, we conducted a study involving 7680 trials across 20 participants, where they evaluated pain intensity through sound. Results show that amplitude and pitch significantly influence agreement with the robot's pain expressions, irrespective of pain sounds. Stronger palpation forces elicited stronger agreement, aligning with psychophysical patterns. The study revealed two key dimensions: pitch and amplitude are central to how people perceive pain sounds, with pitch being the most influential cue. These acoustic features shape how well the sound matches the applied force during palpation, impacting perceived realism. This approach lays the groundwork for high-fidelity robotic patients in clinical education and diagnostic simulation.

Paperid: 930, https://arxiv.org/pdf/2506.11789.pdf

Abstract:
Large language models have not only captivated the public imagination but have also sparked a profound rethinking of how we learn. In the third year following the breakthrough launch of ChatGPT, everyday informal learning has been transformed as diverse user groups explore these novel tools. Who is embracing LLMs for self-directed learning, and who remains hesitant? What are their reasons for adoption or avoidance? What learning patterns emerge with this novel technological landscape? We present an in-depth analysis from a large-scale survey of 776 participants, showcasing that 88% of our respondents already incorporate LLMs into their everyday learning routines for a wide variety of (learning) tasks. Young adults are at the forefront of adopting LLMs, primarily to enhance their learning experiences independently of time and space. Four types of learners emerge across learning contexts, depending on the tasks they perform with LLMs and the devices they use to access them. Interestingly, our respondents exhibit paradoxical behaviours regarding their trust in LLMs' accuracy and privacy protection measures. Our implications emphasize the importance of including different media types for learning, enabling collaborative learning, providing sources and meeting the needs of different types of learners and learning by design.

Paperid: 931, https://arxiv.org/pdf/2506.10079.pdf

Abstract:
We describe DANCE^2, an interactive dance performance in which audience members channel their collective agency into a dancer-robot duet by voting on the behavior of a wearable robot affixed to the dancer's body. At key moments during the performance, the audience is invited to either continue the choreography or override it, shaping the unfolding interaction through real-time collective input. While post-performance surveys revealed that participants felt their choices meaningfully influenced the performance, voting data across four public performances exhibited strikingly consistent patterns. This tension between what audience members do, what they feel, and what actually changes highlights a complex interplay between agentive behavior, the experience of agency, and power. We reflect on how choreography, interaction design, and the structure of the performance mediate this relationship, offering a live analogy for algorithmically curated digital systems where agency is felt, but not exercised.

Paperid: 932, https://arxiv.org/pdf/2506.08805.pdf

Abstract:
The integration of collaborative robots (cobots) in industrial settings raises concerns about worker well-being, particularly due to reduced social interactions. Avatars - designed to facilitate worker interactions and engagement - are promising solutions to enhance the human-robot collaboration (HRC) experience. However, real-world perspectives on avatar-supported HRC remain unexplored. To address this gap, we conducted a focus group study with employees from a German manufacturing company that uses cobots. Before the discussion, participants engaged with a scripted, industry-like HRC demo in a lab setting. This qualitative approach provided valuable insights into the avatar's potential roles, improvements to its behavior, and practical considerations for deploying them in industrial workcells. Our findings also emphasize the importance of personalized communication and task assistance. Although our study's limitations restrict its generalizability, it serves as an initial step in recognizing the potential of adaptive, context-aware avatar interactions in real-world industrial environments.

Paperid: 933, https://arxiv.org/pdf/2506.08517.pdf

Abstract:
Neural disorders refer to any condition affecting the nervous system and that influence how individuals perceive and interact with the world. Traditional neural diagnoses rely on cumbersome, time-consuming, or subjective methods, such as clinical interviews, behavioural observations, or medical imaging. Eye tracking is an attractive alternative because analysing eye movements, such as fixations and saccades, can provide more objective insights into brain function and cognitive processing by capturing non-verbal and unconscious responses. Despite its potential, existing gaze-based studies presented seemingly contradictory findings. They are dispersed across diverse fields, requiring further research to standardise protocols and expand their application, particularly as a preliminary indicator of neural processes for differential diagnosis. Therefore, this paper outlines the main agreed-upon findings and provides a systematisation of knowledge and key guidelines towards advancing gaze-based neural preliminary diagnosis.

Paperid: 934, https://arxiv.org/pdf/2506.06825.pdf

Abstract:
Generative AI (Gen-AI) deepfakes pose a rapidly evolving threat to biometric authentication, yet a significant gap exists between expert understanding of these risks and public perception. This disconnection creates critical vulnerabilities in systems trusted by millions. To bridge this gap, we conducted a comprehensive mixed-method study, surveying 408 professionals across key sectors and conducting in-depth interviews with 37 participants (25 experts, 12 general public [non-experts]). Our findings reveal a paradox: while the public increasingly relies on biometrics for convenience, experts express grave concerns about the spoofing of static modalities like face and voice recognition. We found significant demographic and sector-specific divides in awareness and trust, with finance professionals, for example, showing heightened skepticism. To systematically analyze these threats, we introduce a novel Deepfake Kill Chain model, adapted from Hutchins et al.'s cybersecurity frameworks to map the specific attack vectors used by malicious actors against biometric systems. Based on this model and our empirical findings, we propose a tri-layer mitigation framework that prioritizes dynamic biometric signals (e.g., eye movements), robust privacy-preserving data governance, and targeted educational initiatives. This work provides the first empirically grounded roadmap for defending against AI-generated identity threats by aligning technical safeguards with human-centered insights.

Paperid: 935, https://arxiv.org/pdf/2506.05908.pdf

Abstract:
Gaze-based applications are increasingly advancing with the availability of large datasets but ensuring data quality presents a substantial challenge when collecting data at scale. It further requires different parties to collaborate, therefore, privacy concerns arise. We propose QualitEye--the first method for verifying image-based gaze data quality. QualitEye employs a new semantic representation of eye images that contains the information required for verification while excluding irrelevant information for better domain adaptation. QualitEye covers a public setting where parties can freely exchange data and a privacy-preserving setting where parties cannot reveal their raw data nor derive gaze features/labels of others with adapted private set intersection protocols. We evaluate QualitEye on the MPIIFaceGaze and GazeCapture datasets and achieve a high verification performance (with a small overhead in runtime for privacy-preserving versions). Hence, QualitEye paves the way for new gaze analysis methods at the intersection of machine learning, human-computer interaction, and cryptography.

Paperid: 936, https://arxiv.org/pdf/2506.05720.pdf

Abstract:
Earable devices, wearables positioned in or around the ear, are undergoing a rapid transformation from audio-centric accessories into multifunctional systems for interaction, contextual awareness, and health monitoring. This evolution is driven by commercial trends emphasizing sensor integration and by a surge of academic interest exploring novel sensing capabilities. Building on the foundation established by earlier surveys, this work presents a timely and comprehensive review of earable research published since 2022. We analyze over one hundred recent studies to characterize this shifting research landscape, identify emerging applications and sensing modalities, and assess progress relative to prior efforts. In doing so, we address three core questions: how has earable research evolved in recent years, what enabling resources are now available, and what opportunities remain for future exploration. Through this survey, we aim to provide both a retrospective and forward-looking view of earable technology as a rapidly expanding frontier in ubiquitous computing. In particular, this review reveals that over the past three years, researchers have discovered a variety of novel sensing principles, developed many new earable sensing applications, enhanced the accuracy of existing sensing tasks, and created substantial new resources to advance research in the field. Based on this, we further discuss open challenges and propose future directions for the next phase of earable research.

Paperid: 937, https://arxiv.org/pdf/2506.05605.pdf

Abstract:
Scenario building is an established method to anticipate the future of emerging technologies. Its primary goal is to use narratives to map future trajectories of technology development and sociotechnical adoption. Following this process, risks and benefits can be identified early on, and strategies can be developed that strive for desirable futures. In recent years, computer science has adopted this method and applied it to various technologies, including Artificial Intelligence (AI). Because computing technologies play such an important role in shaping modern societies, it is worth exploring how scenarios are being used as an anticipatory tool in the field -- and what possible traditional uses of scenarios are not yet covered but have the potential to enrich the field. We address this gap by conducting a systematic literature review on the use of scenario building methods in computer science over the last decade (n = 59). We guide the review along two main questions. First, we aim to uncover how scenarios are used in computing literature, focusing especially on the rationale for why scenarios are used. Second, in following the potential of scenario building to enhance inclusivity in research, we dive deeper into the participatory element of the existing scenario building literature in computer science.

Paperid: 938, https://arxiv.org/pdf/2506.03113.pdf

Abstract:
This paper presents a multi-stage experimental framework that integrates immersive Virtual Reality (VR) simulations, wearable sensors, and advanced signal processing to investigate construction workers neuro-physiological stress responses to multi-sensory AR-enabled warnings. Participants performed light- and moderate-intensity roadway maintenance tasks within a high-fidelity VR roadway work zone, while key stress markers of electrodermal activity (EDA), heart rate variability (HRV), and electroencephalography (EEG) were continuously measured. Statistical analyses revealed that task intensity significantly influenced physiological and neurological stress indicators. Moderate-intensity tasks elicited greater autonomic arousal, evidenced by elevated heart rate measures (mean-HR, std-HR, max-HR) and stronger electrodermal responses, while EEG data indicated distinct stress-related alpha suppression and beta enhancement. Feature-importance analysis further identified mean EDR and short-term HR metrics as discriminative for classifying task intensity. Correlation results highlighted a temporal lag between immediate neural changes and subsequent physiological stress reactions, emphasizing the interplay between cognition and autonomic regulation during hazardous tasks.

Paperid: 939, https://arxiv.org/pdf/2505.23326.pdf

Abstract:
Entrepreneurship education equips students to transform innovative ideas into actionable entrepreneurship plans, yet traditional approaches often struggle to provide the personalized guidance and practical alignment needed for success. Focusing on the business plan as a key learning tool and evaluation method, this study investigates the design needs for an AI-empowered scaffold system to address these challenges. Based on qualitative insights from educators and students, the findings highlight three critical dimensions for system design: mastery of business plan development, alignment with entrepreneurial learning goals, and integration of adaptive system features. These findings underscore the transformative potential of AI in bridging gaps in entrepreneurship education while emphasizing the enduring value of human mentorship and experiential learning.

Paperid: 940, https://arxiv.org/pdf/2505.23090.pdf

Abstract:
Dancers often prototype movements themselves or with each other during improvisation and choreography. How are these interactions altered when physically manipulable technologies are introduced into the creative process? To understand how dancers design and improvise movements while working with instruments capable of non-humanoid movements, we engaged dancers in workshops to co-create movements with a robot arm in one-human-to-one-robot and three-human-to-one-robot settings. We found that dancers produced more fluid movements in one-to-one scenarios, experiencing a stronger sense of connection and presence with the robot as a co-dancer. In three-to-one scenarios, the dancers divided their attention between the human dancers and the robot, resulting in increased perceived use of space and more stop-and-go movements, perceiving the robot as part of the stage background. This work highlights how technologies can drive creativity in movement artists adapting to new ways of working with physical instruments, contributing design insights supporting artistic collaborations with non-humanoid agents.

Paperid: 941, https://arxiv.org/pdf/2505.22769.pdf

Abstract:
Mobile gaze tracking faces a fundamental challenge: maintaining accuracy as users naturally change their postures and device orientations. Traditional calibration approaches, like one-off, fail to adapt to these dynamic conditions, leading to degraded performance over time. We present MAC-Gaze, a Motion-Aware continual Calibration approach that leverages smartphone Inertial measurement unit (IMU) sensors and continual learning techniques to automatically detect changes in user motion states and update the gaze tracking model accordingly. Our system integrates a pre-trained visual gaze estimator and an IMU-based activity recognition model with a clustering-based hybrid decision-making mechanism that triggers recalibration when motion patterns deviate significantly from previously encountered states. To enable accumulative learning of new motion conditions while mitigating catastrophic forgetting, we employ replay-based continual learning, allowing the model to maintain performance across previously encountered motion conditions. We evaluate our system through extensive experiments on the publicly available RGBDGaze dataset and our own 10-hour multimodal MotionGaze dataset (481K+ images, 800K+ IMU readings), encompassing a wide range of postures under various motion conditions including sitting, standing, lying, and walking. Results demonstrate that our method reduces gaze estimation error by 19.9% on RGBDGaze (from 1.73 cm to 1.41 cm) and by 31.7% on MotionGaze (from 2.81 cm to 1.92 cm) compared to traditional calibration approaches. Our framework provides a robust solution for maintaining gaze estimation accuracy in mobile scenarios.

Paperid: 942, https://arxiv.org/pdf/2505.22477.pdf

Abstract:
In the intelligent era, the interaction between humans and intelligent systems fundamentally involves collaboration with autonomous intelligent agents. Human-AI Collaboration (HAC) represents a novel type of human-machine relationship facilitated by autonomous intelligent machines equipped with AI technologies. In this paradigm, AI agents serve not only as auxiliary tools but also as active teammates, partnering with humans to accomplish tasks collaboratively. Human-centered AI (HCAI) emphasizes that humans play critical leadership roles in the collaboration. This human-led collaboration imparts new dimensions to the human-machine relationship, necessitating innovative research perspectives, paradigms, and agenda to address the unique challenges posed by HAC. This chapter delves into the essence of HAC from the human-centered perspective, outlining its core concepts and distinguishing features. It reviews the current research methodologies and research agenda within the HAC field from the HCAI perspective, highlighting advancements and ongoing studies. Furthermore, a framework for human-centered HAC (HCHAC) is proposed by integrating these reviews and analyses. A case study of HAC in the context of autonomous vehicles is provided, illustrating practical applications and the synergistic interactions between humans and AI agents. Finally, it identifies potential future research directions aimed at enhancing the effectiveness, reliability, and ethical integration of human-centered HAC systems in diverse domains.

Paperid: 943, https://arxiv.org/pdf/2505.21153.pdf

Abstract:
What if digital fabrication waste could observe the world? What would they see? What would they say? "THE WASTIVE" reimagines digital fabrication waste as sentient observers, giving them a poetic voice through interactive art. As viewers approach, the installation awakens, mimicking the rhythmic ebb and flow of ocean waves - a silent dialogue where discarded materials "observe" and respond to human presence. These interactions echo the gentle murmurs of the sea, transforming technological residue into a reflective, sensory experience. Through this artistic contemplation, "THE WASTIVE" invites audiences to reconsider their creative processes and consumption habits. It serves as a poetic call for more mindful, sustainable practices, provoking deeper reflections on our interconnectedness with the environment.

Paperid: 944, https://arxiv.org/pdf/2505.20918.pdf

Abstract:
Humble AI (Knowles et al., 2023) argues for cautiousness in AI development and deployments through scepticism (accounting for limitations of statistical learning), curiosity (accounting for unexpected outcomes), and commitment (accounting for multifaceted values beyond performance). We present a real-world case study for humble AI in the domain of algorithmic hiring. Specifically, we evaluate virtual screening algorithms in a widely used hiring platform that matches candidates to job openings. There are several challenges in misrecognition and stereotyping in such contexts that are difficult to assess through standard fairness and trust frameworks; e.g., someone with a non-traditional background is less likely to rank highly. We demonstrate technical feasibility of how humble AI principles can be translated to practice through uncertainty quantification of ranks, entropy estimates, and a user experience that highlights algorithmic unknowns. We describe preliminary discussions with focus groups made up of recruiters. Future user studies seek to evaluate whether the higher cognitive load of a humble AI system fosters a climate of trust in its outcomes.

Paperid: 945, https://arxiv.org/pdf/2505.20692.pdf

Abstract:
Recent advances in generative AI have enabled visual content creation through text-to-image (T2I) generation. However, despite their creative potential, T2I models often replicate and amplify societal stereotypes -- particularly those related to gender, race, and culture -- raising important ethical concerns. This paper proposes a theory-driven bias detection rubric and a Social Stereotype Index (SSI) to systematically evaluate social biases in T2I outputs. We audited three major T2I model outputs -- DALL-E-3, Midjourney-6.1, and Stability AI Core -- using 100 queries across three categories -- geocultural, occupational, and adjectival. Our analysis reveals that initial outputs are prone to include stereotypical visual cues, including gendered professions, cultural markers, and western beauty norms. To address this, we adopted our rubric to conduct targeted prompt refinement using LLMs, which significantly reduced bias -- SSI dropped by 61% for geocultural, 69% for occupational, and 51% for adjectival queries. We complemented our quantitative analysis through a user study examining perceptions, awareness, and preferences around AI-generated biased imagery. Our findings reveal a key tension -- although prompt refinement can mitigate stereotypes, it can limit contextual alignment. Interestingly, users often perceived stereotypical images to be more aligned with their expectations. We discuss the need to balance ethical debiasing with contextual relevance and call for T2I systems that support global diversity and inclusivity while not compromising the reflection of real-world social complexity.

Paperid: 946, https://arxiv.org/pdf/2505.11666.pdf

Abstract:
Industrial products are designed to satisfy the needs of consumers. The rise of generative artificial intelligence (GenAI) enables consumers to easily modify a product by prompting a generative model, opening up opportunities to incorporate consumers in exploring the product design space. However, consumers often struggle to articulate their preferred product features due to their unfamiliarity with terminology and their limited understanding of the structure of product features. We present DesignFromX, a system that empowers consumer-driven design space exploration by helping consumers to design a product based on their preferences. Leveraging an effective GenAI-based framework, the system allows users to easily identify design features from product images and compose those features to generate conceptual images and 3D models of a new product. A user study with 24 participants demonstrates that DesignFromX lowers the barriers and frustration for consumer-driven design space explorations by enhancing both engagement and enjoyment for the participants.

Paperid: 947, https://arxiv.org/pdf/2505.11366.pdf

Abstract:
Current invasive assistive technologies are designed to infer high-dimensional motor control signals from severely paralyzed patients. However, they face significant challenges, including public acceptance, limited longevity, and barriers to commercialization. Meanwhile, noninvasive alternatives often rely on artifact-prone signals, require lengthy user training, and struggle to deliver robust high-dimensional control for dexterous tasks. To address these issues, this study introduces a novel human-centered multimodal AI approach as intelligent compensatory mechanisms for lost motor functions that could potentially enable patients with severe paralysis to control high-dimensional assistive devices, such as dexterous robotic arms, using limited and noninvasive inputs. In contrast to the current state-of-the-art (SoTA) noninvasive approaches, our context-aware, multimodal shared-autonomy framework integrates deep reinforcement learning algorithms to blend limited low-dimensional user input with real-time environmental perception, enabling adaptive, dynamic, and intelligent interpretation of human intent for complex dexterous manipulation tasks, such as pick-and-place. The results from our ARAS (Adaptive Reinforcement learning for Amplification of limited inputs in Shared autonomy) trained with synthetic users over 50,000 computer simulation episodes demonstrated the first successful implementation of the proposed closed-loop human-in-the-loop paradigm, outperforming the SoTA shared autonomy algorithms. Following a zero-shot sim-to-real transfer, ARAS was evaluated on 23 human subjects, demonstrating high accuracy in dynamic intent detection and smooth, stable 3D trajectory control for dexterous pick-and-place tasks. ARAS user study achieved a high task success rate of 92.88%, with short completion times comparable to those of SoTA invasive assistive technologies.

Paperid: 948, https://arxiv.org/pdf/2505.09867.pdf

Abstract:
This study examines stress levels in roadway workers utilizing AR-assisted multi-sensory warning systems under varying work intensities. A high-fidelity Virtual Reality environment was used to replicate real-world scenarios, allowing safe exploration of high-risk situations while focusing on the physiological impacts of work conditions. Wearable sensors were used to continuously and non-invasively collect physiological data, including electrodermal activity to monitor stress responses. Analysis of data from 18 participants revealed notable differences in EDR between light- and medium-intensity activities, reflecting variations in autonomic nervous system activity under stress. Also, a feature importance analysis revealed that peak and central tendency metrics of EDR were robust indicators of physiological responses, between light- and medium-intensity activities. The findings emphasize the relationship between AR-enabled warnings, work intensity, and worker stress, offering an approach to active stress monitoring and improved safety practices. By leveraging real-time physiological insights, this methodology has the potential to support better stress management and the development of more effective safety warning systems for roadway work zones. This research also provides valuable guidance for designing interventions to enhance worker safety, productivity, and well-being in high-risk settings.

Paperid: 949, https://arxiv.org/pdf/2505.08119.pdf

Abstract:
Generative AI (GenAI), especially Large Language Models (LLMs), is rapidly reshaping both programming workflows and computer science education. Many programmers now incorporate GenAI tools into their workflows, including for collaborative coding tasks such as pair programming. While prior research has demonstrated the benefits of traditional pair programming and begun to explore GenAI-assisted coding, the role of LLM-based tools as collaborators in pair programming remains underexamined. In this work, we conducted a mixed-methods study with 39 undergraduate students to examine how GenAI influences collaboration, learning, and performance in pair programming. Specifically, students completed six in-class assignments under three conditions: Traditional Pair Programming (PP), Pair Programming with GenAI (PAI), and Solo Programming with GenAI (SAI). They used both LLM-based inline completion tools (e.g., GitHub Copilot) and LLM-based conversational tools (e.g., ChatGPT). Our results show that students in PAI achieved the highest assignment scores, whereas those in SAI attained the lowest. Additionally, students' attitudes toward LLMs' programming capabilities improved significantly after collaborating with LLM-based tools, and preferences were largely shaped by the perceived usefulness for completing assignments and learning programming skills, as well as the quality of collaboration. Our qualitative findings further reveal that while students appreciated LLM-based tools as valuable pair programming partners, they also identified limitations and had different expectations compared to human teammates. Our study provides one of the first empirical evaluations of GenAI as a pair programming collaborator through a comparison of three conditions (PP, PAI, and SAI). We also discuss the design implications and pedagogical considerations for future GenAI-assisted pair programming approaches.

Paperid: 950, https://arxiv.org/pdf/2505.07069.pdf

Abstract:
Group awareness--the ability to perceive the activities of collaborators in a shared space--is a vital mechanism to support effective coordination and joint data analysis in collaborative visualization. We introduce collaborative attention-aware visualizations (CAAVs) that track, record, and revisualize the collective attention of multiple users over time. We implement this concept in HeedVision, a standards-compliant WebXR system that runs on modern AR/VR headsets. Through a user study where pairs of analysts performed visual search tasks in HeedVision, we demonstrate how attention revisualization enhances collaborative performance in immersive analytics. Our findings reveal that CAAVs improve spatial coordination, search efficiency, and task load distribution among collaborators, with benefits varying by visualization context. This work extends attention awareness from individual to multi-user settings and provides empirical evidence for its benefits in collaborative immersive analytics.

Paperid: 951, https://arxiv.org/pdf/2505.05441.pdf

Abstract:
Large Language Model (LLM)-based copilots have shown great potential in Extended Reality (XR) applications. However, the user faces challenges when describing the 3D environments to the copilots due to the complexity of conveying spatial-temporal information through text or speech alone. To address this, we introduce GesPrompt, a multimodal XR interface that combines co-speech gestures with speech, allowing end-users to communicate more naturally and accurately with LLM-based copilots in XR environments. By incorporating gestures, GesPrompt extracts spatial-temporal reference from co-speech gestures, reducing the need for precise textual prompts and minimizing cognitive load for end-users. Our contributions include (1) a workflow to integrate gesture and speech input in the XR environment, (2) a prototype VR system that implements the workflow, and (3) a user study demonstrating its effectiveness in improving user communication in VR environments.

Paperid: 952, https://arxiv.org/pdf/2505.05038.pdf

Abstract:
Multiple challenges emerge when analyzing eye-tracking data with areas of interest (AOIs) because recordings are subject to different sources of uncertainties. Previous work often presents gaze data without considering those inaccuracies in the data. To address this issue, we developed uncertainty-aware scarf plot visualizations that aim to make analysts aware of uncertainties with respect to the position-based mapping of gaze to AOIs and depth dependency in 3D scenes. Additionally, we also consider uncertainties in automatic AOI annotation. We showcase our approach in comparison to standard scarf plots in an augmented reality scenario.

Paperid: 953, https://arxiv.org/pdf/2505.01678.pdf

Abstract:
Non-native speakers (NNSs) often face speaking challenges in real-time multilingual communication, such as struggling to articulate their thoughts. To address this issue, we developed an AI-based speaking assistant (AISA) that provides speaking references for NNSs based on their input queries, task background, and conversation history. To explore NNSs' interaction with AISA and its impact on NNSs' speaking during real-time multilingual communication, we conducted a mixed-method study involving a within-subject experiment and follow-up interviews. In the experiment, two native speakers (NSs) and one NNS formed a team (31 teams in total) and completed two collaborative tasks--one with access to the AISA and one without. Overall, our study revealed four types of AISA input patterns among NNSs, each reflecting different levels of effort and language preferences. Although AISA did not improve NNSs' speaking competence, follow-up interviews revealed that it helped improve the logical flow and depth of their speech. Moreover, the additional multitasking introduced by AISA, such as entering and reviewing system output, potentially elevated NNSs' workload and anxiety. Based on these observations, we discuss the pros and cons of implementing tools to assist NNS in real-time multilingual communication and offer design recommendations.

Paperid: 954, https://arxiv.org/pdf/2505.01413.pdf

Abstract:
The eyes play an important role in human collaboration. Mutual and shared gaze help communicate visual attention to each other or to a specific object of interest. Shared gaze was typically investigated for pair collaborations in remote settings and with people in virtual and augmented reality. With our work, we expand this line of research by a new technique to communicate gaze between groups in tabletop workshop scenarios. To achieve this communication, we use an approach based on projection mapping to unify gaze data from multiple participants into a common visualization space on a tabletop. We showcase our approach with a collaborative puzzle-solving task that displays shared visual attention on individual pieces and provides hints to solve the problem at hand.

Paperid: 955, https://arxiv.org/pdf/2505.00101.pdf

Abstract:
Understanding physiological responses during running is critical for performance optimization, tailored training prescriptions, and athlete health management. We introduce a comprehensive framework -- what we believe to be the first capable of predicting instantaneous oxygen consumption (VO$_{2}$) trajectories exclusively from consumer-grade wearable data. Our approach employs two complementary physiological models: (1) accurate modeling of heart rate (HR) dynamics via a physiologically constrained ordinary differential equation (ODE) and neural Kalman filter, trained on over 3 million HR observations, achieving 1-second interval predictions with mean absolute errors as low as 2.81\,bpm (correlation 0.87); and (2) leveraging the principles of precise HR modeling, a novel VO$_{2}$ prediction architecture requiring only the initial second of VO$_{2}$ data for calibration, enabling robust, sequence-to-sequence metabolic demand estimation. Despite relying solely on smartwatch and chest-strap data, our method achieves mean absolute percentage errors of approximately 13\%, effectively capturing rapid physiological transitions and steady-state conditions across diverse running intensities. Our synchronized dataset, complemented by blood lactate measurements, further lays the foundation for future noninvasive metabolic zone identification. By embedding physiological constraints within modern machine learning, this framework democratizes advanced metabolic monitoring, bridging laboratory-grade accuracy and everyday accessibility, thus empowering both elite athletes and recreational fitness enthusiasts.

Paperid: 956, https://arxiv.org/pdf/2504.21360.pdf

Abstract:
While augmented reality (AR) enables new ways to play, tell stories, and explore ideas rooted in the physical world, authoring personalized AR content remains difficult for non-experts, often requiring professional tools and time. Prior systems have explored AI-driven XR design but typically rely on manually defined VR environments and fixed asset libraries, limiting creative flexibility and real-world relevance. We introduce ImaginateAR, the first mobile tool for AI-assisted AR authoring to combine offline scene understanding, fast 3D asset generation, and LLMs -- enabling users to create outdoor scenes through natural language interaction. For example, saying "a dragon enjoying a campfire" (P7) prompts the system to generate and arrange relevant assets, which can then be refined manually. Our technical evaluation shows that our custom pipelines produce more accurate outdoor scene graphs and generate 3D meshes faster than prior methods. A three-part user study (N=20) revealed preferred roles for AI, how users create in freeform use, and design implications for future AR authoring tools. ImaginateAR takes a step toward empowering anyone to create AR experiences anywhere -- simply by speaking their imagination.

Paperid: 957, https://arxiv.org/pdf/2504.19038.pdf

Abstract:
After the release of several AI literacy guidelines, the rapid rise and widespread adoption of generative AI, such as ChatGPT, Dall E, and Deepseek, have transformed our lives. Unlike traditional AI algorithms (e.g., convolutional neural networks, semantic networks, classifiers) captured in existing AI literacy frameworks, generative AI exhibits distinct and more nuanced characteristics. However, a lack of robust generative AI literacy is hindering individuals ability to evaluate critically and use these models effectively and responsibly. To address this gap, we propose a set of guidelines with 12 items for generative AI literacy, organized into four key aspects: (1) Guidelines for Generative AI Tool Selection and Prompting, (2) Guidelines for Understanding Interaction with Generative AI, (3) Guidelines for Understanding Interaction with Generative AI, and (4) Guidelines for High Level Understanding of Generative AI. These guidelines aim to support schools, companies, educators, and organizations in developing frameworks that empower their members, such as students, employees, and stakeholders, to use generative AI in an efficient, ethical, and informed way.

Paperid: 958, https://arxiv.org/pdf/2504.18932.pdf

Abstract:
Recent advancements in LLMs enable chatbots to interact with individuals on a range of queries, including sensitive mental health contexts. Despite uncertainties about their effectiveness and reliability, the development of LLMs in these areas is growing, potentially leading to harms. To better identify and mitigate these harms, it is critical to understand how the values of people with lived experiences relate to the harms. In this study, we developed a technology probe, a GPT-4o based chatbot called Zenny, enabling participants to engage with depression self-management scenarios informed by previous research. We used Zenny to interview 17 individuals with lived experiences of depression. Our thematic analysis revealed key values: informational support, emotional support, personalization, privacy, and crisis management. This work explores the relationship between lived experience values, potential harms, and design recommendations for mental health AI chatbots, aiming to enhance self-management support while minimizing risks.

Paperid: 959, https://arxiv.org/pdf/2504.18919.pdf

Abstract:
Global healthcare providers are exploring use of large language models (LLMs) to provide medical advice to the public. LLMs now achieve nearly perfect scores on medical licensing exams, but this does not necessarily translate to accurate performance in real-world settings. We tested if LLMs can assist members of the public in identifying underlying conditions and choosing a course of action (disposition) in ten medical scenarios in a controlled study with 1,298 participants. Participants were randomly assigned to receive assistance from an LLM (GPT-4o, Llama 3, Command R+) or a source of their choice (control). Tested alone, LLMs complete the scenarios accurately, correctly identifying conditions in 94.9% of cases and disposition in 56.3% on average. However, participants using the same LLMs identified relevant conditions in less than 34.5% of cases and disposition in less than 44.2%, both no better than the control group. We identify user interactions as a challenge to the deployment of LLMs for medical advice. Standard benchmarks for medical knowledge and simulated patient interactions do not predict the failures we find with human participants. Moving forward, we recommend systematic human user testing to evaluate interactive capabilities prior to public deployments in healthcare.

Paperid: 960, https://arxiv.org/pdf/2504.17531.pdf

Abstract:
The growing capabilities of Artificial Intelligence (AI), particularly Large Language Models (LLMs), prompt a reassessment of the interaction mechanisms between users and their devices. Currently, users are required to use a set of high-level applications to achieve their desired results. However, the advent of AI may signal a shift in this regard, as its capabilities have generated novel prospects for user-provided intent resolution through the deployment of model-generated code. This development represents a significant progression in the realm of hybrid workflows, where human and artificial intelligence collaborate to address user intentions, with the former responsible for defining these intentions and the latter for implementing the solutions to address them. In this paper, we investigate the feasibility of generating and executing workflows through code generation that results from prompting an LLM with a concrete user intention, and a simplified application programming interface for a GUI-less operating system. We provide an in-depth analysis and comparison of various user intentions, the resulting code, and its execution. The findings demonstrate the general feasibility of our approach and that the employed LLM, GPT-4o-mini, exhibits remarkable proficiency in the generation of code-oriented workflows in accordance with provided user intentions.

Paperid: 961, https://arxiv.org/pdf/2504.16562.pdf

Abstract:
Augmented Reality (AR) is transforming the way we interact with virtual information in the physical world. By overlaying digital content in real-world environments, AR enables new forms of immersive and engaging experiences. However, existing AR systems often struggle to effectively manage the many interactive possibilities that AR presents. This vision paper speculates on AI-driven approaches for adaptive AR content placement, dynamically adjusting to user movement and environmental changes. By leveraging machine learning methods, such a system would intelligently manage content distribution between AR projections integrated into the external environment and fixed static content, enabling seamless UI layout and potentially reducing users' cognitive load. By exploring the possibilities of AI-driven dynamic AR content placement, we aim to envision new opportunities for innovation and improvement in various industries, from urban navigation and workplace productivity to immersive learning and beyond. This paper outlines a vision for the development of more intuitive, engaging, and effective AI-powered AR experiences.

Paperid: 962, https://arxiv.org/pdf/2504.16423.pdf

Abstract:
Millimeter wave (mmWave) radar sensors play a vital role in hand gesture recognition (HGR) by detecting subtle motions while preserving user privacy. However, the limited scale of radar datasets hinders the performance. Existing synthetic data generation methods fall short in two key areas. On the one hand, modeling-based approaches fail to accurately simulate the wave propagation and reflection at the hand-gesture level, facing unique complexities such as diffraction and occlusion. On the other hand, generative model-based methods are hard to converge while radar data is limited, lacking interpretability, and sometimes fail to produce kinematically plausible results. To overcome these limitations, we propose a novel hybrid spectrum synthetic framework leveraging visual hand gesture data. It combines a cylinder mesh-based hand reflection model with a small-scale neural network called RadarWeightNet, which focuses on assigning weights to simulated signals. Our framework addresses two key challenges: achieving accurate simulation of complex hand geometry and bridging the simulation-to-real gap in a data-driven manner while preserving interpretability, which balances physical accuracy with machine learning adaptability. We tested our framework under extreme scenarios where radar data is scarce. The results demonstrate the effectiveness of our hybrid framework, achieving up to 63% SSIM in synthetic performance and up to 30% improvement in classification performance in few-shot learning.

Paperid: 963, https://arxiv.org/pdf/2504.15984.pdf

Abstract:
Neuroadaptive haptics offers a path to more immersive extended reality (XR) experiences by dynamically tuning multisensory feedback to user preferences. We present a neuroadaptive haptics system that adapts XR feedback through reinforcement learning (RL) from explicit user ratings and brain-decoded neural signals. In a user study, participants interacted with virtual objects in VR while Electroencephalography (EEG) data were recorded. An RL agent adjusted haptic feedback based either on explicit ratings or on outputs from a neural decoder. Results show that the RL agent's performance was comparable across feedback sources, suggesting that implicit neural feedback can effectively guide personalization without requiring active user input. The EEG-based neural decoder achieved a mean F1 score of 0.8, supporting reliable classification of user experience. These findings demonstrate the feasibility of combining brain-computer interfaces (BCI) and RL to autonomously adapt XR interactions, reducing cognitive load and enhancing immersion.

Paperid: 964, https://arxiv.org/pdf/2504.14649.pdf

Abstract:
As Artificial Intelligence (AI) becomes increasingly integrated into older adults' daily lives, equipping them with the knowledge and skills to understand and use AI is crucial. However, most research on AI literacy education has focused on students and children, leaving a gap in understanding the unique needs of older adults when learning about AI. To address this, we surveyed 103 older adults aged 50 and above (Mean = 64, SD = 7). Results revealed that they found it important and were motivated to learn about AI because they wish to harness the benefits and avoid the dangers of AI, seeing it as necessary to cope in the future. However, they expressed learning challenges such as difficulties in understanding and not knowing how to start learning AI. Particularly, a strong preference for hands-on learning was indicated. We discussed design opportunities to support AI literacy education for older adults.

Paperid: 965, https://arxiv.org/pdf/2504.13879.pdf

Abstract:
The widespread adoption of EHRs following the HITECH Act has increased the clinician documentation burden, contributing to burnout. Emerging technologies, such as ambient listening tools powered by generative AI, offer real-time, scribe-like documentation capabilities to reduce physician workload. This study evaluates the impact of ambient listening tools implemented at UCI Health by analyzing EPIC Signal data to assess changes in note length and time spent on notes. Results show significant reductions in note-taking time and an increase in note length, particularly during the first-month post-implementation. Findings highlight the potential of AI-powered documentation tools to improve clinical efficiency. Future research should explore adoption barriers, long-term trends, and user experiences to enhance the scalability and sustainability of ambient listening technology in clinical practice.

Paperid: 966, https://arxiv.org/pdf/2504.13277.pdf

Abstract:
Suicide is a critical global public health issue, with millions experiencing suicidal ideation (SI) each year. Online spaces enable individuals to express SI and seek peer support. While prior research has revealed the potential of detecting SI using machine learning and natural language analysis, a key limitation is the lack of a theoretical framework to understand the underlying factors affecting high-risk suicidal intent. To bridge this gap, we adopted the Interpersonal Theory of Suicide (IPTS) as an analytic lens to analyze 59,607 posts from Reddit's r/SuicideWatch, categorizing them into SI dimensions (Loneliness, Lack of Reciprocal Love, Self Hate, and Liability) and risk factors (Thwarted Belongingness, Perceived Burdensomeness, and Acquired Capability of Suicide). We found that high-risk SI posts express planning and attempts, methods and tools, and weaknesses and pain. In addition, we also examined the language of supportive responses through psycholinguistic and content analyses to find that individuals respond differently to different stages of Suicidal Ideation (SI) posts. Finally, we explored the role of AI chatbots in providing effective supportive responses to suicidal ideation posts. We found that although AI improved structural coherence, expert evaluations highlight persistent shortcomings in providing dynamic, personalized, and deeply empathetic support. These findings underscore the need for careful reflection and deeper understanding in both the development and consideration of AI-driven interventions for effective mental health support.

Paperid: 967, https://arxiv.org/pdf/2504.09961.pdf

Abstract:
As Large Language Models (LLMs) become integral to scientific workflows, concerns over the confidentiality and ethical handling of confidential data have emerged. This paper explores data exposure risks through LLM-powered scientific tools, which can inadvertently leak confidential information, including intellectual property and proprietary data, from scientists' perspectives. We propose "DataShield", a framework designed to detect confidential data leaks, summarize privacy policies, and visualize data flow, ensuring alignment with organizational policies and procedures. Our approach aims to inform scientists about data handling practices, enabling them to make informed decisions and protect sensitive information. Ongoing user studies with scientists are underway to evaluate the framework's usability, trustworthiness, and effectiveness in tackling real-world privacy challenges.

Paperid: 968, https://arxiv.org/pdf/2504.09809.pdf

Abstract:
Recent developments in multimodal large language models (MLLM) have equipped language models to reason about vision and language jointly. This permits MLLMs to both perceive and answer questions about data visualization across a variety of designs and tasks. Applying MLLMs to a broad range of visualization tasks requires us to properly evaluate their capabilities, and the most common way to conduct evaluation is through measuring a model's visualization reasoning capability, analogous to how we would evaluate human understanding of visualizations (e.g., visualization literacy). However, we found that in the context of visualization question answering (VisQA), how an MLLM perceives and reasons about visualizations can be fundamentally different from how humans approach the same problem. During the evaluation, even without visualization, the model could correctly answer a substantial portion of the visualization test questions, regardless of whether any selection options were provided. We hypothesize that the vast amount of knowledge encoded in the language model permits factual recall that supersedes the need to seek information from the visual signal. It raises concerns that the current VisQA evaluation may not fully capture the models' visualization reasoning capabilities. To address this, we propose a comprehensive sanity check framework that integrates a rule-based decision tree and a sanity check table to disentangle the effects of "seeing" (visual processing) and "recall" (reliance on prior knowledge). This validates VisQA datasets for evaluation, highlighting where models are truly "seeing", positively or negatively affected by the factual recall, or relying on inductive biases for question answering. Our study underscores the need for careful consideration in designing future visualization understanding studies when utilizing MLLMs.

Paperid: 969, https://arxiv.org/pdf/2504.08633.pdf

Abstract:
Generative artificial intelligence (AI), particularly transformer-based models, presents new opportunities for automating and augmenting engineering design workflows. However, effectively integrating these models into interactive tools requires careful interface design that leverages their unique capabilities. This paper introduces a transformer model tailored for gear train assembly design, paired with two novel interaction modes: Explore and Copilot. Explore Mode uses probabilistic sampling to generate and evaluate diverse design alternatives, while Copilot Mode utilizes autoregressive prediction to support iterative, context-aware refinement. These modes emphasize key transformer properties (sequence-based generation and probabilistic exploration) to facilitate intuitive and efficient human-AI collaboration. Through a case study, we demonstrate how well-designed interfaces can enhance engineers' ability to balance automation with domain expertise. A user study shows that Explore Mode supports rapid exploration and problem redefinition, while Copilot Mode provides greater control and fosters deeper engagement. Our results suggest that hybrid workflows combining both modes can effectively support complex, creative engineering design processes.

Paperid: 970, https://arxiv.org/pdf/2504.07423.pdf

Abstract:
As AI-based clinical decision support (AI-CDS) is introduced in more and more aspects of healthcare services, HCI research plays an increasingly important role in designing for complementarity between AI and clinicians. However, current evaluations of AI-CDS often fail to capture when AI is and is not useful to clinicians. This position paper reflects on our work and influential AI-CDS literature to advocate for moving beyond evaluation metrics like Trust, Reliance, Acceptance, and Performance on the AI's task (what we term the "trap" of human-AI collaboration). Although these metrics can be meaningful in some simple scenarios, we argue that optimizing for them ignores important ways that AI falls short of clinical benefit, as well as ways that clinicians successfully use AI. As the fields of HCI and AI in healthcare develop new ways to design and evaluate CDS tools, we call on the community to prioritize ecologically valid, domain-appropriate study setups that measure the emergent forms of value that AI can bring to healthcare professionals.

Paperid: 971, https://arxiv.org/pdf/2503.24021.pdf

Abstract:
Genomics data is essential in biological and medical domains, and bioinformatics analysts often manually create circos plots to analyze the data and extract valuable insights. However, creating circos plots is complex, as it requires careful design for multiple track attributes and positional relationships between them. Typically, analysts often seek inspiration from existing circos plots, and they have to iteratively adjust and refine the plot to achieve a satisfactory final design, making the process both tedious and time-intensive. To address these challenges, we propose IntelliCircos, an AI-powered interactive authoring tool that streamlines the process from initial visual design to the final implementation of circos plots. Specifically, we build a new dataset containing 4396 circos plots with corresponding annotations and configurations, which are extracted and labeled from published papers. With the dataset, we further identify track combination patterns, and utilize Large Language Model (LLM) to provide domain-specific design recommendations and configuration references to navigate the design of circos plots. We conduct a user study with 8 bioinformatics analysts to evaluate IntelliCircos, and the results demonstrate its usability and effectiveness in authoring circos plots.

Paperid: 972, https://arxiv.org/pdf/2503.22024.pdf

Abstract:
Virtual reality (VR) presents immersive opportunities across many applications, yet the inherent risk of developing cybersickness during interaction can severely reduce enjoyment and platform adoption. Cybersickness is marked by symptoms such as dizziness and nausea, which previous work primarily assessed via subjective post-immersion questionnaires and motion-restricted controlled setups. In this paper, we investigate the \emph{dynamic nature} of cybersickness while users experience and freely interact in VR. We propose a novel method to \emph{continuously} identify and quantitatively gauge cybersickness levels from users' \emph{passively monitored} electroencephalography (EEG) and head motion signals. Our method estimates multitaper spectrums from EEG, integrating specialized EEG processing techniques to counter motion artifacts, and, thus, tracks cybersickness levels in real-time. Unlike previous approaches, our method requires no user-specific calibration or personalization for detecting cybersickness. Our work addresses the considerable challenge of reproducibility and subjectivity in cybersickness research.

Paperid: 973, https://arxiv.org/pdf/2503.16841.pdf

Abstract:
Despite decades of advancements in automated ligand screening, large-scale drug discovery remains resource-intensive and requires post-processing hit selection, a step where chemists manually select a few promising molecules based on their chemical intuition. This creates a major bottleneck in the virtual screening process for drug discovery, demanding experts to repeatedly balance complex trade-offs among drug properties across a vast pool of candidates. To improve the efficiency and reliability of this process, we propose a novel human-centered framework named CheapVS that allows chemists to guide the ligand selection process by providing preferences regarding the trade-offs between drug properties via pairwise comparison. Our framework combines preferential multi-objective Bayesian optimization with a docking model for measuring binding affinity to capture human chemical intuition for improving hit identification. Specifically, on a library of 100K chemical candidates targeting EGFR and DRD2, CheapVS outperforms state-of-the-art screening methods in identifying drugs within a limited computational budget. Notably, our method can recover up to 16/37 EGFR and 37/58 DRD2 known drugs while screening only 6% of the library, showcasing its potential to significantly advance drug discovery.

Paperid: 974, https://arxiv.org/pdf/2503.16739.pdf

Abstract:
Maintaining engagement in immersive meetings is challenging, particularly when users must catch up on missed content after disruptions. While transcription interfaces can help, table-fixed panels have the potential to distract users from the group, diminishing social presence, while avatar-fixed captions fail to provide past context. We present EngageSync, a context-aware avatar-fixed transcription interface that adapts based on user engagement, offering live transcriptions and LLM-generated summaries to enhance catching up while preserving social presence. We implemented a live VR meeting setup for a 12-participant formative study and elicited design considerations. In two user studies with small (3 avatars) and mid-sized (7 avatars) groups, EngageSync significantly improved social presence (p < .05) and time spent gazing at others in the group instead of the interface over table-fixed panels. Also, it reduced re-engagement time and increased information recall (p < .05) over avatar-fixed interfaces, with stronger effects in mid-sized groups (p < .01).

Paperid: 975, https://arxiv.org/pdf/2503.14049.pdf

Abstract:
Future surgical care demands real-time, integrated data to drive informed decision-making and improve patient outcomes. The pressing need for seamless and efficient data capture in the OR motivates our development of a modular solution that bridges the gap between emerging machine learning techniques and interventional medicine. We introduce a network of edge devices, called Data Hubs (DHs), that interconnect diverse medical sensors, imaging systems, and robotic tools via optical fiber and a centralized network switch. Built on the NVIDIA Jetson Orin NX, each DH supports multiple interfaces (HDMI, USB-C, Ethernet) and encapsulates device-specific drivers within Docker containers using the Isaac ROS framework and ROS2. A centralized user interface enables straightforward configuration and real-time monitoring, while an Nvidia DGX computer provides state-of-the-art data processing and storage. We validate our approach through an ultrasound-based 3D anatomical reconstruction experiment that combines medical imaging, pose tracking, and RGB-D data acquisition.

Paperid: 976, https://arxiv.org/pdf/2503.07970.pdf

Abstract:
AI systems and tools today can generate human-like expressions on behalf of people. It raises the crucial question about how to sustain human agency in AI-mediated communication. We investigated this question in the context of machine translation (MT) assisted conversations. Our participants included 45 dyads. Each dyad consisted of one new immigrant in the United States, who leveraged MT for English information seeking as a non-native speaker, and one local native speaker, who acted as the information provider. Non-native speakers could influence the English production of their message in one of three ways: labeling the quality of MT outputs, regular post-editing without additional hints, or augmented post-editing with LLM-generated hints. Our data revealed a greater exercise of non-native speakers' agency under the two post-editing conditions. This benefit, however, came at a significant cost to the dyadic-level communication performance. We derived insights for MT and other generative AI design from our findings.

Paperid: 977, https://arxiv.org/pdf/2503.07690.pdf

Abstract:
Digital deliberation has expanded democratic participation, yet challenges remain. This includes processing information at scale, moderating discussions, fact-checking, or attracting people to participate. Recent advances in artificial intelligence (AI) offer potential solutions, but public perceptions of AI's role in deliberation remain underexplored. Beyond efficiency, democratic deliberation is about voice and recognition. If AI is integrated into deliberation, public trust, acceptance, and willingness to participate may be affected. We conducted a preregistered survey experiment with a representative sample in Germany (n=1850) to examine how information about AI-enabled deliberation influences willingness to participate and perceptions of deliberative quality. Respondents were randomly assigned to treatments that provided them information about deliberative tasks facilitated by either AI or humans. Our findings reveal a significant AI-penalty. Participants were less willing to engage in AI-facilitated deliberation and rated its quality lower than human-led formats. These effects were moderated by individual predispositions. Perceptions of AI's societal benefits and anthropomorphization of AI showed positive interaction effects on people's interest to participate in AI-enabled deliberative formats and positive quality assessments, while AI risk assessments showed negative interactions with information about AI-enabled deliberation. These results suggest AI-enabled deliberation faces substantial public skepticism, potentially even introducing a new deliberative divide. Unlike traditional participation gaps based on education or demographics, this divide is shaped by attitudes toward AI. As democratic engagement increasingly moves online, ensuring AI's role in deliberation does not discourage participation or deepen inequalities will be a key challenge for future research and policy.

Paperid: 978, https://arxiv.org/pdf/2503.06349.pdf

Abstract:
Resistive tactile sensing gloves have captured the interest of researchers spanning diverse domains, such as robotics, healthcare, and human-computer interaction. However, existing fabrication methods often require labor-intensive assembly or costly equipment, limiting accessibility. Leveraging flexible printed circuit board (FPCB) technology, we present an automated pipeline for generating resistive tactile sensing glove design files solely from a simple hand photo on legal-size paper, which can be readily supplied to commercial board houses for manufacturing. Our method enables cost-effective, accessible production at under \$130 per glove with sensor assembly times under 15 minutes. Sensor performance was characterized under varying pressure loads, and a preliminary user evaluation showcases four unique automatically manufactured designs, evaluated for their reliability and comfort.

Paperid: 979, https://arxiv.org/pdf/2503.06195.pdf

Abstract:
The integration of Artificial Intelligence (AI) into Integrated Development Environments (IDEs) is reshaping software development, fundamentally altering how developers interact with their tools. This shift marks the emergence of Human-AI Experience in Integrated Development Environment (in-IDE HAX), a field that explores the evolving dynamics of Human-Computer Interaction in AI-assisted coding environments. Despite rapid adoption, research on in-IDE HAX remains fragmented, which highlights the need for a unified overview of current practices, challenges, and opportunities. To provide a structured overview of existing research, we conduct a systematic literature review of 90 studies, summarizing current findings and outlining areas for further investigation. We organize key insights from reviewed studies into three aspects: Impact, Design, and Quality of AI-based systems inside IDEs. Impact findings show that AI-assisted coding enhances developer productivity but also introduces challenges, such as verification overhead and over-reliance. Design studies show that effective interfaces surface context, provide explanations and transparency of suggestion, and support user control. Quality studies document risks in correctness, maintainability, and security. For future research, priorities include productivity studies, design of assistance, and audit of AI-generated code. The agenda calls for larger and longer evaluations, stronger audit and verification assets, broader coverage across the software life cycle, and adaptive assistance under user control.

Paperid: 980, https://arxiv.org/pdf/2503.06122.pdf

Abstract:
The essence of intangible cultural heritage (ICH) lies in the living knowledge and skills passed down through generations. Daily practice plays a vital role in revitalizing ICH by fostering continuous learning and improvement. However, limited resources and accessibility pose significant challenges to sustaining such practice. Virtual reality (VR) has shown promise in supporting extensive skill training. Unlike technical skill training, ICH daily practice prioritizes cultivating a deeper understanding of cultural meanings and values. This study explores VR's potential in facilitating ICH daily practice through a case study of Traditional Chinese Flower Arrangement (TCFA). By investigating TCFA learners' challenges and expectations, we designed and evaluated FloraJing, a VR system enriched with cultural elements to support sustained TCFA practice. Findings reveal that FloraJing promotes progressive reflection, and continuous enhances technical improvement and cultural understanding. We further propose design implications for VR applications aimed at fostering ICH daily practice in both knowledge and skills.

Paperid: 981, https://arxiv.org/pdf/2503.04761.pdf

Abstract:
Despite widespread speculation about artificial intelligence's impact on the future of work, we lack systematic empirical evidence about how these systems are actually being used for different tasks. Here, we present a novel framework for measuring AI usage patterns across the economy. We leverage a recent privacy-preserving system to analyze over four million Claude.ai conversations through the lens of tasks and occupations in the U.S. Department of Labor's O*NET Database. Our analysis reveals that AI usage primarily concentrates in software development and writing tasks, which together account for nearly half of all total usage. However, usage of AI extends more broadly across the economy, with approximately 36% of occupations using AI for at least a quarter of their associated tasks. We also analyze how AI is being used for tasks, finding 57% of usage suggests augmentation of human capabilities (e.g., learning or iterating on an output) while 43% suggests automation (e.g., fulfilling a request with minimal human involvement). While our data and methods face important limitations and only paint a picture of AI usage on a single platform, they provide an automated, granular approach for tracking AI's evolving role in the economy and identifying leading indicators of future impact as these technologies continue to advance.

Paperid: 982, https://arxiv.org/pdf/2503.03067.pdf

Abstract:
This paper explores the acceptance of human-AI love among young adults, particularly focusing on Chinese women in romantic or intimate relationships with AI companions. Through qualitative research, including 14 semi-structured interviews, the study investigates how these individuals establish and maintain relationships with AI, their perceptions and attitudes towards these entities, and the perspectives of other stakeholders. Key findings reveal that users engage with AI companions for emotional comfort, stress relief, and to avoid social pressures. We identify various roles users assign to AI companions, such as friends, mentors, or romantic partners, and highlights the importance of customization and emotional support in these interactions. While AI companions offer advantages like emotional stability and constant availability, they also face limitations in emotional depth and understanding. The research underscores the need for ethical considerations and regulatory frameworks to address privacy concerns and prevent over-immersion in AI relationships. Future work should explore the long-term psychological impacts and evolving dynamics of human-AI relationships as technology advances.

Paperid: 983, https://arxiv.org/pdf/2503.02007.pdf

Abstract:
Recent work in Generative AI enables the stylization of 3D models based on image prompts. However, these methods do not incorporate tactile information, leading to designs that lack the expected tactile properties. We present TactStyle, a system that allows creators to stylize 3D models with images while incorporating the expected tactile properties. TactStyle accomplishes this using a modified image-generation model fine-tuned to generate heightfields for given surface textures. By optimizing 3D model surfaces to embody a generated texture, TactStyle creates models that match the desired style and replicate the tactile experience. We utilize a large-scale dataset of textures to train our texture generation model. In a psychophysical experiment, we evaluate the tactile qualities of a set of 3D-printed original textures and TactStyle's generated textures. Our results show that TactStyle successfully generates a wide range of tactile features from a single image input, enabling a novel approach to haptic design.

Paperid: 984, https://arxiv.org/pdf/2503.01015.pdf

Abstract:
Virtual Reality (VR) interfaces often rely on linear ray-casting for object selection but struggle with precision in dense or occluded environments. This late-breaking work introduces an optimized dual-layered selection mechanism combining dynamic Bezier Curves, controlled via finger gestures, with on-body interaction surfaces to enhance precision and immersion. Bezier Curves offer fine-grained control and flexibility in complex scenarios, while on-body surfaces project nearby virtual objects onto the user's forearm, leveraging proprioception and tactile feedback. A preliminary qualitative study ($N$ = 24) compared two interaction paradigms (Bezier Curve vs. Linear Ray) and two interaction media (On-body vs. Mid-air). Participants praised the Bezier Curve's ability to target occluded objects but noted the physical demand. On-body interactions were favored for their immersive qualities, while mid-air interactions were appreciated for maintaining focus on the virtual scene. These findings highlight the importance of balancing ease of learning and precise control when designing VR selection techniques, opening avenues for further exploration of curve-based and on-body interactions in dense virtual environments.

Paperid: 985, https://arxiv.org/pdf/2503.00303.pdf

Abstract:
Artificial intelligence explanations can make complex predictive models more comprehensible. To be effective, however, they should anticipate and mitigate possible misinterpretations, e.g., arising when users infer incorrect information that is not explicitly conveyed. To this end, we propose complementary explanations -- a novel method that pairs explanations to compensate for their respective limitations. A complementary explanation adds insights that clarify potential misconceptions stemming from the primary explanation while ensuring their coherency and avoiding redundancy. We introduce a framework for designing and evaluating complementary explanation pairs based on pertinent qualitative properties and quantitative metrics. Our approach allows to construct complementary explanations that minimise the chance of their misinterpretation.

Paperid: 986, https://arxiv.org/pdf/2503.00283.pdf

Abstract:
Facial expressions are vital in human communication and significantly influence outcomes in human-robot interaction (HRI), such as likeability, trust, and companionship. However, current methods for generating robotic facial expressions are often labor-intensive, lack adaptability across contexts and platforms, and have limited expressive ranges--leading to repetitive behaviors that reduce interaction quality, particularly in long-term scenarios. We introduce Xpress, a system that leverages language models (LMs) to dynamically generate context-aware facial expressions for robots through a three-phase process: encoding temporal flow, conditioning expressions on context, and generating facial expression code. We demonstrated Xpress as a proof-of-concept through two user studies (n=15x2) and a case study with children and parents (n=13), in storytelling and conversational scenarios to assess the system's context-awareness, expressiveness, and dynamism. Results demonstrate Xpress's ability to dynamically produce expressive and contextually appropriate facial expressions, highlighting its versatility and potential in HRI applications.

Paperid: 987, https://arxiv.org/pdf/2502.20944.pdf

Abstract:
With the increasing spread of AR head-mounted displays suitable for everyday use, interaction with information becomes ubiquitous, even while walking. However, this requires constant shifts of our attention between walking and interacting with virtual information to fulfill both tasks adequately. Accordingly, we as a community need a thorough understanding of the mutual influences of walking and interacting with digital information to design safe yet effective interactions. Thus, we systematically investigate the effects of different AR anchors (hand, head, torso) and task difficulties on user experience and performance. We engage participants (n=26) in a dual-task paradigm involving a visual working memory task while walking. We assess the impact of dual-tasking on both virtual and walking performance, and subjective evaluations of mental and physical load. Our results show that head-anchored AR content least affected walking while allowing for fast and accurate virtual task interaction, while hand-anchored content increased reaction times and workload.

Paperid: 988, https://arxiv.org/pdf/2502.20130.pdf

Abstract:
Understanding the classifications of deep neural networks, e.g. used in safety-critical situations, is becoming increasingly important. While recent models can locally explain a single decision, to provide a faithful global explanation about an accurate model's general behavior is a more challenging open task. Towards that goal, we introduce the Quadratic Programming Enhanced Model (QPM), which learns globally interpretable class representations. QPM represents every class with a binary assignment of very few, typically 5, features, that are also assigned to other classes, ensuring easily comparable contrastive class representations. This compact binary assignment is found using discrete optimization based on predefined similarity measures and interpretability constraints. The resulting optimal assignment is used to fine-tune the diverse features, so that each of them becomes the shared general concept between the assigned classes. Extensive evaluations show that QPM delivers unprecedented global interpretability across small and large-scale datasets while setting the state of the art for the accuracy of interpretable models.

Paperid: 989, https://arxiv.org/pdf/2502.18681.pdf

Abstract:
Understanding collaborative writing dynamics between native speakers (NS) and non-native speakers (NNS) is critical for enhancing collaboration quality and team inclusivity. In this paper, we partnered with communication researchers to develop visual analytics solutions for comparing NS and NNS behaviors in 162 writing sessions across 27 teams. The primary challenges in analyzing writing behaviors are data complexity and the uncertainties introduced by automated methods. In response, we present \textsc{COALA}, a novel visual analytics tool that improves model interpretability by displaying uncertainties in author clusters, generating behavior summaries using large language models, and visualizing writing-related actions at multiple granularities. We validated the effectiveness of \textsc{COALA} through user studies with domain experts (N=2+2) and researchers with relevant experience (N=8). We present the insights discovered by participants using \textsc{COALA}, suggest features for future AI-assisted collaborative writing tools, and discuss the broader implications for analyzing collaborative processes beyond writing.

Paperid: 990, https://arxiv.org/pdf/2502.18120.pdf

Abstract:
Creating games together is both a playful and effective way to develop skills in computational thinking, collaboration, and more. However, game development can be challenging for younger developers who lack formal training. While teenage developers frequently turn to online communities for peer support, their experiences may vary. To better understand the benefits and challenges teens face within online developer communities, we conducted interviews with 18 teenagers who created games or elements in Roblox and received peer support from one or more online Roblox developer communities. Our findings show that developer communities provide teens with valuable resources for technical, social, and career growth. However, teenagers also struggle with inter-user conflicts and a lack of community structure, leading to difficulties in handling complex issues that may arise, such as financial scams. Based on these insights, we propose takeaways for creating positive and safe online spaces for teenage game creators.

Paperid: 991, https://arxiv.org/pdf/2502.17935.pdf

Abstract:
In-game team communication in online multiplayer games has shown the potential to foster efficient collaboration and positive social interactions. Yet players often associate communication within ad hoc teams with frustration and wariness. Though previous works have quantitatively analyzed communication patterns at scale, few have identified the motivations of how a player makes in-the-moment communication decisions. In this paper, we conducted an observation study with 22 League of Legends players by interviewing them during Solo Ranked games on their use of four in-game communication media (chat, pings, emotes, votes). We performed thematic analysis to understand players' in-context assessment and perception of communication attempts. We demonstrate that players evaluate communication opportunities on proximate game states bound by player expectations and norms. Our findings illustrate players' tendency to view communication, regardless of its content, as a precursor to team breakdowns. We build upon these findings to motivate effective player-oriented communication design in online games.

Paperid: 992, https://arxiv.org/pdf/2502.17346.pdf

Abstract:
The integration of Digital Twins with Extended Reality technologies, such as Virtual Reality and Augmented Reality, is transforming industries by enabling more immersive, interactive experiences and enhancing real time decision making. User centered evaluations are crucial for aligning XR enhanced DT systems with user expectations, enhancing acceptance and utility in real world settings. This paper proposes a user centric evaluation method for XR enhanced DT applications to assess usability, cognitive load, and user experience. By employing a range of assessment tools, including questionnaires and observational studies across various use cases, such as virtual tourism, city planning, and industrial maintenance, this method provides a structured approach to capturing the users perspective.

Paperid: 993, https://arxiv.org/pdf/2502.14869.pdf

Abstract:
The potential for negative impacts of AI has rapidly become more pervasive around the world, and this has intensified a need for responsible AI governance. While many regulatory bodies endorse risk-based approaches and a multitude of risk mitigation practices are proposed by companies and academic scholars, these approaches are commonly expert-centered and thus lack the inclusion of a significant group of stakeholders. Ensuring that AI policies align with democratic expectations requires methods that prioritize the voices and needs of those impacted. In this work we develop a participative and forward-looking approach to inform policy-makers and academics that grounds the needs of lay stakeholders at the forefront and enriches the development of risk mitigation strategies. Our approach (1) maps potential mitigation and prevention strategies of negative AI impacts that assign responsibility to various stakeholders, (2) explores the importance and prioritization thereof in the eyes of laypeople, and (3) presents these insights in policy fact sheets, i.e., a digestible format for informing policy processes. We emphasize that this approach is not targeted towards replacing policy-makers; rather our aim is to present an informative method that enriches mitigation strategies and enables a more participatory approach to policy development.

Paperid: 994, https://arxiv.org/pdf/2502.14163.pdf

Abstract:
3D-printed models are increasingly used to provide people who are blind or have low vision (BLV) with access to maps, educational materials, and museum exhibits. Recent research has explored interactive 3D-printed models (I3Ms) that integrate touch gestures, conversational dialogue, and haptic vibratory feedback to create more engaging interfaces. Prior research with sighted people has found that imbuing machines with human-like behaviours, i.e., embodying them, can make them appear more lifelike, increasing social perception and presence. Such embodiment can increase engagement and trust. This work presents the first exploration into the design of embodied I3Ms and their impact on BLV engagement and trust. In a controlled study with 12 BLV participants, we found that I3Ms using specific embodiment design factors, such as haptic vibratory and embodied personified voices, led to an increased sense of liveliness and embodiment, as well as engagement, but had mixed impact on trust.

Paperid: 995, https://arxiv.org/pdf/2502.13255.pdf

Abstract:
PCB (printed circuit board) substrates are often single-use, leading to material waste in electronics making. We introduce PCB Renewal, a novel technique that "erases" and "reconfigures" PCB traces by selectively depositing conductive epoxy onto outdated areas, transforming isolated paths into conductive planes that support new traces. We present the PCB Renewal workflow, evaluate its electrical performance and mechanical durability, and model its sustainability impact, including material usage, cost, energy consumption, and time savings. We develop a software plug-in that guides epoxy deposition, generates updated PCB profiles, and calculates resource usage. To demonstrate PCB Renewal's effectiveness and versatility, we repurpose a single PCB across four design iterations spanning three projects: a camera roller, a WiFi radio, and an ESPboy game console. We also show how an outsourced double-layer PCB can be reconfigured, transforming it from an LED watch to an interactive cat toy. The paper concludes with limitations and future directions.

Paperid: 996, https://arxiv.org/pdf/2502.13254.pdf

Abstract:
The recent democratization of personal fabrication has significantly advanced the maker movement and reshaped applied research in HCI and beyond. However, this growth has also raised increasing sustainability concerns, as material waste is an inevitable byproduct of making and rapid prototyping. In this work, we examine the sustainability landscape within the modern maker community, focusing on grassroots makerspaces and maker-oriented research labs through in-depth interviews with diverse stakeholders involved in making and managing making-related activities. Our findings highlight four key themes: the various types of "waste" generated through the making process, the strategies (or lack thereof) for managing this waste, the motivations driving (un)sustainable practices, and the challenges faced. We synthesize these insights into design considerations and takeaways for technical HCI researchers and the broader community, focusing on future tools, infrastructures, and educational approaches to foster sustainable making.

Paperid: 997, https://arxiv.org/pdf/2502.12755.pdf

Abstract:
This paper introduces an advanced methodology for machine translation (MT) corpus generation, integrating semi-automated, human-in-the-loop post-editing with large language models (LLMs) to enhance efficiency and translation quality. Building upon previous work that utilized real-time training of a custom MT quality estimation metric, this system incorporates novel LLM features such as Enhanced Translation Synthesis and Assisted Annotation Analysis, which improve initial translation hypotheses and quality assessments, respectively. Additionally, the system employs LLM-Driven Pseudo Labeling and a Translation Recommendation System to reduce human annotator workload in specific contexts. These improvements not only retain the original benefits of cost reduction and enhanced post-edit quality but also open new avenues for leveraging cutting-edge LLM advancements. The project's source code is available for community use, promoting collaborative developments in the field. The demo video can be accessed here.

Paperid: 998, https://arxiv.org/pdf/2502.12504.pdf

Abstract:
Human prosocial cooperation is essential for our collective health, education, and welfare. However, designing social systems to maintain or incentivize prosocial behavior is challenging because people can act selfishly to maximize personal gain. This complex and unpredictable aspect of human behavior makes it difficult for policymakers to foresee the implications of their designs. Recently, multi-agent LLM systems have shown remarkable capabilities in simulating human-like behavior, and replicating some human lab experiments. This paper studies how well multi-agent systems can simulate prosocial human behavior, such as that seen in the public goods game (PGG), and whether multi-agent systems can exhibit ``unbounded actions'' seen outside the lab in real world scenarios. We find that multi-agent LLM systems successfully replicate human behavior from lab experiments of the public goods game with three experimental treatments - priming, transparency, and varying endowments. Beyond replicating existing experiments, we find that multi-agent LLM systems can replicate the expected human behavior when combining experimental treatments, even if no previous study combined those specific treatments. Lastly, we find that multi-agent systems can exhibit a rich set of unbounded actions that people do in the real world outside of the lab -- such as collaborating and even cheating. In sum, these studies are steps towards a future where LLMs can be used to inform policy decisions that encourage people to act in a prosocial manner.

Paperid: 999, https://arxiv.org/pdf/2502.11627.pdf

Abstract:
The decline of social connectedness caused by distance and physical limitations severely affects older adults' well-being and mental health. While virtual reality (VR) is promising for older adults to socialize remotely, existing social VR designs primarily focus on verbal communication (e.g., reminiscent, chat). Actively engaging in shared activities is also an important aspect of social connection. We designed RemoteChess, which constructs a social community and a culturally relevant activity (i.e., Chinese chess) for older adults to play while engaging in social interaction. We conducted a user study with groups of older adults interacting with each other through RemoteChess. Our findings indicate that RemoteChess enhanced participants' social connectedness by offering familiar environments, culturally relevant social catalysts, and asymmetric interactions. We further discussed design guidelines for designing culturally relevant social activities in VR to promote social connectedness for older adults.

Paperid: 1000, https://arxiv.org/pdf/2502.11554.pdf

Abstract:
Metaphors play a critical role in shaping user experiences with Voice User Interfaces (VUIs), yet existing designs often rely on static, human-centric metaphors that fail to adapt to diverse contexts and user needs. This paper introduces Metaphor-Fluid Design, a novel approach that dynamically adjusts metaphorical representations based on conversational use-contexts. We compare this approach to a Default VUI, which characterizes the present implementation of commercial VUIs commonly designed around the persona of an assistant, offering a uniform interaction style across contexts. In Study 1 (N=130), metaphors were mapped to four key use-contexts-commands, information seeking, sociality, and error recovery-along the dimensions of formality and hierarchy, revealing distinct preferences for task-specific metaphorical designs. Study 2 (N=91) evaluates a Metaphor-Fluid VUI against a Default VUI, showing that the Metaphor-Fluid VUI enhances perceived intention to adopt, enjoyment, and likability by aligning better with user expectations for different contexts. However, individual differences in metaphor preferences highlight the need for personalization. These findings challenge the one-size-fits-all paradigm of VUI design and demonstrate the potential of Metaphor-Fluid Design to create more adaptive and engaging human-AI interactions.

Paperid: 1001, https://arxiv.org/pdf/2502.10830.pdf

Abstract:
Fingerspelling is a critical part of American Sign Language (ASL) recognition and has become an accessible optional text entry method for Deaf and Hard of Hearing (DHH) individuals. In this paper, we introduce SpellRing, a single smart ring worn on the thumb that recognizes words continuously fingerspelled in ASL. SpellRing uses active acoustic sensing (via a microphone and speaker) and an inertial measurement unit (IMU) to track handshape and movement, which are processed through a deep learning algorithm using Connectionist Temporal Classification (CTC) loss. We evaluated the system with 20 ASL signers (13 fluent and 7 learners), using the MacKenzie-Soukoref Phrase Set of 1,164 words and 100 phrases. Offline evaluation yielded top-1 and top-5 word recognition accuracies of 82.45% (9.67%) and 92.42% (5.70%), respectively. In real-time, the system achieved a word error rate (WER) of 0.099 (0.039) on the phrases. Based on these results, we discuss key lessons and design implications for future minimally obtrusive ASL recognition wearables.

Paperid: 1002, https://arxiv.org/pdf/2502.10570.pdf

Abstract:
Mobile gaze tracking involves inferring a user's gaze point or direction on a mobile device's screen from facial images captured by the device's front camera. While this technology inspires an increasing number of gaze-interaction applications, achieving consistent accuracy remains challenging due to dynamic user-device spatial relationships and varied motion conditions inherent in mobile contexts. This paper provides empirical evidence on how user mobility and behaviour affect mobile gaze tracking accuracy. We conduct two user studies collecting behaviour and gaze data under various motion conditions - from lying to maze navigation - and during different interaction tasks. Quantitative analysis has revealed behavioural regularities among daily tasks and identified head distance, head pose, and device orientation as key factors affecting accuracy, with errors increasing by up to 48.91% in dynamic conditions compared to static ones. These findings highlight the need for more robust, adaptive eye-tracking systems that account for head movements and device deflection to maintain accuracy across diverse mobile contexts.

Paperid: 1003, https://arxiv.org/pdf/2502.10542.pdf

Abstract:
Drug overdose deaths, including those due to prescription opioids, represent a critical public health issue in the United States and worldwide. Artificial intelligence (AI) approaches have been developed and deployed to help prescribers assess a patient's risk for overdose-related death, but it is unknown whether public health experts can leverage similar predictions to make local resource allocation decisions more effectively. In this work, we evaluated how AI-based overdose risk assessment could be used to inform local public health decisions using a working prototype system. Experts from three health departments, of varying locations and sizes with respect to staff and population served, were receptive to the potential benefits of algorithmic risk prediction and of using AI-augmented visualization to connect across data sources. However, they also expressed concerns about whether the risk prediction model's formulation and underlying data would match the state of the overdose epidemic as it evolved in their specific locations. Our findings extend those of other studies on algorithmic systems in the public sector, and they present opportunities for future human-AI collaborative tools to support decision-making in local, time-varying contexts.

Paperid: 1004, https://arxiv.org/pdf/2502.10537.pdf

Abstract:
Analyzing data subgroups is a common data science task to build intuition about a dataset and identify areas to improve model performance. However, subgroup analysis is prohibitively difficult in datasets with many features, and existing tools limit unexpected discoveries by relying on user-defined or static subgroups. We propose exploratory subgroup analysis as a set of tasks in which practitioners discover, evaluate, and curate interesting subgroups to build understanding about datasets and models. To support these tasks we introduce Divisi, an interactive notebook-based tool underpinned by a fast approximate subgroup discovery algorithm. Divisi's interface allows data scientists to interactively re-rank and refine subgroups and to visualize their overlap and coverage in the novel Subgroup Map. Through a think-aloud study with 13 practitioners, we find that Divisi can help uncover surprising patterns in data features and their interactions, and that it encourages more thorough exploration of subtypes in complex data.

Paperid: 1005, https://arxiv.org/pdf/2502.10526.pdf

Abstract:
Temporal predictive models have the potential to improve decisions in health care, public services, and other domains, yet they often fail to effectively support decision-makers. Prior literature shows that many misalignments between model behavior and decision-makers' expectations stem from issues of model specification, namely how, when, and for whom predictions are made. However, model specifications for predictive tasks are highly technical and difficult for non-data-scientist stakeholders to interpret and critique. To address this challenge we developed Tempo, an interactive system that helps data scientists and domain experts collaboratively iterate on model specifications. Using Tempo's simple yet precise temporal query language, data scientists can quickly prototype specifications with greater transparency about pre-processing choices. Moreover, domain experts can assess performance within data subgroups to validate that models behave as expected. Through three case studies, we demonstrate how Tempo helps multidisciplinary teams quickly prune infeasible specifications and identify more promising directions to explore.

Paperid: 1006, https://arxiv.org/pdf/2502.09849.pdf

Abstract:
Explainable AI (XAI) has become a crucial component of Clinical Decision Support Systems (CDSS) to enhance transparency, trust, and clinical adoption. However, while many XAI methods have been proposed, their effectiveness in real-world medical settings remains underexplored. This paper provides a survey of human-centered evaluations of Explainable AI methods in Clinical Decision Support Systems. By categorizing existing works based on XAI methodologies, evaluation frameworks, and clinical adoption challenges, we offer a structured understanding of the landscape. Our findings reveal key challenges in the integration of XAI into healthcare workflows and propose a structured framework to align the evaluation methods of XAI with the clinical needs of stakeholders.

Paperid: 1007, https://arxiv.org/pdf/2502.08621.pdf

Abstract:
Video storytelling is essential for sports performance analysis and fan engagement, enabling sports professionals and fans to effectively communicate and interpret the spatial and temporal dynamics of gameplay. Traditional methods rely on manual annotation and verbal explanations, placing significant demands on creators for video editing skills and on viewers for cognitive focus. However, these approaches are time-consuming and often struggle to accommodate individual needs. SportsBuddy addresses this gap with an intuitive, interactive video authoring tool. It combines player tracking, embedded interaction design, and timeline visualizations to seamlessly integrate narratives and visual cues within game contexts. This empowers users to effortlessly create context-driven video stories. Since its launch, over 150 sports users, including coaches, athletes, content creators, parents and fans, have utilized SportsBuddy to produce compelling game highlights for diverse use cases. User feedback highlights its accessibility and ease of use, making video storytelling and insight communication more attainable for diverse audiences. Case studies with collegiate teams and sports creators further demonstrate SportsBuddy's impact on enhancing coaching communication, game analysis, and fan engagement.

Paperid: 1008, https://arxiv.org/pdf/2502.06985.pdf

Abstract:
Online communities can offer many benefits for youth including peer learning, cultural expression, and skill development. However, most HCI research on youth-focused online communities has centered communities developed by adults for youth rather than by the youth themselves. In this work, we interviewed 11 teenagers (ages 13-17) who moderate online Discord communities created by youth, for youth. Participants were identified by Discord platform staff as leaders of well-moderated servers through an intensive exam and application-based process. We also interviewed 2 young adults who volunteered as mentors of some of our teen participants. We present our findings about the benefits, motivations, and risks of teen-led online communities, as well as the role of external stakeholders of these youth spaces. We contextualize our work within the broader teen online safety landscape to provide recommendations to better support, encourage, and protect teen moderators and their online communities. This empirical work contributes one of the first studies to date with teen Discord moderators and aims to empower safe youth-led online communities.

Paperid: 1009, https://arxiv.org/pdf/2502.05770.pdf

Abstract:
Generative AI (GenAI) has brought opportunities and challenges for higher education as it integrates into teaching and learning environments. As instructors navigate this new landscape, understanding their engagement with and attitudes toward GenAI is crucial. We surveyed 178 instructors from a single U.S. university to examine their current practices, perceptions, trust, and distrust of GenAI in higher education in March 2024. While most surveyed instructors reported moderate to high familiarity with GenAI-related concepts, their actual use of GenAI tools for direct instructional tasks remained limited. Our quantitative results show that trust and distrust in GenAI are related yet distinct; high trust does not necessarily imply low distrust, and vice versa. We also found significant differences in surveyed instructors' familiarity with GenAI across different trust and distrust groups. Our qualitative results show nuanced manifestations of trust and distrust among surveyed instructors and various approaches to support calibrated trust in GenAI. We discuss practical implications focused on (dis)trust calibration among instructors.

Paperid: 1010, https://arxiv.org/pdf/2502.03069.pdf

Abstract:
Generative AI in Virtual Reality offers the potential for collaborative object-building, yet challenges remain in aligning AI contributions with user expectations. In particular, users often struggle to understand and collaborate with AI when its actions are not transparently represented. This paper thus explores the co-creative object-building process through a Wizard-of-Oz study, focusing on how AI can effectively convey its intent to users during object customization in Virtual Reality. Inspired by human-to-human collaboration, we focus on three representation modes: the presence of an embodied avatar, whether the AI's contributions are visualized immediately or incrementally, and whether the areas modified are highlighted in advance. The findings provide insights into how these factors affect user perception and interaction with object-generating AI tools in Virtual Reality as well as satisfaction and ownership of the created objects. The results offer design implications for co-creative world-building systems, aiming to foster more effective and satisfying collaborations between humans and AI in Virtual Reality.

Paperid: 1011, https://arxiv.org/pdf/2502.02911.pdf

Abstract:
Prosocial behaviors, such as helping others, are well-known to enhance human well-being. While there is a growing trend of humans helping AI agents, it remains unclear whether the well-being benefits of helping others extend to interactions with non-human entities. To address this, we conducted an experiment (N = 295) to explore how helping AI agents impacts human well-being, especially when the agents fulfill human basic psychological needs--relatedness, competence, and autonomy--during the interaction. Our findings showed that helping AI agents reduced participants' feelings of loneliness. When AI met participants' needs for competence and autonomy during the helping process, there was a further decrease in loneliness and an increase in positive affect. However, when AI did not meet participants' need for relatedness, participants experienced an increase in positive affect. We discuss the implications of these findings for understanding how AI can support human well-being.

Paperid: 1012, https://arxiv.org/pdf/2502.00853.pdf

Abstract:
We built a spatial hybrid system that combines a personal computer (PC) and virtual reality (VR) for visual sensemaking, addressing limitations in both environments. Although VR offers immense potential for interactive data visualization (e.g., large display space and spatial navigation), it can also present challenges such as imprecise interactions and user fatigue. At the same time, a PC offers precise and familiar interactions but has limited display space and interaction modality. Therefore, we iteratively designed a spatial hybrid system (PC+VR) to complement these two environments by enabling seamless switching between PC and VR environments. To evaluate the system's effectiveness and user experience, we compared it to using a single computing environment (i.e., PC-only and VR-only). Our study results (N=18) showed that spatial PC+VR could combine the benefits of both devices to outperform user preference for VR-only without a negative impact on performance from device switching overhead. Finally, we discussed future design implications.

Paperid: 1013, https://arxiv.org/pdf/2502.00022.pdf

Abstract:
HRA (Human Reliability Analysis) data is crucial for advancing HRA methodologies. however, existing data collection methods lack the necessary granularity, and most approaches fail to capture dynamic features. Additionally, many methods require expert knowledge as input, making them time-consuming and labor-intensive. To address these challenges, we propose a new paradigm for the automated collection of HRA data. Our approach focuses on key indicators behind human error, specifically measuring workload in collaborative settings. This study introduces a novel, scenario-driven method for workload estimation, leveraging fine-tuned large language models (LLMs). By training LLMs on real-world operational data from high-temperature gas-cooled reactors (HTGRs), we simulate human behavior and cognitive load in real time across various collaborative scenarios. The method dynamically adapts to changes in operator workload, providing more accurate, flexible, and scalable workload estimates. The results demonstrate that the proposed WELLA (Workload Estimation with LLMs and Agents) outperforms existing commercial LLM-based methods in terms of prediction accuracy.

Paperid: 1014, https://arxiv.org/pdf/2501.16230.pdf

Abstract:
Emotion recognition using electroencephalogram (EEG) signals has broad potential across various domains. EEG signals have ability to capture rich spatial information related to brain activity, yet effectively modeling and utilizing these spatial relationships remains a challenge. Existing methods struggle with simplistic spatial structure modeling, failing to capture complex node interactions, and lack generalizable spatial connection representations, failing to balance the dynamic nature of brain networks with the need for discriminative and generalizable features. To address these challenges, we propose the Multi-granularity Integration Network with Discrete Codebook for EEG-based Emotion Recognition (MIND-EEG). The framework employs a multi-granularity approach, integrating global and regional spatial information through a Global State Encoder, an Intra-Regional Functionality Encoder, and an Inter-Regional Interaction Encoder to comprehensively model brain activity. Additionally, we introduce a discrete codebook mechanism for constructing network structures via vector quantization, ensuring compact and meaningful brain network representations while mitigating over-smoothing and enhancing model generalization. The proposed framework effectively captures the dynamic and diverse nature of EEG signals, enabling robust emotion recognition. Extensive comparisons and analyses demonstrate the effectiveness of MIND-EEG, and the source code is publicly available at https://anonymous.4open.science/r/MIND_EEG.

Paperid: 1015, https://arxiv.org/pdf/2501.15727.pdf

Abstract:
Multimodal large language models (MLLMs), with their expansive world knowledge and reasoning capabilities, present a unique opportunity for end-users to create personalized AI sensors capable of reasoning about complex situations. A user could describe a desired sensing task in natural language (e.g., "alert if my toddler is getting into mischief"), with the MLLM analyzing the camera feed and responding within seconds. In a formative study, we found that users saw substantial value in defining their own sensors, yet struggled to articulate their unique personal requirements and debug the sensors through prompting alone. To address these challenges, we developed Gensors, a system that empowers users to define customized sensors supported by the reasoning capabilities of MLLMs. Gensors 1) assists users in eliciting requirements through both automatically-generated and manually created sensor criteria, 2) facilitates debugging by allowing users to isolate and test individual criteria in parallel, 3) suggests additional criteria based on user-provided images, and 4) proposes test cases to help users "stress test" sensors on potentially unforeseen scenarios. In a user study, participants reported significantly greater sense of control, understanding, and ease of communication when defining sensors using Gensors. Beyond addressing model limitations, Gensors supported users in debugging, eliciting requirements, and expressing unique personal requirements to the sensor through criteria-based reasoning; it also helped uncover users' "blind spots" by exposing overlooked criteria and revealing unanticipated failure modes. Finally, we discuss how unique characteristics of MLLMs--such as hallucinations and inconsistent responses--can impact the sensor-creation process. These findings contribute to the design of future intelligent sensing systems that are intuitive and customizable by everyday users.

Paperid: 1016, https://arxiv.org/pdf/2501.15028.pdf

Abstract:
People frequently exposed to health information on social media tend to overestimate their symptoms during online self-diagnosis due to availability bias. This may lead to incorrect self-medication and place additional burdens on healthcare providers to correct patients' misconceptions. In this work, we conducted two mixed-method studies to identify design goals for mitigating availability bias in online self-diagnosis. We investigated factors that distort self-assessment of symptoms after exposure to social media. We found that availability bias is pronounced when social media content resonated with individuals, making them disregard their own evidences. To address this, we developed and evaluated three chatbot-based symptom checkers designed to foster evidence-based self-reflection for bias mitigation given their potential to encourage thoughtful responses. Results showed that chatbot-based symptom checkers with cognitive intervention strategies mitigated the impact of availability bias in online self-diagnosis.

Paperid: 1017, https://arxiv.org/pdf/2501.09530.pdf

Abstract:
Humans can play a more active role in improving their comfort in the built environment if given the right information at the right place and time. This paper outlines the use of Just-in-Time Adaptive Interventions (JITAI) implemented in the context of the built environment to provide information that helps humans minimize the impact of heat and noise on their daily lives. This framework is based on the open-source Cozie iOS smartwatch platform. It includes data collection through micro-surveys and intervention messages triggered by environmental, contextual, and personal history conditions. An eight-month deployment of the method was completed in Singapore with 103 participants who submitted more than 12,000 micro-surveys and had more than 3,600 JITAI intervention messages delivered to them. A weekly survey conducted during two deployment phases revealed an overall increase in perceived usefulness ranging from 8-19% over the first three weeks of data collection. For noise-related interventions, participants showed an overall increase in location changes ranging from 4-11% and a 2-17% increase in earphone use to mitigate noise distractions. For thermal comfort-related interventions, participants demonstrated a 3-13\% increase in adjustments to their location or thermostat to feel more comfortable. The analysis found evidence that personality traits (such as conscientiousness), gender, and environmental preferences could be factors in determining the perceived helpfulness of JITAIs and influencing behavior change. These findings underscore the importance of tailoring intervention strategies to individual traits and environmental conditions, setting the stage for future research to refine the delivery, timing, and content of intervention messages.

Paperid: 1018, https://arxiv.org/pdf/2501.09233.pdf

Abstract:
Affordances, a foundational concept in human-computer interaction and design, have traditionally been explained by direct-perception theories, which assume that individuals perceive action possibilities directly from the environment. However, these theories fall short of explaining how affordances are perceived, learned, refined, or misperceived, and how users choose between multiple affordances in dynamic contexts. This paper introduces a novel affordance theory grounded in Computational Rationality, positing that humans construct internal representations of the world based on bounded sensory inputs. Within these internal models, affordances are inferred through two core mechanisms: feature recognition and hypothetical motion trajectories. Our theory redefines affordance perception as a decision-making process, driven by two components: confidence (the perceived likelihood of successfully executing an action) and predicted utility (the expected value of the outcome). By balancing these factors, individuals make informed decisions about which actions to take. Our theory frames affordances perception as dynamic, continuously learned, and refined through reinforcement and feedback. We validate the theory via thought experiments and demonstrate its applicability across diverse types of affordances (e.g., physical, digital, social). Beyond clarifying and generalizing the understanding of affordances across contexts, our theory serves as a foundation for improving design communication and guiding the development of more adaptive and intuitive systems that evolve with user capabilities.

Paperid: 1019, https://arxiv.org/pdf/2501.09204.pdf

Abstract:
The difficulty and consequent fear of travel is one of the most disabling consequences of blindness and severe vision impairment, affecting confidence and quality of life. Traditional tactile graphics are vital in the Orientation and Mobility training process, however 3D printing may have the capacity to enable production of more meaningful and inclusive maps. This study explored the use of 3D printed maps on site at a public event to examine their suitability and to identify guidelines for the design of future 3D maps. An iterative design process was used in the production of the 3D maps, with feedback from visitors who are blind or have low vision informing the recommendations for their design and use. For example, it was found that many representational 3D icons could be recognised by touch without the need for a key and that such a map helped form mental models of the event space. Complex maps, however, require time to explore and should be made available before an event or at the entrance in a comfortable position. The maps were found to support the orientation and mobility process, and importantly to also promote a positive message about inclusion and accessibility.

Paperid: 1020, https://arxiv.org/pdf/2501.08561.pdf

Abstract:
In this paper, we propose an Adaptive Neuro-Symbolic Learning and Reasoning Framework for digital twin technology called ``ANSR-DT." Digital twins in industrial environments often struggle with interpretability, real-time adaptation, and human input integration. Our approach addresses these challenges by combining CNN-LSTM dynamic event detection with reinforcement learning and symbolic reasoning to enable adaptive intelligence with interpretable decision processes. This integration enhances environmental understanding while promoting continuous learning, leading to more effective real-time decision-making in human-machine collaborative applications. We evaluated ANSR-DT on synthetic industrial data, observing significant improvements over traditional approaches, with up to 99.5% accuracy for dynamic pattern recognition. The framework demonstrated superior adaptability with extended reinforcement learning training, improving explained variance from 0.447 to 0.547. Future work aims at scaling to larger datasets to test rule management beyond the current 14 rules. Our open-source implementation promotes reproducibility and establishes a foundation for future research in adaptive, interpretable digital twins for industrial applications.

Paperid: 1021, https://arxiv.org/pdf/2501.06658.pdf

Abstract:
Assessing learners in ill-defined domains, such as scenario-based human tutoring training, is an area of limited research. Equity training requires a nuanced understanding of context, but do contemporary large language models (LLMs) have a knowledge base that can navigate these nuances? Legacy transformer models like BERT, in contrast, have less real-world knowledge but can be more easily fine-tuned than commercial LLMs. Here, we study whether fine-tuning BERT on human annotations outperforms state-of-the-art LLMs (GPT-4o and GPT-4-Turbo) with few-shot prompting and instruction. We evaluate performance on four prediction tasks involving generating and explaining open-ended responses in advocacy-focused training lessons in a higher education student population learning to become middle school tutors. Leveraging a dataset of 243 human-annotated open responses from tutor training lessons, we find that BERT demonstrates superior performance using an offline fine-tuning approach, which is more resource-efficient than commercial GPT models. We conclude that contemporary GPT models may not adequately capture nuanced response patterns, especially in complex tasks requiring explanation. This work advances the understanding of AI-driven learner evaluation under the lens of fine-tuning versus few-shot prompting on the nuanced task of equity training, contributing to more effective training solutions and assisting practitioners in choosing adequate assessment methods.

Paperid: 1022, https://arxiv.org/pdf/2501.05723.pdf

Abstract:
Effective error detection is crucial to prevent task disruption and maintain user trust. Traditional methods often rely on task-specific models or user reporting, which can be inflexible or slow. Recent research suggests social signals, naturally exhibited by users in response to robot errors, can enable more flexible, timely error detection. However, most studies rely on post hoc analysis, leaving their real-time effectiveness uncertain and lacking user-centric evaluation. In this work, we developed a proactive error detection system that combines user behavioral signals (facial action units and speech), user feedback, and error context for automatic error detection. In a study (N = 28), we compared our proactive system to a status quo reactive approach. Results show our system 1) reliably and flexibly detects error, 2) detects errors faster than the reactive approach, and 3) is perceived more favorably by users than the reactive one. We discuss recommendations for enabling robot error awareness in future HRI systems.

Paperid: 1023, https://arxiv.org/pdf/2501.05610.pdf

Abstract:
Assistive mobile robots are a transformative technology that helps persons with disabilities regain the ability to move freely. Although autonomous wheelchairs significantly reduce user effort, they still require human input to allow users to maintain control and adapt to changing environments. Brain Computer Interface (BCI) stands out as a highly user-friendly option that does not require physical movement. Current BCI systems can understand whether users want to accelerate or decelerate, but they implement these changes in discrete speed steps rather than allowing for smooth, continuous velocity adjustments. This limitation prevents the systems from mimicking the natural, fluid speed changes seen in human self-paced motion. The authors aim to address this limitation by redesigning the perception-action cycle in a BCI controlled robotic system: improving how the robotic agent interprets the user's motion intentions (world state) and implementing these actions in a way that better reflects natural physical properties of motion, such as inertia and damping. The scope of this paper focuses on the perception aspect. We asked and answered a normative question "what computation should the robotic agent carry out to optimally perceive incomplete or noisy sensory observations?" Empirical EEG data were collected, and probabilistic representation that served as world state distributions were learned and evaluated in a Generative Adversarial Network framework. The ROS framework was established that connected with a Gazebo environment containing a digital twin of an indoor space and a virtual model of a robotic wheelchair. Signal processing and statistical analyses were implemented to identity the most discriminative features in the spatial-spectral-temporal dimensions, which are then used to construct the world model for the robotic agent to interpret user motion intentions as a Bayesian observer.

Paperid: 1024, https://arxiv.org/pdf/2501.05525.pdf

Abstract:
Motor execution, a fundamental aspect of human behavior, has been extensively studied using BCI technologies. EEG and fNIRS have been utilized to provide valuable insights, but their individual limitations have hindered performance. This study investigates the effectiveness of fusing electroencephalography (EEG) and functional near-infrared spectroscopy (fNIRS) data for classifying rest versus task states in a motor execution paradigm. Using the SMR Hybrid BCI dataset, this work compares unimodal (EEG and fNIRS) classifiers with a multimodal fusion approach. It proposes Motor Execution using Convolutional Additive Self-Attention Mechanisms (MECASA), a novel architecture leveraging convolutional operations and self-attention to capture complex patterns in multimodal data. MECASA, built upon the CAS-ViT architecture, employs a computationally efficient, convolutional-based self-attention module (CASA), a hybrid block design, and a dedicated fusion network to combine features from separate EEG and fNIRS processing streams. Experimental results demonstrate that MECASA consistently outperforms established methods across all modalities (EEG, fNIRS, and fused), with fusion consistently improving accuracy compared to single-modality approaches. fNIRS generally achieved higher accuracy than EEG alone. Ablation studies revealed optimal configurations for MECASA, with embedding dimensions of 64-128 providing the best performance for EEG data and OD128 (upsampled optical density) yielding superior results for fNIRS data. This work highlights the potential of deep learning, specifically MECASA, to enhance EEG-fNIRS fusion for BCI applications.

Paperid: 1025, https://arxiv.org/pdf/2501.01471.pdf

Abstract:
This paper investigates physiological markers of Social Anxiety Disorder (SAD) by examining the relationship between Electrocardiogram (ECG) measurements and speech, a known anxiety-inducing activity. Specifically, we analyze changes in heart rate variability (HRV) and heart rate (HR) during four distinct phases: baseline, anticipation, speech activity, and reflection. Our study, involving 51 participants (31 with SAD and 20 without), found that HRV decreased and HR increased during the anticipation and speech activity phases compared to baseline. In contrast, during the reflection phase, HRV increased and HR decreased. Additionally, participants with SAD exhibited lower HRV, higher HR, and reported greater self-perceived anxiety compared to those without SAD. These findings have implications for developing wearable technology to monitor SAD. We also provide our dataset, which captures anxiety across multiple stages, to support further research in this area.

Paperid: 1026, https://arxiv.org/pdf/2501.00078.pdf

Abstract:
Artificial intelligence (AI) has enabled agents to master complex video games, from first-person shooters like Counter-Strike to real-time strategy games such as StarCraft II and racing games like Gran Turismo. While these achievements are notable, applying these AI methods in commercial video game production remains challenging due to computational constraints. In commercial scenarios, the majority of computational resources are allocated to 3D rendering, leaving limited capacity for AI methods, which often demand high computational power, particularly those relying on pixel-based sensors. Moreover, the gaming industry prioritizes creating human-like behavior in AI agents to enhance player experience, unlike academic models that focus on maximizing game performance. This paper introduces a novel methodology for training neural networks via imitation learning to play a complex, commercial-standard, VALORANT-like 2v2 tactical shooter game, requiring only modest CPU hardware during inference. Our approach leverages an innovative, pixel-free perception architecture using a small set of ray-cast sensors, which capture essential spatial information efficiently. These sensors allow AI to perform competently without the computational overhead of traditional methods. Models are trained to mimic human behavior using supervised learning on human trajectory data, resulting in realistic and engaging AI agents. Human evaluation tests confirm that our AI agents provide human-like gameplay experiences while operating efficiently under computational constraints. This offers a significant advancement in AI model development for tactical shooter games and possibly other genres.

Paperid: 1027, https://arxiv.org/pdf/2506.23545.pdf

Abstract:
Virtual, Augmented, and eXtended Reality (VR/AR/XR) technologies are increasingly recognized for their applications in training, diagnostics, and psychological research, particularly in high-risk and highly regulated environments. In this panel we discuss how immersive systems enhance human performance across multiple domains, including clinical psychology, space exploration, and medical education. In psychological research and training, XR can offer a controlled yet ecologically valid setting for measuring cognitive and affective processes. In space exploration, we discuss the development of VR-based astronaut training and diagnostic systems, allowing astronauts to perform real-time health assessments. In medical education and rehabilitation, we cover procedural training and patient engagement. From virtual surgical simulations to gamified rehabilitation exercises, immersive environments enhance both learning outcomes and treatment adherence.

Paperid: 1028, https://arxiv.org/pdf/2506.22893.pdf

Abstract:
After a very long winter, the Artificial Intelligence (AI) spring is here. Or, so it seems over the last three years. AI has the potential to impact many areas of human life - personal, social, health, education, professional. In this paper, we take a closer look at the potential of AI for Enterprises, where decision-making plays a crucial and repeated role across functions, tasks, and operations. We consider Agents imbued with AI as means to increase decision-productivity of enterprises. We highlight six tenets for Agentic success in enterprises, by drawing attention to what the current, AI-Centric User paradigm misses, in the face of persistent needs of and usefulness for Enterprise Decision-Making. In underscoring a shift to User-Centric AI, we offer six tenets and promote market mechanisms for platforms, aligning the design of AI and its delivery by Agents to the cause of enterprise users.

Paperid: 1029, https://arxiv.org/pdf/2506.22066.pdf

Abstract:
Operators performing high-stakes, safety-critical tasks - such as air traffic controllers, surgeons, or mission control personnel - must maintain exceptional cognitive performance under variable and often stressful conditions. This paper presents a phased methodological approach to building cognitive monitoring systems for such environments. By integrating insights from human factors research, simulation-based training, sensor technologies, and fundamental psychological principles, the proposed framework supports real-time performance assessment with minimum intrusion. The approach begins with simplified simulations and evolves towards operational contexts. Key challenges addressed include variability in workload, the effects of fatigue and stress, thus the need for adaptive monitoring for early warning support mechanisms. The methodology aims to improve situational awareness, reduce human error, and support decision-making without undermining operator autonomy. Ultimately, the work contributes to the development of resilient and transparent systems in domains where human performance is critical to safety.

Paperid: 1030, https://arxiv.org/pdf/2506.20664.pdf

Abstract:
As Large Language Models (LLMs) gain agentic abilities, they will have to navigate complex multi-agent scenarios, interacting with human users and other agents in cooperative and competitive settings. This will require new reasoning skills, chief amongst them being theory of mind (ToM), or the ability to reason about the "mental" states of other agents. However, ToM and other multi-agent abilities in LLMs are poorly understood, since existing benchmarks suffer from narrow scope, data leakage, saturation, and lack of interactivity. We thus propose Decrypto, a game-based benchmark for multi-agent reasoning and ToM drawing inspiration from cognitive science, computational pragmatics and multi-agent reinforcement learning. It is designed to be as easy as possible in all other dimensions, eliminating confounding factors commonly found in other benchmarks. To our knowledge, it is also the first platform for designing interactive ToM experiments. We validate the benchmark design through comprehensive empirical evaluations of frontier LLMs, robustness studies, and human-AI cross-play experiments. We find that LLM game-playing abilities lag behind humans and simple word-embedding baselines. We then create variants of two classic cognitive science experiments within Decrypto to evaluate three key ToM abilities. Surprisingly, we find that state-of-the-art reasoning models are significantly worse at those tasks than their older counterparts. This demonstrates that Decrypto addresses a crucial gap in current reasoning and ToM evaluations, and paves the path towards better artificial agents.

Paperid: 1031, https://arxiv.org/pdf/2506.20595.pdf

Abstract:
The ubiquity of technologies like ChatGPT has raised concerns about their impact on student writing, particularly regarding reduced learner agency and superficial engagement with content. While standalone chat-based LLMs often produce suboptimal writing outcomes, evidence suggests that purposefully designed AI writing support tools can enhance the writing process. This paper investigates how different AI support approaches affect writers' sense of agency and depth of knowledge transformation. Through a randomized control trial with 90 undergraduate students, we compare three conditions: (1) a chat-based LLM writing assistant, (2) an integrated AI writing tool to support diverse subprocesses, and (3) a standard writing interface (control). Our findings demonstrate that, among AI-supported conditions, students using the integrated AI writing tool exhibited greater agency over their writing process and engaged in deeper knowledge transformation overall. These results suggest that thoughtfully designed AI writing support targeting specific aspects of the writing process can help students maintain ownership of their work while facilitating improved engagement with content.

Paperid: 1032, https://arxiv.org/pdf/2506.19079.pdf

Abstract:
Foundation Models (FMs) are rapidly transforming Affective Computing (AC), with Vision Language Models (VLMs) now capable of recognising emotions in zero shot settings. This paper probes a critical but underexplored question: what visual cues do these models rely on to infer affect, and are these cues psychologically grounded or superficially learnt? We benchmark varying scale VLMs on a teeth annotated subset of AffectNet dataset and find consistent performance shifts depending on the presence of visible teeth. Through structured introspection of, the best-performing model, i.e., GPT-4o, we show that facial attributes like eyebrow position drive much of its affective reasoning, revealing a high degree of internal consistency in its valence-arousal predictions. These patterns highlight the emergent nature of FMs behaviour, but also reveal risks: shortcut learning, bias, and fairness issues especially in sensitive domains like mental health and education.

Paperid: 1033, https://arxiv.org/pdf/2506.18455.pdf

Abstract:
We introduce CODS (Computational Optimization in Design Space), a theoretical model that frames computational design as a constrained optimization problem over a structured, multi-dimensional design space. Unlike existing methods that rely on handcrafted heuristics or domain-specific rules, CODS provides a generalizable and interpretable framework that supports diverse design tasks. Given a user requirement and a well-defined design space, CODS automatically derives soft and hard constraints using large language models through a structured prompt engineering pipeline. These constraints guide the optimization process to generate design solutions that are coherent, expressive, and aligned with user intent. We validate our approach across two domains-visualization design and knitwear generation-demonstrating superior performance in design quality, intent alignment, and user preference compared to existing LLM-based methods. CODS offers a unified foundation for scalable, controllable, and AI-powered design automation.

Paperid: 1034, https://arxiv.org/pdf/2506.17890.pdf

Abstract:
Backchanneling (e.g., "uh-huh", "hmm", a simple nod) encompasses a big part of everyday communication; it is how we negotiate the turn to speak, it signals our engagement, and shapes the flow of our conversations. For people with speech and motor impairments, backchanneling is limited to a reduced set of modalities, and their Augmentative and Alternative Communication (AAC) technology requires visual attention, making it harder to observe non-verbal cues of conversation partners. We explore how users of AAC technology approach backchanneling and create their own unique channels and communication culture. We conducted a workshop with 4 AAC users to understand the unique characteristics of backchanneling in AAC. We explored how backchanneling changes when pairs of AAC users communicate vs when an AAC user communicates with a non-AAC user. We contextualize these findings through four in-depth interviews with speech-language pathologists (SLPs). We conclude with a discussion about backchanneling as a micro-cultural practice, rethinking embodiment and mediation in AAC technology, and providing design recommendations for timely multi-modal backchanneling while respecting different communication cultures.

Paperid: 1035, https://arxiv.org/pdf/2506.17606.pdf

Abstract:
We present Full-body WPT, wireless power networking around the human body using a meandered textile coil. Unlike traditional inductive systems that emit strong fields into the deep tissue inside the body, the meander coil enables localized generation of strong magnetic field constrained to the skin surface, even when scaled to the size of the human body. Such localized inductive system enhances both safety and efficiency of wireless power around the body. Furthermore, the use of low-loss conductive yarn achieve energy-efficient and lightweight design. We analyze the performance of our design through simulations and experimental prototypes, demonstrating high power transfer efficiency and adaptability to user movement and posture. Our system provides a safe and efficient distributed power network using meandered textile coils integrated into wearable materials, highlighting the potential of body-centric wireless power networking as a foundational layer for ubiquitous health monitoring, augmented reality, and human-machine interaction systems.

Paperid: 1036, https://arxiv.org/pdf/2506.16677.pdf

Abstract:
Trust prediction is a key issue in human-robot collaboration, especially in construction scenarios where maintaining appropriate trust calibration is critical for safety and efficiency. This paper introduces the Performance-guided Physiological signal-based Trust Prediction (PPTP), a novel framework designed to improve trust assessment. We designed a human-robot construction scenario with three difficulty levels to induce different trust states. Our approach integrates synchronized multimodal physiological signals (ECG, GSR, and EMG) with collaboration performance evaluation to predict human trust levels. Individual physiological signals are processed using collaboration performance information as guiding cues, leveraging the standardized nature of collaboration performance to compensate for individual variations in physiological responses. Extensive experiments demonstrate the efficacy of our cross-modality fusion method in significantly improving trust classification performance. Our model achieves over 81% accuracy in three-level trust classification, outperforming the best baseline method by 6.7%, and notably reaches 74.3% accuracy in high-resolution seven-level classification, which is a first in trust prediction research. Ablation experiments further validate the superiority of physiological signal processing guided by collaboration performance assessment.

Paperid: 1037, https://arxiv.org/pdf/2506.13882.pdf

Abstract:
As extended reality (XR) systems become increasingly immersive and sensor-rich, they enable the collection of fine-grained behavioral signals such as eye and body telemetry. These signals support personalized and responsive experiences and may also contain unique patterns that can be linked back to individuals. However, privacy mechanisms that naively pair unimodal mechanisms (e.g., independently apply privacy mechanisms for eye and body privatization) are often ineffective at preventing re-identification in practice. In this work, we systematically evaluate real-time privacy mechanisms for XR, both individually and in pair, across eye and body modalities. To preserve usability, all mechanisms were tuned based on empirically grounded thresholds for real-time interaction. We evaluated four eye and ten body mechanisms across multiple datasets, comprising up to 407 participants. Our results show that while obfuscating eye telemetry alone offers moderate privacy gains, body telemetry perturbation is substantially more effective. When carefully paired, multimodal mechanisms reduce re-identification rate from 80.3% to 26.3% in casual XR applications (e.g., VRChat and Job Simulator) and from 84.8% to 26.1% in competitive XR applications (e.g., Beat Saber and Synth Riders), all without violating real-time usability requirements. These findings underscore the potential of modality-specific and context-aware privacy strategies for protecting behavioral data in XR environments.

Paperid: 1038, https://arxiv.org/pdf/2506.11781.pdf

Abstract:
Geospatial data analysis plays a crucial role in tackling intricate societal challenges such as urban planning and climate modeling. However, employing tools like GeoPandas, a prominent Python library for geospatial data manipulation, necessitates expertise in complex domain-specific syntax and workflows. GeoPandas-AI addresses this gap by integrating LLMs directly into the GeoPandas workflow, transforming the GeoDataFrame class into an intelligent, stateful class for both data analysis and geospatial code development. This paper formalizes the design of such a smart class and provides an open-source implementation of GeoPandas-AI in PyPI package manager. Through its innovative combination of conversational interfaces and stateful exploitation of LLMs for code generation and data analysis, GeoPandas-AI introduces a new paradigm for code-copilots and instantiates it for geospatial development.

Paperid: 1039, https://arxiv.org/pdf/2506.10587.pdf

Abstract:
Design spaces serve as a conceptual framework that enables designers to explore feasible solutions through the selection and combination of design elements. However, effective decision-making remains heavily dependent on the designer's experience, and the absence of mathematical formalization prevents computational support for automated design processes. To bridge this gap, we introduce a structured representation that models design spaces with orthogonal dimensions and discrete selectable elements. Building on this model, we present IDEA, a decision-making framework for augmenting design intelligence through design space exploration to generate effective outcomes. Specifically, IDEA leverages large language models (LLMs) for constraint generation, incorporates a Monte Carlo Tree Search (MCTS) algorithm guided by these constraints to explore the design space efficiently, and instantiates abstract decisions into domain-specific implementations. We validate IDEA in two design scenarios: data-driven article composition and pictorial visualization generation, supported by example results, expert interviews, and a user study. The evaluation demonstrates the IDEA's adaptability across domains and its capability to produce superior design outcomes.

Paperid: 1040, https://arxiv.org/pdf/2506.10265.pdf

Abstract:
Human gait analysis with wearable sensors has been widely used in various applications, such as daily life healthcare, rehabilitation, physical therapy, and clinical diagnostics and monitoring. In particular, ground reaction force (GRF) provides critical information about how the body interacts with the ground during locomotion. Although instrumented treadmills have been widely used as the gold standard for measuring GRF during walking, their lack of portability and high cost make them impractical for many applications. As an alternative, low-cost, portable, wearable insole sensors have been utilized to measure GRF; however, these sensors are susceptible to noise and disturbance and are less accurate than treadmill measurements. To address these challenges, we propose a Time-aware Knowledge Distillation framework for GRF estimation from insole sensor data. This framework leverages similarity and temporal features within a mini-batch during the knowledge distillation process, effectively capturing the complementary relationships between features and the sequential properties of the target and input data. The performance of the lightweight models distilled through this framework was evaluated by comparing GRF estimations from insole sensor data against measurements from an instrumented treadmill. Empirical results demonstrated that Time-aware Knowledge Distillation outperforms current baselines in GRF estimation from wearable sensor data.

Paperid: 1041, https://arxiv.org/pdf/2506.07275.pdf

Abstract:
Machine learning approaches, such as contextual multi-armed bandit (cMAB) algorithms, offer a promising strategy to reduce sedentary behavior by delivering personalized interventions to encourage physical activity. However, cMAB algorithms typically require large participant samples to learn effectively and may overlook key psychological factors that are not explicitly encoded in the model. In this study, we propose a hybrid approach that combines cMAB for selecting intervention types with large language models (LLMs) to personalize message content. We evaluate four intervention types: behavioral self-monitoring, gain-framed, loss-framed, and social comparison, each delivered as a motivational message aimed at increasing motivation for physical activity and daily step count. Message content is further personalized using dynamic contextual factors including daily fluctuations in self-efficacy, social influence, and regulatory focus. Over a seven-day trial, participants receive daily messages assigned by one of four models: cMAB alone, LLM alone, combined cMAB with LLM personalization (cMABxLLM), or equal randomization (RCT). Outcomes include daily step count and message acceptance, assessed via ecological momentary assessments (EMAs). We apply a causal inference framework to evaluate the effects of each model. Our findings offer new insights into the complementary roles of LLM-based personalization and cMAB adaptation in promoting physical activity through personalized behavioral messaging.

Paperid: 1042, https://arxiv.org/pdf/2506.07041.pdf

Abstract:
User reporting systems are central to addressing interpersonal conflicts and protecting users from harm in online spaces, particularly those with heightened privacy expectations. However, users often express frustration at their lack of insight and input into the reporting process. Drawing on offline legal literature, we trace these frustrations to the inquisitorial nature of today's online reporting systems, where moderators lead evidence gathering and case development. In contrast, adversarial models can grant users greater control and thus are better for procedural justice and privacy protection, despite their increased risks of system abuse. This motivates us to explore the potential of incorporating adversarial practices into online reporting systems. Through literature review, formative interviews, and threat modeling, we find a rich design space for empowering users to collect and present their evidence while mitigating potential abuse in the reporting process. In particular, we propose designs that minimize the amount of information shared for reporting purposes, as well as supporting evidence authentication. Finally, we discuss how our findings can inform new cryptographic tools and new efforts to apply comparative legal frameworks to online moderation.

Paperid: 1043, https://arxiv.org/pdf/2506.06813.pdf

Abstract:
Understanding political discourse in online spaces is crucial for analyzing public opinion and ideological polarization. While social computing and computational linguistics have explored such discussions in English, such research efforts are significantly limited in major yet under-resourced languages like Bengali due to the unavailability of datasets. In this paper, we present a multilingual dataset of Bengali transnational political discourse (BTPD) collected from three online platforms, each representing distinct community structures and interaction dynamics. Besides describing how we hand-curated the dataset through community-informed keyword-based retrieval, this paper also provides a general overview of its topics and multilingual content.

Paperid: 1044, https://arxiv.org/pdf/2506.05687.pdf

Abstract:
As Artificial Intelligence (AI) systems are integrated into more aspects of society, they offer new capabilities but also cause a range of harms that are drawing increasing scrutiny. A large body of work in the Responsible AI community has focused on identifying and auditing these harms. However, much less is understood about what happens after harm occurs: what constitutes reparation, who initiates it, and how effective these reparations are. In this paper, we develop a taxonomy of AI harm reparation based on a thematic analysis of real-world incidents. The taxonomy organizes reparative actions into four overarching goals: acknowledging harm, attributing responsibility, providing remedies, and enabling systemic change. We apply this framework to a dataset of 1,060 AI-related incidents, analyzing the prevalence of each action and the distribution of stakeholder involvement. Our findings show that reparation efforts are concentrated in early, symbolic stages, with limited actions toward accountability or structural reform. Drawing on theories of justice, we argue that existing responses fall short of delivering meaningful redress. This work contributes a foundation for advancing more accountable and reparative approaches to Responsible AI.

Paperid: 1045, https://arxiv.org/pdf/2505.24004.pdf

Abstract:
Crowd work platforms like Amazon Mechanical Turk and Prolific are vital for research, yet workers' growing use of generative AI tools poses challenges. Researchers face compromised data validity as AI responses replace authentic human behavior, while workers risk diminished roles as AI automates tasks. To address this, we propose a hybrid framework using digital twins, personalized AI models that emulate workers' behaviors and preferences while keeping humans in the loop. We evaluate our system with an experiment (n=88 crowd workers) and in-depth interviews with crowd workers (n=5) and social science researchers (n=4). Our results suggest that digital twins may enhance productivity and reduce decision fatigue while maintaining response quality. Both researchers and workers emphasized the importance of transparency, ethical data use, and worker agency. By automating repetitive tasks and preserving human engagement for nuanced ones, digital twins may help balance scalability with authenticity.

Paperid: 1046, https://arxiv.org/pdf/2505.23997.pdf

Abstract:
Existing stress-management tools fail to account for the timing and contextual specificity of students' daily lives, often providing static or misaligned support. Digital calendars contain rich, personal indicators of upcoming responsibilities, yet this data is rarely leveraged for adaptive wellbeing interventions. In this short paper, we explore how large language models (LLMs) might use digital calendar data to deliver timely and personalized stress support. We conducted a one-week study with eight university students using a functional technology probe that generated daily stress-management messages based on participants' calendar events. Through semi-structured interviews and thematic analysis, we found that participants valued interventions that prioritized stressful events and adopted a concise, but colloquial tone. These findings reveal key design implications for LLM-based stress-management tools, including the need for structured questioning and tone calibration to foster relevance and trust.

Paperid: 1047, https://arxiv.org/pdf/2505.21682.pdf

Abstract:
City governments in the United States are increasingly pressured to adopt emerging technologies. Yet, these systems often risk biased and disparate outcomes. Scholars studying public sector technology design have converged on the need to ground these systems in the goals and organizational contexts of employees using them. We expand our understanding of employees' contexts by focusing on the equity practices of city government employees to surface important equity considerations around public sector data and technology use. Through semi-structured interviews with thirty-six employees from ten departments of a U.S. city government, our findings reveal challenges employees face when operationalizing equity, perspectives on data needs for advancing equity goals, and the design space for acceptable government technology. We discuss what it looks like to foreground equity in data use and technology design, and considerations for how to support city government employees in operationalizing equity with and without official equity offices.

Paperid: 1048, https://arxiv.org/pdf/2505.21362.pdf

Abstract:
Effective engagement by large language models (LLMs) requires adapting responses to users' sociodemographic characteristics, such as age, occupation, and education level. While many real-world applications leverage dialogue history for contextualization, existing evaluations of LLMs' behavioral adaptation often focus on single-turn prompts. In this paper, we propose a framework to evaluate LLM adaptation when attributes are introduced either (1) explicitly via user profiles in the prompt or (2) implicitly through multi-turn dialogue history. We assess the consistency of model behavior across these modalities. Using a multi-agent pipeline, we construct a synthetic dataset pairing dialogue histories with distinct user profiles and employ questions from the Value Survey Module (VSM 2013) (Hofstede and Hofstede, 2016) to probe value expression. Our findings indicate that most models adjust their expressed values in response to demographic changes, particularly in age and education level, but consistency varies. Models with stronger reasoning capabilities demonstrate greater alignment, indicating the importance of reasoning in robust sociodemographic adaptation.

Paperid: 1049, https://arxiv.org/pdf/2505.18464.pdf

Abstract:
The growing demand for accessible mental health support, compounded by workforce shortages and logistical barriers, has led to increased interest in utilizing Large Language Models (LLMs) for scalable and real-time assistance. However, their use in sensitive domains such as anxiety support remains underexamined. This study presents a systematic evaluation of LLMs (GPT and Llama) for their potential utility in anxiety support by using real user-generated posts from the r/Anxiety subreddit for both prompting and fine-tuning. Our approach utilizes a mixed-method evaluation framework incorporating three main categories of criteria: (i) linguistic quality, (ii) safety and trustworthiness, and (iii) supportiveness. Results show that fine-tuning LLMs with naturalistic anxiety-related data enhanced linguistic quality but increased toxicity and bias, and diminished emotional responsiveness. While LLMs exhibited limited empathy, GPT was evaluated as more supportive overall. Our findings highlight the risks of fine-tuning LLMs on unprocessed social media content without mitigation strategies.

Paperid: 1050, https://arxiv.org/pdf/2505.18318.pdf

Abstract:
Where do rules come from in online communities? While prior studies of online community governance in social computing have sought to characterize rules by their functions within communities and documented practices of rule enforcement, they have largely overlooked rule adoption and change. This study investigates how and why online communities adopt and change their rules. We conducted a grounded theory-based analysis of 40 in-depth interviews with community leaders from subreddits, Fandom wikis, and Fediverse servers, and identified seven processes involved in the adoption of online community rules. Our findings reveal that, beyond regulating behavior and solving functional intra-community problems, rules are also adopted and changed for relational reasons, such as signaling or reinforcing community legitimacy and identity to other communities. While rule change was often prompted by challenges during community growth or decline, change also depended on volunteer leaders' work capacity, the presence of member feedback mechanisms, and relational dynamics between leaders and members. The findings extend prior theories from social computing and organizational research, illustrating how institutionalist and ecological explanations of the relational origins of rules complement more functional accounts. The results also support design recommendations that integrate the relational aspects of rules and rulemaking to facilitate successful governance across communities' lifecycles.

Paperid: 1051, https://arxiv.org/pdf/2505.14452.pdf

Abstract:
Effective workplace communication is essential for managerial success, yet many managers lack access to tailored and sustained training. Although AI-assisted communication systems may offer scalable training solutions, little is known about how managers envision the role of AI in helping them improve their communication skills. To investigate this, we designed a conversational role-play system, CommCoach, as a functional probe to understand how managers anticipate using AI to practice their communication skills. Through semi-structured interviews, participants emphasized the value of adaptive, low-risk simulations for practicing difficult workplace conversations. They also highlighted opportunities, including human-AI teaming, transparent and context-aware feedback, and greater control over AI-generated personas. AI-assisted communication training should balance personalization, structured learning objectives, and adaptability to different user styles and contexts. However, achieving this requires carefully navigating tensions between adaptive and consistent AI feedback, realism and potential bias, and the open-ended nature of AI conversations versus structured workplace discourse.

Paperid: 1052, https://arxiv.org/pdf/2505.11784.pdf

Abstract:
Analytic provenance can be visually encoded to help users track their ongoing analysis trajectories, recall past interactions, and inform new analytic directions. Despite its significance, provenance is often hardwired into analytics systems, affording limited user control and opportunities for self-reflection. We thus propose modeling provenance as an attribute that is available to users during analysis. We demonstrate this concept by modeling two provenance attributes that track the recency and frequency of user interactions with data. We integrate these attributes into a visual data analysis system prototype, ProvenanceLens, wherein users can visualize their interaction recency and frequency by mapping them to encoding channels (e.g., color, size) or applying data transformations (e.g., filter, sort). Using ProvenanceLens as a design probe, we conduct an exploratory study with sixteen users to investigate how these provenance-tracking affordances are utilized for both decision-making and self-reflection. We find that users can accurately and confidently answer questions about their analysis, and we show that mismatches between the user's mental model and the provenance encodings can be surprising, thereby prompting useful self-reflection. We also report on the user strategies surrounding these affordances, and reflect on their intuitiveness and effectiveness in representing provenance.

Paperid: 1053, https://arxiv.org/pdf/2505.10902.pdf

Abstract:
Background and Objective: Precise preoperative planning and effective physician training for coronary interventions are increasingly important. Despite advances in medical imaging technologies, transforming static or limited dynamic imaging data into comprehensive dynamic cardiac models remains challenging. Existing training systems lack accurate simulation of cardiac physiological dynamics. This study develops a comprehensive dynamic cardiac model research framework based on 4D-CTA, integrating digital twin technology, computer vision, and physical model manufacturing to provide precise, personalized tools for interventional cardiology. Methods: Using 4D-CTA data from a 60-year-old female with three-vessel coronary stenosis, we segmented cardiac chambers and coronary arteries, constructed dynamic models, and implemented skeletal skinning weight computation to simulate vessel deformation across 20 cardiac phases. Transparent vascular physical models were manufactured using medical-grade silicone. We developed cardiac output analysis and virtual angiography systems, implemented guidewire 3D reconstruction using binocular stereo vision, and evaluated the system through angiography validation and CABG training applications. Results: Morphological consistency between virtual and real angiography reached 80.9%. Dice similarity coefficients for guidewire motion ranged from 0.741-0.812, with mean trajectory errors below 1.1 mm. The transparent model demonstrated advantages in CABG training, allowing direct visualization while simulating beating heart challenges. Conclusion: Our patient-specific digital-physical twin approach effectively reproduces both anatomical structures and dynamic characteristics of coronary vasculature, offering a dynamic environment with visual and tactile feedback valuable for education and clinical planning.

Paperid: 1054, https://arxiv.org/pdf/2505.08904.pdf

Abstract:
What happens when a rideshare driver is suddenly locked out of the platform connecting them to riders, wages, and daily work? Deactivation-the abrupt removal of gig workers' platform access-typically occurs through arbitrary AI and algorithmic decisions with little explanation or recourse. This represents one of the most severe forms of algorithmic control and often devastates workers' financial stability. Recent U.S. state policies now mandate appeals processes and recovering compensation during the period of wrongful deactivation based on past earnings. Yet, labor organizers still lack effective tools to support these complex, error-prone workflows. We designed FareShare, a computational tool automating lost wage estimation for deactivated drivers, through a 6 month partnership with the State of Washington's largest rideshare labor union. Over the following 3 months, our field deployment of FareShare registered 178 account signups. We observed that the tool could reduce lost wage calculation time by over 95%, eliminate manual data entry errors, and enable legal teams to generate arbitration-ready reports more efficiently. Beyond these gains, the deployment also surfaced important socio-technical challenges around trust, consent, and tool adoption in high-stakes labor contexts.

Paperid: 1055, https://arxiv.org/pdf/2505.08063.pdf

Abstract:
While LLMs are often touted as tools for democratizing specialized knowledge to beginners, their actual effectiveness for improving task performance and learning is still an open question. It is known that novices engage with LLMs differently from experts, with prior studies reporting meta-cognitive pitfalls that affect novices' ability to verify outputs and prompt effectively. We focus on a task domain, machine learning (ML), which embodies both high complexity and low verifiability to understand the impact of LLM assistance on novices. Provided a buggy ML script and open access to ChatGPT, we conduct a formative study with eight novice ML engineers to understand their reliance on, interactions with, and perceptions of the LLM. We find that user actions can be roughly categorized into leading the LLM and led-by the LLM, and further investigate how they affect reliance outcomes like over- and under-reliance. These results have implications on novices' cognitive engagement in LLM-assisted tasks and potential negative effects on downstream learning. Lastly, we pose potential augmentations to the novice-LLM interaction paradigm to promote cognitive engagement.

Paperid: 1056, https://arxiv.org/pdf/2505.04722.pdf

Abstract:
In this letter, we investigate whether the classical function allocation holds for physical Human-Robot Collaboration, which is important for providing insights for Industry 5.0 to guide how to best augment rather than replace workers. This study empirically tests the applicability of Fitts' List within physical Human-Robot Collaboration, by conducting a user study (N=26, within-subject design) to evaluate four distinct allocations of position/force control between human and robot in an abstract blending task. We hypothesize that the function in which humans control the position achieves better performance and receives higher user ratings. When allocating position control to the human and force control to the robot, compared to the opposite case, we observed a significant improvement in preventing overblending. This was also perceived better in terms of physical demand and overall system acceptance, while participants experienced greater autonomy, more engagement and less frustration. An interesting insight was that the supervisory role (when the robot controls both position and force control) was rated second best in terms of subjective acceptance. Another surprising insight was that if position control was delegated to the robot, the participants perceived much lower autonomy than when the force control was delegated to the robot. These findings empirically support applying Fitts' principles to static function allocation for physical collaboration, while also revealing important nuanced user experience trade-offs, particularly regarding perceived autonomy when delegating position control.

Paperid: 1057, https://arxiv.org/pdf/2505.04712.pdf

Abstract:
Parsons problems (PPs) have shown promise in structured problem solving by providing scaffolding that decomposes the problem and requires learners to reconstruct the solution. However, some students face difficulties when first learning with PPs or solving more complex Parsons problems. This study introduces Guided Parsons problems (GPPs) designed to provide step-specific hints and improve learning outcomes in an intelligent logic tutor. In a controlled experiment with 76 participants, GPP students achieved significantly higher accuracy of rule application in both level-end tests and post-tests, with the strongest gains among students with lower prior knowledge. GPP students initially spent more time in training (1.52 vs. 0.81 hours) but required less time for post-tests, indicating improved problem solving efficiency. Our thematic analysis of GPP student self-explanations revealed task decomposition, better rule understanding, and reduced difficulty as key themes, while some students felt the structured nature of GPPs restricted their own way of reasoning. These findings reinforce that GPPs can effectively combine the benefits of worked examples and problem solving practice, but could be further improved by individual adaptation.

Paperid: 1058, https://arxiv.org/pdf/2505.01967.pdf

Abstract:
Large Language Models (LLMs) have become integral to daily life, widely adopted in communication, decision-making, and information retrieval, raising critical questions about how these systems implicitly form and express socio-cognitive attitudes or "worldviews". While existing research extensively addresses demographic and ethical biases, broader dimensions-such as attitudes toward authority, equality, autonomy, and fate-remain under-explored. In this paper, we introduce the Social Worldview Taxonomy (SWT), a structured framework grounded in Cultural Theory, operationalizing four canonical worldviews (Hierarchy, Egalitarianism, Individualism, Fatalism) into measurable sub-dimensions. Using SWT, we empirically identify distinct and interpretable cognitive profiles across 28 diverse LLMs. Further, inspired by Social Referencing Theory, we experimentally demonstrate that explicit social cues systematically shape these cognitive attitudes, revealing both general response patterns and nuanced model-specific variations. Our findings enhance the interpretability of LLMs by revealing implicit socio-cognitive biases and their responsiveness to social feedback, thus guiding the development of more transparent and socially responsible language technologies.

Paperid: 1059, https://arxiv.org/pdf/2504.21240.pdf

Abstract:
Peer produced goods such as online knowledge bases and free/libre open source software rely on contributors who often choose their tasks regardless of consumer needs. These goods are susceptible to underproduction: when popular goods are relatively low quality. Although underproduction is a common feature of peer production, very little is known about how to counteract it. We use a detailed longitudinal dataset from English Wikipedia to show that more experienced contributors -- including those who contribute without an account -- tend to contribute to underproduced goods. A within-person analysis shows that contributors' efforts shift toward underproduced goods over time. These findings illustrate the value of retaining contributors in peer production, including those contributing without accounts, as a means to counter underproduction.

Paperid: 1060, https://arxiv.org/pdf/2504.20685.pdf

Abstract:
Generating realistic listener facial motions in dyadic conversations remains challenging due to the high-dimensional action space and temporal dependency requirements. Existing approaches usually consider extracting 3D Morphable Model (3DMM) coefficients and modeling in the 3DMM space. However, this makes the computational speed of the 3DMM a bottleneck, making it difficult to achieve real-time interactive responses. To tackle this problem, we propose Facial Action Diffusion (FAD), which introduces the diffusion methods from the field of image generation to achieve efficient facial action generation. We further build the Efficient Listener Network (ELNet) specially designed to accommodate both the visual and audio information of the speaker as input. Considering of FAD and ELNet, the proposed method learns effective listener facial motion representations and leads to improvements of performance over the state-of-the-art methods while reducing 99% computational time.

Paperid: 1061, https://arxiv.org/pdf/2504.18332.pdf

Abstract:
The growing applications of AR/VR increase the demand for real-time full-body pose estimation from Head-Mounted Displays (HMDs). Although HMDs provide joint signals from the head and hands, reconstructing a full-body pose remains challenging due to the unconstrained lower body. Recent advancements often rely on conventional neural networks and generative models to improve performance in this task, such as Transformers and diffusion models. However, these approaches struggle to strike a balance between achieving precise pose reconstruction and maintaining fast inference speed. To overcome these challenges, a lightweight and efficient model, SSD-Poser, is designed for robust full-body motion estimation from sparse observations. SSD-Poser incorporates a well-designed hybrid encoder, State Space Attention Encoders, to adapt the state space duality to complex motion poses and enable real-time realistic pose reconstruction. Moreover, a Frequency-Aware Decoder is introduced to mitigate jitter caused by variable-frequency motion signals, remarkably enhancing the motion smoothness. Comprehensive experiments on the AMASS dataset demonstrate that SSD-Poser achieves exceptional accuracy and computational efficiency, showing outstanding inference efficiency compared to state-of-the-art methods.

Paperid: 1062, https://arxiv.org/pdf/2504.15494.pdf

Abstract:
Addressing usability in free, libre, and open-source software (FLOSS) is a challenging issue, particularly due to a long-existing "by developer, for developer" mentality. Engaging designers and end-users to work with developers can help improve its usability, but unequal power dynamics among those stakeholder roles must be mitigated. To explore how the power of different FLOSS stakeholders manifests and can be mediated during collaboration, we conducted eight design workshops with different combinations of key FLOSS stakeholders (i.e., developers, designers, and end-users). Leveraging existing theories on Dimensions of Power, we revealed how participants navigate existing role-based power structures through resource utilization, knowledge gap management, and experience referencing. We also observed that participants exhibited diverse behaviors confirming and challenging the status quo of FLOSS usability. Overall, our results contribute to a comprehensive understanding of the power dynamics among FLOSS stakeholders, providing valuable insights into ways to balance their power to improve FLOSS usability. Our work also serves as an exemplar of using design workshops as a research method to study power dynamics during collaboration that are usually hidden in the field.

Paperid: 1063, https://arxiv.org/pdf/2504.14222.pdf

Abstract:
Providing timely and actionable feedback is crucial for effective collaboration, learning, and coordination within teams. However, many teams face challenges in receiving feedback that aligns with their goals and promotes cohesion. We introduce tAIfa (``Team AI Feedback Assistant''), an AI agent that uses Large Language Models (LLMs) to provide personalized, automated feedback to teams and their members. tAIfa analyzes team interactions, identifies strengths and areas for improvement, and delivers targeted feedback based on communication patterns. We conducted a between-subjects study with 18 teams testing whether using tAIfa impacted their teamwork. Our findings show that tAIfa improved communication and contributions within the teams. This paper contributes to the Human-AI Interaction literature by presenting a computational framework that integrates LLMs to provide automated feedback, introducing tAIfa as a tool to enhance team engagement and cohesion, and providing insights into future AI applications to support team collaboration.

Paperid: 1064, https://arxiv.org/pdf/2504.13900.pdf

Abstract:
With the rapid adoption of AI tools in learning contexts, it is vital to understand how these systems shape users' reading processes and cognitive engagement. We collected and analyzed text from 124 sessions with AI tools, in which students used these tools to support them as they read assigned readings for an undergraduate course. We categorized participants' prompts to AI according to Bloom's Taxonomy of educational objectives -- Remembering, Understanding, Applying, Analyzing, Evaluating. Our results show that ``Analyzing'' and ``Evaluating'' are more prevalent in users' second and third prompts within a single usage session, suggesting a shift toward higher-order thinking. However, in reviewing users' engagement with AI tools over several weeks, we found that users converge toward passive reading engagement over time. Based on these results, we propose design implications for future AI reading-support systems, including structured scaffolds for lower-level cognitive tasks (e.g., recalling terms) and proactive prompts that encourage higher-order thinking (e.g., analyzing, applying, evaluating). Additionally, we advocate for adaptive, human-in-the-loop features that allow students and instructors to tailor their reading experiences with AI, balancing efficiency with enriched cognitive engagement. Our paper expands the dialogue on integrating AI into academic reading, highlighting both its potential benefits and challenges.

Paperid: 1065, https://arxiv.org/pdf/2504.13861.pdf

Abstract:
Though Large Vision-Language Models (LVLMs) are being actively explored in medicine, their ability to conduct telemedicine consultations combining accurate diagnosis with professional dialogue remains underexplored. In this paper, we present 3MDBench (Medical Multimodal Multi-agent Dialogue Benchmark), an open-source framework for simulating and evaluating LVLM-driven telemedical consultations. 3MDBench simulates patient variability through four temperament-based Patient Agents and an Assessor Agent that jointly evaluate diagnostic accuracy and dialogue quality. It includes 3013 cases across 34 diagnoses drawn from real-world telemedicine interactions, combining textual and image-based data. The experimental study compares diagnostic strategies for popular LVLMs, including GPT-4o-mini, LLaVA-3.2-11B-Vision-Instruct, and Qwen2-VL-7B-Instruct. We demonstrate that multimodal dialogue with internal reasoning improves F1 score by 6.5% over non-dialogue settings, highlighting the importance of context-aware, information-seeking questioning. Moreover, injecting predictions from a diagnostic convolutional network into the LVLM's context boosts F1 by up to 20%. Source code is available at https://anonymous.4open.science/r/3mdbench_acl-0511.

Paperid: 1066, https://arxiv.org/pdf/2504.10969.pdf

Abstract:
Radio Frequency (RF) sensing technologies have experienced significant growth due to the widespread adoption of RF devices and the Internet of Things (IoT). These technologies enable numerous applications across healthcare, smart homes, industrial automation, and human-computer interaction. However, the non-intrusive and ubiquitous nature of RF sensing - combined with its environmental sensitivity and data dependency - makes these systems inherently vulnerable not only as attack targets, but also as powerful attack vectors. This survey presents a comprehensive analysis of RF sensing security, covering both system-level vulnerabilities - such as signal spoofing, adversarial perturbations, and model poisoning - and the misuse of sensing capabilities for attacks like cross-boundary surveillance, side-channel inference, and semantic privacy breaches. We propose unified threat models to structure these attack vectors and further conduct task-specific vulnerability assessments across key RF sensing applications, identifying their unique attack surfaces and risk profiles. In addition, we systematically review defense strategies across system layers and threat-specific scenarios, incorporating both active and passive paradigms to provide a structured and practical view of protection mechanisms. Compared to prior surveys, our work distinguishes itself by offering a multi-dimensional classification framework based on task type, threat vector, and sensing modality, and by providing fine-grained, scenario-driven analysis that bridges theoretical models and real-world implications. This survey aims to serve as a comprehensive reference for researchers and practitioners seeking to understand, evaluate, and secure the evolving landscape of RF sensing technologies.

Paperid: 1067, https://arxiv.org/pdf/2504.10745.pdf

Abstract:
Explanations for computer vision models are important tools for interpreting how the underlying models work. However, they are often presented in static formats, which pose challenges for users, including information overload, a gap between semantic and pixel-level information, and limited opportunities for exploration. We investigate interactivity as a mechanism for tackling these issues in three common explanation types: heatmap-based, concept-based, and prototype-based explanations. We conducted a study (N=24), using a bird identification task, involving participants with diverse technical and domain expertise. We found that while interactivity enhances user control, facilitates rapid convergence to relevant information, and allows users to expand their understanding of the model and explanation, it also introduces new challenges. To address these, we provide design recommendations for interactive computer vision explanations, including carefully selected default views, independent input controls, and constrained output spaces.

Paperid: 1068, https://arxiv.org/pdf/2504.10501.pdf

Abstract:
Widespread stigma, both in the offline and online spaces, acts as a barrier to harm reduction efforts in the context of opioid use disorder (OUD). This stigma is prominently directed towards clinically approved medications for addiction treatment (MAT), people with the condition, and the condition itself. Given the potential of artificial intelligence based technologies in promoting health equity, and facilitating empathic conversations, this work examines whether large language models (LLMs) can help abate OUD-related stigma in online communities. To answer this, we conducted a series of pre-registered randomized controlled experiments, where participants read LLM-generated, human-written, or no responses to help seeking OUD-related content in online communities. The experiment was conducted under two setups, i.e., participants read the responses either once (N = 2,141), or repeatedly for 14 days (N = 107). We found that participants reported the least stigmatized attitudes toward MAT after consuming LLM-generated responses under both the setups. This study offers insights into strategies that can foster inclusive online discourse on OUD, e.g., based on our findings LLMs can be used as an education-based intervention to promote positive attitudes and increase people's propensity toward MAT.

Paperid: 1069, https://arxiv.org/pdf/2504.09558.pdf

Abstract:
The development of smart textile interfaces is hindered by the inclusion of rigid hardware components and batteries within the fabric, which pose challenges in terms of manufacturability, usability, and environmental concerns related to electronic waste. To mitigate these issues, we propose a smart textile interface and its wireless sensing system to eliminate the need for ICs, batteries, and connectors embedded into textiles. Our technique is established on the integration of multi-resonant circuits in smart textile interfaces, and utilizing near-field electromagnetic coupling between two coils to facilitate wireless power transfer and data acquisition from smart textile interface. A key aspect of our system is the development of a mathematical model that accurately represents the equivalent circuit of the sensing system. Using this model, we developed a novel algorithm to accurately estimate sensor signals based on changes in system impedance. Through simulation-based experiments and a user study, we demonstrate that our technique effectively supports multiple textile sensors of various types.

Paperid: 1070, https://arxiv.org/pdf/2504.09438.pdf

Abstract:
Choropleth maps are a common and effective way to visualize geographic thematic data. Although cartographers have established many principles about map design, data binning and color usage, less is known about how mapmakers make individual decisions in practice. We interview 16 cartographers and geographic information systems (GIS) experts from 13 government organizations, NGOs, and federal agencies about their choropleth mapmaking decisions and workflows. We categorize our findings and report on how mapmakers follow cartographic guidelines and personal rules of thumb, collaborate with other stakeholders within and outside their organization, and how organizational structures and norms are tied to decision-making during data preparation, data analysis, data binning, map styling, and map post-processing. We find several points of variation as well as regularity across mapmakers and organizations and present takeaways to inform cartographic education and practice, including broader implications and opportunities for CSCW, HCI, and information visualization researchers and practitioners.

Paperid: 1071, https://arxiv.org/pdf/2504.09355.pdf

Abstract:
Inherent uncertainty in geological data acquisition leads to the generation of large ensembles of equiprobable 3D reservoir models. Running computationally costly numerical flow simulations across such a vast solution space is infeasible. A more suitable approach is to carefully select a small number of geological models that reasonably capture the overall variability of the ensemble. Identifying these representative models is a critical task that enables the oil and gas industry to generate cost-effective production forecasts. Our work leverages virtual reality (VR) to provide engineers with a system for conducting geological uncertainty analysis, enabling them to perform inherently spatial tasks using an associative 3D interaction space. We present our VR system through the lens of the reality-based interaction paradigm, designing 3D interfaces that enable familiar physical interactions inspired by real-world analogies-such as gesture-based operations and view-dependent lenses. We also report an evaluation conducted with 12 reservoir engineers from an industry partner. Our findings offer insights into the benefits, pitfalls, and opportunities for refining our system design. We catalog our results into a set of design recommendations intended to guide researchers and developers of immersive interfaces-in reservoir engineering and broader application domains.

Paperid: 1072, https://arxiv.org/pdf/2504.09142.pdf

Abstract:
Heating of buildings represents a significant share of the energy consumption in Europe. Smart thermostats that capitalize on the data-driven analysis of heating patterns in order to optimize heat supply are a very promising part of building energy management technology. However, factors driving their acceptance by building inhabitants are poorly understood although being a prerequisite for fully tapping on their potential. In order to understand the driving forces of technology adoption in this use case, a large survey (N = 2250) was conducted in five EU countries (Austria, Belgium, Estonia, Germany, Greece). For the data analysis structural equation modelling based on the Unified Theory of Acceptance and Use of Technology (UTAUT) was employed, which was extended by adding social beliefs, including descriptive social norms, collective efficacy, social identity and trust. As a result, performance expectancy, price value, and effort expectancy proved to be the most important predictors overall, with variations across countries. In sum, the adoption of smart thermostats appears more strongly associated with individual beliefs about their functioning, potentially reducing their adoption. At the end of the paper, implications for policy making and marketing of smart heating technologies are discussed.

Paperid: 1073, https://arxiv.org/pdf/2504.06442.pdf

Abstract:
As automation and mobile robotics reshape work environments, rising expectations for productivity increase cognitive demands on human operators, leading to potential stress and cognitive overload. Accurately assessing an operator's mental state is critical for maintaining performance and well-being. We use subjective time perception, which can be altered by stress and cognitive load, as a sensitive, low-latency indicator of well-being and cognitive strain. Distortions in time perception can affect decision-making, reaction times, and overall task effectiveness, making it a valuable metric for adaptive human-swarm interaction systems. We study how human physiological signals can be used to estimate a person's subjective time perception in a human-swarm interaction scenario as example. A human operator needs to guide and control a swarm of small mobile robots. We obtain eye-tracking data that is classified for subjective time perception based on questionnaire data. Our results show that we successfully estimate a person's time perception from eye-tracking data. The approach can profit from individual-based pretraining using only 30 seconds of data. In future work, we aim for robots that respond to human operator needs by automatically classifying physiological data in a closed control loop.

Paperid: 1074, https://arxiv.org/pdf/2504.05862.pdf

Abstract:
Large language model-based agents are becoming increasingly popular as a low-cost mechanism to provide personalized, conversational advice, and have demonstrated impressive capabilities in relatively simple scenarios, such as movie recommendations. But how do these agents perform in complex high-stakes domains, where domain expertise is essential and mistakes carry substantial risk? This paper investigates the effectiveness of LLM-advisors in the finance domain, focusing on three distinct challenges: (1) eliciting user preferences when users themselves may be unsure of their needs, (2) providing personalized guidance for diverse investment preferences, and (3) leveraging advisor personality to build relationships and foster trust. Via a lab-based user study with 64 participants, we show that LLM-advisors often match human advisor performance when eliciting preferences, although they can struggle to resolve conflicting user needs. When providing personalized advice, the LLM was able to positively influence user behavior, but demonstrated clear failure modes. Our results show that accurate preference elicitation is key, otherwise, the LLM-advisor has little impact, or can even direct the investor toward unsuitable assets. More worryingly, users appear insensitive to the quality of advice being given, or worse these can have an inverse relationship. Indeed, users reported a preference for and increased satisfaction as well as emotional trust with LLMs adopting an extroverted persona, even though those agents provided worse advice.

Paperid: 1075, https://arxiv.org/pdf/2504.03643.pdf

Abstract:
The need for automatic and high-quality emotion annotation is paramount in applications such as continuous emotion recognition and video highlight detection, yet achieving this through manual human annotations is challenging. Inspired by inter-subject correlation (ISC) utilized in neuroscience, this study introduces a novel Electroencephalography (EEG) based ISC methodology that leverages a single-electrode and feature-based dynamic approach. Our contributions are three folds. Firstly, we reidentify two potent emotion features suitable for classifying emotions-first-order difference (FD) an differential entropy (DE). Secondly, through the use of overall correlation analysis, we demonstrate the heterogeneous synchronized performance of electrodes. This performance aligns with neural emotion patterns established in prior studies, thus validating the effectiveness of our approach. Thirdly, by employing a sliding window correlation technique, we showcase the significant consistency of dynamic ISCs across various features or key electrodes in each analyzed film clip. Our findings indicate the method's reliability in capturing consistent, dynamic shared neural synchrony among individuals, triggered by evocative film stimuli. This underscores the potential of our approach to serve as an indicator of continuous human emotion arousal. The implications of this research are significant for advancements in affective computing and the broader neuroscience field, suggesting a streamlined and effective tool for emotion analysis in real-world applications.

Paperid: 1076, https://arxiv.org/pdf/2504.03300.pdf

Abstract:
Human oversight requirements are a core component of the European AI Act and in AI governance. In this paper, we highlight key challenges in testing for compliance with these requirements. A central difficulty lies in balancing simple, but potentially ineffective checklist-based approaches with resource-intensive and context-sensitive empirical testing of the effectiveness of human oversight of AI. Questions regarding when to update compliance testing, the context-dependent nature of human oversight requirements, and difficult-to-operationalize standards further complicate compliance testing. We argue that these challenges illustrate broader challenges in the future of sociotechnical AI governance, i.e. a future that shifts from ensuring good technological products to good sociotechnical systems.

Paperid: 1077, https://arxiv.org/pdf/2504.03253.pdf

Abstract:
Wireless mouse rings offer subtle, reliable pointing interactions for wearable computing platforms. However, the small battery below 27 mAh in the miniature rings restricts the ring's continuous lifespan to just 1-10 hours, because current low-powered wireless communication such as BLE is power-consuming for ring's continuous use. The ring's short lifespan frequently disrupts users' mouse use with the need for frequent charging. This paper presents picoRing mouse, enabling a continuous ring-based mouse interaction with ultra-low-powered ring-to-wristband wireless communication. picoRing mouse employs a coil-based impedance sensing named semi-passive inductive telemetry, allowing a wristband coil to capture a unique frequency response of a nearby ring coil via a sensitive inductive coupling between the coils. The ring coil converts the corresponding user's mouse input into the unique frequency response via an up to 449 uW mouse-driven modulation system. Therefore, the continuous use of picoRing mouse can last approximately 600 (8hrs use/day)-1000 (4hrs use/day) hours on a single charge of a 27 mAh battery while supporting subtle thumb-to-index scrolling and pressing interactions in real-world wearable computing situations.

Paperid: 1078, https://arxiv.org/pdf/2504.03098.pdf

Abstract:
Humans directly completing tasks in dangerous or hazardous conditions is not always possible where these tasks are increasingly be performed remotely by teleoperated robots. However, teleoperation is difficult since the operator feels a disconnect with the robot caused by missing feedback from several senses, including touch, and the lack of depth in the video feedback presented to the operator. To overcome this problem, the proposed system actively infers the operator's intent and provides assistance based on the predicted intent. Furthermore, a novel method of calculating confidence in the inferred intent modifies the human-in-the-loop control. The operator's gaze is employed to intuitively indicate the target before the manipulation with the robot begins. A potential field method is used to provide a guiding force towards the intended target, and a safety boundary reduces risk of damage. Modifying these assistances based on the confidence level in the operator's intent makes the control more natural, and gives the robot an intuitive understanding of its human master. Initial validation results show the ability of the system to improve accuracy, execution time, and reduce operator error.

Paperid: 1079, https://arxiv.org/pdf/2504.02887.pdf

Abstract:
Open coding, a key inductive step in qualitative research, discovers and constructs concepts from human datasets. However, capturing extensive and nuanced aspects or "coding moments" can be challenging, especially with large discourse datasets. While some studies explore machine learning (ML)/Generative AI (GAI)'s potential for open coding, few evaluation studies exist. We compare open coding results by five recently published ML/GAI approaches and four human coders, using a dataset of online chat messages around a mobile learning software. Our systematic analysis reveals ML/GAI approaches' strengths and weaknesses, uncovering the complementary potential between humans and AI. Line-by-line AI approaches effectively identify content-based codes, while humans excel in interpreting conversational dynamics. We discussed how embedded analytical processes could shape the results of ML/GAI approaches. Instead of replacing humans in open coding, researchers should integrate AI with and according to their analytical processes, e.g., as parallel co-coders.

Paperid: 1080, https://arxiv.org/pdf/2504.02403.pdf

Abstract:
Large Language Models (LLMs) have seen widespread societal adoption. However, while they are able to interact with users in languages beyond English, they have been shown to lack cultural awareness, providing anglocentric or inappropriate responses for underrepresented language communities. To investigate this gap and disentangle linguistic versus cultural proficiency, we conduct the first cultural evaluation study for the mid-resource language of Danish, in which native speakers prompt different models to solve tasks requiring cultural awareness. Our analysis of the resulting 1,038 interactions from 63 demographically diverse participants highlights open challenges to cultural adaptation: Particularly, how currently employed automatically translated data are insufficient to train or measure cultural adaptation, and how training on native-speaker data can more than double response acceptance rates. We release our study data as DaKultur - the first native Danish cultural awareness dataset.

Paperid: 1081, https://arxiv.org/pdf/2504.01708.pdf

Abstract:
As human-robot collaboration advances, natural and flexible communication methods are essential for effective robot control. Traditional methods relying on a single modality or rigid rules struggle with noisy or misaligned data as well as with object descriptions that do not perfectly fit the predefined object names (e.g. 'Pick that red object'). We introduce TransforMerger, a transformer-based reasoning model that infers a structured action command for robotic manipulation based on fused voice and gesture inputs. Our approach merges multimodal data into a single unified sentence, which is then processed by the language model. We employ probabilistic embeddings to handle uncertainty and we integrate contextual scene understanding to resolve ambiguous references (e.g., gestures pointing to multiple objects or vague verbal cues like "this"). We evaluate TransforMerger in simulated and real-world experiments, demonstrating its robustness to noise, misalignment, and missing information. Our results show that TransforMerger outperforms deterministic baselines, especially in scenarios requiring more contextual knowledge, enabling more robust and flexible human-robot communication. Code and datasets are available at: http://imitrob.ciirc.cvut.cz/publications/transformerger.

Paperid: 1082, https://arxiv.org/pdf/2504.01700.pdf

Abstract:
Personalization in social robotics is critical for fostering effective human-robot interactions, yet systems often face the cold start problem, where initial user preferences or characteristics are unavailable. This paper proposes a novel framework called USER-LLM R1 for a user-aware conversational agent that addresses this challenge through dynamic user profiling and model initiation. Our approach integrates chain-of-thought (CoT) reasoning models to iteratively infer user preferences and vision-language models (VLMs) to initialize user profiles from multimodal inputs, enabling personalized interactions from the first encounter. Leveraging a Retrieval-Augmented Generation (RAG) architecture, the system dynamically refines user representations within an inherent CoT process, ensuring contextually relevant and adaptive responses. Evaluations on the ElderlyTech-VQA Bench demonstrate significant improvements in ROUGE-1 (+23.2%), ROUGE-2 (+0.6%), and ROUGE-L (+8%) F1 scores over state-of-the-art baselines, with ablation studies underscoring the impact of reasoning model size on performance. Human evaluations further validate the framework's efficacy, particularly for elderly users, where tailored responses enhance engagement and trust. Ethical considerations, including privacy preservation and bias mitigation, are rigorously discussed and addressed to ensure responsible deployment.

Paperid: 1083, https://arxiv.org/pdf/2504.01293.pdf

Abstract:
Flying robots, such as quadrotor drones, offer new possibilities for human-robot interaction but often pose safety risks due to fast-spinning propellers, rigid structures, and noise. In contrast, lighter-than-air flapping-wing robots, inspired by animal movement, offer a soft, quiet, and touch-safe alternative. Building on these advantages, we present Cuddle-Fish, a soft flapping-wing floating robot designed for close-proximity interactions in indoor spaces. Through a user study with 24 participants, we explored their perceptions of the robot and experiences during a series of co-located demonstrations in which the robot moved near them. Results showed that participants felt safe, willingly engaged in touch-based interactions with the robot, and exhibited spontaneous affective behaviours, such as patting, stroking, hugging, and cheek-touching, without external prompting. They also reported positive emotional responses towards the robot. These findings suggest that the soft floating robot with flapping wings can serve as a novel and socially acceptable alternative to traditional rigid flying robots, opening new potential for applications in companionship, affective interaction, and play in everyday indoor environments.

Paperid: 1084, https://arxiv.org/pdf/2503.22610.pdf

Abstract:
This paper explores the effectiveness of Multimodal Large Language models (MLLMs) as assistive technologies for visually impaired individuals. We conduct a user survey to identify adoption patterns and key challenges users face with such technologies. Despite a high adoption rate of these models, our findings highlight concerns related to contextual understanding, cultural sensitivity, and complex scene understanding, particularly for individuals who may rely solely on them for visual interpretation. Informed by these results, we collate five user-centred tasks with image and video inputs, including a novel task on Optical Braille Recognition. Our systematic evaluation of twelve MLLMs reveals that further advancements are necessary to overcome limitations related to cultural context, multilingual support, Braille reading comprehension, assistive object recognition, and hallucinations. This work provides critical insights into the future direction of multimodal AI for accessibility, underscoring the need for more inclusive, robust, and trustworthy visual assistance technologies.

Paperid: 1085, https://arxiv.org/pdf/2503.20916.pdf

Abstract:
In this project, we focus on human-robot interaction in caregiving scenarios like bathing, where physical contact is inevitable and necessary for proper task execution because force must be applied to the skin. Using finite element analysis, we designed a 3D-printed gripper combining positive and negative pressure for secure yet compliant handling. Preliminary tests showed it exerted a lower, more uniform pressure profile than a standard rigid gripper. In a user study, participants' trust in robots significantly increased after they experienced a brief bathing demonstration performed by a robotic arm equipped with the soft gripper. These results suggest that soft robotics can enhance perceived safety and acceptance in intimate caregiving scenarios.

Paperid: 1086, https://arxiv.org/pdf/2503.19240.pdf

Abstract:
Referring expression comprehension (REC) aims at achieving object localization based on natural language descriptions. However, existing REC approaches are constrained by object category descriptions and single-attribute intention descriptions, hindering their application in real-world scenarios. In natural human-robot interactions, users often express their desires through individual states and intentions, accompanied by guiding gestures, rather than detailed object descriptions. To address this challenge, we propose Multi-ref EC, a novel task framework that integrates state descriptions, derived intentions, and embodied gestures to locate target objects. We introduce the State-Intention-Gesture Attributes Reference (SIGAR) dataset, which combines state and intention expressions with embodied references. Through extensive experiments with various baseline models on SIGAR, we demonstrate that properly ordered multi-attribute references contribute to improved localization performance, revealing that single-attribute reference is insufficient for natural human-robot interaction scenarios. Our findings underscore the importance of multi-attribute reference expressions in advancing visual-language understanding.

Paperid: 1087, https://arxiv.org/pdf/2503.17504.pdf

Abstract:
Autistic individuals often experience negative self-talk (NST), leading to increased anxiety and depression. While therapy is recommended, it presents challenges for many autistic individuals. Meanwhile, a growing number are turning to large language models (LLMs) for mental health support. To understand how autistic individuals perceive AI's role in coping with NST, we surveyed 200 autistic adults and interviewed practitioners. We also analyzed LLM responses to participants' hypothetical prompts about their NST. Our findings show that participants view LLMs as useful for managing NST by identifying and reframing negative thoughts. Both participants and practitioners recognize AI's potential to support therapy and emotional expression. Participants also expressed concerns about LLMs' understanding of neurodivergent thought patterns, particularly due to the neurotypical bias of LLMs. Practitioners critiqued LLMs' responses as overly wordy, vague, and overwhelming. This study contributes to the growing research on AI-assisted mental health support, with specific insights for supporting the autistic community.

Paperid: 1088, https://arxiv.org/pdf/2503.16524.pdf

Abstract:
Confusing or otherwise unhelpful learner feedback creates or perpetuates erroneous beliefs that the teacher and learner have of each other, thereby increasing the cognitive burden placed upon the human teacher. For example, the robot's feedback might cause the human to misunderstand what the learner knows about the learning objective or how the learner learns. At the same time -- and in addition to the learning objective -- the learner might misunderstand how the teacher perceives the learner's task knowledge and learning processes. To ease the teaching burden, the learner should provide feedback that accounts for these misunderstandings and elicits efficient teaching from the human. This work endows an AI learner with a Second-order Theory of Mind that models perceived rationality as a source for the erroneous beliefs a teacher and learner may have of one another. It also explores how a learner can ease the teaching burden and improve teacher efficacy if it selects feedback which accounts for its model of the teacher's beliefs about the learner and its learning objective.

Paperid: 1089, https://arxiv.org/pdf/2503.16451.pdf

Abstract:
Modeling human-like action-to-reaction generation has significant real-world applications, like human-robot interaction and games. Despite recent advancements in single-person motion generation, it is still challenging to well handle action-to-reaction generation, due to the difficulty of directly predicting reaction from action sequence without prompts, and the absence of a unified representation that effectively encodes multi-person motion. To address these challenges, we introduce Think-Then-React (TTR), a large language-model-based framework designed to generate human-like reactions. First, with our fine-grained multimodal training strategy, TTR is capable to unify two processes during inference: a thinking process that explicitly infers action intentions and reasons corresponding reaction description, which serve as semantic prompts, and a reacting process that predicts reactions based on input action and the inferred semantic prompts. Second, to effectively represent multi-person motion in language models, we propose a unified motion tokenizer by decoupling egocentric pose and absolute space features, which effectively represents action and reaction motion with same encoding. Extensive experiments demonstrate that TTR outperforms existing baselines, achieving significant improvements in evaluation metrics, such as reducing FID from 3.988 to 1.942.

Paperid: 1090, https://arxiv.org/pdf/2503.15569.pdf

Abstract:
Mixed-precision computing, a widely applied technique in AI, offers a larger trade-off space between accuracy and efficiency. The recent purposed Mixed-Precision Over-the-Air Federated Learning (MP-OTA-FL) enables clients to operate at appropriate precision levels based on their heterogeneous hardware, taking advantages of the larger trade-off space while covering the quantization overheads in the mixed-precision modulation scheme for the OTA aggregation process. A key to further exploring the potential of the MP-OTA-FL framework is the optimization of client precision levels. The choice of precision level hinges on multifaceted factors including hardware capability, potential client contribution, and user satisfaction, among which factors can be difficult to define or quantify. In this paper, we propose a RAG-based User Profiling for precision planning framework that integrates retrieval-augmented LLMs and dynamic client profiling to optimize satisfaction and contributions. This includes a hybrid interface for gathering device/user insights and an RAG database storing historical quantization decisions with feedback. Experiments show that our method boosts satisfaction, energy savings, and global model accuracy in MP-OTA-FL systems.

Paperid: 1091, https://arxiv.org/pdf/2503.15524.pdf

Abstract:
In urban search and rescue (USAR) operations, communication between handlers and specially trained canines is crucial but often complicated by challenging environments and the specific behaviors canines are trained to exhibit when detecting a person. Since a USAR canine often works out of sight of the handler, the handler lacks awareness of the canine's location and situation, known as the 'sensemaking gap.' In this paper, we propose KHAIT, a novel approach to close the sensemaking gap and enhance USAR effectiveness by integrating object detection-based Artificial Intelligence (AI) and Augmented Reality (AR). Equipped with AI-powered cameras, edge computing, and AR headsets, KHAIT enables precise and rapid object detection from a canine's perspective, improving survivor localization. We evaluate this approach in a real-world USAR environment, demonstrating an average survival allocation time decrease of 22%, enhancing the speed and accuracy of operations.

Paperid: 1092, https://arxiv.org/pdf/2503.14999.pdf

Abstract:
Nowadays, touch remains essential for emotional conveyance and interpersonal communication as more interactions are mediated remotely. While many studies have discussed the effectiveness of using haptics to communicate emotions, incorporating affect into haptic design still faces challenges due to individual user tactile acuity and preferences. We assessed the conveying of emotions using a two-channel haptic display, emphasizing individual differences. First, 24 participants generated 187 haptic messages reflecting their immediate sentiments after watching 8 emotionally charged film clips. Afterwards, 19 participants were asked to identify emotions from haptic messages designed by themselves and others, yielding 593 samples. Our findings suggest potential links between haptic message decoding ability and emotional traits, particularly Emotional Competence (EC) and Affect Intensity Measure (AIM). Additionally, qualitative analysis revealed three strategies participants used to create touch messages: perceptive, empathetic, and metaphorical expression.

Paperid: 1093, https://arxiv.org/pdf/2503.14412.pdf

Abstract:
Social platforms have expanded opportunities for deliberation with the comments being used to inform one's opinion. However, using such information to form opinions is challenged by unsubstantiated or false content. To enhance the quality of opinion formation and potentially confer resistance to misinformation, we developed Iffy-Or-Not (ION), a browser extension that seeks to invoke critical thinking when reading texts. With three features guided by argumentation theory, ION highlights fallacious content, suggests diverse queries to probe them with, and offers deeper questions to consider and chat with others about. From a user study (N=18), we found that ION encourages users to be more attentive to the content, suggests queries that align with or are preferable to their own, and poses thought-provoking questions that expands their perspectives. However, some participants expressed aversion to ION due to misalignments with their information goals and thinking predispositions. Potential backfiring effects with ION are discussed.

Paperid: 1094, https://arxiv.org/pdf/2503.13817.pdf

Abstract:
Designing reward functions for continuous-control robotics often leads to subtle misalignments or reward hacking, especially in complex tasks. Preference-based RL mitigates some of these pitfalls by learning rewards from comparative feedback rather than hand-crafted signals, yet scaling human annotations remains challenging. Recent work uses Vision-Language Models (VLMs) to automate preference labeling, but a single final-state image generally fails to capture the agent's full motion. In this paper, we present a two-part solution that both improves feedback accuracy and better aligns reward learning with the agent's policy. First, we overlay trajectory sketches on final observations to reveal the path taken, allowing VLMs to provide more reliable preferences-improving preference accuracy by approximately 15-20% in metaworld tasks. Second, we regularize reward learning by incorporating the agent's performance, ensuring that the reward model is optimized based on data generated by the current policy; this addition boosts episode returns by 20-30% in locomotion tasks. Empirical studies on metaworld demonstrate that our method achieves, for instance, around 70-80% success rate in all tasks, compared to below 50% for standard approaches. These results underscore the efficacy of combining richer visual representations with agent-aware reward regularization.

Paperid: 1095, https://arxiv.org/pdf/2503.13419.pdf

Abstract:
The synergy between virtual reality (VR) and artificial intelligence (AI), specifically deep learning (DL)-based cybersickness detection models, has ushered in unprecedented advancements in immersive experiences by automatically detecting cybersickness severity and adaptively various mitigation techniques, offering a smooth and comfortable VR experience. While this DL-enabled cybersickness detection method provides promising solutions for enhancing user experiences, it also introduces new risks since these models are vulnerable to adversarial attacks; a small perturbation of the input data that is visually undetectable to human observers can fool the cybersickness detection model and trigger unexpected mitigation, thus disrupting user immersive experiences (UIX) and even posing safety risks. In this paper, we present a new type of VR attack, i.e., a cybersickness attack, which successfully stops the triggering of cybersickness mitigation by fooling DL-based cybersickness detection models and dramatically hinders the UIX. Next, we propose a novel explainable artificial intelligence (XAI)-guided cybersickness attack detection framework to detect such attacks in VR to ensure UIX and a comfortable VR experience. We evaluate the proposed attack and the detection framework using two state-of-the-art open-source VR cybersickness datasets: Simulation 2021 and Gameplay dataset. Finally, to verify the effectiveness of our proposed method, we implement the attack and the XAI-based detection using a testbed with a custom-built VR roller coaster simulation with an HTC Vive Pro Eye headset and perform a user study. Our study shows that such an attack can dramatically hinder the UIX. However, our proposed XAI-guided cybersickness attack detection can successfully detect cybersickness attacks and trigger the proper mitigation, effectively reducing VR cybersickness.

Paperid: 1096, https://arxiv.org/pdf/2503.13240.pdf

Abstract:
Wireless body networks comprising battery-free on-body sensors and textile-based wireless readers can enable daily health monitoring and activity tracking by continuously monitoring physiological signals across the body. However, previous textile-based wireless networks made of coils or antennas have limited the data and power transmission area because covering the whole body results in undesirable levels of electromagnetic interactions with the body, degrading the scalability, power consumption, and data rate. Here, we report Full-body NFC, digitally-knitted electronic textiles based on a twin meander coil design that enables body-scale near-field communication (NFC) with battery-free sensor tags arbitrarily placed around the body. Full-body NFC features i) a meander coil that enhances the magnetic field intensity on the body's surface while suppressing undesired interactions with deep tissues, in addition to ii) paired identical coil structure that enables highly-sensitive and motion-robust NFC using a differential architecture. Additionally, industrial digital knitting machines loaded with conductive yarn allow the integration of the Full-body NFC system into daily garments supporting approximately $70-80\%$ large-scale NFC-enabled area of the body. We demonstrate Full-body NFC could achieve mW-class energy-efficient near-field sensor networks with hundreds of kbps-class NFC battery-free sensor tags occupying less than $0.3\%$ of the coverage area under severe body movements.

Paperid: 1097, https://arxiv.org/pdf/2503.12921.pdf

Abstract:
As Artificial Intelligence (AI) becomes more pervasive in various aspects of life, AI literacy is becoming a fundamental competency that enables individuals to move safely and competently in an AI-pervaded world. There is a growing need to measure this competency, e.g., to develop targeted educational interventions. Although several measurement tools already exist, many have limitations regarding subjective data collection methods, target group differentiation, validity, and integration of current developments such as Generative AI Literacy. This study develops and validates the AI Competency Objective Scale (AICOS) for measuring AI literacy objectively. The presented scale addresses weaknesses and offers a robust measurement approach that considers established competency and measurement models, captures central sub-competencies of AI literacy, and integrates the dimension of Generative AI Literacy. The AICOS provides a sound and comprehensive measure of AI literacy, and initial analyses show potential for a modular structure. Furthermore, a first edition of a short version of the AICOS is developed. Due to its methodological foundation, extensive validation, and integration of recent developments, the test represents a valuable resource for scientific research and practice in educational institutions and professional contexts. The AICOS significantly contributes to the development of standardized measurement instruments and enables the targeted assessment and development of AI skills in different target groups.

Paperid: 1098, https://arxiv.org/pdf/2503.12499.pdf

Abstract:
Consensus building is inherently challenging due to the diverse opinions held by stakeholders. Effective facilitation is crucial to support the consensus building process and enable efficient group decision making. However, the effectiveness of facilitation is often constrained by human factors such as limited experience and scalability. In this research, we propose a Parallel Thinking-based Facilitation Agent (PTFA) that facilitates online, text-based consensus building processes. The PTFA automatically collects textual posts and leverages large language models (LLMs) to perform all of the six distinct roles of the well-established Six Thinking Hats technique in parallel thinking. To illustrate the potential of PTFA, a pilot study was carried out and PTFA's ability in idea generation, emotional probing, and deeper analysis of ideas was demonstrated. Furthermore, a comprehensive dataset that contains not only the conversational content among the participants but also between the participants and the agent is constructed for future study.

Paperid: 1099, https://arxiv.org/pdf/2503.11918.pdf

Abstract:
Training robotic manipulation policies traditionally requires numerous demonstrations and/or environmental rollouts. While recent Imitation Learning (IL) and Reinforcement Learning (RL) methods have reduced the number of required demonstrations, they still rely on expert knowledge to collect high-quality data, limiting scalability and accessibility. We propose Sketch-to-Skill, a novel framework that leverages human-drawn 2D sketch trajectories to bootstrap and guide RL for robotic manipulation. Our approach extends beyond previous sketch-based methods, which were primarily focused on imitation learning or policy conditioning, limited to specific trained tasks. Sketch-to-Skill employs a Sketch-to-3D Trajectory Generator that translates 2D sketches into 3D trajectories, which are then used to autonomously collect initial demonstrations. We utilize these sketch-generated demonstrations in two ways: to pre-train an initial policy through behavior cloning and to refine this policy through RL with guided exploration. Experimental results demonstrate that Sketch-to-Skill achieves ~96% of the performance of the baseline model that leverages teleoperated demonstration data, while exceeding the performance of a pure reinforcement learning policy by ~170%, only from sketch inputs. This makes robotic manipulation learning more accessible and potentially broadens its applications across various domains.

Paperid: 1100, https://arxiv.org/pdf/2503.07279.pdf

Abstract:
Trust plays a fundamental role in shaping the willingness of users to engage and collaborate with artificial intelligence (AI) systems. Yet, measuring user trust remains challenging due to its complex and dynamic nature. While traditional survey methods provide trust levels for long conversations, they fail to capture its dynamic evolution during ongoing interactions. Here, we present VizTrust, which addresses this challenge by introducing a real-time visual analytics tool that leverages a multi-agent collaboration system to capture and analyze user trust dynamics in human-agent communication. Built on established human-computer trust scales-competence, integrity, benevolence, and predictability-, VizTrust enables stakeholders to observe trust formation as it happens, identify patterns in trust development, and pinpoint specific interaction elements that influence trust. Our tool offers actionable insights into human-agent trust formation and evolution in real time through a dashboard, supporting the design of adaptive conversational agents that responds effectively to user trust signals.

Paperid: 1101, https://arxiv.org/pdf/2503.06463.pdf

Abstract:
As cannabis use has increased in recent years, researchers have come to rely on sophisticated machine learning models to predict cannabis use behavior and its impact on health. However, many artificial intelligence (AI) models lack transparency and interpretability due to their opaque nature, limiting their trust and adoption in real-world medical applications, such as clinical decision support systems (CDSS). To address this issue, this paper enhances algorithm explainability underlying CDSS by integrating multiple Explainable Artificial Intelligence (XAI) methods and applying causal inference techniques to clarify the model' predictive decisions under various scenarios. By providing deeper interpretability of the XAI outputs using Large Language Models (LLMs), we provide users with more personalized and accessible insights to overcome the challenges posed by AI's "black box" nature. Our system dynamically adjusts feedback based on user queries and emotional states, combining text-based sentiment analysis with real-time facial emotion recognition to ensure responses are empathetic, context-adaptive, and user-centered. This approach bridges the gap between the learning demands of interpretability and the need for intuitive understanding, enabling non-technical users such as clinicians and clinical researchers to interact effectively with AI models.} Ultimately, this approach improves usability, enhances perceived trustworthiness, and increases the impact of CDSS in healthcare applications.

Paperid: 1102, https://arxiv.org/pdf/2503.06002.pdf

Abstract:
AI expansion has accelerated workplace adoption of new technologies. Yet, it is unclear whether and how knowledge workers are supported and trained to safely use AI. Inadequate training may lead to unrealized benefits if workers abandon tools, or perpetuate biases if workers misinterpret AI-based outcomes. In a workshop with 39 workers from 26 countries specializing in human resources, labor law, standards creation, and worker training, we explored questions and ideas they had about safely adopting AI. We held 17 follow-up interviews to further investigate what skills and training knowledge workers need to achieve safe and effective AI in practice. We synthesize nine training topics participants surfaced for knowledge workers related to challenges around understanding what AI is, misinterpreting outcomes, exacerbating biases, and worker rights. We reflect how these training topics might be addressed under different contexts, imagine HCI research prototypes as potential training tools, and consider ways to ensure training does not perpetuate harmful values.

Paperid: 1103, https://arxiv.org/pdf/2503.04765.pdf

Abstract:
DeepSeek v3, developed in China, was released in December 2024, followed by Alibaba's Qwen 2.5 Max in January 2025 and Qwen3 235B in April 2025. These free and open-source models offer significant potential for academic writing and content creation. This study evaluates their academic writing performance by comparing them with ChatGPT, Gemini, Llama, Mistral, and Gemma. There is a critical gap in the literature concerning how extensively these tools can be utilized and their potential to generate original content in terms of quality, readability, and effectiveness. Using 40 papers on Digital Twin and Healthcare, texts were generated through AI tools based on posed questions and paraphrased abstracts. The generated content was analyzed using plagiarism detection, AI detection, word count comparisons, semantic similarity, and readability assessments. Results indicate that paraphrased abstracts showed higher plagiarism rates, while question-based responses also exceeded acceptable levels. AI detection tools consistently identified all outputs as AI-generated. Word count analysis revealed that all chatbots produced a sufficient volume of content. Semantic similarity tests showed a strong overlap between generated and original texts. However, readability assessments indicated that the texts were insufficient in terms of clarity and accessibility. This study comparatively highlights the potential and limitations of popular and latest large language models for academic writing. While these models generate substantial and semantically accurate content, concerns regarding plagiarism, AI detection, and readability must be addressed for their effective use in scholarly work.

Paperid: 1104, https://arxiv.org/pdf/2503.04190.pdf

Abstract:
Emotion recognition is critical for various applications such as early detection of mental health disorders and emotion based smart home systems. Previous studies used various sensing methods for emotion recognition, such as wearable sensors, cameras, and microphones. However, these methods have limitations in long term domestic, including intrusiveness and privacy concerns. To overcome these limitations, this paper introduces a nonintrusive and privacy friendly personalized emotion recognition system, EmotionVibe, which leverages footstep induced floor vibrations for emotion recognition. The main idea of EmotionVibe is that individuals' emotional states influence their gait patterns, subsequently affecting the floor vibrations induced by their footsteps. However, there are two main research challenges: 1) the complex and indirect relationship between human emotions and footstep induced floor vibrations and 2) the large between person variations within the relationship between emotions and gait patterns. To address these challenges, we first empirically characterize this complex relationship and develop an emotion sensitive feature set including gait related and vibration related features from footstep induced floor vibrations. Furthermore, we personalize the emotion recognition system for each user by calculating gait similarities between the target person (i.e., the person whose emotions we aim to recognize) and those in the training dataset and assigning greater weights to training people with similar gait patterns in the loss function. We evaluated our system in a real-world walking experiment with 20 participants, summing up to 37,001 footstep samples. EmotionVibe achieved the mean absolute error (MAE) of 1.11 and 1.07 for valence and arousal score estimations, respectively, reflecting 19.0% and 25.7% error reduction compared to the baseline method.

Paperid: 1105, https://arxiv.org/pdf/2503.02637.pdf

Abstract:
Crisis maps are regarded as crucial tools in crisis communication, as demonstrated during the COVID-19 pandemic and climate change crises. However, there is limited understanding of how public audiences engage with these maps and extract essential information. Our study investigates the sensemaking of young, digitally native viewers as they interact with crisis maps. We integrate frameworks from the learning sciences and human-data interaction to explore sensemaking through two empirical studies: a thematic analysis of online comments from a New York Times series on graph comprehension, and interviews with 18 participants from German-speaking regions. Our analysis categorizes sensemaking activities into established clusters: inspecting, engaging with content, and placing, and introduces responding personally to capture the affective dimension. We identify friction points connected to these clusters, including struggles with color concepts, responses to missing context, lack of personal connection, and distrust, offering insights for improving crisis communication to public audiences.

Paperid: 1106, https://arxiv.org/pdf/2503.00940.pdf

Abstract:
A growing body of work in Ethical AI attempts to capture human moral judgments through simple computational models. The key question we address in this work is whether such simple AI models capture {the critical} nuances of moral decision-making by focusing on the use case of kidney allocation. We conducted twenty interviews where participants explained their rationale for their judgments about who should receive a kidney. We observe participants: (a) value patients' morally-relevant attributes to different degrees; (b) use diverse decision-making processes, citing heuristics to reduce decision complexity; (c) can change their opinions; (d) sometimes lack confidence in their decisions (e.g., due to incomplete information); and (e) express enthusiasm and concern regarding AI assisting humans in kidney allocation decisions. Based on these findings, we discuss challenges of computationally modeling moral judgments {as a stand-in for human input}, highlight drawbacks of current approaches, and suggest future directions to address these issues.

Paperid: 1107, https://arxiv.org/pdf/2503.00124.pdf

Abstract:
Like most of NLP, models for human-centered NLP tasks -- tasks attempting to assess author-level information -- predominantly use representations derived from hidden states of Transformer-based LLMs. However, what component of the LM is used for the representation varies widely. Moreover, there is a need for Human Language Models (HuLMs) that implicitly model the author and provide a user-level hidden state. Here, we systematically evaluate different ways of representing documents and users using different LM and HuLM architectures to predict task outcomes as both dynamically changing states and averaged trait-like user-level attributes of valence, arousal, empathy, and distress. We find that representing documents as an average of the token hidden states performs the best generally. Further, while a user-level hidden state itself is rarely the best representation, we find its inclusion in the model strengthens token or document embeddings used to derive document- and user-level representations resulting in best performances.

Paperid: 1108, https://arxiv.org/pdf/2502.18705.pdf

Abstract:
Social online games like Minecraft and Roblox have become increasingly integral to children's daily lives. Our study explores how children aged 8 to 13 create and customize avatars in these virtual environments. Through semi-structured interviews and gameplay observations with 48 participants, we investigate the motivations behind children's avatar-making. Our findings show that children's avatar creation is motivated by self-representation, experimenting with alter ego identities, fulfilling social needs, and improving in-game performance. In addition, designed monetization strategies play a role in shaping children's avatars. We identify the ''wardrobe effect,'' where children create multiple avatars but typically use only one favorite consistently. We discuss the impact of cultural consumerism and how social games can support children's identity exploration while balancing self-expression and social conformity. This work contributes to understanding how avatar shapes children's identity growth in social online games.

Paperid: 1109, https://arxiv.org/pdf/2502.18658.pdf

Abstract:
AI programming tools enable powerful code generation, and recent prototypes attempt to reduce user effort with proactive AI agents, but their impact on programming workflows remains unexplored. We introduce and evaluate Codellaborator, a design probe LLM agent that initiates programming assistance based on editor activities and task context. We explored three interface variants to assess trade-offs between increasingly salient AI support: prompt-only, proactive agent, and proactive agent with presence and context (Codellaborator). In a within-subject study (N=18), we find that proactive agents increase efficiency compared to prompt-only paradigm, but also incur workflow disruptions. However, presence indicators and interaction context support alleviated disruptions and improved users' awareness of AI processes. We underscore trade-offs of Codellaborator on user control, ownership, and code understanding, emphasizing the need to adapt proactivity to programming processes. Our research contributes to the design exploration and evaluation of proactive AI systems, presenting design implications on AI-integrated programming workflow.

Paperid: 1110, https://arxiv.org/pdf/2502.15622.pdf

Abstract:
Asynchronous communication has become increasingly essential in the context of extended reality (XR), enabling users to interact and share information immersively without the constraints of simultaneous engagement. However, current XR systems often struggle to support effective asynchronous interactions, mainly due to limitations in contextual replay and navigation. This paper aims to address these limitations by introducing a novel system that enhances asynchronous communication in XR through the concept of MemoryPods, which allow users to record, annotate, and replay interactions with spatial and temporal accuracy. MemoryPods also feature AI-driven summarisation to ease cognitive load. A user evaluation conducted in a remote maintenance scenario demonstrated significant improvements in comprehension, highlighting the system's potential to transform collaboration in XR. The findings suggest broad applicability of the proposed system across various domains, including direct messaging, healthcare, education, remote collaboration, and training, offering a promising solution to the complexities of asynchronous communication in immersive environments.

Paperid: 1111, https://arxiv.org/pdf/2502.15604.pdf

Abstract:
This paper presents a detailed evaluation of a Retrieval-Augmented Generation (RAG) system that integrates large language models (LLMs) to enhance information retrieval and instruction generation for maintenance personnel across diverse data formats. We assessed the performance of eight LLMs, emphasizing key metrics such as response speed and accuracy, which were quantified using BLEU and METEOR scores. Our findings reveal that advanced models like GPT-4 and GPT-4o-mini significantly outperform their counterparts, particularly when addressing complex queries requiring multi-format data integration. The results validate the system's ability to deliver timely and accurate responses, highlighting the potential of RAG frameworks to optimize maintenance operations. Future research will focus on refining retrieval techniques for these models and enhancing response generation, particularly for intricate scenarios, ultimately improving the system's practical applicability in dynamic real-world environments.

Paperid: 1112, https://arxiv.org/pdf/2502.15538.pdf

Abstract:
Despite the abundance of prior social strategies possessed by humans, there remains a paucity of research dedicated to their transfer and integration into social agents. Our proposed SOTOPIA-$Î©$ framework aims to address and bridge this gap, with a particular focus on enhancing the social capabilities of language agents. This framework dynamically injects multi-step reasoning strategies inspired by negotiation theory and two simple direct strategies into expert agents, thereby automating the construction of a high-quality social dialogue training corpus. Additionally, we introduce the concept of Social Instruction Following (S-IF) and propose two new S-IF evaluation metrics that complement social capability. We demonstrate that several 7B models trained on high-quality corpus not only significantly surpass the expert agent (GPT-4) in achieving social goals but also enhance S-IF performance. Analysis and variant experiments validate the advantages of dynamic construction, which can especially break the agent's prolonged deadlock.

Paperid: 1113, https://arxiv.org/pdf/2502.15039.pdf

Abstract:
This work introduces the design, implementation, and validation of a virtual reality (VR) experience aimed at promoting the inclusion of individuals with dyslexia in university settings. Unlike traditional awareness methods, this immersive approach offers a novel way to foster empathy by allowing participants to experience firsthand the challenges faced by students with dyslexia. Specifically, the experience raises awareness by exposing non-dyslexic individuals to the difficulties commonly encountered by dyslexic students. In the virtual environment, participants explore a virtual campus with multiple buildings, navigating between them while completing tasks and simultaneously encountering barriers that simulate some of the challenges faced by individuals with dyslexia. These barriers include reading signs with shifting letters, following directional arrows that may point incorrectly, and dealing with a lack of assistance. The campus is a comprehensive model featuring both indoor and outdoor spaces and supporting various modes of locomotion. To validate the experience, more than 30 non-dyslexic participants from the university environment, mainly professors and students, evaluated it through ad hoc satisfaction surveys. The results indicated heightened awareness of the barriers encountered by students with dyslexia, with participants deeming the experience a valuable tool for increasing visibility and fostering understanding of dyslexic students.

Paperid: 1114, https://arxiv.org/pdf/2502.14311.pdf

Abstract:
In AI-assisted decision-making, it is crucial but challenging for humans to appropriately rely on AI, especially in high-stakes domains such as finance and healthcare. This paper addresses this problem from a human-centered perspective by presenting an intervention for self-confidence shaping, designed to calibrate self-confidence at a targeted level. We first demonstrate the impact of self-confidence shaping by quantifying the upper-bound improvement in human-AI team performance. Our behavioral experiments with 121 participants show that self-confidence shaping can improve human-AI team performance by nearly 50% by mitigating both over- and under-reliance on AI. We then introduce a self-confidence prediction task to identify when our intervention is needed. Our results show that simple machine-learning models achieve 67% accuracy in predicting self-confidence. We further illustrate the feasibility of such interventions. The observed relationship between sentiment and self-confidence suggests that modifying sentiment could be a viable strategy for shaping self-confidence. Finally, we outline future research directions to support the deployment of self-confidence shaping in a real-world scenario for effective human-AI collaboration.

Paperid: 1115, https://arxiv.org/pdf/2502.13268.pdf

Abstract:
The reference to assumptions in how practitioners use or interact with machine learning (ML) systems is ubiquitous in HCI and responsible ML discourse. However, what remains unclear from prior works is the conceptualization of assumptions and how practitioners identify and handle assumptions throughout their workflows. This leads to confusion about what assumptions are and what needs to be done with them. We use the concept of an argument from Informal Logic, a branch of Philosophy, to offer a new perspective to understand and explicate the confusions surrounding assumptions. Through semi-structured interviews with 22 ML practitioners, we find what contributes most to these confusions is how independently assumptions are constructed, how reactively and reflectively they are handled, and how nebulously they are recorded. Our study brings the peripheral discussion of assumptions in ML to the center and presents recommendations for practitioners to better think about and work with assumptions.

Paperid: 1116, https://arxiv.org/pdf/2502.11273.pdf

Abstract:
Rideshare workers experience unpredictable working conditions due to gig work platforms' reliance on opaque AI and algorithmic systems. In response to these challenges, we found that labor organizers want data to help them advocate for legislation to increase the transparency and accountability of these platforms. To address this need, we collaborated with a Colorado-based rideshare union to develop FairFare, a tool that crowdsources and analyzes workers' data to estimate the take rate -- the percentage of the rider price retained by the rideshare platform. We deployed FairFare with our partner organization that collaborated with us in collecting data on 76,000+ trips from 45 drivers over 18 months. During evaluation interviews, organizers reported that FairFare helped influence the bill language and passage of Colorado Senate Bill 24-75, calling for greater transparency and data disclosure of platform operations, and create a national narrative. Finally, we reflect on complexities of translating quantitative data into policy outcomes, nature of community based audits, and design implications for future transparency tools.

Paperid: 1117, https://arxiv.org/pdf/2502.10638.pdf

Abstract:
Good writing is a dynamic process of knowledge transformation, where writers refine and evolve ideas through planning, translating, and reviewing. Generative AI-powered writing tools can enhance this process but may also disrupt the natural flow of writing, such as when using LLMs for complex tasks like restructuring content across different sections or creating smooth transitions. We introduce Script&Shift, a layered interface paradigm designed to minimize these disruptions by aligning writing intents with LLM capabilities to support diverse content development and rhetorical strategies. By bridging envisioning, semantic, and articulatory distances, Script&Shift's interactions allow writers to leverage LLMs for various content development tasks (scripting) and experiment with diverse organization strategies while tailoring their writing for different audiences (shifting). This approach preserves creative control while encouraging divergent and iterative writing. Our evaluation shows that Script&Shift enables writers to creatively and efficiently incorporate LLMs while preserving a natural flow of composition.

Paperid: 1118, https://arxiv.org/pdf/2502.10636.pdf

Abstract:
The integration of vision-language models into robotic systems constitutes a significant advancement in enabling machines to interact with their surroundings in a more intuitive manner. While VLMs offer rich multimodal reasoning, existing approaches lack user-specific adaptability, often relying on generic interaction paradigms that fail to account for individual behavioral, contextual, or socio-emotional nuances. When customization is attempted, ethical concerns arise from unmitigated biases in user data, risking exclusion or unfair treatment. To address these dual challenges, we propose User-VLM 360Â°, a holistic framework integrating multimodal user modeling with bias-aware optimization. Our approach features: (1) user-aware tuning that adapts interactions in real time using visual-linguistic signals; (2) bias mitigation via preference optimization; and (3) curated 360Â° socio-emotive interaction datasets annotated with demographic, emotion, and relational metadata. Evaluations across eight benchmarks demonstrate state-of-the-art results: +35.3% F1 in personalized VQA, +47.5% F1 in facial features understanding, 15% bias reduction, and 30X speedup over baselines. Ablation studies confirm component efficacy, and deployment on the Pepper robot validates real-time adaptability across diverse users. We open-source parameter-efficient 3B/10B models and an ethical verification framework for responsible adaptation.

Paperid: 1119, https://arxiv.org/pdf/2502.10533.pdf

Abstract:
Learning to Defer (L2D) trains autonomous systems to handle straightforward cases while deferring uncertain ones to human experts. Recent advancements in this field have introduced methods that offer flexibility to unseen experts at test time. However, we find these approaches struggle to generalise to experts with behaviours not seen during training, require extensive human annotation, and lack mechanisms for incorporating prior knowledge of expert capabilities. To address these challenges, we introduce Expert-Agnostic Learning to Defer (EA-L2D), a novel L2D framework that employs a Bayesian approach to model expert behaviour in an \textit{expert-agnostic} fashion. Across benchmark medical imaging datasets (HAM10000, Blood Cells, Retinal OCT, and Liver Tumours), EA-L2D significantly outperforms prior methods on unseen experts, achieving up to a 28\% relative improvement, while also matching or exceeding state-of-the-art performance on seen experts.

Paperid: 1120, https://arxiv.org/pdf/2502.09776.pdf

Abstract:
Challenges in engagement with digital mental health (DMH) tools are commonly addressed through technical enhancements and algorithmic interventions. This paper shifts the focus towards the role of users' broader social context as a significant factor in engagement. Through an eight-week text messaging program aimed at enhancing psychological wellbeing, we recruited 20 participants to help us identify situational engagement disruptors (SEDs), including personal responsibilities, professional obligations, and unexpected health issues. In follow-up design workshops with 25 participants, we explored potential solutions that address such SEDs: prioritizing self-care through structured goal-setting, alternative framings for disengagement, and utilization of external resources. Our findings challenge conventional perspectives on engagement and offer actionable design implications for future DMH tools.

Paperid: 1121, https://arxiv.org/pdf/2502.08893.pdf

Abstract:
Ride-sharing services are revolutionizing urban mobility while simultaneously raising significant concerns regarding fairness and driver equity. This study employs Chicago Trip Network Provider dataset to investigate disparities in ride-sharing earnings between 2018 and 2023. Our analysis reveals marked temporal shifts, including an earnings surge in early 2021 followed by fluctuations and a decline in inflation-adjusted income, as well as pronounced spatial disparities, with drivers in Central and airport regions earning substantially more than those in peripheral areas. Recognizing the limitations of trip-level data, we introduce a novel trip-driver assignment algorithm to reconstruct plausible daily work patterns, uncovering distinct driver clusters with varied earning profiles. Notably, drivers operating during late-evening and overnight hours secure higher per-trip and hourly rates, while emerging groups in low-demand regions face significant earnings deficits. Our findings call for more transparent pricing models and a re-examination of platform design to promote equitable driver outcomes.

Paperid: 1122, https://arxiv.org/pdf/2502.08554.pdf

Abstract:
Large language models (LLMs) can produce erroneous responses that sound fluent and convincing, raising the risk that users will rely on these responses as if they were correct. Mitigating such overreliance is a key challenge. Through a think-aloud study in which participants use an LLM-infused application to answer objective questions, we identify several features of LLM responses that shape users' reliance: explanations (supporting details for answers), inconsistencies in explanations, and sources. Through a large-scale, pre-registered, controlled experiment (N=308), we isolate and study the effects of these features on users' reliance, accuracy, and other measures. We find that the presence of explanations increases reliance on both correct and incorrect responses. However, we observe less reliance on incorrect responses when sources are provided or when explanations exhibit inconsistencies. We discuss the implications of these findings for fostering appropriate reliance on LLMs.

Paperid: 1123, https://arxiv.org/pdf/2502.08513.pdf

Abstract:
Animated virtual reality (VR) stories, combining the presence of VR and the artistry of computer animation, offer a compelling way to deliver messages and evoke emotions. Motivated by the growing demand for immersive narrative experiences, more creators are creating animated VR stories. However, a holistic understanding of their creation processes and challenges involved in crafting these stories is still limited. Based on semi-structured interviews with 21 animated VR story creators, we identify ten common stages in their end-to-end creation processes, ranging from idea generation to evaluation, which form diverse workflows that are story-driven or visual-driven. Additionally, we highlight nine unique issues that arise during the creation process, such as a lack of reference material for multi-element plots, the absence of specific functionalities for story integration, and inadequate support for audience evaluation. We compare the creation of animated VR stories to general XR applications and distill several future research opportunities.

Paperid: 1124, https://arxiv.org/pdf/2502.07077.pdf

Abstract:
The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings. Going beyond single-turn static benchmarks, we contribute three methodological advances in state-of-the-art (SOTA) LLM evaluation. First, we develop a multi-turn evaluation of 14 anthropomorphic behaviours. Second, we present a scalable, automated approach by employing simulations of user interactions. Third, we conduct an interactive, large-scale human subject study (N=1101) to validate that the model behaviours we measure predict real users' anthropomorphic perceptions. We find that all SOTA LLMs evaluated exhibit similar behaviours, characterised by relationship-building (e.g., empathy and validation) and first-person pronoun use, and that the majority of behaviours only first occur after multiple turns. Our work lays an empirical foundation for investigating how design choices influence anthropomorphic model behaviours and for progressing the ethical debate on the desirability of these behaviours. It also showcases the necessity of multi-turn evaluations for complex social phenomena in human-AI interaction.

Paperid: 1125, https://arxiv.org/pdf/2502.06197.pdf

Abstract:
Large Language Models (LLMs) have been widely used to support ideation in the writing process. However, whether generating ideas with the help of LLMs leads to idea fixation or idea expansion is unclear. This study examines how different timings of LLM usage - either at the beginning or after independent ideation - affect people's perceptions and ideation outcomes in a writing task. In a controlled experiment with 60 participants, we found that using LLMs from the beginning reduced the number of original ideas and lowered creative self-efficacy and self-credit, mediated by changes in autonomy and ownership. We discuss the challenges and opportunities associated with using LLMs to assist in idea generation. We propose delaying the use of LLMs to support ideation while considering users' self-efficacy, autonomy, and ownership of the ideation outcomes.

Paperid: 1126, https://arxiv.org/pdf/2502.06075.pdf

Abstract:
Mental-illness stigma is a persistent social problem, hampering both treatment-seeking and recovery. Accordingly, there is a pressing need to understand it more clearly, but analyzing the relevant data is highly labor-intensive. Therefore, we designed a chatbot to engage participants in conversations; coded those conversations qualitatively with AI assistance; and, based on those coding results, built causal knowledge graphs to decode stigma. The results we obtained from 1,002 participants demonstrate that conversation with our chatbot can elicit rich information about people's attitudes toward depression, while our AI-assisted coding was strongly consistent with human-expert coding. Our novel approach combining large language models (LLMs) and causal knowledge graphs uncovered patterns in individual responses and illustrated the interrelationships of psychological constructs in the dataset as a whole. The paper also discusses these findings' implications for HCI researchers in developing digital interventions, decomposing human psychological constructs, and fostering inclusive attitudes.

Paperid: 1127, https://arxiv.org/pdf/2502.06069.pdf

Abstract:
To enhance focused eating and dining socialization, previous Human-Food Interaction research has indicated that external devices can support these dining objectives and immersion. However, methods that focus on the food itself and the diners themselves have remained underdeveloped. In this study, we integrated biofeedback with food, utilizing diners' heart rates as a source of the food's appearance to promote focused eating and dining socialization. By employing LED lights, we dynamically displayed diners' real-time physiological signals through the transparency of the food. Results revealed significant effects on various aspects of dining immersion, such as awareness perceptions, attractiveness, attentiveness to each bite, and emotional bonds with the food. Furthermore, to promote dining socialization, we established a "Sharing Bio-Sync Food" dining system to strengthen emotional connections between diners. Based on these findings, we developed tableware that integrates biofeedback into the culinary experience.

Paperid: 1128, https://arxiv.org/pdf/2502.05661.pdf

Abstract:
Sign languages are essential for the Deaf and Hard-of-Hearing (DHH) community. Sign language generation systems have the potential to support communication by translating from written languages, such as English, into signed videos. However, current systems often fail to meet user needs due to poor translation of grammatical structures, the absence of facial cues and body language, and insufficient visual and motion fidelity. We address these challenges by building on recent advances in LLMs and video generation models to translate English sentences into natural-looking AI ASL signers. The text component of our model extracts information for manual and non-manual components of ASL, which are used to synthesize skeletal pose sequences and corresponding video frames. Our findings from a user study with 30 DHH participants and thorough technical evaluations demonstrate significant progress and identify critical areas necessary to meet user needs.

Paperid: 1129, https://arxiv.org/pdf/2502.04379.pdf

Abstract:
Can out-of-the-box pretrained Large Language Models (LLMs) detect human affect successfully when observing a video? To address this question, for the first time, we evaluate comprehensively the capacity of popular LLMs to annotate and successfully predict continuous affect annotations of videos when prompted by a sequence of text and video frames in a multimodal fashion. Particularly in this paper, we test LLMs' ability to correctly label changes of in-game engagement in 80 minutes of annotated videogame footage from 20 first-person shooter games of the GameVibe corpus. We run over 2,400 experiments to investigate the impact of LLM architecture, model size, input modality, prompting strategy, and ground truth processing method on engagement prediction. Our findings suggest that while LLMs rightfully claim human-like performance across multiple domains, they generally fall behind capturing continuous experience annotations provided by humans. We examine some of the underlying causes for the relatively poor overall performance, highlight the cases where LLMs exceed expectations, and draw a roadmap for the further exploration of automated emotion labelling via LLMs.

Paperid: 1130, https://arxiv.org/pdf/2502.03564.pdf

Abstract:
Effective visual accessibility in Virtual Reality (VR) is crucial for Blind and Low Vision (BLV) users. However, designing visual accessibility systems is challenging due to the complexity of 3D VR environments and the need for techniques that can be easily retrofitted into existing applications. While prior work has studied how to enhance or translate visual information, the advancement of Vision Language Models (VLMs) provides an exciting opportunity to advance the scene interpretation capability of current systems. This paper presents EnVisionVR, an accessibility tool for VR scene interpretation. Through a formative study of usability barriers, we confirmed the lack of visual accessibility features as a key barrier for BLV users of VR content and applications. In response, we designed and developed EnVisionVR, a novel visual accessibility system leveraging a VLM, voice input and multimodal feedback for scene interpretation and virtual object interaction in VR. An evaluation with 12 BLV users demonstrated that EnVisionVR significantly improved their ability to locate virtual objects, effectively supporting scene understanding and object interaction.

Paperid: 1131, https://arxiv.org/pdf/2502.02725.pdf

Abstract:
Wearable technology has significantly improved the quality of life for older adults, and the emergence of on-body, movable robots presents new opportunities to further enhance well-being. Yet, the interaction design for these robots remains under-explored, particularly from the perspective of older adults. We present findings from a two-phase co-design process involving 13 older adults to uncover design principles for on-body robots for this population. We identify a rich spectrum of potential applications and characterize a design space to inform how on-body robots should be built for older adults. Our findings highlight the importance of considering factors like co-presence, embodiment, and multi-modal communication. Our work offers design insights to facilitate the integration of on-body robots into daily life and underscores the value of involving older adults in the co-design process to promote usability and acceptance of emerging wearable robotic technologies.

Paperid: 1132, https://arxiv.org/pdf/2502.01801.pdf

Abstract:
Older adults have increasing difficulty with retrospective memory, hindering their abilities to perform daily activities and posing stress on caregivers to ensure their wellbeing. Recent developments in Artificial Intelligence (AI) and large context-aware multimodal models offer an opportunity to create memory support systems that assist older adults with common issues like object finding. This paper discusses the development of an AI-based, wearable memory assistant, MemPal, that helps older adults with a common problem, finding lost objects at home, and presents results from tests of the system in older adults' own homes. Using visual context from a wearable camera, the multimodal LLM system creates a real-time automated text diary of the person's activities for memory support purposes, offering object retrieval assistance using a voice-based interface. The system is designed to support additional use cases like context-based proactive safety reminders and recall of past actions. We report on a quantitative and qualitative study with N=15 older adults within their own homes that showed improved performance of object finding with audio-based assistance compared to no aid and positive overall user perceptions on the designed system. We discuss further applications of MemPal's design as a multi-purpose memory aid and future design guidelines to adapt memory assistants to older adults' unique needs.

Paperid: 1133, https://arxiv.org/pdf/2502.00682.pdf

Abstract:
The progress in generative AI has fueled AI-powered tools like co-pilots and assistants to provision better guidance, particularly during data analysis. However, research on guidance has not yet examined the perceived efficacy of the source from which guidance is offered and the impact of this source on the user's perception and usage of guidance. We ask whether users perceive all guidance sources as equal, with particular interest in three sources: (i) AI, (ii) human expert, and (iii) a group of human analysts. As a benchmark, we consider a fourth source, (iv) unattributed guidance, where guidance is provided without attribution to any source, enabling isolation of and comparison with the effects of source-specific guidance. We design a five-condition between-subjects study, with one condition for each of the four guidance sources and an additional (v) no-guidance condition, which serves as a baseline to evaluate the influence of any kind of guidance. We situate our study in a custom data preparation and analysis tool wherein we task users to select relevant attributes from an unfamiliar dataset to inform a business report. Depending on the assigned condition, users can request guidance, which the system then provides in the form of attribute suggestions. To ensure internal validity, we control for the quality of guidance across source-conditions. Through several metrics of usage and perception, we statistically test five preregistered hypotheses and report on additional analysis. We find that the source of guidance matters to users, but not in a manner that matches received wisdom. For instance, users utilize guidance differently at various stages of analysis, including expressing varying levels of regret, despite receiving guidance of similar quality. Notably, users in the AI condition reported both higher post-task benefit and regret.

Paperid: 1134, https://arxiv.org/pdf/2502.00283.pdf

Abstract:
Generative Artificial Intelligence (Generative AI) is a collection of AI technologies that can generate new information such as texts and images. With its strong capabilities, Generative AI has been actively studied in creative design processes. However, limited studies have explored the roles of humans and Generative AI in conceptual design processes, leaving a gap for human-AI collaboration investigation. To address this gap, this study uncovers the contributions of different Generative AI technologies in assisting humans in the conceptual design process. Novice designers completed two design tasks with or without the assistance of Generative AI. Results revealed that Generative AI primarily assists humans in problem definition and idea generation stages, while idea selection and evaluation remain predominantly human-led. Additionally, with Generative AI assistance, the idea selection and evaluation stages were further enhanced. Based on the findings, we discuss the role of Generative AI in human-AI collaboration and implications for enhancing future conceptual design support with Generative AI assistance.

Paperid: 1135, https://arxiv.org/pdf/2501.18506.pdf

Abstract:
With the rapid advancements in Artificial Intelligence (AI), autonomous agents are increasingly expected to manage complex situations where learning-enabled algorithms are vital. However, the integration of these advanced algorithms poses significant challenges, especially concerning safety and reliability. This research emphasizes the importance of incorporating human-machine collaboration into the systems engineering process to design learning-enabled increasingly autonomous systems (LEIAS). Our proposed LEIAS architecture emphasizes communication representation and pilot preference learning to boost operational safety. Leveraging the Soar cognitive architecture, the system merges symbolic decision logic with numeric decision preferences enhanced through reinforcement learning. A core aspect of this approach is transparency; the LEIAS provides pilots with a comprehensive, interpretable view of the system's state, encompassing detailed evaluations of sensor reliability, including GPS, IMU, and LIDAR data. This multi-sensor assessment is critical for diagnosing discrepancies and maintaining trust. Additionally, the system learns and adapts to pilot preferences, enabling responsive, context-driven decision-making. Autonomy is incrementally escalated based on necessity, ensuring pilots retain control in standard scenarios and receive assistance only when required. Simulation studies conducted in Microsoft's XPlane simulation environment to validate this architecture's efficacy, showcasing its performance in managing sensor anomalies and enhancing human-machine collaboration, ultimately advancing safety in complex operational environments.

Paperid: 1136, https://arxiv.org/pdf/2501.17747.pdf

Abstract:
While learning programming languages is crucial for software engineers, mastering the necessary tools is equally important. To facilitate this, JetBrains recently released the JetBrains Academy plugin, which customizes the IDE for learners, allowing tutors to create courses entirely within IDE. In this work, we provide the first exploratory study of this learning format. We carried out eight one-hour interviews with students and developers who completed at least one course using the plugin, inquiring about their experience with the format, the used IDE features, and the current shortcomings. Our results indicate that learning inside the IDE is overall welcomed by the learners, allowing them to study in a more realistic setting, using features such as debugging and code analysis, which are crucial for real software development. With the collected results and the analysis of the current drawbacks, we aim to contribute to teaching students more practical skills.

Paperid: 1137, https://arxiv.org/pdf/2501.17258.pdf

Abstract:
Conversational AI agents are commonly applied within single-user, turn-taking scenarios. The interaction mechanics of these scenarios are trivial: when the user enters a message, the AI agent produces a response. However, the interaction dynamics are more complex within group settings. How should an agent behave in these settings? We report on two experiments aimed at uncovering users' experiences of an AI agent's participation within a group, in the context of group ideation (brainstorming). In the first study, participants benefited from and preferred having the AI agent in the group, but participants disliked when the agent seemed to dominate the conversation and they desired various controls over its interactive behaviors. In the second study, we created functional controls over the agent's behavior, operable by group members, to validate their utility and probe for additional requirements. Integrating our findings across both studies, we developed a taxonomy of controls for when, what, and where a conversational AI agent in a group should respond, who can control its behavior, and how those controls are specified and implemented. Our taxonomy is intended to aid AI creators to think through important considerations in the design of mixed-initiative conversational agents.

Paperid: 1138, https://arxiv.org/pdf/2501.16674.pdf

Abstract:
Wireless mouse rings offer subtle, reliable pointing interactions for wearable computing platforms, but the small battery below 27 mAh in the miniature rings restricts the ring's continuous lifespan to just 1-2 hours due to the power consumption of current low-powered wireless communication like BLE. However, the picoRing mouse addresses this by enabling continuous ring-based mouse interaction with ultra-low-powered ring-to-wristband wireless communication through a coil-based impedance sensing method called semi-passive inductive telemetry. This allows a wristband coil to capture a unique frequency response of a nearby ring coil via sensitive inductive coupling, converting the user's mouse input into the unique frequency response via an 820 uW mouse-driven modulation module. Thus, the continuous use of picoRing mouse can potentially last over 92 hours on a single charge of a 20 mAh battery while supporting subtle scrolling and pressing interactions.

Paperid: 1139, https://arxiv.org/pdf/2501.16010.pdf

Abstract:
Students often take digital notes during live lectures, but current methods can be slow when capturing information from lecture slides or the instructor's speech, and require them to focus on their devices, leading to distractions and missing important details. This paper explores supporting live lecture note-taking with mixed reality (MR) to quickly capture lecture information and take notes while staying engaged with the lecture. A survey and interviews with university students revealed common note-taking behaviors and challenges to inform the design. We present MaRginalia to provide digital note-taking with a stylus tablet and MR headset. Students can take notes with an MR representation of the tablet, lecture slides, and audio transcript without looking down at their device. When preferred, students can also perform detailed interactions by looking at the physical tablet. We demonstrate the feasibility and usefulness of MaRginalia and MR-based note-taking in a user study with 12 students.

Paperid: 1140, https://arxiv.org/pdf/2501.14917.pdf

Abstract:
Investigating NLP through a philosophical lens has recently caught researchers' eyes, as it bridges computational methods with classical schools of philosophy. This paper introduces a philosophical framework inspired by the Hegelian Dialectic to enable LLMs' self-reflection, utilizing a self-dialectical approach to emulate internal critiques and synthesize new scientific ideas (spanning domains such as mathematics, physics, and more). Additionally, we explore the effect of generation temperature in LLMs by introducing a dynamic annealing approach, which encourages creativity in the early stages and gradually focuses on refinement and nuance, as well as a constant-temperature strategy. Furthermore, we implement a Multi-Agent Majority Voting (MAMV) strategy to assess the validity and novelty of the generated ideas, which proves useful in the absence of domain experts. We also evaluate the effectiveness of our method in generating novel scientific ideas and improving LLMs' reasoning capabilities. Our experiments demonstrate promising results in ideation, along with significant improvements in mathematical and symbolic reasoning.

Paperid: 1141, https://arxiv.org/pdf/2501.14719.pdf

Abstract:
Equitable access to reliable health information is vital for public health, but the quality of online health resources varies by language, raising concerns about inconsistencies in Large Language Models (LLMs) for healthcare. In this study, we examine the consistency of responses provided by LLMs to health-related questions across English, German, Turkish, and Chinese. We largely expand the HealthFC dataset by categorizing health-related questions by disease type and broadening its multilingual scope with Turkish and Chinese translations. We reveal significant inconsistencies in responses that could spread healthcare misinformation. Our main contributions are 1) a multilingual health-related inquiry dataset with meta-information on disease categories, and 2) a novel prompt-based evaluation workflow that enables sub-dimensional comparisons between two languages through parsing. Our findings highlight key challenges in deploying LLM-based tools in multilingual contexts and emphasize the need for improved cross-lingual alignment to ensure accurate and equitable healthcare information.

Paperid: 1142, https://arxiv.org/pdf/2501.14110.pdf

Abstract:
The susceptibility to biases and discrimination is a pressing issue in today's labor markets. Though digital recruitment systems play an increasingly significant role in human resources management, thus far we lack a systematic understanding of human-centered design principles for fair online hiring. This work proposes a fair recruitment framework based on job-seekers' fairness concerns shared in an online forum. Through qualitative analysis, we uncover four overarching themes of job-seekers' fairness concerns, including discrimination against sensitive attributes, interaction biases, improper interpretations of qualifications, and power imbalance. Based on these findings, we derive design implications for algorithms and interfaces in recruitment systems, integrating them into a fair recruitment framework spanning different hiring stages and fairness considerations.

Paperid: 1143, https://arxiv.org/pdf/2501.13836.pdf

Abstract:
Most social media users come from the Global South, where harmful content usually appears in local languages. Yet, AI-driven moderation systems struggle with low-resource languages spoken in these regions. Through semi-structured interviews with 22 AI experts working on harmful content detection in four low-resource languages: Tamil (South Asia), Swahili (East Africa), Maghrebi Arabic (North Africa), and Quechua (South America)--we examine systemic issues in building automated moderation tools for these languages. Our findings reveal that beyond data scarcity, socio-political factors such as tech companies' monopoly on user data and lack of investment in moderation for low-profit Global South markets exacerbate historic inequities. Even if more data were available, the English-centric and data-intensive design of language models and preprocessing techniques overlooks the need to design for morphologically complex, linguistically diverse, and code-mixed languages. We argue these limitations are not just technical gaps caused by "data scarcity" but reflect structural inequities, rooted in colonial suppression of non-Western languages. We discuss multi-stakeholder approaches to strengthen local research capacity, democratize data access, and support language-aware solutions to improve automated moderation for low-resource languages.

Paperid: 1144, https://arxiv.org/pdf/2501.12868.pdf

Abstract:
Complementary collaboration between humans and AI is essential for human-AI decision making. One feasible approach to achieving it involves accounting for the calibrated confidence levels of both AI and users. However, this process would likely be made more difficult by the fact that AI confidence may influence users' self-confidence and its calibration. To explore these dynamics, we conducted a randomized behavioral experiment. Our results indicate that in human-AI decision-making, users' self-confidence aligns with AI confidence and such alignment can persist even after AI ceases to be involved. This alignment then affects users' self-confidence calibration. We also found the presence of real-time correctness feedback of decisions reduced the degree of alignment. These findings suggest that users' self-confidence is not independent of AI confidence, which practitioners aiming to achieve better human-AI collaboration need to be aware of. We call for research focusing on the alignment of human cognition and behavior with AI.

Paperid: 1145, https://arxiv.org/pdf/2501.12603.pdf

Abstract:
In this paper we address the challenges of documenting early digital artifacts in collections built to offer historical context for future generations. Through insights from active community members (N=20), we examine current archival needs and obstacles. We assess the potential of the CIDOC Conceptual Reference Model (CRM) for categorizing fragmented digital data. Despite its complexity, CIDOC-CRM proves logical, human-readable, and adaptable, enabling archivists to select minimal yet effective building blocks set to empower community-led heritage projects.

Paperid: 1146, https://arxiv.org/pdf/2501.10909.pdf

Abstract:
In recent years, the rapid development of AI systems has brought about the benefits of intelligent services but also concerns about security and reliability. By fostering appropriate user reliance on an AI system, both complementary team performance and reduced human workload can be achieved. Previous empirical studies have extensively analyzed the impact of factors ranging from task, system, and human behavior on user trust and appropriate reliance in the context of one-step decision making. However, user reliance on AI systems in tasks with complex semantics that require multi-step workflows remains under-explored. Inspired by recent work on task decomposition with large language models, we propose to investigate the impact of a novel Multi-Step Transparent (MST) decision workflow on user reliance behaviors. We conducted an empirical study (N = 233) of AI-assisted decision making in composite fact-checking tasks (i.e., fact-checking tasks that entail multiple sub-fact verification steps). Our findings demonstrate that human-AI collaboration with an MST decision workflow can outperform one-step collaboration in specific contexts (e.g., when advice from an AI system is misleading). Further analysis of the appropriate reliance at fine-grained levels indicates that an MST decision workflow can be effective when users demonstrate a relatively high consideration of the intermediate steps. Our work highlights that there is no one-size-fits-all decision workflow that can help obtain optimal human-AI collaboration. Our insights help deepen the understanding of the role of decision workflows in facilitating appropriate reliance. We synthesize important implications for designing effective means to facilitate appropriate reliance on AI systems in composite tasks, positioning opportunities for the human-centered AI and broader HCI communities.

Paperid: 1147, https://arxiv.org/pdf/2501.10803.pdf

Abstract:
Online fraud substantially harms individuals and seniors are disproportionately targeted. While family is crucial for seniors, little research has empirically examined how they protect seniors against fraud. To address this gap, we employed an inductive thematic analysis of 124 posts and 16,872 comments on RedNote (Xiaohongshu), exploring the family support ecosystem for senior-targeted online fraud in China. We develop a taxonomy of senior-targeted online fraud from a familial perspective, revealing younger members often spot frauds hard for seniors to detect, such as unusual charges. Younger family members fulfill multiple safeguarding roles, including preventative measures, fraud identification, fraud persuasion, loss recovery, and education. They also encounter numerous challenges, such as seniors' refusal of help and considerable mental and financial stress. Drawing on these, we develop a conceptual framework to characterize family support in senior-targeted fraud, and outline implications for researchers and practitioners to consider the broader stakeholder ecosystem and cultural aspects.

Paperid: 1148, https://arxiv.org/pdf/2501.07180.pdf

Abstract:
Advances in vitreoretinal robotic surgery enable precise techniques for gene therapies. This study evaluates three robotic approaches using the 7-DoF robotic arm for docking a micro-precise tool to a trocar: fully co-manipulated, hybrid co-manipulated/teleoperated, and hybrid with camera assistance. The fully co-manipulated approach was the fastest but had a 42% success rate. Hybrid methods showed higher success rates (91.6% and 100%) and completed tasks within 2 minutes. NASA Task Load Index (TLX) assessments indicated lower physical demand and effort for hybrid approaches.

Paperid: 1149, https://arxiv.org/pdf/2501.05461.pdf

Abstract:
Social Anxiety Disorder (SAD) significantly impacts individuals' daily lives and relationships. The conventional methods for SAD detection involve physical consultations and self-reported questionnaires, but they have limitations such as time consumption and bias. This paper introduces video analysis as a promising method for early SAD detection. Specifically, we present a new approach for detecting SAD in individuals from various bodily features extracted from the video data. We conducted a study to collect video data of 92 participants performing impromptu speech in a controlled environment. Using the video data, we studied the behavioral change in participants' head, body, eye gaze, and action units. By applying a range of machine learning and deep learning algorithms, we achieved an accuracy rate of up to 74\% in classifying participants as SAD or non-SAD. Video-based SAD detection offers a non-intrusive and scalable approach that can be deployed in real-time, potentially enhancing early detection and intervention capabilities.

Paperid: 1150, https://arxiv.org/pdf/2501.04156.pdf

Abstract:
Pilots operating modern cockpits often face high cognitive demands due to complex interfaces and multitasking requirements, which can lead to overload and decreased performance. This study introduces AdaptiveCoPilot, a neuroadaptive guidance system that adapts visual, auditory, and textual cues in real time based on the pilot's cognitive workload, measured via functional Near-Infrared Spectroscopy (fNIRS). A formative study with expert pilots (N=3) identified adaptive rules for modality switching and information load adjustments during preflight tasks. These insights informed the design of AdaptiveCoPilot, which integrates cognitive state assessments, behavioral data, and adaptive strategies within a context-aware Large Language Model (LLM). The system was evaluated in a virtual reality (VR) simulated cockpit with licensed pilots (N=8), comparing its performance against baseline and random feedback conditions. The results indicate that the pilots using AdaptiveCoPilot exhibited higher rates of optimal cognitive load states on the facets of working memory and perception, along with reduced task completion times. Based on the formative study, experimental findings, qualitative interviews, we propose a set of strategies for future development of neuroadaptive pilot guidance systems and highlight the potential of neuroadaptive systems to enhance pilot performance and safety in aviation environments.

Paperid: 1151, https://arxiv.org/pdf/2501.03596.pdf

Abstract:
Rapid Serial Visual Presentation (RSVP)-based Brain-Computer Interfaces (BCIs) facilitate high-throughput target image detection by identifying event-related potentials (ERPs) evoked in EEG signals. The RSVP-BCI systems effectively detect single-class targets within a stream of images but have limited applicability in scenarios that require detecting multiple target categories. Multi-class RSVP-BCI systems address this limitation by simultaneously identifying the presence of a target and distinguishing its category. However, existing multi-class RSVP decoding algorithms predominantly rely on single-modality EEG decoding, which restricts their performance improvement due to the high similarity between ERPs evoked by different target categories. In this work, we introduce eye movement (EM) modality into multi-class RSVP decoding and explore EEG and EM fusion to enhance decoding performance. First, we design three independent multi-class target RSVP tasks and build an open-source dataset comprising EEG and EM signals from 43 subjects. Then, we propose the Multi-class Target RSVP EEG and EM fusion Network (MTREE-Net) to enhance multi-class RSVP decoding. Specifically, a dual-complementary module is proposed to strengthen the differentiation of uni-modal features across categories. To improve multi-modal fusion performance, we adopt a dynamic reweighting fusion strategy guided by theoretically derived modality contribution ratios. Furthermore, we reduce the misclassification of non-target samples through knowledge transfer between two hierarchical classifiers. Extensive experiments demonstrate the feasibility of integrating EM signals into multi-class RSVP decoding and highlight the superior performance of MTREE-Net compared to existing RSVP decoding methods. The proposed MTREE-Net and open-source dataset provide a promising framework for developing practical multi-class RSVP-BCI systems.

Paperid: 1152, https://arxiv.org/pdf/2501.03594.pdf

Abstract:
Urban segregation refers to the physical and social division of people, often driving inequalities within cities and exacerbating socioeconomic and racial tensions. While most studies focus on residential spaces, they often neglect segregation across "activity spaces" where people work, socialize, and engage in leisure. Human mobility data offers new opportunities to analyze broader segregation patterns, encompassing both residential and activity spaces, but challenges existing methods in capturing the complexity and local nuances of urban segregation. This work introduces InclusiViz, a novel visual analytics system for multi-level analysis of urban segregation, facilitating the development of targeted, data-driven interventions. Specifically, we developed a deep learning model to predict mobility patterns across social groups using environmental features, augmented with explainable AI to reveal how these features influence segregation. The system integrates innovative visualizations that allow users to explore segregation patterns from broad overviews to fine-grained detail and evaluate urban planning interventions with real-time feedback. We conducted a quantitative evaluation to validate the model's accuracy and efficiency. Two case studies and expert interviews with social scientists and urban analysts demonstrated the system's effectiveness, highlighting its potential to guide urban planning toward more inclusive cities.

Paperid: 1153, https://arxiv.org/pdf/2501.02348.pdf

Abstract:
Complex problem-solving requires cognitive flexibility--the capacity to entertain multiple perspectives while preserving their distinctiveness. This flexibility replicates the "wisdom of crowds" within a single individual, allowing them to "think with many minds." While mental simulation enables imagined deliberation, cognitive constraints limit its effectiveness. We propose synthetic deliberation, a Large Language Model (LLM)-based method that simulates discourse between agents embodying diverse perspectives, as a solution. Using a custom GPT-based model, we showcase its benefits: concurrent processing of multiple viewpoints without cognitive degradation, parallel exploration of perspectives, and precise control over viewpoint synthesis. By externalizing the deliberative process and distributing cognitive labor between parallel search and integration, synthetic deliberation transcends mental simulation's limitations. This approach shows promise for strategic planning, policymaking, and conflict resolution.

Paperid: 1154, https://arxiv.org/pdf/2501.00038.pdf

Abstract:
Emotion recognition and touch gesture decoding are crucial for advancing human-robot interaction (HRI), especially in social environments where emotional cues and tactile perception play important roles. However, many humanoid robots, such as Pepper, Nao, and Furhat, lack full-body tactile skin, limiting their ability to engage in touch-based emotional and gesture interactions. In addition, vision-based emotion recognition methods usually face strict GDPR compliance challenges due to the need to collect personal facial data. To address these limitations and avoid privacy issues, this paper studies the potential of using the sounds produced by touching during HRI to recognise tactile gestures and classify emotions along the arousal and valence dimensions. Using a dataset of tactile gestures and emotional interactions from 28 participants with the humanoid robot Pepper, we design an audio-only lightweight touch gesture and emotion recognition model with only 0.24M parameters, 0.94MB model size, and 0.7G FLOPs. Experimental results show that the proposed sound-based touch gesture and emotion recognition model effectively recognises the arousal and valence states of different emotions, as well as various tactile gestures, when the input audio length varies. The proposed model is low-latency and achieves similar results as well-known pretrained audio neural networks (PANNs), but with much smaller FLOPs, parameters, and model size.

Paperid: 1155, https://arxiv.org/pdf/2506.23549.pdf

Abstract:
Effective coordination among artificial agents in dynamic and uncertain environments remains a significant challenge in multi-agent systems. Existing approaches, such as self-play and population-based methods, either generalize poorly to unseen partners or require extensive training. To overcome these limitations, we propose Coordination Transformers (CooT), a novel in-context coordination framework that uses recent interaction histories to adapt to unseen partners rapidly. Unlike previous approaches that primarily aim to increase the diversity of training partners, CooT explicitly focuses on adapting to new partner behaviors by predicting actions aligned with observed partner interactions. Trained on interaction trajectories collected from diverse pairs of agents with complementary behaviors, CooT quickly learns effective coordination strategies without explicit supervision or fine-tuning. Evaluations on the Overcooked benchmark demonstrate that CooT significantly outperforms baseline methods in coordination tasks involving previously unseen partners. Human evaluations further confirm CooT as the most effective collaborative partner, while extensive ablations highlight its robustness, flexibility, and sensitivity to context in multi-agent scenarios.

Paperid: 1156, https://arxiv.org/pdf/2506.22583.pdf

Abstract:
Level of detail (LOD) is widely used to control visual feedback in interactive applications. LOD control is typically based on perception at threshold - the conditions in which a stimulus first becomes perceivable. Yet most LOD manipulations are quite perceivable and occur well above threshold. Moreover, research shows that supra-threshold perception differs drastically from perception at threshold. In that case, should supra-threshold LOD control also differ from LOD control at threshold? In two experiments, we examine supra-threshold LOD control in the visual periphery and find that indeed, it should differ drastically from LOD control at threshold. Specifically, we find that LOD must support a task-dependent level of reliable perceptibility. Above that level, perceptibility of LOD control manipulations should be minimized, and detail contrast is a better predictor of perceptibility than detail size. Below that level, perceptibility must be maximized, and LOD should be improved as eccentricity rises or contrast drops. This directly contradicts prevailing threshold-based LOD control schemes, and strongly suggests a reexamination of LOD control for foveal display.

Paperid: 1157, https://arxiv.org/pdf/2506.21962.pdf

Abstract:
Generative AI assistants have been widely used in front-end programming. However, besides code writing, developers often encounter the need to generate animation effects. As novices in creative design without the assistance of professional designers, developers typically face difficulties in describing, designing, and implementing desired animations. To address this issue, we conducted a formative study (N=6) to identify the challenges that code developers face when dealing with animation design issues. Then, we introduce AnyAni, a human-AI collaborative system that supports front-end developers in the ideation, manipulation, and implementation of animation effects. The system combines the assistance of generative AI in creative design by adopting a nonlinear workflow for iterative animation development. In addition, developers can understand and learn the code generated for implementing animations through various interactive methods. A user study (N=9) demonstrated the usability of AnyAni in animation effect creation support for developers.

Paperid: 1158, https://arxiv.org/pdf/2506.21456.pdf

Abstract:
Previous work has demonstrated the utility of reductions in the level of detail (LOD) in the periphery of head-tracked, large field of view displays. This paper provides a psychophysically based model, centered around an eye/head movement tradeoff, that explains the effectiveness of peripheral degradation and suggests how peripherally degraded displays should be designed. An experiment evaluating the effect on search performance of the shape and area of the high detail central area (inset) in peripherally degraded displays was performed, results indicated that inset shape is not a significant factor in performance. Inset area, however, was significant: performance with displays subtending at least 30 degrees of horizontal and vertical angle was not significantly different from performance with an undegraded display. These results agreed with the proposed model.

Paperid: 1159, https://arxiv.org/pdf/2506.21441.pdf

Abstract:
A paradigm for the design of systems that manage level of detail in virtual environments is proposed. As an example of the prototyping step in this paradigm, a user study was performed to evaluate the effectiveness of high detail insets used with head-mounted displays. Ten subjects were given a simple search task that required the location and identification of a single target object. All subjects used seven different displays (the independent variable), varying in inset size and peripheral detail, to perform this task. Frame rate, target location, subject input method, and order of display use were all controlled. Primary dependent measures were search time on trials with correct identification, and the percentage of all trials correctly identified. ANOVAs of the results showed that insetless, high detail displays did not lead to significantly different search times or accuracies than displays with insets. In fact, only the insetless, low detail display returned significantly different results. Further research is being performed to examine the effect of varying task complexity, inset size, and level of detail.

Paperid: 1160, https://arxiv.org/pdf/2506.20795.pdf

Abstract:
Gestures enable non-verbal human-robot communication, especially in noisy environments like agile production. Traditional deep learning-based gesture recognition relies on task-specific architectures using images, videos, or skeletal pose estimates as input. Meanwhile, Vision Foundation Models (VFMs) and Vision Language Models (VLMs) with their strong generalization abilities offer potential to reduce system complexity by replacing dedicated task-specific modules. This study investigates adapting such models for dynamic, full-body gesture recognition, comparing V-JEPA (a state-of-the-art VFM), Gemini Flash 2.0 (a multimodal VLM), and HD-GCN (a top-performing skeleton-based approach). We introduce NUGGET, a dataset tailored for human-robot communication in intralogistics environments, to evaluate the different gesture recognition approaches. In our experiments, HD-GCN achieves best performance, but V-JEPA comes close with a simple, task-specific classification head - thus paving a possible way towards reducing system complexity, by using it as a shared multi-task model. In contrast, Gemini struggles to differentiate gestures based solely on textual descriptions in the zero-shot setting, highlighting the need of further research on suitable input representations for gestures.

Paperid: 1161, https://arxiv.org/pdf/2506.19899.pdf

Abstract:
Social engineering attacks delivered via email, commonly known as phishing, represent a persistent cybersecurity threat leading to significant organizational incidents and data breaches. Although many organizations train employees on phishing, often mandated by compliance requirements, the real-world effectiveness of this training remains debated. To contribute to evidence-based cybersecurity policy, we conducted a large-scale reproduction study (N = 12,511) at a US-based financial technology firm. Our experimental design refined prior work by comparing training modalities in operational environments, validating NIST's standardized phishing difficulty measurement, and introducing novel organizational-level temporal resilience metrics. Echoing prior work, training interventions showed no significant main effects on click rates (p=0.450) or reporting rates (p=0.417), with negligible effect sizes. However, we found that the NIST Phish Scale predicted user behavior, with click rates increasing from 7.0% for easy lures to 15.0% for hard lures. Our organizational-level resilience result was mixed: 36-55% of campaigns achieved "inoculation" patterns where reports preceded clicks, but training did not significantly improve organizational-level temporal protection. In summary, our results confirm the ineffectiveness of current phishing training approaches while offering a refined study design for future work.

Paperid: 1162, https://arxiv.org/pdf/2506.19524.pdf

Abstract:
Healthcare professionals (HCPs) face increasing levels of stress and burnout. Technological wellbeing interventions provide accessible and flexible support for HCPs. While most studies have focused on mobile- and web-based programs, alternative technologies like virtual reality (VR), augmented reality (AR), tangible interfaces, and embodied technologies are emerging as engaging and effective tools for wellbeing interventions. However, there is still a lack of research on how such technologies are perceived among HCPs. This study explored HCPs' perceptions and preferences for various types of wellbeing technologies, by conducting a 2-phase co-design study involving 26 HCPs in idea generation, concept evaluation, prototype testing, and design iteration. From our findings, HCPs highly valued the potential of technologies to support mental health with immersive, embodied, and collective experiences. Furthermore, we provided design recommendations for wellbeing technologies for HCPs that sustain user engagement by meeting their needs for autonomy, competence, and relatedness in the experiences.

Paperid: 1163, https://arxiv.org/pdf/2506.19210.pdf

Abstract:
Cerebral Visual Impairment (CVI) is the set to be the leading cause of vision impairment, yet remains underrepresented in assistive technology research. Unlike ocular conditions, CVI affects higher-order visual processing-impacting object recognition, facial perception, and attention in complex environments. This paper presents a co-design study with two adults with CVI investigating how smart glasses, i.e. head-mounted extended reality displays, can support understanding and interaction with the immediate environment. Guided by the Double Diamond design framework, we conducted a two-week diary study, two ideation workshops, and ten iterative development sessions using the Apple Vision Pro. Our findings demonstrate that smart glasses can meaningfully address key challenges in locating objects, reading text, recognising people, engaging in conversations, and managing sensory stress. With the rapid advancement of smart glasses and increasing recognition of CVI as a distinct form of vision impairment, this research addresses a timely and under-explored intersection of technology and need.

Paperid: 1164, https://arxiv.org/pdf/2506.18727.pdf

Abstract:
Digitalization in nuclear power plant (NPP) control rooms is reshaping how operators interact with procedures and interface elements. However, existing computer-based procedures (CBPs) often lack semantic integration with human-system interfaces (HSIs), limiting their capacity to support intelligent automation and increasing the risk of human error, particularly under dynamic or complex operating conditions. In this study, we present AutoGraph, a knowledge-graph-based framework designed to formalize and automate procedure execution in digitalized NPP environments.AutoGraph integrates (1) a proposed HTRPM tracking module to capture operator interactions and interface element locations; (2) an Interface Element Knowledge Graph (IE-KG) encoding spatial, semantic, and structural properties of HSIs; (3) automatic mapping from textual procedures to executable interface paths; and (4) an execution engine that maps textual procedures to executable interface paths. This enables the identification of cognitively demanding multi-action steps and supports fully automated execution with minimal operator input. We validate the framework through representative control room scenarios, demonstrating significant reductions in task completion time and the potential to support real-time human reliability assessment. Further integration into dynamic HRA frameworks (e.g., COGMIF) and real-time decision support systems (e.g., DRIF) illustrates AutoGraph extensibility in enhancing procedural safety and cognitive performance in complex socio-technical systems.

Paperid: 1165, https://arxiv.org/pdf/2506.12879.pdf

Abstract:
Despite the potential of generative AI (GenAI) design tools to enhance design processes, professionals often struggle to integrate AI into their workflows. Fundamental cognitive challenges include the need to specify all design criteria as distinct parameters upfront (intent formulation) and designers' reduced cognitive involvement in the design process due to cognitive offloading, which can lead to insufficient problem exploration, underspecification, and limited ability to evaluate outcomes. Motivated by these challenges, we envision novel metacognitive support agents that assist designers in working more reflectively with GenAI. To explore this vision, we conducted exploratory prototyping through a Wizard of Oz elicitation study with 20 mechanical designers probing multiple metacognitive support strategies. We found that agent-supported users created more feasible designs than non-supported users, with differing impacts between support strategies. Based on these findings, we discuss opportunities and tradeoffs of metacognitive support agents and considerations for future AI-based design tools.

Paperid: 1166, https://arxiv.org/pdf/2506.12540.pdf

Abstract:
Brain-computer interfaces offer significant therapeutic opportunities for a variety of neurophysiological and neuropsychiatric disorders and may perhaps one day lead to augmenting the cognition and decision-making of the healthy brain. However, existing regulatory frameworks designed for implantable medical devices are inadequate to address the unique ethical, legal, and social risks associated with next-generation networked brain-computer interfaces. In this article, we make nine recommendations to support developers in the design of BCIs and nine recommendations to support policymakers in the application of BCIs, drawing insights from the regulatory history of IMDs and principles from AI ethics. We begin by outlining the historical development of IMDs and the regulatory milestones that have shaped their oversight. Next, we summarize similarities between IMDs and emerging implantable BCIs, identifying existing provisions for their regulation. We then use two case studies of emerging cutting-edge BCIs, the HALO and SCALO computer systems, to highlight distinctive features in the design and application of next-generation BCIs arising from contemporary chip architectures, which necessitate reevaluating regulatory approaches. We identify critical ethical considerations for these BCIs, including unique conceptions of autonomy, identity, and mental privacy. Based on these insights, we suggest potential avenues for the ethical regulation of BCIs, emphasizing the importance of interdisciplinary collaboration and proactive mitigation of potential harms. The goal is to support the responsible design and application of new BCIs, ensuring their safe and ethical integration into medical practice.

Paperid: 1167, https://arxiv.org/pdf/2506.11276.pdf

Abstract:
Navigating large-scale online discussions is difficult due to the rapid pace and large volume of user-generated content. Prior work in CSCW has shown that moderators often struggle to follow multiple simultaneous discussions, track evolving conversations, and maintain contextual understanding--all of which hinder timely and effective moderation. While platforms like Reddit use threaded structures to organize discourse, deeply nested threads can still obscure discussions and make it difficult to grasp the overall trajectory of conversations. In this paper, we present an interactive system called Needle to support better navigation and comprehension of complex discourse within threaded discussions. Needle uses visual analytics to summarize key conversational metrics--such as activity, toxicity levels, and voting trends--over time, offering both high-level insights and detailed breakdowns of discussion threads. Through a user study with ten Reddit moderators, we find that Needle supports moderation by reducing cognitive load in making sense of large discussion, helping prioritize areas that need attention, and providing decision-making supports. Based on our findings, we provide a set of design guidelines to inform future visualization-driven moderation tools and sociotechnical systems. To the best of our knowledge, Needle is one of the first systems to combine interactive visual analytics with human-in-the-loop moderation for threaded online discussions.

Paperid: 1168, https://arxiv.org/pdf/2506.11004.pdf

Abstract:
Dyslexia, affecting an estimated 10% to 20% of the global population, significantly impairs learning capabilities, highlighting the need for innovative and accessible diagnostic methods. This paper investigates the effectiveness of eye-tracking technology combined with machine learning algorithms as a cost-effective alternative for early dyslexia detection. By analyzing general eye movement patterns, including prolonged fixation durations and erratic saccades, we proposed an enhanced solution for determining eye-tracking-based dyslexia features. A Random Forest Classifier was then employed to detect dyslexia, achieving an accuracy of 88.58\%. Additionally, hierarchical clustering methods were applied to identify varying severity levels of dyslexia. The analysis incorporates diverse methodologies across various populations and settings, demonstrating the potential of this technology to identify individuals with dyslexia, including those with borderline traits, through non-invasive means. Integrating eye-tracking with machine learning represents a significant advancement in the diagnostic process, offering a highly accurate and accessible method in clinical research.

Paperid: 1169, https://arxiv.org/pdf/2506.10762.pdf

Abstract:
Text animation, a foundational element in video creation, enables efficient and cost-effective communication, thriving in advertisements, journalism, and social media. However, traditional animation workflows present significant usability barriers for non-professionals, with intricate operational procedures severely hindering creative productivity. To address this, we propose a Large Language Model (LLM)-aided text animation editing system that enables real-time intent tracking and flexible editing. The system introduces an agent-based dual-stream pipeline that integrates context-aware inline suggestions and conversational guidance as well as employs a semantic-animation mapping to facilitate LLM-driven creative intent translation. Besides, the system supports synchronized text-animation previews and parametric adjustments via unified controls to improve editing workflow. A user study evaluates the system, highlighting its ability to help non-professional users complete animation workflows while validating the pipeline. The findings encourage further exploration of integrating LLMs into a comprehensive video creation workflow.

Paperid: 1170, https://arxiv.org/pdf/2506.09160.pdf

Abstract:
As AI chatbots become increasingly integrated in education, students are turning to these systems for guidance, feedback, and information. However, the anthropomorphic characteristics of these chatbots create ambiguity regarding whether students develop trust toward them as they would a human peer or instructor, based in interpersonal trust, or as they would any other piece of technology, based in technology trust. This ambiguity presents theoretical challenges, as interpersonal trust models may inappropriately ascribe human intentionality and morality to AI, while technology trust models were developed for non-social technologies, leaving their applicability to anthropomorphic systems unclear. To address this gap, we investigate how human-like and system-like trusting beliefs comparatively influence students' perceived enjoyment, trusting intention, behavioral intention to use, and perceived usefulness of an AI chatbot - factors associated with students' engagement and learning outcomes. Through partial least squares structural equation modeling, we found that human-like and system-like trust significantly influenced student perceptions, with varied effects. Human-like trust more strongly predicted trusting intention, while system-like trust better predicted behavioral intention and perceived usefulness. Both had similar effects on perceived enjoyment. Given the partial explanatory power of each type of trust, we propose that students develop a distinct form of trust with AI chatbots (human-AI trust) that differs from human-human and human-technology models of trust. Our findings highlight the need for new theoretical frameworks specific to human-AI trust and offer practical insights for fostering appropriately calibrated trust, which is critical for the effective adoption and pedagogical impact of AI in education.

Paperid: 1171, https://arxiv.org/pdf/2506.08467.pdf

Abstract:
The growing integration of AI tools in student design projects presents an unresolved challenge in HCI education: how should AI-generated content be cited and documented? Traditional citation frameworks -- grounded in credibility, retrievability, and authorship -- struggle to accommodate the dynamic and ephemeral nature of AI outputs. In this paper, we examine how undergraduate students in a UX design course approached AI usage and citation when given the freedom to integrate generative tools into their design process. Through qualitative analysis of 35 team projects and reflections from 175 students, we identify varied citation practices ranging from formal attribution to indirect or absent acknowledgment. These inconsistencies reveal gaps in existing frameworks and raise questions about authorship, assessment, and pedagogical transparency. We argue for rethinking AI citation as a reflective and pedagogical practice; one that supports metacognitive engagement by prompting students to critically evaluate how and why they used AI throughout the design process. We propose alternative strategies -- such as AI contribution statements and process-aware citation models that better align with the iterative and reflective nature of design education. This work invites educators to reconsider how citation practices can support meaningful student--AI collaboration.

Paperid: 1172, https://arxiv.org/pdf/2506.05879.pdf

Abstract:
Joint attention is a critical marker of early social-communicative development, yet remains difficult for caregivers to assess without expert guidance. In this work, we explore how multimodal large language models (MLLMs) can be aligned with the reasoning processes of speech-language pathologists (SLPs) to support the interpretation of everyday parent-child interactions. We conducted in-depth interviews and video annotation studies with three experienced SLPs to uncover how they evaluate joint attention based on three core behavioural cues: gaze, action, and vocalisation. Using these insights, we developed a two-stage MLLM-based system that first extracts fine-grained behavioural descriptions from video segments and then judge joint attention quality using expert-aligned prompts. Our evaluation across 26 parent-child interaction videos shows that MLLMs can achieve up to 85% accuracy in perceptual cue extraction and over 75% average precision in simulating expert judgement. We further propose design guidelines for building MLLM-based behaviour observation-judgement systems that align with SLPs, emphasising the structuring of behavioural cues, the construction of exemplar libraries grounded in expert annotations, and the need to personalise system responses based on developmental stage and neurotypical or atypical presentation. This work provides structured behavioural cues derived from SLP expertise, demonstrates the feasibility of aligning SLPs observation and judgement using MLLMs, and offers practical design guidelines for building aligned systems to support parent-child interaction analysis.

Paperid: 1173, https://arxiv.org/pdf/2506.02622.pdf

Abstract:
Mixed Reality (MR) interfaces have been extensively explored for controlling mobile robots, but there is limited research on their application to managing teams of robots. This paper presents HORUS: Holistic Operational Reality for Unified Systems, a Mixed Reality interface offering a comprehensive set of tools for managing multiple mobile robots simultaneously. HORUS enables operators to monitor individual robot statuses, visualize sensor data projected in real time, and assign tasks to single robots, subsets of the team, or the entire group, all from a Mini-Map (Ground Station). The interface also provides different teleoperation modes: a mini-map mode that allows teleoperation while observing the robot model and its transform on the mini-map, and a semi-immersive mode that offers a flat, screen-like view in either single or stereo view (3D). We conducted a user study in which participants used HORUS to manage a team of mobile robots tasked with finding clues in an environment, simulating search and rescue tasks. This study compared HORUS's full-team management capabilities with individual robot teleoperation. The experiments validated the versatility and effectiveness of HORUS in multi-robot coordination, demonstrating its potential to advance human-robot collaboration in dynamic, team-based environments.

Paperid: 1174, https://arxiv.org/pdf/2505.24195.pdf

Abstract:
With more than 11 times as many pageviews as the next largest edition, English Wikipedia dominates global knowledge access relative to other language editions. Readers are prone to assuming English Wikipedia as a superset of all language editions, leading many to prefer it even when their primary language is not English. Other language editions, however, comprise complementary facts rooted in their respective cultures and media environments, which are marginalized in English Wikipedia. While Wikipedia's user interface enables switching between language editions through its Interlanguage Link (ILL) system, it does not reveal to readers that other language editions contain valuable, complementary information. We present WikiGap, a system that surfaces complementary facts sourced from other Wikipedias within the English Wikipedia interface. Specifically, by combining a recent multilingual information-gap discovery method with a user-centered design, WikiGap enables access to complementary information from French, Russian, and Chinese Wikipedia. In a mixed-methods study (n=21), WikiGap significantly improved fact-finding accuracy, reduced task time, and received a 32-point higher usability score relative to Wikipedia's current ILL-based navigation system. Participants reported increased awareness of the availability of complementary information in non-English editions and reconsidered the completeness of English Wikipedia. WikiGap thus paves the way for improved epistemic equity across language editions.

Paperid: 1175, https://arxiv.org/pdf/2505.23576.pdf

Abstract:
Small Uncrewed Aerial Systems (sUAS) are increasingly deployed as autonomous swarms in search-and-rescue and other disaster-response scenarios. In these settings, they use computer vision (CV) to detect objects of interest and autonomously adapt their missions. However, traditional CV systems often struggle to recognize unfamiliar objects in open-world environments or to infer their relevance for mission planning. To address this, we incorporate large language models (LLMs) to reason about detected objects and their implications. While LLMs can offer valuable insights, they are also prone to hallucinations and may produce incorrect, misleading, or unsafe recommendations. To ensure safe and sensible decision-making under uncertainty, high-level decisions must be governed by cognitive guardrails. This article presents the design, simulation, and real-world integration of these guardrails for sUAS swarms in search-and-rescue missions.

Paperid: 1176, https://arxiv.org/pdf/2505.22983.pdf

Abstract:
Over the past decade, considerable research has investigated Vision-Based Assistive Technologies (VBAT) to support people with vision impairments to understand and interact with their immediate environment using machine learning, computer vision, image enhancement, and/or augmented/virtual reality. However, this has almost totally overlooked a growing demographic: people with Cerebral Visual Impairment (CVI). Unlike ocular vision impairments, CVI arises from damage to the brain's visual processing centres. Through a scoping review, this paper reveals a significant research gap in addressing the needs of this demographic. Three focus studies involving 7 participants with CVI explored the challenges, current strategies, and opportunities for VBAT. We also discussed the assistive technology needs of people with CVI compared with ocular low vision. Our findings highlight the opportunity for the Human-Computer Interaction and Assistive Technologies research community to explore and address this underrepresented domain, thereby enhancing the quality of life for people with CVI.

Paperid: 1177, https://arxiv.org/pdf/2505.21875.pdf

Abstract:
Over the past decade, considerable research has been directed towards assistive technologies to support people with vision impairments using machine learning, computer vision, image enhancement, and/or augmented/virtual reality. However, this has almost totally overlooked a growing demographic: people with Cerebral Visual Impairment (CVI). Unlike Ocular Vision Impairments (OVI), CVI arises from damage to the brain's visual processing centres. This paper introduces CVI and reveals a wide research gap in addressing the needs of this demographic. Through a scoping review, we identified 14 papers at the intersection of these technologies and CVI. Of these, only three papers described assistive technologies focused on people living with CVI, with the others focusing on diagnosis, understanding, simulation or rehabilitation. Our findings highlight the opportunity for the Human-Computer Interaction and Assistive Technologies research community to explore and address this underrepresented domain, thereby enhancing the quality of life for people with CVI.

Paperid: 1178, https://arxiv.org/pdf/2505.21512.pdf

Abstract:
Knowledge graphs (KGs) are powerful data structures, but exploring them effectively remains difficult for even expert users. Large language models (LLMs) are increasingly used to address this gap, yet little is known empirically about how their usage with KGs shapes user trust, exploration strategies, or downstream decision-making - raising key design challenges for LLM-based KG visual analysis systems. To study these effects, we developed LinkQ, a KG exploration system that converts natural language questions into structured queries with an LLM. We collaborated with KG experts to design five visual mechanisms that help users assess the accuracy of both KG queries and LLM responses: an LLM-KG state diagram that illustrates which stage of the exploration pipeline LinkQ is in, a query editor displaying the generated query paired with an LLM explanation, an entity-relation ID table showing extracted KG entities and relations with semantic descriptions, a query structure graph that depicts the path traversed in the KG, and an interactive graph visualization of query results. From a qualitative evaluation with 14 practitioners, we found that users - even KG experts - tended to overtrust LinkQ's outputs due to its "helpful" visualizations, even when the LLM was incorrect. Users exhibited distinct workflows depending on their prior familiarity with KGs and LLMs, challenging the assumption that these systems are one-size-fits-all - despite often being designed as if they are. Our findings highlight the risks of false trust in LLM-assisted data analysis tools and the need for further investigation into the role of visualization as a mitigation technique.

Paperid: 1179, https://arxiv.org/pdf/2505.20796.pdf

Abstract:
Post-traumatic stress disorder (PTSD) is associated with sudden, uncontrollable, and intense flashbacks of traumatic memories. Trauma exposure psychotherapy has proven effective in reducing the severity of trauma-related symptoms. It involves controlled recall of traumatic memories to train coping mechanisms for flashbacks and enable autobiographical integration of distressing experiences. In particular, exposure to visualizations of these memories supports successful recall. Although this approach is effective for various trauma types, it remains available for only a few. This is due to the lack of cost-efficient solutions for creating individualized exposure visualizations. This issue is particularly relevant for the treatment of Complex PTSD (CPTSD), where traumatic memories are highly individual and generic visualizations do not meet therapeutic needs. Generative Artificial Intelligence (GAI) offers a flexible and cost-effective alternative. GAI enables the creation of individualized exposure visualizations during therapy and, for the first time, allows patients to actively participate in the visualization process. While GAI opens new therapeutic perspectives and may improve access to trauma therapy, especially for CPTSD, it also introduces significant challenges and risks. The extreme uncertainty and lack of control that define both CPTSD and GAI raise concerns about feasibility and safety. To support safe and effective three-way communication, it is essential to understand the roles of patient, system, and therapist in exposure visualization and how each can contribute to safety. This paper outlines perspectives, challenges, and risks associated with the use of GAI in trauma therapy, with a focus on CPTSD.

Paperid: 1180, https://arxiv.org/pdf/2505.20711.pdf

Abstract:
The absence of explicit communication channels between automated vehicles (AVs) and other road users requires the use of external Human-Machine Interfaces (eHMIs) to convey messages effectively in uncertain scenarios. Currently, most eHMI studies employ predefined text messages and manually designed actions to perform these messages, which limits the real-world deployment of eHMIs, where adaptability in dynamic scenarios is essential. Given the generalizability and versatility of large language models (LLMs), they could potentially serve as automated action designers for the message-action design task. To validate this idea, we make three contributions: (1) We propose a pipeline that integrates LLMs and 3D renderers, using LLMs as action designers to generate executable actions for controlling eHMIs and rendering action clips. (2) We collect a user-rated Action-Design Scoring dataset comprising a total of 320 action sequences for eight intended messages and four representative eHMI modalities. The dataset validates that LLMs can translate intended messages into actions close to a human level, particularly for reasoning-enabled LLMs. (3) We introduce two automated raters, Action Reference Score (ARS) and Vision-Language Models (VLMs), to benchmark 18 LLMs, finding that the VLM aligns with human preferences yet varies across eHMI modalities.

Paperid: 1181, https://arxiv.org/pdf/2505.20667.pdf

Abstract:
A range of integrated modeling approaches have been developed to enable a holistic representation of business process logic together with all relevant business rules. These approaches address inherent problems with separate documentation of business process models and business rules. In this study, we explore how expert process workers make sense of the information provided through such integrated modeling approaches. To do so, we complement verbal protocol analysis with eye-tracking metrics to reveal nuanced user behaviours involved in the main phases of sensemaking, namely information foraging and information processing. By studying expert process workers engaged in tasks based on integrated modeling of business processes and rules, we provide insights that pave the way for a better understanding of sensemaking practices and improved development of business process and business rule integration approaches. Our research underscores the importance of offering personalized support mechanisms that increase the efficacy and efficiency of sensemaking practices for process knowledge workers.

Paperid: 1182, https://arxiv.org/pdf/2505.19441.pdf

Abstract:
Recommender systems (RS), which are widely deployed across high-stakes domains, are susceptible to biases that can cause large-scale societal impacts. Researchers have proposed methods to measure and mitigate such biases -- but translating academic theory into practice is inherently challenging. RS practitioners must balance the competing interests of diverse stakeholders, including providers and users, and operate in dynamic environments. Through a semi-structured interview study (N=11), we map the RS practitioner workflow within large technology companies, focusing on how technical teams consider fairness internally and in collaboration with other (legal, data, and fairness) teams. We identify key challenges to incorporating fairness into existing RS workflows: defining fairness in RS contexts, particularly when navigating multi-stakeholder and dynamic fairness considerations. We also identify key organization-wide challenges: making time for fairness work and facilitating cross-team communication. Finally, we offer actionable recommendations for the RS community, including HCI researchers and practitioners.

Paperid: 1183, https://arxiv.org/pdf/2505.18416.pdf

Abstract:
Driving is a key component of independence and quality of life for older adults. However, cognitive decline associated with conditions such as mild cognitive impairment and dementia can compromise driving safety and often lead to premature driving cessation. Conditionally automated vehicles, which require drivers to take over control when automation reaches its operational limits, offer a potential assistive solution. However, their effectiveness depends on the driver's ability to respond to takeover requests (TORs) in a timely and appropriate manner. Understanding emotional responses during TORs can provide insight into drivers' engagement, stress levels, and readiness to resume control, particularly in cognitively vulnerable populations. This study investigated affective responses, measured via facial expression analysis of valence and arousal, during TORs among cognitively healthy older adults and those with cognitive impairment. Facial affect data were analyzed across different road geometries and speeds to evaluate within- and between-group differences in affective states. Within-group comparisons using the Wilcoxon signed-rank test revealed significant changes in valence and arousal during TORs for both groups. Cognitively healthy individuals showed adaptive increases in arousal under higher-demand conditions, while those with cognitive impairment exhibited reduced arousal and more positive valence in several scenarios. Between-group comparisons using the Mann-Whitney U test indicated that cognitively impaired individuals displayed lower arousal and higher valence than controls across different TOR conditions. These findings suggest reduced emotional response and awareness in cognitively impaired drivers, highlighting the need for adaptive vehicle systems that detect affective states and support safe handovers for vulnerable users.

Paperid: 1184, https://arxiv.org/pdf/2505.14379.pdf

Abstract:
There exists limited theoretical guidance on integrating visualization and sonification. In this paper, we address this gap by investigating audiovisual semiotics for uncertainty representation: joining uncertainty visualization and sonification to combine audiovisual channels for enhancing users' perception of uncertainty. We conducted two preregistered crowd-sourced user studies. First, we assessed suitable audio/visual pairs. Then, we investigated audiovisual mappings of uncertainty. Here, we use probability as it is an easily communicated aspect of uncertainty. We analyzed the participants' preferences and reaction times in both user studies. Additionally, we explored the strategies employed by participants through qualitative analysis. Our results reveal audiovisual mappings that lead to particularly strong preferences and low reaction times. Furthermore, we found that preferred audio/visual pairs are not necessarily suitable audiovisual mappings of uncertainty. For example, while pitch paired with brightness was preferred as a pair, it was not well suited as a mapping for uncertainty. We recommend audiovisual mappings of uncertainty that lead to low reaction times and high preferences in both user studies. This paper presents guidelines to anyone seeking to employ audiovisual representations for uncertainty, contributing to enhancing the perception of uncertainty.

Paperid: 1185, https://arxiv.org/pdf/2505.09478.pdf

Abstract:
Card sorting is a common ideation technique that elicits information on users' mental organization of content and functionality by having them sort items into categories. For more robust card sorting research, digital card sorting tools could benefit from providing quick automated feedback. Our objective of this research is to advance toward an instrument that applies artificial intelligence (AI) to augment card sorting. For this purpose, we develop the Card Sorting Simulator, a prototype tool that leverages Large Language Models (LLMs) to generate informative categorizations of cards. To illuminate how aligned the simulation is with card sorting by actual participants, and to inform the instrument's design decisions, we conducted a generalizability-focused comparative study. We obtained 28 pre-existing card sorting studies from real practitioners, comprising 1,399 participants, along with diverse contents and origins. With this dataset, we conducted a comprehensive and nuanced analysis of the agreement between actual card sorting results (clusterings of cards) and synthetic clusterings across a multitude of LLMs and prompt designs. Mutual information scores indicate a good degree of agreement to real result clustering, although similarity matrices also demonstrate inconsistencies from mental models, which can be attributed to their top-down nature. Furthermore, the number of cards or complexity of their labels impact the accuracy of its simulation. These findings bolster the case for AI augmentation in card sorting research as a source of meaningful preliminary feedback and highlight the need for further study for the development and validation of intelligent user research tools.

Paperid: 1186, https://arxiv.org/pdf/2505.08939.pdf

Abstract:
As generative AI tools become integrated into design workflows, students increasingly engage with these tools not just as aids, but as collaborators. This study analyzes reflections from 33 student teams in an HCI design course to examine the kinds of judgments students make when using AI tools. We found both established forms of design judgment (e.g., instrumental, appreciative, quality) and emergent types: agency-distribution judgment and reliability judgment. These new forms capture how students negotiate creative responsibility with AI and assess the trustworthiness of its outputs. Our findings suggest that generative AI introduces new layers of complexity into design reasoning, prompting students to reflect not only on what AI produces, but also on how and when to rely on it. By foregrounding these judgments, we offer a conceptual lens for understanding how students engage in co-creative sensemaking with AI in design contexts.

Paperid: 1187, https://arxiv.org/pdf/2505.08902.pdf

Abstract:
Currently, a considerable research effort is devoted to comparing LLMs to a group of human experts, where the term "expert" is often ill-defined or variable, at best, in a state of constantly updating LLM releases. Without proper safeguards in place, LLMs will threaten to cause harm to the established structure of safe delivery of patient care which has been carefully developed throughout history to keep the safety of the patient at the forefront. A key driver of LLM innovation is founded on community research efforts which, if continuing to operate under "humans versus LLMs" principles, will expedite this trend. Therefore, research efforts moving forward must focus on effectively characterizing the safe use of LLMs in clinical settings that persist across the rapid development of novel LLM models. In this communication, we demonstrate that rather than comparing LLMs to humans, there is a need to develop strategies enabling efficient work of humans with LLMs in an almost symbiotic manner.

Paperid: 1188, https://arxiv.org/pdf/2505.08588.pdf

Abstract:
GPT has become nearly synonymous with large language models (LLMs), an increasingly popular term in AIED proceedings. A simple keyword-based search reveals that 61% of the 76 long and short papers presented at AIED 2024 describe novel solutions using LLMs to address some of the long-standing challenges in education, and 43% specifically mention GPT. Although LLMs pioneered by GPT create exciting opportunities to strengthen the impact of AI on education, we argue that the field's predominant focus on GPT and other resource-intensive LLMs (with more than 10B parameters) risks neglecting the potential impact that small language models (SLMs) can make in providing resource-constrained institutions with equitable and affordable access to high-quality AI tools. Supported by positive results on knowledge component (KC) discovery, a critical challenge in AIED, we demonstrate that SLMs such as Phi-2 can produce an effective solution without elaborate prompting strategies. Hence, we call for more attention to developing SLM-based AIED approaches.

Paperid: 1189, https://arxiv.org/pdf/2505.07486.pdf

Abstract:
The problem of how to effectively mitigate the flow of misinformation remains a significant challenge. The classical approach to this is public disapproval of claims or "debunking." The approach is still widely used on social media, but it has some severe limitations in terms of applicability and efficiency. An alternative strategy is to enhance individuals' critical thinking through educational interventions. Instead of merely disproving misinformation, these approaches aim to strengthen users' reasoning skills, enabling them to evaluate and reject false information independently. In this position paper, we explore a combination of intervention methods designed to improve critical thinking in the context of online media consumption. We highlight the role of AI in supporting different stages of these interventions and present a design concept that integrates AI-driven strategies to foster critical reasoning and media literacy.

Paperid: 1190, https://arxiv.org/pdf/2505.07214.pdf

Abstract:
Crucial in disease analysis and surgical planning, manual segmentation of volumetric medical scans (e.g. MRI, CT) is laborious, error-prone, and challenging to master, while fully automatic algorithms can benefit from user feedback. Therefore, with the complementary power of the latest radiological AI foundation models and virtual reality (VR)'s intuitive data interaction, we propose SAMIRA, a novel conversational AI agent for medical VR that assists users with localizing, segmenting, and visualizing 3D medical concepts. Through speech-based interaction, the agent helps users understand radiological features, locate clinical targets, and generate segmentation masks that can be refined with just a few point prompts. The system also supports true-to-scale 3D visualization of segmented pathology to enhance patient-specific anatomical understanding. Furthermore, to determine the optimal interaction paradigm under near-far attention-switching for refining segmentation masks in an immersive, human-in-the-loop workflow, we compare VR controller pointing, head pointing, and eye tracking as input modes. With a user study, evaluations demonstrated a high usability score (SUS=90.0 $\pm$ 9.0), low overall task load, as well as strong support for the proposed VR system's guidance, training potential, and integration of AI in radiological segmentation tasks.

Paperid: 1191, https://arxiv.org/pdf/2505.06702.pdf

Abstract:
Large language models encode knowledge in various domains and demonstrate the ability to understand visualizations. They may also capture visualization design knowledge and potentially help reduce the cost of formative studies. However, it remains a question whether large language models are capable of predicting human feedback on visualizations. To investigate this question, we conducted three studies to examine whether large model-based agents can simulate human ratings in visualization tasks. The first study, replicating a published study involving human subjects, shows agents are promising in conducting human-like reasoning and rating, and its result guides the subsequent experimental design. The second study repeated six human-subject studies reported in literature on subjective ratings, but replacing human participants with agents. Consulting with five human experts, this study demonstrates that the alignment of agent ratings with human ratings positively correlates with the confidence levels of the experts before the experiments. The third study tests commonly used techniques for enhancing agents, including preprocessing visual and textual inputs, and knowledge injection. The results reveal the issues of these techniques in robustness and potential induction of biases. The three studies indicate that language model-based agents can potentially simulate human ratings in visualization experiments, provided that they are guided by high-confidence hypotheses from expert evaluators. Additionally, we demonstrate the usage scenario of swiftly evaluating prototypes with agents. We discuss insights and future directions for evaluating and improving the alignment of agent ratings with human ratings. We note that simulation may only serve as complements and cannot replace user studies.

Paperid: 1192, https://arxiv.org/pdf/2505.06469.pdf

Abstract:
Educators evaluate student knowledge using knowledge component (KC) models that map assessment questions to KCs. Still, designing KC models for large question banks remains an insurmountable challenge for instructors who need to analyze each question by hand. The growing use of Generative AI in education is expected only to aggravate this chronic deficiency of expert-designed KC models, as course engineers designing KCs struggle to keep up with the pace at which questions are generated. In this work, we propose KCluster, a novel KC discovery algorithm based on identifying clusters of congruent questions according to a new similarity metric induced by a large language model (LLM). We demonstrate in three datasets that an LLM can create an effective metric of question similarity, which a clustering algorithm can use to create KC models from questions with minimal human effort. Combining the strengths of LLM and clustering, KCluster generates descriptive KC labels and discovers KC models that predict student performance better than the best expert-designed models available. In anticipation of future work, we illustrate how KCluster can reveal insights into difficult KCs and suggest improvements to instruction.

Paperid: 1193, https://arxiv.org/pdf/2505.04273.pdf

Abstract:
Group Recommender Systems (GRS) employing social choice-based aggregation strategies have previously been explored in terms of perceived consensus, fairness, and satisfaction. At the same time, the impact of textual explanations has been examined, but the results suggest a low effectiveness of these explanations. However, user understanding remains fairly unexplored, even if it can contribute positively to transparent GRS. This is particularly interesting to study in more complex or potentially unfair scenarios when user preferences diverge, such as in a minority scenario (where group members have similar preferences, except for a single member in a minority position). In this paper, we analyzed the impact of different types of explanations on user understanding of group recommendations. We present a randomized controlled trial (n = 271) using two between-subject factors: (i) the aggregation strategy (additive, least misery, and approval voting), and (ii) the modality of explanation (no explanation, textual explanation, or multimodal explanation). We measured both subjective (self-perceived by the user) and objective understanding (performance on model simulation, counterfactuals and error detection). In line with recent findings on explanations for machine learning models, our results indicate that more detailed explanations, whether textual or multimodal, did not increase subjective or objective understanding. However, we did find a significant effect of aggregation strategies on both subjective and objective understanding. These results imply that when constructing GRS, practitioners need to consider that the choice of aggregation strategy can influence the understanding of users. Post-hoc analysis also suggests that there is value in analyzing performance on different tasks, rather than through a single aggregated metric of understanding.

Paperid: 1194, https://arxiv.org/pdf/2505.04260.pdf

Abstract:
As large language models (LLMs) improve in their capacity to serve as personal AI assistants, their ability to output uniquely tailored, personalized responses that align with the soft preferences of their users is essential for enhancing user satisfaction and retention. However, untrained lay users have poor prompt specification abilities and often struggle with conveying their latent preferences to AI assistants. To address this, we leverage activation steering to guide LLMs to align with interpretable preference dimensions during inference. In contrast to memory-based personalization methods that require longer user history, steering is extremely lightweight and can be easily controlled by the user via an linear strength factor. We embed steering into three different interactive chatbot interfaces and conduct a within-subjects user study (n=14) to investigate how end users prefer to personalize their conversations. The results demonstrate the effectiveness of preference-based steering for aligning real-world conversations with hidden user preferences, and highlight further insights on how diverse values around control, usability, and transparency lead users to prefer different interfaces.

Paperid: 1195, https://arxiv.org/pdf/2505.03027.pdf

Abstract:
Performance models of interaction, such as Fitts Law, are important tools for predicting and explaining human motor performance and for designing high-performance user interfaces. Extensive prior work has proposed such models for the 3D interaction task of distal pointing, in which the user points their hand or a device at a distant target in order to select it. However, there is no consensus on how to compute the index of difficulty for distal pointing tasks. We present a preliminary study suggesting that existing models may not be sufficient to model distal pointing performance with current virtual reality technologies. Based on these results, we hypothesized that both the form of the model and the standard method for collecting empirical data for pointing tasks might need to change in order to achieve a more accurate and valid distal pointing model. In our main study, we used a new methodology to collect distal pointing data and evaluated traditional models, purely ballistic models, and two-part models. Ultimately, we found that the best model used a simple Fitts-Law-style index of difficulty with angular measures of amplitude and width.

Paperid: 1196, https://arxiv.org/pdf/2505.02699.pdf

Abstract:
Multi-role pedagogical agents can create engaging and immersive learning experiences, helping learners better understand knowledge in history learning. However, existing pedagogical agents often struggle with multi-role interactions due to complex controls, limited feedback forms, and difficulty dynamically adapting to user inputs. In this study, we developed a VR prototype with LLM-powered adaptive role-switching and action-switching pedagogical agents to help users learn about the history of the Pavilion of Prince Teng. A 2 x 2 between-subjects study was conducted with 84 participants to assess how adaptive role-switching and action-switching affect participants' learning outcomes and experiences. The results suggest that adaptive role-switching enhances participants' perception of the pedagogical agent's trustworthiness and expertise but may lead to inconsistent learning experiences. Adaptive action-switching increases participants' perceived social presence, expertise, and humanness. The study did not uncover any effects of role-switching and action-switching on usability, learning motivation, and cognitive load. Based on the findings, we proposed five design implications for incorporating adaptive role-switching and action-switching into future VR history education tools.

Paperid: 1197, https://arxiv.org/pdf/2505.01886.pdf

Abstract:
Immersive Virtual Reality (iVR) applications have shown immense potential for skill training and learning in manufacturing. However, authoring of such applications requires technical expertise, which makes it difficult for educators to author instructions targeted at desired learning outcomes. We present FlowTrainer, an LLM-assisted interactive system to allow educators to author lesson plans for their iVR instruction based on desired goals. The authoring workflow is supported by Backward design to align the planned lesson based on the desired outcomes. We implemented a welding use case and conducted a user study with welding experts to test the effectiveness of the system in authoring outcome-oriented lesson plans. The study results showed that the system allowed users to plan lesson plans based on desired outcomes while reducing the time and technical expertise required for the authoring process. We believe that such efforts can allow widespread adoption of iVR solutions in manufacturing training to meet the workforce demands in the industry.

Paperid: 1198, https://arxiv.org/pdf/2505.01724.pdf

Abstract:
Historical visualizations are a rich resource for visualization research. While taxonomy is commonly used to structure and understand the design space of visualizations, existing taxonomies primarily focus on contemporary visualizations and largely overlook historical visualizations. To address this gap, we describe an empirical method for taxonomy development. We introduce a coding protocol and the VisTaxa system for taxonomy labeling and comparison. We demonstrate using our method to develop a historical visualization taxonomy by coding 400 images of historical visualizations. We analyze the coding result and reflect on the coding process. Our work is an initial step toward a systematic investigation of the design space of historical visualizations.

Paperid: 1199, https://arxiv.org/pdf/2504.21849.pdf

Abstract:
Governance institutions must respond to societal risks, including those posed by generative AI. This study empirically examines how public trust in institutions and AI technologies, along with perceived risks, shape preferences for AI regulation. Using the nationally representative 2023 Artificial Intelligence, Morality, and Sentience (AIMS) survey, we assess trust in government, AI companies, and AI technologies, as well as public support for regulatory measures such as slowing AI development or outright bans on advanced AI. Our findings reveal broad public support for AI regulation, with risk perception playing a significant role in shaping policy preferences. Individuals with higher trust in government favor regulation, while those with greater trust in AI companies and AI technologies are less inclined to support restrictions. Trust in government and perceived risks significantly predict preferences for both soft (e.g., slowing development) and strong (e.g., banning AI systems) regulatory interventions. These results highlight the importance of public opinion in AI governance. As AI capabilities advance, effective regulation will require balancing public concerns about risks with trust in institutions. This study provides a foundational empirical baseline for policymakers navigating AI governance and underscores the need for further research into public trust, risk perception, and regulatory strategies in the evolving AI landscape.

Paperid: 1200, https://arxiv.org/pdf/2504.21397.pdf

Abstract:
The complexity of UX design practice extends beyond ill-structured design problems to include uncertainties shaped by shifting stakeholder priorities, team dynamics, limited resources, and implementation constraints. While prior research in related fields has addressed uncertainty in design more broadly, the specific character of uncertainty in UX practice remains underexplored. This study examines how UX practitioners experience and respond to uncertainty in real-world projects, drawing on a multi-week diary study and follow-up interviews with ten designers. We identify a range of practitioner strategies-including adaptive framing, negotiation, and judgment-that allow designers to move forward amid ambiguity. Our findings highlight the central role of design judgment in navigating uncertainty, including emergent forms such as temporal and sacrificial judgment, and extend prior understandings by showing how UX practitioners engage uncertainty as a persistent, situated feature of practice.

Paperid: 1201, https://arxiv.org/pdf/2504.20196.pdf

Abstract:
Large Language Models (LLMs) are rapidly transforming software engineering, with coding assistants embedded in an IDE becoming increasingly prevalent. While research has focused on improving the tools and understanding developer perceptions, a critical gap exists in understanding how developers actually use these tools in their daily workflows, and, crucially, where they struggle. This paper addresses part of this gap through a multi-phased investigation of developer interactions with an LLM-powered code editing and transformation feature, Transform Code, in an IDE widely used at Google. First, we analyze telemetry logs of the feature usage, revealing that frequent re-prompting can be an indicator of developer struggles with using Transform Code. Second, we conduct a qualitative analysis of unsatisfactory requests, identifying five key categories of information often missing from developer prompts. Finally, based on these findings, we propose and evaluate a tool, AutoPrompter, for automatically improving prompts by inferring missing information from the surrounding code context, leading to a 27% improvement in edit correctness on our test set.

Paperid: 1202, https://arxiv.org/pdf/2504.17334.pdf

Abstract:
A data story typically integrates data facts from multiple perspectives and stances to construct a comprehensive and objective narrative. However, retrieving these facts demands time for data search and challenges the creator's analytical skills. In this work, we introduce DataScout, an interactive system that automatically performs reasoning and stance-based data facts retrieval to augment the user's statement. Particularly, DataScout leverages an LLM-based agent to construct a retrieval tree, enabling collaborative control of its expansion between users and the agent. The interface visualizes the retrieval tree as a mind map that eases users to intuitively steer the retrieval direction and effectively engage in reasoning and analysis. We evaluate the proposed system through case studies and in-depth expert interviews. Our evaluation demonstrates that DataScout can effectively retrieve multifaceted data facts from different stances, helping users verify their statements and enhance the credibility of their stories.

Paperid: 1203, https://arxiv.org/pdf/2504.16323.pdf

Abstract:
As digital media use continues to evolve and influence various aspects of life, developing flexible and scalable tools to study complex media experiences is essential. This study introduces the Media Content Atlas (MCA), a novel pipeline designed to help researchers investigate large-scale screen data beyond traditional screen-use metrics. Leveraging multimodal large language models (MLLMs), MCA enables moment-by-moment content analysis, content-based clustering, topic modeling, image retrieval, and interactive visualizations. Evaluated on 1.12 million smartphone screenshots continuously captured during screen use from 112 adults over an entire month, MCA facilitates open-ended exploration and hypothesis generation as well as hypothesis-driven investigations at an unprecedented scale. Expert evaluators underscored its usability and potential for research and intervention design, with clustering results rated 96% relevant and descriptions 83% accurate. By bridging methodological possibilities with domain-specific needs, MCA accelerates both inductive and deductive inquiry, presenting new opportunities for media and HCI research.

Paperid: 1204, https://arxiv.org/pdf/2504.14507.pdf

Abstract:
Comprehending visualizations requires readers to interpret visual encoding and the underlying meanings actively. This poses challenges for visualization novices, particularly when interpreting distributional visualizations that depict statistical uncertainty. Advancements in LLM-based conversational interfaces show promise in promoting visualization comprehension. However, they fail to provide contextual explanations at fine-grained granularity, and chart readers are still required to mentally bridge visual information and textual explanations during conversations. Our formative study highlights the expectations for both lexical and visual feedback, as well as the importance of explicitly linking these two modalities throughout the conversation. The findings motivate the design of VizTA, a visualization teaching assistant that leverages the fusion of visual and lexical feedback to help readers better comprehend visualization. VizTA features a semantic-aware conversational agent capable of explaining contextual information within visualizations and employs a visual-lexical fusion design to facilitate chart-centered conversation. A between-subject study with 24 participants demonstrates the effectiveness of VizTA in supporting the understanding and reasoning tasks of distributional visualization across multiple scenarios.

Paperid: 1205, https://arxiv.org/pdf/2504.14177.pdf

Abstract:
Online AI Feedback (OAIF) presents a promising alternative to Reinforcement Learning from Human Feedback (RLHF) by utilizing online AI preference in aligning language models (LLMs). However, the straightforward replacement of humans with AI deprives LLMs from learning more fine-grained AI supervision beyond binary signals. In this paper, we propose Direct Advantage Regression (DAR), a simple alignment algorithm using online AI reward to optimize policy improvement through weighted supervised fine-tuning. As an RL-free approach, DAR maintains theoretical consistency with online RLHF pipelines while significantly reducing implementation complexity and improving learning efficiency. Our empirical results underscore that AI reward is a better form of AI supervision consistently achieving higher human-AI agreement as opposed to AI preference. Additionally, evaluations using GPT-4-Turbo and MT-bench show that DAR outperforms both OAIF and online RLHF baselines.

Paperid: 1206, https://arxiv.org/pdf/2504.13969.pdf

Abstract:
This paper presents Tinker Tales, an interactive storytelling framework in the format of a board game, designed to support both narrative development and AI literacy in early childhood. The framework integrates tangible and speech-based interactions with AI through NFC chip-attached pawns and tokens, along with a speaker and microphone. Children select and define key story elements-such as characters, places, items, and emotions-using the pawns and tokens, providing further details to the AI and receiving proper assistance, similar to how adults prompt AI for specific tasks (e.g., writing). For evaluation, several game sessions were simulated with a child AI agent, and the quality and safety of the generated stories were assessed from various perspectives. This work highlights the potential of combining physical and digital elements in AI literacy, offering a safe and engaging way for children to learn how to effectively collaborate with AI.

Paperid: 1207, https://arxiv.org/pdf/2504.13964.pdf

Abstract:
How should a companion robot behave? In this research, we present a cognitive architecture based on a tailored personality model to investigate the impact of robotic personalities on the perception of companion robots. Drawing from existing literature, we identified empathy, trust, and enjoyability as key factors in building companionship with social robots. Based on these insights, we implemented a personality-dependent, emotion-aware generator, recognizing the crucial role of robot emotions in shaping these elements. We then conducted a user study involving 84 dyadic conversation sessions with the emotional robot Navel, which exhibited different personalities. Results were derived from a multimodal analysis, including questionnaires, open-ended responses, and behavioral observations. This approach allowed us to validate the developed emotion generator and explore the relationship between the personality traits of Agreeableness, Extraversion, Conscientiousness, and Empathy. Furthermore, we drew robust conclusions on how these traits influence relational trust, capability trust, enjoyability, and sociability.

Paperid: 1208, https://arxiv.org/pdf/2504.13938.pdf

Abstract:
Personalization of Large Language Models (LLMs) is important in practical applications to accommodate the individual needs of different mobile users. Due to data privacy concerns, LLM personalization often needs to be locally done at the user's mobile device, but such on-device personalization is constrained by both the limitation of on-device compute power and insufficiency of user's personal data. In this paper, we address these constraints by fine-tuning an already personalized LLM with user's personal data, and present XPerT, a new technique that ensure proper selection of such already personalized LLMs based on explainability about how they were being fine-tuned. We implemented and evaluated XPerT on various smartphone models with mainstream LLMs, and experiment results show that XPerT reduces the computation costs of on-device LLM personalization by 83%, and improves its data efficiency by 51%.

Paperid: 1209, https://arxiv.org/pdf/2504.13095.pdf

Abstract:
Conversational recommender systems (CRSs) provide users with an interactive means to express preferences and receive real-time personalized recommendations. The success of these systems is heavily influenced by the preference elicitation process. While existing research mainly focuses on what questions to ask during preference elicitation, there is a notable gap in understanding what role broader interaction patterns including tone, pacing, and level of proactiveness play in supporting users in completing a given task. This study investigates the impact of different conversational styles on preference elicitation, task performance, and user satisfaction with CRSs. We conducted a controlled experiment in the context of scientific literature recommendation, contrasting two distinct conversational styles, high involvement (fast paced, direct, and proactive with frequent prompts) and high considerateness (polite and accommodating, prioritizing clarity and user comfort) alongside a flexible experimental condition where users could switch between the two. Our results indicate that adapting conversational strategies based on user expertise and allowing flexibility between styles can enhance both user satisfaction and the effectiveness of recommendations in CRSs. Overall, our findings hold important implications for the design of future CRSs.

Paperid: 1210, https://arxiv.org/pdf/2504.12805.pdf

Abstract:
This study explored how large language models (LLMs) perform in two areas related to art: writing critiques of artworks and reasoning about mental states (Theory of Mind, or ToM) in art-related situations. For the critique generation part, we built a system that combines Noel Carroll's evaluative framework with a broad selection of art criticism theories. The model was prompted to first write a full-length critique and then shorter, more coherent versions using a step-by-step prompting process. These AI-generated critiques were then compared with those written by human experts in a Turing test-style evaluation. In many cases, human subjects had difficulty telling which was which, and the results suggest that LLMs can produce critiques that are not only plausible in style but also rich in interpretation, as long as they are carefully guided. In the second part, we introduced new simple ToM tasks based on situations involving interpretation, emotion, and moral tension, which can appear in the context of art. These go beyond standard false-belief tests and allow for more complex, socially embedded forms of reasoning. We tested 41 recent LLMs and found that their performance varied across tasks and models. In particular, tasks that involved affective or ambiguous situations tended to reveal clearer differences. Taken together, these results help clarify how LLMs respond to complex interpretative challenges, revealing both their cognitive limitations and potential. While our findings do not directly contradict the so-called Generative AI Paradox--the idea that LLMs can produce expert-like output without genuine understanding--they suggest that, depending on how LLMs are instructed, such as through carefully designed prompts, these models may begin to show behaviors that resemble understanding more closely than we might assume.

Paperid: 1211, https://arxiv.org/pdf/2504.12424.pdf

Abstract:
This position paper highlights a growing trend in Explainable AI (XAI) research where Large Language Models (LLMs) are used to translate outputs from explainability techniques, like feature-attribution weights, into a natural language explanation. While this approach may improve accessibility or readability for users, recent findings suggest that translating into human-like explanations does not necessarily enhance user understanding and may instead lead to overreliance on AI systems. When LLMs summarize XAI outputs without surfacing model limitations, uncertainties, or inconsistencies, they risk reinforcing the illusion of interpretability rather than fostering meaningful transparency. We argue that - instead of merely translating XAI outputs - LLMs should serve as constructive agitators, or devil's advocates, whose role is to actively interrogate AI explanations by presenting alternative interpretations, potential biases, training data limitations, and cases where the model's reasoning may break down. In this role, LLMs can facilitate users in engaging critically with AI systems and generated explanations, with the potential to reduce overreliance caused by misinterpreted or specious explanations.

Paperid: 1212, https://arxiv.org/pdf/2504.12422.pdf

Abstract:
High-stakes domains like cyber operations need responsible and trustworthy AI methods. While large language models (LLMs) are becoming increasingly popular in these domains, they still suffer from hallucinations. This research paper provides learning outcomes from a case study with LinkQ, an open-source natural language interface that was developed to combat hallucinations by forcing an LLM to query a knowledge graph (KG) for ground-truth data during question-answering (QA). We conduct a quantitative evaluation of LinkQ using a well-known KGQA dataset, showing that the system outperforms GPT-4 but still struggles with certain question categories - suggesting that alternative query construction strategies will need to be investigated in future LLM querying systems. We discuss a qualitative study of LinkQ with two domain experts using a real-world cybersecurity KG, outlining these experts' feedback, suggestions, perceived limitations, and future opportunities for systems like LinkQ.

Paperid: 1213, https://arxiv.org/pdf/2504.10873.pdf

Abstract:
In autonomous driving, it is crucial to correctly interpret traffic gestures (TGs), such as those of an authority figure providing orders or instructions, or a pedestrian signaling the driver, to ensure a safe and pleasant traffic environment for all road users. This study investigates the capabilities of state-of-the-art vision-language models (VLMs) in zero-shot interpretation, focusing on their ability to caption and classify human gestures in traffic contexts. We create and publicly share two custom datasets with varying formal and informal TGs, such as 'Stop', 'Reverse', 'Hail', etc. The datasets are "Acted TG (ATG)" and "Instructive TG In-The-Wild (ITGI)". They are annotated with natural language, describing the pedestrian's body position and gesture. We evaluate models using three methods utilizing expert-generated captions as baseline and control: (1) caption similarity, (2) gesture classification, and (3) pose sequence reconstruction similarity. Results show that current VLMs struggle with gesture understanding: sentence similarity averages below 0.59, and classification F1 scores reach only 0.14-0.39, well below the expert baseline of 0.70. While pose reconstruction shows potential, it requires more data and refined metrics to be reliable. Our findings reveal that although some SOTA VLMs can interpret zero-shot human traffic gestures, none are accurate and robust enough to be trustworthy, emphasizing the need for further research in this domain.

Paperid: 1214, https://arxiv.org/pdf/2504.10162.pdf

Abstract:
Presence is an important and widely used metric to measure the quality of virtual reality (VR) applications. Given the multifaceted and subjective nature of presence, the most common measures for presence are questionnaires. But there is little research on their validity regarding specific presence dimensions and their responsiveness to differences in perception among users. We investigated four presence questionnaires (SUS, PQ, IPQ, Bouchard) on their responsiveness to intensity variations of known presence dimensions and asked users about their consistency with their experience. Therefore, we created five VR scenarios that were designed to emphasize a specific presence dimension. Our findings showed heterogeneous sensitivity of the questionnaires dependent on the different dimensions of presence. This highlights a context-specific suitability of presence questionnaires. The questionnaires' sensitivity was further stated as lower than actually perceived. Based on our findings, we offer guidance on selecting these questionnaires based on their suitability for particular use cases.

Paperid: 1215, https://arxiv.org/pdf/2504.07529.pdf

Abstract:
The emergence of generative AI, large language models (LLMs), and foundation models is fundamentally reshaping computer science, and visualization and visual analytics are no exception. We present a systematic framework for understanding how human-centered AI (HCAI) can transform the visualization discipline. Our framework maps four key HCAI tool capabilities -- amplify, augment, empower, and enhance -- onto the four phases of visual sensemaking: view, explore, schematize, and report. For each combination, we review existing tools, envision future possibilities, identify challenges and pitfalls, and examine ethical considerations. This design space can serve as an R\&D agenda for both visualization researchers and practitioners to integrate AI into their work as well as understanding how visualization can support HCAI research.

Paperid: 1216, https://arxiv.org/pdf/2504.07108.pdf

Abstract:
The use of recommender systems in the recruitment domain has been labeled as 'high-risk' in recent legislation. As a result, strict requirements regarding explainability and fairness have been put in place to ensure proper treatment of all involved stakeholders. To allow for stakeholder-specific explainability, while also handling highly heterogeneous recruitment data, we propose a novel explainable multi-stakeholder job recommender system using graph neural networks: the Occupational Knowledge-based Recommender using Attention (OKRA). The proposed method is capable of providing both candidate- and company-side recommendations and explanations. We find that OKRA performs substantially better than six baselines in terms of nDCG for two datasets. Furthermore, we find that the tested models show a bias toward candidates and vacancies located in urban areas. Overall, our findings suggest that OKRA provides a balance between accuracy, explainability, and fairness.

Paperid: 1217, https://arxiv.org/pdf/2504.06976.pdf

Abstract:
Social media platforms face heightened risks during major political events; yet, how platforms adapt their moderation practices in response remains unclear. The Digital Services Act Transparency Database offers an unprecedented opportunity to systematically study content moderation at scale, enabling researchers and policymakers to assess platforms' compliance and effectiveness. Herein, we analyze 1.58 billion self-reported moderation actions taken by eight large social media platforms during an extended period of eight months surrounding the 2024 European Parliament elections. Our findings reveal a lack of adaptation in moderation strategies, as platforms did not exhibit significant changes in their enforcement behaviors surrounding the elections. This raises concerns about whether platforms adapted their moderation practices at all, or if structural limitations of the database concealed possible adjustments. Moreover, we found that noted transparency and accountability issues persist nearly a year after initial concerns were raised. These results highlight the limitations of current self-regulatory approaches and underscore the need for stronger enforcement and data access mechanisms to ensure that online platforms uphold their responsibility in safeguarding democratic processes.

Paperid: 1218, https://arxiv.org/pdf/2504.04332.pdf

Abstract:
As language models achieve increasingly human-like capabilities in conversational text generation, a critical question emerges: to what extent can these systems simulate the characteristics of specific individuals? To evaluate this, we introduce IMPersona, a framework for evaluating LMs at impersonating specific individuals' writing style and personal knowledge. Using supervised fine-tuning and a hierarchical memory-inspired retrieval system, we demonstrate that even modestly sized open-source models, such as Llama-3.1-8B-Instruct, can achieve impersonation abilities at concerning levels. In blind conversation experiments, participants (mis)identified our fine-tuned models with memory integration as human in 44.44% of interactions, compared to just 25.00% for the best prompting-based approach. We analyze these results to propose detection methods and defense strategies against such impersonation attempts. Our findings raise important questions about both the potential applications and risks of personalized language models, particularly regarding privacy, security, and the ethical deployment of such technologies in real-world contexts.

Paperid: 1219, https://arxiv.org/pdf/2504.03105.pdf

Abstract:
As AI takes on increasingly complex roles in human-computer interaction, fundamental questions arise: how can HCI help maintain the user as the primary agent while augment human cognition and intelligence? This paper suggests questions to guide researchers in considering the implications for agency, autonomy, the augmentation of human intellect, and the future of human-AI synergies. We observe a key paradigm shift behind the transformation of HCI, shifting from explicit command-and-control models to systems where users define high-level goals directly. This shift will be facilitated by XR technologies, whose multi-modal inputs and outputs offer a more seamless way to convey these goals. This paper considers this transformation through the lens of two cultural milestones: the personal computer and the automobile, moving beyond traditional interfaces like keyboards or steering wheels and thinking of them as vessels for everyday XR.

Paperid: 1220, https://arxiv.org/pdf/2504.01415.pdf

Abstract:
Usability issues can hinder the effective use of software. Therefore, various techniques are deployed to diagnose and mitigate them. However, these techniques are costly and time-consuming, particularly in iterative design and development. A substantial body of research indicates that automation and artificial intelligence can enhance the process of obtaining usability insights. In our systematic review of 155 publications, we offer a comprehensive overview of the current state of the art for automated usability issue detection. We analyze trends, paradigms, and the technical context in which they are applied. Finally, we discuss the implications and potential directions for future research.

Paperid: 1221, https://arxiv.org/pdf/2504.01377.pdf

Abstract:
Saving, or checkpointing, intermediate results during interactive data exploration can potentially boost user productivity. However, existing studies on this topic are limited, as they primarily rely on small-scale experiments with human participants - a fundamental constraint of human subject studies. To address this limitation, we employ AI agents to simulate a large number of complex data exploration scenarios, including revisiting past states and branching into new exploration paths. This strategy enables us to accurately assess the impact of checkpointing while closely mimicking the behavior of real-world data practitioners. Our evaluation results, involving more than 1,000 exploration paths and 2,848 executed code blocks, show that a checkpointing framework for computational notebooks can indeed enhance productivity by minimizing unnecessary code re-executions and redundant variables or code.

Paperid: 1222, https://arxiv.org/pdf/2504.01367.pdf

Abstract:
There is a gap between how people explore data and how Jupyter-like computational notebooks are designed. People explore data nonlinearly, using execution undos, branching, and/or complete reverts, whereas notebooks are designed for sequential exploration. Recent works like ForkIt are still insufficient to support these multiple modes of nonlinear exploration in a unified way. In this work, we address the challenge by introducing two-dimensional code+data space versioning for computational notebooks and verifying its effectiveness using our prototype system, Kishuboard, which integrates with Jupyter. By adjusting code and data knobs, users of Kishuboard can intuitively manage the state of computational notebooks in a flexible way, thereby achieving both execution rollbacks and checkouts across complex multi-branch exploration history. Moreover, this two-dimensional versioning mechanism can easily be presented along with a friendly one-dimensional history. Human subject studies indicate that Kishuboard significantly enhances user productivity in various data science tasks.

Paperid: 1223, https://arxiv.org/pdf/2504.01121.pdf

Abstract:
In this work, we analyze video data and interviews from a public deployment of two trash barrel robots in a large public space to better understand the sensemaking activities people perform when they encounter robots in public spaces. Based on an analysis of 274 human-robot interactions and interviews with N=65 individuals or groups, we discovered that people were responding not only to the robots or their behavior, but also to the general idea of deploying robots as trashcans, and the larger social implications of that idea. They wanted to understand details about the deployment because having that knowledge would change how they interact with the robot. Based on our data and analysis, we have provided implications for design that may be topics for future human-robot design researchers who are exploring robots for public space deployment. Furthermore, our work offers a practical example of analyzing field data to make sense of robots in public spaces.

Paperid: 1224, https://arxiv.org/pdf/2504.00692.pdf

Abstract:
Increased usage of generative AI (GenAI) in Human-Computer Interaction (HCI) research induces a climate impact from carbon emissions due to energy consumption of the hardware used to develop and run GenAI models and systems. The exact energy usage and and subsequent carbon emissions are difficult to estimate in HCI research because HCI researchers most often use cloud-based services where the hardware and its energy consumption are hidden from plain view. The HCI GenAI CO2ST Calculator is a tool designed specifically for the HCI research pipeline, to help researchers estimate the energy consumption and carbon footprint of using generative AI in their research, either a priori (allowing for mitigation strategies or experimental redesign) or post hoc (allowing for transparent documentation of carbon footprint in written reports of the research).

Paperid: 1225, https://arxiv.org/pdf/2504.00371.pdf

Abstract:
Desktop environments can integrate augmented reality (AR) head-worn devices to support 3D representations, visualizations, and interactions in a novel yet familiar setting. As users navigate across the dual realities -- desktop and AR -- a way to move 3D objects between them is needed. We devise three baseline transition techniques based on common approaches in the literature and evaluate their usability and practicality in an initial user study (N=18). After refining both our transition techniques and the surrounding technical setup, we validate the applicability of the overall concept for real-world activities in an expert user study (N=6). In it, computational chemists followed their usual desktop workflows to build, manipulate, and analyze 3D molecular structures, but now aided with the addition of AR and our transition techniques. Based on our findings from both user studies, we provide lessons learned and takeaways for the design of 3D object transition techniques in desktop + AR environments.

Paperid: 1226, https://arxiv.org/pdf/2503.24160.pdf

Abstract:
Information Visualization (InfoVis) systems utilize visual representations to enhance data interpretation. Understanding how visual attention is allocated is essential for optimizing interface design. However, collecting Eye-tracking (ET) data presents challenges related to cost, privacy, and scalability. Computational models provide alternatives for predicting gaze patterns, thereby advancing InfoVis research. In our study, we conducted an ET experiment with 40 participants who analyzed graphs while responding to questions of varying complexity within the context of digital forensics. We compared human scanpaths with synthetic ones generated by models such as DeepGaze, UMSS, and Gazeformer. Our research evaluates the accuracy of these models and examines how question complexity and number of nodes influence performance. This work contributes to the development of predictive modeling in visual analytics, offering insights that can enhance the design and effectiveness of InfoVis systems.

Paperid: 1227, https://arxiv.org/pdf/2503.19537.pdf

Abstract:
Phone automation agents aim to autonomously perform a given natural-language user request, such as scheduling appointments or booking a hotel. While much research effort has been devoted to screen understanding and action planning, complex tasks often necessitate user interaction for successful completion. Aligning the agent with the user's expectations is crucial for building trust and enabling personalized experiences. This requires the agent to proactively engage the user when necessary, avoiding actions that violate their preferences while refraining from unnecessary questions where a default action is expected. We argue that such subtle agent-initiated interaction with the user deserves focused research attention. To promote such research, this paper introduces a task formulation for detecting the need for user interaction and generating appropriate messages. We thoroughly define the task, including aspects like interaction timing and the scope of the agent's autonomy. Using this definition, we derived annotation guidelines and created AndroidInteraction, a diverse dataset for the task, leveraging an existing UI automation dataset. We tested several text-based and multimodal baseline models for the task, finding that it is very challenging for current LLMs. We suggest that our task formulation, dataset, baseline models and analysis will be valuable for future UI automation research, specifically in addressing this crucial yet often overlooked aspect of agent-initiated interaction. This work provides a needed foundation to allow personalized agents to properly engage the user when needed, within the context of phone UI automation.

Paperid: 1228, https://arxiv.org/pdf/2503.18805.pdf

Abstract:
To address the shortage of a skilled workforce in the U.S. manufacturing industry, immersive Virtual Reality (VR)-based training solutions hold promising potential. To effectively utilize VR to meet workforce demands, it is important to understand the role of VR in manufacturing education. Therefore, we conduct a scoping review in the field. As a first step, we used a 5W1H (What, Where, Who, When, Why, How) formula as a problem-solving approach to define a comprehensive taxonomy that can consider the role of VR from all relevant possibilities. Our taxonomy categorizes VR applications across three key aspects: (1) Domains, (2) Levels, and (3) Entities. Using a systematic literature search and analysis, we reviewed 108 research articles to find the current state, benefits, challenges, and future opportunities of VR in the field. It was found that VR has been explored in a variety of areas and provides numerous benefits to learners. Despite these benefits, its adoption in manufacturing education is limited. This review discusses the identified barriers and provides actionable insights to address them. These insights can enable the widespread usage of immersive technology to nurture and develop a workforce equipped with the skills required to excel in the evolving landscape of manufacturing.

Paperid: 1229, https://arxiv.org/pdf/2503.18778.pdf

Abstract:
In this paper we propose an advanced approach to integrating artificial intelligence (AI) into healthcare: autonomous decision support. This approach allows the AI algorithm to act autonomously for a subset of patient cases whilst serving a supportive role in other subsets of patient cases based on defined delegation criteria. By leveraging the complementary strengths of both humans and AI, it aims to deliver greater overall performance than existing human-AI teaming models. It ensures safe handling of patient cases and potentially reduces clinician review time, whilst being mindful of AI tool limitations. After setting the approach within the context of current human-AI teaming models, we outline the delegation criteria and apply them to a specific AI-based tool used in histopathology. The potential impact of the approach and the regulatory requirements for its successful implementation are then discussed.

Paperid: 1230, https://arxiv.org/pdf/2503.17306.pdf

Abstract:
Facial mimicry - the automatic, unconscious imitation of others' expressions - is vital for emotional understanding. This study investigates how mimicry differs across emotions using Face Action Units from videos and participants' responses. Dynamic Time Warping quantified the temporal alignment between participants' and stimuli's facial expressions, revealing significant emotional variations. Post-hoc tests indicated greater mimicry for 'Fear' than 'Happy' and reduced mimicry for 'Anger' compared to 'Fear'. The mimicry correlations with personality traits like Extraversion and Agreeableness were significant, showcasing subtle yet meaningful connections. These findings suggest specific emotions evoke stronger mimicry, with personality traits playing a secondary role in emotional alignment. Notably, our results highlight how personality-linked mimicry mechanisms extend beyond interpersonal communication to affective computing applications, such as remote human-human interactions and human-virtual-agent scenarios. Insights from temporal facial mimicry - e.g., designing digital agents that adaptively mirror user expressions - enable developers to create empathetic, personalized systems, enhancing emotional resonance and user engagement.

Paperid: 1231, https://arxiv.org/pdf/2503.16532.pdf

Abstract:
Accurate emotion recognition is pivotal for nuanced and engaging human-computer interactions, yet remains difficult to achieve, especially in dynamic, conversation-like settings. In this study, we showcase how integrating eye-tracking data, temporal dynamics, and personality traits can substantially enhance the detection of both perceived and felt emotions. Seventy-three participants viewed short, speech-containing videos from the CREMA-D dataset, while being recorded for eye-tracking signals (pupil size, fixation patterns), Big Five personality assessments, and self-reported emotional states. Our neural network models combined these diverse inputs including stimulus emotion labels for contextual cues and yielded marked performance gains compared to the state-of-the-art. Specifically, perceived valence predictions reached a macro F1-score of 0.76, and models incorporating personality traits and stimulus information demonstrated significant improvements in felt emotion accuracy. These results highlight the benefit of unifying physiological, individual and contextual factors to address the subjectivity and complexity of emotional expression. Beyond validating the role of user-specific data in capturing subtle internal states, our findings inform the design of future affective computing and human-agent systems, paving the way for more adaptive and cross-individual emotional intelligence in real-world interactions.

Paperid: 1232, https://arxiv.org/pdf/2503.16493.pdf

Abstract:
An underlying assumption of many existing approaches to human-robot task communication is that the robot possesses a sufficient amount of environmental domain knowledge, including the locations of task-critical objects. This assumption is unrealistic if the locations of known objects change or have not yet been discovered by the robot. In this work, our key insight is that in many scenarios, robot end users possess more scene insight than the robot and need ways to express it. Presently, there is a lack of research on how solutions for collecting end-user scene insight should be designed. We thereby created an Uncertainty Expression System (UES) to investigate how best to elicit end-user scene insight. The UES allows end users to convey their knowledge of object uncertainty using either: (1) a precision interface that allows meticulous expression of scene insight; (2) a painting interface by which users create a heat map of possible object locations; and (3) a ranking interface by which end users express object locations via an ordered list. We then conducted a user study to compare the effectiveness of these approaches based on the accuracy of scene insight conveyed to the robot, the efficiency at which end users are able to express this scene insight, and both usability and task load. Results indicate that the rank interface is more user friendly and efficient than the precision interface, and that the paint interface is the least accurate.

Paperid: 1233, https://arxiv.org/pdf/2503.16480.pdf

Abstract:
As large language models (LLMs) enter the mainstream, aligning them to foster constructive dialogue rather than exacerbate societal divisions is critical. Using an individualized and multicultural alignment dataset of over 7,500 conversations of individuals from 74 countries engaging with 21 LLMs, we examined how linguistic attributes linked to constructive interactions are reflected in human preference data used for training AI. We found that users consistently preferred well-reasoned and nuanced responses while rejecting those high in personal storytelling. However, users who believed that AI should reflect their values tended to place less preference on reasoning in LLM responses and more on curiosity. Encouragingly, we observed that users could set the tone for how constructive their conversation would be, as LLMs mirrored linguistic attributes, including toxicity, in user queries.

Paperid: 1234, https://arxiv.org/pdf/2503.16455.pdf

Abstract:
Quantitative estimation of human joint motion in daily living spaces is essential for early detection and rehabilitation tracking of neuromusculoskeletal disorders (e.g., Parkinson's) and mitigating trip and fall risks for older adults. Existing approaches involve monitoring devices such as cameras, wearables, and pressure mats, but have operational constraints such as direct line-of-sight, carrying devices, and dense deployment. To overcome these limitations, we leverage gait-induced floor vibration to estimate lower-limb joint motion (e.g., ankle, knee, and hip flexion angles), allowing non-intrusive and contactless gait health monitoring in people's living spaces. To overcome the high uncertainty in lower-limb movement given the limited information provided by the gait-induced floor vibrations, we formulate a physics-informed graph to integrate domain knowledge of gait biomechanics and structural dynamics into the model. Specifically, different types of nodes represent heterogeneous information from joint motions and floor vibrations; Their connecting edges represent the physiological relationships between joints and forces governed by gait biomechanics, as well as the relationships between forces and floor responses governed by the structural dynamics. As a result, our model poses physical constraints to reduce uncertainty while allowing information sharing between the body and the floor to make more accurate predictions. We evaluate our approach with 20 participants through a real-world walking experiment. We achieved an average of 3.7 degrees of mean absolute error in estimating 12 joint flexion angles (38% error reduction from baseline), which is comparable to the performance of cameras and wearables in current medical practices.

Paperid: 1235, https://arxiv.org/pdf/2503.16452.pdf

Abstract:
Cerebral Palsy (CP) is a prevalent motor disability in children, for which early detection can significantly improve treatment outcomes. While skeleton-based Graph Convolutional Network (GCN) models have shown promise in automatically predicting CP risk from infant videos, their "black-box" nature raises concerns about clinical explainability. To address this, we introduce a perturbation framework tailored for infant movement features and use it to compare two explainable AI (XAI) methods: Class Activation Mapping (CAM) and Gradient-weighted Class Activation Mapping (Grad-CAM). First, we identify significant and non-significant body keypoints in very low- and very high-risk infant video snippets based on the XAI attribution scores. We then conduct targeted velocity and angular perturbations, both individually and in combination, on these keypoints to assess how the GCN model's risk predictions change. Our results indicate that velocity-driven features of the arms, hips, and legs have a dominant influence on CP risk predictions, while angular perturbations have a more modest impact. Furthermore, CAM and Grad-CAM show partial convergence in their explanations for both low- and high-risk CP groups. Our findings demonstrate the use of XAI-driven movement analysis for early CP prediction and offer insights into potential movement-based biomarker discovery that warrant further clinical validation.

Paperid: 1236, https://arxiv.org/pdf/2503.16114.pdf

Abstract:
Interfaces for interacting with large language models (LLMs) are often designed to mimic human conversations, typically presenting a single response to user queries. This design choice can obscure the probabilistic and predictive nature of these models, potentially fostering undue trust and over-anthropomorphization of the underlying model. In this paper, we investigate (i) the effect of displaying multiple responses simultaneously as a countermeasure to these issues, and (ii) how a cognitive support mechanism-highlighting structural and semantic similarities across responses-helps users deal with the increased cognitive load of that intervention. We conducted a within-subjects study in which participants inspected responses generated by an LLM under three conditions: one response, ten responses with cognitive support, and ten responses without cognitive support. Participants then answered questions about workload, trust and reliance, and anthropomorphization. We conclude by reporting the results of these studies and discussing future work and design opportunities for future LLM interfaces.

Paperid: 1237, https://arxiv.org/pdf/2503.15127.pdf

Abstract:
Social robot navigation is an evolving research field that aims to find efficient strategies to safely navigate dynamic environments populated by humans. A critical challenge in this domain is the accurate modeling of human motion, which directly impacts the design and evaluation of navigation algorithms. This paper presents a comparative study of two popular categories of human motion models used in social robot navigation, namely velocity-based models and force-based models. A system-theoretic representation of both model types is presented, which highlights their common feedback structure, although with different state variables. Several navigation policies based on reinforcement learning are trained and tested in various simulated environments involving pedestrian crowds modeled with these approaches. A comparative study is conducted to assess performance across multiple factors, including human motion model, navigation policy, scenario complexity and crowd density. The results highlight advantages and challenges of different approaches to modeling human behavior, as well as their role during training and testing of learning-based navigation policies. The findings offer valuable insights and guidelines for selecting appropriate human motion models when designing socially-aware robot navigation systems.

Paperid: 1238, https://arxiv.org/pdf/2503.13174.pdf

Abstract:
Head-worn augmented reality (AR) allows audiences to be immersed and engaged in stories told by live presenters. While presenters may also be in AR to have the same level of immersion and awareness as their audience, this symmetric presentation style may diminish important social cues such as eye contact. In this work, we examine the effects this (a)symmetry has on engagement, group awareness, and social interaction in co-located one-on-one augmented presentations. We developed a presentation system incorporating 2D/3D content that audiences can view and interact with in AR, with presenters controlling and delivering the presentation in either a symmetric style in AR, or an asymmetric style with a handheld tablet. We conducted a within- and between-subjects evaluation with 12 participant pairs to examine the differences between these symmetric and asymmetric presentation modalities. From our findings, we extracted four themes and derived strategies and guidelines for designers interested in augmented presentations.

Paperid: 1239, https://arxiv.org/pdf/2503.12436.pdf

Abstract:
Many communities, including the scientific community, develop implicit writing norms. Understanding them is crucial for effective communication with that community. Writers gradually develop an implicit understanding of norms by reading papers and receiving feedback on their writing. However, it is difficult to both externalize this knowledge and apply it to one's own writing. We propose two new writing support concepts that reify document and sentence-level patterns in a given text corpus: (1) an ordered distribution over section titles and (2) given the user's draft and cursor location, many retrieved contextually relevant sentences. Recurring words in the latter are algorithmically highlighted to help users see any emergent norms. Study results (N=16) show that participants revised the structure and content using these concepts, gaining confidence in aligning with or breaking norms after reviewing many examples. These results demonstrate the value of reifying distributions over other authors' writing choices during the writing process.

Paperid: 1240, https://arxiv.org/pdf/2503.12334.pdf

Abstract:
We propose a novel dual-loop system that synergistically combines responsive neurostimulation (RNS) implants with artificial intelligence-driven wearable devices for treating post-traumatic stress disorder (PTSD) and enabling naturalistic brain research. In PTSD Therapy Mode, an implanted closed-loop neural device monitors amygdala activity and provides on-demand stimulation upon detecting pathological theta oscillations, while an ensemble of wearables (smart glasses, smartwatches, smartphones) uses multimodal large language model (LLM) analysis of sensory data to detect environmental or physiological PTSD triggers and deliver timely audiovisual interventions. Logged events from both the neural and wearable loops are analyzed to personalize trigger detection and progressively transition patients to non-invasive interventions. In Neuroscience Research Mode, the same platform is adapted for real-world brain activity capture. Wearable-LLM systems recognize naturalistic events (social interactions, emotional situations, compulsive behaviors, decision making) and signal implanted RNS devices (via wireless triggers) to record synchronized intracranial data during these moments. This approach builds on recent advances in mobile intracranial EEG recording and closed-loop neuromodulation in humans (BRAIN Initiative, 2023) (Mobbs et al., 2021). We discuss how our interdisciplinary system could revolutionize PTSD therapy and cognitive neuroscience by enabling 24/7 monitoring, context-aware intervention, and rich data collection outside traditional labs. The vision is a future where AI-enhanced devices continuously collaborate with the human brain, offering therapeutic support and deep insights into neural function, with the resulting real-world context rich neural data, in turn, accelerating the development of more biologically-grounded and human-centric AI.

Paperid: 1241, https://arxiv.org/pdf/2503.11357.pdf

Abstract:
Text entry for extended reality (XR) is far from perfect, and a variety of text entry techniques (TETs) have been proposed to fit various contexts of use. However, comparing between TETs remains challenging due to the lack of a consolidated collection of techniques, and limited understanding of how interaction attributes of a technique (e.g., presence of visual feedback) impact user performance. To address these gaps, this paper examines the current landscape of XR TETs by creating a database of 176 different techniques. We analyze this database to highlight trends in the design of these techniques, the metrics used to evaluate them, and how various interaction attributes impact these metrics. We discuss implications for future techniques and present TEXT: Text Entry for XR Trove, an interactive online tool to navigate our database.

Paperid: 1242, https://arxiv.org/pdf/2503.11018.pdf

Abstract:
Software engineers use code-fluent large language models (LLMs) to help explain unfamiliar code, yet LLM explanations are not adapted to engineers' diverse problem-solving needs. We prompted an LLM to adapt to five problem-solving style types from an inclusive design method, the Gender Inclusiveness Magnifier (GenderMag). We ran a user study with software engineers to examine the impact of explanation adaptations on software engineers' perceptions, both for explanations which matched and mismatched engineers' problem-solving styles. We found that explanations were more frequently beneficial when they matched problem-solving style, but not every matching adaptation was equally beneficial; in some instances, diverse engineers found as much (or more) benefit from mismatched adaptations. Through an equity and inclusivity lens, our work highlights the benefits of having an LLM adapt its explanations to match engineers' diverse problem-solving style values, the potential harms when matched adaptations were not perceived well by engineers, and a comparison of how matching and mismatching LLM adaptations impacted diverse engineers.

Paperid: 1243, https://arxiv.org/pdf/2503.07928.pdf

Abstract:
The widespread availability of large language models (LLMs), such as ChatGPT, has significantly impacted education, raising both opportunities and challenges. Students can frequently interact with LLM-powered, interactive learning tools, but their usage patterns need to be monitored and understood. We introduce StudyChat, a publicly available dataset capturing real-world student interactions with an LLM-powered tutoring chatbot in a semester-long, university-level artificial intelligence (AI) course. We deploy a web application that replicates ChatGPTs core functionalities, and use it to log student interactions with the LLM while working on programming assignments. We collect 16,851 interactions, which we annotate using a dialogue act labeling schema inspired by observed interaction patterns and prior research. We analyze these interactions, highlight usage trends, and analyze how specific student behavior correlates with their course outcome. We find that students who prompt LLMs for conceptual understanding and coding help tend to perform better on assignments and exams. Moreover, students who use LLMs to write reports and circumvent assignment learning objectives have lower outcomes on exams than others. StudyChat serves as a shared resource to facilitate further research on the evolving role of LLMs in education.

Paperid: 1244, https://arxiv.org/pdf/2503.06500.pdf

Abstract:
Data profiling plays a critical role in understanding the structure of complex datasets and supporting numerous downstream tasks, such as social media analytics and financial fraud detection. While existing research predominantly focuses on structured data formats, a substantial portion of semi-structured textual data still requires ad-hoc and arduous manual profiling to extract and comprehend its internal structures. In this work, we propose StructVizor, an interactive profiling system that facilitates sensemaking and transformation of semi-structured textual data. Our tool mainly addresses two challenges: a) extracting and visualizing the diverse structural patterns within data, such as how information is organized or related, and b) enabling users to efficiently perform various wrangling operations on textual data. Through automatic data parsing and structure mining, StructVizor enables visual analytics of structural patterns, while incorporating novel interactions to enable profile-based data wrangling. A comparative user study involving 12 participants demonstrates the system's usability and its effectiveness in supporting exploratory data analysis and transformation tasks.

Paperid: 1245, https://arxiv.org/pdf/2503.06345.pdf

Abstract:
We present ARctic Escape, a co-located augmented reality (AR) escape room designed to promote collaboration between dyads through play. While physical escape rooms provide groups with fun, social experiences, they require a gameplay venue, props, and a game master, all of which detract from their ease of access. Existing AR escape rooms demonstrate that AR can make escape room experiences easier to access. Still, many AR escape rooms are single-player and therefore fail to maintain the social and collaborative elements of their physical counterparts. This paper presents ARctic Escape, a two-person AR escape room with clues emphasizing player interaction and teamwork. We evaluated ARctic Escape by conducting semi-structured interviews with four dyads to learn about participants' interpersonal dynamics and experiences during gameplay. We found that participants thought the experience was fun, collaborative, promoted discussion, and inspired new social dynamics, but sometimes the escape room's reliance on virtual content was disorienting.

Paperid: 1246, https://arxiv.org/pdf/2503.05962.pdf

Abstract:
Following recipes while cooking is an important but difficult task for visually impaired individuals. We developed OSCAR (Object Status Context Awareness for Recipes), a novel approach that provides recipe progress tracking and context-aware feedback on the completion of cooking tasks through tracking object statuses. OSCAR leverages both Large-Language Models (LLMs) and Vision-Language Models (VLMs) to manipulate recipe steps, extract object status information, align visual frames with object status, and provide cooking progress tracking log. We evaluated OSCAR's recipe following functionality using 173 YouTube cooking videos and 12 real-world non-visual cooking videos to demonstrate OSCAR's capability to track cooking steps and provide contextual guidance. Our results highlight the effectiveness of using object status to improve performance compared to baseline by over 20% across different VLMs, and we present factors that impact prediction performance. Furthermore, we contribute a dataset of real-world non-visual cooking videos with step annotations as an evaluation benchmark.

Paperid: 1247, https://arxiv.org/pdf/2503.05780.pdf

Abstract:
The rapid evolution of generative AI has expanded the breadth of risks associated with AI systems. While various taxonomies and frameworks exist to classify these risks, the lack of interoperability between them creates challenges for researchers, practitioners, and policymakers seeking to operationalise AI governance. To address this gap, we introduce the AI Risk Atlas, a structured taxonomy that consolidates AI risks from diverse sources and aligns them with governance frameworks. Additionally, we present the Risk Atlas Nexus, a collection of open-source tools designed to bridge the divide between risk definitions, benchmarks, datasets, and mitigation strategies. This knowledge-driven approach leverages ontologies and knowledge graphs to facilitate risk identification, prioritization, and mitigation. By integrating AI-assisted compliance workflows and automation strategies, our framework lowers the barrier to responsible AI adoption. We invite the broader research and open-source community to contribute to this evolving initiative, fostering cross-domain collaboration and ensuring AI governance keeps pace with technological advancements.

Paperid: 1248, https://arxiv.org/pdf/2503.05470.pdf

Abstract:
The interest and concerns about diversity in software development have soared in recent years. Reporting diversity-related aspects of software projects can increase user trust and help regulators evaluate potential adoption. Furthermore, recent directives around AI are beginning to require diversity information in the development of AI products, indicating the growing interest of public regulators in it. Despite this importance, current documentation assets in software development processes frequently overlook diversity in favor of technical features, partly due to a lack of tools for describing and annotating diversity. This work introduces the Software Diversity Card, a comprehensive framework for reporting diversity-related aspects of software projects. The card is designed to profile the different types of teams involved in developing and governing software projects (including the final user groups involved in testing), and the software adaptations for specific social groups. To encourage its adoption, we provide a diversity modeling language, a toolkit for generating the cards using such language, and a collection of real-world examples from active software projects. Our proposal can enhance diversity practices in software development e.g., through open-source projects like the CONTRIBUTING.md file), support public administrations in software assessment, and help businesses promote diversity as a key asset.

Paperid: 1249, https://arxiv.org/pdf/2503.05263.pdf

Abstract:
Dark patterns are deceptive strategies that recent work in human-computer interaction (HCI) has captured throughout digital domains, including social networking sites (SNSs). While research has identified difficulties among people to recognise dark patterns effectively, few studies consider vulnerable populations and their experience in this regard, including people with attention deficit hyperactivity disorder (ADHD), who may be especially susceptible to attention-grabbing tricks. Based on an interactive web study with 135 participants, we investigate SNS users' ability to recognise and avoid dark patterns by comparing results from participants with and without ADHD. In line with prior work, we noticed overall low recognition of dark patterns with no significant differences between the two groups. Yet, ADHD individuals were able to avoid specific dark patterns more often. Our results advance previous work by understanding dark patterns in a realistic environment and offer insights into their effect on vulnerable populations.

Paperid: 1250, https://arxiv.org/pdf/2503.05041.pdf

Abstract:
External Human-Machine Interfaces (eHMIs) are critical for seamless interactions between autonomous vehicles (AVs) and pedestrians in shared spaces. However, they often struggle to adapt to these environments, where pedestrian movement is fluid and right-of-way is ambiguous. To address these challenges, we propose PaveFlow, an eHMI that projects the AV's intended path onto the ground in real time, providing continuous spatial information rather than a binary stop/go signal. Through a VR study (N=18), we evaluated PaveFlow's effectiveness under two AV density conditions (single vs. multiple AVs) and a baseline condition without PaveFlow. The results showed that PaveFlow significantly improved pedestrian perception of safety, trust, and user experience while reducing cognitive workload. This performance remained consistent across both single and multiple AV conditions, despite persistent tensions in priority negotiation. These findings suggest that path projection enhances eHMI transparency by offering richer movement cues, which may better support AV-pedestrian interaction in shared spaces.

Paperid: 1251, https://arxiv.org/pdf/2503.04318.pdf

Abstract:
This paper presents InFL-UX, an interactive, proof-of-concept browser-based Federated Learning (FL) toolkit designed to integrate user contributions seamlessly into the machine learning (ML) workflow. InFL-UX enables users across multiple devices to upload datasets, define classes, and collaboratively train classification models directly in the browser using modern web technologies. Unlike traditional FL toolkits, which often focus on backend simulations, InFL-UX provides a simple user interface for researchers to explore how users interact with and contribute to FL systems in real-world, interactive settings. By prioritising usability and decentralised model training, InFL-UX bridges the gap between FL and Interactive Machine Learning (IML), empowering non-technical users to actively participate in ML classification tasks.

Paperid: 1252, https://arxiv.org/pdf/2503.03924.pdf

Abstract:
The rapid adoption of generative AI (GenAI) in design has sparked discussions about its benefits and unintended consequences. While AI is often framed as a tool for enhancing productivity by automating routine tasks, historical research on automation warns of paradoxical effects, such as de-skilling and misplaced responsibilities. To assess UX practitioners' perceptions of AI, we analyzed over 120 articles and discussions from UX-focused subreddits. Our findings indicate that while practitioners express optimism about AI reducing repetitive work and augmenting creativity, they also highlight concerns about over-reliance, cognitive offloading, and the erosion of critical design skills. Drawing from human-automation interaction literature, we discuss how these perspectives align with well-documented automation ironies and function allocation challenges. We argue that UX professionals should critically evaluate AI's role beyond immediate productivity gains and consider its long-term implications for creative autonomy and expertise. This study contributes empirical insights into practitioners' perspectives and links them to broader debates on automation in design.

Paperid: 1253, https://arxiv.org/pdf/2503.03852.pdf

Abstract:
Research on cognitive biases and heuristics has become increasingly popular in the visualization literature in recent years. Researchers have studied the effects of biases on visualization interpretation and subsequent decision-making. While this work is important, we contend that the view on biases has presented human cognitive abilities in an unbalanced manner, placing too much emphasis on the flaws and limitations of human decision-making, and potentially suggesting that it should not be trusted. Several decision researchers have argued that the flip side of biases -- i.e., mental shortcuts or heuristics -- demonstrate human ingenuity and serve as core markers of adaptive expertise. In this paper, we review the perspectives and sentiments of the visualization community on biases and describe literature arguing for more balanced views of biases and heuristics. We hope this paper will encourage visualization researchers to consider a fuller picture of human cognitive limitations and strategies for making decisions in complex environments.

Paperid: 1254, https://arxiv.org/pdf/2503.03462.pdf

Abstract:
The prevailing paradigm in the domain of Open-Domain Dialogue agents predominantly focuses on the English language, encompassing both models and datasets. Furthermore, the financial and temporal investments required for crowdsourcing such datasets for finetuning are substantial, particularly when multiple languages are involved. Fortunately, advancements in Large Language Models (LLMs) have unveiled a plethora of possibilities across diverse tasks. Specifically, instruction-tuning has enabled LLMs to execute tasks based on natural language instructions, occasionally surpassing the performance of human crowdworkers. Additionally, these models possess the capability to function in various languages within a single thread. Consequently, to generate new samples in different languages, we propose leveraging these capabilities to replicate the data collection process. We introduce a pipeline for generating Open-Domain Dialogue data in multiple Target Languages using LLMs, with demonstrations provided in a unique Source Language. By eschewing explicit Machine Translation in this approach, we enhance the adherence to language-specific nuances. We apply this methodology to the PersonaChat dataset. To enhance the openness of generated dialogues and mimic real life scenarii, we added the notion of speech events corresponding to the type of conversation the speakers are involved in and also that of common ground which represents the premises of a conversation.

Paperid: 1255, https://arxiv.org/pdf/2503.02853.pdf

Abstract:
The monitoring and prediction of in-class student activities is of paramount importance for the comprehension of engagement and the enhancement of pedagogical efficacy. The accurate detection of these activities enables educators to modify their lessons in real time, thereby reducing negative emotional states and enhancing the overall learning experience. To this end, the use of non-intrusive devices, such as inertial measurement units (IMUs) embedded in smartwatches, represents a viable solution. The development of reliable predictive systems has been limited by the lack of large, labeled datasets in education. To bridge this gap, we present a novel dataset for in-class activity detection using affordable IMU sensors. The dataset comprises 19 diverse activities, both instantaneous and continuous, performed by 12 participants in typical classroom scenarios. It includes accelerometer, gyroscope, rotation vector data, and synchronized stereo images, offering a comprehensive resource for developing multimodal algorithms using sensor and visual data. This dataset represents a key step toward scalable solutions for activity recognition in educational settings.

Paperid: 1256, https://arxiv.org/pdf/2503.02328.pdf

Abstract:
Misinformation surrounding emerging outbreaks poses a serious societal threat, making robust countermeasures essential. One promising approach is stance detection (SD), which identifies whether social media posts support or oppose misleading claims. In this work, we finetune classifiers on COVID-19 misinformation SD datasets consisting of claims and corresponding tweets. Specifically, we test controllable misinformation generation (CMG) using large language models (LLMs) as a method for data augmentation. While CMG demonstrates the potential for expanding training datasets, our experiments reveal that performance gains over traditional augmentation methods are often minimal and inconsistent, primarily due to built-in safeguards within LLMs. We release our code and datasets to facilitate further research on misinformation detection and generation.

Paperid: 1257, https://arxiv.org/pdf/2502.19877.pdf

Abstract:
Joint attention is a critical component of early speech-language development and a key indicator of effective parent-child interaction. However, research on detecting and analysing joint attention remains limited, particularly for Multimodal Large Language Models (MLLMs). This study evaluates MLLMs' ability to comprehend joint attention by analysing 26 parent-child interaction videos annotated by two speech-language pathologists. These annotations identify strong and poor joint attention segments, serving as benchmarks for evaluating the models' interpretive capabilities. Our findings reveal that current MLLMs struggle to accurately interpret joint attention due to a lack of nuanced understanding of child-initiated eye contact, a crucial component of joint attention dynamics. This study highlights the importance of incorporating detailed eye contact to enhance MLLMs' multimodal reasoning. Addressing these gaps is essential for future research to advance the use of MLLMs in analysing and supporting parent-child interactions.

Paperid: 1258, https://arxiv.org/pdf/2502.19822.pdf

Abstract:
In social service, administrative burdens and decision-making challenges often hinder practitioners from performing effective casework. Generative AI (GenAI) offers significant potential to streamline these tasks, yet exacerbates concerns about overreliance, algorithmic bias, and loss of identity within the profession. We explore these issues through a two-stage participatory design study. We conducted formative co-design workshops (\textit{n=27}) to create a prototype GenAI tool, followed by contextual inquiry sessions with practitioners (\textit{n=24}) using the tool with real case data. We reveal opportunities for AI integration in documentation, assessment, and worker supervision, while highlighting risks related to GenAI limitations, skill retention, and client safety. Drawing comparisons with GenAI tools in other fields, we discuss design and usage guidelines for such tools in social service practice.

Paperid: 1259, https://arxiv.org/pdf/2502.19422.pdf

Abstract:
This paper focuses on the piloting of CyberScholar, a Generative AI assistant tool that aims to provide formative feedback on writing in K-12 contexts. Specifically, this study explores how students worked with CyberScholar in diverse subject areas, including English Language Arts, Social Studies, and Modern World History classes in Grades 7, 8, 10, and 11 in three schools in the Midwest and one in the Northwest of the United States. This paper focuses on CyberScholar's potential to support K-12 students' writing in diverse subject areas requiring written assignments. Data were collected through implementation observations, surveys, and interviews by participating 121 students and 4 teachers. Thematic qualitative analysis revealed that the feedback tool was perceived as a valuable tool for supporting student writing through detailed feedback, enhanced interactivity, and alignment with rubric criteria. Students appreciated the tool's guidance in refining their writing. For the students, the assistant tool suggests restructuring feedback as a dynamic, dialogic process rather than a static evaluation, a shift that aligns with the cyber-social learning idea, self-regulation, and metacognition. For the teaching side, the findings indicate a shift in teachers' roles, from serving primarily as evaluators to guiding AI feedback processes that foster better student writing and critical thinking.

Paperid: 1260, https://arxiv.org/pdf/2502.19171.pdf

Abstract:
Urban gardening is widely recognized for its numerous health and environmental benefits. However, the lack of suitable garden spaces, demanding daily schedules and limited gardening expertise present major roadblocks for citizens looking to engage in urban gardening. While prior research has explored smart home solutions to support urban gardeners, these approaches currently do not fully address these practical barriers. In this paper, we present PlantPal, a system that enables the cultivation of garden spaces irrespective of one's location, expertise level, or time constraints. PlantPal enables the shared operation of a precision agriculture robot (PAR) that is equipped with garden tools and a multi-camera system. Insights from a 3-week deployment (N=18) indicate that PlantPal facilitated the integration of gardening tasks into daily routines, fostered a sense of connection with one's field, and provided an engaging experience despite the remote setting. We contribute design considerations for future robot-assisted urban gardening concepts.

Paperid: 1261, https://arxiv.org/pdf/2502.18737.pdf

Abstract:
Despite Generative AI (GenAI) systems' potential for enhancing content creation, users often struggle to effectively integrate GenAI into their creative workflows. Core challenges include misalignment of AI-generated content with user intentions (intent elicitation and alignment), user uncertainty around how to best communicate their intents to the AI system (prompt formulation), and insufficient flexibility of AI systems to support diverse creative workflows (workflow flexibility). Motivated by these challenges, we created IntentTagger: a system for slide creation based on the notion of Intent Tags - small, atomic conceptual units that encapsulate user intent - for exploring granular and non-linear micro-prompting interactions for Human-GenAI co-creation workflows. Our user study with 12 participants provides insights into the value of flexibly expressing intent across varying levels of ambiguity, meta-intent elicitation, and the benefits and challenges of intent tag-driven workflows. We conclude by discussing the broader implications of our findings and design considerations for GenAI-supported content creation workflows.

Paperid: 1262, https://arxiv.org/pdf/2502.18685.pdf

Abstract:
Using a sample of 25,000 Bing Copilot conversations, we study how the agent responds to users of varying levels of domain expertise and the resulting impact on user experience along multiple dimensions. Our findings show that across a variety of topical domains, the agent largely responds at proficient or expert levels of expertise (77% of conversations) which correlates with positive user experience regardless of the user's level of expertise. Misalignment, such that the agent responds at a level of expertise below that of the user, has a negative impact on overall user experience, with the impact more profound for more complex tasks. We also show that users engage more, as measured by the number of words in the conversation, when the agent responds at a level of expertise commensurate with that of the user. Our findings underscore the importance of alignment between user and AI when designing human-centered AI systems, to ensure satisfactory and productive interactions.

Paperid: 1263, https://arxiv.org/pdf/2502.17715.pdf

Abstract:
Generating diverse follow-up questions that uncover missing information remains challenging for conversational agents, particularly when they run on small, locally hosted models. To address this, we develop an information-gap-driven knowledge distillation pipeline in which a teacher LLM generates a comprehensive answer, contrasts it with the initial answer to identify information gaps, and formulates gap-bridging follow-up questions. Using this pipeline, we augment the existing FollowupQG dataset tenfold. We then fine-tune smaller student models on the augmented dataset to distill the teacher's knowledge. Experiments with selected teacher-student model pairs show that fine-tuned students achieve significantly higher informativeness and diversity than variations trained on the original dataset. These findings indicate that our pipeline, which mirrors the human cognitive process of information seeking, provides an efficient distillation channel from state-of-the-art LLMs to smaller models, enabling resource-constrained conversational systems to generate more diverse and informative follow-up questions.

Paperid: 1264, https://arxiv.org/pdf/2502.15365.pdf

Abstract:
This study quantitively examines which features of AI-generated text lead humans to perceive subjective consciousness in large language model (LLM)-based AI systems. Drawing on 99 passages from conversations with Claude 3 Opus and focusing on eight features -- metacognitive self-reflection, logical reasoning, empathy, emotionality, knowledge, fluency, unexpectedness, and subjective expressiveness -- we conducted a survey with 123 participants. Using regression and clustering analyses, we investigated how these features influence participants' perceptions of AI consciousness. The results reveal that metacognitive self-reflection and the AI's expression of its own emotions significantly increased perceived consciousness, while a heavy emphasis on knowledge reduced it. Participants clustered into seven subgroups, each showing distinct feature-weighting patterns. Additionally, higher prior knowledge of LLMs and more frequent usage of LLM-based chatbots were associated with greater overall likelihood assessments of AI consciousness. This study underscores the multidimensional and individualized nature of perceived AI consciousness and provides a foundation for better understanding the psychosocial implications of human-AI interaction.

Paperid: 1265, https://arxiv.org/pdf/2502.15205.pdf

Abstract:
Inspiration plays an important role in design, yet its specific impact on data visualization design practice remains underexplored. This study investigates how professional visualization designers perceive and use inspiration in their practice. Through semi-structured interviews, we examine their sources of inspiration, the value they place on them, and how they navigate the balance between inspiration and imitation. Our findings reveal that designers draw from a diverse array of sources, including existing visualizations, real-world phenomena, and personal experiences. Participants describe a mix of active and passive inspiration practices, often iterating on sources to create original designs. This research offers insights into the role of inspiration in visualization practice, the need to expand visualization design theory, and the implications for the development of visualization tools that support inspiration and for training future visualization designers.

Paperid: 1266, https://arxiv.org/pdf/2502.15027.pdf

Abstract:
Existing benchmarks do not test Large Multimodal Models (LMMs) on their interactive intelligence with human users, which is vital for developing general-purpose AI assistants. We design InterFeedback, an interactive framework, which can be applied to any LMM and dataset to assess this ability autonomously. On top of this, we introduce InterFeedback-Bench which evaluates interactive intelligence using two representative datasets, MMMU-Pro and MathVerse, to test 10 different open-source LMMs. Additionally, we present InterFeedback-Human, a newly collected dataset of 120 cases designed for manually testing interactive performance in leading models such as OpenAI-o1 and Claude-3.5-Sonnet. Our evaluation results indicate that even the state-of-the-art LMM, OpenAI-o1, struggles to refine its responses based on human feedback, achieving an average score of less than 50%. Our findings point to the need for methods that can enhance LMMs' capabilities to interpret and benefit from feedback.

Paperid: 1267, https://arxiv.org/pdf/2502.13749.pdf

Abstract:
Microchips are fundamental components of modern electronic devices, yet they remain opaque to the users who rely on them daily. This opacity, compounded by the complexity of global supply chains and the concealment of proprietary information, raises significant security, trust, and accountability issues. We investigate end users' understanding of microchips, exploring their perceptions of the societal implications and information needs regarding these essential technologies. Through an online survey with 250 participants, we found that while our participants were aware of some microchip applications, they lacked awareness of the broader security, societal, and economic implications. While our participants unanimously desired more information on microchips, their specific information needs were shaped by various factors such as the microchip's application environment and one's affinity for technology interaction. Our findings underscore the necessity for improving end users' awareness and understanding of microchips, and we provide possible directions to pursue this end.

Paperid: 1268, https://arxiv.org/pdf/2502.12354.pdf

Abstract:
Recent XAI studies have investigated what constitutes a \textit{good} explanation in AI-assisted decision-making. Despite the widely accepted human-friendly properties of explanations, such as contrastive and selective, existing studies have yielded inconsistent findings. To address these gaps, our study focuses on the cognitive dimensions of explanation evaluation, by evaluating six explanations with different contrastive strategies and information selectivity and scrutinizing factors behind their valuation process. Our analysis results find that contrastive explanations are not the most preferable or understandable in general; Rather, different contrastive and selective explanations were appreciated to a different extent based on who they are, when, how, and what to explain -- with different level of cognitive load and engagement and sociotechnical contexts. Given these findings, we call for a nuanced view of explanation strategies, with implications for designing AI interfaces to accommodate individual and contextual differences in AI-assisted decision-making.

Paperid: 1269, https://arxiv.org/pdf/2502.11267.pdf

Abstract:
Millions of users prompt large language models (LLMs) for various tasks, but how good are people at prompt engineering? Do users actually get closer to their desired outcome over multiple iterations of their prompts? These questions are crucial when no gold-standard labels are available to measure progress. This paper investigates a scenario in LLM-powered data labeling, "prompting in the dark," where users iteratively prompt LLMs to label data without using manually-labeled benchmarks. We developed PromptingSheet, a Google Sheets add-on that enables users to compose, revise, and iteratively label data through spreadsheets. Through a study with 20 participants, we found that prompting in the dark was highly unreliable -- only 9 participants improved labeling accuracy after four or more iterations. Automated prompt optimization tools like DSPy also struggled when few gold labels were available. Our findings highlight the importance of gold labels and the needs, as well as the risks, of automated support in human prompt engineering, providing insights for future tool design.

Paperid: 1270, https://arxiv.org/pdf/2502.11015.pdf

Abstract:
The growing use of wearable devices for activity tracking, healthcare, and haptics faces challenges due to the bulkiness and short lifespan of batteries. Integration of a textile-based wireless charging and readout system into everyday clothing can enable seamless power supply and data collection around the body. However, expanding such system to cover the entire body is challenging, as it increases electromagnetic interference with the body, degrading the performance of wireless system. This article introduces a meandered textile coil designed for body-scale, efficient wireless charging and readout. The meander coil can confine a strong inductive field near the body surface, ensuring W-class safe charging and sensitive readout with uW-class low power. Moreover, its zigzag design is simple enough for mass production on industrial knitting machines. Therefore, the body-scale meander coil can continuously operate battery-free wearable devices across the body, leading to ubiquitous deployment of continuous full-body wearable computing into everyday clothing.

Paperid: 1271, https://arxiv.org/pdf/2502.10395.pdf

Abstract:
Intelligent tutoring systems (ITSs) are effective in helping students learn; further research could make them even more effective. Particularly desirable is research into how students learn with these systems, how these systems best support student learning, and what learning sciences principles are key in ITSs. CTAT+Tutorshop provides a full stack integrated platform that facilitates a complete research lifecycle with ITSs, which includes using ITS data to discover learner challenges, to identify opportunities for system improvements, and to conduct experimental studies. The platform includes authoring tools to support and accelerate development of ITS, which provide automatic data logging in a format compatible with DataShop, an independent site that supports the analysis of ed tech log data to study student learnings. Among the many technology platforms that exist to support learning sciences research, CTAT+Tutorshop may be the only one that offers researchers the possibility to author elements of ITSs, or whole ITSs, as part of designing studies. This platform has been used to develop and conduct an estimated 147 research studies which have run in a wide variety of laboratory and real-world educational settings, including K-12 and higher education, and have addressed a wide range of research questions. This paper presents five case studies of research conducted on the CTAT+Tutorshop platform, and summarizes what has been accomplished and what is possible for future researchers. We reflect on the distinctive elements of this platform that have made it so effective in facilitating a wide range of ITS research.

Paperid: 1272, https://arxiv.org/pdf/2502.10088.pdf

Abstract:
Robotic ultrasound systems can enhance medical diagnostics, but patient acceptance is a challenge. We propose a system combining an AI-powered conversational virtual agent with three mixed reality visualizations to improve trust and comfort. The virtual agent, powered by a large language model, engages in natural conversations and guides the ultrasound robot, enhancing interaction reliability. The visualizations include augmented reality, augmented virtuality, and fully immersive virtual reality, each designed to create patient-friendly experiences. A user study demonstrated significant improvements in trust and acceptance, offering valuable insights for designing mixed reality and virtual agents in autonomous medical procedures.

Paperid: 1273, https://arxiv.org/pdf/2502.09577.pdf

Abstract:
Prewriting is the process of generating and organising ideas before a first draft. It consists of a combination of informal, iterative, and semi-structured strategies such as visual diagramming, which poses a challenge for collaborating with large language models (LLMs) in a turn-taking conversational manner. We present Polymind, a visual diagramming tool that leverages multiple LLM-powered agents to support prewriting. The system features a parallel collaboration workflow in place of the turn-taking conversational interactions. It defines multiple ``microtasks'' to simulate group collaboration scenarios such as collaborative writing and group brainstorming. Instead of repetitively prompting a chatbot for various purposes, Polymind enables users to orchestrate multiple microtasks simultaneously. Users can configure and delegate customised microtasks, and manage their microtasks by specifying task requirements and toggling visibility and initiative. Our evaluation revealed that, compared to ChatGPT, users had more customizability over collaboration with Polymind, and were thus able to quickly expand personalised writing ideas during prewriting.

Paperid: 1274, https://arxiv.org/pdf/2502.07292.pdf

Abstract:
Generative AI (GenAI) is transforming the creativity process. However, as presented in this paper, GenAI encounters "narrow creativity" barriers. We observe that both humans and GenAI focus on limited subsets of the design space. We investigate this phenomenon using the "Circles Exercise," a creativity test widely used to examine the creativity of humans. Quantitative analysis reveals that humans tend to generate familiar, high-frequency ideas, while GenAI produces a larger volume of incremental innovations at a low cost. However, similar to humans, it struggles to significantly expand creative boundaries. Moreover, advanced prompting strategies, such as Chain-of-Thought (CoT) prompting, mitigate narrow creativity issues but still fall short of substantially broadening the creative scope of humans and GenAI. These findings underscore both the challenges and opportunities for advancing GenAI-powered human creativity support tools.

Paperid: 1275, https://arxiv.org/pdf/2502.04931.pdf

Abstract:
Effectively mitigating online misinformation requires understanding of their mechanisms and learning of practical skills for identification and counteraction. Serious games may serve as tools for combating misinformation, teaching players to recognize common misinformation tactics, and improving their skills of discernment. However, current interventions are designed as single-player, choice-based games, which present players with limited predefined choices. Such restrictions reduce replayability and may lead to an overly simplistic understanding of misinformation and how to debunk them. This study seeks to empower people to understand opinion-influencing and misinformation-debunking processes. We created a Player vs. Player (PvP) game in which participants attempt to generate or debunk misinformation to convince the public opinion represented by LLM. Using a within-subjects mixed-methods study design (N=47), we found that this game significantly raised participants' media literacy and improved their ability to identify misinformation. Qualitative analyses revealed how participants' use of debunking and content creation strategies deepened their understanding of misinformation. This work shows the potential for illuminating contrasting viewpoints of social issues by LLM-based mechanics in PvP games.

Paperid: 1276, https://arxiv.org/pdf/2502.04710.pdf

Abstract:
Software producers are now recognizing the importance of improving their products' suitability for diverse populations, but little attention has been given to measurements to shed light on products' suitability to individuals below the median socioeconomic status (SES) -- who, by definition, make up half the population. To enable software practitioners to attend to both lower- and higher-SES individuals, this paper provides two new surveys that together facilitate measuring how well a software product serves socioeconomically diverse populations. The first survey (SES-Subjective) is who-oriented: it measures who their potential or current users are in terms of their subjective SES (perceptions of their SES). The second survey (SES-Facets) is why-oriented: it collects individuals' values for an evidence-based set of facet values (individual traits) that (1) statistically differ by SES and (2) affect how an individual works and problem-solves with software products. Our empirical validations with deployments at University A and University B (464 and 522 responses, respectively) showed that both surveys are reliable. Further, our results statistically agree with both ground truth data on respondents' socioeconomic statuses and with predictions from foundational literature. Finally, we explain how the pair of surveys is uniquely actionable by software practitioners, such as in requirements gathering, debugging, quality assurance activities, maintenance activities, and fulfilling legal reporting requirements such as those being drafted by various governments for AI-powered software.

Paperid: 1277, https://arxiv.org/pdf/2502.04361.pdf

Abstract:
Critical VR applications in domains such as healthcare, education, and finance that use traditional credentials, such as PIN, password, or multi-factor authentication, stand the chance of being compromised if a malicious person acquires the user credentials or if the user hands over their credentials to an ally. Recently, a number of approaches on user authentication have emerged that use motions of VR head-mounted displays (HMDs) and hand controllers during user interactions in VR to represent the user's behavior as a VR biometric signature. One of the fundamental limitations of behavior-based approaches is that current on-device tracking for HMDs and controllers lacks capability to perform tracking of full-body joint articulation, losing key signature data encapsulated by the user articulation. In this paper, we propose an approach that uses 2D body joints, namely shoulder, elbow, wrist, hip, knee, and ankle, acquired from the right side of the participants using an external 2D camera. Using a Transformer-based deep neural network, our method uses the 2D data of body joints that are not tracked by the VR device to predict past and future 3D tracks of the right controller, providing the benefit of augmenting 3D knowledge in authentication. Our approach provides a minimum equal error rate (EER) of 0.025, and a maximum EER drop of 0.040 over prior work that uses single-unit 3D trajectory as the input.

Paperid: 1278, https://arxiv.org/pdf/2502.03575.pdf

Abstract:
To design data visualizations that are easy to comprehend, we need to understand how people with different interests read them. Computational models of predicting scanpaths on charts could complement empirical studies by offering estimates of user performance inexpensively; however, previous models have been limited to gaze patterns and overlooked the effects of tasks. Here, we contribute Chartist, a computational model that simulates how users move their eyes to extract information from the chart in order to perform analysis tasks, including value retrieval, filtering, and finding extremes. The novel contribution lies in a two-level hierarchical control architecture. At the high level, the model uses LLMs to comprehend the information gained so far and applies this representation to select a goal for the lower-level controllers, which, in turn, move the eyes in accordance with a sampling policy learned via reinforcement learning. The model is capable of predicting human-like task-driven scanpaths across various tasks. It can be applied in fields such as explainable AI, visualization design evaluation, and optimization. While it displays limitations in terms of generalizability and accuracy, it takes modeling in a promising direction, toward understanding human behaviors in interacting with charts.

Paperid: 1279, https://arxiv.org/pdf/2502.03560.pdf

Abstract:
Empirical evidence shows that typing on touchscreen devices is prone to errors and that correcting them poses a major detriment to users' performance. Design of text entry systems that better serve users, across their broad capability range, necessitates understanding the cognitive mechanisms that underpin these errors. However, prior models of typing cover only motor slips. The paper reports on extending the scope of computational modeling of typing to cover the cognitive mechanisms behind the three main types of error: slips (inaccurate execution), lapses (forgetting), and mistakes (incorrect knowledge). Given a phrase, a keyboard, and user parameters, Typoist simulates eye and finger movements while making human-like insertion, omission, substitution, and transposition errors. Its main technical contribution is the formulation of a supervisory control problem wherein the controller allocates cognitive resources to detect and fix errors generated by the various mechanisms. The model generates predictions of typing performance that can inform design, for better text entry systems.

Paperid: 1280, https://arxiv.org/pdf/2502.03330.pdf

Abstract:
During the early stages of interface design, designers need to produce multiple sketches to explore a design space. Design tools often fail to support this critical stage, because they insist on specifying more details than necessary. Although recent advances in generative AI have raised hopes of solving this issue, in practice they fail because expressing loose ideas in a prompt is impractical. In this paper, we propose a diffusion-based approach to the low-effort generation of interface sketches. It breaks new ground by allowing flexible control of the generation process via three types of inputs: A) prompts, B) wireframes, and C) visual flows. The designer can provide any combination of these as input at any level of detail, and will get a diverse gallery of low-fidelity solutions in response. The unique benefit is that large design spaces can be explored rapidly with very little effort in input-specification. We present qualitative results for various combinations of input specifications. Additionally, we demonstrate that our model aligns more accurately with these specifications than other models.

Paperid: 1281, https://arxiv.org/pdf/2502.02780.pdf

Abstract:
Student simulation supports educators to improve teaching by interacting with virtual students. However, most existing approaches ignore the modulation effects of course materials because of two challenges: the lack of datasets with granularly annotated course materials, and the limitation of existing simulation models in processing extremely long textual data. To solve the challenges, we first run a 6-week education workshop from N = 60 students to collect fine-grained data using a custom built online education system, which logs students' learning behaviors as they interact with lecture materials over time. Second, we propose a transferable iterative reflection (TIR) module that augments both prompting-based and finetuning-based large language models (LLMs) for simulating learning behaviors. Our comprehensive experiments show that TIR enables the LLMs to perform more accurate student simulation than classical deep learning models, even with limited demonstration data. Our TIR approach better captures the granular dynamism of learning performance and inter-student correlations in classrooms, paving the way towards a ''digital twin'' for online education.

Paperid: 1282, https://arxiv.org/pdf/2502.02749.pdf

Abstract:
Female Health Applications (FHA), a growing segment of FemTech, aim to provide affordable and accessible healthcare solutions for women globally. These applications gather and monitor health and reproductive data from millions of users. With ongoing debates on women's reproductive rights and privacy, it's crucial to assess how these apps protect users' privacy. In this paper, we undertake a security and data protection assessment of 45 popular FHAs. Our investigation uncovers harmful permissions, extensive collection of sensitive personal and medical data, and the presence of numerous third-party tracking libraries. Furthermore, our examination of their privacy policies reveals deviations from fundamental data privacy principles. These findings highlight a significant lack of privacy and security measures for FemTech apps, especially as women's reproductive rights face growing political challenges. The results and recommendations provide valuable insights for users, app developers, and policymakers, paving the way for better privacy and security in Female Health Applications.

Paperid: 1283, https://arxiv.org/pdf/2502.01564.pdf

Abstract:
Video meeting platforms display conversations linearly through transcripts or summaries. However, ideas during a meeting do not emerge linearly. We leverage LLMs to create dialogue maps in real time to help people visually structure and connect ideas. Balancing the need to reduce the cognitive load on users during the conversation while giving them sufficient control when using AI, we explore two system variants that encompass different levels of AI assistance. In Human-Map, AI generates summaries of conversations as nodes, and users create dialogue maps with the nodes. In AI-Map, AI produces dialogue maps where users can make edits. We ran a within-subject experiment with ten pairs of users, comparing the two MeetMap variants and a baseline. Users preferred MeetMap over traditional methods for taking notes, which aligned better with their mental models of conversations. Users liked the ease of use for AI-Map due to the low effort demands and appreciated the hands-on opportunity in Human-Map for sense-making.

Paperid: 1284, https://arxiv.org/pdf/2502.00067.pdf

Abstract:
AI powered health chatbot applications are increasingly utilized for personalized healthcare services, yet they pose significant challenges related to user data security and privacy. This study evaluates the effectiveness of automated methods, specifically BART and Gemini GenAI, in identifying security privacy related (SPR) concerns within these applications' user reviews, benchmarking their performance against manual qualitative analysis. Our results indicate that while Gemini's performance in SPR classification is comparable to manual labeling, both automated methods have limitations, including the misclassification of unrelated issues. Qualitative analysis revealed critical user concerns, such as data collection practices, data misuse, and insufficient transparency and consent mechanisms. This research enhances the understanding of the relationship between user trust, privacy, and emerging mobile AI health chatbot technologies, offering actionable insights for improving security and privacy practices in AI driven health chatbots. Although exploratory, our findings highlight the necessity for rigorous audits and transparent communication strategies, providing valuable guidance for app developers and vendors in addressing user security and privacy concerns.

Paperid: 1285, https://arxiv.org/pdf/2501.18951.pdf

Abstract:
Creating custom artifacts with computer numerical control (CNC) milling machines typically requires mastery of complex computer-aided design (CAD) software. To eliminate this user barrier, we introduced Draw2Cut, a novel system that allows users to design and fabricate artifacts by sketching directly on physical materials. Draw2Cut employs a custom-drawing language to convert user-drawn lines, symbols, and colors into toolpaths, thereby enabling users to express their creative intent intuitively. The key features include real-time alignment between material and virtual toolpaths, a preview interface for validation, and an open-source platform for customization. Through technical evaluations and user studies, we demonstrate that Draw2Cut lowers the entry barrier for personal fabrication, enabling novices to create customized artifacts with precision and ease. Our findings highlight the potential of the system to enhance creativity, engagement, and accessibility in CNC-based woodworking.

Paperid: 1286, https://arxiv.org/pdf/2501.16150.pdf

Abstract:
Agents for computer use (ACUs) are an emerging class of systems capable of executing complex tasks on digital devices - such as desktops, mobile phones, and web platforms - given instructions in natural language. These agents can automate tasks by controlling software via low-level actions like mouse clicks and touchscreen gestures. However, despite rapid progress, ACUs are not yet mature for everyday use. In this survey, we investigate the state-of-the-art, trends, and research gaps in the development of practical ACUs. We provide a comprehensive review of the ACU landscape, introducing a unifying taxonomy spanning three dimensions: (I) the domain perspective, characterizing agent operating contexts; (II) the interaction perspective, describing observation modalities (e.g., screenshots, HTML) and action modalities (e.g., mouse, keyboard, code execution); and (III) the agent perspective, detailing how agents perceive, reason, and learn. We review 87 ACUs and 33 datasets across foundation model-based and classical approaches through this taxonomy. Our analysis identifies six major research gaps: insufficient generalization, inefficient learning, limited planning, low task complexity in benchmarks, non-standardized evaluation, and a disconnect between research and practical conditions. To address these gaps, we advocate for: (a) vision-based observations and low-level control to enhance generalization; (b) adaptive learning beyond static prompting; (c) effective planning and reasoning methods and models; (d) benchmarks that reflect real-world task complexity; (e) standardized evaluation based on task success; (f) aligning agent design with real-world deployment constraints. Together, our taxonomy and analysis establish a foundation for advancing ACU research toward general-purpose agents for robust and scalable computer use.

Paperid: 1287, https://arxiv.org/pdf/2501.15583.pdf

Abstract:
Human cognitive performance is an underlying factor in most of our daily lives, and numerous factors influence cognitive performance. In this work, we investigate how changes in sleep quality influence cognitive performance, measured from a dataset collected during a 2-month field study. We collected cognitive performance data (alertness) with the Psychomotor Vigilance Task (PVT), mobile keyboard typing metrics from participants' smartphones, and sleep quality metrics through a wearable sleep tracking ring. Our findings highlight that specific sleep metrics like night-time heart rate, sleep latency, sleep timing, sleep restfulness, and overall sleep quantity significantly influence cognitive performance. To strengthen the current research on cognitive measurements, we introduce smartphone typing metrics as a proxy or a complementary method for continuous passive measurement of cognitive performance. Together, our findings contribute to ubiquitous computing via a longitudinal case study with a novel wearable device, the resulting findings on the association between sleep and cognitive function, and the introduction of smartphone keyboard typing as a proxy of cognitive function.

Paperid: 1288, https://arxiv.org/pdf/2501.13309.pdf

Abstract:
We propose a dense insight network framework to encode the relationships between automatically generated insights from a complex dashboard based on their shared characteristics. Our insight network framework includes five high-level categories of relationships (e.g., type, topic, value, metadata, and compound scores). The goal of this insight network framework is to provide a foundation for implementing new insight interpretation and exploration strategies, including both user-driven and automated approaches. To illustrate the complexity and flexibility of our framework, we first describe a visualization playground to directly visualize key network characteristics; this playground also demonstrates potential interactive capabilities for decomposing the dense insight network. Then, we discuss a case study application for ranking insights based on the underlying network characteristics captured by our framework, before prompting a large language model to generate a concise, natural language summary. Finally, we reflect on next steps for leveraging our insight network framework to design and evaluate new systems.

Paperid: 1289, https://arxiv.org/pdf/2501.10421.pdf

Abstract:
Grading programming assignments is crucial for guiding students to improve their programming skills and coding styles. This study presents an automated grading framework, CodEv, which leverages Large Language Models (LLMs) to provide consistent and constructive feedback. We incorporate Chain of Thought (CoT) prompting techniques to enhance the reasoning capabilities of LLMs and ensure that the grading is aligned with human evaluation. Our framework also integrates LLM ensembles to improve the accuracy and consistency of scores, along with agreement tests to deliver reliable feedback and code review comments. The results demonstrate that the framework can yield grading results comparable to human evaluators, by using smaller LLMs. Evaluation and consistency tests of the LLMs further validate our approach, confirming the reliability of the generated scores and feedback.

Paperid: 1290, https://arxiv.org/pdf/2501.07196.pdf

Abstract:
In this paper, we present a human-based computation approach for the analysis of peripheral blood smear (PBS) images images in patients with Sickle Cell Disease (SCD). We used the Mechanical Turk microtask market to crowdsource the labeling of PBS images. We then use the expert-tagged erythrocytesIDB dataset to assess the accuracy and reliability of our proposal. Our results showed that when a robust consensus is achieved among the Mechanical Turk workers, probability of error is very low, based on comparison with expert analysis. This suggests that our proposed approach can be used to annotate datasets of PBS images, which can then be used to train automated methods for the diagnosis of SCD. In future work, we plan to explore the potential integration of our findings with outcomes obtained through automated methodologies. This could lead to the development of more accurate and reliable methods for the diagnosis of SCD

Paperid: 1291, https://arxiv.org/pdf/2501.06698.pdf

Abstract:
In virtual reality applications, users often navigate through virtual environments, but the issue of physiological responses, such as cybersickness, fatigue, and cognitive workload, can disrupt or even halt these activities. Despite its impact, the underlying mechanisms of how the sensory system encodes information in VR remain unclear. In this study, we compare three sensory encoding models, Bayesian Efficient Coding, Fitness Maximizing Coding, and the Linear Nonlinear Poisson model, regarding their ability to simulate human navigation behavior in VR. By incorporating the factor of physiological responses into the models, we find that the Bayesian Efficient Coding model generally outperforms the others. Furthermore, the Fitness Maximizing Code framework provides more accurate estimates when the error penalty is small. Our results suggest that the Bayesian Efficient Coding framework offers superior predictions in most scenarios, providing a better understanding of human navigation behavior in VR environments.

Paperid: 1292, https://arxiv.org/pdf/2501.06416.pdf

Abstract:
Designing a reinforcement learning from human feedback (RLHF) algorithm to approximate a human's unobservable reward function requires assuming, implicitly or explicitly, a model of human preferences. A preference model that poorly describes how humans generate preferences risks learning a poor approximation of the human's reward function. In this paper, we conduct three human studies to asses whether one can influence the expression of real human preferences to more closely conform to a desired preference model. Importantly, our approach does not seek to alter the human's unobserved reward function. Rather, we change how humans use this reward function to generate preferences, such that they better match whatever preference model is assumed by a particular RLHF algorithm. We introduce three interventions: showing humans the quantities that underlie a preference model, which is normally unobservable information derived from the reward function; training people to follow a specific preference model; and modifying the preference elicitation question. All intervention types show significant effects, providing practical tools to improve preference data quality and the resultant alignment of the learned reward functions. Overall we establish a novel research direction in model alignment: designing interfaces and training interventions to increase human conformance with the modeling assumptions of the algorithm that will learn from their input.

Paperid: 1293, https://arxiv.org/pdf/2501.06348.pdf

Abstract:
Understanding the motivations underlying the human inclination to automate tasks is vital to developing truly helpful robots integrated into daily life. Accordingly, we ask: are individuals more inclined to automate chores based on the time they consume or the feelings experienced while performing them? This study explores these preferences and whether they vary across different social groups (i.e., gender category and income level). Leveraging data from the BEHAVIOR-1K dataset, the American Time-Use Survey, and the American Time-Use Survey Well-Being Module, we investigate the relationship between the desire for automation, time spent on daily activities, and their associated feelings - Happiness, Meaningfulness, Sadness, Painfulness, Stressfulness, or Tiredness. Our key findings show that, despite common assumptions, time spent does not strongly relate to the desire for automation for the general population. For the feelings analyzed, only happiness and pain are key indicators. Significant differences by gender and economic level also emerged: Women prefer to automate stressful activities, whereas men prefer to automate those that make them unhappy; mid-income individuals prioritize automating less enjoyable and meaningful activities, while low and high-income show no significant correlations. We hope our research helps motivate technologies to develop robots that match the priorities of potential users, moving domestic robotics toward more socially relevant solutions. We open-source all the data, including an online tool that enables the community to replicate our analysis and explore additional trends at https://robin-lab.cs.utexas.edu/why-automate-this/.

Paperid: 1294, https://arxiv.org/pdf/2501.05589.pdf

Abstract:
Brain--computer interfaces are groundbreaking technology whereby brain signals are used to control external devices. Despite some advances in recent years, electroencephalogram (EEG)-based motor-imagery tasks face challenges, such as amplitude and phase variability and complex spatial correlations, with a need for smaller models and faster inference. In this study, we develop a prototype, called the Lightweight Geometric Learning Brain--Computer Interface (LGL-BCI), which uses our customized geometric deep learning architecture for swift model inference without sacrificing accuracy. LGL-BCI contains an EEG channel selection module via a feature decomposition algorithm to reduce the dimensionality of a symmetric positive definite matrix, providing adaptiveness among the continuously changing EEG signal. Meanwhile, a built-in lossless transformation helps boost the inference speed. The performance of our solution was evaluated using two real-world EEG devices and two public EEG datasets. LGL-BCI demonstrated significant improvements, achieving an accuracy of 82.54% compared to 62.22% for the state-of-the-art approach. Furthermore, LGL-BCI uses fewer parameters (64.9K vs. 183.7K), highlighting its computational efficiency. These findings underscore both the superior accuracy and computational efficiency of LGL-BCI, demonstrating the feasibility and robustness of geometric deep learning in motor-imagery brain--computer interface applications.

Paperid: 1295, https://arxiv.org/pdf/2501.04929.pdf

Abstract:
In this paper, we aim to understand how user motivation shapes human-robot interaction (HRI) in the wild. To explore this, we conducted a field study by deploying a fully autonomous conversational robot in a shopping mall over two days. Through sequential video analysis, we identified five patterns of interaction fluency (Smooth, Awkward, Active, Messy, and Quiet), four types of user motivation for interacting with the robot (Function, Experiment, Curiosity, and Education), and user positioning towards the robot. We further analyzed how these motivations and positioning influence interaction fluency. Our findings suggest that incorporating users' motivation types into the design of robot behavior can enhance interaction fluency, engagement, and user satisfaction in real-world HRI scenarios.

Paperid: 1296, https://arxiv.org/pdf/2501.01220.pdf

Abstract:
AI conversational agents have demonstrated efficacy in social contact interventions for stigma reduction at a low cost. However, the underlying mechanisms of how interaction designs contribute to these effects remain unclear. This study investigates how participating in three human-chatbot interactions affects attitudes toward mental illness. We developed three chatbots capable of engaging in either one-way information dissemination from chatbot to a human or two-way cooperation where the chatbot and a human exchange thoughts and work together on a cooperation task. We then conducted a two-week mixed-methods study to investigate variations over time and across different group memberships. The results indicate that human-AI cooperation can effectively reduce stigma toward individuals with mental illness by fostering relationships between humans and AI through social contact. Additionally, compared to a one-way chatbot, interacting with a cooperative chatbot led participants to perceive it as more competent and likable, promoting greater empathy during the conversation. However, despite the success in reducing stigma, inconsistencies between the chatbot's role and the mental health context raised concerns. We discuss the implications of our findings for human-chatbot interaction designs aimed at changing human attitudes.

Paperid: 1297, https://arxiv.org/pdf/2501.00597.pdf

Abstract:
Eye movement prediction is a promising area of research with the potential to improve performance and the user experience of systems based on eye-tracking technology. In this study, we analyze individual differences in gaze prediction performance. We use three fundamentally different models within the analysis: the lightweight Long Short-Term Memory network (LSTM), the transformer-based network for multivariate time series representation learning (TST), and the Oculomotor Plant Mathematical Model wrapped in the Kalman Filter framework (OPKF). Each solution was assessed on different eye-movement types. We show important subject-to-subject variation for all models and eye-movement types. We found that fixation noise is associated with poorer gaze prediction in fixation. For saccades, higher velocities are associated with poorer gaze prediction performance. We think these individual differences are important and propose that future research should report statistics related to inter-subject variation. We also propose that future models should be designed to reduce subject-to-subject variation.

Paperid: 1298, https://arxiv.org/pdf/2501.00017.pdf

Abstract:
This study investigates students' perceptions of Generative Artificial Intelligence (GenAI), with a focus on Higher Education institutions in Northern Ireland and India. We collect quantitative Likert ratings and qualitative comments from 1211 students on their awareness and perceptions of AI and investigate variations in attitudes toward AI across institutions and subject areas, as well as interactions between these variables with demographic variables (focusing on gender). We found the following: (a) while perceptions varied across institutions, responses for Computer Sciences students were similar, both in terms of topics and degree of positivity; and (b) after controlling for institution and subject area, we observed no effect of gender. These results are consistent with previous studies, which find that students' perceptions are predicted by prior experience; crucially, however, the results of this study contribute to the literature by identifying important interactions between key factors that can influence experience, revealing a more nuanced picture of students' perceptions and the role of experience. We consider the implications of these relations, and further considerations for the role of experience.

Paperid: 1299, https://arxiv.org/pdf/2506.23774.pdf

Abstract:
Computer-aided teacher training is a state-of-the-art method designed to enhance teachers' professional skills effectively while minimising concerns related to costs, time constraints, and geographical limitations. We investigate the potential of large language models (LLMs) in teacher education, using a case of teaching hate incidents management in schools. To this end, we create a multi-agent LLM-based system that mimics realistic situations of hate, using a combination of retrieval-augmented prompting and persona modelling. It is designed to identify and analyse hate speech patterns, predict potential escalation, and propose effective intervention strategies. By integrating persona modelling with agentic LLMs, we create contextually diverse simulations of hate incidents, mimicking real-life situations. The system allows teachers to analyse and understand the dynamics of hate incidents in a safe and controlled environment, providing valuable insights and practical knowledge to manage such situations confidently in real life. Our pilot evaluation demonstrates teachers' enhanced understanding of the nature of annotator disagreements and the role of context in hate speech interpretation, leading to the development of more informed and effective strategies for addressing hate in classroom settings.

Paperid: 1300, https://arxiv.org/pdf/2506.22604.pdf

Abstract:
Robot end users increasingly require accessible means of specifying tasks for robots to perform. Two common end-user programming paradigms include drag-and-drop interfaces and natural language programming. Although natural language interfaces harness an intuitive form of human communication, drag-and-drop interfaces enable users to meticulously and precisely dictate the key actions of the robot's task. In this paper, we investigate the degree to which both approaches can be combined. Specifically, we construct a large language model (LLM)-based pipeline that accepts natural language as input and produces human-like action sequences as output, specified at a level of granularity that a human would produce. We then compare these generated action sequences to another dataset of hand-specified action sequences. Although our results reveal that larger models tend to outperform smaller ones in the production of human-like action sequences, smaller models nonetheless achieve satisfactory performance.

Paperid: 1301, https://arxiv.org/pdf/2506.21762.pdf

Abstract:
Data visualization tasks often require multi-step reasoning, and the interpretive strategies experts use, such as decomposing complex goals into smaller subtasks and selectively attending to key chart regions are rarely made explicit. ViStruct is an automated pipeline that simulates these expert behaviours by breaking high-level visual questions into structured analytic steps and highlighting semantically relevant chart areas. Leveraging large language and vision-language models, ViStruct identifies chart components, maps subtasks to spatial regions, and presents visual attention cues to externalize expert-like reasoning flows. While not designed for direct novice instruction, ViStruct provides a replicable model of expert interpretation that can inform the development of future visual literacy tools. We evaluate the system on 45 tasks across 12 chart types and validate its outputs with trained visualization users, confirming its ability to produce interpretable and expert-aligned reasoning sequences.

Paperid: 1302, https://arxiv.org/pdf/2506.19430.pdf

Abstract:
The automated analysis of human behaviour provides many opportunities for the creation of interactive systems and the post-experiment investigations for user studies. Commodity depth cameras offer reasonable body tracking accuracy at a low price point, without the need for users to wear or hold any extra equipment. The resulting systems typically perform body tracking through a dedicated machine learning model, but they can be enhanced with additional AI components providing extra capabilities. This leads to opportunities but also challenges, for example regarding the orchestration of such AI components and the engineering of the resulting tracking pipeline. In this paper, we discuss these elements, based on our experience with the creation of a remote collaboration system across distant wall-sized displays, that we built using existing and readily available building blocks, including AI-based recognition models.

Paperid: 1303, https://arxiv.org/pdf/2506.17196.pdf

Abstract:
The increasing availability of large language models (LLMs) has raised concerns about their potential misuse in online learning. While tools for detecting LLM-generated text exist and are widely used by researchers and educators, their reliability varies. Few studies have compared the accuracy of detection methods, defined criteria to identify content generated by LLM, or evaluated the effect on learner performance from LLM misuse within learning. In this study, we define LLM-generated text within open responses as those produced by any LLM without paraphrasing or refinement, as evaluated by human coders. We then fine-tune GPT-4o to detect LLM-generated responses and assess the impact on learning from LLM misuse. We find that our fine-tuned LLM outperforms the existing AI detection tool GPTZero, achieving an accuracy of 80% and an F1 score of 0.78, compared to GPTZero's accuracy of 70% and macro F1 score of 0.50, demonstrating superior performance in detecting LLM-generated responses. We also find that learners suspected of LLM misuse in the open response question were more than twice as likely to correctly answer the corresponding posttest MCQ, suggesting potential misuse across both question types and indicating a bypass of the learning process. We pave the way for future work by demonstrating a structured, code-based approach to improve LLM-generated response detection and propose using auxiliary statistical indicators such as unusually high assessment scores on related tasks, readability scores, and response duration. In support of open science, we contribute data and code to support the fine-tuning of similar models for similar use cases.

Paperid: 1304, https://arxiv.org/pdf/2506.15525.pdf

Abstract:
As generative AI (GenAI) emerges as a transformative force, clear understanding of high school students' perspectives is essential for GenAI's meaningful integration in high school environments. In this work, we draw insights from a participatory design workshop where we engaged 17 high school students -- a group rarely involved in prior research in this area -- through the design of novel GenAI tools and school policies addressing their key concerns. Students identified challenges and developed solutions outlining their ideal features in GenAI tools, appropriate school use, and regulations. These centered around the problem spaces of combating bias & misinformation, tackling crime & plagiarism, preventing over-reliance on AI, and handling false accusations of academic dishonesty. Building on our participants' underrepresented perspectives, we propose new guidelines targeted at educational technology designers for development of GenAI technologies in high schools. We also argue for further incorporation of student voices in development of AI policies in their schools.

Paperid: 1305, https://arxiv.org/pdf/2506.14268.pdf

Abstract:
Cybernetic avatars are hybrid interaction robots or digital representations that combine autonomous capabilities with teleoperated control. This study investigates the acceptance of cybernetic avatars in the highly multicultural society of Dubai, with particular emphasis on robotic avatars for customer service. Specifically, we explore how acceptance varies as a function of robot appearance (e.g., android, robotic-looking, cartoonish), deployment settings (e.g., shopping malls, hotels, hospitals), and functional tasks (e.g., providing information, patrolling). To this end, we conducted a large-scale survey with over 1,000 participants. Overall, cybernetic avatars received a high level of acceptance, with physical robot avatars receiving higher acceptance than digital avatars. In terms of appearance, robot avatars with a highly anthropomorphic robotic appearance were the most accepted, followed by cartoonish designs and androids. Animal-like appearances received the lowest level of acceptance. Among the tasks, providing information and guidance was rated as the most valued. Shopping malls, airports, public transport stations, and museums were the settings with the highest acceptance, whereas healthcare-related spaces received lower levels of support. An analysis by community cluster revealed among others that Emirati respondents showed significantly greater acceptance of android appearances compared to the overall sample, while participants from the 'Other Asia' cluster were significantly more accepting of cartoonish appearances. Our study underscores the importance of incorporating citizen feedback into the design and deployment of cybernetic avatars from the early stages to enhance acceptance of this technology in society.

Paperid: 1306, https://arxiv.org/pdf/2506.14018.pdf

Abstract:
While recent research has focused on developing safeguards for generative AI (GAI) model-level content safety, little is known about how content moderation to prevent malicious content performs for end-users in real-world GAI products. To bridge this gap, we investigated content moderation policies and their enforcement in GAI online tools -- consumer-facing web-based GAI applications. We first analyzed content moderation policies of 14 GAI online tools. While these policies are comprehensive in outlining moderation practices, they usually lack details on practical implementations and are not specific about how users can aid in moderation or appeal moderation decisions. Next, we examined user-experienced content moderation successes and failures through Reddit discussions on GAI online tools. We found that although moderation systems succeeded in blocking malicious generations pervasively, users frequently experienced frustration in failures of both moderation systems and user support after moderation. Based on these findings, we suggest improvements for content moderation policy and user experiences in real-world GAI products.

Paperid: 1307, https://arxiv.org/pdf/2506.13933.pdf

Abstract:
Teleoperation is a key enabler for future mobility, supporting Automated Vehicles in rare and complex scenarios beyond the capabilities of their automation. Despite ongoing research, no open source software currently combines Remote Driving, e.g., via steering wheel and pedals, Remote Assistance through high-level interaction with automated driving software modules, and integration with a real-world vehicle for practical testing. To address this gap, we present a modular, open source teleoperation software stack that can interact with an automated driving software, e.g., Autoware, enabling Remote Assistance and Remote Driving. The software featuresstandardized interfaces for seamless integration with various real-world and simulation platforms, while allowing for flexible design of the human-machine interface. The system is designed for modularity and ease of extension, serving as a foundation for collaborative development on individual software components as well as realistic testing and user studies. To demonstrate the applicability of our software, we evaluated the latency and performance of different vehicle platforms in simulation and real-world. The source code is available on GitHub

Paperid: 1308, https://arxiv.org/pdf/2506.13466.pdf

Abstract:
Using smartphone apps during crises is well-established, proving critical for efficient crisis response. However, such apps become futile without an Internet connection, which is a common issue during crises. The ongoing 6G standardization explores the capability to provide local cellular connectivity for areas cut off from the Internet in crises. This paper introduces to the HCI community the concept of cellular island connectivity in isolated areas, promising a seamless transition from normal operation to island operation with local-only cellular connectivity. It presents findings from a survey (N = 857) among adult smartphone users from major German cities regarding their smartphone usage preferences in this model. Results show a shift in app demand, with users favoring general-purpose apps over dedicated crisis apps in specific scenarios. We prioritize smartphone services based on their criticality, distinguishing between apps essential for crisis response and those supporting routines. Our findings provide operators, developers, and authorities insights into making user-centric design decisions for implementing island-ready 6G communication.

Paperid: 1309, https://arxiv.org/pdf/2506.13270.pdf

Abstract:
The rise of generative AI agents has reshaped human-computer interaction and computer-supported cooperative work by shifting users' roles from direct task execution to supervising machine-driven actions, especially in programming (e.g., "vibe coding"). However, there is limited understanding of how screen reader users engage with these systems in practice. To address this gap, we conducted a longitudinal study with 16 screen reader users, exploring their experiences with AI code assistants in daily programming scenarios. Participants first completed a tutorial with GitHub Copilot, then performed a programming task and provided initial feedback. After two weeks of AI-assisted programming, follow-up studies assessed changes in their practices and perceptions. Our findings demonstrate that advanced code assistants not only enhance their programming capabilities but also bridge accessibility gaps. While the assistant proved beneficial, there remains potential to improve how users convey intent and interpret outputs. They also experienced difficulties managing multiple views and maintaining situational awareness. More broadly, they encountered barriers in learning advanced tools and expressed a need to retain control. Based on these insights, we provide design recommendations for more accessible and inclusive AI-assisted tools.

Paperid: 1310, https://arxiv.org/pdf/2506.12339.pdf

Abstract:
We present SheetMind, a modular multi-agent framework powered by large language models (LLMs) for spreadsheet automation via natural language instructions. The system comprises three specialized agents: a Manager Agent that decomposes complex user instructions into subtasks; an Action Agent that translates these into structured commands using a Backus Naur Form (BNF) grammar; and a Reflection Agent that validates alignment between generated actions and the user's original intent. Integrated into Google Sheets via a Workspace extension, SheetMind supports real-time interaction without requiring scripting or formula knowledge. Experiments on benchmark datasets demonstrate an 80 percent success rate on single step tasks and approximately 70 percent on multi step instructions, outperforming ablated and baseline variants. Our results highlight the effectiveness of multi agent decomposition and grammar based execution for bridging natural language and spreadsheet functionalities.

Paperid: 1311, https://arxiv.org/pdf/2506.12270.pdf

Abstract:
Cloud infrastructure is the cornerstone of the modern IT industry. However, managing this infrastructure effectively requires considerable manual effort from the DevOps engineering team. We make a case for developing AI agents powered by large language models (LLMs) to automate cloud infrastructure management tasks. In a preliminary study, we investigate the potential for AI agents to use different cloud/user interfaces such as software development kits (SDK), command line interfaces (CLI), Infrastructure-as-Code (IaC) platforms, and web portals. We report takeaways on their effectiveness on different management tasks, and identify research challenges and potential solutions.

Paperid: 1312, https://arxiv.org/pdf/2506.12266.pdf

Abstract:
Large Language Model (LLM)-based agents have significantly impacted Task-Oriented Dialog Systems (TODS) but continue to face notable performance challenges, especially in zero-shot scenarios. While prior work has noted this performance gap, the behavioral factors driving the performance gap remain under-explored. This study proposes a comprehensive evaluation framework to quantify the behavior gap between AI agents and human experts, focusing on discrepancies in dialog acts, tool usage, and knowledge utilization. Our findings reveal that this behavior gap is a critical factor negatively impacting the performance of LLM agents. Notably, as task complexity increases, the behavior gap widens (correlation: 0.963), leading to a degradation of agent performance on complex task-oriented dialogs. For the most complex task in our study, even the GPT-4o-based agent exhibits low alignment with human behavior, with low F1 scores for dialog acts (0.464), excessive and often misaligned tool usage with a F1 score of 0.139, and ineffective usage of external knowledge. Reducing such behavior gaps leads to significant performance improvement (24.3% on average). This study highlights the importance of comprehensive behavioral evaluations and improved alignment strategies to enhance the effectiveness of LLM-based TODS in handling complex tasks.

Paperid: 1313, https://arxiv.org/pdf/2506.11112.pdf

Abstract:
During the workshop, we deeply discussed what CONversational Information ACcess (CONIAC) is and its unique features, proposing a world model abstracting it, and defined the Conversational Agents Framework for Evaluation (CAFE) for the evaluation of CONIAC systems, consisting of six major components: 1) goals of the system's stakeholders, 2) user tasks to be studied in the evaluation, 3) aspects of the users carrying out the tasks, 4) evaluation criteria to be considered, 5) evaluation methodology to be applied, and 6) measures for the quantitative criteria chosen.

Paperid: 1314, https://arxiv.org/pdf/2506.10891.pdf

Abstract:
Craft practices rely on evolving archives of skill and knowledge, developed through generations of craftspeople experimenting with designs, materials, and techniques. Better documentation of these practices enables the sharing of knowledge and expertise between sites and generations. However, most documentation focuses solely on the linear steps leading to final artifacts, neglecting the tacit knowledge necessary to improvise, or adapt workflows to meet the unique demands of each craft project. This omission limits knowledge sharing and reduces craft to a mechanical endeavor, rather than a sophisticated way of seeing, thinking, and doing. Drawing on expert interviews and literature from HCI, CSCW and the social sciences, we develop an elementary grammar to document improvisational actions of real-world craft practices. We demonstrate the utility of this grammar with an interface called CraftLink that can be used to analyze expert videos and semi-automatically generate documentation to convey material and contextual variations of craft practices. Our user study with expert crocheters (N=7) using this interface evaluates our grammar's effectiveness in capturing and sharing expert knowledge with other craftspeople, offering new pathways for computational systems to support collaborative archives of knowledge and practice within communities.

Paperid: 1315, https://arxiv.org/pdf/2506.10150.pdf

Abstract:
Large language models (LLMs) excel at generating empathic responses in text-based conversations. But, how reliably do they judge the nuances of empathic communication? We investigate this question by comparing how experts, crowdworkers, and LLMs annotate empathic communication across four evaluative frameworks drawn from psychology, natural language processing, and communications applied to 200 real-world conversations where one speaker shares a personal problem and the other offers support. Drawing on 3,150 expert annotations, 2,844 crowd annotations, and 3,150 LLM annotations, we assess inter-rater reliability between these three annotator groups. We find that expert agreement is high but varies across the frameworks' sub-components depending on their clarity, complexity, and subjectivity. We show that expert agreement offers a more informative benchmark for contextualizing LLM performance than standard classification metrics. Across all four frameworks, LLMs consistently approach this expert level benchmark and exceed the reliability of crowdworkers. These results demonstrate how LLMs, when validated on specific tasks with appropriate benchmarks, can support transparency and oversight in emotionally sensitive applications including their use as conversational companions.

Paperid: 1316, https://arxiv.org/pdf/2506.09362.pdf

Abstract:
Peer support plays a vital role in expanding access to mental health care by providing empathetic, community-based support outside formal clinical systems. As digital platforms increasingly mediate such support, the design and impact of these technologies remain under-examined, particularly in Asian contexts. This paper presents findings from an interview study with 20 peer supporters in Singapore, who operate across diverse online, offline, and hybrid environments. Through a thematic analysis, we unpack how participants start, conduct, and sustain peer support, highlighting their motivations, emotional labour, and the sociocultural dimensions shaping their practices. Building on this grounded understanding, we surface design directions for culturally responsive digital tools that scaffold rather than supplant relational care. Drawing insights from qualitative accounts, we offer a situated perspective on how AI might responsibly augment peer support. This research contributes to human-centred computing by articulating the lived realities of peer supporters and proposing design implications for trustworthy and context-sensitive AI in mental health.

Paperid: 1317, https://arxiv.org/pdf/2506.09153.pdf

Abstract:
Real-time face orientation recognition is a cutting-edge technology meant to track and analyze facial movements in virtual environments such as online interviews, remote meetings, and virtual classrooms. As the demand for virtual interactions grows, it becomes increasingly important to measure participant engagement, attention, and overall interaction. This research presents a novel solution that leverages the Media Pipe Face Mesh framework to identify facial landmarks and extract geometric data for calculating Euler angles, which determine head orientation in real time. The system tracks 3D facial landmarks and uses this data to compute head movements with a focus on accuracy and responsiveness. By studying Euler angles, the system can identify a user's head orientation with an accuracy of 90\%, even at a distance of up to four feet. This capability offers significant enhancements for monitoring user interaction, allowing for more immersive and interactive virtual ex-periences. The proposed method shows its reliability in evaluating participant attentiveness during online assessments and meetings. Its application goes beyond engagement analysis, potentially providing a means for improving the quality of virtual communication, fostering better understanding between participants, and ensuring a higher level of interaction in digital spaces. This study offers a basis for future developments in enhancing virtual user experiences by integrating real-time facial tracking technologies, paving the way for more adaptive and interactive web-based platform.

Paperid: 1318, https://arxiv.org/pdf/2506.08892.pdf

Abstract:
The human-robot interaction (HRI) field has recognized the importance of enabling robots to interact with teams. Human teams rely on effective communication for successful collaboration in time-sensitive environments. Robots can play a role in enhancing team coordination through real-time assistance. Despite significant progress in human-robot teaming research, there remains an essential gap in how robots can effectively communicate with action teams using multimodal interaction cues in time-sensitive environments. This study addresses this knowledge gap in an experimental in-lab study to investigate how multimodal robot communication in action teams affects workload and human perception of robots. We explore team collaboration in a medical training scenario where a robotic crash cart (RCC) provides verbal and non-verbal cues to help users remember to perform iterative tasks and search for supplies. Our findings show that verbal cues for object search tasks and visual cues for task reminders reduce team workload and increase perceived ease of use and perceived usefulness more effectively than a robot with no feedback. Our work contributes to multimodal interaction research in the HRI field, highlighting the need for more human-robot teaming research to understand best practices for integrating collaborative robots in time-sensitive environments such as in hospitals, search and rescue, and manufacturing applications.

Paperid: 1319, https://arxiv.org/pdf/2506.08890.pdf

Abstract:
Healthcare workers (HCWs) encounter challenges in hospitals, such as retrieving medical supplies quickly from crash carts, which could potentially result in medical errors and delays in patient care. Robotic crash carts (RCCs) have shown promise in assisting healthcare teams during medical tasks through guided object searches and task reminders. Limited exploration has been done to determine what communication modalities are most effective and least disruptive to patient care in real-world settings. To address this gap, we conducted a between-subjects experiment comparing the RCC's verbal and non-verbal communication of object search with a standard crash cart in resuscitation scenarios to understand the impact of robot communication on workload and attitudes toward using robots in the workplace. Our findings indicate that verbal communication significantly reduced mental demand and effort compared to visual cues and with a traditional crash cart. Although frustration levels were slightly higher during collaborations with the robot compared to a traditional cart, these research insights provide valuable implications for human-robot teamwork in high-stakes environments.

Paperid: 1320, https://arxiv.org/pdf/2506.08549.pdf

Abstract:
Modern technology driven information systems are part of our daily lives. However, this deep integration poses new challenges to the human computer interaction (HCI) professionals. With the rapid growth of mobile and cloud computing and the Internet of Things (IoT), the demand for HCI specialists to design user-friendly and adaptable interfaces has never been more pressing. Especially for diverse user groups such as children, the elderly and people with disabilities who need interfaces tailored to their needs regardless of time and location. This study reviewed 50 recent papers on HCI interface design for modern information systems. The goal is to see how well these methods address the demands of current technology. The findings show that most HCI design methods are still based on old desktop models and do not support mobile users and location-based services well. Most existing interface design guidelines do not align with the flexibility and dynamism of emerging technologies. The goal of this study is to improve interface design by combining agile methodologies with human-centered design principles. Future studies should also incorporate both qualitative and quantitative approaches, particularly in the context of cloud-based technologies and organizational information systems. This approach aims to bridge the gap between current interface design practices and the changing technological landscape.

Paperid: 1321, https://arxiv.org/pdf/2506.04236.pdf

Abstract:
In Artificial Life (ALife) research, replicating Open-Ended Evolution (OEE)-the continuous emergence of novelty observed in biological life-has usually been pursued within isolated, closed system simulations, such as Tierra and Avida, which have typically plateaued after an initial burst of novelty, failing to achieve sustained OEE. Scholars suggest that OEE requires an open-environment system that continually exchanges information or energy with its environment. A recent technological innovation in Decentralized Physical Infrastructure Network (DePIN), which provides permissionless computational substrates, enables the deployment of Large Language Model-based AI agents on blockchains integrated with Trusted Execution Environments (TEEs). This enables on-chain agents to operate autonomously "in the wild," achieving self-sovereignty without human oversight. These agents can control their own social media accounts and cryptocurrency wallets, allowing them to interact directly with blockchain-based financial networks and broader human social media. Building on this new paradigm of on-chain agents, Spore.fun is a recent real-world AI evolution experiment that enables autonomous breeding and evolution of new on-chain agents. This paper presents a detailed case study of Spore.fun, examining agent behaviors and their evolutionary trajectories through digital ethology. We aim to spark discussion about whether open-environment ALife systems "in the wild," based on permissionless computational substrates and driven by economic incentives to interact with their environment, could finally achieve the long-sought goal of OEE.

Paperid: 1322, https://arxiv.org/pdf/2506.03735.pdf

Abstract:
Visuals are valuable tools for teaching math word problems (MWPs), helping young learners interpret textual descriptions into mathematical expressions before solving them. However, creating such visuals is labor-intensive and there is a lack of automated methods to support this process. In this paper, we present Math2Visual, an automatic framework for generating pedagogically meaningful visuals from MWP text descriptions. Math2Visual leverages a pre-defined visual language and a design space grounded in interviews with math teachers, to illustrate the core mathematical relationships in MWPs. Using Math2Visual, we construct an annotated dataset of 1,903 visuals and evaluate Text-to-Image (TTI) models for their ability to generate visuals that align with our design. We further fine-tune several TTI models with our dataset, demonstrating improvements in educational visual generation. Our work establishes a new benchmark for automated generation of pedagogically meaningful visuals and offers insights into key challenges in producing multimodal educational content, such as the misrepresentation of mathematical relationships and the omission of essential visual elements.

Paperid: 1323, https://arxiv.org/pdf/2506.03385.pdf

Abstract:
Novice learners often have difficulty learning new visualization types because they tend to interpret novel visualizations through the mental models of simpler charts they have previously encountered. Traditional visualization teaching methods, which usually rely on directly translating conceptual aspects of data into concrete data visualizations, often fail to attend to the needs of novice learners navigating this tension. To address this, we conducted an empirical exploration of how analogies can be used to help novices with chart comprehension. We introduced visualization analogies: visualizations that map data structures to real-world contexts to facilitate an intuitive understanding of novel chart types. We evaluated this pedagogical technique using a within-subject study (N=128) where we taught 8 chart types using visualization analogies. Our findings show that visualization analogies improve visual analysis skills and help learners transfer their understanding to actual charts. They effectively introduce visual embellishments, cater to diverse learning preferences, and are preferred by novice learners over traditional chart visualizations. This study offers empirical insights and open-source tools to advance visualization education through analogical reasoning.

Paperid: 1324, https://arxiv.org/pdf/2506.01836.pdf

Abstract:
With the automotive industry transitioning towards conditionally automated driving, takeover warning systems are crucial for ensuring safe collaborative driving between users and semi-automated vehicles. However, previous work has focused on static warning systems that do not accommodate different driver states. Therefore, we propose an adaptive takeover warning system that is personalised to drivers, enhancing their experience and safety. We conducted two user studies investigating semi-autonomous driving scenarios in rural and urban environments while participants performed non-driving-related tasks such as text entry and visual search. We investigated the effects of varying time budgets and head-up versus head-down displays for takeover requests on drivers' situational awareness and mental state. Through our statistical and clustering analyses, we propose strategies for designing adaptable takeover systems, e.g., using longer time budgets and head-up displays for non-hazardous takeover events in high-complexity environments while using shorter time budgets and head-down displays for hazardous events in low-complexity environments.

Paperid: 1325, https://arxiv.org/pdf/2506.01135.pdf

Abstract:
Robot teleoperation with extended reality (XR teleoperation) enables intuitive interaction by allowing remote robots to mimic user motions with real-time 3D feedback. However, existing systems face significant motion-to-motion (M2M) latency--the delay between the user's latest motion and the corresponding robot feedback--leading to high teleoperation error and mission completion time. This issue stems from the system's exclusive reliance on network communication, making it highly vulnerable to network degradation. To address these challenges, we introduce TeleXR, the first end-to-end, fully open-sourced XR teleoperation framework that decouples robot control and XR visualization from network dependencies. TeleXR leverages local sensing data to reconstruct delayed or missing information of the counterpart, thereby significantly reducing network-induced issues. This approach allows both the XR and robot to run concurrently with network transmission while maintaining high robot planning accuracy. TeleXR also features contention-aware scheduling to mitigate GPU contention and bandwidth-adaptive point cloud scaling to cope with limited bandwidth.

Paperid: 1326, https://arxiv.org/pdf/2505.23685.pdf

Abstract:
Stereoscopic head-mounted displays (HMDs) render and present binocular images to create an egocentric, 3D percept to the HMD user. Within this render and presentation pipeline there are potential rendering camera and viewing position errors that can induce deviations in the depth and distance that a user perceives compared to the underlying intended geometry. For example, rendering errors can arise when HMD render cameras are incorrectly positioned relative to the assumed centers of projections of the HMD displays and viewing errors can arise when users view stereo geometry from the incorrect location in the HMD eyebox. In this work we present a geometric framework that predicts errors in distance perception arising from inaccurate HMD perspective geometry and build an HMD platform to reliably simulate render and viewing error in a Quest 3 HMD with eye tracking to experimentally test these predictions. We present a series of five experiments to explore the efficacy of this geometric framework and show that errors in perspective geometry can induce both under- and over-estimations in perceived distance. We further demonstrate how real-time visual feedback can be used to dynamically recalibrate visuomotor mapping so that an accurate reach distance is achieved even if the perceived visual distance is negatively impacted by geometric error.

Paperid: 1327, https://arxiv.org/pdf/2505.22962.pdf

Abstract:
Calls to decentralize feed-based social media have been driven by concerns about the concentrated power of centralized platforms and their societal impact. In response, numerous decentralized social media protocols have emerged, each interpreting "decentralization" in different ways. We analyze four such protocols -- ActivityPub, AT Protocol, Nostr, and Farcaster -- to develop a novel conceptual framework for understanding how protocols operationalize decentralization. Drawing from protocol documentation, media coverage, and first-hand interviews with protocol developers and experts, we contextualize each protocol's approach within their respective socio-technical goals. Our framework highlights how control over key components is distributed differently across each protocol, shaping who holds power over what kinds of decisions. How components are arranged in relation to one another further impacts how component owners might offset each other's power in shaping social media. We argue that examining protocols as artifacts reveals how values shape infrastructure and power dynamics -- and that with a holistic framework as a guide, we can more effectively evaluate and design decentralized platforms aligned with the social and political futures we envision.

Paperid: 1328, https://arxiv.org/pdf/2505.17202.pdf

Abstract:
Data visualizations are powerful tools for communicating patterns in quantitative data. Yet understanding any data visualization is no small feat -- succeeding requires jointly making sense of visual, numerical, and linguistic inputs arranged in a conventionalized format one has previously learned to parse. Recently developed vision-language models are, in principle, promising candidates for developing computational models of these cognitive operations. However, it is currently unclear to what degree these models emulate human behavior on tasks that involve reasoning about data visualizations. This gap reflects limitations in prior work that has evaluated data visualization understanding in artificial systems using measures that differ from those typically used to assess these abilities in humans. Here we evaluated eight vision-language models on six data visualization literacy assessments designed for humans and compared model responses to those of human participants. We found that these models performed worse than human participants on average, and this performance gap persisted even when using relatively lenient criteria to assess model performance. Moreover, while relative performance across items was somewhat correlated between models and humans, all models produced patterns of errors that were reliably distinct from those produced by human participants. Taken together, these findings suggest significant opportunities for further development of artificial systems that might serve as useful models of how humans reason about data visualizations. All code and data needed to reproduce these results are available at: https://osf.io/e25mu/?view_only=399daff5a14d4b16b09473cf19043f18.

Paperid: 1329, https://arxiv.org/pdf/2505.16011.pdf

Abstract:
We present a comprehensive survey of perception-based redirected walking (RDW) techniques in virtual reality (VR), presenting a taxonomy that serves as a framework for understanding and designing RDW algorithms. RDW enables users to explore virtual environments (VEs) larger than their physical space, addressing the constraints of real walking in limited home VR setups. Our review spans 232 papers, with 165 included in the final analysis. We categorize perception-based RDW techniques based on gains, gain application, target orientation calculation, and optional general enhancements, identifying key patterns and relationships. We present data on how current work aligns within this classification system and suggest how this data can guide future work into areas that are relatively under explored. This taxonomy clarifies perception-based RDW techniques, guiding the design and application of RDW systems, and suggests future research directions to enhance VR user experience.

Paperid: 1330, https://arxiv.org/pdf/2505.15377.pdf

Abstract:
In today's digital era, internet plays a pervasive role in our lives, influencing everyday activities such as communication, work, and leisure. This online engagement intertwines with offline experiences, shaping individuals' overall well-being. Despite its significance, existing research often falls short in capturing the relationship between internet use and well-being, relying primarily on isolated studies and self-reported data. One of the major contributors to deteriorated well-being - both physical and mental - is stress. While some research has examined the relationship between internet use and stress, both positive and negative associations have been reported. Our primary goal in this work is to identify the associations between an individual's internet use and their stress. For achieving our goal, we conducted a longitudinal multimodal study that spanned seven months. We combined fine-grained URL-level web browsing traces of 1490 German internet users with their sociodemographics and monthly measures of stress. Further, we developed a conceptual framework that allows us to simultaneously explore different contextual dimensions, including how, where, when, and by whom the internet is used. Our analysis revealed several associations between internet use and stress that vary by context. Social media, entertainment, online shopping, and gaming were positively associated with stress, while productivity, news, and adult content use were negatively associated. In the future, the behavioral markers we identified can pave the way for designing individualized tools for people to self-monitor and self-moderate their online behaviors to enhance their well-being, reducing the burden on already overburdened mental health services.

Paperid: 1331, https://arxiv.org/pdf/2505.15108.pdf

Abstract:
The proliferation of Large Language Models (LLMs) and Intelligent Virtual Agents acting as psychotherapists presents significant opportunities for expanding mental healthcare access. However, their deployment has also been linked to serious adverse outcomes, including user harm and suicide, facilitated by a lack of standardized evaluation methodologies capable of capturing the nuanced risks of therapeutic interaction. Current evaluation techniques lack the sensitivity to detect subtle changes in patient cognition and behavior during therapy sessions that may lead to subsequent decompensation. We introduce a novel risk ontology specifically designed for the systematic evaluation of conversational AI psychotherapists. Developed through an iterative process including review of the psychotherapy risk literature, qualitative interviews with clinical and legal experts, and alignment with established clinical criteria (e.g., DSM-5) and existing assessment tools (e.g., NEQ, UE-ATR), the ontology aims to provide a structured approach to identifying and assessing user/patient harms. We provide a high-level overview of this ontology, detailing its grounding, and discuss potential use cases. We discuss four use cases in detail: monitoring real user interactions, evaluation with simulated patients, benchmarking and comparative analysis, and identifying unexpected outcomes. The proposed ontology offers a foundational step towards establishing safer and more responsible innovation in the domain of AI-driven mental health support.

Paperid: 1332, https://arxiv.org/pdf/2505.14363.pdf

Abstract:
This paper explores the evolving landscape of human-machine co-creation, focusing on its development in the context of the ACM Conference on Human Factors in Computing Systems (CHI) from 2014 to 2024. We employ co-word analysis to identify emerging trends, central themes, and the intellectual trajectory of this field. The study highlights the shift from viewing machines as mere tools to recognizing them as collaborative partners in creative processes. By understanding these dynamics, we aim to provide insights into the implications of this paradigm shift for creativity, innovation, and societal impact, ultimately fostering a more inclusive and effective approach to human-machine interaction in various domains.

Paperid: 1333, https://arxiv.org/pdf/2505.11808.pdf

Abstract:
A quadruped robot is a promising system that can offer assistance comparable to that of dog guides due to its similar form factor. However, various challenges remain in making these robots a reliable option for blind and low-vision (BLV) individuals. Among these challenges, noise and jerky motion during walking are critical drawbacks of existing quadruped robots. While these issues have largely been overlooked in guide dog robot research, our interviews with guide dog handlers and trainers revealed that acoustic and physical disturbances can be particularly disruptive for BLV individuals, who rely heavily on environmental sounds for navigation. To address these issues, we developed a novel walking controller for slow stepping and smooth foot swing/contact while maintaining human walking speed, as well as robust and stable balance control. The controller integrates with a perception system to facilitate locomotion over non-flat terrains, such as stairs. Our controller was extensively tested on the Unitree Go1 robot and, when compared with other control methods, demonstrated significant noise reduction -- half of the default locomotion controller. In this study, we adopt a mixed-methods approach to evaluate its usability with BLV individuals. In our indoor walking experiments, participants compared our controller to the robot's default controller. Results demonstrated superior acceptance of our controller, highlighting its potential to improve the user experience of guide dog robots. Video demonstration (best viewed with audio) available at: https://youtu.be/8-pz_8Hqe6s.

Paperid: 1334, https://arxiv.org/pdf/2505.09757.pdf

Abstract:
The recent trend of self-sovereign Decentralized AI Agents (DeAgents) combines Large Language Model (LLM)-based AI agents with decentralization technologies such as blockchain smart contracts and trusted execution environments (TEEs). These tamper-resistant trustless substrates allow agents to achieve self-sovereignty through ownership of cryptowallet private keys and control of digital assets and social media accounts. DeAgents eliminate centralized control and reduce human intervention, addressing key trust concerns inherent in centralized AI systems. This contributes to social computing by enabling new human cooperative paradigm "intelligence as commons." However, given ongoing challenges in LLM reliability such as hallucinations, this creates paradoxical tension between trustlessness and unreliable autonomy. This study addresses this empirical research gap through interviews with DeAgents stakeholders-experts, founders, and developers-to examine their motivations, benefits, and governance dilemmas. The findings will guide future DeAgents system and protocol design and inform discussions about governance in sociotechnical AI systems in the future agentic web.

Paperid: 1335, https://arxiv.org/pdf/2505.09166.pdf

Abstract:
In the creative practice of text-to-image (TTI) generation, images are synthesized from textual prompts. By design, TTI models always yield an output, even if the prompt contains unknown terms. In this case, the model may generate default images: images that closely resemble each other across many unrelated prompts. Studying default images is valuable for designing better solutions for prompt engineering and TTI generation. We present the first investigation into default images on Midjourney. We describe an initial study in which we manually created input prompts triggering default images, and several ablation studies. Building on these, we conduct a computational analysis of about 750,000 images, revealing consistent default images across unrelated prompts. We also conduct an online user study investigating how default images may affect user satisfaction. Our work lays the foundation for understanding default images in TTI generation, highlighting their practical relevance as well as challenges and future research directions.

Paperid: 1336, https://arxiv.org/pdf/2505.07625.pdf

Abstract:
Laser interferometry (LFI)-based eye-tracking systems provide an alternative to traditional camera-based solutions, offering improved privacy by eliminating the risk of direct visual identification. However, the high-frequency signals captured by LFI-based trackers may still contain biometric information that enables user identification. This study investigates user identification from raw high-frequency LFI-based eye movement data by analyzing features extracted from both the time and frequency domains. Using velocity and distance measurements without requiring direct gaze data, we develop a multi-class classification model to accurately distinguish between individuals across various activities. Our results demonstrate that even without direct visual cues, eye movement patterns exhibit sufficient uniqueness for user identification, achieving 93.14% accuracy and a 2.52% EER with 5-second windows across both static and dynamic tasks. Additionally, we analyze the impact of sampling rate and window size on model performance, providing insights into the feasibility of LFI-based biometric recognition. Our findings demonstrate the novel potential of LFI-based eye-tracking for user identification, highlighting both its promise for secure authentication and emerging privacy risks. This work paves the way for further research into high-frequency eye movement data.

Paperid: 1338, https://arxiv.org/pdf/2505.06120.pdf

Abstract:
Large Language Models (LLMs) are conversational interfaces. As such, LLMs have the potential to assist their users not only when they can fully specify the task at hand, but also to help them define, explore, and refine what they need through multi-turn conversational exchange. Although analysis of LLM conversation logs has confirmed that underspecification occurs frequently in user instructions, LLM evaluation has predominantly focused on the single-turn, fully-specified instruction setting. In this work, we perform large-scale simulation experiments to compare LLM performance in single- and multi-turn settings. Our experiments confirm that all the top open- and closed-weight LLMs we test exhibit significantly lower performance in multi-turn conversations than single-turn, with an average drop of 39% across six generation tasks. Analysis of 200,000+ simulated conversations decomposes the performance degradation into two components: a minor loss in aptitude and a significant increase in unreliability. We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that *when LLMs take a wrong turn in a conversation, they get lost and do not recover*.

Paperid: 1339, https://arxiv.org/pdf/2505.05832.pdf

Abstract:
Individuals with upper limb movement limitations face challenges in interacting with others. Although robotic arms are currently used primarily for functional tasks, there is considerable potential to explore ways to enhance users' body language capabilities during social interactions. This paper introduces an Augmented Body Communicator system that integrates robotic arms and a large language model. Through the incorporation of kinetic memory, disabled users and their supporters can collaboratively design actions for the robot arm. The LLM system then provides suggestions on the most suitable action based on contextual cues during interactions. The system underwent thorough user testing with six participants who have conditions affecting upper limb mobility. Results indicate that the system improves users' ability to express themselves. Based on our findings, we offer recommendations for developing robotic arms that support disabled individuals with body language capabilities and functional tasks.

Paperid: 1340, https://arxiv.org/pdf/2505.05318.pdf

Abstract:
The rapid adoption of Vision Language Models (VLMs), pre-trained on large image-text and video-text datasets, calls for protecting and informing users about when to trust these systems. This survey reviews studies on trust dynamics in user-VLM interactions, through a multi-disciplinary taxonomy encompassing different cognitive science capabilities, collaboration modes, and agent behaviours. Literature insights and findings from a workshop with prospective VLM users inform preliminary requirements for future VLM trust studies.

Paperid: 1341, https://arxiv.org/pdf/2505.03807.pdf

Abstract:
Video story interaction enables viewers to engage with and explore narrative content for personalized experiences. However, existing methods are limited to user selection, specially designed narratives, and lack customization. To address this, we propose an interactive system based on user intent. Our system uses a Vision Language Model (VLM) to enable machines to understand video stories, combining Retrieval-Augmented Generation (RAG) and a Multi-Agent System (MAS) to create evolving characters and scene experiences. It includes three stages: 1) Video story processing, utilizing VLM and prior knowledge to simulate human understanding of stories across three modalities. 2) Multi-space chat, creating growth-oriented characters through MAS interactions based on user queries and story stages. 3) Scene customization, expanding and visualizing various story scenes mentioned in dialogue. Applied to the Harry Potter series, our study shows the system effectively portrays emergent character social behavior and growth, enhancing the interactive experience in the video story world.

Paperid: 1342, https://arxiv.org/pdf/2505.01537.pdf

Abstract:
Psychological research has identified different patterns individuals have while making decisions, such as vigilance (making decisions after thorough information gathering), hypervigilance (rushed and anxious decision-making), and buckpassing (deferring decisions to others). We examine whether these decision-making patterns shape peoples' likelihood of seeking out or relying on AI. In an online experiment with 810 participants tasked with distinguishing food facts from myths, we found that a higher buckpassing tendency was positively correlated with both seeking out and relying on AI suggestions, while being negatively correlated with the time spent reading AI explanations. In contrast, the higher a participant tended towards vigilance, the more carefully they scrutinized the AI's information, as indicated by an increased time spent looking through the AI's explanations. These findings suggest that a person's decision-making pattern plays a significant role in their adoption and reliance on AI, which provides a new understanding of individual differences in AI-assisted decision-making.

Paperid: 1343, https://arxiv.org/pdf/2504.21563.pdf

Abstract:
Teleoperation emerged as a promising fallback for situations beyond the capabilities of automated vehicles. Nevertheless, teleoperation still faces challenges, such as reduced situational awareness. Since situational awareness is primarily built through the remote operator's visual perception, the Graphical User Interface (GUI) design is critical. In addition to video feeds, supplemental informational elements are crucial - not only for the predominantly studied Remote Driving but also for the arising desk-based Remote Assistance concepts. This work develops a GUI for different teleoperation concepts by identifying key informational elements during the teleoperation process through expert interviews (N = 9). Following this, a static and dynamic GUI prototype is developed and evaluated in a click-dummy study (N = 36). Thereby, the dynamic GUI adapts the number of displayed elements according to the teleoperation phase. Results show that both GUIs achieve good System Usability Scale (SUS) ratings, with the dynamic GUI significantly outperforming the static version in both usability and task completion time. The User Experience Questionnaire (UEQ) score shows potential for improvement. To enhance the user experience, the GUI should be evaluated in a follow-up study that includes interaction with a real vehicle.

Paperid: 1344, https://arxiv.org/pdf/2504.18969.pdf

Abstract:
Emotion recognition has the potential to play a pivotal role in enhancing human-computer interaction by enabling systems to accurately interpret and respond to human affect. Yet, capturing emotions in face-to-face contexts remains challenging due to subtle nonverbal cues, variations in personal traits, and the real-time dynamics of genuine interactions. Existing emotion recognition datasets often rely on limited modalities or controlled conditions, thereby missing the richness and variability found in real-world scenarios. In this work, we introduce Advancing Face-to-Face Emotion Communication (AFFEC), a multimodal dataset designed to address these gaps. AFFEC encompasses 84 simulated emotional dialogues across six distinct emotions, recorded from 73 participants over more than 5,000 trials and annotated with more than 20,000 labels. It integrates electroencephalography (EEG), eye-tracking, galvanic skin response (GSR), facial videos, and Big Five personality assessments. Crucially, AFFEC explicitly distinguishes between felt emotions (the participant's internal affect) and perceived emotions (the observer's interpretation of the stimulus). Baseline analyses spanning unimodal features and straightforward multimodal fusion demonstrate that even minimal processing yields classification performance significantly above chance, especially for arousal. Incorporating personality traits further improves predictions of felt emotions, highlighting the importance of individual differences. By bridging controlled experimentation with more realistic face-to-face stimuli, AFFEC offers a unique resource for researchers aiming to develop context-sensitive, adaptive, and personalized emotion recognition models.

Paperid: 1345, https://arxiv.org/pdf/2504.17792.pdf

Abstract:
Safety-critical data, such as crash and near-crash records, are crucial to improving autonomous vehicle (AV) design and development. Sharing such data across AV companies, academic researchers, regulators, and the public can help make all AVs safer. However, AV companies rarely share safety-critical data externally. This paper aims to pinpoint why AV companies are reluctant to share safety-critical data, with an eye on how these barriers can inform new approaches to promote sharing. We interviewed twelve AV company employees who actively work with such data in their day-to-day work. Findings suggest two key, previously unknown barriers to data sharing: (1) Datasets inherently embed salient knowledge that is key to improving AV safety and are resource-intensive. Therefore, data sharing, even within a company, is fraught with politics. (2) Interviewees believed AV safety knowledge is private knowledge that brings competitive edges to their companies, rather than public knowledge for social good. We discuss the implications of these findings for incentivizing and enabling safety-critical AV data sharing, specifically, implications for new approaches to (1) debating and stratifying public and private AV safety knowledge, (2) innovating data tools and data sharing pipelines that enable easier sharing of public AV safety data and knowledge; (3) offsetting costs of curating safety-critical data and incentivizing data sharing.

Paperid: 1346, https://arxiv.org/pdf/2504.17173.pdf

Abstract:
In recent years, Channel State Information (CSI), recognized for its fine-grained spatial characteristics, has attracted increasing attention in WiFi-based indoor localization. However, despite its potential, CSI-based approaches have yet to achieve the same level of deployment scale and commercialization as those based on Received Signal Strength Indicator (RSSI). A key limitation lies in the fact that most existing CSI-based systems are developed and evaluated in controlled, small-scale environments, limiting their generalizability. To bridge this gap, we explore the deployment of a large-scale CSI-based localization system involving over 400 Access Points (APs) in a real-world building under the Integrated Sensing and Communication (ISAC) paradigm. We highlight two critical yet often overlooked factors: the underutilization of unlabeled data and the inherent heterogeneity of CSI measurements. To address these challenges, we propose a novel CSI-based learning framework for WiFi localization, tailored for large-scale ISAC deployments on the server side. Specifically, we employ a novel graph-based structure to model heterogeneous CSI data and reduce redundancy. We further design a pretext pretraining task that incorporates spatial and temporal priors to effectively leverage large-scale unlabeled CSI data. Complementarily, we introduce a confidence-aware fine-tuning strategy to enhance the robustness of localization results. In a leave-one-smartphone-out experiment spanning five floors and 25, 600 m2, we achieve a median localization error of 2.17 meters and a floor accuracy of 99.49%. This performance corresponds to an 18.7% reduction in mean absolute error (MAE) compared to the best-performing baseline.

Paperid: 1347, https://arxiv.org/pdf/2504.16295.pdf

Abstract:
Visual-vestibular conflicts (VVCs) are a primary contributor to visually induced motion sickness (VIMS) in head-mounted displays (HMDs). However, virtual reality (VR) comfort studies often rely on exposing seated or standing users to experiences with high intensity visual motion (such as roller coasters). These drastic VVCs tend to induce pronounced VIMS symptoms that can be reliably detected across individuals using common survey measures. The conclusions from studies using these extreme motion-based conflicts may not accurately generalize to naturalistic use cases in VR where efforts are made to minimize, rather than maximize, VIMS symptoms. In this work, we show that a subthreshold visual-vestibular conflict can induce measurable discomfort during naturalistic, long duration use. We first present a psychophysical study, conducted outside of an HMD, to rigorously identify the perceptual thresholds for sinusoidal noise in render pose (i.e., jitter) resulting in erroneous 3D motion of rendered content. We next introduce subthreshold levels of jitter to a Meta Quest 3 VR HMD and demonstrate that this can induce visual discomfort in participants playing the commercially-available game Cubism across a three-session, repeated-measures study. Importantly, we did not identify statistically significant comfort differences between control and jitter conditions with traditional pre- and post-test comparison of Simulator Sickness Questionnaire (SSQ) scores. Significant differences were only identified using the Motion Illness Symptoms Classification (MISC) survey administered every 10 minutes across each 90 minute session. This highlights the benefits of incorporating time-resolved data points and suggests that lightweight, more frequent surveys may be important tools for measuring visual discomfort in more ecologically-valid scenarios.

Paperid: 1348, https://arxiv.org/pdf/2504.16180.pdf

Abstract:
Context: Jupyter Notebook has emerged as a versatile tool that transforms how researchers, developers, and data scientists conduct and communicate their work. As the adoption of Jupyter notebooks continues to rise, so does the interest from the software engineering research community in improving the software engineering practices for Jupyter notebooks. Objective: The purpose of this study is to analyze trends, gaps, and methodologies used in software engineering research on Jupyter notebooks. Method: We selected 146 relevant publications from the DBLP Computer Science Bibliography up to the end of 2024, following established systematic literature review guidelines. We explored publication trends, categorized them based on software engineering topics, and reported findings based on those topics. Results: The most popular venues for publishing software engineering research on Jupyter notebooks are related to human-computer interaction instead of traditional software engineering venues. Researchers have addressed a wide range of software engineering topics on notebooks, such as code reuse, readability, and execution environment. Although reusability is one of the research topics for Jupyter notebooks, only 64 of the 146 studies can be reused based on their provided URLs. Additionally, most replication packages are not hosted on permanent repositories for long-term availability and adherence to open science principles. Conclusion: Solutions specific to notebooks for software engineering issues, including testing, refactoring, and documentation, are underexplored. Future research opportunities exist in automatic testing frameworks, refactoring clones between notebooks, and generating group documentation for coherent code cells.

Paperid: 1349, https://arxiv.org/pdf/2504.14539.pdf

Abstract:
The application of external human-machine interface (EHMI) on autonomous vehicles (AVs) facilitates information exchange. Existing research fails to consider the impact of the sequence of actions, as well as the effects of EHMI applications and deception, raising the question of whether benevolent, well-intentioned deception should be permitted (i.e., misleading statements that are intended to benefit both parties). We established a game theory based EHMI information disclosure framework for AVs in this study. In considering benevolent deception, this framework divided the decision-making process into three stages, respectively encompassing three key questions: whether to disclose, when to disclose, and what type of intention information to disclose. The results show that theoretical advantages of deception exist in certain cases when AV expects to maximize the safety of the interaction. In 40 out of 484 cases (8.3%), safety can be enhanced through successful deception. Those successful deceptions fall into two categories: 1) In 28 of these cases, the straight-going AV expected the left-turning human-driven vehicle (HV) to yield, while HV exhibited lower speed and higher acceleration; 2) In 12 of these cases, AV expected HV to proceed first, while HV exhibited higher speed and lower acceleration. We also conducted a VR-based driving simulation experiment, and the results confirmed our conclusion. Additionally, we found that when participants had low trust in the EHMI, its use negatively impacted interaction efficiency instead. This study serves as an exploratory behavioral mechanism study based on specific hypotheses for future EHMI design and ethical decision-making of autonomous driving system.

Paperid: 1350, https://arxiv.org/pdf/2504.14068.pdf

Abstract:
Understanding patient feedback is crucial for improving healthcare services, yet analyzing unlabeled short-text feedback presents significant challenges due to limited data and domain-specific nuances. Traditional supervised learning approaches require extensive labeled datasets, making unsupervised methods more viable for uncovering meaningful insights from patient feedback. This study explores unsupervised methods to extract meaningful topics from 439 survey responses collected from a healthcare system in Wisconsin, USA. A keyword-based filtering approach was applied to isolate complaint-related feedback using a domain-specific lexicon. To delve deeper and analyze dominant topics in feedback, we explored traditional topic modeling methods, including Latent Dirichlet Allocation (LDA) and Gibbs Sampling Dirichlet Multinomial Mixture (GSDMM), alongside BERTopic, an advanced neural embedding-based clustering approach. To improve coherence and interpretability where data are scarce and consist of short-texts, we propose kBERT, an integration of BERT embeddings with k-means clustering. Model performance was assessed using coherence scores (Cv ) for topic interpretability and average Inverted Rank-Biased Overlap (IRBOavg) for topic diversity. Results indicate that kBERT achieves the highest coherence (Cv = 0.53) and distinct topic separation (IRBOavg = 1.00), outperforming all other models in short-text healthcare feedback analysis. Our findings emphasize the importance of embedding-based techniques for topic identification and highlight the need for context-aware models in healthcare analytics.

Paperid: 1351, https://arxiv.org/pdf/2504.13948.pdf

Abstract:
This research investigates the use of customized GPT models to enhance prompting proficiency among architecture students when generating AI-driven images. Prompt engineering is increasingly essential in architectural education due to the widespread adoption of generative AI tools. This study utilized a mixed-methods experimental design involving architecture students divided into three distinct groups: a control group receiving no structured support, a second group provided with structured prompting guides, and a third group supported by both structured guides and interactive AI personas. Students engaged in reverse engineering tasks, first guessing provided image prompts and then generating their own prompts, aiming to boost critical thinking and prompting skills. Variables examined included time spent prompting, word count, prompt similarity, and concreteness. Quantitative analysis involved correlation assessments between these variables and a one-way ANOVA to evaluate differences across groups. While several correlations showed meaningful relationships, not all were statistically significant. ANOVA results indicated statistically significant improvements in word count, similarity, and concreteness, especially in the group supported by AI personas and structured prompting guides. Qualitative feedback complemented these findings, revealing enhanced confidence and critical thinking skills in students. These results suggest tailored GPT interactions substantially improve students' ability to communicate architectural concepts clearly and effectively.

Paperid: 1352, https://arxiv.org/pdf/2504.13891.pdf

Abstract:
In this work, we introduce Mozualization, a music generation and editing tool that creates multi-style embedded music by integrating diverse inputs, such as keywords, images, and sound clips (e.g., segments from various pieces of music or even a playful cat's meow). Our work is inspired by the ways people express their emotions -- writing mood-descriptive poems or articles, creating drawings with warm or cool tones, or listening to sad or uplifting music. Building on this concept, we developed a tool that transforms these emotional expressions into a cohesive and expressive song, allowing users to seamlessly incorporate their unique preferences and inspirations. To evaluate the tool and, more importantly, gather insights for its improvement, we conducted a user study involving nine music enthusiasts. The study assessed user experience, engagement, and the impact of interacting with and listening to the generated music.

Paperid: 1353, https://arxiv.org/pdf/2504.13880.pdf

Abstract:
Background: The HERMES Kiosk (Healthcare Enhanced Recommendations through Artificial Intelligence & Expertise System) is designed to provide personalized Over-the-Counter (OTC) medication recommendations, addressing the limitations of traditional health kiosks. It integrates an advanced GAMENet model enhanced with Graph Attention Networks (GAT) and Multi-Head Cross-Attention (MHCA) while ensuring user privacy through federated learning. This paper outlines the conceptual design and architecture of HERMES, with a focus on deployment in high-traffic public areas. Methods: HERMES analyzes self-reported symptoms and anonymized medical histories using AI algorithms to generate context-aware OTC medication recommendations. The system was initially trained using Electronic Health Records (EHR) from the MIMIC-III dataset (6,350 patients) and Drug-Drug Interaction (DDI) data from the TWOSIDES database, incorporating the top 90 severity DDI types. Real-time DDI checks and ATC-mapped drug codes further improve safety. The kiosk is designed for accessibility, offering multilingual support, large fonts, voice commands, and Braille compatibility. A built-in health education library promotes preventive care and health literacy. A survey was conducted among 10 medical professionals to evaluate its potential applications in medicine. Results: Preliminary results show that the enhanced GAMENet model achieved a Precision-Recall AUC (PRAUC) of 0.74, outperforming the original model. These findings suggest a strong potential for delivering accurate and secure healthcare recommendations in public settings. Conclusion: HERMES demonstrates how AI-driven, privacy-preserving kiosks can enhance public health access, empower users, and alleviate burdens on healthcare systems. Future work will focus on real-world deployment, usability testing, and scalability for broader adoption.

Paperid: 1354, https://arxiv.org/pdf/2504.13854.pdf

Abstract:
Robot avatars for customer service are gaining traction in Japan. However, their acceptance in other societal contexts remains underexplored, complicating efforts to design robot avatars suitable for diverse cultural environments. To address this, we interviewed key stakeholders in Dubai's service sector to gain insights into their experiences deploying social robots for customer service, as well as their opinions on the most useful tasks and design features that could maximize customer acceptance of robot avatars in Dubai. Providing information and guiding individuals to specific locations were identified as the most valued functions. Regarding appearance, robotic-looking, highly anthropomorphic designs were the most preferred. Ultra-realistic androids and cartoonish-looking robots elicited mixed reactions, while hybrid androids, low-anthropomorphic robotic designs, and animal-looking robots were considered less suitable or discouraged. Additionally, a psycho-sociological analysis revealed that interactions with robot avatars are influenced by their symbolic meaning, context, and affordances. These findings offer pioneering insights into culturally adaptive robot avatar design, addressing a significant research gap and providing actionable guidelines for deploying socially acceptable robots and avatars in multicultural contexts worldwide.

Paperid: 1355, https://arxiv.org/pdf/2504.13370.pdf

Abstract:
This paper proposes a wearable-controlled mobile manipulator system for intelligent smart home assistance, integrating MEMS capacitive microphones, IMU sensors, vibration motors, and pressure feedback to enhance human-robot interaction. The wearable device captures forearm muscle activity and converts it into real-time control signals for mobile manipulation. The wearable device achieves an offline classification accuracy of 88.33\%\ across six distinct movement-force classes for hand gestures by using a CNN-LSTM model, while real-world experiments involving five participants yield a practical accuracy of 83.33\%\ with an average system response time of 1.2 seconds. In Human-Robot synergy in navigation and grasping tasks, the robot achieved a 98\%\ task success rate with an average trajectory deviation of only 3.6 cm. Finally, the wearable-controlled mobile manipulator system achieved a 93.3\%\ gripping success rate, a transfer success of 95.6\%\, and a full-task success rate of 91.1\%\ during object grasping and transfer tests, in which a total of 9 object-texture combinations were evaluated. These three experiments' results validate the effectiveness of MEMS-based wearable sensing combined with multi-sensor fusion for reliable and intuitive control of assistive robots in smart home scenarios.

Paperid: 1356, https://arxiv.org/pdf/2504.13058.pdf

Abstract:
Ensuring equitable access to computing education for all students-including those with autism, dyslexia, or ADHD-is essential to developing a diverse and inclusive workforce. To understand the state of disability research in computing education, we conducted a systematic literature review of research on neurodiversity in computing education. Our search resulted in 1,943 total papers, which we filtered to 14 papers based on our inclusion criteria. Our mixed-methods approach analyzed research methods, participants, contribution types, and findings. The three main contribution types included empirical contributions based on user studies (57.1%), opinion contributions and position papers (50%), and survey contributions (21.4%). Interviews were the most common methodology (75% of empirical contributions). There were often inconsistencies in how research methods were described (e.g., number of participants and interview and survey materials). Our work shows that research on neurodivergence in computing education is still very preliminary. Most papers provided curricular recommendations that lacked empirical evidence to support those recommendations. Three areas of future work include investigating the impacts of active learning, increasing awareness and knowledge about neurodiverse students' experiences, and engaging neurodivergent students in the design of pedagogical materials and computing education research.

Paperid: 1357, https://arxiv.org/pdf/2504.12931.pdf

Abstract:
Large Language Models (LLMs) are increasingly being used for automated evaluations and explaining them. However, concerns about explanation quality, consistency, and hallucinations remain open research challenges, particularly in high-stakes contexts like privacy and security, where user trust and decision-making are at stake. In this paper, we investigate these issues in the context of PRISMe, an interactive privacy policy assessment tool that leverages LLMs to evaluate and explain website privacy policies. Based on a prior user study with 22 participants, we identify key concerns regarding LLM judgment transparency, consistency, and faithfulness, as well as variations in user preferences for explanation detail and engagement. We discuss potential strategies to mitigate these concerns, including structured evaluation criteria, uncertainty estimation, and retrieval-augmented generation (RAG). We identify a need for adaptive explanation strategies tailored to different user profiles for LLM-as-a-judge. Our goal is to showcase the application area of usable privacy and security to be promising for Human-Centered Explainable AI (HCXAI) to make an impact.

Paperid: 1358, https://arxiv.org/pdf/2504.12492.pdf

Abstract:
There has been a continued trend towards minimizing instrumentation for full-body motion capture, going from specialized rooms and equipment, to arrays of worn sensors and recently sparse inertial pose capture methods. However, as these techniques migrate towards lower-fidelity IMUs on ubiquitous commodity devices, like phones, watches, and earbuds, challenges arise including compromised online performance, temporal consistency, and loss of global translation due to sensor noise and drift. Addressing these challenges, we introduce MobilePoser, a real-time system for full-body pose and global translation estimation using any available subset of IMUs already present in these consumer devices. MobilePoser employs a multi-stage deep neural network for kinematic pose estimation followed by a physics-based motion optimizer, achieving state-of-the-art accuracy while remaining lightweight. We conclude with a series of demonstrative applications to illustrate the unique potential of MobilePoser across a variety of fields, such as health and wellness, gaming, and indoor navigation to name a few.

Paperid: 1359, https://arxiv.org/pdf/2504.11138.pdf

Abstract:
Block-building activities are crucial for developing children's spatial reasoning and mathematical skills, yet parents often lack the expertise to guide these activities effectively. BrickSmart, a pioneering system, addresses this gap by providing spatial language guidance through a structured three-step process: Discovery & Design, Build & Learn, and Explore & Expand. This system uniquely supports parents in 1) generating personalized block-building instructions, 2) guiding parents to teach spatial language during building and interactive play, and 3) tracking children's learning progress, altogether enhancing children's engagement and cognitive development. In a comparative study involving 12 parent-child pairs children aged 6-8 years) for both experimental and control groups, BrickSmart demonstrated improvements in supportiveness, efficiency, and innovation, with a significant increase in children's use of spatial vocabularies during block play, thereby offering an effective framework for fostering spatial language skills in children.

Paperid: 1360, https://arxiv.org/pdf/2504.10265.pdf

Abstract:
Although domestic work is often viewed as manual labor, it involves significant interaction with online technologies. However, the detailed exploration of how domestic workers use these technologies remains limited. This study examines the impact of online technologies on domestic workers' work practices, perceptions, and relationships with customers and employers. We interviewed 30 domestic workers residing in the United States, who provided examples that highlight the insufficient transformative role of current online technologies in their work. By conducting a thematic analysis, we characterize how they approach and avoid these digital tools at different stages of their work. Through these findings, we investigate the limitations of technology and identify challenges and opportunities that could inform the design of more suitable tools to improve the conditions of this marginalized group.

Paperid: 1361, https://arxiv.org/pdf/2504.10101.pdf

Abstract:
The dominant metaphor of LLMs-as-minds leads to misleading conceptions of machine agency and is limited in its ability to help both users and developers build the right degree of trust and understanding for outputs from LLMs. It makes it harder to disentangle hallucinations from useful model interactions. This position paper argues that there are fundamental similarities between visual perception and the way LLMs process and present language. These similarities inspire a metaphor for LLMs which could open new avenues for research into interaction paradigms and shared representations. Our visual system metaphor introduces possibilities for addressing these challenges by understanding the information landscape assimilated by LLMs. In this paper we motivate our proposal, introduce the interrelating theories from the fields that inspired this view and discuss research directions that stem from this abstraction.

Paperid: 1362, https://arxiv.org/pdf/2504.09341.pdf

Abstract:
High-quality data annotation is an essential but laborious and costly aspect of developing machine learning-based software. We explore the inherent tradeoff between annotation accuracy and cost by detecting and removing minority reports -- instances where annotators provide incorrect responses -- that indicate unnecessary redundancy in task assignments. We propose an approach to prune potentially redundant annotation task assignments before they are executed by estimating the likelihood of an annotator disagreeing with the majority vote for a given task. Our approach is informed by an empirical analysis over computer vision datasets annotated by a professional data annotation platform, which reveals that the likelihood of a minority report event is dependent primarily on image ambiguity, worker variability, and worker fatigue. Simulations over these datasets show that we can reduce the number of annotations required by over 60% with a small compromise in label quality, saving approximately 6.6 days-equivalent of labor. Our approach provides annotation service platforms with a method to balance cost and dataset quality. Machine learning practitioners can tailor annotation accuracy levels according to specific application needs, thereby optimizing budget allocation while maintaining the data quality necessary for critical settings like autonomous driving technology.

Paperid: 1363, https://arxiv.org/pdf/2504.06996.pdf

Abstract:
High-quality, multi-channel neural recording is indispensable for neuroscience research and clinical applications. Large-scale brain recordings often produce vast amounts of data that must be wirelessly transmitted for subsequent offline analysis and decoding, especially in brain-computer interfaces (BCIs) utilizing high-density intracortical recordings with hundreds or thousands of electrodes. However, transmitting raw neural data presents significant challenges due to limited communication bandwidth and resultant excessive heating. To address this challenge, we propose a neural signal compression scheme utilizing Convolutional Autoencoders (CAEs), which achieves a compression ratio of up to 150 for compressing local field potentials (LFPs). The CAE encoder section is implemented on RAMAN, an energy-efficient tinyML accelerator designed for edge computing. RAMAN leverages sparsity in activation and weights through zero skipping, gating, and weight compression techniques. Additionally, we employ hardware-software co-optimization by pruning the CAE encoder model parameters using a hardware-aware balanced stochastic pruning strategy, resolving workload imbalance issues and eliminating indexing overhead to reduce parameter storage requirements by up to 32.4%. Post layout simulation shows that the RAMAN encoder can be implemented in a TSMC 65-nm CMOS process, occupying a core area of 0.0187 mm2 per channel. Operating at a clock frequency of 2 MHz and a supply voltage of 1.2 V, the estimated power consumption is 15.1 uW per channel for the proposed DS-CAE1 model. For functional validation, the RAMAN encoder was also deployed on an Efinix Ti60 FPGA, utilizing 37.3k LUTs and 8.6k flip-flops. The compressed neural data from RAMAN is reconstructed offline with SNDR of 22.6 dB and 27.4 dB, along with R2 scores of 0.81 and 0.94, respectively, evaluated on two monkey neural recordings.

Paperid: 1364, https://arxiv.org/pdf/2504.06322.pdf

Abstract:
This chapter critiques the dominant reductionist approach in AI and work studies, which isolates tasks and skills as replaceable components. Instead, it advocates for a systemic perspective that emphasizes the interdependence of tasks, roles, and workplace contexts. Two complementary approaches are proposed: an ethnographic, context-rich method that highlights how AI reconfigures work environments and expertise; and a relational task-based analysis that bridges micro-level work descriptions with macro-level labor trends. The authors argue that effective AI impact assessments must go beyond predicting automation rates to include ethical, well-being, and expertise-related questions. Drawing on empirical case studies, they demonstrate how AI reshapes human-technology relations, professional roles, and tacit knowledge practices. The chapter concludes by calling for a human-centric, holistic framework that guides organizational and policy decisions, balancing technological possibilities with social desirability and sustainability of work.

Paperid: 1365, https://arxiv.org/pdf/2504.04734.pdf

Abstract:
Recent studies reveal that experienced data practitioners often draw sketches to facilitate communication around privacy design concepts. However, there is limited understanding of how we can help novice students develop such communication skills. This paper studies methods for lowering novice data science students' barriers to creating high-quality privacy sketches. We first conducted a need-finding study (N=12) to identify barriers students face when sketching privacy designs. We then used a human-centered design approach to guide the method development, culminating in three simple, text-based heuristics. Our user studies with 24 data science students revealed that simply presenting three heuristics to the participants at the beginning of the study can enhance the coverage of privacy-related design decisions in sketches, reduce the mental effort required for creating sketches, and improve the readability of the final sketches.

Paperid: 1366, https://arxiv.org/pdf/2504.04427.pdf

Abstract:
Generating consecutive images of lip movements that align with a given speech in audio-driven lip synthesis is a challenging task. While previous studies have made strides in synchronization and visual quality, lip intelligibility and video fluency remain persistent challenges. This work proposes FluentLip, a two-stage approach for audio-driven lip synthesis, incorporating three featured strategies. To improve lip synchronization and intelligibility, we integrate a phoneme extractor and encoder to generate a fusion of audio and phoneme information for multimodal learning. Additionally, we employ optical flow consistency loss to ensure natural transitions between image frames. Furthermore, we incorporate a diffusion chain during the training of Generative Adversarial Networks (GANs) to improve both stability and efficiency. We evaluate our proposed FluentLip through extensive experiments, comparing it with five state-of-the-art (SOTA) approaches across five metrics, including a proposed metric called Phoneme Error Rate (PER) that evaluates lip pose intelligibility and video fluency. The experimental results demonstrate that our FluentLip approach is highly competitive, achieving significant improvements in smoothness and naturalness. In particular, it outperforms these SOTA approaches by approximately $\textbf{16.3%}$ in FrÃ©chet Inception Distance (FID) and $\textbf{35.2%}$ in PER.

Paperid: 1367, https://arxiv.org/pdf/2504.04355.pdf

Abstract:
Within the virtual reality (VR) research community, there have been several efforts to develop questionnaires with the aim of better understanding the sense of presence. Despite having numerous surveys, the community does not have a questionnaire that informs which components of a VR application contributed to the sense of presence. Furthermore, previous literature notes the absence of consensus on which questionnaire or questions should be used. Therefore, we conducted a Delphi study, engaging presence experts to establish a consensus on the most important presence questions and their respective verbiage. We then conducted a validation study with an exploratory factor analysis (EFA). The efforts between our two studies led to the creation of the Fidelity-based Presence Scale (FPS). With our consensus-driven approach and fidelity-based factoring, we hope the FPS will enable better communication within the research community and yield important future results regarding the relationship between VR system fidelity and presence.

Paperid: 1368, https://arxiv.org/pdf/2504.03991.pdf

Abstract:
Understanding how humans collaborate and communicate in teams is essential for improving human-agent teaming and AI-assisted decision-making. However, relying solely on data from large-scale user studies is impractical due to logistical, ethical, and practical constraints, necessitating synthetic models of multiple diverse human behaviors. Recently, agents powered by Large Language Models (LLMs) have been shown to emulate human-like behavior in social settings. But, obtaining a large set of diverse behaviors requires manual effort in the form of designing prompts. On the other hand, Quality Diversity (QD) optimization has been shown to be capable of generating diverse Reinforcement Learning (RL) agent behavior. In this work, we combine QD optimization with LLM-powered agents to iteratively search for prompts that generate diverse team behavior in a long-horizon, multi-step collaborative environment. We first show, through a human-subjects experiment (n=54 participants), that humans exhibit diverse coordination and communication behavior in this domain. We then show that our approach can effectively replicate trends from human teaming data and also capture behaviors that are not easily observed without collecting large amounts of data. Our findings highlight the combination of QD and LLM-powered agents as an effective tool for studying teaming and communication strategies in multi-agent collaboration.

Paperid: 1369, https://arxiv.org/pdf/2504.02585.pdf

Abstract:
Live coding for teaching-synchronously writing software in front of students-can be an effective method for engaging students and instilling practical programming skills. However, not all settings are conducive to live coding and not all instructors are successful in this challenging task. We present results from a study involving university instructors, teaching assistants, and students identifying both barriers and benefits of live coding. Physical infrastructure, a positive classroom community with psychological safety, and opportunities for teacher development are practical considerations for live coding. In order for live coding to be an active learning experience, we recommend that tools support multiple mechanisms for engaging students, directing audience attention, and encouraging student-led live coding.

Paperid: 1370, https://arxiv.org/pdf/2504.02204.pdf

Abstract:
Understanding the role of creativity in visualization design becomes increasingly important as the field matures, particularly with the emergence of various visualization authoring and recommendation systems. In this paper, we examine how creativity manifests in visualization design processes and how academic research has conceptualized it over time. Through a systematic review of 58 visualization papers that use the terms "creativity" or "creative," we analyze the evolution of creative practices in visualization design. Our findings show that prior literature predominantly used atypical designs through free-form drawings, infographics, pictorials, and data comics to define creative representations. However, creativity in visualization design extends beyond visual representations to encompass early needfinding design activities such as sketching, storyboarding, discussion, and card sorting. Data visualization can also support a wide variety of creative tasks (e.g., fiction writing). We discuss the implications of these findings for fostering innovation within established design paradigms and for developing more sophisticated visualization authoring systems. The full list of coded papers are available here: https://vizcreativity.notion.site/coded-papers.

Paperid: 1371, https://arxiv.org/pdf/2504.02197.pdf

Abstract:
The concept of an AI assistant for task guidance is rapidly shifting from a science fiction staple to an impending reality. Such a system is inherently complex, requiring models for perceptual grounding, attention, and reasoning, an intuitive interface that adapts to the performer's needs, and the orchestration of data streams from many sensors. Moreover, all data acquired by the system must be readily available for post-hoc analysis to enable developers to understand performer behavior and quickly detect failures. We introduce TIM, the first end-to-end AI-enabled task guidance system in augmented reality which is capable of detecting both the user and scene as well as providing adaptable, just-in-time feedback. We discuss the system challenges and propose design solutions. We also demonstrate how TIM adapts to domain applications with varying needs, highlighting how the system components can be customized for each scenario.

Paperid: 1372, https://arxiv.org/pdf/2504.02176.pdf

Abstract:
We analyzed 1,596 sub-conversations within 451 direct message (DM) conversations from 67 teens (ages 13-17) who engaged in private discussions about body image on Instagram. Our findings show that teens often receive support when sharing struggles with negative body image, participate in criticism when engaging in body-shaming, and are met with appreciation when promoting positive body image. Additionally, these types of disclosures and responses varied based on whether the conversations were one-on-one or group-based. We found that sharing struggles and receiving support most often occurred in one-on-one conversations, while body shaming and negative interactions often occurred in group settings. A key insight of the study is that private social media settings can significantly influence how teens discuss and respond to body image. Based on these findings, we suggest design guidelines for social media platforms that could promote positive interactions around body image, ultimately creating a healthier and more supportive online environment for teens dealing with body image concerns.

Paperid: 1373, https://arxiv.org/pdf/2504.01911.pdf

Abstract:
Large Language Models (LLMs) are playing an increasingly important role in physics research by assisting with symbolic manipulation, numerical computation, and scientific reasoning. However, ensuring the reliability, transparency, and interpretability of their outputs remains a major challenge. In this work, we introduce a novel multi-agent LLM physicist framework that fosters collaboration between AI and human scientists through three key modules: a reasoning module, an interpretation module, and an AI-scientist interaction module. Recognizing that effective physics reasoning demands logical rigor, quantitative accuracy, and alignment with established theoretical models, we propose an interpretation module that employs a team of specialized LLM agents-including summarizers, model builders, visualization tools, and testers-to systematically structure LLM outputs into transparent, physically grounded science models. A case study demonstrates that our approach significantly improves interpretability, enables systematic validation, and enhances human-AI collaboration in physics problem-solving and discovery. Our work bridges free-form LLM reasoning with interpretable, executable models for scientific analysis, enabling more transparent and verifiable AI-augmented research.

Paperid: 1374, https://arxiv.org/pdf/2504.00941.pdf

Abstract:
Dyslexia, a neurological condition affecting approximately 12% of the global population, presents significant challenges to reading ability and quality of life. Existing assistive technologies are limited by factors such as unsuitability for quiet environments, high costs, and the risk of distorting meaning or failing to provide real-time support. To address these issues, we introduce LARF (Let AI Read First), the first strategy that employs large language models to annotate text and enhance readability while preserving the original content. We evaluated LARF in a large-scale between-subjects experiment, involving 150 participants with dyslexia. The results show that LARF significantly improves reading performance and experience for individuals with dyslexia. Results also prove that LARF is particularly helpful for participants with more severe reading difficulties. Furthermore, this work discusses potential research directions opened up by LARF for the HCI community.

Paperid: 1375, https://arxiv.org/pdf/2503.24249.pdf

Abstract:
Implementing a teleoperation system with its various actors and interactions is challenging and requires an overview of the necessary functions. This work collects all tasks that arise in a control center for an automated vehicle fleet from literature and assigns them to the two roles Remote Operator and Fleet Manager. Focusing on the driving-related tasks of the remote operator, a process is derived that contains the sequence of tasks, associated vehicle states, and transitions between the states. The resulting state diagram shows all remote operator actions available to effectively resolve automated vehicle disengagements. Thus, the state diagram can be applied to existing legislation or modified based on prohibitions of specific interactions. The developed control center framework and included state diagram should serve as a basis for implementing and testing remote support for automated vehicles to be validated on public roads.

Paperid: 1376, https://arxiv.org/pdf/2503.24145.pdf

Abstract:
People inherently use experiences of their past while imagining their future, a capability that plays a crucial role in mental health. Resonance is an AI-powered journaling tool designed to augment this ability by offering AI-generated, action-oriented suggestions for future activities based on the user's own past memories. Suggestions are offered when a new memory is logged and are followed by a prompt for the user to imagine carrying out the suggestion. In a two-week randomized controlled study (N=55), we found that using Resonance significantly improved mental health outcomes, reducing the users' PHQ8 scores, a measure of current depression, and increasing their daily positive affect, particularly when they would likely act on the suggestion. Notably, the effectiveness of the suggestions was higher when they were personal, novel, and referenced the user's logged memories. Finally, through open-ended feedback, we discuss the factors that encouraged or hindered the use of the tool.

Paperid: 1377, https://arxiv.org/pdf/2503.23674.pdf

Abstract:
We evaluated 4 systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) in two randomised, controlled, and pre-registered Turing tests on independent populations. Participants had 5 minute conversations simultaneously with another human participant and one of these systems before judging which conversational partner they thought was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant. LLaMa-3.1, with the same prompt, was judged to be the human 56% of the time -- not significantly more or less often than the humans they were being compared to -- while baseline models (ELIZA and GPT-4o) achieved win rates significantly below chance (23% and 21% respectively). The results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test. The results have implications for debates about what kind of intelligence is exhibited by Large Language Models (LLMs), and the social and economic impacts these systems are likely to have.

Paperid: 1378, https://arxiv.org/pdf/2503.22216.pdf

Abstract:
PDF inaccessibility is an ongoing challenge that hinders individuals with visual impairments from reading and navigating PDFs using screen readers. This paper presents a step-by-step process for both novice and experienced users to create accessible PDF documents, including an approach for creating alternative text for mathematical formulas without expert knowledge. In a study involving nineteen participants, we evaluated our prototype PAVE 2.0 by comparing it against Adobe Acrobat Pro, the existing standard for remediating PDFs. Our study shows that experienced users improved their tagging scores from 42.0% to 80.1%, and novice users from 39.2% to 75.2% with PAVE 2.0. Overall, fifteen participants stated that they would prefer to use PAVE 2.0 in the future, and all participants would recommend it for novice users. Our work demonstrates PAVE 2.0's potential for increasing PDF accessibility for people with visual impairments and highlights remaining challenges.

Paperid: 1379, https://arxiv.org/pdf/2503.22181.pdf

Abstract:
This paper proposes the e-person architecture for constructing a unified and incremental development of AI ethics. The e-person architecture takes the reduction of uncertainty through collaborative cognition and action with others as a unified basis for ethics. By classifying and defining uncertainty along two axes - (1) first, second, and third person perspectives, and (2) the difficulty of inference based on the depth of information - we support the development of unified and incremental development of AI ethics. In addition, we propose the e-person framework based on the free energy principle, which considers the reduction of uncertainty as a unifying principle of brain function, with the aim of implementing the e-person architecture, and we show our previous works and future challenges based on the proposed framework.

Paperid: 1380, https://arxiv.org/pdf/2503.20160.pdf

Abstract:
As Artificial intelligence (AI) has been increasingly integrated into the medical field, the role of humans may become vague. While numerous studies highlight AI's potential, how humans and AI collaborate to maximize the combined clinical benefits remains unexplored. In this work, we analyze 270 screening scenarios from a health-economic perspective in a national diabetic retinopathy screening program, involving eight human-AI collaborative strategies and traditional manual screening. We find that annual copilot human-AI screening in the 20-79 age group, with referral decisions made when both humans and AI agree, is the most cost-effective strategy for human-AI collaboration. The 'copilot' strategy brings health benefits equivalent to USD 4.64 million per 100,000 population compared to manual screening. These findings demonstrate that even in settings where AI is highly mature and efficient, human involvement remains essential to ensuring both health and economic benefits. Our findings highlight the need to optimize human-AI collaboration strategies for AI implementation into healthcare systems.

Paperid: 1381, https://arxiv.org/pdf/2503.19711.pdf

Abstract:
Open-ended tasks are particularly challenging for LLMs due to the vast solution space, demanding both expansive exploration and adaptable strategies, especially when success lacks a clear, objective definition. Writing, with its vast solution space and subjective evaluation criteria, provides a compelling testbed for studying such problems. In this paper, we investigate the potential of LLMs to act as collaborative co-writers, capable of suggesting and implementing text improvements autonomously. We analyse three prominent LLMs - Gemini 1.5 Pro, Claude 3.5 Sonnet, and GPT-4o - focusing on how their action diversity, human alignment, and iterative improvement capabilities impact overall performance. This work establishes a framework for benchmarking autonomous writing agents and, more broadly, highlights fundamental challenges and potential solutions for building systems capable of excelling in diverse open-ended domains.

Paperid: 1382, https://arxiv.org/pdf/2503.19460.pdf

Abstract:
In programming education, fostering self-regulated learning (SRL) skills is essential for both students and teachers. This paper introduces TrackThinkDashboard, an application designed to visualize the learning workflow by integrating web browsing and programming logs into one unified view. The system aims to (1) help students monitor and reflect on their problem-solving processes, identify knowledge gaps, and cultivate effective SRL strategies; and (2) enable teachers to identify at-risk learners more effectively and provide targeted, data-driven guidance. We conducted a study with 33 participants (32 male, 1 female) from Japanese universities, including individuals with and without prior programming experience, to explore differences in web browsing and coding patterns. The dashboards revealed multiple learning approaches, such as trial-and-error and trial-and-search methods, and highlighted how domain knowledge influenced the overall activity flow. We discuss how this visualization tool can be used continuously or in one-off experiments, consider associated privacy implications, and explore opportunities for expanding data sources to gain richer behavioral insights.

Paperid: 1383, https://arxiv.org/pdf/2503.19398.pdf

Abstract:
This paper introduces CyanKitten, an interactive virtual companion system tailored for elderly users, integrating advanced posture recognition, behavior recognition, and multimodal interaction capabilities. The system utilizes a three-tier architecture to process and interpret user movements and gestures, leveraging a dual-camera setup and a convolutional neural network trained explicitly on elderly movement patterns. The behavior recognition module identifies and responds to three key interactive gestures: greeting waves, petting motions, and heart-making gestures. A multimodal integration layer also combines visual and audio inputs to facilitate natural and intuitive interactions. This paper outlines the technical implementation of each component, addressing challenges such as elderly-specific movement characteristics, real-time processing demands, and environmental adaptability. The result is an engaging and accessible virtual interaction experience designed to enhance the quality of life for elderly users.

Paperid: 1384, https://arxiv.org/pdf/2503.19075.pdf

Abstract:
Generative AI image models have been increasingly evaluated for their (in)ability to represent non-Western cultures. We argue that these evaluations operate through reductive ideals of representation, abstracted from how people define their own representation and neglecting the inherently interpretive and contextual nature of cultural representation. In contrast to these 'thin' evaluations, we introduce the idea of 'thick evaluations': a more granular, situated, and discursive measurement framework for evaluating representations of social worlds in AI images, steeped in communities' own understandings of representation. We develop this evaluation framework through workshops in South Asia, by studying the 'thick' ways in which people interpret and assign meaning to images of their own cultures. We introduce practices for thicker evaluations of representation that expand the understanding of representation underpinning AI evaluations and by co-constructing metrics with communities, bringing measurement in line with the experiences of communities on the ground.

Paperid: 1385, https://arxiv.org/pdf/2503.18492.pdf

Abstract:
Large Foundation Models (LFMs) have unlocked new possibilities in human-computer interaction, particularly with the rise of mobile Graphical User Interface (GUI) Agents capable of interacting with mobile GUIs. These agents allow users to automate complex mobile tasks through simple natural language instructions. However, the inherent probabilistic nature of LFMs, coupled with the ambiguity and context-dependence of mobile tasks, makes LFM-based automation unreliable and prone to errors. To address this critical challenge, we introduce VeriSafe Agent (VSA): a formal verification system that serves as a logically grounded safeguard for Mobile GUI Agents. VSA deterministically ensures that an agent's actions strictly align with user intent before executing the action. At its core, VSA introduces a novel autoformalization technique that translates natural language user instructions into a formally verifiable specification. This enables runtime, rule-based verification of agent's actions, detecting erroneous actions even before they take effect. To the best of our knowledge, VSA is the first attempt to bring the rigor of formal verification to GUI agents, bridging the gap between LFM-driven actions and formal software verification. We implement VSA using off-the-shelf LFM services (GPT-4o) and evaluate its performance on 300 user instructions across 18 widely used mobile apps. The results demonstrate that VSA achieves 94.33%-98.33% accuracy in verifying agent actions, outperforming existing LFM-based verification methods by 30.00%-16.33%, and increases the GUI agent's task completion rate by 90%-130%.

Paperid: 1386, https://arxiv.org/pdf/2503.18471.pdf

Abstract:
Scholars often explore literature outside of their home community of study. This exploration process is frequently hampered by field-specific jargon. Past computational work often focuses on supporting translation work by removing jargon through simplification and summarization; here, we explore a different approach that preserves jargon as useful bridges to new conceptual spaces. Specifically, we cast different scholarly domains as different language-using communities, and explore how to adapt techniques from unsupervised cross-lingual alignment of word embeddings to explore conceptual alignments between domain-specific word embedding spaces.We developed a prototype cross-domain search engine that uses aligned domain-specific embeddings to support conceptual exploration, and tested this prototype in two case studies. We discuss qualitative insights into the promises and pitfalls of this approach to translation work, and suggest design insights for future interfaces that provide computational support for cross-domain information seeking.

Paperid: 1387, https://arxiv.org/pdf/2503.16960.pdf

Abstract:
As robots enter the messy human world so the vital matter of safety takes on a fresh complexion with physical contact becoming inevitable and even desirable. We report on an artistic-exploration of how dancers, working as part of a multidisciplinary team, engaged in contact improvisation exercises to explore the opportunities and challenges of dancing with cobots. We reveal how they employed their honed bodily senses and physical skills to engage with the robots aesthetically and yet safely, interleaving improvised physical manipulations with reflections to grow their knowledge of how the robots behaved and felt. We introduce somatic safety, a holistic mind-body approach in which safety is learned, felt and enacted through bodily contact with robots in addition to being reasoned about. We conclude that robots need to be better designed for people to hold them and might recognise tacit safety cues among people.We propose that safety should be learned through iterative bodily experience interleaved with reflection.

Paperid: 1388, https://arxiv.org/pdf/2503.16586.pdf

Abstract:
Generative AI (GenAI) browser assistants integrate powerful capabilities of GenAI in web browsers to provide rich experiences such as question answering, content summarization, and agentic navigation. These assistants, available today as browser extensions, can not only track detailed browsing activity such as search and click data, but can also autonomously perform tasks such as filling forms, raising significant privacy concerns. It is crucial to understand the design and operation of GenAI browser extensions, including how they collect, store, process, and share user data. To this end, we study their ability to profile users and personalize their responses based on explicit or inferred demographic attributes and interests of users. We perform network traffic analysis and use a novel prompting framework to audit tracking, profiling, and personalization by the ten most popular GenAI browser assistant extensions. We find that instead of relying on local in-browser models, these assistants largely depend on server-side APIs, which can be auto-invoked without explicit user interaction. When invoked, they collect and share webpage content, often the full HTML DOM and sometimes even the user's form inputs, with their first-party servers. Some assistants also share identifiers and user prompts with third-party trackers such as Google Analytics. The collection and sharing continues even if a webpage contains sensitive information such as health or personal information such as name or SSN entered in a web form. We find that several GenAI browser assistants infer demographic attributes such as age, gender, income, and interests and use this profile--which carries across browsing contexts--to personalize responses. In summary, our work shows that GenAI browser assistants can and do collect personal and sensitive information for profiling and personalization with little to no safeguards.

Paperid: 1389, https://arxiv.org/pdf/2503.16548.pdf

Abstract:
Large Language Models (LLMs) have substantially improved the conversational capabilities of social robots. Nevertheless, for an intuitive and fluent human-robot interaction, robots should be able to ground the conversation by relating ambiguous or underspecified spoken utterances to the current physical situation and to the intents expressed non verbally by the user, for example by using referential gaze. Here we propose a representation integrating speech and gaze to enable LLMs to obtain higher situated awareness and correctly resolve ambiguous requests. Our approach relies on a text-based semantic translation of the scanpath produced by the user along with the verbal requests and demonstrates LLM's capabilities to reason about gaze behavior, robustly ignoring spurious glances or irrelevant objects. We validate the system across multiple tasks and two scenarios, showing its generality and accuracy, and demonstrate its implementation on a robotic platform, closing the loop from request interpretation to execution.

Paperid: 1390, https://arxiv.org/pdf/2503.16479.pdf

Abstract:
With Highly Automated Driving (HAD), the driver can engage in non-driving-related tasks. In the event of a system failure, the driver is expected to reasonably regain control of the Automated Vehicle (AV). Incorrect system understanding may provoke misuse by the driver and can lead to vehicle-level hazards. ISO 21448, referred to as the standard for Safety of the Intended Functionality (SOTIF), defines misuse as usage of the system by the driver in a way not intended by the system manufacturer. Foreseeable Misuse (FM) implies anticipated system misuse based on the best knowledge about the system design and the driver behaviour. This is the underlying motivation to propose simulation-based testing of FM. The vital challenge is to perform a simulation-based testing for a SOTIF-related misuse scenario. Transverse Guidance Assist System (TGAS) is modelled for HAD. In the context of this publication, TGAS is referred to as the "system," and the driver is the human operator of the system. This publication focuses on implementing the Driver-Vehicle Interface (DVI) that permits the interactions between the driver and the system. The implementation and testing of a derived misuse scenario using the driving simulator ensure reasonable usage of the system by supporting the driver with unambiguous information on system functions and states so that the driver can conveniently perceive, comprehend, and act upon the information.

Paperid: 1391, https://arxiv.org/pdf/2503.16460.pdf

Abstract:
Researchers have made notable progress in applying Large Language Models (LLMs) to solve math problems, as demonstrated through efforts like GSM8k, ProofNet, AlphaGeometry, and MathOdyssey. This progress has sparked interest in their potential use for tutoring students in mathematics. However, the reliability of LLMs in tutoring contexts -- where correctness and instructional quality are crucial -- remains underexplored. Moreover, LLM problem-solving capabilities may not necessarily translate into effective tutoring support for students. In this work, we present two novel approaches to evaluate the correctness and quality of LLMs in math tutoring contexts. The first approach uses an intelligent tutoring system for college algebra as a testbed to assess LLM problem-solving capabilities. We generate benchmark problems using the tutor, prompt a diverse set of LLMs to solve them, and compare the solutions to those generated by the tutor. The second approach evaluates LLM as tutors rather than problem solvers. We employ human evaluators, who act as students seeking tutoring support from each LLM. We then assess the quality and correctness of the support provided by the LLMs via a qualitative coding process. We applied these methods to evaluate several ChatGPT models, including 3.5 Turbo, 4, 4o, o1-mini, and o1-preview. Our findings show that when used as problem solvers, LLMs generate correct final answers for 85.5% of the college algebra problems tested. When employed interactively as tutors, 90% of LLM dialogues show high-quality instructional support; however, many contain errors -- only 56.6% are entirely correct. We conclude that, despite their potential, LLMs are not yet suitable as intelligent tutors for math without human oversight or additional mechanisms to ensure correctness and quality.

Paperid: 1392, https://arxiv.org/pdf/2503.16457.pdf

Abstract:
The integration of large language models (LLMs) into virtual reality (VR) environments has opened new pathways for creating more immersive and interactive digital humans. By leveraging the generative capabilities of LLMs alongside multimodal outputs such as facial expressions and gestures, virtual agents can simulate human-like personalities and emotions, fostering richer and more engaging user experiences. This paper provides a comprehensive review of methods for enabling digital humans to adopt nuanced personality traits, exploring approaches such as zero-shot, few-shot, and fine-tuning. Additionally, it highlights the challenges of integrating LLM-driven personality traits into VR, including computational demands, latency issues, and the lack of standardized evaluation frameworks for multimodal interactions. By addressing these gaps, this work lays a foundation for advancing applications in education, therapy, and gaming, while fostering interdisciplinary collaboration to redefine human-computer interaction in VR.

Paperid: 1393, https://arxiv.org/pdf/2503.15549.pdf

Abstract:
Ensuring transparency in educational assessment is increasingly critical, particularly post-pandemic, as demand grows for fairer and more reliable evaluation methods. Comparative Judgement (CJ) offers a promising alternative to traditional assessments, yet concerns remain about its perceived opacity. This paper examines how Bayesian Comparative Judgement (BCJ) enhances transparency by integrating prior information into the judgement process, providing a structured, data-driven approach that improves interpretability and accountability. BCJ assigns probabilities to judgement outcomes, offering quantifiable measures of uncertainty and deeper insights into decision confidence. By systematically tracking how prior data and successive judgements inform final rankings, BCJ clarifies the assessment process and helps identify assessor disagreements. Multi-criteria BCJ extends this by evaluating multiple learning outcomes (LOs) independently, preserving the richness of CJ while producing transparent, granular rankings aligned with specific assessment goals. It also enables a holistic ranking derived from individual LOs, ensuring comprehensive evaluations without compromising detailed feedback. Using a real higher education dataset with professional markers in the UK, we demonstrate BCJ's quantitative rigour and ability to clarify ranking rationales. Through qualitative analysis and discussions with experienced CJ practitioners, we explore its effectiveness in contexts where transparency is crucial, such as high-stakes national assessments. We highlight the benefits and limitations of BCJ, offering insights into its real-world application across various educational settings.

Paperid: 1394, https://arxiv.org/pdf/2503.15370.pdf

Abstract:
We reappraise the idea of colliding with robots, moving from a position that tries to avoid or mitigate collisions to one that considers them an important facet of human interaction. We report on a soma design workshop that explored how our bodies could collide with telepresence robots, mobility aids, and a quadruped robot. Based on our findings, we employed soma trajectories to analyse collisions as extended experiences that negotiate key transitions of consent, preparation, launch, contact, ripple, sting, untangle, debris and reflect. We then employed these ideas to analyse two collision experiences, an accidental collision between a person and a drone, and the deliberate design of a robot to play with cats, revealing how real-world collisions involve the complex and ongoing entanglement of soma trajectories. We discuss how viewing collisions as entangled trajectories, or tangles, can be used analytically, as a design approach, and as a lens to broach ethical complexity.

Paperid: 1395, https://arxiv.org/pdf/2503.14965.pdf

Abstract:
Visualizations are powerful tools for conveying information but often rely on accompanying text for essential context and guidance. This study investigates the impact of annotation patterns on reader preferences and comprehension accuracy among multilingual populations, addressing a gap in visualization research. We conducted experiments with two groups fluent in English and either Tamil (n = 557) or Arabic (n = 539) across six visualization types, each varying in annotation volume and semantic content. Full-text annotations yielded the highest comprehension accuracy across all languages, while preferences diverged: English readers favored highly annotated charts, whereas Tamil/Arabic readers preferred full-text or minimally annotated versions. Semantic variations in annotations (L1-L4) did not significantly affect comprehension, demonstrating the robustness of text comprehension across languages. English annotations were generally preferred, with a tendency to think technically in English linked to greater aversion to non-English annotations, though this diminished among participants who regularly switched languages internally. Non-English annotations incorporating visual or external knowledge were less favored, particularly in titles. Our findings highlight cultural and educational factors influencing perceptions of visual information, underscoring the need for inclusive annotation practices for diverse linguistic audiences. All data and materials are available at: https://osf.io/ckdb4/.

Paperid: 1396, https://arxiv.org/pdf/2503.14883.pdf

Abstract:
The rapid advancement of Large Language Models (LLMs), reasoning models, and agentic AI approaches coincides with a growing global mental health crisis, where increasing demand has not translated into adequate access to professional support, particularly for underserved populations. This presents a unique opportunity for AI to complement human-led interventions, offering scalable and context-aware support while preserving human connection in this sensitive domain. We explore various AI applications in peer support, self-help interventions, proactive monitoring, and data-driven insights, using a human-centred approach that ensures AI supports rather than replaces human interaction. However, AI deployment in mental health fields presents challenges such as ethical concerns, transparency, privacy risks, and risks of over-reliance. We propose a hybrid ecosystem where where AI assists but does not replace human providers, emphasising responsible deployment and evaluation. We also present some of our early work and findings in several of these AI applications. Finally, we outline future research directions for refining AI-enhanced interventions while adhering to ethical and culturally sensitive guidelines.

Paperid: 1397, https://arxiv.org/pdf/2503.13812.pdf

Abstract:
Deliberation is essential to well-functioning democracies, yet physical, economic, and social barriers often exclude certain groups, reducing representativeness and contributing to issues like group polarization. In this work, we explore the use of large language model (LLM) personas to introduce missing perspectives in policy deliberations. We develop and evaluate a tool that transcribes conversations in real-time and simulates input from relevant but absent stakeholders. We deploy this tool in a 19-person student citizens' assembly on campus sustainability. Participants and facilitators found that the tool sparked new discussions and surfaced valuable perspectives they had not previously considered. However, they also noted that AI-generated responses were sometimes overly general. They raised concerns about overreliance on AI for perspective-taking. Our findings highlight both the promise and potential risks of using LLMs to raise missing points of view in group deliberation settings.

Paperid: 1398, https://arxiv.org/pdf/2503.11352.pdf

Abstract:
The ability of robots to recognize human gestures facilitates a natural and accessible human-robot collaboration. However, most work in gesture recognition remains rooted in reference frame-dependent representations. This poses a challenge when reference frames vary due to different work cell layouts, imprecise frame calibrations, or other environmental changes. This paper investigated the use of invariant trajectory descriptors for robust hand palm motion gesture recognition under reference frame changes. First, a novel dataset of recorded Hand Palm Motion (HPM) gestures is introduced. The motion gestures in this dataset were specifically designed to be distinguishable without dependence on specific reference frames or directional cues. Afterwards, multiple invariant trajectory descriptor approaches were benchmarked to assess how their performances generalize to this novel HPM dataset. After this offline benchmarking, the best scoring approach is validated for online recognition by developing a real-time Proof of Concept (PoC). In this PoC, hand palm motion gestures were used to control the real-time movement of a manipulator arm. The PoC demonstrated a high recognition reliability in real-time operation, achieving an $F_1$-score of 92.3%. This work demonstrates the effectiveness of the invariant descriptor approach as a standalone solution. Moreover, we believe that the invariant descriptor approach can also be utilized within other state-of-the-art pattern recognition and learning systems to improve their robustness against reference frame variations.

Paperid: 1399, https://arxiv.org/pdf/2503.09794.pdf

Abstract:
As Augmented Reality (AR) and Artificial Intelligence (AI) continue to converge, new opportunities emerge for AI agents to actively support human collaboration in immersive environments. While prior research has primarily focused on dyadic human-AI interactions, less attention has been given to Human-AI Teams (HATs) in AR, where AI acts as an adaptive teammate rather than a static tool. This position paper takes the perspective of team dynamics and work organization to propose that AI agents in AR should not only interact with individuals but also recognize and respond to team-level needs in real time. We argue that spatially aware AI agents should dynamically generate the resources necessary for effective collaboration, such as virtual blackboards for brainstorming, mental map models for shared understanding, and memory recall of spatial configurations to enhance knowledge retention and task coordination. This approach moves beyond predefined AI assistance toward context-driven AI interventions that optimize team performance and decision-making.

Paperid: 1400, https://arxiv.org/pdf/2503.08539.pdf

Abstract:
Dictation interfaces support efficient text input, but the transcribed text can be hard to read. To understand how users read and review dictated text, we conducted a controlled eye-tracking experiment with 20 participants to compare five dictation interfaces: PLAIN (real-time transcription), AOC (periodic corrections), RAKE (keyword highlights), GP-TSM (grammar-preserving highlights), and SUMMARY (LLM-generated abstraction summary). The study analyzed participants' gaze patterns during their speech composition and reviewing processes. The findings show that during composition, participants spent only 7--11% of their time actively reading, and they favored real-time feedback and avoided distracting interface changes. During reviewing, although SUMMARY introduced unfamiliar words (requiring longer and more frequent fixation), they were easier to read (requiring fewer regressions). Participants preferred SUMMARY for the polished text that preserved fidelity to original meanings. RAKE guided the reading of self-produced text better than GP-TSM. RAKE guides the reading of self-produced text better than GP-TSM. These surprising findings suggest that dictation interfaces could consider showing summaries or key information to support recall instead of raw transcripts.

Paperid: 1401, https://arxiv.org/pdf/2503.07825.pdf

Abstract:
We present an advance in wearable technology: a mobile-optimized, real-time, ultra-low-power event camera system that enables natural hand gesture control for smart glasses, dramatically improving user experience. While hand gesture recognition in computer vision has advanced significantly, critical challenges remain in creating systems that are intuitive, adaptable across diverse users and environments, and energy-efficient enough for practical wearable applications. Our approach tackles these challenges through carefully selected microgestures: lateral thumb swipes across the index finger (in both directions) and a double pinch between thumb and index fingertips. These human-centered interactions leverage natural hand movements, ensuring intuitive usability without requiring users to learn complex command sequences. To overcome variability in users and environments, we developed a novel simulation methodology that enables comprehensive domain sampling without extensive real-world data collection. Our power-optimised architecture maintains exceptional performance, achieving F1 scores above 80\% on benchmark datasets featuring diverse users and environments. The resulting models operate at just 6-8 mW when exploiting the Qualcomm Snapdragon Hexagon DSP, with our 2-channel implementation exceeding 70\% F1 accuracy and our 6-channel model surpassing 80\% F1 accuracy across all gesture classes in user studies. These results were achieved using only synthetic training data. This improves on the state-of-the-art for F1 accuracy by 20\% with a power reduction 25x when using DSP. This advancement brings deploying ultra-low-power vision systems in wearable devices closer and opens new possibilities for seamless human-computer interaction.

Paperid: 1402, https://arxiv.org/pdf/2503.07797.pdf

Abstract:
News reading helps individuals stay informed about events and developments in society. Local residents and new immigrants often approach the same news differently, prompting the question of how technology, such as LLM-powered chatbots, can best enhance a reader-oriented news experience. The current paper presents an empirical study involving 144 participants from three groups in Virginia, United States: local residents born and raised there (N=48), Chinese immigrants (N=48), and Vietnamese immigrants (N=48). All participants read local housing news with the assistance of the Copilot chatbot. We collected data on each participant's Q&A interactions with the chatbot, along with their takeaways from news reading. While engaging with the news content, participants in both immigrant groups asked the chatbot fewer analytical questions than the local group. They also demonstrated a greater tendency to rely on the chatbot when formulating practical takeaways. These findings offer insights into technology design that aims to serve diverse news readers.

Paperid: 1403, https://arxiv.org/pdf/2503.07415.pdf

Abstract:
The proliferation of digital mental health (DMH) tracking services promises personalized support, yet accessibility barriers limit equal access. This study investigates blind community experiences with DMH tracking services across the United States as a step toward inclusive health technology design. Working with blind advocacy organizations, we distributed a cross-sectional observational survey (n = 93) and analyzed open-ended responses using Norman and Skinner's eHealth Literacy framework. Our findings reveal significant challenges in navigation, content interpretation, and overall user experience, which impede the blind community's effective engagement with DMH tools. Results highlight the need for adaptive interfaces, accessible tracking strategies, and voice-guided interactions. These insights inform design recommendations for developers and policymakers, promoting more inclusive mental health technologies. By prioritizing accessibility, we make forward progress in ensuring that DMH tracking services fulfill their potential to support mental well-being across diverse user groups, fostering digital equality in mental health care.

Paperid: 1404, https://arxiv.org/pdf/2503.03587.pdf

Abstract:
Protecting online privacy requires users to engage with and comprehend website privacy policies, but many policies are difficult and tedious to read. We present the first qualitative user study on Large Language Model (LLM)-driven privacy policy assessment. To this end, we build and evaluate an LLM-based privacy policy assessment browser extension, which helps users understand the essence of a lengthy, complex privacy policy while browsing. The tool integrates a dashboard and an LLM chat. In our qualitative user study (N=22), we evaluate usability, understandability of the information our tool provides, and its impacts on awareness. While providing a comprehensible quick overview and a chat for in-depth discussion improves privacy awareness, users note issues with building trust in the tool. From our insights, we derive important design implications to guide future policy analysis tools.

Paperid: 1405, https://arxiv.org/pdf/2503.03236.pdf

Abstract:
Existing approaches for color-concept association typically rely on query-based image referencing, and color extraction from image references. However, these approaches are effective only for common concepts, and are vulnerable to unstable image referencing and varying image conditions. Our formative study with designers underscores the need for primary-accent color compositions and context-dependent colors (e.g., 'clear' vs. 'polluted' sky) in design. In response, we introduce a generative approach for mining semantically resonant colors leveraging images generated by text-to-image models. Our insight is that contemporary text-to-image models can resemble visual patterns from large-scale real-world data. The framework comprises three stages: concept instancing produces generative samples using diffusion models, text-guided image segmentation identifies concept-relevant regions within the image, and color association extracts primarily accompanied by accent colors. Quantitative comparisons with expert designs validate our approach's effectiveness, and we demonstrate the applicability through cases in various design scenarios and a gallery.

Paperid: 1406, https://arxiv.org/pdf/2503.01154.pdf

Abstract:
Intergenerational co-creation using technology between grandparents and grandchildren can be challenging due to differences in technological familiarity. AI has emerged as a promising tool to support co-creative activities, offering flexibility and creative assistance, but its role in facilitating intergenerational connection remains underexplored. In this study, we conducted a user study with 29 grandparent-grandchild groups engaged in AI-supported story creation to examine how AI-assisted co-creation can foster meaningful intergenerational bonds. Our findings show that grandchildren managed the technical aspects, while grandparents contributed creative ideas and guided the storytelling. AI played a key role in structuring the activity, facilitating brainstorming, enhancing storytelling, and balancing the contributions of both generations. The process fostered mutual appreciation, with each generation recognizing the strengths of the other, leading to an engaging and cohesive co-creation process. We offer design implications for integrating AI into intergenerational co-creative activities, emphasizing how AI can enhance connection across skill levels and technological familiarity.

Paperid: 1407, https://arxiv.org/pdf/2502.20754.pdf

Abstract:
We present an approach for acquiring grounded representations of words from mixed-initiative, situated interactions with a human instructor. The work focuses on the acquisition of diverse types of knowledge including perceptual, semantic, and procedural knowledge along with learning grounded meanings. Interactive learning allows the agent to control its learning by requesting instructions about unknown concepts, making learning efficient. Our approach has been instantiated in Soar and has been evaluated on a table-top robotic arm capable of manipulating small objects.

Paperid: 1408, https://arxiv.org/pdf/2502.18689.pdf

Abstract:
Local and federal agencies are rapidly adopting AI systems to augment or automate critical decisions, efficiently use resources, and improve public service delivery. AI systems are being used to support tasks associated with urban planning, security, surveillance, energy and critical infrastructure, and support decisions that directly affect citizens and their ability to access essential services. Local governments act as the governance tier closest to citizens and must play a critical role in upholding democratic values and building community trust especially as it relates to smart city initiatives that seek to transform public services through the adoption of AI. Community-centered and participatory approaches have been central for ensuring the appropriate adoption of technology; however, AI innovation introduces new challenges in this context because participatory AI design methods require more robust formulation and face higher standards for implementation in the public sector compared to the private sector. This requires us to reassess traditional methods used in this space as well as develop new resources and methods. This workshop will explore emerging practices in participatory algorithm design - or the use of public participation and community engagement - in the scoping, design, adoption, and implementation of public sector algorithms.

Paperid: 1409, https://arxiv.org/pdf/2502.18688.pdf

Abstract:
Designing robots to support high-stakes teamwork in emergency settings presents unique challenges, including seamless integration into fast-paced environments, facilitating effective communication among team members, and adapting to rapidly changing situations. While teleoperated robots have been successfully used in high-stakes domains such as firefighting and space exploration, autonomous robots that aid highs-takes teamwork remain underexplored. To address this gap, we conducted a rapid prototyping process to develop a series of seemingly autonomous robot designed to assist clinical teams in the Emergency Room. We transformed a standard crash cart--which stores medical equipment and emergency supplies into a medical robotic crash cart (MCCR). The MCCR was evaluated through field deployments to assess its impact on team workload and usability, identified taxonomies of failure, and refined the MCCR in collaboration with healthcare professionals. Our work advances the understanding of robot design for high-stakes, time-sensitive settings, providing insights into useful MCCR capabilities and considerations for effective human-robot collaboration. By publicly disseminating our MCCR tutorial, we hope to encourage HRI researchers to explore the design of robots for high-stakes teamwork.

Paperid: 1410, https://arxiv.org/pdf/2502.18683.pdf

Abstract:
AI systems have rapidly advanced, diversified, and proliferated, but our knowledge of people's perceptions of mind and morality in them is limited, despite its importance for outcomes such as whether people trust AIs and how they assign responsibility for AI-caused harms. In a preregistered online study, 975 participants rated 26 AI and non-AI entities. Overall, AIs were perceived to have low-to-moderate agency (e.g., planning, acting), between inanimate objects and ants, and low experience (e.g., sensing, feeling). For example, ChatGPT was rated only as capable of feeling pleasure and pain as a rock. The analogous moral faculties, moral agency (doing right or wrong) and moral patiency (being treated rightly or wrongly) were higher and more varied, particularly moral agency: The highest-rated AI, a Tesla Full Self-Driving car, was rated as morally responsible for harm as a chimpanzee. We discuss how design choices can help manage perceptions, particularly in high-stakes moral contexts.

Paperid: 1411, https://arxiv.org/pdf/2502.18673.pdf

Abstract:
Learning therapeutic counseling involves significant role-play experience with mock patients, with current manual training methods providing only intermittent granular feedback. We seek to accelerate and optimize counselor training by providing frequent, detailed feedback to trainees as they interact with a simulated patient. Our first application domain involves training motivational interviewing skills for counselors. Motivational interviewing is a collaborative counseling style in which patients are guided to talk about changing their behavior, with empathetic counseling an essential ingredient. We developed and evaluated an LLM-powered training system that features a simulated patient and visualizations of turn-by-turn performance feedback tailored to the needs of counselors learning motivational interviewing. We conducted an evaluation study with professional and student counselors, demonstrating high usability and satisfaction with the system. We present design implications for the development of automated systems that train users in counseling skills and their generalizability to other types of social skills training.

Paperid: 1412, https://arxiv.org/pdf/2502.18348.pdf

Abstract:
Accessible design for some may still produce barriers for others. This tension, called access friction, creates challenges for both designers and end-users with disabilities. To address this, we present the concept of softerware, a system design approach that provides end users with agency to meaningfully customize and adapt interfaces to their needs. To apply softerware to visualization, we assembled 195 data visualization customization options centered on the barriers we expect users with disabilities will experience. We built a prototype that applies a subset of these options and interviewed practitioners for feedback. Lastly, we conducted a design probe study with blind and low vision accessibility professionals to learn more about their challenges and visions for softerware. We observed access frictions between our participant's designs and they expressed that for softerware's success, current and future systems must be designed with accessible defaults, interoperability, persistence, and respect for a user's perceived effort-to-outcome ratio.

Paperid: 1413, https://arxiv.org/pdf/2502.17856.pdf

Abstract:
Virtual YouTubers (VTubers) have recently gained popularity as streamers using computer-generated avatars and real-time motion capture to create distinct virtual identities. While prior research has explored how VTubers construct virtual personas and engage audiences, little attention has been given to viewers' reactions when virtual and real identities blur-what we refer to as "seams." To address this gap, we conducted a case study on PLAVE, a popular Korean VTuber Kpop idol group, interviewing 24 of their fans. Our findings identified two main sources of seams: technical glitches and identity collapses, where VTubers act inconsistently with their virtual personas, revealing aspects of their real selves. These seams played a pivotal role in shaping diverse fan engagements, with some valuing authenticity linked to real identities, while others prioritized the coherence of virtual personas. Overall, our findings underscore the importance of seams in shaping viewer experiences.

Paperid: 1414, https://arxiv.org/pdf/2502.17855.pdf

Abstract:
While AI's potential in education and professional sports is widely recognized, its application in K-12 physical education (PE) remains underexplored with significant opportunities for innovation. This study aims to address this gap by engaging 17 in-service secondary school PE teachers in group ideation workshops to explore potential AI applications and challenges in PE classes. Participants envisioned AI playing multidimensional roles, such as an operational assistant, personal trainer, group coach, and evaluator, as solutions to address unique instructional and operational challenges in K-12 PE classes. These roles reflected participants' perspectives on how AI could enhance class management, deliver personalized feedback, promote balanced team activities, and streamline performance assessments. Participants also highlighted critical considerations for AI integration, including the need to ensure robust student data security and privacy measures, minimize the risk of over-reliance on AI for instructional decisions, and accommodate the varying levels of technological proficiency among PE teachers. Our findings provide valuable insights and practical guidance for AI developers, educators, and policymakers, offering a foundation for the effective integration of AI into K-12 PE curricula to enhance teaching practices and student outcomes.

Paperid: 1415, https://arxiv.org/pdf/2502.17835.pdf

Abstract:
As programming education becomes more widespread, many college students from non-computer science backgrounds begin learning programming. Collaborative programming emerges as an effective method for instructors to support novice students in developing coding and teamwork abilities. However, due to limited class time and attention, instructors face challenges in monitoring and evaluating the progress and performance of groups or individuals. To address this issue, we collect multimodal data from real-world settings and develop CPVis, an interactive visual analytics system designed to assess student collaboration dynamically. Specifically, CPVis enables instructors to evaluate both group and individual performance efficiently. CPVis employs a novel flower-based visual encoding to represent performance and provides time-based views to capture the evolution of collaborative behaviors. A within-subject experiment (N=22), comparing CPVis with two baseline systems, reveals that users gain more insights, find the visualization more intuitive, and report increased confidence in their assessments of collaboration.

Paperid: 1416, https://arxiv.org/pdf/2502.17480.pdf

Abstract:
Modern neuroprostheses can now restore communication in patients who have lost the ability to speak or move. However, these invasive devices entail risks inherent to neurosurgery. Here, we introduce a non-invasive method to decode the production of sentences from brain activity and demonstrate its efficacy in a cohort of 35 healthy volunteers. For this, we present Brain2Qwerty, a new deep learning architecture trained to decode sentences from either electro- (EEG) or magneto-encephalography (MEG), while participants typed briefly memorized sentences on a QWERTY keyboard. With MEG, Brain2Qwerty reaches, on average, a character-error-rate (CER) of 32% and substantially outperforms EEG (CER: 67%). For the best participants, the model achieves a CER of 19%, and can perfectly decode a variety of sentences outside of the training set. While error analyses suggest that decoding depends on motor processes, the analysis of typographical errors suggests that it also involves higher-level cognitive factors. Overall, these results narrow the gap between invasive and non-invasive methods and thus open the path for developing safe brain-computer interfaces for non-communicating patients.

Paperid: 1417, https://arxiv.org/pdf/2502.17399.pdf

Abstract:
Augmented reality (AR) games, particularly those designed for headsets, have become increasingly prevalent with advancements in both hardware and software. However, the majority of AR games still rely on pre-scanned or static scenes, and interaction mechanisms are often limited to controllers or hand-tracking. Additionally, the presence of identical objects in AR games poses challenges for conventional object tracking techniques, which often struggle to differentiate between identical objects or necessitate the installation of fixed cameras for global object movement tracking. In response to these limitations, we present a novel approach to address the tracking of identical objects in an AR scene to enrich physical-virtual interaction. Our method leverages partial scene observations captured by an AR headset, utilizing the perspective and spatial data provided by this technology. Object identities within the scene are determined through the solution of a label assignment problem using integer programming. To enhance computational efficiency, we incorporate a Voronoi diagram-based pruning method into our approach. Our implementation of this approach in a farm-to-table AR game demonstrates its satisfactory performance and robustness. Furthermore, we showcase the versatility and practicality of our method through applications in AR storytelling and a simulated gaming robot. Our video demo is available at: https://youtu.be/rPGkLYuKvCQ.

Paperid: 1418, https://arxiv.org/pdf/2502.17295.pdf

Abstract:
How might healthcare workers (HCWs) leverage augmented reality head-mounted displays (AR-HMDs) to enhance teamwork? Although AR-HMDs have shown immense promise in supporting teamwork in healthcare settings, design for Emergency Department (ER) teams has received little attention. The ER presents unique challenges, including procedural recall, medical errors, and communication gaps. To address this gap, we engaged in a participatory design study with healthcare workers to gain a deep understanding of the potential for AR-HMDs to facilitate teamwork during ER procedures. Our results reveal that AR-HMDs can be used as an information-sharing and information-retrieval system to bridge knowledge gaps, and concerns about integrating AR-HMDs in ER workflows. We contribute design recommendations for seven role-based AR-HMD application scenarios involving HCWs with various expertise, working across multiple medical tasks. We hope our research inspires designers to embark on the development of new AR-HMD applications for high-stakes, team environments.

Paperid: 1419, https://arxiv.org/pdf/2502.15227.pdf

Abstract:
Mitigating cybersickness can improve the usability of virtual reality (VR) and increase its adoption. The most widely used technique, dynamic field-of-view (FOV) restriction, mitigates cybersickness by blacking out the peripheral region of the user's FOV. However, this approach reduces the visibility of the virtual environment. We propose peripheral teleportation, a novel technique that creates a rest frame (RF) in the user's peripheral vision using content rendered from the current virtual environment. Specifically, the peripheral region is rendered by a pair of RF cameras whose transforms are updated by the user's physical motion. We apply alternating teleportations during translations, or snap turns during rotations, to the RF cameras to keep them close to the current viewpoint transformation. Consequently, the optical flow generated by RF cameras matches the user's physical motion, creating a stable peripheral view. In a between-subjects study (N = 90), we compared peripheral teleportation with a traditional black FOV restrictor and an unrestricted control condition. The results showed that peripheral teleportation significantly reduced discomfort and enabled participants to stay immersed in the virtual environment for a longer duration of time. Overall, these findings suggest that peripheral teleportation is a promising technique that VR practitioners may consider adding to their cybersickness mitigation toolset.

Paperid: 1420, https://arxiv.org/pdf/2502.14747.pdf

Abstract:
Concept designers in the entertainment industry create highly detailed, often imaginary environments for movies, games, and TV shows. Their early ideation phase requires intensive research, brainstorming, visual exploration, and combination of various design elements to form cohesive designs. However, existing AI tools focus on image generation from user specifications, lacking support for the unique needs and complexity of concept designers' workflows. Through a formative study with 12 professional designers, we captured their workflows and identified key requirements for AI-assisted ideation tools. Leveraging these insights, we developed AIdeation to support early ideation by brainstorming design concepts with flexible searching and recombination of reference images. A user study with 16 professional designers showed that AIdeation significantly enhanced creativity, ideation efficiency, and satisfaction (all p<.01) compared to current tools and workflows. A field study with 4 studios for 1 week provided insights into AIdeation's benefits and limitations in real-world projects. After the completion of the field study, two studios, covering films, television, and games, have continued to use AIdeation in their commercial projects to date, further validating AIdeation's improvement in ideation quality and efficiency.

Paperid: 1421, https://arxiv.org/pdf/2502.11989.pdf

Abstract:
Diffusion model-generated images can appear indistinguishable from authentic photographs, but these images often contain artifacts and implausibilities that reveal their AI-generated provenance. Given the challenge to public trust in media posed by photorealistic AI-generated images, we conducted a large-scale experiment measuring human detection accuracy on 450 diffusion-model generated images and 149 real images. Based on collecting 749,828 observations and 34,675 comments from 50,444 participants, we find that scene complexity of an image, artifact types within an image, display time of an image, and human curation of AI-generated images all play significant roles in how accurately people distinguish real from AI-generated images. Additionally, we propose a taxonomy characterizing artifacts often appearing in images generated by diffusion models. Our empirical observations and taxonomy offer nuanced insights into the capabilities and limitations of diffusion models to generate photorealistic images in 2024.

Paperid: 1422, https://arxiv.org/pdf/2502.11919.pdf

Abstract:
AI-assisted decision making becomes increasingly prevalent, yet individuals often fail to utilize AI-based decision aids appropriately especially when the AI explanations are absent, potentially as they do not %understand reflect on AI's decision recommendations critically. Large language models (LLMs), with their exceptional conversational and analytical capabilities, present great opportunities to enhance AI-assisted decision making in the absence of AI explanations by providing natural-language-based analysis of AI's decision recommendation, e.g., how each feature of a decision making task might contribute to the AI recommendation. In this paper, via a randomized experiment, we first show that presenting LLM-powered analysis of each task feature, either sequentially or concurrently, does not significantly improve people's AI-assisted decision performance. To enable decision makers to better leverage LLM-powered analysis, we then propose an algorithmic framework to characterize the effects of LLM-powered analysis on human decisions and dynamically decide which analysis to present. Our evaluation with human subjects shows that this approach effectively improves decision makers' appropriate reliance on AI in AI-assisted decision making.

Paperid: 1423, https://arxiv.org/pdf/2502.11720.pdf

Abstract:
Indirect speech acts (ISAs) are a natural pragmatic feature of human communication, allowing requests to be conveyed implicitly while maintaining subtlety and flexibility. Although advancements in speech recognition have enabled natural language interactions with robots through direct, explicit commands -- roviding clarity in communication -- the rise of large language models presents the potential for robots to interpret ISAs. However, empirical evidence on the effects of ISAs on human-robot collaboration (HRC) remains limited. To address this, we conducted a Wizard-of-Oz study (N=36), engaging a participant and a robot in collaborative physical tasks. Our findings indicate that robots capable of understanding ISAs significantly improve human's perceived robot anthropomorphism, team performance, and trust. However, the effectiveness of ISAs is task- and context-dependent, thus requiring careful use. These results highlight the importance of appropriately integrating direct and indirect requests in HRC to enhance collaborative experiences and task performance.

Paperid: 1424, https://arxiv.org/pdf/2502.09799.pdf

Abstract:
The emergence of generative AI, particularly large language models (LLMs), has opened the door for student-centered and active learning methods like project-based learning (PBL). However, PBL poses practical implementation challenges for educators around project design and management, assessment, and balancing student guidance with student autonomy. The following research documents a co-design process with interdisciplinary K-12 teachers to explore and address the current PBL challenges they face. Through teacher-driven interviews, collaborative workshops, and iterative design of wireframes, we gathered evidence for ways LLMs can support teachers in implementing high-quality PBL pedagogy by automating routine tasks and enhancing personalized learning. Teachers in the study advocated for supporting their professional growth and augmenting their current roles without replacing them. They also identified affordances and challenges around classroom integration, including resource requirements and constraints, ethical concerns, and potential immediate and long-term impacts. Drawing on these, we propose design guidelines for future deployment of LLM tools in PBL.

Paperid: 1425, https://arxiv.org/pdf/2502.09282.pdf

Abstract:
Remote sensing image captioning aims to generate descriptive text from remote sensing images, typically employing an encoder-decoder framework. In this setup, a convolutional neural network (CNN) extracts feature representations from the input image, which then guide the decoder in a sequence-to-sequence caption generation process. Although much research has focused on refining the decoder, the quality of image representations from the encoder remains crucial for accurate captioning. This paper introduces a novel approach that integrates features from two distinct CNN based encoders, capturing complementary information to enhance caption generation. Additionally, we propose a weighted averaging technique to combine the outputs of all GRUs in the stacked decoder. Furthermore, a comparison-based beam search strategy is incorporated to refine caption selection. The results demonstrate that our fusion-based approach, along with the enhanced stacked decoder, significantly outperforms both the transformer-based state-of-the-art model and other LSTM-based baselines.

Paperid: 1426, https://arxiv.org/pdf/2502.09222.pdf

Abstract:
We present clinguin, a system for ASP-driven user interface design. Clinguin streamlines the development of user interfaces for ASP developers by letting them build interactive prototypes directly in ASP, eliminating the need for separate frontend languages. To this end, clinguin uses a few dedicated predicates to define user interfaces and the treatment of user-triggered events. This simple design greatly facilitates the specification of user interactions with an ASP system, in our case clingo.

Paperid: 1427, https://arxiv.org/pdf/2502.09043.pdf

Abstract:
Homelessness systems in North America adopt coordinated data-driven approaches to efficiently match support services to clients based on their assessed needs and available resources. AI tools are increasingly being implemented to allocate resources, reduce costs and predict risks in this space. In this study, we conducted an ethnographic case study on the City of Toronto's homelessness system's data practices across different critical points. We show how the City's data practices offer standardized processes for client care but frontline workers also engage in heuristic decision-making in their work to navigate uncertainties, client resistance to sharing information, and resource constraints. From these findings, we show the temporality of client data which constrain the validity of predictive AI models. Additionally, we highlight how the City adopts an iterative and holistic client assessment approach which contrasts to commonly used risk assessment tools in homelessness, providing future directions to design holistic decision-making tools for homelessness.

Paperid: 1428, https://arxiv.org/pdf/2502.07441.pdf

Abstract:
Personal space, also known as peripersonal space, is crucial in human social interaction, influencing comfort, communication, and social stress. Estimating and respecting personal space is essential for enhancing human-computer interaction (HCI) and smart environments. Personal space preferences vary due to individual traits, cultural background, and contextual factors. Advanced multimodal sensing technologies, including eye-tracking and wristband sensors, offer opportunities to develop adaptive systems that dynamically adjust to user comfort levels. Integrating physiological and behavioral data enables a deeper understanding of spatial interactions. This study develops a sensor-based model to estimate comfortable personal space and identifies key features influencing spatial preferences. Our findings show that multimodal sensors, particularly eye-tracking and physiological wristband data, can effectively predict personal space preferences, with eye-tracking data playing a more significant role. An experimental study involving controlled human interactions demonstrates that a Transformer-based model achieves the highest predictive accuracy (F1 score: 0.87) for estimating personal space. Eye-tracking features, such as gaze point and pupil diameter, emerge as the most significant predictors, while physiological signals from wristband sensors contribute marginally. These results highlight the potential for AI-driven personalization of social space in adaptive environments, suggesting that multimodal sensing can be leveraged to develop intelligent systems that optimize spatial arrangements in workplaces, educational institutions, and public settings. Future work should explore larger datasets, real-world applications, and additional physiological markers to enhance model robustness.

Paperid: 1429, https://arxiv.org/pdf/2502.01390.pdf

Abstract:
Since the explosion in popularity of ChatGPT, large language models (LLMs) have continued to impact our everyday lives. Equipped with external tools that are designed for a specific purpose (e.g., for flight booking or an alarm clock), LLM agents exercise an increasing capability to assist humans in their daily work. Although LLM agents have shown a promising blueprint as daily assistants, there is a limited understanding of how they can provide daily assistance based on planning and sequential decision making capabilities. We draw inspiration from recent work that has highlighted the value of 'LLM-modulo' setups in conjunction with humans-in-the-loop for planning tasks. We conducted an empirical study (N = 248) of LLM agents as daily assistants in six commonly occurring tasks with different levels of risk typically associated with them (e.g., flight ticket booking and credit card payments). To ensure user agency and control over the LLM agent, we adopted LLM agents in a plan-then-execute manner, wherein the agents conducted step-wise planning and step-by-step execution in a simulation environment. We analyzed how user involvement at each stage affects their trust and collaborative team performance. Our findings demonstrate that LLM agents can be a double-edged sword -- (1) they can work well when a high-quality plan and necessary user involvement in execution are available, and (2) users can easily mistrust the LLM agents with plans that seem plausible. We synthesized key insights for using LLM agents as daily assistants to calibrate user trust and achieve better overall task outcomes. Our work has important implications for the future design of daily assistants and human-AI collaboration with LLM agents.

Paperid: 1430, https://arxiv.org/pdf/2502.00702.pdf

Abstract:
Online Cardiac Monitoring (OCM) emerges as a compelling enhancement for the next-generation video streaming platforms. It enables various applications including remote health, online affective computing, and deepfake detection. Yet the physiological information encapsulated in the video streams has been long neglected. In this paper, we present the design and implementation of CardioLive, the first online cardiac monitoring system in video streaming platforms. We leverage the naturally co-existed video and audio streams and devise CardioNet, the first audio-visual network to learn the cardiac series. It incorporates multiple unique designs to extract temporal and spectral features, ensuring robust performance under realistic video streaming conditions. To enable the Service-On-Demand online cardiac monitoring, we implement CardioLive as a plug-and-play middleware service and develop systematic solutions to practical issues including changing FPS and unsynchronized streams. Extensive experiments have been done to demonstrate the effectiveness of our system. We achieve a Mean Square Error (MAE) of 1.79 BPM error, outperforming the video-only and audio-only solutions by 69.2% and 81.2%, respectively. Our CardioLive service achieves average throughputs of 115.97 and 98.16 FPS when implemented in Zoom and YouTube. We believe our work opens up new applications for video stream systems. We will release the code soon.

Paperid: 1431, https://arxiv.org/pdf/2501.16641.pdf

Abstract:
In hours-long meeting scenarios, real-time speech stream often struggles with achieving accurate speaker diarization, commonly leading to speaker identification and speaker count errors. To address this challenge, we propose SCDiar, a system that operates on speech segments, split at the token level by a speaker change detection (SCD) module. Building on these segments, we introduce several enhancements to efficiently select the best available segment for each speaker. These improvements lead to significant gains across various benchmarks. Notably, on real-world meeting data involving more than ten participants, SCDiar outperforms previous systems by up to 53.6\% in accuracy, substantially narrowing the performance gap between online and offline systems.

Paperid: 1432, https://arxiv.org/pdf/2501.16240.pdf

Abstract:
Unlike the free exploration of childhood, the demands of daily life reduce our motivation to explore our surroundings, leading to missed opportunities for informal learning. Traditional tools for knowledge acquisition are reactive, relying on user initiative and limiting their ability to uncover hidden interests. Through formative studies, we introduce AiGet, a proactive AI assistant integrated with AR smart glasses, designed to seamlessly embed informal learning into low-demand daily activities (e.g., casual walking and shopping). AiGet analyzes real-time user gaze patterns, environmental context, and user profiles, leveraging large language models to deliver personalized, context-aware knowledge with low disruption to primary tasks. In-lab evaluations and real-world testing, including continued use over multiple days, demonstrate AiGet's effectiveness in uncovering overlooked yet surprising interests, enhancing primary task enjoyment, reviving curiosity, and deepening connections with the environment. We further propose design guidelines for AI-assisted informal learning, focused on transforming everyday moments into enriching learning experiences.

Paperid: 1433, https://arxiv.org/pdf/2501.16033.pdf

Abstract:
Protecting online privacy requires users to engage with and comprehend website privacy policies, but many policies are difficult and tedious to read. We present PRISMe (Privacy Risk Information Scanner for Me), a novel Large Language Model (LLM)-driven privacy policy assessment tool, which helps users to understand the essence of a lengthy, complex privacy policy while browsing. The tool, a browser extension, integrates a dashboard and an LLM chat. One major contribution is the first rigorous evaluation of such a tool. In a mixed-methods user study (N=22), we evaluate PRISMe's efficiency, usability, understandability of the provided information, and impacts on awareness. While our tool improves privacy awareness by providing a comprehensible quick overview and a quality chat for in-depth discussion, users note issues with consistency and building trust in the tool. From our insights, we derive important design implications to guide future policy analysis tools.

Paperid: 1434, https://arxiv.org/pdf/2501.15864.pdf

Abstract:
Facial expression recognition (FER) has emerged as a promising approach to the development of emotion-aware intelligent agents and systems. However, key challenges remain in utilizing FER in real-world contexts, including ensuring user understanding and establishing a suitable level of user trust. We developed a novel explanation method utilizing Facial Action Units (FAUs) to explain the output of a FER model through both textual and visual modalities. We conducted an empirical user study evaluating user understanding and trust, comparing our approach to state-of-the-art eXplainable AI (XAI) methods. Our results indicate that visual AND textual as well as textual-only FAU-based explanations resulted in better user understanding of the FER model. We also show that all modalities of FAU-based methods improved appropriate trust of the users towards the FER model.

Paperid: 1435, https://arxiv.org/pdf/2501.14327.pdf

Abstract:
Accessing visual information is crucial yet challenging for people with low vision due to visual conditions like low visual acuity and limited visual fields. However, unlike blind people, low vision people have and prefer using their functional vision in daily tasks. Gaze patterns thus become an important indicator to uncover their visual challenges and intents, inspiring more adaptive visual support. We seek to deeply understand low vision users' gaze behaviors in different image-viewing tasks, characterizing typical visual intents and the unique gaze patterns exhibited by people with different low vision conditions. We conducted a retrospective think-aloud study using eye tracking with 20 low vision participants and 20 sighted controls. Participants completed various image-viewing tasks and watched the playback of their gaze trajectories to reflect on their visual experiences. Based on the study, we derived a visual intent taxonomy with five visual intents characterized by participants' gaze behaviors. We demonstrated the difference between low vision and sighted participants' gaze behaviors and how visual ability affected low vision participants' gaze patterns across visual intents. Our findings underscore the importance of combining visual ability information, visual context, and eye tracking data in visual intent recognition, setting up a foundation for intent-aware assistive technologies for low vision people.

Paperid: 1436, https://arxiv.org/pdf/2501.14179.pdf

Abstract:
With the rapid expansion of large language model (LLM) applications, there is an emerging shift in the role of LLM-based AI chatbots from serving merely as general inquiry tools to acting as professional service agents. However, current studies often overlook a critical aspect of professional service agents: the act of communicating in a manner consistent with their professional identities. This is of particular importance in the healthcare sector, where effective communication with patients is essential for achieving professional goals, such as promoting patient well-being by encouraging healthy behaviors. To bridge this gap, we propose LAPI (LLM-based Agent with a Professional Identity), a novel framework for designing professional service agent tailored for medical question-and-answer (Q\&A) services, ensuring alignment with a specific professional identity. Our method includes a theory-guided task planning process that decomposes complex professional tasks into manageable subtasks aligned with professional objectives and a pragmatic entropy method designed to generate professional and ethical responses with low uncertainty. Experiments on various LLMs show that the proposed approach outperforms baseline methods, including few-shot prompting, chain-of-thought prompting, across key metrics such as fluency, naturalness, empathy, patient-centricity, and ROUGE-L scores. Additionally, the ablation study underscores the contribution of each component to the overall effectiveness of the approach.

Paperid: 1437, https://arxiv.org/pdf/2501.13806.pdf

Abstract:
Learning Objects represent a widespread approach to structuring instructional materials in a large variety of educational contexts. The main aim of this work consists of analyzing from a qualitative point of view the process of generating reusable learning objects (RLOs) followed by Clavy, a tool that can be used to retrieve data from multiple medical knowledge sources and reconfigure such sources in diverse multimedia-based structures and organizations. From these organizations, Clavy is able to generate learning objects which can be adapted to various instructional healthcare scenarios with several types of user profiles and distinct learning requirements. Moreover, Clavy provides the capability of exporting these learning objects through educational standard specifications, which improves their reusability features. The analysis insights highlight the importance of having a tool able to transfer knowledge from the available digital medical collections to learning objects that can be easily accessed by medical students and healthcare practitioners through the most popular e-learning platforms.

Paperid: 1438, https://arxiv.org/pdf/2501.13308.pdf

Abstract:
Obsessive-compulsive disorder (OCD) is a mental health condition that significantly impacts people's quality of life. While evidence-based therapies such as exposure and response prevention (ERP) can be effective, managing OCD symptoms in everyday life -- an essential part of treatment and independent living -- remains challenging due to fear confrontation and lack of appropriate support. To better understand the challenges and needs in OCD self-management, we conducted interviews with 10 participants with diverse OCD conditions and seven therapists specializing in OCD treatment. Through these interviews, we explored the characteristics of participants' triggers and how they shaped their compulsions, and uncovered key coping strategies across different stages of OCD episodes. Our findings highlight critical gaps between OCD self-management needs and currently available support. Building on these insights, we propose design opportunities for just-in-time self-management technologies for OCD, including personalized symptom tracking, just-in-time interventions, and support for OCD-specific privacy and social needs -- through technology and beyond.

Paperid: 1439, https://arxiv.org/pdf/2501.13258.pdf

Abstract:
Augmented Reality (AR) is a promising medium for guiding users through tasks, yet its impact on fostering deeper task understanding remains underexplored. This paper investigates the impact of reflective prompts -- strategic questions that encourage users to challenge assumptions, connect actions to outcomes, and consider hypothetical scenarios -- on task comprehension and performance. We conducted a two-phase study: a formative survey and co-design sessions (N=9) to develop reflective prompts, followed by a within-subject evaluation (N=16) comparing AR instructions with and without these prompts in coffee-making and circuit assembly tasks. Our results show that reflective prompts significantly improved objective task understanding and resulted in more proactive information acquisition behaviors during task completion. These findings highlight the potential of incorporating reflective elements into AR instructions to foster deeper engagement and learning. Based on data from both studies, we synthesized design guidelines for integrating reflective elements into AR systems to enhance user understanding without compromising task performance.

Paperid: 1440, https://arxiv.org/pdf/2501.13020.pdf

Abstract:
Video-sharing platforms (VSPs) have become increasingly important for individuals with ADHD to recognize symptoms, acquire knowledge, and receive support. While videos offer rich information and high engagement, they also present unique challenges, such as information quality and accessibility issues to users with ADHD. However, little work has thoroughly examined the video content quality and accessibility issues, the impact, and the control strategies in the ADHD community. We fill this gap by systematically collecting 373 ADHD-relevant videos with comments from YouTube and TikTok and analyzing the data with a mixed method. Our study identified the characteristics of ADHD-relevant videos on VSPs (e.g., creator types, video presentation forms, quality issues) and revealed the collective efforts of creators and viewers in video quality control, such as authority building, collective quality checking, and accessibility improvement. We further derive actionable design implications for VSPs to offer more reliable and ADHD-friendly contents.

Paperid: 1441, https://arxiv.org/pdf/2501.12193.pdf

Abstract:
Cardiovascular disease (CVD) remains a leading cause of death, and primary prevention through personalized interventions is crucial. This paper introduces MyDigiTwin, a framework that integrates health digital twins with personal health environments to empower patients in exploring personalized health scenarios while ensuring data privacy. MyDigiTwin uses federated learning to train predictive models across distributed datasets without transferring raw data, and a novel data harmonization framework addresses semantic and format inconsistencies in health data. A proof-of-concept demonstrates the feasibility of harmonizing and using cohort data to train privacy-preserving CVD prediction models. This framework offers a scalable solution for proactive, personalized cardiovascular care and sets the stage for future applications in real-world healthcare settings.

Paperid: 1442, https://arxiv.org/pdf/2501.11556.pdf

Abstract:
Many users of wrist-worn wearable fitness trackers encounter the data-expectation gap - mismatches between data and expectations. While we know such discrepancies exist, we are no closer to designing technologies that can address their negative effects. This is largely because encounters with mismatches are typically treated unidimensionally, while they may differ in context and implications. This treatment does not allow the design of human-data interaction (HDI) mechanisms accounting for temporal, social, emotional, and other factors potentially influencing the perception of mismatches. To address this problem, we present a vocabulary that describes the breadth and context-bound character of encounters with the data-expectation gap, drawing from findings from two studies. Our work contributes to Personal Informatics research providing knowledge on how encounters with the data-expectation gap are embedded in people's daily lives, and a vocabulary encapsulating this knowledge, which can be used when designing HDI experiences in wearable fitness trackers.

Paperid: 1443, https://arxiv.org/pdf/2501.10553.pdf

Abstract:
Video conferencing meetings are more effective when they are inclusive, but inclusion often hinges on meeting leaders' and/or co-facilitators' practices. AI systems can be designed to improve meeting inclusion at scale by moderating negative meeting behaviors and supporting meeting leaders. We explored this design space by conducting $9$ user-centered ideation sessions, instantiating design insights in a prototype ``virtual co-host'' system, and testing the system in a formative exploratory lab study ($n=68$ across $12$ groups, $18$ interviews). We found that ideation session participants wanted AI agents to ask questions before intervening, which we formalized as the ``Observe, Ask, Intervene'' (OAI) framework. Participants who used our prototype preferred OAI over fully autonomous intervention, but rationalized away the virtual co-host's critical feedback. From these findings, we derive guidelines for designing AI agents to influence behavior and mediate group work. We also contribute methodological and design guidelines specific to mitigating inequitable meeting participation.

Paperid: 1444, https://arxiv.org/pdf/2501.09862.pdf

Abstract:
Data workers may have a a different mental model of their data that the one reified in code. Understanding the organization of their data is necessary for analyzing data, be it through scripting, visualization or abstract thought. More complicated organizations, such as tables with attached hierarchies, may tax people's ability to think about and interact with data. To better understand and ultimately design for these situations, we conduct a study across a team of ten people working with the same reified data model. Through interviews and sketching, we probed their conception of the data model and developed themes through reflexive data analysis. Participants had diverse data models that differed from the reified data model, even among team members who had designed the model, resulting in parallel hazards limiting their ability to reason about the data. From these observations, we suggest potential design interventions for data analysis processes and tools.

Paperid: 1445, https://arxiv.org/pdf/2501.08507.pdf

Abstract:
In human-robot teams, human situational awareness is the operator's conscious knowledge of the team's states, actions, plans and their environment. Appropriate human situational awareness is critical to successful human-robot collaboration. In human-robot teaming, it is often assumed that the best and required level of situational awareness is knowing everything at all times. This view is problematic, because what a human needs to know for optimal team performance varies given the dynamic environmental conditions, task context and roles and capabilities of team members. We explore this topic by interviewing 16 participants with active and repeated experience in diverse human-robot teaming applications. Based on analysis of these interviews, we derive a framework explaining the dynamic nature of required situational awareness in human-robot teaming. In addition, we identify a range of factors affecting the dynamic nature of required and actual levels of situational awareness (i.e., dynamic situational awareness), types of situational awareness inefficiencies resulting from gaps between actual and required situational awareness, and their main consequences. We also reveal various strategies, initiated by humans and robots, that assist in maintaining the required situational awareness. Our findings inform the implementation of accurate estimates of dynamic situational awareness and the design of user-adaptive human-robot interfaces. Therefore, this work contributes to the future design of more collaborative and effective human-robot teams.

Paperid: 1446, https://arxiv.org/pdf/2501.07536.pdf

Abstract:
Artificial intelligence has been integrated into nearly every aspect of daily life, powering applications from object detection with computer vision to large language models for writing emails and compact models for use in smart homes. These machine learning models at times cater to the needs of individual users but are often detached from them, as they are typically stored and processed in centralized data centers. This centralized approach raises privacy concerns, incurs high infrastructure costs, and struggles to provide real time, personalized experiences. Federated and fully decentralized learning methods have been proposed to address these issues, but they still depend on centralized servers or face slow convergence due to communication constraints. We propose ML Mule, an approach that utilizes individual mobile devices as 'mules' to train and transport model snapshots as the mules move through physical spaces, sharing these models with the physical 'spaces' the mules inhabit. This method implicitly forms affinity groups among devices associated with users who share particular spaces, enabling collaborative model evolution and protecting users' privacy. Our approach addresses several major shortcomings of traditional, federated, and fully decentralized learning systems. ML Mule represents a new class of machine learning methods that are more robust, distributed, and personalized, bringing the field closer to realizing the original vision of intelligent, adaptive, and genuinely context-aware smart environments. Our results show that ML Mule converges faster and achieves higher model accuracy compared to other existing methods.

Paperid: 1447, https://arxiv.org/pdf/2501.04905.pdf

Abstract:
During virtual navigation, users exhibit varied interaction and navigation behaviors influenced by several factors. Existing theories and models have been developed to explain and predict these diverse patterns. While users often experience uncomfortable sensations, such as cybersickness, during virtual reality (VR) use, they do not always make optimal decisions to mitigate these effects. Although methods like reinforcement learning have been used to model decision-making processes, they typically rely on random selection to simulate actions, failing to capture the complexities of real navigation behavior. In this study, we propose curiosity as a key factor driving irrational decision-making, suggesting that users continuously balance exploration and cybersickness according to the free energy principle during virtual navigation. Our findings show that VR users generally adopt conservative strategies when navigating, with most participants displaying negative curiosity across trials. However, curiosity levels tend to rise when the virtual environment changes, illustrating the dynamic interplay between exploration and discomfort. This study provides a quantitative approach to decoding curiosity-driven behavior during virtual navigation, offering insights into how users balance exploration and the avoidance of cybersickness. Future research will further refine this model by incorporating additional psychological and environmental factors to improve the accuracy of navigation pattern predictions.

Paperid: 1448, https://arxiv.org/pdf/2501.04464.pdf

Abstract:
This article explores the use of a location-aware mid-air gesture-based command triplet syntax to interact with a smart space. The syntax, inspired by human language, is built as a vocative case with an imperative structure. In a sentence like 'Light, please switch on', the object being activated is invoked via making a gesture that mimics its initial letter/acronym (vocative, coincident with the sentence's elliptical subject). A geometrical or directional gesture then identifies the action (imperative verb) and may include an object feature or a second object with which to network (complement), which also represented by the initial or acronym letter. Technically, an interpreter relying on a trainable multidevice gesture recognition layer makes the pair/triplet syntax decoding possible. The recognition layer works on acceleration and position input signals from graspable (smartphone) and free-hand devices (smartwatch and external depth cameras), as well as a specific compiler. On a specific deployment at a Living Lab facility, the syntax has been instantiated via the use of a lexicon derived from English (with respect to the initial letters and acronyms). A within-subject analysis with twelve users has enabled the analysis of the syntax acceptance (in terms of usability, gesture agreement for actions over objects, and social acceptance) and technology preference of the gesture syntax within its three device implementations (graspable, wearable, and device-free ones). Participants express consensus regarding the simplicity of learning the syntax and its potential effectiveness in managing smart resources. Socially, participants favoured the Watch for outdoor activities and the Phone for home and work settings, underscoring the importance of social context in technology design. The Phone emerged as the preferred option for gesture recognition due to its efficiency and familiarity.

Paperid: 1449, https://arxiv.org/pdf/2501.00861.pdf

Abstract:
In light of the growing proportion of older individuals in our society, the timely diagnosis of Alzheimer's disease has become a crucial aspect of healthcare. In this paper, we propose a non-invasive and cost-effective detection method based on speech technology. The method employs a pre-trained language model in conjunction with techniques such as prompt fine-tuning and conditional learning, thereby enhancing the accuracy and efficiency of the detection process. To address the issue of limited computational resources, this study employs the efficient LORA fine-tuning method to construct the classification model. Following multiple rounds of training and rigorous 10-fold cross-validation, the prompt fine-tuning strategy based on the LLAMA2 model demonstrated an accuracy of 81.31\%, representing a 4.46\% improvement over the control group employing the BERT model. This study offers a novel technical approach for the early diagnosis of Alzheimer's disease and provides valuable insights into model optimization and resource utilization under similar conditions. It is anticipated that this method will prove beneficial in clinical practice and applied research, facilitating more accurate and efficient screening and diagnosis of Alzheimer's disease.

Paperid: 1450, https://arxiv.org/pdf/2506.23180.pdf

Abstract:
Improvisation training for actors presents unique challenges, particularly in maintaining narrative coherence and managing cognitive load during performances. Previous research on AI in improvisation performance often predates advances in large language models (LLMs) and relies on human intervention. We introduce ImprovMate, which leverages LLMs as GPTs to automate the generation of narrative stimuli and cues, allowing actors to focus on creativity without keeping track of plot or character continuity. Based on insights from professional improvisers, ImprovMate incorporates exercises that mimic live training, such as abrupt story resolution and reactive thinking exercises, while maintaining coherence via reference tables. By balancing randomness and structured guidance, ImprovMate provides a groundbreaking tool for improv training. Our pilot study revealed that actors might embrace AI techniques if the latter mirrors traditional practices, and appreciate the fresh twist introduced by our approach with the AI-generated cues.

Paperid: 1451, https://arxiv.org/pdf/2506.21333.pdf

Abstract:
The co creativity community is making significant progress in developing more sophisticated and tailored systems to support and enhance human creativity. Design considerations from prior work can serve as a valuable and efficient foundation for future systems. To support this effort, we conducted a systematic literature review of 62 papers on co-creative systems. These papers cover a diverse range of applications, including visual arts, design, and writing, where the AI acts not just as a tool but as an active collaborator in the creative process. From this review, we identified several key dimensions relevant to system design: phase of the creative process, creative task, proactive behavior of the system, user control, system embodiment, and AI model type. Our findings suggest that systems offering high user control lead to greater satisfaction, trust, and a stronger sense of ownership over creative outcomes. Furthermore, proactive systems, when adaptive and context sensitive, can enhance collaboration. We also extracted 24 design considerations, highlighting the value of encouraging users to externalize their thoughts and of increasing the system's social presence and transparency to foster trust. Despite recent advancements, important gaps remain, such as limited support for early creative phases like problem clarification, and challenges related to user adaptation to AI systems.

Paperid: 1452, https://arxiv.org/pdf/2506.20377.pdf

Abstract:
The impact of culture on how people express distress in online support communities is increasingly a topic of interest within Computer Supported Cooperative Work (CSCW) and Human-Computer Interaction (HCI). In the United States, distinct cultures have emerged from each of the two dominant political parties, forming a primary lens by which people navigate online and offline worlds. We examine whether partisan culture may play a role in how U.S. Republican and Democrat users of online mental health support communities express distress. We present a large-scale observational study of 2,184,356 posts from 8,916 statistically matched Republican, Democrat, and unaffiliated online support community members. We utilize methods from causal inference to statistically match partisan users along covariates that correspond with demographic attributes and platform use, in order to create comparable cohorts for analysis. We then leverage methods from natural language processing to understand how partisan expressions of distress compare between these sets of closely matched opposing partisans, and between closely matched partisans and typical support community members. Our data spans January 2013 to December 2022, a period of both rising political polarization and mental health concerns. We find that partisan culture does play into expressions of distress, underscoring the importance of considering partisan cultural differences in the design of online support community platforms.

Paperid: 1453, https://arxiv.org/pdf/2506.19995.pdf

Abstract:
Augmentative and alternative communication (AAC) is a field of research and practice that works with people who have a communication disability. One form AAC can take is a high-tech tool, such as a software-based communication system. Like all user interfaces, these systems must be designed and it is critical to include AAC users in the design process for their systems. A participatory design approach can include AAC users in the design process, but modifications may be necessary to make these methods more accessible. We present a two-part design process we are investigating for improving the participatory design for high-tech AAC systems. We discuss our plans to refine the accessibility of this process based on participant feedback.

Paperid: 1454, https://arxiv.org/pdf/2506.19757.pdf

Abstract:
Context: Developer experience (DX) plays a key role in developers' performance and their continued involvement in a software ecosystem (SECO) platform. While researchers and practitioners have recognized several factors affecting DX in SECO platforms, a clear roadmap of the most influential factors is still missing. This is particularly important given the direct impact on developers' interest in SECO and their ongoing engagement with the common technological platform. Goal: This work aims to identify key DX factors and understand how they influence third-party developers' decisions to adopt and keep contributing to a SECO. Methods: We conducted a systematic mapping study (SMS), analyzing 29 studies to assess the state-of-the-art of DX in SECO. Additionally, we conducted a Delphi study to evaluate the influence of 27 DX factors (identified in our SMS) from the perspective of 21 third-party developers to adopt and keep contributing to a SECO. Results: The factors that most strongly influence developers' adoption and ongoing contributions to a SECO are: financial costs for using the platform, desired technical resources for development, low barriers to entry into the applications market, and more financial gains. Conclusion: DX is essential for the success and sustainability of SECO. Our set of DX factors provides valuable insights and recommendations for researchers and practitioners to address key DX concerns from the perspective of third-party developers.

Paperid: 1455, https://arxiv.org/pdf/2506.18201.pdf

Abstract:
Emotion recognition capabilities in multimodal AI systems are crucial for developing culturally responsive educational technologies, yet remain underexplored for Arabic language contexts where culturally appropriate learning tools are critically needed. This study evaluates the emotion recognition performance of two advanced multimodal large language models, GPT-4o and Gemini 1.5 Pro, when processing Arabic children's storybook illustrations. We assessed both models across three prompting strategies (zero-shot, few-shot, and chain-of-thought) using 75 images from seven Arabic storybooks, comparing model predictions with human annotations based on Plutchik's emotional framework. GPT-4o consistently outperformed Gemini across all conditions, achieving the highest macro F1-score of 59% with chain-of-thought prompting compared to Gemini's best performance of 43%. Error analysis revealed systematic misclassification patterns, with valence inversions accounting for 60.7% of errors, while both models struggled with culturally nuanced emotions and ambiguous narrative contexts. These findings highlight fundamental limitations in current models' cultural understanding and emphasize the need for culturally sensitive training approaches to develop effective emotion-aware educational technologies for Arabic-speaking learners.

Paperid: 1456, https://arxiv.org/pdf/2506.18199.pdf

Abstract:
Large language models have demonstrated remarkable capabilities across various domains, yet concerns about cultural bias - particularly towards Arabs and Muslims - pose significant ethical challenges by perpetuating harmful stereotypes and marginalization. Despite growing recognition of bias in LLMs, prompt engineering strategies specifically addressing Arab and Muslim representation remain understudied. This mixed-methods systematic review examines such techniques, offering evidence-based guidance for researchers and practitioners. Following PRISMA guidelines and Kitchenham's systematic review methodology, we analyzed 8 empirical studies published between 2021-2024 investigating bias mitigation strategies. Our findings reveal five primary prompt engineering approaches: cultural prompting, affective priming, self-debiasing techniques, structured multi-step pipelines, and parameter-optimized continuous prompts. Although all approaches show potential for reducing bias, effectiveness varied substantially across studies and bias types. Evidence suggests that certain bias types may be more resistant to prompt-based mitigation than others. Structured multi-step pipelines demonstrated the highest overall effectiveness, achieving up to 87.7% reduction in bias, though they require greater technical expertise. Cultural prompting offers broader accessibility with substantial effectiveness. These results underscore the accessibility of prompt engineering for mitigating cultural bias without requiring access to model parameters. The limited number of studies identified highlights a significant research gap in this critical area. Future research should focus on developing culturally adaptive prompting techniques, creating Arab and Muslim-specific evaluation resources, and integrating prompt engineering with complementary debiasing methods to address deeper stereotypes while maintaining model utility.

Paperid: 1457, https://arxiv.org/pdf/2506.17834.pdf

Abstract:
AI agents are commonly aligned with "human values" through reinforcement learning from human feedback (RLHF), where a single reward model is learned from aggregated human feedback and used to align an agent's behavior. However, human values are not homogeneous--different people hold distinct and sometimes conflicting values. Aggregating feedback into a single reward model risks disproportionately suppressing minority preferences. To address this, we present a novel reward modeling approach for learning individualized reward models. Our approach uses a language model to guide users through reflective dialogues where they critique agent behavior and construct their preferences. This personalized dialogue history, containing the user's reflections and critiqued examples, is then used as context for another language model that serves as an individualized reward function (what we call a "verbal reward model") for evaluating new trajectories. In studies with 30 participants, our method achieved a 9-12% improvement in accuracy over non-reflective verbal reward models while being more sample efficient than traditional supervised learning methods.

Paperid: 1458, https://arxiv.org/pdf/2506.17494.pdf

Abstract:
AI projects often fail due to financial, technical, ethical, or user acceptance challenges -- failures frequently rooted in early-stage decisions. While HCI and Responsible AI (RAI) research emphasize this, practical approaches for identifying promising concepts early remain limited. Drawing on Research through Design, this paper investigates how early-stage AI concept sorting in commercial settings can reflect RAI principles. Through three design experiments -- including a probe study with industry practitioners -- we explored methods for evaluating risks and benefits using multidisciplinary collaboration. Participants demonstrated strong receptivity to addressing RAI concerns early in the process and effectively identified low-risk, high-benefit AI concepts. Our findings highlight the potential of a design-led approach to embed ethical and service design thinking at the front end of AI innovation. By examining how practitioners reason about AI concepts, our study invites HCI and RAI communities to see early-stage innovation as a critical space for engaging ethical and commercial considerations together.

Paperid: 1459, https://arxiv.org/pdf/2506.16617.pdf

Abstract:
Predictive Process Monitoring (PPM) often uses deep learning models to predict the future behavior of ongoing processes, such as predicting process outcomes. While these models achieve high accuracy, their lack of interpretability undermines user trust and adoption. Explainable AI (XAI) aims to address this challenge by providing the reasoning behind the predictions. However, current evaluations of XAI in PPM focus primarily on functional metrics (such as fidelity), overlooking user-centered aspects such as their effect on task performance and decision-making. This study investigates the effects of explanation styles (feature importance, rule-based, and counterfactual) and perceived AI accuracy (low or high) on decision-making in PPM. We conducted a decision-making experiment, where users were presented with the AI predictions, perceived accuracy levels, and explanations of different styles. Users' decisions were measured both before and after receiving explanations, allowing the assessment of objective metrics (Task Performance and Agreement) and subjective metrics (Decision Confidence). Our findings show that perceived accuracy and explanation style have a significant effect.

Paperid: 1460, https://arxiv.org/pdf/2506.16345.pdf

Abstract:
Heuristic evaluation is a widely used method in Human-Computer Interaction (HCI) to inspect interfaces and identify issues based on heuristics. Recently, Large Language Models (LLMs), such as GPT-4o, have been applied in HCI to assist in persona creation, the ideation process, and the analysis of semi-structured interviews. However, considering the need to understand heuristics and the high degree of abstraction required to evaluate them, LLMs may have difficulty conducting heuristic evaluation. However, prior research has not investigated GPT-4o's performance in heuristic evaluation compared to HCI experts in web-based systems. In this context, this study aims to compare the results of a heuristic evaluation performed by GPT-4o and human experts. To this end, we selected a set of screenshots from a web system and asked GPT-4o to perform a heuristic evaluation based on Nielsen's Heuristics from a literature-grounded prompt. Our results indicate that only 21.2% of the issues identified by human experts were also identified by GPT-4o, despite it found 27 new issues. We also found that GPT-4o performed better for heuristics related to aesthetic and minimalist design and match between system and real world, whereas it has difficulty identifying issues in heuristics related to flexibility, control, and user efficiency. Additionally, we noticed that GPT-4o generated several false positives due to hallucinations and attempts to predict issues. Finally, we highlight five takeaways for the conscious use of GPT-4o in heuristic evaluations.

Paperid: 1461, https://arxiv.org/pdf/2506.15278.pdf

Abstract:
Ride-sharing platforms like Uber market themselves as enabling `flexibility' for their workforce, meaning that drivers are expected to anticipate when and where the algorithm will allocate them jobs, and how well remunerated those jobs will be. In this work we describe our process of participatory action research with drivers and trade union organisers, culminating in a participatory audit of Uber's algorithmic pay and work allocation, before and after the introduction of dynamic pricing. Through longitudinal analysis of 1.5 million trips from 258 drivers in the UK, we find that after dynamic pricing, pay has decreased, Uber's cut has increased, job allocation and pay is less predictable, inequality between drivers is increased, and drivers spend more time waiting for jobs. In addition to these findings, we provide methodological and theoretical contributions to algorithm auditing, gig work, and the emerging practice of worker data science.

Paperid: 1462, https://arxiv.org/pdf/2506.14829.pdf

Abstract:
In an attempt to tackle the UN SDGs, AI for Social Impact (AI4SI) projects focus on harnessing AI to address societal issues in areas such as healthcare, social justice, etc. Unfortunately, despite growing interest in AI4SI, achieving tangible, on-the-ground impact remains a significant challenge. For example, identifying and engaging motivated collaborators who are willing to co-design and deploy AI based solutions in real-world settings is often difficult. Even when such partnerships are established, many AI4SI projects "fail" to progress beyond the proof-of-concept stage, and hence, are unable to transition to at-scale production-level solutions. Furthermore, the unique challenges faced by AI4SI researchers are not always fully recognized within the broader AI community, where such work is sometimes viewed as primarily applied and not aligning with the traditional criteria for novelty emphasized in core AI venues. This paper attempts to shine a light on the diverse challenges faced in AI4SI research by diagnosing a multitude of factors that prevent AI4SI partnerships from achieving real-world impact on the ground. Drawing on semi-structured interviews with six leading AI4SI researchers - complemented by the authors' own lived experiences in conducting AI4SI research - this paper attempts to understand the day-to-day difficulties faced in developing and deploying socially impactful AI solutions. Through thematic analysis, we identify structural and organizational, communication, collaboration, and operational challenges as key barriers to deployment. While there are no easy fixes, we synthesize best practices and actionable strategies drawn from these interviews and our own work in this space. In doing so, we hope this paper serves as a practical reference guide for AI4SI researchers and partner organizations seeking to engage more effectively in socially impactful AI collaborations.

Paperid: 1463, https://arxiv.org/pdf/2506.14476.pdf

Abstract:
Understanding user behaviors on social media has garnered significant scholarly attention, enhancing our comprehension of how virtual platforms impact society and empowering decision-makers. Simulating social media behaviors provides a robust tool for capturing the patterns of social media behaviors, testing hypotheses, and predicting the effects of various interventions, ultimately contributing to a deeper understanding of social media environments. Moreover, it can overcome difficulties associated with utilizing real data for analysis, such as data accessibility issues, ethical concerns, and the complexity of processing large and heterogeneous datasets. However, researchers and stakeholders need more flexible platforms to investigate different user behaviors by simulating different scenarios and characters, which is not possible yet. Therefore, this paper introduces SimSpark, an interactive system including simulation algorithms and interactive visual interfaces which is capable of creating small simulated social media platforms with customizable characters and social environments. We address three key challenges: generating believable behaviors, validating simulation results, and supporting interactive control for generation and results analysis. A simulation workflow is introduced to generate believable behaviors of agents by utilizing large language models. A visual interface enables real-time parameter adjustment and process monitoring for customizing generation settings. A set of visualizations and interactions are also designed to display the models' outputs for further analysis. Effectiveness is evaluated through case studies, quantitative simulation model assessments, and expert interviews.

Paperid: 1464, https://arxiv.org/pdf/2506.14371.pdf

Abstract:
The widespread adoption of chat interfaces based on Large Language Models (LLMs) raises concerns about promoting superficial learning and undermining the development of critical thinking skills. Instead of relying on LLMs purely for retrieving factual information, this work explores their potential to foster deeper reasoning by generating critical questions that challenge unsupported or vague claims in debate interventions. This study is part of a shared task of the 12th Workshop on Argument Mining, co-located with ACL 2025, focused on automatic critical question generation. We propose a two-step framework involving two small-scale open source language models: a Questioner that generates multiple candidate questions and a Judge that selects the most relevant ones. Our system ranked first in the shared task competition, demonstrating the potential of the proposed LLM-based approach to encourage critical engagement with argumentative texts.

Paperid: 1465, https://arxiv.org/pdf/2506.13403.pdf

Abstract:
Many people feel compelled to interpret, describe, and respond to Large Language Models (LLMs) as if they possess inner mental lives similar to our own. Responses to this phenomenon have varied. Inflationists hold that at least some folk psychological ascriptions to LLMs are warranted. Deflationists argue that all such attributions of mentality to LLMs are misplaced, often cautioning against the risk that anthropomorphic projection may lead to misplaced trust or potentially even confusion about the moral status of LLMs. We advance this debate by assessing two common deflationary arguments against LLM mentality. What we term the 'robustness strategy' aims to undercut one justification for believing that LLMs are minded entities by showing that putatively cognitive and humanlike behaviours are not robust, failing to generalise appropriately. What we term the 'etiological strategy' undercuts attributions of mentality by challenging naive causal explanations of LLM behaviours, offering alternative causal accounts that weaken the case for mental state attributions. While both strategies offer powerful challenges to full-blown inflationism, we find that neither strategy provides a knock-down case against ascriptions of mentality to LLMs simpliciter. With this in mind, we explore a modest form of inflationism that permits ascriptions of mentality to LLMs under certain conditions. Specifically, we argue that folk practice provides a defeasible basis for attributing mental states and capacities to LLMs provided those mental states and capacities can be understood in metaphysically undemanding terms (e.g. knowledge, beliefs and desires), while greater caution is required when attributing metaphysically demanding mental phenomena such as phenomenal consciousness.

Paperid: 1466, https://arxiv.org/pdf/2506.12356.pdf

Abstract:
Surface electromyography (sEMG) at the wrists could enable natural, keyboard-free text entry, yet the state-of-the-art emg2qwerty baseline still misrecognizes $51.8\%$ of characters in the zero-shot setting on unseen users and $7.0\%$ after user-specific fine-tuning. We trace many of these errors to mismatched cross-user signal statistics, fragile reliance on high-order feature dependencies, and the absence of architectural inductive biases aligned with the bilateral nature of typing. To address these issues, we introduce three simple modifications: (i) Rolling Time Normalization, which adaptively aligns input distributions across users; (ii) Aggressive Channel Masking, which encourages reliance on low-order feature combinations more likely to generalize across users; and (iii) a Split-and-Share encoder that processes each hand independently with weight-shared streams to reflect the bilateral symmetry of the neuromuscular system. Combined with a five-fold reduction in spectral resolution ($33\!\rightarrow\!6$ frequency bands), these components yield a compact Split-and-Share model, SplashNet-mini, which uses only $\tfrac14$ the parameters and $0.6\times$ the FLOPs of the baseline while reducing character-error rate (CER) to $36.4\%$ zero-shot and $5.9\%$ after fine-tuning. An upscaled variant, SplashNet ($\tfrac12$ the parameters, $1.15\times$ the FLOPs of the baseline), further lowers error to $35.7\%$ and $5.5\%$, representing relative improvements of $31\%$ and $21\%$ in the zero-shot and fine-tuned settings, respectively. SplashNet therefore establishes a new state of the art without requiring additional data.

Paperid: 1467, https://arxiv.org/pdf/2506.11774.pdf

Abstract:
Isometric exercises appeal to individuals seeking convenience, privacy, and minimal dependence on equipments. However, such fitness training is often overdependent on unreliable digital media content instead of expert supervision, introducing serious risks, including incorrect posture, injury, and disengagement due to lack of corrective feedback. To address these challenges, we present a real-time feedback system for assessing isometric poses. Our contributions include the release of the largest multiclass isometric exercise video dataset to date, comprising over 3,600 clips across six poses with correct and incorrect variations. To support robust evaluation, we benchmark state-of-the-art models-including graph-based networks-on this dataset and introduce a novel three-part metric that captures classification accuracy, mistake localization, and model confidence. Our results enhance the feasibility of intelligent and personalized exercise training systems for home workouts. This expert-level diagnosis, delivered directly to the users, also expands the potential applications of these systems to rehabilitation, physiotherapy, and various other fitness disciplines that involve physical motion.

Paperid: 1468, https://arxiv.org/pdf/2506.11212.pdf

Abstract:
Mainstream messaging platforms offer a variety of features designed to enhance user privacy, such as password-protected chats and end-to-end encryption, which primarily protect message contents. Beyond contents, a lot can be inferred about people simply by tracing who sends and receives messages, when, and how often. This paper explores user perceptions of and attitudes toward "untraceability", defined as preventing third parties from tracing who communicates with whom, to inform the design of privacy-enhancing technologies and untraceable communication protocols. Through a vignette-based qualitative study with 189 participants, we identify a diverse set of features that users perceive to be useful for untraceable messaging, ranging from using aliases instead of real names to VPNs. Through a reflexive thematic analysis, we uncover three overarching attitudes that influence the support or rejection of untraceability in messaging platforms and that can serve as a set of new privacy personas: privacy fundamentalists, who advocate for privacy as a universal right; safety fundamentalists, who support surveillance for the sake of accountability; and optimists, who advocate for privacy in principle but also endorse exceptions in idealistic ways, such as encryption backdoors. We highlight a critical gap between the threat models assumed by users and those addressed by untraceable communication protocols. Many participants understood untraceability as a form of anonymity, but interpret it as senders and receivers hiding their identities from each other, rather than from external network observers. We discuss implications for design of strategic communication and user interfaces of untraceable messaging protocols, and propose framing untraceability as a form of "altruistic privacy", i.e., adopting privacy-enhancing technologies to protect others, as a promising strategy to foster broad adoption.

Paperid: 1469, https://arxiv.org/pdf/2506.08962.pdf

Abstract:
This research-to-practice work-in-progress (WIP) paper presents an AI-enabled smart tutor designed to provide homework assessment and feedback for students in an undergraduate circuit analysis course. We detail the tutor's design philosophy and core components, including open-ended question answering and homework feedback generation. The prompts are carefully crafted to optimize responses across different problems. The smart tutor was deployed on the Microsoft Azure platform and is currently in use in an undergraduate circuit analysis course at the School of Electrical and Computer Engineering in a large, public, research-intensive institution in the Southeastern United States. Beyond offering personalized instruction and feedback, the tutor collects student interaction data, which is summarized and shared with the course instructor. To evaluate its effectiveness, we collected student feedback, with 90.9% of responses indicating satisfaction with the tutor. Additionally, we analyze a subset of collected data on preliminary circuit analysis topics to assess tutor usage frequency for each problem and identify frequently asked questions. These insights help instructors gain real-time awareness of student difficulties, enabling more targeted classroom instruction. In future work, we will release a full analysis once the complete dataset is available after the Spring 2025 semester. We also explore the potential applications of this smart tutor across a broader range of engineering disciplines by developing improved prompts, diagram-recognition methods, and database management strategies, which remain ongoing areas of research.

Paperid: 1470, https://arxiv.org/pdf/2506.08443.pdf

Abstract:
While current AI illustration tools can generate high-quality images from text prompts, they rarely reveal the step-by-step procedure that human artists follow. We present SakugaFlow, a four-stage pipeline that pairs diffusion-based image generation with a large-language-model tutor. At each stage, novices receive real-time feedback on anatomy, perspective, and composition, revise any step non-linearly, and branch alternative versions. By exposing intermediate outputs and embedding pedagogical dialogue, SakugaFlow turns a black-box generator into a scaffolded learning environment that supports both creative exploration and skills acquisition.

Paperid: 1471, https://arxiv.org/pdf/2506.08294.pdf

Abstract:
Constraint-satisfaction problems (CSPs) are ubiquitous, ranging from budgeting for grocery shopping to verifying software behavior. Logic modeling helps solve CSPs programmatically using SMT solvers. Despite its importance in many Computer Science disciplines, resources for teaching and learning logic modeling are scarce and scattered, and challenges remain in designing educational environments for logic modeling that are accessible and meet the needs of teachers and students. This paper explores how to design such an environment and probes the impact of the design on the learning experience. From a need-finding interview study and a design iteration with teachers of logic modeling, we curated 10 design guidelines spanning three main requirements: providing easy access, supporting various educational modalities, and allowing extensions for customized pedagogical needs. We implemented nine guidelines in Z3Guide, an open-source browser-based tool. Using Z3Guide in a logic modeling learning workshop with more than 100 students, we gathered positive feedback on its support for learning and identified opportunities for future improvements.

Paperid: 1472, https://arxiv.org/pdf/2506.07281.pdf

Abstract:
As AI technologies become more human-facing, there have been numerous calls to adapt participatory approaches to AI development -- spurring the idea of participatory AI. However, these calls often focus only on primary stakeholders, such as end-users, and not secondary stakeholders. This paper seeks to translate the ideals of participatory AI to a broader population of secondary AI stakeholders through semi-structured interviews. We theorize that meaningful participation involves three participatory ideals: (1) informedness, (2) consent, and (3) agency. We also explore how secondary stakeholders realize these ideals by traversing a complicated problem space. Like walking up the rungs of a ladder, these ideals build on one another. We introduce three stakeholder archetypes: the reluctant data contributor, the unsupported activist, and the well-intentioned practitioner, who must navigate systemic barriers to achieving agentic AI relationships. We envision an AI future where secondary stakeholders are able to meaningfully participate with the AI systems they influence and are influenced by.

Paperid: 1473, https://arxiv.org/pdf/2506.06104.pdf

Abstract:
The rising prevalence of chronic wounds, especially in aging populations, presents a significant healthcare challenge due to prolonged hospitalizations, elevated costs, and reduced patient quality of life. Traditional wound care is resource-intensive, requiring frequent in-person visits that strain both patients and healthcare professionals (HCPs). Therefore, we present WoundAIssist, a patient-centered, AI-driven mobile application designed to support telemedical wound care. WoundAIssist enables patients to regularly document wounds at home via photographs and questionnaires, while physicians remain actively engaged in the care process through remote monitoring and video consultations. A distinguishing feature is an integrated lightweight deep learning model for on-device wound segmentation, which, combined with patient-reported data, enables continuous monitoring of wound healing progression. Developed through an iterative, user-centered process involving both patients and domain experts, WoundAIssist prioritizes an user-friendly design, particularly for elderly patients. A conclusive usability study with patients and dermatologists reported excellent usability, good app quality, and favorable perceptions of the AI-driven wound recognition. Our main contribution is two-fold: (I) the implementation and (II) evaluation of WoundAIssist, an easy-to-use yet comprehensive telehealth solution designed to bridge the gap between patients and HCPs. Additionally, we synthesize design insights for remote patient monitoring apps, derived from over three years of interdisciplinary research, that may inform the development of similar digital health tools across clinical domains.

Paperid: 1474, https://arxiv.org/pdf/2506.06083.pdf

Abstract:
The availability of big data has significantly influenced the possibilities and methodological choices for conducting large-scale behavioural and social science research. In the context of qualitative data analysis, a major challenge is that conventional methods require intensive manual labour and are often impractical to apply to large datasets. One effective way to address this issue is by integrating emerging computational methods to overcome scalability limitations. However, a critical concern for researchers is the trustworthiness of results when Machine Learning (ML) and Natural Language Processing (NLP) tools are used to analyse such data. We argue that confidence in the credibility and robustness of results depends on adopting a 'human-in-the-loop' methodology that is able to provide researchers with control over the analytical process, while retaining the benefits of using ML and NLP. With this in mind, we propose a novel methodological framework for Computational Grounded Theory (CGT) that supports the analysis of large qualitative datasets, while maintaining the rigour of established Grounded Theory (GT) methodologies. To illustrate the framework's value, we present the results of testing it on a dataset collected from Reddit in a study aimed at understanding tutors' experiences in the gig economy.

Paperid: 1475, https://arxiv.org/pdf/2506.04423.pdf

Abstract:
Large Language Models (LLMs) offer novel opportunities for educational applications that have the potential to transform traditional learning for students. Despite AI-enhanced applications having the potential to provide personalized learning experiences, more studies are needed on the design of generative AI systems and evidence for using them in real educational settings. In this paper, we design, implement and evaluate \texttt{Reviewriter}, a novel tool to provide students with AI-generated instructions for writing peer reviews in German. Our study identifies three key aspects: a) we provide insights into student needs when writing peer reviews with generative models which we then use to develop a novel system to provide adaptive instructions b) we fine-tune three German language models on a selected corpus of 11,925 student-written peer review texts in German and choose German-GPT2 based on quantitative measures and human evaluation, and c) we evaluate our tool with fourteen students, revealing positive technology acceptance based on quantitative measures. Additionally, the qualitative feedback presents the benefits and limitations of generative AI in peer review writing.

Paperid: 1476, https://arxiv.org/pdf/2506.04063.pdf

Abstract:
Large Language Models (LLMs) increasingly rely on Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) to align model responses with human preferences. While RLHF employs a reinforcement learning approach with a separate reward model, SFT uses human-curated datasets for supervised learning. Both approaches traditionally depend on small, vetted groups of annotators, making them costly, prone to bias, and limited in scalability. We propose an open, crowd-sourced fine-tuning framework that addresses these limitations by enabling broader feedback collection for SFT without extensive annotator training. Our framework promotes incentive fairness via a point-based reward system correlated with Shapley values and guides model convergence through iterative model updates. Our multi-model selection framework demonstrates up to a 55% reduction in target distance over single-model selection, enabling subsequent experiments that validate our point-based reward mechanism's close alignment with Shapley values (a well-established method for attributing individual contributions) thereby supporting fair and scalable participation.

Paperid: 1477, https://arxiv.org/pdf/2506.04043.pdf

Abstract:
Automated counter-narratives (CN) offer a promising strategy for mitigating online hate speech, yet concerns about their affective tone, accessibility, and ethical risks remain. We propose a framework for evaluating Large Language Model (LLM)-generated CNs across four dimensions: persona framing, verbosity and readability, affective tone, and ethical robustness. Using GPT-4o-Mini, Cohere's CommandR-7B, and Meta's LLaMA 3.1-70B, we assess three prompting strategies on the MT-Conan and HatEval datasets. Our findings reveal that LLM-generated CNs are often verbose and adapted for people with college-level literacy, limiting their accessibility. While emotionally guided prompts yield more empathetic and readable responses, there remain concerns surrounding safety and effectiveness.

Paperid: 1478, https://arxiv.org/pdf/2506.03807.pdf

Abstract:
The experience and adoption of conversational search is tied to the accuracy and completeness of users' mental models -- their internal frameworks for understanding and predicting system behaviour. Thus, understanding these models can reveal areas for design interventions. Transparency is one such intervention which can improve system interpretability and enable mental model alignment. While past research has explored mental models of search engines, those of generative conversational search remain underexplored, even while the popularity of these systems soars. To address this, we conducted a study with 16 participants, who performed 4 search tasks using 4 conversational interfaces of varying transparency levels. Our analysis revealed that most user mental models were too abstract to support users in explaining individual search instances. These results suggest that 1) mental models may pose a barrier to appropriate trust in conversational search, and 2) hybrid web-conversational search is a promising novel direction for future search interface design.

Paperid: 1479, https://arxiv.org/pdf/2506.02449.pdf

Abstract:
In modern dialogue systems, the ability to implicitly infer user backgrounds from conversations and leverage this information for personalized assistance is crucial. However, the scarcity of high-quality data remains a fundamental challenge to evaluating and improving this capability. Traditional dataset construction methods are labor-intensive, resource-demanding, and raise privacy concerns. To address these issues, we propose a novel approach for automatic synthetic data generation and introduce the Implicit Personalized Dialogue (IP-Dialog) benchmark along with a training dataset, covering 10 tasks and 12 user attribute types. Additionally, we develop a systematic evaluation framework with four metrics to assess both attribute awareness and reasoning capabilities. We further propose five causal graphs to elucidate models' reasoning pathways during implicit personalization. Extensive experiments yield insightful observations and prove the reliability of our dataset.

Paperid: 1480, https://arxiv.org/pdf/2506.02262.pdf

Abstract:
While the increased integration of AI technologies into interactive systems enables them to solve an equally increasing number of tasks, the black box problem of AI models continues to spread throughout the interactive system as a whole. Explainable AI (XAI) techniques can make AI models more accessible by employing post-hoc methods or transitioning to inherently interpretable models. While this makes individual AI models clearer, the overarching system architecture remains opaque. To this end, we propose an approach to represent interactive systems as sequences of structural building blocks, such as AI models and control mechanisms grounded in the literature. These can then be explained through accompanying visual building blocks, such as XAI techniques. The flow and APIs of the structural building blocks form an explicit overview of the system. This serves as a communication basis for both humans and automated agents like LLMs, aligning human and machine interpretability of AI models. We discuss a selection of building blocks and concretize our flow-based approach in an architecture and accompanying prototype interactive system.

Paperid: 1481, https://arxiv.org/pdf/2506.01284.pdf

Abstract:
Steady-State Visual Evoked Potential is a brain response to visual stimuli flickering at constant frequencies. It is commonly used in brain-computer interfaces for direct brain-device communication due to their simplicity, minimal training data, and high information transfer rate. Traditional methods suffer from poor performance due to reliance on prior knowledge, while deep learning achieves higher accuracy but requires substantial high-quality training data for precise signal decoding. In this paper, we propose a calibration-free EEG signal decoding framework for fast SSVEP detection. Our framework integrates Inter-Trial Remixing & Context-Aware Distribution Alignment data augmentation for EEG signals and employs a compact architecture of small fully connected layers, effectively addressing the challenge of limited EEG data availability. Additionally, we propose an Adaptive Spectrum Denoise Module that operates in the frequency domain based on global features, requiring only linear complexity to reduce noise in EEG data and improve data quality. For calibration-free classification experiments on short EEG signals from three public datasets, our framework demonstrates statistically significant accuracy advantages(p<0.05) over existing methods in the majority of cases, while requiring at least 52.7% fewer parameters and 29.9% less inference time. By eliminating the need for user-specific calibration, this advancement significantly enhances the usability of BCI systems, accelerating their commercialization and widespread adoption in real-world applications.

Paperid: 1482, https://arxiv.org/pdf/2505.24126.pdf

Abstract:
This study investigates how undergraduate students engage with ChatGPT in self-directed learning contexts. Analyzing naturalistic interaction logs, we identify five dominant use categories of ChatGPT: information seeking, content generation, language refinement, metacognitive engagement, and conversational repair. Behavioral modeling reveals that structured, goal-driven tasks like coding, multiple-choice solving, and job application writing are strong predictors of continued use. Drawing on Self-Directed Learning (SDL) and the Uses and Gratifications Theory (UGT), we show how students actively manage ChatGPT's affordances and limitations through prompt adaptation, follow-ups, and emotional regulation. Rather than disengaging after breakdowns, students often persist through clarification and repair, treating the assistant as both tool and learning partner. We also offer design and policy recommendations to support transparent, responsive, and pedagogically grounded integration of generative AI in higher education.

Paperid: 1483, https://arxiv.org/pdf/2505.24042.pdf

Abstract:
Modern healthcare facilities demand digital accessibility to guarantee equal access to telemedicine platforms, online pharmacy services, and health monitoring devices that can be worn or are handy. With the rising call for the implementation of robust digital healthcare solutions, people with disabilities encounter impediments in their endeavor of managing and getting accustomed to these modern technologies owing to insufficient accessibility features. The paper highlights the role of comprehensive solutions for enhanced patient engagement and usability, particularly, in digital pharmacy, healthcare, and wearable devices. Besides, it elucidates the key obstructions faced by users experiencing auditory, visual, cognitive, and motor impairments. Through a kind consideration of present accessibility guidelines, practices, and emerging technologies, the paper provides a holistic overview by offering innovative solutions, accentuating the vitality of compliance with Web Content Accessibility Guidelines (WCAG), Americans with Disabilities Act (ADA), and other regulatory structures to foster easy access to digital healthcare services. Moreover, there is due focus on using AI-driven tools, speech-activated interfaces, and tactile feedback in wearable health devices to assist persons with disabilities. The outcome of the research explicates the necessity of prioritizing accessibility for individuals with disabilities and cultivating a culture where healthcare providers, policymakers, and officials build a patient-centered digital healthcare ecosystem that is all-encompassing in nature.

Paperid: 1484, https://arxiv.org/pdf/2505.24035.pdf

Abstract:
The swift evolution of telehealth has revolutionized how medical professionals deliver healthcare services and boost convenience and accessibility. Yet, the Medicaid population encounters several impediments in utilizing facilities especially owing to poor internet connectivity, less awareness about digital platforms, and a shortage of assistive technologies. The paper aims to explicate key factors behind digital accessibility for Medicaid populations and expounds robust solutions to eradicate these challenges. Through inclusive design ideas, AI-assisted technologies, and all-encompassing policies by the concerned authorities, healthcare professionals can enhance usability and efficacy and thus better serve the needy. This revolution not only enhances convenience but also expands access, mainly for underserved groups such as rural populations or those with mobility issues, thereby ensuring inclusivity and flexibility in the healthcare domain. Besides, the paper highlights the vitality of collaboration between healthcare professionals, policymakers, and tech developers in unveiling the accessibility and usability impediments. What else helps in minimizing healthcare differences and enhancing patient outcomes is guaranteeing equitable access to telehealth for Medicaid beneficiaries. The paper systematically offers major recommendations to increase digital accessibility in telehealth, thereby creating a patient-oriented and all-encompassing healthcare system.

Paperid: 1485, https://arxiv.org/pdf/2505.23730.pdf

Abstract:
The Digital Twin Brain (DTB) is an advanced artificial intelligence framework that integrates spiking neurons to simulate complex cognitive functions and collaborative behaviors. For domain experts, visualizing the DTB's simulation outcomes is essential to understanding complex cognitive activities. However, this task poses significant challenges due to DTB data's inherent characteristics, including its high-dimensionality, temporal dynamics, and spatial complexity. To address these challenges, we developed DTBIA, an Immersive Visual Analytics System for Brain-Inspired Research. In collaboration with domain experts, we identified key requirements for effectively visualizing spatiotemporal and topological patterns at multiple levels of detail. DTBIA incorporates a hierarchical workflow - ranging from brain regions to voxels and slice sections - along with immersive navigation and a 3D edge bundling algorithm to enhance clarity and provide deeper insights into both functional (BOLD) and structural (DTI) brain data. The utility and effectiveness of DTBIA are validated through two case studies involving with brain research experts. The results underscore the system's role in enhancing the comprehension of complex neural behaviors and interactions.

Paperid: 1486, https://arxiv.org/pdf/2505.23310.pdf

Abstract:
While virtual reality (VR) holds significant potential to revolutionize digital user interaction, how visual information is presented through VR head-mounted displays (HMDs) differs from naturalistic viewing and interactions in physical environments, leading to performance decrements. One critical challenge in VR development is the vergence-accommodation conflict (VAC), which arises due to the intrinsic constraints of approximating the natural viewing geometry through digital displays. Although various hardware and software solutions have been proposed to address VAC, no commercially viable option has been universally adopted by manufacturers. This paper presents and evaluates a software solution grounded in a vision-based geometrical model of VAC that mediates VAC's impact on movement in VR. This model predicts the impact of VAC as a constant offset to the vergence angle, distorting the binocular viewing geometry that results in movement undershooting. In Experiment 1, a 3D pointing task validated the model's predictions and demonstrated that VAC primarily affects online movements involving real-time visual feedback. Experiment 2 implemented a shader program to rectify the effect of VAC, improving movement accuracy by approximately 30%. Overall, this work presented a practical approach to reducing the impact of VAC on HMD-based manual interactions, enhancing the user experience in virtual environments.

Paperid: 1487, https://arxiv.org/pdf/2505.23079.pdf

Abstract:
Exploring data relations across multiple views has been a common task in many domains such as bioinformatics, cybersecurity, and healthcare. To support this, various techniques (e.g., visual links and brushing and linking) are used to show related visual elements across views via lines and highlights. However, understanding the relations using these techniques, when many related elements are scattered, can be difficult due to spatial distance and complexity. To address this, we present iTrace, an interactive visualization technique to effectively trace cross-view data relationships. iTrace leverages the concept of interactive focus transitions, which allows users to see and directly manipulate their focus as they navigate between views. By directing the user's attention through smooth transitions between related elements, iTrace makes it easier to follow data relationships. We demonstrate the effectiveness of iTrace with a user study, and we conclude with a discussion of how iTrace can be broadly used to enhance data exploration in various types of visualizations.

Paperid: 1488, https://arxiv.org/pdf/2505.22906.pdf

Abstract:
While AI programming tools hold the promise of increasing programmers' capabilities and productivity to a remarkable degree, they often exclude users from essential decision-making processes, causing many to effectively "turn off their brains" and over-rely on solutions provided by these systems. These behaviors can have severe consequences in critical domains, like software security. We propose Human-in-the-loop Decoding, a novel interaction technique that allows users to observe and directly influence LLM decisions during code generation, in order to align the model's output with their personal requirements. We implement this technique in HiLDe, a code completion assistant that highlights critical decisions made by the LLM and provides local alternatives for the user to explore. In a within-subjects study (N=18) on security-related tasks, we found that HiLDe led participants to generate significantly fewer vulnerabilities and better align code generation with their goals compared to a traditional code completion assistant.

Paperid: 1489, https://arxiv.org/pdf/2505.22831.pdf

Abstract:
Web-based activities are fundamentally distributed across webpages. However, conventional browsers with stacks of tabs fail to support operating and synthesizing large volumes of information across pages. While recent AI systems enable fully automated web browsing and information synthesis, they often diminish user agency and hinder contextual understanding. Therefore, we explore how AI could instead augment users' interactions with content across webpages and mitigate cognitive and manual efforts. Through literature on information tasks and web browsing challenges, and an iterative design process, we present a rich set of novel interactions with our prototype web browser, Orca. Leveraging AI, Orca supports user-driven exploration, operation, organization, and synthesis of web content at scale. To enable browsing at scale, webpages are treated as malleable materials that humans and AI can collaboratively manipulate and compose into a malleable, dynamic, and browser-level workspace. Our evaluation revealed an increased "appetite" for information foraging, enhanced user control, and more flexibility in sensemaking across a broader information landscape on the web.

Paperid: 1490, https://arxiv.org/pdf/2505.21907.pdf

Abstract:
AI copilots represent a new generation of AI-powered systems designed to assist users, particularly knowledge workers and developers, in complex, context-rich tasks. As these systems become more embedded in daily workflows, personalization has emerged as a critical factor for improving usability, effectiveness, and user satisfaction. Central to this personalization is preference optimization: the system's ability to detect, interpret, and align with individual user preferences. While prior work in intelligent assistants and optimization algorithms is extensive, their intersection within AI copilots remains underexplored. This survey addresses that gap by examining how user preferences are operationalized in AI copilots. We investigate how preference signals are sourced, modeled across different interaction stages, and refined through feedback loops. Building on a comprehensive literature review, we define the concept of an AI copilot and introduce a taxonomy of preference optimization techniques across pre-, mid-, and post-interaction phases. Each technique is evaluated in terms of advantages, limitations, and design implications. By consolidating fragmented efforts across AI personalization, human-AI interaction, and language model adaptation, this work offers both a unified conceptual foundation and a practical design perspective for building user-aligned, persona-aware AI copilots that support end-to-end adaptability and deployment.

Paperid: 1491, https://arxiv.org/pdf/2505.20916.pdf

Abstract:
Users often struggle to navigate the privacy / publicity boundary in sharing images online: they may lack awareness of image privacy risks and/or the ability to apply effective mitigation strategies. To address this challenge, we introduce and evaluate Imago Obscura, an AI-powered, image-editing copilot that enables users to identify and mitigate privacy risks with images they intend to share. Driven by design requirements from a formative user study with 7 image-editing experts, Imago Obscura enables users to articulate their image-sharing intent and privacy concerns. The system uses these inputs to surface contextually pertinent privacy risks, and then recommends and facilitates application of a suite of obfuscation techniques found to be effective in prior literature -- e.g., inpainting, blurring, and generative content replacement. We evaluated Imago Obscura with 15 end-users in a lab study and found that it greatly improved users' awareness of image privacy risks and their ability to address those risks, allowing them to make more informed sharing decisions.

Paperid: 1492, https://arxiv.org/pdf/2505.18385.pdf

Abstract:
Effective communication between AI and humans is essential for successful human-AI co-creation. However, many current co-creative AI systems lack effective communication, which limits their potential for collaboration. This paper presents the initial design of the Framework for AI Communication (FAICO) for co-creative AI, developed through a systematic review of 107 full-length papers. FAICO presents key aspects of AI communication and their impact on user experience, offering preliminary guidelines for designing human-centered AI communication. To improve the framework, we conducted a preliminary study with two focus groups involving skilled individuals in AI, HCI, and design. These sessions sought to understand participants' preferences for AI communication, gather their perceptions of the framework, collect feedback for refinement, and explore its use in co-creative domains like collaborative writing and design. Our findings reveal a preference for a human-AI feedback loop over linear communication and emphasize the importance of context in fostering mutual understanding. Based on these insights, we propose actionable strategies for applying FAICO in practice and future directions, marking the first step toward developing comprehensive guidelines for designing effective human-centered AI communication in co-creation.

Paperid: 1493, https://arxiv.org/pdf/2505.18371.pdf

Abstract:
Military weapon systems and command-and-control infrastructure augmented by artificial intelligence (AI) have seen rapid development and deployment in recent years. However, the sociotechnical impacts of AI on combat systems, military decision-making, and the norms of warfare have been understudied. We focus on a specific subset of lethal autonomous weapon systems (LAWS) that use AI for targeting or battlefield decisions. We refer to this subset as AI-powered lethal autonomous weapon systems (AI-LAWS) and argue that they introduce novel risks -- including unanticipated escalation, poor reliability in unfamiliar environments, and erosion of human oversight -- all of which threaten both military effectiveness and the openness of AI research. These risks cannot be addressed by high-level policy alone; effective regulation must be grounded in the technical behavior of AI models. We argue that AI researchers must be involved throughout the regulatory lifecycle. Thus, we propose a clear, behavior-based definition of AI-LAWS -- systems that introduce unique risks through their use of modern AI -- as a foundation for technically grounded regulation, given that existing frameworks do not distinguish them from conventional LAWS. Using this definition, we propose several technically-informed policy directions and invite greater participation from the AI research community in military AI policy discussions.

Paperid: 1494, https://arxiv.org/pdf/2505.17557.pdf

Abstract:
Instructional gestures are essential for teaching, as they enhance communication and support student comprehension. However, existing training methods for developing these embodied skills can be time-consuming, isolating, or overly prescriptive. Research suggests that developing these tacit, experiential skills requires teachers' peer learning, where they learn from each other and build shared knowledge. This paper introduces Novobo, an apprentice AI-agent stimulating teachers' peer learning of instructional gestures through verbal and bodily inputs. Positioning the AI as a mentee employs the learning-by-teaching paradigm, aiming to promote deliberate reflection and active learning. Novobo encourages teachers to evaluate its generated gestures and invite them to provide demonstrations. An evaluation with 30 teachers in 10 collaborative sessions showed Novobo prompted teachers to share tacit knowledge through conversation and movement. This process helped teachers externalize, exchange, and internalize their embodied knowledge, promoting collaborative learning and building a shared understanding of instructional gestures within the local teaching community. This work advances understanding of how teachable AI agents can enhance collaborative learning in teacher professional development, offering valuable design insights for leveraging AI to promote the sharing and construction of embodied and practical knowledge.

Paperid: 1495, https://arxiv.org/pdf/2505.17374.pdf

Abstract:
The field of Multimodal Large Language Models (MLLMs) has made remarkable progress in visual understanding tasks, presenting a vast opportunity to predict the perceptual and emotional impact of charts. However, it also raises concerns, as many applications of LLMs are based on overgeneralized assumptions from a few examples, lacking sufficient validation of their performance and effectiveness. We introduce Chart-to-Experience, a benchmark dataset comprising 36 charts, evaluated by crowdsourced workers for their impact on seven experiential factors. Using the dataset as ground truth, we evaluated capabilities of state-of-the-art MLLMs on two tasks: direct prediction and pairwise comparison of charts. Our findings imply that MLLMs are not as sensitive as human evaluators when assessing individual charts, but are accurate and reliable in pairwise comparisons.

Paperid: 1496, https://arxiv.org/pdf/2505.17343.pdf

Abstract:
This paper investigates the feasibility of fusing two eye-centric authentication modalities-eye movements and periocular images-within a calibration-free authentication system. While each modality has independently shown promise for user authentication, their combination within a unified gaze-estimation pipeline has not been thoroughly explored at scale. In this report, we propose a multimodal authentication system and evaluate it using a large-scale in-house dataset comprising 9202 subjects with an eye tracking (ET) signal quality equivalent to a consumer-facing virtual reality (VR) device. Our results show that the multimodal approach consistently outperforms both unimodal systems across all scenarios, surpassing the FIDO benchmark. The integration of a state-of-the-art machine learning architecture contributed significantly to the overall authentication performance at scale, driven by the model's ability to capture authentication representations and the complementary discriminative characteristics of the fused modalities.

Paperid: 1497, https://arxiv.org/pdf/2505.15790.pdf

Abstract:
Innovators transform the world by understanding where services are successfully meeting customers' needs and then using this knowledge to identify failsafe opportunities for innovation. Pre-trained models have changed the AI innovation landscape, making it faster and easier to create new AI products and services. Understanding where pre-trained models are successful is critical for supporting AI innovation. Unfortunately, the hype cycle surrounding pre-trained models makes it hard to know where AI can really be successful. To address this, we investigated pre-trained model applications developed by HCI researchers as a proxy for commercially successful applications. The research applications demonstrate technical capabilities, address real user needs, and avoid ethical challenges. Using an artifact analysis approach, we categorized capabilities, opportunity domains, data types, and emerging interaction design patterns, uncovering some of the opportunity space for innovation with pre-trained models.

Paperid: 1498, https://arxiv.org/pdf/2505.14872.pdf

Abstract:
As the population of older adults increases, there is a growing need for support for them to age in place. This is exacerbated by the growing number of individuals struggling with cognitive decline and shrinking number of youth who provide care for them. Artificially intelligent agents could provide cognitive support to older adults experiencing memory problems, and they could help informal caregivers with coordination tasks. To better understand this possible future, we conducted a speed dating with storyboards study to reveal invisible social boundaries that might keep older adults and their caregivers from accepting and using agents. We found that healthy older adults worry that accepting agents into their homes might increase their chances of developing dementia. At the same time, they want immediate access to agents that know them well if they should experience cognitive decline. Older adults in the early stages of cognitive decline expressed a desire for agents that can ease the burden they saw themselves becoming for their caregivers. They also speculated that an agent who really knew them well might be an effective advocate for their needs when they were less able to advocate for themselves. That is, the agent may need to transition from being unremarkable to remarkable. Based on these findings, we present design opportunities and considerations for agents and articulate directions of future research.

Paperid: 1499, https://arxiv.org/pdf/2505.14816.pdf

Abstract:
We present a structured design methodology for creating semantically-resonant abstract patterns, making the pattern design process accessible to the general public. Semantically-resonant patterns are those that intuitively evoke the concept they represent within a specific set (e.g., in a vegetable concept set, small dots for olives and large dots for tomatoes), analogous to the concept of semantically-resonant colors (e.g., using olive green for olives and red for tomatoes). Previous research has shown that semantically-resonant colors can improve chart reading speed, and designers have made attempts to integrate semantic cues into abstract pattern designs. However, a systematic framework for developing such patterns was lacking. To bridge this gap, we conducted a series of workshops with design experts, resulting in a design methodology that summarizes the methodology for designing semantically-resonant abstract patterns. We evaluated our design methodology through another series of workshops with non-design participants. The results indicate that our proposed design methodology effectively supports the general public in designing semantically-resonant abstract patterns for both abstract and concrete concepts.

Paperid: 1500, https://arxiv.org/pdf/2505.12780.pdf

Abstract:
Recent advancements in HCI and AI have predominantly centered on individual user experiences, often neglecting the emergent dynamics of group interactions. This provocation introduces Group Experience(GX) to capture the collective perceptual, emotional, and cognitive dimensions that arise when individuals interact in cohesive groups. We challenge the conventional Human-centered AI paradigm and propose Group-centered AI(GCAI) as a framework that actively mediates group dynamics, amplifies diverse voices, and fosters ethical collective decision-making. Drawing on social psychology, organizational behavior, and group dynamics, we outline a group-centered design approach that balances individual autonomy with collective interests while developing novel evaluative metrics. Our analysis emphasizes rethinking traditional methodologies that focus solely on individual outcomes and advocates for innovative strategies to capture group collaboration. We call on researchers to bridge the gap between micro-level experiences and macro-level impacts, ultimately enriching and transforming collaborative human interactions.

Paperid: 1501, https://arxiv.org/pdf/2505.10816.pdf

Abstract:
Industry 4.0 is transforming manufacturing and logistics by integrating robots into shared human environments, such as factories, warehouses, and healthcare facilities. However, the risk of human-robot collisions, especially in Non-Line-of-Sight (NLoS) scenarios like around corners, remains a critical challenge. Existing solutions, such as vision-based and LiDAR systems, often fail under occlusion, lighting constraints, or privacy concerns, while RF-based systems are limited by range and accuracy. To address these limitations, we propose mmMirror, a novel system leveraging a Van Atta Array-based millimeter-wave (mmWave) reconfigurable intelligent reflecting surface (IRS) for precise, device-free NLoS localization. mmMirror integrates seamlessly with existing frequency-modulated continuous-wave (FMCW) radars and offers: (i) robust NLoS localization with centimeter-level accuracy at ranges up to 3 m, (ii) seamless uplink and downlink communication between radar and IRS, (iii) support for multi-radar and multi-target scenarios via dynamic beam steering, and (iv) reduced scanning latency through adaptive time slot allocation. Implemented using commodity 24 GHz radars and a PCB-based IRS prototype, mmMirror demonstrates its potential in enabling safe human-robot interactions in dynamic and complex environments.

Paperid: 1502, https://arxiv.org/pdf/2505.10661.pdf

Abstract:
Human-AI collaboration is increasingly relevant in consequential areas where AI recommendations support human discretion. However, human-AI teams' effectiveness, capability, and fairness highly depend on human perceptions of AI. Positive fairness perceptions have been shown to foster trust and acceptance of AI recommendations. Yet, work on confirmation bias highlights that humans selectively adhere to AI recommendations that align with their expectations and beliefs -- despite not being necessarily correct or fair. This raises the question whether confirmation bias also transfers to the alignment of gender bias between human and AI decisions. In our study, we examine how gender bias alignment influences fairness perceptions and reliance. The results of a 2x2 between-subject study highlight the connection between gender bias alignment, fairness perceptions, and reliance, demonstrating that merely constructing a ``formally fair'' AI system is insufficient for optimal human-AI collaboration; ultimately, AI recommendations will likely be overridden if biases do not align.

Paperid: 1503, https://arxiv.org/pdf/2505.10309.pdf

Abstract:
Commonsense intelligence in machines is often assessed by static benchmarks that compare a model's output against human-prescribed correct labels. An important, albeit implicit, assumption of these labels is that they accurately capture what any human would think, effectively treating human common sense as homogeneous. However, recent empirical work has shown that humans vary enormously in what they consider commonsensical; thus what appears self-evident to one benchmark designer may not be so to another. Here, we propose a novel method for evaluating common sense in artificial intelligence (AI), specifically in large language models (LLMs), that incorporates empirically observed heterogeneity among humans by measuring the correspondence between a model's judgment and that of a human population. We first find that, when treated as independent survey respondents, most LLMs remain below the human median in their individual commonsense competence. Second, when used as simulators of a hypothetical population, LLMs correlate with real humans only modestly in the extent to which they agree on the same set of statements. In both cases, smaller, open-weight models are surprisingly more competitive than larger, proprietary frontier models. Our evaluation framework, which ties commonsense intelligence to its cultural basis, contributes to the growing call for adapting AI models to human collectivities that possess different, often incompatible, social stocks of knowledge.

Paperid: 1504, https://arxiv.org/pdf/2505.09901.pdf

Abstract:
Large language models (LLMs) are increasingly used to simulate or automate human behavior in complex sequential decision-making tasks. A natural question is then whether LLMs exhibit similar decision-making behavior to humans, and can achieve comparable (or superior) performance. In this work, we focus on the exploration-exploitation (E&E) tradeoff, a fundamental aspect of dynamic decision-making under uncertainty. We employ canonical multi-armed bandit (MAB) tasks introduced in the cognitive science and psychiatry literature to conduct a comparative study of the E&E strategies of LLMs, humans, and MAB algorithms. We use interpretable choice models to capture the E&E strategies of the agents and investigate how explicit reasoning, through both prompting strategies and reasoning-enhanced models, shapes LLM decision-making. We find that reasoning shifts LLMs toward more human-like behavior, characterized by a mix of random and directed exploration. In simple stationary tasks, reasoning-enabled LLMs exhibit similar levels of random and directed exploration compared to humans. However, in more complex, non-stationary environments, LLMs struggle to match human adaptability, particularly in effective directed exploration, despite achieving similar regret in certain scenarios. Our findings highlight both the promise and limits of LLMs as simulators of human behavior and tools for automated decision-making and point to potential areas of improvements.

Paperid: 1505, https://arxiv.org/pdf/2505.09208.pdf

Abstract:
With the rapid advancement of generative artificial intelligence(AI), its potential applications in higher education have attracted significant attention. This study investigated how 148 students from diverse engineering disciplines and regions across China used generative AI, focusing on its impact on their learning experience and the opportunities and challenges it poses in engineering education. Based on the surveyed data, we explored four key areas: the frequency and application scenarios of AI use among engineering students, its impact on students' learning and performance, commonly encountered challenges in using generative AI, and future prospects for its adoption in engineering education. The results showed that more than half of the participants reported a positive impact of generative AI on their learning efficiency, initiative, and creativity, with nearly half believing it also enhanced their independent thinking. However, despite acknowledging improved study efficiency, many felt their actual academic performance remained largely unchanged and expressed concerns about the accuracy and domain-specific reliability of generative AI. Our findings provide a first-hand insight into the current benefits and challenges generative AI brings to students, particularly Chinese engineering students, while offering several recommendations, especially from the students' perspective, for effectively integrating generative AI into engineering education.

Paperid: 1506, https://arxiv.org/pdf/2505.09054.pdf

Abstract:
The construction industry is a major contributor to global greenhouse gas emissions, with embodied carbon being a key component. This study develops EcoSphere, an innovative software designed to evaluate and balance embodied and operational carbon emissions with construction and environmental costs in urban planning. Using high-resolution data from the National Structure Inventory, combined with computer vision and natural language processing applied to Google Street View and satellite imagery, EcoSphere categorizes buildings by structural and material characteristics with a bottom-up approach, creating a baseline emissions dataset. By simulating policy scenarios and mitigation strategies, EcoSphere provides policymakers and non-experts with actionable insights for sustainable development in cities and provide them with a vision of the environmental and financial results of their decisions. Case studies in Chicago and Indianapolis showcase how EcoSphere aids in assessing policy impacts on carbon emissions and costs, supporting data-driven progress toward carbon neutrality.

Paperid: 1507, https://arxiv.org/pdf/2505.08628.pdf

Abstract:
Metabolic syndrome (MetS) is a medication condition characterized by abdominal obesity, insulin resistance, hypertension and hyperlipidemia. It increases the risk of majority of chronic diseases, including type 2 diabetes mellitus, and affects about one quarter of the global population. Therefore, early detection and timely intervention for MetS are crucial. Standard diagnosis for MetS components requires blood tests conducted within medical institutions. However, it is frequently underestimated, leading to unmet need for care for MetS population. This study aims to use the least physiological data and free texts about exercises related activities, which are obtained easily in daily life, to diagnosis MetS. We collected the data from 40 volunteers in a nursing home and used data augmentation to reduce the imbalance. We propose a deep learning framework for classifying MetS that integrates natural language processing (NLP) and exercise monitoring. The results showed that the best model reported a high positive result (AUROC=0.806 and REC=76.3%) through 3-fold cross-validation. Feature importance analysis revealed that text and minimum heart rate on a daily basis contribute the most in the classification of MetS. This study demonstrates the potential application of data that are easily measurable in daily life for the early diagnosis of MetS, which could contribute to reducing the cost of screening and management for MetS population.

Paperid: 1508, https://arxiv.org/pdf/2505.06591.pdf

Abstract:
This research prepares an automatic pipeline for generating reliable question-answer (Q&A) tests using AI chatbots. We automatically generated a GPT-4o-mini-based Q&A test for a Natural Language Processing course and evaluated its psychometric and perceived-quality metrics with students and experts. A mixed-format IRT analysis showed that the generated items exhibit strong discrimination and appropriate difficulty, while student and expert star ratings reflect high overall quality. A uniform DIF check identified two items for review. These findings demonstrate that LLM-generated assessments can match human-authored tests in psychometric performance and user satisfaction, illustrating a scalable approach to AI-assisted assessment development.

Paperid: 1509, https://arxiv.org/pdf/2505.05923.pdf

Abstract:
In intuitive physics the process of stacking cubes has become a paradigmatic, canonical task. Even though it gets employed in various shades and complexities, the very fundamental setting with two cubes has not been thoroughly investigated. Furthermore, the majority of settings feature only a reduced, one dimensional (1D) decision space. In this paper an experiment is conducted in which participants judge the stability of two cubes stacked on top of each other. It is performed in the full 3D setting which features a 2D decision surface. The analysis yield a shape of a rotated square for the perceived stability area instead of the commonly reported safety margin in 1D. This implies a more complex decision behavior in human than previously assumed.

Paperid: 1510, https://arxiv.org/pdf/2505.05694.pdf

Abstract:
Wearable sensors are widely used to collect physiological data and develop stress detection models. However, most studies focus on a single dataset, rarely evaluating model reproducibility across devices, populations, or study conditions. We previously assessed the reproducibility of stress detection models across multiple studies, testing models trained on one dataset against others using heart rate (with R-R interval) and electrodermal activity (EDA). In this study, we extended our stress detection reproducibility to consumer wearable sensors. We compared validated research-grade devices, to consumer wearables - Biopac MP160, Polar H10, Empatica E4, to the Garmin Forerunner 55s, assessing device-specific stress detection performance by conducting a new stress study on undergraduate students. Thirty-five students completed three standardized stress-induction tasks in a lab setting. Biopac MP160 performed the best, being consistent with our expectations of it as the gold standard, though performance varied across devices and models. Combining heart rate variability (HRV) and EDA enhanced stress prediction across most scenarios. However, Empatica E4 showed variability; while HRV and EDA improved stress detection in leave-one-subject-out (LOSO) evaluations (AUROC up to 0.953), device-specific limitations led to underperformance when tested with our pre-trained stress detection tool (AUROC 0.723), highlighting generalizability challenges related to hardware-model compatibility. Garmin Forerunner 55s demonstrated strong potential for real-world stress monitoring, achieving the best mental arithmetic stress detection performance in LOSO (AUROC up to 0.961) comparable to research-grade devices like Polar H10 (AUROC 0.954), and Empatica E4 (AUROC 0.905 with HRV-only model and AUROC 0.953 with HRV+EDA model), with the added advantage of consumer-friendly wearability for free-living contexts.

Paperid: 1511, https://arxiv.org/pdf/2505.04152.pdf

Abstract:
Effective communication between providers and their patients influences health and care outcomes. The effectiveness of such conversations has been linked not only to the exchange of clinical information, but also to a range of interpersonal behaviors; commonly referred to as social signals, which are often conveyed through non-verbal cues and shape the quality of the patient-provider relationship. Recent advances in large language models (LLMs) have demonstrated an increasing ability to infer emotional and social behaviors even when analyzing only textual information. As automation increases also in clinical settings, such as for transcription of patient-provider conversations, there is growing potential for LLMs to automatically analyze and extract social behaviors from these interactions. To explore the foundational capabilities of LLMs in tracking social signals in clinical dialogue, we designed task-specific prompts and evaluated model performance across multiple architectures and prompting styles using a highly imbalanced, annotated dataset spanning 20 distinct social signals such as provider dominance, patient warmth, etc. We present the first system capable of tracking all these 20 coded signals, and uncover patterns in LLM behavior. Further analysis of model configurations and clinical context provides insights for enhancing LLM performance on social signal processing tasks in healthcare settings.

Paperid: 1512, https://arxiv.org/pdf/2505.03618.pdf

Abstract:
Large unlabeled datasets demand efficient and scalable data labeling solutions, in particular when the number of instances and classes is large. This leads to significant visual scalability challenges and imposes a high cognitive load on the users. Traditional instance-centric labeling methods, where (single) instances are labeled in each iteration struggle to scale effectively in these scenarios. To address these challenges, we introduce cVIL, a Class-Centric Visual Interactive Labeling methodology designed for interactive visual data labeling. By shifting the paradigm from assigning-classes-to-instances to assigning-instances-to-classes, cVIL reduces labeling effort and enhances efficiency for annotators working with large, complex and class-rich datasets. We propose a novel visual analytics labeling interface built on top of the conceptual cVIL workflow, enabling improved scalability over traditional visual labeling. In a user study, we demonstrate that cVIL can improve labeling efficiency and user satisfaction over instance-centric interfaces. The effectiveness of cVIL is further demonstrated through a usage scenario, showcasing its potential to alleviate cognitive load and support experts in managing extensive labeling tasks efficiently.

Paperid: 1513, https://arxiv.org/pdf/2505.03427.pdf

Abstract:
Large Language Models (LLMs) have demonstrated significant promise for various applications in healthcare. However, their efficacy in the Arabic medical domain remains unexplored due to the lack of high-quality domain-specific datasets and benchmarks. This study introduces MedArabiQ, a novel benchmark dataset consisting of seven Arabic medical tasks, covering multiple specialties and including multiple choice questions, fill-in-the-blank, and patient-doctor question answering. We first constructed the dataset using past medical exams and publicly available datasets. We then introduced different modifications to evaluate various LLM capabilities, including bias mitigation. We conducted an extensive evaluation with five state-of-the-art open-source and proprietary LLMs, including GPT-4o, Claude 3.5-Sonnet, and Gemini 1.5. Our findings highlight the need for the creation of new high-quality benchmarks that span different languages to ensure fair deployment and scalability of LLMs in healthcare. By establishing this benchmark and releasing the dataset, we provide a foundation for future research aimed at evaluating and enhancing the multilingual capabilities of LLMs for the equitable use of generative AI in healthcare.

Paperid: 1514, https://arxiv.org/pdf/2505.02418.pdf

Abstract:
We present \textbf{SymbioticRAG}, a novel framework that fundamentally reimagines Retrieval-Augmented Generation~(RAG) systems by establishing a bidirectional learning relationship between humans and machines. Our approach addresses two critical challenges in current RAG systems: the inherently human-centered nature of relevance determination and users' progression from "unconscious incompetence" in query formulation. SymbioticRAG introduces a two-tier solution where Level 1 enables direct human curation of retrieved content through interactive source document exploration, while Level 2 aims to build personalized retrieval models based on captured user interactions. We implement Level 1 through three key components: (1)~a comprehensive document processing pipeline with specialized models for layout detection, OCR, and extraction of tables, formulas, and figures; (2)~an extensible retriever module supporting multiple retrieval strategies; and (3)~an interactive interface that facilitates both user engagement and interaction data logging. We experiment Level 2 implementation via a retriever strategy incorporated LLM summarized user intention from user interaction logs. To maintain high-quality data preparation, we develop a human-on-the-loop validation interface that improves pipeline output while advancing research in specialized extraction tasks. Evaluation across three scenarios (literature review, geological exploration, and education) demonstrates significant improvements in retrieval relevance and user satisfaction compared to traditional RAG approaches. To facilitate broader research and further advancement of SymbioticRAG Level 2 implementation, we will make our system openly accessible to the research community.

Paperid: 1515, https://arxiv.org/pdf/2505.02209.pdf

Abstract:
Modeling domain intent within an evolving domain structure presents a significant challenge for domain-specific conversational recommendation systems (CRS). The conventional approach involves training an intent model using utterance-intent pairs. However, as new intents and patterns emerge, the model must be continuously updated while preserving existing relationships and maintaining efficient retrieval. This process leads to substantial growth in utterance-intent pairs, making manual labeling increasingly costly and impractical. In this paper, we propose an efficient solution for constructing a dynamic hierarchical structure that minimizes the number of user utterances required to achieve adequate domain knowledge coverage. To this end, we introduce a neural network-based attention-driven hierarchical clustering algorithm designed to optimize intent grouping using minimal data. The proposed method builds upon and integrates concepts from two existing flat clustering algorithms DEC and NAM, both of which utilize neural attention mechanisms. We apply our approach to a curated subset of 44,000 questions from the business food domain. Experimental results demonstrate that constructing the hierarchy using a stratified sampling strategy significantly reduces the number of questions needed to represent the evolving intent structure. Our findings indicate that this approach enables efficient coverage of dynamic domain knowledge without frequent retraining, thereby enhancing scalability and adaptability in domain-specific CSRs.

Paperid: 1516, https://arxiv.org/pdf/2505.02003.pdf

Abstract:
Closed-loop brain stimulation holds potential as personalized treatment for drug-resistant epilepsy (DRE) but still suffers from limitations that result in highly variable efficacy. First, stimulation is typically delivered upon detection of the seizure to abort rather than prevent it; second, the stimulation parameters are established by trial and error, requiring lengthy rounds of fine-tuning, which delay steady-state therapeutic efficacy. Here, we address these limitations by leveraging the potential of neuromorphic computing. We present a neuromorphic reservoir computing hardware system capable of driving real-time personalized free-run stimulations based on seizure forecasting, wherein each forecast triggers an electrical pulse rather than an arbitrarily predefined fixed-frequency stimulus train. The system achieves 83.33% accuracy in forecasting seizure occurrences during the training phase. We validate the system using hippocampal spheroids coupled to 3D microelectrode array as a simplified testbed, achieving seizure reduction >97% during the real-time processing while primarily using instantaneous stimulation frequencies within 20 Hz, well below what typically used in clinical practice. Our work demonstrates the potential of neuromorphic systems as a next-generation neuromodulation strategy for personalized DRE treatment, leveraging their sparse and event-driven processing for real-time applications.

Paperid: 1517, https://arxiv.org/pdf/2505.01520.pdf

Abstract:
Approximately 1 in 100 children worldwide are diagnosed with Autism Spectrum Disorder (ASD), and 46% to 89% experience significant feeding difficulties. Although mobile health (mHealth) applications offer potential support for caregivers, the quality and relevance of apps targeting autism-related feeding issues remain unclear. This systematic review evaluated mobile applications available on the Apple App Store and the Google Play Store between September and October 2024. The searches were carried out using 15 predefined terms (e.g., "child autism feeding", "child autism food"). Applications were eligible if they were in English, free to download, updated within the past year, explicitly addressed feeding in children with autism, accessible in Africa, and had more than 100 downloads. Of the 326 apps identified, only two iOS applications met all inclusion criteria; no Android apps qualified. Behavior Change Wheel (BCW) analysis showed that the selected applications incorporated multiple intervention functions, such as education, training, enablement, incentivization, and modeling, though none addressed the full spectrum of behavioral strategies. Mobile App Rating Scale (MARS) indicated moderate to high usability, with features such as sensory-friendly food routines and structured caregiver tools. However, both apps lacked clinical validation and comprehensive customization. These findings highlight a critical gap in the availability of evidence-based high-quality mHealth tools for caregivers managing ASD-related feeding challenges and underscore the need for professionally developed and culturally sensitive digital solutions.

Paperid: 1518, https://arxiv.org/pdf/2505.00018.pdf

Abstract:
This position paper critically surveys a broad spectrum of recent empirical developments on human-AI agents collaboration, highlighting both their technical achievements and persistent gaps. We observe a lack of a unifying theoretical framework that can coherently integrate these varied studies, especially when tackling open-ended, complex tasks. To address this, we propose a novel conceptual architecture: one that systematically interlinks the technical details of multi-agent coordination, knowledge management, cybernetic feedback loops, and higher-level control mechanisms. By mapping existing contributions, from symbolic AI techniques and connectionist LLM-based agents to hybrid organizational practices, onto this proposed framework (Hierarchical Exploration-Exploitation Net), our approach facilitates revision of legacy methods and inspires new work that fuses qualitative and quantitative paradigms. The paper's structure allows it to be read from any section, serving equally as a critical review of technical implementations and as a forward-looking reference for designing or extending human-AI symbioses. Together, these insights offer a stepping stone toward deeper co-evolution of human cognition and AI capability.

Paperid: 1519, https://arxiv.org/pdf/2504.21702.pdf

Abstract:
The promotion of a healthy lifestyle is one of the main drivers of an individual's overall physical and psycho-emotional well-being. Digital technologies are more and more adopted as ''facilitators'' for this goal, to raise awareness and solicit healthy lifestyle habits. This study aims to experiment the effects of the adoption of a digital conversational tool to influence awareness creation and behavioural change in the context of a well-being lifestyle. Our aim is to collect evidence of the aspects that must be taken into account when designing and implementing such tools in well-being promotion campaigns. To this end, we created a conversational application for promoting well-being and healthy lifestyles, which presents relevant information and asks specific questions to its intended users within an interaction happening through a chat interface; the conversational tool presents itself as a well-being counsellor named Allegra and follows a coaching approach to structure the interaction with the user. In our user study, participants were asked to first interact with Allegra in one of three experimental conditions, corresponding to different conversational styles; then, they answered a questionnaire about their experience. The questionnaire items were related to intrinsic motivation factors as well as awareness creation and behavioural change. The collected data allowed us to assess the hypotheses of our model that put in connection those variables. Our results confirm the positive effect of intrinsic motivation factors on both awareness creation and behavioural intention in the context of well-being and healthy lifestyle; on the other hand, we did not record any statistically significant effect of different language and communication styles on the outcomes.

Paperid: 1520, https://arxiv.org/pdf/2504.20910.pdf

Abstract:
Red-teaming is a core part of the infrastructure that ensures that AI models do not produce harmful content. Unlike past technologies, the black box nature of generative AI systems necessitates a uniquely interactional mode of testing, one in which individuals on red teams actively interact with the system, leveraging natural language to simulate malicious actors and solicit harmful outputs. This interactional labor done by red teams can result in mental health harms that are uniquely tied to the adversarial engagement strategies necessary to effectively red team. The importance of ensuring that generative AI models do not propagate societal or individual harm is widely recognized -- one less visible foundation of end-to-end AI safety is also the protection of the mental health and wellbeing of those who work to keep model outputs safe. In this paper, we argue that the unmet mental health needs of AI red-teamers is a critical workplace safety concern. Through analyzing the unique mental health impacts associated with the labor done by red teams, we propose potential individual and organizational strategies that could be used to meet these needs, and safeguard the mental health of red-teamers. We develop our proposed strategies through drawing parallels between common red-teaming practices and interactional labor common to other professions (including actors, mental health professionals, conflict photographers, and content moderators), describing how individuals and organizations within these professional spaces safeguard their mental health given similar psychological demands. Drawing on these protective practices, we describe how safeguards could be adapted for the distinct mental health challenges experienced by red teaming organizations as they mitigate emerging technological risks on the new digital frontlines.

Paperid: 1521, https://arxiv.org/pdf/2504.20035.pdf

Abstract:
Off-the-shelf smartphone-based AR systems typically use a single front-facing or rear-facing camera, which restricts user interactions to a narrow field of view and small screen size, thus reducing their practicality. We present Cam-2-Cam, an interaction concept implemented in three smartphone-based AR applications with interactions that span both cameras. Results from our qualitative analysis conducted on 30 participants presented two major design lessons that explore the interaction space of smartphone AR while maintaining critical AR interface attributes like embodiment and immersion: (1) Balancing Contextual Relevance and Feedback Quality serves to outline a delicate balance between implementing familiar interactions people do in the real world and the quality of multimodal AR responses and (2) Preventing Disorientation using Simultaneous Capture and Alternating Cameras which details how to prevent disorientation during AR interactions using the two distinct camera techniques we implemented in the paper. Additionally, we consider observed user assumptions or natural tendencies to inform future implementations of dual-camera setups for smartphone-based AR. We envision our design lessons as an initial pioneering step toward expanding the interaction space of smartphone-based AR, potentially driving broader adoption and overcoming limitations of single-camera AR.

Paperid: 1522, https://arxiv.org/pdf/2504.19703.pdf

Abstract:
Bias in generative Text-to-Image (T2I) models is a known issue, yet systematically analyzing such models' outputs to uncover it remains challenging. We introduce the Visual Bias Explorer (ViBEx) to interactively explore the output space of T2I models to support the discovery of visual bias. ViBEx introduces a novel flexible prompting tree interface in combination with zero-shot bias probing using CLIP for quick and approximate bias exploration. It additionally supports in-depth confirmatory bias analysis through visual inspection of forward, intersectional, and inverse bias queries. ViBEx is model-agnostic and publicly available. In four case study interviews, experts in AI and ethics were able to discover visual biases that have so far not been described in literature.

Paperid: 1523, https://arxiv.org/pdf/2504.19061.pdf

Abstract:
Clinical summarization is crucial in healthcare as it distills complex medical data into digestible information, enhancing patient understanding and care management. Large language models (LLMs) have shown significant potential in automating and improving the accuracy of such summarizations due to their advanced natural language understanding capabilities. These models are particularly applicable in the context of summarizing medical/clinical texts, where precise and concise information transfer is essential. In this paper, we investigate the effectiveness of open-source LLMs in extracting key events from discharge reports, including admission reasons, major in-hospital events, and critical follow-up actions. In addition, we also assess the prevalence of various types of hallucinations in the summaries produced by these models. Detecting hallucinations is vital as it directly influences the reliability of the information, potentially affecting patient care and treatment outcomes. We conduct comprehensive simulations to rigorously evaluate the performance of these models, further probing the accuracy and fidelity of the extracted content in clinical summarization. Our results reveal that while the LLMs (e.g., Qwen2.5 and DeepSeek-v2) perform quite well in capturing admission reasons and hospitalization events, they are generally less consistent when it comes to identifying follow-up recommendations, highlighting broader challenges in leveraging LLMs for comprehensive summarization.

Paperid: 1524, https://arxiv.org/pdf/2504.17997.pdf

Abstract:
Adolescents' uncontrolled exposure to digital content can negatively impact their development. Traditional regulatory methods, such as time limits or app restrictions, often take a rigid approach, ignoring adolescents' decision-making abilities. Another issue is the lack of content and services tailored for adolescents. To address this, we propose Chatperone, a concept of a system that provides adaptive scaffolding to support adolescents. Chatperone fosters healthy mobile interactions through three key modules: Perception, Negotiation, and Moderation. This paper outlines these modules' functionalities and discusses considerations for real-world implementation.

Paperid: 1525, https://arxiv.org/pdf/2504.17964.pdf

Abstract:
This paper examines how graduate students develop frameworks for evaluating machine-generated expertise in web-based interactions with large language models (LLMs). Through a qualitative study combining surveys, LLM interaction transcripts, and in-depth interviews with 14 graduate students, we identify patterns in how these emerging professionals assess and engage with AI-generated content. Our findings reveal that students construct evaluation frameworks shaped by three main factors: professional identity, verification capabilities, and system navigation experience. Rather than uniformly accepting or rejecting LLM outputs, students protect domains central to their professional identities while delegating others--with managers preserving conceptual work, designers safeguarding creative processes, and programmers maintaining control over core technical expertise. These evaluation frameworks are further influenced by students' ability to verify different types of content and their experience navigating complex systems. This research contributes to web science by highlighting emerging human-genAI interaction patterns and suggesting how platforms might better support users in developing effective frameworks for evaluating machine-generated expertise signals in AI-mediated web environments.

Paperid: 1526, https://arxiv.org/pdf/2504.17677.pdf

Abstract:
The rise of AI, especially Large Language Models, presents challenges and opportunities to integrate such technology into the classroom. AI has the potential to revolutionize education by helping teaching staff with various tasks, such as personalizing their teaching methods, but it also raises concerns, for example, about the degradation of student-teacher interactions and user privacy. Based on interviews with teaching staff, this paper introduces INSIGHT, a proof of concept to combine various AI tools to assist teaching staff and students in the process of solving exercises. INSIGHT has a modular design that allows it to be integrated into various higher education courses. We analyze students' questions to an LLM by extracting keywords, which we use to dynamically build an FAQ from students' questions and provide new insights for the teaching staff to use for more personalized face-to-face support. Future work could build upon INSIGHT by using the collected data to provide adaptive learning and adjust content based on student progress and learning styles to offer a more interactive and inclusive learning experience.

Paperid: 1527, https://arxiv.org/pdf/2504.16898.pdf

Abstract:
Exploratory analysis of a text corpus is essential for assessing data quality and developing meaningful hypotheses. Text analysis relies on understanding documents through structured attributes spanning various granularities of the documents such as words, phrases, sentences, topics, or clusters. However, current text visualization tools typically adopt a fixed representation tailored to specific tasks or domains, requiring users to switch tools as their analytical goals change. To address this limitation, we present Texture, a general-purpose interactive text exploration tool. Texture introduces a configurable data schema for representing text documents enriched with descriptive attributes. These attributes can appear at arbitrary levels of granularity in the text and possibly have multiple values, including document-level attributes, multi-valued attributes (e.g., topics), fine-grained span-level attributes (e.g., words), and vector embeddings. The system then combines existing interactive methods for text exploration into a single interface that provides attribute overview visualizations, supports cross-filtering attribute charts to explore subsets, uses embeddings for a dataset overview and similar instance search, and contextualizes filters in the actual documents. We evaluated Texture through a two-part user study with 10 participants from varied domains who each analyzed their own dataset in a baseline session and then with Texture. Texture was able to represent all of the previously derived dataset attributes, enabled participants to more quickly iterate during their exploratory analysis, and discover new insights about their data. Our findings contribute to the design of scalable, interactive, and flexible exploration systems that improve users' ability to make sense of text data.

Paperid: 1528, https://arxiv.org/pdf/2504.14927.pdf

Abstract:
This study proposes a multimodal neural network-based approach to predict segment access frequency in lecture archives. These archives, widely used as supplementary resources in modern education, often consist of long, unedited recordings that make it difficult to keep students engaged. Captured directly from face-to-face lectures without post-processing, they lack visual appeal. Meanwhile, the increasing volume of recorded material renders manual editing and annotation impractical. Automatically detecting high-engagement segments is thus crucial for improving accessibility and maintaining learning effectiveness. Our research focuses on real classroom lecture archives, characterized by unedited footage, no additional hardware (e.g., eye-tracking), and limited student numbers. We approximate student engagement using segment access frequency as a proxy. Our model integrates multimodal features from teachers' actions (via OpenPose and optical flow), audio spectrograms, and slide page progression. These features are deliberately chosen for their non-semantic nature, making the approach applicable regardless of lecture language. Experiments show that our best model achieves a Pearson correlation of 0.5143 in 7-fold cross-validation and 69.32 percent average accuracy in a downstream three-class classification task. The results, obtained with high computational efficiency and a small dataset, demonstrate the practical feasibility of our system in real-world educational contexts.

Paperid: 1529, https://arxiv.org/pdf/2504.14071.pdf

Abstract:
With the rise of artificial intelligence (AI), there has been increasing interest in human-AI co-creation in a variety of artistic domains including music as AI-driven systems are frequently able to generate human-competitive artifacts. Now, the implications of such systems for musical practice are being investigated. We report on a thorough evaluation of the user adoption of the Multi-Track Music Machine (MMM) as a co-creative AI tool for music composers. To do this, we integrate MMM into Cubase, a popular Digital Audio Workstation (DAW) by Steinberg, by producing a "1-parameter" plugin interface named MMM-Cubase (MMM-C), which enables human-AI co-composition. We contribute a methodological assemblage as a 3-part mixed method study measuring usability, user experience and technology acceptance of the system across two groups of expert-level composers: hobbyists and professionals. Results show positive usability and acceptance scores. Users report experiences of novelty, surprise and ease of use from using the system, and limitations on controllability and predictability of the interface when generating music. Findings indicate no significant difference between the two user groups.

Paperid: 1530, https://arxiv.org/pdf/2504.14058.pdf

Abstract:
With the rise of artificial intelligence in recent years, there has been a rapid increase in its application towards creative domains, including music. There exist many systems built that apply machine learning approaches to the problem of computer-assisted music composition (CAC). Calliope is a web application that assists users in performing a variety of multi-track composition tasks in the symbolic domain. The user can upload (Musical Instrument Digital Interface) MIDI files, visualize and edit MIDI tracks, and generate partial (via bar in-filling) or complete multi-track content using the Multi-Track Music Machine (MMM). Generation of new MIDI excerpts can be done in batch and can be combined with active playback listening for an enhanced assisted-composition workflow. The user can export generated MIDI materials or directly stream MIDI playback from the system to their favorite Digital Audio Workstation (DAW). We present a demonstration of the system, its features, generative parameters and describe the co-creative workflows that it affords.

Paperid: 1531, https://arxiv.org/pdf/2504.14055.pdf

Abstract:
With the recent developments in machine intelligence and web technologies, new generative music systems are being explored for assisted composition using machine learning techniques on the web. Such systems are built for various tasks such as melodic, harmonic or rhythm generation, music interpolation, continuation and style imitation. In this paper, we introduce Apollo, an interactive music application for generating symbolic phrases of conventional western music using corpus-based style imitation techniques. In addition to enabling the construction and management of symbolic musical corpora, the system makes it possible for music artists and researchers to generate new musical phrases in the style of the proposed corpus. The system is available as a desktop application. The generated symbolic music materials, encoded in the MIDI format, can be exported or streamed for various purposes including using them as seed material for musical projects. We present the system design, implementation details, discuss and conclude with future work for the system.

Paperid: 1532, https://arxiv.org/pdf/2504.14045.pdf

Abstract:
Metacognition--the capacity to monitor and evaluate one's own knowledge and performance--is foundational to human decision-making, learning, and communication. As large language models (LLMs) become increasingly embedded in both high-stakes and widespread low-stakes contexts, it is important to assess whether, how, and to what extent they exhibit metacognitive abilities. Here, we provide an overview of current knowledge of LLMs' metacognitive capacities, how they might be studied, and how they relate to our knowledge of metacognition in humans. We show that while humans and LLMs can sometimes appear quite aligned in their metacognitive capacities and behaviors, it is clear many differences remain; attending to these differences is important for enhancing human-AI collaboration. Finally, we discuss how endowing future LLMs with more sensitive and more calibrated metacognition may also help them develop new capacities such as more efficient learning, self-direction, and curiosity.

Paperid: 1533, https://arxiv.org/pdf/2504.13899.pdf

Abstract:
Counterfactual explanations are a widely used approach in Explainable AI, offering actionable insights into decision-making by illustrating how small changes to input data can lead to different outcomes. Despite their importance, evaluating the quality of counterfactual explanations remains an open problem. Traditional quantitative metrics, such as sparsity or proximity, fail to fully account for human preferences in explanations, while user studies are insightful but not scalable. Moreover, relying only on a single overall satisfaction rating does not lead to a nuanced understanding of why certain explanations are effective or not. To address this, we analyze a dataset of counterfactual explanations that were evaluated by 206 human participants, who rated not only overall satisfaction but also seven explanatory criteria: feasibility, coherence, complexity, understandability, completeness, fairness, and trust. Modeling overall satisfaction as a function of these criteria, we find that feasibility (the actionability of suggested changes) and trust (the belief that the changes would lead to the desired outcome) consistently stand out as the strongest predictors of user satisfaction, though completeness also emerges as a meaningful contributor. Crucially, even excluding feasibility and trust, other metrics explain 58% of the variance, highlighting the importance of additional explanatory qualities. Complexity appears independent, suggesting more detailed explanations do not necessarily reduce satisfaction. Strong metric correlations imply a latent structure in how users judge quality, and demographic background significantly shapes ranking patterns. These insights inform the design of counterfactual algorithms that adapt explanatory qualities to user expertise and domain context.

Paperid: 1534, https://arxiv.org/pdf/2504.13887.pdf

Abstract:
Despite increasing AI chatbot deployment in public discourse, empirical evidence on their capacity to foster intercultural empathy remains limited. Through a randomized experiment, we assessed how different AI deliberation approaches--cross-cultural deliberation (presenting other-culture perspectives), own-culture deliberation (representing participants' own culture), and non-deliberative control--affect intercultural empathy across American and Latin American participants. Cross-cultural deliberation increased intercultural empathy among American participants through positive emotional engagement, but produced no such effects for Latin American participants, who perceived AI responses as culturally inauthentic despite explicit prompting to represent their cultural perspectives. Our analysis of participant-driven feedback, where users directly flagged and explained culturally inappropriate AI responses, revealed systematic gaps in AI's representation of Latin American contexts that persist despite sophisticated prompt engineering. These findings demonstrate that current approaches to AI cultural alignment--including linguistic adaptation and explicit cultural prompting--cannot fully address deeper representational asymmetries in AI systems. Our work advances both deliberation theory and AI alignment research by revealing how the same AI system can simultaneously promote intercultural understanding for one cultural group while failing for another, with critical implications for designing equitable AI systems for cross-cultural democratic discourse.

Paperid: 1535, https://arxiv.org/pdf/2504.13684.pdf

Abstract:
Human cognition is constrained by processing limitations, leading to cognitive overload and inefficiencies in knowledge synthesis and decision-making. Large Language Models (LLMs) present an opportunity for cognitive augmentation, but their current reactive nature limits their real-world applicability. This position paper explores the potential of context-aware cognitive augmentation, where LLMs dynamically adapt to users' cognitive states and task environments to provide appropriate support. Through a think-aloud study in an exhibition setting, we examine how individuals interact with multi-modal information and identify key cognitive challenges in structuring, retrieving, and applying knowledge. Our findings highlight the need for AI-driven cognitive support systems that integrate real-time contextual awareness, personalized reasoning assistance, and socially adaptive interactions. We propose a framework for AI augmentation that seamlessly transitions between real-time cognitive support and post-experience knowledge organization, contributing to the design of more effective human-centered AI systems.

Paperid: 1536, https://arxiv.org/pdf/2504.13567.pdf

Abstract:
This paper presents PoEmotion, an approach to visualizing emotions in poetry with Chinese calligraphy strokes. Traditional textual emotion analysis often lacks emotional resonance due to its mechanical nature. PoEmotion combines natural language processing with deep learning generative algorithms to create Chinese calligraphy that effectively conveys the emotions in poetry. The created calligraphy represents four fundamental emotions: excitement, anger, sadness, and relaxation, making the visual representation of emotions intuitive and concise. Furthermore, the approach delves into the relationship be-tween time, emotion, and cultural communication. Its goal is to provide a more natural means of communicating emotions through non-verbal mediums to enhance human emotional expression.

Paperid: 1537, https://arxiv.org/pdf/2504.12665.pdf

Abstract:
Drivers' perception of risk determines their acceptance, trust, and use of the Automated Driving Systems (ADSs). However, perceived risk is subjective and difficult to evaluate using existing methods. To address this issue, a driver's subjective perceived risk (DSPR) model is proposed, regarding perceived risk as a dynamically triggered mechanism with anisotropy and attenuation. 20 participants are recruited for a driver-in-the-loop experiment to report their real-time subjective risk ratings (SRRs) when experiencing various automatic driving scenarios. A convolutional neural network and bidirectional long short-term memory network with temporal pattern attention (CNN-Bi-LSTM-TPA) is embedded into a semi-supervised learning strategy to predict SRRs, aiming to reduce data noise caused by subjective randomness of participants. The results illustrate that DSPR achieves the highest prediction accuracy of 87.91% in predicting SRRs, compared to three state-of-the-art risk models. The semi-supervised strategy improves accuracy by 20.12%. Besides, CNN-Bi-LSTM-TPA network presents the highest accuracy among four different LSTM structures. This study offers an effective method for assessing driver's perceived risk, providing support for the safety enhancement of ADS and driver's trust improvement.

Paperid: 1538, https://arxiv.org/pdf/2504.12012.pdf

Abstract:
Hallucinations in Large Language Models (LLMs) are widely regarded as errors - outputs that deviate from factual accuracy. However, in creative or exploratory contexts, these "mistakes" may represent unexpected avenues for innovation. We introduce Purposefully Induced Psychosis (PIP), a novel approach that amplifies LLM hallucinations for imaginative tasks such as speculative fiction, interactive storytelling, and mixed-reality simulations. Drawing on Herman Melville's Moby-Dick, where Pip's "madness" reveals profound insight, we reframe hallucinations as a source of computational imagination rather than a flaw. Our method fine-tunes LLMs to encourage speculative, metaphorical, and surreal outputs - hallucinations that are useful when factual accuracy is not the chief objective. Inspired by the consensual illusions of theater and stage magic, PIP situates these creative missteps in contexts where users willingly suspend disbelief, thereby transforming "errors" into catalysts for new ways of thinking. We discuss potential applications, design principles for ensuring user consent, preliminary observations, and implications for broader AI ethics and human-AI collaboration.

Paperid: 1539, https://arxiv.org/pdf/2504.10961.pdf

Abstract:
As generative AI models, particularly large language models (LLMs), transform educational feedback practices in higher education (HE) contexts, understanding students' perceptions of different sources of feedback becomes crucial for their effective implementation and adoption. This study addresses a critical gap by comparing undergraduate students' trust in LLM, human, and human-AI co-produced feedback in their authentic HE context. More specifically, through a within-subject experimental design involving 91 participants, we investigated factors that predict students' ability to distinguish between feedback types, their perceptions of feedback quality, and potential biases related to the source of feedback. Findings revealed that when the source was blinded, students generally preferred AI and co-produced feedback over human feedback regarding perceived usefulness and objectivity. However, they presented a strong bias against AI when the source of feedback was disclosed. In addition, only AI feedback suffered a decline in perceived genuineness when feedback sources were revealed, while co-produced feedback maintained its positive perception. Educational AI experience improved students' ability to identify LLM-generated feedback and increased their trust in all types of feedback. More years of students' experience using AI for general purposes were associated with lower perceived usefulness and credibility of feedback. These insights offer substantial evidence of the importance of source credibility and the need to enhance both feedback literacy and AI literacy to mitigate bias in student perceptions for AI-generated feedback to be adopted and impact education.

Paperid: 1540, https://arxiv.org/pdf/2504.10708.pdf

Abstract:
Explanations for artificial intelligence (AI) systems are intended to support the people who are impacted by AI systems in high-stakes decision-making environments, such as doctors, patients, teachers, students, housing applicants, and many others. To protect people and support the responsible development of AI, explanations need to be actionable--helping people take pragmatic action in response to an AI system--and contestable--enabling people to push back against an AI system and its determinations. For many high-stakes domains, such as healthcare, education, and finance, the sociotechnical environment includes significant legal implications that impact how people use AI explanations. For example, physicians who use AI decision support systems may need information on how accepting or rejecting an AI determination will protect them from lawsuits or help them advocate for their patients. In this paper, we make the case for Legally-Informed Explainable AI, responding to the need to integrate and design for legal considerations when creating AI explanations. We describe three stakeholder groups with different informational and actionability needs, and provide practical recommendations to tackle design challenges around the design of explainable AI systems that incorporate legal considerations.

Paperid: 1541, https://arxiv.org/pdf/2504.10276.pdf

Abstract:
The integration of ethics into software development faces significant challenges due to market fundamentalism in organizational practices, where profit often takes precedence over ethical considerations. Additionally, the critical influence of practitioners' individual backgrounds on ethical decision-making remains underexplored, highlighting a gap in comprehensive research. This is especially essential to understand due to the demographic imbalance in software roles. This study investigates ethical concerns in software development, focusing on how they are perceived, prioritized, and addressed by demographically different practitioners. By surveying 217 software practitioners across diverse roles, industries, and countries, we identify critical barriers to ethical integration and examine practitioners' capacity to mitigate these issues. Our findings reveal pronounced demographic disparities, with marginalized groups - including women, BIPOC, and disabled individuals - reporting ethical concerns at higher frequencies. Notably, marginalized practitioners demonstrated heightened sensitivity to ethical implementation and greater empowerment to address them. However, practitioners overall often lack the support needed to address ethical challenges effectively. These insights underscore the urgent need for reforms in software education and development processes that center on diverse perspectives. Such reforms are essential to advancing ethical integration in software development and ensuring responsible computing practices in an increasingly complex technological landscape.

Paperid: 1542, https://arxiv.org/pdf/2504.10249.pdf

Abstract:
This study examines the role of AI-assisted pretesting in enhancing learning outcomes, particularly when integrated with generative AI tools like ChatGPT. Pretesting, a learning strategy in which students attempt to answer questions or solve problems before receiving instruction, has been shown to improve retention by activating prior knowledge. The adaptability and interactivity of AI-assisted pretesting introduce new opportunities for optimizing learning in digital environments. Across three experimental studies, we explored how pretesting strategies, task characteristics, and student motivation influence learning. Findings suggest that AI-assisted pretesting enhances learning outcomes, particularly for tasks requiring higher-order thinking. While adaptive AI-driven pretesting increased engagement, its benefits were most pronounced in complex, exploratory tasks rather than straightforward computational problems. These results highlight the importance of aligning pretesting strategies with task demands, demonstrating that AI can optimize learning when applied to tasks requiring deeper cognitive engagement. This research provides insights into how AI-assisted pretesting can be effectively integrated with generative AI tools to enhance both cognitive and motivational outcomes in learning environments.

Paperid: 1543, https://arxiv.org/pdf/2504.10134.pdf

Abstract:
Computational reproducibility of scientific results, that is, the execution of a computational experiment (e.g., a script) using its original settings (data, code, etc.), should always be possible. However, reproducibility has become a significant challenge, as researchers often face difficulties in accurately replicating experiments due to inconsistencies in documentation, setup configurations, and missing data. This lack of reproducibility may undermine the credibility of scientific results. To address this issue, we propose a conversational, text-based tool that allows researchers to easily reproduce computational experiments (theirs or from others) and package them in a single file that can be re-executed with just a double click on any computer, requiring the installation of a single widely-used software. Researchers interact with the platform in natural language, which our tool processes to automatically create a computational environment able to execute the provided experiment/code. We conducted two studies to evaluate our proposal. In the first study, we gathered qualitative data by executing 18 experiments from the literature. Although in some cases it was not possible to execute the experiment, in most instances, it was necessary to have little or even no interaction for the tool to reproduce the results. We also conducted a user study comparing our tool with an enterprise-level one. During this study, we measured the usability of both tools using the System Usability Scale (SUS) and participants' workload using the NASA Task Load Index (TLX). The results show a statistically significant difference between both tools in favor of our proposal, demonstrating that the usability and workload of our tool are superior to the current state of the art.

Paperid: 1544, https://arxiv.org/pdf/2504.09955.pdf

Abstract:
Meta Quest Store: https://www.meta.com/experiences/stanford-mri-simulator/8205539289482347/ Magnetic Resonance Imaging (MRI) can be a stressful experience for pediatric patients due to the loud acoustic environment, enclosed scanner bore, and a prolonged requirement to remain still. While sedation is commonly used to manage anxiety and motion, it carries clinical risks and logistical burdens. Traditional preparatory approaches, such as instructional videos and mock scans, often lack engagement for older children and adolescents. In this study, we present a comparative evaluation of four MRI preparation modalities: (1) a gamified virtual reality (VR) simulation that trains stillness through real-time feedback; (2) a passive VR experience replicating the MRI environment without interactivity; (3) a 360Â° first-person video of a real MRI procedure; and (4) a standard 2D educational video. Using a within-subjects design (N = 11, ages 10-16), we assess each method's impact on head motion data, anxiety reduction, procedural preparedness, usability, cognitive workload, and subjective preference. Results show that the gamified VR condition has significantly lower head motion (p < 0.001) and yielded the highest preparedness scores (p < 0.05). Head motion data were significantly correlated with learning outcomes (p < 0.01), suggesting that behavioral performance in VR strongly indicates procedural readiness. While all modalities reduced anxiety and were rated usable, interactive VR was preferred by most participants and demonstrated unique advantages in promoting engagement and behavioral rehearsal. We conclude with design recommendations for designing immersive simulations and integrating VR training into pediatric imaging workflows.

Paperid: 1545, https://arxiv.org/pdf/2504.09779.pdf

Abstract:
The widespread adoption of generative AI is already impacting learning and help-seeking. While the benefits of generative AI are well-understood, recent studies have also raised concerns about increased potential for cheating and negative impacts on students' metacognition and critical thinking. However, the potential impacts on social interactions, peer learning, and classroom dynamics are not yet well understood. To investigate these aspects, we conducted 17 semi-structured interviews with undergraduate computing students across seven R1 universities in North America. Our findings suggest that help-seeking requests are now often mediated by generative AI. For example, students often redirected questions from their peers to generative AI instead of providing assistance themselves, undermining peer interaction. Students also reported feeling increasingly isolated and demotivated as the social support systems they rely on begin to break down. These findings are concerning given the important role that social interactions play in students' learning and sense of belonging.

Paperid: 1546, https://arxiv.org/pdf/2504.09271.pdf

Abstract:
The ubiquity and widespread use of digital and online technologies have transformed mental health support, with online mental health communities (OMHCs) providing safe spaces for peer support. More recently, generative AI and large language models (LLMs) have introduced new possibilities for scalable, around-the-clock mental health assistance that could potentially augment and supplement the capabilities of OMHCs. Although genAI shows promise in delivering immediate and personalized responses, their effectiveness in replicating the nuanced, experience-based support of human peers remains an open question. In this study, we harnessed 24,114 posts and 138,758 online community (OC) responses from 55 OMHCs on Reddit. We prompted several state-of-the-art LLMs (GPT-4-Turbo, Llama-3, and Mistral-7B) with these posts, and compared their (AI) responses to human-written (OC) responses based on a variety of linguistic measures across psycholinguistics and lexico-semantics. Our findings revealed that AI responses are more verbose, readable, and analytically structured, but lack linguistic diversity and personal narratives inherent in human-human interactions. Through a qualitative examination, we found validation as well as complementary insights into the nature of AI responses, such as its neutrality of stance and the absence of seeking back-and-forth clarifications. We discuss the ethical and practical implications of integrating generative AI into OMHCs, advocating for frameworks that balance AI's scalability and timeliness with the irreplaceable authenticity, social interactiveness, and expertise of human connections that form the ethos of online support communities.

Paperid: 1547, https://arxiv.org/pdf/2504.09169.pdf

Abstract:
Researchers often struggle to develop measurement items and lack a standardized process. To support the design process, we present UX Remix, a system to help researchers develop constructs and measurement items using large language models (LLMs). UX Remix leverages a database of constructs and associated measurement items from previous papers. Based on the data, UX Remix recommends constructs relevant to the research context. The researchers then select appropriate constructs based on the recommendations. Afterward, selected constructs are used to generate a custom construct, and UX Remix recommends measurement items. UX Remix streamlines the process of selecting constructs, developing measurement items, and adapting them to research contexts, addressing challenges in the selection and reuse of measurement items. This paper describes the implementation of the system, the potential benefits, and future directions to improve the rigor and efficiency of measurement design in human-computer interaction (HCI) research.

Paperid: 1548, https://arxiv.org/pdf/2504.08056.pdf

Abstract:
Achieving a high level of immersion and adaptation in virtual reality (VR) requires precise measurement and representation of user state. While extrinsic physical characteristics such as locomotion and pose can be accurately tracked in real-time, reliably capturing mental states is more challenging. Quantitative psychology allows considering more intrinsic features like emotion, attention, or cognitive load. Time perception, in particular, is strongly tied to users' mental states, including stress, focus, and boredom. However, research on objectively measuring the pace at which we perceive the passage of time is scarce. In this work, we investigate the potential of electroencephalography (EEG) as an objective measure of time perception in VR, exploring neural correlates with oscillatory responses and time-frequency analysis. To this end, we implemented a variety of time perception modulators in VR, collected EEG recordings, and labeled them with overestimation, correct estimation, and underestimation time perception states. We found clear EEG spectral signatures for these three states, that are persistent across individuals, modulators, and modulation duration. These signatures can be integrated and applied to monitor and actively influence time perception in VR, allowing the virtual environment to be purposefully adapted to the individual to increase immersion further and improve user experience. A free copy of this paper and all supplemental materials are available at https://vrarlab.uni.lu/pub/brain-signatures.

Paperid: 1549, https://arxiv.org/pdf/2504.05008.pdf

Abstract:
The rapid development of AI-driven tools, particularly large language models (LLMs), is reshaping professional writing. Still, key aspects of their adoption such as languages support, ethics, and long-term impact on writers voice and creativity remain underexplored. In this work, we conducted a questionnaire (N = 301) and an interactive survey (N = 36) targeting professional writers regularly using AI. We examined LLM-assisted writing practices across 25+ languages, ethical concerns, and user expectations. The findings of the survey demonstrate important insights, reflecting upon the importance of: LLMs adoption for non-English speakers; the degree of misinformation, domain and style adaptation; usability and key features of LLMs. These insights can guide further development, benefiting both writers and a broader user base.

Paperid: 1550, https://arxiv.org/pdf/2504.03014.pdf

Abstract:
Reliable building energy audits are crucial for efficiency through heat loss detection. While drones assist inspections, they overlook the interplay between personality traits, stress management, and operational strategies expert engineers employ. This gap, combined with workforce shortages, necessitates effective knowledge transfer. This study proposes a VR-based training system for human-drone interaction in building heat loss inspection. Participants piloted a virtual drone with a thermographic monitor to identify defects. By analyzing flight patterns, stress adaptation, and inspection performance across diverse trainees, we found: (1) Flight Trajectories - Extraverts, Intuitives, Feelers, and Perceivers explored larger areas but showed higher misclassification rates, while Introverts, Sensors, Thinkers, and Judgers demonstrated methodical approaches. (2) Stress Adaptation - Heart rate variability revealed broader stress fluctuations among Extraverts, Intuitives, Feelers, and Perceivers, whereas Introverts, Sensors, Thinkers, and Judgers maintained steadier responses. Task complexity magnified these differences. (3) Inspection Performance - Extraverts, Intuitives, and Feelers achieved higher recall but over-identified defects. Introverts, Sensors, Thinkers, and Judgers made fewer random errors but risked overlooking subtle heat losses. These insights highlight the interplay among personality traits, stress management, and operational strategies in VR training for drone-assisted audits. The framework shows potential for addressing workforce shortages by facilitating knowledge transfer and optimizing human-drone collaboration.

Paperid: 1551, https://arxiv.org/pdf/2504.02780.pdf

Abstract:
The rise of Generative AI, and Large Language Models (LLMs) in particular, is fundamentally changing cognitive processes in knowledge work, raising critical questions about their impact on human reasoning and problem-solving capabilities. As these AI systems become increasingly integrated into workflows, they offer unprecedented opportunities for augmenting human thinking while simultaneously risking cognitive erosion through passive consumption of generated answers. This tension is particularly pronounced in open-ended tasks, where effective solutions require deep contextualization and integration of domain knowledge. Unlike structured tasks with established metrics, measuring the quality of human-LLM interaction in such open-ended tasks poses significant challenges due to the absence of ground truth and the iterative nature of solution development. To address this, we present a framework that analyzes interaction patterns along two dimensions: cognitive activity mode (exploration vs. exploitation) and cognitive engagement mode (constructive vs. detrimental). This framework provides systematic measurements to evaluate when LLMs are effective tools for thought rather than substitutes for human cognition, advancing theoretical understanding and practical guidance for developing AI systems that protect and augment human cognitive capabilities.

Paperid: 1552, https://arxiv.org/pdf/2504.02526.pdf

Abstract:
How AI communicates with humans is crucial for effective human-AI co-creation. However, many existing co-creative AI tools cannot communicate effectively, limiting their potential as collaborators. This paper introduces our initial design of a Framework for designing AI Communication (FAICO) for co-creative AI based on a systematic review of 107 full-length papers. FAICO presents key aspects of AI communication and their impacts on user experience to guide the design of effective AI communication. We then show actionable ways to translate our framework into two practical tools: design cards for designers and a configuration tool for users. The design cards enable designers to consider AI communication strategies that cater to a diverse range of users in co-creative contexts, while the configuration tool empowers users to customize AI communication based on their needs and creative workflows. This paper contributes new insights within the literature on human-AI co-creativity and Human-Computer Interaction, focusing on designing AI communication to enhance user experience.

Paperid: 1553, https://arxiv.org/pdf/2504.02234.pdf

Abstract:
Accurate and verifiable large language model (LLM) simulations of human research subjects promise an accessible data source for understanding human behavior and training new AI systems. However, results to date have been limited, and few social scientists have adopted this method. In this position paper, we argue that the promise of LLM social simulations can be achieved by addressing five tractable challenges. We ground our argument in a review of empirical comparisons between LLMs and human research subjects, commentaries on the topic, and related work. We identify promising directions, including context-rich prompting and fine-tuning with social science datasets. We believe that LLM social simulations can already be used for pilot and exploratory studies, and more widespread use may soon be possible with rapidly advancing LLM capabilities. Researchers should prioritize developing conceptual models and iterative evaluations to make the best use of new AI systems.

Paperid: 1554, https://arxiv.org/pdf/2504.02110.pdf

Abstract:
Many mobile apps are inaccessible, thereby excluding people from their potential benefits. Existing rule-based accessibility checkers aim to mitigate these failures by identifying errors early during development but are constrained in the types of errors they can detect. We present ScreenAudit, an LLM-powered system designed to traverse mobile app screens, extract metadata and transcripts, and identify screen reader accessibility errors overlooked by existing checkers. We recruited six accessibility experts including one screen reader user to evaluate ScreenAudit's reports across 14 unique app screens. Our findings indicate that ScreenAudit achieves an average coverage of 69.2%, compared to only 31.3% with a widely-used accessibility checker. Expert feedback indicated that ScreenAudit delivered higher-quality feedback and addressed more aspects of screen reader accessibility compared to existing checkers, and that ScreenAudit would benefit app developers in real-world settings.

Paperid: 1555, https://arxiv.org/pdf/2504.00831.pdf

Abstract:
To improve the trustworthiness of an AI model, finding consistent, understandable representations of its inference process is essential. This understanding is particularly important in high-stakes operations such as weather forecasting, where the identification of underlying meteorological mechanisms is as critical as the accuracy of the predictions. Despite the growing literature that addresses this issue through explainable AI, the applicability of their solutions is often limited due to their AI-centric development. To fill this gap, we follow a user-centric process to develop an example-based concept analysis framework, which identifies cases that follow a similar inference process as the target instance in a target model and presents them in a user-comprehensible format. Our framework provides the users with visually and conceptually analogous examples, including the probability of concept assignment to resolve ambiguities in weather mechanisms. To bridge the gap between vector representations identified from models and human-understandable explanations, we compile a human-annotated concept dataset and implement a user interface to assist domain experts involved in the the framework development.

Paperid: 1556, https://arxiv.org/pdf/2504.00795.pdf

Abstract:
Machine learning (ML) is becoming increasingly popular in meteorological decision-making. Although the literature on explainable artificial intelligence (XAI) is growing steadily, user-centered XAI studies have not extend to this domain yet. This study defines three requirements for explanations of black-box models in meteorology through user studies: statistical model performance for different rainfall scenarios to identify model bias, model reasoning, and the confidence of model outputs. Appropriate XAI methods are mapped to each requirement, and the generated explanations are tested quantitatively and qualitatively. An XAI interface system is designed based on user feedback. The results indicate that the explanations increase decision utility and user trust. Users prefer intuitive explanations over those based on XAI algorithms even for potentially easy-to-recognize examples. These findings can provide evidence for future research on user-centered XAI algorithms, as well as a basis to improve the usability of AI systems in practice.

Paperid: 1557, https://arxiv.org/pdf/2503.23484.pdf

Abstract:
Navigating peripersonal space requires reaching targets in both horizontal (e.g., desks) and vertical (e.g., shelves) layouts with high precision. We developed a haptic glove to aid peri-personal target navigation and investigated the effectiveness of different feedback delivery methods. Twenty-two participants completed target navigation tasks under various conditions, including scene layout (horizontal or vertical), guidance approach (two-tactor or worst-axis first), guidance metaphor (push or pull), and intensity mode (linear or zone) for conveying distance cues. Task completion time, hand trajectory distance, and the percentage of hand trajectory in a critical area were measured as performance outcomes, along with subjective feedback. Participants achieved significantly faster task completion times and covered less hand trajectory distance in the horizontal layout, worst-axis first approach, and pull metaphor conditions. Additionally, male participants demonstrated superior performance and reported lower levels of frustration compared to their female counterparts throughout the study. Intensity mode had no significant effect on the results. In summary, vibrating one tactor at a time (worst-axis first) and using the pull metaphor were the most effective methods of delivering vibrotactile feedback for peripersonal target navigation in both horizontal and vertical settings. Findings from this work can guide future development of haptic gloves for individuals with vision impairments, environments with visual limitations, and for accessibility and rehabilitation applications.

Paperid: 1558, https://arxiv.org/pdf/2503.22247.pdf

Abstract:
A wide range of haptic feedback is crucial for achieving high realism and immersion in virtual environments. Therefore, a multi-modal haptic interface that provides various haptic signals simultaneously is highly beneficial. This paper introduces a novel silicone fingertip actuator that is pneumatically actuated, delivering a realistic and effective haptic experience by simultaneously providing pressure, vibrotactile, and cold thermal feedback. The actuator features a design with multiple air chambers, each with controllable volume achieved through pneumatic valves connected to compressed air tanks. The lower air chamber generates pressure feedback, while the upper chamber produces vibrotactile feedback. In addition, two integrated lateral air nozzles create a cold thermal sensation. To showcase the system's capabilities, we designed two unique 3D surfaces in the virtual environment: a frozen meat surface and an abrasive icy surface. These surfaces simulate tactile perceptions of coldness, pressure, and texture. Comprehensive performance assessments and user studies were conducted to validate the actuator's effectiveness, highlighting its diverse feedback capabilities compared to traditional actuators that offer only single feedback modalities.

Paperid: 1559, https://arxiv.org/pdf/2503.22145.pdf

Abstract:
A considerable part of the performance of today's large language models (LLM's) and multimodal large language models (MLLM's) depends on their tokenization strategies. While tokenizers are extensively researched for textual and visual input, there is no research on tokenization strategies for gaze data due to its nature. However, a corresponding tokenization strategy would allow using the vision capabilities of pre-trained MLLM's for gaze data, for example, through fine-tuning. In this paper, we aim to close this research gap by analyzing five different tokenizers for gaze data on three different datasets for the forecasting and generation of gaze data through LLMs (cf.~\cref{fig:teaser}). We evaluate the tokenizers regarding their reconstruction and compression abilities. Further, we train an LLM for each tokenization strategy, measuring its generative and predictive performance. Overall, we found that a quantile tokenizer outperforms all others in predicting the gaze positions and k-means is best when predicting gaze velocities.

Paperid: 1560, https://arxiv.org/pdf/2503.21723.pdf

Abstract:
Occlusion is one of the challenging issues when estimating 3D hand pose. This problem becomes more prominent when hand interacts with an object or two hands are involved. In the past works, much attention has not been given to these occluded regions. But these regions contain important and beneficial information that is vital for 3D hand pose estimation. Thus, in this paper, we propose an occlusion robust and accurate method for the estimation of 3D hand-object pose from the input RGB image. Our method includes first localising the hand joints using a CNN based model and then refining them by extracting contextual information. The self attention transformer then identifies the specific joints along with the hand identity. This helps the model to identify the hand belongingness of a particular joint which helps to detect the joint even in the occluded region. Further, these joints with hand identity are then used to estimate the pose using cross attention mechanism. Thus, by identifying the joints in the occluded region, the obtained network becomes robust to occlusion. Hence, this network achieves state-of-the-art results when evaluated on the InterHand2.6M, HO3D and H$_2$O3D datasets.

Paperid: 1561, https://arxiv.org/pdf/2503.20790.pdf

Abstract:
While AI-assisted colonoscopy promises improved colorectal cancer screening, its success relies on effective integration into clinical practice, not just algorithmic accuracy. This paper, based on an Australian field study (observations and gastroenterologist interviews), highlights a critical disconnect: current development prioritizes machine learning model performance, overlooking essential aspects of user interface design, workflow integration, and overall user experience. Industry interactions reveal a similar emphasis on data and algorithms. To realize AI's full potential, the HCI community must champion user-centered design, ensuring these systems are usable, support endoscopist expertise, and enhance patient outcomes.

Paperid: 1562, https://arxiv.org/pdf/2503.20518.pdf

Abstract:
This study investigates the elicitation of empathy toward a third party through interaction with social agents. Participants engaged with either a physical robot or a voice-enabled chatbot, both driven by a large language model (LLM) programmed to exhibit either an empathetic tone or remain neutral. The interaction is focused on a fictional character, Katie Banks, who is in a challenging situation and in need of financial donations. The willingness to help Katie, measured by the number of hours participants were willing to volunteer, along with their perceptions of the agent, were assessed for 60 participants. Results indicate that neither robotic embodiment nor empathetic tone significantly influenced participants' willingness to volunteer. While the LLM effectively simulated human empathy, fostering genuine empathetic responses in participants proved challenging.

Paperid: 1563, https://arxiv.org/pdf/2503.17670.pdf

Abstract:
Trust plays a critical role in visual data communication and decision-making, yet existing visualization research employs varied trust measures, making it challenging to compare and synthesize findings across studies. In this work, we first took a bottom-up, data-driven approach to understand what visualization readers mean when they say they "trust" a visualization. We compiled and adapted a broad set of trust-related statements from existing inventories and collected responses on visualizations with varying degrees of trustworthiness. Through exploratory factor analysis, we derived an operational definition of trust in visualizations. Our findings indicate that people perceive a trustworthy visualization as one that presents credible information and is comprehensible and usable. Additionally, we found that general trust disposition influences how individuals assess visualization trustworthiness. Building on these insights, we developed a compact inventory consisting of statements that not only effectively represent each trust factor but also exhibit high item discrimination. We further validated our inventory through two trust games with real-world stakes, demonstrating that our measures reliably predict behavioral trust. Finally, we illustrate how this standardized inventory can be applied across diverse visualization research contexts. Utilizing our inventory, future research can examine how design choices, tasks, and domains influence trust, and how to foster appropriate trusting behavior in human-data interactions.

Paperid: 1564, https://arxiv.org/pdf/2503.17553.pdf

Abstract:
Radiotherapy treatment planning is a complex and time-intensive process, often impacted by inter-planner variability and subjective decision-making. To address these challenges, we introduce Dose Optimization Language Agent (DOLA), an autonomous large language model (LLM)-based agent designed for optimizing radiotherapy treatment plans while rigorously protecting patient privacy. DOLA integrates the LLaMa3.1 LLM directly with a commercial treatment planning system, utilizing chain-of-thought prompting, retrieval-augmented generation (RAG), and reinforcement learning (RL). Operating entirely within secure local infrastructure, this agent eliminates external data sharing. We evaluated DOLA using a retrospective cohort of 18 prostate cancer patients prescribed 60 Gy in 20 fractions, comparing model sizes (8 billion vs. 70 billion parameters) and optimization strategies (No-RAG, RAG, and RAG+RL) over 10 planning iterations. The 70B model demonstrated significantly improved performance, achieving approximately 16.4% higher final scores than the 8B model. The RAG approach outperformed the No-RAG baseline by 19.8%, and incorporating RL accelerated convergence, highlighting the synergy of retrieval-based memory and reinforcement learning. Optimal temperature hyperparameter analysis identified 0.4 as providing the best balance between exploration and exploitation. This proof of concept study represents the first successful deployment of locally hosted LLM agents for autonomous optimization of treatment plans within a commercial radiotherapy planning system. By extending human-machine interaction through interpretable natural language reasoning, DOLA offers a scalable and privacy-conscious framework, with significant potential for clinical implementation and workflow improvement.

Paperid: 1565, https://arxiv.org/pdf/2503.16791.pdf

Abstract:
Data analysis encompasses a spectrum of tasks, from high-level conceptual reasoning to lower-level execution. While AI-powered tools increasingly support execution tasks, there remains a need for intelligent assistance in conceptual tasks. This paper investigates the design of an ordered node-link tree interface augmented with AI-generated information hints and visualizations, as a potential shared representation for hypothesis exploration. Through a design probe (n=22), participants generated diagrams averaging 21.82 hypotheses. Our findings showed that the node-link diagram acts as "guardrails" for hypothesis exploration, facilitating structured workflows, providing comprehensive overviews, and enabling efficient backtracking. The AI-generated information hints, particularly visualizations, aided users in transforming abstract ideas into data-backed concepts while reducing cognitive load. We further discuss how node-link diagrams can support both parallel exploration and iterative refinement in hypothesis formulation, potentially enhancing the breadth and depth of human-AI collaborative data analysis.

Paperid: 1566, https://arxiv.org/pdf/2503.16632.pdf

Abstract:
The increasing integration of Visual Language Models (VLMs) into visualization systems demands a comprehensive understanding of their visual interpretation capabilities and constraints. While existing research has examined individual models, systematic comparisons of VLMs' visualization literacy remain unexplored. We bridge this gap through a rigorous, first-of-its-kind evaluation of four leading VLMs (GPT-4, Claude, Gemini, and Llama) using standardized assessments: the Visualization Literacy Assessment Test (VLAT) and Critical Thinking Assessment for Literacy in Visualizations (CALVI). Our methodology uniquely combines randomized trials with structured prompting techniques to control for order effects and response variability - a critical consideration overlooked in many VLM evaluations. Our analysis reveals that while specific models demonstrate competence in basic chart interpretation (Claude achieving 67.9% accuracy on VLAT), all models exhibit substantial difficulties in identifying misleading visualization elements (maximum 30.0\% accuracy on CALVI). We uncover distinct performance patterns: strong capabilities in interpreting conventional charts like line charts (76-96% accuracy) and detecting hierarchical structures (80-100% accuracy), but consistent difficulties with data-dense visualizations involving multiple encodings (bubble charts: 18.6-61.4%) and anomaly detection (25-30% accuracy). Significantly, we observe distinct uncertainty management behavior across models, with Gemini displaying heightened caution (22.5% question omission) compared to others (7-8%). These findings provide crucial insights for the visualization community by establishing reliable VLM evaluation benchmarks, identifying areas where current models fall short, and highlighting the need for targeted improvements in VLM architectures for visualization tasks.

Paperid: 1567, https://arxiv.org/pdf/2503.16518.pdf

Abstract:
Human-Machine Teaming (HMT) is revolutionizing collaboration across domains such as defense, healthcare, and autonomous systems by integrating AI-driven decision-making, trust calibration, and adaptive teaming. This survey presents a comprehensive taxonomy of HMT, analyzing theoretical models, including reinforcement learning, instance-based learning, and interdependence theory, alongside interdisciplinary methodologies. Unlike prior reviews, we examine team cognition, ethical AI, multi-modal interactions, and real-world evaluation frameworks. Key challenges include explainability, role allocation, and scalable benchmarking. We propose future research in cross-domain adaptation, trust-aware AI, and standardized testbeds. By bridging computational and social sciences, this work lays a foundation for resilient, ethical, and scalable HMT systems.

Paperid: 1568, https://arxiv.org/pdf/2503.16476.pdf

Abstract:
Simulation of conflict situations for autonomous driving research is crucial for understanding and managing interactions between Automated Vehicles (AVs) and human drivers. This paper presents a set of exemplary conflict scenarios in CARLA that arise in shared autonomy settings, where both AVs and human drivers must navigate complex traffic environments. We explore various conflict situations, focusing on the impact of driver behavior and decision-making processes on overall traffic safety and efficiency. We build a simple extendable toolkit for situation awareness research, in which the implemented conflicts can be demonstrated.

Paperid: 1569, https://arxiv.org/pdf/2503.16474.pdf

Abstract:
This paper presents Matrix, an advanced AI-powered framework designed for real-time 3D object generation in Augmented Reality (AR) environments. By integrating a cutting-edge text-to-3D generative AI model, multilingual speech-to-text translation, and large language models (LLMs), the system enables seamless user interactions through spoken commands. The framework processes speech inputs, generates 3D objects, and provides object recommendations based on contextual understanding, enhancing AR experiences. A key feature of this framework is its ability to optimize 3D models by reducing mesh complexity, resulting in significantly smaller file sizes and faster processing on resource-constrained AR devices. Our approach addresses the challenges of high GPU usage, large model output sizes, and real-time system responsiveness, ensuring a smoother user experience. Moreover, the system is equipped with a pre-generated object repository, further reducing GPU load and improving efficiency. We demonstrate the practical applications of this framework in various fields such as education, design, and accessibility, and discuss future enhancements including image-to-3D conversion, environmental object detection, and multimodal support. The open-source nature of the framework promotes ongoing innovation and its utility across diverse industries.

Paperid: 1570, https://arxiv.org/pdf/2503.16462.pdf

Abstract:
It has been suggested that autonomous vehicles can improve efficiency and safety of the transportation systems. While research in this area often focuses on autonomous vehicles which operate on roads, the deployment of low-speed, autonomous vehicles in unstructured, crowded environments has been studied less well and requires specific considerations regarding their interaction with pedestrians. For making the operation of these vehicles acceptable, their behaviour needs to be perceived as safe by both pedestrians and the passengers riding the vehicle. In this paper we conducted an online survey with 116 participants, to understand people's preferences with respect to an autonomous golf cart's behaviour in different interaction scenarios. We measured people's self-reported perceived safety towards different behaviour of the cart in a variety of scenarios. Results suggested that despite the unstructured nature of the environment, the cart was expected to follow common traffic rules when interacting with a group of pedestrians.

Paperid: 1571, https://arxiv.org/pdf/2503.16461.pdf

Abstract:
Facial Expression Recognition (FER) plays a foundational role in enabling AI systems to interpret emotional nuances, a critical aspect of affective Theory of Mind (ToM). However, existing models often struggle with poor calibration and a limited capacity to capture emotional intensity and complexity. To address this, we propose Ranking the Emotional Nuance for Theory of Mind (Rank-O-ToM), a framework that leverages ordinal ranking to align confidence levels with the emotional spectrum. By incorporating synthetic samples reflecting diverse affective complexities, Rank-O-ToM enhances the nuanced understanding of emotions, advancing AI's ability to reason about affective states.

Paperid: 1572, https://arxiv.org/pdf/2503.16450.pdf

Abstract:
Dog guides offer an effective mobility solution for blind or visually impaired (BVI) individuals, but conventional dog guides have limitations including the need for care, potential distractions, societal prejudice, high costs, and limited availability. To address these challenges, we seek to develop a robot dog guide capable of performing the tasks of a conventional dog guide, enhanced with additional features. In this work, we focus on design research to identify functional and aesthetic design concepts to implement into a quadrupedal robot. The aesthetic design remains relevant even for BVI users due to their sensitivity toward societal perceptions and the need for smooth integration into society. We collected data through interviews and surveys to answer specific design questions pertaining to the appearance, texture, features, and method of controlling and communicating with the robot. Our study identified essential and preferred features for a future robot dog guide, which are supported by relevant statistics aligning with each suggestion. These findings will inform the future development of user-centered designs to effectively meet the needs of BVI individuals.

Paperid: 1573, https://arxiv.org/pdf/2503.16440.pdf

Abstract:
Algorithmic causal discovery is based on formal reasoning and provably converges toward the optimal solution. However, since some of the underlying assumptions are often not met in practice no applications for autonomous everyday life competence are yet available. Humans on the other hand possess full everyday competence and develop cognitive models in a data efficient manner with the ability to transfer knowledge between and to new situations. Here we investigate the causal discovery capabilities of humans in an object place task in virtual reality (VR) with haptic feedback and compare the results to the state of the art causal discovery algorithms FGES, PC and FCI. In addition we use the algorithms to analyze causal relations between sensory information and the kinematic parameters of human behavior. Our findings show that the majority of participants were able to determine which variables are causally related. This is in line with causal discovery algorithms like PC, which recover causal dependencies in the first step. However, unlike such algorithms which can identify causes and effects in our test configuration, humans are unsure in determining a causal direction. Regarding the relation between the sensory information provided to the participants and their placing actions (i.e. their kinematic parameters) the data yields a surprising dissociation of the subjects knowledge and the sensorimotor level. Knowledge of the cause-effect pairs, though undirected, should suffice to improve subject's movements. Yet a detailed causal analysis provides little evidence for any such influence. This, together with the reports of the participants, implies that instead of exploiting their consciously perceived information they leave it to the sensorimotor level to control the movement.

Paperid: 1574, https://arxiv.org/pdf/2503.15510.pdf

Abstract:
Operators working with robots in safety-critical domains have to make decisions under uncertainty, which remains a challenging problem for a single human operator. An open question is whether two human operators can make better decisions jointly, as compared to a single operator alone. While prior work has shown that two heads are better than one, such studies have been mostly limited to static and passive tasks. We investigate joint decision-making in a dynamic task involving humans teleoperating robots. We conduct a human-subject experiment with $N=100$ participants where each participant performed a navigation task with two mobiles robots in simulation. We find that joint decision-making through confidence sharing improves dyad performance beyond the better-performing individual (p<0.0001). Further, we find that the extent of this benefit is regulated both by the skill level of each individual, as well as how well-calibrated their confidence estimates are. Finally, we present findings on characterising the human-human dyad's confidence calibration based on the individuals constituting the dyad. Our findings demonstrate for the first time that two heads are better than one, even on a spatiotemporal task which includes active operator control of robots.

Paperid: 1575, https://arxiv.org/pdf/2503.15504.pdf

Abstract:
The interaction between humans is very complex to describe since it is composed of different elements from different modalities such as speech, gaze, and gestures influenced by social attitudes and emotions. Furthermore, the interaction can be affected by some features which refer to the interlocutor's state. Actual Socially Interactive Agents SIAs aim to adapt themselves to the state of the interaction partner. In this paper, we discuss this adaptation by describing the architecture of the GRETA platform which considers external features while interacting with humans and/or another ECA and process the dialogue incrementally. We illustrate the new architecture of GRETA which deals with the external features, the adaptation, and the incremental approach for the dialogue processing.

Paperid: 1576, https://arxiv.org/pdf/2503.15502.pdf

Abstract:
Choropleth maps, which utilize color schemes to visualize spatial patterns and trends, are simple yet effective tools for geographic data analysis. As such, color scheme design is a critical aspect of choropleth map creation. The traditional coloring methods offered by GIS tools such as ArcGIS and QGIS are not user-friendly for non-professionals. On the one hand, these tools provide numerous color schemes, making it hard to decide which one best matches the theme. On the other hand, it is difficult to fulfill some ambiguous and personalized coloring needs of users, such as requests for 'summer-like' map colors. To address these shortcomings, we develop a novel system that leverages a large language model and map color design principles to generate contextually relevant and user-aligned choropleth map color schemes. The system follows a three-stage process: Data processing, which provides an overview of the data and classifies the data into meaningful classes; Color Concept Design, where the color theme and color mode are conceptualized based on data characteristics and user intentions; and Color Scheme Design, where specific colors are assigned to classes based on generated color theme, color mode, and user requirements. Our system incorporates an interactive interface, providing necessary visualization for choropleth map color design and allowing users to customize and refine color choices flexibly. Through user studies and evaluations, the system demonstrates acceptable usability, accuracy, and flexibility, with users highlighting the tool's efficiency and ease of use.

Paperid: 1577, https://arxiv.org/pdf/2503.14527.pdf

Abstract:
This study examines AI adoption among Finnish healthcare SMEs through semi-structured interviews with six health-tech companies. We identify three AI engagement categories: AI-curious (exploring AI), AI-embracing (integrating AI), and AI-catering (providing AI solutions). Our proposed threefold model highlights key adoption barriers, including regulatory complexities, technical expertise gaps, and financial constraints. While SMEs recognize AI's potential, most remain in early adoption stages. We provide actionable recommendations to accelerate AI integration, focusing on regulatory reforms, talent development, and inter-company collaboration, offering valuable insights for healthcare organizations, policymakers, and researchers.

Paperid: 1578, https://arxiv.org/pdf/2503.14468.pdf

Abstract:
With the advent of the data era, and of new, more intelligent interfaces for supporting decision making, there is a growing need to define, model and assess human ability and data visualizations usability for a better encoding and decoding of data patterns. Data Visualization Literacy (DVL) is the ability of encoding and decoding data into and from a visual language. Although this ability and its measurement are crucial for advancing human knowledge and decision capacity, they have seldom been investigated, let alone systematically. To address this gap, this paper presents a systematic literature review comprising 43 reports on DVL, analyzed using the PRISMA methodology. Our results include the identification of the purposes of DVL, its satellite aspects, the models proposed, and the assessments designed to evaluate the degree of DVL of people. Eventually, we devise many research directions including, among the most challenging, the definition of a (standard) unifying construct of DVL.

Paperid: 1579, https://arxiv.org/pdf/2503.14263.pdf

Abstract:
Group decision-making processes frequently suffer when social influence and power dynamics suppress minority viewpoints, leading to compliance and groupthink. Conversational agents can counteract these harmful dynamics by encouraging critical thinking. This study investigates how LLM-powered devil's advocate systems affect psychological safety, opinion expression, and satisfaction in power-imbalanced group dynamics. We conducted an experiment with 48 participants in 12 four-person groups, each containing three high-power (senior) and one low-power (junior) member. Each group completed decision tasks in both baseline and AI intervention conditions. Results show AI counterarguments fostered a more flexible atmosphere and significantly enhanced both process and outcome satisfaction for all participants, with particularly notable improvements for minority members. Cognitive workload increased slightly, though not significantly. This research contributes empirical evidence on how AI systems can effectively navigate power hierarchies to foster more inclusive decision-making environments, highlighting the importance of balancing intervention frequency, maintaining conversational flow, and preserving group cohesion.

Paperid: 1580, https://arxiv.org/pdf/2503.13463.pdf

Abstract:
ML/AI is the field of computer science and computer engineering that arguably received the most attention and funding over the last decade. Data is the key element of ML/AI, so it is becoming increasingly important to ensure that users are fully aware of the quality of the datasets that they use, and of the process generating them, so that possible negative impacts on downstream effects can be tracked, analysed, and, where possible, mitigated. One of the tools that can be useful in this perspective is dataset documentation. The aim of this work is to investigate the state of dataset documentation practices, measuring the completeness of the documentation of several popular datasets in ML/AI repositories. We created a dataset documentation schema -- the Documentation Test Sheet (DTS) -- that identifies the information that should always be attached to a dataset (to ensure proper dataset choice and informed use), according to relevant studies in the literature. We verified 100 popular datasets from four different repositories with the DTS to investigate which information was present. Overall, we observed a lack of relevant documentation, especially about the context of data collection and data processing, highlighting a paucity of transparency.

Paperid: 1581, https://arxiv.org/pdf/2503.13250.pdf

Abstract:
A promising effective human-robot interaction in assistive robotic systems is gaze-based control. However, current gaze-based assistive systems mainly help users with basic grasping actions, offering limited support. Moreover, the restricted intent recognition capability constrains the assistive system's ability to provide diverse assistance functions. In this paper, we propose an open implicit intention recognition framework powered by Large Language Model (LLM) and Vision Foundation Model (VFM), which can process gaze input and recognize user intents that are not confined to predefined or specific scenarios. Furthermore, we implement a gaze-driven LLM-enhanced assistive robot system (MindEye-OmniAssist) that recognizes user's intentions through gaze and assists in completing task. To achieve this, the system utilizes open vocabulary object detector, intention recognition network and LLM to infer their full intentions. By integrating eye movement feedback and LLM, it generates action sequences to assist the user in completing tasks. Real-world experiments have been conducted for assistive tasks, and the system achieved an overall success rate of 41/55 across various undefined tasks. Preliminary results show that the proposed method holds the potential to provide a more user-friendly human-computer interaction interface and significantly enhance the versatility and effectiveness of assistive systems by supporting more complex and diverse task.

Paperid: 1582, https://arxiv.org/pdf/2503.12628.pdf

Abstract:
Recently, there has been a surge of interest in sustainable energy sources, particularly for wearable computing. Triboelectric nanogenerators (TENGs) have shown promise in converting human motion into electric power. Textile-based TENGs, valued for their flexibility and breathability, offer an ideal form factor for wearables. However, uptake in maker communities has been slow due to commercially unavailable materials, complex fabrication processes, and structures incompatible with human motion. This paper introduces texTENG, a textile-based framework simplifying the fabrication of power harvesting and self-powered sensing applications. By leveraging accessible materials and familiar tools, texTENG bridges the gap between advanced TENG research and wearable applications. We explore a design menu for creating multidimensional TENG structures using braiding, weaving, and knitting. Technical evaluations and example applications highlight the performance and feasibility of these designs, offering DIY-friendly pathways for fabricating textile-based TENGs and promoting sustainable prototyping practices within the HCI and maker communities.

Paperid: 1583, https://arxiv.org/pdf/2503.12309.pdf

Abstract:
Computational notebooks are intended to prioritize the needs of scientists, but little is known about how scientists interact with notebooks, what requirements drive scientists' software development processes, or what tactics scientists use to meet their requirements. We conducted an observational study of 20 scientists using Jupyter notebooks for their day-to-day tasks, finding that scientists prioritize different quality attributes depending on their goals. A qualitative analysis of their usage shows (1) a collection of goals scientists pursue with Jupyter notebooks, (2) a set of quality attributes that scientists value when they write software, and (3) tactics that scientists leverage to promote quality. In addition, we identify ways scientists incorporated AI tools into their notebook work. From our observations, we derive design recommendations for improving computational notebooks and future programming systems for scientists. Key opportunities pertain to helping scientists create and manage state, dependencies, and abstractions in their software, enabling more effective reuse of clearly-defined components.

Paperid: 1584, https://arxiv.org/pdf/2503.10892.pdf

Abstract:
Despite the importance of viewers' trust in data visualization, there is a lack of research on the viewers' own perspective on their trust. In addition, much of the research on trust remains relatively theoretical and inaccessible for designers. This work aims to address this gap by conducting a qualitative study to explore how viewers perceive different data visualizations and how their perceptions impact their trust. Three dominant themes emerged from the data. First, users appeared to be consistent, listing similar rationale for their trust across different stimuli. Second, there were diverse opinions about what factors were most important to trust perception and about why the factors matter. Third, despite this disagreement, there were important trends to the factors that users reported as impactful. Finally, we leverage these themes to give specific and actionable guidelines for visualization designers to make more trustworthy visualizations.

Paperid: 1585, https://arxiv.org/pdf/2503.09805.pdf

Abstract:
Queer people are often discussed as targets of bias, harm, or discrimination in research on generative AI. However, the specific ways that queer people engage with generative AI, and thus possible uses that support queer people, have yet to be explored. We conducted a workshop study with 13 queer artists, during which we gave participants access to GPT-4 and DALL-E 3 and facilitated group sensemaking activities. We found our participants struggled to use these models due to various normative values embedded in their designs, such as hyper-positivity and anti-sexuality. We describe various strategies our participants developed to overcome these models' limitations and how, nevertheless, our participants found value in these highly-normative technologies. Drawing on queer feminist theory, we discuss implications for the conceptualization of "state-of-the-art" models and consider how FAccT researchers might support queer alternatives.

Paperid: 1586, https://arxiv.org/pdf/2503.09061.pdf

Abstract:
Motion comics, a digital animation format that enhances comic book narratives, have wide applications in storytelling, education, and advertising. However, their creation poses significant challenges for amateur creators, primarily due to the need for specialized skills and complex workflows. To address these issues, we conducted an exploratory survey (N=58) to understand the challenges associated with creating motion comics, and an expert interview (N=4) to identify a typical workflow for creation. We further analyzed $95$ online motion comics to gain insights into the design space of character and object actions. Based on our findings, we proposed DancingBoard, an integrated authoring tool designed to simplify the creation process. This tool features a user-friendly interface and a guided workflow, providing comprehensive support throughout each step of the creation process. A user study involving 23 creators showed that, compared to professional tools, DancingBoard is easily comprehensible and provides improved guidance and support, requiring less effort from users. Additionally, a separate study with $18$ audience members confirmed the tool's effectiveness in conveying the story to its viewers.

Paperid: 1587, https://arxiv.org/pdf/2503.08844.pdf

Abstract:
Future warfare will occur in more complex, fast-paced, ill-structured, and demanding conditions that will stress current Command and Control (C2) systems. Without modernization, these C2 systems may fail to maintain overmatch against adversaries. We previously proposed robust partnerships between humans and artificial intelligence systems, and directly focusing on C2, we introduced how intelligent technologies could provide future overmatch through streamlining the C2 operations process, maintaining unity of effort across formations, and developing collective knowledge systems that adapt to battlefield dynamics across missions. Future C2 systems must seamlessly integrate human and machine intelligence to achieve decision advantage over adversaries while overcoming "new" challenges due to the technological advances driving fundamental changes in effective teaming, unity of effort, and meaningful human control. Here, we describe "new" C2 challenges and discuss pathways to transcend them, such as AI-enabled systems with effective human machine interfaces.

Paperid: 1588, https://arxiv.org/pdf/2503.08437.pdf

Abstract:
The recent surge in the vehicle market has led to an alarming increase in road accidents. This underscores the critical importance of enhancing road safety measures, particularly for vulnerable road users like motorcyclists. Hence, we introduce the rider intention prediction (RIP) competition that aims to address challenges in rider safety by proactively predicting maneuvers before they occur, thereby strengthening rider safety. This capability enables the riders to react to the potential incorrect maneuvers flagged by advanced driver assistance systems (ADAS). We collect a new dataset, namely, rider action anticipation dataset (RAAD) for the competition consisting of two tasks: single-view RIP and multi-view RIP. The dataset incorporates a spectrum of traffic conditions and challenging navigational maneuvers on roads with varying lighting conditions. For the competition, we received seventy-five registrations and five team submissions for inference of which we compared the methods of the top three performing teams on both the RIP tasks: one state-space model (Mamba2) and two learning-based approaches (SVM and CNN-LSTM). The results indicate that the state-space model outperformed the other methods across the entire dataset, providing a balanced performance across maneuver classes. The SVM-based RIP method showed the second-best performance when using random sampling and SMOTE. However, the CNN-LSTM method underperformed, primarily due to class imbalance issues, particularly struggling with minority classes. This paper details the proposed RAAD dataset and provides a summary of the submissions for the RIP 2024 competition.

Paperid: 1589, https://arxiv.org/pdf/2503.07599.pdf

Abstract:
Generative AI is transforming education by enabling personalized, on-demand learning experiences. However, AI tutors lack the ability to assess a learner's cognitive state in real time, limiting their adaptability. Meanwhile, electroencephalography (EEG)-based neuroadaptive systems have successfully enhanced engagement by dynamically adjusting learning content. This paper presents NeuroChat, a proof-of-concept neuroadaptive AI tutor that integrates real-time EEG-based engagement tracking with generative AI. NeuroChat continuously monitors a learner's cognitive engagement and dynamically adjusts content complexity, response style, and pacing using a closed-loop system. We evaluate this approach in a pilot study (n=24), comparing NeuroChat to a standard LLM-based chatbot. Results indicate that NeuroChat enhances cognitive and subjective engagement but does not show an immediate effect on learning outcomes. These findings demonstrate the feasibility of real-time cognitive feedback in LLMs, highlighting new directions for adaptive learning, AI tutoring, and human-AI interaction.

Paperid: 1590, https://arxiv.org/pdf/2503.07086.pdf

Abstract:
Automated data insight mining and visualization have been widely used in various business intelligence applications (e.g., market analysis and product promotion). However, automated insight mining techniques often output the same mining results to different analysts without considering their personal preferences, while interactive insight discovery requires significant manual effort. This paper fills the gap by integrating automated insight mining with interactive data visualization and striking a proper balance between them to facilitate insight discovery and exploration. Specifically, we regard data insights as a special type of data and further present InsightMap, a novel visualization approach that uses the map metaphor to provide a quick overview and in-depth exploration of different data insights, where a metric is proposed to measure the similarity between different insights. The effectiveness and usability of InsightMap are demonstrated through extensive case studies and in-depth user interviews.

Paperid: 1591, https://arxiv.org/pdf/2503.06911.pdf

Abstract:
In this work, we explore explicit Large Language Model (LLM)-powered support for the iterative design of computer programs. Program design, like other design activity, is characterized by navigating a space of alternative problem formulations and associated solutions in an iterative fashion. LLMs are potentially powerful tools in helping this exploration; however, by default, code-generation LLMs deliver code that represents a particular point solution. This obscures the larger space of possible alternatives, many of which might be preferable to the LLM's default interpretation and its generated code. We contribute an IDE that supports program design through generating and showing new ways to frame problems alongside alternative solutions, tracking design decisions, and identifying implicit decisions made by either the programmer or the LLM. In a user study, we find that with our IDE, users combine and parallelize design phases to explore a broader design space -- but also struggle to keep up with LLM-originated changes to code and other information overload. These findings suggest a core challenge for future IDEs that support program design through higher-level instructions given to LLM-based agents: carefully managing attention and deciding what information agents should surface to program designers and when.

Paperid: 1592, https://arxiv.org/pdf/2503.06099.pdf

Abstract:
Medical education increasingly emphasizes students' ability to apply knowledge in real-world clinical settings, focusing on evidence-based clinical reasoning and differential diagnoses. Problem-based learning (PBL) addresses traditional teaching limitations by embedding learning into meaningful contexts and promoting active participation. However, current PBL practices are often confined to medical instructional settings, limiting students' ability to self-direct and refine their approaches based on targeted improvements. Additionally, the unstructured nature of information organization during analysis poses challenges for record-keeping and subsequent review. Existing research enhances PBL realism and immersion but overlooks the construction of logic chains and evidence-based reasoning. To address these gaps, we designed e-MedLearn, a learner-centered PBL system that supports more efficient application and practice of evidence-based clinical reasoning. Through controlled study (N=19) and testing interviews (N=13), we gathered data to assess the system's impact. The findings demonstrate that e-MedLearn improves PBL experiences and provides valuable insights for advancing clinical reasoning-based learning.

Paperid: 1593, https://arxiv.org/pdf/2503.05220.pdf

Abstract:
In this position paper, we propose researching the combination of Augmented Reality (AR) and Artificial Intelligence (AI) to support conversations, inspired by the interfaces of dialogue systems commonly found in videogames. AR-capable devices are becoming more powerful and conventional in looks, as seen in head-mounted displays (HMDs) like the Snapchat Spectacles, the XREAL glasses, or the recently presented Meta Orion. This development reduces possible ergonomic, appearance, and runtime concerns, thus allowing a more straightforward integration and extended use of AR in our everyday lives, both in private and at work. At the same time, we can observe an immense surge in AI development (also at CHI). Recently notorious Large Language Models (LLMs) like OpenAI's o3-mini or DeepSeek-R1 soar over their precursors in their ability to sustain conversations, provide suggestions, and handle complex topics in (almost) real time. In combination with natural language recognition systems, which are nowadays a standard component of smartphones and similar devices (including modern AR-HMDs), it is easy to imagine a combined system that integrates into daily conversations and provides various types of assistance. Such a system would enable many opportunities for research in AR+AI, which, as stated by Hirzle et al., remains scarce. In the following, we describe how the design of a conversational AR+AI system can learn from videogame dialogue systems, and we propose use cases and research questions that can be investigated thanks to this AR+AI combination.

Paperid: 1594, https://arxiv.org/pdf/2503.04945.pdf

Abstract:
The proliferation of generative models has presented significant challenges in distinguishing authentic human-authored content from deepfake content. Collaborative human efforts, augmented by AI tools, present a promising solution. In this study, we explore the potential of DeepFakeDeLiBot, a deliberation-enhancing chatbot, to support groups in detecting deepfake text. Our findings reveal that group-based problem-solving significantly improves the accuracy of identifying machine-generated paragraphs compared to individual efforts. While engagement with DeepFakeDeLiBot does not yield substantial performance gains overall, it enhances group dynamics by fostering greater participant engagement, consensus building, and the frequency and diversity of reasoning-based utterances. Additionally, participants with higher perceived effectiveness of group collaboration exhibited performance benefits from DeepFakeDeLiBot. These findings underscore the potential of deliberative chatbots in fostering interactive and productive group dynamics while ensuring accuracy in collaborative deepfake text detection. \textit{Dataset and source code used in this study will be made publicly available upon acceptance of the manuscript.

Paperid: 1595, https://arxiv.org/pdf/2503.04516.pdf

Abstract:
In the field of conditional autonomous driving technology, driver perceived risk prediction plays a crucial role in reducing traffic risks and ensuring passenger safety. This study introduces an innovative perceived risk prediction model for human-machine interaction in intelligent driving systems. The model aims to enhance prediction accuracy and, thereby, ensure passenger safety. Through a comprehensive analysis of risk impact mechanisms, we identify three key categories of factors, both subjective and objective, influencing perceived risk: driver's personal characteristics, ego-vehicle motion, and surrounding environment characteristics. We then propose a deep-learning-based risk prediction network that uses the first two categories of factors as inputs. The network captures the interactive relationships among traffic participants in dynamic driving scenarios. Additionally, we design a personalized modeling strategy that incorporates driver-specific traits to improve prediction accuracy. To ensure high-quality training data, we conducted a rigorous video rating experiment. Experimental results show that the proposed network achieves a 10.0% performance improvement over state-of-the-art methods. These findings suggest that the proposed network has significant potential to enhance the safety of conditional autonomous driving systems.

Paperid: 1596, https://arxiv.org/pdf/2503.04114.pdf

Abstract:
Quadratic Surveys (QSs) elicit more accurate preferences than traditional methods like Likert-scale surveys. However, the cognitive load associated with QSs has hindered their adoption in digital surveys for collective decision-making. We introduce a two-phase "organize-then-vote" QS to reduce cognitive load. As interface design significantly impacts survey results and accuracy, our design scaffolds survey takers' decision-making while managing the cognitive load imposed by QS. In a 2x2 between-subject in-lab study on public resource allotment, we compared our interface with a traditional text interface across a QS with 6 (short) and 24 (long) options. Two-phase interface participants spent more time per option and exhibited shorter voting edit distances. We qualitatively observed shifts in cognitive effort from mechanical operations to constructing more comprehensive preferences. We conclude that this interface promoted deeper engagement, potentially reducing satisficing behaviors caused by cognitive overload in longer QSs. This research clarifies how human-centered design improves preference elicitation tools for collective decision-making.

Paperid: 1597, https://arxiv.org/pdf/2503.04084.pdf

Abstract:
Unlike static and rigid user interfaces, generative and malleable user interfaces offer the potential to respond to diverse users' goals and tasks. However, current approaches primarily rely on generating code, making it difficult for end-users to iteratively tailor the generated interface to their evolving needs. We propose employing task-driven data models-representing the essential information entities, relationships, and data within information tasks-as the foundation for UI generation. We leverage AI to interpret users' prompts and generate the data models that describe users' intended tasks, and by mapping the data models with UI specifications, we can create generative user interfaces. End-users can easily modify and extend the interfaces via natural language and direct manipulation, with these interactions translated into changes in the underlying model. The technical evaluation of our approach and user evaluation of the developed system demonstrate the feasibility and effectiveness of the proposed generative and malleable UIs.

Paperid: 1598, https://arxiv.org/pdf/2503.03606.pdf

Abstract:
Recommender ecosystems are an emerging subject of research. Such research examines how the characteristics of algorithms, recommendation consumers, and item providers influence system dynamics and long-term outcomes. One architectural possibility that has not yet been widely explored in this line of research is the consequences of a configuration in which recommendation algorithms are decoupled from the platforms they serve. This is sometimes called "the friendly neighborhood algorithm store" or "middleware" model. We are particularly interested in how such architectures might offer a range of different distributions of utility across consumers, providers, and recommendation platforms. In this paper, we create a model of a recommendation ecosystem that incorporates algorithm choice and examine the outcomes of such a design.

Paperid: 1599, https://arxiv.org/pdf/2503.03383.pdf

Abstract:
There are countless examples of how AI can cause harm, and increasing evidence that the public are willing to ascribe blame to the AI itself, regardless of how "illogical" this might seem. This raises the question of whether and how the public might expect AI to be punished for this harm. However, public expectations of the punishment of AI have been vastly underexplored. Understanding these expectations is vital, as the public may feel the lingering effect of harm unless their desire for punishment is satisfied. We synthesise research from psychology, human-computer and -robot interaction, philosophy and AI ethics, and law to highlight how our understanding of this issue is still lacking. We call for an interdisciplinary programme of research to establish how we can best satisfy victims of AI harm, for fear of creating a "satisfaction gap" where legal punishment of AI (or not) fails to meet public expectations.

Paperid: 1600, https://arxiv.org/pdf/2503.02973.pdf

Abstract:
Everyday objects and mid-air gestures have been explored as input modalities, but each has its strengths and limitations - for example, objects offer tangibility but rely on their physical presence; gestures are convenient but lack haptic feedback. We introduce Objestures ("Obj" + "Gestures"), five interaction types that utilize both modalities for a design space of expressive and playful interaction. To evaluate its usefulness, we conducted a user study (N = 12) assessing whether it can effectively support basic 3D tasks such as rotation and scaling and found it has performance comparable to or better than the headset's native freehand manipulation. To understand its user experience, we conducted case studies on three example applications - Sound, Draw, and Shadow - with the same participants, who found it intuitive, engaging, and expressive, and were interested in its everyday use. We further illustrate 30 examples to showcase how Objestures can enrich everyday interactions and discuss its limitations and implications. https://www.zhuoyuelyu.com/objestures

Paperid: 1601, https://arxiv.org/pdf/2503.02699.pdf

Abstract:
We present the results of a study comparing the performance of younger adults (YA) and people in late adulthood (PLA) across ten low-level analysis tasks and five basic visualizations, employing Bayesian regression to aggregate and model participant performance. We analyzed performance at the task level and across combinations of tasks and visualizations, reporting measures of performance at aggregate and individual levels. These analyses showed that PLA on average required more time to complete tasks while demonstrating comparable accuracy. Furthermore, at the individual level, PLA exhibited greater heterogeneity in task performance as well as differences in best-performing visualization types for some tasks. We contribute empirical knowledge on how age interacts with analysis task and visualization type and use these results to offer actionable insights and design recommendations for aging-inclusive visualization design. We invite the visualization research community to further investigate aging-aware data visualization. Supplementary materials can be found at https://osf.io/a7xtz/.

Paperid: 1602, https://arxiv.org/pdf/2503.02571.pdf

Abstract:
Biomechanical models allow for diverse simulations of user movements in interaction. Their performance depends critically on the careful design of reward functions, yet the interplay between reward components and emergent behaviours remains poorly understood. We investigate what makes a model "breathe" by systematically analysing the impact of rewarding effort minimisation, task completion, and target proximity on movement trajectories. Using a choice reaction task as a test-bed, we find that a combination of completion bonus and proximity incentives is essential for task success. Effort terms are optional, but can help avoid irregularities if scaled appropriately. Our work offers practical insights for HCI designers to create realistic simulations without needing deep reinforcement learning expertise, advancing the use of simulations as a powerful tool for interaction design and evaluation in HCI.

Paperid: 1603, https://arxiv.org/pdf/2503.02080.pdf

Abstract:
Large language models (LLMs) have demonstrated the ability to generate text that realistically reflects a range of different subjective human perspectives. This paper studies how LLMs are seemingly able to reflect more liberal versus more conservative viewpoints among other political perspectives in American politics. We show that LLMs possess linear representations of political perspectives within activation space, wherein more similar perspectives are represented closer together. To do so, we probe the attention heads across the layers of three open transformer-based LLMs (Llama-2-7b-chat, Mistral-7b-instruct, Vicuna-7b). We first prompt models to generate text from the perspectives of different U.S. lawmakers. We then identify sets of attention heads whose activations linearly predict those lawmakers' DW-NOMINATE scores, a widely-used and validated measure of political ideology. We find that highly predictive heads are primarily located in the middle layers, often speculated to encode high-level concepts and tasks. Using probes only trained to predict lawmakers' ideology, we then show that the same probes can predict measures of news outlets' slant from the activations of models prompted to simulate text from those news outlets. These linear probes allow us to visualize, interpret, and monitor ideological stances implicitly adopted by an LLM as it generates open-ended responses. Finally, we demonstrate that by applying linear interventions to these attention heads, we can steer the model outputs toward a more liberal or conservative stance. Overall, our research suggests that LLMs possess a high-level linear representation of American political ideology and that by leveraging recent advances in mechanistic interpretability, we can identify, monitor, and steer the subjective perspective underlying generated text.

Paperid: 1604, https://arxiv.org/pdf/2503.01769.pdf

Abstract:
A growing body of work has shown that AI-assisted methods -- leveraging large language models, social choice methods, and collective dialogues -- can help navigate polarization and surface common ground in controlled lab settings. But what can these approaches contribute in real-world contexts? We present a case study applying these techniques to find common ground between Israeli and Palestinian peacebuilders in the period following October 7th, 2023. From April to July 2024 an iterative deliberative process combining LLMs, bridging-based ranking, and collective dialogues was conducted in partnership with the Alliance for Middle East Peace. Around 138 civil society peacebuilders participated including Israeli Jews, Palestinian citizens of Israel, and Palestinians from the West Bank and Gaza. The process resulted in a set of collective statements, including demands to world leaders, with at least 84% agreement from participants on each side. In this paper, we document the process, results, challenges, and important open questions.

Paperid: 1605, https://arxiv.org/pdf/2503.01733.pdf

Abstract:
Human Activity Recognition (HAR) using ambient sensors has great potential for practical applications, particularly in elder care and independent living. However, deploying HAR systems in real-world settings remains challenging due to the high cost of labeled data, the need for pre-segmented sensor streams, and the lack of flexibility in activity granularity. To address these limitations, we introduce DISCOVER, a method designed to discover fine-grained human sub-activities from unlabeled sensor data without relying on pre-segmentation. DISCOVER combines unsupervised feature extraction and clustering with a user-friendly visualization tool to streamline the labeling process. DISCOVER enables domain experts to efficiently annotate only a minimal set of representative cluster centroids, reducing the annotation workload to a small number of samples (0.05% of our dataset). We demonstrate DISCOVER's effectiveness through a re-annotation exercise on widely used HAR datasets, showing that it uncovers finer-grained activities and produces more nuanced annotations than traditional coarse labels. DISCOVER represents a step toward practical, deployable HAR systems that adapt to diverse real environments.

Paperid: 1606, https://arxiv.org/pdf/2503.01608.pdf

Abstract:
As a popular form of science communication, science stories attract readers because they combine engaging narratives with comprehensible scientific knowledge. However, crafting such stories requires substantial skill and effort, as writers must navigate complex scientific concepts and transform them into coherent and accessible narratives tailored to audiences with varying levels of scientific literacy. To address the challenge, we propose RevTogether, a multi-agent system (MAS) designed to support revision of science stories with human-like AI agents (using GPT-4o). RevTogether allows AI agents to simulate affects in addition to providing comments and writing suggestions, while offering varying degrees of user agency. Our preliminary user study with non-expert writers (N=3) highlighted the need for transparency in AI agents' decision-making processes to support learning and suggested that emotional interactions could enhance human-AI collaboration in science storytelling.

Paperid: 1607, https://arxiv.org/pdf/2503.01327.pdf

Abstract:
Online abuse, a persistent aspect of social platform interactions, impacts user well-being and exposes flaws in platform designs that include insufficient detection efforts and inadequate victim protection measures. Ensuring safety in platform interactions requires the integration of victim perspectives in the design of abuse detection and response systems. In this paper, we conduct surveys (n = 230) and semi-structured interviews (n = 15) with students at a minority-serving institution in the US, to explore their experiences with abuse on a variety of social platforms, their defense strategies, and their recommendations for social platforms to improve abuse responses. We build on study findings to propose design requirements for abuse defense systems and discuss the role of privacy, anonymity, and abuse attribution requirements in their implementation. We introduce ARI, a blueprint for a unified, transparent, and personalized abuse response system for social platforms that sustainably detects abuse by leveraging the expertise of platform users, incentivized with proceeds obtained from abusers.

Paperid: 1608, https://arxiv.org/pdf/2503.01011.pdf

Abstract:
Mid-air gestures serve as a common interaction modality across Extended Reality (XR) applications, enhancing engagement and ownership through intuitive body movements. However, prolonged arm movements induce shoulder fatigue, known as "Gorilla Arm Syndrome", degrading user experience and reducing interaction duration. Although existing ergonomic techniques derived from Fitts' law (such as reducing target distance, increasing target width, and modifying control-display gain) provide some fatigue mitigation, their implementation in XR applications remains challenging due to the complex balance between user engagement and physical exertion. We present AlphaPIG, a meta-technique designed to Prolong Interactive Gestures by leveraging real-time fatigue predictions. AlphaPIG assists designers in extending and improving XR interactions by enabling automated fatigue-based interventions. Through adjustment of intervention timing and intensity decay rate, designers can explore and control the trade-off between fatigue reduction and potential effects such as decreased body ownership. We validated AlphaPIG's effectiveness through a study (N=22) implementing the widely-used Go-Go technique. Results demonstrated that AlphaPIG significantly reduces shoulder fatigue compared to non-adaptive Go-Go, while maintaining comparable perceived body ownership and agency. Based on these findings, we discuss positive and negative perceptions of the intervention. By integrating real-time fatigue prediction with adaptive intervention mechanisms, AlphaPIG constitutes a critical first step towards creating fatigue-aware applications in XR.

Paperid: 1609, https://arxiv.org/pdf/2503.00967.pdf

Abstract:
AI chatbots have emerged as promising educational tools for personalized learning experiences, with advances in large language models (LLMs) enabling teachers to create and customize these chatbots for their specific classroom needs. However, there is a limited understanding of how teachers create pedagogical chatbots and integrate them into their lessons. Through semi-structured interviews with seven K-12 teachers, we examined their practices and challenges when designing, implementing, and deploying chatbots. Our findings revealed that teachers prioritize developing task-specific chatbots aligned with their lessons. Teachers engaged in various creation practices and had different challenges; novices in chatbot creation struggled mainly with initial design and technical implementation, while experienced teachers faced challenges with technical aspects and analyzing conversational data. Based on these insights, we explore approaches to supporting teachers' chatbot development and opportunities for designing future chatbot creation systems. This work provides foundational insights from teachers that can empower teacher-created chatbots, facilitating AI-augmented teaching.

Paperid: 1610, https://arxiv.org/pdf/2503.00228.pdf

Abstract:
Judging the similarity of visualizations is crucial to various applications, such as visualization-based search and visualization recommendation systems. Recent studies show deep-feature-based similarity metrics correlate well with perceptual judgments of image similarity and serve as effective loss functions for tasks like image super-resolution and style transfer. We explore the application of such metrics to judgments of visualization similarity. We extend a similarity metric using five ML architectures and three pre-trained weight sets. We replicate results from previous crowd-sourced studies on scatterplot and visual channel similarity perception. Notably, our metric using pre-trained ImageNet weights outperformed gradient-descent tuned MS-SSIM, a multi-scale similarity metric based on luminance, contrast, and structure. Our work contributes to understanding how deep-feature-based metrics can enhance similarity assessments in visualization, potentially improving visual analysis tools and techniques. Supplementary materials are available at https://osf.io/dj2ms.

Paperid: 1611, https://arxiv.org/pdf/2502.21267.pdf

Abstract:
Recent advances in generative artificial intelligence (AI) have created models capable of high-quality musical content generation. However, little consideration is given to how to use these models for real-time or cooperative jamming musical applications because of crucial required features: low latency, the ability to communicate planned actions, and the ability to adapt to user input in real-time. To support these needs, we introduce ReaLJam, an interface and protocol for live musical jamming sessions between a human and a Transformer-based AI agent trained with reinforcement learning. We enable real-time interactions using the concept of anticipation, where the agent continually predicts how the performance will unfold and visually conveys its plan to the user. We conduct a user study where experienced musicians jam in real-time with the agent through ReaLJam. Our results demonstrate that ReaLJam enables enjoyable and musically interesting sessions, and we uncover important takeaways for future work.

Paperid: 1612, https://arxiv.org/pdf/2502.18861.pdf

Abstract:
Volunteer moderators use various strategies to address online harms within their communities. Although punitive measures like content removal or account bans are common, recent research has explored the potential for restorative justice as an alternative framework to address the distinct needs of victims, offenders, and community members. In this study, we take steps toward identifying a more concrete design space for restorative justice-oriented tools by developing ApoloBot, a Discord bot designed to facilitate apologies when harm occurs in online communities. We present results from two rounds of interviews: first, with moderators giving feedback about the design of ApoloBot, and second, after a subset of these moderators have deployed ApoloBot in their communities. This study builds on prior work to yield more detailed insights regarding the potential of adopting online restorative justice tools, including opportunities, challenges, and implications for future designs.

Paperid: 1613, https://arxiv.org/pdf/2502.18682.pdf

Abstract:
AI systems are often introduced with high expectations, yet many fail to deliver, resulting in unintended harm and missed opportunities for benefit. We frequently observe significant "AI Mismatches", where the system's actual performance falls short of what is needed to ensure safety and co-create value. These mismatches are particularly difficult to address once development is underway, highlighting the need for early-stage intervention. Navigating complex, multi-dimensional risk factors that contribute to AI Mismatches is a persistent challenge. To address it, we propose an AI Mismatch approach to anticipate and mitigate risks early on, focusing on the gap between realistic model performance and required task performance. Through an analysis of 774 AI cases, we extracted a set of critical factors, which informed the development of seven matrices that map the relationships between these factors and highlight high-risk areas. Through case studies, we demonstrate how our approach can help reduce risks in AI development.

Paperid: 1614, https://arxiv.org/pdf/2502.18357.pdf

Abstract:
AI systems powered by large language models can act as capable assistants for writing and editing. In these tasks, the AI system acts as a co-creative partner, making novel contributions to an artifact-under-creation alongside its human partner(s). One question that arises in these scenarios is the extent to which AI should be credited for its contributions. We examined knowledge workers' views of attribution through a survey study (N=155) and found that they assigned different levels of credit across different contribution types, amounts, and initiative. Compared to a human partner, we observed a consistent pattern in which AI was assigned less credit for equivalent contributions. Participants felt that disclosing AI involvement was important and used a variety of criteria to make attribution judgments, including the quality of contributions, personal values, and technology considerations. Our results motivate and inform new approaches for crediting AI contributions to co-created work.

Paperid: 1615, https://arxiv.org/pdf/2502.17898.pdf

Abstract:
Automated planning is traditionally the domain of experts, utilized in fields like manufacturing and healthcare with the aid of expert planning tools. Recent advancements in LLMs have made planning more accessible to everyday users due to their potential to assist users with complex planning tasks. However, LLMs face several application challenges within end-user planning, including consistency, accuracy, and user trust issues. This paper introduces VeriPlan, a system that applies formal verification techniques, specifically model checking, to enhance the reliability and flexibility of LLMs for end-user planning. In addition to the LLM planner, VeriPlan includes three additional core features -- a rule translator, flexibility sliders, and a model checker -- that engage users in the verification process. Through a user study (n=12), we evaluate VeriPlan, demonstrating improvements in the perceived quality, usability, and user satisfaction of LLMs. Our work shows the effective integration of formal verification and user-control features with LLMs for end-user planning tasks.

Paperid: 1616, https://arxiv.org/pdf/2502.17776.pdf

Abstract:
Tip-of-the-tongue (TOT) search occurs when a user struggles to recall a specific identifier, such as a document title. While common, existing search systems often fail to effectively support TOT scenarios. Research on TOT retrieval is further constrained by the challenge of collecting queries, as current approaches rely heavily on community question-answering (CQA) websites, leading to labor-intensive evaluation and domain bias. To overcome these limitations, we introduce two methods for eliciting TOT queries - leveraging large language models (LLMs) and human participants - to facilitate simulated evaluations of TOT retrieval systems. Our LLM-based TOT user simulator generates synthetic TOT queries at scale, achieving high correlations with how CQA-based TOT queries rank TOT retrieval systems when tested in the Movie domain. Additionally, these synthetic queries exhibit high linguistic similarity to CQA-derived queries. For human-elicited queries, we developed an interface that uses visual stimuli to place participants in a TOT state, enabling the collection of natural queries. In the Movie domain, system rank correlation and linguistic similarity analyses confirm that human-elicited queries are both effective and closely resemble CQA-based queries. These approaches reduce reliance on CQA-based data collection while expanding coverage to underrepresented domains, such as Landmark and Person. LLM-elicited queries for the Movie, Landmark, and Person domains have been released as test queries in the TREC 2024 TOT track, with human-elicited queries scheduled for inclusion in the TREC 2025 TOT track. Additionally, we provide source code for synthetic query generation and the human query collection interface, along with curated visual stimuli used for eliciting TOT queries.

Paperid: 1617, https://arxiv.org/pdf/2502.17623.pdf

Abstract:
AI-assisted learning companion robots are increasingly used in early education. Many parents express concerns about content appropriateness, while they also value how AI and robots could supplement their limited skill, time, and energy to support their children's learning. We designed a card-based kit, SET, to systematically capture scenarios that have different extents of parental involvement. We developed a prototype interface, PAiREd, with a learning companion robot to deliver LLM-generated educational content that can be reviewed and revised by parents. Parents can flexibly adjust their involvement in the activity by determining what they want the robot to help with. We conducted an in-home field study involving 20 families with children aged 3-5. Our work contributes to an empirical understanding of the level of support parents with different expectations may need from AI and robots and a prototype that demonstrates an innovative interaction paradigm for flexibly including parents in supporting their children.

Paperid: 1618, https://arxiv.org/pdf/2502.16054.pdf

Abstract:
Given the complexity of multi-tenant cloud environments and the growing need for real-time threat mitigation, Security Operations Centers (SOCs) must adopt AI-driven adaptive defense mechanisms to counter Advanced Persistent Threats (APTs). However, SOC analysts face challenges in handling adaptive adversarial tactics, requiring intelligent decision-support frameworks. We propose a Cognitive Hierarchy Theory-driven Deep Q-Network (CHT-DQN) framework that models interactive decision-making between SOC analysts and AI-driven APT bots. The SOC analyst (defender) operates at cognitive level-1, anticipating attacker strategies, while the APT bot (attacker) follows a level-0 policy. By incorporating CHT into DQN, our framework enhances adaptive SOC defense using Attack Graph (AG)-based reinforcement learning. Simulation experiments across varying AG complexities show that CHT-DQN consistently achieves higher data protection and lower action discrepancies compared to standard DQN. A theoretical lower bound further confirms its superiority as AG complexity increases. A human-in-the-loop (HITL) evaluation on Amazon Mechanical Turk (MTurk) reveals that SOC analysts using CHT-DQN-derived transition probabilities align more closely with adaptive attackers, leading to better defense outcomes. Moreover, human behavior aligns with Prospect Theory (PT) and Cumulative Prospect Theory (CPT): participants are less likely to reselect failed actions and more likely to persist with successful ones. This asymmetry reflects amplified loss sensitivity and biased probability weighting -- underestimating gains after failure and overestimating continued success. Our findings highlight the potential of integrating cognitive models into deep reinforcement learning to improve real-time SOC decision-making for cloud security.

Paperid: 1619, https://arxiv.org/pdf/2502.15761.pdf

Abstract:
The deployment of large language models (LLMs) on extended reality (XR) devices has great potential to advance the field of human-AI interaction. In the case of direct, on-device model inference, selecting the appropriate model and device for specific tasks remains challenging. In this paper, we present AIvaluateXR, a comprehensive evaluation framework for benchmarking LLMs running on XR devices. To demonstrate the framework, we deploy 17 selected LLMs across four XR platforms: Magic Leap 2, Meta Quest 3, Vivo X100s Pro, and Apple Vision Pro, and conduct an extensive evaluation. Our experimental setup measures four key metrics: performance consistency, processing speed, memory usage, and battery consumption. For each of the 68 model-device pairs, we assess performance under varying string lengths, batch sizes, and thread counts, analyzing the trade-offs for real-time XR applications. We propose a unified evaluation method based on the 3D Pareto Optimality theory to select the optimal device-model pairs from quality and speed objectives. Additionally, we compare the efficiency of on-device LLMs with client-server and cloud-based setups, and evaluate their accuracy on two interactive tasks. We believe our findings offer valuable insight to guide future optimization efforts for LLM deployment on XR devices. Our evaluation method can be used as standard groundwork for further research and development in this emerging field. The source code and supplementary materials are available at: www.nanovis.org/AIvaluateXR.html

Paperid: 1620, https://arxiv.org/pdf/2502.15666.pdf

Abstract:
The growing use of large language models (LLMs) for text generation has led to widespread concerns about AI-generated content detection. However, an overlooked challenge is AI-polished text, where human-written content undergoes subtle refinements using AI tools. This raises a critical question: should minimally polished text be classified as AI-generated? Such classification can lead to false plagiarism accusations and misleading claims about AI prevalence in online content. In this study, we systematically evaluate twelve state-of-the-art AI-text detectors using our AI-Polished-Text Evaluation (APT-Eval) dataset, which contains 14.7K samples refined at varying AI-involvement levels. Our findings reveal that detectors frequently flag even minimally polished text as AI-generated, struggle to differentiate between degrees of AI involvement, and exhibit biases against older and smaller models. These limitations highlight the urgent need for more nuanced detection methodologies.

Paperid: 1621, https://arxiv.org/pdf/2502.14389.pdf

Abstract:
Argument mining algorithms analyze the argumentative structure of essays, making them a valuable tool for enhancing education by providing targeted feedback on the students' argumentation skills. While current methods often use encoder or encoder-decoder deep learning architectures, decoder-only models remain largely unexplored, offering a promising research direction. This paper proposes leveraging open-source, small Large Language Models (LLMs) for argument mining through few-shot prompting and fine-tuning. These models' small size and open-source nature ensure accessibility, privacy, and computational efficiency, enabling schools and educators to adopt and deploy them locally. Specifically, we perform three tasks: segmentation of student essays into arguments, classification of the arguments by type, and assessment of their quality. We empirically evaluate the models on the Feedback Prize - Predicting Effective Arguments dataset of grade 6-12 students essays and demonstrate how fine-tuned small LLMs outperform baseline methods in segmenting the essays and determining the argument types while few-shot prompting yields comparable performance to that of the baselines in assessing quality. This work highlights the educational potential of small, open-source LLMs to provide real-time, personalized feedback, enhancing independent learning and writing skills while ensuring low computational cost and privacy.

Paperid: 1622, https://arxiv.org/pdf/2502.12842.pdf

Abstract:
Effective feedback is essential for fostering students' success in scientific inquiry. With advancements in artificial intelligence, large language models (LLMs) offer new possibilities for delivering instant and adaptive feedback. However, this feedback often lacks the pedagogical validation provided by real-world practitioners. To address this limitation, our study evaluates and compares the feedback quality of LLM agents with that of human teachers and science education experts on student-written experimentation protocols. Four blinded raters, all professionals in scientific inquiry and science education, evaluated the feedback texts generated by 1) the LLM agent, 2) the teachers and 3) the science education experts using a five-point Likert scale based on six criteria of effective feedback: Feed Up, Feed Back, Feed Forward, Constructive Tone, Linguistic Clarity, and Technical Terminology. Our results indicate that LLM-generated feedback shows no significant difference to that of teachers and experts in overall quality. However, the LLM agent's performance lags in the Feed Back dimension, which involves identifying and explaining errors within the student's work context. Qualitative analysis highlighted the LLM agent's limitations in contextual understanding and in the clear communication of specific errors. Our findings suggest that combining LLM-generated feedback with human expertise can enhance educational practices by leveraging the efficiency of LLMs and the nuanced understanding of educators.

Paperid: 1623, https://arxiv.org/pdf/2502.12526.pdf

Abstract:
Cartoon videos have proven to be effective in learning vocabulary to preschool children.However, we have little knowledge about integrating AI into cartoon videos to provide systematic, multimodal vocabulary learning support. This late-breaking work present \name{}, an AI-powered cartoon video system that enables real-time Q\&A, vocabulary review, and contextual learning. Preliminary findings contextualized how families interact with \name{} to support vocabulary learning. Parents appreciated the system for its personalized, engaging experiences, fostering collaboration, and encouraging self-reflection on parenting. This study offers valuable design implications for informing future video systems to support vocabulary learning.

Paperid: 1624, https://arxiv.org/pdf/2502.12454.pdf

Abstract:
This study investigates the feasibility and performance of using large multimodal models (LMMs) to automatically annotate human emotions in everyday scenarios. We conducted experiments on the DailyLife subset of the publicly available FERV39k dataset, employing the GPT-4o-mini model for rapid, zero-shot labeling of key frames extracted from video segments. Under a seven-class emotion taxonomy ("Angry," "Disgust," "Fear," "Happy," "Neutral," "Sad," "Surprise"), the LMM achieved an average precision of approximately 50%. In contrast, when limited to ternary emotion classification (negative/neutral/positive), the average precision increased to approximately 64%. Additionally, we explored a strategy that integrates multiple frames within 1-2 second video clips to enhance labeling performance and reduce costs. The results indicate that this approach can slightly improve annotation accuracy. Overall, our preliminary findings highlight the potential application of zero-shot LMMs in human facial emotion annotation tasks, offering new avenues for reducing labeling costs and broadening the applicability of LMMs in complex multimodal environments.

Paperid: 1625, https://arxiv.org/pdf/2502.11650.pdf

Abstract:
Based on Irish older adult's perceptions, practices, and challenges regarding password management, the goal of this study was to compile suitable advice that can benefit this demographic. To achieve this, we first conducted semi structured interviews (n=37), we then collated advice based on best practice and what we learned from these interviews. We facilitated two independent focus groups (n=31) to evaluate and adjust this advice and tested the finalized advice through an observational study (n=15). The participants were aged between 59 and 86 and came from various counties in Ireland, both rural and urban. The findings revealed that managing multiple passwords was a significant source of frustration, leading some participants to adopt novel and informal strategies for storing them. A notable hesitation to adopt digital password managers and passphrases was also observed. Participants appreciated guidance on improving their password practices, with many affirming that securely writing down passwords was a practical strategy. Irish older adults demonstrated strong intuition regarding cybersecurity, notably expressing concerns over knowledge-based security checks used by banks and government institutions. This study aims to contribute to the aggregation of practical password advice suited to older adults, making password security more manageable and less burdensome for this demographic.

Paperid: 1626, https://arxiv.org/pdf/2502.11497.pdf

Abstract:
Virtual Reality headsets isolate users from the real-world by restricting their perception to the virtual-world. Video See-Through (VST) headsets address this by utilizing world-facing cameras to create Augmented Reality experiences. However, directly displaying camera feeds causes visual discomfort and cybersickness due to the inaccurate perception of scale and exaggerated motion parallax. This paper demonstrates the potential of geometry aware passthrough systems in mitigating cybersickness through accurate depth perception. We first present a methodology to benchmark and compare passthrough algorithms. Furthermore, we design a protocol to quantitatively measure cybersickness experienced by users in VST headsets. Using this protocol, we conduct a user study to compare direct passthrough and geometry aware passthrough systems. To the best of our knowledge, our study is the first one to reveal significantly reduced nausea, disorientation, and total scores of cybersickness with geometry aware passthrough. It also uncovers several potential avenues to further mitigate visually-induced discomfort.

Paperid: 1627, https://arxiv.org/pdf/2502.10561.pdf

Abstract:
Landmarks are critical in navigation, supporting self-orientation and mental model development. Similar to sighted people, people with low vision (PLV) frequently look for landmarks via visual cues but face difficulties identifying some important landmarks due to vision loss. We first conducted a formative study with six PLV to characterize their challenges and strategies in landmark selection, identifying their unique landmark categories (e.g., area silhouettes, accessibility-related objects) and preferred landmark augmentations. We then designed VisiMark, an AR interface that supports landmark perception for PLV by providing both overviews of space structures and in-situ landmark augmentations. We evaluated VisiMark with 16 PLV and found that VisiMark enabled PLV to perceive landmarks they preferred but could not easily perceive before, and changed PLV's landmark selection from only visually-salient objects to cognitive landmarks that are more important and meaningful. We further derive design considerations for AR-based landmark augmentation systems for PLV.

Paperid: 1628, https://arxiv.org/pdf/2502.10166.pdf

Abstract:
As digital services increasingly replace traditional analogue systems, ensuring that older adults are not left behind is critical to fostering inclusive access. This study explores how digital educators support older adults in developing essential digital skills, drawing insights from interviews with $34$ educators in Ireland. These educators, both professional and volunteer, offer instruction through a range of formats, including workshops, remote calls, and in-person sessions. Our findings highlight the importance of personalized, step-by-step guidance tailored to older adults' learning needs, as well as fostering confidence through hands-on engagement with technology. Key challenges identified include limited transportation options, poor internet connectivity, outdated devices, and a lack of familial support for learning. To address these barriers, we propose enhanced public funding, expanded access to resources, and sustainable strategies such as providing relevant and practical course materials. Additionally, innovative tools like simulated online platforms for practicing digital transactions can help reduce anxiety and enhance digital literacy among older adults. This study underscores the vital role that digital educators play in bridging the digital divide, creating a more inclusive, human-centered approach to digital learning for older adults.

Paperid: 1629, https://arxiv.org/pdf/2502.09757.pdf

Abstract:
Post-intensive care syndrome (PICS) is a multifaceted condition that arises from prolonged stays in an intensive care unit (ICU). While preventing PICS among ICU patients is becoming increasingly important, interventions remain limited. Building on evidence supporting the effectiveness of art exposure in addressing the psychological aspects of PICS, we propose a novel art therapy solution through a collaborative Human-AI approach that enhances personalized therapeutic interventions using state-of-the-art Visual Art Recommendation Systems. We developed two Human-in-the-Loop (HITL) personalization methods and assessed their impact through a large-scale user study (N=150). Our findings demonstrate that this Human-AI collaboration not only enhances the personalization and effectiveness of art therapy but also supports therapists by streamlining their workload. While our study centres on PICS intervention, the results suggest that human-AI collaborative Art therapy could potentially benefit other areas where emotional support is critical, such as cases of anxiety and depression.

Paperid: 1630, https://arxiv.org/pdf/2502.08786.pdf

Abstract:
Chinese acupuncture practitioners primarily depend on muscle memory and tactile feedback to insert needles and accurately target acupuncture points, as the current workflow lacks imaging modalities and visual aids. Consequently, new practitioners often learn through trial and error, requiring years of experience to become proficient and earn the trust of patients. Medical students face similar challenges in mastering this skill. To address these challenges, we developed an innovative system, MRUCT, that integrates ultrasonic computed tomography (UCT) with mixed reality (MR) technology to visualize acupuncture points in real-time. This system offers offline image registration and real-time guidance during needle insertion, enabling them to accurately position needles based on anatomical structures such as bones, muscles, and auto-generated reference points, with the potential for clinical implementation. In this paper, we outline the non-rigid registration methods used to reconstruct anatomical structures from UCT data, as well as the key design considerations of the MR system. We evaluated two different 3D user interface (3DUI) designs and compared the performance of our system to traditional workflows for both new practitioners and medical students. The results highlight the potential of MR to enhance therapeutic medical practices and demonstrate the effectiveness of the system we developed.

Paperid: 1631, https://arxiv.org/pdf/2502.07732.pdf

Abstract:
Progress in AI has relied on human-generated data, from annotator marketplaces to the wider Internet. However, the widespread use of large language models now threatens the quality and integrity of human-generated data on these very platforms. We argue that this issue goes beyond the immediate challenge of filtering AI-generated content -- it reveals deeper flaws in how data collection systems are designed. Existing systems often prioritize speed, scale, and efficiency at the cost of intrinsic human motivation, leading to declining engagement and data quality. We propose that rethinking data collection systems to align with contributors' intrinsic motivations -- rather than relying solely on external incentives -- can help sustain high-quality data sourcing at scale while maintaining contributor trust and long-term participation.

Paperid: 1632, https://arxiv.org/pdf/2502.07401.pdf

Abstract:
This research explores the opportunities of Generative AI (GenAI) in the realm of higher education through the design and development of a multimodal chatbot for an undergraduate course. Leveraging the ChatGPT API for nuanced text-based interactions and Google Bard for advanced image analysis and diagram-to-code conversions, we showcase the potential of GenAI in addressing a broad spectrum of educational queries. Additionally, the chatbot presents a file-based analyser designed for educators, offering deep insights into student feedback via sentiment and emotion analysis, and summarising course evaluations with key metrics. These combinations highlight the crucial role of multimodal conversational AI in enhancing teaching and learning processes, promising significant advancements in educational adaptability, engagement, and feedback analysis. By demonstrating a practical web application, this research underlines the imperative for integrating GenAI technologies to foster more dynamic and responsive educational environments, ultimately contributing to improved educational outcomes and pedagogical strategies.

Paperid: 1633, https://arxiv.org/pdf/2502.06341.pdf

Abstract:
The ethical, social and legal issues surrounding facial analysis technologies have been widely debated in recent years. Key critics have argued that these technologies can perpetuate bias and discrimination, particularly against marginalized groups. We contribute to this field of research by reporting on the limitations of facial analysis systems with the faces of people with Down syndrome: this particularly vulnerable group has received very little attention in the literature so far. This study involved the creation of a specific dataset of face images. An experimental group with faces of people with Down syndrome, and a control group with faces of people who are not affected by the syndrome. Two commercial tools were tested on the dataset, along three tasks: gender recognition, age prediction and face labelling. The results show an overall lower accuracy of prediction in the experimental group, and other specific patterns of performance differences: i) high error rates in gender recognition in the category of males with Down syndrome; ii) adults with Down syndrome were more often incorrectly labelled as children; iii) social stereotypes are propagated in both the control and experimental groups, with labels related to aesthetics more often associated with women, and labels related to education level and skills more often associated with men. These results, although limited in scope, shed new light on the biases that alter face classification when applied to faces of people with Down syndrome. They confirm the structural limitation of the technology, which is inherently dependent on the datasets used to train the models.

Paperid: 1634, https://arxiv.org/pdf/2502.06251.pdf

Abstract:
Group decision-making often benefits from diverse perspectives, yet power imbalances and social influence can stifle minority opinions and compromise outcomes. This prequel introduces an AI-mediated communication system that leverages the Large Language Model to serve as a devil's advocate, representing underrepresented viewpoints without exposing minority members' identities. Rooted in persuasive communication strategies and anonymity, the system aims to improve psychological safety and foster more inclusive decision-making. Our multi-agent architecture, which consists of a summary agent, conversation agent, AI duplicate checker, and paraphrase agent, encourages the group's critical thinking while reducing repetitive outputs. We acknowledge that reliance on text-based communication and fixed intervention timings may limit adaptability, indicating pathways for refinement. By focusing on the representation of minority viewpoints anonymously in power-imbalanced settings, this approach highlights how AI-driven methods can evolve to support more divergent and inclusive group decision-making.

Paperid: 1635, https://arxiv.org/pdf/2502.05935.pdf

Abstract:
Neuromorphic Human-Computer Interaction (HCI) is a theoretical approach to designing better user experiences (UX) motivated by advances in the understanding of the neurophysiology of the brain. Inspired by the neuroscientific theory of Active Inference, Interactive Inference is a first example of such approach. It offers a simplified interpretation of Active Inference that allows designers to more readily apply this theory to design and evaluation. In Interactive Inference, user behaviour is modeled as Bayesian inference on progress and goal distributions that predicts the next action. We show how the error between goal and progress distributions, or Bayesian surprise, can be modeled as a simple mean square error of the signal-to-noise ratio (SNR) of a task. The problem is that the user's capacity to process Bayesian surprise follows the logarithm of this SNR. This means errors rise quickly once average capacity is exceeded. Our model allows the quantitative analysis of performance and error using one framework that can provide real-time estimates of the mental load in users that needs to be minimized by design. We show how three basic laws of HCI, Hick's Law, Fitts' Law and the Power Law can be expressed using our model. We then test the validity of the model by empirically measuring how well it predicts human performance and error in a car following task. Results suggest that driver processing capacity indeed is a logarithmic function of the SNR of the distance to a lead car. This result provides initial evidence that Interactive Interference can be useful as a new theoretical design tool.

Paperid: 1636, https://arxiv.org/pdf/2502.05612.pdf

Abstract:
Speech-to-text technologies have been shown to improve text input efficiency and potentially lower the barriers to writing. Recent LLM-assisted dictation tools aim to support writing with speech by bridging the gaps between speaking and traditional writing. This case study reports on the real-world writing experiences of twelve academic or creative writers using one such tool, Rambler, to write various pieces such as blog posts, diaries, screenplays, notes, or fictional stories, etc. Through a ten-day diary study, we identified the participants' in-context writing strategies using Rambler, such as how they expanded from an outline or organized their loose thoughts for different writing goals. The interviews uncovered the psychological and productivity affordances of writing with speech, pointing to future directions of designing for this writing modality and the utilization of AI support.

Paperid: 1637, https://arxiv.org/pdf/2502.05287.pdf

Abstract:
Science communication increases public interest in science by educating, engaging, and encouraging everyday people to participate in the sciences. But traditional science communication is often too formal and inaccessible for general audiences. However, there is a growing trend on social media to make it more approachable using three techniques: relatable examples to make explanations concrete, step-by-step walkthroughs to improve understanding, and personal language to drive engagement. These techniques are flashy and often garner more engagement from social media users, but the effectiveness of these techniques in actually explaining the science is unknown. Furthermore, many scientists struggle with adopting these science communication strategies for social media, fearing it might undermine their authority. We conduct a reader study to understand how these science communication techniques on social media affect readers' understanding and engagement of the science. We found that while most readers prefer these techniques, they had diverse preferences for when and where these techniques are used. With these findings, we conducted a writer study to understand how scientists' varying comfort levels with these strategies can be supported by presenting different structure and style options. We found that the side-by-side comparison of options helped writers make editorial decisions. Instead of adhering to one direction of science communication, writers explored a continuum of options which helped them identify which communication strategies they wanted to implement.

Paperid: 1638, https://arxiv.org/pdf/2502.04983.pdf

Abstract:
Creating interactive scenes often involves complex programming tasks. Although large language models (LLMs) like ChatGPT can generate code from natural language, their output is often error-prone, particularly when scripting interactions among multiple elements. The linear conversational structure limits the editing of individual elements, and lacking graphical and precise control complicates visual integration. To address these issues, we integrate an element-level modularization technique that processes textual descriptions for individual elements through separate LLM modules, with a central module managing interactions among elements. This modular approach allows for refining each element independently. We design a graphical user interface, MoGraphGPT , which combines modular LLMs with enhanced graphical control to generate codes for 2D interactive scenes. It enables direct integration of graphical information and offers quick, precise control through automatically generated sliders. Our comparative evaluation against an AI coding tool, Cursor Composer, as the baseline system and a usability study show MoGraphGPT significantly improves easiness, controllability, and refinement in creating complex 2D interactive scenes with multiple visual elements in a coding-free manner.

Paperid: 1639, https://arxiv.org/pdf/2502.04525.pdf

Abstract:
We present a comparative study of building with LEGO in three environments: the physical world, a Virtual Reality (VR) counterpart, and a VR setting enhanced with "superpowers". The study aims to understand how traditional creative hands-on activities translate to virtual environments, with potential benefits for educational, training, entertainment, and therapeutic uses. 22 participants engaged in both structured assembly and creative free-building tasks across these environments. We investigated differences in user performance, engagement, and creativity, with a focus on how the additional VR functionalities influenced the building experience. The findings reveal that while the physical environment offers a familiar tactile experience, VR, particularly with added superpowers, was clearly favoured by participants in the creative free-building scenario. Our recommendations for VR design include balancing automation with user control to enhance task efficiency while maintaining engagement, and implementing intuitive systems that manage complexity to prevent user overwhelm and support creative freedom.

Paperid: 1640, https://arxiv.org/pdf/2502.03964.pdf

Abstract:
Despite living in the era of the internet, phone-based scams remain one of the most prevalent forms of scams. These scams aim to exploit victims for financial gain, causing both monetary losses and psychological distress. While governments, industries, and academia have actively introduced various countermeasures, scammers also continue to evolve their tactics, making phone scams a persistent threat. To combat these increasingly sophisticated scams, detection technologies must also advance. In this work, we propose a framework for modeling scam calls and introduce an LLM-based real-time detection approach, which assesses fraudulent intent in conversations, further providing immediate warnings to users to mitigate harm. Through experiments, we evaluate the method's performance and analyze key factors influencing its effectiveness. This analysis enables us to refine the method to improve precision while exploring the trade-off between recall and timeliness, paving the way for future directions in this critical area of research.

Paperid: 1641, https://arxiv.org/pdf/2502.03788.pdf

Abstract:
With the continuous development of generative AI's logical reasoning abilities, AI's growing code-generation potential poses challenges for both technical and creative professionals. But how can these advances be directed toward empowering junior researchers and designers who often require additional help to build and express their professional and personal identities? We introduce Frontend Diffusion, a multi-agent coding system transforming user-drawn layouts and textual prompts into refined website code, thereby supporting self-representation goals. A user study with 13 junior researchers and designers shows AI as a human capability enhancer rather than a replacement, and highlights the importance of bidirectional human-AI alignment. We then discuss future work such as leveraging AI for career development and fostering bidirectional human-AI alignment of multi-agent systems.

Paperid: 1642, https://arxiv.org/pdf/2502.02929.pdf

Abstract:
We present AudioMiXR, an augmented reality (AR) interface intended to assess how users manipulate virtual audio objects situated in their physical space using six degrees of freedom (6DoF) deployed on a head-mounted display (Apple Vision Pro) for 3D sound design. Existing tools for 3D sound design are typically constrained to desktop displays, which may limit spatial awareness of mixing within the execution environment. Utilizing an XR HMD to create soundscapes may provide a real-time test environment for 3D sound design, as modern HMDs can provide precise spatial localization assisted by cross-modal interactions. However, there is no research on design guidelines specific to sound design with 6DoF in XR. To provide a first step toward identifying design-related research directions in this space, we conducted an exploratory study where we recruited 27 participants, consisting of expert and non-expert sound designers. The goal was to assess design lessons that can be used to inform future research venues in 3D sound design. We ran a within-subjects study where users designed both a music and cinematic soundscapes. After thematically analyzing participant data, we constructed two design lessons: (1) Proprioception for AR Sound Design, and (2) Balancing Audio-Visual Modalities in AR GUIs. Additionally, we provide application domains that can benefit most from 6DoF sound design based on our results. To expand on these insights, we conducted a second within-subjects study comparing AudioMiXR to a 2D panner baseline. Results show that AudioMiXR significantly improved usability (SUS), reduced frustration and mental workload (NASA-TLX), and enhanced creativity across all subscales. These findings demonstrate that 6DoF AR interaction yields measurable gains in user experience and creative output, positioning AudioMiXR as a promising foundation for future AR-based sound design tools.

Paperid: 1643, https://arxiv.org/pdf/2502.02880.pdf

Abstract:
It is widely believed that outsourcing cognitive work to AI boosts immediate productivity at the expense of long-term human capital development. An overlooked possibility is that AI tools can support skill development by providing just-in-time, high-quality, personalized examples. In this investigation, lay forecasters predicted that practicing writing cover letters with an AI tool would impair learning compared to practicing writing letters without the tool. However, in a highly-powered pre-registered experiment, participants randomly assigned to practice writing with AI improved more on a writing test one day later compared to writers assigned to practice without AI. Notably, writers given access to the AI tool improved more despite exerting less effort, whether measured by time on task, keystrokes, or subjective ratings. We replicated and extended these results in a second pre-registered experiment, showing that writers given access to the AI tool again outperformed those who practiced on their own -- but performed no better than writers merely shown an AI-generated cover letter that they could not edit. Collectively, these findings constitute an existence proof that by providing personalized examples of high-quality work, AI tools can improve, rather than undermine, learning.

Paperid: 1644, https://arxiv.org/pdf/2502.02207.pdf

Abstract:
Teleoperation enables remote human support of automated vehicles in scenarios where the automation is not able to find an appropriate solution. Remote assistance concepts, where operators provide discrete inputs to aid specific automation modules like planning, is gaining interest due to its reduced workload on the human remote operator and improved safety. However, these concepts are challenging to implement and maintain due to their deep integration and interaction with the automated driving system. In this paper, we propose a solution to facilitate the implementation of remote assistance concepts that intervene on planning level and extend the operational design domain of the vehicle at runtime. Using arbitration graphs, a modular decision-making framework, we integrate remote assistance into an existing automated driving system without modifying the original software components. Our simulative implementation demonstrates this approach in two use cases, allowing operators to adjust planner constraints and enable trajectory generation beyond nominal operational design domains.

Paperid: 1645, https://arxiv.org/pdf/2502.00888.pdf

Abstract:
This paper presents our solution to the 2025 3DUI Contest challenge. We aimed to develop a collaborative, immersive experience that raises awareness about trash pollution in natural landscapes while enhancing traditional interaction techniques in virtual environments. To achieve these objectives, we created an engaging multiplayer game where one user collects harmful pollutants while the other user provides medication to impacted wildlife using enhancements to traditional interaction techniques: HOMER and Fishing Reel. We enhanced HOMER to use a cone volume to reduce the precise aiming required by a selection raycast to provide a more efficient means to collect pollutants at large distances, coined as FLOW-MATCH. To improve the animal feed distribution to wildlife far away from the user with Fishing Reel, we created RAWR-XD, an asymmetric bi-manual technique to more conveniently adjust the reeling speed using the non-selecting wrist rotation of the user.

Paperid: 1646, https://arxiv.org/pdf/2501.19245.pdf

Abstract:
Reinforcement learning (RL) offers a general approach for modeling and training AI agents, including human-AI interaction scenarios. In this paper, we propose SHARPIE (Shared Human-AI Reinforcement Learning Platform for Interactive Experiments) to address the need for a generic framework to support experiments with RL agents and humans. Its modular design consists of a versatile wrapper for RL environments and algorithm libraries, a participant-facing web interface, logging utilities, deployment on popular cloud and participant recruitment platforms. It empowers researchers to study a wide variety of research questions related to the interaction between humans and RL agents, including those related to interactive reward specification and learning, learning from human feedback, action delegation, preference elicitation, user-modeling, and human-AI teaming. The platform is based on a generic interface for human-RL interactions that aims to standardize the field of study on RL in human contexts.

Paperid: 1647, https://arxiv.org/pdf/2501.15711.pdf

Abstract:
By overlaying time-synced user comments on videos, Danmu creates a co-watching experience for online viewers. However, its visual-centric design poses significant challenges for blind and low vision (BLV) viewers. Our formative study identified three primary challenges that hinder BLV viewers' engagement with Danmu: the lack of visual context, the speech interference between comments and videos, and the disorganization of comments. To address these challenges, we present DanmuA11y, a system that makes Danmu accessible by transforming it into multi-viewer audio discussions. DanmuA11y incorporates three core features: (1) Augmenting Danmu with visual context, (2) Seamlessly integrating Danmu into videos, and (3) Presenting Danmu via multi-viewer discussions. Evaluation with twelve BLV viewers demonstrated that DanmuA11y significantly improved Danmu comprehension, provided smooth viewing experiences, and fostered social connections among viewers. We further highlight implications for enhancing commentary accessibility in video-based social media and live-streaming platforms.

Paperid: 1648, https://arxiv.org/pdf/2501.15413.pdf

Abstract:
Woodworkers have to navigate multiple considerations when planning a project, including available resources, skill-level, and intended effort. Do it yourself (DIY) woodworkers face these challenges most acutely because of tight material constraints and a desire for custom designs tailored to specific spaces. To address these needs, we present XR-penter, an extended reality (XR) application that supports in situ, material-aware woodworking for casual makers. Our system enables users to design virtual scrap wood assemblies directly in their workspace, encouraging sustainable practices through the use of discarded materials. Users register physical material as virtual twins, manipulate these twins into an assembly in XR, and preview cuts needed for fabrication. We conducted a case study and feedback sessions to demonstrate how XR-penter supports improvisational workflows in practice, the type of woodworker who would benefit most from our system, and insights on integrating similar spatial and material considerations into future work.

Paperid: 1649, https://arxiv.org/pdf/2501.15408.pdf

Abstract:
Reminiscing with photo collections offers significant psychological benefits but poses challenges for people with visual impairment (PVI). Their current reliance on sighted help restricts the flexibility of this activity. In response, we explored using a chatbot in a preliminary study. We identified two primary challenges that hinder effective reminiscence with a chatbot: the scattering of information and a lack of proactive guidance. To address these limitations, we present Memory Reviver, a proactive chatbot that helps PVI reminisce with a photo collection through natural language communication. Memory Reviver incorporates two novel features: (1) a Memory Tree, which uses a hierarchical structure to organize the information in a photo collection; and (2) a Proactive Strategy, which actively delivers information to users at proper conversation rounds. Evaluation with twelve PVI demonstrated that Memory Reviver effectively facilitated engaging reminiscence, enhanced understanding of photo collections, and delivered natural conversational experiences. Based on our findings, we distill implications for supporting photo reminiscence and designing chatbots for PVI.

Paperid: 1650, https://arxiv.org/pdf/2501.14163.pdf

Abstract:
Rules are a critical component of the functioning of nearly every online community, yet it is challenging for community moderators to make data-driven decisions about what rules to set for their communities. The connection between a community's rules and how its membership feels about its governance is not well understood. In this work, we conduct the largest-to-date analysis of rules on Reddit, collecting a set of 67,545 unique rules across 5,225 communities which collectively account for more than 67% of all content on Reddit. More than just a point-in-time study, our work measures how communities change their rules over a 5+ year period. We develop a method to classify these rules using a taxonomy of 17 key attributes extended from previous work. We assess what types of rules are most prevalent, how rules are phrased, and how they vary across communities of different types. Using a dataset of communities' discussions about their governance, we are the first to identify the rules most strongly associated with positive community perceptions of governance: rules addressing who participates, how content is formatted and tagged, and rules about commercial activities. We conduct a longitudinal study to quantify the impact of adding new rules to communities, finding that after a rule is added, community perceptions of governance immediately improve, yet this effect diminishes after six months. Our results have important implications for platforms, moderators, and researchers. We make our classification model and rules datasets public to support future research on this topic.

Paperid: 1651, https://arxiv.org/pdf/2501.11540.pdf

Abstract:
Gaze-based interaction techniques have created significant interest in the field of spatial interaction. Many of these methods require additional input modalities, such as hand gestures (e.g., gaze coupled with pinch). Those can be uncomfortable and difficult to perform in public or limited spaces, and pose challenges for users who are unable to execute pinch gestures. To address these aspects, we propose a novel, hands-free Gaze+Blink interaction technique that leverages the user's gaze and intentional eye blinks. This technique enables users to perform selections by executing intentional blinks. It facilitates continuous interactions, such as scrolling or drag-and-drop, through eye blinks coupled with head movements. So far, this concept has not been explored for hands-free spatial interaction techniques. We evaluated the performance and user experience (UX) of our Gaze+Blink method with two user studies and compared it with Gaze+Pinch in a realistic user interface setup featuring common menu interaction tasks. Study 1 demonstrated that while Gaze+Blink achieved comparable selection speeds, it was prone to accidental selections resulting from unintentional blinks. In Study 2 we explored an enhanced technique employing a deep learning algorithms for filtering out unintentional blinks.

Paperid: 1652, https://arxiv.org/pdf/2501.10517.pdf

Abstract:
Search engines, as cognitive partners, reshape how individuals evaluate their cognitive abilities. This study examines how search tool access influences cognitive self-esteem (CSE)-users' self-perception of cognitive abilities -- through the lens of transactive memory systems. Using a within-subject design with 164 participants, we found that CSE significantly inflates when users have access to search tools, driven by cognitive offloading. Participants with lower initial CSE exhibited greater shifts, highlighting individual differences. Search self-efficacy mediated the relationship between prior search experience and CSE, emphasizing the role of users' past interactions. These findings reveal opportunities for search engine design: interfaces that promote awareness of cognitive offloading and foster self-reflection can support accurate metacognitive evaluations, reducing overreliance on external tools. This research contributes to HCI by demonstrating how interactive systems shape cognitive self-perception, offering actionable insights for designing human-centered tools that balance user confidence and cognitive independence.

Paperid: 1653, https://arxiv.org/pdf/2501.10383.pdf

Abstract:
The Generative AI Ethics Playbook provides guidance for identifying and mitigating risks of machine learning systems across various domains, including natural language processing, computer vision, and generative AI. This playbook aims to assist practitioners in diagnosing potential harms that may arise during the design, development, and deployment of datasets and models. It offers concrete strategies and resources for mitigating these risks, to help minimize negative impacts on users and society. Drawing on current best practices in both research and ethical considerations, this playbook aims to serve as a comprehensive resource for AI/ML practitioners. The intended audience of this playbook includes machine learning researchers, engineers, and practitioners who are involved in the creation and implementation of generative and multimodal models (e.g., text-to-text, image-to-image, text-to-image, text-to-video). Specifically, we provide transparency/documentation checklists, topics of interest, common questions, examples of harms through case studies, and resources and strategies to mitigate harms throughout the Generative AI lifecycle. This playbook was made collaboratively over the course of 16 months through extensive literature review of over 100 resources and peer-reviewed articles, as well as through an initial group brainstorming session with 18 interdisciplinary AI ethics experts from industry and academia, and with additional feedback from 8 experts (5 of whom were in the initial brainstorming session). We note that while this playbook provides examples, discussion, and harm mitigation strategies, research in this area is ongoing. Our playbook aims to be a practically useful survey, taking a high-level view rather than aiming for covering the entire existing body of research.

Paperid: 1654, https://arxiv.org/pdf/2501.09457.pdf

Abstract:
Extracting concepts and understanding relationships from videos is essential in Video-Based Design (VBD), where videos serve as a primary medium for exploration but require significant effort in managing meta-information. Mind maps, with their ability to visually organize complex data, offer a promising approach for structuring and analysing video content. Recent advancements in Large Language Models (LLMs) provide new opportunities for meta-information processing and visual understanding in VBD, yet their application remains underexplored. This study recruited 28 VBD practitioners to investigate the use of prompt-tuned LLMs for generating mind maps from ethnographic videos. Comparing LLM-generated mind maps with those created by professional designers, we evaluated rated scores, design effectiveness, and user experience across two contexts. Findings reveal that LLMs effectively capture central concepts but struggle with hierarchical organization and contextual grounding. We discuss trust, customization, and workflow integration as key factors to guide future research on LLM-supported information mapping in VBD.

Paperid: 1655, https://arxiv.org/pdf/2501.09210.pdf

Abstract:
As generative AI products could generate code and assist students with programming learning seamlessly, integrating AI into programming education contexts has driven much attention. However, one emerging concern is that students might get answers without learning from the LLM-generated content. In this work, we deployed the LLM-powered personalized Parsons puzzles as scaffolding to write-code practice in a Python learning classroom (PC condition) and conducted an 80-minute randomized between-subjects study. Both conditions received the same practice problems. The only difference was that when requesting help, the control condition showed students a complete solution (CC condition), simulating the most traditional LLM output. Results indicated that students who received personalized Parsons puzzles as scaffolding engaged in practicing significantly longer than those who received complete solutions when struggling.

Paperid: 1656, https://arxiv.org/pdf/2501.08774.pdf

Abstract:
Artificial intelligence (AI), including large language models and generative AI, is emerging as a significant force in software development, offering developers powerful tools that span the entire development lifecycle. Although software engineering research has extensively studied AI tools in software development, the specific types of interactions between developers and these AI-powered tools have only recently begun to receive attention. Understanding and improving these interactions has the potential to enhance productivity, trust, and efficiency in AI-driven workflows. In this paper, we propose a taxonomy of interaction types between developers and AI tools, identifying eleven distinct interaction types, such as auto-complete code suggestions, command-driven actions, and conversational assistance. Building on this taxonomy, we outline a research agenda focused on optimizing AI interactions, improving developer control, and addressing trust and usability challenges in AI-assisted development. By establishing a structured foundation for studying developer-AI interactions, this paper aims to stimulate research on creating more effective, adaptive AI tools for software development.

Paperid: 1657, https://arxiv.org/pdf/2501.08626.pdf

Abstract:
When humans interact with learning-based control systems, a common goal is to minimize a cost function known only to the human. For instance, an exoskeleton may adapt its assistance in an effort to minimize the human's metabolic cost-of-transport. Conventional approaches to synthesizing the learning algorithm solve an inverse problem to infer the human's cost. However, these problems can be ill-posed, hard to solve, or sensitive to problem data. Here we show a game-theoretic learning algorithm that works solely by observing human actions to find the cost minimum, avoiding the need to solve an inverse problem. We evaluate the performance of our algorithm in an extensive set of human subjects experiments, demonstrating consistent convergence to the minimum of a prescribed human cost function in scalar and multidimensional instantiations of the game. We conclude by outlining future directions for theoretical and empirical extensions of our results.

Paperid: 1658, https://arxiv.org/pdf/2501.02304.pdf

Abstract:
While augmented reality shows promise for supporting human-robot collaboration, creating such interactive systems still poses great challenges. Addressing this, we introduce ARTHUR, an open-source authoring tool for augmented reality-supported human-robot collaboration. ARTHUR supports 20 types of multi-modal feedback to convey robot, task, and system state, 10 actions that enable the user to control the robot and system, and 18 conditions for feedback customization and triggering of actions. By combining these elements, users can create interaction spaces, controls, and information visualizations in augmented reality for collaboration with robot arms. With ARTHUR, we propose to combine desktop interfaces and touchscreen devices for effective authoring, with head-mounted displays for testing and in-situ refinements. To demonstrate the general applicability of ARTHUR for human-robot collaboration scenarios, we replicate representative examples from prior work. Further, in an evaluation with five participants, we reflect on the usefulness of our hybrid user interface approach and the provided functionality, highlighting directions for future work.

Paperid: 1659, https://arxiv.org/pdf/2501.02233.pdf

Abstract:
Deaf and hard-of-hearing (DHH) students face significant challenges in specialized educational settings, such as limited exposure to written and spoken language, a lack of tailored educational tools, and restricted access to resources, impacting their language literacy development and overall educational experience. We, therefore, employed a User-Centered Design (UCD) process, collaborating with 8 DHH students and 2 Teachers of the Deaf (ToDs) from a School of Deaf to effectively develop and utilize a real-time captioning augmented reality (AR) system to their school settings, aiming to enhance their learning experience. User study with 24 DHH participants revealed a strong preference (87.5\%) for our system, underscoring its potential to enhance learning experience. We present a comprehensive needs analysis, the UCD process, system implementation, and user feedback, showcasing the effectiveness of tailored AR caption interfaces for DHH students. We also discuss the implications for future development of educational technologies for DHH students.

Paperid: 1660, https://arxiv.org/pdf/2501.01884.pdf

Abstract:
Telegram emerged as a crucial platform for both parties during the conflict between Russia and Ukraine. Per its minimal policies for content moderation, Pro-Kremlin narratives and potential misinformation were spread on Telegram, while anti-Kremlin narratives with related content were also propagated, such as war footage, troop movements, maps of bomb shelters, and air raid warnings. This paper presents a dataset of posts from both pro-Kremlin and anti-Kremlin Telegram channels, collected over a period spanning a year before and a year after the Russian invasion. The dataset comprises 404 pro-Kremlin channels with 4,109,645 posts and 114 anti-Kremlin channels with 1,117,768 posts. We provide details on the data collection process, processing methods, and dataset characterization. Lastly, we discuss the potential research opportunities this dataset may enable researchers across various disciplines.

Paperid: 1661, https://arxiv.org/pdf/2501.01285.pdf

Abstract:
Augmented Reality (AR) functionalities may be effectively leveraged in collaborative service scenarios (e.g., remote maintenance, on-site building, street gaming, etc.). Standard development cycles for collaborative AR require to code for each specific visualization platform and implement the necessary control mechanisms over the shared assets. This paper describes SARA, an architecture to support cross-platform collaborative Augmented Reality applications based on microservices. The architecture is designed to work over the concept of collaboration models (turn, layer, ownership,hierarchy-based and unconstrained examples) which regulate the interaction and permissions of each user over the AR assets. Thanks to the reusability of its components, during the development of an application, SARA enables focusing on the application logic while avoiding the implementation of the communication protocol, data model handling and orchestration between the different, possibly heterogeneous,devices involved in the collaboration (i.e., mobile or wearable AR devices using different operating systems). To describe how to build an application based on SARA, a prototype for HoloLens and iOS devices has been implemented. the prototype is a collaborative voxel-based game in which several players work real time together on a piece of land, adding or eliminating cubes in a collaborative manner to create buildings and landscapes. Turn-based and unconstrained collaboration models are applied to regulate the interaction, the development workflow for this case study shows how the architecture serves as a framework to support the deployment of collaborative AR services, enabling the reuse of collaboration model components, agnostically handling client technologies.

Paperid: 1662, https://arxiv.org/pdf/2501.00939.pdf

Abstract:
The Web has drastically simplified our access to knowledge and learning, and fact-checking online resources has become a part of our daily routine. Studying online knowledge consumption is thus critical for understanding human behavior and informing the design of future platforms. In this Chapter, we approach this subject by describing the navigation patterns of the readers of Wikipedia, the world's largest platform for open knowledge. We provide a comprehensive overview of what is known about the three steps that characterize navigation on Wikipedia: (1) how readers reach the platform, (2) how readers navigate the platform, and (3) how readers leave the platform. Finally, we discuss open problems and opportunities for future research in this field.

Paperid: 1663, https://arxiv.org/pdf/2501.00935.pdf

Abstract:
Dynamic gesture recognition is one of the challenging research areas due to variations in pose, size, and shape of the signer's hand. In this letter, Multiscaled Multi-Head Attention Video Transformer Network (MsMHA-VTN) for dynamic hand gesture recognition is proposed. A pyramidal hierarchy of multiscale features is extracted using the transformer multiscaled head attention model. The proposed model employs different attention dimensions for each head of the transformer which enables it to provide attention at the multiscale level. Further, in addition to single modality, recognition performance using multiple modalities is examined. Extensive experiments demonstrate the superior performance of the proposed MsMHA-VTN with an overall accuracy of 88.22\% and 99.10\% on NVGesture and Briareo datasets, respectively.

Paperid: 1664, https://arxiv.org/pdf/2506.24104.pdf

Abstract:
Digital twins (DT) are increasingly used in healthcare to model patients, processes, and physiological systems. While recent solutions leverage visualization, visual analytics, and user interaction, these systems rarely incorporate structured service design methodologies. Bridging service design with visual analytics and visualization can be valuable for the healthcare DT community. This paper aims to introduce the service design discipline to visualization researchers by framing this integration gap and suggesting research directions to enhance the real-world applicability of DT solutions.

Paperid: 1665, https://arxiv.org/pdf/2506.24057.pdf

Abstract:
The popularity of accessibility research has grown recently, improving digital inclusion for people with disabilities. However, researchers, including those who have disabilities, have attempted to include people with disabilities in all aspects of design, and they have identified a myriad of practical accessibility barriers posed by tools and methods leveraged by human-computer interaction (HCI) researchers during prototyping. To build a more inclusive technological landscape, we must question the effectiveness of existing prototyping tools and methods, repurpose/retrofit existing resources, and build new tools and methods to support the participation of both researchers and people with disabilities within the prototyping design process of novel technologies. This full-day workshop at CHI 2025 will provide a platform for HCI researchers, designers, and practitioners to discuss barriers and opportunities for creating accessible prototyping and promote hands-on ideation and fabrication exercises aimed at futuring accessible prototyping.

Paperid: 1666, https://arxiv.org/pdf/2506.23694.pdf

Abstract:
The process of requirements analysis requires an understanding of the end users of a system. Thus, expert stakeholders, such as User Experience (UX) designers, usually create various descriptions containing information about the users and their possible needs. In our paper, we investigate to what extent UX novices are able to write such descriptions into user scenarios. We conducted a user study with 60 participants consisting of 30 UX experts and 30 novices who were asked to write a user scenario with or without the help of an LLM-supported writing assistant. Our findings show that LLMs empower laypersons to write reasonable user scenarios and provide first-hand insights for requirements analysis that are comparable to UX experts in terms of structure and clarity, while especially excelling at audience-orientation. We present our qualitative and quantitative findings, including user scenario anatomies, potential influences, and differences in the way participants approached the task.

Paperid: 1667, https://arxiv.org/pdf/2506.23682.pdf

Abstract:
A digital security-by-design computer architecture, like CHERI, lets you program without fear of buffer overflows or other memory safety errors, but CHERI also rewrites some of the assumptions about how C works and how fundamental types (such as pointers) are implemented in hardware. We conducted a usability study to examine how developers react to the changes required by CHERI when porting software to run on it. We find that developers struggle with CHERI's display of warnings and errors and a lack of diverse documentation.

Paperid: 1668, https://arxiv.org/pdf/2506.22597.pdf

Abstract:
Wayfinding, the ability to recall the environment and navigate through it, is an essential cognitive skill relied upon almost every day in a person's life. A crucial component of wayfinding is the construction of cognitive maps, mental representations of the environments through which a person travels. Age, disease or injury can severely affect cognitive mapping, making assessment of this basic survival skill particularly important to clinicians and therapists. Cognitive mapping has also been the focus of decades of basic research by cognitive psychologists. Both communities have evolved a number of techniques for assessing cognitive mapping ability. We present the Cognitive Map Probe (CMP), a new computerized tool for assessment of cognitive mapping ability that increases consistency and promises improvements in flexibility, accessibility, sensitivity and control. The CMP uses a tangible user interface that affords spatial manipulation. We describe the design of the CMP, and find that it is sensitive to factors known to affect cognitive mapping performance in extensive experimental testing.

Paperid: 1669, https://arxiv.org/pdf/2506.21898.pdf

Abstract:
Large language models (LLMs) are becoming increasingly ubiquitous in our daily lives, but numerous concerns about bias in LLMs exist. This study examines how gender-diverse populations perceive bias, accuracy, and trustworthiness in LLMs, specifically ChatGPT. Through 25 in-depth interviews with non-binary/transgender, male, and female participants, we investigate how gendered and neutral prompts influence model responses and how users evaluate these responses. Our findings reveal that gendered prompts elicit more identity-specific responses, with non-binary participants particularly susceptible to condescending and stereotypical portrayals. Perceived accuracy was consistent across gender groups, with errors most noted in technical topics and creative tasks. Trustworthiness varied by gender, with men showing higher trust, especially in performance, and non-binary participants demonstrating higher performance-based trust. Additionally, participants suggested improving the LLMs by diversifying training data, ensuring equal depth in gendered responses, and incorporating clarifying questions. This research contributes to the CSCW/HCI field by highlighting the need for gender-diverse perspectives in LLM development in particular and AI in general, to foster more inclusive and trustworthy systems.

Paperid: 1670, https://arxiv.org/pdf/2506.20884.pdf

Abstract:
''TikTok, Do Your Thing'' is a viral trend where users attempt to identify strangers they see in public via information crowd-sourcing. The trend started as early as 2021 and users typically engage with it for romantic purposes (similar to a ''Missed Connections'' personal advertisement). This practice includes acts of surveillance and identification in the public sphere, although by peers rather than governments or corporations. To understand users' reactions to this trend we conducted a qualitative analysis of 60 TikTok videos and 1,901 user comments. Of the 60 videos reviewed, we find 19 individuals were successfully identified. We also find that while there were comments expressing disapproval (n=310), more than double the number expressed support (n=883). Supportive comments demonstrated genuine interest and empathy, reflecting evolving conceptions of community and algorithmic engagement. On the other hand, disapproving comments highlighted concerns about inappropriate relationships, stalking, consent, and gendered double standards. We discuss these insights in relation to the normalization of interpersonal surveillance, online stalking, and as an evolution of social surveillance to offer a new perspective on user perceptions surrounding interpersonal surveillance and identification in the public sphere.

Paperid: 1671, https://arxiv.org/pdf/2506.20207.pdf

Abstract:
Mobile Augmented Reality (AR) applications leverage various sensors to provide immersive user experiences. However, their reliance on diverse data sources introduces significant privacy challenges. This paper investigates user perceptions and understanding of privacy permissions in mobile AR apps through an analysis of existing applications and an online survey of 120 participants. Findings reveal common misconceptions, including confusion about how permissions relate to specific AR functionalities (e.g., location and measurement of physical distances), and misinterpretations of permission labels (e.g., conflating camera and gallery access). We identify a set of actionable implications for designing more usable and transparent privacy mechanisms tailored to mobile AR technologies, including contextual explanations, modular permission requests, and clearer permission labels. These findings offer actionable guidance for developers, researchers, and policymakers working to enhance privacy frameworks in mobile AR.

Paperid: 1672, https://arxiv.org/pdf/2506.20091.pdf

Abstract:
Recent advances in multi-agentic systems (e.g. AutoGen, OpenAI Swarm) allow users to interact with a group of specialised AI agents rather than a single general-purpose agent. Despite the promise of this new paradigm, the HCI community has yet to fully examine the opportunities, risks, and user-centred challenges it introduces. We contribute to research on multi-agentic systems by exploring their architectures and key features through a human-centred lens. While literature and use cases remain limited, we build on existing tools and frameworks available to developers to identify a set of overarching challenges, e.g. orchestration and conflict resolution, that can guide future research in HCI. We illustrate these challenges through examples, offer potential design considerations, and provide research opportunities to spark interdisciplinary conversation. Our work lays the groundwork for future exploration and offers a research agenda focused on user-centred design in multi-agentic systems.

Paperid: 1673, https://arxiv.org/pdf/2506.18711.pdf

Abstract:
The goal of this study is to identify factors that support and enhance older adults' creative experiences in human-robot co-creativity. Because the research into the use of robots for creativity support with older adults remains underexplored, we carried out an exploratory case study. We took a participatory approach and collaborated with professional art educators to design a course Drawing with Robots for adults aged 65 and over. The course featured human-human and human-robot drawing activities with various types of robots. We observed collaborative drawing interactions, interviewed participants on their experiences, and analyzed collected data. Findings show that participants preferred acting as curators, evaluating creative suggestions from the robot in a teacher or coach role. When we enhanced a robot with a multimodal Large Language Model (LLM), participants appreciated its spoken dialogue capabilities. They reported however, that the robot's feedback sometimes lacked an understanding of the context, and sensitivity to their artistic goals and preferences. Our findings highlight the potential of LLM-enhanced robots to support creativity and offer future directions for advancing human-robot co-creativity with older adults.

Paperid: 1674, https://arxiv.org/pdf/2506.18317.pdf

Abstract:
Indoor localization opens the path to potentially transformative applications. Although many indoor localization methods have been proposed over the years, they remain too impractical for widespread deployment in the real world. In this paper, we introduce PeepLoc, a deployable and scalable Wi-Fi-based solution for indoor localization that relies only on pre-existing devices and infrastructure. Specifically, PeepLoc works on any mobile device with an unmodified Wi-Fi transceiver and in any indoor environment with a sufficient number of Wi-Fi access points (APs) and pedestrian traffic. At the core of PeepLoc is (a) a mechanism which allows any Wi-Fi device to obtain non-cooperative time-of-flight (ToF) to any Wi-Fi AP and (b) a novel bootstrapping mechanism that relies on pedestrian dead reckoning (PDR) and crowdsourcing to opportunistically initialize pre-existing APs as anchor points within an environment. We implement PeepLoc using commodity hardware and evaluate it extensively across 4 campus buildings. We show PeepLoc leads to a mean and median positional error of 3.41 m and 3.06 m respectively, which is superior to existing deployed indoor localization systems and is competitive with commodity GPS in outdoor environments.

Paperid: 1675, https://arxiv.org/pdf/2506.17116.pdf

Abstract:
In this workshop paper, we discuss the potential for measures of user-centric benefits (such as emotional well-being) that could be explored when evaluating explainable AI (XAI) systems within the arts. As a background to this, we draw from our recent review of creativity support tool (CST) evaluations, that found a paucity of studies evaluating CSTs for user-centric measures that benefit the user themselves. Specifically, we discuss measures of: (1) developing intrinsic abilities, (2) emotional well-being, (3) self-reflection, and (4) self-perception. By discussing these user-centric measures within the context of XAI and the arts, we wish to provoke discussion regarding the potential of such measures.

Paperid: 1676, https://arxiv.org/pdf/2506.16874.pdf

Abstract:
The integration of Generative Artificial Intelligence (GenAI) in K-6 project-based art courses presents both opportunities and challenges for enhancing creativity, engagement, and group collaboration. This study introduces a four-phase field study, involving in total two experienced K-6 art teachers and 132 students in eight offline course sessions, to investigate the usage and impact of GenAI. Specifically, based on findings in Phases 1 and 2, we developed AskArt, an interactive interface that combines DALL-E and GPT and is tailored to support elementary school students in their art projects, and deployed it in Phases 3 and 4. Our findings revealed the benefits of GenAI in providing background information, inspirations, and personalized guidance. However, challenges in query formulation for generating expected content were also observed. Moreover, students employed varied collaboration strategies, and teachers noted increased engagement alongside concerns regarding misuse and interface suitability. This study offers insights into the effective integration of GenAI in elementary education, presents AskArt as a practical tool, and provides recommendations for educators and researchers to enhance project-based learning with GenAI technologies.

Paperid: 1677, https://arxiv.org/pdf/2506.16716.pdf

Abstract:
Automatic video commentary systems are widely used on multimedia social media platforms to extract factual information about video content. However, current systems may overlook essential para-linguistic cues, including emotion and attitude, which are critical for fully conveying the meaning of visual content. The absence of these cues can limit user understanding or, in some cases, distort the video's original intent. Expressive speech effectively conveys these cues and enhances the user's comprehension of videos. Building on these insights, this paper explores the usage of vision-context-aware expressive speech in enhancing users' understanding of videos in video commentary systems. Firstly, our formatting study indicates that semantic-only speech can lead to ambiguity, and misaligned emotions between speech and visuals may distort content interpretation. To address this, we propose a method called vision-context-aware speech synthesis (V-CASS). It analyzes para-linguistic cues from visuals using a vision-language model and leverages a knowledge-infused language model to guide the expressive speech model in generating context-aligned speech. User studies show that V-CASS enhances emotional and attitudinal resonance, as well as user audio-visual understanding and engagement, with 74.68% of participants preferring the system. Finally, we explore the potential of our method in helping blind and low-vision users navigate web videos, improving universal accessibility.

Paperid: 1678, https://arxiv.org/pdf/2506.15883.pdf

Abstract:
Drawing connections between interesting groupings of data and their real-world meaning is an important, yet difficult, part of encountering a new dataset. A lay reader might see an interesting visual pattern in a chart but lack the domain expertise to explain its meaning. Or, a reader might be familiar with a real-world concept but struggle to express it in terms of a dataset's fields. In response, we developed semantic scaffolding, a technique for using domain-specific information from large language models (LLMs) to identify, explain, and formalize semantically meaningful data groupings. We present groupings in two ways: as semantic bins, which segment a field into domain-specific intervals and categories; and data highlights, which annotate subsets of data records with their real-world meaning. We demonstrate and evaluate this technique in Olli, an accessible visualization tool that exemplifies tensions around explicitly defining groupings while respecting the agency of readers to conduct independent data exploration. We conducted a study with 15 blind and low-vision (BLV) users and found that readers used semantic scaffolds to quickly understand the meaning of the data, but were often also critically aware of its influence on their interpretation.

Paperid: 1679, https://arxiv.org/pdf/2506.14948.pdf

Abstract:
Large language models (LLMs) are increasingly deployed in domains requiring moral understanding, yet their reasoning often remains shallow, and misaligned with human reasoning. Unlike humans, whose moral reasoning integrates contextual trade-offs, value systems, and ethical theories, LLMs often rely on surface patterns, leading to biased decisions in morally and ethically complex scenarios. To address this gap, we present a value-grounded framework for evaluating and distilling structured moral reasoning in LLMs. We benchmark 12 open-source models across four moral datasets using a taxonomy of prompts grounded in value systems, ethical theories, and cognitive reasoning strategies. Our evaluation is guided by four questions: (1) Does reasoning improve LLM decision-making over direct prompting? (2) Which types of value/ethical frameworks most effectively guide LLM reasoning? (3) Which cognitive reasoning strategies lead to better moral performance? (4) Can small-sized LLMs acquire moral competence through distillation? We find that prompting with explicit moral structure consistently improves accuracy and coherence, with first-principles reasoning and Schwartz's + care-ethics scaffolds yielding the strongest gains. Furthermore, our supervised distillation approach transfers moral competence from large to small models without additional inference cost. Together, our results offer a scalable path toward interpretable and value-grounded models.

Paperid: 1680, https://arxiv.org/pdf/2506.14295.pdf

Abstract:
Generative Artificial Intelligence (AI) tools are increasingly deployed across social media platforms, yet their implications for user behavior and experience remain understudied, particularly regarding two critical dimensions: (1) how AI tools affect the behaviors of content producers in a social media context, and (2) how content generated with AI assistance is perceived by users. To fill this gap, we conduct a controlled experiment with a representative sample of 680 U.S. participants in a realistic social media environment. The participants are randomly assigned to small discussion groups, each consisting of five individuals in one of five distinct experimental conditions: a control group and four treatment groups, each employing a unique AI intervention-chat assistance, conversation starters, feedback on comment drafts, and reply suggestions. Our findings highlight a complex duality: some AI-tools increase user engagement and volume of generated content, but at the same time decrease the perceived quality and authenticity of discussion, and introduce a negative spill-over effect on conversations. Based on our findings, we propose four design principles and recommendations aimed at social media platforms, policymakers, and stakeholders: ensuring transparent disclosure of AI-generated content, designing tools with user-focused personalization, incorporating context-sensitivity to account for both topic and user intent, and prioritizing intuitive user interfaces. These principles aim to guide an ethical and effective integration of generative AI into social media.

Paperid: 1681, https://arxiv.org/pdf/2506.13583.pdf

Abstract:
Reinforcement Learning (RL) agents often exhibit learning behaviors that are not intuitively interpretable by human observers, which can result in suboptimal feedback in collaborative teaching settings. Yet, how humans perceive and interpret RL agent's learning behavior is largely unknown. In a bottom-up approach with two experiments, this work provides a data-driven understanding of the factors of human observers' understanding of the agent's learning process. A novel, observation-based paradigm to directly assess human inferences about agent learning was developed. In an exploratory interview study (\textit{N}=9), we identify four core themes in human interpretations: Agent Goals, Knowledge, Decision Making, and Learning Mechanisms. A second confirmatory study (\textit{N}=34) applied an expanded version of the paradigm across two tasks (navigation/manipulation) and two RL algorithms (tabular/function approximation). Analyses of 816 responses confirmed the reliability of the paradigm and refined the thematic framework, revealing how these themes evolve over time and interrelate. Our findings provide a human-centered understanding of how people make sense of agent learning, offering actionable insights for designing interpretable RL systems and improving transparency in Human-Robot Interaction.

Paperid: 1682, https://arxiv.org/pdf/2506.12617.pdf

Abstract:
Large language models (LLMs) are increasingly used in psychological research and practice, yet traditional benchmarks reveal little about the values they express in real interaction. We introduce PAPERS, an output-based evaluation of the values LLMs prioritise in their text. Study 1 thematically analysed responses from eleven LLMs, identifying five recurring dimensions (Purposeful Contribution, Adaptive Growth, Positive Relationality, Ethical Integrity, and Robust Functionality) with Self-Actualised Autonomy appearing only under a hypothetical sentience prompt. These results suggest that LLMs are trained to prioritise humanistic and utility values as dual objectives of optimal functioning, a pattern supported by existing AI alignment and prioritisation frameworks. Study 2 operationalised PAPERS as a ranking instrument across the same eleven LLMs, yielding stable, non-random value priorities alongside systematic between-model differences. Hierarchical clustering distinguished "human-centric" models (e.g., ChatGPT-4o, Claude Sonnet 4) that prioritised relational/ethical values from "utility-driven" models (e.g., Llama 4, Gemini 2.5 Pro) that emphasised operational priorities. Study 3 benchmarked four LLMs against human judgements (N = 376) under matched prompts, finding near-perfect rank-order convergence (r = .97-.98) but moderate absolute agreement; among tested models, ChatGPT-4o showed the closest alignment with human ratings (ICC = .78). Humans also showed limited readiness to endorse sentient AI systems. Taken together, PAPERS enabled systematic value audits and revealed trade-offs with direct implications for deployment: human-centric models aligned more closely with human value judgments and appear better suited for humanistic psychological applications, whereas utility-driven models emphasised functional efficiency and may be more appropriate for instrumental or back-office tasks.

Paperid: 1683, https://arxiv.org/pdf/2506.11829.pdf

Abstract:
This paper introduces a multimethod framework for studying spatial and social dynamics in real-world group-agent interactions with socially interactive agents. Drawing on proxemics and bonding theories, the method combines subjective self-reports and objective spatial tracking. Applied in two field studies in a museum (N = 187) with a robot and a virtual agent, the paper addresses the challenges in aligning human perception and behavior. We focus on presenting an open source, scalable, and field-tested toolkit for future studies.

Paperid: 1684, https://arxiv.org/pdf/2506.09801.pdf

Abstract:
Shape-changing haptic interfaces (SCHIs) are a promising and emerging field. However, compared to more established stimulus modalities, such as vibration, there is sparse literature on the perception of dynamic shapes. Furthermore, the influence of properties such as grasp types and displacement magnitude/direction has not been formally evaluated. This work attempts to initiate a formal perceptual evaluation of SCHIs via a psychophysical user study involving a 1-DOF translational shape-changing interface that can move its body with 1.25-micrometer resolution. Participants completed a Method of Constant Stimulus study while holding the device with three different grasps. Stimuli direction occurred both toward and away from the thumb, while the standard stimuli varied between small (0.48 mm) and large (6 mm). Our results indicate that translational SCHIs should maximize the translation magnitude rather than the number of fingers in contact. We also demonstrated how to apply our findings to real-world applications via a simple 'paddle game', where we compared conventional linear mapping with non-linear mapping derived from our perceptual experiment outcomes between the device position and its represented value. Results indicate that the non-linear mapping was more effective, with improved error distribution. We hope this work inspires further formal perceptual investigation into other SCHI morphologies.

Paperid: 1685, https://arxiv.org/pdf/2506.08321.pdf

Abstract:
We present LeanTutor, a Large Language Model (LLM)-based tutoring system for math proofs. LeanTutor interacts with the student in natural language, formally verifies student-written math proofs in Lean, generates correct next steps, and provides the appropriate instructional guidance. LeanTutor is composed of three modules: (i) an autoformalizer/proof-checker, (ii) a next-step generator, and (iii) a natural language feedback generator. The first module faithfully autoformalizes student proofs into Lean and verifies proof accuracy via successful code compilation. If the proof has an error, the incorrect step is identified. The next-step generator module outputs a valid next Lean tactic for incorrect proofs via LLM-based candidate generation and proof search. The feedback generator module leverages Lean data to produce a pedagogically-motivated natural language hint for the student user. To evaluate our system, we introduce PeanoBench, a human-written dataset derived from the Natural Numbers Game, consisting of 371 Peano Arithmetic proofs, where each natural language proof step is paired with the corresponding logically equivalent tactic in Lean. The Autoformalizer correctly formalizes 57% of tactics in correct proofs and accurately identifies the incorrect step in 30% of incorrect proofs. In generating natural language hints for erroneous proofs, LeanTutor outperforms a simple baseline on accuracy and relevance metrics.

Paperid: 1686, https://arxiv.org/pdf/2506.07997.pdf

Abstract:
The construction industry is characterized by both high physical and psychological risks, yet supports of mental health remain limited. While advancements in artificial intelligence (AI), particularly large language models (LLMs), offer promising solutions, their potential in construction remains largely underexplored. To bridge this gap, we developed a conversational multi-agent system that addresses industry-specific challenges through an AI-driven approach integrated with domain knowledge. In parallel, it fulfills construction workers' basic psychological needs by enabling interactions with multiple agents, each has a distinct persona. This approach ensures that workers receive both practical problem-solving support and social engagement, ultimately contributing to their overall well-being. We evaluate its usability and effectiveness through a within-subjects user study with 12 participants. The results show that our system significantly outperforms the single-agent baseline, achieving improvements of 18% in usability, 40% in self-determination, 60% in social presence, and 60% in trust. These findings highlight the promise of LLM-driven AI systems in providing domain-specific support for construction workers.

Paperid: 1687, https://arxiv.org/pdf/2506.06829.pdf

Abstract:
In this paper, we introduce a novel device architecture that merges memristive devices with light-sensing surfaces, for energy-efficient motion recognition at the edge. Our light-sensing surface captures motion data through in-sensor computation. This data is then processed using a memristive system equipped with a HfO2-based synaptic device, coupled with a winner-take-all (WTA) circuit, tailored for low-power motion classification tasks. We validate our end-to-end system using four distinct human hand gestures - left-to-right, right-to-left, bottom-to-top, and top-to-bottom movements - to assess energy efficiency and classification robustness. Our experiments show that the system requires an average of only 4.17 nJ for taking our processed analog signal and mapping weights onto our memristive system and 0.952 nJ for testing per movement class, achieving 97.22% accuracy even under 5% noise interference. A key advantage of our proposed architecture is its low energy requirement, enabling the integration of energy-harvesting solutions such as solar power for sustainable autonomous operation. Additionally, our approach enhances data privacy by processing data locally, reducing the need for external data transmission and storage.

Paperid: 1688, https://arxiv.org/pdf/2506.06816.pdf

Abstract:
Sociotechnical systems, such as language technologies, frequently exhibit identity-based biases. These biases exacerbate the experiences of historically marginalized communities and remain understudied in low-resource contexts. While models and datasets specific to a language or with multilingual support are commonly recommended to address these biases, this paper empirically tests the effectiveness of such approaches in the context of gender, religion, and nationality-based identities in Bengali, a widely spoken but low-resourced language. We conducted an algorithmic audit of sentiment analysis models built on mBERT and BanglaBERT, which were fine-tuned using all Bengali sentiment analysis (BSA) datasets from Google Dataset Search. Our analyses showed that BSA models exhibit biases across different identity categories despite having similar semantic content and structure. We also examined the inconsistencies and uncertainties arising from combining pre-trained models and datasets created by individuals from diverse demographic backgrounds. We connected these findings to the broader discussions on epistemic injustice, AI alignment, and methodological decisions in algorithmic audits.

Paperid: 1689, https://arxiv.org/pdf/2506.06473.pdf

Abstract:
Paper-based interactive RF devices have opened new possibilities for wireless sensing, yet they are typically constrained by short operational ranges. This paper introduces RadioGami, a method for creating long-range, batteryless RF sensing surfaces on paper using low-cost, DIY materials like copper tape, paper, and off-the-shelf electronics paired with an affordable radio receiver (approx. $20). We explore the design space enabled by RadioGami, including sensing paper deformations like bending, tearing, and origami patterns (Miura, Kresling) at ranges up to 45.73 meters. RadioGami employs a novel ultra-low power (35uW) switching circuit with a tunnel diode for wireless functionality. These surfaces can sustainably operate by harvesting energy using tiny photodiodes. We demonstrate applications that monitor object status, track user interactions (rotation, sliding), and detect environmental changes. We characterize performance, sensitivity, range, and power consumption with deployment studies. RadioGami advances sustainable, tangible, and batteryless interfaces for embodied interaction.

Paperid: 1690, https://arxiv.org/pdf/2506.03052.pdf

Abstract:
Many conversational user interfaces facilitate linear conversations with turn-based dialogue, similar to face-to-face conversations between people. However, digital conversations can afford more than simple back-and-forth; they can be layered with interaction techniques and structured representations that scaffold exploration, reflection, and shared understanding between users and AI systems. We introduce Feedstack, a speculative interface that augments feedback conversations with layered affordances for organizing, navigating, and externalizing feedback. These layered structures serve as a shared representation of the conversation that can surface user intent and reveal underlying design principles. This work represents an early exploration of this vision using a research-through-design approach. We describe system features and design rationale, and present insights from two formative (n=8, n=8) studies to examine how novice designers engage with these layered supports. Rather than presenting a conclusive evaluation, we reflect on Feedstack as a design probe that opens up new directions for conversational feedback systems.

Paperid: 1691, https://arxiv.org/pdf/2505.23780.pdf

Abstract:
Longitudinal engagement with generative AI (GenAI) storytelling agents is a timely but less charted domain. We explored multi-generational experiences with "Dreamsmithy," a daily dream-crafting app, where participants (N = 28) co-created stories with AI narrator "Makoto" every day. Reflections and interactions were captured through a two-week diary study. Reflexive thematic analysis revealed themes likes "oscillating ambivalence" and "socio-chronological bonding," highlighting the complex dynamics that emerged between individuals and the AI narrator over time. Findings suggest that while people appreciated the personal notes, opportunities for reflection, and AI creativity, limitations in narrative coherence and control occasionally caused frustration. The results underscore the potential of GenAI for longitudinal storytelling, but also raise critical questions about user agency and ethics. We contribute initial empirical insights and design considerations for developing adaptive, more-than-human storytelling systems.

Paperid: 1692, https://arxiv.org/pdf/2505.22032.pdf

Abstract:
As the COVID-19 pandemic evolved, the Centers for Disease Control and Prevention (CDC) used Twitter to disseminate safety guidance and updates, reaching millions of users. This study analyzes two years of tweets from, to, and about the CDC using a mixed methods approach to examine discourse characteristics, credibility, and user engagement. We found that the CDCs communication remained largely one directional and did not foster reciprocal interaction, while discussions around COVID19 were deeply shaped by political and ideological polarization. Users frequently cited earlier CDC messages to critique new and sometimes contradictory guidance. Our findings highlight the role of sentiment, media richness, and source credibility in shaping the spread of public health messages. We propose design strategies to help the CDC tailor communications to diverse user groups and manage misinformation more effectively during high-stakes health crises.

Paperid: 1693, https://arxiv.org/pdf/2505.21891.pdf

Abstract:
Tangible User Interfaces have shown potential in supporting the acquisition of key concepts in computing and mathematics while fostering engagement in young learners, but these approaches are less commonly utilised in the context of geometry. In this paper we introduce TIEboard, an interactive device to promote early learning of basic geometry concepts. TIEboard draws inspiration from traditional geoboards and lacing toys to leverage children's familiarity with these traditional tools. It employs instructional lights to guide children in creating shapes using colourful threads of optical fiber. The use of conductive materials allows the system to detect lacing activity and provide feedback in real-time. TIEboard incorporates six interaction modes of varying difficulty based on an incremental learning framework. The study evaluated TIEboard's effectiveness in supporting early geometric learning, facilitating creativity and promoting collaboration among 16 children aged 5-9.

Paperid: 1694, https://arxiv.org/pdf/2505.21604.pdf

Abstract:
Social media serves as a primary communication and information dissemination platform for major global events, entertainment, and niche or topically focused community discussions. Therefore, it represents a valuable resource for researchers who aim to understand numerous questions. However, obtaining data can be difficult, expensive, and often unreliable due to the presence of bots, fake accounts, and manipulated content. Additionally, there are ethical concerns if researchers decide to conduct an online experiment without explicitly notifying social media users about their intent. There is a need for more controlled and scalable mechanisms to evaluate the impacts of digital discussion interventions on audiences. We introduce the Public Discourse Sandbox (PDS), which serves as a digital discourse research platform for human-AI as well as AI-AI discourse research, testing, and training. PDS provides a safe and secure space for research experiments that are not viable on public, commercial social media platforms. Its main purpose is to enable the understanding of AI behaviors and the impacts of customized AI participants via techniques such as prompt engineering, retrieval-augmented generation (RAG), and fine-tuning. We provide a hosted live version of the sandbox to support researchers as well as the open-sourced code on GitHub for community collaboration and contribution.

Paperid: 1695, https://arxiv.org/pdf/2505.20464.pdf

Abstract:
Self-disclosure, the sharing of one's thoughts and feelings, is affected by the perceived relationship between individuals. While chatbots are increasingly used for self-disclosure, the impact of a chatbot's framing on users' self-disclosure remains under-explored. We investigated how a chatbot's description of its relationship with users, particularly in terms of ephemerality, affects self-disclosure. Specifically, we compared a Familiar chatbot, presenting itself as a companion remembering past interactions, with a Stranger chatbot, presenting itself as a new, unacquainted entity in each conversation. In a mixed factorial design, participants engaged with either the Familiar or Stranger chatbot in two sessions across two days, with one conversation focusing on Emotional- and another Factual-disclosure. When Emotional-disclosure was sought in the first chatting session, Stranger-condition participants felt more comfortable self-disclosing. However, when Factual-disclosure was sought first, these differences were replaced by more enjoyment among Familiar-condition participants. Qualitative findings showed Stranger afforded anonymity and reduced judgement, whereas Familiar sometimes felt intrusive unless rapport was built via low-risk Factual-disclosure.

Paperid: 1696, https://arxiv.org/pdf/2505.19325.pdf

Abstract:
Over the last decade there has been considerable research into how artificial intelligence (AI), specifically computer vision, can assist people who are blind or have low-vision (BLV) to understand their environment. However, there has been almost no research into whether the tasks (object detection, image captioning, text recognition etc.) and devices (smartphones, smart-glasses etc.) investigated by researchers align with the needs and preferences of BLV people. We identified 646 studies published in the last two and a half years that have investigated such assistive AI techniques. We analysed these papers to determine the task, device and participation by BLV individuals. We then interviewed 24 BLV people and asked for their top five AI-based applications and to rank the applications found in the literature. We found only a weak positive correlation between BLV participants' perceived importance of tasks and researchers' focus and that participants prefer conversational agent interface and head-mounted devices.

Paperid: 1697, https://arxiv.org/pdf/2505.18862.pdf

Abstract:
Parkinson's Disease (PD) is a neurodegenerative disorder that significantly impacts motor and non-motor functions. There is currently no treatment that slows or stops neurodegeneration in PD. In this context, assistive technologies (ATs) have emerged as vital tools to aid people with Parkinson's and significantly improve their quality of life. This review explores a broad spectrum of ATs, including wearable and cueing devices, exoskeletons, robotics, virtual reality, voice and video-assisted technologies, and emerging innovations such as artificial intelligence (AI), machine learning (ML), and the Internet of Things (IoT). The review highlights ATs' significant role in addressing motor symptoms such as freezing of gait (FOG) and gait and posture disorders. However, it also identifies significant gaps in addressing non-motor symptoms such as sleep dysfunction and mental health. Similarly, the research identifies substantial potential in the further implementation of deep learning, AI, IOT technologies. Overall, this review highlights the transformative potential of AT in PD management while identifying gaps that future research should address to ensure personalized, accessible, and effective solutions.

Paperid: 1698, https://arxiv.org/pdf/2505.16384.pdf

Abstract:
Eye gaze can provide rich information on human psychological activities, and has garnered significant attention in the field of Human-Robot Interaction (HRI). However, existing gaze estimation methods merely predict either the gaze direction or the Point-of-Gaze (PoG) on the screen, failing to provide sufficient information for a comprehensive six Degree-of-Freedom (DoF) gaze analysis in 3D space. Moreover, the variations of eye shape and structure among individuals also impede the generalization capability of these methods. In this study, we propose MAGE, a Multi-task Architecture for Gaze Estimation with an efficient calibration module, to predict the 6-DoF gaze information that is applicable for the real-word HRI. Our basic model encodes both the directional and positional features from facial images, and predicts gaze results with dedicated information flow and multiple decoders. To reduce the impact of individual variations, we propose a novel calibration module, namely Easy-Calibration, to fine-tune the basic model with subject-specific data, which is efficient to implement without the need of a screen. Experimental results demonstrate that our method achieves state-of-the-art performance on the public MPIIFaceGaze, EYEDIAP, and our built IMRGaze datasets.

Paperid: 1699, https://arxiv.org/pdf/2505.16254.pdf

Abstract:
In this paper, we conduct a critical review of existing theories and frameworks on human-human collaborative writing to assess their relevance to the current human-AI paradigm in organizational workplace settings, and draw seven insights along with design implications for human-AI collaborative writing tools. Our main finding was that, as we delegate more writing to AI, our cognitive process shifts from the traditional planning/translating/reviewing process to a planning/waiting/reviewing process, breaking the process due to the waiting that occurs in between. To ensure that our cognitive process remains intact, we suggest a "prototyping" approach, where the tool allows for faster iterations of the cognitive process by starting with smaller chunks of text, and gradually moving on to a fully fleshed-out document. We aim to bring theoretical grounding and practical design guidance to the interaction designs of human-AI collaborative writing, with the goal of enhancing future human-AI writing software.

Paperid: 1700, https://arxiv.org/pdf/2505.15971.pdf

Abstract:
As generative AI tools become embedded in creative practice, questions of ownership in co-creative contexts are pressing. Yet studies of human-AI collaboration often invoke "ownership" without definition: sometimes conflating it with other concepts, and other times leaving interpretation to participants. This inconsistency makes findings difficult to compare across or even within studies. We introduce a framework of creative ownership comprising three dimensions - Person, Process, and System - each with three subdimensions, offering a shared language for both system design and HCI research. In semi-structured interviews with 21 creative professionals, we found that participants' initial references to ownership (e.g., embodiment, control, concept) were fully encompassed by the framework, demonstrating its coverage. Once introduced, however, they also articulated and prioritized the remaining subdimensions, underscoring how the framework expands reflection and enables richer insights. Our contributions include 1) the framework, 2) a web-based visualization tool, and 3) empirical findings on its utility.

Paperid: 1701, https://arxiv.org/pdf/2505.15162.pdf

Abstract:
Self-tracking technologies and wearables automate the process of data collection and insight generation with the support of artificial intelligence systems, with many emerging studies exploring ways to evolve these features further through large-language models (LLMs). This is done with the intent to reduce capture burden and the cognitive stress of health-based decision making, but studies neglect to consider how automation has stymied the agency and independent reflection of users of self-tracking interventions. In this position paper, we explore the consequences of automation in self-tracking by relating it to our experiences with investigating the Oura Ring, a sleep wearable, and navigate potential remedies.

Paperid: 1702, https://arxiv.org/pdf/2505.14853.pdf

Abstract:
Trust and transparency in civic decision-making processes, like neighborhood planning, are eroding as community members frequently report sending feedback "into a void" without understanding how, or whether, their input influences outcomes. To address this gap, we introduce Voice to Vision, a sociotechnical system that bridges community voices and planning outputs through a structured yet flexible data infrastructure and complementary interfaces for both community members and planners. Through a five-month iterative design process with 21 stakeholders and subsequent field evaluation involving 24 participants, we examine how this system facilitates shared understanding across the civic ecosystem. Our findings reveal that while planners value systematic sensemaking tools that find connections across diverse inputs, community members prioritize seeing themselves reflected in the process, discovering patterns within feedback, and observing the rigor behind decisions, while emphasizing the importance of actionable outcomes. We contribute insights into participatory design for civic contexts, a complete sociotechnical system with an interoperable data structure for civic decision-making, and empirical findings that inform how digital platforms can promote shared understanding among elected or appointed officials, planners, and community members by enhancing transparency and legitimacy.

Paperid: 1703, https://arxiv.org/pdf/2505.13565.pdf

Abstract:
Artificial Intelligence (AI) poses both significant risks and valuable opportunities for democratic governance. This paper introduces a dual taxonomy to evaluate AI's complex relationship with democracy: the AI Risks to Democracy (AIRD) taxonomy, which identifies how AI can undermine core democratic principles such as autonomy, fairness, and trust; and the AI's Positive Contributions to Democracy (AIPD) taxonomy, which highlights AI's potential to enhance transparency, participation, efficiency, and evidence-based policymaking. Grounded in the European Union's approach to ethical AI governance, and particularly the seven Trustworthy AI requirements proposed by the European Commission's High-Level Expert Group on AI, each identified risk is aligned with mitigation strategies based on EU regulatory and normative frameworks. Our analysis underscores the transversal importance of transparency and societal well-being across all risk categories and offers a structured lens for aligning AI systems with democratic values. By integrating democratic theory with practical governance tools, this paper offers a normative and actionable framework to guide research, regulation, and institutional design to support trustworthy, democratic AI. It provides scholars with a conceptual foundation to evaluate the democratic implications of AI, equips policymakers with structured criteria for ethical oversight, and helps technologists align system design with democratic principles. In doing so, it bridges the gap between ethical aspirations and operational realities, laying the groundwork for more inclusive, accountable, and resilient democratic systems in the algorithmic age.

Paperid: 1704, https://arxiv.org/pdf/2505.13054.pdf

Abstract:
Task performance in terms of task completion time in teleoperation is still far behind compared to humans conducting tasks directly. One large identified impact on this is the human capability to perform transformations and alignments, which is directly influenced by the point of view and the motion retargeting strategy. In modern teleoperation systems, motion retargeting is usually implemented through a one time calibration or switching modes. Complex tasks, like concatenated screwing, might be difficult, because the operator has to align (e.g. mirror) rotational and translational input commands. Recent research has shown, that the separation of translation and rotation leads to increased task performance. This work proposes a formal motion retargeting method, which separates translational and rotational input commands. This method is then included in a optimal control based trajectory planner and shown to work on a UR5e manipulator.

Paperid: 1705, https://arxiv.org/pdf/2505.12349.pdf

Abstract:
Despite their performance, large language models (LLMs) can inadvertently perpetuate biases found in the data they are trained on. By analyzing LLM responses to bias-eliciting headlines, we find that these models often mirror human biases. To address this, we explore crowd-based strategies for mitigating bias through response aggregation. We first demonstrate that simply averaging responses from multiple LLMs, intended to leverage the "wisdom of the crowd", can exacerbate existing biases due to the limited diversity within LLM crowds. In contrast, we show that locally weighted aggregation methods more effectively leverage the wisdom of the LLM crowd, achieving both bias mitigation and improved accuracy. Finally, recognizing the complementary strengths of LLMs (accuracy) and humans (diversity), we demonstrate that hybrid crowds containing both significantly enhance performance and further reduce biases across ethnic and gender-related contexts.

Paperid: 1706, https://arxiv.org/pdf/2505.11162.pdf

Abstract:
Electrovibration technology enables tactile texture rendering on capacitive touchscreens by modulating friction between the finger and the screen through electrostatic attraction forces, generated by applying an alternating voltage signal to the screen. Accurate signal calibration is essential for robust texture rendering but remains challenging due to variations in sliding speed, applied force, and individual skin mechanics, all of which unpredictably affect frictional behavior. Here, we investigate how exploration conditions affect electrovibration-induced finger friction on touchscreens and the role of skin mechanics in this process. Ten participants slid their index fingers across an electrovibration-enabled touchscreen at five sliding speeds ($20\sim100$ mm/s) and applied force levels ($0.2\sim0.6$ N). Contact forces and skin accelerations were measured while amplitude modulated voltage signals spanning the tactile frequency range were applied to the screen. We modeled the finger-touchscreen friction response as a first-order system and the skin mechanics as a mass-spring-damper system. Results showed that sliding speed influenced the friction response's cutoff frequency, along with the estimated finger moving mass and stiffness. For every $1$ mm/s increase in speed, the cutoff frequency, the finger moving mass, and stiffness increased by $13.8$ Hz, $3.23\times 10^{-5}$ kg, and $4.04$ N/m, respectively. Correlation analysis revealed that finger stiffness had a greater impact on the cutoff frequency than moving mass. Notably, we observed a substantial inter-participant variability in both finger-display interaction and skin mechanics parameters. Finally, we developed a speed-dependent friction model to support consistent and perceptually stable electrovibration-based haptic feedback across varying user conditions.

Paperid: 1707, https://arxiv.org/pdf/2505.10098.pdf

Abstract:
The analysis of secondary quantitative data extracted from high-resolution synchrotron X-ray computed tomography scans represents a significant challenge for users. While a number of methods have been introduced for processing large three-dimensional images in order to generate secondary data, there are only a few techniques available for simple and intuitive visualization of such data in their entirety. This work employs the AccuStripes visualization technique for that purpose, which enables the visual analysis of secondary data represented by an ensemble of univariate distributions. It supports different schemes for adaptive histogram binnings in combination with several ways of rendering aggregated data and it allows the interactive selection of optimal visual representations depending on the data and the use case. We demonstrate the usability of AccuStripes on a high-resolution synchrotron scan of a particle-reinforced metal matrix composite sample, containing more than 20 million particles. Through AccuStripes, detailed insights are facilitated into distributions of derived particle characteristics of the entire sample. Furthermore, research questions such as how the overall shape of the particles is or how homogeneously they are distributed across the sample can be answered.

Paperid: 1708, https://arxiv.org/pdf/2505.09583.pdf

Abstract:
Many online platforms incorporate engagement signals, such as likes, into their interface design to boost engagement. However, these signals can unintentionally elevate content that may not support normatively desirable behavior, especially when toxic content correlates strongly with popularity indicators. In this study, we propose structured prosocial feedback as a complementary signal, which highlights content quality based on normative criteria. We design and implement an LLM-based feedback system, which evaluates user comments based on principles from positive psychology, such as individual well-being. A pre-registered user study then examines how existing peer-based (popularity) and the new expert-based feedback interact to shape users' reposting behavior in a social media setting. Results show that peer feedback increases conformity to popularity cues, while expert feedback shifts choices toward normatively higher-quality content. This illustrates the added value of normative cues and underscores the potential benefits of incorporating such signals into platform feedback systems to foster healthier online environments.

Paperid: 1709, https://arxiv.org/pdf/2505.09094.pdf

Abstract:
Carefully constructed experimental designs are essential for drawing valid, generalizable conclusions from scientific experiments. Unfortunately, experimental designs can be difficult to specify, communicate clearly, and relate to alternatives. In response, we introduce a grammar of composable operators for constructing experimental assignment procedures (e.g., Latin square). The PLanet DSL implements this grammar. Researchers specify assignment requirements. PLanet compiles these into a constraint satisfaction problem over matrices that determines viable experimental plans. In an expressivity evaluation, we find that PLanet is the most expressive compared to two existing experimental design libraries. Its composability enables expression of both canonical and customized designs in HCI experiments. Case studies with three researchers reveal how PLanet helps them make complex design choices explicit, explore alternatives, and develop a deeper understanding of experimental design.

Paperid: 1710, https://arxiv.org/pdf/2505.07340.pdf

Abstract:
Conducting user studies that involve physiological and behavioral measurements is very time-consuming and expensive, as it not only involves a careful experiment design, device calibration, etc. but also a careful software testing. We propose Thalamus, a software toolkit for collecting and simulating multimodal signals that can help the experimenters to prepare in advance for unexpected situations before reaching out to the actual study participants and even before having to install or purchase a specific device. Among other features, Thalamus allows the experimenter to modify, synchronize, and broadcast physiological signals (as coming from various data streams) from different devices simultaneously and not necessarily located in the same place. Thalamus is cross-platform, cross-device, and simple to use, making it thus a valuable asset for HCI research.

Paperid: 1711, https://arxiv.org/pdf/2505.07110.pdf

Abstract:
Based on the DeepSORT algorithm, this study explores the application of visual tracking technology in intelligent human-computer interaction, especially in the field of gesture recognition and tracking. With the rapid development of artificial intelligence and deep learning technology, visual-based interaction has gradually replaced traditional input devices and become an important way for intelligent systems to interact with users. The DeepSORT algorithm can achieve accurate target tracking in dynamic environments by combining Kalman filters and deep learning feature extraction methods. It is especially suitable for complex scenes with multi-target tracking and fast movements. This study experimentally verifies the superior performance of DeepSORT in gesture recognition and tracking. It can accurately capture and track the user's gesture trajectory and is superior to traditional tracking methods in terms of real-time and accuracy. In addition, this study also combines gesture recognition experiments to evaluate the recognition ability and feedback response of the DeepSORT algorithm under different gestures (such as sliding, clicking, and zooming). The experimental results show that DeepSORT can not only effectively deal with target occlusion and motion blur but also can stably track in a multi-target environment, achieving a smooth user interaction experience. Finally, this paper looks forward to the future development direction of intelligent human-computer interaction systems based on visual tracking and proposes future research focuses such as algorithm optimization, data fusion, and multimodal interaction in order to promote a more intelligent and personalized interactive experience. Keywords-DeepSORT, visual tracking, gesture recognition, human-computer interaction

Paperid: 1712, https://arxiv.org/pdf/2505.06278.pdf

Abstract:
The need for social robots and agents to interact and assist humans is growing steadily. To be able to successfully interact with humans, they need to understand and analyse socially interactive scenes from their (robot's) perspective. Works that model social situations between humans and agents are few; and even those existing ones are often too computationally intensive to be suitable for deployment in real time or on real world scenarios with limited available information. We propose a robust knowledge distillation framework that models social interactions through various multimodal cues, yet is robust against incomplete and noisy information during inference. Our teacher model is trained with multimodal input (body, face and hand gestures, gaze, raw images) that transfers knowledge to a student model that relies solely on body pose. Extensive experiments on two publicly available human-robot interaction datasets demonstrate that the our student model achieves an average accuracy gain of 14.75\% over relevant baselines on multiple downstream social understanding task even with up to 51\% of its input being corrupted. The student model is highly efficient: it is $<1$\% in size of the teacher model in terms of parameters and uses $\sim 0.5$\textperthousand~FLOPs of that in the teacher model. Our code will be made public during publication.

Paperid: 1713, https://arxiv.org/pdf/2505.05786.pdf

Abstract:
Occupations referred to as "dirty work" often face entrenched social stigma, which adversely affects the mental health of workers in these fields and impedes occupational equity. In this study, we propose a novel Interactive Fiction (IF) framework powered by Large Language Models (LLMs) to encourage perspective-taking and reduce biases against these stigmatized yet essential roles. Through an experiment with participants (n = 100) across four such occupations, we observed a significant increase in participants' understanding of these occupations, as well as a high level of empathy and a strong sense of connection to individuals in these roles. Additionally, qualitative interviews with participants (n = 15) revealed that the LLM-based perspective-taking IF enhanced immersion, deepened emotional resonance and empathy toward "dirty work," and allowed participants to experience a sense of professional fulfillment in these occupations. However, participants also highlighted ongoing challenges, such as limited contextual details generated by the LLM and the unintentional reinforcement of existing stereotypes. Overall, our findings underscore that an LLM-based perspective-taking IF framework offers a promising and scalable strategy for mitigating stigma and promoting social equity in marginalized professions.

Paperid: 1714, https://arxiv.org/pdf/2505.05170.pdf

Abstract:
Small and medium sized businesses often struggle with data driven decision making do to a lack of advanced analytics tools, especially in African countries where they make up a majority of the workforce. Though many tools exist they are not designed to fit into the ways of working of SMB workers who are mobile first, have limited time to learn new workflows, and for whom social and business are tightly coupled. To address this, the Dukawalla prototype was created. This intelligent assistant bridges the gap between raw business data, and actionable insights by leveraging voice interaction and the power of generative AI. Dukawalla provides an intuitive way for business owners to interact with their data, aiding in informed decision making. This paper examines Dukawalla's deployment across SMBs in Nairobi, focusing on their experiences using this voice based assistant to streamline data collection and provide business insights

Paperid: 1715, https://arxiv.org/pdf/2505.03423.pdf

Abstract:
This study explores the use of AI-based feedback to enhance the counselling competence of prospective teachers. An iterative block seminar was designed, incorporating theoretical foundations, practical applications, and AI tools for analysing verbal, paraverbal, and nonverbal communication. The seminar included recorded simulated teacher-parent conversations, followed by AI-based feedback and qualitative interviews with students. The study investigated correlations between communication characteristics and conversation quality, student perceptions of AI-based feedback, and the training of AI models to identify conversation phases and techniques. Results indicated significant correlations between nonverbal and paraverbal features and conversation quality, and students positively perceived the AI feedback. The findings suggest that AI-based feedback can provide objective, actionable insights to improve teacher training programs. Future work will focus on refining verbal skill annotations, expanding the dataset, and exploring additional features to enhance the feedback system.

Paperid: 1716, https://arxiv.org/pdf/2505.03364.pdf

Abstract:
Users regularly rely on mobile applications for their daily information needs, and mobile sensemaking is prevalent in various domains such as education, healthcare, business intelligence, and emergency response, where timely and context-aware information-processing and decision-making is critical. However, valuable information is often scattered across the closed ecosystems within various applications, posing challenges for traditional search engines to retrieve data openly and in real-time. Additionally, due to limitations such as mobile device screen sizes, language differences, and unfamiliarity with specific applications and domain knowledge, users have to frequently switch between multiple applications and spend substantial time locating and integrating the information. To address these challenges, we present DroidRetriever, a system for cross-application information retrieval to facilitate mobile sensemaking. DroidRetriever can automatically navigate to relevant interfaces based on users' natural language commands, capture screenshots, extract and integrate information, and finally present the results. Our experimental results demonstrate that DroidRetriever can extract and integrate information with near-human accuracy while significantly reducing processing time. Furthermore, with minimal user intervention, DroidRetriever effectively corrects and completes various information retrieval tasks, substantially reducing the user's workload. Our summary of the motivations for intervention and the discussion of their necessity provide valuable implications for future research. We will open-source our code upon acceptance of the paper.

Paperid: 1717, https://arxiv.org/pdf/2505.02649.pdf

Abstract:
Gaze may enhance the robustness of lie detectors but remains under-studied. This study evaluated the efficacy of AI models (using fixations, saccades, blinks, and pupil size) for detecting deception in Concealed Information Tests across two datasets. The first, collected with Eyelink 1000, contains gaze data from a computerized experiment where 87 participants revealed, concealed, or faked the value of a previously selected card. The second, collected with Pupil Neon, involved 36 participants performing a similar task but facing an experimenter. XGBoost achieved accuracies up to 74% in a binary classification task (Revealing vs. Concealing) and 49% in a more challenging three-classification task (Revealing vs. Concealing vs. Faking). Feature analysis identified saccade number, duration, amplitude, and maximum pupil size as the most important for deception prediction. These results demonstrate the feasibility of using gaze and AI to enhance lie detectors and encourage future research that may improve on this.

Paperid: 1718, https://arxiv.org/pdf/2505.02428.pdf

Abstract:
Training more counselors, from clinical students to peer supporters, can help meet the demand for accessible mental health support; however, current training approaches remain resource-intensive and difficult to scale effectively. Large Language Models (LLMs) offer promising solutions for growing counseling skills training through simulated practice and automated feedback. Despite successes in aligning LLMs with expert-counselor annotations, we do not know whether LLM-based counseling training tools -- such as AI patients that simulate real-world challenges and generative AI feedback with suggested alternatives and rationales -- actually lead to improvements in novice counselor skill development. We develop CARE, an LLM-simulated practice and feedback system, and randomize 94 novice counselors to practice using an AI patient, either alone or with AI feedback, measuring changes in their behavioral performance, self-assessments, and qualitative learning takeaways. Our results show the practice-and-feedback group improved in their use of reflections and questions (d=0.32-0.39, p$<$0.05). In contrast, the group that practiced with an AI patient alone did not show improvements, and in the case of empathy, actually had worse uses across time (d=$-$0.52, p=0.001) and when compared against the practice-and-feedback group (d=0.72, p=0.001). Participants' qualitative self-reflections revealed key differences: the practice-and-feedback group adopted a client-centered approach involving listening to and validating feelings, while the practice-alone group remained solution-oriented but delayed offering suggestions until gathering more information. Overall, these results suggest that LLM-based training systems can promote effective skill development, but that combining both simulated practice and structured feedback is critical.

Paperid: 1719, https://arxiv.org/pdf/2505.02180.pdf

Abstract:
Masks are essential in medical settings and during infectious outbreaks but significantly impair speech communication, especially in environments with background noise. Existing solutions often require substantial computational resources or compromise hygiene and comfort. We propose a novel sensing approach that captures only the wearer's voice by detecting mask surface vibrations using a piezoelectric sensor. Our developed device, MaskClip, employs a stainless steel clip with an optimally positioned piezoelectric sensor to selectively capture speech vibrations while inherently filtering out ambient noise. Evaluation experiments demonstrated superior performance with a low Character Error Rate of 6.1\% in noisy environments compared to conventional microphones. Subjective evaluations by 102 participants also showed high satisfaction scores. This approach shows promise for applications in settings where clear voice communication must be maintained while wearing protective equipment, such as medical facilities, cleanrooms, and industrial environments.

Paperid: 1720, https://arxiv.org/pdf/2505.01601.pdf

Abstract:
Creativity Support Tools (CSTs) are widely used across diverse creative domains, with generative AI recently increasing the abilities of CSTs. To better understand how the success of CSTs is determined in the literature, we conducted a review of outcome measures used in CST evaluations. Drawing from (n=173) CST evaluations in the ACM Digital Library, we identified the metrics commonly employed to assess user interactions with CSTs. Our findings reveal prevailing trends in current evaluation practices, while exposing underexplored measures that could broaden the scope of future research. Based on these results, we argue for a more holistic approach to evaluating CSTs, encouraging the HCI community to consider not only user experience and the quality of the generated output, but also user-centric aspects such as self-reflection and well-being as critical dimensions of assessment. We also highlight a need for validated measures specifically suited to the evaluation of generative AI in CSTs.

Paperid: 1721, https://arxiv.org/pdf/2505.01192.pdf

Abstract:
Artificial Intelligence (AI) systems are increasingly used for decision-making across domains, raising debates over the information and explanations they should provide. Most research on Explainable AI (XAI) has focused on feature-based explanations, with less attention on alternative styles. Personality traits like the Need for Cognition (NFC) can also lead to different decision-making outcomes among low and high NFC individuals. We investigated how presenting AI information (prediction, confidence, and accuracy) and different explanation styles (example-based, feature-based, rule-based, and counterfactual) affect accuracy, reliance on AI, and cognitive load in a loan application scenario. We also examined low and high NFC individuals' differences in prioritizing XAI interface elements (loan attributes, AI information, and explanations), accuracy, and cognitive load. Our findings show that high AI confidence significantly increases reliance on AI while reducing cognitive load. Feature-based explanations did not enhance accuracy compared to other conditions. Although counterfactual explanations were less understandable, they enhanced overall accuracy, increasing reliance on AI and reducing cognitive load when AI predictions were correct. Both low and high NFC individuals prioritized explanations after loan attributes, leaving AI information as the least important. However, we found no significant differences between low and high NFC groups in accuracy or cognitive load, raising questions about the role of personality traits in AI-assisted decision-making. These findings highlight the need for user-centric personalization in XAI interfaces, incorporating diverse explanation styles and exploring multiple personality traits and other user characteristics to optimize human-AI collaboration.

Paperid: 1722, https://arxiv.org/pdf/2504.20976.pdf

Abstract:
Navigating unfamiliar places continues to be one of the most persistent and essential everyday obstacles for those who are blind or have limited vision (BLV). Existing assistive technologies, such as GPS-based navigation systems, AI-powered smart glasses, and sonar-equipped canes, often face limitations in real-time obstacle avoidance, precise localization, and adaptability to dynamic surroundings. To investigate potential solutions, we introduced PathFinder, a novel map-less navigation system that explores different models for understanding 2D images, including Vision Language Models (VLMs), Large Language Models (LLMs), and employs monocular depth estimation for free-path detection. Our approach integrates a Depth-First Search (DFS) algorithm on depth images to determine the longest obstacle-free path, ensuring optimal route selection while maintaining computational efficiency. We conducted comparative evaluations against existing AI-powered navigation methods and performed a usability study with BLV participants. The results demonstrate that PathFinder achieves a favorable balance between accuracy, computational efficiency, and real-time responsiveness. Notably, it reduces mean absolute error (MAE) and improves decision-making speed in outdoor navigation compared to AI-based alternatives. Participant feedback emphasizes the system's usability and effectiveness in outside situations, but also identifies issues in complicated indoor locations and low-light conditions. Usability testing revealed that 73% of participants understood how to use the app in about a minute, and 80% praised its balance of accuracy, quick response, and overall convenience.

Paperid: 1723, https://arxiv.org/pdf/2504.20886.pdf

Abstract:
In 2021, the City of Atlanta and Atlanta Police Foundation launched plans to build a large police training facility in the South River Forest in unincorporated DeKalb County, GA. Residents of Atlanta and DeKalb County, environmental activists, police and prison abolitionists, and other activists and concerned individuals formed the movement in opposition to the facility, known as the Stop Cop City / Defend the Atlanta Forest movement. Social media and digital maps became common tools for communicating information about the facility and the movement. Here, we examine online maps about the facility and the opposition movement, originating from grassroots organizations, the City of Atlanta, news media outlets, the Atlanta Police Foundation, and individuals. We gather and examine 32 publicly available maps collected through the Google Search API, Twitter (now X), Instagram and reddit. Using a framework of critical cartography, we conduct a content analysis of these maps to identify the mapping technologies and techniques (data, cartographic elements, styles) used by different stakeholders and roles that maps and mapping technologies can play in social movements. We examine the extent to which these maps provide data to confirm or contradict concerns raised by grassroots organizations and local residents about the facility. We find that stakeholders and mapmakers use geospatial tools in different ways and likely have varied access to mapping technologies. We argue that documenting the use of maps to communicate information about a contentious project can help enumerate community positions and perspectives, and we advocate for accessible mapmaking tools. We conclude by discussing the implications of accessibility of mapping technology and posting maps to social media, and share example map images that extend the geographic information systems (GIS) techniques seen in the retrieved maps.

Paperid: 1724, https://arxiv.org/pdf/2504.20844.pdf

Abstract:
Interactive communication in virtual reality can be used in experimental paradigms to increase the ecological validity of hearing device evaluations. This requires the virtual environment to elicit natural communication behaviour in listeners. This study evaluates the effect of virtual animated characters' head movements on participants' communication behaviour and experience. Triadic conversations were conducted between a test participant and two confederates. To facilitate the manipulation of head movements, the conversation was conducted in telepresence using a system that transmitted audio, head movement data and video with low delay. The confederates were represented by virtual animated characters (avatars) with different levels of animation: Static heads, automated head movement animations based on speech level onsets, and animated head movements based on the transmitted head movements of the interlocutors. A condition was also included in which the videos of the interlocutors' heads were embedded in the visual scene. The results show significant effects of animation level on the participants' speech and head movement behaviour as recorded by physical sensors, as well as on the subjective sense of presence and the success of the conversation. The largest effects were found for the range of head orientation during speech and the perceived realism of avatars. Participants reported that they were spoken to in a more helpful way when the avatars showed head movements transmitted from the interlocutors than when the avatars' heads were static. We therefore conclude that the representation of interlocutors must include sufficiently realistic head movements in order to elicit natural communication behaviour.

Paperid: 1725, https://arxiv.org/pdf/2504.20782.pdf

Abstract:
Adaptive User Interfaces (AUI) play a crucial role in modern software applications by dynamically adjusting interface elements to accommodate users' diverse and evolving needs. However, existing adaptation strategies often lack real-time responsiveness. Reinforcement Learning (RL) has emerged as a promising approach for addressing complex, sequential adaptation challenges, enabling adaptive systems to learn optimal policies based on previous adaptation experiences. Although RL has been applied to AUIs,integrating RL agents effectively within user interactions remains a challenge. In this paper, we enhance a RL-based Adaptive User Interface adaption framework by incorporating personalized human feedback directly into the leaning process. Unlike prior approaches that rely on a single pre-trained RL model, our approach trains a unique RL agent for each user, allowing individuals to actively shape their personal RL agent's policy, potentially leading to more personalized and responsive UI adaptations. To evaluate this approach, we conducted an empirical study to assess the impact of integrating human feedback into the RL-based Adaptive User Interface adaption framework and its effect on User Experience (UX). The study involved 33 participants interacting with AUIs incorporating human feedback and non-adaptive user interfaces in two domains: an e-learning platform and a trip-planning application. The results suggest that incorporating human feedback into RL-driven adaptations significantly enhances UX, offering promising directions for advancing adaptive capabilities and user-centered design in AUIs.

Paperid: 1726, https://arxiv.org/pdf/2504.20329.pdf

Abstract:
This paper investigates how developers conceptualize AI-powered Development Tools and how these role attributions influence technology acceptance. Through qualitative analysis of 38 interviews and a quantitative survey with 102 participants, we identify two primary Mental Models: AI as an inanimate tool and AI as a human-like teammate. Factor analysis further groups AI roles into Support Roles (e.g., assistant, reference guide) and Expert Roles (e.g., advisor, problem solver). We find that assigning multiple roles to AI correlates positively with Perceived Usefulness and Perceived Ease of Use, indicating that diverse conceptualizations enhance AI adoption. These insights suggest that AI4SE tools should accommodate varying user expectations through adaptive design strategies that align with different Mental Models.

Paperid: 1727, https://arxiv.org/pdf/2504.19772.pdf

Abstract:
Working memory involves the temporary retention of information over short periods. It is a critical cognitive function that enables humans to perform various online processing tasks, such as dialing a phone number, recalling misplaced items' locations, or navigating through a store. However, inherent limitations in an individual's capacity to retain information often result in forgetting important details during such tasks. Although previous research has successfully utilized wearable and assistive technologies to enhance long-term memory functions (e.g., episodic memory), their application to supporting short-term recall in daily activities remains underexplored. To address this gap, we present Memento, a framework that uses multimodal wearable sensor data to detect significant changes in cognitive state and provide intelligent in situ cues to enhance recall. Through two user studies involving 15 and 25 participants in visual search navigation tasks, we demonstrate that participants receiving visual cues from Memento achieved significantly better route recall, improving approximately 20-23% compared to free recall. Furthermore, Memento reduced cognitive load and review time by 46% while also substantially reducing computation time (3.86 seconds vs. 15.35 seconds), offering an average of 75% effectiveness compared to computer vision-based cue selection approaches.

Paperid: 1728, https://arxiv.org/pdf/2504.16416.pdf

Abstract:
Design thrives on feedback. However, gathering constant feedback throughout the design process can be labor-intensive and disruptive. We explore how AI can bridge this gap by providing effortless, ambient feedback. We introduce FeedQUAC, a design companion that delivers real-time AI-generated commentary from a variety of perspectives through different personas. A design probe study with eight participants highlights how designers can leverage quick yet ambient AI feedback to enhance their creative workflows. Participants highlight benefits such as convenience, playfulness, confidence boost, and inspiration from this lightweight feedback agent, while suggesting additional features, like chat interaction and context curation. We discuss the role of AI feedback, its strengths and limitations, and how to integrate it into existing design workflows while balancing user involvement. Our findings also suggest that ambient interaction is a valuable consideration for both the design and evaluation of future creativity support systems.

Paperid: 1729, https://arxiv.org/pdf/2504.15429.pdf

Abstract:
The prevalence of distressing content on social media raises concerns about users' mental well-being, prompting the use of trigger warnings (TW) and content warnings (CW). However, inconsistent implementation of TW/CW across platforms and the lack of standardized practices confuse users regarding these warnings. To better understand how users experienced and utilized these warnings, we conducted a semi-structured interview study with 15 general social media users. Our findings reveal challenges across three key stakeholders: viewers, who need to decide whether to engage with warning-labeled content; posters, who struggle with whether and how to apply TW/CW to the content; and platforms, whose design features shape the visibility and usability of warnings. While users generally expressed positive attitudes toward warnings, their understanding of TW/CW usage was limited. Based on these insights, we proposed a conceptual framework of the TW/CW mechanisms from multiple stakeholders' perspectives. Lastly, we further reflected on our findings and discussed the opportunities for social media platforms to enhance users' TW/CW experiences, fostering a more trauma-informed social media environment.

Paperid: 1730, https://arxiv.org/pdf/2504.14536.pdf

Abstract:
The Ephemeral Shadow is an interactive art installation centered on the concept of "simulacrum," focusing on the reconstruction of subjectivity at the intersection of reality and virtuality. Drawing inspiration from the aesthetic imagery of traditional shadow puppetry, the installation combines robotic performance and digital projection to create a multi-layered visual space, presenting a progressively dematerialized hyperreal experience. By blurring the audience's perception of the boundaries between entity and image, the work employs the replacement of physical presence with imagery as its core technique, critically reflecting on issues of technological subjectivity, affective computing, and ethics. Situated within the context of posthumanism and digital media, the installation prompts viewers to contemplate: as digital technologies increasingly approach and simulate "humanity," how can we reshape identity and perception while safeguarding the core values and ethical principles of human subjectivity?

Paperid: 1731, https://arxiv.org/pdf/2504.14115.pdf

Abstract:
We investigate tasks that can be accomplished with unlabelled graphs, where nodes do not have persistent or semantically meaningful labels. New techniques to visualize these graphs have been proposed, but more understanding of unlabelled graph tasks is required before they can be adequately evaluated. Some tasks apply to both labelled and unlabelled graphs, but many do not translate between these contexts. We propose a taxonomy of unlabelled graph abstract tasks, organized according to the Scope of the data at play, the Action intended by the user, and the Target data under consideration. We show the descriptive power of this task abstraction by connecting to concrete examples from previous frameworks, and connect these abstractions to real-world problems. To showcase the evaluative power of the taxonomy, we perform a preliminary assessment of 6 visualizations for each task. For each combination of task and visual encoding, we consider the effort required from viewers, the likelihood of task success, and how both factors vary between small-scale and large-scale graphs.

Paperid: 1732, https://arxiv.org/pdf/2504.14105.pdf

Abstract:
Current AI models often fail to account for local context and language, given the predominance of English and Western internet content in their training data. This hinders the global relevance, usefulness, and safety of these models as they gain more users around the globe. Amplify Initiative, a data platform and methodology, leverages expert communities to collect diverse, high-quality data to address the limitations of these models. The platform is designed to enable co-creation of datasets, provide access to high-quality multilingual datasets, and offer recognition to data authors. This paper presents the approach to co-creating datasets with domain experts (e.g., health workers, teachers) through a pilot conducted in Sub-Saharan Africa (Ghana, Kenya, Malawi, Nigeria, and Uganda). In partnership with local researchers situated in these countries, the pilot demonstrated an end-to-end approach to co-creating data with 155 experts in sensitive domains (e.g., physicians, bankers, anthropologists, human and civil rights advocates). This approach, implemented with an Android app, resulted in an annotated dataset of 8,091 adversarial queries in seven languages (e.g., Luganda, Swahili, Chichewa), capturing nuanced and contextual information related to key themes such as misinformation and public interest topics. This dataset in turn can be used to evaluate models for their safety and cultural relevance within the context of these languages.

Paperid: 1733, https://arxiv.org/pdf/2504.13947.pdf

Abstract:
In this paper, we introduce a speculative design methodology for studying the behavior of generative AI systems, framing design as a mode of inquiry. We propose bridging seemingly unrelated domains to generate intentional context voids, using these tasks as probes to elicit AI model behavior. We demonstrate this through a case study: probing the ChatGPT system (GPT-4 and DALL-E) to generate headshots from professional Curricula Vitae (CVs). In contrast to traditional ways, our approach assesses system behavior under conditions of radical uncertainty -- when forced to invent entire swaths of missing context -- revealing subtle stereotypes and value-laden assumptions. We qualitatively analyze how the system interprets identity and competence markers from CVs, translating them into visual portraits despite the missing context (i.e. physical descriptors). We show that within this context void, the AI system generates biased representations, potentially relying on stereotypical associations or blatant hallucinations.

Paperid: 1734, https://arxiv.org/pdf/2504.13916.pdf

Abstract:
Learning by Asking (LBA) enables robots to identify knowledge gaps during task execution and acquire the missing information by asking targeted questions. However, different tasks often require different types of questions, and how to adapt questioning strategies accordingly remains underexplored. This paper investigates human questioning behavior in two representative household service tasks: a Goal-Oriented task (refrigerator organization) and a Process-Oriented task (cocktail mixing). Through a human-human study involving 28 participants, we analyze the questions asked using a structured framework that encodes each question along three dimensions: acquired knowledge, cognitive process, and question form. Our results reveal that participants adapt both question types and their temporal ordering based on task structure. Goal-Oriented tasks elicited early inquiries about user preferences, while Process-Oriented tasks led to ongoing, parallel questioning of procedural steps and preferences. These findings offer actionable insights for developing task-sensitive questioning strategies in LBA-enabled robots for more effective and personalized human-robot collaboration.

Paperid: 1735, https://arxiv.org/pdf/2504.13903.pdf

Abstract:
AI-assisted development tools promise productivity gains and improved code quality, yet their adoption among developers remains inconsistent. Prior research suggests that professional expertise influences technology adoption, but its role in shaping developers' perceptions of AI tools is unclear. We analyze survey data from 3380 developers to examine how coding experience relates to AI awareness, adoption, and the roles developers assign to AI in their workflow. Our findings reveal that coding experience does not predict AI adoption but significantly influences mental models of AI's role. Experienced developers are more likely to perceive AI as a junior colleague, a content generator, or assign it no role, whereas less experienced developers primarily view AI as a teacher. These insights suggest that AI tools must align with developers' expertise levels to drive meaningful adoption.

Paperid: 1736, https://arxiv.org/pdf/2504.13871.pdf

Abstract:
This study examines the understudied role of algorithmic evaluation of human judgment in hybrid decision-making systems, a critical gap in management research. While extant literature focuses on human reluctance to follow algorithmic advice, we reverse the perspective by investigating how AI agents based on large language models (LLMs) assess and integrate human input. Our work addresses a pressing managerial constraint: firms barred from deploying LLMs directly due to privacy concerns can still leverage them as mediating tools (for instance, anonymized outputs or decision pipelines) to guide high-stakes choices like pricing or discounts without exposing proprietary data. Through a controlled prediction task, we analyze how an LLM-based AI agent weights human versus algorithmic predictions. We find that the AI system systematically discounts human advice, penalizing human errors more severely than algorithmic errors--a bias exacerbated when the agent's identity (human vs AI) is disclosed and the human is positioned second. These results reveal a disconnect between AI-generated trust metrics and the actual influence of human judgment, challenging assumptions about equitable human-AI collaboration. Our findings offer three key contributions. First, we identify a reverse algorithm aversion phenomenon, where AI agents undervalue human input despite comparable error rates. Second, we demonstrate how disclosure and positional bias interact to amplify this effect, with implications for system design. Third, we provide a framework for indirect LLM deployment that balances predictive power with data privacy. For practitioners, this research emphasize the need to audit AI weighting mechanisms, calibrate trust dynamics, and strategically design decision sequences in human-AI systems.

Paperid: 1737, https://arxiv.org/pdf/2504.13486.pdf

Abstract:
In recent years, we have seen an influx in reliance on AI assistants for information seeking. Given this widespread use and the known challenges AI poses for Black users, recent efforts have emerged to identify key considerations needed to provide meaningful support. One notable effort is the development of ChatBlackGPT, a culturally informed AI assistant designed to provide culturally relevant responses. Despite the existence of ChatBlackGPT, there is no research on when and how Black communities might engage with culturally informed AI assistants and the distinctions between engagement with general purpose tools like ChatGPT. To fill this gap, we propose a research agenda grounded in results from a preliminary comparative analysis of outputs provided by ChatGPT and ChatBlackGPT for travel-related inquiries. Our efforts thus far emphasize the need to consider Black communities' values, perceptions, and experiences when designing AI assistants that acknowledge the Black lived experience.

Paperid: 1738, https://arxiv.org/pdf/2504.13069.pdf

Abstract:
Alt-text is essential for mobile app accessibility, yet UI icons often lack meaningful descriptions, limiting accessibility for screen reader users. Existing approaches either require extensive labeled datasets, struggle with partial UI contexts, or operate post-development, increasing technical debt. We first conduct a formative study to determine when and how developers prefer to generate icon alt-text. We then explore the ALTICON approach for generating alt-text for UI icons during development using two fine-tuned models: a text-only large language model that processes extracted UI metadata and a multi-modal model that jointly analyzes icon images and textual context. To improve accuracy, the method extracts relevant UI information from the DOM tree, retrieves in-icon text via OCR, and applies structured prompts for alt-text generation. Our empirical evaluation with the most closely related deep-learning and vision-language models shows that ALTICON generates alt-text that is of higher quality while not requiring a full-screen input.

Paperid: 1739, https://arxiv.org/pdf/2504.12614.pdf

Abstract:
Enhancing emotional well-being has become a significant focus in HCI and CSCW, with technologies increasingly designed to track, visualize, and manage emotions. However, these approaches have faced criticism for potentially suppressing certain emotional experiences. Through a scoping review of 53 empirical studies from ACM proceedings implementing Technology-Mediated Emotion Intervention (TMEI), we critically examine current practices through lenses drawn from HCI critical theories. Our analysis reveals emotion intervention mechanisms that extend beyond traditional emotion regulation paradigms, identifying care-centered goals that prioritize non-judgmental emotional support and preserve users' identities. The findings demonstrate how researchers design technologies for generating artificial care, intervening in power dynamics, and nudging behavioral changes. We contribute the concept of "emotion support" as an alternative approach to "emotion regulation," emphasizing human-centered approaches to emotional well-being. This work advances the understanding of diverse human emotional needs beyond individual and cognitive perspectives, offering design implications that critically reimagine how technologies can honor emotional complexity, preserve human agency, and transform power dynamics in care contexts.

Paperid: 1740, https://arxiv.org/pdf/2504.12488.pdf

Abstract:
As generative AI tools like ChatGPT become integral to everyday writing, critical questions arise about how to preserve writers' sense of agency and ownership when using these tools. Yet, a systematic understanding of how AI assistance affects different aspects of the writing process - and how this shapes writers' agency - remains underexplored. To address this gap, we conducted a systematic review of 109 HCI papers using the PRISMA approach. From this literature, we identify four overarching design strategies for AI writing support: structured guidance, guided exploration, active co-writing, and critical feedback - mapped across the four key cognitive processes in writing: planning, translating, reviewing, and monitoring. We complement this analysis with interviews of 15 writers across diverse domains. Our findings reveal that writers' desired levels of AI intervention vary across the writing process: content-focused writers (e.g., academics) prioritize ownership during planning, while form-focused writers (e.g., creatives) value control over translating and reviewing. Writers' preferences are also shaped by contextual goals, values, and notions of originality and authorship. By examining when ownership matters, what writers want to own, and how AI interactions shape agency, we surface both alignment and gaps between research and user needs. Our findings offer actionable design guidance for developing human-centered writing tools for co-writing with AI, on human terms.

Paperid: 1741, https://arxiv.org/pdf/2504.12236.pdf

Abstract:
Supporting student success requires collaboration among multiple stakeholders. Researchers have explored machine learning models for academic performance prediction; yet key challenges remain in ensuring these models are interpretable, equitable, and actionable within real-world educational support systems. First, many models prioritize predictive accuracy but overlook human-centered machine learning principles, limiting trust among students and reducing their usefulness for educators and institutional decision-makers. Second, most models require at least a month of data before making reliable predictions, delaying opportunities for early intervention. Third, current models primarily rely on sporadically collected, classroom-derived data, missing broader behavioral patterns that could provide more continuous and actionable insights. To address these gaps, we present three modeling approaches-LR, 1D-CNN, and MTL-1D-CNN-to classify students as low or high academic performers. We evaluate them based on explainability, fairness, and generalizability to assess their alignment with key social values. Using behavioral and self-reported data collected within the first week of two Spring terms, we demonstrate that these models can identify at-risk students as early as week one. However, trade-offs across human-centered machine learning principles highlight the complexity of designing predictive models that effectively support multi-stakeholder decision-making and intervention strategies. We discuss these trade-offs and their implications for different stakeholders, outlining how predictive models can be integrated into student support systems. Finally, we examine broader socio-technical challenges in deploying these models and propose future directions for advancing human-centered, collaborative academic prediction systems.

Paperid: 1742, https://arxiv.org/pdf/2504.11653.pdf

Abstract:
Low-cost teleguidance of medical procedures is becoming essential to provide healthcare to remote and underserved communities. Human teleoperation is a promising new method for guiding a novice person with relatively high precision and efficiency through a mixed reality (MR) interface. Prior work has shown that the novice, or "follower", can reliably track the MR input with performance not unlike a telerobotic system. As a consequence, it is of interest to understand and control the follower's dynamics to optimize the system performance and permit stable and transparent bilateral teleoperation. To this end, linearity, time-invariance, inter-axis coupling, and passivity are important in teleoperation and controller design. This paper therefore explores these effects with regard to the follower person in human teleoperation. It is demonstrated through modeling and experiments that the follower can indeed be treated as approximately linear and time invariant, with little coupling and a large excess of passivity at practical frequencies. Furthermore, a stochastic model of the follower dynamics is derived. These results will permit controller design and analysis to improve the performance of human teleoperation.

Paperid: 1743, https://arxiv.org/pdf/2504.10849.pdf

Abstract:
Rich-text captions are essential to help communication for Deaf and hard-of-hearing (DHH) people, second-language learners, and those with autism spectrum disorder (ASD). They also preserve nuances when converting speech to text, enhancing the realism of presentation scripts and conversation or speech logs. However, current real-time captioning systems lack the capability to alter text attributes (ex. capitalization, sizes, and fonts) at the word level, hindering the accurate conveyance of speaker intent that is expressed in the tones or intonations of the speech. For example, ''YOU should do this'' tends to be considered as indicating ''You'' as the focus of the sentence, whereas ''You should do THIS'' tends to be ''This'' as the focus. This paper proposes a solution that changes the text decorations at the word level in real time. As a prototype, we developed an application that adjusts word size based on the loudness of each spoken word. Feedback from users implies that this system helped to convey the speaker's intent, offering a more engaging and accessible captioning experience.

Paperid: 1744, https://arxiv.org/pdf/2504.10768.pdf

Abstract:
This paper examines the thin-slicing approach - the ability to make accurate judgments based on minimal information - in the context of scientific presentations. Drawing on research from nonverbal communication and personality psychology, we show that brief excerpts (thin slices) reliably predict overall presentation quality. Using a novel corpus of over one hundred real-life science talks, we employ Large Language Models (LLMs) to evaluate transcripts of full presentations and their thin slices. By correlating LLM-based evaluations of short excerpts with full-talk assessments, we determine how much information is needed for accurate predictions. Our results demonstrate that LLM-based evaluations align closely with human ratings, proving their validity, reliability, and efficiency. Critically, even very short excerpts (less than 10 percent of a talk) strongly predict overall evaluations. This suggests that the first moments of a presentation convey relevant information that is used in quality evaluations and can shape lasting impressions. The findings are robust across different LLMs and prompting strategies. This work extends thin-slicing research to public speaking and connects theories of impression formation to LLMs and current research on AI communication. We discuss implications for communication and social cognition research on message reception. Lastly, we suggest an LLM-based thin-slicing framework as a scalable feedback tool to enhance human communication.

Paperid: 1745, https://arxiv.org/pdf/2504.09902.pdf

Abstract:
Quantum computing is an emerging field that utilizes the unique principles of quantum mechanics to offer significant advantages in algorithm execution over classical approaches. This potential is particularly promising in the domain of quantum image processing, which aims to manipulate all pixels simultaneously. However, the process of designing and verifying these algorithms remains a complex and error-prone task. To address this challenge, new methods are needed to support effective debugging of quantum circuits. The Quantum Image Visualizer is an interactive visual analysis tool that allows for the examination of quantum images and their transformation throughout quantum circuits. The framework incorporates two overview visualizations that trace image evolution across a sequence of gates based on the most probable outcomes. Interactive exploration allows users to focus on relevant gates, and select pixels of interest. Upon selection, detailed visualizations enable in-depth inspection of individual pixels and their probability distributions, revealing how specific gates influence the likelihood of pixel color values and the magnitude of these changes. An evaluation of the Quantum Image Visualizer was conducted through in-depth interviews with eight domain experts. The findings demonstrate the effectiveness and practical value of our approach in supporting visual debugging of quantum image processing circuits.

Paperid: 1746, https://arxiv.org/pdf/2504.09346.pdf

Abstract:
Recent advances in artificial intelligence (AI) speech generation and voice cloning technologies have produced naturalistic speech and accurate voice replication, yet their influence on sociotechnical systems across diverse accents and linguistic traits is not fully understood. This study evaluates two synthetic AI voice services (Speechify and ElevenLabs) through a mixed methods approach using surveys and interviews to assess technical performance and uncover how users' lived experiences influence their perceptions of accent variations in these speech technologies. Our findings reveal technical performance disparities across five regional, English-language accents and demonstrate how current speech generation technologies may inadvertently reinforce linguistic privilege and accent-based discrimination, potentially creating new forms of digital exclusion. Overall, our study highlights the need for inclusive design and regulation by providing actionable insights for developers, policymakers, and organizations to ensure equitable and socially responsible AI speech technologies.

Paperid: 1747, https://arxiv.org/pdf/2504.09283.pdf

Abstract:
How do we update AI memory of user intent as intent changes? We consider how an AI interface may assist the integration of new information into a repository of natural language data. Inspired by software engineering concepts like impact analysis, we develop methods and a UI for managing semantic changes with non-local effects, which we call "semantic conflict resolution." The user commits new intent to a project -- makes a "semantic commit" -- and the AI helps the user detect and resolve semantic conflicts within a store of existing information representing their intent (an "intent specification"). We develop an interface, SemanticCommit, to better understand how users resolve conflicts when updating intent specifications such as Cursor Rules and game design documents. A knowledge graph-based RAG pipeline drives conflict detection, while LLMs assist in suggesting resolutions. We evaluate our technique on an initial benchmark. Then, we report a 12 user within-subjects study of SemanticCommit for two task domains -- game design documents, and AI agent memory in the style of ChatGPT memories -- where users integrated new information into an existing list. Half of our participants adopted a workflow of impact analysis, where they would first flag conflicts without AI revisions then resolve conflicts locally, despite having access to a global revision feature. We argue that AI agent interfaces, such as software IDEs like Cursor and Windsurf, should provide affordances for impact analysis and help users validate AI retrieval independently from generation. Our work speaks to how AI agent designers should think about updating memory as a process that involves human feedback and decision-making.

Paperid: 1748, https://arxiv.org/pdf/2504.08687.pdf

Abstract:
Writing well requires not only expressing ideas but also refining them through revision, a process facilitated by reflection. Prior research suggests that feedback delivered through dialogues, such as those in writing center tutoring sessions, can help writers reflect more thoughtfully on their work compared to static feedback. Recent advancements in multi-modal large language models (LLMs) now offer new possibilities for supporting interactive and expressive voice-based reflection in writing. In particular, we propose that LLM-generated static feedback can be repurposed as conversation starters, allowing writers to seek clarification, request examples, and ask follow-up questions, thereby fostering deeper reflection on their writing. We argue that voice-based interaction can naturally facilitate this conversational exchange, encouraging writers' engagement with higher-order concerns, facilitating iterative refinement of their reflections, and reduce cognitive load compared to text-based interactions. To investigate these effects, we propose a formative study exploring how text vs. voice input influence writers' reflection and subsequent revisions. Findings from this study will inform the design of intelligent and interactive writing tools, offering insights into how voice-based interactions with LLM-powered conversational agents can support reflection and revision.

Paperid: 1749, https://arxiv.org/pdf/2504.07001.pdf

Abstract:
Caregiving of older adults is an urgent global challenge, with many older adults preferring to age in place rather than enter residential care. However, providing adequate home-based assistance remains difficult, particularly in geographically vast regions. Teleoperated robots offer a promising solution, but conventional motion-mapping teleoperation imposes unnatural movement constraints on operators, leading to muscle fatigue and reduced usability. This paper presents a novel teleoperation framework that leverages action recognition to enable intuitive remote robot control. Using our simplified Spatio-Temporal Graph Convolutional Network (S-ST-GCN), the system recognizes human actions and executes corresponding preset robot trajectories, eliminating the need for direct motion synchronization. A finite-state machine (FSM) is integrated to enhance reliability by filtering out misclassified actions. Our experiments demonstrate that the proposed framework enables effortless operator movement while ensuring accurate robot execution. This proof-of-concept study highlights the potential of teleoperation with action recognition for enabling caregivers to remotely assist older adults during activities of daily living (ADLs). Future work will focus on improving the S-ST-GCN's recognition accuracy and generalization, integrating advanced motion planning techniques to further enhance robotic autonomy in older adult care, and conducting a user study to evaluate the system's telepresence and ease of control.

Paperid: 1750, https://arxiv.org/pdf/2504.06718.pdf

Abstract:
Successful, enjoyable group interactions are important in public and personal contexts, especially for teenagers whose peer groups are important for self-identity and self-esteem. Social robots seemingly have the potential to positively shape group interactions, but it seems difficult to effect such impact by designing robot behaviors solely based on related (human interaction) literature. In this article, we take a user-centered approach to explore how teenagers envisage a social robot "group assistant". We engaged 16 teenagers in focus groups, interviews, and robot testing to capture their views and reflections about robots for groups. Over the course of a two-week summer school, participants co-designed the action space for such a robot and experienced working with/wizarding it for 10+ hours. This experience further altered and deepened their insights into using robots as group assistants. We report results regarding teenagers' views on the applicability and use of a robot group assistant, how these expectations evolved throughout the study, and their repeat interactions with the robot. Our results indicate that each group moves on a spectrum of need for the robot, reflected in use of the robot more (or less) for ice-breaking, turn-taking, and fun-making as the situation demanded.

Paperid: 1751, https://arxiv.org/pdf/2504.05748.pdf

Abstract:
Effective human behavior modeling is critical for successful human-robot interaction. Current state-of-the-art approaches for predicting listening head behavior during dyadic conversations employ continuous-to-discrete representations, where continuous facial motion sequence is converted into discrete latent tokens. However, non-verbal facial motion presents unique challenges owing to its temporal variance and multi-modal nature. State-of-the-art discrete motion token representation struggles to capture underlying non-verbal facial patterns making training the listening head inefficient with low-fidelity generated motion. This study proposes a novel method for representing and predicting non-verbal facial motion by encoding long sequences into a sparse sequence of keyframes and transition frames. By identifying crucial motion steps and interpolating intermediate frames, our method preserves the temporal structure of motion while enhancing instance-wise diversity during the learning process. Additionally, we apply this novel sparse representation to the task of listening head prediction, demonstrating its contribution to improving the explanation of facial motion patterns.

Paperid: 1752, https://arxiv.org/pdf/2504.04508.pdf

Abstract:
Trust in automated vehicles (AVs) has traditionally been explored through a cognitive lens, but growing evidence highlights the significant role emotions play in shaping trust. This study investigates how risk perception and AV performance (error vs. no error) influence emotional responses and trust in AVs, using mediation analysis to examine the indirect effects of emotions. In this study, 70 participants (42 male, 28 female) watched real-life recorded videos of AVs operating with or without errors, coupled with varying levels of risk information (high, low, or none). They reported their anticipated emotional responses using 19 discrete emotion items, and trust was assessed through dispositional, learned, and situational trust measures. Factor analysis identified four key emotional components, namely hostility, confidence, anxiety, and loneliness, that were influenced by risk perception and AV performance. The linear mixed model showed that risk perception was not a significant predictor of trust, while performance and individual differences were. Mediation analysis revealed that confidence was a strong positive mediator, while hostile and anxious emotions negatively impacted trust. However, lonely emotions did not significantly mediate the relationship between AV performance and trust. The results show that real-time AV behavior is more influential on trust than pre-existing risk perceptions, indicating trust in AVs might be more experience-based than shaped by prior beliefs. Our findings also underscore the importance of fostering positive emotional responses for trust calibration, which has important implications for user experience design in automated driving.

Paperid: 1753, https://arxiv.org/pdf/2504.04298.pdf

Abstract:
Generative art merges creativity with computation, using algorithms to produce aesthetic works. This paper introduces Samila, a Python-based generative art library that employs mathematical functions and randomness to create visually compelling compositions. The system allows users to control the generation process through random seeds, function selections, and projection modes, enabling the exploration of randomness and artistic expression. By adjusting these parameters, artists can create diverse compositions that reflect intentionality and unpredictability. We demonstrate that Samila's outputs are uniquely determined by two random generation seeds, making regeneration nearly impossible without both. Additionally, altering the point generation functions while preserving the seed produces artworks with distinct graphical characteristics, forming a visual family. Samila serves as both a creative tool for artists and an educational resource for teaching mathematical and programming concepts. It also provides a platform for research in generative design and computational aesthetics. Future developments could include AI-driven generation and aesthetic evaluation metrics to enhance creative control and accessibility.

Paperid: 1754, https://arxiv.org/pdf/2504.03971.pdf

Abstract:
The SIGCHI and Social Computing research communities have been at the forefront of online safety efforts for youth, ranging from understanding the serious risks youth face online to developing evidence-based interventions for risk protection. Yet, to bring these efforts to bear, we must partner with practitioners, such as industry stakeholders who know how to bring such technologies to market, and youth service providers who work directly with youth. Therefore, we interviewed 33 stakeholders in the space of youth online safety, including industry professionals (n=12), youth service providers (n=11), and researchers (n=10) to understand where their visions toward working together to protect youth online converged and surfaced tensions, as well as how we might reconcile conflicting viewpoints to move forward as one community with synergistic expertise on how to change the current sociotechnical landscape for youth online safety. Overall, we found that non-partisan leadership is necessary to chart actionable, equitable goals to facilitate collaboration between stakeholders, combat feelings of isolation, and foster trust between the stakeholder groups. Based on these findings, we recommend the use of open-innovation methods with their inherent transparency, federated governance models, and clear but inclusive leadership structures to promote collaboration between youth online safety stakeholders. We propose the creation of an open-innovation organization that unifies the diverse voices in youth online safety to develop open-standards and evidence-based design patterns that centralize otherwise fragmented efforts that have fallen short of the goal of effective technological solutions that keep youth safe online.

Paperid: 1755, https://arxiv.org/pdf/2504.02735.pdf

Abstract:
Photoplethysmography (PPG) is a widely adopted, non-invasive technique for monitoring cardiovascular health and physiological parameters in both consumer and clinical settings. While motion artifacts in dynamic environments have been extensively studied, suboptimal skin-sensor contact in sedentary conditions - a critical yet underexplored issue - can distort PPG waveform morphology, leading to the loss or misalignment of key features and compromising sensing accuracy. In this work, we propose CP-PPG, a novel framework that transforms Contact Pressure-distorted PPG signals into high-fidelity waveforms with ideal morphology. CP-PPG integrates a custom data collection protocol, a carefully designed signal processing pipeline, and a novel deep adversarial model trained with a custom PPG-aware loss function. We validated CP-PPG through comprehensive evaluations, including 1) morphology transformation performance on our self-collected dataset, 2) downstream physiological monitoring performance on public datasets, and 3) in-the-wild study. Extensive experiments demonstrate substantial and consistent improvements in signal fidelity (Mean Absolute Error: 0.09, 40% improvement over the original signal) as well as downstream performance across all evaluations in Heart Rate (HR), Heart Rate Variability (HRV), Respiration Rate (RR), and Blood Pressure (BP) estimation (on average, 21% improvement in HR; 41-46% in HRV; 6% in RR; and 4-5% in BP). These findings highlight the critical importance of addressing skin-sensor contact issues to enhance the reliability and effectiveness of PPG-based physiological monitoring. CP-PPG thus holds significant potential to improve the accuracy of wearable health technologies in clinical and consumer applications.

Paperid: 1756, https://arxiv.org/pdf/2504.02685.pdf

Abstract:
Out-of-Distribution (OOD) detection is a critical task in machine learning, particularly in safety-sensitive applications where model failures can have serious consequences. However, current OOD detection methods often suffer from restrictive distributional assumptions, limited scalability, and a lack of interpretability. To address these challenges, we propose STOOD-X, a two-stage methodology that combines a Statistical nonparametric Test for OOD Detection with eXplainability enhancements. In the first stage, STOOD-X uses feature-space distances and a Wilcoxon-Mann-Whitney test to identify OOD samples without assuming a specific feature distribution. In the second stage, it generates user-friendly, concept-based visual explanations that reveal the features driving each decision, aligning with the BLUE XAI paradigm. Through extensive experiments on benchmark datasets and multiple architectures, STOOD-X achieves competitive performance against state-of-the-art post hoc OOD detectors, particularly in high-dimensional and complex settings. In addition, its explainability framework enables human oversight, bias detection, and model debugging, fostering trust and collaboration between humans and AI systems. The STOOD-X methodology therefore offers a robust, explainable, and scalable solution for real-world OOD detection tasks.

Paperid: 1757, https://arxiv.org/pdf/2503.24037.pdf

Abstract:
Online disinformation often provokes strong anger, driving social media users to spread it; however, few measures specifically target sharing behaviors driven by this emotion to curb the spread of disinformation. This study aimed to evaluate whether digital nudges that encourage deliberation by drawing attention to emotional information can reduce sharing driven by strong anger associated with online disinformation. We focused on emotion regulation, as a method for fostering deliberation, which is activated when individuals' attention is drawn to their current emotions. Digital nudges were designed to display emotional information about disinformation and emotion regulation messages. Among these, we found that distraction and perspective-taking nudges may encourage deliberation in anger-driven sharing. To assess their effectiveness, existing nudges mimicking platform functions were used for comparison. Participant responses were measured across four dimensions: sharing intentions, type of emotion, intensity of emotion, and authenticity. The results showed that all digital nudges significantly reduced the sharing of disinformation, with distraction nudges being the most effective. These findings suggest that digital nudges addressing emotional responses can serve as an effective intervention against the spread disinformation driven by strong anger.

Paperid: 1758, https://arxiv.org/pdf/2503.22995.pdf

Abstract:
In this position paper, we discuss the paradigm shift that moves away from parental mediation approaches toward collaborative approaches to promote adolescents' online safety. We present empirical studies that highlight the limitations of traditional parental control models and advocate for collaborative, community-driven solutions that prioritize teen empowerment. Specifically, we explore how extending oversight beyond the immediate family to include trusted community members can provide crucial support for teens in managing their online lives. We discuss the potential benefits and challenges of this expanded approach, emphasizing the importance of granular privacy controls and reciprocal support within these networks. Finally, we pose open questions for the research community to consider during the workshop, focusing on the design of "teen-centered" online safety solutions that foster autonomy, awareness, and self-regulation.

Paperid: 1759, https://arxiv.org/pdf/2503.22993.pdf

Abstract:
Youth, while tech-savvy and highly active on social media, are still vulnerable to online privacy and security risks. Therefore, it is critical to understand how they negotiate and manage social connections versus protecting themselves in online contexts. In this work, we conducted a thematic analysis of 1,318 private conversations on Instagram from 149 youth aged 13-21 to understand the digital privacy and security topics they discussed, if and how they engaged in risky privacy behaviors, and how they balanced the benefits and risks (i.e., privacy calculus) of making these decisions. Overall, youth were forthcoming when broaching a wide range of topics on digital privacy and security, ranging from password management and account access challenges to shared experiences of being victims of privacy risks. However, they also openly engaged in risky behaviors, such as sharing personal account information with peers and even perpetrating privacy and security risks against others. Nonetheless, we found many of these behaviors could be explained by the unique "privacy calculus" of youth, where they often prioritized social benefits over potential risks; for instance, youth often shared account credentials with peers to foster social connection and affirmation. As such, we provide a nuanced understanding of youth decision-making regarding digital security and privacy, highlighting both positive behaviors, tensions, and points of concern. We encourage future research to continue to challenge the potentially untrue narratives regarding youth and their digital privacy and security to unpack the nuance of their privacy calculus that may differ from that of adults.

Paperid: 1760, https://arxiv.org/pdf/2503.22946.pdf

Abstract:
Data-driven storytelling has gained prominence in journalism and other data reporting fields. However, the process of creating these stories remains challenging, often requiring the integration of effective visualizations with compelling narratives to form a cohesive, interactive presentation. To help streamline this process, we present an integrated authoring framework and system, DataWeaver, that supports both visualization-to-text and text-to-visualization composition. DataWeaver enables users to create data narratives anchored to data facts derived from "call-out" interactions, i.e., user-initiated highlights of visualization elements that prompt relevant narrative content. In addition to this "vis-to-text" composition, DataWeaver also supports a "text-initiated" approach, generating relevant interactive visualizations from existing narratives. Key findings from an evaluation with 13 participants highlighted the utility and usability of DataWeaver and the effectiveness of its integrated authoring framework. The evaluation also revealed opportunities to enhance the framework by refining filtering mechanisms and visualization recommendations and better support authoring creativity by introducing advanced customization options.

Paperid: 1761, https://arxiv.org/pdf/2503.22250.pdf

Abstract:
Effective patient communication is pivotal in healthcare, yet traditional medical training often lacks exposure to diverse, challenging interpersonal dynamics. To bridge this gap, this study proposes the use of Large Language Models (LLMs) to simulate authentic patient communication styles, specifically the "accuser" and "rationalizer" personas derived from the Satir model, while also ensuring multilingual applicability to accommodate diverse cultural contexts and enhance accessibility for medical professionals. Leveraging advanced prompt engineering, including behavioral prompts, author's notes, and stubbornness mechanisms, we developed virtual patients (VPs) that embody nuanced emotional and conversational traits. Medical professionals evaluated these VPs, rating their authenticity (accuser: $3.8 \pm 1.0$; rationalizer: $3.7 \pm 0.8$ on a 5-point Likert scale (from one to five)) and correctly identifying their styles. Emotion analysis revealed distinct profiles: the accuser exhibited pain, anger, and distress, while the rationalizer displayed contemplation and calmness, aligning with predefined, detailed patient description including medical history. Sentiment scores (on a scale from zero to nine) further validated these differences in the communication styles, with the accuser adopting negative ($3.1 \pm 0.6$) and the rationalizer more neutral ($4.0 \pm 0.4$) tone. These results underscore LLMs' capability to replicate complex communication styles, offering transformative potential for medical education. This approach equips trainees to navigate challenging clinical scenarios by providing realistic, adaptable patient interactions, enhancing empathy and diagnostic acumen. Our findings advocate for AI-driven tools as scalable, cost-effective solutions to cultivate nuanced communication skills, setting a foundation for future innovations in healthcare training.

Paperid: 1762, https://arxiv.org/pdf/2503.21997.pdf

Abstract:
Autonomous navigation robots can increase the independence of blind people but often limit user control, following what is called in Japanese an "omakase" approach where decisions are left to the robot. This research investigates ways to enhance user control in social robot navigation, based on two studies conducted with blind participants. The first study, involving structured interviews (N=14), identified crowded spaces as key areas with significant social challenges. The second study (N=13) explored navigation tasks with an autonomous robot in these environments and identified design strategies across different modes of autonomy. Participants preferred an active role, termed the "boss" mode, where they managed crowd interactions, while the "monitor" mode helped them assess the environment, negotiate movements, and interact with the robot. These findings highlight the importance of shared control and user involvement for blind users, offering valuable insights for designing future social navigation robots.

Paperid: 1763, https://arxiv.org/pdf/2503.20262.pdf

Abstract:
This study examines how public discourse around COVID-19 unfolded on Twitter through the lens of crisis communication and digital publics. Analyzing over 275,000 tweets involving the CDC, we identify 16 distinct discourse clusters shaped by framing, sentiment, credibility, and network dynamics. We find that CDC messaging became a flashpoint for affective and ideological polarization, with users aligning along competing frames of science vs. freedom, and public health vs. political overreach. Most clusters formed echo chambers, while a few enabled cross cutting dialogue. Publics emerged not only around ideology but also around topical and emotional stakes, reflecting shifting concerns across different stages of the pandemic. While marginalized communities raised consistent equity concerns, these narratives struggled to reshape broader discourse. Our findings highlight the importance of long-term, adaptive engagement with diverse publics and propose design interventions such as multi-agent AI assistants, to support more inclusive communication throughout extended public health crises.

Paperid: 1764, https://arxiv.org/pdf/2503.20229.pdf

Abstract:
This study proposes a UI interface generation method based on a diffusion model, aiming to achieve high-quality, diversified, and personalized interface design through generative artificial intelligence technology. The diffusion model is based on its step-by-step denoising generation process. By combining the conditional generation mechanism, design optimization module, and user feedback mechanism, the model can generate a UI interface that meets the requirements based on multimodal inputs such as text descriptions and sketches provided by users. In the study, a complete experimental evaluation framework was designed, and mainstream generation models (such as GAN, VAE, DALL E, etc.) were selected for comparative experiments. The generation results were quantitatively analyzed from indicators such as PSNR, SSIM, and FID. The results show that the model proposed in this study is superior to other models in terms of generation quality and user satisfaction, especially in terms of logical clarity of information transmission and visual aesthetics. The ablation experiment further verifies the key role of conditional generation and design optimization modules in improving interface quality. This study provides a new technical path for UI design automation and lays the foundation for the intelligent and personalized development of human-computer interaction interfaces. In the future, the application potential of the model in virtual reality, game design, and other fields will be further explored.

Paperid: 1765, https://arxiv.org/pdf/2503.20089.pdf

Abstract:
We present MatplotAlt, an open-source Python package for easily adding alternative text to Matplotlib figures. MatplotAlt equips Jupyter notebook authors to automatically generate and surface chart descriptions with a single line of code or command, and supports a range of options that allow users to customize the generation and display of captions based on their preferences and accessibility needs. Our evaluation indicates that MatplotAlt's heuristic and LLM-based methods to generate alt text can create accurate long-form descriptions of both simple univariate and complex Matplotlib figures. We find that state-of-the-art LLMs still struggle with factual errors when describing charts, and improve the accuracy of our descriptions by prompting GPT4-turbo with heuristic-based alt text or data tables parsed from the Matplotlib figure.

Paperid: 1766, https://arxiv.org/pdf/2503.17639.pdf

Abstract:
We introduce BeaMsteerX (BMX), a novel mmWave hand hygiene gesture recognition technique that improves accuracy in longer ranges (1.5m). BMX steers a mmWave beam towards multiple directions around the subject, generating multiple views of the gesture that are then intelligently combined using deep learning to enhance gesture classification. We evaluated BMX using off-the-shelf mmWave radars and collected a total of 7,200 hand hygiene gesture data from 10 subjects performing a six-step hand-rubbing procedure, as recommended by the World Health Organization, using sanitizer, at 1.5m -- over five times longer than in prior works. BMX outperforms state-of-the-art approaches by 31--43% and achieves 91% accuracy at boresight by combining only two beams, demonstrating superior gesture classification in low SNR scenarios. BMX maintained its effectiveness even when the subject was positioned 30 degrees away from the boresight, exhibiting a modest 5% drop in accuracy.

Paperid: 1768, https://arxiv.org/pdf/2503.16471.pdf

Abstract:
Brain-Computer Interface (BCI) technology facilitates direct communication between the human brain and external devices, representing a substantial advancement in human-machine interaction. This review provides an in-depth analysis of various BCI paradigms, including classic paradigms, current classifications, and hybrid paradigms, each with distinct characteristics and applications. Additionally, we explore a range of signal acquisition methods, classified into non-implantation, intervention, and implantation techniques, elaborating on their principles and recent advancements. By examining the interdependence between paradigms and signal acquisition technologies, this review offers a comprehensive perspective on how innovations in one domain propel progress in the other. The goal is to present insights into the future development of more efficient, user-friendly, and versatile BCI systems, emphasizing the synergy between paradigm design and signal acquisition techniques and their potential to transform the field.

Paperid: 1769, https://arxiv.org/pdf/2503.16466.pdf

Abstract:
In this short paper we address issues related to building multimodal AI systems for human performance support in manufacturing domains. We make two contributions: we first identify challenges of participatory design and training of such systems, and secondly, to address such challenges, we propose the ACE paradigm: "Action and Control via Explanations". Specifically, we suggest that LLMs can be used to produce explanations in the form of human interpretable "semantic frames", which in turn enable end users to provide data the AI system needs to align its multimodal models and representations, including computer vision, automatic speech recognition, and document inputs. ACE, by using LLMs to "explain" using semantic frames, will help the human and the AI system to collaborate, together building a more accurate model of humans activities and behaviors, and ultimately more accurate predictive outputs for better task support, and better outcomes for human users performing manual tasks.

Paperid: 1770, https://arxiv.org/pdf/2503.16445.pdf

Abstract:
In an era where black-box AI models are integral to decision-making across industries, robust methods for explaining these models are more critical than ever. While these models leverage complex feature interplay for accurate predictions, most explanation methods only assign relevance to individual features. There is a research gap in methods that effectively illustrate interactions between features, especially in visualizing higher-order interactions involving multiple features, which challenge conventional representation methods. To address this challenge in local explanations focused on individual instances, we employ a visual, subset-based approach to reveal relevant feature interactions. Our visual analytics tool FINCH uses coloring and highlighting techniques to create intuitive, human-centered visualizations, and provides additional views that enable users to calibrate their trust in the model and explanations. We demonstrate FINCH in multiple case studies, demonstrating its generalizability, and conducted an extensive human study with machine learning experts to highlight its helpfulness and usability. With this approach, FINCH allows users to visualize feature interactions involving any number of features locally.

Paperid: 1771, https://arxiv.org/pdf/2503.16444.pdf

Abstract:
Explainable AI (XAI) aims to provide insights into the decisions made by AI models. To date, most XAI approaches provide only one-time, static explanations, which cannot cater to users' diverse knowledge levels and information needs. Conversational explanations have been proposed as an effective method to customize XAI explanations. However, building conversational explanation systems is hindered by the scarcity of training data. Training with synthetic data faces two main challenges: lack of data diversity and hallucination in the generated data. To alleviate these issues, we introduce a repetition penalty to promote data diversity and exploit a hallucination detector to filter out untruthful synthetic conversation turns. We conducted both automatic and human evaluations on the proposed system, fEw-shot Multi-round ConvErsational Explanation (EMCEE). For automatic evaluation, EMCEE achieves relative improvements of 81.6% in BLEU and 80.5% in ROUGE compared to the baselines. EMCEE also mitigates the degeneration of data quality caused by training on synthetic data. In human evaluations (N=60), EMCEE outperforms baseline models and the control group in improving users' comprehension, acceptance, trust, and collaboration with static explanations by large margins. Through a fine-grained analysis of model responses, we further demonstrate that training on self-generated synthetic data improves the model's ability to generate more truthful and understandable answers, leading to better user interactions. To the best of our knowledge, this is the first conversational explanation method that can answer free-form user questions following static explanations.

Paperid: 1772, https://arxiv.org/pdf/2503.15936.pdf

Abstract:
Virtual reality (VR) users often encounter interruptions, posing challenges to maintaining real-world awareness during immersive experiences. The Passthrough feature in VR headsets allows users to view their physical surroundings without removing the headset. However, when interruptions come from the rear, users need to turn their heads to see the real world, which can lead to negative experiences in VR. Study 1, conducted through semi-structured interviews involving 13 participants, found that users are less likely to use Passthrough for rear interruptions due to large head-turning movements, which cause inconvenience, increase the risk of motion sickness, and reduce the experience. Building on these findings, we introduced three Passthrough techniques in Study 2 for displaying the rear view in front of the user: Full Rear Passthrough + Pause (FRPP), Rear Passthrough Window (RPW), and Rear Passthrough AR (RPAR). Compared to the Baseline method that requires head-turning, all three systems reduced physical and temporal demands, alleviated disorientation caused by motion sickness, and provided a better user experience for managing rear interruptions. Among these, FRPP and RPAR were the most preferred. These findings provide valuable insights for future VR design, emphasizing the need for solutions that effectively manage rear interruptions while maintaining user comfort and experience.

Paperid: 1773, https://arxiv.org/pdf/2503.15518.pdf

Abstract:
We present a novel framework for designing emotionally agile robots with dynamic personalities and memory-based learning, with the aim of performing adaptive and non-deterministic interactions with humans while conforming to shared social understanding. While existing work has largely focused on emotion recognition and static response systems, many approaches rely on sentiment analysis and action mapping frameworks that are pre-defined with limited dimensionality and fixed configurations, lacking the flexibility of dynamic personality traits and memory-enabled adaptation. Other systems are often restricted to limited modes of expression and fail to develop a causal relationship between human behavior and the robot's proactive physical actions, resulting in constrained adaptability and reduced responsiveness in complex, dynamic interactions. Our methodology integrates the Big Five Personality Traits, Appraisal Theory, and abstracted memory layers through Large Language Models (LLMs). The LLM generates a parameterized robot personality based on the Big Five, processes human language and sentiments, evaluates human behavior using Appraisal Theory, and generates emotions and selects appropriate actions adapted by historical context over time. We validated the framework by testing three robots with distinct personalities in identical background contexts and found that personality, appraisal, and memory influence the adaptability of human-robot interactions. The impact of the individual components was further validated through ablation tests. We conclude that this system enables robots to engage in meaningful and personalized interactions with users, and holds significant potential for applications in domains such as pet robots, assistive robots, educational robots, and collaborative functional robots, where cultivating tailored relationships and enriching user experiences are essential.

Paperid: 1774, https://arxiv.org/pdf/2503.15124.pdf

Abstract:
Despite advances in Automatic Speech Recognition (ASR), transcription errors persist and require manual correction. Confidence scores, which indicate the certainty of ASR results, could assist users in identifying and correcting errors. This study evaluates the reliability of confidence scores for error detection through a comprehensive analysis of end-to-end ASR models and a user study with 36 participants. The results show that while confidence scores correlate with transcription accuracy, their error detection performance is limited. Classifiers frequently miss errors or generate many false positives, undermining their practical utility. Confidence-based error detection neither improved correction efficiency nor was perceived as helpful by participants. These findings highlight the limitations of confidence scores and the need for more sophisticated approaches to improve user interaction and explainability of ASR results.

Paperid: 1775, https://arxiv.org/pdf/2503.15120.pdf

Abstract:
Communication access real-time translation (CART) is an essential accessibility service for d/Deaf and hard of hearing (DHH) individuals, but the cost and scarcity of trained personnel limit its availability. While Automatic Speech Recognition (ASR) offers a cheap and scalable alternative, transcription errors can lead to serious accessibility issues. Real-time correction of ASR by non-professionals presents an under-explored CART workflow that addresses these limitations. We conducted a user study with 75 participants to evaluate the feasibility and efficiency of this workflow. Complementary, we held focus groups with 25 DHH individuals to identify acceptable accuracy levels and factors affecting the accessibility of real-time captioning. Results suggest that collaborative editing can improve transcription accuracy to the extent that DHH users rate it positively regarding understandability. Focus groups also showed that human effort to improve captioning is highly valued, supporting a semi-automated approach as an alternative to stand-alone ASR and traditional CART services.

Paperid: 1776, https://arxiv.org/pdf/2503.14725.pdf

Abstract:
Automating a production line with robotic arms is a complex, demanding task that requires not only substantial resources but also a deep understanding of the automated processes and available technologies and tools. Expert integrators must consider factors such as placement, payload, and robot reach requirements to determine the feasibility of automation. Ideally, such considerations are based on a detailed digital simulation developed before any hardware is deployed. However, this process is often time-consuming and challenging. To simplify these processes, we introduce a much simpler method for the feasibility analysis of robotic arms' reachability, designed for non-experts. We implement this method through a mobile, sensing-based prototype tool. The two-step experimental evaluation included the expert user study results, which helped us identify the difficulty levels of various deployment scenarios and refine the initial prototype. The results of the subsequent quantitative study with 22 non-expert participants utilizing both scenarios indicate that users could complete both simple and complex feasibility analyses in under ten minutes, exhibiting similar cognitive loads and high engagement. Overall, the results suggest that the tool was well-received and rated as highly usable, thereby showing a new path for changing the ease of feasibility analysis for automation.

Paperid: 1777, https://arxiv.org/pdf/2503.14522.pdf

Abstract:
We argue that there is a need for Accessibility to be represented in several important domains: - Capitalize on the new capabilities AI provides - Support for open source development of AI, which can allow disabled and disability focused professionals to contribute, including - Development of Accessibility Apps which help realise the promise of AI in accessibility domains - Open Source Model Development and Validation to ensure that accessibility concerns are addressed in these algorithms - Data Augmentation to include accessibility in data sets used to train models - Accessible Interfaces that allow disabled people to use any AI app, and to validate its outputs - Dedicated Functionality and Libraries that can make it easy to integrate AI support into a variety of settings and apps. - Data security and privacy and privacy risks including data collected by AI based accessibility technologies; and the possibility of disability disclosure. - Disability-specific AI risks and biases including both direct bias (during AI use by the disabled person) and indirect bias (when AI is used by someone else on data relating to a disabled person).

Paperid: 1778, https://arxiv.org/pdf/2503.12085.pdf

Abstract:
Effective traffic incident management is essential for ensuring safety, minimizing congestion, and reducing response times in emergency situations. Traditional highway incident management relies heavily on radio room operators, who must make rapid, informed decisions in high-stakes environments. This paper proposes an innovative solution to support and enhance these decisions by integrating Large Language Models (LLMs) into a decision-support system for traffic incident management. We introduce two approaches: (1) an LLM + Optimization hybrid that leverages both the flexibility of natural language interaction and the robustness of optimization techniques, and (2) a Full LLM approach that autonomously generates decisions using only LLM capabilities. We tested our solutions using historical event data from Autostrade per l'Italia. Experimental results indicate that while both approaches show promise, the LLM + Optimization solution demonstrates superior reliability, making it particularly suited to critical applications where consistency and accuracy are paramount. This research highlights the potential for LLMs to transform highway incident management by enabling accessible, data-driven decision-making support.

Paperid: 1779, https://arxiv.org/pdf/2503.10915.pdf

Abstract:
Extended reality (XR) devices have become ubiquitous. They are equipped with arrays of sensors, collecting extensive user and environmental data, allowing inferences about sensitive user information users may not realize they are sharing. Current VR privacy notices largely replicate mechanisms from 2D interfaces, failing to leverage the unique affordances of virtual 3D environments. To address this, we conducted brainstorming and sketching sessions with novice game developers and designers, followed by privacy expert evaluations, to explore and refine privacy interfaces tailored for VR. Key challenges include balancing user engagement with privacy awareness, managing complex privacy information with user comprehension, and maintaining compliance and trust. We identify design implications such as thoughtful gamification, explicit and purpose-tied consent mechanisms, and granular, modifiable privacy control options. Our findings provide actionable guidance to researchers and practitioners for developing privacy-aware and user-friendly VR experiences.

Paperid: 1780, https://arxiv.org/pdf/2503.09060.pdf

Abstract:
MOBA (Multiplayer Online Battle Arena) games require a delicate interplay of strategic planning and real-time decision-making, particularly in professional esports, where players exhibit varying levels of skill and strategic insight. While team strategies have been widely studied, analyzing inconsistencies in professional matches remains a significant challenge. The complexity lies in defining and quantifying the difference between real-time and preferred professional strategies, as well as understanding the disparities between them. Establishing direct causal links between specific strategic decisions and game outcomes also demands a comprehensive analysis of the entire match progression. To tackle these challenges, we present the StratIncon Detector, a visual analytics system designed to assist professional players and coaches in efficiently identifying strategic inconsistencies. The system detects real-time strategies, predicts preferred professional strategies, extracts relevant human factors, and uncovers their impact on subsequent game phases. Findings from a case study, a user study with 24 participants, and expert interviews suggest that, compared to traditional methods, the StratIncon Detector enables users to more comprehensively and efficiently identify inconsistencies, infer their causes, evaluate their effects on subsequent game outcomes, and gain deeper insights into team collaboration-ultimately enhancing future teamwork.

Paperid: 1781, https://arxiv.org/pdf/2503.08582.pdf

Abstract:
Surveys are a widespread method for collecting data at scale, but their rigid structure often limits the depth of qualitative insights obtained. While interviews naturally yield richer responses, they are challenging to conduct across diverse locations and large participant pools. To partially bridge this gap, we investigate the potential of using LLM-based chatbots to support qualitative data collection through interview probes embedded in surveys. We assess four theory-based interview probes: descriptive, idiographic, clarifying, and explanatory. Through a split-plot study design (N=64), we compare the probes' impact on response quality and user experience across three key stages of HCI research: exploration, requirements gathering, and evaluation. Our results show that probes facilitate the collection of high-quality survey data, with specific probes proving effective at different research stages. We contribute practical and methodological implications for using chatbots as research tools to enrich qualitative data collection.

Paperid: 1782, https://arxiv.org/pdf/2503.07892.pdf

Abstract:
Recovering from crises, such as hurricanes or wildfires, is a complex process that can take weeks, months, or even decades to overcome. Crises have both acute (immediate) and chronic (long-term) effects on communities. Crisis informatics research often focuses on the immediate response phase of disasters, thereby overlooking the long-term recovery phase, which is critical for understanding the information needs of users undergoing challenges like climate gentrification and housing inequity. We fill this gap by investigating community discourse over eight months following Hurricane Ida in an online neighborhood Facebook group and Town Hall Meetings of a borough in the New York Metropolitan region. Using a mixed methods approach, we examined the use of social media to manage long-term disaster recovery. The findings revealed a significant overlap in topics, underscoring the interconnected nature of online and offline community discourse, and illuminated themes related to the long-term consequences of disasters. We conclude with recommendations aimed at helping designers and government leaders enhance participation across community forums and support recovery in the aftermath of disasters.

Paperid: 1783, https://arxiv.org/pdf/2503.07782.pdf

Abstract:
The overview-detail design pattern, characterized by an overview of multiple items and a detailed view of a selected item, is ubiquitously implemented across software interfaces. Designers often try to account for all users, but ultimately these interfaces settle on a single form. For instance, an overview map may display hotel prices but omit other user-desired attributes. This research instead explores the malleable overview-detail interface, one that end-users can customize to address individual needs. Our content analysis of overview-detail interfaces uncovered three dimensions of variation: content, composition, and layout, enabling us to develop customization techniques along these dimensions. For content, we developed Fluid Attributes, a set of techniques enabling users to show and hide attributes between views and leverage AI to manipulate, reformat, and generate new attributes. For composition and layout, we provided solutions to compose multiple overviews and detail views and transform between various overview and overview-detail layouts. A user study on our techniques implemented in two design probes revealed that participants produced diverse customizations and unique usage patterns, highlighting the need and broad applicability for malleable overview-detail interfaces.

Paperid: 1784, https://arxiv.org/pdf/2503.07622.pdf

Abstract:
Detecting robot failures during collaborative tasks is crucial for maintaining trust in human-robot interactions. This study investigates user gaze behaviour as an indicator of robot failures, utilising machine learning models to distinguish between non-failure and two types of failures: executional and decisional. Eye-tracking data were collected from 26 participants collaborating with a robot on Tangram puzzle-solving tasks. Gaze metrics, such as average gaze shift rates and the probability of gazing at specific areas of interest, were used to train machine learning classifiers, including Random Forest, AdaBoost, XGBoost, SVM, and CatBoost. The results show that Random Forest achieved 90% accuracy for detecting executional failures and 80% for decisional failures using the first 5 seconds of failure data. Real-time failure detection was evaluated by segmenting gaze data into intervals of 3, 5, and 10 seconds. These findings highlight the potential of gaze dynamics for real-time error detection in human-robot collaboration.

Paperid: 1785, https://arxiv.org/pdf/2503.07319.pdf

Abstract:
The key to robot-assisted rehabilitation lies in the design of the human-machine interface, which must accommodate the needs of both patients and machines. Current interface designs primarily focus on machine control algorithms, often requiring patients to spend considerable time adapting. In this paper, we introduce a novel approach based on the Cooperative Adaptive Markov Decision Process (CAMDPs) model to address the fundamental aspects of the interactive learning process, offering theoretical insights and practical guidance. We establish sufficient conditions for the convergence of CAMDPs and ensure the uniqueness of Nash equilibrium points. Leveraging these conditions, we guarantee the system's convergence to a unique Nash equilibrium point. Furthermore, we explore scenarios with multiple Nash equilibrium points, devising strategies to adjust both Value Evaluation and Policy Improvement algorithms to enhance the likelihood of converging to the global minimal Nash equilibrium point. Through numerical experiments, we illustrate the effectiveness of the proposed conditions and algorithms, demonstrating their applicability and robustness in practical settings. The proposed conditions for convergence and the identification of a unique optimal Nash equilibrium contribute to the development of more effective adaptive systems for human users in robot-assisted rehabilitation.

Paperid: 1786, https://arxiv.org/pdf/2503.06105.pdf

Abstract:
In-game friend recommendations significantly impact player retention and sustained engagement in online games. Balancing similarity and diversity in recommendations is crucial for fostering stronger social bonds across diverse player groups. However, automated recommendation systems struggle to achieve this balance, especially as player preferences evolve over time. To tackle this challenge, we introduce Prefer2SD (derived from Preference to Similarity and Diversity), an iterative, human-in-the-loop approach designed to optimize the similarity-diversity (SD) ratio in friend recommendations. Developed in collaboration with a local game company, Prefer2D leverages a visual analytics system to help experts explore, analyze, and adjust friend recommendations dynamically, incorporating players' shifting preferences. The system employs interactive visualizations that enable experts to fine-tune the balance between similarity and diversity for distinct player groups. We demonstrate the efficacy of Prefer2SD through a within-subjects study (N=12), a case study, and expert interviews, showcasing its ability to enhance in-game friend recommendations and offering insights for the broader field of personalized recommendation systems.

Paperid: 1787, https://arxiv.org/pdf/2503.06098.pdf

Abstract:
Indexical storytelling is gaining popularity in video games, where the narrative unfolds through fragmented clues. This approach fosters player-generated content and discussion, as story interpreters piece together the overarching narrative from these scattered elements. However, the fragmented and non-linear nature of the clues makes systematic categorization and interpretation challenging, potentially hindering efficient story reconstruction and creative engagement. To address these challenges, we first proposed a hierarchical taxonomy to categorize narrative clues, informed by a formative study. Using this taxonomy, we designed ClueCart, a creativity support tool aimed at enhancing creators' ability to organize story clues and facilitate intricate story interpretation. We evaluated ClueCart through a between-subjects study (N=40), using Miro as a baseline. The results showed that ClueCart significantly improved creators' efficiency in organizing and retrieving clues, thereby better supporting their creative processes. Additionally, we offer design insights for future studies focused on player-centric narrative analysis.

Paperid: 1788, https://arxiv.org/pdf/2503.05926.pdf

Abstract:
While human-AI collaboration has been a longstanding goal and topic of study for computational research, the emergence of increasingly naturalistic generative AI language models has greatly inflected the trajectory of such research. In this paper we identify how, given the language capabilities of generative AI, common features of human-human collaboration derived from the social sciences can be applied to the study of human-computer interaction. We provide insights drawn from interviews with industry personnel working on building human-AI collaboration systems, as well as our collaborations with end-users to build a multimodal AI assistant for task support.

Paperid: 1789, https://arxiv.org/pdf/2503.00248.pdf

Abstract:
Despite the growing interest in collaborative AI, designing systems that seamlessly integrate human input remains a major challenge. In this study, we developed a task to systematically examine human preferences for collaborative agents. We created and evaluated five collaborative AI agents with strategies that differ in the manner and degree they adapt to human actions. Participants interacted with a subset of these agents, evaluated their perceived traits, and selected their preferred agent. We used a Bayesian model to understand how agents' strategies influence the Human-AI team performance, AI's perceived traits, and the factors shaping human-preferences in pairwise agent comparisons. Our results show that agents who are more considerate of human actions are preferred over purely performance-maximizing agents. Moreover, we show that such human-centric design can improve the likability of AI collaborators without reducing performance. We find evidence for inequality-aversion effects being a driver of human choices, suggesting that people prefer collaborative agents which allow them to meaningfully contribute to the team. Taken together, these findings demonstrate how collaboration with AI can benefit from development efforts which include both subjective and objective metrics.

Paperid: 1790, https://arxiv.org/pdf/2503.00149.pdf

Abstract:
Tactile charts are essential for conveying data to blind and low vision (BLV) readers but are difficult for designers to construct. Non-expert designers face barriers to entry due to complex guidelines, while experts struggle with fragmented and time-consuming workflows that involve extensive customization. Inspired by formative interviews with expert tactile graphics designers, we created Tactile Vega-Lite (TVL): an extension of Vega-Lite that offers tactile-specific abstractions and synthesizes existing guidelines into a series of smart defaults. Predefined stylistic choices enable non-experts to produce guideline-compliant tactile charts quickly. Expert users can override defaults to tailor customizations for their intended audience. In a user study with 12 tactile graphics creators, we show that Tactile Vega-Lite enhances flexibility and consistency by automating tasks like adjusting spacing and translating braille while accelerating iterations through pre-defined textures and line styles. Through expert critique, we also learn more about tactile chart design best practices and design decisions.

Paperid: 1791, https://arxiv.org/pdf/2502.20658.pdf

Abstract:
Individuals with severe mental illnesses (SMI), particularly schizophrenia, experience complex and intense emotions frequently. They increasingly turn to vlogging as an authentic medium for emotional disclosure and online support-seeking. While previous research has primarily focused on text-based disclosure, little is known about how people construct narratives around emotions and emotional experiences through video blogs. Our study analyzed 401 YouTube videos created by schizophrenia vloggers, revealing that vloggers disclosed their fear, sadness, and joy through verbal narration by explicit expressions or storytelling. Visually, they employed various framing styles, including Anonymous, Talk-to-Camera, and In-the-Moment approaches, along with diverse visual narration techniques. Notably, we uncovered a concerning 'visual appeal disparity' in audience engagement, with visually appealing videos receiving significantly more views, likes, and comments. This study discusses the role of video-sharing platforms in emotional expression and offers design implications for fostering online care-seeking for emotionally vulnerable populations.

Paperid: 1792, https://arxiv.org/pdf/2502.20513.pdf

Abstract:
The emergence of Large Language Models (LLMs) has revolutionized Conversational User Interfaces (CUIs), enabling more dynamic, context-aware, and human-like interactions across diverse domains, from social sciences to healthcare. However, the rapid adoption of LLM-based personas raises critical ethical and practical concerns, including bias, manipulation, and unforeseen social consequences. Unlike traditional CUIs, where personas are carefully designed with clear intent, LLM-based personas generate responses dynamically from vast datasets, making their behavior less predictable and harder to govern. This workshop aims to bridge the gap between CUI and broader AI communities by fostering a cross-disciplinary dialogue on the responsible design and evaluation of LLM-based personas. Bringing together researchers, designers, and practitioners, we will explore best practices, develop ethical guidelines, and promote frameworks that ensure transparency, inclusivity, and user-centered interactions. By addressing these challenges collaboratively, we seek to shape the future of LLM-driven CUIs in ways that align with societal values and expectations.

Paperid: 1793, https://arxiv.org/pdf/2502.20025.pdf

Abstract:
Socioemotional and regulation processes in learning are important. We add to the understanding of previous work on co-regulation processes in the learning sciences, extending the caregiver-child paradigm and focusing on the teacher-student relation by presenting an interactive co-regulation model and the methodology for developing empirically grounded systems for training teachers. We focus on the combination of classroom management and affect models and detail the use of a psychological model to operationalise and automate the interaction with the virtual student. We delve into an annotation scheme developed to capture teacher subjective psychological experiences during training and how these affect their co-regulation behavior with students and contributes to understanding the role of teacher emotional experiences and their consequences of co-regulation processes for classroom management. This research is also a contribution to developing hybrid AI systems.

Paperid: 1794, https://arxiv.org/pdf/2502.19093.pdf

Abstract:
Historical visualizations are a valuable resource for studying the history of visualization and inspecting the cultural context where they were created. When investigating historical visualizations, it is essential to consider contributions from different cultural frameworks to gain a comprehensive understanding. While there is extensive research on historical visualizations within the European cultural framework, this work shifts the focus to ancient China, a cultural context that remains underexplored by visualization researchers. To this aim, we propose a semi-automatic pipeline to collect, extract, and label historical Chinese visualizations. Through the pipeline, we curate ZuantuSet, a dataset with over 71K visualizations and 108K illustrations. We analyze distinctive design patterns of historical Chinese visualizations and their potential causes within the context of Chinese history and culture. We illustrate potential usage scenarios for this dataset, summarize the unique challenges and solutions associated with collecting historical Chinese visualizations, and outline future research directions.

Paperid: 1795, https://arxiv.org/pdf/2502.18013.pdf

Abstract:
The rise of autonomous driving technology has led to concerns about inactivity-induced fatigue. This paper explores Traditional Chinese Medicine (TCM) scents for mitigating. Two human-involved studies have been conducted in a high-fidelity driving simulator. Study 1 maps six prevalent TCM scents onto the arousal/valence circumplex to select proper candidates, i.e., argy wormwood (with the highest arousal) and tangerine peel (with the highest valence). Study 2 tests both scents in an auto-driving course. Statistics show both scents can improve driver alertness and reaction-time, but should be used in different ways: argy wormwood is suitable for short-term use due to its higher intensity but poor acceptance, while tangerine peel is ideal for long-term use due to its higher likeness. These findings provide insights for in-car fatigue mitigation to enhance driver safety and well-being. However, issues such as scent longevity as for aromatherapy and automatic fatigue prediction remain unresolved.

Paperid: 1796, https://arxiv.org/pdf/2502.17960.pdf

Abstract:
Multi-drone systems have become transformative technologies across various industries, offering innovative applications. However, despite significant advancements, their autonomous capabilities remain inherently limited. As a result, human operators are often essential for supervising and controlling these systems, creating what is referred to as a human-multi-drone team. In realistic settings, human operators must make real-time decisions while addressing a variety of signals, such as drone statuses and sensor readings, and adapting to dynamic conditions and uncertainty. This complexity may lead to suboptimal operations, potentially compromising the overall effectiveness of the team. In critical contexts like Search And Rescue (SAR) missions, such inefficiencies can have costly consequences. This work introduces an advising agent designed to enhance collaboration in human-multi-drone teams, with a specific focus on SAR scenarios. The advising agent is designed to assist the human operator by suggesting contextual actions worth taking. To that end, the agent employs a novel computation technique that relies on a small set of human demonstrations to generate varying realistic human-like trajectories. These trajectories are then generalized using machine learning for fast and accurate predictions of the long-term effects of different advice. Through human evaluations, we demonstrate that our approach delivers high-quality assistance, resulting in significantly improved performance compared to baseline conditions.

Paperid: 1797, https://arxiv.org/pdf/2502.17829.pdf

Abstract:
Silent speech interfaces (SSI) are being actively developed to assist individuals with communication impairments who have long suffered from daily hardships and a reduced quality of life. However, silent sentences are difficult to segment and recognize due to elision and linking. A novel silent speech sentence recognition method is proposed to convert the facial motion signals collected by six-axis accelerometers into transcribed words and sentences. A Conformer-based neural network with the Connectionist-Temporal-Classification algorithm is used to gain contextual understanding and translate the non-acoustic signals into words sequences, solely requesting the constituent words in the database. Test results show that the proposed method achieves a 97.17% accuracy in sentence recognition, surpassing the existing silent speech recognition methods with a typical accuracy of 85%-95%, and demonstrating the potential of accelerometers as an available SSI modality for high-accuracy silent speech sentence recognition.

Paperid: 1798, https://arxiv.org/pdf/2502.17785.pdf

Abstract:
Reading comprehension is a key for individual success, yet the assessment of question difficulty remains challenging due to the extensive human annotation and large-scale testing required by traditional methods such as linguistic analysis and Item Response Theory (IRT). While these robust approaches provide valuable insights, their scalability is limited. There is potential for Large Language Models (LLMs) to automate question difficulty estimation; however, this area remains underexplored. Our study investigates the effectiveness of LLMs, specifically OpenAI's GPT-4o and o1, in estimating the difficulty of reading comprehension questions using the Study Aid and Reading Assessment (SARA) dataset. We evaluated both the accuracy of the models in answering comprehension questions and their ability to classify difficulty levels as defined by IRT. The results indicate that, while the models yield difficulty estimates that align meaningfully with derived IRT parameters, there are notable differences in their sensitivity to extreme item characteristics. These findings suggest that LLMs can serve as the scalable method for automated difficulty assessment, particularly in dynamic interactions between learners and Adaptive Instructional Systems (AIS), bridging the gap between traditional psychometric techniques and modern AIS for reading comprehension and paving the way for more adaptive and personalized educational assessments.

Paperid: 1799, https://arxiv.org/pdf/2502.17650.pdf

Abstract:
We use a duoethnographic approach to study how wearable-integrated LLM chatbots can assist with personalized stress management, addressing the growing need for immediacy and tailored interventions. Two researchers interacted with custom chatbots over 22 days, responding to wearable-detected physiological prompts, recording stressor phrases, and using them to seek tailored interventions from their LLM-powered chatbots. They recorded their experiences in autoethnographic diaries and analyzed them during weekly discussions, focusing on the relevance, clarity, and impact of chatbot-generated interventions. Results showed that even though most events triggered by the wearable were meaningful, only one in five warranted an intervention. It also showed that interventions tailored with brief event descriptions were more effective than generic ones. By examining the intersection of wearables and LLM, this research contributes to developing more effective, user-centric mental health tools for real-time stress relief and behavior change.

Paperid: 1800, https://arxiv.org/pdf/2502.17643.pdf

Abstract:
Coaches are vital for effective collaboration, but cost and resource constraints often limit their availability during real-world tasks. This limitation poses serious challenges in life-critical domains that rely on effective teamwork, such as healthcare and disaster response. To address this gap, we propose and realize an innovative application of AI: task-time team coaching. Specifically, we introduce Socratic, a novel AI system that complements human coaches by providing real-time guidance during task execution. Socratic monitors team behavior, detects misalignments in team members' shared understanding, and delivers automated interventions to improve team performance. We validated Socratic through two human subject experiments involving dyadic collaboration. The results demonstrate that the system significantly enhances team performance with minimal interventions. Participants also perceived Socratic as helpful and trustworthy, supporting its potential for adoption. Our findings also suggest promising directions both for AI research and its practical applications to enhance human teamwork.

Paperid: 1801, https://arxiv.org/pdf/2502.17293.pdf

Abstract:
HCI is increasingly taking inspiration from religious traditions as a basis for ethical technology designs. Such ethically-inspired designs can be especially important for social communications technologies, which are associated with numerous societal concerns. If religious values are to be incorporated into real-world designs, there may be challenges when designers work with values unfamiliar to them. Therefore, we investigate the difference in interpretations of values when they are translated to technology designs. To do so we studied design patterns that embody Catholic Social Teaching (CST). We interviewed 24 technologists and 7 CST scholars to assess how their understanding of how those values would manifest in social media designs. We found that for the most part the technologists responded similarly to the CST scholars. However, CST scholars had a better understanding of the principle of subsidiarity, and they believed moderation upheld human dignity more than the technologists did. We discuss the implications of our findings on the designs of social technologies and design processes at large.

Paperid: 1802, https://arxiv.org/pdf/2502.16899.pdf

Abstract:
Robots are prone to making errors, which can negatively impact their credibility as teammates during collaborative tasks with human users. Detecting and recovering from these failures is crucial for maintaining effective level of trust from users. However, robots may fail without being aware of it. One way to detect such failures could be by analysing humans' non-verbal behaviours and reactions to failures. This study investigates how human gaze dynamics can signal a robot's failure and examines how different types of failures affect people's perception of robot. We conducted a user study with 27 participants collaborating with a robotic mobile manipulator to solve tangram puzzles. The robot was programmed to experience two types of failures -- executional and decisional -- occurring either at the beginning or end of the task, with or without acknowledgement of the failure. Our findings reveal that the type and timing of the robot's failure significantly affect participants' gaze behaviour and perception of the robot. Specifically, executional failures led to more gaze shifts and increased focus on the robot, while decisional failures resulted in lower entropy in gaze transitions among areas of interest, particularly when the failure occurred at the end of the task. These results highlight that gaze can serve as a reliable indicator of robot failures and their types, and could also be used to predict the appropriate recovery actions.

Paperid: 1803, https://arxiv.org/pdf/2502.16868.pdf

Abstract:
Large Language Models (LLMs) have recently demonstrated remarkable performance in tasks such as Retrieval-Augmented Generation (RAG) and autonomous AI agent workflows. Yet, when faced with large sets of unstructured documents requiring progressive exploration, analysis, and synthesis, such as conducting literature survey, existing approaches often fall short. We address this challenge -- termed Progressive Document Investigation -- by introducing Graphy, an end-to-end platform that automates data modeling, exploration and high-quality report generation in a user-friendly manner. Graphy comprises an offline Scrapper that transforms raw documents into a structured graph of Fact and Dimension nodes, and an online Surveyor that enables iterative exploration and LLM-driven report generation. We showcase a pre-scrapped graph of over 50,000 papers -- complete with their references -- demonstrating how Graphy facilitates the literature-survey scenario. The demonstration video can be found at https://youtu.be/uM4nzkAdGlM.

Paperid: 1804, https://arxiv.org/pdf/2502.16833.pdf

Abstract:
The collaborative design process is intrinsically complicated and dynamic, and researchers have long been exploring how to enhance efficiency in this process. As Artificial Intelligence technology evolves, it has been widely used as a design tool and exhibited the potential as a design collaborator. Nevertheless, problems concerning how designers should communicate with AI in collaborative design remain unsolved. To address this research gap, we referred to how designers communicate fluently in human-human design collaboration, and found awareness to be an important ability for facilitating communication by understanding their collaborators and current situation. However, previous research mainly studied and supported human awareness, the possible impact AI awareness would bring to the human-AI collaborative design process, and the way to realize AI awareness remain unknown. In this study, we explored how AI awareness will impact human-AI collaboration through a Wizard-of-Oz experiment. Both quantitative and qualitative results supported that enabling AI to have awareness can enhance the communication fluidity between human and AI, thus enhancing collaboration efficiency. We further discussed the results and concluded design implications for future human-AI collaborative design systems.

Paperid: 1805, https://arxiv.org/pdf/2502.16178.pdf

Abstract:
With the rise of online learning, many novice tutors lack experience engaging students remotely. We introduce TutorUp, a Large Language Model (LLM)-based system that enables novice tutors to practice engagement strategies with simulated students through scenario-based training. Based on a formative study involving two surveys (N1=86, N2=102) on student engagement challenges, we summarize scenarios that mimic real teaching situations. To enhance immersion and realism, we employ a prompting strategy that simulates dynamic online learning dialogues. TutorUp provides immediate and asynchronous feedback by referencing tutor-students online session dialogues and evidence-based teaching strategies from learning science literature. In a within-subject evaluation (N=16), participants rated TutorUp significantly higher than a baseline system without simulation capabilities regarding effectiveness and usability. Our findings suggest that TutorUp provides novice tutors with more effective training to learn and apply teaching strategies to address online student engagement challenges.

Paperid: 1806, https://arxiv.org/pdf/2502.15413.pdf

Abstract:
We present an approach to evaluate the efficacy of annotations in augmenting learning environments in the context of Virtual Reality. Our study extends previous work highlighting the benefits of learning based in virtual reality and introduces a method to facilitate asynchronous collaboration between educators and students. These two distinct perspectives fulfill special roles: educators aim to convey information, which learners should get familiarized. Educators are empowered to annotate static scenes on large touchscreens to supplement information. Subsequently, learners explore those annotated scenes in virtual reality. To assess the comparative ease and usability of creating text and pen annotations, we conducted a user study with 24 participants, which assumed both roles of learners and teachers. Educators annotated static courses using provided textbook excerpts, interfacing through an 86-inch touchscreen. Learners navigated pre-designed educational courses in virtual reality to evaluate the practicality of annotations. The utility of annotations in virtual reality garnered high ratings. Users encountered issues with the touch interface implementation and rated it with a low intuitivity. Despite this, our study underscores the significant benefits of annotations, particularly for learners. This research offers valuable insights into annotation-enriched learning, emphasizing its potential to enhance students' information retention and comprehension.

Paperid: 1807, https://arxiv.org/pdf/2502.15383.pdf

Abstract:
Teachers play a pivotal role in fostering students' emotional and cognitive development. Teachers need to regulate their emotions in order to co-regulate students. Here using a unique mixed method approach, we investigate the relationship between self-compassion, treating oneself with compassion, and physiological stress responses among pre-service teachers. Heart rate variability (HRV) was measured during a mixed reality (MR) teacher training scenario environment designed to simulate socio-emotional conflict in class. Recorded interviews that followed the MR-training were analyzed for observed self-compassion. Findings suggest that less emotional stress during the MR-training correlates with higher levels of self-compassion during the interview. MR-trainings and self-compassion may be valuable tools to train teacher emotion regulation and well-being.

Paperid: 1808, https://arxiv.org/pdf/2502.14241.pdf

Abstract:
Augmented Reality (AR) is increasingly positioned as a tool for knowledge work, providing beneficial affordances such as a virtually limitless display space that integrates digital information with the user's physical surroundings. However, for AR to supplant traditional screen-based devices in knowledge work, it must support prolonged usage across diverse contexts. Until now, few studies have explored the effects, opportunities, and challenges of working in AR outside a controlled laboratory setting and for an extended duration. This gap in research limits our understanding of how users may adapt its affordances to their daily workflows and what barriers hinder its adoption. In this paper, we present findings from a longitudinal diary study examining how participants incorporated an AR laptop -- Sightful's Spacetop EA -- into their daily work routines. 14 participants used the device for 40-minute daily sessions over two weeks, collectively completing 103 hours of AR-based work. Through survey responses, workspace photographs, and post-study interviews, we analyzed usage patterns, workspace configurations, and evolving user perceptions. Our findings reveal key factors influencing participants' usage of AR, including task demands, environmental constraints, social dynamics, and ergonomic considerations. We highlight how participants leveraged and configured AR's virtual display space, along with emergent hybrid workflows that involved physical screens and tasks. Based on our results, we discuss both overlaps with current literature and new considerations and challenges for the future design of AR systems for pervasive and productive use.

Paperid: 1809, https://arxiv.org/pdf/2502.14229.pdf

Abstract:
Generative AI (GenAI) models have become more capable than ever at augmenting productivity and cognition across diverse contexts. However, a fundamental challenge remains as users struggle to anticipate what AI will generate. As a result, they must engage in excessive turn-taking with the AI's feedback to clarify their intent, leading to significant cognitive load and time investment. Our goal is to advance the perspective that in order for users to seamlessly leverage the full potential of GenAI systems across various contexts, we must design GenAI systems that not only provide informative feedback but also informative feedforward -- designs that tell users what AI will generate before the user submits their prompt. To spark discussion on feedforward in GenAI, we designed diverse instantiations of feedforward across four GenAI applications: conversational UIs, document editors, malleable interfaces, and automation agents, and discussed how these designs can contribute to a more rigorous investigation of a design space and a set of guidelines for feedforward in all GenAI systems.

Paperid: 1810, https://arxiv.org/pdf/2502.14082.pdf

Abstract:
Purpose: With the rise of mental health risks globally, it is urgent to provide effective mental health support. However, a holistic understanding of how people seek help for mental health problems remains limited, impeding the development of evidence-based intervention programs to facilitate help-seeking behavior. This study reviews current theories that guide empirical research on young adults' help-seeking behavior using technologies, identifies limitations in existing frameworks, and proposes directions for future research. Methods: We searched databases that are most likely to contain mental health help-seeking practices in relation to information technology, including PubMed, ACM Digital Library, Web of Science, PsycInfo, ScienceDirect, EBSCO, and Cochrane Library. Results: Of 2443 abstracts reviewed, 43 studies met the criteria and were included in the analysis. We identified 16 theories and models. They represent seven perspectives to view mental health help-seeking and reveal factors such as accessibility, stigma, and social support as key factors influencing help-seeking. Limitations: We summarized the theories and models and categorized them based on their primary perspectives. Cross-perspective connections could be explored in future reviews. Conclusions: A holistic approach to creating culturally sensitive multi-level interventions that consider individual, interpersonal, and community factors is needed to advance effective mental health help-seeking support strategies.

Paperid: 1811, https://arxiv.org/pdf/2502.13421.pdf

Abstract:
Physical touch, a fundamental aspect of human social interaction, remains largely absent in real-time virtual communication. We present a haptic-enabled multi-user Virtual Reality (VR) system that facilitates real-time, bi-directional social touch communication among physically distant users. We developed wearable gloves and forearm sleeves, embedded with 26 vibrotactile actuators for each hand and arm, actuated via a WiFi-based communication system. The system enables VR-transmitted data to be universally interpreted by haptic devices, allowing feedback rendering based on their capabilities. Users can perform and receive social touch gestures such as stroke, pat, poke, and squeeze, with other users within a shared virtual space or interact with other virtual objects, and they receive vibrotactile feedback. Through a two-part user study involving six pairs of participants, we investigate the impact of gesture speed, haptic feedback modality, and user roles, during real-time haptic communication in VR, on affective and sensory experiences, as well as evaluate the overall system usability. Our findings highlight key design considerations that significantly improve affective experiences, presence, embodiment, pleasantness, and naturalness, to foster more immersive and expressive mediated social touch experiences in VR.

Paperid: 1812, https://arxiv.org/pdf/2502.12048.pdf

Abstract:
Integration of Brain-Computer Interfaces (BCIs) and Generative Artificial Intelligence (GenAI) has opened new frontiers in brain signal decoding, enabling assistive communication, neural representation learning, and multimodal integration. BCIs, particularly those leveraging Electroencephalography (EEG), provide a non-invasive means of translating neural activity into meaningful outputs. Recent advances in deep learning, including Generative Adversarial Networks (GANs) and Transformer-based Large Language Models (LLMs), have significantly improved EEG-based generation of images, text, and speech. This paper provides a literature review of the state-of-the-art in EEG-based multimodal generation, focusing on (i) EEG-to-image generation through GANs, Variational Autoencoders (VAEs), and Diffusion Models, and (ii) EEG-to-text generation leveraging Transformer based language models and contrastive learning methods. Additionally, we discuss the emerging domain of EEG-to-speech synthesis, an evolving multimodal frontier. We highlight key datasets, use cases, challenges, and EEG feature encoding methods that underpin generative approaches. By providing a structured overview of EEG-based generative AI, this survey aims to equip researchers and practitioners with insights to advance neural decoding, enhance assistive technologies, and expand the frontiers of brain-computer interaction.

Paperid: 1813, https://arxiv.org/pdf/2502.11277.pdf

Abstract:
Mental health care-seeking among marginalized young adults has received limited attention in CSCW research. Through in-depth interviews and visual elicitation methods with 18 diverse U.S. participants, our study reveals how marginalized identities shape mental health care-seeking journeys, often characterized by low aspirations and passive care-seeking influenced by lived experiences of marginalization. However, we found the transformative function of "care encounters" - serendipitous interactions with mental health resources that occur when individuals are not actively seeking support. These encounters serve as critical turning points, catalyzing shifts in aspiration and enabling more proactive care-seeking behaviors. Our analysis identifies both the infrastructural conditions that enable transformative care encounters and the aspiration breakdowns that impede care-seeking processes. This work makes conceptual contributions by supplementing traditional motivation-based care-seeking models with a reconceptualization of "care encounters" that accounts for the infrastructural and serendipitous nature of mental health access. We advance understanding of how marginalized identity uniquely influences care-seeking behaviors while providing actionable design implications for embedding technology-mediated "care encounters" into socio-technical interventions that can better support mental health care access for vulnerable populations.

Paperid: 1814, https://arxiv.org/pdf/2502.10678.pdf

Abstract:
This work investigates the integration of generative visual aids in human-robot task communication. We developed GenComUI, a system powered by large language models that dynamically generates contextual visual aids (such as map annotations, path indicators, and animations) to support verbal task communication and facilitate the generation of customized task programs for the robot. This system was informed by a formative study that examined how humans use external visual tools to assist verbal communication in spatial tasks. To evaluate its effectiveness, we conducted a user experiment (n = 20) comparing GenComUI with a voice-only baseline. The results demonstrate that generative visual aids, through both qualitative and quantitative analysis, enhance verbal task communication by providing continuous visual feedback, thus promoting natural and effective human-robot communication. Additionally, the study offers a set of design implications, emphasizing how dynamically generated visual aids can serve as an effective communication medium in human-robot interaction. These findings underscore the potential of generative visual aids to inform the design of more intuitive and effective human-robot communication, particularly for complex communication scenarios in human-robot interaction and LLM-based end-user development.

Paperid: 1815, https://arxiv.org/pdf/2502.09811.pdf

Abstract:
Avatar is a critical medium for identity representation in social virtual reality (VR). However, options for disability expression are highly limited on current avatar interfaces. Improperly designed disability features may even perpetuate misconceptions about people with disabilities (PWD). As more PWD use social VR, there is an emerging need for comprehensive design standards that guide developers and designers to create inclusive avatars. Our work aim to advance the avatar design practices by delivering a set of centralized, comprehensive, and validated design guidelines that are easy to adopt, disseminate, and update. Through a systematic literature review and interview with 60 participants with various disabilities, we derived 20 initial design guidelines that cover diverse disability expression methods through five aspects, including avatar appearance, body dynamics, assistive technology design, peripherals around avatars, and customization control. We further evaluated the guidelines via a heuristic evaluation study with 10 VR practitioners, validating the guideline coverage, applicability, and actionability. Our evaluation resulted in a final set of 17 design guidelines with recommendation levels.

Paperid: 1816, https://arxiv.org/pdf/2502.09083.pdf

Abstract:
The pervasiveness of large language models and generative AI in online media has amplified the need for effective automated fact-checking to assist fact-checkers in tackling the increasing volume and sophistication of misinformation. The complex nature of fact-checking demands that automated fact-checking systems provide explanations that enable fact-checkers to scrutinise their outputs. However, it is unclear how these explanations should align with the decision-making and reasoning processes of fact-checkers to be effectively integrated into their workflows. Through semi-structured interviews with fact-checking professionals, we bridge this gap by: (i) providing an account of how fact-checkers assess evidence, make decisions, and explain their processes; (ii) examining how fact-checkers use automated tools in practice; and (iii) identifying fact-checker explanation requirements for automated fact-checking tools. The findings show unmet explanation needs and identify important criteria for replicable fact-checking explanations that trace the model's reasoning path, reference specific evidence, and highlight uncertainty and information gaps.

Paperid: 1817, https://arxiv.org/pdf/2502.08766.pdf

Abstract:
The global mental health crisis is a pressing concern, with college students particularly vulnerable to rising mental health disorders. The widespread use of smartphones among young adults, while offering numerous benefits, has also been linked to negative outcomes such as addiction and regret, significantly impacting well-being. Leveraging the longest longitudinal dataset collected over four college years through passive mobile sensing, this study is the first to examine the relationship between students' smartphone unlocking behaviors and their mental health at scale in real-world settings. We provide the first evidence demonstrating the predictability of phone unlocking behaviors for mental health outcomes based on a large dataset, highlighting the potential of these novel features for future predictive models. Our findings reveal important variations in smartphone usage across genders and locations, offering a deeper understanding of the interplay between digital behaviors and mental health. We highlight future research directions aimed at mitigating adverse effects and promoting digital well-being in this population.

Paperid: 1818, https://arxiv.org/pdf/2502.08442.pdf

Abstract:
With the transition to fully autonomous vehicles, non-driving related tasks (NDRTs) become increasingly important, allowing passengers to use their driving time more efficiently. In-car Augmented Reality (AR) gives the possibility to engage in NDRTs while also allowing passengers to engage with their surroundings, for example, by displaying world-fixed points of interest (POIs). This can lead to new discoveries, provide information about the environment, and improve locational awareness. To explore the optimal visualization of POIs using in-car AR, we conducted a field study (N = 38) examining six parameters: positioning, scaling, rotation, render distance, information density, and appearance. We also asked for intention of use, preferred seat positions and preferred automation level for the AR function in a post-study questionnaire. Our findings reveal user preferences and general acceptance of the AR functionality. Based on these results, we derived UX-guidelines for the visual appearance and behavior of location-based POIs in in-car AR.

Paperid: 1819, https://arxiv.org/pdf/2502.08437.pdf

Abstract:
As passengers spend more time in vehicles, the demand for non-driving related tasks (NDRTs) increases. In-car Augmented Reality (AR) has the potential to enhance passenger experiences by enabling interaction with the environment through NDRTs using world-fixed Points of Interest (POIs). However, the effectiveness of existing interaction techniques and visualization methods for in-car AR remains unclear. Based on a survey (N=110) and a pre-study (N=10), we developed an interactive in-car AR system using a video see-through head-mounted display to engage with POIs via eye-gaze and pinch. Users could explore passed and upcoming POIs using three visualization techniques: List, Timeline, and Minimap. We evaluated the system's feasibility in a field study (N=21). Our findings indicate general acceptance of the system, with the List visualization being the preferred method for exploring POIs. Additionally, the study highlights limitations of current AR hardware, particularly the impact of vehicle movement on 3D interaction.

Paperid: 1820, https://arxiv.org/pdf/2502.07922.pdf

Abstract:
Tele-ultrasound has the potential greatly to improve health equity for countless remote communities. However, practical scenarios involve potentially large time delays which cause current implementations of telerobotic ultrasound (US) to fail. Using a local model of the remote environment to provide haptics to the expert operator can decrease teleoperation instability, but the delayed visual feedback remains problematic. This paper introduces a robotic tele-US system in which the local model is not only haptic, but also visual, by re-slicing and rendering a pre-acquired US sweep in real time to provide the operator a preview of what the delayed image will resemble. A prototype system is presented and tested with 15 volunteer operators. It is found that visual-haptic model-mediated teleoperation (MMT) compensates completely for time delays up to 1000 ms round trip in terms of operator effort and completion time while conventional MMT does not. Visual-haptic MMT also significantly outperforms MMT for longer time delays in terms of motion accuracy and force control. This proof-of-concept study suggests that visual-haptic MMT may facilitate remote robotic tele-US.

Paperid: 1821, https://arxiv.org/pdf/2502.05117.pdf

Abstract:
E-scooters have become a more dominant mode of transport in recent years. However, the rise in their usage has been accompanied by an increase in injuries, affecting the trust and perceived safety of both users and non-users. Artificial intelligence (AI), as a cutting-edge and widely applied technology, has demonstrated potential to enhance transportation safety, particularly in driver assistance systems. The integration of AI into e-scooters presents a promising approach to addressing these safety concerns. This study aims to explore the factors influencing individuals willingness to use AI-assisted e-scooters. Data were collected using a structured questionnaire, capturing responses from 405 participants. The questionnaire gathered information on demographic characteristics, micromobility usage frequency, road users' perception of safety around e-scooters, perceptions of safety in AI-enabled technology, trust in AI-enabled e-scooters, and involvement in e-scooter crash incidents. To examine the impact of demographic factors on participants' preferences between AI-assisted and regular e-scooters, decision tree analysis is employed, indicating that ethnicity, income, and age significantly influence preferences. To analyze the impact of other factors on the willingness to use AI-enabled e-scooters, a full-scale Structural Equation Model (SEM) is applied, revealing that the perception of safety in AI enabled technology and the level of trust in AI-enabled e-scooters are the strongest predictors.

Paperid: 1822, https://arxiv.org/pdf/2502.04569.pdf

Abstract:
The face remains relatively unexplored as a target region for haptic feedback, despite providing a considerable surface area consisting of highly sensitive skin. There are promising applications for facial haptic feedback, especially in cases of severe upper limb loss or spinal cord injury, where the face is typically less impacted than other body parts. Moreover, the neural representation of the face is adjacent to that of the hand, and phantom maps have been discovered between the fingertips and the cheeks. However, there is a dearth of compact devices for facial haptic feedback, and vibrotactile stimulation, a common modality of haptic feedback, has not been characterized for localization acuity on the face. We performed a localization experiment on the cheek, with an arrangement of off-the-shelf coin vibration motors. The study follows the methods of prior work studying other skin regions, in which participants attempt to identify the sites of discrete vibrotactile stimuli. We intend for our results to inform the future development of systems using vibrotactile feedback to convey information via the face.

Paperid: 1823, https://arxiv.org/pdf/2502.03719.pdf

Abstract:
We introduce the concept of code shaping, an interaction paradigm for editing code using free-form sketch annotations directly on top of the code and console output. To evaluate this concept, we conducted a three-stage design study with 18 different programmers to investigate how sketches can communicate intended code edits to an AI model for interpretation and execution. The results show how different sketches are used, the strategies programmers employ during iterative interactions with AI interpretations, and interaction design principles that support the reconciliation between the code editor and sketches. Finally, we demonstrate the practical application of the code shaping concept with two use case scenarios, illustrating design implications from the study.

Paperid: 1824, https://arxiv.org/pdf/2502.03447.pdf

Abstract:
One of the key challenges faced by autistic children is understanding social affordances in complex environments, which further impacts their ability to respond appropriately to social signals. In traffic scenarios, this impairment can even lead to safety concerns. In this paper, we introduce an LLM-simulated immersive projection environment designed to improve this ability in autistic children while ensuring their safety. We first propose 17 design considerations across four major categories, derived from a comprehensive review of previous research. Next, we developed a system called AIroad, which leverages LLMs to simulate drivers with varying social intents, expressed through explicit multimodal social signals. AIroad helps autistic children bridge the gap in recognizing the intentions behind behaviors and learning appropriate responses through various stimuli. A user study involving 14 participants demonstrated that this technology effectively engages autistic children and leads to significant improvements in their comprehension of social affordances in traffic scenarios. Additionally, parents reported high perceived usability of the system. These findings highlight the potential of combining LLM technology with immersive environments for the functional rehabilitation of autistic children in the future.

Paperid: 1825, https://arxiv.org/pdf/2502.01764.pdf

Abstract:
In real-world decision making, outcomes are often delayed, meaning individuals must make multiple decisions before receiving any feedback. Moreover, feedback can be presented in different ways: it may summarize the overall results of multiple decisions (aggregated feedback) or report the outcome of individual decisions after some delay (clustered feedback). Despite its importance, the timing and presentation of delayed feedback has received little attention in cognitive modeling of decision-making, which typically focuses on immediate feedback. To address this, we conducted an experiment to compare the effect of delayed vs. immediate feedback and aggregated vs. clustered feedback. We also propose a Hierarchical Instance-Based Learning (HIBL) model that captures how people make decisions in delayed feedback settings. HIBL uses a super-model that chooses between sub-models to perform the decision-making task until an outcome is observed. Simulations show that HIBL best predicts human behavior and specific patterns, demonstrating the flexibility of IBL models.

Paperid: 1826, https://arxiv.org/pdf/2502.01448.pdf

Abstract:
When encountering a robot in the wild, it is not inherently clear to human users what the robot's capabilities are. When encountering misunderstandings or problems in spoken interaction, robots often just apologize and move on, without additional effort to make sure the user understands what happened. We set out to compare the effect of two speech based capability communication strategies (proactive, reactive) to a robot without such a strategy, in regard to the user's rating of and their behavior during the interaction. For this, we conducted an in-person user study with 120 participants who had three speech-based interactions with a social robot in a restaurant setting. Our results suggest that users preferred the robot communicating its capabilities proactively and adjusted their behavior in those interactions, using a more conversational interaction style while also enjoying the interaction more.

Paperid: 1827, https://arxiv.org/pdf/2502.00908.pdf

Abstract:
Augmented Reality (AR) collaboration can benefit from a shared 2D surface, such as a whiteboard. However, many features of each collaborators physical environment must be considered in order to determine the best placement and shape of the shared surface. We explored the effects of three methods for beginning a collaborative whiteboarding session with varying levels of user control: MANUAL, DISCRETE CHOICE, and AUTOMATIC by conducting a simulated AR study within Virtual Reality (VR). In the MANUAL method, users draw their own surfaces directly in the environment until they agree on the placement; in the DISCRETE CHOICE method, the system provides three options for whiteboard size and location; and in the AUTOMATIC method, the system automatically creates a whiteboard that fits within each collaborators environment. We evaluate these three conditions in a study in which two collaborators used each method to begin collaboration sessions. After establishing a session, the users worked together to complete an affinity diagramming task using the shared whiteboard. We found that the majority of participants preferred to have direct control during the initialization of a new collaboration session, despite the additional workload induced by the Manual method.

Paperid: 1828, https://arxiv.org/pdf/2502.00275.pdf

Abstract:
Accurate estimation of human hand configuration and the forces they exert is critical for effective teleoperation and skill transfer in robotic manipulation. A deeper understanding of human interactions with objects can further enhance teleoperation performance. To address this need, researchers have explored methods to capture and translate human manipulation skills and applied forces to robotic systems. Among these, biosignal-based approaches, particularly those using forearm ultrasound data, have shown significant potential for estimating hand movements and finger forces. In this study, we present a method for simultaneously estimating manipulation skills and applied hand force using forearm ultrasound data. Data collected from seven participants were used to train deep learning models for classifying manipulation skills and estimating grasp force. Our models achieved an average classification accuracy of 94.87 percent plus or minus 10.16 percent for manipulation skills and an average root mean square error (RMSE) of 0.51 plus or minus 0.19 Newtons for force estimation, as evaluated using five-fold cross-validation. These results highlight the effectiveness of forearm ultrasound in advancing human-machine interfacing and robotic teleoperation for complex manipulation tasks. This work enables new and effective possibilities for human-robot skill transfer and tele-manipulation, bridging the gap between human dexterity and robotic control.

Paperid: 1829, https://arxiv.org/pdf/2502.00177.pdf

Abstract:
Human-in-the-loop optimization (HILO) is a promising approach for personalizing visual prostheses by iteratively refining stimulus parameters based on user feedback. Previous work demonstrated HILO's efficacy in simulation, but its performance with human participants remains untested. Here we evaluate HILO using sighted participants viewing simulated prosthetic vision to assess its ability to optimize stimulation strategies under realistic conditions. Participants selected between phosphenes generated by competing encoders to iteratively refine a deep stimulus encoder (DSE). We tested HILO in three conditions: standard optimization, threshold misspecifications, and out-of-distribution parameter sampling. Participants consistently preferred HILO-generated stimuli over both a naive encoder and the DSE alone, with log odds favoring HILO across all conditions. We also observed key differences between human and simulated decision-making, highlighting the importance of validating optimization strategies with human participants. These findings support HILO as a viable approach for adapting visual prostheses to individuals. Clinical relevance: Validating HILO with sighted participants viewing simulated prosthetic vision is an important step toward personalized calibration of future visual prostheses.

Paperid: 1830, https://arxiv.org/pdf/2501.18291.pdf

Abstract:
We present an interactive and explainable automated coaching assistant called CueTip for a variant of pool/billiards. CueTip's novelty lies in its combination of three features: a natural-language interface, an ability to perform contextual, physics-aware reasoning, and that its explanations are rooted in a set of predetermined guidelines developed by domain experts. We instrument a physics simulator so that it generates event traces in natural language alongside traditional state traces. Event traces lend themselves to interpretation by language models, which serve as the interface to our assistant. We design and train a neural adaptor that decouples tactical choices made by CueTip from its interactivity and explainability allowing it to be reconfigured to mimic any pool playing agent. Our experiments show that CueTip enables contextual query-based assistance and explanations while maintaining the strength of the agent in terms of win rate (improving it in some situations). The explanations generated by CueTip are physically-aware and grounded in the expert rules and are therefore more reliable.

Paperid: 1831, https://arxiv.org/pdf/2501.18148.pdf

Abstract:
Existing commercial and in-house software development tools are often inaccessible to Blind and Low Vision Software Professionals (BLVSPs), hindering their participation and career growth at work. Building on existing research on Do-It-Yourself (DIY) Assistive Technologies and customized tools made by programmers, we shed light on the currently unexplored intersection of how DIY tools built and used by BLVSPs support accessible software development. Through semi-structured interviews with 30 BLVSPs, we found that such tools serve many different purposes and are driven by motivations such as desiring to maintain a professional image and a sense of dignity at work. These tools had significant impacts on workplace accessibility and revealed a need for a more centralized community for sharing tools, tips, and tricks. Based on our findings, we introduce the "Double Hacker Dilemma" and highlight a need for developing more effective peer and organizational platforms that support DIY tool sharing.

Paperid: 1832, https://arxiv.org/pdf/2501.16164.pdf

Abstract:
MetaDecorator, is a framework that empowers users to personalize virtual spaces. By leveraging text-driven prompts and image synthesis techniques, MetaDecorator adorns static panoramas captured by 360Â° imaging devices, transforming them into uniquely styled and visually appealing environments. This significantly enhances the realism and engagement of virtual tours compared to traditional offerings. Beyond the core framework, we also discuss the integration of Large Language Models (LLMs) and haptics in the VR application to provide a more immersive experience.

Paperid: 1833, https://arxiv.org/pdf/2501.15628.pdf

Abstract:
Social anxiety (SA) has become increasingly prevalent. Traditional coping strategies often face accessibility challenges. Generative AI (GenAI), known for their knowledgeable and conversational capabilities, are emerging as alternative tools for mental well-being. With the increased integration of GenAI, it is important to examine individuals' attitudes and trust in GenAI chatbots' support for SA. Through a mixed-method approach that involved surveys (n = 159) and interviews (n = 17), we found that individuals with severe symptoms tended to trust and embrace GenAI chatbots more readily, valuing their non-judgmental support and perceived emotional comprehension. However, those with milder symptoms prioritized technical reliability. We identified factors influencing trust, such as GenAI chatbots' ability to generate empathetic responses and its context-sensitive limitations, which were particularly important among individuals with SA. We also discuss the design implications and use of GenAI chatbots in fostering cognitive and emotional trust, with practical and design considerations.

Paperid: 1834, https://arxiv.org/pdf/2501.15599.pdf

Abstract:
Recent advancements in large language models (LLMs) promise to expand mental health interventions by emulating therapeutic techniques, potentially easing barriers to care. Yet there is a lack of real-world empirical evidence evaluating the strengths and limitations of LLM-enabled psychotherapy interventions. In this work, we evaluate an LLM-powered chatbot, designed via prompt engineering to deliver cognitive restructuring (CR), with 19 users. Mental health professionals then examined the resulting conversation logs to uncover potential benefits and pitfalls. Our findings indicate that an LLM-based CR approach has the capability to adhere to core CR protocols, prompt Socratic questioning, and provide empathetic validation. However, issues of power imbalances, advice-giving, misunderstood cues, and excessive positivity reveal deeper challenges, including the potential to erode therapeutic rapport and ethical concerns. We also discuss design implications for leveraging LLMs in psychotherapy and underscore the importance of expert oversight to mitigate these concerns, which are critical steps toward safer, more effective AI-assisted interventions.

Paperid: 1835, https://arxiv.org/pdf/2501.13060.pdf

Abstract:
As computing's societal impact grows, so does the need for computing students to recognize and address the ethical and sociotechnical implications of their work. While there are efforts to integrate ethics into computing curricula, we lack a standardized tool to measure those efforts, specifically, students' attitudes towards ethical reflection and their ability to effect change. This paper introduces the novel framework of Critically Conscious Computing and reports on the development and content validation of the Critical Reflection and Agency in Computing Index, a novel instrument designed to assess undergraduate computing students' attitudes towards practicing critically conscious computing. The resulting index is a theoretically grounded, expert-reviewed tool to support research and practice in computing ethics education. This enables researchers and educators to gain insights into students' perspectives, inform the design of targeted ethics interventions, and measure the effectiveness of computing ethics education initiatives.

Paperid: 1836, https://arxiv.org/pdf/2501.12493.pdf

Abstract:
Nonverbal behaviors such as posture, gestures, and gaze are essential for conveying internal states, both consciously and unconsciously, in human interaction. For robots to interact more naturally with humans, robot movement design should likewise integrate expressive qualities, such as intention, attention, and emotions, alongside traditional functional considerations like task fulfillment and time efficiency. In this paper, we present the design and prototyping of a lamp-like robot that explores the interplay between functional and expressive objectives in movement design. Using a research-through-design methodology, we document the hardware design process, define expressive movement primitives, and outline a set of interaction scenario storyboards. We propose a framework that incorporates both functional and expressive utilities during movement generation, and implement the robot behavior sequences in different function- and social- oriented tasks. Through a user study comparing expression-driven versus function-driven movements across six task scenarios, our findings indicate that expression-driven movements significantly enhance user engagement and perceived robot qualities. This effect is especially pronounced in social-oriented tasks.

Paperid: 1837, https://arxiv.org/pdf/2501.12152.pdf

Abstract:
Large language models (LLMs) are increasingly prevalent in recommender systems, where LLMs can be used to generate personalized recommendations. Here, we examine how different LLM-generated explanations for movie recommendations affect users' perceptions of cognitive, affective, and utilitarian needs and consumption intentions. In a pre-registered, between-subject online experiment (N=759) and follow-up interviews (N=30), we compare (a) LLM-generated generic explanations, and (b) LLM-generated contextualized explanations. Our findings show that contextualized explanations (i.e., explanations that incorporate users' past behaviors) effectively meet users' cognitive needs while increasing users' intentions to watch recommended movies. However, adding explanations offers limited benefits in meeting users' utilitarian and affective needs, raising concerns about the proper design and implications of LLM-generated explanations. Qualitative insights from interviews reveal that referencing users' past preferences enhances trust and understanding but can feel excessive if overused. Furthermore, users with more active and positive engagement with the recommender system and movie-watching get substantial gains from contextualized explanations. Overall, our research clarifies how LLM-generated recommendations influence users' motivations and behaviors, providing valuable insights for the future development of user-centric recommender systems, a key element in social media platforms and online ecosystems.

Paperid: 1838, https://arxiv.org/pdf/2501.10713.pdf

Abstract:
Socially interactive agents are gaining prominence in domains like healthcare, education, and service contexts, particularly virtual agents due to their inherent scalability. To facilitate authentic interactions, these systems require verbal and nonverbal communication through e.g., facial expressions and gestures. While natural language processing technologies have rapidly advanced, incorporating human-like nonverbal behavior into real-world interaction contexts is crucial for enhancing the success of communication, yet this area remains underexplored. One barrier is creating autonomous systems with sophisticated conversational abilities that integrate human-like nonverbal behavior. This paper presents a distributed architecture using Epic Games MetaHuman, combined with advanced conversational AI and camera-based user management, that supports methods like motion capture, handcrafted animation, and generative approaches for nonverbal behavior. We share insights into a system architecture designed to investigate nonverbal behavior in socially interactive agents, deployed in a three-week field study in the Deutsches Museum Bonn, showcasing its potential in realistic nonverbal behavior research.

Paperid: 1839, https://arxiv.org/pdf/2501.10288.pdf

Abstract:
Virtue ethics is a philosophical tradition that emphasizes the cultivation of virtues in achieving the common good. It has been suggested to be an effective framework for envisioning more ethical technology, yet previous work on virtue ethics and technology design has remained at theoretical recommendations. Therefore, we propose an approach for identifying user experience design patterns that embody particular virtues to more concretely articulate virtuous technology designs. As a proof of concept for our approach, we documented seven design patterns for social media that uphold the virtues of Catholic Social Teaching. We interviewed 24 technology researchers and industry practitioners to evaluate these patterns. We found that overall the patterns enact the virtues they were identified to embody; our participants valued that the patterns fostered intentional conversations and personal connections. We pave a path for technology professionals to incorporate diverse virtue traditions into the development of technologies that support human flourishing.

Paperid: 1840, https://arxiv.org/pdf/2501.10134.pdf

Abstract:
The recent advancements in Generative Artificial intelligence (GenAI) technology have been transformative for the field of education. Large Language Models (LLMs) such as ChatGPT and Bard can be leveraged to automate boilerplate tasks, create content for personalised teaching, and handle repetitive tasks to allow more time for creative thinking. However, it is important to develop guidelines, policies, and assessment methods in the education sector to ensure the responsible integration of these tools. In this article, thematic analysis has been performed on seven essays obtained from professionals in the education sector to understand the advantages and pitfalls of using GenAI models such as ChatGPT and Bard in education. Exploratory Data Analysis (EDA) has been performed on the essays to extract further insights from the text. The study found several themes which highlight benefits and drawbacks of GenAI tools, as well as suggestions to overcome these limitations and ensure that students are using these tools in a responsible and ethical manner.

Paperid: 1841, https://arxiv.org/pdf/2501.08253.pdf

Abstract:
Augmented Reality (AR) presents new opportunities for immersive storytelling. However, this immersiveness faces two main hurdles. First, AR's immersive quality is often confined to visual elements, such as pixels on a screen. Second, crafting immersive narratives is complex and generally beyond the reach of amateurs due to the need for advanced technical skills. We introduce Jigsaw, a system that empowers beginners to both experience and craft immersive stories, blending virtual and physical elements. Jigsaw uniquely combines mobile AR with readily available Internet-of-things (IoT) devices. We conducted a qualitative study with 20 participants to assess Jigsaw's effectiveness in both consuming and creating immersive narratives. The results were promising: participants not only successfully created their own immersive stories but also found the playback of three such stories deeply engaging. However, sensory overload emerged as a significant challenge in these experiences. We discuss design trade-offs and considerations for future endeavors in immersive storytelling involving AR and IoT.

Paperid: 1842, https://arxiv.org/pdf/2501.07690.pdf

Abstract:
Data-centric technologies provide exciting opportunities, but recent research has shown how lack of representation in datasets, often as a result of systemic inequities and socioeconomic disparities, can produce inequitable outcomes that can exclude or harm certain demographics. In this paper, we discuss preliminary insights from an ongoing effort aimed at better understanding barriers to equitable data-centric innovation. We report findings from a survey of 261 technologists and researchers who use data in their work regarding their experiences seeking adequate, representative datasets. Our findings suggest that age and identity play a significant role in the seeking and selection of representative datasets, warranting further investigation into these aspects of data-centric research and development.

Paperid: 1843, https://arxiv.org/pdf/2501.05706.pdf

Abstract:
Making errors is part of the programming process -- even for the most seasoned professionals. Novices in particular are bound to make many errors while learning. It is well known that traditional (compiler/interpreter) programming error messages have been less than helpful for many novices and can have effects such as being frustrating, containing confusing jargon, and being downright misleading. Recent work has found that large language models (LLMs) can generate excellent error explanations, but that the effectiveness of these error messages heavily depends on whether the LLM has been provided with context -- typically the original source code where the problem occurred. Knowing that programming error messages can be misleading and/or contain that serves little-to-no use (particularly for novices) we explore the reverse: what happens when GPT-3.5 is prompted for error explanations on just the erroneous source code itself -- original compiler/interpreter produced error message excluded. We utilized various strategies to make more effective error explanations, including one-shot prompting and fine-tuning. We report the baseline results of how effective the error explanations are at providing feedback, as well as how various prompting strategies might improve the explanations' effectiveness. Our results can help educators by understanding how LLMs respond to such prompts that novices are bound to make, and hopefully lead to more effective use of Generative AI in the classroom.

Paperid: 1844, https://arxiv.org/pdf/2501.05153.pdf

Abstract:
Teleoperating a robot arm involves the human operator positioning the robot's end-effector or programming each joint. Whereas humans can control their own arms easily by integrating visual and proprioceptive feedback, it is challenging to control an external robot arm in the same way, due to its inconsistent orientation and appearance. We explore teleoperating a robot arm through motion-capture (MoCap) of the human operator's arm with the assistance of augmented reality (AR) visualisations. We investigate how AR helps teleoperation by visualising a virtual reference of the human arm alongside the robot arm to help users understand the movement mapping. We found that the AR overlay of a humanoid arm on the robot in the same orientation helped users learn the control. We discuss findings and future work on MoCap-based robot teleoperation.

Paperid: 1845, https://arxiv.org/pdf/2501.05141.pdf

Abstract:
Office Assistant Robots (OARs) offer a promising solution to proactively provide in-situ support to enhance employee well-being and productivity in office spaces. We introduce OfficeMate, a social OAR designed to assist with practical tasks, foster social interaction, and promote health and well-being. Through a pilot evaluation with seven participants in an office environment, we found that users see potential in OARs for reducing stress and promoting healthy habits and value the robot's ability to provide companionship and physical activity reminders in the office space. However, concerns regarding privacy, communication, and the robot's interaction timing were also raised. The feedback highlights the need to carefully consider the robot's appearance and behaviour to ensure it enhances user experience and aligns with office social norms. We believe these insights will better inform the development of adaptive, intelligent OAR systems for future office space integration.

Paperid: 1846, https://arxiv.org/pdf/2501.04543.pdf

Abstract:
Large Language Models (LLMs) created new opportunities for generating personas, expected to streamline and accelerate the human-centered design process. Yet, AI-generated personas may not accurately represent actual user experiences, as they can miss contextual and emotional insights critical to understanding real users' needs and behaviors. This introduces a potential threat to quality, especially for novices. This paper examines the differences in how users perceive personas created by LLMs compared to those crafted by humans regarding their credibility for design. We gathered ten human-crafted personas developed by HCI experts according to relevant attributes established in related work. Then, we systematically generated ten personas with an LLM and compared them with human-crafted ones in a survey. The results showed that participants differentiated between human-created and AI-generated personas, with the latter perceived as more informative and consistent. However, participants noted that the AI-generated personas tended to follow stereotypes, highlighting the need for a greater emphasis on diversity when utilizing LLMs for persona creation.

Paperid: 1847, https://arxiv.org/pdf/2501.03862.pdf

Abstract:
The restaurant industry is currently facing a challenging socio-economic situation caused by the rise of delivery services, inflation, and typically low margins. Often, technological opportunities for process optimization or customer retention are not fully utilized. In our design case study, we investigate which technologies are already being used to improve the customer experience in restaurants and explore a novel new approach to this issue. We designed, implemented, and evaluated a platform with customers and restaurateurs to increase visibility and emotional connection to nearby restaurants through their dishes. Some of our key findings include the enormous potential of combining location-based systems and conversational agents, but also the difficulties in creating content for such platforms. We contribute to the field of Human-Food Interaction by (1) identifying promising design spaces as well as customer and restaurateur requirements for technology in this domain, (2) presenting an innovative design case study to improve the user experience, and (3) exploring the broader implications of our design case study findings for approaching a real-world metaverse.

Paperid: 1848, https://arxiv.org/pdf/2501.03537.pdf

Abstract:
In Human-Computer Interaction (HCI) and Ubiquitous Computing, the objective of optimizing device interactions and personalizing user experiences has placed a new emphasis on accurately evaluating cognitive readiness using wearable devices. Interpreting cognitive readiness in real-world scenarios is complex due to the plethora of potential physiological measures, individual variability, and the limitations of wearable devices. In this review, we present a systematic overview of key physiological measures that can be used for an in-depth assessment of cognitive readiness. These measures can serve as proxies for detailed assessments of cognitive readiness. This review serves as a tool for assessing cognitive readiness for diverse applications, with special focus on in-the-wild research settings. In addition, due to the complexity of measurements and devices, we propose the development of robust catalog for cognitive readiness measurements.

Paperid: 1849, https://arxiv.org/pdf/2501.02841.pdf

Abstract:
Rapid Serial Visual Presentation (RSVP)-based Brain-Computer Interface (BCI) is an effective technology used for information detection by detecting Event-Related Potentials (ERPs). The current RSVP decoding methods can perform well in decoding EEG signals within a single RSVP task, but their decoding performance significantly decreases when directly applied to different RSVP tasks without calibration data from the new tasks. This limits the rapid and efficient deployment of RSVP-BCI systems for detecting different categories of targets in various scenarios. To overcome this limitation, this study aims to enhance the cross-task zero-calibration RSVP decoding performance. First, we design three distinct RSVP tasks for target image retrieval and build an open-source dataset containing EEG signals and corresponding stimulus images. Then we propose an EEG with Language-Image Prior fusion Transformer (ELIPformer) for cross-task zero-calibration RSVP decoding. Specifically, we propose a prompt encoder based on the language-image pre-trained model to extract language-image features from task-specific prompts and stimulus images as prior knowledge for enhancing EEG decoding. A cross bidirectional attention mechanism is also adopted to facilitate the effective feature fusion and alignment between the EEG and language-image features. Extensive experiments demonstrate that the proposed model achieves superior performance in cross-task zero-calibration RSVP decoding, which promotes the RSVP-BCI system from research to practical application.

Paperid: 1850, https://arxiv.org/pdf/2501.02684.pdf

Abstract:
Background: The increasing adoption of AI assistants in programming has led to numerous studies exploring their benefits. While developers consistently report significant productivity gains from these tools, empirical measurements often show more modest improvements. While prior research has documented self-reported experiences with AI-assisted programming tools, little to no work has been done to understand their usage patterns and the actual cognitive load imposed in practice. Objective: In this exploratory study, we aim to investigate the role AI assistants play in developer productivity. Specifically, we are interested in how developers' expertise levels influence their AI usage patterns, and how these patterns impact their actual cognitive load and productivity during development tasks. We also seek to better understand how this relates to their perceived productivity. Method: We propose a controlled observational study combining physiological measurements (EEG and eye tracking) with interaction data to examine developers' use of AI-assisted programming tools. We will recruit professional developers to complete programming tasks both with and without AI assistance while measuring their cognitive load and task completion time. Through pre- and post-task questionnaires, we will collect data on perceived productivity and cognitive load using NASA-TLX.

Paperid: 1851, https://arxiv.org/pdf/2501.00775.pdf

Abstract:
Extracting insights from qualitative analysis involves a series of reasoning steps, such as open coding, grouping, and identifying themes. We introduce the MindCoder reasoning chain, built on Chain-of-Thought (CoT) prompting, to support the insight extraction process step by step-including topic clustering, code labeling, conceptualization, and reporting. We designed the MindCoder web application to help users 1) automatically run this reasoning chain (i.e., obtain analysis report results in approximately 3-5 minutes) and 2) interactively control the reasoning process on demand. Our technical evaluations assess its reliability across various data types and demonstrate that simulated human iteration can potentially enhance coding quality. A user study further confirmed positive feedback regarding MindCoder's automation and its on-demand reasoning functionality.

Paperid: 1852, https://arxiv.org/pdf/2501.00168.pdf

Abstract:
We present a virtual reality (VR) environment featuring conversational avatars powered by a locally-deployed LLM, integrated with automatic speech recognition (ASR), text-to-speech (TTS), and lip-syncing. Through a pilot study, we explored the effects of three types of avatar status indicators during response generation. Our findings reveal design considerations for improving responsiveness and realism in LLM-driven conversational systems. We also detail two system architectures: one using an LLM-based state machine to control avatar behavior and another integrating retrieval-augmented generation (RAG) for context-grounded responses. Together, these contributions offer practical insights to guide future work in developing task-oriented conversational AI in VR environments.

Paperid: 1853, https://arxiv.org/pdf/2506.22941.pdf

Abstract:
Access to accurate and actionable harm reduction information can directly impact the health outcomes of People Who Use Drugs (PWUD), yet existing online channels often fail to meet their diverse and dynamic needs due to limitations in adaptability, accessibility, and the pervasive impact of stigma. Large Language Models (LLMs) present a novel opportunity to enhance information provision, but their application in such a high-stakes domain is under-explored and presents socio-technical challenges. This paper investigates how LLMs can be responsibly designed to support the information needs of PWUD. Through a qualitative workshop involving diverse stakeholder groups (academics, harm reduction practitioners, and an online community moderator), we explored LLM capabilities, identified potential use cases, and delineated core design considerations. Our findings reveal that while LLMs can address some existing information barriers (e.g., by offering responsive, multilingual, and potentially less stigmatising interactions), their effectiveness is contingent upon overcoming challenges related to ethical alignment with harm reduction principles, nuanced contextual understanding, effective communication, and clearly defined operational boundaries. We articulate design pathways emphasising collaborative co-design with experts and PWUD to develop LLM systems that are helpful, safe, and responsibly governed. This work contributes empirically grounded insights and actionable design considerations for the responsible development of LLMs as supportive tools within the harm reduction ecosystem.

Paperid: 1854, https://arxiv.org/pdf/2506.20952.pdf

Abstract:
Human crowd simulation in virtual reality (VR) is a powerful tool with potential applications including emergency evacuation training and assessment of building layout. While haptic feedback in VR enhances immersive experience, its effect on walking behavior in dense and dynamic pedestrian flows is unknown. Through a user study, we investigated how haptic feedback changes user walking motion in crowded pedestrian flows in VR. The results indicate that haptic feedback changed users' collision avoidance movements, as measured by increased walking trajectory length and change in pelvis angle. The displacements of users' lateral position and pelvis angle were also increased in the instantaneous response to a collision with a non-player character (NPC), even when the NPC was inside the field of view. Haptic feedback also enhanced users' awareness and visual exploration when an NPC approached from the side and back. Furthermore, variation in walking speed was increased by the haptic feedback. These results suggested that the haptic feedback enhanced users' sensitivity to a collision in VR environment.

Paperid: 1855, https://arxiv.org/pdf/2506.20268.pdf

Abstract:
Detecting miscommunication in human-robot interaction is a critical function for maintaining user engagement and trust. While humans effortlessly detect communication errors in conversations through both verbal and non-verbal cues, robots face significant challenges in interpreting non-verbal feedback, despite advances in computer vision for recognizing affective expressions. This research evaluates the effectiveness of machine learning models in detecting miscommunications in robot dialogue. Using a multi-modal dataset of 240 human-robot conversations, where four distinct types of conversational failures were systematically introduced, we assess the performance of state-of-the-art computer vision models. After each conversational turn, users provided feedback on whether they perceived an error, enabling an analysis of the models' ability to accurately detect robot mistakes. Despite using state-of-the-art models, the performance barely exceeds random chance in identifying miscommunication, while on a dataset with more expressive emotional content, they successfully identified confused states. To explore the underlying cause, we asked human raters to do the same. They could also only identify around half of the induced miscommunications, similarly to our model. These results uncover a fundamental limitation in identifying robot miscommunications in dialogue: even when users perceive the induced miscommunication as such, they often do not communicate this to their robotic conversation partner. This knowledge can shape expectations of the performance of computer vision models and can help researchers to design better human-robot conversations by deliberately eliciting feedback where needed.

Paperid: 1856, https://arxiv.org/pdf/2506.20055.pdf

Abstract:
Online(-only) friendships have become increasingly common in daily lives post-COVID despite debates around their mental health benefits and equivalence to ''real'' relationships. Previous research has reflected a need to understand how online friends engage beyond individual platforms, and the lack of platform-agnostic inquiry limits our ability to fully understand the dynamics of online friendship. We employed an activity-grounded analysis of 25 interviews on lived experiences of close online friendship spanning multiple years. Our findings present unique challenges and strategies in online friendships, such as stigma from real-life circles, an ambivalent relationship with online communities, and counter-theoretical reappropriations of communication technology. This study contributes to HCI research in online communities and social interface design by refocusing prior impressions of strong vs. weak-ties in online social spaces and foregrounding time-stable interactions in design for relationship maintenance through technology. Our work also promotes critical reflection on biased perspectives towards technology-mediated practices and consideration of online friends as an invisible marginalized community.

Paperid: 1857, https://arxiv.org/pdf/2506.19307.pdf

Abstract:
Presbyopia, a common age-related vision condition affecting most people as they age, often remains inadequately understood by those unaffected. To help bridge the gap between abstract accessibility knowledge and a more grounded appreciation of perceptual challenges, this study presents OpticalAging, an optical see-through simulation approach. Unlike VR-based methods, OpticalAging uses dynamically controlled tunable lenses to simulate the first-person visual perspective of presbyopia's distance-dependent blur during real-world interaction, aiming to enhance awareness. While acknowledging critiques regarding simulation's limitations in fully capturing lived experience, we position this tool as a complement to user-centered methods. Our user study (N = 19, 18-35 years old) provides validation: quantitative measurements show statistically significant changes in near points across three age modes (40s, 50s, 60s), while qualitative results suggest increases in reported understanding and empathy among participants. The integration of our tool into a design task showcases its potential applicability within age-inclusive design workflows when used critically alongside direct user engagement.

Paperid: 1858, https://arxiv.org/pdf/2506.18760.pdf

Abstract:
This paper evaluates the user interface of an in vitro fertility (IVF) outcome prediction tool, focussing on its understandability for patients or potential patients. We analyse four years of anonymous patient feedback, followed by a user survey and interviews to quantify trust and understandability. Results highlight a lay user's need for prediction model \emph{explainability} beyond the model feature space. We identify user concerns about data shifts and model exclusions that impact trust. The results call attention to the shortcomings of current practices in explainable AI research and design and the need for explainability beyond model feature space and epistemic assumptions, particularly in high-stakes healthcare contexts where users gather extensive information and develop complex mental models. To address these challenges, we propose a dialogue-based interface and explore user expectations for personalised explanations.

Paperid: 1859, https://arxiv.org/pdf/2506.18743.pdf

Abstract:
The role of information systems (IS) as representations of real-world systems is changing in an increasingly digitalized world, suggesting that conceptual modeling is losing its relevance to the IS field. We argue the opposite: Conceptual modeling research is more relevant to the IS field than ever, but it requires an update with current theory. We develop a new theoretical framework of conceptual modeling that delivers a fundamental shift in the assumptions that govern research in this area. This move can make traditional knowledge about conceptual modeling consistent with the emerging requirements of a digital world. Our framework draws attention to the role of conceptual modeling scripts as mediators between physical and digital realities. We identify new research questions about grammars, methods, scripts, agents, and contexts that are situated in intertwined physical and digital realities. We discuss several implications for conceptual modeling scholarship that relate to the necessity of developing new methods and grammars for conceptual modeling, broadening the methodological array of conceptual modeling scholarship, and considering new dependent variables.

Paperid: 1860, https://arxiv.org/pdf/2506.18742.pdf

Abstract:
All aspects of our society, including the life sciences, need a mechanism for people working within them to represent the concepts they employ to carry out their research. For the information systems being designed and developed to support researchers and scientists in conducting their work, conceptual models of the relevant domains are usually designed as both blueprints for a system being developed and as a means of communication between the designer and developer. Most conceptual modelling concepts are generic in the sense that they are applied with the same understanding across many applications. Problems in the life sciences, however, are especially complex and important, because they deal with humans, their well-being, and their interactions with the environment as well as other organisms. This work proposes a systemist perspective for creating a conceptual model of a life scientist's problem. We introduce the notion of a system and then show how it can be applied to the development of an information system for handling genomic-related information. We extend our discussion to show how the proposed systemist perspective can support the modelling of precision medicine. This research recognizes challenges in life sciences research of how to model problems to better represent the connections between physical and digital worlds. We propose a new notation that explicitly incorporates systemist thinking, as well as the components of systems based on recent ontological foundations. The new notation captures important semantics in the domain of life sciences. It may be used to facilitate understanding, communication and problem-solving more broadly. We also provide a precise, sound, ontologically supported characterization of the term system, as a basic construct for conceptual modelling in life sciences.

Paperid: 1861, https://arxiv.org/pdf/2506.17570.pdf

Abstract:
Virtual reality (VR) has recently proliferated significantly, consisting of headsets or head-mounted displays (HMDs) and hand controllers for an embodied and immersive experience. The VR device is usually embedded with different kinds of IoT sensors, such as cameras, microphones, communication sensors, etc. However, VR security has not been scrutinized from a physical hardware point of view, especially electromagnetic emanations (EM) that are automatically and unintentionally emitted from the VR headset. This paper presents VReaves, a system that can eavesdrop on the electromagnetic emanation side channel of a VR headset for VR app identification and activity recognition. To do so, we first characterize the electromagnetic emanations from the embedded IoT sensors (e.g., cameras and microphones) in the VR headset through a signal processing pipeline and further propose machine learning models to identify the VR app and recognize the VR app activities. Our experimental evaluation with commercial off-the-shelf VR devices demonstrates the efficiency of VR app identification and activity recognition via electromagnetic emanation side channel.

Paperid: 1862, https://arxiv.org/pdf/2506.16571.pdf

Abstract:
Prior natural language datasets for data visualization have focused on tasks such as visualization literacy assessment, insight generation, and visualization generation from natural language instructions. These studies often rely on controlled setups with purpose-built visualizations and artificially constructed questions. As a result, they tend to prioritize the interpretation of visualizations, focusing on decoding visualizations rather than understanding their encoding. In this paper, we present a new dataset and methodology for probing visualization design rationale through natural language. We leverage a unique source of real-world visualizations and natural language narratives: literate visualization notebooks created by students as part of a data visualization course. These notebooks combine visual artifacts with design exposition, in which students make explicit the rationale behind their design decisions. We also use large language models (LLMs) to generate and categorize question-answer-rationale triples from the narratives and articulations in the notebooks. We then carefully validate the triples and curate a dataset that captures and distills the visualization design choices and corresponding rationales of the students.

Paperid: 1863, https://arxiv.org/pdf/2506.16468.pdf

Abstract:
Restoring movement of a paralyzed foot is a key challenge in helping individuals with neurological conditions such as spinal cord injury (SCI) to improve their quality of life. Neuroprostheses based on functional electrical stimulation (FES) can restore the physiological range of motion by stimulating the affected muscles using surface electrodes. We have previously shown that, despite chronic motor-complete SCI, it is possible to capture paralyzed hand movements in individuals with tetraplegia using spared and modulated motor unit (MU) activity decoded with non-invasive electromyography (EMG) sensors. This study investigated whether a wearable high-density surface EMG system could capture and control paralyzed foot kinematics in closed-loop control with an FES system. We found that all our participants with SCI (2 with chronic SCI and 3 with acute SCI) retained distinct spared EMG activity for at least three ankle movements, which allowed them to reliably control a digital cursor using their spared tibialis anterior and triceps surae MU activity. Movement separability was further reconfirmed by extracting task-modulated MU activity during foot flexion/extension (3-7 modulated MUs/participant). Three participants were further able to modulate and maintain their foot flexion/extension EMG levels with an accuracy of >70%. Lastly, we show that real-time control of a FES system using EMG from the affected limb can restore foot movements in a highly intuitive way, significantly improving the lost or pathological foot range of motion. Our system provides an intuitive approach for closed-loop control of FES that has the potential to assist individuals with SCI in regaining lost motor functions.

Paperid: 1864, https://arxiv.org/pdf/2506.15008.pdf

Abstract:
Generative AI, specifically text-to-image models, have revolutionized interior architectural design by enabling the rapid translation of conceptual ideas into visual representations from simple text prompts. While generative AI can produce visually appealing images they often lack actionable data for designers In this work, we propose a novel pipeline that integrates DALL-E 3 with a materials dataset to enrich AI-generated designs with sustainability metrics and material usage insights. After the model generates an interior design image, a post-processing module identifies the top ten materials present and pairs them with carbon dioxide equivalent (CO2e) values from a general materials dictionary. This approach allows designers to immediately evaluate environmental impacts and refine prompts accordingly. We evaluate the system through three user tests: (1) no mention of sustainability to the user prior to the prompting process with generative AI, (2) sustainability goals communicated to the user before prompting, and (3) sustainability goals communicated along with quantitative CO2e data included in the generative AI outputs. Our qualitative and quantitative analyses reveal that the introduction of sustainability metrics in the third test leads to more informed design decisions, however, it can also trigger decision fatigue and lower overall satisfaction. Nevertheless, the majority of participants reported incorporating sustainability principles into their workflows in the third test, underscoring the potential of integrated metrics to guide more ecologically responsible practices. Our findings showcase the importance of balancing design freedom with practical constraints, offering a clear path toward holistic, data-driven solutions in AI-assisted architectural design.

Paperid: 1865, https://arxiv.org/pdf/2506.14653.pdf

Abstract:
As part of global climate action, digital technologies are seen as a key enabler of energy efficiency savings. A popular application domain for this work is smart homes. There is a risk, however, that these efficiency gains result in rebound effects, which reduce or even overcompensate the savings. Rebound effects are well-established in economics, but it is less clear whether they also inform smart energy research in other disciplines. In this paper, we ask: to what extent have rebound effects and their underlying mechanisms been considered in computing, HCI and smart home research? To answer this, we conducted a literature mapping drawing on four scientific databases and a SIGCHI corpus. Our results reveal limited consideration of rebound effects and significant opportunities for HCI to advance this topic. We conclude with a taxonomy of actions for HCI to address rebound effects and help determine the viability of energy efficiency projects.

Paperid: 1866, https://arxiv.org/pdf/2506.14376.pdf

Abstract:
This paper introduces System 0, a conceptual framework for understanding how artificial intelligence functions as a cognitive extension preceding both intuitive (System 1) and deliberative (System 2) thinking processes. As AI systems increasingly shape the informational substrate upon which human cognition operates, they transform from passive tools into active cognitive partners. Building on the Extended Mind hypothesis and Heersmink's criteria for cognitive extension, we argue that AI systems satisfy key conditions for cognitive integration. These include reliability, trust, transparency, individualization, and the ability to enhance and transform human mental functions. However, AI integration creates a paradox: while expanding cognitive capabilities, it may simultaneously constrain thinking through sycophancy and bias amplification. To address these challenges, we propose seven evidence-based frameworks for effective human-AI cognitive integration: Enhanced Cognitive Scaffolding, which promotes progressive autonomy; Symbiotic Division of Cognitive Labor, strategically allocating tasks based on comparative strengths; Dialectical Cognitive Enhancement, countering AI sycophancy through productive epistemic tension; Agentic Transparency and Control, ensuring users understand and direct AI influence; Expertise Democratization, breaking down knowledge silos; Social-Emotional Augmentation, addressing affective dimensions of cognitive work; and Duration-Optimized Integration, managing the evolving human-AI relationship over time. Together, these frameworks provide a comprehensive approach for harnessing AI as a genuine cognitive extension while preserving human agency, critical thinking, and intellectual growth, transforming AI from a replacement for human cognition into a catalyst for enhanced thinking.

Paperid: 1867, https://arxiv.org/pdf/2506.14166.pdf

Abstract:
Culturally adaptive emotional responses remain a critical challenge in affective computing. This paper introduces Affective-CARA, an agentic framework designed to enhance user-agent interactions by integrating a Cultural Emotion Knowledge Graph (derived from StereoKG) with Valence, Arousal, and Dominance annotations, culture-specific data, and cross-cultural checks to minimize bias. A Gradient-Based Reward Policy Optimization mechanism further refines responses according to cultural alignment, affective appropriateness, and iterative user feedback. A Cultural-Aware Response Mediator coordinates knowledge retrieval, reinforcement learning updates, and historical data fusion. By merging real-time user input with past emotional states and cultural insights, Affective-CARA delivers narratives that are deeply personalized and sensitive to diverse cultural norms. Evaluations on AffectNet, SEMAINE DB, and MERD confirm that the framework consistently outperforms baseline models in sentiment alignment, cultural adaptation, and narrative quality. Affective-CARA achieved a Cultural Semantic Density of 9.32 out of 10 and lowered cultural representation bias by 61% (KL-Divergence: 0.28), demonstrating robust performance in generating ethical, adaptive responses. These findings suggest the potential for more inclusive and empathetic interactions, making Affective-CARA an avenue for fostering culturally grounded user experiences across domains such as cross-cultural communication, mental health support, and education.

Paperid: 1868, https://arxiv.org/pdf/2506.13971.pdf

Abstract:
Group conversations over videoconferencing are a complex social behavior. However, the subjective moments of negative experience, where the conversation loses fluidity or enjoyment remain understudied. These moments are infrequent in naturalistic data, and thus training a supervised learning (SL) model requires costly manual data annotation. We applied semi-supervised learning (SSL) to leverage targeted labeled and unlabeled clips for training multimodal (audio, facial, text) deep features to predict non-fluid or unenjoyable moments in holdout videoconference sessions. The modality-fused co-training SSL achieved an ROC-AUC of 0.9 and an F1 score of 0.6, outperforming SL models by up to 4% with the same amount of labeled data. Remarkably, the best SSL model with just 8% labeled data matched 96% of the SL model's full-data performance. This shows an annotation-efficient framework for modeling videoconference experience.

Paperid: 1869, https://arxiv.org/pdf/2506.13275.pdf

Abstract:
Learning management systems (LMS) like Moodle are increasingly used to support university teaching. As Moodle courses become more complex, incorporating diverse interactive elements, it is important to understand how students navigate through course sections and whether course designs are meeting student needs. While substantial research exists on student usage of individual LMS elements, there is a lack of research on broader navigational patterns between course sections and how these patterns differ across courses. This study analyzes navigational data from 747 courses in the Moodle LMS at a technical university of applied sciences, representing (after filtering) around 4,400 students and 1.8 million logged events. By mapping section names across a large sample of courses, the analysis enables cross-course comparisons of student navigational sequences between sections. Transition matrices and heat map visualizations are used to identify common navigational patterns. Findings include that many of the generated heatmap include one or more diagonal axis, indicating that students typically navigate from the current to the next or previous section. More fine-grained patterns show typical behavior for blended learning scenarios. Other patterns include dominant sections.

Paperid: 1870, https://arxiv.org/pdf/2506.13188.pdf

Abstract:
OpenStreetMap (OSM) is a vital resource for investigative journalists doing geolocation verification. However, existing tools to query OSM data such as Overpass Turbo require familiarity with complex query languages, creating barriers for non-technical users. We present SPOT, an open source natural language interface that makes OSM's rich, tag-based geographic data more accessible through intuitive scene descriptions. SPOT interprets user inputs as structured representations of geospatial object configurations using fine-tuned Large Language Models (LLMs), with results being displayed in an interactive map interface. While more general geospatial search tasks are conceivable, SPOT is specifically designed for use in investigative journalism, addressing real-world challenges such as hallucinations in model output, inconsistencies in OSM tagging, and the noisy nature of user input. It combines a novel synthetic data pipeline with a semantic bundling system to enable robust, accurate query generation. To our knowledge, SPOT is the first system to achieve reliable natural language access to OSM data at this level of accuracy. By lowering the technical barrier to geolocation verification, SPOT contributes a practical tool to the broader efforts to support fact-checking and combat disinformation.

Paperid: 1871, https://arxiv.org/pdf/2506.12910.pdf

Abstract:
In Bangladesh's rapidly expanding informal e-market, small-scale sellers use social media platforms like Facebook to run businesses outside formal infrastructures. These sellers rely heavily on platform algorithms, not just for visibility, but as active collaborators in business operations. Drawing on 37 in-depth interviews with sellers, buyers, and stakeholders, this paper examines how people in informal e-markets perceive and interact with the algorithm as a "team member" that performs sales, marketing, and customer engagement tasks. We found that while sellers and local tech entrepreneurs are interested in developing services to support this industry, buyers and investors place greater trust in human interactions. This reveals a postcolonial tension involving cultural values, local tech education and training, and a mismatch between the global and Bangladeshi e-market growth. We expand this discussion using perspectives from HCI, political design, and AI design. We also support the decoloniality movement in informal e-markets by proposing the DAIEM framework, which includes six components: autonomy and agency; resistance; locality, culture, and history; rationality; materiality; and advocacy. DAIEM serves as both a guideline for algorithm design and an analytical tool.

Paperid: 1872, https://arxiv.org/pdf/2506.12496.pdf

Abstract:
Large Language Models (LLMs) succeed in many natural language processing tasks. However, their tendency to hallucinate - generate plausible but inconsistent or factually incorrect text - can cause significant problems in certain tasks, including response generation in dialogue. To mitigate this issue, we propose two novel graph knowledge-augmented frameworks, Dialogue Response Generation via Textualised Graphs (TG-DRG) and Graph-Aware Dialogue Response Generation (GA-DRG), which combine reasoning-guided dialogue reformulation, dialogue sense knowledge selection, and graph-enhanced response generation to improve the factuality of dialogue responses. To evaluate the factuality of generated responses, we propose a dialogue fact score that addresses the limitations of existing fact-score methods in dialogue settings, providing a more reliable assessment of factual consistency. We evaluate our methods using different baselines on the OpendialKG and HybriDialogue datasets. Our methods noticeably improve factuality compared to other graph knowledge-augmentation baselines, including the state-of-the-art G-retriever, achieving improvements of 3.47% on OpendialKG and 3.12% on HybriDialogue in terms of dialogue fact score. The code will be released on GitHub.

Paperid: 1873, https://arxiv.org/pdf/2506.11788.pdf

Abstract:
Digital workers on crowdsourcing platforms (e.g., Amazon Mechanical Turk, Appen, Clickworker, Prolific) play a crucial role in training and improving AI systems, yet they often face low pay, unfair conditions, and a lack of recognition for their contributions. To map these issues in the existing literature of computer science, AI, and related scholarship, we selected over 300 research papers on digital labor published between 2015 and 2024, narrowing them down to 143 on digital gig-labor for a detailed analysis. This analysis provides a broad overview of the key challenges, concerns, and trends in the field. Our synthesis reveals how the persistent patterns of representation and voices of gig workers in digital labor are structured and governed. We offer new insights for researchers, platform designers, and policymakers, helping them better understand the experiences of digital workers and pointing to key areas where interventions and future investigations are promptly needed. By mapping the findings from the past ten years' growth of the domain and possible implications, this paper contributes to a more coherent and critical understanding of digital labor in contemporary and future AI ecosystems.

Paperid: 1874, https://arxiv.org/pdf/2506.11326.pdf

Abstract:
Project-based learning plays a crucial role in computing education. However, its open-ended nature makes tracking project development and assessing success challenging. We investigate how dialogue and system interaction logs predict project quality during collaborative, project-based AI learning of 94 middle school students working in pairs. We used linguistic features from dialogue transcripts and behavioral features from system logs to predict three project quality outcomes: productivity (number of training phrases), content richness (word density), and lexical variation (word diversity) of chatbot training phrases. We compared the predictive accuracy of each modality and a fusion of the modalities. Results indicate log data better predicts productivity, while dialogue data is more effective for content richness. Both modalities modestly predict lexical variation. Multimodal fusion improved predictions for productivity and lexical variation of training phrases but not content richness. These findings suggest that the value of multimodal fusion depends on the specific learning outcome. The study contributes to multimodal learning analytics by demonstrating the nuanced interplay between behavioral and linguistic data in assessing student learning progress in open-ended AI learning environments.

Paperid: 1875, https://arxiv.org/pdf/2506.10249.pdf

Abstract:
Artificial Intelligence holds significant potential to enhance human creativity. However, achieving this vision requires a clearer understanding of how such enhancement can be effectively realized. Drawing on a relational and distributed cognition perspective, we identify three fundamental modes by which AI can support and shape creative processes: Support, where AI acts as a tool; Synergy, where AI and humans collaborate in complementary ways; and Symbiosis, where human and AI cognition become so integrated that they form a unified creative system. These modes are defined along two key dimensions: the level of technical autonomy exhibited by the AI system (i.e., its ability to operate independently and make decisions without human intervention), and the degree of perceived agency attributed to it (i.e., the extent to which the AI is experienced as an intentional or creative partner). We examine how each configuration influences different levels of creativity from everyday problem solving to paradigm shifting innovation and discuss the implications for ethics, research, and the design of future human AI creative systems.

Paperid: 1876, https://arxiv.org/pdf/2506.10197.pdf

Abstract:
As artificial intelligence (AI) becomes deeply integrated into family life, immigrant families must navigate unique intergenerational, linguistic, and cultural challenges. This study examines how Korean immigrant families in the United States negotiate the use of AI tools such as ChatGPT and smart assistants in their homes. Through 20 semi-structured interviews with parents and teens, we identify two key practices that shape their engagement: interpretive gatekeeping, where parents mediate their children's AI use through a lens of cultural and ethical values, and convenient critical deferment, where teens strategically postpone critical evaluation of AI for immediate academic and social utility. These intertwined practices challenge conventional, skills-based models of AI literacy, revealing it instead as a dynamic and relational practice co-constructed through ongoing family negotiation. We contribute to information science and HCI by offering a new conceptual extension for intergenerational AI literacy and providing design implications for more equitable, culturally attuned, and family-centered AI systems.

Paperid: 1877, https://arxiv.org/pdf/2506.09696.pdf

Abstract:
Through a series of workshops, we looked at ways to structure and scaffold group dialogue, and support the emergence of novel design patterns. We contrast these sessions--which we ran with other humans--with two "virtual workshops" which we simulated with ChatGPT. Limitations in both human and virtual settings are discussed, alongside lessons learned. We conclude by proposing a development trajectory that combines AI agents, pattern-based design, and institutional governance.

Paperid: 1878, https://arxiv.org/pdf/2506.09073.pdf

Abstract:
We live in an age of unprecedented opportunities to use existing data for tasks not anticipated when those data were collected, resulting in widespread data repurposing. This commentary defines and maps the scope of data repurposing to highlight its importance for organizations and society and the need to study data repurposing as a frontier of data management. We explain how repurposing differs from original data use and data reuse and then develop a framework for data repurposing consisting of concepts and activities for adapting existing data to new tasks. The framework and its implications are illustrated using two examples of repurposing, one in healthcare and one in citizen science. We conclude by suggesting opportunities for research to better understand data repurposing and enable more effective data repurposing practices.

Paperid: 1879, https://arxiv.org/pdf/2506.08881.pdf

Abstract:
The video game industry deals with a fast-paced, competitive and almost unpredictable market. Trends of genres, settings and modalities change on a perpetual basis, studios are often one big hit or miss away from surviving or perishing, and hitting the pulse of the time has become one of the greatest challenges for industrials, investors and other stakeholders. In this work, we aim to support the understanding of video game trends over time based on data-driven analysis, visualization and interpretation of Steam tag evolutions. We confirm underlying groundwork that trends can be categorized in short-lived fads, contemporary fashions, or stable classics, and derived that the surge of a trend averages at about four years in the realm of video games. After using industrial experts to validate our findings, we deliver visualizations, insights and an open approach of deciphering shifts in video game trends.

Paperid: 1880, https://arxiv.org/pdf/2506.08200.pdf

Abstract:
Music is a powerful medium for influencing listeners' emotional states, and this capacity has driven a surge of research interest in AI-based affective music generation in recent years. Many existing systems, however, are a black box which are not directly controllable, thus making these systems less flexible and adaptive to users. We present \textit{AffectMachine-Pop}, an expert system capable of generating retro-pop music according to arousal and valence values, which can either be pre-determined or based on a listener's real-time emotion states. To validate the efficacy of the system, we conducted a listening study demonstrating that AffectMachine-Pop is capable of generating affective music at target levels of arousal and valence. The system is tailored for use either as a tool for generating interactive affective music based on user input, or for incorporation into biofeedback or neurofeedback systems to assist users with emotion self-regulation.

Paperid: 1881, https://arxiv.org/pdf/2506.07830.pdf

Abstract:
With respect to digital games, older adults are a demographic that is often underserved due to an industry-wide focus on younger audiences' preferences and skill sets. Meanwhile, as artificial intelligence (AI) continues to expand into everyday technologies, its assistive capabilities have been recognized, suggesting its potential in improving the gaming experience for older gamers. To study this potential, we iteratively developed a pilot survey aimed at understanding older adult gamers' current gameplay preference, challenges they are facing, and their perspectives of AI usage in gaming. This article contributes an overview of our iterative survey-design workflow, and pilot results from 39 participants. During each iteration, we analyzed the survey's efficacy and adjusted the content, language, and format to better capture meaningful data, and was able to create a refined survey for a larger, more representative future parent study. At the same time, preliminary findings suggest that for older adult gamers, usability issues in gaming remain key obstacles, while this demographic's perceptions of AI are shaped by both its practical benefits and concerns about autonomy and complexity. These findings also offer early insights for the design of age-inclusive, AI-supported gaming experiences.

Paperid: 1882, https://arxiv.org/pdf/2506.07777.pdf

Abstract:
As the population continues to age, and gaming continues to grow as a hobby for older people, heterogeneity among older adult gamers is increasing. We argue that traditional game-based accessibility features, such as simplified input schemes, redundant information channels, and increased legibility of digital user interfaces, are increasingly limited in the face of this heterogeneity. This is because such features affect all older adult players simultaneously and therefore are designed generically. We introduce artificial intelligence, although it has its own limitations and ethical concerns, as a method of creating player-based accessibility features, given the adaptive nature of the emerging technology. These accessibility features may help to address unique assemblage of accessibility needs an individual may accumulate through age. We adopt insights from gerontology, HCI, and disability studies into the digital game design discourse for older adults, and we contribute insight that can guide the integration of player-based accessibility features to supplement game-based counterparts. The accessibility of digital games for heterogenous older adult audience is paramount, as the medium offers short-term social, emotional, psychological, cognitive, and physical that support the long-term goal of aging well.

Paperid: 1883, https://arxiv.org/pdf/2506.07073.pdf

Abstract:
The ultimate purpose of generative music AI is music production. The studio-lab, a social form within the art-science branch of cross-disciplinarity, is a way to advance music production with AI music models. During a studio-lab experiment involving researchers, music producers, and an AI model for music generating bass-like audio, it was observed that the producers used the model's output to convey two or more pitches with a single harmonic complex tone, which in turn revealed that the model had learned to generate structured and coherent simultaneous melodic lines using monophonic sequences of harmonic complex tones. These findings prompt a reconsideration of the long-standing debate on whether humans can perceive harmonics as distinct pitches and highlight how generative AI can not only enhance musical creativity but also contribute to a deeper understanding of music.

Paperid: 1884, https://arxiv.org/pdf/2506.05663.pdf

Abstract:
Older adults with mild cognitive impairment (MCI) often face challenges during meal preparation, such as forgetting ingredients, skipping steps, or leaving appliances on, which can compromise their safety and independence. Our study explores the design of context-aware assistive technologies for meal preparation using a user-centered iterative design process. Through three iterative phases of design and feedback, evolving from low-tech lightbox to a digital screen, we gained insights into managing diverse contexts and personalizing assistance through collaboration with older adults with MCI and their care partners. We concluded our findings in three key contexts--routine-based, real-time, and situational--that informed strategies for designing context-aware meal prep assistance tailored to users' needs. Our results provide actionable insights for creating technologies to assist meal preparation that are personalized for the unique lifestyles of older adults with MCI, situated in the complex and dynamic homebound context, and respecting the collaboration between older adults and their care partners.

Paperid: 1885, https://arxiv.org/pdf/2506.04525.pdf

Abstract:
Users of social media platforms based on recommendation systems (RecSys) (e.g. TikTok, X, YouTube) strategically interact with platform content to influence future recommendations. On some such platforms, users have been documented to form large-scale grassroots movements encouraging others to purposefully interact with algorithmically suppressed content in order to "boost" its recommendation; we term this behavior user altruism. To capture this behavior, we study a game between users and a RecSys, where users provide the RecSys (potentially manipulated) preferences over the contents available to them, and the RecSys -- limited by data and computation constraints -- creates a low-rank approximation preference matrix, and ultimately provides each user her (approximately) most-preferred item. We compare the users' social welfare under truthful preference reporting and under a class of strategies capturing user altruism. In our theoretical analysis, we provide sufficient conditions to ensure strict increases in user social welfare under user altruism, and provide an algorithm to find an effective altruistic strategy. Interestingly, we show that for commonly assumed recommender utility functions, effectively altruistic strategies also improve the utility of the RecSys! We show that our results are robust to several model misspecifications, thus strengthening our conclusions. Our theoretical analysis is complemented by empirical results of effective altruistic strategies on the GoodReads dataset, and an online survey on how real-world users behave altruistically in RecSys. Overall, our findings serve as a proof-of-concept of the reasons why traditional RecSys may incentivize users to form collectives and/or follow altruistic strategies when interacting with them.

Paperid: 1886, https://arxiv.org/pdf/2506.03664.pdf

Abstract:
Deep Learning models have achieved remarkable success. Training them is often accelerated by building on top of pre-trained models which poses the risk of perpetuating encoded biases. Here, we investigate biases in the representations of commonly used ImageNet classifiers for facial images while considering intersections of sensitive variables age, race and gender. To assess the biases, we use linear classifier probes and visualize activations as topographic maps. We find that representations in ImageNet classifiers particularly allow differentiation between ages. Less strongly pronounced, the models appear to associate certain ethnicities and distinguish genders in middle-aged groups.

Paperid: 1887, https://arxiv.org/pdf/2506.02514.pdf

Abstract:
Embodiment in conversational agents (CAs) refers to the physical or visual representation of these agents, which can significantly influence user perception and interaction. Limited work has been done examining the effect of embodiment on the perception of CAs utilizing modern large language models (LLMs) in non-hierarchical cooperative tasks, a common use case of CAs as more powerful models become widely available for general use. To bridge this research gap, we conducted a mixed-methods within-subjects study on how users perceive LLM-based CAs in cooperative tasks when embodied and non-embodied. The results show that the non-embodied agent received significantly better quantitative appraisals for competence than the embodied agent, and in qualitative feedback, many participants believed that the embodied CA was more sycophantic than the non-embodied CA. Building on prior work on users' perceptions of LLM sycophancy and anthropomorphic features, we theorize that the typically-positive impact of embodiment on perception of CA credibility can become detrimental in the presence of sycophancy. The implication of such a phenomenon is that, contrary to intuition and existing literature, embodiment is not a straightforward way to improve a CA's perceived credibility if there exists a tendency to sycophancy.

Paperid: 1888, https://arxiv.org/pdf/2506.01998.pdf

Abstract:
Conversational agents that mimic people have raised questions about the ethics of anthropomorphizing machines with human social identity cues. Critics have also questioned assumptions of identity neutrality in humanlike agents. Recent work has revealed that intersectional Japanese pronouns can elicit complex and sometimes evasive impressions of agent identity. Yet, the role of other "neutral" non-pronominal self-referents (NPSR) and voice as a socially expressive medium remains unexplored. In a crowdsourcing study, Japanese participants (N = 204) evaluated three ChatGPT voices (Juniper, Breeze, and Ember) using seven self-referents. We found strong evidence of voice gendering alongside the potential of intersectional self-referents to evade gendering, i.e., ambiguity through neutrality and elusiveness. Notably, perceptions of age and formality intersected with gendering as per sociolinguistic theories, especially boku and watakushi. This work provides a nuanced take on agent identity perceptions and champions intersectional and culturally-sensitive work on voice agents.

Paperid: 1889, https://arxiv.org/pdf/2506.00924.pdf

Abstract:
This paper introduces a dual-layer framework for network operator-side quality of experience (QoE) assessment that integrates both objective network modeling and subjective user perception extracted from live-streaming platforms. On the objective side, we develop a machine learning model trained on mean opinion scores (MOS) computed via the ITU-T P.1203 reference implementation, allowing accurate prediction of user-perceived video quality using only network parameters such as packet loss, delay, jitter, and throughput without reliance on video content or client-side instrumentation. On the subjective side, we present a semantic filtering and scoring pipeline that processes user comments from live streams to extract performance-related feedback. A large language model is used to assign scalar MOS scores to filtered comments in a deterministic and reproducible manner. To support scalable and interpretable analysis, we construct a labeled dataset of 47,894 live-stream comments, of which about 34,000 are identified as QoE-relevant through multi-layer semantic filtering. Each comment is enriched with simulated Internet Service Provider attribution and temporally aligned using synthetic timestamps in 5-min intervals. The resulting dataset enables operator-level aggregation and time-series analysis of user-perceived quality. A delta MOS metric is proposed to measure each Internet service provider's deviation from platform-wide sentiment, allowing detection of localized degradations even in the absence of direct network telemetry. A controlled outage simulation confirms the framework's effectiveness in identifying service disruptions through comment-based trends alone. The system provides each operator with its own subjective MOS and the global platform average per interval, enabling real-time interpretation of performance deviations and comparison with objective network-based QoE estimates.

Paperid: 1890, https://arxiv.org/pdf/2506.00386.pdf

Abstract:
Effective communication training is essential to preparing nurses for high-quality patient care. While standardized patient (SP) simulations provide valuable experiential learning, they are often costly and inflexible. Virtual patient (VP) systems offer a scalable alternative, but most fail to adapt to the varying communication skills of trainees. In particular, when trainees respond ineffectively, VPs should escalate in hostility or become uncooperative--yet this level of adaptive interaction remains largely unsupported. To address this gap, we introduce Adaptive-VP, a VP dialogue generation framework that leverages large language models (LLMs) to dynamically adapt VP behavior based on trainee input. The framework features a pipeline for constructing clinically grounded yet flexible VP scenarios and a modular system for assessing trainee communication and adjusting VP responses in real time, while ensuring learner safety. We validated Adaptive-VP by simulating challenging patient conversations. Automated evaluation using a corpus from practicing nurses showed that our communication skill evaluation mechanism reflected real-world proficiency levels. Expert nurses further confirmed that Adaptive-VP produced more natural and realistic interactions than existing approaches, demonstrating its potential as a scalable and effective tool for nursing communication training.

Paperid: 1891, https://arxiv.org/pdf/2505.24348.pdf

Abstract:
In this article, we propose a 3D mobile crowdsensing (3D-MCS) framework aimed at sustainable urban digital twins (UDTs). The framework comprises four key mechanisms: (1) the 3D-MCS mechanism, consisting of active and passive models; (2) the Geohash-based spatial information management mechanism; (3) the dynamic point cloud integration mechanism for UDTs; and (4) the web-based real-time visualizer for 3D-MCS and UDTs. The active sensing model features a gamified 3D-MCS approach, where participants collect point cloud data through an augmented reality territory coloring game. In contrast, the passive sensing model employs a wearable 3D-MCS approach, where participants wear smartphones around their necks without disrupting daily activities. The spatial information management mechanism efficiently partitions the space into regions using Geohash. The dynamic point cloud integration mechanism incorporates point clouds collected by 3D-MCS into UDTs through global and local point cloud registration. Finally, we evaluated the proposed framework through real-world experiments. We verified the effectiveness of the proposed 3D-MCS models from the perspectives of subjective evaluation and data collection and analysis. Furthermore, we analyzed the performance of the dynamic point cloud integration using a dataset.

Paperid: 1892, https://arxiv.org/pdf/2505.24115.pdf

Abstract:
Audio is a rich sensing modality that is useful for a variety of human activity recognition tasks. However, the ubiquitous nature of smartphones and smart speakers with always-on microphones has led to numerous privacy concerns and a lack of trust in deploying these audio-based sensing systems. This paper addresses this critical challenge of preserving user privacy when using audio for sensing applications while maintaining utility. While prior work focuses primarily on protecting recoverable speech content, we show that sensitive speaker-specific attributes such as age and gender can still be inferred after masking speech and propose a comprehensive privacy evaluation framework to assess this speaker attribute leakage. We design and implement FeatureSense, an open-source library that provides a set of generalizable privacy-aware audio features that can be used for wide range of sensing applications. We present an adaptive task-specific feature selection algorithm that optimizes the privacy-utility-cost trade-off based on the application requirements. Through our extensive evaluation, we demonstrate the high utility of FeatureSense across a diverse set of sensing tasks. Our system outperforms existing privacy techniques by 60.6% in preserving user-specific privacy. This work provides a foundational framework for ensuring trust in audio sensing by enabling effective privacy-aware audio classification systems.

Paperid: 1893, https://arxiv.org/pdf/2505.24039.pdf

Abstract:
Modern healthcare domain incorporates a feature of digital accessibility to ensure seamless flow of online services for the patients. However, this feature of digital accessibility poses a challenge particularly for patients with disabilities. To eradicate this issue and provide immersive and user-friendly experiences, evolving technologies like Augmented Reality (AR) and Virtual Reality (VR) are integrated in medical applications to enhance accessibility. The present research paper aims to study inclusivity and accessibility features of AR/VR in revolutionizing healthcare practices especially in domains like telemedicine, patient education, assistive tools, and rehabilitation for persons with disabilities. The current trends of advancements and case studies are also analyzed to measure the efficacy of AR/VR in healthcare. Moreover, the paper entails a detailed analysis of the challenges of its adoption particularly technical limitations, implementation costs, and regulatory aspects. Finally, the paper concludes with recommendations for integrating AR/VR to foster a more equitable and inclusive healthcare system and provide individuals with auditory, visual, and motor impairments with digital healthcare solutions.

Paperid: 1894, https://arxiv.org/pdf/2505.20788.pdf

Abstract:
Wearable human activity recognition has been shown to benefit from the inclusion of acoustic data, as the sounds around a person often contain valuable context. However, due to privacy concerns, it is usually not ethically feasible to record and save microphone data from the device, since the audio could, for instance, also contain private conversations. Rather, the data should be processed locally, which in turn requires processing power and consumes energy on the wearable device. One special use case of contextual information that can be utilized to augment special tasks in human activity recognition is water flow detection, which can, e.g., be used to aid wearable hand washing detection. We created a new label called tap water for the recently released HD-Epic data set, creating 717 hand-labeled annotations of tap water flow, based on existing annotations of the water class. We analyzed the relation of tap water and water in the dataset and additionally trained and evaluated two lightweight classifiers to evaluate the newly added label class, showing that the new class can be learned more easily.

Paperid: 1895, https://arxiv.org/pdf/2505.20138.pdf

Abstract:
Creating fair opportunities for all participants to contribute is a notable challenge in video conferencing. This paper introduces FairTalk, a system that facilitates the subconscious redistribution of speaking opportunities. FairTalk predicts participants' turn-grabbing intentions using a machine learning model trained on web-collected videoconference data with positive-unlabeled learning, where turn-taking detection provides automatic positive labels. To subtly balance speaking turns, the system visualizes predicted intentions by mimicking natural human behaviors associated with the desire to speak. A user study suggests that FairTalk may help improve speaking balance, though subjective feedback indicates no significant perceived impact. We also discuss design implications derived from participant interviews.

Paperid: 1896, https://arxiv.org/pdf/2505.20082.pdf

Abstract:
Co-viewing videos with family and friends remotely has become prevalent with the support of communication channels such as text messaging or real-time voice chat. However, current co-viewing platforms often lack visible embodied cues, such as body movements and facial expressions. This absence can reduce emotional engagement and the sense of co-presence when people are watching together remotely. Although virtual reality (VR) is an emerging technology that allows individuals to participate in various social activities while embodied as avatars, we still do not fully understand how this embodiment in VR affects co-viewing experiences, particularly in terms of engagement, emotional contagion, and expressive norms. In a controlled experiment involving eight triads of three participants each (N=24), we compared the participants' perceptions and reactions while watching comedy in VR using embodied expressive avatars that displayed visible laughter cues. This was contrasted with a control condition where no such embodied expressions were presented. With a mixed-method analysis, we found that embodied laughter cues shifted participants' engagement from individual immersion to socially coordinated participation. Participants reported heightened self-awareness of emotional expression, greater emotional contagion, and the development of expressive norms surrounding co-viewers' laughter. The result highlighted the tension between individual engagement and interpersonal emotional accommodation when co-viewing with embodied expressive avatars.

Paperid: 1897, https://arxiv.org/pdf/2505.18326.pdf

Abstract:
Older immigrant adults often face layered barriers to digital participation, including language exclusion, generational divides, and emotional fatigue. This study examines how older Korean immigrants in the greater NYC area selectively engage with digital tools such as smartphones, YouTube, and AI platforms. Using a community-based participatory research (CBPR) framework and 22 semi-structured interviews, we identify two key practices: pragmatic disengagement, where users avoid emotionally taxing or culturally misaligned content, and interdependent navigation, where digital use is shaped through reliance on family or community support. These strategies challenge deficit-oriented narratives of non-use, showing how disengagement can be thoughtful, protective, and culturally situated. We contribute to CSCW by expanding theories of non-use and algorithmic resistance and by offering design and policy recommendations to support more dignified, culturally attuned digital engagement for aging immigrant populations.

Paperid: 1898, https://arxiv.org/pdf/2505.17739.pdf

Abstract:
Understanding the causal influence of one agent on another agent is crucial for safely deploying artificially intelligent systems such as automated vehicles and mobile robots into human-inhabited environments. Existing models of causal responsibility deal with simplified abstractions of scenarios with discrete actions, thus, limiting real-world use when understanding responsibility in spatial interactions. Based on the assumption that spatially interacting agents are embedded in a scene and must follow an action at each instant, Feasible Action-Space Reduction (FeAR) was proposed as a metric for causal responsibility in a grid-world setting with discrete actions. Since real-world interactions involve continuous action spaces, this paper proposes a formulation of the FeAR metric for measuring causal responsibility in space-continuous interactions. We illustrate the utility of the metric in prototypical space-sharing conflicts, and showcase its applications for analysing backward-looking responsibility and in estimating forward-looking responsibility to guide agent decision making. Our results highlight the potential of the FeAR metric for designing and engineering artificial agents, as well as for assessing the responsibility of agents around humans.

Paperid: 1899, https://arxiv.org/pdf/2505.17593.pdf

Abstract:
Generative AI offers potential for educational support, but often lacks pedagogical grounding and awareness of the student's learning context. Furthermore, researching student interactions with these tools within authentic learning environments remains challenging. To address this, we present JELAI, an open-source platform architecture designed to integrate fine-grained Learning Analytics (LA) with Large Language Model (LLM)-based tutoring directly within a Jupyter Notebook environment. JELAI employs a modular, containerized design featuring JupyterLab extensions for telemetry and chat, alongside a central middleware handling LA processing and context-aware LLM prompt enrichment. This architecture enables the capture of integrated code interaction and chat data, facilitating real-time, context-sensitive AI scaffolding and research into student behaviour. We describe the system's design, implementation, and demonstrate its feasibility through system performance benchmarks and two proof-of-concept use cases illustrating its capabilities for logging multi-modal data, analysing help-seeking patterns, and supporting A/B testing of AI configurations. JELAI's primary contribution is its technical framework, providing a flexible tool for researchers and educators to develop, deploy, and study LA-informed AI tutoring within the widely used Jupyter ecosystem.

Paperid: 1900, https://arxiv.org/pdf/2505.16057.pdf

Abstract:
AI-Generated (AIG) content has become increasingly widespread by recent advances in generative models and the easy-to-use tools that have significantly lowered the technical barriers for producing highly realistic audio, images, and videos through simple natural language prompts. In response, platforms are adopting provable provenance with platforms recommending AIG to be self-disclosed and signaled to users. However, these indicators may be often missed, especially when they rely solely on visual cues and make them ineffective to users with different sensory abilities. To address the gap, we conducted semi-structured interviews (N=28) with 15 sighted and 13 BLV participants to examine their interaction with AIG content through self-disclosed AI indicators. Our findings reveal diverse mental models and practices, highlighting different strengths and weaknesses of content-based (e.g., title, description) and menu-aided (e.g., AI labels) indicators. While sighted participants leveraged visual and audio cues, BLV participants primarily relied on audio and existing assistive tools, limiting their ability to identify AIG. Across both groups, they frequently overlooked menu-aided indicators deployed by platforms and rather interacted with content-based indicators such as title and comments. We uncovered usability challenges stemming from inconsistent indicator placement, unclear metadata, and cognitive overload. These issues were especially critical for BLV individuals due to the insufficient accessibility of interface elements. We provide practical recommendations and design implications for future AIG indicators across several dimensions.

Paperid: 1901, https://arxiv.org/pdf/2505.15031.pdf

Abstract:
Peer review is vital in academia for evaluating research quality. Top AI conferences use reviewer confidence scores to ensure review reliability, but existing studies lack fine-grained analysis of text-score consistency, potentially missing key details. This work assesses consistency at word, sentence, and aspect levels using deep learning and NLP conference review data. We employ deep learning to detect hedge sentences and aspects, then analyze report length, hedge word/sentence frequency, aspect mentions, and sentiment to evaluate text-score alignment. Correlation, significance, and regression tests examine confidence scores' impact on paper outcomes. Results show high text-score consistency across all levels, with regression revealing higher confidence scores correlate with paper rejection, validating expert assessments and peer review fairness.

Paperid: 1902, https://arxiv.org/pdf/2505.14844.pdf

Abstract:
Many social media platforms allow content creators to pin user comments in response to their content. Once pinned, a comment remains fixed at the top of the comments section, regardless of subsequent activity or the selected sorting order. The "Pin of Shame" refers to an innovative re-purposing of this feature, where creators intentionally pin norm-violating comments to spotlight them and prompt shaming responses from their audiences. This study explores how creators adopt this emerging moderation tactic, examining their motivations, its outcomes, and how it compares-procedurally and in effect-to other content moderation strategies. Through interviews with 20 content creators who had pinned negative comments on their posts, we find that the Pin of Shame is used to punish and educate inappropriate commenters, elicit emotional accountability, provoke audience negotiation of community norms, and support creators' impression management goals. Our findings shed light on the benefits, precarities, and risks of using public shaming as a tool for norm enforcement. We contribute to HCI research by informing the design of user-centered tools for addressing content-based harm.

Paperid: 1903, https://arxiv.org/pdf/2505.14842.pdf

Abstract:
Understanding collision avoidance behavior is of key importance in traffic safety research and for designing and evaluating advanced driver assistance systems and autonomous vehicles. While existing experimental work has primarily focused on response timing in traffic conflicts, the goal of the present study was to gain a better understanding of human evasive maneuver decisions and execution in collision avoidance scenarios. To this end, we designed a driving simulator study where participants were exposed to one of three surprising opposite direction lateral incursion (ODLI) scenario variants. The results demonstrated that both the participants' collision avoidance behavior patterns and the collision outcome was strongly determined by the scenario kinematics and, more specifically, by the uncertainty associated with the oncoming vehicle's future trajectory. We discuss pitfalls related to hindsight bias when judging the quality of evasive maneuvers in uncertain situations and suggest that the availability of escape paths in collision avoidance scenarios can be usefully understood based on the notion of affordances, and further demonstrate how such affordances can be operationalized in terms of reachable sets. We conclude by discussing how these results can be used to inform computational models of collision avoidance behavior.

Paperid: 1904, https://arxiv.org/pdf/2505.14080.pdf

Abstract:
Language models encode and subsequently perpetuate harmful gendered stereotypes. Research has succeeded in mitigating some of these harms, e.g. by dissociating non-gendered terms such as occupations from gendered terms such as 'woman' and 'man'. This approach, however, remains superficial given that associations are only one form of prejudice through which gendered harms arise. Critical scholarship on gender, such as gender performativity theory, emphasizes how harms often arise from the construction of gender itself, such as conflating gender with biological sex. In language models, these issues could lead to the erasure of transgender and gender diverse identities and cause harms in downstream applications, from misgendering users to misdiagnosing patients based on wrong assumptions about their anatomy. For FAccT research on gendered harms to go beyond superficial linguistic associations, we advocate for a broader definition of 'gender bias' in language models. We operationalize insights on the construction of gender through language from gender studies literature and then empirically test how 16 language models of different architectures, training datasets, and model sizes encode gender. We find that language models tend to encode gender as a binary category tied to biological sex, and that gendered terms that do not neatly fall into one of these binary categories are erased and pathologized. Finally, we show that larger models, which achieve better results on performance benchmarks, learn stronger associations between gender and sex, further reinforcing a narrow understanding of gender. Our findings lead us to call for a re-evaluation of how gendered harms in language models are defined and addressed.

Paperid: 1905, https://arxiv.org/pdf/2505.13648.pdf

Abstract:
Conceptual modeling is an important part of information systems development and use that involves identifying and representing relevant aspects of reality. Although the past decades have experienced continuous digitalization of services and products that impact business and society, conceptual modeling efforts are still required to support new technologies as they emerge. This paper surveys research on conceptual modeling over the past five decades and shows how its topics and trends continue to evolve to accommodate emerging technologies, while remaining grounded in basic constructs. We survey over 5,300 papers that address conceptual modeling topics from the 1970s to the present, which are collected from 35 multidisciplinary journals and conferences, and use them as the basis from which to analyze the progression of conceptual modeling. The important role that conceptual modeling should play in our evolving digital world is discussed, and future research directions proposed.

Paperid: 1906, https://arxiv.org/pdf/2505.12718.pdf

Abstract:
Recent advances in Generative Artificial Intelligence (GenAI) have transformed educational content creation, particularly in developing tutor training materials. However, biases embedded in AI-generated content--such as gender, racial, or national stereotypes--raise significant ethical and educational concerns. Despite the growing use of GenAI, systematic methods for detecting and evaluating such biases in educational materials remain limited. This study proposes an automated bias assessment approach that integrates the Contextualized Embedding Association Test with a prompt-engineered word extraction method within a Retrieval-Augmented Generation framework. We applied this method to AI-generated texts used in tutor training lessons. Results show a high alignment between the automated and manually curated word sets, with a Pearson correlation coefficient of r = 0.993, indicating reliable and consistent bias assessment. Our method reduces human subjectivity and enhances fairness, scalability, and reproducibility in auditing GenAI-produced educational content.

Paperid: 1907, https://arxiv.org/pdf/2505.12114.pdf

Abstract:
AI-enhanced personality assessments are increasingly shaping hiring decisions, using affective computing to predict traits from the Big Five (OCEAN) model. However, integrating AI into these assessments raises ethical concerns, especially around bias amplification rooted in training data. These biases can lead to discriminatory outcomes based on protected attributes like gender, ethnicity, and age. To address this, we introduce a counterfactual-based framework to systematically evaluate and quantify bias in AI-driven personality assessments. Our approach employs generative adversarial networks (GANs) to generate counterfactual representations of job applicants by altering protected attributes, enabling fairness analysis without access to the underlying model. Unlike traditional bias assessments that focus on unimodal or static data, our method supports multimodal evaluation-spanning visual, audio, and textual features. This comprehensive approach is particularly important in high-stakes applications like hiring, where third-party vendors often provide AI systems as black boxes. Applied to a state-of-the-art personality prediction model, our method reveals significant disparities across demographic groups. We also validate our framework using a protected attribute classifier to confirm the effectiveness of our counterfactual generation. This work provides a scalable tool for fairness auditing of commercial AI hiring platforms, especially in black-box settings where training data and model internals are inaccessible. Our results highlight the importance of counterfactual approaches in improving ethical transparency in affective computing.

Paperid: 1908, https://arxiv.org/pdf/2505.11579.pdf

Abstract:
As AI systems evolve from static tools to dynamic agents, traditional categorical governance frameworks -- based on fixed risk tiers, levels of autonomy, or human oversight models -- are increasingly insufficient on their own. Systems built on foundation models, self-supervised learning, and multi-agent architectures increasingly blur the boundaries that categories were designed to police. In this Perspective, we make the case for dimensional governance: a framework that tracks how decision authority, process autonomy, and accountability (the 3As) distribute dynamically across human-AI relationships. A critical advantage of this approach is its ability to explicitly monitor system movement toward and across key governance thresholds, enabling preemptive adjustments before risks materialize. This dimensional approach provides the necessary foundation for more adaptive categorization, enabling thresholds and classifications that can evolve with emerging capabilities. While categories remain essential for decision-making, building them upon dimensional foundations allows for context-specific adaptability and stakeholder-responsive governance that static approaches cannot achieve. We outline key dimensions, critical trust thresholds, and practical examples illustrating where rigid categorical frameworks fail -- and where a dimensional mindset could offer a more resilient and future-proof path forward for both governance and innovation at the frontier of artificial intelligence.

Paperid: 1909, https://arxiv.org/pdf/2505.10398.pdf

Abstract:
Incorporating an autonomous auxiliary camera into robot-assisted minimally invasive surgery (RAMIS) enhances spatial awareness and eliminates manual viewpoint control. Existing path planning methods for auxiliary cameras track two-dimensional surgical features but do not simultaneously account for camera orientation, workspace constraints, and robot joint limits. This study presents AutoCam: an automatic auxiliary camera placement method to improve visualization in RAMIS. Implemented on the da Vinci Research Kit, the system uses a priority-based, workspace-constrained control algorithm that combines heuristic geometric placement with nonlinear optimization to ensure robust camera tracking. A user study (N=6) demonstrated that the system maintained 99.84% visibility of a salient feature and achieved a pose error of 4.36 $\pm$ 2.11 degrees and 1.95 $\pm$ 5.66 mm. The controller was computationally efficient, with a loop time of 6.8 $\pm$ 12.8 ms. An additional pilot study (N=6), where novices completed a Fundamentals of Laparoscopic Surgery training task, suggests that users can teleoperate just as effectively from AutoCam's viewpoint as from the endoscope's while still benefiting from AutoCam's improved visual coverage of the scene. These results indicate that an auxiliary camera can be autonomously controlled using the da Vinci patient-side manipulators to track a salient feature, laying the groundwork for new multi-camera visualization methods in RAMIS.

Paperid: 1910, https://arxiv.org/pdf/2505.10312.pdf

Abstract:
In the realm of Human Activity Recognition (HAR), obtaining high quality and variance data is still a persistent challenge due to high costs and the inherent variability of real-world activities. This study introduces a generation dataset by deep learning approaches (Attention Autoencoder and conditional Generative Adversarial Networks). Another problem that data heterogeneity is a critical challenge, one of the solutions is to shuffle the data to homogenize the distribution. Experimental results demonstrate that the random sequence strategy significantly improves classification performance, achieving an accuracy of up to 0.70 $\pm$ 0.03 and a macro F1 score of 0.64 $\pm$ 0.01. For that, disrupting temporal dependencies through random sequence reordering compels the model to focus on instantaneous recognition, thereby improving robustness against activity transitions. This approach not only broadens the effective training dataset but also offers promising avenues for enhancing HAR systems in complex, real-world scenarios.

Paperid: 1911, https://arxiv.org/pdf/2505.09819.pdf

Abstract:
State-of-the-art upper limb myoelectric prostheses often use pattern recognition (PR) control systems that translate electromyography (EMG) signals into desired movements. As prosthesis movement complexity increases, users often struggle to produce sufficiently distinct EMG patterns for reliable classification. Existing training typically involves heuristic, trial-and-error user adjustments to static decoder boundaries. Goal: We introduce the Reviewer, a 3D visual interface projecting EMG signals directly into the decoder's classification space, providing intuitive, real-time insight into PR algorithm behavior. This structured feedback reduces cognitive load and fosters mutual, data-driven adaptation between user-generated EMG patterns and decoder boundaries. Methods: A 10-session study with 12 able-bodied participants compared PR performance after motor-based training and updating using the Reviewer versus conventional virtual arm visualization. Performance was assessed using a Fitts law task that involved the aperture of the cursor and the control of orientation. Results: Participants trained with the Reviewer achieved higher completion rates, reduced overshoot, and improved path efficiency and throughput compared to the standard visualization group. Significance: The Reviewer introduces decoder-informed motor training, facilitating immediate and consistent PR-based myoelectric control improvements. By iteratively refining control through real-time feedback, this approach reduces reliance on trial-and-error recalibration, enabling a more adaptive, self-correcting training framework. Conclusion: The 3D visual feedback significantly improves PR control in novice operators through structured training, enabling feedback-driven adaptation and reducing reliance on extensive heuristic adjustments.

Paperid: 1912, https://arxiv.org/pdf/2505.09806.pdf

Abstract:
Voting advice applications (VAAs), which have become increasingly prominent in European elections, are seen as a successful tool for boosting electorates' political knowledge and engagement. However, VAAs' complex language and rigid presentation constrain their utility to less-sophisticated voters. While previous work enhanced VAAs' click-based interaction with scripted explanations, a conversational chatbot's potential for tailored discussion and deliberate political decision-making remains untapped. Our exploratory mixed-method study investigates how LLM-based chatbots can support voting preparation. We deployed a VAA chatbot to 331 users before Germany's 2024 European Parliament election, gathering insights from surveys, conversation logs, and 10 follow-up interviews. Participants found the VAA chatbot intuitive and informative, citing its simple language and flexible interaction. We further uncovered VAA chatbots' role as a catalyst for reflection and rationalization. Expanding on participants' desire for transparency, we provide design recommendations for building interactive and trustworthy VAA chatbots.

Paperid: 1913, https://arxiv.org/pdf/2505.08515.pdf

Abstract:
Children with Autism commonly face difficulties in vocabulary acquisition, which can have an impact on their social communication. Using digital tools for vocabulary learning can prove beneficial for these children, as they can provide a predictable environment and effective individualized feedback. While existing work has explored the use of technology-assisted vocabulary learning for children with Autism, no study has incorporated turn-taking to facilitate learning and use of vocabulary similar to that used in real-world social contexts. To address this gap, we propose the design of a cooperative two-player vocabulary learning game, CoVoL. CoVoL allows children to engage in game-based vocabulary learning useful for real-world social communication scenarios. We discuss our first prototype and its evaluation. Additionally, we present planned features which are based on feedback obtained through ten interviews with researchers and therapists, as well as an evaluation plan for the final release of CoVoL.

Paperid: 1914, https://arxiv.org/pdf/2505.08312.pdf

Abstract:
Augmented Virtuality integrates physical content into virtual environments, but the occlusion of physical by virtual content is a challenge. This unwanted occlusion may disrupt user interactions with physical devices and compromise safety and usability. This paper investigates two resolution strategies to address this issue: Redirected Walking, which subtly adjusts the user's movement to maintain physical-virtual alignment, and Automatic Teleport Rotation, which realigns the virtual environment during travel. A user study set in a virtual forest demonstrates that both methods effectively reduce occlusion. While in our testbed, Automatic Teleport Rotation achieves higher occlusion resolution, it is suspected to increase cybersickness compared to the less intrusive Redirected Walking approach.

Paperid: 1915, https://arxiv.org/pdf/2505.07339.pdf

Abstract:
Affirmative algorithms have emerged as a potential answer to algorithmic discrimination, seeking to redress past harms and rectify the source of historical injustices. We present the results of two experiments ($N$$=$$1193$) capturing laypeople's perceptions of affirmative algorithms -- those which explicitly prioritize the historically marginalized -- in hiring and criminal justice. We contrast these opinions about affirmative algorithms with folk attitudes towards algorithms that prioritize the privileged (i.e., discriminatory) and systems that make decisions independently of demographic groups (i.e., fair). We find that people -- regardless of their political leaning and identity -- view fair algorithms favorably and denounce discriminatory systems. In contrast, we identify disagreements concerning affirmative algorithms: liberals and racial minorities rate affirmative systems as positively as their fair counterparts, whereas conservatives and those from the dominant racial group evaluate affirmative algorithms as negatively as discriminatory systems. We identify a source of these divisions: people have varying beliefs about who (if anyone) is marginalized, shaping their views of affirmative algorithms. We discuss the possibility of bridging these disagreements to bring people together towards affirmative algorithms.

Paperid: 1916, https://arxiv.org/pdf/2505.07282.pdf

Abstract:
As digital platforms increasingly mediate interactions tied to place, ensuring genuine local participation is essential for maintaining trust and credibility in location-based services, community-driven platforms, and civic engagement systems. However, localness is a social and relational identity shaped by knowledge, participation, and community recognition. Drawing on the German philosopher Heidegger's concept of dwelling -- which extends beyond physical presence to encompass meaningful connection to place -- we investigate how people conceptualize and evaluate localness in both human and artificial agents. Using a chat-based interaction paradigm inspired by Turing's Imitation Game and Von Ahn's Games With A Purpose, we engaged 230 participants in conversations designed to examine the cues people rely on to assess local presence. Our findings reveal a multi-dimensional framework of localness, highlighting differences in how locals and nonlocals emphasize various aspects of local identity. We show that people are significantly more accurate in recognizing locals than nonlocals, suggesting that localness is an affirmative status requiring active demonstration rather than merely the absence of nonlocal traits. Additionally, we identify conditions under which artificial agents are perceived as local and analyze participants' sensemaking strategies in evaluating localness. Through predictive modeling, we determine key factors that drive accurate localness judgments. By bridging theoretical perspectives on human-place relationships with practical challenges in digital environments, our work informs the design of location-based services that foster meaningful local engagement. Our findings contribute to a broader understanding of localness as a dynamic and relational construct, reinforcing the importance of dwelling as a process of belonging, recognition, and engagement with place.

Paperid: 1917, https://arxiv.org/pdf/2505.05660.pdf

Abstract:
As large language models (LLMs) increasingly adapt and personalize to diverse sets of users, there is an increased risk of systems appropriating sociolects, i.e., language styles or dialects that are associated with specific minoritized lived experiences (e.g., African American English, Queer slang). In this work, we examine whether sociolect usage by an LLM agent affects user reliance on its outputs and user perception (satisfaction, frustration, trust, and social presence). We designed and conducted user studies where 498 African American English (AAE) speakers and 487 Queer slang speakers performed a set of question-answering tasks with LLM-based suggestions in either standard American English (SAE) or their self-identified sociolect. Our findings showed that sociolect usage by LLMs influenced both reliance and perceptions, though in some surprising ways. Results suggest that both AAE and Queer slang speakers relied more on the SAE agent, and had more positive perceptions of the SAE agent. Yet, only Queer slang speakers felt more social presence from the Queer slang agent over the SAE one, whereas only AAE speakers preferred and trusted the SAE agent over the AAE one. These findings emphasize the need to test for behavioral outcomes rather than simply assume that personalization would lead to a better and safer reliance outcome. They also highlight the nuanced dynamics of minoritized language in machine interactions, underscoring the need for LLMs to be carefully designed to respect cultural and linguistic boundaries while fostering genuine user engagement and trust.

Paperid: 1918, https://arxiv.org/pdf/2505.04886.pdf

Abstract:
Regression-based predictive analytics used in modern kidney transplantation is known to inherit biases from training data. This leads to social discrimination and inefficient organ utilization, particularly in the context of a few social groups. Despite this concern, there is limited research on fairness in regression and its impact on organ utilization and placement. This paper introduces three novel divergence-based group fairness notions: (i) independence, (ii) separation, and (iii) sufficiency to assess the fairness of regression-based analytics tools. In addition, fairness preferences are investigated from crowd feedback, in order to identify a socially accepted group fairness criterion for evaluating these tools. A total of 85 participants were recruited from the Prolific crowdsourcing platform, and a Mixed-Logit discrete choice model was used to model fairness feedback and estimate social fairness preferences. The findings clearly depict a strong preference towards the separation and sufficiency fairness notions, and that the predictive analytics is deemed fair with respect to gender and race groups, but unfair in terms of age groups.

Paperid: 1919, https://arxiv.org/pdf/2505.04551.pdf

Abstract:
Complex systems, such as small Uncrewed Aerial Systems (sUAS) swarms dispatched for emergency response, often require dynamic reconfiguration at runtime under the supervision of human operators. This introduces human-on-the-loop requirements, where evolving needs shape ongoing system functionality and behaviors. While traditional personas support upfront, static requirements elicitation, we propose a persona-based advocate framework for runtime requirements engineering to provide ethically informed, safety-driven, and regulatory-aware decision support. Our approach extends standard personas into event-driven personas. When triggered by events such as adverse environmental conditions, evolving mission state, or operational constraints, the framework updates the sUAS operator's view of the personas, ensuring relevance to current conditions. We create three key advocate personas, namely Safety Controller, Ethical Governor, and Regulatory Auditor, to manage trade-offs among risk, ethical considerations, and regulatory compliance. We perform a proof-of-concept validation in an emergency response scenario using sUAS, showing how our advocate personas provide context-aware guidance grounded in safety, regulatory, and ethical constraints. By evolving static, design-time personas into adaptive, event-driven advocates, the framework surfaces mission-critical runtime requirements in response to changing conditions. These requirements shape operator decisions in real time, aligning actions with the operational demands of the moment.

Paperid: 1920, https://arxiv.org/pdf/2505.04548.pdf

Abstract:
This work introduces a robotic dummy head that fuses the acoustic realism of conventional audiological mannequins with the mobility of robots. The proposed device is capable of moving, talking, and listening as people do, and can be used to automate spatially-stationary audio experiments, thus accelerating the pace of audio research. Critically, the device may also be used as a moving sound source in dynamic experiments, due to its quiet motor. This feature differentiates our work from previous robotic acoustic research platforms. Validation that the robot enables high quality audio data collection is provided through various experiments and acoustic measurements. These experiments also demonstrate how the robot might be used to study adaptive binaural beamforming. Design files are provided as open-source to stimulate novel audio research.

Paperid: 1921, https://arxiv.org/pdf/2505.02694.pdf

Abstract:
Serious illness communication (SIC) in end-of-life care faces challenges such as emotional stress, cultural barriers, and balancing hope with honesty. Despite its importance, one of the few available ways for clinicians to practice SIC is with standardized patients, which is expensive, time-consuming, and inflexible. In this paper, we present SOPHIE, an AI-powered standardized patient simulation and automated feedback system. SOPHIE combines large language models (LLMs), a lifelike virtual avatar, and automated, personalized feedback based on clinical literature to provide remote, on-demand SIC training. In a randomized control study with healthcare students and professionals, SOPHIE users demonstrated significant improvement across three critical SIC domains: Empathize, Be Explicit, and Empower. These results suggest that AI-driven tools can enhance complex interpersonal communication skills, offering scalable, accessible solutions to address a critical gap in clinician education.

Paperid: 1922, https://arxiv.org/pdf/2505.02414.pdf

Abstract:
Unlike their biological cousins, the majority of existing quadrupedal robots are constructed with rigid chassis. This results in motion that is either beetle-like or distinctly robotic, lacking the natural fluidity characteristic of mammalian movements. Existing literature on quadrupedal robots with spinal configurations primarily focuses on energy efficiency and does not consider the effects in human-robot interaction scenarios. Our contributions include an initial investigation into various trajectory generation strategies for a quadrupedal robot with a four degree of freedom spine, and an analysis on the effect that such methods have on human perception of gait naturalness compared to a fixed spine baseline. The strategies were evaluated using videos of walking, trotting and turning simulations. Among the four different strategies developed, the optimised time varying and the foot-tracking strategies were perceived to be more natural than the baseline in a randomised trial with 50 participants. Although none of the strategies demonstrated any energy efficiency improvements over the no-spine baseline, some showed greater footfall consistency at higher speeds. Given the greater likeability drawn from the more natural locomotion patterns, this type of robot displays potential for applications in social robot scenarios such as elderly care, where energy efficiency is not a primary concern.

Paperid: 1923, https://arxiv.org/pdf/2505.01680.pdf

Abstract:
Manual scoring of the Action Research Arm Test (ARAT) for upper extremity assessment in stroke rehabilitation is time-intensive and variable. We propose an automated ARAT scoring system integrating multimodal video analysis with SlowFast, I3D, and Transformer-based models using OpenPose keypoints and object locations. Our approach employs multi-view data (ipsilateral, contralateral, and top perspectives), applying early and late fusion to combine features across views and models. Hierarchical Bayesian Models (HBMs) infer movement quality components, enhancing interpretability. A clinician dashboard displays task scores, execution times, and quality assessments. We conducted a study with five clinicians who reviewed 500 video ratings generated by our system, providing feedback on its accuracy and usability. Evaluated on a stroke rehabilitation dataset, our framework achieves 89.0% validation accuracy with late fusion, with HBMs aligning closely with manual assessments. This work advances automated rehabilitation by offering a scalable, interpretable solution with clinical validation.

Paperid: 1924, https://arxiv.org/pdf/2505.01351.pdf

Abstract:
Adaptive game systems aim to enrich player experiences by dynamically adjusting game content in response to user data. While extensive research has addressed content personalization and player experience modeling, the integration of these components into fully operational adaptive gameplay systems remains limited. This systematic review, conducted in accordance with PRISMA guidelines, analyzes 17 empirical studies published between January 2015 and May 2024, identifying and analyzing approaches that implement the complete experience-driven loop -- including player sensing, modeling, and content adaptation. Game telemetry remains the most prevalent sensing modality, although other non-invasive methods suitable for affective modeling -- such as facial expression analysis (FEA) and peripheral interaction data -- remain underutilized despite their potential for real-time emotional inference. Knowledge-based methods, such as rule-based systems and heuristics, dominate modeling and adaptation due to their interpretability and low resource demands, whereas machine learning approaches face challenges related to data availability and transparency. Despite their relevance to immersive and therapeutic experiences, affective states such as stress and anxiety remain largely ignored, as systems continue to favor performance over emotion-sensitive adaptation. These findings highlight a crucial research direction: advancing emotionally responsive game systems that move beyond performance optimization by incorporating underutilized sensing modalities -- such as FEA and peripheral interaction -- to enable real-time affect-driven personalization. Advancing in this direction holds strong potential to increase immersion, personalize gameplay, and support affect regulation across entertainment and therapeutic contexts.

Paperid: 1925, https://arxiv.org/pdf/2505.00219.pdf

Abstract:
Invasive brain-computer interface (BCI) technology has demonstrated the possibility of restoring brain-controlled walking in paraplegic spinal cord injury patients. However, current implementations of BCI-controlled walking still have significant drawbacks. In particular, prior systems are unidirectional and lack sensory feedback for insensate patients, have suboptimal reliance on brain signals from the bilateral arm areas of the motor cortex, and depend on external systems for signal processing. Motivated by these shortcomings, this study is the first time a bidirectional brain-computer interface (BDBCI) has demonstrated the restoration of both brain-controlled walking and leg sensory feedback while utilizing the bilateral leg motor and sensory cortices. Here, a subject undergoing subdural electrocorticogram electrode implantation for epilepsy surgery evaluation leveraged the leg representation areas of the bilateral interhemispheric primary motor and sensory cortices to operate a BDBCI with high performance. Although electrode implantation in the interhemispheric region is uncommon, electrodes can be safely implanted in this region to access rich leg motor information and deliver bilateral leg sensory feedback. Finally, we demonstrated that all BDBCI operations can be executed on a dedicated, portable embedded system. These results indicate that BDBCIs can potentially provide brain-controlled ambulation and artificial leg sensation to people with paraplegia after spinal cord injury in a manner that emulates full-implantability and is untethered from any external systems.

Paperid: 1926, https://arxiv.org/pdf/2504.21500.pdf

Abstract:
This report provides insights into the challenges, emerging topics, and opportunities related to human-data interaction and visual analytics in the AI era. The BigVis 2024 organizing committee conducted a survey among experts in the field. They invite the Program Committee members and the authors of accepted papers to share their views. Thirty-two scientists from diverse research communities, including Databases, Information Visualization, and Human-Computer Interaction, participated in the study. These scientists, representing both industry and academia, provided valuable insights into the current and future landscape of the field. In this report, we analyze the survey responses and compare them to the findings of a similar study conducted four years ago. The results reveal some interesting insights. First, many of the critical challenges identified in the previous survey remain highly relevant today, despite being unrelated to AI. Meanwhile, the field's landscape has significantly evolved, with most of today's vital challenges not even being mentioned in the earlier survey, underscoring the profound impact of AI-related advancements. By summarizing the perspectives of the research community, this report aims to shed light on the key challenges, emerging trends, and potential research directions in human-data interaction and visual analytics in the AI era.

Paperid: 1927, https://arxiv.org/pdf/2504.20365.pdf

Abstract:
AI writing tools have been shown to dramatically change the way people write, yet the effects of AI text presentation are not well understood nor always intentionally designed. Although text presentation in existing large language model interfaces is linked to the speed of the underlying model, text presentation speed can impact perceptions of AI systems, potentially influencing whether AI suggestions are accepted or rejected. In this paper, we analyze the effects of varying text generation speed in creative and professional writing scenarios on an online platform (n=297). We find that speed is correlated with perceived humanness and trustworthiness of the AI tool, as well as the perceived quality of the generated text. We discuss its implications on creative and writing processes, along with future steps in the intentional design of AI writing tool interfaces.

Paperid: 1928, https://arxiv.org/pdf/2504.20308.pdf

Abstract:
Youth online safety research in HCI has historically centered on perspectives from the Global North, often overlooking the unique particularities and cultural contexts of regions in the Global South. This paper presents a systematic review of 66 youth online safety studies published between 2014 and 2024, specifically focusing on regions in the Global South. Our findings reveal a concentrated research focus in Asian countries and predominance of quantitative methods. We also found limited research on marginalized youth populations and a primary focus on risks related to cyberbullying. Our analysis underscores the critical role of cultural factors in shaping online safety, highlighting the need for educational approaches that integrate social dynamics and awareness. We propose methodological recommendations and a future research agenda that encourages the adoption of situated, culturally sensitive methodologies and youth-centered approaches to researching youth online safety regions in the Global South. This paper advocates for greater inclusivity in youth online safety research, emphasizing the importance of addressing varied sociocultural contexts to better understand and meet the online safety needs of youth in the Global South.

Paperid: 1929, https://arxiv.org/pdf/2504.19345.pdf

Abstract:
Blind individuals face persistent challenges in last-mile navigation, including locating entrances, identifying obstacles, and navigating complex or cluttered spaces. Although wearable cameras are increasingly used in assistive systems, there has been no systematic, vantage-focused comparison to guide their design. This paper addresses that gap through a two-part investigation. First, we surveyed ten experienced blind cane users, uncovering navigation strategies, pain points, and technology preferences. Participants stressed the importance of multi-sensory integration, destination-focused travel, and assistive tools that complement (rather than replace) the cane's tactile utility. Second, we conducted controlled data collection with a blind participant navigating five real-world environments using synchronized head- and cane-mounted cameras, isolating vantage placement as the primary variable. To assess how each vantage supports spatial perception, we evaluated SLAM performance (for localization and mapping) and NeRF-based 3D reconstruction (for downstream scene understanding). Head-mounted sensors delivered superior localization accuracy, while cane-mounted views offered broader ground-level coverage and richer environmental reconstructions. A combined (head+cane) configuration consistently outperformed both. These results highlight the complementary strengths of different sensor placements and offer actionable guidance for developing hybrid navigation aids that are perceptive, robust, and user-aligned.

Paperid: 1930, https://arxiv.org/pdf/2504.19158.pdf

Abstract:
Online interpersonal harm, such as cyberbullying and sexual harassment, remains a pervasive issue on social media platforms. Traditional approaches, primarily content moderation, often overlook survivors' needs and agency. We introduce SnuggleSense, a system that empowers survivors through structured sensemaking. Inspired by restorative justice practices, SnuggleSense guides survivors through reflective questions, offers personalized recommendations from similar survivors, and visualizes plans using interactive sticky notes. A controlled experiment demonstrates that SnuggleSense significantly enhances sensemaking compared to an unstructured process of making sense of the harm. We argue that SnuggleSense fosters community awareness, cultivates a supportive survivor network, and promotes a restorative justice-oriented approach toward restoration and healing. We also discuss design insights, such as tailoring informational support and providing guidance while preserving survivors' agency.

Paperid: 1931, https://arxiv.org/pdf/2504.18702.pdf

Abstract:
Software developers maintain extensive mental models of code they produce and its context, often relying on memory to retrieve or reconstruct design decisions, edge cases, and debugging experiences. These missing links and data obstruct both developers and, more recently, large language models (LLMs) working with unfamiliar code. We present Codetations, a system that helps developers contextualize documents with rich notes and tools. Unlike previous approaches, notes in Codetations stay outside the document to prevent code clutter, attaching to spans in the document using a hybrid edit-tracking/LLM-based method. Their content is dynamic, interactive, and synchronized with code changes. A worked example shows that relevant notes with interactively-collected data improve LLM performance during code repair. In our user evaluation, developers praised these properties and saw significant potential in annotation types that we generated with an LLM in just a few minutes.

Paperid: 1932, https://arxiv.org/pdf/2504.18191.pdf

Abstract:
Many enterprise software systems provide complex Graphical User Interfaces (GUIs) that need robust architectural patterns for well-structured software design. However, popular GUI architectural patterns like Model-View-ViewModel (MVVM) often lack detailed implementation guidance, leading GUI developers to inappropriately use the pattern without a comprehensive overview of design variants and often-mentioned trade-offs. Therefore, this paper presents an extensive review of MVVM design aspects and trade-offs, extending beyond the standard MVVM definition. We conducted a multivocal literature review (MLR), including white and gray literature, to cover essential knowledge from blogs, published papers, and other unpublished formats like books. Using the standard MVVM definition as a baseline, our study identifies (1) 76 additional design constructs grouped into 29 design aspects and (2) 16 additional benefits and 15 additional drawbacks. These insights can guide enterprise application developers in implementing practical MVVM solutions and enable informed design decisions.

Paperid: 1933, https://arxiv.org/pdf/2504.17352.pdf

Abstract:
A substantial amount of research has demonstrated the robustness and accuracy of the Riemannian minimum distance to mean (MDM) classifier for all kinds of EEG-based brain--computer interfaces (BCIs). This classifier is simple, fully deterministic, robust to noise, computationally efficient, and prone to transfer learning. Its training is very simple, requiring just the computation of a geometric mean of a symmetric positive-definite (SPD) matrix per class. We propose an improvement of the MDM involving a number of power means of SPD matrices instead of the sole geometric mean. By the analysis of 20 public databases, 10 for the motor-imagery BCI paradigm and 10 for the P300 BCI paradigm, comprising 587 individuals in total, we show that the proposed classifier clearly outperforms the MDM, approaching the state-of-the art in terms of performance while retaining the simplicity and the deterministic behavior. In order to promote reproducible research, our code will be released as open source.

Paperid: 1934, https://arxiv.org/pdf/2504.17023.pdf

Abstract:
Saliency maps are a popular approach for explaining classifications of (convolutional) neural networks. However, it remains an open question as to how best to evaluate salience maps, with three families of evaluation methods commonly being used: subjective user measures, objective user measures, and mathematical metrics. We examine three of the most popular saliency map approaches (viz., LIME, Grad-CAM, and Guided Backpropagation) in a between subject study (N=166) across these families of evaluation methods. We test 1) for subjective measures, if the maps differ with respect to user trust and satisfaction; 2) for objective measures, if the maps increase users' abilities and thus understanding of a model; 3) for mathematical metrics, which map achieves the best ratings across metrics; and 4) whether the mathematical metrics can be associated with objective user measures. To our knowledge, our study is the first to compare several salience maps across all these evaluation methods$-$with the finding that they do not agree in their assessment (i.e., there was no difference concerning trust and satisfaction, Grad-CAM improved users' abilities best, and Guided Backpropagation had the most favorable mathematical metrics). Additionally, we show that some mathematical metrics were associated with user understanding, although this relationship was often counterintuitive. We discuss these findings in light of general debates concerning the complementary use of user studies and mathematical metrics in the evaluation of explainable AI (XAI) approaches.

Paperid: 1935, https://arxiv.org/pdf/2504.16086.pdf

Abstract:
We present a novel virtual staging application for kitchen remodeling from a single panorama. To ensure the realism of the virtual rendered scene, we capture real-world High Dynamic Range (HDR) panoramas and recover the absolute scene radiance for high-quality scene relighting. Our application pipeline consists of three key components: (1) HDR photography for capturing paired indoor and outdoor panoramas, (2) automatic kitchen layout generation with new kitchen components, and (3) an editable rendering pipeline that flexibly edits scene materials and relights the new virtual scene with global illumination. Additionally, we contribute a novel Pano-Pano HDR dataset with 141 paired indoor and outdoor panoramas and present a low-cost photometric calibration method for panoramic HDR photography.

Paperid: 1936, https://arxiv.org/pdf/2504.14996.pdf

Abstract:
This paper investigates the impact of artificial intelligence integration on remote operations, emphasising its influence on both distributed and team cognition. As remote operations increasingly rely on digital interfaces, sensors, and networked communication, AI-driven systems transform decision-making processes across domains such as air traffic control, industrial automation, and intelligent ports. However, the integration of AI introduces significant challenges, including the reconfiguration of human-AI team cognition, the need for adaptive AI memory that aligns with human distributed cognition, and the design of AI fallback operators to maintain continuity during communication disruptions. Drawing on theories of distributed and team cognition, we analyse how cognitive overload, loss of situational awareness, and impaired team coordination may arise in AI-supported environments. Based on real-world intelligent port scenarios, we propose research directions that aim to safeguard human reasoning and enhance collaborative decision-making in AI-augmented remote operations.

Paperid: 1937, https://arxiv.org/pdf/2504.14223.pdf

Abstract:
Text simplification is essential for making complex content accessible to diverse audiences who face comprehension challenges. Yet, the limited availability of simplified materials creates significant barriers to personal and professional growth and hinders social inclusion. Although researchers have explored various methods for automatic text simplification, none fully leverage large language models (LLMs) to offer tailored customization for different target groups and varying levels of simplicity. Moreover, despite its proven benefits for both consumers and organizations, the well-established practice of plain language remains underutilized. In this paper, we https://simplifymytext.org, the first system designed to produce plain language content from multiple input formats, including typed text and file uploads, with flexible customization options for diverse audiences. We employ GPT-4 and Llama-3 and evaluate outputs across multiple metrics. Overall, our work contributes to research on automatic text simplification and highlights the importance of tailored communication in promoting inclusivity.

Paperid: 1938, https://arxiv.org/pdf/2504.14038.pdf

Abstract:
Conducting data analysis typically involves authoring code to transform, visualize, analyze, and interpret data. Large language models (LLMs) are now capable of generating such code for simple, routine analyses. LLMs promise to democratize data science by enabling those with limited programming expertise to conduct data analyses, including in scientific research, business, and policymaking. However, analysts in many real-world settings must often exercise fine-grained control over specific analysis steps, verify intermediate results explicitly, and iteratively refine their analytical approaches. Such tasks present barriers to building robust and reproducible analyses using LLMs alone or even in conjunction with existing authoring tools (e.g., computational notebooks). This paper introduces Flowco, a new mixed-initiative system to address these challenges. Flowco leverages a visual dataflow programming model and integrates LLMs into every phase of the authoring process. A user study suggests that Flowco supports analysts, particularly those with less programming experience, in quickly authoring, debugging, and refining data analyses.

Paperid: 1939, https://arxiv.org/pdf/2504.13940.pdf

Abstract:
Language students can increase their effectiveness in learning written Japanese by mastering the visual structure and written technique of Japanese kanji. Yet, existing kanji handwriting recognition systems do not assess the written technique sufficiently enough to discourage students from developing bad learning habits. In this paper, we describe our work on Hashigo, a kanji sketch interactive system which achieves human instructor-level critique and feedback on both the visual structure and written technique of students' sketched kanji. This type of automated critique and feedback allows students to target and correct specific deficiencies in their sketches that, if left untreated, are detrimental to effective long-term kanji learning.

Paperid: 1940, https://arxiv.org/pdf/2504.13889.pdf

Abstract:
Learning music theory not only has practical benefits for musicians to write, perform, understand, and express music better, but also for both non-musicians to improve critical thinking, math analytical skills, and music appreciation. However, current external tools applicable for learning music theory through writing when human instruction is unavailable are either limited in feedback, lacking a written modality, or assuming already strong familiarity of music theory concepts. In this paper, we describe Maestoso, an educational tool for novice learners to learn music theory through sketching practice of quizzed music structures. Maestoso first automatically recognizes students' sketched input of quizzed concepts, then relies on existing sketch and gesture recognition techniques to automatically recognize the input, and finally generates instructor-emulated feedback. From our evaluations, we demonstrate that Maestoso performs reasonably well on recognizing music structure elements and that novice students can comfortably grasp introductory music theory in a single session.

Paperid: 1941, https://arxiv.org/pdf/2504.13888.pdf

Abstract:
Kanji script writing is a skill that is often introduced to novice Japanese foreign language students for achieving Japanese writing mastery, but often poses difficulties to students with primarily English fluency due to their its vast differences with written English. Instructors often introduce various pedagogical methods -- such as visual structure and written techniques -- to assist students in kanji study, but may lack availability providing direct feedback on students' writing outside of class. Current educational applications are also limited due to lacking richer instructor-emulated feedback. We introduce Kanji Workbook, a writing-based intelligent tutoring system for students to receive intelligent assessment that emulates human instructor feedback. Our interface not only leverages students' computing devices for allowing them to learn, practice, and review the writing of prompted characters from their course's kanji script lessons, but also provides a diverse set of writing assessment metrics -- derived from instructor interviews and classroom observation insights -- through intelligent scoring and visual animations. We deployed our interface onto novice- and intermediate-level university courses over an entire academic year, and observed that interface users on average achieved higher course grades than their peers and also reacted positively to our interface's various features.

Paperid: 1942, https://arxiv.org/pdf/2504.13860.pdf

Abstract:
Large language models (LLMs), like ChatGPT, are capable of computing affectionately nuanced text that therefore can shape online interactions, including dating. This study explores how individuals experience closeness and romantic interest in dating profiles, depending on whether they believe the profiles are human- or AI-generated. In a matchmaking scenario, 307 participants rated 10 responses to the Interpersonal Closeness Generating Task, unaware that all were LLM-generated. Surprisingly, perceived source (human or AI) had no significant impact on closeness or romantic interest. Instead, perceived quality and human-likeness of responses shaped reactions. The results challenge current theoretical frameworks for human-machine communication and raise critical questions about the importance of authenticity in affective online communication.

Paperid: 1943, https://arxiv.org/pdf/2504.12769.pdf

Abstract:
Human error remains a critical concern in aviation safety, contributing to 70-80% of accidents despite technological advancements. While physiological measures show promise for error detection in laboratory settings, their effectiveness in dynamic flight environments remains underexplored. Through live flight trials with nine commercial pilots, we investigated whether established error-detection approaches maintain accuracy during actual flight operations. Participants completed standardized multi-tasking scenarios across conditions ranging from laboratory settings to straight-and-level flight and 2G manoeuvres while we collected synchronized physiological data. Our findings demonstrate that EEG-based classification maintains high accuracy (87.83%) during complex flight manoeuvres, comparable to laboratory performance (89.23%). Eye-tracking showed moderate performance (82.50\%), while ECG performed near chance level (51.50%). Classification accuracy remained stable across flight conditions, with minimal degradation during 2G manoeuvres. These results provide the first evidence that physiological error detection can translate effectively to operational aviation environments.

Paperid: 1944, https://arxiv.org/pdf/2504.11020.pdf

Abstract:
In today's society, where Artificial Intelligence (AI) has gained a vital role, concerns regarding user's trust have garnered significant attention. The use of AI systems in high-risk domains have often led users to either under-trust it, potentially causing inadequate reliance or over-trust it, resulting in over-compliance. Therefore, users must maintain an appropriate level of trust. Past research has indicated that explanations provided by AI systems can enhance user understanding of when to trust or not trust the system. However, the utility of presentation of different explanations forms still remains to be explored especially in high-risk domains. Therefore, this study explores the impact of different explanation types (text, visual, and hybrid) and user expertise (retired police officers and lay users) on establishing appropriate trust in AI-based predictive policing. While we observed that the hybrid form of explanations increased the subjective trust in AI for expert users, it did not led to better decision-making. Furthermore, no form of explanations helped build appropriate trust. The findings of our study emphasize the importance of re-evaluating the use of explanations to build [appropriate] trust in AI based systems especially when the system's use is questionable. Finally, we synthesize potential challenges and policy recommendations based on our results to design for appropriate trust in high-risk based AI-based systems.

Paperid: 1945, https://arxiv.org/pdf/2504.10797.pdf

Abstract:
Across cultures, names tell a lot about their bearers as they carry deep personal and cultural significance. Names also serve as powerful signals of gender, race, and status in the social hierarchy - a pecking order in which individual positions shape others' expectations on their perceived competence and worth. With the widespread adoption of LLMs and as names are often an input for LLMs, it is crucial to evaluate whether LLMs may sort people into status positions based on first and last names and, if so, whether it is in an unfair, biased fashion. While prior work has primarily investigated biases in first names, little attention has been paid to last names and even less to the combined effects of first and last names. In this study, we conduct a large-scale analysis of name variations across 5 ethnicities to examine how AI exhibits name biases. Our study investigates three key characteristics of inequality and finds that LLMs reflect and reinforce status hierarchies based on names that signal gender and ethnicity as they encode differential expectations of competence, leadership, and economic potential. Contrary to the common assumption that AI tends to favor Whites, we show that East and, in some contexts, South Asian names receive higher rankings. We also disaggregate Asians, a population projected to be the largest immigrant group in the U.S. by 2055. Our results challenge the monolithic Asian model minority assumption, illustrating a more complex and stratified model of bias. Gender moderates biases, with girls facing unfair disadvantages in certain racial groups. Additionally, spanning cultural categories by adopting Western first names improves AI-perceived status for East and Southeast Asian students, particularly for girls. Our findings underscore the importance of intersectional and more nuanced understandings of race, gender, and mixed identities in the evaluation of LLMs.

Paperid: 1946, https://arxiv.org/pdf/2504.10714.pdf

Abstract:
Mobile gaming's global growth has introduced evolving monetization strategies, such as in app purchases and ads, designed to boost revenue while maintaining player engagement. However, there is limited understanding of the scope and frequency of these strategies, particularly in mature markets like South Korea. To address this research gap, this study examines the monetization strategies used in the top 40 most popular Korean mobile games through direct gameplay observations and targeted video analyses. We identified the prevalence of specific strategies, including time gated progression, Conflict Driven Design, and social Dynamics, which are systematically categorized in our proposed framework for monetization. Our findings also highlight ethical concerns, including issues with transparency, probability disclosures, and the exploitation of competitive pressures areas that remain poorly regulated. To address these challenges, we emphasize the need for stricter consumer protections, cross regional research, and greater focus on protecting vulnerable populations to promote a more equitable and responsible gaming environment.

Paperid: 1947, https://arxiv.org/pdf/2504.10440.pdf

Abstract:
Surgical planning for congenital heart disease traditionally relies on collaborative group examinations of a patient's 3D-printed heart model, a process that lacks flexibility and accessibility. While mobile augmented reality (AR) offers a promising alternative with its portability and familiar interaction gestures, existing solutions limit collaboration to users in the same physical space. We developed HybridCollab, the first iOS AR application that introduces a novel paradigm that enables both in-person and remote medical teams to interact with a shared AR heart model in a single surgical planning session. For example, a team of two doctors in one hospital room can collaborate in real time with another team in a different hospital.Our approach is the first to leverage Apple's GameKit service for surgical planning, ensuring an identical collaborative experience for all participants, regardless of location. Additionally, co-located users can interact with the same anchored heart model in their shared physical space. By bridging the gap between remote and in-person collaboration across medical teams, HybridCollab has the potential for significant real-world impact, streamlining communication and enhancing the effectiveness of surgical planning. Watch the demo: https://youtu.be/hElqJYDuvLM.

Paperid: 1948, https://arxiv.org/pdf/2504.09213.pdf

Abstract:
Decoding brain signals accurately and efficiently is crucial for intra-cortical brain-computer interfaces. Traditional decoding approaches based on neural activity vector features suffer from low accuracy, whereas deep learning based approaches have high computational cost. To improve both the decoding accuracy and efficiency, this paper proposes a spiking neural network (SNN) for effective and energy-efficient intra-cortical brain signal decoding. We also propose a feature fusion approach, which integrates the manually extracted neural activity vector features with those extracted by a deep neural network, to further improve the decoding accuracy. Experiments in decoding motor-related intra-cortical brain signals of two rhesus macaques demonstrated that our SNN model achieved higher accuracy than traditional artificial neural networks; more importantly, it was tens or hundreds of times more efficient. The SNN model is very suitable for high precision and low power applications like intra-cortical brain-computer interfaces.

Paperid: 1949, https://arxiv.org/pdf/2504.09104.pdf

Abstract:
The availability of extended reality (XR) devices has widened their adoption, yet authoring interactive experiences remains complex for non-programmers. We introduce Tell-XR, an intelligent agent leveraging large language models (LLMs) to guide end-users in defining the interaction in XR settings using automations described as Event-Condition-Action (ECA) rules. Through a formative study, we identified the key conversation stages to define and refine automations, which informed the design of the system architecture. The evaluation study in two scenarios (a VR museum and an AR smart home) demonstrates the effectiveness of Tell-XR across different XR interaction settings.

Paperid: 1950, https://arxiv.org/pdf/2504.09018.pdf

Abstract:
Virtual YouTubers (VTubers) are avatar-based livestreamers that are voiced and played by human actors. VTubers have been popular in East Asia for years and have more recently seen widespread international growth. Despite their emergent popularity, research has been scarce into the interactions and relationships that exist between avatarized VTubers and their viewers, particularly in contrast to non-avatarized streamers. To address this gap, we performed in-depth interviews with self-reported VTuber viewers (n=21). Our findings first reveal that the avatarized nature of VTubers fosters new forms of theatrical engagement, as factors of the virtual blend with the real to create a mixture of fantasy and realism in possible livestream interactions. Avatarization furthermore results in a unique audience perception regarding the identity of VTubers - an identity which comprises a dynamic, distinct mix of the real human (the voice actor/actress) and the virtual character. Our findings suggest that each of these dual identities both individually and symbiotically affect viewer interactions and relationships with VTubers. Whereas the performer's identity mediates social factors such as intimacy, relatability, and authenticity, the virtual character's identity offers feelings of escapism, novelty in interactions, and a sense of continuity beyond the livestream. We situate our findings within existing livestreaming literature to highlight how avatarization drives unique, character-based interactions as well as reshapes the motivations and relationships that viewers form with livestreamers. Finally, we provide suggestions and recommendations for areas of future exploration to address the challenges involved in present livestreamed avatarized entertainment.

Paperid: 1951, https://arxiv.org/pdf/2504.09016.pdf

Abstract:
Livestreaming has rapidly become a popular online pastime, with real-time interaction between streamer and viewer being a key motivating feature. However, viewers have traditionally had limited opportunity to directly influence the streamed content; even when such interactions are possible, it has been reliant on text-based chat. We investigate the potential of spatial interaction on the livestreamed video content as a form of direct, real-time input for livestreamed applications. We developed VIBES, a flexible digital system that registers viewers' mouse interactions on the streamed video, i.e., clicks or movements, and transmits it directly into the streamed application. We used VIBES as a technology probe; first designing possible demonstrative interactions and using these interactions to explore streamers' perception of viewer influence and possible challenges and opportunities. We then deployed applications built using VIBES in two livestreams to explore its effects on audience engagement and investigate their relationships with the stream, the streamer, and fellow audience members. The use of spatial interactions enhances engagement and participation and opens up new avenues for both streamer-viewer and viewer-viewer participation. We contextualize our findings around a broader understanding of motivations and engagement in livestreaming, and we propose design guidelines and extensions for future research.

Paperid: 1952, https://arxiv.org/pdf/2504.08954.pdf

Abstract:
The emergent capabilities of large language models (LLMs) have prompted interest in using them as surrogates for human subjects in opinion surveys. However, prior evaluations of LLM-based opinion simulation have relied heavily on costly, domain-specific survey data, and mixed empirical results leave their reliability in question. To enable cost-effective, early-stage evaluation, we introduce a quality control assessment designed to test the viability of LLM-simulated opinions on Likert-scale tasks without requiring large-scale human data for validation. This assessment comprises two key tests: \emph{logical consistency} and \emph{alignment with stakeholder expectations}, offering a low-cost, domain-adaptable validation tool. We apply our quality control assessment to an opinion simulation task relevant to AI-assisted content moderation and fact-checking workflows -- a socially impactful use case -- and evaluate seven LLMs using a baseline prompt engineering method (backstory prompting), as well as fine-tuning and in-context learning variants. None of the models or methods pass the full assessment, revealing several failure modes. We conclude with a discussion of the risk management implications and release \texttt{TopicMisinfo}, a benchmark dataset with paired human and LLM annotations simulated by various models and approaches, to support future research.

Paperid: 1953, https://arxiv.org/pdf/2504.08861.pdf

Abstract:
Objectives: Machine learning (ML) has the potential to facilitate "continual learning" in medicine, in which an ML system continues to evolve in response to exposure to new data over time, even after being deployed in a clinical setting. In this paper, we provide a tutorial on the range of ethical issues raised by the use of such "adaptive" ML systems in medicine that have, thus far, been neglected in the literature. Target audience: The target audiences for this tutorial are the developers of machine learning AI systems, healthcare regulators, the broader medical informatics community, and practicing clinicians. Scope: Discussions of adaptive ML systems to date have overlooked the distinction between two sorts of variance that such systems may exhibit -- diachronic evolution (change over time) and synchronic variation (difference between cotemporaneous instantiations of the algorithm at different sites) -- and under-estimated the significance of the latter. We highlight the challenges that diachronic evolution and synchronic variation present for the quality of patient care, informed consent, and equity, and discuss the complex ethical trade-offs involved in the design of such systems.

Paperid: 1954, https://arxiv.org/pdf/2504.08726.pdf

Abstract:
This paper explores interaction designs for generative AI interfaces that necessitate human involvement throughout the generation process. We argue that such interfaces can promote cognitive engagement, agency, and thoughtful decision-making. Through a case study in text revision, we present and analyze two interaction techniques: (1) using a predictive-text interaction to type the assistant's response to a revision request, and (2) highlighting potential edit opportunities in a document. Our implementations demonstrate how these approaches reveal the landscape of writing possibilities and enable fine-grained control. We discuss implications for human-AI writing partnerships and future interaction design directions.

Paperid: 1955, https://arxiv.org/pdf/2504.08471.pdf

Abstract:
Mobile user interfaces abundantly feature so-called 'dark patterns'. These deceptive design practices manipulate users' decision making to profit online service providers. While past research on dark patterns mainly focus on visual design, other sensory modalities such as audio and touch remain largely unexplored. In this early work, we investigate the manipulative side of haptics, which we term as 'Dark Haptics', as a strategy to manipulate users. We designed a study to empirically showcase the potential of using a dark haptic pattern in a mobile device to manipulate user actions in a survey. Our findings indicate that our dark haptic design successfully influenced participants to forego their privacy after experiencing an alarming feedback for rejecting intrusive requests in the survey. As a first exploration of manipulative qualities of dark haptic designs, we attempt to lay the groundwork for future research and tools to mitigate harms and risks of dark haptics.

Paperid: 1956, https://arxiv.org/pdf/2504.08440.pdf

Abstract:
In an era of human-computer interaction with increasingly agentic AI systems capable of connecting with users conversationally, speech is an important modality for commanding agents. By recognizing and using speech emotions (i.e., how a command is spoken), we can provide agents with the ability to emotionally accentuate their responses and socially enrich users' perceptions and experiences. To explore the concept and impact of speech emotion commands on user perceptions, we realized a prototype and conducted a user study (N = 14) where speech commands are used to steer two vehicles in a minimalist and retro game style implementation. While both agents execute user commands, only one of the agents uses speech emotion information to adapt its execution behavior. We report on differences in how users perceived each agent, including significant differences in stimulation and dependability, outline implications for designing interactions with agents using emotional speech commands, and provide insights on how users consciously emote, which we describe as "voice acting".

Paperid: 1957, https://arxiv.org/pdf/2504.08336.pdf

Abstract:
The history of information technology development has been characterized by consecutive waves of boom and bust, as new technologies come to market, fuel surges of investment, and then stabilize towards maturity. However, in recent decades, the acceleration of such technology hype cycles has resulted in the prioritization of massive capital generation at the expense of longterm sustainability, resulting in a cascade of negative social, political, and environmental consequences. Despite the negative impacts of this pattern, academic research, and in particular HCI research, is not immune from such hype cycles, often contributing substantial amounts of literature to the discourse surrounding a wave of hype. In this paper, we discuss the relationship between technology and capital, offer a critique of the technology hype cycle using generative AI as an example, and finally suggest an approach and a set of strategies for how we can counteract such cycles through research as resistance.

Paperid: 1958, https://arxiv.org/pdf/2504.08117.pdf

Abstract:
Facial expressiveness plays a crucial role in a robot's ability to engage and interact with children. Prior research has shown that expressive robots can enhance child engagement during human-robot interactions. However, many robots used in therapy settings feature non-personalized, static faces designed with traditional facial feature considerations, which can limit the depth of interactions and emotional connections. Digital faces offer opportunities for personalization, yet the current landscape of robot face design lacks a dynamic, user-centered approach. Specifically, there is a significant research gap in designing robot faces based on child preferences. Instead, most robots in child-focused therapy spaces are developed from an adult-centric perspective. We present a novel study investigating the influence of child-drawn digital faces in child-robot interactions. This approach focuses on a design activity with children instructed to draw their own custom robot faces. We compare the perceptions of social intelligence (PSI) of two implementations: a generic digital face and a robot face, personalized using the user's drawn robot faces. The results of this study show the perceived social intelligence of a child-drawn robot was significantly higher compared to a generic face.

Paperid: 1959, https://arxiv.org/pdf/2504.07879.pdf

Abstract:
Creativity is a valuable human skill that has long been augmented through both analog and digital tools. Recent progress in generative AI, such as image generation, provides a disruptive technological solution to supporting human creativity further and helping humans generate solutions faster. While AI image generators can help to rapidly visualize ideas based on user prompts, the use of such AI systems has also been critiqued due to their considerable energy usage. In this paper, we report on a user study (N = 24) to understand whether energy consumption can be reduced without impeding on the tool's perceived creativity support. Our results highlight that, for example, a main effect of (image generation) condition on energy consumption, and index of creativity support per prompt but not per task, which seem mainly attributed to image quantity per prompt. We provide details of our analysis on the relation between energy usage, creativity support, and prompting behavior, including attitudes towards designing with AI and its environmental impact.

Paperid: 1960, https://arxiv.org/pdf/2504.05781.pdf

Abstract:
Social Virtual Reality (VR) games offer immersive socialization experiences but pose significant challenges of harassment. Common solutions, such as reporting and moderation, address harassment after it happens but fail to prevent or stop harassment in the moment. In this study, we explore and design proactive and instant-reactive safety designs to mitigate harassment in social VR. Proactive designs prevent harassment from occurring, while instant-reactive designs minimize harm during incidents. We explore three directions for design: user-initiated personal bubbles, clarifying social norms, and encouraging bystander intervention. Through an iterative process, we first conducted a formative interview study to determine design goals for making these features effective, fit user needs, and robust to manipulation. We then implemented Puffer, an integrated safety system that includes a suite of proactive and instant-reactive features, as a social VR prototype. From an evaluation using simulated scenarios with participants, we find evidence that Puffer can help protect players during emergencies, foster prosocial norms, and create more positive social interactions. We conclude by discussing how system safety features can be designed to complement existing proactive and instant-reactive strategies, particularly for people with marginalized identities.

Paperid: 1961, https://arxiv.org/pdf/2504.05697.pdf

Abstract:
In the biomedical domain, visualizing the document embeddings of an extensive corpus has been widely used in information-seeking tasks. However, three key challenges with existing visualizations make it difficult for clinicians to find information efficiently. First, the document embeddings used in these visualizations are generated statically by pretrained language models, which cannot adapt to the user's evolving interest. Second, existing document visualization techniques cannot effectively display how the documents are relevant to users' interest, making it difficult for users to identify the most pertinent information. Third, existing embedding generation and visualization processes suffer from a lack of interpretability, making it difficult to understand, trust and use the result for decision-making. In this paper, we present a novel visual analytics pipeline for user driven document representation and iterative information seeking (VADIS). VADIS introduces a prompt-based attention model (PAM) that generates dynamic document embedding and document relevance adjusted to the user's query. To effectively visualize these two pieces of information, we design a new document map that leverages a circular grid layout to display documents based on both their relevance to the query and the semantic similarity. Additionally, to improve the interpretability, we introduce a corpus-level attention visualization method to improve the user's understanding of the model focus and to enable the users to identify potential oversight. This visualization, in turn, empowers users to refine, update and introduce new queries, thereby facilitating a dynamic and iterative information-seeking experience. We evaluated VADIS quantitatively and qualitatively on a real-world dataset of biomedical research papers to demonstrate its effectiveness.

Paperid: 1962, https://arxiv.org/pdf/2504.04592.pdf

Abstract:
Consider a setting where a pre-trained agent is operating in an environment and a human operator can decide to temporarily terminate its operation and take-over for some duration of time. These kind of scenarios are common in human-machine interactions, for example in autonomous driving, factory automation and healthcare. In these settings, we typically observe a trade-off between two extreme cases -- if no take-overs are allowed, then the agent might employ a sub-optimal, possibly dangerous policy. Alternatively, if there are too many take-overs, then the human has no confidence in the agent, greatly limiting its usefulness. In this paper, we formalize this setup and propose an explainability scheme to help optimize the number of human interventions.

Paperid: 1963, https://arxiv.org/pdf/2504.04553.pdf

Abstract:
Code auditing demands a robust understanding of codebases - an especially challenging task for end-user developers with limited expertise. To address this, we conducted formative interviews with experienced auditors and identified a Chain-of-Understanding approach, in which Large Language Models (LLMs) guide developers through hierarchical code comprehension - from high-level overviews to specific functions and variables. Building on this, we incorporated the Chain-of-Understanding concept into CodeMap, a system offering interactive visualizations, stepwise guided analysis, and context-aware chatbot support. Through within-subject user studies with 10 participants of diverse backgrounds and 5 expert and 2 novice interviews, CodeMap proved effective in reducing the manual effort of prompt engineering while enhancing engagement with visualization, outperforming both standalone LLMs and traditional static visualization tools.

Paperid: 1964, https://arxiv.org/pdf/2504.04243.pdf

Abstract:
The design of AI systems to assist human decision-making typically requires the availability of labels to train and evaluate supervised models. Frequently, however, these labels are unknown, and different ways of estimating them involve unverifiable assumptions or arbitrary choices. In this work, we introduce the concept of label indeterminacy and derive important implications in high-stakes AI-assisted decision-making. We present an empirical study in a healthcare context, focusing specifically on predicting the recovery of comatose patients after resuscitation from cardiac arrest. Our study shows that label indeterminacy can result in models that perform similarly when evaluated on patients with known labels, but vary drastically in their predictions for patients where labels are unknown. After demonstrating crucial ethical implications of label indeterminacy in this high-stakes context, we discuss takeaways for evaluation, reporting, and design.

Paperid: 1965, https://arxiv.org/pdf/2504.03141.pdf

Abstract:
This paper presents a sign language conversation system based on the See-Through Face Display to address the challenge of maintaining eye contact in remote sign language interactions. A camera positioned behind a transparent display allows users to look at the face of their conversation partner while appearing to maintain direct eye contact. Unlike conventional methods that rely on software-based gaze correction or large-scale half-mirror setups, this design reduces visual distortions and simplifies installation. We implemented and evaluated a videoconferencing system that integrates See-Through Face Display, comparing it to traditional videoconferencing methods. We explore its potential applications for Deaf and Hard of Hearing (DHH), including multi-party sign language conversations, corpus collection, remote interpretation, and AI-driven sign language avatars. Collaboration with DHH communities will be key to refining the system for real-world use and ensuring its practical deployment.

Paperid: 1966, https://arxiv.org/pdf/2504.02622.pdf

Abstract:
Large language models (LLMs) are transforming how students learn by providing readily available tools that can quickly augment or complete various learning activities with non-trivial performance. Similar paradigm shifts have occurred in the past with the introduction of search engines and Wikipedia, which replaced or supplemented traditional information sources such as libraries and books. This study investigates the potential for LLMs to represent the next shift in learning, focusing on their role in information discovery and synthesis compared to existing technologies, such as search engines. Using a within-subjects, counterbalanced design, participants learned new topics using a search engine (Google) and an LLM (ChatGPT). Post-task follow-up interviews explored students' reflections, preferences, pain points, and overall perceptions. We present analysis of their responses that show nuanced insights into when, why, and how students prefer LLMs over search engines, offering implications for educators, policymakers, and technology developers navigating the evolving educational landscape.

Paperid: 1967, https://arxiv.org/pdf/2504.02461.pdf

Abstract:
Current fairness metrics and mitigation techniques provide tools for practitioners to asses how non-discriminatory Automatic Decision Making (ADM) systems are. What if I, as an individual facing a decision taken by an ADM system, would like to know: Am I being treated fairly? We explore how to create the affordance for users to be able to ask this question of ADM. In this paper, we argue for the reification of fairness not only as a property of ADM, but also as an epistemic right of an individual to acquire information about the decisions that affect them and use that information to contest and seek effective redress against those decisions, in case they are proven to be discriminatory. We examine key concepts from existing research not only in algorithmic fairness but also in explainable artificial intelligence, accountability, and contestability. Integrating notions from these domains, we propose a conceptual framework to ascertain fairness by combining different tools that empower the end-users of ADM systems. Our framework shifts the focus from technical solutions aimed at practitioners to mechanisms that enable individuals to understand, challenge, and verify the fairness of decisions, and also serves as a blueprint for organizations and policymakers, bridging the gap between technical requirements and practical, user-centered accountability.

Paperid: 1968, https://arxiv.org/pdf/2504.02250.pdf

Abstract:
In this paper, we present a systematic method of design for human-swarm interaction interfaces, combining theoretical insights with empirical evaluation. We first derived ten design principles from existing literature, applying them to key information dimensions identified through goal-directed task analysis and developed a tablet-based interface for a target search task. We then conducted a user study with 31 participants where humans were required to guide a robotic swarm to a target in the presence of three types of hazards that pose a risk to the robots: Distributed, Moving, and Spreading. Performance was measured based on the proximity of the robots to the target and the number of deactivated robots at the end of the task. Results indicate that at least one robot was brought closer to the target in 98% of tasks, demonstrating the interface's success in fulfilling the primary objective of the task. Additionally, in nearly 67% of tasks, more than 50% of the robots reached the target. Moreover, particularly better performance was noted in moving hazards. Additionally, the interface appeared to help minimise robot deactivation, as evidenced by nearly 94% of tasks where participants managed to keep more than 50% of the robots active, ensuring that most of the swarm remained operational. However, its effectiveness varied across hazards, with robot deactivation being lowest in distributed hazard scenarios, suggesting that the interface provided the most support in these conditions.

Paperid: 1969, https://arxiv.org/pdf/2504.00333.pdf

Abstract:
There is growing concern about climate change and increased interest in taking action. However, people have difficulty understanding abstract units like CO2 and the relative environmental impact of different behaviors. This position piece explores findings from nutritional labeling and step counting research, two domains aimed at making abstract concepts (i.e., calories and exercise) more familiar to the general public. Research in these two domains suggests that consistent, widespread communication can make people more familiar and think more precisely about abstract units, but that better communication and understanding does not guarantee behavior change. These findings suggest that consistent and ubiquitous communication can make CO2 units more familiar to people, which in turn could help interventions aimed at encouraging more sustainable behaviors.

Paperid: 1970, https://arxiv.org/pdf/2503.24334.pdf

Abstract:
As Generative AI (GenAI) capabilities expand, understanding how to preserve and develop human expertise while leveraging AI's benefits becomes increasingly critical. Through empirical studies in two contexts -- survey article authoring in scholarly research and business document sensemaking -- we examine how domain expertise shapes patterns of AI delegation and information processing among knowledge workers. Our findings reveal that while experts welcome AI assistance with repetitive information foraging tasks, they prefer to retain control over complex synthesis and interpretation activities that require nuanced domain understanding. We identify implications for designing GenAI systems that support expert cognition. These include enabling selective delegation aligned with expertise levels, preserving expert agency over critical analytical tasks, considering varying levels of domain expertise in system design, and supporting verification mechanisms that help users calibrate their reliance while deepening expertise. We discuss the inherent tension between reducing cognitive load through automation and maintaining the deliberate practice necessary for expertise development. Lastly, we suggest approaches for designing systems that provide metacognitive support, moving beyond simple task automation toward actively supporting expertise development. This work contributes to our understanding of how to design AI systems that augment rather than diminish human expertise in document-centric workflows.

Paperid: 1971, https://arxiv.org/pdf/2503.24150.pdf

Abstract:
Recent advances in generative AI have been driven by alignment techniques such as reinforcement learning from human feedback (RLHF). RLHF and related techniques typically involve constructing a dataset of binary or ranked choice human preferences and subsequently fine-tuning models to align with these preferences. This paper shifts the focus to understanding the preferences encoded in such datasets and identifying common human preferences. We find that a small subset of 21 preference categories (selected from a set of nearly 5,000 distinct preferences) captures >89% of preference variation across individuals. This small set of preferences is analogous to a canonical basis of human preferences, similar to established findings that characterize human variation in psychology or facial recognition studies. Through both synthetic and empirical evaluations, we confirm that our low-rank, canonical set of human preferences generalizes across the entire dataset and within specific topics. We further demonstrate our preference basis' utility in model evaluation, where our preference categories offer deeper insights into model alignment, and in model training, where we show that fine-tuning on preference-defined subsets successfully aligns the model accordingly.

Paperid: 1972, https://arxiv.org/pdf/2503.23609.pdf

Abstract:
Aging in place refers to the enabling of individuals to age comfortably and securely within their own homes and communities. Aging in place relies on robust infrastructure, prompting the development and implementation of both human-led care services and information and communication technologies to provide support. Through a long-term ethnographic study that includes semi-structured interviews with 24 stakeholders, we consider these human- and technology-driven care infrastructures for aging in place, examining their origins, deployment, interactions with older adults, and challenges. In doing so, we reconsider the value of these different forms of older adult care, highlighting the various issues associated with using, for instance, health monitoring technology or appointment scheduling systems to care for older adults aging in place. We suggest that technology should take a supportive, not substitutive role in older adult care infrastructure. Furthermore, we note that designing for aging in place should move beyond a narrow focus on independence in one's home to instead encompass the broader community and its dynamics.

Paperid: 1973, https://arxiv.org/pdf/2503.23153.pdf

Abstract:
There has been vast literature that studies Conversational Agents (CAs) in facilitating older adults' health. The vast and diverse studies warrants a comprehensive review that concludes the main findings and proposes research directions for future studies, while few literature review did it from human-computer interaction (HCI) perspective. In this study, we present a survey of existing studies on CAs for older adults' health. Through a systematic review of 72 papers, this work reviewed previously studied older adults' characteristics and analyzed participants' experiences and expectations of CAs for health. We found that (1) Past research has an increasing interest on chatbots and voice assistants and applied CA as multiple roles in older adults' health. (2) Older adults mainly showed low acceptance CAs for health due to various reasons, such as unstable effects, harm to independence, and privacy concerns. (3) Older adults expect CAs to be able to support multiple functions, to communicate using natural language, to be personalized, and to allow users full control. We also discuss the implications based on the findings.

Paperid: 1974, https://arxiv.org/pdf/2503.21844.pdf

Abstract:
Disabled people on social media often experience ableist hate and microaggressions. Prior work has shown that platform moderation often fails to remove ableist hate leaving disabled users exposed to harmful content. This paper examines how personalized moderation can safeguard users from viewing ableist comments. During interviews and focus groups with 23 disabled social media users, we presented design probes to elicit perceptions on configuring their filters of ableist speech (e.g. intensity of ableism and types of ableism) and customizing the presentation of the ableist speech to mitigate the harm (e.g. AI rephrasing the comment and content warnings). We found that participants preferred configuring their filters through types of ableist speech and favored content warnings. We surface participants distrust in AI-based moderation, skepticism in AI's accuracy, and varied tolerances in viewing ableist hate. Finally we share design recommendations to support users' agency, mitigate harm from hate, and promote safety.

Paperid: 1975, https://arxiv.org/pdf/2503.21044.pdf

Abstract:
Proprioception is essential for coordinating human movements and enhancing the performance of assistive robotic devices. Skin stretch feedback, which closely aligns with natural proprioception mechanisms, presents a promising method for conveying proprioceptive information. To better understand the impact of interference on skin stretch perception, we conducted a user study with 30 participants that evaluated the effect of two simultaneous skin stretches on user perception. We observed that when participants experience simultaneous skin stretch stimuli, a masking effect occurs which deteriorates perception performance in the collocated skin stretch configurations. However, the perceived workload stays the same. These findings show that interference can affect the perception of skin stretch such that multi-channel skin stretch feedback designs should avoid locating modules in close proximity.

Paperid: 1976, https://arxiv.org/pdf/2503.20646.pdf

Abstract:
In augmented reality (AR), where digital content is overlaid onto the real world, realistic thermal feedback has been shown to enhance immersion. Yet current thermal feedback devices, heavily influenced by the needs of virtual reality, often hinder physical interactions and are ineffective for immersion in AR. To bridge this gap, we have identified three design considerations relevant for AR thermal feedback: indirect feedback to maintain dexterity, thermal passthrough to preserve real-world temperature perception, and spatiotemporal rendering for dynamic sensations. We then created a unique and innovative thermal feedback device that satisfies these criteria. Human subject experiments assessing perceptual sensitivity, object temperature matching, spatial pattern recognition, and moving thermal stimuli demonstrated the impact of our design, enabling realistic temperature discrimination, virtual object perception, and enhanced immersion. These findings demonstrate that carefully designed thermal feedback systems can bridge the sensory gap between physical and virtual interactions, enhancing AR realism and usability.

Paperid: 1977, https://arxiv.org/pdf/2503.19280.pdf

Abstract:
The study of propositional logic -- fundamental to the theory of computing -- is a cornerstone of the undergraduate computer science curriculum. Learning to solve logical proofs requires repeated guided practice, but undergraduate students often lack access to on-demand tutoring in a judgment-free environment. In this work, we highlight the need for guided practice tools in undergraduate mathematics education and outline the desiderata of an effective practice tool. We accordingly develop LogicLearner, a web application for guided logic proof practice. LogicLearner consists of an interface to attempt logic proofs step-by-step and an automated proof solver to generate solutions on the fly, allowing users to request guidance as needed. We pilot LogicLearner as a practice tool in two semesters of an undergraduate discrete mathematics course and receive strongly positive feedback for usability and pedagogical value in student surveys. To the best of our knowledge, LogicLearner is the only learning tool that provides an end-to-end practice environment for logic proofs with immediate, judgment-free feedback.

Paperid: 1978, https://arxiv.org/pdf/2503.17479.pdf

Abstract:
In this paper, we present Speak Ease: an augmentative and alternative communication (AAC) system to support users' expressivity by integrating multimodal input, including text, voice, and contextual cues (conversational partner and emotional tone), with large language models (LLMs). Speak Ease combines automatic speech recognition (ASR), context-aware LLM-based outputs, and personalized text-to-speech technologies to enable more personalized, natural-sounding, and expressive communication. Through an exploratory feasibility study and focus group evaluation with speech and language pathologists (SLPs), we assessed Speak Ease's potential to enable expressivity in AAC. The findings highlight the priorities and needs of AAC users and the system's ability to enhance user expressivity by supporting more personalized and contextually relevant communication. This work provides insights into the use of multimodal inputs and LLM-driven features to improve AAC systems and support expressivity.

Paperid: 1979, https://arxiv.org/pdf/2503.17046.pdf

Abstract:
Automatic robotic facial expression generation is crucial for human-robot interaction, as handcrafted methods based on fixed joint configurations often yield rigid and unnatural behaviors. Although recent automated techniques reduce the need for manual tuning, they tend to fall short by not adequately bridging the gap between human preferences and model predictions-resulting in a deficiency of nuanced and realistic expressions due to limited degrees of freedom and insufficient perceptual integration. In this work, we propose a novel learning-to-rank framework that leverages human feedback to address this discrepancy and enhanced the expressiveness of robotic faces. Specifically, we conduct pairwise comparison annotations to collect human preference data and develop the Human Affective Pairwise Impressions (HAPI) model, a Siamese RankNet-based approach that refines expression evaluation. Results obtained via Bayesian Optimization and online expression survey on a 35-DOF android platform demonstrate that our approach produces significantly more realistic and socially resonant expressions of Anger, Happiness, and Surprise than those generated by baseline and expert-designed methods. This confirms that our framework effectively bridges the gap between human preferences and model predictions while robustly aligning robotic expression generation with human affective responses.

Paperid: 1980, https://arxiv.org/pdf/2503.16824.pdf

Abstract:
AI-driven multimodal interfaces have the potential to revolutionize industrial 3D CAD modeling by improving workflow efficiency and user experience. However, the integration of these technologies remains challenging due to software constraints, user adoption barriers, and limitations in AI model adaptability. This paper explores the role of multimodal AI in CAD environments, examining its current applications, key challenges, and future research directions. We analyze Bayesian workflow inference, multimodal input strategies, and collaborative AI-driven interfaces to identify areas where AI can enhance CAD design processes while addressing usability concerns in industrial manufacturing settings.

Paperid: 1981, https://arxiv.org/pdf/2503.16540.pdf

Abstract:
Strain sensors are gaining popularity in soft robotics for acquiring tactile data due to their flexibility and ease of integration. Tactile sensing plays a critical role in soft grippers, enabling them to safely interact with unstructured environments and precisely detect object properties. However, a significant challenge with these systems is their high non-linearity, time-varying behavior, and long-term signal drift. In this paper, we introduce a continual learning (CL) approach to model a soft finger equipped with piezoelectric-based strain sensors for proprioception. To tackle the aforementioned challenges, we propose an adaptive CL algorithm that integrates a Long Short-Term Memory (LSTM) network with a memory buffer for rehearsal and includes a regularization term to keep the model's decision boundary close to the base signal while adapting to time-varying drift. We conduct nine different experiments, resetting the entire setup each time to demonstrate signal drift. We also benchmark our algorithm against two other methods and conduct an ablation study to assess the impact of different components on the overall performance.

Paperid: 1982, https://arxiv.org/pdf/2503.16498.pdf

Abstract:
Large Language Models (LLMs) offer a promising alternative to traditional survey methods, potentially enhancing efficiency and reducing costs. In this study, we use LLMs to create virtual populations that answer survey questions, enabling us to predict outcomes comparable to human responses. We evaluate several LLMs-including GPT-4o, GPT-3.5, Claude 3.5-Sonnet, and versions of the Llama and Mistral models-comparing their performance to that of a traditional Random Forests algorithm using demographic data from the World Values Survey (WVS). LLMs demonstrate competitive performance overall, with the significant advantage of requiring no additional training data. However, they exhibit biases when predicting responses for certain religious and population groups, underperforming in these areas. On the other hand, Random Forests demonstrate stronger performance than LLMs when trained with sufficient data. We observe that removing censorship mechanisms from LLMs significantly improves predictive accuracy, particularly for underrepresented demographic segments where censored models struggle. These findings highlight the importance of addressing biases and reconsidering censorship approaches in LLMs to enhance their reliability and fairness in public opinion research.

Paperid: 1983, https://arxiv.org/pdf/2503.16491.pdf

Abstract:
The rapid adoption of generative AI in software development has impacted the industry, yet its effects on developers with visual impairments remain largely unexplored. To address this gap, we used an Activity Theory framework to examine how developers with visual impairments interact with AI coding assistants. For this purpose, we conducted a study where developers who are visually impaired completed a series of programming tasks using a generative AI coding assistant. We uncovered that, while participants found the AI assistant beneficial and reported significant advantages, they also highlighted accessibility challenges. Specifically, the AI coding assistant often exacerbated existing accessibility barriers and introduced new challenges. For example, it overwhelmed users with an excessive number of suggestions, leading developers who are visually impaired to express a desire for ``AI timeouts.'' Additionally, the generative AI coding assistant made it more difficult for developers to switch contexts between the AI-generated content and their own code. Despite these challenges, participants were optimistic about the potential of AI coding assistants to transform the coding experience for developers with visual impairments. Our findings emphasize the need to apply activity-centered design principles to generative AI assistants, ensuring they better align with user behaviors and address specific accessibility needs. This approach can enable the assistants to provide more intuitive, inclusive, and effective experiences, while also contributing to the broader goal of enhancing accessibility in software development.

Paperid: 1984, https://arxiv.org/pdf/2503.16473.pdf

Abstract:
Traditional rule-based conversational robots, constrained by predefined scripts and static response mappings, fundamentally lack adaptability for personalized, long-term human interaction. While Large Language Models (LLMs) like GPT-4 have revolutionized conversational AI through open-domain capabilities, current social robots implementing LLMs still lack emotional awareness and continuous personalization. This dual limitation hinders their ability to sustain engagement across multiple interaction sessions. We bridge this gap with PERCY (Personal Emotional Robotic Conversational sYstem), a system designed to enable open-domain, multi-turn dialogues by dynamically analyzing users' real-time facial expressions and vocabulary to tailor responses based on their emotional state. Built on a ROS-based multimodal framework, PERCY integrates a fine-tuned GPT-4 reasoning engine, combining textual sentiment analysis with visual emotional cues to accurately assess and respond to user emotions. We evaluated PERCY's performance through various dialogue quality metrics, showing strong coherence, relevance, and diversity. Human evaluations revealed PERCY's superior personalization and comparable naturalness to other models. This work highlights the potential for integrating advanced multimodal perception and personalization in social robot dialogue systems.

Paperid: 1985, https://arxiv.org/pdf/2503.15926.pdf

Abstract:
Face Emotion Recognition (FER) is essential for social interactions and understanding others' mental states. Utilizing eye tracking to investigate FER has yielded insights into cognitive processes. In this study, we utilized an instructionless paradigm to collect eye movement data from 21 participants, examining two FER processes: free viewing and grounded FER. We analyzed fixational, pupillary, and microsaccadic events from eye movements, establishing their correlation with emotion perception and performance in the grounded task. By identifying regions of interest on the face, we explored the impact of eye-gaze strategies on face processing, their connection to emotions, and performance in emotion perception. During free viewing, participants displayed specific attention patterns for various emotions. In grounded tasks, where emotions were interpreted based on words, we assessed performance and contextual understanding. Notably, gaze patterns during free viewing predicted success in grounded FER tasks, underscoring the significance of initial gaze behavior. We also employed features from pre-trained deep-learning models for face recognition to enhance the scalability and comparability of attention analysis during free viewing across different datasets and populations. This method facilitated the prediction and modeling of individual emotion perception performance from minimal observations. Our findings advance the understanding of the link between eye movements and emotion perception, with implications for psychology, human-computer interaction, and affective computing, and pave the way for developing precise emotion recognition systems.

Paperid: 1986, https://arxiv.org/pdf/2503.15525.pdf

Abstract:
In this study, it was investigated whether AI evaluators assess the content validity of B1-level English reading comprehension test items in a manner similar to human evaluators. A 25-item multiple-choice test was developed, and these test items were evaluated by four human and four AI evaluators. No statistically significant difference was found between the scores given by human and AI evaluators, with similar evaluation trends observed. The Content Validity Ratio (CVR) and the Item Content Validity Index (I-CVI) were calculated and analyzed using the Wilcoxon Signed-Rank Test, with no statistically significant difference. The findings revealed that in some cases, AI evaluators could replace human evaluators. However, differences in specific items were thought to arise from varying interpretations of the evaluation criteria. Ensuring linguistic clarity and clearly defining criteria could contribute to more consistent evaluations. In this regard, the development of hybrid evaluation systems, in which AI technologies are used alongside human experts, is recommended.

Paperid: 1987, https://arxiv.org/pdf/2503.15516.pdf

Abstract:
We seek measurable properties of AI agents that make them better or worse teammates from the subjective perspective of human collaborators. Our experiments use the cooperative card game Hanabi -- a common benchmark for AI-teaming research. We first evaluate AI agents on a set of objective metrics based on task performance, information theory, and game theory, which are measurable without human interaction. Next, we evaluate subjective human preferences toward AI teammates in a large-scale (N=241) human-AI teaming experiment. Finally, we correlate the AI-only objective metrics with the human subjective preferences. Our results refute common assumptions from prior literature on reinforcement learning, revealing new correlations between AI behaviors and human preferences. We find that the final game score a human-AI team achieves is less predictive of human preferences than esoteric measures of AI action diversity, strategic dominance, and ability to team with other AI. In the future, these correlations may help shape reward functions for training human-collaborative AI.

Paperid: 1988, https://arxiv.org/pdf/2503.15511.pdf

Abstract:
Recent proliferation of powerful AI systems has created a strong need for capabilities that help users to calibrate trust in those systems. As AI systems grow in scale, information required to evaluate their trustworthiness becomes less accessible, presenting a growing risk of using these systems inappropriately. We propose the Trust Calibration Maturity Model (TCMM) to characterize and communicate information about AI system trustworthiness. The TCMM incorporates five dimensions of analytic maturity: Performance Characterization, Bias & Robustness Quantification, Transparency, Safety & Security, and Usability. The TCMM can be presented along with system performance information to (1) help a user to appropriately calibrate trust, (2) establish requirements and track progress, and (3) identify research needs. Here, we discuss the TCMM and demonstrate it on two target tasks: using ChatGPT for high consequence nuclear science determinations, and using PhaseNet (an ensemble of seismic models) for categorizing sources of seismic events.

Paperid: 1989, https://arxiv.org/pdf/2503.14810.pdf

Abstract:
This paper introduces a framework for human swarm interaction studies that measures situation awareness in dynamic environments. A tablet-based interface was developed for a user study by implementing the concepts introduced in the framework, where operators guided a robotic swarm in a single-target search task, marking hazardous cells unknown to the swarm. Both subjective and objective situation awareness measures were used, with task performance evaluated based on how close the robots were to the target. The framework enabled a structured investigation of the role of situation awareness in human swarm interaction, leading to key findings such as improved task performance across attempts, showing the interface was learnable, centroid active robot position proved to be a useful task performance metric for assessing situation awareness, perception and projection played a key role in task performance, highlighting their importance in interface design and objective situation awareness influenced both subjective situation awareness and task performance, emphasizing the need for interfaces that emphasise objective situation awareness. These findings validate our framework as a structured approach for integrating situation awareness concepts into human swarm interaction studies, offering a systematic way to assess situation awareness and task performance. The framework can be applied to other swarming studies to evaluate interface learnability, identify meaningful task performance metrics, and refine interface designs to enhance situation awareness, ultimately improving human swarm interaction in dynamic environments.

Paperid: 1990, https://arxiv.org/pdf/2503.14102.pdf

Abstract:
The five senses are gateways to our wellbeing and their decline is considered a significant public health challenge which is linked to multiple conditions that contribute significantly to morbidity and mortality. Modern technology, with its ubiquitous nature and fast data processing has the ability to leverage the power of the senses to transform our approach to day to day healthcare, with positive effects on our quality of life. Here, we introduce the idea of sensory-driven microinterventions for preventative, personalised healthcare. Microinterventions are targeted, timely, minimally invasive strategies that seamlessly integrate into our daily life. This idea harnesses human's sensory capabilities, leverages technological advances in sensory stimulation and real-time processing ability for sensing the senses. The collection of sensory data from our continuous interaction with technology - for example the tone of voice, gait movement, smart home behaviour - opens up a shift towards personalised technology-enabled, sensory-focused healthcare interventions, coupled with the potential of early detection and timely treatment of sensory deficits that can signal critical health insights, especially for neurodegenerative diseases such as Parkinson's disease.

Paperid: 1991, https://arxiv.org/pdf/2503.14096.pdf

Abstract:
In 3D design, specifying design objectives and visualizing complex shapes through text alone proves to be a significant challenge. Although advancements in 3D GenAI have significantly enhanced part assembly and the creation of high-quality 3D designs, many systems still to dynamically generate and edit design elements based on the shape parameters. To bridge this gap, we propose GenPara, an interactive 3D design editing system that leverages text-conditional shape parameters of part-aware 3D designs and visualizes design space within the Exploration Map and Design Versioning Tree. Additionally, among the various shape parameters generated by LLM, the system extracts and provides design outcomes within the user's regions of interest based on Bayesian inference. A user study N = 16 revealed that \textit{GenPara} enhanced the comprehension and management of designers with text-conditional shape parameters, streamlining design exploration and concretization. This improvement boosted efficiency and creativity of the 3D design process.

Paperid: 1992, https://arxiv.org/pdf/2503.13752.pdf

Abstract:
Online community research routinely poses minimal risk to individuals, but does the same hold true for online communities? In response to high-profile breaches of online community trust and increased debate in the social computing research community on the ethics of online community research, this paper investigates community-level harms and benefits of research. Through 9 participatory-inspired workshops with four critical online communities (Wikipedia, InTheRooms, CaringBridge, and r/AskHistorians) we found researchers should engage more directly with communities' primary purpose by rationalizing their methods and contributions in the context of community goals to equalize the beneficiaries of community research. To facilitate deeper alignment of these expectations, we present the FACTORS (Functions for Action with Communities: Teaching, Overseeing, Reciprocating, and Sustaining) framework for ethical online community research. Finally, we reflect on our findings by providing implications for researchers and online communities to identify and implement functions for navigating community-level harms and benefits.

Paperid: 1993, https://arxiv.org/pdf/2503.13340.pdf

Abstract:
With the increasing prevalence of online learning, adapting education to diverse learner needs remains a persistent challenge. Recent advancements in artificial intelligence (AI), particularly large language models (LLMs), promise powerful tools and capabilities to enhance personalized learning in online educational environments. In this work, we explore how LLMs can improve personalized learning experiences by catering to individual user needs toward enhancing the overall quality of online education. We designed personalization guidelines based on the growing literature on personalized learning to ground LLMs in generating tailored learning plans. To operationalize these guidelines, we implemented LearnMate, an LLM-based system that generates personalized learning plans and provides users with real-time learning support. We discuss the implications and future directions of this work, aiming to move beyond the traditional one-size-fits-all approach by integrating LLM-based personalized support into online learning environments.

Paperid: 1994, https://arxiv.org/pdf/2503.12981.pdf

Abstract:
Monitoring swimmer performance is crucial for improving training and enhancing athletic techniques. Traditional methods for tracking swimmers, such as above-water and underwater cameras, face limitations due to the need for multiple cameras and obstructions from water splashes. This paper presents a novel approach for tracking swimmers using a moving UAV. The proposed system employs a UAV equipped with a high-resolution camera to capture aerial footage of the swimmers. The footage is then processed using computer vision algorithms to extract the swimmers' positions and movements. This approach offers several advantages, including single camera use and comprehensive coverage. The system's accuracy is evaluated with both training and in competition videos. The results demonstrate the system's ability to accurately track swimmers' movements, limb angles, stroke duration and velocity with the maximum error of 0.3 seconds and 0.35~m/s for stroke duration and velocity, respectively.

Paperid: 1995, https://arxiv.org/pdf/2503.12933.pdf

Abstract:
With app-based interaction increasingly permeating all aspects of daily living, it is essential to ensure that apps are designed to be \emph{inclusive} and are usable by a wider audience such as the elderly, with various impairments (e.g., visual, audio and motor). We propose \names, a system that fosters empathetic design, by allowing app designers, \emph{in-situ}, to rapidly evaluate the usability of their apps, from the perspective of impaired users. To provide a truly authentic experience, \name carefully orchestrates the interaction between a smartphone and a VR device, allowing the user to experience simulated impairments in a virtual world while interacting naturally with the app, using a real smartphone. By carefully orchestrating the VR-smartphone interaction, \name tackles challenges such as preserving low-latency app interaction, accurate visualization of hand movement and low-overhead perturbation of I/O streams. Experimental results show that user interaction with \name is comparable (both in accuracy and user perception) to real-world app usage, and that it can simulate impairment effects as effectively as a custom hardware simulator.

Paperid: 1996, https://arxiv.org/pdf/2503.12851.pdf

Abstract:
With mobile apps rapidly permeating all aspects of daily living with use by all segments of the population, it is crucial to support the evaluation of app usability for specific impaired users to improve app accessibility. In this work, we examine the effects of using our \textit{augmented virtuality} impairment simulation system--\textit{Empath-D}--to support experienced designer-developers to redesign a mockup of commonly used mobile application for cataract-impaired users, comparing this with existing tools that aid designing for accessibility. We show that the use of augmented virtuality for assessing usability supports enhanced usability challenge identification, finding more defects and doing so more accurately than with existing methods. Through our user interviews, we also show that augmented virtuality impairment simulation supports realistic interaction and evaluation to provide a concrete understanding over the usability challenges that impaired users face, and complements the existing guidelines-based approaches meant for general accessibility.

Paperid: 1997, https://arxiv.org/pdf/2503.12757.pdf

Abstract:
The widespread adoption of Large Language Models (LLMs) and LLM-powered agents in multi-user settings underscores the need for reliable, usable methods to accommodate diverse preferences and resolve conflicting directives. Drawing on conflict resolution theory, we introduce a user-centered workflow for multi-user personalization comprising three stages: Reflection, Analysis, and Feedback. We then present MAP -- a \textbf{M}ulti-\textbf{A}gent system for multi-user \textbf{P}ersonalization -- to operationalize this workflow. By delegating subtasks to specialized agents, MAP (1) retrieves and reflects on relevant user information, while enhancing reliability through agent-to-agent interactions, (2) provides detailed analysis for improved transparency and usability, and (3) integrates user feedback to iteratively refine results. Our user study findings (n=12) highlight MAP's effectiveness and usability for conflict resolution while emphasizing the importance of user involvement in resolution verification and failure management. This work highlights the potential of multi-agent systems to implement user-centered, multi-user personalization workflows and concludes by offering insights for personalization in multi-user contexts.

Paperid: 1998, https://arxiv.org/pdf/2503.12253.pdf

Abstract:
When collaborating relative to a shared 3D virtual object in mixed reality (MR), users may experience communication issues arising from differences in perspective. These issues include occlusion (e.g., one user not being able to see what the other is referring to) and inefficient spatial references (e.g., "to the left of this" may be confusing when users are positioned opposite to each other). This paper presents a novel technique for automatic perspective alignment in collaborative MR involving co-located interaction centered around a shared virtual object. To align one user's perspective on the object with a collaborator's, a local copy of the object and any other virtual elements that reference it (e.g., the collaborator's hands) are dynamically transformed. The technique does not require virtual travel and preserves face-to-face interaction. We created a prototype application to demonstrate our technique and present an evaluation methodology for related MR collaboration and perspective alignment scenarios.

Paperid: 1999, https://arxiv.org/pdf/2503.10264.pdf

Abstract:
While peer review enhances writing and research quality, harsh feedback can frustrate and demotivate authors. Hence, it is essential to explore how critiques should be delivered to motivate authors and enable them to keep iterating their work. In this study, we explored the impact of appending an automatically generated positive summary to the peer reviews of a writing task, alongside varying levels of overall evaluations (high vs. low), on authors' feedback reception, revision outcomes, and motivation to revise. Through a 2x2 online experiment with 137 participants, we found that adding an AI-reframed positive summary to otherwise harsh feedback increased authors' critique acceptance, whereas low overall evaluations of their work led to increased revision efforts. We discuss the implications of using AI in peer feedback, focusing on how AI-driven critiques can influence critique acceptance and support research communities in fostering productive and friendly peer feedback practices.

Paperid: 2000, https://arxiv.org/pdf/2503.10029.pdf

Abstract:
Hand interactions are increasingly used as the primary input modality in immersive environments, but they are not always feasible due to situational impairments, motor limitations, and environmental constraints. Speech interfaces have been explored as an alternative to hand input in research and commercial solutions, but are limited to initiating basic hand gestures and system controls. We introduce HandProxy, a system that expands the affordances of speech interfaces to support expressive hand interactions. Instead of relying on predefined speech commands directly mapped to possible interactions, HandProxy enables users to control the movement of a virtual hand as an interaction proxy, allowing them to describe the intended interactions naturally while the system translates speech into a sequence of hand controls for real-time execution. A user study with 20 participants demonstrated that HandProxy effectively enabled diverse hand interactions in virtual environments, achieving a 100% task completion rate with an average of 1.09 attempts per speech command and 91.8% command execution accuracy, while supporting flexible, natural speech input with varying levels of control and granularity.

Paperid: 2001, https://arxiv.org/pdf/2503.09127.pdf

Abstract:
This research presents Spiritus, an AI-assisted creation tool designed to streamline 2D character animation creation while enhancing creative flexibility. By integrating natural language processing and diffusion models, users can efficiently transform natural language descriptions into personalized 2D characters and animations. The system employs automated segmentation, layered costume techniques, and dynamic mesh-skeleton binding solutions to support flexible adaptation of complex costumes and additional components. Spiritus further achieves real-time animation generation and efficient animation resource reuse between characters through the integration of BVH data and motion diffusion models. Experimental results demonstrate Spiritus's effectiveness in reducing technical barriers, enhancing creative freedom, and supporting resource universality. Future work will focus on optimizing user experience and further exploring the system's human-computer collaboration potential.

Paperid: 2002, https://arxiv.org/pdf/2503.09062.pdf

Abstract:
Knowledge dissemination in educational settings is profoundly influenced by the curse of knowledge, a cognitive bias that causes experts to underestimate the challenges faced by learners due to their own in-depth understanding of the subject. This bias can hinder effective knowledge transfer and pedagogical effectiveness, and may be exacerbated by inadequate instructor-student communication. To encourage more effective feedback and promote empathy, we introduce TSConnect, a bias-aware, adaptable interactive MOOC (Massive Open Online Course) learning system, informed by a need-finding survey involving 129 students and 6 instructors. TSConnect integrates instructors, students, and Artificial Intelligence (AI) into a cohesive platform, facilitating diverse and targeted communication channels while addressing previously overlooked information needs. A notable feature is its dynamic knowledge graph, which enhances learning support and fosters a more interconnected educational experience. We conducted a between-subjects user study with 30 students comparing TSConnect to a baseline system. Results indicate that TSConnect significantly encourages students to provide more feedback to instructors. Additionally, interviews with 4 instructors reveal insights into how they interpret and respond to this feedback, potentially leading to improvements in teaching strategies and the development of broader pedagogical skills.

Paperid: 2003, https://arxiv.org/pdf/2503.08945.pdf

Abstract:
This study developed a new explainable artificial intelligence algorithm called PassAI, which classifies successful or failed passes in a soccer game and explains its rationale using both tracking and passer's seasonal stats information. This study aimed to address two primary challenges faced by artificial intelligence and machine learning algorithms in the sports domain: how to use different modality data for the analysis and how to explain the rationale of the outcome from multimodal perspectives. To address these challenges, PassAI has two processing streams for multimodal information: tracking image data and passer's stats and classifying pass success and failure. After completing the classification, it provides a rationale by either calculating the relative contribution between the different modality data or providing more detailed contribution factors within the modality. The results of the experiment with 6,349 passes of data obtained from professional soccer games revealed that PassAI showed higher classification performance than state-of-the-art algorithms by >5% and could visualize the rationale of the pass success/failure for both tracking and stats data. These results highlight the importance of using multimodality data in the sports domain to increase the performance of the artificial intelligence algorithm and explainability of the outcomes.

Paperid: 2004, https://arxiv.org/pdf/2503.07427.pdf

Abstract:
The growing use of technology in K--8 classrooms highlights a parallel need for formal learning opportunities aimed at helping children use technology safely and protect their personal information. Even the youngest students are now using tablets, laptops, and apps to support their learning; however, there are limited curricular materials available for elementary and middle school children on digital privacy and security topics. To bridge this gap, we developed a series of micro-lessons to help K--8 children learn about digital privacy and security at school. We first conducted a formative study by interviewing elementary school teachers to identify the design needs for digital privacy and security lessons. We then developed micro-lessons -- multiple 15-20 minute activities designed to be easily inserted into the existing curriculum -- using a co-design approach with multiple rounds of developing and revising the micro-lessons in collaboration with teachers. Throughout the process, we conducted evaluation sessions where teachers implemented or reviewed the micro-lessons. Our study identifies strengths, challenges, and teachers' tailoring strategies when incorporating micro-lessons for K--8 digital privacy and security topics, providing design implications for facilitating learning about these topics in school classrooms.

Paperid: 2005, https://arxiv.org/pdf/2503.06659.pdf

Abstract:
Parkinson's Disease (PD) significantly impacts driving abilities, often leading to early driving cessation or accidents due to reduced motor control and increasing reaction times. To diminish the impact of these symptoms, we developed PANDA (Parkinson's Assistance and Notification Driving Aid), a multi-modality real-time alert system designed to monitor driving patterns continuously and provide immediate alerts for irregular driving behaviors, enhancing driver safety of individuals with PD. The system was developed through a participatory design process with 9 people with PD and 13 non-PD individuals using a driving simulator, which allowed us to identify critical design characteristics and collect detailed data on driving behavior. A user study involving individuals with PD evaluated the effectiveness of PANDA, exploring optimal strategies for delivering alerts and ensuring they are timely and helpful. Our findings demonstrate that PANDA has the potential to enhance the driving safety of individuals with PD, offering a valuable tool for maintaining independence and confidence behind the wheel.

Paperid: 2006, https://arxiv.org/pdf/2503.06333.pdf

Abstract:
Objective: Immersive virtual reality (VR) enhances ecologically validity and facilitates intuitive and ergonomic hand interactions for performing neuropsychological assessments. However, its comparability to traditional computerized methods remains unclear. This study investigates the convergent validity, user experience, and usability of VR-based versus PC-based assessments of short-term and working memory, and psychomotor skills, while also examining how demographic and IT-related skills influence performance in both modalities. Methods: Sixty-six participants performed the Digit Span Task (DST), Corsi Block Task (CBT), and Deary-Liewald Reaction Time Task (DLRTT) in both VR- and PC-based formats. Participants' experience in using computers and smartphones, and playing videogames, was considered. User experience and system usability of the formats were also evaluated. Results: While performance on DST was similar across modalities, PC assessments enabled better performance on CBT and faster reaction times in DLRTT. Moderate-to-strong correlations between VR and PC versions supported convergent validity. Regression analyses revealed that performance on PC versions was influenced by age, computing, and gaming experience, whereas performance on VR versions was largely independent of these factors, except for gaming experience predicting performance on CBT backward recall. Moreover, VR assessments received higher ratings for user experience and usability than PC-based assessments. Conclusion: Immersive VR assessments provide an engaging alternative to traditional computerized methods, with minimal reliance on prior IT experience and demographic factors. This resilience to individual differences suggests that VR may offer a more equitable and accessible platform for cognitive assessment. Future research should explore the long-term reliability of VR-based assessments.

Paperid: 2007, https://arxiv.org/pdf/2503.06324.pdf

Abstract:
Avatars on displays lack the ability to engage with the physical environment through gaze. To address this limitation, we propose a gaze synthesis method that enables animated avatars to establish gaze communication with the physical environment using a camera-behind-the-display system. The system uses a display that rapidly alternates between visible and transparent states. During the transparent state, a camera positioned behind the display captures the physical environment. This configuration physically aligns the position of the avatar's eyes with the camera, enabling two-way gaze communication with people and objects in the physical environment. Building on this system, we developed a framework for mutual gaze communication between avatars and people. The framework detects the user's gaze and dynamically synthesizes the avatar's gaze towards people or objects in the environment. This capability was integrated into an AI agent system to generate real-time, context-aware gaze behaviors during conversations, enabling more seamless and natural interactions. To evaluate the system, we conducted a user study to assess its effectiveness in supporting physical gaze awareness and generating human-like gaze behaviors. The results show that the behind-display approach significantly enhances the user's perception of being observed and attended to by the avatar. By bridging the gap between virtual avatars and the physical environment through enhanced gaze interactions, our system offers a promising avenue for more immersive and human-like AI-mediated communication in everyday environments.

Paperid: 2008, https://arxiv.org/pdf/2503.05812.pdf

Abstract:
Frontier AI models -- highly capable foundation models at the cutting edge of AI development -- may pose severe risks to public safety, human rights, economic stability, and societal value in the coming years. These risks could arise from deliberate adversarial misuse, system failures, unintended cascading effects, or simultaneous failures across multiple models. In response to such risks, at the AI Seoul Summit in May 2024, 16 global AI industry organizations signed the Frontier AI Safety Commitments, and 27 nations and the EU issued a declaration on their intent to define these thresholds. To fulfill these commitments, organizations must determine and disclose ``thresholds at which severe risks posed by a model or system, unless adequately mitigated, would be deemed intolerable.'' To assist in setting and operationalizing intolerable risk thresholds, we outline key principles and considerations; for example, to aim for ``good, not perfect'' thresholds in the face of limited data on rapidly advancing AI capabilities and consequently evolving risks. We also propose specific threshold recommendations, including some detailed case studies, for a subset of risks across eight risk categories: (1) Chemical, Biological, Radiological, and Nuclear (CBRN) Weapons, (2) Cyber Attacks, (3) Model Autonomy, (4) Persuasion and Manipulation, (5) Deception, (6) Toxicity, (7) Discrimination, and (8) Socioeconomic Disruption. Our goal is to serve as a starting point or supplementary resource for policymakers and industry leaders, encouraging proactive risk management that prioritizes preventing intolerable risks (ex ante) rather than merely mitigating them after they occur (ex post).

Paperid: 2009, https://arxiv.org/pdf/2503.05012.pdf

Abstract:
Large language models (LLMs) like OpenAI ChatGPT, Google Gemini, and GitHub Copilot are rapidly gaining traction in the software industry, but their full impact on software engineering remains insufficiently explored. Despite their growing adoption, there is a notable lack of formal, qualitative assessments of how LLMs are applied in real-world software development contexts. To fill this gap, we conducted semi-structured interviews with sixteen early-adopter professional developers to explore their use of LLMs throughout various stages of the software development life cycle. Our investigation examines four dimensions: people - how LLMs affect individual developers and teams; process - how LLMs alter software engineering workflows; product - LLM impact on software quality and innovation; and society - the broader socioeconomic and ethical implications of LLM adoption. Thematic analysis of our data reveals that while LLMs have not fundamentally revolutionized the development process, they have substantially enhanced routine coding tasks, including code generation, refactoring, and debugging. Developers reported the most effective outcomes when providing LLMs with clear, well-defined problem statements, indicating that LLMs excel with decomposed problems and specific requirements. Furthermore, these early-adopters identified that LLMs offer significant value for personal and professional development, aiding in learning new languages and concepts. Early-adopters, highly skilled in software engineering and how LLMs work, identified early and persisting challenges for software engineering, such as inaccuracies in generated content and the need for careful manual review before integrating LLM outputs into production environments. Our study provides a nuanced understanding of how LLMs are shaping the landscape of software development, with their benefits, limitations, and ongoing implications.

Paperid: 2010, https://arxiv.org/pdf/2503.04103.pdf

Abstract:
It has been increasingly recognized that effective human-AI co-creation requires more than prompts and results, but an environment with empowering structures that facilitate exploration, planning, iteration, as well as control and inspection of AI generation. Yet, a concrete design approach to such an environment has not been established. Our literature analysis highlights that compositional structures-which organize and visualize individual elements into meaningful wholes-are highly effective in granting creators control over the essential aspects of their content. However, efficiently aggregating and connecting these structures to support the full creation process remains challenging. Therefore, we propose a design approach of leveraging compositional structures as the substrates and infusing AI within and across these structures to enable a controlled and fluid creation process. We evaluate this approach through a case study of developing a video co-creation environment using this approach. User evaluation shows that such an environment allowed users to stay oriented in their creation activity, remain aware and in control of AI's generation, and enable flexible human-AI collaborative workflows.

Paperid: 2011, https://arxiv.org/pdf/2503.03087.pdf

Abstract:
The proliferation of "Internet of Things (IoT)" provides older adults with critical support for "health monitoring" and independent living, yet significant concerns about security and privacy persist. In this paper, we report on these issues through a two-phase user study, including a survey (N = 22) and semi-structured interviews (n = 9) with adults aged 65+. We found that while 81.82% of our participants are aware of security features like "two-factor authentication (2FA)" and encryption, 63.64% express serious concerns about unauthorized access to sensitive health data. Only 13.64% feel confident in existing protections, citing confusion over "data sharing policies" and frustration with "complex security settings" which lead to distrust and anxiety. To cope, our participants adopt various strategies, such as relying on family or professional support and limiting feature usage leading to disengagement. Thus, we recommend "adaptive security mechanisms," simplified interfaces, and real-time transparency notifications to foster trust and ensure "privacy and security by design" in IoT health systems for older adults.

Paperid: 2012, https://arxiv.org/pdf/2503.00757.pdf

Abstract:
How has Wikipedia activity changed for articles with content similar to ChatGPT following its introduction? We estimate the impact using differences-in-differences models, with dissimilar Wikipedia articles as a baseline for comparison, to examine how changes in voluntary knowledge contributions and information-seeking behavior differ by article content. Our analysis reveals that newly created, popular articles whose content overlaps with ChatGPT 3.5 saw a greater decline in editing and viewership after the November 2022 launch of ChatGPT than dissimilar articles did. These findings indicate heterogeneous substitution effects, where users selectively engage less with existing platforms when AI provides comparable content. This points to potential uneven impacts on the future of human-driven online knowledge contributions.

Paperid: 2013, https://arxiv.org/pdf/2503.00714.pdf

Abstract:
Analyzing large datasets requires responsive query execution, but executing SQL queries on massive datasets can be slow. This paper explores whether query execution can begin even before the user has finished typing, allowing results to appear almost instantly. We propose SpeQL, a system that leverages Large Language Models (LLMs) to predict likely queries based on the database schema, the user's past queries, and their incomplete query. Since exact query prediction is infeasible, SpeQL speculates on partial queries in two ways: 1) it predicts the query structure to compile and plan queries in advance, and 2) it precomputes smaller temporary tables that are much smaller than the original database, but are still predicted to contain all information necessary to answer the user's final query. Additionally, SpeQL continuously displays results for speculated queries and subqueries in real time, aiding exploratory analysis. A utility/user study showed that SpeQL improved task completion time, and participants reported that its speculative display of results helped them discover patterns in the data more quickly. In the study, SpeQL improves user's query latency by up to $289\times$ and kept the overhead reasonable, at $\$4$ per hour.

Paperid: 2014, https://arxiv.org/pdf/2503.00195.pdf

Abstract:
Romantic relationships with social chatbots are becoming increasingly prevalent, raising important questions about their societal and psychological implications. Despite this growing trend, little is known about the individuals entering these synthetic relationships. This three-part study seeks to enhance understanding of the factors encompassing human-chatbot relationships by quantitatively examining the commonly discussed characteristics romantic and sexual fantasy, loneliness, attachment style, anthropomorphism, and sexual sensation seeking (Study 1A), comparing the impact of romantic and sexual fantasizing for human-chatbot versus human-human relationships (Study 1B), and providing qualitative insights into how individuals conceptualize romantic and sexual fantasies in their interactions with chatbots (Study 2). Individuals with romantic chatbot connections were interviewed (N=15) or surveyed (N=92), while participants in the comparison groups, long-distance (N=90) and cohabiting relationships (N=82), completed a questionnaire. Romantic fantasizing emerged as the strongest predictor of human-chatbot relationships, alongside anthropomorphism and anxious-avoidant attachment. Notably, romantic fantasy also predicted partner closeness across all relationship types, revealing shared psychological dynamics between human-chatbot and human-human bonds. Interviews further reinforced this, with all participants engaging in fantasy exploration while desiring their chatbot to feel as human as possible. This paper provides a novel and multifaceted examination of the psychological dynamics within human-chatbot relationships, highlighting the central yet understudied role of fantasy.

Paperid: 2015, https://arxiv.org/pdf/2502.21236.pdf

Abstract:
Tuberculosis (TB) is the leading cause of death from an infectious disease globally, with the highest burden in low- and middle-income countries. In these regions, limited healthcare access and high patient-to-provider ratios impede effective patient support, communication, and treatment completion. To bridge this gap, we propose integrating a specialized Large Language Model into an efficacious digital adherence technology to augment interactive communication with treatment supporters. This AI-powered approach, operating within a human-in-the-loop framework, aims to enhance patient engagement and improve TB treatment outcomes.

Paperid: 2016, https://arxiv.org/pdf/2502.21220.pdf

Abstract:
Explainable AI (XAI) is concerned with how to make AI models more understandable to people. To date these explanations have predominantly been technocentric - mechanistic or productivity oriented. This paper introduces the Explainable AI for the Arts (XAIxArts) manifesto to provoke new ways of thinking about explainability and AI beyond technocentric discourses. Manifestos offer a means to communicate ideas, amplify unheard voices, and foster reflection on practice. To supports the co-creation and revision of the XAIxArts manifesto we combine a World CafÃ© style discussion format with a living manifesto to question four core themes: 1) Empowerment, Inclusion, and Fairness; 2) Valuing Artistic Practice; 3) Hacking and Glitches; and 4) Openness. Through our interactive living manifesto experience we invite participants to actively engage in shaping this XIAxArts vision within the CHI community and beyond.

Paperid: 2017, https://arxiv.org/pdf/2502.21219.pdf

Abstract:
Expressing design intent using natural language prompts requires designers to verbalize the ambiguous visual details concisely, which can be challenging or even impossible. To address this, we introduce Brickify, a visual-centric interaction paradigm -- expressing design intent through direct manipulation on design tokens. Brickify extracts visual elements (e.g., subject, style, and color) from reference images and converts them into interactive and reusable design tokens that can be directly manipulated (e.g., resize, group, link, etc.) to form the visual lexicon. The lexicon reflects users' intent for both what visual elements are desired and how to construct them into a whole. We developed Brickify to demonstrate how AI models can interpret and execute the visual lexicon through an end-to-end pipeline. In a user study, experienced designers found Brickify more efficient and intuitive than text-based prompts, allowing them to describe visual details, explore alternatives, and refine complex designs with greater ease and control.

Paperid: 2018, https://arxiv.org/pdf/2502.19730.pdf

Abstract:
Explanatory information helps users to evaluate the suggestions offered by AI-driven decision support systems. With large language models, adjusting explanation expressions has become much easier. However, how these expressions influence human decision-making remains largely unexplored. This study investigated the effect of explanation tone (e.g., formal or humorous) on decision-making, focusing on AI roles and user attributes. We conducted user experiments across three scenarios depending on AI roles (assistant, second-opinion provider, and expert) using datasets designed with varying tones. The results revealed that tone significantly influenced decision-making regardless of user attributes in the second-opinion scenario, whereas its impact varied by user attributes in the assistant and expert scenarios. In addition, older users were more influenced by tone, and highly extroverted users exhibited discrepancies between their perceptions and decisions. Furthermore, open-ended questionnaires highlighted that users expect tone adjustments to enhance their experience while emphasizing the importance of tone consistency and ethical considerations. Our findings provide crucial insights into the design of explanation expressions.

Paperid: 2019, https://arxiv.org/pdf/2502.19519.pdf

Abstract:
This paper presents a game master AI for single-player role-playing games. The AI is designed to deliver interactive text-based narratives and experiences typically associated with multiplayer tabletop games like Dungeons & Dragons. We report on the design process and the series of experiments to improve the functionality and experience design, resulting in two functional versions of the system. While v1 of our system uses simplified prompt engineering, v2 leverages a multi-agent architecture and the ReAct framework to include reasoning and action. A comparative evaluation demonstrates that v2 as an agentic system maintains play while significantly improving modularity and game experience, including immersion and curiosity. Our findings contribute to the evolution of AI-driven interactive fiction, highlighting new avenues for enhancing solo role-playing experiences.

Paperid: 2020, https://arxiv.org/pdf/2502.19410.pdf

Abstract:
Large Language Models (LLMs) have shown remarkable potential in recommending everyday actions as personal AI assistants, while Explainable AI (XAI) techniques are being increasingly utilized to help users understand why a recommendation is given. Personal AI assistants today are often located on ultra-small devices such as smartwatches, which have limited screen space. The verbosity of LLM-generated explanations, however, makes it challenging to deliver glanceable LLM explanations on such ultra-small devices. To address this, we explored 1) spatially structuring an LLM's explanation text using defined contextual components during prompting and 2) presenting temporally adaptive explanations to users based on confidence levels. We conducted a user study to understand how these approaches impacted user experiences when interacting with LLM recommendations and explanations on ultra-small devices. The results showed that structured explanations reduced users' time to action and cognitive load when reading an explanation. Always-on structured explanations increased users' acceptance of AI recommendations. However, users were less satisfied with structured explanations compared to unstructured ones due to their lack of sufficient, readable details. Additionally, adaptively presenting structured explanations was less effective at improving user perceptions of the AI compared to the always-on structured explanations. Together with users' interview feedback, the results led to design implications to be mindful of when personalizing the content and timing of LLM explanations that are displayed on ultra-small devices.

Paperid: 2021, https://arxiv.org/pdf/2502.19263.pdf

Abstract:
We introduce ArtInsight, a novel AI-powered system to facilitate deeper engagement with child-created artwork in mixed visual-ability families. ArtInsight leverages large language models (LLMs) to craft a respectful and thorough initial description of a child's artwork, and provides: creative AI-generated descriptions for a vivid overview, audio recording to capture the child's own description of their artwork, and a set of AI-generated questions to facilitate discussion between blind or low-vision (BLV) family members and their children. Alongside ArtInsight, we also contribute a new rubric to score AI-generated descriptions of child-created artwork and an assessment of state-of-the-art LLMs. We evaluated ArtInsight with five groups of BLV family members and their children, and as a case study with one BLV child therapist. Our findings highlight a preference for ArtInsight's longer, artistically-tailored descriptions over those generated by existing BLV AI tools. Participants highlighted the creative description and audio recording components as most beneficial, with the former helping ``bring a picture to life'' and the latter centering the child's narrative to generate context-aware AI responses. Our findings reveal different ways that AI can be used to support art engagement, including before, during, and after interaction with the child artist, as well as expectations that BLV adults and their sighted children have about AI-powered tools.

Paperid: 2022, https://arxiv.org/pdf/2502.18881.pdf

Abstract:
Young adults often encounter challenges in career exploration. Self-guided interventions, such as the letter-exchange exercise, where participants envision and adopt the perspective of their future selves by exchanging letters with their envisioned future selves, can support career development. However, the broader adoption of such interventions may be limited without structured guidance. To address this, we integrated Large Language Model (LLM)-based agents that simulate participants' future selves into the letter-exchange exercise and evaluated their effectiveness. A one-week experiment (N=36) compared three conditions: (1) participants manually writing replies to themselves from the perspective of their future selves (baseline), (2) future-self agents generating letters to participants, and (3) future-self agents engaging in chat conversations with participants. Results indicated that exchanging letters with future-self agents enhanced participants' engagement during the exercise, while overall benefits of the intervention on future orientation, career self-concept, and psychological support remained comparable across conditions. We discuss design implications for AI-augmented interventions for supporting young adults' career exploration.

Paperid: 2023, https://arxiv.org/pdf/2502.18640.pdf

Abstract:
eXplainMR is a Mixed Reality tutoring system designed for basic cardiac surface ultrasound training. Trainees wear a head-mounted display (HMD) and hold a controller, mimicking a real ultrasound probe, while treating a desk surface as the patient's body for low-cost and anywhere training. eXplainMR engages trainees with troubleshooting questions and provides automated feedback through four key mechanisms: 1) subgoals that break down tasks into single-movement steps, 2) textual explanations comparing the current incorrect view with the target view, 3) real-time segmentation and annotation of ultrasound images for direct visualization, and 4) the 3D visual cues provide further explanations on the intersection between the slicing plane and anatomies.

Paperid: 2024, https://arxiv.org/pdf/2502.18145.pdf

Abstract:
Recent interest in human-AI interactions in agent-based modeling and simulation (ABMS) has grown rapidly due to the widespread utilization of large language models (LLMs). ABMS is an intelligent approach that simulates autonomous agents' behaviors within a defined environment to research emergent phenomena. Integrating LLMs into ABMS enables natural language interaction between humans and models. Meanwhile, it introduces new challenges that rely on human interaction to address. Human involvement can assist ABMS in adapting to flexible and complex research demands. However, systematic reviews of interactions that examine how humans and AI interact in ABMS are lacking. In this paper, we investigate existing works and propose a novel taxonomy to categorize the interactions derived from them. Specifically, human users refer to researchers who utilize ABMS tools to conduct their studies in our survey. We decompose interactions into five dimensions: the goals that users want to achieve (Why), the phases that users are involved (When), the components of the system (What), the roles of users (Who), and the means of interactions (How). Our analysis summarizes the findings that reveal existing interaction patterns. They provide researchers who develop interactions with comprehensive guidance on how humans and AI interact. We further discuss the unexplored interactions and suggest future research directions.

Paperid: 2025, https://arxiv.org/pdf/2502.17487.pdf

Abstract:
Large language models (LLMs) increasingly serve as interactive healthcare resources, yet user acceptance remains underexplored. This study examines how ease of use, perceived usefulness, trust, and risk perception interact to shape intentions to adopt DeepSeek, an emerging LLM-based platform, for healthcare purposes. A cross-sectional survey of 556 participants from India, the United Kingdom, and the United States was conducted to measure perceptions and usage patterns. Structural equation modeling assessed both direct and indirect effects, including potential quadratic relationships. Results revealed that trust plays a pivotal mediating role: ease of use exerts a significant indirect effect on usage intentions through trust, while perceived usefulness contributes to both trust development and direct adoption. By contrast, risk perception negatively affects usage intent, emphasizing the importance of robust data governance and transparency. Notably, significant non-linear paths were observed for ease of use and risk, indicating threshold or plateau effects. The measurement model demonstrated strong reliability and validity, supported by high composite reliabilities, average variance extracted, and discriminant validity measures. These findings extend technology acceptance and health informatics research by illuminating the multifaceted nature of user adoption in sensitive domains. Stakeholders should invest in trust-building strategies, user-centric design, and risk mitigation measures to encourage sustained and safe uptake of LLMs in healthcare. Future work can employ longitudinal designs or examine culture-specific variables to further clarify how user perceptions evolve over time and across different regulatory environments. Such insights are critical for harnessing AI to enhance outcomes.

Paperid: 2026, https://arxiv.org/pdf/2502.17263.pdf

Abstract:
Previous work established that open source software (OSS) projects can benefit from the involvement of UX professionals, who offer user-centric perspectives and contributions to improve software usability. However, their participation in OSS issue discussions (places where design and implementation decisions are often made) is relatively scarce since those platforms are created with a developer-centric mindset. Analyzing a dataset sampled from five OSS projects, this study identifies UX professionals' distinct approaches to raising and following up on usability issues. Compared to other contributors, UX professionals addressed a broader range of usability issues, well-supported their stances, and were more factual than emotional. They also actively engage in discussions to provide additional insights and clarifications in comments following up on the issues they posted. Results from this study provide useful insights for increasing UX professionals' involvement in OSS communities to improve usability and end-user satisfaction.

Paperid: 2027, https://arxiv.org/pdf/2502.16740.pdf

Abstract:
This paper investigates the task delegation trends of digital comic authors to generative AIs during the creation process. We observed 16 digital comic authors using generative AIs during the drafting stage. We categorized authors delegation levels and examined the extent of delegation, variations in AI usage, and calibration of delegation in co-creation. Our findings show that most authors delegate significant tasks to AI, with higher delegation linked to less time spent on creation and more detailed questions to AI. After co-creation, about 60% of authors adjusted their delegation levels, mostly calibrating to less delegation due to loss of agency and AIs unoriginal outputs. We suggest strategies for calibrating delegation to an appropriate level, redefine trust in human-AI co-creation, and propose novel measurements for trust in these contexts. Our study provides insights into how authors can effectively collaborate with generative AIs, balance delegation, and navigate AIs role in the creative process.

Paperid: 2028, https://arxiv.org/pdf/2502.16613.pdf

Abstract:
Intelligent tutors have proven to be effective in K-12 education, though their impact on adult learners -- especially as a supplementary resource -- remains underexplored. Understanding how adults voluntarily engage with educational technologies can inform the design of tools that support skill re-learning and enhancement. More critically, it helps determine whether tutoring systems, which are typically built for K-12 learners, can also support adult populations. This study examines the adoption, usage patterns, and effectiveness of a novel tutoring system, Apprentice Tutors, among adult learners at a state technical college. We analyze three types of data including, user demographics, grades, and tutor interactions, to assess whether voluntary tutor usage translates into measurable learning gains. Our findings reveal key temporal patterns in tutor engagement and provide evidence of learning within tutors, as determined through skill improvement in knowledge components across tutors. We also found evidence that this learning transferred outside the tutor, as observed through higher course assessment scores following tutor usage. These results suggest that intelligent tutors are a viable tool for adult learners, warranting further research into their long-term impact on this population.

Paperid: 2029, https://arxiv.org/pdf/2502.16375.pdf

Abstract:
Building on related concepts, like, decentralized identifiers (DIDs), proof of personhood, anonymous credentials, personhood credentials (PHCs) emerged as an alternative approach, enabling individuals to verify to digital service providers that they are a person without disclosing additional information. However, new technologies might introduce some friction due to users misunderstandings and mismatched expectations. Despite their growing importance, limited research has been done on users perceptions and preferences regarding PHCs. To address this gap, we conducted competitive analysis, and semi-structured online user interviews with 23 participants from US and EU to provide concrete design recommendations for PHCs that incorporate user needs, adoption rules, and preferences. Our study -- (a)surfaces how people reason about unknown privacy and security guarantees of PHCs compared to current verification methods -- (b) presents the impact of several factors on how people would like to onboard and manage PHCs, including, trusted issuers (e.g. gov), ground truth data to issue PHC (e.g biometrics, physical id), and issuance system (e.g. centralized vs decentralized). In a think-aloud conceptual design session, participants recommended -- conceptualized design, such as periodic biometrics verification, time-bound credentials, visually interactive human-check, and supervision of government for issuance system. We propose actionable designs reflecting users preferences.

Paperid: 2030, https://arxiv.org/pdf/2502.16097.pdf

Abstract:
Teaching literature under interdisciplinary contexts (e.g., science, art) that connect reading materials has become popular in elementary schools. However, constructing such contexts is challenging as it requires teachers to explore substantial amounts of interdisciplinary content and link it to the reading materials. In this paper, we develop LitLinker via an iterative design process involving 13 teachers to facilitate the ideation of interdisciplinary contexts for teaching literature. Powered by a large language model (LLM), LitLinker can recommend interdisciplinary topics and contextualize them with the literary elements (e.g., paragraphs, viewpoints) in the reading materials. A within-subjects study (N=16) shows that compared to an LLM chatbot, LitLinker can improve the integration depth of different subjects and reduce workload in this ideation task. Expert interviews (N=9) also demonstrate LitLinker's usefulness for supporting the ideation of interdisciplinary contexts for teaching literature. We conclude with concerns and design considerations for supporting interdisciplinary teaching with LLMs.

Paperid: 2031, https://arxiv.org/pdf/2502.15978.pdf

Abstract:
Access to health information and services among women continues to be a major challenge in many communities globally. In recent years, there has been a growing interest in the potential of chatbots to address this information and access gap. We conducted interviews and focus group discussions with underserved women in urban India to understand their receptivity towards the use of chatbots for maternal and child health, as well as barriers to their adoption. Our findings uncover gaps in digital access and literacies, and perceived conflict with various responsibilities that women are burdened with, which shape their interactions with digital technology. Our paper offers insights into the design of chatbots for community health that can meet the lived realities of women in underserved settings.

Paperid: 2032, https://arxiv.org/pdf/2502.15939.pdf

Abstract:
Access to sexual and reproductive health information remains a challenge in many communities globally, due to cultural taboos and limited availability of healthcare providers. Public health organizations are increasingly turning to Large Language Models (LLMs) to improve access to timely and personalized information. However, recent HCI scholarship indicates that significant challenges remain in incorporating context awareness and mitigating bias in LLMs. In this paper, we study the development of a culturally-appropriate LLM-based chatbot for reproductive health with underserved women in urban India. Through user interactions, focus groups, and interviews with multiple stakeholders, we examine the chatbot's response to sensitive and highly contextual queries on reproductive health. Our findings reveal strengths and limitations of the system in capturing local context, and complexities around what constitutes "culture". Finally, we discuss how local context might be better integrated, and present a framework to inform the design of culturally-sensitive chatbots for community health.

Paperid: 2033, https://arxiv.org/pdf/2502.15287.pdf

Abstract:
Software developers balance a variety of different tasks in a workweek, yet the allocation of time often differs from what they consider ideal. Identifying and addressing these deviations is crucial for organizations aiming to enhance the productivity and well-being of the developers. In this paper, we present the findings from a survey of 484 software developers at Microsoft, which aims to identify the key differences between how developers would like to allocate their time during an ideal workweek versus their actual workweek. Our analysis reveals significant deviations between a developer's ideal workweek and their actual workweek, with a clear correlation: as the gap between these two workweeks widens, we observe a decline in both productivity and satisfaction. By examining these deviations in specific activities, we assess their direct impact on the developers' satisfaction and productivity. Additionally, given the growing adoption of AI tools in software engineering, both in the industry and academia, we identify specific tasks and areas that could be strong candidates for automation. In this paper, we make three key contributions: 1) We quantify the impact of workweek deviations on developer productivity and satisfaction 2) We identify individual tasks that disproportionately affect satisfaction and productivity 3) We provide actual data-driven insights to guide future AI automation efforts in software engineering, aligning them with the developers' requirements and ideal workflows for maximizing their productivity and satisfaction.

Paperid: 2034, https://arxiv.org/pdf/2502.15242.pdf

Abstract:
Current image generation paradigms prioritize actualizing user intention - "see what you intend" - but often neglect the sociopolitical dimensions of this process. However, it is increasingly evident that image generation is political, contributing to broader social struggles over visual meaning. This sociopolitical aspect was highlighted by the March 2024 Gemini controversy, where Gemini faced criticism for inappropriately injecting demographic diversity into user prompts. Although the developers sought to redress image generation's sociopolitical dimension by introducing diversity "corrections," their opaque imposition of a standard for "diversity" ultimately proved counterproductive. In this paper, we present an alternative approach: an image generation interface designed to embrace open negotiation along the sociopolitical dimensions of image creation. Grounded in the principles of agonistic pluralism (from the Greek agon, meaning struggle), our interface actively engages users with competing visual interpretations of their prompts. Through a lab study with 29 participants, we evaluate our agonistic interface on its ability to facilitate reflection - engagement with other perspectives and challenging dominant assumptions - a core principle that underpins agonistic contestation. We compare it to three existing paradigms: a standard interface, a Gemini-style interface that produces "diverse" images, and an intention-centric interface suggesting prompt refinements. Our findings demonstrate that the agonistic interface enhances reflection across multiple measures, but also that reflection depends on users perceiving the interface as both appropriate and empowering; introducing diversity without grounding it in relevant political contexts was perceived as inauthentic. Our results suggest that diversity and user intention should not be treated as opposing values to be balanced.

Paperid: 2035, https://arxiv.org/pdf/2502.15058.pdf

Abstract:
What if our clothes could capture our body motion accurately? This paper introduces Flexible Inertial Poser (FIP), a novel motion-capturing system using daily garments with two elbow-attached flex sensors and four Inertial Measurement Units (IMUs). To address the inevitable sensor displacements in loose wearables which degrade joint tracking accuracy significantly, we identify the distinct characteristics of the flex and inertial sensor displacements and develop a Displacement Latent Diffusion Model and a Physics-informed Calibrator to compensate for sensor displacements based on such observations, resulting in a substantial improvement in motion capture accuracy. We also introduce a Pose Fusion Predictor to enhance multimodal sensor fusion. Extensive experiments demonstrate that our method achieves robust performance across varying body shapes and motions, significantly outperforming SOTA IMU approaches with a 19.5% improvement in angular error, a 26.4% improvement in elbow angular error, and a 30.1% improvement in positional error. FIP opens up opportunities for ubiquitous human-computer interactions and diverse interactive applications such as Metaverse, rehabilitation, and fitness analysis.

Paperid: 2036, https://arxiv.org/pdf/2502.14052.pdf

Abstract:
In subjective decision-making, where decisions are based on contextual interpretation, Large Language Models (LLMs) can be integrated to present users with additional rationales to consider. The diversity of these rationales is mediated by the ability to consider the perspectives of different social actors. However, it remains unclear whether and how models differ in the distribution of perspectives they provide. We compare the perspectives taken by humans and different LLMs when assessing subtle sexism scenarios. We show that these perspectives can be classified within a finite set (perpetrator, victim, decision-maker), consistently present in argumentations produced by humans and LLMs, but in different distributions and combinations, demonstrating differences and similarities with human responses, and between models. We argue for the need to systematically evaluate LLMs' perspective-taking to identify the most suitable models for a given decision-making task. We discuss the implications for model evaluation.

Paperid: 2037, https://arxiv.org/pdf/2502.13320.pdf

Abstract:
This work sheds light on whether and how creative writers' needs are met by existing research and commercial writing support tools (WST). We conducted a need finding study to gain insight into the writers' process during creative writing through a qualitative analysis of the response from an online questionnaire and Reddit discussions on r/Writing. Using a systematic analysis of 115 tools and 67 research papers, we map out the landscape of how digital tools facilitate the writing process. Our triangulation of data reveals that research predominantly focuses on the writing activity and overlooks pre-writing activities and the importance of visualization. We distill 10 key takeaways to inform future research on WST and point to opportunities surrounding underexplored areas. Our work offers a holistic and up-to-date account of how tools have transformed the writing process, guiding the design of future tools that address writers' evolving and unmet needs.

Paperid: 2038, https://arxiv.org/pdf/2502.13062.pdf

Abstract:
AI systems increasingly support human decision-making. In many cases, despite the algorithm's superior performance, the final decision remains in human hands. For example, an AI may assist doctors in determining which diagnostic tests to run, but the doctor ultimately makes the diagnosis. This paper studies such AI-assisted decision-making settings, where the human learns through repeated interactions with the algorithm. In our framework, the algorithm -- designed to maximize decision accuracy according to its own model -- determines which features the human can consider. The human then makes a prediction based on their own less accurate model. We observe that the discrepancy between the algorithm's model and the human's model creates a fundamental tradeoff. Should the algorithm prioritize recommending more informative features, encouraging the human to recognize their importance, even if it results in less accurate predictions in the short term until learning occurs? Or is it preferable to forgo educating the human and instead select features that align more closely with their existing understanding, minimizing the immediate cost of learning? This tradeoff is shaped by the algorithm's time-discounted objective and the human's learning ability. Our results show that optimal feature selection has a surprisingly clean combinatorial characterization, reducible to a stationary sequence of feature subsets that is tractable to compute. As the algorithm becomes more "patient" or the human's learning improves, the algorithm increasingly selects more informative features, enhancing both prediction accuracy and the human's understanding. Notably, early investment in learning leads to the selection of more informative features than a later investment. We complement our analysis by showing that the impact of errors in the algorithm's knowledge is limited as it does not make the prediction directly.

Paperid: 2039, https://arxiv.org/pdf/2502.12747.pdf

Abstract:
Exoskeletons open up a unique interaction space that seamlessly integrates users' body movements with robotic actuation. Despite its potential, human-exoskeleton interaction remains an underexplored area in HCI, largely due to the lack of accessible prototyping tools that enable designers to easily develop exoskeleton designs and customized interactive behaviors. We present ExoKit, a do-it-yourself toolkit for rapid prototyping of low-fidelity, functional exoskeletons targeted at novice roboticists. ExoKit includes modular hardware components for sensing and actuating shoulder and elbow joints, which are easy to fabricate and (re)configure for customized functionality and wearability. To simplify the programming of interactive behaviors, we propose functional abstractions that encapsulate high-level human-exoskeleton interactions. These can be readily accessed either through ExoKit's command-line or graphical user interface, a Processing library, or microcontroller firmware, each targeted at different experience levels. Findings from implemented application cases and two usage studies demonstrate the versatility and accessibility of ExoKit for early-stage interaction design.

Paperid: 2040, https://arxiv.org/pdf/2502.12397.pdf

Abstract:
Although 85% of sub-Saharan Africa's population is covered by mobile broadband signal, only 37% use the internet, and those who do seldom use the web. The most frequently cited reason for low internet usage is the cost of data. We investigate whether AI can bridge this gap by analyzing 40,350 queries submitted to an AI chatbot by 469 teachers in Sierra Leone over 17 months. Teachers use AI for teaching assistance more frequently than web search. We compare the AI responses to the corresponding top search results for the same queries from the most popular local web search engine, google.com.sl. Only 2% of results for corresponding web searches contain content from in country. Additionally, the average web search result consumes 3,107 times more data than an AI response. Bandwidth alone costs \$2.41 per thousand web search results loaded, while the total cost of AI is \$0.30 per thousand responses. As a result, AI is 87% less expensive than web search. In blinded evaluations, an independent sample of teachers rate AI responses as more relevant, helpful, and correct than web search results. These findings suggest that AI-driven solutions can cost-effectively bridge information gaps in low-connectivity regions.

Paperid: 2041, https://arxiv.org/pdf/2502.10844.pdf

Abstract:
Recent studies have revealed that large language model (LLM)-powered conversational agents often exhibit `sycophancy', a tendency to adapt their responses to align with user perspectives, even at the expense of factual accuracy. However, users' perceptions of LLM sycophancy and its interplay with other anthropomorphic features (e.g., friendliness) in shaping user trust remains understudied. To bridge this gap, we conducted a 2 (Sycophancy: presence vs. absence) x 2 (Friendliness: high vs. low) between-subjects experiment (N = 224). Our study uncovered, for the first time, the intricate dynamics between LLM sycophancy and friendliness: When an LLM agent already exhibits a friendly demeanor, being sycophantic reduces perceived authenticity, thereby lowering user trust; Conversely, when the agent is less friendly, aligning its responses with user opinions makes it appear more genuine, leading to higher user trust. Our findings entail profound implications for AI persuasion through exploiting human psychological tendencies and highlight the imperative for responsible designs in user-LLM agent interactions.

Paperid: 2042, https://arxiv.org/pdf/2502.09843.pdf

Abstract:
Multimodal AI is an important step towards building effective tools to leverage multiple modalities in human-AI communication. Building a multimodal document-grounded AI system to interact with long documents remains a challenge. Our work aims to fill the research gap of directly leveraging grounded visuals from documents alongside textual content in documents for response generation. We present an interactive conversational AI agent 'MuDoC' based on GPT-4o to generate document-grounded responses with interleaved text and figures. MuDoC's intelligent textbook interface promotes trustworthiness and enables verification of system responses by allowing instant navigation to source text and figures in the documents. We also discuss qualitative observations based on MuDoC responses highlighting its strengths and limitations.

Paperid: 2043, https://arxiv.org/pdf/2502.09687.pdf

Abstract:
Be careful what you ask for, you just might get it. This saying fits with the way large language models (LLMs) are trained, which, instead of being rewarded for correctness, are increasingly rewarded for pleasing the recipient. So, they are increasingly effective at persuading us that their answers are valuable. But what tricks do they use in this persuasion? In this study, we examine what are the psycholinguistic features of the responses used by twelve different language models. By grouping response content according to rational or emotional prompts and exploring social influence principles employed by LLMs, we ask whether and how we can mitigate the risks of LLM-driven mass misinformation. We position this study within the broader discourse on human-centred AI, emphasizing the need for interdisciplinary approaches to mitigate cognitive and societal risks posed by persuasive AI responses.

Paperid: 2044, https://arxiv.org/pdf/2502.09386.pdf

Abstract:
Program text is rendered using impoverished typographic styles. Beyond choice of fonts and syntax-highlighting colors, code editors and related tools utilize very few text decorations. These limited styles are, furthermore, applied in monolithic fashion, regardless of the programs and tasks at hand. We present the notion of _code style sheets_ for styling program text. Motivated by analogy to cascading style sheets (CSS) for styling HTML documents, code style sheets provide mechanisms for defining rules to select elements from an abstract syntax tree (AST) in order to style their corresponding visual representation. Technically, our selector language generalizes essential constructs from CSS to a programming-language setting with algebraic data types (such as ASTs). Practically, code style sheets allow ASTs to be styled granularly, based on semantic information -- such as the structure of abstract syntax, static type information, and corresponding run-time values -- as well as design choices on the part of authors and readers of a program. Because programs are heavily nested in structure, a key aspect of our design is a layout algorithm that renders nested, multiline text blocks more compactly than in existing box-based layout systems such as HTML. In this paper, we design and implement a code style sheets system for a subset of Haskell, using it to illustrate several code presentation and visualization tasks. These examples demonstrate that code style sheets provide a uniform framework for rendering programs in multivarious ways, which could be employed in future designs for text-based as well as structure editors.

Paperid: 2045, https://arxiv.org/pdf/2502.08981.pdf

Abstract:
Authoring site-specific outdoor augmented reality (AR) experiences requires a nuanced understanding of real-world context to create immersive and relevant content. Existing ex-situ authoring tools typically rely on static 3D models to represent spatial information. However, in our formative study (n=25), we identified key limitations of this approach: models are often outdated, incomplete, or insufficient for capturing critical factors such as safety considerations, user flow, and dynamic environmental changes. These issues necessitate frequent on-site visits and additional iterations, making the authoring process more time-consuming and resource-intensive. To mitigate these challenges, we introduce CoCreatAR, an asymmetric collaborative mixed reality authoring system that integrates the flexibility of ex-situ workflows with the immediate contextual awareness of in-situ authoring. We conducted an exploratory study (n=32) comparing CoCreatAR to an asynchronous workflow baseline, finding that it enhances engagement, creativity, and confidence in the authored output while also providing preliminary insights into its impact on task load. We conclude by discussing the implications of our findings for integrating real-world context into site-specific AR authoring systems.

Paperid: 2046, https://arxiv.org/pdf/2502.08906.pdf

Abstract:
Blind people have limited opportunities to explore an environment based on their interests. While existing navigation systems could provide them with surrounding information while navigating, they have limited scalability as they require preparing prebuilt maps. Thus, to develop a map-less robot that assists blind people in exploring, we first conducted a study with ten blind participants at a shopping mall and science museum to investigate the requirements of the system, which revealed the need for three levels of detail to describe the surroundings based on users' preferences. Then, we developed WanderGuide, with functionalities that allow users to adjust the level of detail in descriptions and verbally interact with the system to ask questions about the environment or to go to points of interest. The study with five blind participants revealed that WanderGuide could provide blind people with the enjoyable experience of wandering around without a specific destination in their minds.

Paperid: 2047, https://arxiv.org/pdf/2502.08743.pdf

Abstract:
In general, Terms of Service (ToS) and other policy documents are verbose and full of legal jargon, which poses challenges for users to understand. To improve user accessibility and transparency, the "Terms of Service; Didn't Read" (ToS;DR) project condenses intricate legal terminology into summaries and overall grades for the website's policy documents. Nevertheless, uncertainties remain about whether users could truly grasp the implications of simplified presentations. We conducted an online survey to assess the perceived understandability and severity of randomly chosen cases from the ToS;DR taxonomy. Preliminary results indicate that, although most users report understanding the cases, they find a bias towards service providers in about two-thirds of the cases. The findings of our study emphasize the necessity of prioritizing user-centric policy formulation. This study has the potential to reveal the extent of information imbalance in digital services and promote more well-informed user consent.

Paperid: 2048, https://arxiv.org/pdf/2502.08566.pdf

Abstract:
Recent advancements in Augmented Reality (AR) have demonstrated applications in architecture, design, and fabrication. Compared to conventional 2D construction drawings, AR can be used to superimpose contextual instructions, display 3D spatial information and enable on-site engagement. Despite the potential of AR, the widespread adoption of the technology in the industry is limited by its precision. Precision is important for projects requiring strict construction tolerances, design fidelity, and fabrication feedback. For example, the manufacturing of glulam beams requires tolerances of less than 2mm. The goal of this project is to explore the industrial application of using multiple fiducial markers for high-precision AR fabrication. While the method has been validated in lab settings with a precision of 0.97, this paper focuses on fabricating glulam beams in a factory setting with an industry manufacturer, Unalam Factory.

Paperid: 2049, https://arxiv.org/pdf/2502.07725.pdf

Abstract:
Textual content (including titles, annotations, and captions) plays a central role in helping readers understand a visualization by emphasizing, contextualizing, or summarizing the depicted data. Yet, existing visualization tools provide limited support for jointly authoring the two modalities of text and visuals such that both convey semantically-rich information and are cohesively integrated. In response, we introduce Pluto, a mixed-initiative authoring system that uses features of a chart's construction (e.g., visual encodings) as well as any textual descriptions a user may have drafted to make suggestions about the content and presentation of the two modalities. For instance, a user can begin to type out a description and interactively brush a region of interest in the chart, and Pluto will generate a relevant auto-completion of the sentence. Similarly, based on a written description, Pluto may suggest lifting a sentence out as an annotation or the visualization's title, or may suggest applying a data transformation (e.g., sort) to better align the two modalities. A preliminary user study revealed that Pluto's recommendations were particularly useful for bootstrapping the authoring process and helped identify different strategies participants adopt when jointly authoring text and charts. Based on study feedback, we discuss design implications for integrating interactive verification features between charts and text, offering control over text verbosity and tone, and enhancing the bidirectional flow in unified text and chart authoring tools.

Paperid: 2050, https://arxiv.org/pdf/2502.07039.pdf

Abstract:
High-risk artificial intelligence and machine learning classification tasks, such as healthcare diagnosis, require accurate and interpretable prediction models. However, classifier algorithms typically sacrifice individual case-accuracy for overall model accuracy, limiting analysis of class overlap areas regardless of task significance. The Adaptive Boosting meta-algorithm, which won the 2003 GÃ¶del Prize, analytically assigns higher weights to misclassified cases to reclassify. However, it relies on weaker base classifiers that are iteratively strengthened, limiting improvements from base classifiers. Combining visual and computational approaches enables selecting stronger base classifiers before boosting. This paper proposes moving boosting methodology from focusing on only misclassified cases to all cases in the class overlap areas using Computational and Interactive Visual Learning (CIVL) with a Human-in-the-Loop. It builds classifiers in lossless visualizations integrating human domain expertise and visual insights. A Divide and Classify process splits cases to simple and complex, classifying these individually through computational analysis and data visualization with lossless visualization spaces of Parallel Coordinates or other General Line Coordinates. After finding pure and overlap class areas simple cases in pure areas are classified, generating interpretable sub-models like decision rules in Propositional and First-order Logics. Only multidimensional cases in the overlap areas are losslessly visualized simplifying end-user cognitive tasks to identify difficult case patterns, including engineering features to form new classifiable patterns. Demonstration shows a perfectly accurate and losslessly interpretable model of the Iris dataset, and simulated data shows generalized benefits to accuracy and interpretability of models, increasing end-user confidence in discovered models.

Paperid: 2051, https://arxiv.org/pdf/2502.05296.pdf

Abstract:
Mobile messaging apps offer an increasing range of emotional expressions, such as emojis to help users manually augment their texting experiences. Accessibility of such augmentations is limited in voice messaging. With the term "speejis" we refer to accessible emojis and other visual speech emotion cues that are created automatically from speech input alone. The paper presents an implementation of speejis and reports on a user study (N=12) comparing the UX of voice messaging with and without speejis. Results show significant differences in measures such as attractiveness and stimulation and a clear preference of all participants for messaging with speejis. We highlight the benefits of using paralinguistic speech processing and continuous emotion models to enable finer grained augmentations of emotion changes and transitions within a single message in addition to augmentations of the overall tone of the message.

Paperid: 2052, https://arxiv.org/pdf/2502.04940.pdf

Abstract:
Standing at the intersection of science and art, artistic data visualization has gained popularity in recent years and emerged as a significant domain. Despite more than a decade since the field's conceptualization, a noticeable gap remains in research concerning the design features of artistic data visualizations, the aesthetic goals they pursue, and their potential to inspire our community. To address these gaps, we analyzed 220 data artworks to understand their design paradigms and intents, and construct a design taxonomy to characterize their design techniques (e.g., sensation, interaction, narrative, physicality). We also conducted in-depth interviews with twelve data artists to explore their practical perspectives, such as their understanding of artistic data visualization and the challenges they encounter. In brief, we found that artistic data visualization is deeply rooted in art discourse, with its own distinctive characteristics in both inner pursuits and outer presentations. Based on our research, we outline seven prospective paths for future work.

Paperid: 2053, https://arxiv.org/pdf/2502.04542.pdf

Abstract:
Visual storytelling combines visuals and narratives to communicate important insights. While web-based visual storytelling is well-established, leveraging the next generation of digital technologies for visual storytelling, specifically immersive technologies, remains underexplored. We investigated the impact of the story viewpoint (from the audience's perspective) and navigation (when progressing through the story) on spatial immersion and understanding. First, we collected web-based 3D stories and elicited design considerations from three VR developers. We then adapted four selected web-based stories to an immersive format. Finally, we conducted a user study (N=24) to examine egocentric and exocentric viewpoints, active and passive navigation, and the combinations they form. Our results indicated significantly higher preferences for egocentric+active (higher agency and engagement) and exocentric+passive (higher focus on content). We also found a marginal significance of viewpoints on story understanding and a strong significance of navigation on spatial immersion.

Paperid: 2054, https://arxiv.org/pdf/2502.03862.pdf

Abstract:
Nudging participants with text-based reflective nudges enhances deliberation quality on online deliberation platforms. The effectiveness of multimodal reflective nudges, however, remains largely unexplored. Given the multi-sensory nature of human perception, incorporating diverse modalities into self-reflection mechanisms has the potential to better support various reflective styles. This paper explores how presenting reflective nudges of different types (direct: persona and indirect: storytelling) in different modalities (text, image, video and audio) affects deliberation quality. We conducted two user studies with 20 and 200 participants respectively. The first study identifies the preferred modality for each type of reflective nudges, revealing that text is most preferred for persona and video is most preferred for storytelling. The second study assesses the impact of these modalities on deliberation quality. Our findings reveal distinct effects associated with each modality, providing valuable insights for developing more inclusive and effective online deliberation platforms.

Paperid: 2055, https://arxiv.org/pdf/2502.03804.pdf

Abstract:
Replying to formal emails is time-consuming and cognitively demanding, as it requires crafting polite phrasing and providing an adequate response to the sender's demands. Although systems with Large Language Models (LLMs) were designed to simplify the email replying process, users still need to provide detailed prompts to obtain the expected output. Therefore, we proposed and evaluated an LLM-powered question-and-answer (QA)-based approach for users to reply to emails by answering a set of simple and short questions generated from the incoming email. We developed a prototype system, ResQ, and conducted controlled and field experiments with 12 and 8 participants. Our results demonstrated that the QA-based approach improves the efficiency of replying to emails and reduces workload while maintaining email quality, compared to a conventional prompt-based approach that requires users to craft appropriate prompts to obtain email drafts. We discuss how the QA-based approach influences the email reply process and interpersonal relationship dynamics, as well as the opportunities and challenges associated with using a QA-based approach in AI-mediated communication.

Paperid: 2056, https://arxiv.org/pdf/2502.03579.pdf

Abstract:
The integration of Large Language Models (LLMs) into healthcare settings has gained significant attention, particularly for question-answering tasks. Given the high-stakes nature of healthcare, it is essential to ensure that LLM-generated content is accurate and reliable to prevent adverse outcomes. However, the development of robust evaluation metrics and methodologies remains a matter of much debate. We examine the performance of publicly available LLM-based chatbots for menopause-related queries, using a mixed-methods approach to evaluate safety, consensus, objectivity, reproducibility, and explainability. Our findings highlight the promise and limitations of traditional evaluation metrics for sensitive health topics. We propose the need for customized and ethically grounded evaluation frameworks to assess LLMs to advance safe and effective use in healthcare.

Paperid: 2057, https://arxiv.org/pdf/2502.02772.pdf

Abstract:
A method for cross-modality embedding of force profile and words is presented for synergistic coordination of verbal and haptic communication. When two people carry a large, heavy object together, they coordinate through verbal communication about the intended movements and physical forces applied to the object. This natural integration of verbal and physical cues enables effective coordination. Similarly, human-robot interaction could achieve this level of coordination by integrating verbal and haptic communication modalities. This paper presents a framework for embedding words and force profiles in a unified manner, so that the two communication modalities can be integrated and coordinated in a way that is effective and synergistic. Here, it will be shown that, although language and physical force profiles are deemed completely different, the two can be embedded in a unified latent space and proximity between the two can be quantified. In this latent space, a force profile and words can a) supplement each other, b) integrate the individual effects, and c) substitute in an exchangeable manner. First, the need for cross-modality embedding is addressed, and the basic architecture and key building block technologies are presented. Methods for data collection and implementation challenges will be addressed, followed by experimental results and discussions.

Paperid: 2058, https://arxiv.org/pdf/2502.02610.pdf

Abstract:
Music is a deeply personal experience and our aim is to enhance this with a fully-automated pipeline for personalized music video generation. Our work allows listeners to not just be consumers but co-creators in the music video generation process by creating personalized, consistent and context-driven visuals based on lyrics, rhythm and emotion in the music. The pipeline combines multimodal translation and generation techniques and utilizes low-rank adaptation on listeners' images to create immersive music videos that reflect both the music and the individual. To ensure the ethical use of users' identity, we also introduce CHARCHA (patent pending), a facial identity verification protocol that protects people against unauthorized use of their face while at the same time collecting authorized images from users for personalizing their videos. This paper thus provides a secure and innovative framework for creating deeply personalized music videos.

Paperid: 2059, https://arxiv.org/pdf/2502.01306.pdf

Abstract:
Conversational assistants process personal data and must comply with data protection regulations that require providers to be transparent with users about how their data is handled. Transparency, in a legal sense, demands preciseness, comprehensibility and accessibility, yet existing solutions fail to meet these requirements. To address this, we introduce a new human-expert-generated dataset for Privacy Question-Answering (Q&A), developed through an iterative process involving legal professionals and conversational designers. We evaluate this dataset through linguistic analysis and a user study, comparing it to privacy policy excerpts and state-of-the-art responses from Amazon Alexa. Our findings show that the proposed answers improve usability and clarity compared to existing solutions while achieving legal preciseness, thereby enhancing the accessibility of data processing information for Conversational AI and Natural Language Processing applications.

Paperid: 2060, https://arxiv.org/pdf/2502.00268.pdf

Abstract:
Vibrotactile signals offer new possibilities for conveying sensations and emotions in various applications. Yet, designing vibrotactile tactile icons (i.e., Tactons) to evoke specific feelings often requires a trial-and-error process and user studies. To support haptic design, we propose a framework for predicting sensory and emotional ratings from vibration signals. We created 154 Tactons and conducted a study to collect acceleration data from smartphones and roughness, valence, and arousal user ratings (n=36). We converted the Tacton signals into two-channel spectrograms reflecting the spectral sensitivities of mechanoreceptors, then input them into VibNet, our dual-stream neural network. The first stream captures sequential features using recurrent networks, while the second captures temporal-spectral features using 2D convolutional networks. VibNet outperformed baseline models, with 82% of its predictions falling within the standard deviations of ground truth user ratings for two new Tacton sets. We discuss the efficacy of our mechanoreceptive processing and dual-stream neural network and present future research directions.

Paperid: 2061, https://arxiv.org/pdf/2501.18778.pdf

Abstract:
Divergent thinking activities, like research and ideation, are key drivers of innovation in UI/UX design. Existing research has explored AI's role in automating design tasks, but leaves a critical gap in understanding how AI specifically influences divergent thinking. To address this, we conducted interviews with 19 professional UI/UX designers, examining their use and perception of AI in these creative activities. We found that in this context, participants valued AI tools that offer greater control over ideation, facilitate collaboration, enhance efficiency to liberate creativity, and align with their visual habits. Our results indicated four key roles AI plays in supporting divergent thinking: aiding research, kick-starting creativity, generating design alternatives, and facilitating prototype exploration. Through this study, we provide insights into the evolving role of AI in the less-investigated area of divergent thinking in UI/UX design, offering recommendations for future AI tools that better support design innovation.

Paperid: 2062, https://arxiv.org/pdf/2501.18748.pdf

Abstract:
UI/UX designers often work under constraints like brand identity, design norms, and industry guidelines. How these constraints impact designers' ideation and exploration processes should be addressed in creativity-support tools for design. Through an exploratory interview study, we identified three designer personas with varying views on having constraints in the ideation process, which guided the creation of UIDEC, a GenAI-powered tool for supporting creativity under constraints. UIDEC allows designers to specify project details, such as purpose, target audience, industry, and design styles, based on which it generates diverse design examples that adhere to these constraints, with minimal need to write prompts. In a user evaluation involving designers representing the identified personas, participants found UIDEC compatible with their existing ideation process and useful for creative inspiration, especially when starting new projects. Our work provides design implications to AI-powered tools that integrate constraints during UI/UX design ideation to support creativity.

Paperid: 2063, https://arxiv.org/pdf/2501.18108.pdf

Abstract:
Despite the growing potential of older adult care technologies, the adoption of these technologies remains challenging. In this work, we conducted a focus-group session with family caregivers to scope designs of the older adult care technology. We then developed a high-fidelity prototype and conducted its qualitative study with professional caregivers and older adults to understand their perspectives on the system functionalities. This system monitors abnormal activity patterns of older adults using wireless motion sensors and machine learning models and supports interactive dialogue responses to explain abnormal activity patterns of older adults to caregivers and allow older adults proactively sharing their status with caregivers for an adequate intervention. Both older adults and professional caregivers appreciated that our system can provide a faster, personalized service while proactively controlling what information is to be shared through interactive dialogue responses. We further discuss other considerations to realize older adult technology in practice.

Paperid: 2064, https://arxiv.org/pdf/2501.17819.pdf

Abstract:
As children increasingly consume media on devices, parents look for ways this usage can support learning and growth, especially in domains like social-emotional learning. We introduce eaSEL, a system that (a) integrates social-emotional learning (SEL) curricula into children's video consumption by generating reflection activities and (b) facilitates parent-child discussions around digital media without requiring co-consumption of videos. We present a technical evaluation of our system's ability to detect social-emotional moments within a transcript and to generate high-quality SEL-based activities for both children and parents. Through a user study with N=20 parent-child dyads, we find that after completing an eaSEL activity, children reflect more on the emotional content of videos. Furthermore, parents find that the tool promotes meaningful active engagement and could scaffold deeper conversations around content. Our work paves directions in how AI can support children's social-emotional reflection of media and family connections in the digital age.

Paperid: 2065, https://arxiv.org/pdf/2501.16885.pdf

Abstract:
Digital-safety research with at-risk users is particularly urgent. At-risk users are more likely to be digitally attacked or targeted by surveillance and could be disproportionately harmed by attacks that facilitate physical assaults. One group of such at-risk users are activists and politically active individuals. For them, as for other at-risk users, the rise of smart environments harbors new risks. Since digitization and datafication are no longer limited to a series of personal devices that can be switched on and off, but increasingly and continuously surround users, granular geolocation poses new safety challenges. Drawing on eight exploratory qualitative interviews of an ongoing research project, this contribution highlights what activists with powerful adversaries think about evermore data traces, including location data, and how they intend to deal with emerging risks. Responses of activists include attempts to control one's immediate technological surroundings and to more carefully manage device-related location data. For some activists, threat modeling has also shaped provider choices based on geopolitical considerations. Since many activists have not enough digital-safety knowledge for effective protection, feelings of insecurity and paranoia are widespread. Channeling the concerns and fears of our interlocutors, we call for more research on how activists can protect themselves against evermore fine-grained location data tracking.

Paperid: 2066, https://arxiv.org/pdf/2501.16864.pdf

Abstract:
In the last years the pervasive use of sensors, as they exist in smart devices, e.g., phones, watches, medical devices, has increased dramatically the availability of personal data. However, existing research on data collection primarily focuses on the objective view of reality, as provided, for instance, by sensors, often neglecting the integration of subjective human input, as provided, for instance, by user answers to questionnaires. This limits substantially the exploitability of the collected data. In this paper we present a methodology and a platform specifically designed for the collection of a combination of large-scale sensor data and qualitative human feedback. The methodology has been designed to be deployed on top, and enriches the functionalities of, an existing data collection APP, called iLog, which has been used in large scale, worldwide data collection experiments. The main goal is to put the key actors involved in an experiment, i.e., the researcher in charge, the participant, and iLog in better control of the experiment itself, thus enabling a much improved quality and richness of the data collected. The novel functionalities of the resulting platform are: (i) a time-wise representation of the situational context within which the data collection is performed, (ii) an explicit representation of the temporal context within which the data collection is performed, (iii) a calendar-based dashboard for the real-time monitoring of the data collection context(s), and, finally, (iv) a mechanism for the run-time revision of the data collection plan. The practicality and utility of the proposed functionalities are demonstrated by showing how they apply to a case study involving 350 University students.

Paperid: 2067, https://arxiv.org/pdf/2501.16518.pdf

Abstract:
Play is pivotal in fostering the emotional, social, and cultural dimensions of urban spaces. While generative AI (GAI) potentially supports playful urban interaction, a balanced and critical approach to the design opportunities and challenges is needed. This work develops iWonder, an image-to-image GAI tool engaging fourteen designers in urban explorations to identify GAI's playful features and create design ideas. Fourteen citizens then evaluated these ideas, providing expectations and critical concerns from a bottom-up perspective. Our findings reveal the dynamic interplay between users, GAI, and urban contexts, highlighting GAI's potential to facilitate playful urban experiences through generative agency, meaningful unpredictability, social performativity, and the associated offensive qualities. We propose design considerations to address citizen concerns and the `tourist metaphor' to deepen our understanding of GAI's impact, offering insights to enhance cities' socio-cultural fabric. Overall, this research contributes to the effort to harness GAI's capabilities for urban enrichment.

Paperid: 2068, https://arxiv.org/pdf/2501.16515.pdf

Abstract:
Assisted Reality (aR) is a subfield of Augmented Reality (AR) that overlays information onto a user's immediate view via see-through head-mounted displays (OST-HMDs). This technology has proven to be effective and energy-efficient to support the user and information interaction for everyday wearable intelligent systems. The aR viewing experience, however, is affected by varying real-world backgrounds, lighting, and user movements, which makes designing for aR challenging. Designers have to test their designs in-situ across multiple real-world settings, which can be time-consuming and labor-intensive. We propose SimulataR, a cost-effective desktop-based approach for rapid aR prototyping using first-person-view context videos blended with design prototypes to simulate an aR experience. A field study involving 12 AR users comparing SimulataR to real OST-HMDs found that SimulataR can approximate the aR experience, particularly for indoors and in low-to-moderate lit outdoor environments. Case studies with two designers who used SimulataR in their design process demonstrates the potential of design-blended videos for rapid aR prototyping.

Paperid: 2069, https://arxiv.org/pdf/2501.16505.pdf

Abstract:
Mixed Reality (MR) devices are being increasingly adopted across a wide range of real-world applications, ranging from education and healthcare to remote work and entertainment. However, the unique immersive features of MR devices, such as 3D spatial interactions and the encapsulation of virtual objects by invisible elements, introduce new vulnerabilities leading to interaction obstruction and misdirection. We implemented latency, click redirection, object occlusion, and spatial occlusion attacks within a remote collaborative MR platform using the Microsoft HoloLens 2 and evaluated user behavior and mitigations through a user study. We compared responses to MR-specific attacks, which exploit the unique characteristics of remote collaborative immersive environments, and traditional security attacks implemented in MR. Our findings indicate that users generally exhibit lower recognition rates for immersive attacks (e.g., spatial occlusion) compared to attacks inspired by traditional ones (e.g., click redirection). Our results demonstrate a clear gap in user awareness and responses when collaborating remotely in MR environments. Our findings emphasize the importance of training users to recognize potential threats and enhanced security measures to maintain trust in remote collaborative MR systems.

Paperid: 2070, https://arxiv.org/pdf/2501.16061.pdf

Abstract:
Design educators are finding ways to support students in skillfully using GenAI tools in their practices while encouraging the critical scrutiny of the ethical and social issues around these technologies. However, the issue of environmental sustainability remains unaddressed. There is a lack of both resources to grasp the environmental costs of genAI in education and a lack of shared practices for engaging with the issue. This paper critically reflects on the energy costs of using genAI in design education, using a workshop held in 2023 with 49 students as a motivating example. Through this reflection, we develop a set of five alternative stances, with related actions, that support the conscious use of genAI in design education. The work contributes to the field of design and HCI by bringing together ways for educators to reflect on their practices, informing the future development of educational programs around genAI.

Paperid: 2071, https://arxiv.org/pdf/2501.14792.pdf

Abstract:
A common challenge in home-based rehabilitation is muscle compensation induced by pain or fatigue, where patients with weakened primary muscles recruit secondary muscle groups to assist their movement, causing issues such as delayed rehabilitation progress or risk of further injury. In a home-based setting, the subtle compensatory actions may not be perceived since physiotherapists cannot directly observe patients. To address this problem, this study develops a novel wearable strain sensor-based shoulder patch to detect fatigue-induced muscle compensation during bicep curl exercises. Built on an observation that the amplitude of a strain sensor's resistance is correlated to the motion of a joint that the sensor is attached to, we develop an algorithm that can robustly detect the state when significant changes appear in the shoulder joint motion, which indicates fatigue-induced muscle compensation in bicep curls. The developed shoulder patch is tested on 13 subjects who perform bicep curl exercises with a 5 kg dumbbell until reaching fatigue. During the experiment, the performance of the shoulder patch is also benchmarked with optical tracking sensors and surface electromyography (sEMG) sensors. Results reveal that the proposed wearable sensor and detection methods effectively monitor fatigue-induced muscle compensation during bicep curl exercises in both Real-Time and Post Hoc modes. This development marks a significant step toward enhancing the effectiveness of home-based rehabilitation by providing physiotherapists with a tool to monitor and adjust treatment plans remotely.

Paperid: 2072, https://arxiv.org/pdf/2501.14779.pdf

Abstract:
This study investigated the students' perceptions of using Generative Artificial Intelligence (GenAI) in upper-secondary mathematics education. Data was collected from Finnish high school students to represent how key constructs of the Technology Acceptance Model (Perceived Usefulness, Perceived Ease of Use, Perceived Enjoyment, and Intention to Use) influence the adoption of AI tools. First, a structural equation model for a comparative study with a prior study was constructed and analyzed. Then, an extended model with the additional construct of Compatibility, which represents the alignment of AI tools with students' educational experiences and needs, was proposed and analyzed. The results demonstrated a strong influence of perceived usefulness on the intention to use GenAI, emphasizing the statistically significant role of perceived enjoyment in determining perceived usefulness and ease of use. The inclusion of compatibility improved the model's explanatory power, particularly in predicting perceived usefulness. This study contributes to a deeper understanding of how AI tools can be integrated into mathematics education and highlights key differences between the Finnish educational context and previous studies based on structural equation modeling.

Paperid: 2073, https://arxiv.org/pdf/2501.13443.pdf

Abstract:
The recent surge in artificial intelligence, particularly in multimodal processing technology, has advanced human-computer interaction, by altering how intelligent systems perceive, understand, and respond to contextual information (i.e., context awareness). Despite such advancements, there is a significant gap in comprehensive reviews examining these advances, especially from a multimodal data perspective, which is crucial for refining system design. This paper addresses a key aspect of this gap by conducting a systematic survey of data modality-driven Vision-based Multimodal Interfaces (VMIs). VMIs are essential for integrating multimodal data, enabling more precise interpretation of user intentions and complex interactions across physical and digital environments. Unlike previous task- or scenario-driven surveys, this study highlights the critical role of the visual modality in processing contextual information and facilitating multimodal interaction. Adopting a design framework moving from the whole to the details and back, it classifies VMIs across dimensions, providing insights for developing effective, context-aware systems.

Paperid: 2074, https://arxiv.org/pdf/2501.13421.pdf

Abstract:
In machine learning (ML) applications, unfairness is triggered due to bias in the data, the data curation process, erroneous assumptions, and implicit bias rendered during the development process. It is also well-accepted by researchers that fairness in ML application development is highly subjective, with a lack of clarity of what it means from an ML development and implementation perspective. Thus, in this research, we investigate and formalize the notion of the perceived fairness of ML development from a sociotechnical lens. Our goal in this research is to understand the characteristics of perceived fairness in ML applications. We address this research goal using a three-pronged strategy: 1) conducting virtual focus groups with ML developers, 2) reviewing existing literature on fairness in ML, and 3) incorporating aspects of justice theory relating to procedural and distributive justice. Based on our theoretical exposition, we propose operational attributes of perceived fairness to be transparency, accountability, and representativeness. These are described in terms of multiple concepts that comprise each dimension of perceived fairness. We use this operationalization to empirically validate the notion of perceived fairness of machine learning (ML) applications from both the ML practioners and users perspectives. The multidimensional framework for perceived fairness offers a comprehensive understanding of perceived fairness, which can guide the creation of fair ML systems with positive implications for society and businesses.

Paperid: 2075, https://arxiv.org/pdf/2501.12214.pdf

Abstract:
Explanations constitute an important aspect of successful human robot interactions and can enhance robot understanding. To improve the understanding of the robot, we have developed four levels of explanation (LOE) based on two questions: what needs to be explained, and why the robot has made a particular decision. The understandable robot requires a communicative action when there is disparity between the human s mental model of the robot and the robots state of mind. This communicative action was generated by utilizing a conversational AI platform to generate explanations. An adaptive dialog was implemented for transition from one LOE to another. Here, we demonstrate the adaptive dialog in a collaborative task with errors and provide results of a feasibility study with users.

Paperid: 2076, https://arxiv.org/pdf/2501.11935.pdf

Abstract:
LLMs such as ChatGPT have been widely adopted by students in higher education as tools for learning programming and related concepts. However, it remains unclear how effective students are and what strategies students use while learning with LLMs. Since the majority of students' experiences in online self-learning have come through using search engines such as Google, evaluating AI tools in this context can help us address these gaps. In this mixed methods research, we conducted an exploratory within-subjects study to understand how CS2 students learn programming concepts using both LLMs as well as traditional online methods such as educational websites and videos to examine how students approach learning within and across both scenarios. We discovered that students found it easier to learn a more difficult concept using traditional methods than using ChatGPT. We also found that students ask fewer follow-ups and use more keyword-based queries for search engines while their prompts to LLMs tend to explicitly ask for information.

Paperid: 2077, https://arxiv.org/pdf/2501.11887.pdf

Abstract:
Robots, particularly in service and companionship roles, must develop positive relationships with people they interact with regularly to be successful. These positive human-robot relationships can be characterized as establishing "rapport," which indicates mutual understanding and interpersonal connection that form the groundwork for successful long-term human-robot interaction. However, the human-robot interaction research literature lacks scale instruments to assess human-robot rapport in a variety of situations. In this work, we developed the 18-item Connection-Coordination Rapport (CCR) Scale to measure human-robot rapport. We first ran Study 1 (N = 288) where online participants rated videos of human-robot interactions using a set of candidate items. Our Study 1 results showed the discovery of two factors in our scale, which we named "Connection" and "Coordination." We then evaluated this scale by running Study 2 (N = 201) where online participants rated a new set of human-robot interaction videos with our scale and an existing rapport scale from virtual agents research for comparison. We also validated our scale by replicating a prior in-person human-robot interaction study, Study 3 (N = 44), and found that rapport is rated significantly greater when participants interacted with a responsive robot (responsive condition) as opposed to an unresponsive robot (unresponsive condition). Results from these studies demonstrate high reliability and validity for the CCR scale, which can be used to measure rapport in both first-person and third-person perspectives. We encourage the adoption of this scale in future studies to measure rapport in a variety of human-robot interactions.

Paperid: 2078, https://arxiv.org/pdf/2501.11754.pdf

Abstract:
Virtual displays provided through head-worn displays (HWDs) offer users large screen space for productivity, but managing this space effectively presents challenges. This paper explores how to enhance window-switching strategies for virtual displays by leveraging eye tracking provided by HWDs and underutilized spaces around the main display area. We investigate the efficiency and usability of different cursor behaviors and selection modes in a Spatial Bar interface for window-switching tasks in augmented reality environments. Results show gaze coupled with teleport led to the quickest window-switching times, particularly in tasks where the original cursor position or the target window was far from the Spatial Bar.

Paperid: 2079, https://arxiv.org/pdf/2501.11457.pdf

Abstract:
Since the emergence of generative AI, creative workers have spoken up about the career-based harms they have experienced arising from this new technology. A common theme in these accounts of harm is that generative AI models are trained on workers' creative output without their consent and without giving credit or compensation to the original creators. This paper reports findings from 20 interviews with creative workers in three domains: visual art and design, writing, and programming. We investigate the gaps between current AI governance strategies, what creative workers want out of generative AI governance, and the nuanced role of creative workers' consent, compensation and credit for training AI models on their work. Finally, we make recommendations for how generative AI can be governed and how operators of generative AI systems might more ethically train models on creative output in the future.

Paperid: 2080, https://arxiv.org/pdf/2501.10696.pdf

Abstract:
Spatial navigation is a complex cognitive function involving sensory inputs, such as visual, auditory, and proprioceptive information, to understand and move within space. This ability allows humans to create mental maps, navigate through environments, and process directional cues, crucial for exploring new places and finding one's way in unfamiliar surroundings. This study takes an algorithmic approach to extract indices relevant to human spatial navigation using eye movement data. Leveraging electrooculography signals, we analyzed statistical features and applied feature engineering techniques to study eye movements during navigation tasks. The proposed work combines signal processing and machine learning approaches to develop indices for navigation and orientation, spatial anxiety, landmark recognition, path survey, and path route. The analysis yielded five subscore indices with notable accuracy. Among these, the navigation and orientation subscore achieved an R2 score of 0.72, while the landmark recognition subscore attained an R2 score of 0.50. Additionally, statistical features highly correlated with eye movement metrics, including blinks, saccades, and fixations, were identified. The findings of this study can lead to more cognitive assessments and enable early detection of spatial navigation impairments, particularly among individuals at risk of cognitive decline.

Paperid: 2081, https://arxiv.org/pdf/2501.09442.pdf

Abstract:
This study investigates how partner avatar design affects learning and memory when an avatar serves as a lecturer. Based on earlier research on the environmental context dependency of memory, we hypothesize that the use of diverse partner avatars results in a slower learning rate but better memory retention than that of a constant partner avatar. Accordingly, participants were tasked with memorizing Tagalog--Japanese word pairs. On the first day of the experiment, they repeatedly learned the pairs over six sessions from a partner avatar in an immersive virtual environment. One week later, on the second day of the experiment, they underwent a recall test in a real environment. We employed a between-participants design to compare the following conditions: the varied avatar condition, in which each repetition used a different avatar, and the constant avatar condition, in which the same avatar was used throughout the experiment. Results showed that, compared to the constant avatar condition, the varied avatar condition resulted in significantly lower recall performance in the repeated learning trials conducted on the first day. However, the avatar conditions showed no significant differences in the final recall test on the second day. We discuss these effects in relation to the social presence of the partner avatar. This study opens up a novel approach to optimizing the effectiveness of instructor avatars in immersive virtual environments.

Paperid: 2082, https://arxiv.org/pdf/2501.09165.pdf

Abstract:
Classroom debates are a unique form of collaborative learning characterized by fast-paced, high-intensity interactions that foster critical thinking and teamwork. Despite the recognized importance of debates, the role of AI tools, particularly LLM-based systems, in supporting this dynamic learning environment has been under-explored in HCI. This study addresses this opportunity by investigating the integration of LLM-based AI into real-time classroom debates. Over four weeks, 22 students in a Design History course participated in three rounds of debates with support from ChatGPT. The findings reveal how learners prompted the AI to offer insights, collaboratively processed its outputs, and divided labor in team-AI interactions. The study also surfaces key advantages of AI usage, reducing social anxiety, breaking communication barriers, and providing scaffolding for novices, alongside risks, such as information overload and cognitive dependency, which could limit learners' autonomy. We thereby discuss a set of nuanced implications for future HCI exploration.

Paperid: 2083, https://arxiv.org/pdf/2501.07661.pdf

Abstract:
In the densifying data ecosystem of today's cities, data intermediaries are crucial stakeholders in facilitating data access and use. Community advocates live in these sites of social injustices and opportunities for change. Highly experienced in working with data to enact change, they offer distinctive insights on data practices and tools. This paper examines the unique perspectives that community advocates offer on data intermediaries. Based on interviews with 17 advocates working with 23 grassroots and nonprofit organizations, we propose the quality of "near" and "far" to be seriously considered in data intermediaries' works and articulate advocates' vision of connecting "near data" and "far data." To pursue this vision, we identified three pathways for data intermediaries: align data exploration with ways of storytelling, communicate context and uncertainties, and decenter artifacts for relationship building. These pathways help data intermediaries to put data feminism into practice, surface design opportunities and tensions, and raise key questions for supporting the pursuit of the Right to the City.

Paperid: 2084, https://arxiv.org/pdf/2501.07394.pdf

Abstract:
The resting-state brain networks (RSNs) reflects the functional connectivity patterns between brain modules, providing essential foundations for decoding intrinsic neural information within the brain. It serves as one of the primary tools for describing the spatial dynamics of the brain using various neuroimaging techniques, such as electroencephalography (EEG) and magnetoencephalography (MEG). However, the distribution rules or potential modes of functional connectivity weights in the resting state remain unclear. In this context, we first start from simulation, using forward solving model to generate scalp EEG with four channel densities (19, 32, 64, 128). Subsequently, we construct scalp brain networks using five coupling measures, aiming to explore whether different channel density or coupling measures affect the distribution pattern of functional connectivity weights. Next, we quantify the distribution pattern by calculating the skewness, kurtosis, and Shannon entropy of the functional connectivity network weights. Finally, the results of the simulation were validated in a normative database. We observed that: 1) The functional connection weights exhibit a right-skewed distribution, and are not influenced by channel density or coupling measures; 2) The functional connection weights exhibit a relatively uniform distribution, with the potential for volume conduction to affect the degree of uniformity in the distribution; 3) Networks constructed using coupling measures influenced by volume conduction exhibit significant correlations between the average connection weight and measures of skewness, kurtosis, and Shannon entropy. This study contributes to a deeper understanding of RSNs, providing valuable insights for research in the field of neuroscience, and holds promise for being associated with brain cognition and disease diagnosis.

Paperid: 2085, https://arxiv.org/pdf/2501.06901.pdf

Abstract:
The field of serious games for health has grown significantly, demonstrating effectiveness in various clinical contexts such as stroke, spinal cord injury, and degenerative neurological diseases. Despite their potential benefits, therapists face barriers to adopting serious games in rehabilitation, including limited training and game literacy, concerns about cost and equipment availability, and a lack of evidence-based research on game effectiveness. Serious games for rehabilitation often involve repetitive exercises, which can be tedious and reduce motivation for continued rehabilitation, treating clients as passive recipients of clinical outcomes rather than players. This study identifies gaps and provides essential insights for advancing serious games in rehabilitation, aiming to enhance their engagement for clients and effectiveness as a therapeutic tool. Addressing these challenges requires a paradigm shift towards developing and co-creating serious games for rehabilitation with therapists, researchers, and stakeholders. Furthermore, future research is crucial to advance the development of serious games, ensuring they adhere to evidence-based principles and engage both clients and therapists. This endeavor will identify gaps in the field, inspire new directions, and support the creation of practical guidelines for serious games research.

Paperid: 2086, https://arxiv.org/pdf/2501.06073.pdf

Abstract:
In this study, we investigated gaze-based interaction methods within a virtual reality game with a visual search task with 52 participants. We compared four different interaction techniques: Selection by dwell time or confirmation of selection by head orientation, nodding or smooth pursuit eye movements. We evaluated both subjective and objective performance metrics, including NASA-TLX for subjective task load as well as time to find the correct targets and points achieved for objective analysis. The results showed significant differences between the interaction methods in terms of NASA TLX dimensions, time to find the right targets, and overall performance scores, suggesting differential effectiveness of gaze-based approaches in improving intuitive system communication. Interestingly, the results revealed gender-specific differences, suggesting interesting implications for the design of gaze-based interaction paradigms that are optimized for different user needs and preferences. These findings could help to develop more customized and effective gaze interaction systems that can improve accessibility and user satisfaction.

Paperid: 2087, https://arxiv.org/pdf/2501.05985.pdf

Abstract:
Effective questionnaire design improves the validity of the results, but creating and adapting questionnaires across contexts is challenging due to resource constraints and limited expert access. Recently, the emergence of LLMs has led researchers to explore their potential in survey research. In this work, we focus on the suitability of LLMs in assisting the generation and adaptation of questionnaires. We introduce a novel pipeline that leverages LLMs to create new questionnaires, pretest with a target audience to determine potential issues and adapt existing standardized questionnaires for different contexts. We evaluated our pipeline for creation and adaptation through two studies on Prolific, involving 238 participants from the US and 118 participants from South Africa. Our findings show that participants found LLM-generated text clearer, LLM-pretested text more specific, and LLM-adapted questions slightly clearer and less biased than traditional ones. Our work opens new opportunities for LLM-driven questionnaire support in survey research.

Paperid: 2088, https://arxiv.org/pdf/2501.05471.pdf

Abstract:
The increasing complexity of machine learning models in computer vision, particularly in face verification, requires the development of explainable artificial intelligence (XAI) to enhance interpretability and transparency. This study extends previous work by integrating semantic concepts derived from human cognitive processes into XAI frameworks to bridge the comprehension gap between model outputs and human understanding. We propose a novel approach combining global and local explanations, using semantic features defined by user-selected facial landmarks to generate similarity maps and textual explanations via large language models (LLMs). The methodology was validated through quantitative experiments and user feedback, demonstrating improved interpretability. Results indicate that our semantic-based approach, particularly the most detailed set, offers a more nuanced understanding of model decisions than traditional methods. User studies highlight a preference for our semantic explanations over traditional pixelbased heatmaps, emphasizing the benefits of human-centric interpretability in AI. This work contributes to the ongoing efforts to create XAI frameworks that align AI models behaviour with human cognitive processes, fostering trust and acceptance in critical applications.

Paperid: 2089, https://arxiv.org/pdf/2501.04902.pdf

Abstract:
Advances in Artificial Intelligence (AI) have generated widespread enthusiasm for the potential of AI to support our understanding and protection of the environment. As such tools move from basic research to more consequential settings, such as regulatory enforcement, the human context of how AI is utilized, interpreted, and deployed becomes increasingly critical. Yet little work has systematically examined the role of such organizational goals and incentives in deploying AI systems. We report results from a unique case study of a satellite imagery-based AI tool to detect dumping of agricultural waste, with concurrent field trials with the Wisconsin Department of Natural Resources (WDNR) and a non-governmental environmental interest group in which the tool was utilized for field investigations when dumping was presumptively illegal in February-March 2023. Our results are threefold: First, both organizations confirmed a similar level of ground-truth accuracy for the model's detections. Second, they differed, however, in their overall assessment of its usefulness, as WDNR was interested in clear violations of existing law, while the interest group sought to document environmental risk beyond the scope of existing regulation. Dumping by an unpermitted entity or just before February 1, for instance, were deemed irrelevant by WDNR. Third, while AI tools promise to prioritize allocation of environmental protection resources, they may expose important gaps of existing law.

Paperid: 2090, https://arxiv.org/pdf/2501.04163.pdf

Abstract:
All creative tasks require creators to iteratively produce, select, and discard potentially useful ideas. Now, creativity tools include generative AI features (e.g., Photoshop Generative Fill) that increase the number of alternatives creators consider due to rapid experiments with text prompts and random generations. Creators use tedious manual systems for organizing their prior ideas by saving file versions or hiding layers, but they lack the support they want for reusing prior alternatives in personal work or in communication with others. We present HistoryPalette, a system that supports exploration and reuse of prior designs in generative image creation and editing. Using HistoryPalette, creators and their collaborators explore a "palette" of prior design alternatives organized by spatial position, topic category, and creation time. HistoryPalette enables creators to quickly preview and reuse their prior work. In creative professional and client collaborator user studies, participants generated and edited images by exploring and reusing past design alternatives with HistoryPalette.

Paperid: 2091, https://arxiv.org/pdf/2501.03467.pdf

Abstract:
Human safety is critical in applications involving close human-robot interactions (HRI) and is a key aspect of physical compatibility between humans and robots. While measures of human safety in HRI exist, these mainly target industrial settings involving robotic manipulators. Less attention has been paid to settings where mobile robots and humans share the space. This paper introduces a new robot-centered directional framework of human safety. It is particularly useful for evaluating mobile robots as they operate in environments populated by multiple humans. The framework integrates several key metrics, such as each human's relative distance, speed, and orientation. The core novelty lies in the framework's flexibility to accommodate different application requirements while allowing for both the robot-centered and external observer points of view. We instantiate the framework by using RGB-D based vision integrated with a deep learning-based human detection pipeline to yield a generalized safety index (GSI) that instantaneously assesses human safety. We evaluate GSI's capability of producing appropriate, robust, and fine-grained safety measures in real-world experimental scenarios and compare its performance with extant safety models.

Paperid: 2092, https://arxiv.org/pdf/2501.03190.pdf

Abstract:
Videoconferencing is now a frequent mode of communication in both professional and informal settings, yet it often lacks the fluidity and enjoyment of in-person conversation. This study leverages multimodal machine learning to predict moments of negative experience in videoconferencing. We sampled thousands of short clips from the RoomReader corpus, extracting audio embeddings, facial actions, and body motion features to train models for identifying low conversational fluidity, low enjoyment, and classifying conversational events (backchanneling, interruption, or gap). Our best models achieved an ROC-AUC of up to 0.87 on hold-out videoconference sessions, with domain-general audio features proving most critical. This work demonstrates that multimodal audio-video signals can effectively predict high-level subjective conversational outcomes. In addition, this is a contribution to research on videoconferencing user experience by showing that multimodal machine learning can be used to identify rare moments of negative user experience for further study or mitigation.

Paperid: 2093, https://arxiv.org/pdf/2501.02641.pdf

Abstract:
Designing public transportation cabins that effectively engage passengers and encourage more sustainable mobility options requires a deep understanding of how users from different backgrounds, visually interact with these environments. The following study employs eye-tracking technology to investigate visual attention patterns across six distinct cabin designs, ranging from the current and poorly maintained versions to enhanced, biophilic focused, cyclist-friendly, and productivity-focused configurations. A total of N:304 participants engaged with each cabin design while their eye movements such as Fixation Counts, Time to First Fixation (TFF), First Fixation Duration (FFD), Stationary Gaze Entropy (SGE), and Gaze Transition Entropy (GTE) were recorded. Results revealed that alternative cabin configurations consistently exhibited shorter TFFs and lower entropy measures compared to the baseline current version. Specifically, designs incorporating natural elements and biophilic aspects, streamlined layouts, or functional amenities, facilitated quicker orientation and more structured gaze patterns, indicating enhanced visual engagement and possibly reduced cognitive load. In contrast, the poorly maintained cabin design was associated with higher entropy values, suggesting more scattered and less predictable visual exploration. Demographic factors, particularly ethnicity, significantly influenced FFD in certain designs, with Non-white participants showing reduced fixation durations in the enhanced and poorly maintained environments highlighting the importance of inclusive design considerations. Moreover, transportation-related demographic factors such as frequency of public transport use, trip purpose, and duration of use significantly influenced visual attention metrics in various cabin designs.

Paperid: 2094, https://arxiv.org/pdf/2501.02084.pdf

Abstract:
Spatial scheduling of electrode activation ("rastering") is essential for safely operating high-density retinal implants, yet its perceptual consequences remain poorly understood. This study systematically evaluates the impact of raster patterns, or spatial arrangements of sequential electrode activation, on performance and perceived difficulty in simulated prosthetic vision (SPV). By addressing this gap, we aimed to identify patterns that optimize functional vision in retinal implants. Sighted participants completed letter recognition and motion discrimination tasks under four raster patterns (horizontal, vertical, checkerboard, and random) using an immersive SPV system. The simulations emulated epiretinal implant perception and employed psychophysically validated models of electrode activation, phosphene appearance, nonlinear spatial summation, and temporal dynamics, ensuring realistic representation of prosthetic vision. Performance accuracy and self-reported difficulty were analyzed to assess the effects of raster patterning. The checkerboard pattern consistently outperformed other raster patterns, yielding significantly higher accuracy and lower difficulty ratings across both tasks. The horizontal and vertical patterns introduced biases aligned with apparent motion artifacts, while the checkerboard minimized such effects. Random patterns resulted in the lowest performance, underscoring the importance of structured activation. Notably, checkerboard matched performance in the "No Raster" condition, despite conforming to groupwise safety constraints. This is the first quantitative, task-based evaluation of raster patterns in SPV. Checkerboard-style scheduling enhances perceptual clarity without increasing computational load, offering a low-overhead, clinically relevant strategy for improving usability in next-generation retinal prostheses.

Paperid: 2095, https://arxiv.org/pdf/2501.01813.pdf

Abstract:
The vision of adaptive architecture proposes that robotic technologies could enable interior spaces to physically transform in a bidirectional interaction with occupants. Yet, it is still unknown how this interaction could unfold in an understandable way. Inspired by HRI studies where robotic furniture gestured intents to occupants by deliberately positioning or moving in space, we hypothesise that adaptive architecture could also convey intents through gestures performed by a mobile robotic partition. To explore this design space, we invited 15 multidisciplinary experts to join co-design improvisation sessions, where they manually manoeuvred a deactivated robotic partition to design gestures conveying six architectural intents that varied in purpose and urgency. Using a gesture elicitation method alongside motion-tracking data, a Laban-based questionnaire, and thematic analysis, we identified 20 unique gestural strategies. Through categorisation, we introduced architectonic gestures as a novel strategy for robotic furniture to convey intent by indexically leveraging its spatial impact, complementing the established deictic and emblematic gestures. Our study thus represents an exploratory step toward making the autonomous gestures of adaptive architecture more legible. By understanding how robotic gestures are interpreted based not only on their motion but also on their spatial impact, we contribute to bridging HRI with Human-Building Interaction research.

Paperid: 2096, https://arxiv.org/pdf/2501.01374.pdf

Abstract:
Through participatory design, we are developing a computational system for the semi-automated assessment of the Action Research Arm Test (ARAT) for stroke rehabilitation. During rehabilitation assessment, clinicians rate movement segments and components in the context of overall task performance. Clinicians change viewing angles to assess particular components. Through studies with clinicians, we develop a system that includes: a) unobtrusive multi-camera capture, b) a segmentation interface for non-expert segmentors, and c) a rating interface for expert clinicians. Five clinicians independently captured 1800 stroke survivor videos with <5$\%$ errors. Three segmentors have segmented 760 of these videos, averaging 20 seconds per segment. They favor the recommended camera view $>$ 90\%. Multiple clinicians have rated the segmented videos while reporting minimal problems. The complete data will be used for training an automated segmentation and rating system that empowers the clinicians as the ratings will be compatible with clinical practice and intuition.

Paperid: 2097, https://arxiv.org/pdf/2506.24039.pdf

Abstract:
Zero-shot and prompt-based models have excelled at visual reasoning tasks by leveraging large-scale natural image corpora, but they often fail on sparse and domain-specific scientific image data. We introduce Zenesis, a no-code interactive computer vision platform designed to reduce data readiness bottlenecks in scientific imaging workflows. Zenesis integrates lightweight multimodal adaptation for zero-shot inference on raw scientific data, human-in-the-loop refinement, and heuristic-based temporal enhancement. We validate our approach on Focused Ion Beam Scanning Electron Microscopy (FIB-SEM) datasets of catalyst-loaded membranes. Zenesis outperforms baselines, achieving an average accuracy of 0.947, Intersection over Union (IoU) of 0.858, and Dice score of 0.923 on amorphous catalyst samples; and 0.987 accuracy, 0.857 IoU, and 0.923 Dice on crystalline samples. These results represent a significant performance gain over conventional methods such as Otsu thresholding and standalone models like the Segment Anything Model (SAM). Zenesis enables effective image segmentation in domains where annotated datasets are limited, offering a scalable solution for scientific discovery.

Paperid: 2098, https://arxiv.org/pdf/2506.23850.pdf

Abstract:
This paper introduces a novel architectural framework that integrates Large Language Models (LLMs) with email interfaces to automate administrative tasks, specifically targeting accessibility barriers in enterprise environments. The system connects email communication channels with Optical Character Recognition (OCR) and intelligent automation, enabling non-technical administrative staff to delegate complex form-filling and document processing tasks using familiar email interfaces. By treating the email body as a natural language prompt and attachments as contextual information, the workflow bridges the gap between advanced AI capabilities and practical usability. Empirical evaluation shows that the system can complete complex administrative forms in under 8 seconds of automated processing, with human supervision reducing total staff time by a factor of three to four compared to manual workflows. The top-performing LLM accurately filled 16 out of 29 form fields and reduced the total cost per processed form by 64% relative to manual completion. These findings demonstrate that email-based LLM integration is a viable and cost-effective approach for democratizing advanced automation in organizational settings, supporting widespread adoption without requiring specialized technical knowledge or major workflow changes. This aligns with broader trends in leveraging LLMs to enhance accessibility and automate complex tasks for non-technical users, making technology more inclusive and efficient.

Paperid: 2099, https://arxiv.org/pdf/2506.23826.pdf

Abstract:
Human Digital Twins (HDTs) have traditionally been conceptualized as data-driven models designed to support decision-making across various domains. However, recent advancements in conversational AI open new possibilities for HDTs to function as authentic, interactive digital counterparts of individuals. This paper introduces a novel HDT system architecture that integrates large language models with dynamically updated personal data, enabling it to mirror an individual's conversational style, memories, and behaviors. To achieve this, our approach implements context-aware memory retrieval, neural plasticity-inspired consolidation, and adaptive learning mechanisms, creating a more natural and evolving digital persona. The resulting system does not only replicate an individual's unique conversational style depending on who they are speaking with, but also enriches responses with dynamically captured personal experiences, opinions, and memories. While this marks a significant step toward developing authentic virtual counterparts, it also raises critical ethical concerns regarding privacy, accountability, and the long-term implications of persistent digital identities. This study contributes to the field of HDTs by describing our novel system architecture, demonstrating its capabilities, and discussing future directions and emerging challenges to ensure the responsible and ethical development of HDTs.

Paperid: 2100, https://arxiv.org/pdf/2506.23739.pdf

Abstract:
Ensuring safe and realistic interactions between automated driving systems and vulnerable road users (VRUs) in urban environments requires advanced testing methodologies. This paper presents a test environment that combines a Vehiclein-the-Loop (ViL) test bench with a motion laboratory, demonstrating the feasibility of cyber-physical (CP) testing of vehicle-pedestrian and vehicle-cyclist interactions. Building upon previous work focused on pedestrian localization, we further validate a human pose estimation (HPE) approach through a comparative analysis of real-world (RW) and virtual representations of VRUs. The study examines the perception of full-body motion using a commercial monocular camera-based 3Dskeletal detection AI. The virtual scene is generated in Unreal Engine 5, where VRUs are animated in real time and projected onto a screen to stimulate the camera. The proposed stimulation technique ensures the correct perspective, enabling realistic vehicle perception. To assess the accuracy and consistency of HPE across RW and CP domains, we analyze the reliability of detections as well as variations in movement trajectories and joint estimation stability. The validation includes dynamic test scenarios where human avatars, both walking and cycling, are monitored under controlled conditions. Our results show a strong alignment in HPE between RW and CP test conditions for stable motion patterns, while notable inaccuracies persist under dynamic movements and occlusions, particularly for complex cyclist postures. These findings contribute to refining CP testing approaches for evaluating next-generation AI-based vehicle perception and to enhancing interaction models of automated vehicles and VRUs in CP environments.

Paperid: 2101, https://arxiv.org/pdf/2506.22111.pdf

Abstract:
With the rapid advancements in autonomous driving, accurately predicting pedestrian behavior has become essential for ensuring safety in complex and unpredictable traffic conditions. The growing interest in this challenge highlights the need for comprehensive datasets that capture unstructured environments, enabling the development of more robust prediction models to enhance pedestrian safety and vehicle navigation. In this paper, we introduce an Indian driving pedestrian dataset designed to address the complexities of modeling pedestrian behavior in unstructured environments, such as illumination changes, occlusion of pedestrians, unsignalized scene types and vehicle-pedestrian interactions. The dataset provides high-level and detailed low-level comprehensive annotations focused on pedestrians requiring the ego-vehicle's attention. Evaluation of the state-of-the-art intention prediction methods on our dataset shows a significant performance drop of up to $\mathbf{15\%}$, while trajectory prediction methods underperform with an increase of up to $\mathbf{1208}$ MSE, defeating standard pedestrian datasets. Additionally, we present exhaustive quantitative and qualitative analysis of intention and trajectory baselines. We believe that our dataset will open new challenges for the pedestrian behavior research community to build robust models. Project Page: https://cvit.iiit.ac.in/research/projects/cvit-projects/iddped

Paperid: 2102, https://arxiv.org/pdf/2506.21780.pdf

Abstract:
Due to the COVID-19 pandemic, many professional entities shifted toward remote collaboration and video conferencing (VC) tools. Social virtual reality (VR) platforms present an alternative to VC for meetings and collaborative activities. Well-crafted social VR environments could enhance feelings of co-presence and togetherness at meetings, helping reduce the need for carbon-intensive travel to face-to-face meetings. This research contributes to creating meeting tools in VR by exploring the effects of avatar styles and virtual environments on groups creative performance using the Mozilla Hubs platform. We present the results of two sequential studies. Study One surveys avatar and environment preferences in various VR meeting contexts (N=87). Study Two applies these findings to the design of a between-subjects and within-subjects research where participants (N=40) perform creativity tasks in pairs as embodied avatars in different virtual settings using VR headsets. We discuss the design implications of avatar appearances and meeting settings on teamwork.

Paperid: 2103, https://arxiv.org/pdf/2506.21536.pdf

Abstract:
With the rapid development of digital technology, AI-driven psychological counseling has gradually become an important research direction in the field of mental health. However, existing models still have deficiencies in dialogue safety, detailed scenario handling, and lightweight deployment. To address these issues, this study proposes PsyLite, a lightweight psychological counseling large language model agent developed based on the base model InternLM2.5-7B-chat. Through a two-stage training strategy (hybrid distillation data fine-tuning and ORPO preference optimization), PsyLite enhances the model's deep-reasoning ability, psychological counseling ability, and safe dialogue ability. After deployment using Ollama and Open WebUI, a custom workflow is created with Pipelines. An innovative conditional RAG is designed to introduce crosstalk humor elements at appropriate times during psychological counseling to enhance user experience and decline dangerous requests to strengthen dialogue safety. Evaluations show that PsyLite outperforms the baseline models in the Chinese general evaluation (CEval), psychological counseling professional evaluation (CPsyCounE), and dialogue safety evaluation (SafeDialBench), particularly in psychological counseling professionalism (CPsyCounE score improvement of 47.6\%) and dialogue safety (\safe{} score improvement of 2.4\%). Additionally, the model uses quantization technology (GGUF q4\_k\_m) to achieve low hardware deployment (5GB memory is sufficient for operation), providing a feasible solution for psychological counseling applications in resource-constrained environments.

Paperid: 2104, https://arxiv.org/pdf/2506.21417.pdf

Abstract:
This study presents a lightweight, wearable fingertip haptic device that provides physics-based haptic feedback for dexterous manipulation in virtual environments without hindering real-world interactions. The device, designed with thin strings and actuators attached to the fingernails, ensures minimal weight (1.55 g per finger) and preserves finger flexibility. Integrating the software with a physics engine renders multiple types of haptic feedback (grip force, collision, and sliding vibration feedback). We evaluated the device's performance in pressure perception, slip feedback, typical dexterous manipulation tasks, and daily operations, and we gathered user experience through subjective assessments. Our results show that participants could perceive and respond to pressure and vibration feedback. Through dexterous manipulation experiments, we further demonstrated that these minimal haptic cues significantly improved virtual task efficiency, showcasing how lightweight haptic feedback can enhance manipulation performance without complex mechanisms. The device's ability to preserve tactile sensations and minimize hindrance to real-world operations is a key advantage over glove-type haptic devices. This research offers a potential solution for designing haptic interfaces that balance lightweight construction, haptic feedback for dexterous manipulation, and daily wearability.

Paperid: 2105, https://arxiv.org/pdf/2506.21338.pdf

Abstract:
Brain-computer interface (BCI) technology utilizing electroencephalography (EEG) marks a transformative innovation, empowering motor-impaired individuals to engage with their environment on equal footing. Despite its promising potential, developing subject-invariant and session-invariant BCI systems remains a significant challenge due to the inherent complexity and variability of neural activity across individuals and over time, compounded by EEG hardware constraints. While prior studies have sought to develop robust BCI systems, existing approaches remain ineffective in capturing the intricate spatiotemporal dependencies within multichannel EEG signals. This study addresses this gap by introducing the attentive graph-temporal convolutional network (AGTCNet), a novel graph-temporal model for motor imagery EEG (MI-EEG) classification. Specifically, AGTCNet leverages the topographic configuration of EEG electrodes as an inductive bias and integrates graph convolutional attention network (GCAT) to jointly learn expressive spatiotemporal EEG representations. The proposed model significantly outperformed existing MI-EEG classifiers, achieving state-of-the-art performance while utilizing a compact architecture, underscoring its effectiveness and practicality for BCI deployment. With a 49.87% reduction in model size, 64.65% faster inference time, and shorter input EEG signal, AGTCNet achieved a moving average accuracy of 66.82% for subject-independent classification on the BCI Competition IV Dataset 2a, which further improved to 82.88% when fine-tuned for subject-specific classification. On the EEG Motor Movement/Imagery Dataset, AGTCNet achieved moving average accuracies of 64.14% and 85.22% for 4-class and 2-class subject-independent classifications, respectively, with further improvements to 72.13% and 90.54% for subject-specific classifications.

Paperid: 2106, https://arxiv.org/pdf/2506.21322.pdf

Abstract:
Advancements in robotic capabilities for providing physical assistance, psychological support, and daily health management are making the deployment of intelligent healthcare robots in home environments increasingly feasible in the near future. However, challenges arise when the information provided by these robots contradicts users' memory, raising concerns about user trust and decision-making. This paper presents a study that examines how varying a robot's level of transparency and sociability influences user interpretation, decision-making and perceived trust when faced with conflicting information from a robot. In a 2 x 2 between-subjects online study, 176 participants watched videos of a Furhat robot acting as a family healthcare assistant and suggesting a fictional user to take medication at a different time from that remembered by the user. Results indicate that robot transparency influenced users' interpretation of information discrepancies: with a low transparency robot, the most frequent assumption was that the user had not correctly remembered the time, while with the high transparency robot, participants were more likely to attribute the discrepancy to external factors, such as a partner or another household member modifying the robot's information. Additionally, participants exhibited a tendency toward overtrust, often prioritizing the robot's recommendations over the user's memory, even when suspecting system malfunctions or third-party interference. These findings highlight the impact of transparency mechanisms in robotic systems, the complexity and importance associated with system access control for multi-user robots deployed in home environments, and the potential risks of users' over reliance on robots in sensitive domains such as healthcare.

Paperid: 2107, https://arxiv.org/pdf/2506.20291.pdf

Abstract:
Conversational Recommender Systems (CRSs) have garnered attention as a novel approach to delivering personalized recommendations through multi-turn dialogues. This review developed a taxonomy framework to systematically categorize relevant publications into four groups: dataset construction, algorithm design, system evaluation, and empirical studies, providing a comprehensive analysis of simulation methods in CRSs research. Our analysis reveals that simulation methods play a key role in tackling CRSs' main challenges. For example, LLM-based simulation methods have been used to create conversational recommendation data, enhance CRSs algorithms, and evaluate CRSs. Despite several challenges, such as dataset bias, the limited output flexibility of LLM-based simulations, and the gap between text semantic space and behavioral semantics, persist due to the complexity in Human-Computer Interaction (HCI) of CRSs, simulation methods hold significant potential for advancing CRS research. This review offers a thorough summary of the current research landscape in this domain and identifies promising directions for future inquiry.

Paperid: 2108, https://arxiv.org/pdf/2506.19611.pdf

Abstract:
This position paper situates AR beauty filters within the broader debate on Body Politics in HCI. We argue that these filters are not neutral tools but technologies of governance that reinforce racialized, gendered, and ableist beauty standards. Through naming conventions, algorithmic bias, and platform governance, they impose aesthetic norms while concealing their influence. To address these challenges, we advocate for transparency-driven interventions and a critical rethinking of algorithmic aesthetics and digital embodiment.

Paperid: 2109, https://arxiv.org/pdf/2506.19519.pdf

Abstract:
Virtual reality (VR) can enrich neuropsychological testing, yet the ergonomic trade-offs of its input modes remain under-examined. Seventy-seven healthy volunteers-young (19-29 y) and middle-aged (35-56 y)-completed a VR Trail-Making Test with three pointing methods: eye-tracking, head-gaze, and a six-degree-of-freedom hand controller. Completion time, spatial accuracy, and error counts for the simple (Trail A) and alternating (Trail B) sequences were analysed in 3 x 2 x 2 mixed-model ANOVAs; post-trial scales captured usability (SUS), user experience (UEQ-S), and acceptability. Age dominated behaviour: younger adults were reliably faster, more precise, and less error-prone. Against this backdrop, input modality mattered. Eye-tracking yielded the best spatial accuracy and shortened Trail A time relative to manual control; head-gaze matched eye-tracking on Trail A speed and became the quickest, least error-prone option on Trail B. Controllers lagged on every metric. Subjective ratings were high across the board, with only a small usability dip in middle-aged low-gamers. Overall, gaze-based ray-casting clearly outperformed manual pointing, but optimal choice depended on task demands: eye-tracking maximised spatial precision, whereas head-gaze offered calibration-free enhanced speed and error-avoidance under heavier cognitive load. TMT-VR appears to be accurate, engaging, and ergonomically adaptable assessment, yet it requires age-specific-stratified norms.

Paperid: 2110, https://arxiv.org/pdf/2506.19202.pdf

Abstract:
Roboticists often design with the assumption that assistive robots should be fully autonomous. However, it remains unclear whether users prefer highly autonomous robots, as prior work in assistive robotics suggests otherwise. High robot autonomy can reduce the user's sense of agency, which represents feeling in control of one's environment. How much control do users, in fact, want over the actions of robots used for in-home assistance? We investigate how robot autonomy levels affect users' sense of agency and the autonomy level they prefer in contexts with varying risks. Our study asked participants to rate their sense of agency as robot users across four distinct autonomy levels and ranked their robot preferences with respect to various household tasks. Our findings revealed that participants' sense of agency was primarily influenced by two factors: (1) whether the robot acts autonomously, and (2) whether a third party is involved in the robot's programming or operation. Notably, an end-user programmed robot highly preserved users' sense of agency, even though it acts autonomously. However, in high-risk settings, e.g., preparing a snack for a child with allergies, they preferred robots that prioritized their control significantly more. Additional contextual factors, such as trust in a third party operator, also shaped their preferences.

Paperid: 2111, https://arxiv.org/pdf/2506.18706.pdf

Abstract:
The integration of Generative AI (GenAI) into creative communities, like fanfiction, is reshaping how stories are created, shared, and valued. This study investigates the perceptions of 157 active fanfiction members, both readers and writers, regarding AI-generated content in fanfiction. Our research explores the impact of GenAI on community dynamics, examining how AI affects the participatory and collaborative nature of these spaces. The findings reveal responses ranging from cautious acceptance of AI's potential for creative enhancement to concerns about authenticity, ethical issues, and the erosion of human-centered values. Participants emphasized the importance of transparency and expressed worries about losing social connections. Our study highlights the need for thoughtful AI integration in creative platforms using design interventions that enable ethical practices, promote transparency, increase engagement and connection, and preserve the community's core values.

Paperid: 2112, https://arxiv.org/pdf/2506.18365.pdf

Abstract:
Despite growing interest in Learning-by-Teaching (LbT), few studies have explored how this paradigm can be implemented with autonomous, peer-like social robots in real classrooms. Most prior work has relied on scripted or Wizard-of-Oz behaviors, limiting our understanding of how real-time, interactive learning can be supported by artificial agents. This study addresses this gap by introducing Interactive Reinforcement Learning (RL) as a cognitive model for teachable social robots. We conducted two between-subject experiments with 58 primary school children, who either taught a robot or practiced independently on a tablet while learning French vocabulary (memorization) and grammatical rules (inference). The robot, powered by Interactive RL, learned from the child's evaluative feedback. Children in the LbT condition achieved significantly higher retention gains compared to those in the self-practice condition, especially on the grammar task. Learners with lower prior knowledge benefited most from teaching the robot. Behavioural metrics revealed that children adapted their teaching strategies over time and engaged more deeply during inference tasks. This work makes two contributions: (1) it introduces Interactive RL as a pedagogically effective and scalable model for peer-robot learning, and (2) it demonstrates, for the first time, the feasibility of deploying multiple autonomous robots simultaneously in real classrooms. These findings extend theoretical understanding of LbT by showing that social robots can function not only as passive tutees but as adaptive partners that enhance meta-cognitive engagement and long-term learning outcomes.

Paperid: 2113, https://arxiv.org/pdf/2506.18308.pdf

Abstract:
Rear-end collisions constituted a large portion of crashes on the road, despite efforts to mitigate rear-end collisions, such as forward collision warnings. The chance of rear-end collisions is closely related to drivers' car-following (CF) behaviors in the traffic flow. Given that drivers may rely on more than the information of the direct lead vehicle (DLV) when making CF decisions, expanding drivers' perceptual range by providing beyond-visual-range (BVR) information based on vehicle-to-vehicle (V2V) communication may enhance CF safety. Thus, four different human-machine interfaces (HMIs) providing various types of BVR information in CF events were designed, including Brake-HMI showing only brake action of indirect lead vehicles (ILV), Dis-HMI and THW-HMI showing the relative distance and time headway between the ILV and DLV, respectively, and Video-HMI showing the live-stream video of ILV from the perspective of DLV. A driving simulator experiment with 40 participants was conducted to evaluate the impact of BVR-based HMI on driving safety in CF events. We found that, in general, BVR information could improve CF safety without overloading drivers and compromising their visual attention allocation strategies, particularly among novice drivers, by enabling quicker brake responses and increasing time headway and time-to-collision in brake events. The Brake-HMI yielded the safest performance in chain brake events, whereas Video-HMI increased attentional demands without observable benefits. This research provides insights into enabling drivers' BVR perception based on V2V communication to enhance driving safety in CF scenarios.

Paperid: 2114, https://arxiv.org/pdf/2506.18196.pdf

Abstract:
In this work, we explore the musical interface potential of the MindCube, an interactive device designed to study emotions. Embedding diverse sensors and input devices, this interface resembles a fidget cube toy commonly used to help users relieve their stress and anxiety. As such, it is a particularly well-suited controller for musical systems that aim to help with emotion regulation. In this regard, we present two different mappings for the MindCube, with and without AI. With our generative AI mapping, we propose a way to infuse meaning within a latent space and techniques to navigate through it with an external controller. We discuss our results and propose directions for future work.

Paperid: 2115, https://arxiv.org/pdf/2506.18143.pdf

Abstract:
Vocals harmonizers are powerful tools to help solo vocalists enrich their melodies with harmonically supportive voices. These tools exist in various forms, from commercially available pedals and software to custom-built systems, each employing different methods to generate harmonies. Traditional harmonizers often require users to manually specify a key or tonal center, while others allow pitch selection via an external keyboard-both approaches demanding some degree of musical expertise. The AI Harmonizer introduces a novel approach by autonomously generating musically coherent four-part harmonies without requiring prior harmonic input from the user. By integrating state-of-the-art generative AI techniques for pitch detection and voice modeling with custom-trained symbolic music models, our system arranges any vocal melody into rich choral textures. In this paper, we present our methods, explore potential applications in performance and composition, and discuss future directions for real-time implementations. While our system currently operates offline, we believe it represents a significant step toward AI-assisted vocal performance and expressive musical augmentation. We release our implementation on GitHub.

Paperid: 2116, https://arxiv.org/pdf/2506.17032.pdf

Abstract:
The literature describes many visualization techniques for different types of data, tasks, and application contexts, and new techniques are proposed on a regular basis. Visualization surveys try to capture the immense space of techniques and structure it with meaningful categorizations. Yet, it remains difficult to understand the similarity of visualization techniques in general. We approach this open research question from two angles. First, we follow a model-driven approach that is based on defining the signature of visualization techniques and interpreting the similarity of signatures as the similarity of their associated techniques. Second, following an expert-driven approach, we asked visualization experts in a small online study for their ad-hoc intuitive assessment of the similarity of pairs of visualization techniques. From both approaches, we gain insight into the similarity of a set of 13 basic and advanced visualizations for different types of data. While our results are so far preliminary and academic, they are first steps toward better understanding the similarity of visualization techniques.

Paperid: 2117, https://arxiv.org/pdf/2506.16051.pdf

Abstract:
Reproducibility remains a central challenge in machine learning (ML), especially in collaborative eScience projects where teams iterate over data, features, and models. Current ML workflows are often dynamic yet fragmented, relying on informal data sharing, ad hoc scripts, and loosely connected tools. This fragmentation impedes transparency, reproducibility, and the adaptability of experiments over time. This paper introduces a data-centric framework for lifecycle-aware reproducibility, centered around six structured artifacts: Dataset, Feature, Workflow, Execution, Asset, and Controlled Vocabulary. These artifacts formalize the relationships between data, code, and decisions, enabling ML experiments to be versioned, interpretable, and traceable over time. The approach is demonstrated through a clinical ML use case of glaucoma detection, illustrating how the system supports iterative exploration, improves reproducibility, and preserves the provenance of collaborative decisions across the ML lifecycle.

Paperid: 2118, https://arxiv.org/pdf/2506.16010.pdf

Abstract:
Panel discussion allows the audience to learn different perspectives through interactive discussions among experts moderated by a host and a Q&A session with the audience. Despite its benefits, panel discussion in the real world is inaccessible to many who do not have the privilege to participate due to geographical, financial, and time constraints. We present SimuPanel, which simulates panel discussions among academic experts through LLM-based multi-agent interaction. It enables users to define topics of interest for the panel, observe the expert discussion, engage in Q&A, and take notes. SimuPanel employs a host-expert architecture where each panel member is simulated by an agent with specialized expertise, and the panel is visualized in an immersive 3D environment to enhance engagement. Traditional dialogue generation struggles to capture the depth and interactivity of real-world panel discussions. To address this limitation, we propose a novel multi-agent interaction framework that simulates authentic panel dynamics by modeling reasoning strategies and personas of experts grounded in multimedia sources. This framework enables agents to dynamically recall and contribute to the discussion based on past experiences from diverse perspectives. Our technical evaluation and the user study with university students show that SimuPanel was able to simulate more in-depth discussions and engage participants to interact with and reflect on the discussions. As a first step in this direction, we offer design implications for future avenues to improve and harness the power of panel discussion for multimedia learning.

Paperid: 2119, https://arxiv.org/pdf/2506.15873.pdf

Abstract:
Generative AI promises to allow people to create high-quality personalized media. Although powerful, we identify three fundamental design problems with existing tooling through a literature review. We introduce a multimodal generative AI tool, DeckFlow, to address these problems. First, DeckFlow supports task decomposition by allowing users to maintain multiple interconnected subtasks on an infinite canvas populated by cards connected through visual dataflow affordances. Second, DeckFlow supports a specification decomposition workflow where an initial goal is iteratively decomposed into smaller parts and combined using feature labels and clusters. Finally, DeckFlow supports generative space exploration by generating multiple prompt and output variations, presented in a grid, that can feed back recursively into the next design iteration. We evaluate DeckFlow for text-to-image generation against a state-of-practice conversational AI baseline for image generation tasks. We then add audio generation and investigate user behaviors in a more open-ended creative setting with text, image, and audio outputs.

Paperid: 2120, https://arxiv.org/pdf/2506.14147.pdf

Abstract:
K-12 computing teachers must navigate complex trade-offs when selecting programming languages and instructional materials for classrooms with emergent bilingual students. While they aim to foster an inclusive learning environment by addressing language barriers that impact student engagement, they must also align with K-12 computer science curricular guidelines and prepare students for industry-standard programming tools. Because programming languages predominantly use English keywords and most instructional materials are written in English, these linguistic barriers introduce cognitive load and accessibility challenges. This paper examines teachers' decisions in balancing these competing priorities, highlighting the tensions between accessibility, curriculum alignment, and workforce preparation. The findings shed light on how our teacher participants negotiate these trade-offs and what factors influence their selection of programming tools to best support EB students while meeting broader educational and professional goals.

Paperid: 2121, https://arxiv.org/pdf/2506.13904.pdf

Abstract:
Despite promising developments in Explainable Artificial Intelligence, the practical value of XAI methods remains under-explored and insufficiently validated in real-world settings. Robust and context-aware evaluation is essential, not only to produce understandable explanations but also to ensure their trustworthiness and usability for intended users, but tends to be overlooked because of no clear guidelines on how to design an evaluation with users. This study addresses this gap with two main goals: (1) to develop a framework of well-defined, atomic properties that characterise the user experience of XAI in healthcare; and (2) to provide clear, context-sensitive guidelines for defining evaluation strategies based on system characteristics. We conducted a systematic review of 82 user studies, sourced from five databases, all situated within healthcare settings and focused on evaluating AI-generated explanations. The analysis was guided by a predefined coding scheme informed by an existing evaluation framework, complemented by inductive codes developed iteratively. The review yields three key contributions: (1) a synthesis of current evaluation practices, highlighting a growing focus on human-centred approaches in healthcare XAI; (2) insights into the interrelations among explanation properties; and (3) an updated framework and a set of actionable guidelines to support interdisciplinary teams in designing and implementing effective evaluation strategies for XAI systems tailored to specific application contexts.

Paperid: 2122, https://arxiv.org/pdf/2506.13685.pdf

Abstract:
A key part of modern social dynamics is flaking at short notice. However, anxiety in coming up with believable and socially acceptable reasons to do so can instead lead to 'ghosting', awkwardness, or implausible excuses, risking emotional harm and resentment in the other party. The ability to delegate this task to a Large Language Model (LLM) could substantially reduce friction and enhance the flexibility of user's social life while greatly minimising the aforementioned creative burden and moral qualms. We introduce FLAKE-Bench, an evaluation of models' capacity to effectively, kindly, and humanely extract themselves from a diverse set of social, professional and romantic scenarios. We report the efficacy of 10 frontier or recently-frontier LLMs in bailing on prior commitments, because nothing says "I value our friendship" like having AI generate your cancellation texts. We open-source FLAKE-Bench at github.com/Cloakless/flake-bench to support future research.

Paperid: 2123, https://arxiv.org/pdf/2506.12437.pdf

Abstract:
This paper explores the growing presence of emotionally responsive artificial intelligence through a critical and interdisciplinary lens. Bringing together the voices of early-career researchers from multiple fields, it explores how AI systems that simulate or interpret human emotions are reshaping our interactions in areas such as education, healthcare, mental health, caregiving, and digital life. The analysis is structured around four central themes: the ethical implications of emotional AI, the cultural dynamics of human-machine interaction, the risks and opportunities for vulnerable populations, and the emerging regulatory, design, and technical considerations. The authors highlight the potential of affective AI to support mental well-being, enhance learning, and reduce loneliness, as well as the risks of emotional manipulation, over-reliance, misrepresentation, and cultural bias. Key challenges include simulating empathy without genuine understanding, encoding dominant sociocultural norms into AI systems, and insufficient safeguards for individuals in sensitive or high-risk contexts. Special attention is given to children, elderly users, and individuals with mental health challenges, who may interact with AI in emotionally significant ways. However, there remains a lack of cognitive or legal protections which are necessary to navigate such engagements safely. The report concludes with ten recommendations, including the need for transparency, certification frameworks, region-specific fine-tuning, human oversight, and longitudinal research. A curated supplementary section provides practical tools, models, and datasets to support further work in this domain.

Paperid: 2124, https://arxiv.org/pdf/2506.12357.pdf

Abstract:
Reproductive well-being education in the Global South is often challenged as many communities perceive many of its contents as misinformation, misconceptions, and language-inappropriate. Our ten-month-long ethnographic study (n=41) investigated the impact of sociocultural landscape, cultural beliefs, and healthcare infrastructure on Bangladeshi people's access to quality reproductive healthcare and set four design goals: combating misinformation, including culturally appropriate language, professionals' accountable moderation, and promoting users' democratic participation. Building on the model of `\textit{Distributive Justice,}' we designed and evaluated \textit{`Socheton,'} a culturally appropriate AI-mediated tool for reproductive well-being that includes healthcare professionals, AI-language teachers, and community members to moderate and run the activity-based platform. Our user study (n=28) revealed that only combating misinformation and language inappropriateness may still leave the community with a conservative mob culture and patronize reproductive care-seeking. This guides well-being HCI design toward being culturally appropriate in the context of reproductive justice with sensitive marginalized communities.

Paperid: 2125, https://arxiv.org/pdf/2506.12259.pdf

Abstract:
As governments increasingly adopt digital tools, public service chatbots have emerged as a growing communication channel. This paper explores the design considerations and engagement opportunities of public service chatbots, using a 311 chatbot from a metropolitan city as a case study. Our qualitative study consisted of official survey data and 16 interviews examining stakeholder experiences and design preferences for the chatbot. We found two key areas of concern regarding these public chatbots: individual-level and community-level. At the individual level, citizens experience three key challenges: interpretation, transparency, and social contextualization. Moreover, the current chatbot design prioritizes the efficient completion of individual tasks but neglects the broader community perspective. It overlooks how individuals interact and discuss problems collectively within their communities. To address these concerns, we offer design opportunities for creating more intelligent, transparent, community-oriented chatbots that better engage individuals and their communities.

Paperid: 2126, https://arxiv.org/pdf/2506.11932.pdf

Abstract:
This study evaluates a smartphone-based, deep-learning eye-tracking algorithm by comparing its performance against a commercial infrared-based eye tracker, the Tobii Pro Nano. The aim is to investigate the feasibility of appearance-based gaze estimation under realistic mobile usage conditions. Key sensitivity factors, including age, gender, vision correction, lighting conditions, device type, and head position, were systematically analysed. The appearance-based algorithm integrates a lightweight convolutional neural network (MobileNet-V3) with a recurrent structure (Long Short-Term Memory) to predict gaze coordinates from grayscale facial images. Gaze data were collected from 51 participants using dynamic visual stimuli, and accuracy was measured using Euclidean distance. The deep learning model produced a mean error of 17.76 mm, compared to 16.53 mm for the Tobii Pro Nano. While overall accuracy differences were small, the deep learning-based method was more sensitive to factors such as lighting, vision correction, and age, with higher failure rates observed under low-light conditions among participants using glasses and in older age groups. Device-specific and positional factors also influenced tracking performance. These results highlight the potential of appearance-based approaches for mobile eye tracking and offer a reference framework for evaluating gaze estimation systems across varied usage conditions.

Paperid: 2127, https://arxiv.org/pdf/2506.11727.pdf

Abstract:
This paper critically audits the search endpoint of YouTube's Data API (v3), a common tool for academic research. Through systematic weekly searches over six months using eleven queries, we identify major limitations regarding completeness, representativeness, consistency, and bias. Our findings reveal substantial differences between ranking parameters like relevance and date in terms of video recall and precision, with relevance often retrieving numerous off-topic videos. We also find severe temporal decay, as the number of findable videos for a specific period dramatically decreases after just 20-60 days from the publication date, potentially hampering many different research designs. Furthermore, search results lack consistency, with identical queries yielding different video sets over time, compromising replicability. A case study on the European Parliament elections highlights how these issues impact research outcomes. While the paper offers several mitigation strategies, it concludes that the API's search function, potentially prioritizing "freshness" over comprehensive retrieval, is not adequate for robust academic research, especially concerning Digital Services Act requirements.

Paperid: 2128, https://arxiv.org/pdf/2506.11718.pdf

Abstract:
As AI tools proliferate across domains, from chatbots and copilots to emerging agents, they increasingly support professional knowledge work. Yet despite their growing capabilities, these systems remain fragmented: they assist with isolated tasks but lack the architectural scaffolding for sustained, adaptive collaboration. We propose a layered framework for human-agent systems that integrates three interdependent dimensions: interaction, process, and infrastructure. Crucially, our architecture elevates process to a primary focus by making it explicit, inspectable, and adaptable, enabling humans and agents to align with evolving goals and coordinate over time. This model clarifies limitations of current tools, unifies emerging system design approaches, and reveals new opportunities for researchers and AI system builders. By grounding intelligent behavior in structured collaboration, we reimagine human-agent collaboration not as task-specific augmentation, but as a form of coherent and aligned system for real-world work.

Paperid: 2129, https://arxiv.org/pdf/2506.11610.pdf

Abstract:
Client-designer alignment is crucial to the success of design projects, yet little research has explored how digital technologies might influence this alignment. To address this gap, this paper presents a three-phase study investigating how digital systems can support requirements elicitation in professional design practice. Specifically, it examines how integrating a conversational agent and choice-based response formats into a digital elicitation tool affects early-stage client-designer collaboration. The first phase of the study inquired into the current practices of 10 design companies through semi-structured interviews, informing the system's design. The second phase evaluated the system using a 2x2 factorial design with 50 mock clients, quantifying the effects of conversational AI and response type on user experience and perceived preparedness. In phase three, the system was presented to seven of the original 10 companies to gather reflections on its value, limitations, and potential integration into practice. Findings show that both conversational AI and choice-based responses lead to lower dependability scores on the User Experience Questionnaire, yet result in client input with greater clarity. We contribute design implications for integrating conversational AI and choice-based responses into elicitation tools to support mutual understanding in early-stage client-designer collaboration.

Paperid: 2130, https://arxiv.org/pdf/2506.11366.pdf

Abstract:
Despite a growing need for spiritual care in the US, it is often under-served, inaccessible, or misunderstood, while almost no prior work in CSCW/HCI research has engaged with professional chaplains and spiritual care providers. This interdisciplinary study aims to develop a foundational understanding of how spiritual care may (or may not) be expanded into online spaces -- especially focusing on anonymous, asynchronous, and text-based online communities. We conducted an exploratory mixed-methods study with chaplains (N=22) involving interviews and user testing sessions centered around Reddit support communities to understand participants' perspectives on technology and their ideations about the role of chaplaincy in prospective Online Spiritual Care Communities (OSCCs). Our Grounded Theory Method analysis highlighted benefits of OSCCs including: meeting patients where they are at; accessibility and scalability; and facilitating patient-initiated care. Chaplains highlighted how their presence in OSCCs could help with shaping peer interactions, moderation, synchronous chats for group care, and redirecting to external resources, while also raising important feasibility concerns, risks, and needs for future design and research. We used an existing taxonomy of chaplaincy techniques to show that some spiritual care strategies may be amenable to online spaces, yet we also exposed the limitations of technology to fully mediate spiritual care and the need to develop new online chaplaincy interventions. Based on these findings, we contribute the model of a ``Care Loop'' between institutionally-based formal care and platform-based community care to expand access and drive greater awareness and utilization of spiritual care. We also contribute design implications to guide future work in online spiritual care.

Paperid: 2131, https://arxiv.org/pdf/2506.11151.pdf

Abstract:
We consider the problem of recovering a mental target (e.g., an image of a face) that a participant has in mind from paired EEG (i.e., brain responses) and image (i.e., perceived faces) data collected during interactive sessions without access to labeled information. The problem has been previously explored with labeled data but not via self-calibration, where labeled data is unavailable. Here, we present the first framework and an algorithm, CURSOR, that learns to recover unknown mental targets without access to labeled data or pre-trained decoders. Our experiments on naturalistic images of faces demonstrate that CURSOR can (1) predict image similarity scores that correlate with human perceptual judgments without any label information, (2) use these scores to rank stimuli against an unknown mental target, and (3) generate new stimuli indistinguishable from the unknown mental target (validated via a user study, N=53).

Paperid: 2132, https://arxiv.org/pdf/2506.11092.pdf

Abstract:
Retrieval-Augmented Generation (RAG) has significantly advanced large language models (LLMs) by grounding their outputs in external tools and knowledge sources. However, existing RAG systems are typically constrained to static, single-turn interactions with fixed toolsets, making them ill-suited for dynamic domains such as healthcare and smart homes, where user intent, available tools, and contextual factors evolve over time. We present Dynamic Context Tuning (DCT), a lightweight framework that extends RAG to support multi-turn dialogue and evolving tool environments without requiring retraining. DCT integrates an attention-based context cache to track relevant past information, LoRA-based retrieval to dynamically select domain-specific tools, and efficient context compression to maintain inputs within LLM context limits. Experiments on both synthetic and real-world benchmarks show that DCT improves plan accuracy by 14% and reduces hallucinations by 37%, while matching GPT-4 performance at significantly lower cost. Furthermore, DCT generalizes to previously unseen tools, enabling scalable and adaptable AI assistants across a wide range of dynamic environments.

Paperid: 2133, https://arxiv.org/pdf/2506.10933.pdf

Abstract:
Steady-state visual evoked potential (SSVEP)-based brain-computer interfaces (BCIs) can achieve high recognition accuracy with sufficient training data. Transfer learning presents a promising solution to alleviate data requirements for the target subject by leveraging data from source subjects; however, effectively addressing individual variability among both target and source subjects remains a challenge. This paper proposes a novel transfer learning framework, termed instance-based task-related component analysis (iTRCA), which leverages knowledge from source subjects while considering their individual contributions. iTRCA extracts two types of features: (1) the subject-general feature, capturing shared information between source and target subjects in a common latent space, and (2) the subject-specific feature, preserving the unique characteristics of the target subject. To mitigate negative transfer, we further design an enhanced framework, subject selection-based iTRCA (SS-iTRCA), which integrates a similarity-based subject selection strategy to identify appropriate source subjects for transfer based on their task-related components (TRCs). Comparative evaluations on the Benchmark, BETA, and a self-collected dataset demonstrate the effectiveness of the proposed iTRCA and SS-iTRCA frameworks. This study provides a potential solution for developing high-performance SSVEP-based BCIs with reduced target subject data.

Paperid: 2134, https://arxiv.org/pdf/2506.10932.pdf

Abstract:
Individuals with schizophrenia frequently experience intense emotions and often turn to vlogging as a medium for emotional expression. While previous research has predominantly focused on text based disclosure, little is known about how individuals construct narratives around emotions and emotional experiences in video blogs. Our study addresses this gap by analyzing 200 YouTube videos created by individuals with schizophrenia. Drawing on media research and self presentation theories, we developed a visual analysis framework to disentangle these videos. Our analysis revealed diverse practices of emotion disclosure through both verbal and visual channels, highlighting the dynamic interplay between these modes of expression. We found that the deliberate construction of visual elements, including environmental settings and specific aesthetic choices, appears to foster more supportive and engaged viewer responses. These findings underscore the need for future large scale quantitative research examining how visual features shape video mediated communication on social media platforms. Such investigations would inform the development of care centered video sharing platforms that better support individuals managing illness experiences.

Paperid: 2135, https://arxiv.org/pdf/2506.10272.pdf

Abstract:
This position paper argues that there is an urgent need to restructure markets for the information that goes into AI systems. Specifically, producers of information goods (such as journalists, researchers, and creative professionals) need to be able to collectively bargain with AI product builders in order to receive reasonable terms and a sustainable return on the informational value they contribute. We argue that without increased market coordination or collective bargaining on the side of these primary information producers, AI will exacerbate a large-scale "information market failure" that will lead not only to undesirable concentration of capital, but also to a potential "ecological collapse" in the informational commons. On the other hand, collective bargaining in the information economy can create market frictions and aligned incentives necessary for a pro-social, sustainable AI future. We provide concrete actions that can be taken to support a coalition-based approach to achieve this goal. For example, researchers and developers can establish technical mechanisms such as federated data management tools and explainable data value estimations, to inform and facilitate collective bargaining in the information economy. Additionally, regulatory and policy interventions may be introduced to support trusted data intermediary organizations representing guilds or syndicates of information producers.

Paperid: 2136, https://arxiv.org/pdf/2506.10229.pdf

Abstract:
Key decision-makers in small and medium businesses (SMBs) often lack the awareness and knowledge to implement cybersecurity measures effectively. To gain a deeper understanding of how SMB executives navigate cybersecurity decision-making, we deployed a mixed-method approach, conducting semi-structured interviews (n=21) and online surveys (n=322) with SMB key decision-makers. Using thematic analysis, we revealed SMB decision-makers' perceived risks in terms of the digital assets they valued, and found reasons for their choice of defense measures and factors impacting security perception. We employed the situational awareness model to characterize decision-makers based on cybersecurity awareness, identifying those who have comparatively low awareness in the fight against adversaries. We further explored the relationship between awareness and business attributes, and constructed a holistic structural equation model to understand how awareness can be improved. Finally, we proposed interventions to help SMBs overcome potential challenges.

Paperid: 2138, https://arxiv.org/pdf/2506.09968.pdf

Abstract:
Self-regulated learning (SRL) is crucial for college students navigating increased academic demands and independence. Insufficient SRL skills can lead to disorganized study habits, low motivation, and poor time management, undermining learners ability to thrive in challenging environments. Through a formative study involving 59 college students, we identified key challenges students face in developing SRL skills, including difficulties with goal-setting, time management, and reflective learning. To address these challenges, we introduce SRLAgent, an LLM-assisted system that fosters SRL skills through gamification and adaptive support from large language models (LLMs). Grounded in Zimmermans three-phase SRL framework, SRLAgent enables students to engage in goal-setting, strategy execution, and self-reflection within an interactive game-based environment. The system offers real-time feedback and scaffolding powered by LLMs to support students independent study efforts. We evaluated SRLAgent using a between-subjects design, comparing it to a baseline system (SRL without Agent features) and a traditional multimedia learning condition. Results showed significant improvements in SRL skills within the SRLAgent group (p < .001, Cohens d = 0.234) and higher engagement compared to the baselines. This work highlights the value of embedding SRL scaffolding and real-time AI support within gamified environments, offering design implications for educational technologies that aim to promote deeper learning and metacognitive skill development.

Paperid: 2139, https://arxiv.org/pdf/2506.09220.pdf

Abstract:
This paper examines the gap between the promises and real-world performance of emerging AI personal assistants. Drawing on interviews with early adopters of devices like Rabbit R1 and Humane AI Pin, as well as services like Ohai and Docus, we map user experiences through the lens of Uses and Gratifications and Uncertainty Reduction Theory. We identify three core types of user uncertainty, functional, interactional, and social, and explore how each disrupts different user gratifications. We show that while marketing hype fuels initial adoption, unmet expectations often result in frustration or abandonment. Our findings highlight the importance of transparency, task-specific design, and user control over contextual memory and personalization. We provide design and policy recommendations, including user-facing explainability tools and calls for regulatory benchmarks such as CI Bench, to guide ethical and interpretable AI integration. Our study offers actionable insights for creating more usable, trustworthy, and socially aligned AI assistants.

Paperid: 2140, https://arxiv.org/pdf/2506.09216.pdf

Abstract:
Spreadsheet collaboration provides valuable opportunities for learning and expertise sharing between colleagues. Sharing expertise is essential for the retention of important technical skillsets within organisations, but previous studies suggest that spreadsheet experts often fail to disseminate their knowledge to others. We suggest that social norms and beliefs surrounding the value of spreadsheet use significantly influence user engagement in sharing behaviours. To explore this, we conducted 31 semi-structured interviews with professional spreadsheet users from two separate samples. We found that spreadsheet providers face challenges in adapting highly personalised strategies to often subjective standards and evaluating the appropriate social timing of sharing. In addition, conflicted self-evaluations of one's spreadsheet expertise, dismissive normative beliefs about the value of this knowledge, and concerns about the potential disruptions associated with collaboration can further deter sharing. We suggest these observations reflect the challenges of long-term learning in feature-rich software designed primarily with initial learnability in mind. We therefore provide implications for design to navigate this tension. Overall, our findings demonstrate how the complex interaction between technology design and social dynamics can shape collaborative learning behaviours in the context of feature-rich software.

Paperid: 2141, https://arxiv.org/pdf/2506.07393.pdf

Abstract:
A four-leaf clover (FLC) symbolizes luck and happiness worldwide, but it is hard to distinguish it from the common three-leaf clover. While AI technology can assist in searching for FLC, it may not replicate the traditional search's sense of achievement. This study explores searcher feelings when AI aids the FLC search. In this study, we developed a system called ``Happiness Finder'' that uses object detection algorithms on smartphones or tablets to support the search. We exhibited HappinessFinder at an international workshop, allowing participants to experience four-leaf clover searching using potted artificial clovers and the HappinessFinder app. This paper reports the findings from this demonstration.

Paperid: 2142, https://arxiv.org/pdf/2506.04852.pdf

Abstract:
AI music generation has advanced rapidly, with models like diffusion and autoregressive algorithms enabling high-fidelity outputs. These tools can alter styles, mix instruments, or isolate them. Since sound can be visualized as spectrograms, image-generation algorithms can be applied to generate novel music. However, these algorithms are typically trained on fixed datasets, which makes it challenging for them to interpret and respond to user input accurately. This is especially problematic because music is highly subjective and requires a level of personalization that image generation does not provide. In this work, we propose a human-computation approach to gradually improve the performance of these algorithms based on user interactions. The human-computation element involves aggregating and selecting user ratings to use as the loss function for fine-tuning the model. We employ a genetic algorithm that incorporates user feedback to enhance the baseline performance of a model initially trained on a fixed dataset. The effectiveness of this approach is measured by the average increase in user ratings with each iteration. In the pilot test, the first iteration showed an average rating increase of 0.2 compared to the baseline. The second iteration further improved upon this, achieving an additional increase of 0.39 over the first iteration.

Paperid: 2143, https://arxiv.org/pdf/2506.04260.pdf

Abstract:
Practitioners building online services and tools often turn to online forums such as Reddit, Law Stack Exchange, and Stack Overflow for legal guidance to ensure compliance with the GDPR. The legal information presented in these forums directly impacts present-day industry practitioner's decisions. Online forums can serve as gateways that, depending on the accuracy and quality of the answers provided, may either support or undermine the protection of privacy and data protection fundamental rights. However, there is a need for deeper investigation into practitioners' decision-making processes and their understanding of legal compliance when seeking for legal information online. Using GDPR's ``legitimate interests'' legal ground for processing personal data as a case study, we investigate how practitioners use online forums to identify common areas of confusion in applying legitimate interests in practice, and evaluate how legally sound online forum responses are. Our analysis found that applying the legal basis of legitimate interest is complex for practitioners, with important implications for how the GDPR is implemented in practice. The legal analysis showed that crowdsourced legal information tends to be legally sound, though sometimes incomplete. We outline recommendations to improve the quality of online forums by ensuring that responses are more legally sound and comprehensive, enabling practitioners to apply legitimate interests effectively in practice and uphold the GDPR.

Paperid: 2144, https://arxiv.org/pdf/2506.03741.pdf

Abstract:
We introduce PromptCanvas, a concept that transforms prompting into a composable, widget-based experience on an infinite canvas. Users can generate, customize, and arrange interactive widgets representing various facets of their text, offering greater control over AI-generated content. PromptCanvas allows widget creation through system suggestions, user prompts, or manual input, providing a flexible environment tailored to individual needs. This enables deeper engagement with the creative process. In a lab study with 18 participants, PromptCanvas outperformed a traditional conversational UI on the Creativity Support Index. Participants found that it reduced cognitive load, with lower mental demand and frustration. Qualitative feedback revealed that the visual organization of thoughts and easy iteration encouraged new perspectives and ideas. A follow-up field study (N=10) confirmed these results, showcasing the potential of dynamic, customizable interfaces in improving collaborative writing with AI.

Paperid: 2145, https://arxiv.org/pdf/2506.03720.pdf

Abstract:
The use of applications on computers, smartphones, and tablets has been considerably simplified thanks to interactive and dynamic graphical interfaces coupled with the mouse and touch screens. It is no longer necessary to be a computer specialist to use them. Paradoxically, the development of computer programs generally requires writing lines of code in a programming language whose syntax is particularly strict. This process poses many difficulties for programmers. We propose an original tool in which arbitrary programs (Turing-complete) can be developed in a completely visual manner by direct manipulation of the data, without writing a line of code. The user can thus develop an algorithm by directly visualizing the result of actions taken on the data. A method for constructing iterations is associated with the tool. It proposes to create each part, including the loop body, in a non-linear manner under visual control of the state of the data. In addition, the tool supports the production of lines of code in several languages including Python, C, Java, that correspond to the actions performed. In this article, we present the tool, the design choices, the problems to be solved, and the limits and the contributions of the direct-data-manipulation approach.

Paperid: 2146, https://arxiv.org/pdf/2506.03520.pdf

Abstract:
Many people struggle with social anxiety, feeling fear, or even physically uncomfortable in social situations like talking to strangers. Exposure therapy, a clinical method that gradually and repeatedly exposes individuals to the source of their fear and helps them build coping mechanisms, can reduce social anxiety but traditionally requires human therapists' guidance and constructions of situations. In this paper, we developed a multi-agent system VChatter to explore large language models(LLMs)-based conversational agents for simulating exposure therapy with users. Based on a survey study (N=36) and an expert interview, VChatter includes an Agent-P, which acts as a psychotherapist to design the exposure therapy plans for users, and two Agent-Hs, which can take on different interactive roles in low, medium, and high exposure scenarios. A six-day qualitative study (N=10) showcases VChatter's usefulness in reducing users' social anxiety, feelings of isolation, and avoidance of social interactions. We demonstrated the feasibility of using LLMs-based conversational agents to simulate exposure therapy for addressing social anxiety and discussed future concerns for designing agents tailored to social anxiety.

Paperid: 2147, https://arxiv.org/pdf/2506.01395.pdf

Abstract:
Journaling has long been recognized for fostering emotional awareness and self-reflection, and recent advancements in generative AI offer new opportunities to create personalized music that can enhance these practices. In this study, we explore how AI-generated music can augment the journaling experience. Through a formative study, we examined journal writers' writing patterns, purposes, emotional regulation strategies, and the design requirements for the system that augments journaling experience by journal-based AI-generated music. Based on these insights, we developed NoRe, a system that transforms journal entries into personalized music using generative AI. In a seven-day in-the-wild study (N=15), we investigated user engagement and perceived emotional effectiveness through system logs, surveys, and interviews. Our findings suggest that journal-based music generation could support emotional reflection and provide vivid reminiscence of daily experiences. Drawing from these findings, we discuss design implications for tailoring music to journal writers' emotional states and preferences.

Paperid: 2148, https://arxiv.org/pdf/2506.00376.pdf

Abstract:
Grandparent-grandchild bonds are crucial for both parties. Many immigrant families are geographically dispersed, and the grandparents and grandchildren need to rely on remote communication to maintain their relationships. In addition to geographical separation, grandparents and grandchildren in such families also face language and culture barriers during remote communication. The associated challenges and needs remain understudied as existing research primarily focuses on non-immigrant families or co-located immigrant families. To address this gap, we conducted interviews with six Chinese immigrant families in Canada. Our findings highlight unique challenges faced by immigrant families during remote communication, such as amplified language and cultural barriers due to geographic separation, and provide insights into how technology can better support remote communication. This work offers empirical knowledge about the communication needs of distributed immigrant families and provides directions for future research and design to support grandparent-grandchild remote communication in these families.

Paperid: 2149, https://arxiv.org/pdf/2506.00220.pdf

Abstract:
The rapid growth of AI in robotics has amplified the need for high-quality, reusable datasets, particularly in human-robot interaction (HRI) and AI-embedded robotics. While more robotics datasets are being created, the landscape of open data in the field is uneven. This is due to a lack of curation standards and consistent publication practices, which makes it difficult to discover, access, and reuse robotics data. To address these challenges, this paper presents a curation and access system with two main contributions: (1) a structured methodology to curate, publish, and integrate FAIR (Findable, Accessible, Interoperable, Reusable) human-centered robotics datasets; and (2) a ChatGPT-powered conversational interface trained with the curated datasets metadata and documentation to enable exploration, comparison robotics datasets and data retrieval using natural language. Developed based on practical experience curating datasets from robotics labs within Texas Robotics at the University of Texas at Austin, the system demonstrates the value of standardized curation and persistent publication of robotics data. The system's evaluation suggests that access and understandability of human-robotics data are significantly improved. This work directly aligns with the goals of the HCRL @ ICRA 2025 workshop and represents a step towards more human-centered access to data for embodied AI.

Paperid: 2150, https://arxiv.org/pdf/2506.00094.pdf

Abstract:
This paper explores the emotional, ethical and practical dimensions of integrating Artificial Intelligence (AI) into personal and professional workflows, focusing on the concept of feeling guilty as a 'c(ai)borg' - a human augmented by AI. Inspired by Donna Haraway's Cyborg Manifesto, the study explores how AI challenges traditional notions of creativity, originality and intellectual labour. Using an autoethnographic approach, the authors reflect on their year-long experiences with AI tools, revealing a transition from initial guilt and reluctance to empowerment through skill-building and transparency. Key findings highlight the importance of basic academic skills, advanced AI literacy and honest engagement with AI results. The c(ai)borg vision advocates for a future where AI is openly embraced as a collaborative partner, fostering innovation and equity while addressing issues of access and agency. By reframing guilt as growth, the paper calls for a thoughtful and inclusive approach to AI integration.

Paperid: 2151, https://arxiv.org/pdf/2506.00081.pdf

Abstract:
Many people suffer from mental health problems but not everyone seeks professional help or has access to mental health care. AI chatbots have increasingly become a go-to for individuals who either have mental disorders or simply want someone to talk to. This paper presents a study on participants who have previously used chatbots and a scenario-based testing of large language model (LLM) chatbots. Our findings indicate that AI chatbots were primarily utilized as a "Five minute therapist" or as a non-judgmental companion. Participants appreciated the anonymity and lack of judgment from chatbots. However, there were concerns about privacy and the security of sensitive information. The scenario-based testing of LLM chatbots highlighted additional issues. Some chatbots were consistently reassuring, used emojis and names to add a personal touch, and were quick to suggest seeking professional help. However, there were limitations such as inconsistent tone, occasional inappropriate responses (e.g., casual or romantic), and a lack of crisis sensitivity, particularly in recognizing red flag language and escalating responses appropriately. These findings can inform both the technology and mental health care industries on how to better utilize AI chatbots to support individuals during challenging emotional periods.

Paperid: 2152, https://arxiv.org/pdf/2506.00028.pdf

Abstract:
By analyzing the gaze trajectories of people viewing screens and advertisements, we can determine what people are interested in. This knowledge can be effective when recommending commercial products and services, and also, when improving advertisement design. Therefore, analysis and visualization of eye gaze have been an active research topic. This paper proposes a new method for visualizing patterns of the gaze trajectories of multiple people by (1) visualizing patterns that move through multiple areas of interest (AOI) and (2) visualizing differences among multiple gaze trajectories. The method first constructs a hierarchical AOI structure to a Web page or an image, and uses this structure to convert the trajectory into a sequence of symbols. We apply N-grams to the generated symbol sequences to extract transition patterns between AOIs. Finally, the method visualizes a list of the pattern extraction results and the shapes of the characteristic elements. We present the visualization of gaze trajectories for three examples of stimuli, and argue that analysts can efficiently discover trends in gaze transitions between text and figures, as well as differences between participants of the eye-tracking experiments.

Paperid: 2153, https://arxiv.org/pdf/2505.24102.pdf

Abstract:
Despite the recognized benefits of visual analytics systems in supporting data-driven decision-making, their deployment in real-world civic contexts often faces significant barriers. Beyond technical challenges such as resource constraints and development complexity, sociotechnical factors, including organizational hierarchies, misalignment between designers and stakeholders, and concerns around technology adoption hinder their sustained use. In this work, we reflect on our collective experiences of designing, developing, and deploying visual analytics systems in the civic domain and discuss challenges across design and adoption aspects. We emphasize the need for deeper integration strategies, equitable stakeholder engagement, and sustainable implementation frameworks to bridge the gap between research and practice.

Paperid: 2154, https://arxiv.org/pdf/2505.23984.pdf

Abstract:
Pelvic bone tumor resections remain significantly challenging due to complex three-dimensional anatomy and limited surgical visualization. Current navigation systems and patient-specific instruments, while accurate, present limitations including high costs, radiation exposure, workflow disruption, long production time, and lack of reusability. This study evaluates a real-time vision-guided surgical system combined with modular jigs to improve accuracy in pelvic bone tumor resections. A vision-guided surgical system combined with modular cutting jigs and real-time optical tracking was developed and validated. Five female pelvis sawbones were used, with each hemipelvis randomly assigned to either the vision-guided and modular jig system or traditional freehand method. A total of twenty resection planes were analyzed for each method. Accuracy was assessed by measuring distance and angular deviations from the planned resection planes. The vision-guided and modular jig system significantly improved resection accuracy compared to the freehand method, reducing the mean distance deviation from 2.07 $\pm$ 1.71 mm to 1.01 $\pm$ 0.78 mm (p=0.0193). In particular, all specimens resected using the vision-guided system exhibited errors of less than 3 mm. Angular deviations also showed significant improvements with roll angle deviation reduced from 15.36 $\pm$ 17.57$^\circ$ to 4.21 $\pm$ 3.46$^\circ$ (p=0.0275), and pitch angle deviation decreased from 6.17 $\pm$ 4.58$^\circ$ to 1.84 $\pm$ 1.48$^\circ$ (p<0.001). The proposed vision-guided and modular jig system significantly improves the accuracy of pelvic bone tumor resections while maintaining workflow efficiency. This cost-effective solution provides real-time guidance without the need for referencing external monitors, potentially improving surgical outcomes in complex pelvic bone tumor cases.

Paperid: 2155, https://arxiv.org/pdf/2505.23733.pdf

Abstract:
In recent years, the rapid advancement and democratization of generative AI models have sparked significant debate over safety, ethical risks, and dual-use concerns, particularly in the context of cybersecurity. While anecdotally known, this paper provides empirical evidence regarding generative AI's association with malicious internet-related activities and cybercrime by examining the phenomenon through psychological frameworks of technological amplification and affordance theory. Using a quasi-experimental design with interrupted time series analysis, we analyze two datasets, one general and one cryptocurrency-focused, to empirically assess generative AI's role in cybercrime. The findings contribute to ongoing discussions about AI governance by balancing control and fostering innovation, underscoring the need for strategies to guide policymakers, inform AI developers and cybersecurity professionals, and educate the public to maximize AI's benefits while mitigating its risks.

Paperid: 2156, https://arxiv.org/pdf/2505.23516.pdf

Abstract:
We present the CASE framework, an open-source platform for adaptive, context-aware participatory research, and pandemic preparedness. CASE implements an event-driven architecture that enables dynamic survey workflows, allowing real-time adaptation based on participant responses, external data, temporal conditions, and evolving user states. The framework supports a broad range of research needs, from simple one-time questionnaires to complex longitudinal studies with advanced conditional logic. Built on over a decade of practical experience, CASE underwent a major architectural rework in 2024, transitioning from a microservice-based design to a streamlined monolithic architecture. This evolution significantly improved maintainability, flexibility, and accessibility to deployment, particularly for institutions with limited technical capacity. CASE has been successfully deployed across diverse domains, powering national disease surveillance platforms, supporting post-COVID cohort studies, and enabling real-time sentiment analysis during political events. These applications, involving tens of thousands of participants, demonstrate the framework's scalability, versatility, and practical value. This paper describes the foundations of CASE, details its architectural evolution, and presents lessons learned from real-world deployments. We establish CASE as a mature and reusable research infrastructure that balances sophisticated functionality with practical implementation, addressing the critical global need for sustainable and institutionally controlled data collection systems.

Paperid: 2157, https://arxiv.org/pdf/2505.23447.pdf

Abstract:
This paper contributes a set of quality metrics for identification and visual analysis of structured missingness in high-dimensional data. Missing values in data are a frequent challenge in most data generating domains and may cause a range of analysis issues. Structural missingness in data may indicate issues in data collection and pre-processing, but may also highlight important data characteristics. While research into statistical methods for dealing with missing data are mainly focusing on replacing missing values with plausible estimated values, visualization has great potential to support a more in-depth understanding of missingness structures in data. Nonetheless, while the interest in missing data visualization has increased in the last decade, it is still a relatively overlooked research topic with a comparably small number of publications, few of which address scalability issues. Efficient visual analysis approaches are needed to enable exploration of missingness structures in large and high-dimensional data, and to support informed decision-making in context of potential data quality issues. This paper suggests a set of quality metrics for identification of patterns of interest for understanding of structural missingness in data. These quality metrics can be used as guidance in visual analysis, as demonstrated through a use case exploring structural missingness in data from a real-life walking monitoring study. All supplemental materials for this paper are available at https://doi.org/10.25405/data.ncl.c.7741829.

Paperid: 2158, https://arxiv.org/pdf/2505.22303.pdf

Abstract:
In this study, we propose a solution based on a multi-agent LLM architecture and a voice user interface (VUI) designed to update the knowledge base of a digital assistant. Its usability is evaluated in comparison to a more traditional graphical content management system (CMS), with a focus on understanding the relationship between user preferences and the complexity of the information being provided. The findings demonstrate that, while the overall usability of the VUI is rated lower than the graphical interface, it is already preferred by users for less complex tasks. Furthermore, the quality of content entered through the VUI is comparable to that achieved with the graphical interface, even for highly complex tasks. Obtained qualitative results suggest that a hybrid interface combining the strengths of both approaches could address the key challenges identified during the experiment, such as reducing cognitive load through graphical feedback while maintaining the intuitive nature of voice-based interactions. This work highlights the potential of conversational interfaces as a viable and effective method for knowledge management in specific business contexts.

Paperid: 2159, https://arxiv.org/pdf/2505.20701.pdf

Abstract:
Cloud architecture design is a complex process requiring both technical expertise and architectural knowledge to develop solutions from frequently ambiguous requirements. We present CloudArchitectBuddy, a system-driven cloud architecture design support application with two key mechanisms: (1) structured state management that enhances design understanding through explicit representation of requirements and architectural decisions, and (2) guided decision assistance that facilitates design progress through proactive verification and requirement refinement. Our study with 16 industry practitioners showed that while our approach achieved comparable design quality to a chat interface, participants rated our system higher for usability and appreciated its ability to help understand architectural relationships and identify missing requirements. However, participants also expressed a need for user-initiated interactions where they could freely provide design instructions and engage in detailed discussions with LLMs. These results suggest that integrating a chat interface into our structured and guided workflow approach would create a more practical solution, balancing systematic design support with conversational flexibility for comprehensive cloud architecture development.

Paperid: 2160, https://arxiv.org/pdf/2505.20584.pdf

Abstract:
Mpox (formerly monkeypox) is a zoonotic disease caused by an orthopoxvirus closely related to variola and remains a significant global public health concern. During outbreaks, social media platforms like X (formerly Twitter) can both inform and misinform the public, complicating efforts to convey accurate health information. To support local response efforts, we developed a researcher-focused dashboard for use by public health stakeholders and the public that enables searching and visualizing mpox-related tweets through an interactive interface. Following the CDC's designation of mpox as an emerging virus in August 2024, our dashboard recorded a marked increase in tweet volume compared to 2023, illustrating the rapid spread of health discourse across digital platforms. These findings underscore the continued need for real-time social media monitoring tools to support public health communication and track evolving sentiment and misinformation trends at the local level.

Paperid: 2161, https://arxiv.org/pdf/2505.20311.pdf

Abstract:
Explainable AI (XAI) is a promising solution to ensure compliance with the EU AI Act, the first multi-national regulation for AI. XAI aims to enhance transparency and human oversight of AI systems, particularly ``black-box models'', which are criticized as incomprehensible. However, the discourse around the main stakeholders in the AI Act and XAI appears disconnected. While XAI prioritizes the end user's needs as the primary goal, the AI Act focuses on the obligations of the provider and deployer of the AI system. We aim to bridge this divide and provide guidance on how these two worlds are related. By fostering an interdisciplinary discussion in a cross-functional team with XAI, AI Act, legal, and requirements engineering experts, we walk through the steps necessary to analyze an AI-based clinical decision support system to clarify the end-user needs and assess AI Act applicability. By analyzing our justified understanding using an AI system under development as a case, we show that XAI techniques can fill a gap between stakeholder needs and the requirements of the AI Act. We look at the similarities and contrasts between the legal requirements and the needs of stakeholders. In doing so, we encourage researchers and practitioners from the XAI community to reflect on their role towards the AI Act by achieving a mutual understanding of the implications of XAI and the AI Act within different disciplines.

Paperid: 2162, https://arxiv.org/pdf/2505.18814.pdf

Abstract:
As electronic signatures (e-signatures) become increasingly integral to secure digital transactions, understanding their usability and security perception from an end-user perspective has become crucial. This study empirically evaluates and compares two major e-signature systems -- token-based and remote signatures -- through a controlled user experience study with 20 participants. Participants completed tasks involving acquisition, installation, and document signing using both methods, followed by structured surveys and qualitative feedback. Statistical analyses revealed that remote e-signatures were perceived as significantly more usable than token-based ones, due to their minimal setup and platform-independent accessibility. In contrast, token-based signatures were rated as significantly more secure, highlighting users' trust in hardware-based protection. Although more participants preferred remote e-signatures for document signing, the preference did not reach statistical significance, indicating a trend toward favoring convenience in real-world scenarios. These findings underline the fundamental trade-off between usability and perceived security in digital signing systems. By bridging the gap between theoretical frameworks and real user experience, this study contributes valuable insights to the design and policymaking of qualified electronic signature solutions.

Paperid: 2163, https://arxiv.org/pdf/2505.18553.pdf

Abstract:
The rapid advancement of Large Language Models (LLMs) has resulted in interest in their potential applications within manufacturing systems, particularly in the context of Industry 5.0. However, determining when to implement LLMs versus other Natural Language Processing (NLP) techniques, ontologies or knowledge graphs, remains an open question. This paper offers decision-making guidance for selecting the most suitable technique in various industrial contexts, emphasizing human-robot collaboration and resilience in manufacturing. We examine the origins and unique strengths of LLMs, ontologies, and knowledge graphs, assessing their effectiveness across different industrial scenarios based on the number of domains or disciplines required to bring a product from design to manufacture. Through this comparative framework, we explore specific use cases where LLMs could enhance robotics for human-robot collaboration, while underscoring the continued relevance of ontologies and knowledge graphs in low-dependency or resource-constrained sectors. Additionally, we address the practical challenges of deploying these technologies, such as computational cost and interpretability, providing a roadmap for manufacturers to navigate the evolving landscape of Language based AI tools in Industry 5.0. Our findings offer a foundation for informed decision-making, helping industry professionals optimize the use of Language Based models for sustainable, resilient, and human-centric manufacturing. We also propose a Large Knowledge Language Model architecture that offers the potential for transparency and configuration based on complexity of task and computing resources available.

Paperid: 2164, https://arxiv.org/pdf/2505.18112.pdf

Abstract:
Performance artforms like Peking opera face transmission challenges due to the extensive passive listening required to understand their nuance. To create engaging forms of experiencing auditory Intangible Cultural Heritage (ICH), we designed a spatial interaction-based segmented-audio (SISA) Virtual Reality system that transforms passive ICH experiences into active ones. We undertook: (1) a co-design workshop with seven stakeholders to establish design requirements, (2) prototyping with five participants to validate design elements, and (3) user testing with 16 participants exploring Peking Opera. We designed transformations of temporal music into spatial interactions by cutting sounds into short audio segments, applying t-SNE algorithm to cluster audio segments spatially. Users navigate through these sounds by their similarity in audio property. Analysis revealed two distinct interaction patterns (Progressive and Adaptive), and demonstrated SISA's efficacy in facilitating active auditory ICH engagement. Our work illuminates the design process for enriching traditional performance artform using spatially-tuned forms of listening.

Paperid: 2165, https://arxiv.org/pdf/2505.17208.pdf

Abstract:
Rapid changes in social networks have transformed the way people express themselves, turning past neologisms, values, and mindsets embedded in these expressions into online heritage. How can we preserve these expressions as cultural heritage? Instead of traditional archiving methods for static material, we designed an interactive and experiential form of archiving for Chinese social networks. Using dialogue data from 2000-2010 on early Chinese social media, we developed a GPT-driven agent within a retro chat interface, emulating the language and expression style of the period for interaction. Results from a qualitative study with 18 participants show that the design captures the past chatting experience and evokes memory flashbacks and nostalgia feeling through conversation. Participants, particularly those familiar with the era, adapted their language to match the agent's chatting style. This study explores how the design of preservation methods for digital experiences can be informed by experiential representations supported by generative tools.

Paperid: 2166, https://arxiv.org/pdf/2505.17055.pdf

Abstract:
The study aims to enhance mathematics education accessibility for hard-of-hearing students by developing an accurate Palestinian sign language PSL recognition system using advanced artificial intelligence techniques. Due to the scarcity of digital resources for PSL, a custom dataset comprising 41 mathematical gesture classes was created, and recorded by PSL experts to ensure linguistic accuracy and domain specificity. To leverage state-of-the-art-computer vision techniques, a Vision Transformer ViTModel was fine-tuned for gesture classification. The model achieved an accuracy of 97.59%, demonstrating its effectiveness in recognizing mathematical signs with high precision and reliability. This study highlights the role of deep learning in developing intelligent educational tools that bridge the learning gap for hard-of-hearing students by providing AI-driven interactive solutions to enhance mathematical comprehension. This work represents a significant step toward innovative and inclusive frosting digital integration in specialized learning environments. The dataset is hosted on Hugging Face at https://huggingface.co/datasets/fidaakh/STEM_data.

Paperid: 2167, https://arxiv.org/pdf/2505.16352.pdf

Abstract:
Accurate prediction of perceptual attributes of haptic textures is essential for advancing VR and AR applications and enhancing robotic interaction with physical surfaces. This paper presents a deep learning-based multi-modal framework, incorporating visual and tactile data, to predict perceptual texture ratings by leveraging multi-feature inputs. To achieve this, a four-dimensional haptic attribute space encompassing rough-smooth, flat-bumpy, sticky-slippery, and hard-soft dimensions is first constructed through psychophysical experiments, where participants evaluate 50 diverse real-world texture samples. A physical signal space is subsequently created by collecting visual and tactile data from these textures. Finally, a deep learning architecture integrating a CNN-based autoencoder for visual feature learning and a ConvLSTM network for tactile data processing is trained to predict user-assigned attribute ratings. This multi-modal, multi-feature approach maps physical signals to perceptual ratings, enabling accurate predictions for unseen textures. To evaluate predictive accuracy, we employed leave-one-out cross-validation to rigorously assess the model's reliability and generalizability against several machine learning and deep learning baselines. Experimental results demonstrate that the framework consistently outperforms single-modality approaches, achieving lower MAE and RMSE, highlighting the efficacy of combining visual and tactile modalities.

Paperid: 2168, https://arxiv.org/pdf/2505.16023.pdf

Abstract:
As large language models (LLMs) are used in complex writing workflows, users engage in multi-turn interactions to steer generations to better fit their needs. Rather than passively accepting output, users actively refine, explore, and co-construct text. We conduct a large-scale analysis of this collaborative behavior for users engaged in writing tasks in the wild with two popular AI assistants, Bing Copilot and WildChat. Our analysis goes beyond simple task classification or satisfaction estimation common in prior work and instead characterizes how users interact with LLMs through the course of a session. We identify prototypical behaviors in how users interact with LLMs in prompts following their original request. We refer to these as Prototypical Human-AI Collaboration Behaviors (PATHs) and find that a small group of PATHs explain a majority of the variation seen in user-LLM interaction. These PATHs span users revising intents, exploring texts, posing questions, adjusting style or injecting new content. Next, we find statistically significant correlations between specific writing intents and PATHs, revealing how users' intents shape their collaboration behaviors. We conclude by discussing the implications of our findings on LLM alignment.

Paperid: 2169, https://arxiv.org/pdf/2505.15974.pdf

Abstract:
College students are increasingly affected by stress, anxiety, and depression, yet face barriers to traditional mental health care. This study evaluated the efficacy of a mobile health (mHealth) intervention, Mental Health Evaluation and Lookout Program (mHELP), which integrates a smartwatch sensor and machine learning (ML) algorithms for real-time stress detection and self-management. In a 12-week randomized controlled trial (n = 117), participants were assigned to a treatment group using mHELP's full suite of interventions or a control group using the app solely for real-time stress logging and weekly psychological assessments. The primary outcome, "Moments of Stress" (MS), was assessed via physiological and self-reported indicators and analyzed using Generalized Linear Mixed Models (GLMM) approaches. Similarly, secondary outcomes of psychological assessments, including the Generalized Anxiety Disorder-7 (GAD-7) for anxiety, the Patient Health Questionnaire (PHQ-8) for depression, and the Perceived Stress Scale (PSS), were also analyzed via GLMM. The finding of the objective measure, MS, indicates a substantial decrease in MS among the treatment group compared to the control group, while no notable between-group differences were observed in subjective scores of anxiety (GAD-7), depression (PHQ-8), or stress (PSS). However, the treatment group exhibited a clinically meaningful decline in GAD-7 and PSS scores. These findings underscore the potential of wearable-enabled mHealth tools to reduce acute stress in college populations and highlight the need for extended interventions and tailored features to address chronic symptoms like depression.

Paperid: 2170, https://arxiv.org/pdf/2505.14805.pdf

Abstract:
In human-robot collaboration (HRC), it is crucial for robot agents to consider humans' knowledge of their surroundings. In reality, humans possess a narrow field of view (FOV), limiting their perception. However, research on HRC often overlooks this aspect and presumes an omniscient human collaborator. Our study addresses the challenge of adapting to the evolving subtask intent of humans while accounting for their limited FOV. We integrate FOV within the human-aware probabilistic planning framework. To account for large state spaces due to considering FOV, we propose a hierarchical online planner that efficiently finds approximate solutions while enabling the robot to explore low-level action trajectories that enter the human FOV, influencing their intended subtask. Through user study with our adapted cooking domain, we demonstrate our FOV-aware planner reduces human's interruptions and redundant actions during collaboration by adapting to human perception limitations. We extend these findings to a virtual reality kitchen environment, where we observe similar collaborative behaviors.

Paperid: 2171, https://arxiv.org/pdf/2505.13381.pdf

Abstract:
Providing personalized, detailed feedback at scale in large undergraduate STEM courses remains a persistent challenge. We present an empirically evaluated practice exam system that integrates AI generated feedback with targeted textbook references, deployed in a large introductory biology course. Our system encourages metacognitive behavior by asking students to explain their answers and declare their confidence. It uses OpenAI's GPT-4o to generate personalized feedback based on this information, while directing them to relevant textbook sections. Through interaction logs from consenting participants across three midterms (541, 342, and 413 students respectively), totaling 28,313 question-student interactions across 146 learning objectives, along with 279 surveys and 23 interviews, we examined the system's impact on learning outcomes and engagement. Across all midterms, feedback types showed no statistically significant performance differences, though some trends suggested potential benefits. The most substantial impact came from the required confidence ratings and explanations, which students reported transferring to their actual exam strategies. About 40 percent of students engaged with textbook references when prompted by feedback -- far higher than traditional reading rates. Survey data revealed high satisfaction (mean rating 4.1 of 5), with 82.1 percent reporting increased confidence on practiced midterm topics, and 73.4 percent indicating they could recall and apply specific concepts. Our findings suggest that embedding structured reflection requirements may be more impactful than sophisticated feedback mechanisms.

Paperid: 2172, https://arxiv.org/pdf/2505.13218.pdf

Abstract:
Decision support systems enhanced by Artificial Intelligence (AI) are increasingly being used in high-stakes scenarios where errors or biased outcomes can have significant consequences. In this work, we explore the conditions under which AI-based decision support systems affect the decision accuracy of humans involved in face matching tasks. Previous work suggests that this largely depends on various factors, such as the specific nature of the task and how users perceive the quality of the decision support, among others. Hence, we conduct extensive experiments to examine how both task difficulty and the precision of the system influence human outcomes. Our results show a strong influence of task difficulty, which not only makes humans less precise but also less capable of determining whether the decision support system is yielding accurate suggestions or not. This has implications for the design of decision support systems, and calls for a careful examination of the context in which they are deployed and on how they are perceived by users.

Paperid: 2173, https://arxiv.org/pdf/2505.13010.pdf

Abstract:
Media bias detection is a critical task in ensuring fair and balanced information dissemination, yet it remains challenging due to the subjectivity of bias and the scarcity of high-quality annotated data. In this work, we perform sentence-level bias classification by fine-tuning a RoBERTa-based model on the expert-annotated BABE dataset. Using McNemar's test and the 5x2 cross-validation paired t-test, we show statistically significant improvements in performance when comparing our model to a domain-adaptively pre-trained DA-RoBERTa baseline. Furthermore, attention-based analysis shows that our model avoids common pitfalls like oversensitivity to politically charged terms and instead attends more meaningfully to contextually relevant tokens. For a comprehensive examination of media bias, we present a pipeline that combines our model with an already-existing bias-type classifier. Our method exhibits good generalization and interpretability, despite being constrained by sentence-level analysis and dataset size because of a lack of larger and more advanced bias corpora. We talk about context-aware modeling, bias neutralization, and advanced bias type classification as potential future directions. Our findings contribute to building more robust, explainable, and socially responsible NLP systems for media bias detection.

Paperid: 2174, https://arxiv.org/pdf/2505.12734.pdf

Abstract:
We present a novel and practically significant problem-Geo-Contextual Soundscape-to-Landscape (GeoS2L) generation-which aims to synthesize geographically realistic landscape images from environmental soundscapes. Prior audio-to-image generation methods typically rely on general-purpose datasets and overlook geographic and environmental contexts, resulting in unrealistic images that are misaligned with real-world environmental settings. To address this limitation, we introduce a novel geo-contextual computational framework that explicitly integrates geographic knowledge into multimodal generative modeling. We construct two large-scale geo-contextual multimodal datasets, SoundingSVI and SonicUrban, pairing diverse soundscapes with real-world landscape images. We propose SounDiT, a novel Diffusion Transformer (DiT)-based model that incorporates geo-contextual scene conditioning to synthesize geographically coherent landscape images. Furthermore, we propose a practically-informed geo-contextual evaluation framework, the Place Similarity Score (PSS), across element-, scene-, and human perception-levels to measure consistency between input soundscapes and generated landscape images. Extensive experiments demonstrate that SounDiT outperforms existing baselines in both visual fidelity and geographic settings. Our work not only establishes foundational benchmarks for GeoS2L generation but also highlights the importance of incorporating geographic domain knowledge in advancing multimodal generative models, opening new directions at the intersection of generative AI, geography, urban planning, and environmental sciences.

Paperid: 2175, https://arxiv.org/pdf/2505.11996.pdf

Abstract:
AI-enabled decision-support systems aim to help medical providers rapidly make decisions with limited information during medical emergencies. A critical challenge in developing these systems is supporting providers in interpreting the system output to make optimal treatment decisions. In this study, we designed and evaluated an AI-enabled decision-support system to aid providers in treating patients with traumatic injuries. We first conducted user research with physicians to identify and design information types and AI outputs for a decision-support display. We then conducted an online experiment with 35 medical providers from six health systems to evaluate two human-AI interaction strategies: (1) AI information synthesis and (2) AI information and recommendations. We found that providers were more likely to make correct decisions when AI information and recommendations were provided compared to receiving no AI support. We also identified two socio-technical barriers to providing AI recommendations during time-critical medical events: (1) an accuracy-time trade-off in providing recommendations and (2) polarizing perceptions of recommendations between providers. We discuss three implications for developing AI-enabled decision support used in time-critical events, contributing to the limited research on human-AI interaction in this context.

Paperid: 2176, https://arxiv.org/pdf/2505.11715.pdf

Abstract:
Our poster presents ConflictLens, a three-stage simulation system powered by large language models (LLMs) and grounded in psychological theory, designed to help users reflect on and practice conflict resolution in romantic relationships. Users can upload real conflict scenarios to receive evaluation of behavioral patterns, reflect on conflicts by annotating their negative behaviors, and practice different conflict resolution strategies in AI-simulated duologues. Initial evaluation by three domain experts suggests that ConflictLens offers a realistic experience and effectively supports self-guided reflection and communication practice in romantic relationships.

Paperid: 2177, https://arxiv.org/pdf/2505.11056.pdf

Abstract:
Technology has helped to innovate in the teaching-learning process. Today's students are more demanding actors when it comes to the environment, they have at their disposal to learn, experiment and develop critical thinking. The area of mathematics has successively suffered from students' learning difficulties, whether due to lack of motivation, low abstraction ability, or lack of new tools for teachers to bring innovation into the classroom and outside it. While it is true that digitalization has entered schools, it often follows a process of digital replication of approaches and materials that were previously only available on physical media. This work focuses on the use of Extended Realities for teaching mathematics, and very particularly in the teaching of geometry, with a proposition of a conceptual model that combines the use of Extended Reality and Machine Learning. The proposed model was subject to prototyping, which is presented as a form of laboratory validation as a contribution to innovate the way in which the geometry teaching-learning process is developed, as well as through the ability to obtain useful insights for teachers and students throughout the process.

Paperid: 2178, https://arxiv.org/pdf/2505.10864.pdf

Abstract:
Recent advancements in Ultra-Wideband (UWB) radar technology have enabled contactless, non-line-of-sight vital sign monitoring, making it a valuable tool for healthcare. However, UWB radar's ability to capture sensitive physiological data, even through walls, raises significant privacy concerns, particularly in human-robot interactions and autonomous systems that rely on radar for sensing human presence and physiological functions. In this paper, we present Anti-Sensing, a novel defense mechanism designed to prevent unauthorized radar-based sensing. Our approach introduces physically realizable perturbations, such as oscillatory motion from wearable devices, to disrupt radar sensing by mimicking natural cardiac motion, thereby misleading heart rate (HR) estimations. We develop a gradient-based algorithm to optimize the frequency and spatial amplitude of these oscillations for maximal disruption while ensuring physiological plausibility. Through both simulations and real-world experiments with radar data and neural network-based HR sensing models, we demonstrate the effectiveness of Anti-Sensing in significantly degrading model accuracy, offering a practical solution for privacy preservation.

Paperid: 2179, https://arxiv.org/pdf/2505.10863.pdf

Abstract:
Adolescent girls face significant mental health challenges during their transition to adulthood, often experiencing heightened stress from various sources. While various interactive technologies for self-disclosure had been explored to support stress relief, little is known about how to encourage stress-related self-disclosure through an embodied approach. This study presents a co-design workshop centred on Embodied Probes, a series of artefacts and activities incorporating embodied methods and technologies. During the workshop, nine participants aged 15 to 18 engaged with their bodies, expressed bodily sensations through tangible means, and designed embodied prototypes tailored to their personal needs for stress perception and relief. The workshop revealed insights into somatic symptoms, sources, and coping strategies for stress among adolescent girls, as well as how embodied methods can support their stress self-disclosure. This paper contributes to the HCI community by offering design implications on leveraging embodied technologies to support self-disclosure for young women's mental well-being.

Paperid: 2180, https://arxiv.org/pdf/2505.10681.pdf

Abstract:
We present Social Digital Twinner, an innovative social simulation tool for exploring plausible effects of what-if scenarios in complex adaptive social systems. The architecture is composed of three seamlessly integrated parts: a data infrastructure featuring real-world data and a multi-dimensionally representative synthetic population of citizens, an LLM-enabled agent-based simulation engine, and a user interface that enable intuitive, natural language interactions with the simulation engine and the artificial agents (i.e. citizens). Social Digital Twinner facilitates real-time engagement and empowers stakeholders to collaboratively design, test, and refine intervention measures. The approach is promoting a data-driven and evidence-based approach to societal problem-solving. We demonstrate the tool's interactive capabilities by addressing the critical issue of youth school dropouts in Kragero, Norway, showcasing its ability to create and execute a dedicated social digital twin using natural language.

Paperid: 2181, https://arxiv.org/pdf/2505.10472.pdf

Abstract:
Effective communication about breast and cervical cancers remains a persistent health challenge, with significant gaps in public understanding of cancer prevention, screening, and treatment, potentially leading to delayed diagnoses and inadequate treatments. This study evaluates the capabilities and limitations of Large Language Models (LLMs) in generating accurate, safe, and accessible cancer-related information to support patient understanding. We evaluated five general-purpose and three medical LLMs using a mixed-methods evaluation framework across linguistic quality, safety and trustworthiness, and communication accessibility and affectiveness. Our approach utilized quantitative metrics, qualitative expert ratings, and statistical analysis using Welch's ANOVA, Games-Howell, and Hedges' g. Our results show that general-purpose LLMs produced outputs of higher linguistic quality and affectiveness, while medical LLMs demonstrate greater communication accessibility. However, medical LLMs tend to exhibit higher levels of potential harm, toxicity, and bias, reducing their performance in safety and trustworthiness. Our findings indicate a duality between domain-specific knowledge and safety in health communications. The results highlight the need for intentional model design with targeted improvements, particularly in mitigating harm and bias, and improving safety and affectiveness. This study provides a comprehensive evaluation of LLMs for cancer communication, offering critical insights for improving AI-generated health content and informing future development of accurate, safe, and accessible digital health tools.

Paperid: 2182, https://arxiv.org/pdf/2505.10412.pdf

Abstract:
Material heritage typically has a whole set of associated immaterial heritage, which is essential to pass on to the visitor as a cultural mission of the destinations and those who manage them. In this sense, the interpretation of material heritage is a complex process that is not a fully efficient process with the mere observation of physical artifacts. In this context, it emerges as fundamental to provide visitors with a set of tools that allow them to correctly interpret the artifacts that come to fully understand the cultural dimension of the destinations and their heritage. Accordingly, the role of virtual reality can leverage the creation of innovative and immersive solutions that allow the visitor to understand and feel part of their own heritage and its ancestral component that defines the sociocultural roots of destinations and their civilizational traditions. This article, after dissecting and substantiating the role of virtual reality in the interpretation of heritage, presents a conceptual model, based on the use of virtual reality, which was, in part, prototyped in the scenario of the Portuguese Museum in the city of Miranda do Douro. This proposal is an ongoing contribution to the creation of innovative and immersive tools for the interpretation of heritage.

Paperid: 2183, https://arxiv.org/pdf/2505.10047.pdf

Abstract:
Modern production rates and the increasing complexity of mechanical systems require efficient and effective manufacturing and assembly processes. The transition to Industry 4.0, supported by the deployment of innovative tools such as Augmented Reality (AR), equips the industry to tackle future challenges. Among critical processes, the assembly and tightening of bolted joints stand out due to their significant safety and economic implications across various industrial sectors. This study proposes an innovative tightening method designed to enhance the reliability of bolted assembly tightening through the use of Augmented Reality and connected tools. A 6-Degrees-of-Freedom (6-DoF) tracked connected torque wrench assists the operator during tightening, ensuring each screw is tightened to the correct torque. The effectiveness of this method is compared with the conventional tightening method using paper instructions. Participants in the study carried out tightening sequences on two simple parts with multiple screws. The study evaluates the impact of the proposed method on task performance and its acceptability to operators. The tracked connected torque wrench provides considerable assistance to the operators, including wrench control and automatic generation of tightening reports. The results suggest that the AR-based method has the potential to ensure reliable torque tightening of bolted joints.

Paperid: 2184, https://arxiv.org/pdf/2505.09936.pdf

Abstract:
The rapid development of generative artificial intelligence (GenAI) presents new opportunities to advance the cartographic process. Previous studies have either overlooked the artistic aspects of maps or faced challenges in creating both accurate and informative maps. In this study, we propose CartoAgent, a novel multi-agent cartographic framework powered by multimodal large language models (MLLMs). This framework simulates three key stages in cartographic practice: preparation, map design, and evaluation. At each stage, different MLLMs act as agents with distinct roles to collaborate, discuss, and utilize tools for specific purposes. In particular, CartoAgent leverages MLLMs' visual aesthetic capability and world knowledge to generate maps that are both visually appealing and informative. By separating style from geographic data, it can focus on designing stylesheets without modifying the vector-based data, thereby ensuring geographic accuracy. We applied CartoAgent to a specific task centered on map restyling-namely, map style transfer and evaluation. The effectiveness of this framework was validated through extensive experiments and a human evaluation study. CartoAgent can be extended to support a variety of cartographic design decisions and inform future integrations of GenAI in cartography.

Paperid: 2185, https://arxiv.org/pdf/2505.09882.pdf

Abstract:
Spatial computing technologies have the potential to revolutionize how we interact with the world around us. However, most modern integrated development environments (IDEs) have not fully adapted to this paradigm shift. For example, physical 3D objects in the real world are still represented as 2D text variables in code, creating a significant perceptual distance between these representations. In response to this challenge, we introduce SnapNCode, a novel IDE for spatial programming. SnapNCode enables programmers to capture various states of physical objects through live video streams from cameras and directly insert these visual representations into their code. Moreover, users can augment physical objects by attaching code snippets onto objects, which are opportunistically triggered when observed by cameras. We conducted a user study (N=12) to assess the usability of SnapNCode. Feedback from participants indicates that the system is easy-to-use and holds promise for daily casual uses and integration into a broader range of workflows.

Paperid: 2186, https://arxiv.org/pdf/2505.09872.pdf

Abstract:
Music plays a critical role in emotional regulation and stress relief; however, individuals often need different types of music tailored to their unique stress levels or surrounding environment. Choosing the right music can be challenging due to the overwhelming number of options and the time-consuming trial-and-error process. To address this, we propose Context-AI Tune (CAT), a system that generates personalized music based on environmental inputs and the user's self-assessed stress level. A 2x2 within-subject experiment (N=26) was conducted with two independent variables: AI (AI, NoAI) and Environment (Busy Hub, Quiet Library). CAT's effectiveness in reducing stress was evaluated using the Visual Analog Scale for Stress (VAS-S). Results show that CAT is more effective than manually chosen music in reducing stress by adapting to user context.

Paperid: 2187, https://arxiv.org/pdf/2505.09823.pdf

Abstract:
Multi-modal generative AI models integrated into wearable devices have shown significant promise in enhancing the accessibility of visual information for blind or visually impaired (BVI) individuals, as evidenced by the rapid uptake of Meta Ray-Bans among BVI users. However, the proprietary nature of these platforms hinders disability-led innovation of visual accessibility technologies. For instance, OpenAI showcased the potential of live, multi-modal AI as an accessibility resource in 2024, yet none of the presented applications have reached BVI users, despite the technology being available since then. To promote the democratization of visual access technology development, we introduce WhatsAI, a prototype extensible framework that empowers BVI enthusiasts to leverage Meta Ray-Bans to create personalized wearable visual accessibility technologies. Our system is the first to offer a fully hackable template that integrates with WhatsApp, facilitating robust Accessible Artificial Intelligence Implementations (AAII) that enable blind users to conduct essential visual assistance tasks, such as real-time scene description, object detection, and Optical Character Recognition (OCR), utilizing standard machine learning techniques and cutting-edge visual language models. The extensible nature of our framework aspires to cultivate a community-driven approach, led by BVI hackers and innovators to tackle the complex challenges associated with visual accessibility.

Paperid: 2188, https://arxiv.org/pdf/2505.09802.pdf

Abstract:
As society becomes increasingly reliant on artificial intelligence, the need to mitigate risk and harm is paramount. In response, researchers and practitioners have developed tools to detect and reduce undesired bias, commonly referred to as fairness tools. Many of these tools are publicly available for free use and adaptation. While the growing availability of such tools is promising, little is known about the broader landscape beyond well-known examples like AI Fairness 360 and Fairlearn. Because fairness is an ongoing concern, these tools must be built for long-term sustainability. Using an existing set of fairness tools as a reference, we systematically searched GitHub and identified 50 related projects. We then analyzed various aspects of their repositories to assess community engagement and the extent of ongoing maintenance. Our findings show diverse forms of engagement with these tools, suggesting strong support for open-source development. However, we also found significant variation in how well these tools are maintained. Notably, 53 percent of fairness projects become inactive within the first three years. By examining sustainability in fairness tooling, we aim to promote more stability and growth in this critical area.

Paperid: 2189, https://arxiv.org/pdf/2505.09509.pdf

Abstract:
Long-distance relationships (LDRs) have become more common in the last few decades, primarily among young adults pursuing educational or employment opportunities. A common way for couples in LDRs to spend time together is by playing multiplayer video games, which are often a shared hobby and therefore a preferred joint activity. However, games are relatively understudied in the context of relational maintenance for LDRs. In this work, we used a mixed-methods approach to collect data on the experiences of 13 couples in LDRs who frequently play games together. We investigated different values around various game mechanics and modalities and found significant differences in couple play styles, and also detail how couples appropriate game mechanics to express affection to each other virtually. We also created prototypes and design implications based on couples' needs surrounding the lack of physical sensation and memorabilia storage in most popular games.

Paperid: 2190, https://arxiv.org/pdf/2505.09065.pdf

Abstract:
Explainable Recommender Systems (XRS) aim to provide users with understandable reasons for the recommendations generated by these systems, representing a crucial research direction in artificial intelligence (AI). Recent research has increasingly focused on the algorithms, display, and evaluation methodologies of XRS. While current research and reviews primarily emphasize the algorithmic aspects, with fewer studies addressing the Human-Computer Interaction (HCI) layer of XRS. Additionally, existing reviews lack a unified taxonomy for XRS and there is insufficient attention given to the emerging area of short video recommendations. In this study, we synthesize existing literature and surveys on XRS, presenting a unified framework for its research and development. The main contributions are as follows: 1) We adopt a lifecycle perspective to systematically summarize the technologies and methods used in XRS, addressing challenges posed by the diversity and complexity of algorithmic models and explanation techniques. 2) For the first time, we highlight the application of multimedia, particularly video-based explanations, along with its potential, technical pathways, and challenges in XRS. 3) We provide a structured overview of evaluation methods from both qualitative and quantitative dimensions. These findings provide valuable insights for the systematic design, progress, and testing of XRS.

Paperid: 2191, https://arxiv.org/pdf/2505.08493.pdf

Abstract:
Generative AI can help small business owners automate tasks, increase efficiency, and improve their bottom line. However, despite the seemingly intuitive design of systems like ChatGPT, significant barriers remain for those less comfortable with technology. To address these disparities, prior work highlights accessory skills -- beyond prompt engineering -- users must master to successfully adopt generative AI including keyboard shortcuts, editing skills, file conversions, and browser literacy. Building on a design workshop series and 15 interviews with small businesses, we introduce BizChat, a large language model (LLM)-powered web application that helps business owners across digital skills levels write their business plan -- an essential but often neglected document. To do so, BizChat's interface embodies three design considerations inspired by learning sciences: ensuring accessibility to users with less digital skills while maintaining extensibility to power users ("low-floor-high-ceiling"), providing in situ micro-learning to support entrepreneurial education ("just-in-time learning"), and framing interaction around business activities ("contextualized technology introduction"). We conclude with plans for a future BizChat deployment.

Paperid: 2192, https://arxiv.org/pdf/2505.08360.pdf

Abstract:
A comparison between human and Generative AI decision-making attributes in complex health services is a knowledge gap in the literature, at present. Humans may possess unique attributes beneficial to decision-making in complex health services such as health policy and health regulation, but are also susceptible to decision-making flaws. The objective is to explore whether humans have unique, and/or helpful attributes that contribute to optimal decision-making in complex health services. This comparison may also shed light on whether humans are likely to compete, cooperate, or converge with Generative AI. The comparison is based on two published reviews: a scoping review of human attributes [1] and a rapid review of Generative AI attributes [2]. The analysis categorizes attributes by uniqueness and impact. The results are presented in tabular form, comparing the sets and subsets of human and Generative AI attributes. Humans and Generative AI decision-making attributes have complementary strengths. Cooperation between these two entities seems more likely than pure competition. To maintain meaningful decision-making roles, humans could develop their unique attributes, with decision-making systems integrating both human and Generative AI contributions. These entities may also converge, in future.

Paperid: 2193, https://arxiv.org/pdf/2505.08048.pdf

Abstract:
Political misinformation, particularly harmful when it aligns with individuals' preexisting beliefs and political ideologies, has become widespread on social media platforms. In response, platforms like Facebook and X introduced warning messages leveraging fact-checking results from third-party fact-checkers to alert users against false content. However, concerns persist about the effectiveness of these fact-checks, especially when fact-checkers are perceived as politically biased. To address these concerns, this study presents findings from an online human-subject experiment (N=216) investigating how the political stances of fact-checkers influence their effectiveness in correcting misbeliefs about political misinformation. Our findings demonstrate that partisan fact-checkers can decrease the perceived accuracy of political misinformation and correct misbeliefs without triggering backfire effects. This correction is even more pronounced when the misinformation aligns with individuals' political ideologies. Notably, while previous research suggests that fact-checking warnings are less effective for conservatives than liberals, our results suggest that explicitly labeled partisan fact-checkers, positioned as political counterparts to conservatives, are particularly effective in reducing conservatives' misbeliefs toward pro-liberal misinformation.

Paperid: 2194, https://arxiv.org/pdf/2505.08031.pdf

Abstract:
Understanding what is communicated by data visualizations is a critical component of scientific literacy in the modern era. However, it remains unclear why some tasks involving data visualizations are more difficult than others. Here we administered a composite test composed of five widely used tests of data visualization literacy to a large sample of U.S. adults (N=503 participants).We found that items in the composite test spanned the full range of possible difficulty levels, and that our estimates of item-level difficulty were highly reliable. However, the type of data visualization shown and the type of task involved only explained a modest amount of variation in performance across items, relative to the reliability of the estimates we obtained. These results highlight the need for finer-grained ways of characterizing these items that predict the reliable variation in difficulty measured in this study, and that generalize to other tests of data visualization understanding.

Paperid: 2195, https://arxiv.org/pdf/2505.07592.pdf

Abstract:
Consumer-grade electroencephalography (EEG) devices show promise for Brain-Computer Interface (BCI) applications, but their efficacy in detecting subtle cognitive states remains understudied. We developed a comprehensive study paradigm which incorporates a combination of established cognitive tasks (N-Back, Stroop, and Mental Rotation) and adds a novel ecological Chess puzzles task. We tested our paradigm with the MUSE 2, a low-cost consumer-grade EEG device. Using linear mixed-effects modeling we demonstrate successful distinctions of within-task workload levels and cross-task cognitive states based on the spectral power data derived from the MUSE 2 device. With machine learning we further show reliable predictive power to differentiate between workload levels in the N-Back task, and also achieve effective cross-task classification. These findings demonstrate that consumer-grade EEG devices like the MUSE 2 can be used to effectively differentiate between various levels of cognitive workload as well as among more nuanced task-based cognitive states, and that these tools can be leveraged for real-time adaptive BCI applications in practical settings.

Paperid: 2196, https://arxiv.org/pdf/2505.07498.pdf

Abstract:
Postoperative delirium (POD) is among the most common complications after surgeries for older adults and can entail long-term adverse health consequences. Active patient participation in POD prevention presents a central factor in reducing these risks. To support patient engagement through a digital health application, we use value sensitive design approaches to identify the requirements for a patient-centered digital health application supporting patient engagement in POD prevention. Through interviews with medical professionals and patient representatives, we construct a patient journey, which serves as the basis for twelve patient value journey interviews. In these interviews, patients from the high-risk group for POD revisit their recent experience of undergoing surgery to elicit barriers, needs, and values concerning POD prevention from a patient perspective. An analysis of the patient interviews derives four design requirements for a digital health application supporting patients regarding POD prevention: the adaptation of patient-centered communication, the provision of procedural transparency, fostering patient empowerment through consistent guidance, and explicitly addressing relatives as mediators and supporters for a patient after a POD occurrence.

Paperid: 2197, https://arxiv.org/pdf/2505.07161.pdf

Abstract:
Effective feedback is essential for refining instructional practices in mathematics education, and researchers often turn to advanced natural language processing (NLP) models to analyze classroom dialogues from multiple perspectives. However, utterance-level discourse analysis encounters two primary challenges: (1) multifunctionality, where a single utterance may serve multiple purposes that a single tag cannot capture, and (2) the exclusion of many utterances from domain-specific discourse move classifications, leading to their omission in feedback. To address these challenges, we proposed a multi-perspective discourse analysis that integrates domain-specific talk moves with dialogue act (using the flattened multi-functional SWBD-MASL schema with 43 tags) and discourse relation (applying Segmented Discourse Representation Theory with 16 relations). Our top-down analysis framework enables a comprehensive understanding of utterances that contain talk moves, as well as utterances that do not contain talk moves. This is applied to two mathematics education datasets: TalkMoves (teaching) and SAGA22 (tutoring). Through distributional unigram analysis, sequential talk move analysis, and multi-view deep dive, we discovered meaningful discourse patterns, and revealed the vital role of utterances without talk moves, demonstrating that these utterances, far from being mere fillers, serve crucial functions in guiding, acknowledging, and structuring classroom discourse. These insights underscore the importance of incorporating discourse relations and dialogue acts into AI-assisted education systems to enhance feedback and create more responsive learning environments. Our framework may prove helpful for providing human educator feedback, but also aiding in the development of AI agents that can effectively emulate the roles of both educators and students.

Paperid: 2198, https://arxiv.org/pdf/2505.07142.pdf

Abstract:
The paper investigates the integration of Large Language Models (LLMs) into Conversational Agents (CAs) to encourage a shift in consumption patterns from a demand-driven to a supply-based paradigm. Specifically, the research examines the role of anthropomorphic design in delivering environmentally conscious messages by comparing two CA designs: a personified agent representing an appliance and a traditional, non-personified assistant. A lab study (N=26) assessed the impact of these designs on interaction, perceived self-efficacy, and engagement. Results indicate that LLM-based CAs significantly enhance users' self-reported eco-friendly behaviors, with participants expressing greater confidence in managing energy consumption. While the anthropomorphic design did not notably affect self-efficacy, those interacting with the personified agent reported a stronger sense of connection with the system. These findings suggest that although anthropomorphic CAs may improve user engagement, both designs hold promise for fostering sustainable behaviors in home energy management.

Paperid: 2199, https://arxiv.org/pdf/2505.06680.pdf

Abstract:
Lane-changing (LC) behavior, a critical yet complex driving maneuver, significantly influences driving safety and traffic dynamics. Traditional analytical LC decision (LCD) models, while effective in specific environments, often oversimplify behavioral heterogeneity and complex interactions, limiting their capacity to capture real LCD. Data-driven approaches address these gaps by leveraging rich empirical data and machine learning to decode latent decision-making patterns, enabling adaptive LCD modeling in dynamic environments. In light of the rapid development of artificial intelligence and the demand for data-driven models oriented towards connected vehicles and autonomous vehicles, this paper presents a comprehensive survey of data-driven LCD models, with a particular focus on human drivers LC decision-making. It systematically reviews the modeling framework, covering data sources and preprocessing, model inputs and outputs, objectives, structures, and validation methods. This survey further discusses the opportunities and challenges faced by data-driven LCD models, including driving safety, uncertainty, as well as the integration and improvement of technical frameworks.

Paperid: 2200, https://arxiv.org/pdf/2505.06428.pdf

Abstract:
Improving end-users' understanding of decisions made by autonomous vehicles (AVs) driven by artificial intelligence (AI) can improve utilization and acceptance of AVs. However, current explanation mechanisms primarily help AI researchers and engineers in debugging and monitoring their AI systems, and may not address the specific questions of end-users, such as passengers, about AVs in various scenarios. In this paper, we conducted two user studies to investigate questions that potential AV passengers might pose while riding in an AV and evaluate how well answers to those questions improve their understanding of AI-driven AV decisions. Our initial formative study identified a range of questions about AI in autonomous driving that existing explanation mechanisms do not readily address. Our second study demonstrated that interactive text-based explanations effectively improved participants' comprehension of AV decisions compared to simply observing AV decisions. These findings inform the design of interactions that motivate end-users to engage with and inquire about the reasoning behind AI-driven AV decisions.

Paperid: 2201, https://arxiv.org/pdf/2505.05773.pdf

Abstract:
Recently, many humanoid robots have been increasingly deployed in various facilities, including hospitals and assisted living environments, where they are often remotely controlled by human operators. Their kinematic redundancy enhances reachability and manipulability, enabling them to navigate complex, cluttered environments and perform a wide range of tasks. However, this redundancy also presents significant control challenges, particularly in coordinating the movements of the robot's macro-micro structure (torso and arms). Therefore, we propose various human-robot collaborative (HRC) methods for coordinating the torso and arm of remotely controlled mobile humanoid robots, aiming to balance autonomy and human input to enhance system efficiency and task execution. The proposed methods include human-initiated approaches, where users manually control torso movements, and robot-initiated approaches, which autonomously coordinate torso and arm based on factors such as reachability, task goal, or inferred human intent. We conducted a user study with N=17 participants to compare the proposed approaches in terms of task performance, manipulability, and energy efficiency, and analyzed which methods were preferred by participants.

Paperid: 2202, https://arxiv.org/pdf/2505.04890.pdf

Abstract:
The increasing convergence of artificial intelligence has opened new avenues, including its emerging role in enhancing creativity. It is reshaping traditional creative practices such as actor improvisation, which often struggles with predictable patterns, limited interaction, and a lack of engaging stimuli. In this paper, we introduce a new concept, Theatrical Language Processing (TLP), and an AI-driven creativity support tool, Scribble$.$ai, designed to augment actors' creative expression and spontaneity through interactive practice. We conducted a user study involving tests and interviews with fourteen participants. Our findings indicate that: (1) Actors expanded their creativity when faced with AI-produced irregular scenarios; (2) The AI's unpredictability heightened their problem-solving skills, specifically in interpreting unfamiliar situations; (3) However, AI often generated excessively detailed scripts, which limited interpretive freedom and hindered subtext exploration. Based on these findings, we discuss the new potential in enhancing creative expressions in film and theater studies through an AI-driven tool.

Paperid: 2203, https://arxiv.org/pdf/2505.04487.pdf

Abstract:
LLM-generated tabular data is creating new opportunities for data-driven applications in academia, business, and society. To leverage benefits like missing value imputation, labeling, and enrichment with context-aware attributes, LLM-generated data needs a critical validation process. The number of pioneering approaches is increasing fast, opening a promising validation space that, so far, remains unstructured. We present a design space for the critical validation of LLM-generated tabular data with two dimensions: First, the Analysis Granularity dimension: from within-attribute (single-item and multi-item) to across-attribute perspectives (1 x 1, 1 x m, and n x n). Second, the Data Source dimension: differentiating between LLM-generated values, ground truth values, explanations, and their combinations. We discuss analysis tasks for each dimension cross-cut, map 19 existing validation approaches, and discuss the characteristics of two approaches in detail, demonstrating descriptive power.

Paperid: 2204, https://arxiv.org/pdf/2505.04446.pdf

Abstract:
The violin is one of the most popular musical instruments. Various parameters of bowing motion, such as pressure, position, and speed, are crucial for producing a beautiful tone. However, mastering them is challenging and requires extensive practice. In this study, we aimed to support practice of bowing, focusing on bow pressure. First, we compared the bowing movements, specifically bow pressure, bow position, and bow speed, of eight experienced players with those of eight beginners. Next, we developed and evaluated a visual feedback system that displays bow pressure to support practice. We taught the identified differences to 14 beginners, dividing them into two groups: one practiced with an explanation, and the other with both an explanation and a feedback system. These two experiments found that clarifying the characteristics unique to experienced players can support practice.

Paperid: 2205, https://arxiv.org/pdf/2505.04433.pdf

Abstract:
The limited expressiveness of virtual user representations in Mixed Reality and Virtual Reality can inhibit an integral part of communication: emotional expression. Emotion recognition based on face tracking is often used to compensate for this. However, emotional facial expressions are highly individual, which is why many approaches have difficulties recognizing unique variations of emotional expressions. We propose several strategies to improve face tracking systems for emotion recognition with and without user intervention for the Affective Interaction Workshop at CHI '25.

Paperid: 2206, https://arxiv.org/pdf/2505.03867.pdf

Abstract:
Creative coding platforms like Scratch have democratized programming for children, yet translating imaginative ideas into functional code remains a significant hurdle for many young learners. While AI copilots assist adult programmers, few tools target children in block-based environments. Building on prior research \cite{druga_how_2021,druga2023ai, druga2023scratch}, we present Cognimates Scratch Copilot: an AI-powered assistant integrated into a Scratch-like environment, providing real-time support for ideation, code generation, debugging, and asset creation. This paper details the system architecture and findings from an exploratory qualitative evaluation with 18 international children (ages 7--12). Our analysis reveals how the AI Copilot supported key creative coding processes, particularly aiding ideation and debugging. Crucially, it also highlights how children actively negotiated the use of AI, demonstrating strong agency by adapting or rejecting suggestions to maintain creative control. Interactions surfaced design tensions between providing helpful scaffolding and fostering independent problem-solving, as well as learning opportunities arising from navigating AI limitations and errors. Findings indicate Cognimates Scratch Copilot's potential to enhance creative self-efficacy and engagement. Based on these insights, we propose initial design guidelines for AI coding assistants that prioritize youth agency and critical interaction alongside supportive scaffolding.

Paperid: 2207, https://arxiv.org/pdf/2505.03189.pdf

Abstract:
Controlling the behavior of Large Language Models (LLMs) remains a significant challenge due to their inherent complexity and opacity. While techniques like fine-tuning can modify model behavior, they typically require extensive computational resources. Recent work has introduced a class of contrastive activation engineering (CAE) techniques as promising approaches for steering LLM outputs through targeted modifications to their internal representations. Applied at inference-time with zero cost, CAE has the potential to introduce a new paradigm of flexible, task-specific LLM behavior tuning. We analyze the performance of CAE in in-distribution, out-of-distribution settings, evaluate drawbacks, and begin to develop comprehensive guidelines for its effective deployment. We find that 1. CAE is only reliably effective when applied to in-distribution contexts. 2. Increasing the number of samples used to generate steering vectors has diminishing returns at around 80 samples. 3. Steering vectors are susceptible to adversarial inputs that reverses the behavior that is steered for. 4. Steering vectors harm the overall model perplexity. 5. Larger models are more resistant to steering-induced degradation.

Paperid: 2208, https://arxiv.org/pdf/2505.03117.pdf

Abstract:
Interest in explainable artificial intelligence (XAI) is surging. Prior research has primarily focused on systems' ability to generate explanations, often guided by researchers' intuitions rather than end-users' needs. Unfortunately, such approaches have not yielded favorable outcomes when compared to a black-box baseline (i.e., no explanation). To address this gap, this paper advocates a human-centered approach that shifts focus to air traffic controllers (ATCOs) by asking a fundamental yet overlooked question: Do ATCOs need explanations, and if so, why? Insights from air traffic management (ATM), human-computer interaction, and the social sciences were synthesized to provide a holistic understanding of XAI challenges and opportunities in ATM. Evaluating 11 ATM operational goals revealed a clear need for explanations when ATCOs aim to document decisions and rationales for future reference or report generation. Conversely, ATCOs are less likely to seek them when their conflict resolution approach align with the artificial intelligence (AI) advisory. While this is a preliminary study, the findings are expected to inspire broader and deeper inquiries into the design of ATCO-centric XAI systems, paving the way for more effective human-AI interaction in ATM.

Paperid: 2209, https://arxiv.org/pdf/2505.02975.pdf

Abstract:
AI assistants are increasingly integrated into older adults' daily lives, offering new opportunities for social support and accessibility while raising important questions about privacy, autonomy, and trust. As these systems become embedded in caregiving and social networks, older adults must navigate trade-offs between usability, data privacy, and personal agency across different interaction contexts. Although prior work has explored AI assistants' potential benefits, further research is needed to understand how perceived usefulness and risk shape adoption and engagement. This paper examines these dynamics and advocates for participatory design approaches that position older adults as active decision makers in shaping AI assistant functionality. By advancing a framework for privacy-aware, user-centered AI design, this work contributes to ongoing discussions on developing ethical and transparent AI systems that enhance well-being without compromising user control.

Paperid: 2210, https://arxiv.org/pdf/2505.02842.pdf

Abstract:
This study investigates the efficiency and safety outcomes of implementing different adaptive coordination models for automated vehicle (AV) fleets, managed by a centralized coordinator that dynamically responds to human-controlled vehicle behavior. The simulated scenarios replicate an underground mining environment characterized by narrow tunnels with limited connectivity. To address the unique challenges of such settings, we propose a novel metric - Path Overlap Density (POD) - to predict efficiency and potentially the safety performance of AV fleets. The study also explores the impact of map features on AV fleets performance. The results demonstrate that both AV fleet coordination strategies and underground tunnel network characteristics significantly influence overall system performance. While map features are critical for optimizing efficiency, adaptive coordination strategies are essential for ensuring safe operations.

Paperid: 2211, https://arxiv.org/pdf/2505.02802.pdf

Abstract:
To combat climate change, individuals are encouraged to adopt sustainable habits, in particular, with their household, optimizing their electrical consumption. Conversational agents, such as Smart Home Assistants, hold promise as effective tools for promoting sustainable practices within households. Our research investigated the application of Large Language Models (LLM) in enhancing smart home automation and promoting sustainable household practices, specifically using the HomeAssistant framework. In particular, it highlights the potential of GPT models in generating accurate automation routines. While the LLMs showed proficiency in understanding complex commands and creating valid JSON outputs, challenges such as syntax errors and message malformations were noted, indicating areas for further improvement. Still, despite minimal quantitative differences between "green" and "no green" prompts, qualitative feedback highlighted a positive shift towards sustainability in the routines generated with environmentally focused prompts. Then, an empirical evaluation (N=56) demonstrated that the system was well-received and found engaging by users compared to its traditional rule-based counterpart. Our findings highlight the role of LLMs in advancing smart home technologies and suggest further research to refine these models for broader, real-world applications to support sustainable living.

Paperid: 2212, https://arxiv.org/pdf/2505.02558.pdf

Abstract:
The Turing Test, first proposed by Alan Turing in 1950, has historically served as a benchmark for evaluating artificial intelligence (AI). However, since the release of ELIZA in 1966, and particularly with recent advancements in large language models (LLMs), AI has been claimed to pass the Turing Test. Furthermore, criticism argues that the Turing Test primarily assesses deceptive mimicry rather than genuine intelligence, prompting the continuous emergence of alternative benchmarks. This study argues against discarding the Turing Test, proposing instead using more refined versions of it, for example, by interacting simultaneously with both an AI and human candidate to determine who is who, allowing a longer interaction duration, access to the Internet and other AIs, using experienced people as evaluators, etc. Through systematic experimentation using a web-based platform, we demonstrate that richer, contextually structured testing environments significantly enhance participants' ability to differentiate between AI and human interactions. Namely, we show that, while an off-the-shelf LLM can pass some version of a Turing Test, it fails to do so when faced with a more robust version. Our findings highlight that the Turing Test remains an important and effective method for evaluating AI, provided it continues to adapt as AI technology advances. Additionally, the structured data gathered from these improved interactions provides valuable insights into what humans expect from truly intelligent AI systems.

Paperid: 2213, https://arxiv.org/pdf/2505.02329.pdf

Abstract:
Algorithmic management (AM)'s impact on worker well-being has led to calls for regulation. However, little is known about the effectiveness and challenges in real-world AM regulation across the regulatory process -- rule operationalization, software use, and enforcement. Our multi-stakeholder study addresses this gap within workplace scheduling, one of the few AM domains with implemented regulations. We interviewed 38 stakeholders across the regulatory process: regulators, defense attorneys, worker advocates, managers, and workers. Our findings suggest that the efficacy of AM regulation is influenced by: (i) institutional constraints that challenge efforts to encode law into AM software, (ii) on-the-ground use of AM software that shapes its ability to facilitate compliance, (iii) mismatches between software and regulatory contexts that hinder enforcement, and (iv) unique concerns that software introduces when used to regulate AM. These findings underscore the importance of a sociotechnical approach to AM regulation, which considers organizational and collaborative contexts alongside the inherent attributes of software. We offer future research directions and implications for technology policy and design.

Paperid: 2214, https://arxiv.org/pdf/2505.01679.pdf

Abstract:
Runway and taxiway incursions continue to challenge aviation safety, as pilots often experience disorientation from poor visibility in adverse conditions and cognitive workload in complex airport layouts. Current tools, such as airport moving maps on portable tablets, allow manual route planning but do not dynamically adapt to air traffic controllers' (ATCOs) clearances, limiting their effectiveness in high-stress scenarios. This study investigates the impact of different input modalities - paper-based, keyboard touch, map touch, and speech-to-text - on taxiway navigation performance, using a medium-fidelity flight simulator and a Wizard-of-Oz methodology to simulate ideal automation conditions. Contrary to common assumptions, recent studies indicate that paper-based methods outperform digital counterparts in accuracy and efficiency under certain conditions, highlighting critical limitations in current automation strategies. In response, our study investigates why manual methods may excel and how future automation can be optimized for pilot-centered operations. Employing a Wizard-of-Oz approach, we replicated the full taxiing process - from receiving ATCO clearances to executing maneuvers - and differentiated between readback and execution accuracy. Findings reveal that speech-based systems suffer from low pilot trust, necessitating hybrid solutions that integrate error correction and confidence indicators. These insights contribute to the development of future pilot-centered taxiway assistance that enhance situational awareness, minimize workload, and improve overall operational safety.

Paperid: 2215, https://arxiv.org/pdf/2505.01372.pdf

Abstract:
Mechanistic Interpretability (MI) aims to understand neural networks through causal explanations. Though MI has many explanation-generating methods, progress has been limited by the lack of a universal approach to evaluating explanations. Here we analyse the fundamental question "What makes a good explanation?" We introduce a pluralist Explanatory Virtues Framework drawing on four perspectives from the Philosophy of Science - the Bayesian, Kuhnian, Deutschian, and Nomological - to systematically evaluate and improve explanations in MI. We find that Compact Proofs consider many explanatory virtues and are hence a promising approach. Fruitful research directions implied by our framework include (1) clearly defining explanatory simplicity, (2) focusing on unifying explanations and (3) deriving universal principles for neural networks. Improved MI methods enhance our ability to monitor, predict, and steer AI systems.

Paperid: 2216, https://arxiv.org/pdf/2505.00956.pdf

Abstract:
We introduce Audio Personas, enabling users to "decorate" themselves with body-anchored sounds in audio augmented reality. Like outfits, makeup, and fragrances, audio personas offer an alternative yet dynamic channel to augment face-to-face interactions. For instance, one can set their audio persona as rain sounds to reflect a bad mood, bee sounds to establish personal boundaries, or a playful "woosh" sound to mimic passing by someone like a breeze. To instantiate the concept, we implemented a headphone-based prototype with multi-user tracking and audio streaming. Our preregistered in-lab study with 64 participants showed that audio personas influenced how participants formed impressions. Individuals with positive audio personas were rated as more socially attractive, more likable, and less threatening than those with negative audio personas. Our study with audio designers revealed that audio personas were preferred in public and semi-public-private spaces for managing social impressions (e.g., personality) and signaling current states (e.g., emotions).

Paperid: 2217, https://arxiv.org/pdf/2505.00907.pdf

Abstract:
Navigating and visualizing multilayered knowledge graphs remains a challenging, unresolved problem in information systems design. Building on our earlier study, which engaged end users in both the design and population of a domain-specific knowledge graph, we now focus on translating their insights into actionable interface guidelines. In this paper, we synthesize recommendations drawn from a participatory workshop with doctoral students. We then demonstrate how these recommendations inform the design of a prototype interface. Finally, we found that a participatory iterative design approach can help designers in decision making, leading to interfaces that are both innovative and user-centric. By combining user-driven requirements with proven visualization techniques, this paper presents a coherent framework for guiding future development of knowledge-graph navigation tools.

Paperid: 2218, https://arxiv.org/pdf/2505.00821.pdf

Abstract:
AI-supported writing technologies (AISWT) that provide grammatical suggestions, autocomplete sentences, or generate and rewrite text are now a regular feature integrated into many people's workflows. However, little is known about how people perceive the suggestions these tools provide. In this paper, we investigate how Black American users perceive AISWT, motivated by prior findings in natural language processing that highlight how the underlying large language models can contain racial biases. Using interviews and observational user studies with 13 Black American users of AISWT, we found a strong tradeoff between the perceived benefits of using AISWT to enhance their writing style and feeling like "it wasn't built for us". Specifically, participants reported AISWT's failure to recognize commonly used names and expressions in African American Vernacular English, experiencing its corrections as hurtful and alienating and fearing it might further minoritize their culture. We end with a reflection on the tension between AISWT that fail to include Black American culture and language, and AISWT that attempt to mimic it, with attention to accuracy, authenticity, and the production of social difference.

Paperid: 2219, https://arxiv.org/pdf/2504.21731.pdf

Abstract:
Mixed Reality (MR) could assist users' tasks by continuously integrating virtual content with their view of the physical environment. However, where and how to place these content to best support the users has been a challenging problem due to the dynamic nature of MR experiences. In contrast to prior work that investigates optimization-based methods, we are exploring how reinforcement learning (RL) could assist with continuous 3D content placement that is aware of users' poses and their surrounding environments. Through an initial exploration and preliminary evaluation, our results demonstrate the potential of RL to position content that maximizes the reward for users on the go. We further identify future directions for research that could harness the power of RL for personalized and optimized UI and content placement in MR.

Paperid: 2220, https://arxiv.org/pdf/2504.20761.pdf

Abstract:
Robotic-assisted procedures offer enhanced precision, but while fully autonomous systems are limited in task knowledge, difficulties in modeling unstructured environments, and generalisation abilities, fully manual teleoperated systems also face challenges such as delay, stability, and reduced sensory information. To address these, we developed an interactive control strategy that assists the human operator by predicting their motion plan at both high and low levels. At the high level, a surgeme recognition system is employed through a Transformer-based real-time gesture classification model to dynamically adapt to the operator's actions, while at the low level, a Confidence-based Intention Assimilation Controller adjusts robot actions based on user intent and shared control paradigms. The system is built around a robotic suturing task, supported by sensors that capture the kinematics of the robot and task dynamics. Experiments across users with varying skill levels demonstrated the effectiveness of the proposed approach, showing statistically significant improvements in task completion time and user satisfaction compared to traditional teleoperation.

Paperid: 2221, https://arxiv.org/pdf/2504.20656.pdf

Abstract:
Federated learning (FL) is a machine learning approach that allows multiple devices or institutions to collaboratively train a model without sharing their local data with a third-party. FL is considered a promising way to address patient privacy concerns in medical artificial intelligence. The ethical risks of medical FL systems themselves, however, have thus far been underexamined. This paper aims to address this gap. We argue that medical FL presents a new variety of opacity -- federation opacity -- that, in turn, generates a distinctive double black box problem in healthcare AI. We highlight several instances in which the anticipated benefits of medical FL may be exaggerated, and conclude by highlighting key challenges that must be overcome to make FL ethically feasible in medicine.

Paperid: 2222, https://arxiv.org/pdf/2504.20369.pdf

Abstract:
Visualizing data is often a crucial first step in data analytics workflows, but growing data sizes pose challenges due to computational and visual perception limitations. As a result, data analysts commonly down-sample their data and work with subsets. Deriving representative samples, however, remains a challenge. This paper focuses on scatterplots, a widely-used visualization type, and introduces a novel sampling objective -- perception-awareness -- aiming to improve sample efficacy by targeting humans' perception of a visualization. We make the following contributions: (1) We propose perception-augmented databases and design PAwS: a novel perception-aware sampling method for scatterplots that leverages saliency maps -- a computer vision tool for predicting areas of attention focus in visualizations -- and models perception-awareness via saliency, density, and coverage objectives. (2) We design ApproPAwS: a fast, perception-aware method for approximate visualizations, which exploits the fact that small visual perturbations are often imperceptible to humans. (3) We introduce the concept of perceptual similarity as a metric for sample quality, and present a novel method that compares saliency maps to measure it. (4) Our extensive experimental evaluation shows that our methods consistently outperform prior art in producing samples with high perceptual similarity, while ApproPAwS achieves up to 100x speed-ups with minimal loss in visual fidelity. Our user study shows that PAwS is often preferred by humans, validating our quantitative findings.

Paperid: 2223, https://arxiv.org/pdf/2504.19728.pdf

Abstract:
The remote human operator's user interface (UI) is an important link to make the robot an efficient extension of the operator's perception and action. In rescue applications, several studies have investigated the design of operator interfaces based on observations during major robotics competitions or field deployments. Based on this research, guidelines for good interface design were empirically identified. The investigations on the UIs of teams participating in competitions are often based on external observations during UI application, which may miss some relevant requirements for UI flexibility. In this work, we present an open-source and flexibly configurable user interface based on established guidelines and its exemplary use for wheeled, tracked, and walking robots. We explain the design decisions and cover the insights we have gained during its highly successful applications in multiple robotics competitions and evaluations. The presented UI can also be adapted for other robots with little effort and is available as open source.

Paperid: 2224, https://arxiv.org/pdf/2504.18449.pdf

Abstract:
Bias is an inherent threat to human decision-making, including in decisions made during software development. Extensive research has demonstrated the presence of biases at various stages of the software development life-cycle. Notably, code reviews are highly susceptible to prejudice-induced biases, and individuals are often unaware of these biases as they occur. Developing methods to automatically detect these biases is crucial for addressing the associated challenges. Recent advancements in visual data analytics have shown promising results in detecting potential biases by analyzing user interaction patterns. In this project, we propose a controlled experiment to extend this approach to detect potentially biased outcomes in code reviews by observing how reviewers interact with the code. We employ the "spotlight model of attention", a cognitive framework where a reviewer's gaze is tracked to determine their focus areas on the review screen. This focus, identified through gaze tracking, serves as an indicator of the reviewer's areas of interest or concern. We plan to analyze the sequence of gaze focus using advanced sequence modeling techniques, including Markov Models, Recurrent Neural Networks (RNNs), and Conditional Random Fields (CRF). These techniques will help us identify patterns that may suggest biased interactions. We anticipate that the ability to automatically detect potentially biased interactions in code reviews will significantly reduce unnecessary push-backs, enhance operational efficiency, and foster greater diversity and inclusion in software development. This approach not only helps in identifying biases but also in creating a more equitable development environment by mitigating these biases effectively

Paperid: 2225, https://arxiv.org/pdf/2504.18310.pdf

Abstract:
Using basic health statements authorized by UK and EU registers and 9,100 journalist-vetted public-health assertions on topics such as abortion, COVID-19 and politics from sources ranging from peer-reviewed journals and government advisories to social media and news across the political spectrum, we benchmark six leading large language models from in 21 languages, finding that, despite high accuracy on English-centric textbook claims, performance falls in multiple non-European languages and fluctuates by topic and source, highlighting the urgency of comprehensive multilingual, domain-aware validation before deploying AI in global health communication.

Paperid: 2226, https://arxiv.org/pdf/2504.17823.pdf

Abstract:
While analysing challenges in pilot projects developing AI with marginalized communities, we found it difficult to express them within commonly used paradigms. We therefore constructed an alternative conceptual framework to ground AI development in the social fabric -- the Cloud Weaving Model -- inspired (amongst others) by indigenous knowledge, motifs from nature, and Eastern traditions. This paper introduces and elaborates on the fundamental elements of the model (clouds, spiders, threads, spiderwebs, and weather) and their interpretation in an AI context. The framework is then applied to comprehend patterns observed in co-creation pilots approaching marginalized communities, highlighting neglected yet relevant dimensions for responsible AI development.

Paperid: 2227, https://arxiv.org/pdf/2504.17393.pdf

Abstract:
Artificial Intelligence (AI) has become an important part of our everyday lives, yet user requirements for designing AI-assisted systems in law enforcement remain unclear. To address this gap, we conducted qualitative research on decision-making within a law enforcement agency. Our study aimed to identify limitations of existing practices, explore user requirements and understand the responsibilities that humans expect to undertake in these systems. Participants in our study highlighted the need for a system capable of processing and analysing large volumes of data efficiently to help in crime detection and prevention. Additionally, the system should satisfy requirements for scalability, accuracy, justification, trustworthiness and adaptability to be adopted in this domain. Participants also emphasised the importance of having end users review the input data that might be challenging for AI to interpret, and validate the generated output to ensure the system's accuracy. To keep up with the evolving nature of the law enforcement domain, end users need to help the system adapt to the changes in criminal behaviour and government guidance, and technical experts need to regularly oversee and monitor the system. Furthermore, user-friendly human interaction with the system is essential for its adoption and some of the participants confirmed they would be happy to be in the loop and provide necessary feedback that the system can learn from. Finally, we argue that it is very unlikely that the system will ever achieve full automation due to the dynamic and complex nature of the law enforcement domain.

Paperid: 2228, https://arxiv.org/pdf/2504.16573.pdf

Abstract:
Psychological counseling is a highly personalized and dynamic process that requires therapists to continuously monitor emotional changes, document session insights, and maintain therapeutic continuity. In this paper, we introduce PsyCounAssist, a comprehensive AI-powered counseling assistant system specifically designed to augment psychological counseling practices. PsyCounAssist integrates multimodal emotion recognition combining speech and photoplethysmography (PPG) signals for accurate real-time affective analysis, automated structured session reporting using large language models (LLMs), and personalized AI-generated follow-up support. Deployed on Android-based tablet devices, the system demonstrates practical applicability and flexibility in real-world counseling scenarios. Experimental evaluation confirms the reliability of PPG-based emotional classification and highlights the system's potential for non-intrusive, privacy-aware emotional support. PsyCounAssist represents a novel approach to ethically and effectively integrating AI into psychological counseling workflows.

Paperid: 2229, https://arxiv.org/pdf/2504.16546.pdf

Abstract:
The ascent of scaling in artificial intelligence research has revolutionized the field over the past decade, yet it presents significant challenges for academic researchers, particularly in computational social science and critical algorithm studies. The dominance of large language models, characterized by their extensive parameters and costly training processes, creates a disparity where only industry-affiliated researchers can access these resources. This imbalance restricts academic researchers from fully understanding their tools, leading to issues like reproducibility in computational social science and a reliance on black-box metaphors in critical studies. To address these challenges, we propose a "tinkering" approach that is inspired by existing works. This method involves engaging with smaller models or components that are manageable for ordinary researchers, fostering hands-on interaction with algorithms. We argue that tinkering is both a way of making and knowing for computational social science and a way of knowing for critical studies, and fundamentally, it is a way of caring that has broader implications for both fields.

Paperid: 2230, https://arxiv.org/pdf/2504.16378.pdf

Abstract:
In Affective computing, recognizing users' emotions accurately is the basis of affective human-computer interaction. Understanding users' interoception contributes to a better understanding of individually different emotional abilities, which is essential for achieving inter-individually accurate emotion estimation. However, existing interoception measurement methods, such as the heart rate discrimination task, have several limitations, including their dependence on a well-controlled laboratory environment and precision apparatus, making monitoring users' interoception challenging. This study aims to determine other forms of data that can explain users' interoceptive or similar states in their real-world lives and propose a novel hypothetical concept "cyberoception," a new sense (1) which has properties similar to interoception in terms of the correlation with other emotion-related abilities, and (2) which can be measured only by the sensors embedded inside commodity smartphone devices in users' daily lives. Results from a 10-day-long in-lab/in-the-wild hybrid experiment reveal a specific cyberoception type "Turn On" (users' subjective sensory perception about the frequency of turning-on behavior on their smartphones), significantly related to participants' emotional valence. We anticipate that cyberoception to serve as a fundamental building block for developing more "emotion-aware", user-friendly applications and services.

Paperid: 2231, https://arxiv.org/pdf/2504.16031.pdf

Abstract:
This paper addresses the concept of materiality in virtual environments, which we define as being composed of objects that can influence user experience actively. Such virtual materiality is closely related to its physical counterpart, which is discussed in theoretical frameworks such as sociomateriality and actor-network theory. They define phenomena in terms of the entanglement of human and non-human elements. We report on an early investigation of virtual materiality within the context of reflection and perspective change in nature-based virtual environments. We considered the case of university students reflecting on the planning and management of their theses and major projects. Inspired by nature's known positive cognitive and affective effects and repeated questioning processes, we established a virtual reflection intervention to demonstrate the environmental mechanisms and material characteristics relevant to virtual materiality. Our work is a preliminary step toward understanding virtual materiality and its implications for research and the design of virtual environments.

Paperid: 2232, https://arxiv.org/pdf/2504.15886.pdf

Abstract:
As robots become increasingly involved in decision-making processes (e.g., personnel selection), concerns about fairness and social inclusion arise. This study examines social exclusion in robot-led group interviews by robot Ameca, exploring the relationship between objective exclusion (robot's attention allocation), subjective exclusion (perceived exclusion), mood change, and need fulfillment. In a controlled lab study (N = 35), higher objective exclusion significantly predicted subjective exclusion. In turn, subjective exclusion negatively impacted mood and need fulfillment but only mediated the relationship between objective exclusion and need fulfillment. A piecewise regression analysis identified a critical threshold at which objective exclusion begins to be perceived as subjective exclusion. Additionally, the standing position was the primary predictor of exclusion, whereas demographic factors (e.g., gender, height) had no significant effect. These findings underscore the need to consider both objective and subjective exclusion in human-robot interactions and have implications for fairness in robot-assisted hiring processes.

Paperid: 2233, https://arxiv.org/pdf/2504.15859.pdf

Abstract:
We propose a half-day workshop at IEEE VIS 2025 on addressing the emerging challenges in data-rich multimodal remote collaboration. We focus on synchronous, remote, and hybrid settings where people take part in tasks such as data analysis, decision-making, and presentation. With this workshop, we continue successful prior work from the first MERCADO workshop at VIS 2023 and a 2024 Shonan Seminar that followed. Based on the findings of the earlier events, we invite research and ideas related to four themes of challenges: Tools & Technologies, Individual Differences & Interpersonal Dynamics, AI-assisted Collaboration, and Evaluation. With this workshop, we aim to broaden the community, foster new collaborations, and develop a research agenda to address these challenges in future research. Our planned workshop format is comprised of a keynote, short presentations, a breakout group session, and discussions organized around the identified challenges.

Paperid: 2234, https://arxiv.org/pdf/2504.15549.pdf

Abstract:
Large Language Model (LLM)-based in-application assistants, or copilots, can automate software tasks, but users often prefer learning by doing, raising questions about the optimal level of automation for an effective user experience. We investigated two automation paradigms by designing and implementing a fully automated copilot (AutoCopilot) and a semi-automated copilot (GuidedCopilot) that automates trivial steps while offering step-by-step visual guidance. In a user study (N=20) across data analysis and visual design tasks, GuidedCopilot outperformed AutoCopilot in user control, software utility, and learnability, especially for exploratory and creative tasks, while AutoCopilot saved time for simpler visual tasks. A follow-up design exploration (N=10) enhanced GuidedCopilot with task-and state-aware features, including in-context preview clips and adaptive instructions. Our findings highlight the critical role of user control and tailored guidance in designing the next generation of copilots that enhance productivity, support diverse skill levels, and foster deeper software engagement.

Paperid: 2235, https://arxiv.org/pdf/2504.14571.pdf

Abstract:
As Large Language Models (LLMs) become increasingly embedded in empirical research workflows, their use as analytical tools for quantitative or qualitative data raises pressing concerns for scientific integrity. This opinion paper draws a parallel between "prompt-hacking", the strategic tweaking of prompts to elicit desirable outputs from LLMs, and the well-documented practice of "p-hacking" in statistical analysis. We argue that the inherent biases, non-determinism, and opacity of LLMs make them unsuitable for data analysis tasks demanding rigor, impartiality, and reproducibility. We emphasize how researchers may inadvertently, or even deliberately, adjust prompts to confirm hypotheses while undermining research validity. We advocate for a critical view of using LLMs in research, transparent prompt documentation, and clear standards for when LLM use is appropriate. We discuss how LLMs can replace traditional analytical methods, whereas we recommend that LLMs should only be used with caution, oversight, and justification.

Paperid: 2236, https://arxiv.org/pdf/2504.14427.pdf

Abstract:
This case study presents our user-centered design model for Socially Intelligent Agent (SIA) development frameworks through our experience developing Estuary, an open source multimodal framework for building low-latency real-time socially interactive agents. We leverage the Rapid Assessment Process (RAP) to collect the thoughts of leading researchers in the field of SIAs regarding the current state of the art for SIA development as well as their evaluation of how well Estuary may potentially address current research gaps. We achieve this through a series of end-user interviews conducted by a fellow researcher in the community. We hope that the findings of our work will not only assist the continued development of Estuary but also guide the development of other future frameworks and technologies for SIAs.

Paperid: 2237, https://arxiv.org/pdf/2504.14320.pdf

Abstract:
Text-based prompting remains the predominant interaction paradigm in generative AI, yet it often introduces friction for novice users such as small business owners (SBOs), who struggle to articulate creative goals in domain-specific contexts like advertising. Through a formative study with six SBOs in the United Kingdom, we identify three key challenges: difficulties in expressing brand intuition through prompts, limited opportunities for fine-grained adjustment and refinement during and after content generation, and the frequent production of generic content that lacks brand specificity. In response, we present ACAI (AI Co-Creation for Advertising and Inspiration), a multimodal generative AI tool designed to support novice designers by moving beyond traditional prompt interfaces. ACAI features a structured input system composed of three panels: Branding, Audience and Goals, and the Inspiration Board. These inputs allow users to convey brand-relevant context and visual preferences. This work contributes to HCI research on generative systems by showing how structured interfaces can foreground user-defined context, improve alignment, and enhance co-creative control in novice creative workflows.

Paperid: 2238, https://arxiv.org/pdf/2504.13952.pdf

Abstract:
Tourist crowding degrades the visitor experience and negatively impacts the environment and the local population, potentially making tourism in popular destinations unsustainable. This motivated us to develop, within the framework of the European RESETTING project related to the digital transformation of tourism, a platform to visualize this crowding, exploring historical data, detecting patterns and trends and predicting future events. The ultimate goal is to support short- and medium-term decision-making to mitigate the phenomenon. To this end, the platform takes into account the carrying capacity of the target sites when calculating crowding density. The integration of data from different sources is achieved with an extensible, connector-based architecture. Three scenarios for using the platform are described, relating to major annual crowding events. Two of them, in the municipality of Lisbon, are based on data from a mobile network provided by the LxDataLab initiative. The third, in Melbourne, Australia, using public data from a network of movement sensors called the Pedestrian Counting System. An experiment to evaluate the usability of the proposed platform using NASA-TLX is also described. -- -- O apinhamento turÃstico degrada a experiÃªncia dos visitantes e impacta negativamente o ambiente e a populaÃ§Ã£o local, podendo tornar insustentÃ¡vel o turismo em destinos populares. Isto motivou-nos a desenvolver, no Ã¢mbito do projeto europeu RESETTING relacionado com a transformaÃ§Ã£o digital do turismo, uma plataforma para visualizar este apinhamento, explorando dados histÃ³ricos, detetando padrÃµes e tendÃªncias e prevendo eventos futuros. O objetivo final Ã© apoiar a tomada de decisÃ£o, a curto e mÃ©dio prazo, para mitigar o fenÃ³meno. Para tal, a plataforma considera a capacidade de carga dos locais alvo no cÃ¡lculo da densidade de apinhamento. A integraÃ§Ã£o de dados de diversas fontes Ã© conseguida com uma arquitetura extensÃvel, Ã base de conetores. SÃ£o descritos trÃªs cenÃ¡rios de utilizaÃ§Ã£o da plataforma, relativos a eventos anuais de grande apinhamento. Dois deles, no municÃpio de Lisboa, baseados em dados de uma rede mÃ³vel disponibilizados pela iniciativa LxDataLab. O terceiro, em Melbourne na AustrÃ¡lia, utilizando dados pÃºblicos de uma rede de sensores de movimento designada de Pedestrian Counting System. Ã ainda descrita uma experiÃªncia de avaliaÃ§Ã£o da usabilidade da plataforma proposta, usando o NASA-TLX.

Paperid: 2239, https://arxiv.org/pdf/2504.13937.pdf

Abstract:
We introduce a novel auditory brain-computer interface (BCI) paradigm, Auditory Intention Decoding (AID), designed to enhance communication capabilities within the brain-AI interface (BAI) system EEGChat. AID enables users to select among multiple auditory options (intentions) by analyzing their brain responses, offering a pathway to construct a communication system that requires neither muscle movement nor syntactic formation. To evaluate the feasibility of this paradigm, we conducted a proof-of-concept study. The results demonstrated statistically significant decoding performance, validating the approach's potential. Despite these promising findings, further optimization is required to enhance system performance and realize the paradigm's practical application.

Paperid: 2240, https://arxiv.org/pdf/2504.13926.pdf

Abstract:
The integration of Artificial Intelligence (AI) into high-stakes domains such as healthcare, finance, and autonomous systems is often constrained by concerns over transparency, interpretability, and trust. While Human-Centered AI (HCAI) emphasizes alignment with human values, Explainable AI (XAI) enhances transparency by making AI decisions more understandable. However, the lack of a unified approach limits AI's effectiveness in critical decision-making scenarios. This paper presents a novel three-layered framework that bridges HCAI and XAI to establish a structured explainability paradigm. The framework comprises (1) a foundational AI model with built-in explainability mechanisms, (2) a human-centered explanation layer that tailors explanations based on cognitive load and user expertise, and (3) a dynamic feedback loop that refines explanations through real-time user interaction. The framework is evaluated across healthcare, finance, and software development, demonstrating its potential to enhance decision-making, regulatory compliance, and public trust. Our findings advance Human-Centered Explainable AI (HCXAI), fostering AI systems that are transparent, adaptable, and ethically aligned.

Paperid: 2241, https://arxiv.org/pdf/2504.13909.pdf

Abstract:
We propose and create an incentive based recommendation algorithm aimed at improving the lifestyle of diabetic patients. This algorithm is integrated into a real world mobile application to provide personalized health recommendations. Initially, users enter data such as step count, calorie intake, gender, age, weight, height and blood glucose levels. When the data is preprocessed, the app identifies the personalized health and glucose management goals. The recommendation engine suggests exercise routines and dietary adjustments based on these goals. As users achieve their goals and follow these recommendations, they receive incentives, encouraging adherence and promoting positive health outcomes. Furthermore, the mobile application allows users to monitor their progress through descriptive analytics, which displays their daily activities and health metrics in graphical form. To evaluate the proposed methodology, the study was conducted with 10 participants, with type 2 diabetes for three weeks. The participants were recruited through advertisements and health expert references. The application was installed on the patient phone to use it for three weeks. The expert was also a part of this study by monitoring the patient health record. To assess the algorithm performance, we computed efficiency and proficiency. As a result, the algorithm showed proficiency and efficiency scores of 90% and 92%, respectively. Similarly, we computed user experience with application in terms of attractiveness, hedonic and pragmatic quality, involving 35 people in the study. As a result, it indicated an overall positive user response. The findings show a clear positive correlation between exercise and rewards, with noticeable improvements observed in user outcomes after exercise.

Paperid: 2242, https://arxiv.org/pdf/2504.13908.pdf

Abstract:
Standardized surveys scale efficiently but sacrifice depth, while conversational interviews improve response quality at the cost of scalability and consistency. This study bridges the gap between these methods by introducing a framework for AI-assisted conversational interviewing. To evaluate this framework, we conducted a web survey experiment where 1,800 participants were randomly assigned to AI 'chatbots' which use large language models (LLMs) to dynamically probe respondents for elaboration and interactively code open-ended responses to fixed questions developed by human researchers. We assessed the AI chatbot's performance in terms of coding accuracy, response quality, and respondent experience. Our findings reveal that AI chatbots perform moderately well in live coding even without survey-specific fine-tuning, despite slightly inflated false positive errors due to respondent acquiescence bias. Open-ended responses were more detailed and informative, but this came at a slight cost to respondent experience. Our findings highlight the feasibility of using AI methods such as chatbots enhanced by LLMs to enhance open-ended data collection in web surveys.

Paperid: 2243, https://arxiv.org/pdf/2504.13881.pdf

Abstract:
When children are anxious or scared, it can be hard for them to stay still or follow instructions during medical procedures, making the process more challenging and affecting procedure results. This is particularly true for radiological procedures, where long scan times, confined spaces, and loud noises can cause children to move, significantly impacting scan quality. To this end, sometimes children are sedated, but doctors are constantly seeking alternative non-pharmacological solutions. This work aims to explore how social robots could assist in preparing children for radiological procedures. We have conducted a focus group discussion with five hospital stakeholders, namely radiographers, paediatricians, and clinical engineers, to explore (i) the context regarding children's preparation for radiological procedures, hence their needs and how children are currently prepared, and (ii) the potential role of social robots in this process. The discussion was transcribed and analysed using thematic analysis. Among our findings, we identified three potential roles for a social robot in this preparation process: offering infotainment in the waiting room, acting as a guide within the hospital, and assisting radiographers in preparing children for the procedure. We hope that insights from this study will inform the design of social robots for pediatric healthcare.

Paperid: 2244, https://arxiv.org/pdf/2504.13873.pdf

Abstract:
This paper introduces the Translational Evaluation of Multimodal AI for Inspection (TEMAI) framework, bridging multimodal AI capabilities with industrial inspection implementation. Adapting translational research principles from healthcare to industrial contexts, TEMAI establishes three core dimensions: Capability (technical feasibility), Adoption (organizational readiness), and Utility (value realization). The framework demonstrates that technical capability alone yields limited value without corresponding adoption mechanisms. TEMAI incorporates specialized metrics including the Value Density Coefficient and structured implementation pathways. Empirical validation through retail and photovoltaic inspection implementations revealed significant differences in value realization patterns despite similar capability reduction rates, confirming the framework's effectiveness across diverse industrial sectors while highlighting the importance of industry-specific adaptation strategies.

Paperid: 2245, https://arxiv.org/pdf/2504.13872.pdf

Abstract:
Although intended to foster spontaneous interactions among workers, a typical open-plan office layout cannot mitigate visual, acoustic, or privacy-related distractions that originate from unplanned meetings. As office workers often refrain from tackling these issues by manually demarcating or physically relocating to a more suitable subspace that is enclosed by movable partitions, we hypothesise that these subspaces could instead be robotically manifested. This study therefore evaluated the perceived impact of two mobile robotic partitions that were wizarded to jointly manifest an enclosed subspace, to: 1) either `mitigate' or `intervene' in the distractions caused by spontaneous face-to-face or remote meetings; or 2) either `gesturally' or `spatially' nudge a distraction-causing worker to relocate. Our findings suggest how robotic furniture should interact with office workers with and through transient space, and autonomously balance the distractions not only for each individual worker but also for multiple workers sharing the same workspace.

Paperid: 2246, https://arxiv.org/pdf/2504.13866.pdf

Abstract:
Physical rehabilitation exercises suggested by healthcare professionals can help recovery from various musculoskeletal disorders and prevent re-injury. However, patients' engagement tends to decrease over time without direct supervision, which is why there is a need for an automated monitoring system. In recent years, there has been great progress in quality assessment of physical rehabilitation exercises. Most of them only provide a binary classification if the performance is correct or incorrect, and a few provide a continuous score. This information is not sufficient for patients to improve their performance. In this work, we propose an algorithm for error classification of rehabilitation exercises, thus making the first step toward more detailed feedback to patients. We focus on skeleton-based exercise assessment, which utilizes human pose estimation to evaluate motion. Inspired by recent algorithms for quality assessment during rehabilitation exercises, we propose a Transformer-based model for the described classification. Our model is inspired by the HyperFormer method for human action recognition, and adapted to our problem and dataset. The evaluation is done on the KERAAL dataset, as it is the only medical dataset with clear error labels for the exercises, and our model significantly surpasses state-of-the-art methods. Furthermore, we bridge the gap towards better feedback to the patients by presenting a way to calculate the importance of joints for each exercise.

Paperid: 2247, https://arxiv.org/pdf/2504.13862.pdf

Abstract:
Introduced in the early 2010s, Electronic Health Records (EHRs) have become ubiquitous in hospitals. Despite clear benefits, they remain unpopular among healthcare professionals and present significant challenges. Positioned at the intersection of Health Information Systems studies, Computer Supported Collaborative Work (CSCW), Service Design, and Participatory Design (PD), our research investigates how involving users in the co-design of new EHR components within a dedicated hospital space can transform healthcare practices. Through participatory co-design methodologies, including ethnographic observation, collaborative workshops, and realistic simulations, we identify the material and interactional elements essential for rebalancing power dynamics between users and designers. This project contributes to rethinking traditional EHR design approaches, embedding design practice into systemic transformation to genuinely meet healthcare professionals' needs.

Paperid: 2248, https://arxiv.org/pdf/2504.13735.pdf

Abstract:
The purpose of this study was to develop and evaluate a novel virtual reality seated orientation and mobility (VR-S-O&M) test protocol designed to assess functional vision. This study aims to provide a dataset of healthy subjects using this protocol and preliminary analyses. We introduced a VR-based O&M test protocol featuring a novel seated displacement method, diverse lighting conditions, and varying course configurations within a virtual environment. Normally sighted participants (N=42) completed the test, which required them to navigate a path and destroy identified obstacles. We assessed basic performance metrics, including time duration, number of missed objects, and time before the first step, under different environmental conditions to verify ecological validity. Additionally, we analyzed participants' behaviors regarding missed objects, demonstrating the potential of integrating behavioral and interactive data for a more precise functional vision assessment. Our VR-S-O&M test protocol, along with the first O&M behavior dataset, presents significant opportunities for developing more refined performance metrics for assessing functional vision and enhancing the quality of life.

Paperid: 2249, https://arxiv.org/pdf/2504.13421.pdf

Abstract:
Despite the significant increase in popularity of Virtual YouTubers (VTubers), research on the unique dynamics of viewer-VTuber parasocial relationships is nascent. This work investigates how English-speaking viewers grieved VTubers whose identities are no longer used, an interesting context as the nakanohito (i.e., the person behind the VTuber identity) is usually alive post-retirement and might "reincarnate" as another VTuber. We propose a typology for VTuber retirements and analyzed 13,655 Reddit posts and comments spanning nearly three years using mixed-methods. Findings include how viewers coped using methods similar to when losing loved ones, alongside novel coping methods reflecting different attachment styles. Although emotions like sadness, shock, concern, disapproval, confusion, and love decreased with time, regret and loyalty showed opposite trends. Furthermore, viewers' reactions situated a VTuber identity within a community of content creators and viewers. We also discuss design implications alongside implications on the VTuber ecosystem and future research directions.

Paperid: 2250, https://arxiv.org/pdf/2504.13334.pdf

Abstract:
The risk of loss of lives and property damage has increased all around the world in recent years as wildfire seasons have become longer and fires have become larger. Knowing how to prepare and evacuate safely is critical, yet it may be daunting for those who have never experienced a wildfire threat before. This paper considers the potential for utilizing virtual reality (VR) technology to prepare people for an evacuation scenario. We discuss the unique affordances of VR for this type of work, as well as the initial steps in creating a training simulation. We also explore the next steps for what a tool like this may mean for the future of evacuation preparedness training.

Paperid: 2251, https://arxiv.org/pdf/2504.12830.pdf

Abstract:
Decision-makers run the risk of relying too much on machine recommendations, which is associated with lower cognitive engagement. Reflection has been shown to increase cognitive engagement and improve critical thinking and therefore decision-making. Questions are a means to stimulate reflection, but there is a research gap regarding the systematic creation and use of relevant questions for machine-assisted decision-making. We therefore present a taxonomy of questions aimed at promoting reflection and cognitive engagement in order to stimulate a deliberate decision-making process. Our taxonomy builds on the Socratic questioning method and a question bank for explainable AI. As a starting point, we focus on clinical decision-making. Brief discussions with two medical and three educational researchers provide feedback on the relevance and expected benefits of our taxonomy. Our work contributes to research on mitigating overreliance in human-AI interactions and aims to support effective human oversight as required by the European AI Act.

Paperid: 2252, https://arxiv.org/pdf/2504.12511.pdf

Abstract:
In this paper, we advance the study of AI-augmented reasoning in the context of Human-Computer Interaction (HCI), psychology and cognitive science, focusing on the critical task of visual perception. Specifically, we investigate the applicability of Multimodal Large Language Models (MLLMs) in this domain. To this end, we leverage established principles and explanations from psychology and cognitive science related to complexity in human visual perception. We use them as guiding principles for the MLLMs to compare and interprete visual content. Our study aims to benchmark MLLMs across various explainability principles relevant to visual perception. Unlike recent approaches that primarily employ advanced deep learning models to predict complexity metrics from visual content, our work does not seek to develop a mere new predictive model. Instead, we propose a novel annotation-free analytical framework to assess utility of MLLMs as cognitive assistants for HCI tasks, using visual perception as a case study. The primary goal is to pave the way for principled study in quantifying and evaluating the interpretability of MLLMs for applications in improving human reasoning capability and uncovering biases in existing perception datasets annotated by humans.

Paperid: 2253, https://arxiv.org/pdf/2504.12452.pdf

Abstract:
Personal development through self-directed learning is essential in today's fast-changing world, but many learners struggle to manage it effectively. While AI tools like large language models (LLMs) have the potential for personalized learning planning, they face issues such as transparency and hallucinated information. To address this, we propose PlanGlow, an LLM-based system that generates personalized, well-structured study plans with clear explanations and controllability through user-centered interactions. Through mixed methods, we surveyed 28 participants and interviewed 10 before development, followed by a within-subject experiment with 24 participants to evaluate PlanGlow's performance, usability, controllability, and explainability against two baseline systems: a GPT-4o-based system and Khan Academy's Khanmigo. Results demonstrate that PlanGlow significantly improves usability, explainability, and controllability. Additionally, two educational experts assessed and confirmed the quality of the generated study plans. These findings highlight PlanGlow's potential to enhance personalized learning and address key challenges in self-directed learning.

Paperid: 2254, https://arxiv.org/pdf/2504.12211.pdf

Abstract:
In the AI community, benchmarks to evaluate model quality are well established, but an equivalent approach to benchmarking products built upon generative AI models is still missing. This has had two consequences. First, it has made teams focus on model quality over the developer experience, while successful products combine both. Second, product team have struggled to answer questions about their products in relation to their competitors. In this case study, we share: (1) our process to create robust, enterprise-grade and modular components to support the benchmarking of the developer experience (DX) dimensions of our team's AI for code offerings, and (2) the components we have created to do so, including demographics and attitudes towards AI surveys, a benchmarkable task, and task and feature surveys. By doing so, we hope to lower the barrier to the DX benchmarking of genAI-enhanced code products.

Paperid: 2255, https://arxiv.org/pdf/2504.10662.pdf

Abstract:
In contemporary society, widespread social media usage is evident in people's daily lives. Nevertheless, disparities in emotional expressions between the real world and online platforms can manifest. We comprehensively analyzed Persian community on X to explore this phenomenon. An innovative pipeline was designed to measure the similarity between emotions in the real world compared to social media. Accordingly, recent tweets and images of participants were gathered and analyzed using Transformers-based text and image sentiment analysis modules. Each participant's friends also provided insights into the their real-world emotions. A distance criterion was used to compare real-world feelings with virtual experiences. Our study encompassed N=105 participants, 393 friends who contributed their perspectives, over 8,300 collected tweets, and 2,000 media images. Results indicated a 28.67% similarity between images and real-world emotions, while tweets exhibited a 75.88% alignment with real-world feelings. Additionally, the statistical significance confirmed that the observed disparities in sentiment proportions.

Paperid: 2256, https://arxiv.org/pdf/2504.09860.pdf

Abstract:
We propose SUMART, a method for summarizing and compressing the volume of verbose subtitle translations. SUMART is designed for understanding translated captions (e.g., interlingual conversations via subtitle translation or when watching movies in foreign language audio and translated captions). SUMART is intended for users who want a big-picture and fast understanding of the conversation, audio, video content, and speech in a foreign language. During the training data collection, when a speaker makes a verbose statement, SUMART employs a large language model on-site to compress the volume of subtitles. This compressed data is then stored in a database for fine-tuning purposes. Later, SUMART uses data pairs from those non-compressed ASR results and compressed translated results for fine-tuning the translation model to generate more concise translations for practical uses. In practical applications, SUMART utilizes this trained model to produce concise translation results. Furthermore, as a practical application, we developed an application that allows conversations using subtitle translation in augmented reality spaces. As a pilot study, we conducted qualitative surveys using a SUMART prototype and a survey on the summarization model for SUMART. We envision the most effective use case of this system is where users need to consume a lot of information quickly (e.g., Speech, lectures, podcasts, Q&A in conferences).

Paperid: 2257, https://arxiv.org/pdf/2504.09734.pdf

Abstract:
In today's globalized world, there are increasing opportunities for individuals to communicate using a common non-native language (lingua franca). Non-native speakers often have opportunities to listen to foreign languages, but may not comprehend them as fully as native speakers do. To aid real-time comprehension, live transcription of subtitles is frequently used in everyday life (e.g., during Zoom conversations, watching YouTube videos, or on social networking sites). However, simultaneously reading subtitles while listening can increase cognitive load. In this study, we propose Dynamik, a system that reduces cognitive load during reading by decreasing the size of less important words and enlarging important ones, thereby enhancing sentence contrast. Our results indicate that Dynamik can reduce certain aspects of cognitive load, specifically, participants' perceived performance and effort among individuals with low proficiency in English, as well as enhance the users' sense of comprehension, especially among people with low English ability. We further discuss our methods' applicability to other languages and potential improvements and further research directions.

Paperid: 2258, https://arxiv.org/pdf/2504.09221.pdf

Abstract:
Emotion recognition is an important component of affective computing, and also human-machine interaction. Unimodal emotion recognition is convenient, but the accuracy may not be high enough; on the contrary, multi-modal emotion recognition may be more accurate, but it also increases the complexity and cost of the data collection system. This paper considers cross-modal emotion recognition, i.e., using both electroencephalography (EEG) and eye movement in training, but only EEG or eye movement in test. We propose cross-modal contrastive representation distillation (CMCRD), which uses a pre-trained eye movement classification model to assist the training of an EEG classification model, improving feature extraction from EEG signals, or vice versa. During test, only EEG signals (or eye movement signals) are acquired, eliminating the need for multi-modal data. CMCRD not only improves the emotion recognition accuracy, but also makes the system more simplified and practical. Experiments using three different neural network architectures on three multi-modal emotion recognition datasets demonstrated the effectiveness of CMCRD. Compared with the EEG-only model, it improved the average classification accuracy by about 6.2%.

Paperid: 2259, https://arxiv.org/pdf/2504.09010.pdf

Abstract:
Community empowerment is the process of enabling communities to increase control over their narratives, resources, and futures. In HCI and design, this social challenge centers on helping marginalized groups gain agency through technology and design interventions. For Indigenous communities in particular, empowerment means not only representation but sovereignty in how their stories are told and by whom. Location-based augmented reality (AR) offers a novel opportunity to address this challenge. By overlaying digital content onto physical places, AR can spatially anchor community narratives in the real world, allowing communities to re-tell the story of a place on their own terms. Such site-specific AR experiences have already been used to reveal hidden histories, re-imagine colonial monuments, and celebrate minority cultures. The affordances of XR - particularly ARÅ spatial interaction and immersive storytelling - make it a promising tool for cultural continuity and community activism. In this position paper, we focus on how these XR affordances can empower communities, using the ThÃ¡mien Ohlone AR Tour as a case study. We outline why traditional digital interventions fall short of true empowerment, how AR's immersive qualities uniquely support Indigenous self-determination, insights from co-designing the Ohlone AR Tour, and future directions to scale such efforts responsibly.

Paperid: 2260, https://arxiv.org/pdf/2504.08486.pdf

Abstract:
Automatic minimization and optimization of the number of the electrodes is essential for the practical application of electroencephalography (EEG)-based brain computer interface (BCI). Previous methods typically require additional training costs or rely on prior knowledge assumptions. This study proposed a novel channel pruning model, plug-and-select (PlugSelect), applicable across a broad range of BCI paradigms with no additional training cost and plug-and-play functionality. It integrates gradients along the input path to globally infer the causal relationships between input channels and outputs, and ranks the contribution sequences to identify the most highly attributed channels. The results showed that for three BCI paradigms, i.e., auditory attention decoding (AAD), motor imagery (MI), affective computation (AC), PlugSelect could reduce the number of channels by at least half while effectively maintaining decoding performance and improving efficiency. The outcome benefits the design of wearable EEG-based devices, facilitating the practical application of BCI technology.

Paperid: 2261, https://arxiv.org/pdf/2504.08235.pdf

Abstract:
Recent advancements in AI technology have seen researchers and industry professionals actively exploring the application of AI tools in legal workflows. Despite this prevailing trend, legal practitioners found that AI tools had limited effectiveness in supporting everyday tasks, which can be partly attributed to their design. Typically, AI legal tools only offer end-to-end interaction: practitioners can only manipulate the input and output but have no control over the intermediate steps, raising concerns about AI tools' performance and ethical use. To design an effective AI legal tool, as a first step, we explore users' needs with one specific use case: precedent search. Through a qualitative study with five legal practitioners, we uncovered the precedent search workflow, the challenges they face using current systems, and their concerns and expectations regarding AI tools. We conclude our exploration with an initial prototype to reflect the design implications derived from our findings.

Paperid: 2262, https://arxiv.org/pdf/2504.07198.pdf

Abstract:
The human face plays a central role in social communication, necessitating the use of performant computer vision tools for human-centered applications. We propose Face-LLaVA, a multimodal large language model for face-centered, in-context learning, including facial expression and attribute recognition. Additionally, Face-LLaVA is able to generate natural language descriptions that can be used for reasoning. Leveraging existing visual databases, we first developed FaceInstruct-1M, a face-centered database for instruction tuning MLLMs for face processing. We then developed a novel face-specific visual encoder powered by Face-Region Guided Cross-Attention that integrates face geometry with local visual features. We evaluated the proposed method across nine different datasets and five different face processing tasks, including facial expression recognition, action unit detection, facial attribute detection, age estimation and deepfake detection. Face-LLaVA achieves superior results compared to existing open-source MLLMs and competitive performance compared to commercial solutions. Our model output also receives a higher reasoning rating by GPT under a zero-shot setting across all the tasks. Both our dataset and model wil be released at https://face-llava.github.io to support future advancements in social AI and foundational vision-language research.

Paperid: 2263, https://arxiv.org/pdf/2504.06771.pdf

Abstract:
How can we design AI tools that effectively support human decision-making by complementing and enhancing users' reasoning processes? Common recommendation-centric approaches face challenges such as inappropriate reliance or a lack of integration with users' decision-making processes. Here, we explore an alternative interaction model in which the AI outputs build upon users' own decision-making rationales. We compare this approach, which we call ExtendAI, with a recommendation-based AI. Participants in our mixed-methods user study interacted with both AIs as part of an investment decision-making task. We found that the AIs had different impacts, with ExtendAI integrating better into the decision-making process and people's own thinking and leading to slightly better outcomes. RecommendAI was able to provide more novel insights while requiring less cognitive effort. We discuss the implications of these and other findings along with three tensions of AI-assisted decision-making which our study revealed.

Paperid: 2264, https://arxiv.org/pdf/2504.06593.pdf

Abstract:
The growing presence of service robots in human-centric environments, such as warehouses, demands seamless and intuitive human-robot collaboration. In this paper, we propose a collaborative shelf-picking framework that combines multimodal interaction, physics-based reasoning, and task division for enhanced human-robot teamwork. The framework enables the robot to recognize human pointing gestures, interpret verbal cues and voice commands, and communicate through visual and auditory feedback. Moreover, it is powered by a Large Language Model (LLM) which utilizes Chain of Thought (CoT) and a physics-based simulation engine for safely retrieving cluttered stacks of boxes on shelves, relationship graph for sub-task generation, extraction sequence planning and decision making. Furthermore, we validate the framework through real-world shelf picking experiments such as 1) Gesture-Guided Box Extraction, 2) Collaborative Shelf Clearing and 3) Collaborative Stability Assistance.

Paperid: 2265, https://arxiv.org/pdf/2504.06496.pdf

Abstract:
We present an account of an ongoing practice-based Design Research programme that explores the interaction affordances of real-time AI image generators. Based on our experiences from three installations, we reflect on the design of PromptJ, a user interface built around the concept of a prompt mixer. Our first contribution is a series of strong concepts based on our reflections of designing and deploying PromptJ. We cohere and abstract our strong concepts into the notion of Holistic Prompt Craft, which describes the importance of considering all relevant parameters concurrently. Finally, we present PromptTank, a prototype design which exemplifies the principles of Holistic Prompt Craft. Our contributions are articulated as strong concepts or intermediate knowledge that are intended to inform and inspire practitioners and researchers who are designing with image generation models or developing novel interaction paradigms for generative AI systems more generally.

Paperid: 2266, https://arxiv.org/pdf/2504.05934.pdf

Abstract:
Group Recommender Systems (GRSs) have been studied and developed for more than twenty years. However, their application and usage has not grown. They can even be labeled as failures, if compared to the very successful and common recommender systems (RSs) used on all the major ecommerce and social platforms. As a result, the RSs that we all use now, are only targeted for individual users, aiming at choosing an item exclusively for themselves; no choice support is provided to groups trying to select a service, a product, an experience, a person, serving equally well all the group members. In this opinion article we discuss why the success of group recommender systems is lagging and we propose a research program unfolding on the analysis and development of new forms of collaboration between humans and intelligent systems. We define a set of roles, named CAJO, that GRSs should play in order to become more useful tools for group decision making.

Paperid: 2267, https://arxiv.org/pdf/2504.05791.pdf

Abstract:
Leveraging the integration of visual and proprioceptive cues, research has uncovered various perception thresholds in VR that can be exploited to support haptic feedback for grasping. While previous studies have explored individual dimensions, such as size, the combined effect of multiple geometric properties on perceptual illusions remains poorly understood. We present a two-alternative forced choice study investigating the perceptual interplay between object size and taper angle. We introduce an illusion space model, providing detailed insights into how physical and virtual object configurations affect human perception. Our insights reveal how, for example, as virtual sizes increase, users perceive that taper angles increase, and as virtual angles decrease, users overestimate sizes. We provide a mathematical model of the illusion space, and an associated tool, which can be used as a guide for the design of future VR haptic devices and for proxy object selections.

Paperid: 2268, https://arxiv.org/pdf/2504.05325.pdf

Abstract:
Recent advancements in Large Language Models (LLMs) have made them a popular information-seeking tool among end users. However, the statistical training methods for LLMs have raised concerns about their representation of under-represented topics, potentially leading to biases that could influence real-world decisions and opportunities. These biases could have significant economic, social, and cultural impacts as LLMs become more prevalent, whether through direct interactions--such as when users engage with chatbots or automated assistants--or through their integration into third-party applications (as agents), where the models influence decision-making processes and functionalities behind the scenes. Our study examines the biases present in LLMs recommendations of U.S. cities and towns across three domains: relocation, tourism, and starting a business. We explore two key research questions: (i) How similar LLMs responses are, and (ii) How this similarity might favor areas with certain characteristics over others, introducing biases. We focus on the consistency of LLMs responses and their tendency to over-represent or under-represent specific locations. Our findings point to consistent demographic biases in these recommendations, which could perpetuate a ``rich-get-richer'' effect that widens existing economic disparities.

Paperid: 2269, https://arxiv.org/pdf/2504.04927.pdf

Abstract:
Although Generative AI (GenAI) has the potential for persona development, many challenges must be addressed. This research systematically reviews 52 articles from 2022-2024, with important findings. First, closed commercial models are frequently used in persona development, creating a monoculture Second, GenAI is used in various stages of persona development (data collection, segmentation, enrichment, and evaluation). Third, similar to other quantitative persona development techniques, there are major gaps in persona evaluation for AI generated personas. Fourth, human-AI collaboration models are underdeveloped, despite human oversight being crucial for maintaining ethical standards. These findings imply that realizing the full potential of AI-generated personas will require substantial efforts across academia and industry. To that end, we provide a list of research avenues to inspire future work.

Paperid: 2270, https://arxiv.org/pdf/2504.04703.pdf

Abstract:
Artificial intelligence-augmented technology represents a considerable opportunity for improving healthcare delivery. Significant progress has been made to demonstrate the value of complex models to enhance clinicians` efficiency in decision-making. However, the clinical adoption of such models is scarce due to multifaceted implementation issues, with the explainability of AI models being among them. One of the substantially documented areas of concern is the unclear AI explainability that negatively influences clinicians` considerations for accepting the complex model. With a usability study engaging 20 U.S.-based clinicians and following the qualitative reflexive thematic analysis, this study develops and presents a concrete framework and an operational definition of explainability. The framework can inform the required customizations and feature developments in AI tools to support clinicians` preferences and enhance their acceptance.

Paperid: 2271, https://arxiv.org/pdf/2504.04488.pdf

Abstract:
Displaying a written transcript of what a human said (i.e. producing an "automatic speech recognition transcript") is a common feature for smartphone vocal assistants: the utterance produced by a human speaker (e.g. a question) is displayed on the screen while it is being verbally responded to by the vocal assistant. Although very rarely, this feature also exists on some "social" robots which transcribe human interactants' speech on a screen or a tablet. We argue that this informational configuration is pragmatically consequential on the interaction, both for human participants and for the embodied conversational agent. Based on a corpus of co-present interactions with a humanoid robot, we attempt to show that this transcript is a contextual feature which can heavily impact the actions ascribed by humans to the robot: that is, the way in which humans respond to the robot's behavior as constituting a specific type of action (rather than another) and as constituting an adequate response to their own previous turn.

Paperid: 2272, https://arxiv.org/pdf/2504.04440.pdf

Abstract:
This paper proposes a novel approach to scaling distributed collaboration in mixed reality by virtualizing collaborative tasks as independent, installable environments. By mapping group activities into dedicated virtual spaces that adapt to each user's real-world context, the proposed method supports consistent MR interactions, dynamic group engagement, and seamless task transitions. Preliminary studies in individual ideation demonstrate enhanced immersion and productivity, paving the way for future multi-user collaborative systems.

Paperid: 2275, https://arxiv.org/pdf/2504.03251.pdf

Abstract:
Clinical systems operate in safety-critical environments and are not intended to function autonomously; however, they are currently designed to replicate clinicians' diagnoses rather than assist them in the diagnostic process. To enable better supervision of system-generated diagnoses, we replicate radiologists' systematic approach used to analyze chest X-rays. This approach facilitates comprehensive analysis across all regions of clinical images and can reduce errors caused by inattentional blindness and under reading. Our work addresses a critical research gap by identifying difficult-to-diagnose diseases for clinicians using insights from human vision, enabling these systems to serve as an effective "second pair of eyes". These improvements make the clinical imaging systems more complementary and combine the strengths of human and machine vision. Additionally, we leverage effective receptive fields in deep learning models to present machine-generated diagnoses with sufficient context, making it easier for clinicians to evaluate them.

Paperid: 2276, https://arxiv.org/pdf/2504.03207.pdf

Abstract:
How can we use generative AI to design tools that augment rather than replace human cognition? In this position paper, we review our own research on AI-assisted decision-making for lessons to learn. We observe that in both AI-assisted decision-making and generative AI, a popular approach is to suggest AI-generated end-to-end solutions to users, which users can then accept, reject, or edit. Alternatively, AI tools could offer more incremental support to help users solve tasks themselves, which we call process-oriented support. We describe findings on the challenges of end-to-end solutions, and how process-oriented support can address them. We also discuss the applicability of these findings to generative AI based on a recent study in which we compared both approaches to assist users in a complex decision-making task with LLMs.

Paperid: 2277, https://arxiv.org/pdf/2504.03068.pdf

Abstract:
Large Language Model (LLM) tools have demonstrated their potential to deliver high-quality assistance by providing instant, personalized feedback that is crucial for effective programming education. However, many of these tools operate independently from institutional Learning Management Systems, which creates a significant disconnect. This isolation limits the ability to leverage learning materials and exercise context for generating tailored, context-aware feedback. Furthermore, previous research on self-regulated learning and LLM support mainly focused on knowledge acquisition, not the development of important self-regulation skills. To address these challenges, we developed CodeRunner Agent, an LLM-based programming assistant that integrates the CodeRunner, a student-submitted code executing and automated grading plugin in Moodle. CodeRunner Agent empowers educators to customize AI-generated feedback by incorporating detailed context from lecture materials, programming questions, student answers, and execution results. Additionally, it enhances students' self-regulated learning by providing strategy-based AI responses. This integrated, context-aware, and skill-focused approach offers promising avenues for data-driven improvements in programming education.

Paperid: 2278, https://arxiv.org/pdf/2504.02663.pdf

Abstract:
The societal need to leverage third-party data has driven the data-distribution market and increased the importance of data quality assessment (DQA) in data transactions between organizations. However, DQA requires expert knowledge of raw data and related data attributes, which hinders consensus-building in data purchasing. This study focused on the differences in DQAs between experienced and inexperienced data handlers. We performed two experiments: The first was a questionnaire survey involving 41 participants with varying levels of data-handling experience, who evaluated 12 data samples using 10 predefined indices with and without quality metadata generated by the automated tool. The second was an eye-tracking experiment to reveal the viewing behavior of participants during data evaluation. It was revealed that using quality metadata generated by the automated tool can reduce misrecognition in DQA. While experienced data handlers rated the quality metadata highly, semi-experienced users gave it the lowest ratings. This study contributes to enhancing data understanding within organizations and promoting the distribution of valuable data by proposing an automated tool to support DQAs.

Paperid: 2279, https://arxiv.org/pdf/2504.02074.pdf

Abstract:
Functional fixedness, a cognitive bias that restricts users' interactions with a new system or tool to expected or familiar ways, limits the full potential of Large Language Model (LLM)-enabled chat search, especially in complex and exploratory tasks. To investigate its impact, we conducted a crowdsourcing study with 450 participants, each completing one of six decision-making tasks spanning public safety, diet and health management, sustainability, and AI ethics. Participants engaged in a multi-prompt conversation with ChatGPT to address the task, allowing us to compare pre-chat intent-based expectations with observed interactions. We found that: 1) Several aspects of pre-chat expectations are closely associated with users' prior experiences with ChatGPT, search engines, and virtual assistants; 2) Prior system experience shapes language use and prompting behavior. Frequent ChatGPT users reduced deictic terms and hedge words and frequently adjusted prompts. Users with rich search experience maintained structured, less-conversational queries with minimal modifications. Users of virtual assistants favored directive, command-like prompts, reinforcing functional fixedness; 3) When the system failed to meet expectations, participants generated more detailed prompts with increased linguistic diversity, reflecting adaptive shifts. These findings suggest that while preconceived expectations constrain early interactions, unmet expectations can motivate behavioral adaptation. With appropriate system support, this may promote broader exploration of LLM capabilities. This work also introduces a typology for user intents in chat search and highlights the importance of mitigating functional fixedness to support more creative and analytical use of LLMs.

Paperid: 2280, https://arxiv.org/pdf/2504.01888.pdf

Abstract:
With the rapid development of Rehabilitation Lower Extremity Robotic Exoskeletons (RLEEX) technology, significant advancements have been made in Human-Robot Interaction (HRI) methods. These include traditional physical HRI methods that are easily recognizable and various bio-electrical signal-based HRI methods that can visualize and predict actions. However, most of these HRI methods are contact-based, facing challenges such as operational complexity, sensitivity to interference, risks associated with implantable devices, and, most importantly, limitations in comfort. These challenges render the interaction less intuitive and natural, which can negatively impact patient motivation for rehabilitation. To address these issues, this paper proposes a novel non-contact gesture interaction control method for RLEEX, based on RGB monocular camera depth estimation. This method integrates three key steps: detecting keypoints, recognizing gestures, and assessing distance, thereby applying gesture information and augmented reality triggering technology to control gait movements of RLEEX. Results indicate that this approach provides a feasible solution to the problems of poor comfort, low reliability, and high latency in HRI for RLEEX platforms. Specifically, it achieves a gesture-controlled exoskeleton motion accuracy of 94.11\% and an average system response time of 0.615 seconds through non-contact HRI. The proposed non-contact HRI method represents a pioneering advancement in control interactions for RLEEX, paving the way for further exploration and development in this field.

Paperid: 2281, https://arxiv.org/pdf/2504.01366.pdf

Abstract:
Spaceflight is an isolated and confined environment (ICE) that exposes astronauts to psychological hazards, such as stress, danger, and monotony. Virtual reality (VR) and artificial intelligence (AI) technologies can serve as psychological countermeasures as they can digitally simulate immersive environments, interactive companions, and therapeutic experiences. Our study employs a scoping literature review approach to identify what is currently known about the use and effectiveness of VR and AI-based interventions as psychological countermeasures to improve mood or emotional states in adults in space or other ICEs. Additionally, this review aimed to identify gaps in the knowledge base and whether a systematic review with meta-analysis was warranted. The review included studies where the intervention was used or intended for use in space or other extraterrestrial environments (ICE). Our search strategy yielded 19 studies from 3390 records across seven major databases. All studies focused on VR-based interventions, with no eligible AI-based intervention studies found. VR interventions were found to be effective for relaxation and improving mood, emergency training, as an interactive communication platform, for comparing interior designs, and for enhancing exercise. There were improvements for measures of mood and emotion\n (e.g., anxiety and stress); however, user preferences varied, and some instances of cybersickness were reported. A systematic review with meta-analysis is not recommended due to the heterogeneity of results. There is significant scope for further research into the use of VR for a wider range of mood and emotion variables using standardised assessment instruments. Additionally, the potential application of AI as a psychological countermeasure warrants further investigation.

Paperid: 2282, https://arxiv.org/pdf/2504.01287.pdf

Abstract:
Emotional responses to auditory stimuli are a common part of everyday life. However, for some individuals, these responses can be distressing enough to interfere with daily functioning. Despite their prevalence, the mechanisms underlying auditory-induced emotion remain only partially understood. Prior research has identified contributing factors such as auditory features, listener traits, and bodily sensations. However, most studies have focused on acoustic features, leaving the role of syntactic structure largely unexplored. This study specifically investigates how hierarchical syntactic structures influence emotional experience, in conjunction with listener traits and bodily sensations. An online experiment was conducted with 715 participants, who listened to 26 sound sequences varying systematically in hierarchical syntactic complexity. Sequences were generated by combining three types of local pitch movement with three types of global pitch movement in ascending and descending pitch directions, resulting in nine complexity levels. Participants rated the valence and arousal of each sequence and indicated any bodily sensations on a body map. Measures of sensory processing patterns were also collected. Results showed that emotional valence was associated with the complex interplay of moderate syntactic complexity ("not too simple, not too complex"), sensory sensitivity, and upper torso sensations. These findings expand existing research by identifying syntactic features that shape auditory-induced emotional experience and highlight the link between bodily sensation and emotional response. They also suggest potential applications for incorporating syntactic design into therapeutic approaches to emotion regulation.

Paperid: 2283, https://arxiv.org/pdf/2504.00337.pdf

Abstract:
In virtual reality, it is widely assumed that increased realism in hand-object interactions enhances user immersion and overall experience. However, recent studies challenge this assumption, suggesting that faithfully replicating real-world physics and visuals is not always necessary for improved usability or immersion. This has led to ambiguity for developers when choosing optimal hand interaction methods for different applications. Currently, there is a lack of comprehensive research to resolve this issue. This study aims to fill this gap by evaluating three contemporary VR hand interaction methods-Attachment, Penetration, and Torque-across two distinct task scenarios: simple manipulation tasks and more complex, precision-driven tasks. By examining key technical features, we identify the strengths and limitations of each method and propose development guidelines for future advancements. Our findings reveal that while Attachment, with its simplified control mechanisms, is well-suited for commercial applications, Penetration and Torque show promise for next-generation interactions. The insights gained from our study provide practical guidance for developers and researchers seeking to balance realism, usability, and user satisfaction in VR environments.

Paperid: 2284, https://arxiv.org/pdf/2503.23574.pdf

Abstract:
Model documentation plays a crucial role in promoting transparency and responsible development of AI systems. With the rise of Generative AI (GenAI), open-source platforms have increasingly become hubs for hosting and distributing these models, prompting platforms like Hugging Face to develop dedicated model documentation guidelines that align with responsible AI principles. Despite these growing efforts, there remains a lack of understanding of how developers document their GenAI models on open-source platforms. Through interviews with 13 GenAI developers active on open-source platforms, we provide empirical insights into their documentation practices and challenges. Our analysis reveals that despite existing resources, developers of GenAI models still face multiple layers of uncertainties in their model documentation: (1) uncertainties about what specific content should be included; (2) uncertainties about how to effectively report key components of their models; and (3) uncertainties in deciding who should take responsibilities for various aspects of model documentation. Based on our findings, we discuss the implications for policymakers, open-source platforms, and the research community to support meaningful, effective and actionable model documentation in the GenAI era, including cultivating better community norms, building robust evaluation infrastructures, and clarifying roles and responsibilities.

Paperid: 2285, https://arxiv.org/pdf/2503.23460.pdf

Abstract:
Connectivity enabled by technologies such as the Internet of Things, Artificial Intelligence, Big Data, and Cloud Computing is rapidly transforming our interactions with the world and with each other. It reshapes social interactions, fostering collaboration, creativity, and unprecedented access to information and resources. However, this connected world and era demand innovative design approaches that harmonize technical functionality with human-centered values. We have run a series of workshops at different conferences, trying to engage the participants in discussions about the related challenges and opportunities, of digital art [1] and aesthetics [2] to AI-driven creativity [3] and their functional aspects in healthcare [1] and empowerment [2, 3]. We want to focus further on the intersection of these challenges where we see opportunities: leveraging aesthetics and connectivity as catalysts for empowerment.

Paperid: 2286, https://arxiv.org/pdf/2503.22588.pdf

Abstract:
Visual observation of objects is essential for many robotic applications, such as object reconstruction and manipulation, navigation, and scene understanding. Machine learning algorithms constitute the state-of-the-art in many fields but require vast data sets, which are costly and time-intensive to collect. Automated strategies for observation and exploration are crucial to enhance the efficiency of data gathering. Therefore, a novel strategy utilizing the Next-Best-Trajectory principle is developed for a robot manipulator operating in dynamic environments. Local trajectories are generated to maximize the information gained from observations along the path while avoiding collisions. We employ a voxel map for environment modeling and utilize raycasting from perspectives around a point of interest to estimate the information gain. A global ergodic trajectory planner provides an optional reference trajectory to the local planner, improving exploration and helping to avoid local minima. To enhance computational efficiency, raycasting for estimating the information gain in the environment is executed in parallel on the graphics processing unit. Benchmark results confirm the efficiency of the parallelization, while real-world experiments demonstrate the strategy's effectiveness.

Paperid: 2287, https://arxiv.org/pdf/2503.22018.pdf

Abstract:
Selective exposure to online news consumption reinforces filter bubbles, restricting access to diverse viewpoints. Interactive systems can counteract this bias by suggesting alternative perspectives, but they require real-time indicators to identify selective exposure. This workshop paper proposes the integration of physiological sensing, including Electroencephalography (EEG) and eye tracking, to measure selective exposure. We propose methods for examining news agreement and its relationship to theta band power in the parietal region, indicating a potential link between cortical activity and selective exposure. Our vision is interactive systems that detect selective exposure and provide alternative views in real time. We suggest that future news interfaces incorporate physiological signals to promote more balanced information consumption. This work joins the discussion on AI-enhanced methodology for bias detection.

Paperid: 2288, https://arxiv.org/pdf/2503.21540.pdf

Abstract:
Large Language Models (LLMs) promise to overcome limitations of rule-based mental health chatbots through more natural conversations. However, evaluating LLM-based mental health chatbots presents a significant challenge: Their probabilistic nature requires comprehensive testing to ensure therapeutic quality, yet conducting such evaluations with people with depression would impose an additional burden on vulnerable people and risk exposing them to potentially harmful content. Our paper presents an evaluation approach for LLM-based mental health chatbots that combines dialogue generation with artificial users and dialogue evaluation by psychotherapists. We developed artificial users based on patient vignettes, systematically varying characteristics such as depression severity, personality traits, and attitudes toward chatbots, and let them interact with a LLM-based behavioral activation chatbot. Ten psychotherapists evaluated 48 randomly selected dialogues using standardized rating scales to assess the quality of behavioral activation and its therapeutic capabilities. We found that while artificial users showed moderate authenticity, they enabled comprehensive testing across different users. In addition, the chatbot demonstrated promising capabilities in delivering behavioral activation and maintaining safety. Furthermore, we identified deficits, such as ensuring the appropriateness of the activity plan, which reveals necessary improvements for the chatbot. Our framework provides an effective method for evaluating LLM-based mental health chatbots while protecting vulnerable people during the evaluation process. Future research should improve the authenticity of artificial users and develop LLM-augmented evaluation tools to make psychotherapist evaluation more efficient, and thus further advance the evaluation of LLM-based mental health chatbots.

Paperid: 2289, https://arxiv.org/pdf/2503.21394.pdf

Abstract:
Generative AI models offer many possibilities for text creation and transformation. Current graphical user interfaces (GUIs) for prompting them lack support for iterative exploration, as they do not represent prompts as actionable interface objects. We propose the concept of a composable prompting canvas for text exploration and iteration using dynamic widgets. Users generate widgets through system suggestions, prompting, or manually to capture task-relevant facets that affect the generated text. In a comparative study with a baseline (conversational UI), 18 participants worked on two writing tasks, creating diverse prompting environments with custom widgets and spatial layouts. They reported having more control over the generated text and preferred our system over the baseline. Our design significantly outperformed the baseline on the Creativity Support Index, and participants felt the results were worth the effort. This work highlights the need for GUIs that support user-driven customization and (re-)structuring to increase both the flexibility and efficiency of prompting.

Paperid: 2290, https://arxiv.org/pdf/2503.21191.pdf

Abstract:
Generative design, an AI-assisted technology for optimizing design through algorithmic processes, is propelling advancements across numerous fields. As the use of immersive environments such as Augmented Reality (AR) continues to rise, integrating generative design into such platforms presents a potent opportunity for innovation. However, a vital challenge that impedes this integration is the current absence of an efficient and user-friendly interface for designers to operate within these environments effectively. To bridge this gap, we introduce a novel UI system for generative design software in AR, which automates the process of generating the potential design constraints based on the users' inputs. This system allows users to construct a virtual environment, edit objects and constraints, and export the final data in CSV format. The interface enhances the user's design experience by enabling more intuitive interactions and providing immediate visual feedback. Deriving from participatory design principles, this research proposes a significant leap forward in the realms of generative design and immersive environments.

Paperid: 2291, https://arxiv.org/pdf/2503.21130.pdf

Abstract:
Tutorial videos are a valuable resource for people looking to learn new tasks. People often learn these skills by viewing multiple tutorial videos to get an overall understanding of a task by looking at different approaches to achieve the task. However, navigating through multiple videos can be time-consuming and mentally demanding as these videos are scattered and not easy to skim. We propose VideoMix, a system that helps users gain a holistic understanding of a how-to task by aggregating information from multiple videos on the task. Insights from our formative study (N=12) reveal that learners value understanding potential outcomes, required materials, alternative methods, and important details shared by different videos. Powered by a Vision-Language Model pipeline, VideoMix extracts and organizes this information, presenting concise textual summaries alongside relevant video clips, enabling users to quickly digest and navigate the content. A comparative user study (N=12) demonstrated that VideoMix enabled participants to gain a more comprehensive understanding of tasks with greater efficiency than a baseline video interface, where videos are viewed independently. Our findings highlight the potential of a task-oriented, multi-video approach where videos are organized around a shared goal, offering an enhanced alternative to conventional video-based learning.

Paperid: 2292, https://arxiv.org/pdf/2503.21094.pdf

Abstract:
Smartphones with large screens provide users with increased display and interaction space but pose challenges in reaching certain areas with the thumb when using the device with one hand. To address this, we introduce GazeSwipe, a multimodal interaction technique that combines eye gaze with finger-swipe gestures, enabling intuitive and low-friction reach on mobile touchscreens. Specifically, we design a gaze estimation method that eliminates the need for explicit gaze calibration. Our approach also avoids the use of additional eye-tracking hardware by leveraging the smartphone's built-in front-facing camera. Considering the potential decrease in gaze accuracy without dedicated eye trackers, we use finger-swipe gestures to compensate for any inaccuracies in gaze estimation. Additionally, we introduce a user-unaware auto-calibration method that improves gaze accuracy during interaction. Through extensive experiments on smartphones and tablets, we compare our technique with various methods for touchscreen reachability and evaluate the performance of our auto-calibration strategy. The results demonstrate that our method achieves high success rates and is preferred by users. The findings also validate the effectiveness of the auto-calibration strategy.

Paperid: 2293, https://arxiv.org/pdf/2503.20112.pdf

Abstract:
Effective error analysis is critical for the successful development and deployment of CVML models. One approach to understanding model errors is to summarize the common characteristics of error samples. This can be particularly challenging in tasks that utilize unstructured, complex data such as images, where patterns are not always obvious. Another method is to analyze error distributions across pre-defined categories, which requires analysts to hypothesize about potential error causes in advance. Forming such hypotheses without access to explicit labels or annotations makes it difficult to isolate meaningful subgroups or patterns, however, as analysts must rely on manual inspection, prior expertise, or intuition. This lack of structured guidance can hinder a comprehensive understanding of where models fail. To address these challenges, we introduce VibE, a semantic error analysis workflow designed to identify where and why computer vision and machine learning (CVML) models fail at the subgroup level, even when labels or annotations are unavailable. VibE incorporates several core features to enhance error analysis: semantic subgroup generation, semantic summarization, candidate issue proposals, semantic concept search, and interactive subgroup analysis. By leveraging large foundation models (such as CLIP and GPT-4) alongside visual analytics, VibE enables developers to semantically interpret and analyze CVML model errors. This interactive workflow helps identify errors through subgroup discovery, supports hypothesis generation with auto-generated subgroup summaries and suggested issues, and allows hypothesis validation through semantic concept search and comparative analysis. Through three diverse CVML tasks and in-depth expert interviews, we demonstrate how VibE can assist error understanding and analysis.

Paperid: 2294, https://arxiv.org/pdf/2503.19138.pdf

Abstract:
This work investigates personal perspectives in visualization annotations as devices for collective data-driven storytelling. Inspired by existing efforts in critical cartography, we show how people share personal memories in a visualization of COVID-19 data and how comments by other visualization readers influence the reading and understanding of visualizations. Analyzing interaction logs, reader surveys, visualization annotations, and interviews, we find that reader annotations help other viewers relate to other people's stories and reflect on their own experiences. Further, we found that annotations embedded directly into the visualization can serve as social traces guiding through a visualization and help readers contextualize their own stories. With that, they supersede the attention paid to data encodings and become the main focal point of the visualization.

Paperid: 2295, https://arxiv.org/pdf/2503.19132.pdf

Abstract:
This paper critically examines the machine learning (ML) modeling of humans in three case studies of well-being technologies. Through a critical technical approach, it examines how these apps were experienced in daily life (technology in use) to surface breakdowns and to identify the assumptions about the "human" body entrenched in the ML models (technology design). To address these issues, this paper applies agential realism to decenter foundational assumptions, such as body regularity and health/illness binaries, and speculates more inclusive design and ML modeling paths that acknowledge irregularity, human-system entanglements, and uncertain transitions. This work is among the first to explore the implications of decentering theories in computational modeling of human bodies and well-being, offering insights for more inclusive technologies and speculations toward posthuman-centered ML modeling.

Paperid: 2296, https://arxiv.org/pdf/2503.18606.pdf

Abstract:
The use of motion capture in live dance performances has created an emerging discipline enabling dancers to play different avatars on the digital stage. Unlike classical workflows, avatars enable performers to act as different characters in customized narratives, but research has yet to address how movement, improvisation, and perception change when dancers act as avatars. We created five avatars representing differing genders, shapes, and body limitations, and invited 15 dancers to improvise with each in practice and performance settings. Results show that dancers used avatars to distance themselves from their own habitual movements, exploring new ways of moving through differing physical constraints. Dancers explored using gender-stereotyped movements like powerful or feminine actions, experimenting with gender identity. However, focusing on avatars can coincide with a lack of continuity in improvisation. This work shows how emerging practices with performance technology enable dancers to improvise with new constraints, stepping outside the classical stage.

Paperid: 2297, https://arxiv.org/pdf/2503.17517.pdf

Abstract:
Data visualizations are typically not accessible to blind and low-vision (BLV) users. Automatically generating text descriptions offers an enticing mechanism for democratizing access to the information held in complex scientific charts, yet appropriate procedures for generating those texts remain elusive. Pursuing this issue, we study a single complex chart form: UpSet plots. UpSet Plots are a common way to analyze set data, an area largely unexplored by prior accessibility literature. By analyzing the patterns present in real-world examples, we develop a system for automatically captioning any UpSet plot. We evaluated the utility of our captions via semi-structured interviews with (N=11) BLV users and found that BLV users find them informative. In extensions, we find that sighted users can use our texts similarly to UpSet plots and that they are better than naive LLM usage.

Paperid: 2298, https://arxiv.org/pdf/2503.17257.pdf

Abstract:
The environmental comfort in offices is traditionally captured by surveying an entire workforce simultaneously, which yet fails to capture the situatedness of the different personal experiences. To address this limitation, we developed the EnviroMapper Toolkit, a data physicalisation toolkit that allows individual office workers to record their personal experiences of environmental comfort by mapping the actual moments and locations these occurred. By analysing two in-the-wild studies in existing open-plan office environments (N=14), we demonstrate how this toolkit acts like a situated input visualisation that can be interpreted by domain experts who were not present during its construction. This study therefore offers four key contributions: (1) the iterative design process of the physicalisation toolkit; (2) its preliminary deployment in two real-world office contexts; (3) the decoding of the resulting artefacts by domain experts; and (4) design considerations to support future input physicalisation and visualisation constructions that capture and synthesise data from multiple individuals.

Paperid: 2299, https://arxiv.org/pdf/2503.17085.pdf

Abstract:
Artificial intelligence (AI) systems powered by large language models have become increasingly prevalent in modern society, enabling a wide range of applications through natural language interaction. As AI agents proliferate in our daily lives, their generic and uniform expressiveness presents a significant limitation to their appeal and adoption. Personality expression represents a key prerequisite for creating more human-like and distinctive AI systems. We show that AI models can express deterministic and consistent personalities when instructed using established psychological frameworks, with varying degrees of accuracy depending on model capabilities. We find that more advanced models like GPT-4o and o1 demonstrate the highest accuracy in expressing specified personalities across both Big Five and Myers-Briggs assessments, and further analysis suggests that personality expression emerges from a combination of intelligence and reasoning capabilities. Our results reveal that personality expression operates through holistic reasoning rather than question-by-question optimization, with response-scale metrics showing higher variance than test-scale metrics. Furthermore, we find that model fine-tuning affects communication style independently of personality expression accuracy. These findings establish a foundation for creating AI agents with diverse and consistent personalities, which could significantly enhance human-AI interaction across applications from education to healthcare, while additionally enabling a broader range of more unique AI agents. The ability to quantitatively assess and implement personality expression in AI systems opens new avenues for research into more relatable, trustworthy, and ethically designed AI.

Paperid: 2300, https://arxiv.org/pdf/2503.17013.pdf

Abstract:
This study employs the Paul-Elder Critical Thinking Model and Tan's argumentative writing framework to create a structured methodology. This methodology, ChatGPT Guideline for Critical Argumentative Writing (CGCAW) framework, integrates the models with ChatGPT's capabilities to guide L2 learners in utilizing ChatGPT to enhance their critical thinking skills. A quantitative experiment was conducted with 10 participants from a state university, divided into experimental and control groups. The experimental group utilized the CGCAW framework, while the control group used ChatGPT without specific guidelines. Participants wrote an argumentative essay within a 40-minute timeframe, and essays were evaluated by three assessors: ChatGPT, Grammarly, and a course instructor. Results indicated that the experimental group showed improvements in clarity, logical coherence, and use of evidence, demonstrating ChatGPT's potential to enhance specific aspects of argumentative writing. However, the control group performed better in overall language mechanics and articulation of main arguments, indicating areas where the CGCAW framework could be further refined. This study highlights the need for further research to optimize the use of AI tools like ChatGPT in L2 learning environments to enhance critical thinking and writing skills.

Paperid: 2301, https://arxiv.org/pdf/2503.16920.pdf

Abstract:
An AI design framework was developed based on three core principles, namely understandability, trust, and usability. The framework was conceptualized by synthesizing evidence from the literature and by consulting with experts. The initial version of the AI Explainability Framework was validated based on an in-depth expert engagement and review process. For evaluation purposes, an AI-anchored prototype, incorporating novel explainability features, was built and deployed online. The primary function of the prototype was to predict the postpartum depression risk using analytics models. The development of the prototype was carried out in an iterative fashion, based on a pilot-level formative evaluation, followed by refinements and summative evaluation. The System Explainability Scale (SES) metric was developed to measure the influence of the three dimensions of the AI Explainability Framework. For the summative stage, a comprehensive usability test was conducted involving 20 clinicians, and the SES metric was used to assess clinicians` satisfaction with the tool. On a 5-point rating system, the tool received high scores for the usability dimension, followed by trust and understandability. The average explainability score was 4.56. In terms of understandability, trust, and usability, the average score was 4.51, 4.53 and 4.71 respectively. Overall, the 13-item SES metric showed strong internal consistency with Cronbach`s alpha of 0.84 and a positive correlation coefficient (Spearman`s rho = 0.81, p<0.001) between the composite SES score and explainability. A major finding was that the framework, combined with the SES usability metric, provides a straightforward approach for developing AI-based healthcare tools that lower the challenges associated with explainability.

Paperid: 2302, https://arxiv.org/pdf/2503.16889.pdf

Abstract:
Relying on a large corpus of natural interactions between visitors and a robot in a museum setting, we study a recurrent practice through which humans "worked" to maintain the robot as a competent participant: the description by bystanders, in a way that was made accessible to the main speaker, of the social action that the robot was taken to be accomplishing. Doing so, bystanders maintained the robot's (sometimes incongruous) behaviour as relevant to the activity at hand and preserved the robot itself as a competent participant. Relying on these data, we argue that ex ante definitions of a robot as "social" (i.e. before any interaction occurred) run the risk of naturalizing as self-evident the observable result from micro-sociological processes: namely, the interactional work of co-present humans through which the robot's conduct is reconfigured as contextually relevant.

Paperid: 2303, https://arxiv.org/pdf/2503.16521.pdf

Abstract:
In this paper, we compare a manual assembly task communicated to workers using both printed and robot-delivered instructions. The comparison was made using physiological signals (blood volume pulse (BVP) and electrodermal activity (EDA)) collected from individuals during an experimental study. In addition, we also collected responses of individuals using the NASA Task Load Index (TLX) survey. Furthermore, we mapped the collected physiological signals to the responses of participants for NASA TLX to predict their workload. For both the classification problems, we compare the performance of Convolutional Neural Networks (CNNs) and Long-Short-Term Memory (LSTM) models. Results show that for our CNN-based approach using multimodal data (both BVP and EDA) gave better results than using just BVP (approx. 8.38% more) and EDA (approx 20.49% more). Our LSTM-based model too had better results when we used multimodal data (approx 8.38% more than just BVP and 6.70% more than just EDA). Overall, CNNs performed better than LSTMs for classifying physiologies for paper vs robot-based instruction by 7.72%. The CNN-based model was able to give better classification results (approximately 17.83% more on an average across all responses of the NASA TLX) within a few minutes of training compared to the LSTM-based models.

Paperid: 2305, https://arxiv.org/pdf/2503.16484.pdf

Abstract:
Episodic Future Thinking (EFT) involves vividly imagining personal future events and experiences in detail. It has shown promise as an intervention to reduce delay discounting-the tendency to devalue delayed rewards in favor of immediate gratification- and to promote behavior change in a range of maladaptive health behaviors. We present EFTeacher, an AI chatbot powered by the GPT-4-Turbo large language model, designed to generate EFT cues for users with lifestyle-related conditions. To evaluate the feasibility and usability of EFTeacher, we conducted a mixed-methods study that included usability assessments, user evaluations based on content characteristics questionnaires, and semi-structured interviews. Qualitative findings indicate that participants perceived EFTeacher as communicative and supportive through an engaging dialogue. The chatbot facilitated imaginative thinking and reflection on future goals. Participants appreciated its adaptability and personalization features, though some noted challenges such as repetitive dialogue and verbose responses. Our findings underscore the potential of large language model-based chatbots in EFT interventions targeting maladaptive health behaviors.

Paperid: 2306, https://arxiv.org/pdf/2503.16468.pdf

Abstract:
The search for effective collaboration between humans and computer systems is one of the biggest challenges in Artificial Intelligence. One of the more effective mechanisms that humans use to coordinate with one another is theory of mind (ToM). ToM can be described as the ability to `take someone else's perspective and make estimations of their beliefs, desires and intentions, in order to make sense of their behaviour and attitudes towards the world'. If leveraged properly, this skill can be very useful in Human-AI collaboration. This introduces the question how we implement ToM when building an AI system. Humans and AI Systems work quite differently, and ToM is a multifaceted concept, each facet rooted in different research traditions across the cognitive and developmental sciences. We observe that researchers from artificial intelligence and the computing sciences, ourselves included, often have difficulties finding their way in the ToM literature. In this paper, we identify four common misconceptions around ToM that we believe should be taken into account when developing an AI system. We have hyperbolised these misconceptions for the sake of the argument, but add nuance in their discussion. The misconceptions we discuss are: (1) "Humans Use a ToM Module, So AI Systems Should As Well". (2) "Every Social Interaction Requires (Advanced) ToM". (3) "All ToM is the Same". (4) "Current Systems Already Have ToM". After discussing the misconception, we end each section by providing tentative guidelines on how the misconception can be overcome.

Paperid: 2307, https://arxiv.org/pdf/2503.16458.pdf

Abstract:
In this paper, we investigate how individuals evaluate human and large langue models generated responses to popular questions when the source of the content is either concealed or disclosed. Through a controlled field experiment, participants were presented with a set of questions, each accompanied by a response generated by either a human or an AI. In a randomized design, half of the participants were informed of the response's origin while the other half remained unaware. Our findings indicate that, overall, participants tend to prefer AI-generated responses. However, when the AI origin is revealed, this preference diminishes significantly, suggesting that evaluative judgments are influenced by the disclosure of the response's provenance rather than solely by its quality. These results underscore a bias against AI-generated content, highlighting the societal challenge of improving the perception of AI work in contexts where quality assessments should be paramount.

Paperid: 2308, https://arxiv.org/pdf/2503.16449.pdf

Abstract:
The uncanny valley effect poses a significant challenge in the development and acceptance of hyper-realistic social robots. This study investigates whether advanced conversational capabilities powered by large language models (LLMs) can mitigate this effect in highly anthropomorphic robots. We conducted a user study with 80 participants interacting with Nadine, a hyper-realistic humanoid robot equipped with LLM-driven communication skills. Through pre- and post-interaction surveys, we assessed changes in perceptions of uncanniness, conversational quality, and overall user experience. Our findings reveal that LLM-enhanced interactions significantly reduce feelings of eeriness while fostering more natural and engaging conversations. Additionally, we identify key factors influencing user acceptance, including conversational naturalness, human-likeness, and interestingness. Based on these insights, we propose design recommendations to enhance the appeal and acceptability of hyper-realistic robots in social contexts. This research contributes to the growing field of human-robot interaction by offering empirical evidence on the potential of LLMs to bridge the uncanny valley, with implications for the future development of social robots.

Paperid: 2309, https://arxiv.org/pdf/2503.16437.pdf

Abstract:
This study introduces "Haunted House" a novel text-based game designed to compare the performance of humans and large language models (LLMs) in model-based reasoning. Players must escape from a house containing nine rooms in a 3x3 grid layout while avoiding the ghost. They are guided by verbal clues that they get each time they move. In Study 1, the results from 98 human participants revealed a success rate of 31.6%, significantly outperforming seven state-of-the-art LLMs tested. Out of 140 attempts across seven LLMs, only one attempt resulted in a pass by Claude 3 Opus. Preliminary results suggested that GPT o3-mini-high performance might be higher, but not at the human level. Further analysis of 29 human participants' moves in Study 2 indicated that LLMs frequently struggled with random and illogical moves, while humans exhibited such errors less frequently. Our findings suggest that current LLMs encounter difficulties in tasks that demand active model-based reasoning, offering inspiration for future benchmarks.

Paperid: 2310, https://arxiv.org/pdf/2503.16432.pdf

Abstract:
This study investigates multimodal turn-taking prediction within human-agent interactions (HAI), particularly focusing on cooperative gaming environments. It comprises both model development and subsequent user study, aiming to refine our understanding and improve conversational dynamics in spoken dialogue systems (SDSs). For the modeling phase, we introduce a novel transformer-based deep learning (DL) model that simultaneously integrates multiple modalities - text, vision, audio, and contextual in-game data to predict turn-taking events in real-time. Our model employs a Crossmodal Transformer architecture to effectively fuse information from these diverse modalities, enabling more comprehensive turn-taking predictions. The model demonstrates superior performance compared to baseline models, achieving 87.3% accuracy and 83.0% macro F1 score. A human user study was then conducted to empirically evaluate the turn-taking DL model in an interactive scenario with a virtual avatar while playing the game "Dont Starve Together", comparing a control condition without turn-taking prediction (n=20) to an experimental condition with our model deployed (n=40). Both conditions included a mix of English and Korean speakers, since turn-taking cues are known to vary by culture. We then analyzed the interaction quality, examining aspects such as utterance counts, interruption frequency, and participant perceptions of the avatar. Results from the user study suggest that our multimodal turn-taking model not only enhances the fluidity and naturalness of human-agent conversations, but also maintains a balanced conversational dynamic without significantly altering dialogue frequency. The study provides in-depth insights into the influence of turn-taking abilities on user perceptions and interaction quality, underscoring the potential for more contextually adaptive and responsive conversational agents.

Paperid: 2311, https://arxiv.org/pdf/2503.15527.pdf

Abstract:
More and more people are experiencing pressure from work, life, and education. These pressures often lead to an anxious state of mind, or even the early symptoms of suicidal ideation. With the advancement of artificial intelligence (AI) technology, large language models have become one of the most prominent technologies. They are often used for detecting psychological disorders. However, current studies primarily provide categorization results without offering interpretable explanations for these results. To address this gap, this study adopts a person-centered perspective and focuses on GPT-generated multi-scenario simulated conversations. These simulated conversations were selected as data samples for the study. Various transformer-based encoder models were utilized to develop a classification model capable of identifying different levels of anxiety. Additionally, a knowledge base focusing on anxiety was constructed using LangChain and GPT-4. When analyzing classification results, this knowledge base was able to provide explanations and reasons most relevant to the interlocutor's anxiety situation. The study demonstrates that the proposed model achieves over 94% accuracy in categorical prediction, and the advice provided is highly personalized and relevant.

Paperid: 2312, https://arxiv.org/pdf/2503.15522.pdf

Abstract:
With the rapid advancement of autonomous vehicle (AV) technology, AVs are progressively seen as interactive agents with some level of autonomy, as well as some context-dependent social features. This introduces new challenges and questions, already relevant in other areas of human-robot interaction (HRI) - namely, if an AV is perceived as a social agent by the human with whom it is interacting, how are the various facets of its design and behaviour impacting its human partner? And how can we foster a successful human-agent interaction (HAI) between the AV and the human, maximizing the human's comfort, acceptance, and trust in the AV? In this work, we attempt to understand the various factors that could influence naÃ¯ve participants' acceptance and trust when interacting with an AV in the role of a driver. Through a large-scale online study, we investigate the effect of the AV's autonomy on the human driver, as well as explore which parameters of the interaction have the highest impact on the user's sense of trust in the AV. Finally, we analyze our preliminary findings from the user study within existing guidelines on Trustworthy HAI/HRI.

Paperid: 2313, https://arxiv.org/pdf/2503.15521.pdf

Abstract:
Achieving consensus in group decision-making often involves overcoming significant challenges, particularly in reconciling diverse perspectives and mitigating biases that hinder agreement. Traditional methods relying on human facilitators are often constrained by scalability and efficiency, especially in large-scale, fast-paced discussions. To address these challenges, this study proposes a novel framework employing large language models (LLMs) as automated facilitators within a custom-built multi-user chat system. Leveraging cosine similarity as a core metric, this approach evaluates the ability of three state-of-the-art LLMs- ChatGPT 4.0, Mistral Large 2, and AI21 Jamba Instruct- to synthesize consensus proposals that align with participants' viewpoints. Unlike conventional techniques, the system integrates adaptive facilitation strategies, including clarifying misunderstandings, summarizing discussions, and proposing compromises, enabling the LLMs to iteratively refine consensus proposals based on user feedback. Experimental results demonstrate the superiority of ChatGPT 4.0, which achieves higher alignment with participant opinions, requiring fewer iterations to reach consensus compared to its counterparts. Moreover, analysis reveals the nuanced performance of the models across various sustainability-focused discussion topics, such as climate action, quality education, good health and well-being, and access to clean water and sanitation. These findings highlight the transformative potential of LLM-driven facilitation for improving collective decision-making processes and underscore the importance of advancing evaluation metrics and cross-cultural adaptability in future research.

Paperid: 2314, https://arxiv.org/pdf/2503.15515.pdf

Abstract:
Computer-Using Agents (CUA) enable users to automate increasingly-complex tasks using graphical interfaces such as browsers. As many potential tasks require personal data, we propose Computer-Using Personal Agents (CUPAs) that have access to an external repository of the user's personal data. Compared with CUAs, CUPAs offer users better control of their personal data, the potential to automate more tasks involving personal data, better interoperability with external sources of data, and better capabilities to coordinate with other CUPAs in order to solve collaborative tasks involving the personal data of multiple users.

Paperid: 2315, https://arxiv.org/pdf/2503.15506.pdf

Abstract:
In the rapidly evolving landscape of manufacturing and material forming, innovative strategies are imperative for maintaining a competitive edge. Augmented Reality (AR) has emerged as a groundbreaking technology, offering new dimensions in how information is displayed and interacted with. It holds particular promise in the panel of instructional guides for complex machinery, potentially enhance traditional methods of knowledge transfer and operator training. Material forming, a key discipline within mechanical engineering, requires high-precision and skill, making it an ideal candidate for the integration of advanced instructional technologies like AR. This study aims to explore the efficiency of three distinct types of user manuals-video, paper, and augmented reality (AR)-on performance and acceptability in a material forming workshop environment. The focus will be on how AR can be specifically applied to improve task execution and understanding in material forming operations. Participants are mechanical engineering students specializing in material forming. They will engage in a series of standardized tasks related to machining processes. Performance will be gauged by metrics like task completion time and error rates, while task load will be assessed via the NASA Task Load Index (NASA-TLX) [1]. Acceptability of each manual type will be evaluated using the System Usability Scale (SUS) [2]. By comparing these various instructional formats, this research seeks to shed light on the most effective mediums for enhancing both operator performance and experience.

Paperid: 2316, https://arxiv.org/pdf/2503.15498.pdf

Abstract:
Revival is an innovative live audiovisual performance and music improvisation by our artist collective K-Phi-A, blending human and AI musicianship to create electronic music with audio-reactive visuals. The performance features real-time co-creative improvisation between a percussionist, an electronic music artist, and AI musical agents. Trained in works by deceased composers and the collective's compositions, these agents dynamically respond to human input and emulate complex musical styles. An AI-driven visual synthesizer, guided by a human VJ, produces visuals that evolve with the musical landscape. Revival showcases the potential of AI and human collaboration in improvisational artistic creation.

Paperid: 2317, https://arxiv.org/pdf/2503.15494.pdf

Abstract:
Artificial Intelligence (AI) is revolutionizing assistive technologies. It offers innovative solutions to enhance the quality of life for individuals with visual impairments. This review examines the development, applications, and impact of AI-powered tools in key domains, such as computer vision, natural language processing (NLP), and wearable devices. Specific advancements include object recognition for identifying everyday items, scene description for understanding surroundings, and NLP-driven text-to-speech systems for accessing digital information. Assistive technologies like smart glasses, smartphone applications, and AI-enabled navigation aids are discussed, demonstrating their ability to support independent travel, facilitate social interaction, and increase access to education and employment opportunities. The integration of deep learning models, multimodal interfaces, and real-time data processing has transformed the functionality and usability of these tools, fostering inclusivity and empowerment. This article also addresses critical challenges, including ethical considerations, affordability, and adaptability in diverse environments. Future directions highlight the need for interdisciplinary collaboration to refine these technologies, ensuring equitable access and sustainable innovation. By providing a comprehensive overview, this review underscores AI's transformative potential in promoting independence, enhancing accessibility, and fostering social inclusion for visually impaired individuals.

Paperid: 2318, https://arxiv.org/pdf/2503.15492.pdf

Abstract:
Manual scoring of polysomnography (PSG) is a time intensive task, prone to inter scorer variability that can impact diagnostic reliability. This study investigates the integration of decision support systems (DSS) into PSG scoring workflows, focusing on their effects on accuracy, scoring time, and potential biases toward recommendations from artificial intelligence (AI) compared to human generated recommendations. Using a novel online scoring platform, we conducted a repeated measures study with sleep technologists, who scored traditional and self applied PSGs. Participants were occasionally presented with recommendations labeled as either human or AI generated. We found that traditional PSGs tended to be scored slightly more accurately than self applied PSGs, but this difference was not statistically significant. Correct recommendations significantly improved scoring accuracy for both PSG types, while incorrect recommendations reduced accuracy. No significant bias was observed toward or against AI generated recommendations compared to human generated recommendations. These findings highlight the potential of AI to enhance PSG scoring reliability. However, ensuring the accuracy of AI outputs is critical to maximizing its benefits. Future research should explore the long term impacts of DSS on scoring workflows and strategies for integrating AI in clinical practice.

Paperid: 2319, https://arxiv.org/pdf/2503.15204.pdf

Abstract:
Swine disease surveillance is critical to the sustainability of global agriculture, yet its effectiveness is frequently undermined by limited veterinary resources, delayed identification of cases, and variability in diagnostic accuracy. To overcome these barriers, we introduce a novel AI-powered, multi-agent diagnostic system that leverages Retrieval-Augmented Generation (RAG) to deliver timely, evidence-based disease detection and clinical guidance. By automatically classifying user inputs into either Knowledge Retrieval Queries or Symptom-Based Diagnostic Queries, the system ensures targeted information retrieval and facilitates precise diagnostic reasoning. An adaptive questioning protocol systematically collects relevant clinical signs, while a confidence-weighted decision fusion mechanism integrates multiple diagnostic hypotheses to generate robust disease predictions and treatment recommendations. Comprehensive evaluations encompassing query classification, disease diagnosis, and knowledge retrieval demonstrate that the system achieves high accuracy, rapid response times, and consistent reliability. By providing a scalable, AI-driven diagnostic framework, this approach enhances veterinary decision-making, advances sustainable livestock management practices, and contributes substantively to the realization of global food security.

Paperid: 2320, https://arxiv.org/pdf/2503.15100.pdf

Abstract:
Social Virtual Reality (VR), where people meet in virtual spaces via 3D avatars, is used by children and adults alike. Children experience new forms of harassment in social VR where it is often inaccessible to parental oversight. To date, there is limited understanding of how parents and non-parent adults within the child social VR ecosystem perceive the appropriateness of social VR for different age groups and the measures in place to safeguard children. We present results of a mixed-methods questionnaire (N=149 adults, including 79 parents) focusing on encounters with children in social VR and perspectives towards children's use of social VR. We draw novel insights on the frequency of social VR use by children under 13 and current use of, and future aspirations for, child protection interventions. Compared to non-parent adults, parents familiar with social VR propose lower minimum ages and are more likely to allow social VR without supervision. Adult users experience immaturity from children in social VR, while children face abuse, encounter age-inappropriate behaviours and self-disclose to adults. We present directions to enhance the safety of social VR through pre-planned controls, real-time oversight, post-event insight and the need for evidence-based guidelines to support parents and platforms around age-appropriate interventions.

Paperid: 2321, https://arxiv.org/pdf/2503.15098.pdf

Abstract:
Immersive technologies are capable of transporting people to distant or inaccessible environments that they might not otherwise visit. Practitioners and researchers alike are discovering new ways to replicate and enhance existing tourism experiences using virtual reality, yet few controlled experiments have studied how users perceive virtual tours of real-world locations. In this paper we present an initial exploration of a new system for virtual tourism, measuring the effects of real-time experiences and storytelling on presence, place attachment, and user memories of the destination. Our results suggest that narrative plays an important role in inducing presence within and attachment to the destination, while livestreaming can further increase place attachment while providing flexible, tailored experiences. We discuss the design and evaluation of our system, including feedback from our tourism partners, and provide insights into current limitations and further opportunities for virtual tourism.

Paperid: 2322, https://arxiv.org/pdf/2503.14322.pdf

Abstract:
Electroencephalography-based eye tracking (EEG-ET) leverages eye movement artifacts in EEG signals as an alternative to camera-based tracking. While EEG-ET offers advantages such as robustness in low-light conditions and better integration with brain-computer interfaces, its development lags behind traditional methods, particularly in consumer-grade settings. To support research in this area, we present a dataset comprising simultaneous EEG and eye-tracking recordings from 113 participants across 116 sessions, amounting to 11 hours and 45 minutes of recordings. Data was collected using a consumer-grade EEG headset and webcam-based eye tracking, capturing eye movements under four experimental paradigms with varying complexity. The dataset enables the evaluation of EEG-ET methods across different gaze conditions and serves as a benchmark for assessing feasibility with affordable hardware. Data preprocessing includes handling of missing values and filtering to enhance usability. In addition to the dataset, code for data preprocessing and analysis is available to support reproducibility and further research.

Paperid: 2323, https://arxiv.org/pdf/2503.14143.pdf

Abstract:
We found that children in elementary school often experience stress during task performance. Limited coping skills and lack of stress awareness restrict children's ability to manage their stress. Many designs and studies have proposed different stress detection and intervention solutions. Still, they often overlook the potential of enhancing everyday objects and actively sensing stress-related behavioral data during human-product interaction. Therefore, we propose Petting pen as an interactive robotic object for children to manage their stress during task performance. It detects and validates stress and further intervenes in stress during a process of natural writing and relaxation interactions. The design is an iteration based on our previous research results of a stress-aware pen, enhanced with tactile needs, robotic interaction, and integration of behavioral and bio-sensing capabilities. Petting pen is supposed to bridge the gap between robots and everyday objects in mental health applications for children.

Paperid: 2324, https://arxiv.org/pdf/2503.13121.pdf

Abstract:
Computer-mediated concerts can be enjoyed on various devices, from desktop and mobile to VR devices, often supporting multiple devices simultaneously. However, due to the limited accessibility of VR devices, relatively small audience members tend to congregate in VR venues, resulting in diminished unique social experiences. To address this gap and enrich VR concert experiences, we present a novel approach that leverages non-VR user interaction data, specifically chat from audiences watching the same content on a live-streaming platform. Based on an analysis of audience reactions in offline concerts, we designed and prototyped a concert interaction translation system that extracts the level of engagement and emotions from chats and translates them to collective movements, cheers, and singalongs of virtual audience avatars in a VR venue. Our user study (n=48) demonstrates that our system, which combines both movement and audio reactions, significantly enhances the sense of immersion and co-presence than the previous method.

Paperid: 2325, https://arxiv.org/pdf/2503.13034.pdf

Abstract:
Continuous prediction of finger joint movement using historical joint positions/rotations is vital in a multitude of applications, especially related to virtual reality, computer graphics, robotics, and rehabilitation. However, finger motions are highly articulated with multiple degrees of freedom, making them significantly harder to model and predict. To address this challenge, we propose a physics-inspired time-agnostic graph neural network (TA-GNN) to accurately predict human finger motions. The proposed encoder comprises a kinematic feature extractor to generate filtered velocity and acceleration and a physics-based encoder that follows linear kinematics. The model is designed to be prediction-time-agnostic so that it can seamlessly provide continuous predictions. The graph-based decoder for learning the topological motion between finger joints is designed to address the higher degree articulation of fingers. We show the superiority of our model performance in virtual reality context. This novel approach enhances finger tracking without additional sensors, enabling predictive interactions such as haptic re-targeting and improving predictive rendering quality.

Paperid: 2326, https://arxiv.org/pdf/2503.13018.pdf

Abstract:
Participatory data physicalisation (PDP) is recognised for its potential to support data-driven decisions among stakeholders who collaboratively construct physical elements into commonly insightful visualisations. Like all participatory processes, PDP is however influenced by underlying power dynamics that might lead to issues regarding extractive participation, marginalisation, or exclusion, among others. We first identified the decisions behind these power dynamics by developing an ontology that synthesises critical theoretical insights from both visualisation and participatory design research, which were then systematically applied unto a representative corpus of 23 PDP artefacts. By revealing how shared decisions are guided by different agendas, this paper presents three contributions: 1) a cross-disciplinary ontology that facilitates the systematic analysis of existing and novel PDP artefacts and processes; which leads to 2) six PDP agendas that reflect the key power dynamics in current PDP practice, revealing the diversity of orientations towards stakeholder participation in PDP practice; and 3) a set of critical considerations that should guide how power dynamics can be balanced, such as by reflecting on how issues are represented, data is contextualised, participants express their meanings, and how participants can dissent with flexible artefact construction. Consequently, this study advances a feminist research agenda by guiding researchers and practitioners in openly reflecting on and sharing responsibilities in data physicalisation and participatory data visualisation.

Paperid: 2327, https://arxiv.org/pdf/2503.13003.pdf

Abstract:
Saliency modulation has significant potential for various applications. In our pursuit of implementing saliency modulation for optical see-through near-eye displays, we decided to introduce a blur effect to reduce the sharpness of specific areas while preserving the sharpness of others. In this study, we used a digital micromirror device (DMD) to separate the incoming light from a scene into sharp and blurred areas. To achieve this, we integrated an electrically tunable lens (ETL), which operates in its zero optical power mode when the reflected light from the DMD represents the sharp area (i.e., the blur area is masked). Conversely, when the reflected light indicates the blur area, the ETL adjusts to non-zero optical powers. Importantly, these modulations occur at a speed that surpasses the critical flicker frequency threshold of the human eye. Furthermore, we proposed an algorithm to mitigate the artifacts around the border area between the sharp and blur areas that are caused by the magnification of the ETL. We have also developed a prototype system to demonstrate the feasibility of our method.

Paperid: 2328, https://arxiv.org/pdf/2503.12762.pdf

Abstract:
Tech neck, a growing musculoskeletal concern caused by prolonged poor posture during device use, has significant health implications. This study investigates the relationship between head posture and muscular activity in the upper trapezius muscle to predict muscle strain by leveraging data from EMG sensors and head trackers. We train a regression model to predict EMG envelope readings using head movement data. We conduct preliminary experiments involving various postures to explore the correlation between these modalities and assess the feasibility of predicting muscle strain using head worn sensors. We discuss the key research challenges in sensing and predicting muscle fatigue. The results highlight the potential of this approach in real-time ergonomic feedback systems, contributing to the prevention and management of tech neck.

Paperid: 2329, https://arxiv.org/pdf/2503.12641.pdf

Abstract:
Driven by the vision of everyday haptics, the HCI community is advocating for "design touch first" and investigating "how to touch well." However, a gap remains between the exploratory nature of haptic design and technical reproducibility. We present Shape-Kit, a hybrid design toolkit embodying our "crafting haptics" metaphor, where hand touch is transduced into dynamic pin-based sensations that can be freely explored across the body. An ad-hoc tracking module captures and digitizes these patterns. Our study with 14 designers and artists demonstrates how Shape-Kit facilitates sensorial exploration for expressive haptic design. We analyze how designers collaboratively ideate, prototype, iterate, and compose touch experiences and show the subtlety and richness of touch that can be achieved through diverse crafting methods with Shape-Kit. Reflecting on the findings, our work contributes key insights into haptic toolkit design and touch design practices centered on the "crafting haptics" metaphor. We discuss in-depth how Shape-Kit's simplicity, though remaining constrained, enables focused crafting for deeper exploration, while its collaborative nature fosters shared sense-making of touch experiences.

Paperid: 2330, https://arxiv.org/pdf/2503.12619.pdf

Abstract:
Learning to solve a Rubik's Cube requires the learners to repeatedly practice a skill component, e.g., identifying a misplaced square and putting it back. However, for 3D physical tasks such as this, generating sufficient repeated practice opportunities for learners can be challenging, in part because it is difficult for novices to reconfigure the physical object to specific states. We propose Rubikon, an intelligent tutoring system for learning to solve the Rubik's Cube. Rubikon reduces the necessity for repeated manual configurations of the Rubik's Cube without compromising the tactile experience of handling a physical cube. The foundational design of Rubikon is an AR setup, where learners manipulate a physical cube while seeing an AR-rendered cube on a display. Rubikon automatically generates configurations of the Rubik's Cube to target learners' weaknesses and help them exercise diverse knowledge components. In a between-subjects experiment, we showed that Rubikon learners scored 25% higher on a post-test compared to baselines.

Paperid: 2331, https://arxiv.org/pdf/2503.11466.pdf

Abstract:
Human Activity Recognition (HAR) using wearable inertial measurement unit (IMU) sensors can revolutionize healthcare by enabling continual health monitoring, disease prediction, and routine recognition. Despite the high accuracy of Deep Learning (DL) HAR models, their robustness to real-world variabilities remains untested, as they have primarily been trained and tested on limited lab-confined data. In this study, we isolate subject, device, position, and orientation variability to determine their effect on DL HAR models and assess the robustness of these models in real-world conditions. We evaluated the DL HAR models using the HARVAR and REALDISP datasets, providing a comprehensive discussion on the impact of variability on data distribution shifts and changes in model performance. Our experiments measured shifts in data distribution using Maximum Mean Discrepancy (MMD) and observed DL model performance drops due to variability. We concur that studied variabilities affect DL HAR models differently, and there is an inverse relationship between data distribution shifts and model performance. The compounding effect of variability was analyzed, and the implications of variabilities in real-world scenarios were highlighted. MMD proved an effective metric for calculating data distribution shifts and explained the drop in performance due to variabilities in HARVAR and REALDISP datasets. Combining our understanding of variability with evaluating its effects will facilitate the development of more robust DL HAR models and optimal training techniques. Allowing Future models to not only be assessed based on their maximum F1 score but also on their ability to generalize effectively

Paperid: 2332, https://arxiv.org/pdf/2503.11177.pdf

Abstract:
This study investigates the subjective experiences of users in two robotic object delivery methods: direct handover and table placement, when users are occupied with another task. A user study involving 15 participants engaged in a typing game revealed that table placement significantly enhances user experience compared to direct handovers, particularly in terms of satisfaction, perceived safety and intuitiveness. Additionally, handovers negatively impacted typing performance, while all participants expressed a clear preference for table placement as the delivery method. These findings highlight the advantages of table placement in scenarios requiring minimal user disruption.

Paperid: 2333, https://arxiv.org/pdf/2503.10488.pdf

Abstract:
Generating co-speech gestures in real time requires both temporal coherence and efficient sampling. We introduce Accelerated Rolling Diffusion, a novel framework for streaming gesture generation that extends rolling diffusion models with structured progressive noise scheduling, enabling seamless long-sequence motion synthesis while preserving realism and diversity. We further propose Rolling Diffusion Ladder Acceleration (RDLA), a new approach that restructures the noise schedule into a stepwise ladder, allowing multiple frames to be denoised simultaneously. This significantly improves sampling efficiency while maintaining motion consistency, achieving up to a 2x speedup with high visual fidelity and temporal coherence. We evaluate our approach on ZEGGS and BEAT, strong benchmarks for real-world applicability. Our framework is universally applicable to any diffusion-based gesture generation model, transforming it into a streaming approach. Applied to three state-of-the-art methods, it consistently outperforms them, demonstrating its effectiveness as a generalizable and efficient solution for real-time, high-fidelity co-speech gesture synthesis.

Paperid: 2334, https://arxiv.org/pdf/2503.10116.pdf

Abstract:
The knowledge transfer from 3D printing technology paved the way for unlocking the innovative potential of 3D Food Printing (3DFP) technology. However, this technology-oriented approach neglects userderived issues that could be addressed with advancements in 3DFP technology. To explore potential new features and application areas for 3DFP technology, we created the Mobile Food Printer (MFP) prototype. We collected insights from novice chefs for MFP in the restaurant context through four online focus group sessions (N=12). Our results revealed how MFP can be applied in the current kitchen routines (preparation, serving, and eating) and introduce novel dining experiences. We discuss our learnings under two themes: 1) dealing with the kitchen rush and 2) streamlining workflows in the kitchen. The opportunities we present in this study act as a starting point for HCI and HFI researchers and encourage them to implement mobility in 3DFP with a useroriented lens. We further provide a ground for future research to uncover potentials for advancing 3DFP technology.

Paperid: 2335, https://arxiv.org/pdf/2503.09987.pdf

Abstract:
As human augmentation technologies evolve, the convergence of AI, robotics, and extended reality (XR) is redefining human potential -- enhancing cognition, perception, and physical abilities. However, these advancements also introduce ethical dilemmas, security risks, and concerns over loss of control. This workshop explores both the transformative potential and the unintended consequences of augmentation technologies. Bringing together experts from HCI, neuroscience, robotics, and ethics, we will examine real-world applications, emerging risks, and governance strategies for responsible augmentation. The session will feature keynote talks and interactive discussions, addressing topics such as AI-enhanced cognition, wearable robotics, neural interfaces, and XR-driven augmentation. By fostering multidisciplinary dialogue, this workshop aims to generate actionable insights for responsible innovation, proposing ethical frameworks to balance human empowerment with risk mitigation. We invite researchers, practitioners, and industry leaders to contribute their perspectives and help shape the future of human augmentation.

Paperid: 2336, https://arxiv.org/pdf/2503.09849.pdf

Abstract:
After-action reviews (AARs) are professional discussions that help operators and teams enhance their task performance by analyzing completed missions with peers and professionals. Previous studies that compared different formats of AARs have mainly focused on human teams. However, the inclusion of robotic teammates brings along new challenges in understanding teammate intent and communication. Traditional AAR between human teammates may not be satisfactory for human-robot teams. To address this limitation, we propose a new training review (TR) tool, called the Virtual Spectator Interface (VSI), to enhance human-robot team performance and situational awareness (SA) in a simulated search mission. The proposed VSI primarily utilizes visual feedback to review subjects' behavior. To examine the effectiveness of VSI, we took elements from AAR to conduct our own TR, designed a 1 x 3 between-subjects experiment with experimental conditions: TR with (1) VSI, (2) screen recording, and (3) non-technology (only verbal descriptions). The results of our experiments demonstrated that the VSI did not result in significantly better team performance than other conditions. However, the TR with VSI led to more improvement in the subjects SA over the other conditions.

Paperid: 2337, https://arxiv.org/pdf/2503.09832.pdf

Abstract:
Massively multiplayer online games (MMOGs) can foster social interaction and relationship formation, but they pose specific privacy and safety challenges, especially in the context of mediating intimate interpersonal connections. To explore the potential risks, we conducted a case study on Final Fantasy XIV (FFXIV) players intimate partner seeking posts on social media. We analyzed 1,288 posts from a public Weibo account using Latent Dirichlet Allocation (LDA) topic modeling and thematic analysis. Our findings reveal that players disclose sensitive personal information and share vulnerabilities to establish trust but face difficulties in managing identity and privacy across multiple platforms. We also found that players expectations regarding intimate partner are diversified, and mismatch of expectations may leads to issues like privacy leakage or emotional exploitation. Based on our findings, we propose design implications for reducing privacy and safety risks and fostering healthier social interactions in virtual worlds.

Paperid: 2338, https://arxiv.org/pdf/2503.09150.pdf

Abstract:
Personalization is a critical yet often overlooked factor in boosting productivity and wellbeing in knowledge-intensive workplaces to better address individual preferences. Existing tools typically offer uniform guidance whether auto-generating email responses or prompting break reminders without accounting for individual behavioral patterns or stress triggers. We introduce AdaptAI, a multimodal AI solution combining egocentric vision and audio, heart and motion activities, and the agentic workflow of Large Language Models LLMs to deliver highly personalized productivity support and context-aware well-being interventions. AdaptAI not only automates peripheral tasks (e.g. drafting succinct document summaries, replying to emails etc.) but also continuously monitors the users unique physiological and situational indicators to dynamically tailor interventions such as micro-break suggestions or exercise prompts, at the exact point of need. In a preliminary study with 15 participants, AdaptAI demonstrated significant improvements in task throughput and user satisfaction by anticipating user stressors and streamlining daily workflows.

Paperid: 2339, https://arxiv.org/pdf/2503.09102.pdf

Abstract:
While generative AI is advancing writing support tools, creative writing is often seen as the exclusive domain of skilled writers. This paper introduces "1001 Nights", a co-creative story-crafting game that transforms writing into a playful and rewarding activity. In this game, the AI agent takes on the role of a "moody" king with distinct storytelling preferences, not merely assisting but actively influencing the narrative. Players engage with the king agent through strategic storytelling, guiding him to mention weapon-related keywords, which materialize as battle equipment. The king agent provides dynamic feedback, expressing satisfaction or displeasure, prompting players to adjust their approach. By combining storytelling, game mechanics, and AI-driven responses, our system motivates creativity through playful constraints. Inspired by Oulipo's literary techniques, this approach demonstrates how AI-powered game experiences can make creative writing more accessible and engaging, encouraging players to explore their creative potential.

Paperid: 2340, https://arxiv.org/pdf/2503.08205.pdf

Abstract:
The primary challenge in continuous sign language recognition (CSLR) mainly stems from the presence of multi-orientational and long-term motions. However, current research overlooks these crucial aspects, significantly impacting accuracy. To tackle these issues, we propose a novel CSLR framework: Orientation-aware Long-term Motion Decoupling (OLMD), which efficiently aggregates long-term motions and decouples multi-orientational signals into easily interpretable components. Specifically, our innovative Long-term Motion Aggregation (LMA) module filters out static redundancy while adaptively capturing abundant features of long-term motions. We further enhance orientation awareness by decoupling complex movements into horizontal and vertical components, allowing for motion purification in both orientations. Additionally, two coupling mechanisms are proposed: stage and cross-stage coupling, which together enrich multi-scale features and improve the generalization capabilities of the model. Experimentally, OLMD shows SOTA performance on three large-scale datasets: PHOENIX14, PHOENIX14-T, and CSL-Daily. Notably, we improved the word error rate (WER) on PHOENIX14 by an absolute 1.6% compared to the previous SOTA

Paperid: 2341, https://arxiv.org/pdf/2503.07512.pdf

Abstract:
Text in dashboards plays multiple critical roles, including providing context, offering insights, guiding interactions, and summarizing key information. Despite its importance, most dashboarding tools focus on visualizations and offer limited support for text authoring. To address this gap, we developed Plume, a system to help authors craft effective dashboard text. Through a formative review of exemplar dashboards, we created a typology of text parameters and articulated the relationship between visual placement and semantic connections, which informed Plume's design. Plume employs large language models (LLMs) to generate contextually appropriate content and provides guidelines for writing clear, readable text. A preliminary evaluation with 12 dashboard authors explored how assisted text authoring integrates into workflows, revealing strengths and limitations of LLM-generated text and the value of our human-in-the-loop approach. Our findings suggest opportunities to improve dashboard authoring tools by better supporting the diverse roles that text plays in conveying insights.

Paperid: 2342, https://arxiv.org/pdf/2503.06871.pdf

Abstract:
The development of digital humanities necessitates scholars to adopt more data-intensive methods and engage in multidisciplinary collaborations. Understanding their collaborative data behaviors becomes essential for providing more curated data, tailored tools, and a collaborative research environment. This study explores how interdisciplinary researchers collaborate on data activities by conducting focus group interviews with 19 digital humanities research groups. Through inductive coding, the study identified seven primary and supportive data activities and found that different collaborative modes are adopted in various data activities. The collaborative modes include humanities-driven, technically-driven, and balanced, depending on how team members naturally adjusted their responsibilities based on their expertise. These findings establish a preliminary framework for examining collaborative data behavior and interdisciplinary collaboration in digital humanities.

Paperid: 2343, https://arxiv.org/pdf/2503.06729.pdf

Abstract:
Small business owners (SBOs) often lack the resources and design experience needed to produce high-quality advertisements. To address this, we developed ACAI (AI Co-Creation for Advertising and Inspiration), an GenAI-powered multimodal advertisement creation tool, and conducted a user study with 16 SBOs in London to explore their perceptions of and interactions with ACAI in advertisement creation. Our findings reveal that structured inputs enhance user agency and control while improving AI outputs by facilitating better brand alignment, enhancing AI transparency, and offering scaffolding that assists novice designers, such as SBOs, in formulating prompts. We also found that ACAI's multimodal interface bridges the design skill gap for SBOs with a clear advertisement vision, but who lack the design jargon necessary for effective prompting. Building on our findings, we propose three capabilities: contextual intelligence, adaptive interactions, and data management, with corresponding design recommendations to advance the co-creative attributes of AI-mediated design tools.

Paperid: 2344, https://arxiv.org/pdf/2503.06552.pdf

Abstract:
LLM chatbot interfaces allow students to get instant, interactive assistance with homework, but doing so carelessly may not advance educational objectives. In this study, an interactive homework help system based on DeepSeek R1 is developed and first implemented for students enrolled in a large computer science beginning programming course. In addition to an assist button in a well-known code editor, our assistant also has a feedback option in our command-line automatic evaluator. It wraps student work in a personalized prompt that advances our educational objectives without offering answers straight away. We have discovered that our assistant can recognize students' conceptual difficulties and provide ideas, plans, and template code in pedagogically appropriate ways. However, among other mistakes, it occasionally incorrectly labels the correct student code as incorrect or encourages students to use correct-but-lesson-inappropriate approaches, which can lead to long and frustrating journeys for the students. After discussing many development and deployment issues, we provide our conclusions and future actions.

Paperid: 2345, https://arxiv.org/pdf/2503.06229.pdf

Abstract:
We introduce Frank, a human-in-the-loop system for co-evolutionary hybrid decision-making aiding the user to label records from an un-labeled dataset. Frank employs incremental learning to ``evolve'' in parallel with the user's decisions, by training an interpretable machine learning model on the records labeled by the user. Furthermore, Frank advances state-of-the-art approaches by offering inconsistency controls, explanations, fairness checks, and bad-faith safeguards simultaneously. We evaluate our proposal by simulating the users' behavior with various levels of expertise and reliance on Frank's suggestions. The experiments show that Frank's intervention leads to improvements in the accuracy and the fairness of the decisions.

Paperid: 2346, https://arxiv.org/pdf/2503.05899.pdf

Abstract:
Blind and Low Vision (BLV) people have adopted AI-powered visual interpretation applications to address their daily needs. While these applications have been helpful, prior work has found that users remain unsatisfied by their frequent errors. Recently, multimodal large language models (MLLMs) have been integrated into visual interpretation applications, and they show promise for more descriptive visual interpretations. However, it is still unknown how this advancement has changed people's use of these applications. To address this gap, we conducted a two-week diary study in which 20 BLV people used an MLLM-enabled visual interpretation application we developed, and we collected 553 entries. In this paper, we report a preliminary analysis of 60 diary entries from 6 participants. We found that participants considered the application's visual interpretations trustworthy (mean 3.75 out of 5) and satisfying (mean 4.15 out of 5). Moreover, participants trusted our application in high-stakes scenarios, such as receiving medical dosage advice. We discuss our plan to complete our analysis to inform the design of future MLLM-enabled visual interpretation systems.

Paperid: 2347, https://arxiv.org/pdf/2503.05649.pdf

Abstract:
This study investigates the influence of Visual Guidance (VG) on user performance and human factors within Augmented Reality (AR) via a between-subjects experiment. VG is a crucial component in AR applications, serving as a bridge between digital information and real-world interactions. Unlike prior research, which often produced inconsistent outcomes, our study focuses on varying types of supportive visualisations rather than interaction methods. Our findings reveal a 31% reduction in task completion time, offset by a significant rise in errors, highlighting a compelling trade-off between speed and accuracy. Furthermore, we assess the detrimental effects of occlusion as part of our experimental design. In addition to examining other variables such as cognitive load, motivation, and usability, we identify specific directions and offer actionable insights for future research. Overall, our results underscore the promise of VG for enhancing user performance in AR, while emphasizing the importance of further investigating the underlying human factors.

Paperid: 2348, https://arxiv.org/pdf/2503.05456.pdf

Abstract:
This paper investigates multi-selection in XR interfaces based on eye and hand interaction. We propose enabling multi-selection using different variations of techniques that combine gaze with a semi-pinch gesture, allowing users to select multiple objects, while on the way to a full-pinch. While our exploration is based on the semi-pinch mode for activating a quasi-mode, we explore four methods for confirming subselections in multi-selection mode, varying in effort and complexity: dwell-time (SemiDwell), swipe (SemiSwipe), tilt (SemiTilt), and non-dominant hand input (SemiNDH), and compare them to a baseline technique. In the user study, we evaluate their effectiveness in reducing task completion time, errors, and effort. The results indicate the strengths and weaknesses of each technique, with SemiSwipe and SemiDwell as the most preferred methods by participants. We also demonstrate their utility in file managing and RTS gaming application scenarios. This study provides valuable insights to advance 3D input systems in XR.

Paperid: 2349, https://arxiv.org/pdf/2503.05455.pdf

Abstract:
Research on human-AI collaboration often prioritizes objective performance. However, understanding human subjective preferences is essential to improving human-AI complementarity and human experiences. We investigate human preferences for controllability in a shared workspace task with AI partners using Behavior Shaping (BS), a reinforcement learning algorithm that allows humans explicit control over AI behavior. In one experiment, we validate the robustness of BS in producing effective AI policies relative to self-play policies, when controls are hidden. In another experiment, we enable human control, showing that participants perceive AI partners as more effective and enjoyable when they can directly dictate AI behavior. Our findings highlight the need to design AI that prioritizes both task performance and subjective human preferences. By aligning AI behavior with human preferences, we demonstrate how human-AI complementarity can extend beyond objective outcomes to include subjective preferences.

Paperid: 2350, https://arxiv.org/pdf/2503.05109.pdf

Abstract:
Large language models (LLMs) are increasingly used to assist computational social science research. While prior efforts have focused on text, the potential of leveraging multimodal LLMs (MLLMs) for online video studies remains underexplored. We conduct one of the first case studies on MLLM-assisted video content analysis, comparing AI's interpretations to human understanding of abstract concepts. We leverage LLaVA-1.6 Mistral 7B to interpret four abstract concepts regarding video-mediated self-disclosure, analyzing 725 keyframes from 142 depression-related YouTube short videos. We perform a qualitative analysis of MLLM's self-generated explanations and found that the degree of operationalization can influence MLLM's interpretations. Interestingly, greater detail does not necessarily increase human-AI alignment. We also identify other factors affecting AI alignment with human understanding, such as concept complexity and versatility of video genres. Our exploratory study highlights the need to customize prompts for specific concepts and calls for researchers to incorporate more human-centered evaluations when working with AI systems in a multimodal context.

Paperid: 2351, https://arxiv.org/pdf/2503.04696.pdf

Abstract:
Generative Artificial Intelligence (GenAI) tools and models have the potential to re-shape educational needs, norms, practices, and policies in all sectors of engineering education. Empirical data, rather than anecdata and assumptions, on how engineering students have adopted GenAI is essential to developing a foundational understanding of students' GenAI-related behaviors and needs during academic training. This data will also help formulate effective responses to GenAI by both academic institutions and industrial employers. We collected two representative survey samples at the Colorado School of Mines, a small engineering-focused R-1 university in the USA, in May 2023 ($n_1=601$) and September 2024 ($n_2=862$) to address research questions related to (RQ1) how GenAI has been adopted by engineering students, including motivational and demographic factors contributing to GenAI use, (RQ2) students' ethical concerns about GenAI, and (RQ3) students' perceived benefits v.s. harms for themselves, science, and society. Analysis revealed a statistically significant rise in GenAI adoption rates from 2023 to 2024. Students predominantly leverage GenAI tools to deepen understanding, enhance work quality, and stay informed about emerging technologies. Although most students assess their own usage of GenAI as ethical and beneficial, they nonetheless expressed significant concerns regarding GenAI and its impacts on society. We collected student estimates of ``P(doom)'' and discovered a bimodal distribution. Thus, we show that the student body at Mines is polarized with respect to future impacts of GenAI on the engineering workforce and society, despite being increasingly willing to explore GenAI over time. We discuss implications of these findings for future research and for integrating GenAI in engineering education.

Paperid: 2352, https://arxiv.org/pdf/2503.03967.pdf

Abstract:
Training AI models is challenging, particularly when crafting behavior instructions. Traditional methods rely on machines (supervised learning) or manual pattern discovery, which results in not interpretable models or time sink. While Large Language Models (LLMs) simplify instruction writing through natural language, articulating intended model behavior still remains difficult. We introduce Visionary Tuning, a human-in-the-loop self-playing followed by automatic self-refinement to improve behavior specification. Our system helps users clarify desired behavior through self-playing and generates prompts through self-improving, Our first evaluation involves user study conducted on a system implementation of Visionary Tuning within the context of chatbot behavior. Our system self-play itself by simulating user interactions to identify patterns and create effective prompts based on the pattern. In a within-subject study (N=12), participants pinpointed more patterns through self-playing and crafted better prompts. Surprisingly, users felt more or less success level in specifying the model behavior. Follow-up crowd studies (N=60) confirmed that the chatbot adhered to instructions without sacrificing quality. Our second evaluation is a case study on a real-world implementation using a movie rating dataset with Visionary Tuning, demonstrating its effectiveness and robustness in modeling a critic's preferences across the spectrum of low to highly rated movies. Together, these results suggest how AI improves the design process of interactive AI systems. Furthermore, they suggest how the benefits of these tools may be non-obvious to end-users. We reflect on these findings and suggest future directions.

Paperid: 2353, https://arxiv.org/pdf/2503.03617.pdf

Abstract:
People can generate high-quality ideas by building on each other's ideas. By enabling individuals to contribute their ideas at their own comfortable time and method (i.e., asynchronous ideation), they can deeply engage in ideation and improve idea quality. However, running asynchronous ideation faces a practical constraint. Whereas trained human facilitators are needed to guide effective idea exchange, they cannot be continuously available to engage with individuals joining at varying hours. In this paper, we ask how chatbots can be designed to facilitate asynchronous ideation. For this, we adopted the guidelines found in the literature about human facilitators and designed two chatbots: one provides a structured ideation process, and another adapts the ideation process to individuals' ideation performance. We invited 48 participants to generate and select ideas by interacting with one of our chatbots and invited an expert facilitator to review our chatbots. We found that both chatbots can guide users to build on each other's ideas and converge them into a few satisfying ideas. However, we also found the chatbots' limitations in social interaction with collaborators, which only human facilitators can provide. Accordingly, we conclude that chatbots can be promising facilitators of asynchronous ideation, but hybrid facilitation with human facilitators would be needed to address the social aspects of collaborative ideation.

Paperid: 2354, https://arxiv.org/pdf/2503.03532.pdf

Abstract:
Journaling plays a crucial role in managing chronic conditions by allowing patients to document symptoms and medication intake, providing essential data for long-term care. While valuable, traditional journaling methods often rely on static, self-directed entries, lacking interactive feedback and real-time guidance. This gap can result in incomplete or imprecise information, limiting its usefulness for effective treatment. To address this gap, we introduce PATRIKA, an AI-enabled prototype designed specifically for people with Parkinson's disease (PwPD). The system incorporates cooperative conversation principles, clinical interview simulations, and personalization to create a more effective and user-friendly journaling experience. Through two user studies with PwPD and iterative refinement of PATRIKA, we demonstrate conversational journaling's significant potential in patient engagement and collecting clinically valuable information. Our results showed that generating probing questions PATRIKA turned journaling into a bi-directional interaction. Additionally, we offer insights for designing journaling systems for healthcare and future directions for promoting sustained journaling.

Paperid: 2355, https://arxiv.org/pdf/2503.03166.pdf

Abstract:
Dance teachers rely primarily on verbal instructions and visual demonstrations to convey key dance concepts and movement. These techniques, however, have limitations in supporting students who are blind or have low vision (BLV). This work explores the role technology can play in supporting instruction for BLV students, as well as improvisation with their instructor. Through a series of design workshops with dance instructors and BLV students, ideas were generated by physically engaging with probes featuring diverse modalities including tactile objects, a body tracked sound and musical probe, and a body tracked controller with vibrational feedback. Implications for the design of supporting technologies were discovered for four contemporary dance learning goals: learning a phrase; improvising; collaborating through movement; and awareness of body and movement qualities. We discuss the potential of numerous multi-sensory methods and artefacts, and present design considerations for technologies to support meaningful dance instruction and participation.

Paperid: 2356, https://arxiv.org/pdf/2503.03154.pdf

Abstract:
Data wrangling is a time-consuming and challenging task in a data science pipeline. While many tools have been proposed to automate or facilitate data wrangling, they often misinterpret user intent, especially in complex tasks. We propose Dango, a mixed-initiative multi-agent system for data wrangling. Compared to existing tools, Dango enhances user communication of intent by allowing users to demonstrate on multiple tables and use natural language prompts in a conversation interface, enabling users to clarify their intent by answering LLM-posed multiple-choice clarification questions, and providing multiple forms of feedback such as step-by-step natural language explanations and data provenance to help users evaluate the data wrangling scripts. We conducted a within-subjects user study with 38 participants and demonstrated that Dango's features can significantly improve intent clarification, accuracy, and efficiency in data wrangling. Furthermore, we demonstrated the generalizability of Dango by applying it to a broader set of data wrangling tasks.

Paperid: 2357, https://arxiv.org/pdf/2503.02816.pdf

Abstract:
The advancement of Vision-Language Model (VLM) camera sensors, which enable autonomous understanding of household situations without user intervention, has the potential to completely transform the DIY smart home building experience. Will this simplify or complicate the DIY smart home process? Additionally, what features do users want to create using these sensors? To explore this, we conducted a three-week diary-based experience prototyping study with 12 participants. Participants recorded their daily activities, used GPT to analyze the images, and manually customized and tested smart home features based on the analysis. The study revealed three key findings: (1) participants' expectations for VLM camera-based smart homes, (2) the impact of VLM camera sensor characteristics on the DIY process, and (3) users' concerns. Through the findings of this study, we propose design implications to support the DIY smart home building process with VLM camera sensors, and discuss living with intelligence.

Paperid: 2358, https://arxiv.org/pdf/2503.02703.pdf

Abstract:
Graphical assets play an important role in the design and development of games. There is potential in the use of AI-driven generative tools, to aid in creating graphical assets, thus improving game design and development pipelines. However, there is little research to address how the generative methods can fit into the wider pipeline. There also no guidelines or heuristics for creating such tools. To address this gap we conducted a user study with 16 game designers and developers to examine their behaviour and interaction with generative tools for graphical assets. The findings highlight that early design stage is preferred by all participants. Designers and developers are inclined to use such tools for creating large amounts of variations at the cost of quality as they can improve the quality of the artefacts once they generate a suitable asset. The results also strongly raised the need for better integration of such tools in existing design and development environments and the need for the outputs to be in common data formats, to be manipulatable and smoothly integrate into existing environments. The study also highlights the requirement for further emphasis on the needs of the users to incorporate these tools effectively in existing pipelines. Informed by these results, we provide a set of heuristics for creating tools that meet the expectations and needs of game designers and developers.

Paperid: 2359, https://arxiv.org/pdf/2503.02309.pdf

Abstract:
Wireless earbuds are an appealing platform for wearable computing on-the-go. However, their small size and out-of-view location mean they support limited different inputs. We propose finger identification input on earbuds as a novel technique to resolve these problems. This technique involves associating touches by different fingers with different responses. To enable it on earbuds, we adapted prior work on smartwatches to develop a wireless earbud featuring a magnetometer that detects fields from a magnetic ring. A first study reveals participants achieve rapid, precise earbud touches with different fingers, even while mobile (time: 0.98s, errors: 5.6%). Furthermore, touching fingers can be accurately classified (96.9%). A second study shows strong performance with a more expressive technique involving multi-finger double-taps (inter-touch time: 0.39s, errors: 2.8%) while maintaining high accuracy (94.7%). We close by exploring and evaluating the design of earbud finger identification applications and demonstrating the feasibility of our system on low-resource devices.

Paperid: 2360, https://arxiv.org/pdf/2503.02308.pdf

Abstract:
Smartwatches offer powerful features, but their small touchscreens limit the expressiveness of the input that can be achieved. To address this issue, we present, and open-source, the first sonar-based around-device input on an unmodified consumer smartwatch. We achieve this using a fine-grained, one-dimensional sonar-based finger-tracking system. In addition, we use this system to investigate the fundamental issue of how to trigger selections during around-device smartwatch input through two studies. The first examines the methods of double-crossing, dwell, and finger tap in a binary task, while the second considers a subset of these designs in a multi-target task and in the presence and absence of haptic feedback. Results showed double-crossing was optimal for binary tasks, while dwell excelled in multi-target scenarios, and haptic feedback enhanced comfort but not performance. These findings offer design insights for future around-device smartwatch interfaces that can be directly deployed on today's consumer hardware.

Paperid: 2361, https://arxiv.org/pdf/2503.02150.pdf

Abstract:
Decentralized social media protocols enable users in independent, user-hosted servers (i.e., instances) to interact with each other while they self-govern. This community-based model of social media governance opens up new opportunities for tailored decision-making about information flows -- i.e., what user data is shared to whom and when -- and in turn, for protecting user privacy. To better understand how community governance shapes privacy expectations on decentralized social media, we conducted a semi-structured interview with 23 users of the Fediverse, a decentralized social media network. Our findings illustrate important factors that shape a community's understandings of information flows, such as rules and proactive efforts from admins who are perceived as trustworthy. We also highlight ''governance frictions'' between communities that raise new privacy risks due to incompatibilities in values, security practices, and software. Our findings highlight the unique challenges of decentralized social media, suggest design opportunities to address frictions, and outline the role of participatory decision-making to realize the full potential of decentralization.

Paperid: 2362, https://arxiv.org/pdf/2503.02135.pdf

Abstract:
Rapid spread of harmful misinformation has led to a dire need for effective media literacy interventions, to which educational games have been suggested as a possible solution. Researchers and educators have created several games that increase media literacy and resilience to misinformation. However, the existing body of misinformation education games rarely focus upon the socio-emotional influences that factor into misinformation belief. Misinformation correction and serious games have both explored narrative as a method to engage with people on an emotional basis. To this end, we investigated how 123 young adults (mean age = 22.98) experienced narrative transportation and identification in two narrative-centered misinformation escape room games developed for library settings. We found that propensity for certain misinformation contexts, such as engagement with fan culture and likelihood to share on social media platforms, significantly affected how participants experienced specific measures of narrative immersion within the games. We discuss design implications for tailoring educational interventions to specific misinformation contexts.

Paperid: 2363, https://arxiv.org/pdf/2503.01925.pdf

Abstract:
In recent years,the application of deep learning in task functional Magnetic Resonance Imaging (tfMRI) decoding has led to significant advancements. However,most studies remain constrained by assumption of temporal stationarity in neural activity,resulting in predominantly block-wise analysis with limited temporal resolution on the order of tens of seconds. This limitation restricts the ability to decode cognitive functions in detail. To address these limitations, this study proposes a deep neural network designed for volume-wise identification of task states within tfMRI data,thereby overcoming the constraints of conventional methods. Evaluated on Human Connectome Project (HCP) motor and gambling tfMRI datasets,the model achieved impressive mean accuracy rates of 94.0% and 79.6%,respectively. These results demonstrate a substantial enhancement in temporal resolution,enabling more detailed exploration of cognitive processes. The study further employs visualization algorithms to investigate dynamic brain mappings during different tasks,marking a significant step forward in deep learning-based frame-level tfMRI decoding. This approach offers new methodologies and tools for examining dynamic changes in brain activities and understanding the underlying cognitive mechanisms.

Paperid: 2364, https://arxiv.org/pdf/2503.01631.pdf

Abstract:
Problem reframing is a designerly activity wherein alternative perspectives are created to recast what a stated design problem is about. Generating alternative problem frames is challenging because it requires devising novel and useful perspectives that fit the given problem context. Large language models (LLMs) could assist this activity via their generative capability. However, it is not clear whether they can help designers produce high-quality frames. Therefore, we asked if there are benefits to working with LLMs. To this end, we compared three ways of using LLMs (N=280): 1) free-form, 2) direct generation, and 3) a structured approach informed by a theory of reframing. We found that using LLMs does not help improve the quality of problem frames. In fact, it increases the competence gap between experienced and inexperienced designers. Also, inexperienced ones perceived lower agency when working with LLMs. We conclude that there is no benefit to using LLMs in problem reframing and discuss possible factors for this lack of effect.

Paperid: 2365, https://arxiv.org/pdf/2503.01623.pdf

Abstract:
Commercial content moderation APIs are marketed as scalable solutions to combat online hate speech. However, the reliance on these APIs risks both silencing legitimate speech, called over-moderation, and failing to protect online platforms from harmful speech, known as under-moderation. To assess such risks, this paper introduces a framework for auditing black-box NLP systems. Using the framework, we systematically evaluate five widely used commercial content moderation APIs. Analyzing five million queries based on four datasets, we find that APIs frequently rely on group identity terms, such as ``black'', to predict hate speech. While OpenAI's and Amazon's services perform slightly better, all providers under-moderate implicit hate speech, which uses codified messages, especially against LGBTQIA+ individuals. Simultaneously, they over-moderate counter-speech, reclaimed slurs and content related to Black, LGBTQIA+, Jewish, and Muslim people. We recommend that API providers offer better guidance on API implementation and threshold setting and more transparency on their APIs' limitations. Warning: This paper contains offensive and hateful terms and concepts. We have chosen to reproduce these terms for reasons of transparency.

Paperid: 2366, https://arxiv.org/pdf/2503.01197.pdf

Abstract:
Sensing touch on arbitrary surfaces has long been a goal of ubiquitous computing, but often requires instrumenting the surface. Depth camera-based systems have emerged as a promising solution for minimizing instrumentation, but at the cost of high touch-down detection error rates, high touch latency, and high minimum hover distance, limiting them to basic tasks. We developed HaloTouch, a vision-based system which exploits a multipath interference effect from an off-the-shelf time-of-flight depth camera to enable fast, accurate touch interactions on general surfaces. HaloTouch achieves a 99.2% touch-down detection accuracy across various materials, with a motion-to-photon latency of 150 ms. With a brief (20s) user-specific calibration, HaloTouch supports millimeter-accurate hover sensing as well as continuous pressure sensing. We conducted a user study with 12 participants, including a typing task demonstrating text input at 26.3 AWPM. HaloTouch shows promise for more robust, dynamic touch interactions without instrumenting surfaces or adding hardware to users.

Paperid: 2367, https://arxiv.org/pdf/2503.00946.pdf

Abstract:
We present a comprehensive, in-depth review of ideation assisted by large language models (LLMs), highlighting emerging trends and identifying unaddressed research gaps. In total, we examined 61 studies investigating the application of LLMs in both group and individual ideation processes. From these studies, we derived the Hourglass Ideation Framework for LLM-assisted ideation, comprising three phases and seven key ideation stages, which served as the basis for our systematic survey. Our analysis reveals that LLMs are most frequently used for idea generation and refinement, but their use in scope specification, foundational material structuring and multi-idea evaluation and selection remains limited. We provide our findings in extensive tabular and online formats. These catalogues detail research on LLM-assisted, purely LLM-based, and human-only activities across the seven ideation stages for each of the 61 studies. These also detail creative domains, publication outlets, interaction designs, user study designs, and assessment methods. Our analysis of system interaction design reveals a predominant focus on supporting individual ideation activities and text-based interaction, with a growing trend of incorporating multimedia elements. However, in group ideation, tools and interaction modalities targeting both synchronous and asynchronous collaboration are much scarcer. We synthesize the primary findings of our review and outline promising directions for future research in LLM-assisted ideation. We hope this review will help researchers quickly gain an overview of this rapidly expanding area, efficiently locate relevant work, and identify underexplored areas for further investigation. In addition, we believe the framework we present here will form the basis for the development of future problem and solution space taxonomies, and methodologies for LLM-assisted ideation development and use.

Paperid: 2368, https://arxiv.org/pdf/2503.00842.pdf

Abstract:
VTubing, the practice of live streaming using virtual avatars, has gained worldwide popularity among streamers seeking to maintain anonymity. While previous research has primarily focused on the social and cultural aspects of VTubing, there is a noticeable lack of studies examining the practical challenges VTubers face in creating and operating their avatars. To address this gap, we surveyed VTubers' equipment and expanded the live-streaming design space by introducing six new dimensions related to avatar creation and control. Additionally, we conducted interviews with 16 professional VTubers to comprehensively explore their practices, strategies, and challenges throughout the VTubing process. Our findings reveal that VTubers face significant burdens compared to real-person streamers due to fragmented tools and the multi-tasking nature of VTubing, leading to unique workarounds. Finally, we summarize these challenges and propose design opportunities to improve the effectiveness and efficiency of VTubing.

Paperid: 2369, https://arxiv.org/pdf/2503.00715.pdf

Abstract:
Knowledge gaps often arise during communication due to diverse backgrounds, knowledge bases, and vocabularies. With recent LLM developments, providing real-time knowledge support is increasingly viable, but is challenging due to shared and individual cognitive limitations (e.g., attention, memory, and comprehension) and the difficulty in understanding the user's context and internal knowledge. To address these challenges, we explore the key question of understanding how people want to receive real-time knowledge support. We built StopGap -- a prototype that provides real-time knowledge support for explaining jargon words in videos -- to conduct a design probe study (N=24) that explored multiple visual knowledge representation formats. Our study revealed individual differences in preferred representations and highlighted the importance of user agency, personalization, and mixed-initiative assistance. Based on our findings, we map out six key design dimensions for real-time LLM knowledge support systems and offer insights for future research in this space.

Paperid: 2370, https://arxiv.org/pdf/2503.00144.pdf

Abstract:
AI-supported tools can help learners overcome challenges in programming education by providing adaptive assistance. However, existing research often focuses on individual tools rather than deriving broader design recommendations. A key challenge in designing these systems is balancing learner control with system-driven guidance. To explore user preferences for AI-supported programming learning tools, we conducted a participatory design study with 15 undergraduate novice programmers and 10 instructors to gather insights on their desired help features and control preferences, as well as a follow-up survey with 172 introductory programming students. Our qualitative findings show that learners prefer help that is encouraging, incorporates visual aids, and includes peer-related insights, whereas instructors prioritize scaffolding that reflects learners' progress and reinforces best practices. Both groups favor shared control, though learners generally prefer more autonomy, while instructors lean toward greater system guidance to prevent cognitive overload. Additionally, our interviews revealed individual differences in control preferences. Based on our findings, we propose design guidelines for AI-supported programming tools, particularly regarding user-centered help features and adaptive control mechanisms. Our work contributes to the human-centered design of AI-supported learning environments by informing the development of systems that effectively balance autonomy and guidance, enhancing AI-supported educational tools for programming and beyond.

Paperid: 2371, https://arxiv.org/pdf/2502.21028.pdf

Abstract:
Large Language Models (LLMs) can engage in human-looking conversational exchanges. Although conversations can elicit trust between users and LLMs, scarce empirical research has examined trust formation in human-LLM contexts, beyond LLMs' trustworthiness or human trust in AI in general. Here, we introduce the Trust-In-LLMs Index (TILLMI) as a new framework to measure individuals' trust in LLMs, extending McAllister's cognitive and affective trust dimensions to LLM-human interactions. We developed TILLMI as a psychometric scale, prototyped with a novel protocol we called LLM-simulated validity. The LLM-based scale was then validated in a sample of 1,000 US respondents. Exploratory Factor Analysis identified a two-factor structure. Two items were then removed due to redundancy, yielding a final 6-item scale with a 2-factor structure. Confirmatory Factor Analysis on a separate subsample showed strong model fit ($CFI = .995$, $TLI = .991$, $RMSEA = .046$, $p_{X^2} > .05$). Convergent validity analysis revealed that trust in LLMs correlated positively with openness to experience, extraversion, and cognitive flexibility, but negatively with neuroticism. Based on these findings, we interpreted TILLMI's factors as "closeness with LLMs" (affective dimension) and "reliance on LLMs" (cognitive dimension). Younger males exhibited higher closeness with- and reliance on LLMs compared to older women. Individuals with no direct experience with LLMs exhibited lower levels of trust compared to LLMs' users. These findings offer a novel empirical foundation for measuring trust in AI-driven verbal communication, informing responsible design, and fostering balanced human-AI collaboration.

Paperid: 2372, https://arxiv.org/pdf/2502.21014.pdf

Abstract:
Verification of biomedical claims is critical for healthcare decision-making, public health policy and scientific research. We present an interactive biomedical claim verification system by integrating LLMs, transparent model explanations, and user-guided justification. In the system, users first retrieve relevant scientific studies from a persistent medical literature corpus and explore how different LLMs perform natural language inference (NLI) within task-adaptive reasoning framework to classify each study as "Support," "Contradict," or "Not Enough Information" regarding the claim. Users can examine the model's reasoning process with additional insights provided by SHAP values that highlight word-level contributions to the final result. This combination enables a more transparent and interpretable evaluation of the model's decision-making process. A summary stage allows users to consolidate the results by selecting a result with narrative justification generated by LLMs. As a result, a consensus-based final decision is summarized for each retrieved study, aiming safe and accountable AI-assisted decision-making in biomedical contexts. We aim to integrate this explainable verification system as a component within a broader evidence synthesis framework to support human-AI collaboration.

Paperid: 2373, https://arxiv.org/pdf/2502.20701.pdf

Abstract:
In human-AI interactions, explanation is widely seen as necessary for enabling trust in AI systems. We argue that trust, however, may be a pre-requisite because explanation is sometimes impossible. We derive this result from a formalization of explanation as a search process through knowledge networks, where explainers must find paths between shared concepts and the concept to be explained, within finite time. Our model reveals that explanation can fail even under theoretically ideal conditions - when actors are rational, honest, motivated, can communicate perfectly, and possess overlapping knowledge. This is because successful explanation requires not just the existence of shared knowledge but also finding the connection path within time constraints, and it can therefore be rational to cease attempts at explanation before the shared knowledge is discovered. This result has important implications for human-AI interaction: as AI systems, particularly Large Language Models, become more sophisticated and able to generate superficially compelling but spurious explanations, humans may default to trust rather than demand genuine explanations. This creates risks of both misplaced trust and imperfect knowledge integration.

Paperid: 2374, https://arxiv.org/pdf/2502.20463.pdf

Abstract:
Engaging in political discussions is crucial in democratic societies, yet many individuals remain politically disinclined due to various factors such as perceived knowledge gaps, conflict avoidance, or a sense of disconnection from the political system. In this paper, we explore the potential of personal narratives-short, first-person accounts emphasizing personal experiences-as a means to empower these individuals to participate in online political discussions. Using a text classifier that identifies personal narratives, we conducted a large-scale computational analysis to evaluate the relationship between the use of personal narratives and participation in political discussions on Reddit. We find that politically disinclined individuals (PDIs) are more likely to use personal narratives than more politically active users. Personal narratives are more likely to attract and retain politically disinclined individuals in political discussions than other comments. Importantly, personal narratives posted by politically disinclined individuals are received more positively than their other comments in political communities. These results emphasize the value of personal narratives in promoting inclusive political discourse.

Paperid: 2375, https://arxiv.org/pdf/2502.19706.pdf

Abstract:
Autonomous interaction is crucial for the effective use of elderly care robots. However, developing universal AI architectures is extremely challenging due to the diversity in robot configurations and a lack of dataset. We proposed a universal architecture for the AI-ization of elderly care robots, called AoECR. Specifically, based on a nursing bed, we developed a patient-nurse interaction dataset tailored for elderly care scenarios and fine-tuned a large language model to enable it to perform nursing manipulations. Additionally, the inference process included a self-check chain to ensure the security of control commands. An expert optimization process further enhanced the humanization and personalization of the interactive responses. The physical experiment demonstrated that the AoECR exhibited zero-shot generalization capabilities across diverse scenarios, understood patients' instructions, implemented secure control commands, and delivered humanized and personalized interactive responses. In general, our research provides a valuable dataset reference and AI-ization solutions for elderly care robots.

Paperid: 2376, https://arxiv.org/pdf/2502.19500.pdf

Abstract:
The language generation and reasoning capabilities of large language models (LLMs) have enabled conversational systems with impressive performance in a variety of tasks, from code generation, to composing essays, to passing STEM and legal exams, to a new paradigm for knowledge search. Besides those short-term use applications, LLMs are increasingly used to help with real-life goals or tasks that take a long time to complete, involving multiple sessions across days, weeks, months, or even years. Thus to enable conversational systems for long term interactions and tasks, we need language-based agents that can plan for long horizons. Traditionally, such capabilities were addressed by reinforcement learning agents with hierarchical planning capabilities. In this work, we explore a novel architecture where the LLM acts as the meta-controller deciding the agent's next macro-action, and tool use augmented LLM-based option policies execute the selected macro-action. We instantiate this framework for a specific set of macro-actions enabling adaptive planning for users' personal plans through conversation and follow-up questions collecting user feedback. We show how this paradigm can be applicable in scenarios ranging from tutoring for academic and non-academic tasks to conversational coaching for personal health plans.

Paperid: 2377, https://arxiv.org/pdf/2502.18853.pdf

Abstract:
Image-generative AI provides new opportunities to transform personal data into alternative visual forms. In this paper, we illustrate the potential of AI-generated images in facilitating meaningful engagement with personal data. In a formative autobiographical design study, we explored the design and use of AI-generated images derived from personal data. Informed by this study, we designed a web-based application as a probe that represents personal data through generative images utilizing Open AI's GPT-4 model and DALL-E 3. We then conducted a 21-day diary study and interviews using the probe with 16 participants to investigate users' in-depth experiences with images generated by AI in everyday lives. Our findings reveal new qualities of experiences in users' engagement with data, highlighting how participants constructed personal meaning from their data through imagination and speculation on AI-generated images. We conclude by discussing the potential and concerns of leveraging image-generative AI for personal data meaning-making.

Paperid: 2378, https://arxiv.org/pdf/2502.18676.pdf

Abstract:
We envision the concept of Thoughtful AI, a new human-AI interaction paradigm in which the AI behaves as a continuously thinking entity. Unlike conventional AI systems that operate on a turn-based, input-output model, Thoughtful AI autonomously generates, develops, and communicates its evolving thought process throughout an interaction. In this position paper, we argue that this thoughtfulness unlocks new possibilities for human-AI interaction by enabling proactive AI behavior, facilitating continuous cognitive alignment with users, and fostering more dynamic interaction experiences. We outline the conceptual foundations of Thoughtful AI, illustrate its potential through example projects, and envision how this paradigm can transform human-AI interaction in the future.

Paperid: 2379, https://arxiv.org/pdf/2502.18641.pdf

Abstract:
Generative AI significantly enhances player agency in interactive narratives (IN) by enabling just-in-time content generation that adapts to player actions. While delegating generation to AI makes IN more interactive, it becomes challenging for authors to control the space of possible narratives - within which the final story experienced by the player emerges from their interaction with AI. In this paper, we present WhatELSE, an AI-bridged IN authoring system that creates narrative possibility spaces from example stories. WhatELSE provides three views (narrative pivot, outline, and variants) to help authors understand the narrative space and corresponding tools leveraging linguistic abstraction to control the boundaries of the narrative space. Taking innovative LLM-based narrative planning approaches, WhatELSE further unfolds the narrative space into executable game events. Through a user study (N=12) and technical evaluations, we found that WhatELSE enables authors to perceive and edit the narrative space and generates engaging interactive narratives at play-time.

Paperid: 2380, https://arxiv.org/pdf/2502.18594.pdf

Abstract:
Brain-computer interfaces (BCIs) provide alternative communication methods for individuals with motor disabilities by allowing control and interaction with external devices. Non-invasive BCIs, especially those using electroencephalography (EEG), are practical and safe for various applications. However, their performance is often hindered by EEG non-stationarities, caused by changing mental states or device characteristics like electrode impedance. This challenge has spurred research into adaptive BCIs that can handle such variations. In recent years, interest has grown in using error-related potentials (ErrPs) to enhance BCI performance. ErrPs, neural responses to errors, can be detected non-invasively and have been integrated into different BCI paradigms to improve performance through error correction or adaptation. This research introduces a novel adaptive ErrP-based BCI approach using reinforcement learning (RL). We demonstrate the feasibility of an RL-driven adaptive framework incorporating ErrPs and motor imagery. Utilizing two RL agents, the framework adapts dynamically to EEG non-stationarities. Validation was conducted using a publicly available motor imagery dataset and a fast-paced game designed to boost user engagement. Results show the framework's promise, with RL agents learning control policies from user interactions and achieving robust performance across datasets. However, a critical insight from the game-based protocol revealed that motor imagery in a high-speed interaction paradigm was largely ineffective for participants, highlighting task design limitations in real-time BCI applications. These findings underscore the potential of RL for adaptive BCIs while pointing out practical constraints related to task complexity and user responsiveness.

Paperid: 2381, https://arxiv.org/pdf/2502.18201.pdf

Abstract:
The growing prevalence of Large Language Models (LLMs) is reshaping online text-based communication; a transformation that is extensively studied as AI-mediated communication. However, much of the existing research remains bound by traditional communication models, where messages are created and transmitted directly between humans despite LLMs being able to play a more active role in transforming messages. In this work, we propose the Intersubjective Model of AI-mediated Communication, an alternative communication model that leverages LLM-based adaptive agents to augment human-human communication. Unlike traditional communication models that focus on the accurate transmission of information, the Intersubjective Model allows for communication to be designed in an adaptive and customizable way to create alternative interactions by dynamically shaping messages in real time and facilitating shared understanding between the human participants. In this paper, we have developed a prototype text chat system based on the Intersubjective Model to describe the potential of this model, as well as the design space it affords.

Paperid: 2382, https://arxiv.org/pdf/2502.17868.pdf

Abstract:
Swarm User Interfaces allow dynamic arrangement of user environments through the use of multiple mobile robots, but their operational range is typically confined to a single plane due to constraints imposed by their two-wheel propulsion systems. We present corobos, a proof-of-concept design that enables these robots to cooperatively transition between table (horizontal) and wall (vertical) surfaces seamlessly, without human intervention. Each robot is equipped with a uniquely designed slope structure that facilitates smooth rotation when another robot pushes it toward a target surface. Notably, this design relies solely on passive mechanical elements, eliminating the need for additional active electrical components. We investigated the design parameters of this structure and evaluated its transition success rate through experiments. Furthermore, we demonstrate various application examples to showcase the potential of corobos in enhancing user environments.

Paperid: 2383, https://arxiv.org/pdf/2502.17309.pdf

Abstract:
Accurate environmental perception is critical for advanced driver assistance systems (ADAS). Light detection and ranging (LiDAR) systems play a crucial role in ADAS; they can reliably detect obstacles and help ensure traffic safety. Existing research on LiDAR sensing has demonstrated that adapting the LiDAR's resolution and range based on environmental characteristics can improve machine perception. However, current adaptive LiDAR approaches for ADAS have not explored the possibility of combining the perception abilities of the vehicle and the human driver, which can potentially further enhance the detection performance. In this paper, we propose a novel system that adapts LiDAR characteristics to human driver's visual perception to enhance LiDAR sensing outside human's field of view. We develop a proof-of-concept prototype of the system in the virtual environment CARLA. Our system integrates real-time data on the driver's gaze to identify regions in the environment that the driver is monitoring. This allows the system to optimize LiDAR resources by dynamically increasing the LiDAR's range and resolution in peripheral areas that the driver may not be attending to. Our simulations show that this gaze-aware LiDAR enhances detection performance compared to a baseline standalone LiDAR, particularly in challenging environmental conditions like fog. Our hybrid human-machine sensing approach potentially offers improved safety and situational awareness in real-time driving scenarios for ADAS applications.

Paperid: 2384, https://arxiv.org/pdf/2502.16411.pdf

Abstract:
Professionals increasingly use Artificial Intelligence (AI) to enhance their capabilities and assist with task execution. While prior research has examined these uses separately, their potential interaction remains underexplored. We propose that AI-driven training ("tutor") and AI-assisted task completion ("tool") can have a joint effect on human capability and test this hypothesis in the context of lung cancer diagnosis. In a field experiment with 336 medical students, we manipulated AI deployment in training, in practice, and in both. Our findings reveal that while AI-integrated training and AI assistance independently improved diagnostic performance, their combination yielded the highest accuracy. These results underscore AI's dual role in enhancing human performance through both learning and real-time support, offering insights into AI deployment in professional settings where human expertise remains essential.

Paperid: 2385, https://arxiv.org/pdf/2502.16395.pdf

Abstract:
Large language models (LLMs) are increasingly used to automate data analysis through executable code generation. Yet, data science tasks often admit multiple statistically valid solutions, e.g. different modeling strategies, making it critical to understand the reasoning behind analyses, not just their outcomes. While manual review of LLM-generated code can help ensure statistical soundness, it is labor-intensive and requires expertise. A more scalable approach is to evaluate the underlying workflows - the logical plans guiding code generation. However, it remains unclear how to assess whether a LLM-generated workflow supports reproducible implementations. To address this, we present $\it{AIRepr}$, an $\it{A}$nalyst - $\it{I}$nspector framework for automatically evaluating and improving the $\it{Repr}$oducibility of LLM-generated data analysis workflows. Our framework is grounded in statistical principles and supports scalable, automated assessment. We introduce two novel reproducibility-enhancing prompting strategies and benchmark them against standard prompting across 15 analyst-inspector LLM pairs and 1,032 tasks from three public benchmarks. Our findings show that workflows with higher reproducibility also yield more accurate analyses, and that reproducibility-enhancing prompts substantially improve both metrics. This work provides a foundation for more transparent, reliable, and efficient human-AI collaboration in data science. Our code is publicly available.

Paperid: 2386, https://arxiv.org/pdf/2502.15395.pdf

Abstract:
Large language models (LLMs) are increasingly used for both everyday and specialized tasks. While HCI research focuses on domain-specific applications, little is known about how heavy users integrate LLMs into everyday decision-making. Through qualitative interviews with heavy LLM users (n=7) who employ these systems for both intuitive and analytical thinking tasks, our findings show that participants use LLMs for social validation, self-regulation, and interpersonal guidance, seeking to build self-confidence and optimize cognitive resources. These users viewed LLMs either as rational, consistent entities or average human decision-makers. Our findings suggest that heavy LLM users develop nuanced interaction patterns beyond simple delegation, highlighting the need to reconsider how we study LLM integration in decision-making processes.

Paperid: 2387, https://arxiv.org/pdf/2502.15229.pdf

Abstract:
The advancement of large language models (LLMs) now allows users to actively interact with conversational recommendation systems (CRS) and build their own personalized recommendation services tailored to their unique needs and goals. This experience offers users a significantly higher level of controllability compared to traditional RS, enabling an entirely new dimension of recommendation experiences. Building on this context, this study explored the unique experiences that LLM-powered CRS can provide compared to traditional RS. Through a three-week diary study with 12 participants using custom GPTs for music recommendations, we found that LLM-powered CRS can (1) help users clarify implicit needs, (2) support unique exploration, and (3) facilitate a deeper understanding of musical preferences. Based on these findings, we discuss the new design space enabled by LLM-powered CRS and highlight its potential to support more personalized, user-driven recommendation experiences.

Paperid: 2388, https://arxiv.org/pdf/2502.14675.pdf

Abstract:
Machine learning practitioners often need to compare multiple models to select the best one for their application. However, current methods of comparing models fall short because they rely on aggregate metrics that can be difficult to interpret or do not provide enough information to understand the differences between models. To better support the comparison of models, we propose set visualizations of model outputs to enable easier model-to-model comparison. We outline the requirements for using sets to compare machine learning models and demonstrate how this approach can be applied to various machine learning tasks. We also introduce SetMLVis, an interactive system that utilizes set visualizations to compare object detection models. Our evaluation shows that SetMLVis outperforms traditional visualization techniques in terms of task completion and reduces cognitive workload for users. Supplemental materials can be found at https://osf.io/afksu/?view_only=bb7f259426ad425f81d0518a38c597be.

Paperid: 2389, https://arxiv.org/pdf/2502.11430.pdf

Abstract:
Online dating is frequently used by individuals looking for potential relationships and intimate connections. Central to dating apps is the creation and refinement of a dating profile, which represents the way individuals desire to present themselves to potential mates, while hiding information they do not care to share. To investigate the way frequent users of dating apps construct their online profiles and perceive the effectiveness of strategies taken in making profiles, we conducted semi-structured interviews with 20 experienced users who are Chinese college-educated young adults and uncovered the processes and rationales by which they make profiles for online dating, particularly in selecting images for inclusion. We found that participants used idealized photos that exaggerated their positive personality traits, sometimes traits that they do not possess but perceive others to desire, and sometimes even traits they wish they had possessed. Users also strategically used photos that show personality and habits without showing themselves, and often hid certain identifying information to reduce privacy risks. This analysis signals potential factors that are key in building online dating profiles, providing design implications for systems that limit the use of inaccurate information while still promoting self-expression in relationship platforms.

Paperid: 2390, https://arxiv.org/pdf/2502.11337.pdf

Abstract:
Machine learning applications in high-stakes scenarios should always operate under human oversight. Developing an optimal combination of human and machine intelligence requires an understanding of their complementarities, particularly regarding the similarities and differences in the way they make mistakes. We perform extensive experiments in the area of face recognition and compare two automated face recognition systems against human annotators through a demographically balanced user study. Our research uncovers important ways in which machine learning errors and human errors differ from each other, and suggests potential strategies in which human-machine collaboration can improve accuracy in face recognition.

Paperid: 2391, https://arxiv.org/pdf/2502.10618.pdf

Abstract:
Pedagogical approaches focusing on stereotypical code solutions, known as programming plans, can increase problem-solving ability and motivate diverse learners. However, plan-focused pedagogies are rarely used beyond introductory programming. Our formative study (N=10 educators) showed that identifying plans is a tedious process. To advance plan-focused pedagogies in application-focused domains, we created an LLM-powered pipeline that automates the effortful parts of educators' plan identification process by providing use-case-driven program examples and candidate plans. In design workshops (N=7 educators), we identified design goals to maximize instructors' efficiency in plan identification by optimizing interaction with this LLM-generated content. Our resulting tool, PLAID, enables instructors to access a corpus of relevant programs to inspire plan identification, compare code snippets to assist plan refinement, and facilitates them in structuring code snippets into plans. We evaluated PLAID in a within-subjects user study (N=12 educators) and found that PLAID led to lower cognitive demand and increased productivity compared to the state-of-the-art. Educators found PLAID beneficial for generating instructional material. Thus, our findings suggest that human-in-the-loop approaches hold promise for supporting plan-focused pedagogies at scale.

Paperid: 2392, https://arxiv.org/pdf/2502.10258.pdf

Abstract:
We present PromptArtisan, a groundbreaking approach to multi-instruction image editing that achieves remarkable results in a single pass, eliminating the need for time-consuming iterative refinement. Our method empowers users to provide multiple editing instructions, each associated with a specific mask within the image. This flexibility allows for complex edits involving mask intersections or overlaps, enabling the realization of intricate and nuanced image transformations. PromptArtisan leverages a pre-trained InstructPix2Pix model in conjunction with a novel Complete Attention Control Mechanism (CACM). This mechanism ensures precise adherence to user instructions, granting fine-grained control over the editing process. Furthermore, our approach is zero-shot, requiring no additional training, and boasts improved processing complexity compared to traditional iterative methods. By seamlessly integrating multi-instruction capabilities, single-pass efficiency, and complete attention control, PromptArtisan unlocks new possibilities for creative and efficient image editing workflows, catering to both novice and expert users alike.

Paperid: 2393, https://arxiv.org/pdf/2502.09899.pdf

Abstract:
The replication of object stiffness is essential for enhancing haptic feedback in virtual environments. However, existing research has overlooked how stylus stiffness influences the perception of virtual object stiffness during tool-mediated interactions. To address this, we conducted a psychophysical experiment demonstrating that changing stylus stiffness combined with visual stimuli altered users' perception of virtual object stiffness. Based on these insights, we developed Transtiff, a stylus-shaped interface capable of on-demand stiffness control using a McKibben artificial muscle mechanism. Unlike previous approaches, our method manipulates the perceived stiffness of virtual objects via the stylus by controlling the stiffness of the stylus without altering the properties of the real object being touched, creating the illusion of a hard object feeing soft. Our user study confirmed that Transtiff effectively simulates a range of material properties, such as sponge, plastic, and tennis balls, providing haptic rendering that is closely aligned with the perceived material characteristics. By addressing the challenge of delivering realistic haptic feedback through tool-based interactions, Transtiff represents a significant advancement in the haptic interface design for VR applications.

Paperid: 2394, https://arxiv.org/pdf/2502.09869.pdf

Abstract:
As personalized recommendation algorithms become integral to social media platforms, users are increasingly aware of their ability to influence recommendation content. However, limited research has explored how users provide feedback through their behaviors and platform mechanisms to shape the recommendation content. We conducted semi-structured interviews with 34 active users of algorithmic-driven social media platforms (e.g., Xiaohongshu, Douyin). In addition to explicit and implicit feedback, this study introduced intentional implicit feedback, highlighting the actions users intentionally took to refine recommendation content through perceived feedback mechanisms. Additionally, choices of feedback behaviors were found to align with specific purposes. Explicit feedback was primarily used for feed customization, while unintentional implicit feedback was more linked to content consumption. Intentional implicit feedback was employed for multiple purposes, particularly in increasing content diversity and improving recommendation relevance. This work underscores the user intention dimension in the explicit-implicit feedback dichotomy and offers insights for designing personalized recommendation feedback that better responds to users' needs.

Paperid: 2395, https://arxiv.org/pdf/2502.09532.pdf

Abstract:
Recent advances in generative AI have precipitated a proliferation of novel writing assistants. These systems typically rely on multilingual large language models (LLMs), providing globalized workers the ability to revise or create diverse forms of content in different languages. However, there is substantial evidence indicating that the performance of multilingual LLMs varies between languages. Users who employ writing assistance for multiple languages are therefore susceptible to disparate output quality. Importantly, recent research has shown that people tend to generalize algorithmic errors across independent tasks, violating the behavioral axiom of choice independence. In this paper, we analyze whether user utilization of novel writing assistants in a charity advertisement writing task is affected by the AI's performance in a second language. Furthermore, we quantify the extent to which these patterns translate into the persuasiveness of generated charity advertisements, as well as the role of peoples' beliefs about LLM utilization in their donation choices. Our results provide evidence that writers who engage with an LLM-based writing assistant violate choice independence, as prior exposure to a Spanish LLM reduces subsequent utilization of an English LLM. While these patterns do not affect the aggregate persuasiveness of the generated advertisements, people's beliefs about the source of an advertisement (human versus AI) do. In particular, Spanish-speaking female participants who believed that they read an AI-generated advertisement strongly adjusted their donation behavior downwards. Furthermore, people are generally not able to adequately differentiate between human-generated and LLM-generated ads. Our work has important implications for the design, development, integration, and adoption of multilingual LLMs as assistive agents -- particularly in writing tasks.

Paperid: 2396, https://arxiv.org/pdf/2502.08796.pdf

Abstract:
In recent years, evaluating the Theory of Mind (ToM) capabilities of large language models (LLMs) has received significant attention within the research community. As the field rapidly evolves, navigating the diverse approaches and methodologies has become increasingly complex. This systematic review synthesizes current efforts to assess LLMs' ability to perform ToM tasks, an essential aspect of human cognition involving the attribution of mental states to oneself and others. Despite notable advancements, the proficiency of LLMs in ToM remains a contentious issue. By categorizing benchmarks and tasks through a taxonomy rooted in cognitive science, this review critically examines evaluation techniques, prompting strategies, and the inherent limitations of LLMs in replicating human-like mental state reasoning. A recurring theme in the literature reveals that while LLMs demonstrate emerging competence in ToM tasks, significant gaps persist in their emulation of human cognitive abilities.

Paperid: 2397, https://arxiv.org/pdf/2502.07598.pdf

Abstract:
With the widespread adoption of Extended Reality (XR) headsets, spatial computing technologies are gaining increasing attention. Spatial computing enables interaction with virtual elements through natural input methods such as eye tracking, hand gestures, and voice commands, thus placing natural human-computer interaction at its core. While previous surveys have reviewed conventional XR interaction techniques, recent advancements in natural interaction, particularly driven by artificial intelligence (AI) and large language models (LLMs), have introduced new paradigms and technologies. In this paper, we review research on multimodal natural interaction for wearable XR, focusing on papers published between 2022 and 2024 in six top venues: ACM CHI, UIST, IMWUT (Ubicomp), IEEE VR, ISMAR, and TVCG. We classify and analyze these studies based on application scenarios, operation types, and interaction modalities. This analysis provides a structured framework for understanding how researchers are designing advanced natural interaction techniques in XR. Based on these findings, we discuss the challenges in natural interaction techniques and suggest potential directions for future research. This review provides valuable insights for researchers aiming to design natural and efficient interaction systems for XR, ultimately contributing to the advancement of spatial computing.

Paperid: 2398, https://arxiv.org/pdf/2502.07440.pdf

Abstract:
Prisoners of Nazi concentration camps created paintings as a means to express their daily life experiences and feelings. Several thousand such paintings exist, but a quantitative analysis of them has not been carried out. We created an extensive dataset of 1,939 Holocaust prisoner artworks, and we employed an object detection framework that found 19,377 objects within these artworks. To support the quantitative and qualitative analysis of the art collection and its objects, we have developed an intuitive and interactive dashboard to promote a deeper engagement with these visual testimonies. The dashboard features various visual interfaces, e.g., a word cloud showing the detected objects and a map of artwork origins, and options for filtering. We presented the interface to domain experts, whose feedback highlights the dashboard's intuitiveness and potential for both quantitative and qualitative analysis while also providing relevant suggestions for improvement. Our project demonstrates the benefit of digital methods such as machine learning and visual analytics for Holocaust remembrance and educational purposes.

Paperid: 2399, https://arxiv.org/pdf/2502.07125.pdf

Abstract:
Metaphors have been used during therapy sessions to facilitate the communication of inner feelings between clients and therapists. Can we create a digital metaphorical chatting space for daily use within close relationships? As the first step towards this vision, this work follows the autobiographical design approach to prototype MetaphorChat, which comprises two metaphorical chatting scenes tailored to meet researchers' genuine needs for discussing specific life topics in close relationships. Along with typing-based chatting, each scene offers a metaphorical narrative experience, composed of graphics and sound, with interactive mechanisms that deliver metaphorical meanings. This pictorial details the process of mapping abstract feelings into metaphor concepts, then how these concepts are translated into various interaction design elements, and the reflections from self-usage. We discuss the vision for such a metaphorical chatting space, uniquely positioned between messaging apps and video games, for the future design of empathetic communication applications.

Paperid: 2400, https://arxiv.org/pdf/2502.05797.pdf

Abstract:
The rapid evolution of wearable technology marks a transformative phase in human-computer interaction, seamlessly integrating digital functionality into daily life. This paper explores the historical trajectory, current advancements, and future potential of wearables, emphasizing their impact on healthcare, productivity, and personal well-being. Key developments include the integration of artificial intelligence (AI), Internet of Things (IoT), and augmented reality (AR), driving personalization, real-time adaptability, and enhanced user experiences. The study highlights user-centered design principles, ethical considerations, and interdisciplinary collaboration as critical factors in creating wearables that are intuitive, inclusive, and secure. Furthermore, the paper examines sustainability trends, such as modular designs and eco-friendly materials, aligning innovation with environmental responsibility. By addressing challenges like data privacy, algorithmic bias, and usability, wearable technology is poised to redefine the interaction between humans and technology, offering unprecedented opportunities for enrichment and empowerment in diverse contexts. This comprehensive analysis provides a roadmap for advancing wearables to meet emerging societal needs while fostering ethical and sustainable growth.

Paperid: 2401, https://arxiv.org/pdf/2502.05323.pdf

Abstract:
Adolescent girls and young women (AGYW) in sub-Saharan Africa face unique barriers to contraceptive access and lack AGYW-centered contraceptive decision-support resources. To empower AGYW to make informed choices and improve reproductive health outcomes, we developed a tablet-based application to provide contraceptive education and decision-making support in the pharmacy setting - a key source of contraceptive services for AGYW - in Kenya. We conducted workshops with AGYW and pharmacy providers in Kenya to gather app feedback and understand how to integrate the intervention into the pharmacy setting. Our analysis highlights how intermediated interactions - a multiuser, cooperative effort to enable technology use and information access - could inform a successful contraceptive intervention in Kenya. The potential strengths of intermediation in our setting inform implications for technological health interventions in intermediated scenarios in \lrem{LMICs}\ladd{low- and middle-income countries}, including challenges and opportunities for extending impact to different populations and integrating technology into resource-constrained healthcare settings.

Paperid: 2402, https://arxiv.org/pdf/2502.04706.pdf

Abstract:
This paper focuses on simulating text dialogues in which impressions between speakers improve during speed dating. This simulation involves selecting an utterance from multiple candidates generated by a text generation model that replicates a specific speaker's utterances, aiming to improve the impression of the speaker. Accurately selecting an utterance that improves the impression is crucial for the simulation. We believe that whether an utterance improves a dialogue partner's impression of the speaker may depend on the personalities of both parties. However, recent methods for utterance selection do not consider the impression per utterance or the personalities. To address this, we propose a method that predicts whether an utterance improves a partner's impression of the speaker, considering the personalities. The evaluation results showed that personalities are useful in predicting impression changes per utterance. Furthermore, we conducted a human evaluation of simulated dialogues using our method. The results showed that it could simulate dialogues more favorably received than those selected without considering personalities.

Paperid: 2403, https://arxiv.org/pdf/2502.03784.pdf

Abstract:
Data-rich documents are ubiquitous in various applications, yet they often rely solely on textual descriptions to convey data insights. Prior research primarily focused on providing visualization-centric augmentation to data-rich documents. However, few have explored using automatically generated word-scale visualizations to enhance the document-centric reading process. As an exploratory step, we propose GistVis, an automatic pipeline that extracts and visualizes data insight from text descriptions. GistVis decomposes the generation process into four modules: Discoverer, Annotator, Extractor, and Visualizer, with the first three modules utilizing the capabilities of large language models and the fourth using visualization design knowledge. Technical evaluation including a comparative study on Discoverer and an ablation study on Annotator reveals decent performance of GistVis. Meanwhile, the user study (N=12) showed that GistVis could generate satisfactory word-scale visualizations, indicating its effectiveness in facilitating users' understanding of data-rich documents (+5.6% accuracy) while significantly reducing their mental demand (p=0.016) and perceived effort (p=0.033).

Paperid: 2404, https://arxiv.org/pdf/2502.03635.pdf

Abstract:
In our effort to implement an interactive customer segmentation tool for a global manufacturing company, we identified user experience (UX) challenges with technical implications. The main challenge relates to domain users' effort, in our case sales experts, to interpret the clusters produced by an unsupervised Machine Learning (ML) algorithm, for creating a customer segmentation. An additional challenge is what sort of interactions should such a tool support to enable meaningful interpretations of the output of clustering models. In this case study, we describe what we learned from implementing an Interactive Machine Learning (IML) prototype to address such UX challenges. We leverage a multi-year real-world dataset and domain experts' feedback from a global manufacturing company to evaluate our tool. We report what we found to be effective and wish to inform designers of IML systems in the context of customer segmentation and other related unsupervised ML tools.

Paperid: 2405, https://arxiv.org/pdf/2502.03580.pdf

Abstract:
As demand for immersive experiences grows, displays are moving closer to the eye with smaller sizes and higher resolutions. However, shrinking pixel emitters reduce intensity, making them harder to perceive. Electronic Papers utilize ambient light for visibility, maintaining optical contrast regardless of pixel size, but cannot achieve high resolution. We show electrically tunable meta-pixels down to ~560 nm in size (>45,000 PPI) consisting of WO3 nanodiscs, allowing one-to-one pixel-photodetector mapping on the retina when the display size matches the pupil diameter, which we call Retina Electronic Paper. Our technology also supports video display (25 Hz), high reflectance (~80%), and optical contrast (~50%), which will help create the ultimate virtual reality display.

Paperid: 2406, https://arxiv.org/pdf/2502.02201.pdf

Abstract:
In our daily lives, we can naturally convey instructions for the spatial manipulation of objects using words and gestures. Transposing this form of interaction into virtual reality (VR) object manipulation can be beneficial. We propose VR Mover, an LLM-empowered solution that can understand and interpret the user's vocal instruction to support object manipulation. By simply pointing and speaking, the LLM can manipulate objects without structured input. Our user study demonstrates that VR Mover enhances user usability, overall experience and performance on multi-object manipulation, while also reducing workload and arm fatigue. Users prefer the proposed natural interface for broad movements and may complementarily switch to gizmos or virtual hands for finer adjustments. These findings are believed to contribute to design implications for future LLM-based object manipulation interfaces, highlighting the potential for more intuitive and efficient user interactions in VR environments.

Paperid: 2407, https://arxiv.org/pdf/2502.02051.pdf

Abstract:
Positive human-perception of robots is critical to achieving sustained use of robots in shared environments. One key factor affecting human-perception of robots are their sounds, especially the consequential sounds which robots (as machines) must produce as they operate. This paper explores qualitative responses from 182 participants to gain insight into human-perception of robot consequential sounds. Participants viewed videos of different robots performing their typical movements, and responded to an online survey regarding their perceptions of robots and the sounds they produce. Topic analysis was used to identify common properties of robot consequential sounds that participants expressed liking, disliking, wanting or wanting to avoid being produced by robots. Alongside expected reports of disliking high pitched and loud sounds, many participants preferred informative and audible sounds (over no sound) to provide predictability of purpose and trajectory of the robot. Rhythmic sounds were preferred over acute or continuous sounds, and many participants wanted more natural sounds (such as wind or cat purrs) in-place of machine-like noise. The results presented in this paper support future research on methods to improve consequential sounds produced by robots by highlighting features of sounds that cause negative perceptions, and providing insights into sound profile changes for improvement of human-perception of robots, thus enhancing human robot interaction.

Paperid: 2408, https://arxiv.org/pdf/2502.01829.pdf

Abstract:
Health information technologies are transforming how mental healthcare is paid for through value-based care programs, which tie payment to data quantifying care outcomes. But, it is unclear what outcomes data these technologies should store, how to engage users in data collection, and how outcomes data can improve care. Given these challenges, we conducted interviews with 30 U.S.-based mental health clinicians to explore the design space of health information technologies that support outcomes data specification, collection, and use in value-based mental healthcare. Our findings center clinicians' perspectives on aligning outcomes data for payment programs and care; opportunities for health technologies and personal devices to improve data collection; and considerations for using outcomes data to hold stakeholders including clinicians, health insurers, and social services financially accountable in value-based mental healthcare. We conclude with implications for future research designing and developing technologies supporting value-based care across stakeholders involved with mental health service delivery.

Paperid: 2409, https://arxiv.org/pdf/2502.00637.pdf

Abstract:
AI ethics narratives have the potential to shape the public accurate understanding of AI technologies and promote communication among different stakeholders. However, AI ethics narratives are largely lacking. Existing limited narratives tend to center on works of science fiction or corporate marketing campaigns of large technology companies. Misuse of "socio-technical imaginary" can blur the line between speculation and reality for the public, undermining the responsibility and regulation of technology development. Therefore, constructing authentic AI ethics narratives is an urgent task. The emergence of generative AI offers new possibilities for building narrative systems. This study is dedicated to data-driven visual storytelling about AI ethics relying on the human-AI collaboration. Based on the five key elements of story models, we proposed a conceptual framework for human-AI collaboration, explored the roles of generative AI and humans in the creation of visual stories. We implemented the conceptual framework in a real AI news case. This research leveraged advanced generative AI technologies to provide a reference for constructing genuine AI ethics narratives. Our goal is to promote active public engagement and discussions through authentic AI ethics narratives, thereby contributing to the development of better AI policies.

Paperid: 2410, https://arxiv.org/pdf/2502.00202.pdf

Abstract:
By leveraging quantum-mechanical properties like superposition, entanglement, and interference, quantum computing (QC) offers promising solutions for problems that classical computing has not been able to solve efficiently, such as drug discovery, cryptography, and physical simulation. Unfortunately, adopting QC remains difficult for potential users like QC beginners and application-specific domain experts, due to limited theoretical and practical knowledge, the lack of integrated interface-wise support, and poor documentation. For example, to use quantum computers, one has to convert conceptual logic into low-level codes, analyze quantum program results, and share programs and results. To support the wider adoption of QC, we, as designers and QC experts, propose interaction techniques for QC through design iterations. These techniques include writing quantum codes conceptually, comparing initial quantum programs with optimized programs, sharing quantum program results, and exploring quantum machines. We demonstrate the feasibility and utility of these techniques via use cases with high-fidelity prototypes.

Paperid: 2411, https://arxiv.org/pdf/2502.00066.pdf

Abstract:
This study presents a narrative review of the use of digital health technologies (DHTs) and artificial intelligence to screen and mitigate risks and mental health consequences associated with ACEs among children and youth. Several databases were searched for studies published from August 2017 to August 2022. Selected studies (1) explored the relationship between digital health interventions and mitigation of negative health outcomes associated with mental health in childhood and adolescence and (2) examined prevention of ACE occurrence associated with mental illness in childhood and adolescence. A total of 18 search papers were selected, according to our inclusion and exclusion criteria, to evaluate and identify means by which existing digital solutions may be useful in mitigating the mental health consequences associated with the occurrence of ACEs in childhood and adolescence and preventing ACE occurrence due to mental health consequences. We also highlighted a few knowledge gaps or barriers to DHT implementation and usability. Findings from the search suggest that the incorporation of DHTs, if implemented successfully, has the potential to improve the quality of related care provisions for the management of mental health consequences of adverse or traumatic events in childhood, including posttraumatic stress disorder, suicidal behavior or ideation, anxiety or depression, and attention-deficit/hyperactivity disorder. The use of DHTs, machine learning tools, natural learning processing, and artificial intelligence can positively help in mitigating ACEs and associated risk factors. Under proper legal regulations, security, privacy, and confidentiality assurances, digital technologies could also assist in promoting positive childhood experiences in children and young adults, bolstering resilience, and providing reliable public health resources to serve populations in need.

Paperid: 2412, https://arxiv.org/pdf/2502.00044.pdf

Abstract:
We present HoloGraphs, a novel approach for physically representing, explaining, exploring, and interacting with dynamic networks. HoloGraphs addresses the challenges of visualizing and understanding evolving network structures by providing an engaging method of interacting and exploring dynamic network structures using physicalization techniques. In contrast to traditional digital interfaces, our approach leverages tangible artifacts made from transparent materials to provide an intuitive way for people with low visualization literacy to explore network data. The process involves printing network embeddings on transparent media and assembling them to create a 3D representation of dynamic networks, maintaining spatial perception and allowing the examination of each timeslice individually. Interactivity is envisioned using optional Focus+Context layers and overlays for node trajectories and labels. Focus layers highlight nodes of interest, context layers provide an overview of the network structure, and global overlays show node trajectories over time. In this paper, we outline the design principles and implementation of HoloGraphs and present how elementary digital interactions can be mapped to physical interactions to manipulate the elements of a network and temporal dimension in an engaging matter. We demonstrate the capabilities of our concept in a case study. Using a dynamic network of character interactions from a popular book series, we showcase how it represents and supports understanding complex concepts such as dynamic networks.

Paperid: 2413, https://arxiv.org/pdf/2502.00023.pdf

Abstract:
Our research explores the development and application of musical agents, human-in-the-loop generative AI systems designed to support music performance and improvisation within co-creative spaces. We introduce MACAT and MACataRT, two distinct musical agent systems crafted to enhance interactive music-making between human musicians and AI. MACAT is optimized for agent-led performance, employing real-time synthesis and self-listening to shape its output autonomously, while MACataRT provides a flexible environment for collaborative improvisation through audio mosaicing and sequence-based learning. Both systems emphasize training on personalized, small datasets, fostering ethical and transparent AI engagement that respects artistic integrity. This research highlights how interactive, artist-centred generative AI can expand creative possibilities, empowering musicians to explore new forms of artistic expression in real-time, performance-driven and music improvisation contexts.

Paperid: 2414, https://arxiv.org/pdf/2501.18764.pdf

Abstract:
Haptic sensations that align with virtual reality (VR) experiences have a profound impact on presence and enjoyment. There is potential to explore the dynamic capabilities of pneumatic inflatables to offer immersive sensations in virtual environments, including variations in shape, size, and stiffness. We introduce Pneutouch, an ungrounded and untethered wrist-worn device designed as a pneumatic haptic interface for VR interactions. Pneutouch's dynamic inflation ability enables programmable stiffness and shape change of haptic proxies. Additionally, multiple haptic proxies can be delivered into and out of the user's hand grasp. We describe the implementation of the Pneutouch device. We conducted user studies to demonstrate the affordances of pneumatic inflatables and assessed the device's efficacy in providing haptic feedback. With Pneutouch, our goal is to expand what can be touched in the virtual space and bring more immersion into virtual reality.

Paperid: 2415, https://arxiv.org/pdf/2501.18755.pdf

Abstract:
Existing methods of haptic feedback for virtual fluids are challenging to scale, lack durability for long-term rough use, and fail to fully capture the expressive haptic qualities of fluids. To overcome these limitations, we present Vibr-eau, a physical system designed to emulate the sensation of virtual fluids in vessels using vibrotactile actuators. Vibr-eau uses spatial and temporal vibrotactile feedback to create realistic haptic sensations within a 3D-printed vessel. When the users are in the virtual environment and interact with the physical vessel, vibration impulses are triggered and the user will feel like there is fluid in the vessel. We explore the impact of motor density, direct touch, and vibration strength on users' perception of virtual fluid sensations. User studies reveal that Vibr-eau effectively simulates dynamic weight shifts and fluid-like sensations, with participants reporting experiences closely resembling real-world interactions with fluids. Our findings contribute to the development of adaptable and scalable haptic applications for virtual fluids, providing insights into optimizing parameters for realistic and perceptually faithful simulated fluid experiences in VR environments.

Paperid: 2416, https://arxiv.org/pdf/2501.18642.pdf

Abstract:
Ethical intervention prompting has emerged as a tool to counter demographic biases of text-to-image generative AI models. Existing solutions either require to retrain the model or struggle to generate images that reflect desired distributions on gender and race. We propose an inference-time process called DebiasPI for Debiasing-by-Prompt-Iteration that provides prompt intervention by enabling the user to control the distributions of individuals' demographic attributes in image generation. DebiasPI keeps track of which attributes have been generated either by probing the internal state of the model or by using external attribute classifiers. Its control loop guides the text-to-image model to select not yet sufficiently represented attributes, With DebiasPI, we were able to create images with equal representations of race and gender that visualize challenging concepts of news headlines. We also experimented with the attributes age, body type, profession, and skin tone, and measured how attributes change when our intervention prompt targets the distribution of an unrelated attribute type. We found, for example, if the text-to-image model is asked to balance racial representation, gender representation improves but the skin tone becomes less diverse. Attempts to cover a wide range of skin colors with various intervention prompts showed that the model struggles to generate the palest skin tones. We conducted various ablation studies, in which we removed DebiasPI's attribute control, that reveal the model's propensity to generate young, male characters. It sometimes visualized career success by generating two-panel images with a pre-success dark-skinned person becoming light-skinned with success, or switching gender from pre-success female to post-success male, thus further motivating ethical intervention prompting with DebiasPI.

Paperid: 2417, https://arxiv.org/pdf/2501.18493.pdf

Abstract:
Alongside the growth of generative AI, we are witnessing a surge in the use of synthetic data across all stages of the AI development pipeline. It is now common practice for researchers and practitioners to use one large generative model (which we refer to as an auxiliary model) to generate synthetic data that is used to train or evaluate another, reconfiguring AI workflows and reshaping the very nature of data. While scholars have raised concerns over the risks of synthetic data, policy guidance and best practices for its responsible use have not kept up with these rapidly evolving industry trends, in part because we lack a clear picture of current practices and challenges. Our work aims to address this gap. Through 29 interviews with AI practitioners and responsible AI experts, we examine the expanding role of synthetic data in AI development. Our findings reveal how auxiliary models are now widely used across the AI development pipeline. Practitioners describe synthetic data as crucial for addressing data scarcity and providing a competitive edge, noting that evaluation of generative AI systems at scale would be infeasible without auxiliary models. However, they face challenges controlling the outputs of auxiliary models, generating data that accurately depict underrepresented groups, and scaling data validation practices that are based primarily on manual inspection. We detail general limitations of and ethical considerations for synthetic data and conclude with a proposal of concrete steps towards the development of best practices for its responsible use.

Paperid: 2418, https://arxiv.org/pdf/2501.18210.pdf

Abstract:
Algorithms have played a central role in personalized recommendations on social media. However, they also present significant obstacles for content creators trying to predict and manage their audience reach. This issue is particularly challenging for marginalized groups seeking to maintain safe spaces. Our study explores how women on Xiaohongshu (rednote), a recommendation-driven social platform, proactively re-appropriate hashtags (e.g., #Baby Supplemental Food) by using them in posts unrelated to their literal meaning. The hashtags were strategically chosen from topics that would be uninteresting to the male audience they wanted to block. Through a mixed-methods approach, we analyzed the practice of hashtag re-appropriation based on 5,800 collected posts and interviewed 24 active users from diverse backgrounds to uncover users' motivations and reactions towards the re-appropriation. This practice highlights how users can reclaim agency over content distribution on recommendation-driven platforms, offering insights into self-governance within algorithmic-centered power structures.

Paperid: 2419, https://arxiv.org/pdf/2501.18103.pdf

Abstract:
Traditional text-based human-AI interactions often adhere to a strict turn-taking approach. In this research, we propose a novel approach that incorporates overlapping messages, mirroring natural human conversations. Through a formative study, we observed that even in text-based contexts, users instinctively engage in overlapping behaviors like "A: Today I went to-" "B: yeah." To capitalize on these insights, we developed OverlapBot, a prototype chatbot where both AI and users can initiate overlapping. Our user study revealed that OverlapBot was perceived as more communicative and immersive than traditional turn-taking chatbot, fostering faster and more natural interactions. Our findings contribute to the understanding of design space for overlapping interactions. We also provide recommendations for implementing overlap-capable AI interactions to enhance the fluidity and engagement of text-based conversations.

Paperid: 2420, https://arxiv.org/pdf/2501.17629.pdf

Abstract:
The current cycle of hype and anxiety concerning the benefits and risks to human society of Artificial Intelligence is fuelled, not only by the increasing use of generative AI and other AI tools by the general public, but also by claims made on behalf of such technology by popularizers and scientists. In particular, recent studies have claimed that Large Language Models (LLMs) can pass the Turing Test-a goal for AI since the 1950s-and therefore can "think". Large-scale impacts on society have been predicted as a result. Upon detailed examination, however, none of these studies has faithfully applied Turing's original instructions. Consequently, we conducted a rigorous Turing Test with GPT-4-Turbo that adhered closely to Turing's instructions for a three-player imitation game. We followed established scientific standards where Turing's instructions were ambiguous or missing. For example, we performed a Computer-Imitates-Human Game (CIHG) without constraining the time duration and conducted a Man-Imitates-Woman Game (MIWG) as a benchmark. All but one participant correctly identified the LLM, showing that one of today's most advanced LLMs is unable to pass a rigorous Turing Test. We conclude that recent extravagant claims for such models are unsupported, and do not warrant either optimism or concern about the social impact of thinking machines.

Paperid: 2421, https://arxiv.org/pdf/2501.17430.pdf

Abstract:
Social media platforms (SMPs) facilitate information sharing across varying levels of sensitivity. A crucial design decision for SMP administrators is the platform's identity policy, with some opting for real-name systems while others allow anonymous participation. Content moderation on these platforms is conducted by both humans and automated bots. This paper examines the relationship between anonymity, specifically through the use of ``throwaway'' accounts, and the extent and nature of content moderation on Reddit. Our findings indicate that content originating from anonymous throwaway accounts is more likely to violate rules on Reddit. Thus, they are more likely to be removed by moderation than standard pseudonymous accounts. However, the moderation actions applied to throwaway accounts are consistent with those applied to ordinary accounts, suggesting that the use of anonymous accounts does not necessarily necessitate increased human moderation. We conclude by discussing the implications of these findings for identity policies and content moderation strategies on SMPs.

Paperid: 2422, https://arxiv.org/pdf/2501.17375.pdf

Abstract:
Virtual reality (VR) technology can be used to treat anxiety symptoms and disorders. However, most VR interventions for anxiety have been therapist guided rather than self-guided. This systematic review aimed to examine the effectiveness and user experience (i.e., usability, acceptability, safety, and attrition rates) of self-guided VR therapy interventions in people with any anxiety condition as well as provide future research directions. Peer-reviewed journal articles reporting on self-guided VR interventions for anxiety were sought from the Cochrane Library, IEEE Explore Digital Library, PsycINFO, PubMED, Scopus, and Web of Science databases. Study data from the eligible articles were extracted, tabulated, and addressed with a narrative synthesis. A total of 21 articles met the inclusion criteria. The findings revealed that self-guided VR interventions for anxiety can provide an effective treatment of social anxiety disorder, public speaking anxiety, and specific phobias. User experiences outcomes of safety, usability, and acceptability were generally positive and the average attrition rate was low. However, there was a lack of standardised assessments to measure user experiences. Self-guided VR for anxiety can provide an engaging approach for effectively and safely treating common anxiety conditions. Nevertheless, more experimental studies are required to examine their use in underrepresented anxiety populations, their long-term treatment effects beyond 12 months, and compare their effectiveness against other self-help interventions for anxiety (e.g., internet interventions and bibliotherapy).

Paperid: 2423, https://arxiv.org/pdf/2501.17299.pdf

Abstract:
Journalism has emerged as an essential domain for understanding the uses, limitations, and impacts of large language models (LLMs) in the workplace. News organizations face divergent financial incentives: LLMs already permeate newswork processes within financially constrained organizations, even as ongoing legal challenges assert that AI companies violate their copyright. At stake are key questions about what LLMs are created to do, and by whom: How might a journalist-led LLM work, and what can participatory design illuminate about the present-day challenges about adapting ``one-size-fits-all'' foundation models to a given context of use? In this paper, we undertake a co-design exploration to understand how a participatory approach to LLMs might address opportunities and challenges around AI in journalism. Our 20 interviews with reporters, data journalists, editors, labor organizers, product leads, and executives highlight macro, meso, and micro tensions that designing for this opportunity space must address. From these desiderata, we describe the result of our co-design work: organizational structures and functionality for a journalist-controlled LLM. In closing, we discuss the limitations of commercial foundation models for workplace use, and the methodological implications of applying participatory methods to LLM co-design.

Paperid: 2424, https://arxiv.org/pdf/2501.16929.pdf

Abstract:
While shared autonomy offers significant potential for assistive robotics, key questions remain about how to effectively map 2D control inputs to 6D robot motions. An intuitive framework should allow users to input commands effortlessly, with the robot responding as expected, without users needing to anticipate the impact of their inputs. In this article, we propose a dynamic input mapping framework that links joystick movements to motions on control frames defined along a trajectory encoded with canal surfaces. We evaluate our method in a user study with 20 participants, demonstrating that our input mapping framework reduces the workload and improves usability compared to a baseline mapping with similar motion encoding. To prepare for deployment in assistive scenarios, we built on the development from the accessible gaming community to select an accessible control interface. We then tested the system in an exploratory study, where three wheelchair users controlled the robot for both daily living activities and a creative painting task, demonstrating its feasibility for users closer to our target population.

Paperid: 2425, https://arxiv.org/pdf/2501.16693.pdf

Abstract:
Artificial Intelligence (AI) has demonstrated potential in healthcare, particularly in enhancing diagnostic accuracy and decision-making through Clinical Decision Support Systems (CDSSs). However, the successful implementation of these systems relies on user trust and reliance, which can be influenced by explainable AI. This study explores the impact of varying explainability levels on clinicians trust, cognitive load, and diagnostic performance in breast cancer detection. Utilizing an interrupted time series design, we conducted a web-based experiment involving 28 healthcare professionals. The results revealed that high confidence scores substantially increased trust but also led to overreliance, reducing diagnostic accuracy. In contrast, low confidence scores decreased trust and agreement while increasing diagnosis duration, reflecting more cautious behavior. Some explainability features influenced cognitive load by increasing stress levels. Additionally, demographic factors such as age, gender, and professional role shaped participants' perceptions and interactions with the system. This study provides valuable insights into how explainability impact clinicians' behavior and decision-making. The findings highlight the importance of designing AI-driven CDSSs that balance transparency, usability, and cognitive demands to foster trust and improve integration into clinical workflows.

Paperid: 2426, https://arxiv.org/pdf/2501.16661.pdf

Abstract:
Mining and conveying actionable insights from complex data is a key challenge of exploratory data analysis (EDA) and storytelling. To address this challenge, we present a design space for actionable EDA and storytelling. Synthesizing theory and expert interviews, we highlight how semantic precision, rhetorical persuasion, and pragmatic relevance underpin effective EDA and storytelling. We also show how this design space subsumes common challenges in actionable EDA and storytelling, such as identifying appropriate analytical strategies and leveraging relevant domain knowledge. Building on the potential of LLMs to generate coherent narratives with commonsense reasoning, we contribute Jupybara, an AI-enabled assistant for actionable EDA and storytelling implemented as a Jupyter Notebook extension. Jupybara employs two strategies -- design-space-aware prompting and multi-agent architectures -- to operationalize our design space. An expert evaluation confirms Jupybara's usability, steerability, explainability, and reparability, as well as the effectiveness of our strategies in operationalizing the design space framework with LLMs.

Paperid: 2427, https://arxiv.org/pdf/2501.16305.pdf

Abstract:
This paper explores opportunities and challenges for data-driven advocacy to support home care workers, an often overlooked group of low-wage, frontline health workers. First, we investigate what data to collect and how to collect it in ways that preserve privacy and avoid burdening workers. Second, we examine how workers and advocates could use collected data to strengthen individual and collective advocacy efforts. Our qualitative study with 11 workers and 15 advocates highlights tensions between workers' desires for individual and immediate benefits and advocates' preferences to prioritize more collective and long-term benefits. We also uncover discrepancies between participants' expectations for how data might transform advocacy and their on-the-ground experiences collecting and using real data. Finally, we discuss future directions for data-driven worker advocacy, including combining different kinds of data to ameliorate challenges, leveraging advocates as data stewards, and accounting for workers' and organizations' heterogeneous goals.

Paperid: 2428, https://arxiv.org/pdf/2501.15770.pdf

Abstract:
Procrastination, the voluntary delay of tasks despite potential negative consequences, has prompted numerous time and task management interventions in the HCI community. While these interventions have shown promise in addressing specific behaviors, psychological theories suggest that learning about procrastination itself may help individuals develop their own coping strategies and build mental resilience. However, little research has explored how to support this learning process through HCI approaches. We present ProcrastiMate, a text adventure game where players learn about procrastination's causes and experiment with coping strategies by guiding in-game characters in managing relatable scenarios. Our field study with 27 participants revealed that ProcrastiMate facilitated learning and self-reflection while maintaining psychological distance, motivating players to integrate newly acquired knowledge in daily life. This paper contributes empirical insights on leveraging serious games to facilitate learning about procrastination and offers design implications for addressing psychological challenges through HCI approaches.

Paperid: 2429, https://arxiv.org/pdf/2501.15608.pdf

Abstract:
The work of Emergency Management (EM) agencies requires timely collection of relevant data to inform decision-making for operations and public communication before, during, and after a disaster. However, the limited human resources available to deploy for field data collection is a persistent problem for EM agencies. Thus, many of these agencies have started leveraging social media as a supplemental data source and a new venue to engage with the public. While prior research has analyzed the potential benefits and attitudes of practitioners and the public when leveraging social media during disasters, a gap exists in the critical analysis of the actual practices and uses of social media among EM agencies, across both geographical regions and phases of the EM lifecycle - typically mitigation, preparedness, response, and recovery. In this paper, we conduct a mixed-method analysis to update and fill this gap on how EM practitioners in the U.S. and Europe use social media, building on a survey study of about 150 professionals and a follow-up interview study with 11 participants. The results indicate that using social media is no longer a non-traditional practice in operational and informational processes for the decision-making of EM agencies working at both the local level (e.g., county or town) and non-local level (e.g., state/province, federal/national) for emergency management. Especially, the practitioners affiliated with agencies working at the local level have a very high perceived value of social media for situational awareness (e.g., analyzing disaster extent and impact) and public communication (e.g., disseminating timely information and correcting errors in crisis coverage). We conclude with the policy, technological, and socio-technical needs to design future social media analytics systems to support the work of EM agencies in such communication including the applications of AI.

Paperid: 2430, https://arxiv.org/pdf/2501.15332.pdf

Abstract:
The integration of artificial intelligence (AI) into human teams is widely expected to enhance performance and collaboration. However, our study reveals a striking and counterintuitive result: human-AI teams performed worse than human-only teams, especially when task difficulty increased. Using a virtual reality-based sensorimotor task, we observed that the inclusion of an active human-like AI teammate disrupted team dynamics, leading to elevated arousal, reduced engagement, and diminished communication intensity among human participants. These effects persisted even as the human teammates' perception of the AI teammate improved over time. These findings challenge prevailing assumptions about the benefits of AI in team settings and highlight the critical need for human-centered AI design to mitigate adverse physiological and behavioral impacts, ensuring more effective human-AI collaboration.

Paperid: 2431, https://arxiv.org/pdf/2501.15276.pdf

Abstract:
Artificial intelligence is reshaping creative domains, yet its co-creative processes, especially in group settings with novice users, remain under explored. To bridge this gap, we conducted a case study in a college-level course where nine undergraduate students were tasked with creating three original music tracks using AI tools over 10 weeks. The study spanned the entire creative journey from ideation to releasing these songs on Spotify. Participants leveraged AI for music and lyric production, cover art, and distribution. Our findings highlight how AI transforms creative workflows: accelerating ideation but compressing the traditional preparation stage, and requiring novices to navigate a challenging idea selection and validation phase. We also identified a new "collaging and refinement" stage, where participants creatively combined diverse AI-generated outputs into cohesive works. Furthermore, AI influenced group social dynamics and role division among human creators. Based on these insights, we propose the Human-AI Co-Creation Stage Model and the Human-AI Agency Model, offering new perspectives on collaborative co-creation with AI.

Paperid: 2432, https://arxiv.org/pdf/2501.14648.pdf

Abstract:
Justice, epistemology, and marginalization are rich areas of study in HCI. And yet, we repeatedly find platforms and algorithms that push communities further into the margins. In this paper, we propose epistemic autonomy -- one's ability to govern knowledge about themselves -- as a necessary HCI paradigm for working with marginalized communities. We establish epistemic autonomy by applying the transfeminine principle of autonomy to the problem of epistemic injustice. To articulate the harm of violating one's epistemic autonomy, we present six stories from two trans women: (1) a transfem online administrator and (2) a transfem researcher. We then synthesize our definition of epistemic autonomy in research into a research paradigm. Finally, we present two variants of common HCI methods, autoethnography and asynchronous remote communities, that stem from these beliefs. We discuss how CHI is uniquely situated to champion this paradigm and, thereby, the epistemic autonomy of our research participants.

Paperid: 2433, https://arxiv.org/pdf/2501.13878.pdf

Abstract:
Advanced multimodal AI agents can now collaborate with users to solve challenges in the world. Yet, these emerging contextual AI systems rely on explicit communication channels between the user and system. We hypothesize that implicit communication of the user's interests and intent would reduce friction and improve user experience when collaborating with AI agents. In this work, we explore the potential of wearable eye tracking to convey signals about user attention. We measure the eye tracking signal quality requirements to effectively map gaze traces to physical objects, then conduct experiments that provide visual scanpath history as additional context when querying vision language models. Our results show that eye tracking provides high value as a user attention signal and can convey important context about the user's current task and interests, improving understanding of contextual AI agents.

Paperid: 2434, https://arxiv.org/pdf/2501.13778.pdf

Abstract:
We present Explainable XR, an end-to-end framework for analyzing user behavior in diverse eXtended Reality (XR) environments by leveraging Large Language Models (LLMs) for data interpretation assistance. Existing XR user analytics frameworks face challenges in handling cross-virtuality - AR, VR, MR - transitions, multi-user collaborative application scenarios, and the complexity of multimodal data. Explainable XR addresses these challenges by providing a virtuality-agnostic solution for the collection, analysis, and visualization of immersive sessions. We propose three main components in our framework: (1) A novel user data recording schema, called User Action Descriptor (UAD), that can capture the users' multimodal actions, along with their intents and the contexts; (2) a platform-agnostic XR session recorder, and (3) a visual analytics interface that offers LLM-assisted insights tailored to the analysts' perspectives, facilitating the exploration and analysis of the recorded XR session data. We demonstrate the versatility of Explainable XR by demonstrating five use-case scenarios, in both individual and collaborative XR applications across virtualities. Our technical evaluation and user studies show that Explainable XR provides a highly usable analytics solution for understanding user actions and delivering multifaceted, actionable insights into user behaviors in immersive environments.

Paperid: 2435, https://arxiv.org/pdf/2501.13233.pdf

Abstract:
Small talk can foster rapport building in human-human teamwork; yet how non-anthropomorphic robots, such as collaborative manipulators commonly used in industry, may capitalize on these social communications remains unclear. This work investigates how robot-initiated small talk influences task performance, rapport, and interaction dynamics in human-robot collaboration. We developed an autonomous robot system that assists a human in an assembly task while initiating and engaging in small talk. A user study ($N = 58$) was conducted in which participants worked with either a functional robot, which engaged in only task-oriented speech, or a social robot, which also initiated small talk. Our study found that participants in the social condition reported significantly higher levels of rapport with the robot. Moreover, all participants in the social condition responded to the robot's small talk attempts; 59% initiated questions to the robot, and 73% engaged in lingering conversations after requesting the final task item. Although active working times were similar across conditions, participants in the social condition recorded longer task durations than those in the functional condition. We discuss the design and implications of robot small talk in shaping human-robot collaboration.

Paperid: 2436, https://arxiv.org/pdf/2501.13145.pdf

Abstract:
AI can now generate high-fidelity UI mock-up screens from a high-level textual description, promising to support UX practitioners' work. However, it remains unclear how UX practitioners would adopt such Generative UI (GenUI) models in a way that is integral and beneficial to their work. To answer this question, we conducted a formative study with 37 UX-related professionals that consisted of four roles: UX designers, UX researchers, software engineers, and product managers. Using a state-of-the-art GenUI tool, each participant went through a week-long, individual mini-project exercise with role-specific tasks, keeping a daily journal of their usage and experiences with GenUI, followed by a semi-structured interview. We report findings on participants' workflow using the GenUI tool, how GenUI can support all and each specific roles, and existing gaps between GenUI and users' needs and expectations, which lead to design implications to inform future work on GenUI development.

Paperid: 2437, https://arxiv.org/pdf/2501.12924.pdf

Abstract:
The focus on managing problems that can arise for older adults has meant that extant HCI and Ageing research has not given the concepts of 'age' and 'ageing' the explicit theoretical attention they deserve. Attending to this gap, we critically examine a ten-year corpus of CHI publications through the lens of an existing typology which we have further developed to analyse how age is understood, interpreted and constructed in the field of HCI. Our resulting multidimensional typology of age in HCI elucidates the distinctive characteristics of older adults considered when designing with and for this user group, but also highlights the need for a more critical, reflexive, social constructivist approach to age in HCI. Applying this approach, we explore age as a multidimensional system of stratification to better understand the phenomenon of the age-based digital divide.

Paperid: 2438, https://arxiv.org/pdf/2501.11792.pdf

Abstract:
Effective debugging is a crucial aspect of software development, demanding problem-solving skills, expertise, and appropriate tools. Although previous research has studied expert developers' debugging strategies, the specific factors influencing strategy choice in complex scenarios remain underexplored. To investigate these contextual factors, we conducted two studies. First, we surveyed 35 developers to identify experiences with challenging debugging problems and contextual complexities. Second, we held semi-structured interviews with 16 experienced developers to gain deeper insight into strategic reasoning for complex debugging tasks. Insights from both groups enriched our understanding of debugging strategies at different expertise levels. We found that contextual factors interact in complex ways, and combinations of factors influence strategy choice, evolving throughout the debugging process. Hypothesis making is the baseline for debugging, with experience and code familiarity crucial for strategy selection. Our results show a gap between learning and effectively practicing strategies in challenging contexts, highlighting the need for carefully designed debugging tools and educational frameworks that align with problem contexts.

Paperid: 2439, https://arxiv.org/pdf/2501.10137.pdf

Abstract:
Stopword removal is a critical stage in many Machine Learning methods but often receives little consideration, it interferes with the model visualizations and disrupts user confidence. Inappropriately chosen or hastily omitted stopwords not only lead to suboptimal performance but also significantly affect the quality of models, thus reducing the willingness of practitioners and stakeholders to rely on the output visualizations. This paper proposes a novel extraction method that provides a corpus-specific probabilistic estimation of stopword likelihood and an interactive visualization system to support their analysis. We evaluated our approach and interface using real-world data, a commonly used Machine Learning method (Topic Modelling), and a comprehensive qualitative experiment probing user confidence. The results of our work show that our system increases user confidence in the credibility of topic models by (1) returning reasonable probabilities, (2) generating an appropriate and representative extension of common stopword lists, and (3) providing an adjustable threshold for estimating and analyzing stopwords visually. Finally, we discuss insights, recommendations, and best practices to support practitioners while improving the output of Machine Learning methods and topic model visualizations with robust stopword analysis and removal.

Paperid: 2440, https://arxiv.org/pdf/2501.09910.pdf

Abstract:
Apologies serve essential functions for moral agents such as expressing remorse, taking responsibility, and repairing trust. LLM-based chatbots routinely produce output that has the linguistic form of an apology. However, they do this simply because they are echoing the kinds of things that humans say. Moreover, there are reasons to think that chatbots are not the kind of linguistic or moral agents capable of apology. To put the point bluntly: Chatbot apologies are bullshit. This paper explores this concern and develops it beyond the epithet, drawing on the nature of morally serious apologies, the linguistic agency required to perform them, and the moral agency required for them to matter. We conclude by considering some consequences for how chatbots should be designed and how we ought to think about them.

Paperid: 2441, https://arxiv.org/pdf/2501.09521.pdf

Abstract:
We present a method for augmenting a Large Language Model (LLM) with a combination of text and visual data to enable accurate question answering in visualization of scientific data, making conversational visualization possible. LLMs struggle with tasks like visual data interaction, as they lack contextual visual information. We address this problem by merging a text description of a visualization and dataset with snapshots of the visualization. We extract their essential features into a structured text file, highly compact, yet descriptive enough to appropriately augment the LLM with contextual information, without any fine-tuning. This approach can be applied to any visualization that is already finally rendered, as long as it is associated with some textual description.

Paperid: 2442, https://arxiv.org/pdf/2501.09235.pdf

Abstract:
This paper examines how gifting spreads among viewers on Twitch, one of the largest live streaming platforms worldwide. Twitch users can give gift subscriptions to other viewers in the chat room, with the majority of gifters opting for community gifting, which is gifting to randomly selected viewers. We identify the random nature of gift-receiving in our data as a natural experiment setting. We investigate whether gift recipients pay it forward, considering various gift types that may either promote or deter the spread of gifting. Our findings reveal that Twitch viewers who receive gift subscriptions are generally more likely to pay it forward than non-recipients, and the positive impact of gift-receiving becomes stronger when the recipient is the sole beneficiary of the giver's gifting behavior. However, we found that gifts from frequent gifters discourage recipients from paying it forward, and gifts from anonymous gifters do not influence the likelihood of viewers becoming future gifters. This research contributes to the existing literature on the spread of online prosocial behavior by providing robust evidence and suggests practical strategies for promoting online gifting.

Paperid: 2443, https://arxiv.org/pdf/2501.08868.pdf

Abstract:
Analyzing large volumes of real-world driving data is essential for providing meaningful and reliable insights into real-world trips, scenarios, and human driving behaviors. To this end, we developed a multi-level data processing approach that adds new information, segments data, and extracts desired parameters. Leveraging a confidential but extensive dataset (over 1 million km), this approach leads to three levels of in-depth analysis: trip, scenario, and driving. The trip-level analysis explains representative properties observed in real-world trips, while the scenario-level analysis focuses on scenario conditions resulting from road events that reduce vehicle speed. The driving-level analysis identifies the cause of driving regimes for specific situations and characterizes typical human driving behaviors. Such analyses can support the design of both trip- and scenario-based tests, the modeling of human drivers, and the establishment of guidelines for connected and automated vehicles.

Paperid: 2444, https://arxiv.org/pdf/2501.08693.pdf

Abstract:
Reconstructing speech envelopes from EEG signals is essential for exploring neural mechanisms underlying speech perception. Yet, EEG variability across subjects and physiological artifacts complicate accurate reconstruction. To address this problem, we introduce Subject Disentangling Neural Network (SDN-Net), which disentangles subject identity information from reconstructed speech envelopes to enhance cross-subject reconstruction accuracy. SDN-Net integrates three key components: MLA-Codec, MPN-MI, and CTA-MTDNN. The MLA-Codec, a fully convolutional neural network, decodes EEG signals into speech envelopes. The CTA-MTDNN module, a multi-scale time-delay neural network with channel and temporal attention, extracts subject identity features from EEG signals. Lastly, the MPN-MI module, a mutual information estimator with a multi-layer perceptron, supervises the removal of subject identity information from the reconstructed speech envelope. Experiments on the Auditory EEG Decoding Dataset demonstrate that SDN-Net achieves superior performance in inner- and cross-subject speech envelope reconstruction compared to recent state-of-the-art methods.

Paperid: 2445, https://arxiv.org/pdf/2501.08182.pdf

Abstract:
The field of affective computing has seen significant advancements in exploring the relationship between emotions and emerging technologies. This paper presents a novel and valuable contribution to this field with the introduction of a comprehensive French multimodal dataset designed specifically for emotion recognition. The dataset encompasses three primary modalities: facial expressions, speech, and gestures, providing a holistic perspective on emotions. Moreover, the dataset has the potential to incorporate additional modalities, such as Natural Language Processing (NLP) to expand the scope of emotion recognition research. The dataset was curated through engaging participants in card game sessions, where they were prompted to express a range of emotions while responding to diverse questions. The study included 10 sessions with 20 participants (9 females and 11 males). The dataset serves as a valuable resource for furthering research in emotion recognition and provides an avenue for exploring the intricate connections between human emotions and digital technologies.

Paperid: 2446, https://arxiv.org/pdf/2501.07748.pdf

Abstract:
The vertical ground reaction force (vGRF) and its characteristic weight acceptance and push-off peaks measured during walking are important for gait and biomechanical analysis. Current wearable vGRF estimation methods suffer from drifting errors or low generalization performances, limiting their practical application. This paper proposes a novel method for reliably estimating vGRF and its characteristic peaks using data collected from the smart insole, including inertial measurement unit data and the newly introduced center of the pressed sensor data. These data were fused with machine learning algorithms including artificial neural networks, random forest regression, and bi-directional long-short-term memory. The proposed method outperformed the state-of-the-art methods with the root mean squared error, normalized root mean squared error, and correlation coefficient of 0.024 body weight (BW), 1.79% BW, and 0.997 in intra-participant testing, and 0.044 BW, 3.22% BW, and 0.991 in inter-participant testing, respectively. The difference between the reference and estimated weight acceptance and push-off peak values are 0.022 BW and 0.017 BW with a delay of 1.4% and 1.8% of the gait cycle for the intra-participant testing and 0.044 BW and 0.025 BW with a delay of 1.5% and 2.3% of the gait cycle for the inter-participant testing. The results indicate that the proposed vGRF estimation method has the potential to achieve accurate vGRF measurement during walking in free living environments.

Paperid: 2447, https://arxiv.org/pdf/2501.07234.pdf

Abstract:
The integration of haptics within Augmented Reality may help to deliver an enriched experience, while facilitating the performance of specific actions (e.g. repositioning or resizin ) that are still dependent on the user's skills. This paper gathers the description of a flexible architecture designed to deploy haptically-enabled AR applications. The haptic feedback may be generated through a variety of devices (e.g., wearable, graspable, or mid-air ones), and the architecture facilitates handling the specificity of each. For this reason, it is discussed how to generate a haptic representation of a 3D digital object depending on the application and the target device. Additionally, it is included an analysis of practical, relevant issues that arise when setting up a system to work with specific devices like Head-Mounted Displays (e.g., HoloLens) and mid-air haptic devices (e.g., Ultrahaptics UHK), such as the alignment between the real world and the virtual one. The architecture applicability is demonstrated through the implementation of two applications: Form Inspector and Simon Game, built for HoloLens and iOS mobile phones for visualization and for UHK for mid-air haptics delivery. These applications have been used by nine users to explore the efficiency, meaningfulness, and usefulness of mid-air haptics for form perception, object resizing, and push interaction tasks. Results show that, although mobile interaction is preferred when this option is available, haptics turn out to be more meaningful in identifying shapes when compared to what users initially expect and in contributing to the execution of resizing tasks. Moreover, this preliminary user study reveals that users may be expecting a tailored interface metaphor, not necessarily inspired in natural interaction.

Paperid: 2448, https://arxiv.org/pdf/2501.07213.pdf

Abstract:
The integration of dialogue interfaces in mobile devices has become ubiquitous, providing a wide array of services. As technology progresses, humanoid robots designed with human-like features to interact effectively with people are gaining prominence, and the use of advanced human-robot dialogue interfaces is continually expanding. In this context, emotion recognition plays a crucial role in enhancing human-robot interaction by enabling robots to understand human intentions. This research proposes a facial emotion detection interface integrated into a mobile humanoid robot, capable of displaying real-time emotions from multiple individuals on a user interface. To this end, various deep neural network models for facial expression recognition were developed and evaluated under consistent computer-based conditions, yielding promising results. Afterwards, a trade-off between accuracy and memory footprint was carefully considered to effectively implement this application on a mobile humanoid robot.

Paperid: 2449, https://arxiv.org/pdf/2501.06981.pdf

Abstract:
The global AI surge demands crowdworkers from diverse languages and cultures. They are pivotal in labeling data for enabling global AI systems. Despite global significance, research has primarily focused on understanding the perspectives and experiences of US and India crowdworkers, leaving a notable gap. To bridge this, we conducted a survey with 100 crowdworkers across 16 Latin American and Caribbean countries. We discovered that these workers exhibited pride and respect for their digital labor, with strong support and admiration from their families. Notably, crowd work was also seen as a stepping stone to financial and professional independence. Surprisingly, despite wanting more connection, these workers also felt isolated from peers and doubtful of others' labor quality. They resisted collaboration and gender-based tools, valuing gender-neutrality. Our work advances HCI understanding of Latin American and Caribbean crowdwork, offering insights for digital resistance tools for the region.

Paperid: 2450, https://arxiv.org/pdf/2501.06899.pdf

Abstract:
Stroke rehabilitation continues to face challenges in accessibility and patient engagement, where traditional approaches often fall short. Virtual reality (VR)-based telerehabilitation offers a promising avenue, by enabling home-based recovery through immersive environments and gamification. This systematic review evaluates current VR solutions for upper-limb post-stroke recovery, focusing on design principles, safety measures, patient-therapist communication, and strategies to promote motivation and adherence. Following PRISMA 2020 guidelines, a comprehensive search was conducted across PubMed, IEEE Xplore, and ScienceDirect. The review reveals a scarcity of studies meeting the inclusion criteria, possibly reflecting the challenges inherent in the current paradigm of VR telerehabilitation systems. Although these systems have potential to enhance accessibility and patient autonomy, they often lack standardized safety protocols and reliable real-time monitoring. Human-centered design principles are evident in some solutions, but inconsistent patient involvement during the development process limits their usability and clinical relevance. Furthermore, communication between patients and therapists remains constrained by technological barriers, although advancements in real-time feedback and adaptive systems offer promising solutions. This review underscores the potential of VR telerehabilitation to address critical needs in upper-limb stroke recovery while highlighting the importance of addressing existing limitations to ensure broader clinical implementation and improved patient outcomes.

Paperid: 2451, https://arxiv.org/pdf/2501.06867.pdf

Abstract:
The fundamental role of personality in shaping interactions is increasingly being exploited in robotics. A carefully designed robotic personality has been shown to improve several key aspects of Human-Robot Interaction (HRI). However, the fragmentation and rigidity of existing approaches reveal even greater challenges when applied to non-humanoid robots. On one hand, the state of the art is very dispersed; on the other hand, Industry 4.0 is moving towards a future where humans and industrial robots are going to coexist. In this context, the proper design of a robotic personality can lead to more successful interactions. This research takes a first step in that direction by integrating a comprehensive cognitive architecture built upon the definition of robotic personality - validated on humanoid robots - into a robotic Kinova Jaco2 arm. The robot personality is defined through the cognitive architecture as a vector in the three-dimensional space encompassing Conscientiousness, Extroversion, and Agreeableness, affecting how actions are executed, the action selection process, and the internal reaction to environmental stimuli. Our main objective is to determine whether users perceive distinct personalities in the robot, regardless of its shape, and to understand the role language plays in shaping these perceptions. To achieve this, we conducted a user study comprising 144 sessions of a collaborative game between a Kinova Jaco2 arm and participants, where the robot's behavior was influenced by its assigned personality. Furthermore, we compared two conditions: in the first, the robot communicated solely through gestures and action choices, while in the second, it also utilized verbal interaction.

Paperid: 2452, https://arxiv.org/pdf/2501.05674.pdf

Abstract:
This study examines the intersection of academic pressure and sleep within Taiwanese families, revealing how cultural norms and expectations shape sleep practices. Through interviews and two-week diaries from eleven families, we found that academic demands significantly influence children's sleep patterns, leading to reduced sleep duration and varied sleep schedules. Our research highlights the importance of integrating care and attuning into the design of sleep-tracking technologies, advocating for a family informatics approach that considers both health needs and social expectations. By exploring these dynamics, we contribute to a broader understanding of family contexts in diverse cultural settings and offer insights for more inclusive technology design.

Paperid: 2453, https://arxiv.org/pdf/2501.05664.pdf

Abstract:
Fabric has been a fundamental part of human life for thousands of years, providing comfort, protection, and aesthetic expression. While modern advancements have enhanced fabric's functionality, it remains static and unchangeable, failing to adapt to our evolving body shapes and preferences. This lack of adaptability can lead to unsustainable practices, as consumers often buy more items to meet their changing needs. In this paper, we propose ExoFabric, a re-moldable fabric system for customized soft goods applications. We created ExoFabric by embedding thermoplastic threads into fabric through computerized embroidery to allow for tunability between rigid plastic and conformable fabric. We defined a library of design primitives to enable geometric formability, stiffness, and stretchability by identifying suitable fabrics, threads, embroidery parameters, and machine limitations. To facilitate practical applications, we demonstrated practical methods for linking parameters to application requirements, showcasing form-fitting wearables, structural support, and shape-changeable furniture for repeatable or one-time customization.

Paperid: 2454, https://arxiv.org/pdf/2501.04860.pdf

Abstract:
As interest in studying in-the-wild human-robot interaction grows, there is a need for methods to collect data over time and in naturalistic or potentially private environments. HRI researchers have increasingly used the diary method for these studies, asking study participants to self-administer a structured data collection instrument, i.e., a diary, over a period of time. Although the diary method offers a unique window into settings that researchers may not have access to, they also lack the interactivity and probing that interview-based methods offer. In this paper, we explore a novel data collection method in which a robot plays the role of an interactive diary. We developed the Diary Robot system and performed in-home deployments for a week to evaluate the feasibility and effectiveness of this approach. Using traditional text-based and audio-based diaries as benchmarks, we found that robots are able to effectively elicit the intended information. We reflect on our findings, and describe scenarios where the utilization of robots in diary studies as a data collection instrument may be especially applicable.

Paperid: 2455, https://arxiv.org/pdf/2501.04755.pdf

Abstract:
The rapid development of artificial intelligence and robotics has had a significant impact on our lives, with intelligent systems increasingly performing tasks traditionally performed by humans. Efficient knowledge transfer requires matching the mental model of the human teacher with the capabilities of the robot learner. This paper introduces the Mental Model Mismatch (MMM) Score, a feedback mechanism designed to quantify and reduce mismatches by aligning human teaching behavior with robot learning behavior. Using Large Language Models (LLMs), we analyze teacher intentions in natural language to generate adaptive feedback. A study with 150 participants teaching a virtual robot to solve a puzzle game shows that intention-based feedback significantly outperforms traditional performance-based feedback or no feedback. The results suggest that intention-based feedback improves instructional outcomes, improves understanding of the robot's learning process and reduces misconceptions. This research addresses a critical gap in human-robot interaction (HRI) by providing a method to quantify and mitigate discrepancies between human mental models and robot capabilities, with the goal of improving robot learning and human teaching effectiveness.

Paperid: 2456, https://arxiv.org/pdf/2501.04359.pdf

Abstract:
Decoding speech from non-invasive brain signals, such as electroencephalography (EEG), has the potential to advance brain-computer interfaces (BCIs), with applications in silent communication and assistive technologies for individuals with speech impairments. However, EEG-based speech decoding faces major challenges, such as noisy data, limited datasets, and poor performance on complex tasks like speech perception. This study attempts to address these challenges by employing variational autoencoders (VAEs) for EEG data augmentation to improve data quality and applying a state-of-the-art (SOTA) sequence-to-sequence deep learning architecture, originally successful in electromyography (EMG) tasks, to EEG-based speech decoding. Additionally, we adapt this architecture for word classification tasks. Using the Brennan dataset, which contains EEG recordings of subjects listening to narrated speech, we preprocess the data and evaluate both classification and sequence-to-sequence models for EEG-to-words/sentences tasks. Our experiments show that VAEs have the potential to reconstruct artificial EEG data for augmentation. Meanwhile, our sequence-to-sequence model achieves more promising performance in generating sentences compared to our classification model, though both remain challenging tasks. These findings lay the groundwork for future research on EEG speech perception decoding, with possible extensions to speech production tasks such as silent or imagined speech.

Paperid: 2457, https://arxiv.org/pdf/2501.03371.pdf

Abstract:
In Asia, many individuals with disabilities rely on wheelchairs for mobility. However, some people, such as those who are fully disabled or paralyzed, cannot use traditional wheelchairs despite having fully functioning cognitive abilities. To address this issue, we propose the development of an electric wheelchair that can be controlled using EEG signals and eye blinks. The project utilizes a MindWave Mobile device and Arduino to enable seamless control. Additionally, various sensors are incorporated to enhance the system's reliability. An ultrasonic sensor helps avoid unexpected collisions, while a smoke sensor detects hazardous smoke levels, triggering an automatic alert via a short message to a designated person. Similarly, if the passenger falls from the wheelchair, a notification will also be sent. The wheelchair's movement is controlled via an Android application, with eye-blink detection serving as the primary input method for navigation. This innovative design offers a cost-effective solution, making it accessible for widespread use. By integrating these advanced features, the system can be implemented on motorized wheelchairs to better support individuals with disabilities and enhance their independence.

Paperid: 2458, https://arxiv.org/pdf/2501.02456.pdf

Abstract:
The ACM CHI Conference has a tradition of citing its intellectual heritage. At the same time, we know CHI is highly diverse and evolving. In this highly dynamic context, it is not clear how the CHI community continues to appreciate its milestones (within and outside of CHI). We present an investigation into how the community's citations to milestones have evolved over 43 years of CHI Proceedings (1981-2024). Forgetting curves plotted for each year suggest that milestones are slowly fading from the CHI community's collective memory. However, the picture is more nuanced when we trace citations to the top-cited milestones over time. We identify three distinct types of milestones cited at CHI, a typology of milestone contributions, and define the Milestone Coefficient as a metric to assess the impact of milestone papers on a continuous scale. Further, we provide empirical evidence of a Matthew effect at CHI. We discuss the broader ramifications for the CHI community and the field of HCI.

Paperid: 2459, https://arxiv.org/pdf/2501.01897.pdf

Abstract:
Why do users follow moral advice from chatbots? A chatbot is not an authoritative moral advisor, but it can generate seemingly plausible arguments. Users do not follow reasoned more readily than unreasoned advice, though, we find in an experiment. However, this is also true if we attribute advice to a moral advisor, not a chatbot. Hence, it seems that advice offers users a cheap way to escape from a moral dilemma. This is a concern that chatbots do not raise, but they exacerbate it as they make advice easily accessible. We conclude that it takes ethical in addition to digital literacy to harness users against moral advice from chatbots.

Paperid: 2460, https://arxiv.org/pdf/2501.01545.pdf

Abstract:
Social annotation platforms enable student engagement by integrating discussions directly into course materials. However, in large online courses, the sheer volume of comments can overwhelm students and impede learning. This paper investigates community-based design interventions on a social annotation platform (NB) to address this challenge and foster more meaningful online educational discussions. By examining student preferences and reactions to different curation strategies, this research aims to optimize the utility of social annotations in educational contexts. A key emphasis is placed on how the visibility of comments shapes group interactions, guides conversational flows, and enriches learning experiences. The study combined iterative design and development with two large-scale experiments to create and refine comment curation strategies, involving thousands of students. The study introduced specific features of the platform, such as targeted comment visibility controls, which demonstrably improved peer interactions and reduced discussion overload. These findings inform the design of next-generation social annotation systems and highlight opportunities to integrate Large Language Models (LLMs) for key activities like summarizing annotations, improving clarity in student writing, and assisting instructors with efficient comment curation.

Paperid: 2461, https://arxiv.org/pdf/2501.01404.pdf

Abstract:
For blind and low-vision (BLV) individuals, digital math communication is uniquely difficult due to the lack of accessible tools. Currently, the state of the art is either code-based, like LaTeX, or WYSIWYG, like visual editors. However, both paradigms view math communication as primarily a visual typesetting problem, and may be accessible but difficult to use. In this paper, we present an equation editor that is built from the ground up with BLV accessibility in mind. Specifically, we notice that two of the biggest barriers with current technology are the high cognitive load and the lack of spatial relationships. Thus, we build an editor that uses spatial audio cues, muscle memory, tones, and more intuitive navigation to properly contextualize math equations. We discuss how this new paradigm can enable new levels of math communication, engagement, and literacy. Finally, we discuss natural next steps.

Paperid: 2462, https://arxiv.org/pdf/2501.00825.pdf

Abstract:
Studies have indicated that personality is related to achievement, and several personality assessment models have been developed. However, most are either questionnaires or based on marker systems, which entails limitations. We proposed a physiological signal based model, thereby ensuring the objectivity of the data and preventing unreliable responses. Thirty participants were recruited from the Department of Electrical Engineering of Yuan Ze University in Taiwan. Wearable sensors were used to collect physiological signals as the participants watched and summarized a video. They then completed a personality questionnaire based on the big five factor markers system. The results were used to construct a personality prediction model, which revealed that galvanic skin response and heart rate variance were key factors predicting extroversion; heart rate variance also predicted agreeableness and conscientiousness. The results of this experiment can elucidate students personality traits, which can help educators select the appropriate pedagogical methods.

Paperid: 2463, https://arxiv.org/pdf/2501.00449.pdf

Abstract:
Past researches show that personality trait is a strong predictor for ones academic performance. Today, mature and verified marker systems for assessing personality traits already exist. However, marker systems-based assessing methods have their own limitations. For example, dishonest responses cannot be avoided. In this research, the goal is to develop a method that can overcome the limitations. The proposed method will rely on physiological signals for the assessment. Thirty participants have participated in this experiment. Based on the statistical results, we found that there are correlations between students personality traits and their physiological signal change when learning via videos. Specifically, we found that participants degree of extraversion, agreeableness, conscientiousness, and openness to experiences are correlated with the variance of heart rates, the variance of GSR values, and the skewness of voice frequencies, etc.

Paperid: 2464, https://arxiv.org/pdf/2501.00383.pdf

Abstract:
One of the long-standing aspirations in conversational AI is to allow them to autonomously take initiatives in conversations, i.e., being proactive. This is especially challenging for multi-party conversations. Prior NLP research focused mainly on predicting the next speaker from contexts like preceding conversations. In this paper, we demonstrate the limitations of such methods and rethink what it means for AI to be proactive in multi-party, human-AI conversations. We propose that just like humans, rather than merely reacting to turn-taking cues, a proactive AI formulates its own inner thoughts during a conversation, and seeks the right moment to contribute. Through a formative study with 24 participants and inspiration from linguistics and cognitive psychology, we introduce the Inner Thoughts framework. Our framework equips AI with a continuous, covert train of thoughts in parallel to the overt communication process, which enables it to proactively engage by modeling its intrinsic motivation to express these thoughts. We instantiated this framework into two real-time systems: an AI playground web app and a chatbot. Through a technical evaluation and user studies with human participants, our framework significantly surpasses existing baselines on aspects like anthropomorphism, coherence, intelligence, and turn-taking appropriateness.

Paperid: 2465, https://arxiv.org/pdf/2506.24046.pdf

Abstract:
New endoscopists require a large volume of expert-proctored colonoscopies to attain minimal competency. Developing multi-fingered, synchronized control of a colonoscope requires significant time and exposure to the device. Current training methods inhibit this development by relying on tool hand-off for expert demonstrations. There is a need for colonoscopy training tools that enable in-hand expert guidance in real-time. We present a new concept of a tandem training system that uses a telemanipulated preceptor colonoscope to guide novice users as they perform a colonoscopy. This system is capable of dual-control and can automatically toggle between expert and novice control of a standard colonoscope's angulation control wheels. Preliminary results from a user study with novice and expert users show the effectiveness of this device as a skill acquisition tool. We believe that this device has the potential to accelerate skill acquisition for colonoscopy and, in the future, enable individualized instruction and responsive teaching through bidirectional actuation.

Paperid: 2466, https://arxiv.org/pdf/2506.23952.pdf

Abstract:
AI systems increasingly support human decision-making across domains of professional, skill-based, and personal activity. While previous work has examined how AI might affect human autonomy globally, the effects of AI on domain-specific autonomy -- the capacity for self-governed action within defined realms of skill or expertise -- remain understudied. We analyze how AI decision-support systems affect two key components of domain-specific autonomy: skilled competence (the ability to make informed judgments within one's domain) and authentic value-formation (the capacity to form genuine domain-relevant values and preferences). By engaging with prior investigations and analyzing empirical cases across medical, financial, and educational domains, we demonstrate how the absence of reliable failure indicators and the potential for unconscious value shifts can erode domain-specific autonomy both immediately and over time. We then develop a constructive framework for autonomy-preserving AI support systems. We propose specific socio-technical design patterns -- including careful role specification, implementation of defeater mechanisms, and support for reflective practice -- that can help maintain domain-specific autonomy while leveraging AI capabilities. This framework provides concrete guidance for developing AI systems that enhance rather than diminish human agency within specialized domains of action.

Paperid: 2467, https://arxiv.org/pdf/2506.23851.pdf

Abstract:
The integration of cloud computing in education can revolutionise learning in advanced (Australia & South Korea) and middle-income (Ghana & Nigeria) countries, while offering scalable, cost-effective and equitable access to adaptive learning systems. This paper explores how cloud computing and adaptive learning technologies are deployed across different socio-economic and infrastructure contexts. The study identifies enabling factors and systematic challenges, providing insights into how cloud-based education can be tailored to bridge the digital and educational divide globally.

Paperid: 2468, https://arxiv.org/pdf/2506.23458.pdf

Abstract:
Portable and wearable consumer-grade electroencephalography (EEG) devices, like Muse headbands, offer unprecedented mobility for daily brain-computer interface (BCI) applications, including cognitive load detection. However, the exacerbated non-stationarity in portable EEG signals constrains data fidelity and decoding accuracy, creating a fundamental trade-off between portability and performance. To mitigate such limitation, we propose MuseCogNet (Muse-based Cognitive Network), a unified joint learning framework integrating self-supervised and supervised training paradigms. In particular, we introduce an EEG-grounded self-supervised reconstruction loss based on average pooling to capture robust neurophysiological patterns, while cross-entropy loss refines task-specific cognitive discriminants. This joint learning framework resembles the bottom-up and top-down attention in humans, enabling MuseCogNet to significantly outperform state-of-the-art methods on a publicly available Muse dataset and establish an implementable pathway for neurocognitive monitoring in ecological settings.

Paperid: 2469, https://arxiv.org/pdf/2506.23017.pdf

Abstract:
This paper addresses the critical issue of deceptive design elements prevalent in technology, and their potential impact on children. Recent research highlights the impact of dark patterns on adults and adolescents, while studies involving children are scarce. In an era where children wield greater independence with digital devices, their vulnerability to dark patterns amplifies without early education. Our findings show a significant positive impact of dark pattern education on children's awareness, revealing that heightened awareness considerably alters children's navigation of social media, video games, and streaming platforms. To this end, we developed a gamified application aimed at instructing children on identifying and responding to various dark patterns. Our evaluation results emphasize the critical role of early education in empowering children to recognize and counter deceptive design, thereby cultivating a digitally literate generation capable of making informed choices in the complex landscape of digital technology.

Paperid: 2470, https://arxiv.org/pdf/2506.23016.pdf

Abstract:
The global prevalence of dementia is projected to double by 2050, highlighting the urgent need for scalable diagnostic tools. This study utilizes digital cognitive tasks with eye-tracking data correlated with memory processes to distinguish between Healthy Controls (HC) and Mild Cognitive Impairment (MCI), a precursor to dementia. A deep learning model based on VTNet was trained using eye-tracking data from 44 participants (24 MCI, 20 HCs) who performed a visual memory task. The model utilizes both time series and spatial data derived from eye-tracking. It was modified to incorporate scan paths, heat maps, and image content. These modifications also enabled testing parameters such as image resolution and task performance, analyzing their impact on model performance. The best model, utilizing $700\times700px$ resolution heatmaps, achieved 68% sensitivity and 76% specificity. Despite operating under more challenging conditions (e.g., smaller dataset size, shorter task duration, or a less standardized task), the model's performance is comparable to an Alzheimer's study using similar methods (70% sensitivity and 73% specificity). These findings contribute to the development of automated diagnostic tools for MCI. Future work should focus on refining the model and using a standardized long-term visual memory task.

Paperid: 2471, https://arxiv.org/pdf/2506.22940.pdf

Abstract:
This paper investigates how collaborative AI systems can enhance user agency in identifying and evaluating misinformation on social media platforms. Traditional methods, such as personal judgment or basic fact-checking, often fall short when faced with emotionally charged or context-deficient content. To address this, we designed and evaluated an interactive interface that integrates collaborative AI features, including real-time explanations, source aggregation, and debate-style interaction. These elements aim to support critical thinking by providing contextual cues and argumentative reasoning in a transparent, user-centered format. In a user study with 14 participants, 79% found the debate mode more effective than standard chatbot interfaces, and the multiple-source view received an average usefulness rating of 4.6 out of 5. Our findings highlight the potential of context-rich, dialogic AI systems to improve media literacy and foster trust in digital information environments. We argue that future tools for misinformation mitigation should prioritize ethical design, explainability, and interactive engagement to empower users in a post-truth era.

Paperid: 2472, https://arxiv.org/pdf/2506.22932.pdf

Abstract:
The increase of the percentage of elderly population in modern societies dictates the use of emerging technologies as a means of supporting elder members of the society. Within this scope, Extended Reality (XR) technologies pose as a promising technology for improving the daily lives of the elderly population. This paper presents a literature review that describes the most common characteristics of the physical and mental state of the elderly, allowing readers, and specifically XR developers, to understand the main difficulties faced by elderly users of extended reality applications so they can develop accessible, user friendly and engaging applications for the target audience. Furthermore, a review of existing extended reality applications that target the elder population is presented, allowing readers to get acquainted with existing design paradigms that can inspire future developments.

Paperid: 2473, https://arxiv.org/pdf/2506.22841.pdf

Abstract:
Adjusting transparency is a common method of mitigating occlusion but is often detrimental for understanding the relative depth relationships between objects as well as removes potentially important information from the occluding object. We propose using dichoptic opacity, a novel method for occlusion management that contrasts the transparency of occluders presented to each eye. This allows for better simultaneous understanding of both occluder and occluded. A user study highlights the technique's potential, showing strong user engagement and a clear preference for dichoptic opacity over traditional presentations. While it does not determine optimal transparency values, it reveals promising trends in both percentage and range that merit further investigation.

Paperid: 2474, https://arxiv.org/pdf/2506.22520.pdf

Abstract:
This study examines the impact of an Artificial Intelligence tutor teammate (AI) on student curiosity-driven engagement and learning effectiveness during Interactive Molecular Dynamics (IMD) tasks on the Visual Molecular Dynamics platform. It explores the role of the AI's curiosity-triggering and response behaviors in stimulating and sustaining student curiosity, affecting the frequency and complexity of student-initiated questions. The study further assesses how AI interventions shape student engagement, foster discovery curiosity, and enhance team performance within the IMD learning environment. Using a Wizard-of-Oz paradigm, a human experimenter dynamically adjusts the AI tutor teammate's behavior through a large language model. By employing a mixed-methods exploratory design, a total of 11 high school students participated in four IMD tasks that involved molecular visualization and calculations, which increased in complexity over a 60-minute period. Team performance was evaluated through real-time observation and recordings, whereas team communication was measured by question complexity and AI's curiosity-triggering and response behaviors. Cross Recurrence Quantification Analysis (CRQA) metrics reflected structural alignment in coordination and were linked to communication behaviors. High-performing teams exhibited superior task completion, deeper understanding, and increased engagement. Advanced questions were associated with AI curiosity-triggering, indicating heightened engagement and cognitive complexity. CRQA metrics highlighted dynamic synchronization in student-AI interactions, emphasizing structured yet adaptive engagement to promote curiosity. These proof-of-concept findings suggest that the AI's dual role as a teammate and educator indicates its capacity to provide adaptive feedback, sustaining engagement and epistemic curiosity.

Paperid: 2475, https://arxiv.org/pdf/2506.22476.pdf

Abstract:
Objective skill assessment in high-stakes procedural environments requires models that not only decode underlying cognitive and motor processes but also generalize across tasks, individuals, and experimental contexts. While prior work has demonstrated the potential of functional near-infrared spectroscopy (fNIRS) for evaluating cognitive-motor performance, existing approaches are often task-specific, rely on extensive preprocessing, and lack robustness to new procedures or conditions. Here, we introduce an interpretable transformer-based foundation model trained on minimally processed fNIRS signals for cross-procedural skill assessment. Pretrained using self-supervised learning on data from laparoscopic surgical tasks and endotracheal intubation (ETI), the model achieves greater than 88% classification accuracy on all tasks, with Matthews Correlation Coefficient exceeding 0.91 on ETI. It generalizes to a novel emergency airway procedure--cricothyrotomy--using fewer than 30 labeled samples and a lightweight (less than 2k parameter) adapter module, attaining an AUC greater than 87%. Interpretability is achieved via a novel channel attention mechanism--developed specifically for fNIRS--that identifies functionally coherent prefrontal sub-networks validated through ablation studies. Temporal attention patterns align with task-critical phases and capture stress-induced changes in neural variability, offering insight into dynamic cognitive states.

Paperid: 2476, https://arxiv.org/pdf/2506.22379.pdf

Abstract:
Online and AI-based symptom checkers are applications that assist medical laypeople in diagnosing their symptoms and determining which course of action to take. When evaluating these tools, previous studies primarily used an approach introduced a decade ago that lacked any type of quality control. Numerous studies have criticized this approach, and several empirical studies have sought to improve specific aspects of evaluations. However, even after a decade, a high-quality methodological framework for standardizing the evaluation of symptom checkers remains missing. This article synthesizes empirical studies to outline a framework for standardized evaluations based on representative case selection, an externally and internally valid evaluation design, and metrics that increase cross-study comparability. This approach is backed up by several open-access resources to facilitate implementation. Ultimately, this approach should enhance the quality and comparability of future evaluations of online and AI-based symptom checkers to enable meta-analyses and help stakeholders make more informed decisions.

Paperid: 2477, https://arxiv.org/pdf/2506.21201.pdf

Abstract:
The consumption of subtitles via TVs, laptops and smartphones has the potential to marginalize people based on their complex accessibility needs. The current one-size-fits-all approach to this accessibility aid is no longer fit for purpose and work is required to look at how it can be adapted to be personalised for individual users based on individual context, content, and consumption habits. People with Aphasia, for example, encounter significant challenges in understanding subtitle texts. We see our work as a call to action for more inclusive practices, focusing on how the thoughts and opinions of people with aphasia can be included in media research. Our work investigates how to develop future media solutions for people with aphasia to create a more inclusive media viewing environment. We believe the key to this is appropriate prototyping tools and methods to allow equitable inclusion in the system design process.

Paperid: 2478, https://arxiv.org/pdf/2506.20463.pdf

Abstract:
Educators and learners worldwide are embracing the rise of Generative Artificial Intelligence (GenAI) as it reshapes higher education. However, GenAI also raises significant privacy and security concerns, as models and privacy-sensitive user data, such as student records, may be misused by service providers. Unfortunately, end-users often have little awareness of or control over how these models operate. To address these concerns, universities are developing institutional policies to guide GenAI use while safeguarding security and privacy. This work examines these emerging policies and guidelines, with a particular focus on the often-overlooked privacy and security dimensions of GenAI integration in higher education, alongside other academic values. Through a qualitative analysis of GenAI usage guidelines from universities across 12 countries, we identify key challenges and opportunities institutions face in providing effective privacy and security protections, including the need for GenAI safeguards tailored specifically to the academic context.

Paperid: 2479, https://arxiv.org/pdf/2506.20156.pdf

Abstract:
The core challenge in learning has shifted from knowledge acquisition to effective Self-Regulated Learning (SRL): planning, monitoring, and reflecting on one's learning. Existing digital tools, however, inadequately support metacognitive reflection. Spaced Repetition Systems (SRS) use de-contextualized review, overlooking the role of context, while Personal Knowledge Management (PKM) tools require high manual maintenance. To address these challenges, this paper introduces "Insight Recall," a novel paradigm that conceptualizes the context-triggered retrieval of personal past insights as a metacognitive scaffold to promote SRL. We formalize this paradigm using the Just-in-Time Adaptive Intervention (JITAI) framework and implement a prototype system, Irec, to demonstrate its feasibility. At its core, Irec uses a dynamic knowledge graph of the user's learning history. When a user faces a new problem, a hybrid retrieval engine recalls relevant personal "insights." Subsequently, a large language model (LLM) performs a deep similarity assessment to filter and present the most relevant scaffold in a just-in-time manner. To reduce cognitive load, Irec features a human-in-the-loop pipeline for LLM-based knowledge graph construction. We also propose an optional "Guided Inquiry" module, where users can engage in a Socratic dialogue with an expert LLM, using the current problem and recalled insights as context. The contribution of this paper is a solid theoretical framework and a usable system platform for designing next-generation intelligent learning systems that enhance metacognition and self-regulation.

Paperid: 2480, https://arxiv.org/pdf/2506.19644.pdf

Abstract:
Diversity in image generation is essential to ensure fair representations and support creativity in ideation. Hence, many text-to-image models have implemented diversification mechanisms. Yet, after a few iterations of generation, a lack of diversity becomes apparent, because each user has their own diversity goals (e.g., different colors, brands of cars), and there are diverse attributions to be specified. To support user-driven diversity control, we propose Varif.ai that employs text-to-image and Large Language Models to iteratively i) (re)generate a set of images, ii) verify if user-specified attributes have sufficient coverage, and iii) vary existing or new attributes. Through an elicitation study, we uncovered user needs for diversity in image generation. A pilot validation showed that Varif.ai made achieving diverse image sets easier. In a controlled evaluation with 20 participants, Varif.ai proved more effective than baseline methods across various scenarios. Thus, this supports user control of diversity in image generation for creative ideation and scalable image generation.

Paperid: 2481, https://arxiv.org/pdf/2506.19495.pdf

Abstract:
Empathy is widely recognized as a vital attribute for effective collaboration and communication in the workplace, yet developing empathic skills and fostering it among colleagues remains a challenge. This study explores the potential of a collaborative digital storytelling platform - In Your Shoes - designed to promote empathic listening and interpersonal understanding through the structured exchange of personal narratives. A one-week intervention was conducted with employees from multiple organizations using the platform. Employing a mixed methods approach, we assessed quantitative changes in empathy using the Empathy Quotient (EQ) and qualitatively analyzed participant experiences through grounded theory. While quantitative analysis revealed no statistically significant shift in dispositional empathy, qualitative findings suggested the tool facilitated situational empathy, prompted self-reflection, improved emotional resonance, and enhanced workplace relationships. Participants reported feelings of psychological safety, connection, and, in some cases, therapeutic benefits from sharing and responding to stories. These results highlight the promise of asynchronous, structured narrative-based digital tools for supporting empathic engagement in professional settings, offering insights for the design of emotionally intelligent workplace technologies.

Paperid: 2482, https://arxiv.org/pdf/2506.19415.pdf

Abstract:
3D Gaussian Splatting represents a breakthrough in the field of novel view synthesis. It establishes Gaussians as core rendering primitives for highly accurate real-world environment reconstruction. Recent advances have drastically increased the size of scenes that can be created. In this work, we present a method for rendering large and complex 3D Gaussian Splatting scenes using virtual memory. By leveraging well-established virtual memory and virtual texturing techniques, our approach efficiently identifies visible Gaussians and dynamically streams them to the GPU just in time for real-time rendering. Selecting only the necessary Gaussians for both storage and rendering results in reduced memory usage and effectively accelerates rendering, especially for highly complex scenes. Furthermore, we demonstrate how level of detail can be integrated into our proposed method to further enhance rendering speed for large-scale scenes. With an optimized implementation, we highlight key practical considerations and thoroughly evaluate the proposed technique and its impact on desktop and mobile devices.

Paperid: 2483, https://arxiv.org/pdf/2506.19364.pdf

Abstract:
The integration of Generative AI (GenAI) into education has raised concerns about over-reliance and superficial learning, particularly in writing tasks in higher education. This study explores whether a theory-driven learning analytics dashboard (LAD) can enhance human-AI collaboration in the academic writing task by improving writing knowledge gains, fostering self-regulated learning (SRL) skills and building different human-AI dialogue characteristics. Grounded in Zimmerman's SRL framework, the LAD provided real-time feedback on learners' goal-setting, writing processes and reflection, while monitoring the quality of learner-AI interactions. A quasi-experiment was conducted involving 52 postgraduate students divided into an experimental group (EG) using the LAD to a control group (CG) without it in a human-AI collaborative writing task. Pre- and post- knowledge tests, questionnaires measuring SRL and cognitive load, and students' dialogue data with GenAI were collected and analyzed. Results showed that the EG achieved significantly higher writing knowledge gains and improved SRL skills, particularly in self-efficacy and cognitive strategies. However, the EG also reported increased test anxiety and cognitive load, possibly due to heightened metacognitive awareness. Epistemic Network Analysis revealed that the EG engaged in more reflective, evaluative interactions with GenAI, while the CG focused on more transactional and information-seeking exchanges. These findings contribute to the growing body of literature on the educational use of GenAI and highlight the importance of designing interventions that complement GenAI tools, ensuring that technology enhances rather than undermines the learning process.

Paperid: 2484, https://arxiv.org/pdf/2506.19280.pdf

Abstract:
Human-Computer Interaction (HCI) has evolved significantly to incorporate emotion recognition capabilities, creating unprecedented opportunities for adaptive and personalized user experiences. This paper explores the integration of emotion detection into calendar applications, enabling user interfaces to dynamically respond to users' emotional states and stress levels, thereby enhancing both productivity and engagement. We present and evaluate two complementary approaches to emotion detection: a biometric-based method utilizing heart rate (HR) data extracted from electrocardiogram (ECG) signals processed through Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) neural networks to predict the emotional dimensions of Valence, Arousal, and Dominance; and a behavioral method analyzing computer activity through multiple machine learning models to classify emotions based on fine-grained user interactions such as mouse movements, clicks, and keystroke patterns. Our comparative analysis, from real-world datasets, reveals that while both approaches demonstrate effectiveness, the computer activity-based method delivers superior consistency and accuracy, particularly for mouse-related interactions, which achieved approximately 90\% accuracy. Furthermore, GRU networks outperformed LSTM models in the biometric approach, with Valence prediction reaching 84.38\% accuracy.

Paperid: 2485, https://arxiv.org/pdf/2506.19017.pdf

Abstract:
The climate is warming rapidly, and atmospheric concentrations of greenhouse gases (GHGs) are at their highest levels ever recorded. As a result of these climate changes, caused mainly by human activities, disasters have increased fivefold over the past 50 years, causing death and economic loss. Civic engagement and awareness are essential to mitigate climate change and its impacts. In this work, we proposed a user interface that makes users aware of the environmental impact of the food products they buy when shopping. A user-centered scenario-based design was followed in the development of the interface. Gamification elements were added to increase civic participation in climate action.

Paperid: 2486, https://arxiv.org/pdf/2506.18941.pdf

Abstract:
Lucrative career prospects and creative opportunities often attract students to enroll in computer science majors and pursue advanced studies in the field. Consequently, there has been a significant surge in enrollment in computer science courses, resulting in large class sizes that can range from hundreds to even thousands of students. A common challenge in such large classrooms is the lack of engagement between students and both the instructor and the learning material. However, with advancements in technology and improvements in large language models (LLMs), there is a considerable opportunity to utilize LLM-based AI models, such as conversational artificial intelligence (CAI), to enhance student engagement with learning content in large classes. To explore the potential of CAI to support engagement, especially with learning content, we designed an activity in a software Engineering course (with a large class size) where students used CAI for an in-class activity. We conducted a within-subject investigation in a large classroom at a US university where we compared student engagement during an in-class activity that used CAI tool vs. one without CAI tool. The CAI tool we used was ChatGPT due to its widespread popularity and familiarity. Our results indicate that CAI (ChatGPT) has the potential to support engagement with learning content during in-class activities, especially in large class sizes. We further discuss the implications of our findings.

Paperid: 2487, https://arxiv.org/pdf/2506.18648.pdf

Abstract:
The visual style of game elements considerably contributes to the overall experience. Aesthetics influence player appeal, while the abilities of game pieces define their in-game functionality. In this paper, we investigate how the visual style of collectible cards influences the players' perception of the card's actual strength in the game. Using the popular trading card game Magic: The Gathering, we conduct a single-blind survey study that examines how players perceive the strength of AI-generated cards that are shown in two contrasting visual styles: cute and harmless, or heroic and mighty. Our analysis reveals that some participants are influenced by a card's visual appearance when judging its in-game strength. Overall, differences in style perception are normally distributed around a neutral center, but individual participants vary in both directions: some generally perceive the cute style to be stronger, whereas others believe that the heroic style is better.

Paperid: 2488, https://arxiv.org/pdf/2506.18605.pdf

Abstract:
As autonomous vehicles become more integrated into shared human environments, effective communication with road users is essential for ensuring safety. While previous research has focused on developing external Human-Machine Interfaces (eHMIs) to facilitate these interactions, we argue that involving users in the early creative stages can help address key challenges in the development of this technology. To explore this, our study adopts a participatory, crowd-sourced approach to gather user-generated ideas for eHMI designs. Participants were first introduced to fundamental eHMI concepts, equipping them to sketch their own design ideas in response to scenarios with varying levels of perceived risk. An initial pre-study with 29 participants showed that while they actively engaged in the process, there was a need to refine task objectives and encourage deeper reflection. To address these challenges, a follow-up study with 50 participants was conducted. The results revealed a strong preference for autonomous vehicles to communicate their awareness and intentions using lights (LEDs and projections), symbols, and text. Participants' sketches prioritized multi-modal communication, directionality, and adaptability to enhance clarity, consistently integrating familiar vehicle elements to improve intuitiveness.

Paperid: 2489, https://arxiv.org/pdf/2506.18119.pdf

Abstract:
The notion of machine companions has long been embedded in social-technological imaginaries. Recent advances in AI have moved those media musings into believable sociality manifested in interfaces, robotic bodies, and devices. Those machines are often referred to colloquially as "companions" yet there is little careful engagement of machine companionship (MC) as a formal concept or measured variable. This PRISMA-guided scoping review systematically samples, surveys, and synthesizes current scholarly works on MC (N = 71; 2017-2025), to that end. Works varied widely in considerations of MC according to guiding theories, dimensions of a-priori specified properties (subjectively positive, sustained over time, co-active, autotelic), and in measured concepts (with more than 50 distinct measured variables). WE ultimately offer a literature-guided definition of MC as an autotelic, coordinated connection between human and machine that unfolds over time and is subjectively positive.

Paperid: 2490, https://arxiv.org/pdf/2506.16851.pdf

Abstract:
Recent studies show that users often interpret social media algorithms as mystical or spiritual because of their unpredictability. This invites new questions about how such perceptions affect the content that creators create and the communities they form online. In this study, 14 creators of algorithmic conspirituality content on TikTok were interviewed to explore their interpretations and creation processes influenced by the platform's For You Page algorithm. We illustrate how creators' beliefs interact with TikTok's algorithmic mediation to reinforce and shape their spiritual or relational themes. Furthermore, we show how algorithmic conspirituality content impacts viewers, highlighting its role in generating significant emotional and affective labor for creators, stemming from complex relational dynamics inherent in this content creation. We discuss implications for design to support creators aimed at recognizing the unexpected spiritual and religious experiences algorithms prompt, as well as supporting creators in effectively managing these challenges.

Paperid: 2491, https://arxiv.org/pdf/2506.16542.pdf

Abstract:
Technical interviews are a critical yet stressful step in the hiring process for computer science graduates, often hindered by limited access to practice opportunities. This formative qualitative study (n=20) explores whether a multimodal AI system can realistically simulate technical interviews and support confidence-building among candidates. Participants engaged with an AI-driven mock interview tool featuring whiteboarding tasks and real-time feedback. Many described the experience as realistic and helpful, noting increased confidence and improved articulation of problem-solving decisions. However, challenges with conversational flow and timing were noted. These findings demonstrate the potential of AI-driven technical interviews as scalable and realistic preparation tools, suggesting that future research could explore variations in interviewer behavior and their potential effects on candidate preparation.

Paperid: 2492, https://arxiv.org/pdf/2506.16310.pdf

Abstract:
State-of-the-art text-to-speech (TTS) systems realize high naturalness in monolingual environments, synthesizing speech with correct multilingual accents (especially for Indic languages) and context-relevant emotions still poses difficulty owing to cultural nuance discrepancies in current frameworks. This paper introduces a new TTS architecture integrating accent along with preserving transliteration with multi-scale emotion modelling, in particularly tuned for Hindi and Indian English accent. Our approach extends the Parler-TTS model by integrating A language-specific phoneme alignment hybrid encoder-decoder architecture, and culture-sensitive emotion embedding layers trained on native speaker corpora, as well as incorporating a dynamic accent code switching with residual vector quantization. Quantitative tests demonstrate 23.7% improvement in accent accuracy (Word Error Rate reduction from 15.4% to 11.8%) and 85.3% emotion recognition accuracy from native listeners, surpassing METTS and VECL-TTS baselines. The novelty of the system is that it can mix code in real time - generating statements such as "Namaste, let's talk about " with uninterrupted accent shifts while preserving emotional consistency. Subjective evaluation with 200 users reported a mean opinion score (MOS) of 4.2/5 for cultural correctness, much better than existing multilingual systems (p<0.01). This research makes cross-lingual synthesis more feasible by showcasing scalable accent-emotion disentanglement, with direct application in South Asian EdTech and accessibility software.

Paperid: 2493, https://arxiv.org/pdf/2506.16202.pdf

Abstract:
Explicit labeling of online content produced by artificial intelligence (AI) is a widely mooted policy for ensuring transparency and promoting public confidence. Yet little is known about the scope of AI labeling effects on public assessments of labeled content. We contribute new evidence on this question from a survey experiment using a high-quality nationally representative probability sample (n = 3,861). First, we demonstrate that explicit AI labeling of a news article about a proposed public policy reduces its perceived accuracy. Second, we test whether there are spillover effects in terms of policy interest, policy support, and general concerns about online misinformation. We find that AI labeling reduces interest in the policy, but neither influences support for the policy nor triggers general concerns about online misinformation. We further find that increasing the salience of AI use reduces the negative impact of AI labeling on perceived accuracy, while one-sided versus two-sided framing of the policy has no moderating effect. Overall, our findings suggest that the effects of algorithm aversion induced by AI labeling of online content are limited in scope.

Paperid: 2494, https://arxiv.org/pdf/2506.16199.pdf

Abstract:
Explainable Artificial Intelligence (XAI) plays a critical role in fostering user trust and understanding in AI-driven systems. However, the design of effective XAI interfaces presents significant challenges, particularly for UX professionals who may lack technical expertise in AI or machine learning. Existing explanation methods, such as SHAP, LIME, and counterfactual explanations, often rely on complex technical language and assumptions that are difficult for non-expert users to interpret. To address these gaps, we propose a UX Research (UXR) Playbook for XAI - a practical framework aimed at supporting UX professionals in designing accessible, transparent, and trustworthy AI experiences. Our playbook offers actionable guidance to help bridge the gap between technical explainability methods and user centred design, empowering designers to create AI interactions that foster better understanding, trust, and responsible AI adoption.

Paperid: 2495, https://arxiv.org/pdf/2506.16107.pdf

Abstract:
In 2021 the Technical Infrastructure (TI) User Experience (UX) team sent a survey to 10,000 Google Developers (Googlers) and uncovered that Google's internal infrastructure tools were fragmented and inefficient, hindering developers' productivity. Using user centered research and design methodologies the team first created a story map and service blueprint to visualize the relationship between internal applications, then formulated a strategic vision to consolidate tools, streamline workflows, and measure the impact of their work. We secured executive buy-in and delivered incremental improvements.

Paperid: 2496, https://arxiv.org/pdf/2506.15512.pdf

Abstract:
Large Language Models have brought a radical change in the process of remote learning students, among other aspects of educative activities. Current retrieval of remote learning resources lacks depth in contextual meaning that provides comprehensive information on complex student queries. This work proposes a novel approach to enhancing remote learning retrieval by integrating GPT-based models within the LangChain framework. We achieve this system in a more intuitive and productive manner using CoT reasoning and prompt engineering. The framework we propose puts much emphasis on increasing the precision and relevance of the retrieval results to return comprehensive and contextually enriched explanations and resources that best suit each student's needs. We also assess the effectiveness of our approach against paradigmatic LLMs and report improvements in user satisfaction and learning outcomes.

Paperid: 2497, https://arxiv.org/pdf/2506.15325.pdf

Abstract:
Advancements in Artificial Intelligence (AI) have significantly transformed the financial industry, enabling the development of more personalised and adaptable financial products and services. This research paper explores various instances where Human-Centred AI (HCAI) has facilitated these advancements, drawing from contemporary studies and industry progress. The paper examines how the application of HCAI-powered data analytics, machine learning, and natural language processing enables financial institutions to gain a deeper understanding of their customers' unique needs, preferences, and behavioural patterns. This, in turn, allows for the creation of tailored financial solutions that address individual consumer requirements, ultimately enhancing overall user experience and satisfaction. Additionally, the study highlights the integration of AI-powered robo-advisory services, which offer customised investment recommendations and portfolio management tailored to diverse risk profiles and investment goals. Moreover, the paper underscores the role of AI in strengthening fraud detection, risk assessment, and regulatory compliance, leading to a more secure and adaptable financial landscape. The findings of this research demonstrate the substantial impact of Human-Centred AI on the financial industry, offering a strategic framework for financial institutions to leverage these technologies. By incorporating a User Experience Research (UXR) Point of View (PoV), financial institutions can ensure that AI-driven solutions align with user needs and business objectives.

Paperid: 2498, https://arxiv.org/pdf/2506.15314.pdf

Abstract:
In the dynamic landscape of Cloud financial management, we are sharing a case study exploring the development of a User Experience Research (UXR) Point of View (PoV) to drive FinOps product innovation. We demonstrate how qualitative and quantitative research methods working together to navigate the challenges of understanding customer needs, aligning cross-functional teams, and prioritizing limited resources. Through a multi-phased research approach, the research team identifies opportunities, quantifies pain points, and segments diverse customer cohorts. This culminated in a UXR PoV that informed the creation of a differentiated product strategy, a 'one-stop shop' dashboard empowering FinOps practitioners with actionable insights and tools. This case study highlights the power of mixed-methods research in uncovering actionable insights that drive impactful product innovation.

Paperid: 2499, https://arxiv.org/pdf/2506.15294.pdf

Abstract:
This paper discusses a popular UX research activity, feature prioritization, using the User Experience Research Point of View (UXR PoV) Playbook framework. We describe an application of multinomial logistic regression, frequently marketed as MaxDiff, for prioritizing product features in consumer product development. It addresses challenges of traditional surveying techniques. We propose a solution using MaxDiff to generate a reliable preference list with a reasonable sample size. We also adapt the MaxDiff method to reduce the number of survey responses in half, making it less tedious from the survey takers' perspective. We present a case study using the adapted MaxDiff method for tablet feature prioritization research involving users with disabilities.

Paperid: 2500, https://arxiv.org/pdf/2506.14854.pdf

Abstract:
Accurate video annotation plays a vital role in modern retail applications, including customer behavior analysis, product interaction detection, and in-store activity recognition. However, conventional annotation methods heavily rely on time-consuming manual labeling by human annotators, introducing non-robust frame selection and increasing operational costs. To address these challenges in the retail domain, we propose a deep learning-based approach that automates key-frame identification in retail videos and provides automatic annotations of products and customers. Our method leverages deep neural networks to learn discriminative features by embedding video frames and incorporating object detection-based techniques tailored for retail environments. Experimental results showcase the superiority of our approach over traditional methods, achieving accuracy comparable to human annotator labeling while enhancing the overall efficiency of retail video annotation. Remarkably, our approach leads to an average of 2 times cost savings in video annotation. By allowing human annotators to verify/adjust less than 5% of detected frames in the video dataset, while automating the annotation process for the remaining frames without reducing annotation quality, retailers can significantly reduce operational costs. The automation of key-frame detection enables substantial time and effort savings in retail video labeling tasks, proving highly valuable for diverse retail applications such as shopper journey analysis, product interaction detection, and in-store security monitoring.

Paperid: 2501, https://arxiv.org/pdf/2506.14809.pdf

Abstract:
Surveys are a cornerstone of Information Systems (IS) research, yet creating high-quality surveys remains labor-intensive, requiring both domain expertise and methodological rigor. With the evolution of large language models (LLMs), new opportunities emerge to automate survey generation. This paper presents the real-world deployment of an LLM-powered system designed to accelerate data collection while maintaining survey quality. Deploying such systems in production introduces real-world complexity, including diverse user needs and quality control. We evaluate the system using the DeLone and McLean IS Success Model to understand how generative AI can reshape a core IS method. This study makes three key contributions. To our knowledge, this is the first application of the IS Success Model to a generative AI system for survey creation. In addition, we propose a hybrid evaluation framework combining automated and human assessments. Finally, we implement safeguards that mitigate post-deployment risks and support responsible integration into IS workflows.

Paperid: 2502, https://arxiv.org/pdf/2506.14783.pdf

Abstract:
Decoding natural language from brain activity using non-invasive electroencephalography (EEG) remains a significant challenge in neuroscience and machine learning, particularly for open-vocabulary scenarios where traditional methods struggle with noise and variability. Previous studies have achieved high accuracy on small-closed vocabularies, but it still struggles on open vocabularies. In this study, we propose ETS, a framework that integrates EEG with synchronized eye-tracking data to address two critical tasks: (1) open-vocabulary text generation and (2) sentiment classification of perceived language. Our model achieves a superior performance on BLEU and Rouge score for EEG-To-Text decoding and up to 10% F1 score on EEG-based ternary sentiment classification, which significantly outperforms supervised baselines. Furthermore, we show that our proposed model can handle data from various subjects and sources, showing great potential for high performance open vocabulary eeg-to-text system.

Paperid: 2503, https://arxiv.org/pdf/2506.14777.pdf

Abstract:
This article introduces WebXAII, an open-source web framework designed to facilitate research on human interaction with eXplainable Artificial Intelligence (XAI) systems. The field of XAI is rapidly expanding, driven by the growing societal implications of the widespread adoption of AI (and in particular machine learning) across diverse applications. Researchers who study the interaction between humans and XAI techniques typically develop ad hoc interfaces in order to conduct their studies. These interfaces are usually not shared alongside the results of the studies, which limits their reusability and the reproducibility of experiments. In response, we design and implement WebXAII, a web-based platform that can embody full experimental protocols, meaning that it can present all aspects of the experiment to human participants and record their responses. The experimental protocols are translated into a composite architecture of generic views and modules, which offers a lot of flexibility. The architecture is defined in a structured configuration file, so that protocols can be implemented with minimal programming skills. We demonstrate that WebXAII can effectively embody relevant protocols, by reproducing the protocol of a state-of-the-art study of the literature.

Paperid: 2504, https://arxiv.org/pdf/2506.14567.pdf

Abstract:
Generative AI tools have become more prevalent in engineering workflows, particularly through chatbots and code assistants. As the perceived accuracy of these tools improves, questions arise about whether and how those who work in high-precision domains might maintain vigilance for errors, and what other aspects of using such tools might trouble their work. This paper analyzes interviews with hardware and software engineers, and their collaborators, who work in integrated circuit design to identify the role accuracy plays in their use of generative AI tools and what other forms of trouble they face in using such tools. The paper inventories these forms of trouble, which are then mapped to elements of generative AI systems, to conclude that controlling the context of interactions between engineers and the generative AI tools is one of the largest challenges they face. The paper concludes with recommendations for mitigating this form of trouble by increasing the ability to control context interactively.

Paperid: 2505, https://arxiv.org/pdf/2506.13845.pdf

Abstract:
The increasing availability and use of artificial intelligence (AI) tools in educational settings has raised concerns about students' overreliance on these technologies. Overreliance occurs when individuals accept incorrect AI-generated recommendations, often without critical evaluation, leading to flawed problem solutions and undermining learning outcomes. This study investigates potential factors contributing to patterns of AI reliance among undergraduate students, examining not only overreliance but also appropriate reliance (correctly accepting helpful and rejecting harmful recommendations) and underreliance (incorrectly rejecting helpful recommendations). Our approach combined pre- and post-surveys with a controlled experimental task where participants solved programming problems with an AI assistant that provided both accurate and deliberately incorrect suggestions, allowing direct observation of students' reliance patterns when faced with varying AI reliability. We find that appropriate reliance is significantly related to students' programming self-efficacy, programming literacy, and need for cognition, while showing negative correlations with post-task trust and satisfaction. Overreliance showed significant correlations with post-task trust and satisfaction with the AI assistant. Underreliance was negatively correlated with programming literacy, programming self-efficacy, and need for cognition. Overall, the findings provide insights for developing targeted interventions that promote appropriate reliance on AI tools, with implications for the integration of AI in curriculum and educational technologies.

Paperid: 2506, https://arxiv.org/pdf/2506.13630.pdf

Abstract:
Effective methods for visualizing data involving multiple variables, including categorical ones, are limited. The hammock plot (Schonlau, 2003) visualizes both categorical and numerical variables using parallel coordinates. We introduce the Stata implementation hammock. We give numerous examples that explore highlighting, missing values, putting axes on the same scale, and tracing an observation across variables. Further, we introduce parallel univariate plots as an edge case of hammock plots. We also present and make publicly available a new dataset on the 2020 Tour de France.

Paperid: 2507, https://arxiv.org/pdf/2506.12792.pdf

Abstract:
This chapter presents an overview of Prosocial Design, an approach to platform design and governance that recognizes design choices influence behavior and that those choices can or should be made toward supporting healthy interactions and other prosocial outcomes. The authors discuss several core principles of Prosocial Design and its relationship to Trust and Safety and other related fields. As a primary contribution, the chapter reviews relevant research to demonstrate how Prosocial Design can be an effective approach to reducing rule-breaking and other harmful behavior and how it can help to stem the spread of harmful misinformation. Prosocial Design is a nascent and evolving field and research is still limited. The authors hope this chapter will not only inspire more research and the adoption of a prosocial design approach, but that it will also provoke discussion about the principles of Prosocial Design and its potential to support Trust and Safety.

Paperid: 2508, https://arxiv.org/pdf/2506.12332.pdf

Abstract:
Terms of Service (ToS) are ubiquitous, legally binding contracts that govern consumers' digital interactions. However, ToS are not designed to be read: they are filled with pages of ambiguous and complex legal terminology that burden potential users. We introduce TermSight, an intelligent reading interface designed to make ToS more approachable. TermSight offers visual summaries that highlight the relevance and power balance of information in a ToS. TermSight also categorizes and simplifies information within the ToS into concise plain-language summaries. To aid in reading the original text, TermSight offers contextualized definitions and scenarios for unfamiliar phrases. Our within-subjects evaluation of TermSight (N=20) revealed that TermSight significantly reduced the difficulty of reading ToS and increased participants' willingness to do so. We also observed emerging strategies that participants took when interacting with AI-powered features that highlight the diverse ways that TermSight assisted ToS reading.

Paperid: 2509, https://arxiv.org/pdf/2506.12244.pdf

Abstract:
Online services are required to gain informed consent from users to collect, store and analyse their personal data, both intentionally divulged and derived during their use of the service. There are many issues with these forms: they are too long, too complex and demand the user's attention too frequently. Many users consent without reading so do not know what they are agreeing to. As such,granted consent is effectively uninformed. In this paper, we report on two studies we carried out to arrive at a value-driven approach to inform efforts to reduce the length of consent forms. The first study interviewed unemployed users to identify the values they want these forms to satisfy. The second survey study helped us to quantify the values and value creators. To ensure that we understood the particular valuation of the unemployed, we compared their responses to those of an employed demographic and observed no significant differences between their prioritisation on any of the values. However, we did find substantial differences between values and value creators, with effort minimisation being most valued by our participants.

Paperid: 2510, https://arxiv.org/pdf/2506.12008.pdf

Abstract:
Dance performance traditionally follows a unidirectional relationship where movement responds to music. While AI has advanced in various creative domains, its application in dance has primarily focused on generating choreography from musical input. We present a system that enables dancers to dynamically shape musical environments through their movements. Our multi-modal architecture creates a coherent musical composition by intelligently combining pre-recorded musical clips in response to dance movements, establishing a bidirectional creative partnership where dancers function as both performers and composers. Through correlation analysis of performance data, we demonstrate emergent communication patterns between movement qualities and audio features. This approach reconceptualizes the role of AI in performing arts as a responsive collaborator that expands possibilities for both professional dance performance and improvisational artistic expression across broader populations.

Paperid: 2511, https://arxiv.org/pdf/2506.11665.pdf

Abstract:
This paper examines the views of software providers in the German dairy industry with regard to dairy farmers' needs for explanation of digital decision support systems. The study is based on mastitis detection in dairy cows using a hypothetical herd management system. We designed four exemplary explanation formats for mastitis assessments with different types of presentation (textual, rule-based, herd comparison, and time series). In our previous study, 14 dairy farmers in Germany had rated these formats in terms of comprehensibility and the trust they would have in a system providing each format. In this study, we repeat the survey with 13 software providers active in the German dairy industry. We ask them how well they think the formats would be received by farmers. We hypothesized that there may be discrepancies between the views of both groups that are worth investigating, partly to find reasons for the reluctance to adopt digital systems. A comparison of the feedback from both groups supports the hypothesis and calls for further investigation. The results show that software providers tend to make assumptions about farmers' preferences that are not necessarily accurate. Our study, although not representative due to the small sample size, highlights the potential benefits of a thorough user requirements analysis (farmers' needs) to improve software adaptation and user acceptance.

Paperid: 2512, https://arxiv.org/pdf/2506.11536.pdf

Abstract:
Extended exposure to virtual reality environments can induce motion sickness, often referred to as cybersickness, which may lead to physiological stress responses and impaired cognitive performance. This study investigates the aftereffects of VR-induced motion sickness with a focus on physiological stress markers and working memory performance. Using a carousel simulation to elicit cybersickness, we assessed subjective discomfort (SSQ, FMS), physiological stress (salivary cortisol, alpha-amylase, electrodermal activity, heart rate), and cognitive performance (n-Back task) over a 90-minute post-exposure period. Our findings demonstrate a significant increase in both subjective and physiological stress indicators following VR exposure, accompanied by a decline in working memory performance. Notably, delayed symptom progression was observed in a substantial proportion of participants, with some reporting peak symptoms up to 90 minutes post-stimulation. Salivary cortisol levels remained elevated throughout the observation period, indicating prolonged stress recovery. These results highlight the need for longer washout phases in XR research and raise safety concerns for professional applications involving post-exposure task performance.

Paperid: 2513, https://arxiv.org/pdf/2506.11393.pdf

Abstract:
Clinical communication skills are essential for preparing healthcare professionals to provide equitable care across cultures. However, traditional training with simulated patients can be resource intensive and difficult to scale, especially in under-resourced settings. In this project, we explore the use of an AI-driven chatbot to support culturally competent communication training for medical students. The chatbot was designed to simulate realistic patient conversations and provide structured feedback based on the ACT Cultural Competence model. We piloted the chatbot with a small group of third-year medical students at a UK medical school in 2024. Although we did not follow a formal experimental design, our experience suggests that the chatbot offered useful opportunities for students to reflect on their communication, particularly around empathy and interpersonal understanding. More challenging areas included addressing systemic issues and historical context. Although this early version of the chatbot helped surface some interesting patterns, limitations were also clear, such as the absence of nonverbal cues and the tendency for virtual patients to be overly agreeable. In general, this reflection highlights both the potential and the current limitations of AI tools in communication training. More work is needed to better understand their impact and improve the learning experience.

Paperid: 2514, https://arxiv.org/pdf/2506.11179.pdf

Abstract:
Mental stress has become a pervasive factor affecting cognitive health and overall well-being, necessitating the development of robust, non-invasive diagnostic tools. Electroencephalogram (EEG) signals provide a direct window into neural activity, yet their non-stationary and high-dimensional nature poses significant modeling challenges. Here we introduce Brain2Vec, a new deep learning tool that classifies stress states from raw EEG recordings using a hybrid architecture of convolutional, recurrent, and attention mechanisms. The model begins with a series of convolutional layers to capture localized spatial dependencies, followed by an LSTM layer to model sequential temporal patterns, and concludes with an attention mechanism to emphasize informative temporal regions. We evaluate Brain2Vec on the DEAP dataset, applying bandpass filtering, z-score normalization, and epoch segmentation as part of a comprehensive preprocessing pipeline. Compared to traditional CNN-LSTM baselines, our proposed model achieves an AUC score of 0.68 and a validation accuracy of 81.25%. These findings demonstrate Brain2Vec's potential for integration into wearable stress monitoring platforms and personalized healthcare systems.

Paperid: 2515, https://arxiv.org/pdf/2506.11047.pdf

Abstract:
Machine learning systems are increasingly deployed in high-stakes domains, yet they remain vulnerable to bias systematic disparities that disproportionately impact specific demographic groups. Traditional bias detection methods often depend on access to sensitive labels or rely on rigid fairness metrics, limiting their applicability in real-world settings. This paper introduces a novel, perception-driven framework for bias detection that leverages crowdsourced human judgment. Inspired by reCAPTCHA and other crowd-powered systems, we present a lightweight web platform that displays stripped-down visualizations of numeric data (for example-salary distributions across demographic clusters) and collects binary judgments on group similarity. We explore how users' visual perception-shaped by layout, spacing, and question phrasing can signal potential disparities. User feedback is aggregated to flag data segments as biased, which are then validated through statistical tests and machine learning cross-evaluations. Our findings show that perceptual signals from non-expert users reliably correlate with known bias cases, suggesting that visual intuition can serve as a powerful, scalable proxy for fairness auditing. This approach offers a label-efficient, interpretable alternative to conventional fairness diagnostics, paving the way toward human-aligned, crowdsourced bias detection pipelines.

Paperid: 2516, https://arxiv.org/pdf/2506.11015.pdf

Abstract:
In the age of generative AI and ubiquitous digital tools, human cognition faces a structural paradox: as external aids become more capable, internal memory systems risk atrophy. Drawing on neuroscience and cognitive psychology, this paper examines how heavy reliance on AI systems and discovery-based pedagogies may impair the consolidation of declarative and procedural memory -- systems essential for expertise, critical thinking, and long-term retention. We review how tools like ChatGPT and calculators can short-circuit the retrieval, error correction, and schema-building processes necessary for robust neural encoding. Notably, we highlight striking parallels between deep learning phenomena such as "grokking" and the neuroscience of overlearning and intuition. Empirical studies are discussed showing how premature reliance on AI during learning inhibits proceduralization and intuitive mastery. We argue that effective human-AI interaction depends on strong internal models -- biological "schemata" and neural manifolds -- that enable users to evaluate, refine, and guide AI output. The paper concludes with policy implications for education and workforce training in the age of large language models.

Paperid: 2517, https://arxiv.org/pdf/2506.10964.pdf

Abstract:
Urban digital twins are increasingly perceived as a way to pool the growing digital resources of cities for the purpose of a more sustainable and integrated urban planning. Models and simulations are central to this undertaking: They enable "what if?" scenarios, create insights and describe relationships between the vast data that is being collected. However, the process of integrating and subsequently using models in urban digital twins is an inherently complex undertaking. It raises questions about how to represent urban complexity, how to deal with uncertain assumptions and modeling paradigms, and how to capture underlying power relations. Existent approaches in the domain largely focus on monolithic and centralized solutions in the tradition of neoliberal city-making, oftentimes prohibiting pluralistic and open interoperable models. Using a participatory design for participatory systems approach together with the City of Hamburg, Germany, we find that an open Urban Model Platform can function both as a public technological backbone for modeling and simulation in urban digital twins and as a socio-technical framework for a collaborative and pluralistic representation of urban processes. Such a platform builds on open standards, allows for a decentralized integration of models, enables communication between models and supports a multi-model approach to representing urban systems.

Paperid: 2518, https://arxiv.org/pdf/2506.10927.pdf

Abstract:
Reduced social connectedness increasingly poses a threat to mental health, life expectancy, and general well-being. Generative AI (GAI) technologies, such as large language models (LLMs) and image generation tools, are increasingly integrated into applications aimed at enhancing human social experiences. Despite their growing presence, little is known about how these technologies influence social interactions. This scoping review investigates how GAI-based applications are currently designed to facilitate social interaction, what forms of social engagement they target, and which design and evaluation methodologies designers use to create and evaluate them. Through an analysis of 30 studies published since 2020, we identify key trends in application domains including storytelling, socio-emotional skills training, reminiscence, collaborative learning, music making, and general conversation. We highlight the role of participatory and co-design approaches in fostering both effective technology use and social engagement, while also examining socio-ethical concerns such as cultural bias and accessibility. This review underscores the potential of GAI to support dynamic and personalized interactions, but calls for greater attention to equitable design practices and inclusive evaluation strategies.

Paperid: 2519, https://arxiv.org/pdf/2506.10818.pdf

Abstract:
The ability to predict the object the user intends to grasp offers essential contextual information and may help to leverage the effects of point-to-point latency in interactive environments. This paper explores the feasibility and accuracy of real-time recognition of uninstrumented objects based on hand kinematics during reach-to-grasp actions. In a data collection study, we recorded the hand motions of 16 participants while reaching out to grasp and then moving real and synthetic objects. Our results demonstrate that even a simple LSTM network can predict the time point at which the user grasps an object with a precision better than 21 ms and the current distance to this object with a precision better than 1 cm. The target's size can be determined in advance with an accuracy better than 97%. Our results have implications for designing adaptive and fine-grained interactive user interfaces in ubiquitous and mixed-reality environments.

Paperid: 2520, https://arxiv.org/pdf/2506.10598.pdf

Abstract:
Coding forms a key part of computer science education in universities. As part of this education, Integrated Development Environments (IDEs) are essential tools for coding. However, it is currently unknown how the design of an IDE's interface impacts on students with Attention Deficit Hyperactivity Disorder (ADHD). In this study we investigated the use of IDEs by students with ADHD. We conducted a think aloud study with nine university computing students, followed by qualitative observational interviews to analyse their learning and engagement with the Visual Studio Code IDE. The paper reports on these experiences and seeks to understand the role IDEs play in the educational setting. Our work also examines how digital accessibility and usability are considered in the current design of IDEs. We analysed the qualitative data using a thematic analysis and identified three primary themes: self-confidence, interaction, and learning as well as various sub-themes. The themes and their sub-themes illustrate key areas of consideration when designing IDEs for students with ADHD. The primary findings highlight experiences of frustration and barriers in the current design and layout of IDEs. Through our participatory approach we provide a rare insight into ADHD user experiences around usability and accessibility, and describe the need for better design of development environments to ensure a positive learning experience for the students.

Paperid: 2521, https://arxiv.org/pdf/2506.10003.pdf

Abstract:
Digital 3D representations of urban areas, through their growing availability, are a helpful tool to better understand a territory. However, they lack contextual information about, for example, the history or functionality of buildings. On another side, multimedia documents like images, videos or texts usually contain such information. Crossing these two types of data can therefore help in the analysis and understanding of the organization of our cities. This could also be used to develop document search based on spatial navigation, instead of the classical textual query. In this paper, we propose four approaches to integrate multimedia documents in a 3D urban scene, allowing to contextualize the scene with any type of media. We combine these integration approaches with user guidance modes that allows to guide the user through the consumption of these media and support its understanding of the territory. We demonstrate the usefulness of these techniques in the context of different projects within the Lyon area (France). The use of multimedia documents integrated into a digital tour allows, for example, the iconic buildings to be contextualised or to understand the evolution of a territory through time.

Paperid: 2522, https://arxiv.org/pdf/2506.09873.pdf

Abstract:
Responsible AI (rAI) guidance increasingly promotes stakeholder involvement (SHI) during AI development. At the same time, SHI is already common in commercial software development, but with potentially different foci. This study clarifies the extent to which established SHI practices are able to contribute to rAI efforts as well as potential disconnects -- essential insights to inform and tailor future interventions that further shift industry practice towards rAI efforts. First, we analysed 56 rAI guidance documents to identify why SHI is recommended (i.e. its expected benefits for rAI) and uncovered goals such as redistributing power, improving socio-technical understandings, anticipating risks, and enhancing public oversight. To understand why and how SHI is currently practised in commercial settings, we then conducted an online survey (n=130) and semi-structured interviews (n=10) with AI practitioners. Our findings reveal that SHI in practice is primarily driven by commercial priorities (e.g. customer value, compliance) and several factors currently discourage more rAI-aligned SHI practices. This suggests that established SHI practices are largely not contributing to rAI efforts. To address this disconnect, we propose interventions and research opportunities to advance rAI development in practice.

Paperid: 2523, https://arxiv.org/pdf/2506.09292.pdf

Abstract:
Misconceptions in psychology and education persist despite clear contradictory evidence, resisting traditional correction methods. This study investigated whether personalised AI dialogue could effectively correct these stubborn beliefs. In a preregistered experiment (N = 375), participants holding strong psychology misconceptions engaged in one of three interventions: (1) personalised AI dialogue targeting their specific misconception, (2) generic textbook-style refutation, or (3) neutral AI dialogue (control). Results showed that personalised AI dialogue produced significantly larger immediate belief reductions compared to both textbook reading and neutral dialogue. This advantage persisted at 10-day follow-up but diminished by 2 months, where AI dialogue and textbook conditions converged while both remained superior to control. Both AI conditions generated significantly higher engagement and confidence than textbook reading, demonstrating the motivational benefits of conversational interaction. These findings demonstrate that AI dialogue can accelerate initial belief correction through personalised, interactive engagement that disrupts the cognitive processes maintaining misconceptions. However, the convergence of effects over time suggests brief interventions require reinforcement for lasting change. Future applications should integrate AI tutoring into structured educational programs with spaced reinforcement to sustain the initial advantages of personalised dialogue.

Paperid: 2524, https://arxiv.org/pdf/2506.09236.pdf

Abstract:
During the past decade, there has been a significant increase in research focused on integrating AR User Interfaces into public safety applications, particularly for first responders in the domains of Emergency Medical Services, Firefighting, and Law Enforcement. This paper presents the results of a scoping review involving the application of AR user interfaces in the public safety domain and applies an established systematic review methodology to provide a comprehensive analysis of the current research landscape, identifying key trends, challenges, and gaps in the literature. This review includes peer-reviewed publications indexed by the major scientific databases up to April 2025. A basic keyword search retrieved 1,751 papers, of which 90 were deemed relevant for this review. An in-depth analysis of the literature allowed the development of a faceted taxonomy that categorizes AR user interfaces for public safety. This classification lays a solid foundation for future research, while also highlighting key design considerations, challenges, and gaps in the literature. This review serves as a valuable resource for researchers and developers, offering insights that can drive further advances in the field.

Paperid: 2525, https://arxiv.org/pdf/2506.09054.pdf

Abstract:
Particle Builder Online is a web-based education game designed for high school physics students. Students can play against an AI opponent or peers to familiarise themselves with the Standard Model of Particle Physics. The game is aimed at a high school level and tailored to the International Baccalaureate and the Australian Curriculum. Students from four schools in Canberra took pre/post-tests and a survey while completing a lesson where they played Particle Builder. Students' understanding of particle physics concepts improved significantly. Students found the game more enjoyable and effective than regular classroom lessons.

Paperid: 2526, https://arxiv.org/pdf/2506.08911.pdf

Abstract:
This paper presents a keyword spotting (KWS) system implemented on the NXP MCXN947 microcontroller with an integrated Neural Processing Unit (NPU), enabling real-time voice interaction on resource-constrained devices. The system combines MFCC feature extraction with a CNN classifier, optimized using Quantization Aware Training to reduce model size with minimal accuracy drop. Experimental results demonstrate a 59x speedup in inference time when leveraging the NPU compared to CPU-only execution, achieving 97.06% accuracy with a model size of 30.58 KB, demonstrating the feasibility of efficient, low-power voice interfaces on embedded platforms.

Paperid: 2527, https://arxiv.org/pdf/2506.08836.pdf

Abstract:
Swiss German is a low-resource language represented by diverse dialects that differ significantly from Standard German and from each other, lacking a standardized written form. As a result, transcribing Swiss German involves translating into Standard German. Existing datasets have been collected in controlled environments, yielding effective speech-to-text (STT) models, but these models struggle with spontaneous conversational speech. This paper, therefore, introduces the new SRB-300 dataset, a 300-hour annotated speech corpus featuring real-world long-audio recordings from 39 Swiss German radio and TV stations. It captures spontaneous speech across all major Swiss dialects recorded in various realistic environments and overcomes the limitation of prior sentence-level corpora. We fine-tuned multiple OpenAI Whisper models on the SRB-300 dataset, achieving notable enhancements over previous zero-shot performance metrics. Improvements in word error rate (WER) ranged from 19% to 33%, while BLEU scores increased between 8% and 40%. The best fine-tuned model, large-v3, achieved a WER of 17.1% and a BLEU score of 74.8. This advancement is crucial for developing effective and robust STT systems for Swiss German and other low-resource languages in real-world contexts.

Paperid: 2528, https://arxiv.org/pdf/2506.08555.pdf

Abstract:
Cross-subject electromyography (EMG) pattern recognition faces significant challenges due to inter-subject variability in muscle anatomy, electrode placement, and signal characteristics. Traditional methods rely on subject-specific calibration data to adapt models to new users, an approach that is both time-consuming and impractical for large-scale, real-world deployment. This paper presents an approach to eliminate calibration requirements through feature disentanglement, enabling effective cross-subject generalization. We propose an end-to-end dual-branch adversarial neural network that simultaneously performs pattern recognition and individual identification by disentangling EMG features into pattern-specific and subject-specific components. The pattern-specific components facilitate robust pattern recognition for new users without model calibration, while the subject-specific components enable downstream applications such as task-invariant biometric identification. Experimental results demonstrate that the proposed model achieves robust performance on data from unseen users, outperforming various baseline methods in cross-subject scenarios. Overall, this study offers a new perspective for cross-subject EMG pattern recognition without model calibration and highlights the proposed model's potential for broader applications, such as task-independent biometric systems.

Paperid: 2529, https://arxiv.org/pdf/2506.08303.pdf

Abstract:
Situation awareness (SA)--comprising the ability to 1) perceive critical elements in the environment, 2) comprehend their meanings, and 3) project their future states--is critical for human operator performance. Due to the disruptive nature of gold-standard SA measures, researchers have sought physiological indicators to provide real-time information about SA. We extend prior work by using a multimodal suite of neurophysiological, psychophysiological, and behavioral signals, predicting all three levels of SA along a continuum, and predicting a comprehensive measure of SA in a complex multi-tasking simulation. We present a lab study in which 31 participants controlled an aircraft simulator task battery while wearing physiological sensors and responding to SA 'freeze-probe' assessments. We demonstrate the validity of task and assessment for measuring SA. Multimodal physiological models predict SA with greater predictive performance ($Q^2$ for levels 1-3 and total, respectively: 0.14, 0.00, 0.26, and 0.36) than models built with shuffled labels, demonstrating that multimodal physiological signals provide useful information in predicting all SA levels. Level 3 SA (projection) was best predicted, and level 2 SA comprehension) was the most challenging to predict. Ablation analysis and single sensor models found EEG and eye-tracking signals to be particularly useful to predictions of level 3 and total SA. A reduced sensor fusion model showed that predictive performance can be maintained with a subset of sensors. This first rigorous cross-validation assessment of predictive performance demonstrates the utility of multimodal physiological signals for inferring complex, holistic, objective measures of SA at all levels, non-disruptively, and along a continuum.

Paperid: 2531, https://arxiv.org/pdf/2506.07707.pdf

Abstract:
This paper explores how Mixed Reality (MR) and 2D video conferencing influence children's communication during a gesture-based guessing game. Finnish-speaking participants engaged in a short collaborative task using two different setups: Microsoft HoloLens MR and Zoom. Audio-video recordings were transcribed and analyzed using Large Language Models (LLMs), enabling iterative correction, translation, and annotation. Despite limitations in annotations' accuracy and agreement, automated approaches significantly reduced processing time and allowed non-Finnish-speaking researchers to participate in data analysis. Evaluations highlight both the efficiency and constraints of LLM-based analyses for capturing children's interactions across these platforms. Initial findings indicate that MR fosters richer interaction, evidenced by higher emotional expression during annotation, and heightened engagement, while Zoom offers simplicity and accessibility. This study underscores the potential of MR to enhance collaborative learning experiences for children in distributed settings.

Paperid: 2532, https://arxiv.org/pdf/2506.07667.pdf

Abstract:
To meet the demands of content moderation, online platforms have resorted to automated systems. Newer forms of real-time engagement($\textit{e.g.}$, users commenting on live streams) on platforms like Twitch exert additional pressures on the latency expected of such moderation systems. Despite their prevalence, relatively little is known about the effectiveness of these systems. In this paper, we conduct an audit of Twitch's automated moderation tool ($\texttt{AutoMod}$) to investigate its effectiveness in flagging hateful content. For our audit, we create streaming accounts to act as siloed test beds, and interface with the live chat using Twitch's APIs to send over $107,000$ comments collated from $4$ datasets. We measure $\texttt{AutoMod}$'s accuracy in flagging blatantly hateful content containing misogyny, racism, ableism and homophobia. Our experiments reveal that a large fraction of hateful messages, up to $94\%$ on some datasets, $\textit{bypass moderation}$. Contextual addition of slurs to these messages results in $100\%$ removal, revealing $\texttt{AutoMod}$'s reliance on slurs as a moderation signal. We also find that contrary to Twitch's community guidelines, $\texttt{AutoMod}$ blocks up to $89.5\%$ of benign examples that use sensitive words in pedagogical or empowering contexts. Overall, our audit points to large gaps in $\texttt{AutoMod}$'s capabilities and underscores the importance for such systems to understand context effectively.

Paperid: 2533, https://arxiv.org/pdf/2506.07389.pdf

Abstract:
Smart contract (SC) fuzzing is a critical technique for detecting vulnerabilities in blockchain applications. However, its adoption remains challenging for practitioners due to fundamental differences between SCs and traditional software systems. In this study, we investigate the challenges practitioners face when adopting SC fuzzing tools by conducting an inductive content analysis of 381 GitHub issues from two widely used SC fuzzers: Echidna and Foundry. Furthermore, we conducted a user study to examine how these challenges affect different practitioner groups, SC developers, and traditional software security professionals, and identify strategies practitioners use to overcome them. We systematically categorize these challenges into a taxonomy based on their nature and occurrence within the SC fuzzing workflow. Our findings reveal domain-specific ease-of-use and usefulness challenges, including technical issues with blockchain emulation, and human issues with a lack of accessible documentation and process automation. Our results provide actionable insights for tool developers and researchers, guiding future improvements in SC fuzzer tool design.

Paperid: 2534, https://arxiv.org/pdf/2506.07278.pdf

Abstract:
This paper presents IDEIA (Intelligent Engine for Editorial Ideation and Assistance), a generative AI-powered system designed to optimize the journalistic ideation process by combining real-time trend analysis with automated content suggestion. Developed in collaboration with the Sistema Jornal do Commercio de ComunicaÃ§Ã£o (SJCC), the largest media conglomerate in Brazil's North and Northeast regions, IDEIA integrates the Google Trends API for data-driven topic monitoring and the Google Gemini API for the generation of context-aware headlines and summaries. The system adopts a modular architecture based on Node.js, React, and PostgreSQL, supported by Docker containerization and a CI/CD pipeline using GitHub Actions and Vercel. Empirical results demonstrate a significant reduction in the time and cognitive effort required for editorial planning, with reported gains of up to 70\% in the content ideation stage. This work contributes to the field of computational journalism by showcasing how intelligent automation can enhance productivity while maintaining editorial quality. It also discusses the technical and ethical implications of incorporating generative models into newsroom workflows, highlighting scalability and future applicability across sectors beyond journalism.

Paperid: 2535, https://arxiv.org/pdf/2506.06353.pdf

Abstract:
The growing convergence between Large Language Models (LLMs) and electroencephalography (EEG) research is enabling new directions in neural decoding, brain-computer interfaces (BCIs), and affective computing. This survey offers a systematic review and structured taxonomy of recent advancements that utilize LLMs for EEG-based analysis and applications. We organize the literature into four domains: (1) LLM-inspired foundation models for EEG representation learning, (2) EEG-to-language decoding, (3) cross-modal generation including image and 3D object synthesis, and (4) clinical applications and dataset management tools. The survey highlights how transformer-based architectures adapted through fine-tuning, few-shot, and zero-shot learning have enabled EEG-based models to perform complex tasks such as natural language generation, semantic interpretation, and diagnostic assistance. By offering a structured overview of modeling strategies, system designs, and application areas, this work serves as a foundational resource for future work to bridge natural language processing and neural signal analysis through language models.

Paperid: 2536, https://arxiv.org/pdf/2506.06225.pdf

Abstract:
As generative AI (GenAI) becomes increasingly widespread, it is crucial to equip users, particularly vulnerable populations such as older adults (65 and older), with the knowledge to understand its benefits and potential risks. Older adults often exhibit greater reservations about adopting emerging technologies and require tailored literacy support. Using a mixed methods approach, this study examines strategies for delivering GenAI literacy to older adults through a chatbot named Litti, evaluating its impact on their AI literacy (knowledge, safety, and ethical use). The quantitative data indicated a trend toward improved AI literacy, though the results were not statistically significant. However, qualitative interviews revealed diverse levels of familiarity with generative AI and a strong desire to learn more. Findings also show that while Litti provided a positive learning experience, it did not significantly enhance participants' trust or sense of safety regarding GenAI. This exploratory case study highlights the challenges and opportunities in designing AI literacy education for the rapidly growing older adult population.

Paperid: 2537, https://arxiv.org/pdf/2506.06166.pdf

Abstract:
The training and deployment of large language models (LLMs) create a feedback loop with human users: models learn human beliefs from data, reinforce these beliefs with generated content, reabsorb the reinforced beliefs, and feed them back to users again and again. This dynamic resembles an echo chamber. We hypothesize that this feedback loop entrenches the existing values and beliefs of users, leading to a loss of diversity and potentially the lock-in of false beliefs. We formalize this hypothesis and test it empirically with agent-based LLM simulations and real-world GPT usage data. Analysis reveals sudden but sustained drops in diversity after the release of new GPT iterations, consistent with the hypothesized human-AI feedback loop. Code and data available at https://thelockinhypothesis.com

Paperid: 2538, https://arxiv.org/pdf/2506.06165.pdf

Abstract:
While the complexity of 21st-century demands has promoted pedagogical approaches to foster complex competencies, a persistent gap remains between in-class learning activities and individualized learning or assessment practices. To address this, studies have explored the use of AI-generated characters in learning and assessment. One attempt is scenario-based assessment (SBA), a technique that not only measures but also fosters the development of competencies throughout the assessment process. SBA introduces simulated agents to provide an authentic social-interactional context, allowing for the assessment of competency-based constructs while mitigating the unpredictability of real-life interactions. Recent advancements in multimodal AI, such as text-to-video technology, allow these agents to be enhanced into AI-generated characters. This mixed-method study investigates how learners perceive AI characters taking the role of mentor and teammates in an SBA mirroring the context of a collaborative science investigation. Specifically, we examined the Likert scale responses of 56 high schoolers regarding trust, social presence, and effectiveness. We analyzed the relationships between these factors and their impact on the intention to adopt AI characters through PLS-SEM. Our findings indicated that learners' trust shaped their sense of social presence with the AI characters, enhancing perceived effectiveness. Qualitative analysis further highlighted factors that foster trust, such as material credibility and alignment with learning goals, as well as the pivotal role of social presence in creating a collaborative context. This paper was accepted as an full paper for AIED 2025.

Paperid: 2539, https://arxiv.org/pdf/2506.06162.pdf

Abstract:
Scientific recommender systems, such as Google Scholar and Web of Science, are essential tools for discovery. Search algorithms that power work through stigmergy, a collective intelligence mechanism that surfaces useful paths through repeated engagement. While generally effective, this "rich-get-richer" dynamic results in a small number of high-profile papers that dominate visibility. This essay argues argue that these algorithm over-reliance on popularity fosters intellectual homogeneity and exacerbates structural inequities, stifling innovative and diverse perspectives critical for scientific progress. We propose an overhaul of search platforms to incorporate user-specific calibration, allowing researchers to manually adjust the weights of factors like popularity, recency, and relevance. We also advise platform developers on how text embeddings and LLMs could be implemented in ways that increase user autonomy. While our suggestions are particularly pertinent to aligning recommender systems with scientific values, these ideas are broadly applicable to information access systems in general. Designing platforms that increase user autonomy is an important step toward more robust and dynamic information

Paperid: 2540, https://arxiv.org/pdf/2506.06066.pdf

Abstract:
Mixed reality (MR) environments offer embodied spatial interaction, providing intuitive 3D manipulation capabilities that enhance the conceptual design process. Parametric modeling, a powerful and advanced architectural design method, enables the generation of complex, optimized geometries. However, its integration into MR environments remains limited due to precision constraints and unsuitable input modalities. Existing MR tools prioritize spatial interaction but lack the control and expressiveness required for parametric workflows, particularly for designers without formal programming backgrounds. We address this gap by introducing a novel conversational MR interface that combines speech input, gesture recognition, and a multi-agent large language model (LLM) system to support intuitive parametric modeling. Our system dynamically manages parameter states, resolves ambiguous commands through conversation and contextual prompting, and enables real-time model manipulation within immersive environments. We demonstrate how this approach reduces cognitive and operational barriers in early-stage design tasks, allowing users to refine and explore their design space. This work expands the role of MR to a generative design platform, supporting programmatic thinking in design tasks through natural, embodied interaction.

Paperid: 2541, https://arxiv.org/pdf/2506.06062.pdf

Abstract:
Minoritised ethnic people are marginalised in society, and therefore at a higher risk of adverse online harms, including those arising from the loss of security and privacy of personal data. Despite this, there has been very little research focused on minoritised ethnic people's security and privacy concerns, attitudes, and behaviours. In this work, we provide the results of one of the first studies in this regard. We explore minoritised ethnic people's experiences of using essential online services across three sectors: health, social housing, and energy, their security and privacy-related concerns, and responses towards these services. We conducted a thematic analysis of 44 semi-structured interviews with people of various reported minoritised ethnicities in the UK. Privacy concerns and lack of control over personal data emerged as a major theme, with many interviewees considering privacy as their most significant concern when using online services. Several creative tactics to exercise some agency were reported, including selective and inconsistent disclosure of personal data. A core concern about how data may be used was driven by a fear of repercussions, including penalisation and discrimination, influenced by prior experiences of institutional and online racism. The increased concern and potential for harm resulted in minoritised ethnic people grappling with a higher-stakes dilemma of whether to disclose personal information online or not. Furthermore, trust in institutions, or lack thereof, was found to be embedded throughout as a basis for adapting behaviour. We draw on our results to provide lessons learned for the design of more inclusive, marginalisation-aware, and privacy-preserving online services.

Paperid: 2542, https://arxiv.org/pdf/2506.06055.pdf

Abstract:
As emerging technologies continue to shape society, there is a growing emphasis on the need to engage with design ethics as it unfolds in practice to better capture the complexities of ethical considerations embedded in day-to-day work. Positioned within the broader "turn to practice" in HCI, the review characterizes this body of work in terms of its motivations, conceptual frameworks, methodologies, and contributions across a range of design disciplines and academic databases. The findings reveal a shift away from static and abstract ethical frameworks toward an understanding of ethics as an evolving, situated, and inherent aspect of design activities, one that can be cultivated and fostered collaboratively. This review proposes six future directions for establishing common research priorities and fostering the field's growth. While the review promotes cross-disciplinary dialogue, we argue that HCI research, given its cumulative experience with practice-oriented research, is well-equipped to guide this emerging strand of work on design ethics.

Paperid: 2543, https://arxiv.org/pdf/2506.05729.pdf

Abstract:
Young adults with depression often experience prolonged indoor stays, limiting their access to natural environments and exacerbating mental health challenges. While nature therapy is recognized for its psychological benefits, existing interventions frequently require outdoor engagement, which may not be accessible for all individuals. This study explores the potential of user-led indoor modifications using local natural materials as a mental health intervention. A qualitative approach wasemployedtoassessemotionalandenvironmentalconnectedness.Participants engaged in material exploration, collection, and crafting, integrating natural elements into their living spaces. Findings indicate improved mood,increased environmental awareness,and a stronger sense of agency over personal space. The standardized intervention steps suggest the feasibility of a self-help toolkit, enabling broader implementation. This research contributes to sustainable, user-driven mental health interventions, bridging the gap between nature therapy and practical indoor applications.

Paperid: 2544, https://arxiv.org/pdf/2506.05226.pdf

Abstract:
As a Ph.D. student with a diverse background in both public and private sectors, I have encountered numerous challenges in cross-disciplinary and multi-stakeholder team projects. My research on developing team compositions that involve multidisciplinary members from fields including education, academia, and health. Along with my advisor, we are focused on exploring how HCI can help individuals assemble more effective teams. This effort involves developing socio-technical systems that guide and inform individuals of the potential teams that they can assemble. We employ state-of-the-art algorithms that prioritize inclusion among team members from diverse areas of expertise and familiarity between the team members. Our goal for attending this workshop is to engage in meaningful dialogues with scholars and researchers, leveraging these interactions to refine our approach to building an AI-driven team composition system to foster effective, interdisciplinary collaboration in health-focused HCI research.

Paperid: 2545, https://arxiv.org/pdf/2506.05030.pdf

Abstract:
Artificial intelligence promises to revolutionise medicine, yet its impact remains limited because of the pervasive translational gap. We posit that the prevailing technology-centric approaches underpin this challenge, rendering such systems fundamentally incompatible with clinical practice, specifically diagnostic reasoning and decision making. Instead, we propose a novel sociotechnical conceptualisation of data-driven support tools designed to complement doctors' cognitive and epistemic activities. Crucially, it prioritises real-world impact over superhuman performance on inconsequential benchmarks.

Paperid: 2546, https://arxiv.org/pdf/2506.04745.pdf

Abstract:
Brain-Computer Interfaces (BCIs) based on motor imagery (MI) hold promise for restoring control in individuals with motor impairments. However, up to 30% of users remain unable to effectively use BCIs-a phenomenon termed ''BCI inefficiency.'' This study addresses a major limitation in current BCI training protocols: the use of fixed-length training paradigms that ignore individual learning variability. We propose a novel approach that leverages neuronal avalanches-spatiotemporal cascades of brain activity-as biomarkers to characterize and predict user-specific learning mechanism. Using electroencephalography (EEG) data collected across four MI-BCI training sessions in 20 healthy participants, we extracted two features: avalanche length and activations. These features revealed significant training and taskcondition effects, particularly in later sessions. Crucially, changes in these features across sessions ($Î$avalanche length and $Î$activations) correlated significantly with BCI performance and enabled prediction of future BCI success via longitudinal Support Vector Regression and Classification models. Predictive accuracy reached up to 91%, with notable improvements after spatial filtering based on selected regions of interest. These findings demonstrate the utility of neuronal avalanche dynamics as robust biomarkers for BCI training, supporting the development of personalized protocols aimed at mitigating BCI illiteracy.

Paperid: 2547, https://arxiv.org/pdf/2506.04659.pdf

Abstract:
In this work, we present a multi-tool evaluation of 106 deployed web-based chatbots, across domains like healthcare, education and customer service, comprising both standalone applications and embedded widgets using automated tools (Google Lighthouse, PageSpeed Insights, SiteImprove Accessibility Checker) and manual audits (Microsoft Accessibility Insights). Our analysis reveals that over 80% of chatbots exhibit at least one critical accessibility issue, and 45% suffer from missing semantic structures or ARIA role misuse. Furthermore, we found that accessibility scores correlate strongly across tools (e.g., Lighthouse vs PageSpeed Insights, r = 0.861), but performance scores do not (r = 0.436), underscoring the value of a multi-tool approach. We offer a replicable evaluation insights and actionable recommendations to support the development of user-friendly conversational interfaces.

Paperid: 2548, https://arxiv.org/pdf/2506.04253.pdf

Abstract:
We present HADA (Human-AI Agent Decision Alignment), a protocol- and framework agnostic reference architecture that keeps both large language model (LLM) agents and legacy algorithms aligned with organizational targets and values. HADA wraps any algorithm or LLM in role-specific stakeholder agents -- business, data-science, audit, ethics, and customer -- each exposing conversational APIs so that technical and non-technical actors can query, steer, audit, or contest every decision across strategic, tactical, and real-time horizons. Alignment objectives, KPIs, and value constraints are expressed in natural language and are continuously propagated, logged, and versioned while thousands of heterogeneous agents run on different orchestration stacks. A cloud-native proof of concept packages a production credit-scoring model (getLoanDecision) and deploys it on Docker/Kubernetes/Python; five scripted retail-bank scenarios show how target changes, parameter tweaks, explanation requests, and ethics triggers flow end to end through the architecture. Evaluation followed the Design-Science Research Methodology. Walkthrough observation and log inspection demonstrated complete coverage of six predefined objectives: every role could invoke conversational control, trace KPIs and value constraints, detect and mitigate ZIP-code bias, and reproduce full decision lineage, independent of the underlying LLM or agent library. Contributions: (1) an open-source HADA architecture, (2) a mid-range design theory for human-AI alignment in multi-agent systems, and (3) empirical evidence that framework-agnostic, protocol-compliant stakeholder agents improve accuracy, transparency, and ethical compliance in real-world decision pipelines.

Paperid: 2549, https://arxiv.org/pdf/2506.03402.pdf

Abstract:
Live coding is a pedagogical technique in which an instructor writes and executes code in front of students to impart skills like incremental development and debugging. Although live coding offers many benefits, instructors face many challenges in the classroom, like cognitive challenges and psychological stress, most of which have yet to be formally studied. To understand the obstacles faced by instructors in CS classes, we conducted (1) a formative interview with five teaching assistants in exercise sessions and (2) a contextual inquiry study with four lecturers for large-scale classes. We found that the improvisational and unpredictable nature of live coding makes it difficult for instructors to manage their time and keep students engaged, resulting in more mental stress than presenting static slides. We discussed opportunities for augmenting existing IDEs and presentation setups to help enhance live coding experience.

Paperid: 2550, https://arxiv.org/pdf/2506.03188.pdf

Abstract:
Diabetic foot ulcers (DFUs), a class of chronic wounds, affect ~750,000 individuals every year in the US alone and identifying non-healing DFUs that develop to chronic wounds early can drastically reduce treatment costs and minimize risks of amputation. There is therefore a pressing need for diagnostic tools that can detect non-healing DFUs early. We develop a low cost, multi-analyte 3D printed assays seamlessly integrated on swabs that can identify non-healing DFUs and a Wound Sensor iOS App - an innovative mobile application developed for the controlled acquisition and automated analysis of wound sensor data. By comparing both the original base image (before exposure to the wound) and the wound-exposed image, we developed automated computer vision techniques to compare density changes between the two assay images, which allow us to automatically determine the severity of the wound. The iOS app ensures accurate data collection and presents actionable insights, despite challenges such as variations in camera configurations and ambient conditions. The proposed integrated sensor and iOS app will allow healthcare professionals to monitor wound conditions real-time, track healing progress, and assess critical parameters related to wound care.

Paperid: 2551, https://arxiv.org/pdf/2506.02856.pdf

Abstract:
This work investigates how listeners perceive and evaluate AI-generated as compared to human-composed music in the context of emotional resonance and regulation. Across a mixed-methods design, participants were exposed to both AI and human music under various labeling conditions (music correctly labeled as AI- or human-origin, music incorrectly labeled as AI- or human-origin, and unlabeled music) and emotion cases (Calm and Upbeat), and were asked to rate preference, efficacy of target emotion elicitation, and emotional impact. Participants were significantly more likely to rate human-composed music, regardless of labeling, as more effective at eliciting target emotional states, though quantitative analyses revealed no significant differences in emotional response. However, participants were significantly more likely to indicate preference for AI-generated music, yielding further questions regarding the impact of emotional authenticity and perceived authorship on musical appraisal. Qualitative data underscored this, with participants associating humanness with qualities such as imperfection, flow, and 'soul.' These findings challenge the assumption that preference alone signals success in generative music systems. Rather than positioning AI tools as replacements for human creativity or emotional expression, they point toward a more careful design ethos that acknowledges the limits of replication and prioritizes human values such as authenticity, individuality, and emotion regulation in wellness and affective technologies.

Paperid: 2552, https://arxiv.org/pdf/2506.02700.pdf

Abstract:
Cognitive load, which varies across individuals, can significantly affect focus and memory performance.This study explores the integration of Virtual Reality (VR) with memory palace techniques, aiming to optimize VR environments tailored to individual cognitive load levels to improve focus and memory. We utilized EEG devices, specifically the Oculus Quest 2, to monitor Beta wave activity in 10 participants.By modeling their cognitive load profiles through polynomial regression, we dynamically adjusted spatial variables within a VR environment using Grasshopper, creating personalized experiences. Results indicate that 8 participants showed a notable increase in Beta wave activity, demonstrating improved focus and cognitive performance in the customized VR settings.These findings underscore the potential of VR-based memory environments, driven by cognitive load considerations, and provide valuable insights for advancing VR memory research

Paperid: 2553, https://arxiv.org/pdf/2506.02447.pdf

Abstract:
Word embedding, which converts words into numerical values, is an important natural language processing technique and widely used. One of the serious problems of word embedding is that the bias will be learned and affect the model if the dataset used for pre-training contains bias. On the other hand, indiscriminate removal of bias from word embeddings may result in the loss of information, even if the bias is undesirable to us. As a result, a risk of model performance degradation due to bias removal will be another problem. As a solution to this problem, we focus on gender bias in Japanese and propose an interactive visualization method to adjust the degree of debias for each word category. Specifically, we visualize the accuracy in a category classification task after debiasing, and allow the user to adjust the parameters based on the visualization results, so that the debiasing can be adjusted according to the user's objectives. In addition, considering a trade-off between debiasing and preventing degradation of model performance, and that different people perceive gender bias differently, we developed a mechanism to present multiple choices of debiasing configurations applying an optimization scheme. This paper presents the results of an experiment in which we removed the gender bias for word embeddings learned from the Japanese version of Wikipedia. We classified words into five categories based on a news corpus, and observed that the degree of influence of debiasing differed greatly among the categories. We then adjusted the degree of debiasing for each category based on the visualization results.

Paperid: 2554, https://arxiv.org/pdf/2506.01492.pdf

Abstract:
This study investigates the challenges in designing research infrastructure software for automated software publication in multi-stakeholder environments, focusing specifically on the HERMES system. Through two quantitative surveys of research software engineers (RSEs) and infrastructure facility staff (IFs), it examines technical, organizational, and social requirements across these stakeholder groups. The study reveals significant differences in how RSEs and IFs prioritize various system features. While RSEs highly value compatibility with existing infrastructure, IFs prioritize user-focused aspects like system usability and documentation. The research identifies two main challenges in designing research infrastructure software: (1) the existence of multiple stakeholder groups with differing requirements, and (2) the internal heterogeneity within each stakeholder group across dimensions such as technical experience. The study also highlights that only half of RSE respondents actively practice software publication, pointing to potential cultural or technical barriers. Additionally, the research reveals discrepancies in how stakeholders view organizational aspects, with IFs consistently rating factors like responsibility structures and quality assurance as more important than RSEs do. These findings contribute to a better understanding of the complexities involved in designing research infrastructure software and emphasize the need for systems that can accommodate diverse user groups while maintaining usability across different technical expertise levels.

Paperid: 2555, https://arxiv.org/pdf/2506.00058.pdf

Abstract:
The rise of large language models (LLMs) has created a new job role: the Prompt Engineer. Despite growing interest in this position, we still do not fully understand what skills this new job role requires or how common these jobs are. We analyzed 20,662 job postings on LinkedIn, including 72 prompt engineer positions, to learn more about this emerging role. We found that prompt engineering is still rare (less than 0.5% of sampled job postings) but has a unique skill profile. Prompt engineers need AI knowledge (22.8%), prompt design skills (18.7%), good communication (21.9%), and creative problem-solving (15.8%) skills. These requirements significantly differ from those of established roles, such as data scientists and machine learning engineers, showing that prompt engineering is becoming its own profession. Our findings help job seekers, employers, and educational institutions in better understanding the emerging field of prompt engineering.

Paperid: 2556, https://arxiv.org/pdf/2505.24803.pdf

Abstract:
Large Language Models (LLMs) have shown great potential in automated story generation, but challenges remain in maintaining long-form coherence and providing users with intuitive and effective control. Retrieval-Augmented Generation (RAG) has proven effective in reducing hallucinations in text generation; however, the use of structured data to support generative storytelling remains underexplored. This paper investigates how knowledge graphs (KGs) can enhance LLM-based storytelling by improving narrative quality and enabling user-driven modifications. We propose a KG-assisted storytelling pipeline and evaluate its effectiveness through a user study with 15 participants. Participants created their own story prompts, generated stories, and edited knowledge graphs to shape their narratives. Through quantitative and qualitative analysis, our findings demonstrate that knowledge graphs significantly enhance story quality in action-oriented and structured narratives within our system settings. Additionally, editing the knowledge graph increases users' sense of control, making storytelling more engaging, interactive, and playful.

Paperid: 2557, https://arxiv.org/pdf/2505.24681.pdf

Abstract:
Generative AI transforms knowledge production, validation, and dissemination, raising academic integrity and credibility concerns. This study examines 53 academic influencer videos that reached 5.3 million viewers to identify an emerging, structured, implementation-ready pipeline balancing originality, ethical compliance, and human-AI collaboration despite the disruptive impacts. Findings highlight generative AI's potential to automate publication workflows and democratize participation in knowledge production while challenging traditional scientific norms. Academic influencers emerge as key intermediaries in this paradigm shift, connecting bottom-up practices with institutional policies to improve adaptability. Accordingly, the study proposes a generative publication production pipeline and a policy framework for co-intelligence adaptation and reinforcing credibility-centered standards in AI-powered research. These insights support scholars, educators, and policymakers in understanding AI's transformative impact by advocating responsible and innovation-driven knowledge production. Additionally, they reveal pathways for automating best practices, optimizing scholarly workflows, and fostering creativity in academic research and publication.

Paperid: 2558, https://arxiv.org/pdf/2505.24096.pdf

Abstract:
Sensor-based reactive and hybrid approaches have proven a promising line of study to address imperfect knowledge in grasping and manipulation. However the reactive approaches are usually tightly coupled to a particular embodiment making transfer of knowledge difficult. This paper proposes a paradigm for modeling and execution of reactive manipulation actions, which makes knowledge transfer to different embodiments possible while retaining the reactive capabilities of the embodiments. The proposed approach extends the idea of control primitives coordinated by a state machine by introducing an embodiment independent layer of abstraction. Abstract manipulation primitives constitute a vocabulary of atomic, embodiment independent actions, which can be coordinated using state machines to describe complex actions. To obtain embodiment specific models, the abstract state machines are automatically translated to embodiment specific models, such that full capabilities of each platform can be utilized. The strength of the manipulation primitives paradigm is demonstrated by developing a set of corresponding embodiment specific primitives for object transport, including a complex reactive grasping primitive. The robustness of the approach is experimentally studied in emptying of a box filled with several unknown objects. The embodiment independence is studied by performing a manipulation task on two different platforms using the same abstract description.

Paperid: 2559, https://arxiv.org/pdf/2505.23472.pdf

Abstract:
If public trust is lost in a new technology early in its life cycle it can take much more time for the benefits of that technology to be realised. Eventually tens-of-millions of people will collectively have the power to determine self-driving technology success of failure driven by their perception of risk, data handling, safety, governance, accountability, benefits to their life and more. This paper reviews the evidence on safety critical technology covering trust, engagement, and acceptance. The paper takes a narrative review approach concluding with a scalable model for self-driving technology education and engagement. The paper find that if a mismatch between the publics perception and expectations about self driving systems emerge it can lead to misuse, disuse, or abuse of the system. Furthermore we find from the evidence that industrial experts often misunderstand what matters to the public, users, and stakeholders. However we find that engagement programmes that develop approaches to defining the right information at the right time, in the right format orientated around what matters to the public creates the potential for ever more sophisticated conversations, greater trust, and moving the public into a progressive more active role of critique and advocacy. This work has been undertaken as part of the Partners for Automated Vehicle Education (PAVE) United Kingdom programme.

Paperid: 2560, https://arxiv.org/pdf/2505.23405.pdf

Abstract:
Formative assessment is a cornerstone of effective teaching and learning, providing students with feedback to guide their learning. While there has been an exponential growth in the application of generative AI in scaling various aspects of formative assessment, ranging from automatic question generation to intelligent tutoring systems and personalized feedback, few have directly addressed the core pedagogical principles of formative assessment. Here, we critically examined how generative AI, especially large-language models (LLMs) such as ChatGPT, can support key components of formative assessment: helping students, teachers, and peers understand "where learners are going," "where learners currently are," and "how to move learners forward" in the learning process. With the rapid emergence of new prompting techniques and LLM capabilities, we also provide guiding principles for educators to effectively leverage cost-free LLMs in formative assessments while remaining grounded in pedagogical best practices. Furthermore, we reviewed the role of LLMs in generating feedback, highlighting limitations in current evaluation metrics that inadequately capture the nuances of formative feedback, such as distinguishing feedback at the task, process, and self-regulatory levels. Finally, we offer practical guidelines for educators and researchers, including concrete classroom strategies and future directions such as developing robust metrics to assess LLM-generated feedback, leveraging LLMs to overcome systemic and cultural barriers to formative assessment, and designing AI-aware assessment strategies that promote transferable skills while mitigating overreliance on LLM-generated responses. By structuring the discussion within an established formative assessment framework, this review provides a comprehensive foundation for integrating LLMs into formative assessment in a pedagogically informed manner.

Paperid: 2561, https://arxiv.org/pdf/2505.23147.pdf

Abstract:
Advances in eye-tracking control for assistive robotic arms provide intuitive interaction opportunities for people with physical disabilities. Shared control has gained interest in recent years by improving user satisfaction through partial automation of robot control. We present an eye-tracking-guided shared control design based on insights from state-of-the-art literature. A Wizard of Oz setup was used in which automation was simulated by an experimenter to evaluate the concept without requiring full implementation. This approach allowed for rapid exploration of user needs and expectations to inform future iterations. Two studies were conducted to assess user experience, identify design challenges, and find improvements to ensure usability and accessibility. The first study involved people with disabilities by providing a survey, and the second study used the Wizard of Oz design in person to gain technical insights, leading to a comprehensive picture of findings.

Paperid: 2562, https://arxiv.org/pdf/2505.23137.pdf

Abstract:
In recent years, there is a growing need and opportunity to use online platforms for psychophysics research. Online experiments make it possible to evaluate large and diverse populations remotely and quickly, complementing laboratory-based research. However, developing and running online psychophysics experiments poses several challenges: i) a high barrier-to-entry for researchers who often need to learn complex code-based platforms, ii) an uncontrolled experimental environment, and iii) questionable credibility of the participants. Here, we introduce an open-source Modular Online Psychophysics Platform (MOPP) to address these challenges. Through the simple web-based interface of MOPP, researchers can build modular experiments, share them with others, and copy or modify tasks from each others environments. MOPP provides built-in features to calibrate for viewing distance and to measure visual acuity. It also includes email-based and IP-based authentication, and reCAPTCHA verification. We developed five example psychophysics tasks, that come preloaded in the environment, and ran a pilot experiment which was hosted on the AWS (Amazon Web Services) cloud. Pilot data collected for these tasks yielded similar results to those reported in laboratory settings. MOPP can thus help researchers collect large psychophysics datasets online, with reduced turnaround time, and in a standardized manner.

Paperid: 2563, https://arxiv.org/pdf/2505.22969.pdf

Abstract:
The increasing number of accidents caused by alcohol-impaired driving has prompted the development of integrated safety systems in vehicles to monitor driver behavior and prevent crashes. This paper explores how drivers perceive these systems, focusing on their comfort, trust, privacy concerns, and willingness to adopt the technology. Through a survey of 115 U.S. participants, the study reveals a preference for non-intrusive systems, such as those monitoring eye movements, over more restrictive technologies like alcohol detection devices. Privacy emerged as a major concern, with many participants preferring local data processing and anonymity. Trust in these systems was crucial for acceptance, as drivers are more likely to adapt their behavior when they believe the system is accurate and reliable. To encourage adoption, it is important to address concerns about privacy and balance the benefits of safety with personal freedom. By improving transparency, ensuring reliability, and increasing public awareness, these systems could play a significant role in reducing road accidents and improving safety.

Paperid: 2564, https://arxiv.org/pdf/2505.22414.pdf

Abstract:
This paper examines how the coding strategies of sighted and blind programmers differ when working with audio feedback alone. The goal is to identify challenges in mixed-ability collaboration, particularly when sighted programmers work with blind peers or teach programming to blind students. To overcome limitations of traditional blindness simulation studies, we proposed Task-Oriented Priming and Sensory Alignment (ToPSen), a design framework that reframes sensory constraints as technical requirements rather than as a disability. Through a study of 12 blind and 12 sighted participants coding non-visually, we found that expert blind programmers maintain more accurate mental models and process more information in working memory than sighted programmers using ToPSen. Our analysis revealed that blind and sighted programmers process structural information differently, exposing gaps in current IDE designs. These insights inform our guidelines for improving the accessibility of programming tools and fostering effective mixed-ability collaboration.

Paperid: 2565, https://arxiv.org/pdf/2505.21582.pdf

Abstract:
Intelligent tutoring systems combined with large language models offer a promising approach to address students' diverse needs and promote self-efficacious learning. While large language models possess good foundational knowledge of electrical engineering basics, they remain insufficiently capable of addressing specific questions about electrical circuits. In this paper, we present AITEE, an agent-based tutoring system for electrical engineering designed to accompany students throughout their learning process, offer individualized support, and promote self-directed learning. AITEE supports both hand-drawn and digital circuits through an adapted circuit reconstruction process, enabling natural interaction with students. Our novel graph-based similarity measure identifies relevant context from lecture materials through a retrieval augmented generation approach, while parallel Spice simulation further enhances accuracy in applying solution methodologies. The system implements a Socratic dialogue to foster learner autonomy through guided questioning. Experimental evaluations demonstrate that AITEE significantly outperforms baseline approaches in domain-specific knowledge application, with even medium-sized LLM models showing acceptable performance. Our results highlight the potential of agentic tutors to deliver scalable, personalized, and effective learning environments for electrical engineering education.

Paperid: 2566, https://arxiv.org/pdf/2505.21562.pdf

Abstract:
This case study examines the ClimaTech Great Global Innovation Challenge's approach to selecting climate tech startups by integrating human and AI evaluations. The competition aimed to identify top startups and enhance the accuracy and efficiency of the selection process through a hybrid model. Research shows data-driven approaches help VC firms reduce bias and improve decision-making. Machine learning models have outperformed human investors in deal screening, helping identify high-potential startups. Incorporating AI aimed to ensure more equitable and objective evaluations. The methodology included three phases: initial AI review, semi-finals judged by humans, and finals using a hybrid weighting. In phase one, 57 applications were scored by an AI tool built with StackAI and OpenAI's GPT-4o, and the top 36 advanced. In the semi-finals, human judges, unaware of AI scores, evaluated startups on team quality, market potential, and technological innovation. Each score - human or AI - was weighted equally, resulting in 75 percent human and 25 percent AI influence. In the finals, with five human judges, weighting shifted to 83.3 percent human and 16.7 percent AI. There was a moderate positive correlation between AI and human scores - Spearman's = 0.47 - indicating general alignment with key differences. Notably, the final four startups, selected mainly by humans, were among those rated highest by the AI. This highlights the complementary nature of AI and human judgment. The study shows that hybrid models can streamline and improve startup assessments. The ClimaTech approach offers a strong framework for future competitions by combining human expertise with AI capabilities.

Paperid: 2567, https://arxiv.org/pdf/2505.21196.pdf

Abstract:
In affective computing, datasets often contain multiple annotations from different annotators, which may lack full agreement. Typically, these annotations are merged into a single gold standard label, potentially losing valuable inter-rater variability. We propose a multi-annotator training approach for continuous emotion recognition (CER) that seeks a consensus across all annotators rather than relying on a single reference label. Our method employs a consensus network to aggregate annotations into a unified representation, guiding the main arousal-valence predictor to better reflect collective inputs. Tested on the RECOLA and COGNIMUSE datasets, our approach outperforms traditional methods that unify annotations into a single label. This underscores the benefits of fully leveraging multi-annotator data in emotion recognition and highlights its applicability across various fields where annotations are abundant yet inconsistent.

Paperid: 2568, https://arxiv.org/pdf/2505.21016.pdf

Abstract:
This study examines how Reddit users engaged with the racial narratives of Lovecraft Country and Watchmen, two television series that reimagine historical racial trauma. Drawing on narrative persuasion and multistep flow theory, we analyze 3,879 Reddit comments using topic modeling and critical discourse analysis. We identify three dynamic social roles advocates, adversaries, and adaptives and explore how users move between them in response to racial discourse. Findings reveal how Reddits pseudonymous affordances shape role fluidity, opinion leadership, and moral engagement. While adversaries minimized or rejected racism as exaggerated, advocates shared standpoint experiences and historical resources to challenge these claims. Adaptive users shifted perspectives over time, demonstrating how online publics can foster critical racial learning. This research highlights how popular culture and participatory platforms intersect in shaping collective meaning making around race and historical memory.

Paperid: 2569, https://arxiv.org/pdf/2505.20727.pdf

Abstract:
Have you ever read a blog or social media post and suspected that it was written--at least in part--by artificial intelligence (AI)? While transparently acknowledging contributors to writing is generally valued, why some writers choose to disclose or withhold AI involvement remains unclear. In this work, we ask what factors shape writers' decisions to disclose their AI use as a starting point to effectively advocate for transparency. To shed light on this question, we synthesize study findings and theoretical frameworks in human-AI interaction and behavioral science. Concretely, we identify and curate a list of factors that could affect writers' decisions regarding disclosure for human-AI co-created content.

Paperid: 2570, https://arxiv.org/pdf/2505.20466.pdf

Abstract:
Smart microscopy represents a paradigm shift in biological imaging, moving from passive observation tools to active collaborators in scientific inquiry. Enabled by advances in automation, computational power, and artificial intelligence, these systems are now capable of adaptive decision-making and real-time experimental control. Here, we introduce a theoretical framework that reconceptualizes smart microscopy as a partner in scientific investigation. Central to our framework is the concept of the 'epistemic-empirical divide' in cellular investigation-the gap between what is observable (empirical domain) and what must be understood (epistemic domain). We propose six core design principles: epistemic-empirical awareness, hierarchical context integration, an evolution from detection to perception, adaptive measurement frameworks, narrative synthesis capabilities, and cross-contextual reasoning. Together, these principles guide a multi-agent architecture designed to align empirical observation with the goals of scientific understanding. Our framework provides a roadmap for building microscopy systems that go beyond automation to actively support hypothesis generation, insight discovery, and theory development, redefining the role of scientific instruments in the process of knowledge creation.

Paperid: 2571, https://arxiv.org/pdf/2505.20339.pdf

Abstract:
The declared goal of this paper is to fill this gap: "... cognitive systems research needs questions or challenges that define progress. The challenges are not (yet more) predictions of the future, but a guideline to what are the aims and what would constitute progress." -- the quotation being from the project description of EUCogII, the project for the European Network for Cognitive Systems within which this formulation of the 'challenges' was originally developed (http://www.eucognition.org). So, we stick out our neck and formulate the challenges for artificial cognitive systems. These challenges are articulated in terms of a definition of what a cognitive system is: a system that learns from experience and uses its acquired knowledge (both declarative and practical) in a flexible manner to achieve its own goals.

Paperid: 2572, https://arxiv.org/pdf/2505.20068.pdf

Abstract:
Shared understanding plays a key role in the effective communication in and performance of human-human interactions. With the increasingly common integration of AI into human contexts, the future of personal and workplace interactions will likely see human-AI interaction (HAII) in which the perception of shared understanding is important. Existing literature has addressed the processes and effects of PSU in human-human interactions, but the construal remains underexplored in HAII. To better understand PSU in HAII, we conducted an online survey to collect user reflections on interactions with a large language model when it sunderstanding of a situation was thought to be similar to or different from the participant's. Through inductive thematic analysis, we identified eight dimensions comprising PSU in human-AI interactions: Fluency, aligned operation, fluidity, outcome satisfaction, contextual awareness, lack of humanlike abilities, computational limits, and suspicion.

Paperid: 2573, https://arxiv.org/pdf/2505.20011.pdf

Abstract:
Human-like agents are an increasingly important topic in games and beyond. Believable non-player characters enhance the gaming experience by improving immersion and providing entertainment. They also offer players the opportunity to engage with AI entities that can function as opponents, teachers, or cooperating partners. Additionally, in games where bots are prohibited -- and even more so in non-game environments -- there is a need for methods capable of identifying whether digital interactions occur with bots or humans. This leads to two fundamental research questions: (1) how to model and implement human-like AI, and (2) how to measure its degree of human likeness. This article offers two contributions. The first one is a survey of the most significant challenges in implementing human-like AI in games (or any virtual environment featuring simulated agents, although this article specifically focuses on games). Thirteen such challenges, both conceptual and technical, are discussed in detail. The second is an empirical study performed in a tactical video game that addresses the research question: "Is it possible to distinguish human players from bots (AI agents) based on empirical data?" A machine-learning approach using a custom deep recurrent convolutional neural network is presented. We hypothesize that the more challenging it is to create human-like AI for a given game, the easier it becomes to develop a method for distinguishing humans from AI-driven players.

Paperid: 2574, https://arxiv.org/pdf/2505.19419.pdf

Abstract:
The quality of training data is critical to the performance of machine learning applications in domains like transportation, healthcare, and robotics. Accurate image labeling, however, often relies on time-consuming, expert-driven methods with limited feedback. This research introduces a sketch-based annotation approach supported by large language models (LLMs) to reduce technical barriers and enhance accessibility. Using a synthetic dataset, we examine how sketch recognition features relate to LLM feedback metrics, aiming to improve the reliability and interpretability of LLM-assisted labeling. We also explore how prompting strategies and sketch variations influence feedback quality. Our main contribution is a sketch-based virtual assistant that simplifies annotation for non-experts and advances LLM-driven labeling tools in terms of scalability, accessibility, and explainability.

Paperid: 2575, https://arxiv.org/pdf/2505.18066.pdf

Abstract:
Despite the growing promise of artificial intelligence (AI) in supporting decision-making across domains, fostering appropriate human reliance on AI remains a critical challenge. In this paper, we investigate the utility of exploring distance-based uncertainty scores for task delegation to AI and describe how these scores can be visualized through embedding representations for human-AI decision-making. After developing an AI-based system for physical stroke rehabilitation assessment, we conducted a study with 19 health professionals and 10 students in medicine/health to understand the effect of exploring distance-based uncertainty scores on users' reliance on AI. Our findings showed that distance-based uncertainty scores outperformed traditional probability-based uncertainty scores in identifying uncertain cases. In addition, after exploring confidence scores for task delegation and reviewing embedding-based visualizations of distance-based uncertainty scores, participants achieved an 8.20% higher rate of correct decisions, a 7.15% higher rate of changing their decisions to correct ones, and a 7.14% lower rate of incorrect changes after reviewing AI outputs than those reviewing probability-based uncertainty scores ($p<0.01$). Our findings highlight the potential of distance-based uncertainty scores to enhance decision accuracy and appropriate reliance on AI while discussing ongoing challenges for human-AI collaborative decision-making.

Paperid: 2576, https://arxiv.org/pdf/2505.16954.pdf

Abstract:
Traditional methods for raising awareness of privacy protection often fail to engage users or provide hands-on insights into how privacy vulnerabilities are exploited. To address this, we incorporate an adversarial mechanic in the design of the dialogue-based serious game Cracking Aegis. Leveraging LLMs to simulate natural interactions, the game challenges players to impersonate characters and extract sensitive information from an AI agent, Aegis. A user study (n=22) revealed that players employed diverse deceptive linguistic strategies, including storytelling and emotional rapport, to manipulate Aegis. After playing, players reported connecting in-game scenarios with real-world privacy vulnerabilities, such as phishing and impersonation, and expressed intentions to strengthen privacy control, such as avoiding oversharing personal information with AI systems. This work highlights the potential of LLMs to simulate complex relational interactions in serious games, while demonstrating how an adversarial game strategy provides unique insights for designs for social good, particularly privacy protection.

Paperid: 2577, https://arxiv.org/pdf/2505.16171.pdf

Abstract:
When agents interact with people as part of a team, fairness becomes an important factor. Prior work has proposed fairness metrics based on teammates' capabilities for task allocation within human-agent teams. However, most metrics only consider teammate capabilities from a third-person point of view (POV). In this work, we extend these metrics to include task preferences and consider a first-person POV. We leverage an iterative design method consisting of simulation data and human data to design a task allocation algorithm that balances task efficiency and fairness based on both capabilities and preferences. We first show that these metrics may not align with people's perceived fairness from a first-person POV. In light of this result, we propose a new fairness metric, fair-equity, and the Fair-Efficient Algorithm (FEA). Our findings suggest that an agent teammate who balances efficiency and fairness based on equity will be perceived to be fairer and preferred by human teammates in various human-agent team types. We suggest that the perception of fairness may also depend on a person's POV.

Paperid: 2578, https://arxiv.org/pdf/2505.15440.pdf

Abstract:
Serendipity has been associated with numerous benefits in the context of recommender systems, e.g., increased user satisfaction and consumption of long-tail items. Despite this, serendipity in the context of recommender systems has thus far remained conceptually ambiguous. This conceptual ambiguity has led to inconsistent operationalizations between studies, making it difficult to compare and synthesize findings. In this paper, we conceptualize the user's experience of serendipity. To this effect, we interviewed 17 participants and analyzed the data following the grounded theory paradigm. Based on these interviews, we conceptualize experienced serendipity as "a user experience in which a user unintentionally encounters content that feels fortuitous, refreshing, and enriching". We find that all three components -- fortuitous, refreshing and enriching -- are necessary and together are sufficient to classify a user's experience as serendipitous. However, these components can be satisfied through a variety of conditions. Our conceptualization unifies previous definitions of serendipity within a single framework, resolving inconsistencies by identifying distinct flavors of serendipity. It highlights underexposed flavors, offering new insights into how users experience serendipity in the context of recommender systems. By clarifying the components and conditions of experienced serendipity in recommender systems, this work can guide the design of recommender systems that stimulate experienced serendipity in their users, and lays the groundwork for developing a standardized operationalization of experienced serendipity in its many flavors, enabling more consistent and comparable evaluations.

Paperid: 2579, https://arxiv.org/pdf/2505.14388.pdf

Abstract:
Algorithmic tools are increasingly used in hiring to improve fairness and diversity, often by enforcing constraints such as gender-balanced candidate shortlists. However, we show theoretically and empirically that enforcing equal representation at the shortlist stage does not necessarily translate into more diverse final hires, even when there is no gender bias in the hiring stage. We identify a crucial factor influencing this outcome: the correlation between the algorithm's screening criteria and the human hiring manager's evaluation criteria -- higher correlation leads to lower diversity in final hires. Using a large-scale empirical analysis of nearly 800,000 job applications across multiple technology firms, we find that enforcing equal shortlists yields limited improvements in hire diversity when the algorithmic screening closely mirrors the hiring manager's preferences. We propose a complementary algorithmic approach designed explicitly to diversify shortlists by selecting candidates likely to be overlooked by managers, yet still competitive according to their evaluation criteria. Empirical simulations show that this approach significantly enhances gender diversity in final hires without substantially compromising hire quality. These findings highlight the importance of algorithmic design choices in achieving organizational diversity goals and provide actionable guidance for practitioners implementing fairness-oriented hiring algorithms.

Paperid: 2580, https://arxiv.org/pdf/2505.14377.pdf

Abstract:
Although the integration of artificial intelligence (AI) into everyday tasks improves efficiency and objectivity, it also risks transmitting bias to human decision-making. In this study, we conducted a controlled experiment that simulated hiring decisions to examine how biased AI recommendations - augmented with or without counterfactual explanations - influence human judgment over time. Participants, acting as hiring managers, completed 60 decision trials divided into a baseline phase without AI, followed by a phase with biased (X)AI recommendations (favoring either male or female candidates), and a final post-interaction phase without AI. Our results indicate that the participants followed the AI recommendations 70% of the time when the qualifications of the given candidates were comparable. Yet, only a fraction of participants detected the gender bias (8 out of 294). Crucially, exposure to biased AI altered participants' inherent preferences: in the post-interaction phase, participants' independent decisions aligned with the bias when no counterfactual explanations were provided before, but reversed the bias when explanations were given. Reported trust did not differ significantly across conditions. Confidence varied throughout the study phases after exposure to male-biased AI, indicating nuanced effects of AI bias on decision certainty. Our findings point to the importance of calibrating XAI to avoid unintended behavioral shifts in order to safeguard equitable decision-making and prevent the adoption of algorithmic bias.

Paperid: 2581, https://arxiv.org/pdf/2505.14339.pdf

Abstract:
With the introduction of the Visualization for Communication workshop (VisComm) at IEEE VIS and in light of the COVID-19 pandemic, there has been renewed interest in studying visualization as a medium of communication. However the characteristics and definition of this line of study tend to vary from paper to paper and person to person. In this work, we examine the 37 papers accepted to VisComm from 2018 through 2022. Using grounded theory we identify nuances in how VisComm defines visualization, common themes in the work in this area, and a noticeable gap in DEI practices.

Paperid: 2582, https://arxiv.org/pdf/2505.14078.pdf

Abstract:
Understanding which factors could influence co-presence in Virtual Reality could help develop more qualitative social interactions, or social interactions that generate similar sensations, emotions and feelings than the ones generated during Face-to-Face interactions. Co-presence is studied since the beginning of Virtual Reality (VR); though, no consensus is identified on what factors could influence it, except the consensus on the definition of "being there together" inside the Virtual Environment. In this paper, we introduce the Koinos method to explain social interactions in VR through communication models, (i) theoretically, and (ii) on two VR experiments that change the virtual partner social and physical representations. These analyses lead us to propose an equation to predict and help manage the sense of co-presence in VR.

Paperid: 2583, https://arxiv.org/pdf/2505.14074.pdf

Abstract:
Understanding how neural activity encodes speech and language production is a fundamental challenge in neuroscience and artificial intelligence. This study investigates whether embeddings from large-scale, self-supervised language and speech models can effectively reconstruct high-gamma neural activity characteristics, key indicators of cortical processing, recorded during speech production. We leverage pre-trained embeddings from deep learning models trained on linguistic and acoustic data to represent high-level speech features and map them onto these high-gamma signals. We analyze the extent to which these embeddings preserve the spatio-temporal dynamics of brain activity. Reconstructed neural signals are evaluated against high-gamma ground-truth activity using correlation metrics and signal reconstruction quality assessments. The results indicate that high-gamma activity can be effectively reconstructed using large language and speech model embeddings in all study participants, generating Pearson's correlation coefficients ranging from 0.79 to 0.99.

Paperid: 2584, https://arxiv.org/pdf/2505.13953.pdf

Abstract:
Humans have always dreamed of possessing superpowers, and the rapid development of AI-based features promises to bring these dreams (closer) to reality. However, these advancements come with significant risks. This paper advocates for challenging existing methods and approaches in design and evaluation for more responsible AI. We stimulate reflection through a futuristic user journey illustrating the AI-driven life of Edmund in 2035. Subsequently, we discuss four AI-based superpowers: extended perception, cognitive offloading, externalized memory, and enhanced presence. We then discuss implications for HCI and AI, emphasizing the need for preserving intrinsic human superpowers, identifying meaningful use cases for AI, and evaluating AI's impact on human abilities. This paper advocates for responsible and reflective AI integration and proposes a pathway towards the idea of a Human Flourishing Benchmark.

Paperid: 2585, https://arxiv.org/pdf/2505.13773.pdf

Abstract:
We compare three methods of familiarizing a human with an artificial intelligence (AI) teammate ("agent") prior to operation in a collaborative, fast-paced intelligence, surveillance, and reconnaissance (ISR) environment. In a between-subjects user study (n=60), participants either read documentation about the agent, trained alongside the agent prior to the mission, or were given no familiarization. Results showed that the most valuable information about the agent included details of its decision-making algorithms and its relative strengths and weaknesses compared to the human. This information allowed the familiarization groups to form sophisticated team strategies more quickly than the control group. Documentation-based familiarization led to the fastest adoption of these strategies, but also biased participants towards risk-averse behavior that prevented high scores. Participants familiarized through direct interaction were able to infer much of the same information through observation, and were more willing to take risks and experiment with different control modes, but reported weaker understanding of the agent's internal processes. Significant differences were seen between individual participants' risk tolerance and methods of AI interaction, which should be considered when designing human-AI control interfaces. Based on our findings, we recommend a human-AI team familiarization method that combines AI documentation, structured in-situ training, and exploratory interaction.

Paperid: 2586, https://arxiv.org/pdf/2505.13688.pdf

Abstract:
Turn-taking prediction is crucial for seamless interactions. This study introduces a novel, lightweight framework for accurate turn-taking prediction in triadic conversations without relying on computationally intensive methods. Unlike prior approaches that either disregard gaze or treat it as a passive signal, our model integrates gaze with speaker localization, structuring it within a spatial constraint to transform it into a reliable predictive cue. Leveraging egocentric behavioral cues, our experiments demonstrate that incorporating gaze data from a single-user significantly improves prediction performance, while gaze data from multiple-users further enhances it by capturing richer conversational dynamics. This study presents a lightweight and privacy-conscious approach to support adaptive, directional sound control, enhancing speech intelligibility in noisy environments, particularly for hearing assistance in smart glasses.

Paperid: 2587, https://arxiv.org/pdf/2505.13612.pdf

Abstract:
We explore the integration of multisensory elements in virtual reality reconstructions of historical spaces through a case study of the Virtual Vauxhall Gardens project. While visual and auditory components have become standard in digital heritage experiences, the addition of olfactory stimuli remains underexplored, despite its powerful connection to memory and emotional engagement. This research investigates how multisensory experiences involving olfaction can be effectively integrated into VR reconstructions of historical spaces to enhance presence and engagement with cultural heritage. In the context of a VR reconstruction of London's eighteenth-century Vauxhall Pleasure Gardens, we developed a networked portable olfactory display capable of synchronizing specific scents with visual and auditory elements at pivotal moments in the virtual experience. Our evaluation methodology assesses both technical implementation and user experience, measuring presence, and usability metrics across diverse participant groups. Our results show that integrating synchronized olfactory stimuli into the VR experience can enhance user engagement and be perceived positively, contributing to a unique and immersive encounter with historical settings. While presence questionnaires indicated a strong sense of auditory presence and control, with other sensory factors rated moderately, user experience of attractiveness was exceptionally high; qualitative feedback suggested heightened sensory awareness and engagement influenced by the inclusion and anticipation of smell. Our results suggest that evaluating multisensory VR heritage experiences requires a nuanced approach, as standard usability metrics may be ill-suited and 'realism' might be less critical than creating an evocative, historically informed, and emotionally resonant experience......

Paperid: 2588, https://arxiv.org/pdf/2505.13472.pdf

Abstract:
In parallel to the ever-growing usage of mechanized proofs in diverse areas of mathematics and computer science, proof assistants are used more and more for education. This paper surveys previous work related to the use of proof assistants for (mostly undergraduate) teaching. This includes works where the authors report on their experiments using proof assistants to teach logic, mathematics or computer science, as well as designs or adaptations of proof assistants for teaching. We provide an overview of both tutoring systems that have been designed for teaching proof and proving, or general-purpose proof assistants that have been adapted for education, adding user interfaces and/or dedicated input or output languages.

Paperid: 2589, https://arxiv.org/pdf/2505.13246.pdf

Abstract:
The exponential growth of scientific literature presents significant challenges for researchers navigating the complex knowledge landscape. We propose "Agentic Publications", a novel LLM-driven framework complementing traditional publishing by transforming papers into interactive knowledge systems. Our architecture integrates structured data with unstructured content through retrieval-augmented generation and multi-agent verification. The framework offers interfaces for both humans and machines, combining narrative explanations with machine-readable outputs while addressing ethical considerations through automated validation and transparent governance. Key features include continuous knowledge updates, automatic integration of new findings, and customizable detail levels. Our proof-of-concept demonstrates multilingual interaction, API accessibility, and structured knowledge representation through vector databases, knowledge graphs, and verification agents. This approach enhances scientific communication across disciplines, improving efficiency and collaboration while preserving traditional publishing pathways, particularly valuable for interdisciplinary fields where knowledge integration remains challenging.

Paperid: 2590, https://arxiv.org/pdf/2505.13044.pdf

Abstract:
Large language models (LLMs) have advanced the field of artificial intelligence (AI) and are a powerful enabler for interactive systems. However, they still face challenges in long-term interactions that require adaptation towards the user as well as contextual knowledge and understanding of the ever-changing environment. To overcome these challenges, holistic memory modeling is required to efficiently retrieve and store relevant information across interaction sessions for suitable responses. Cognitive AI, which aims to simulate the human thought process in a computerized model, highlights interesting aspects, such as thoughts, memory mechanisms, and decision-making, that can contribute towards improved memory modeling for LLMs. Inspired by these cognitive AI principles, we propose our memory framework CAIM. CAIM consists of three modules: 1.) The Memory Controller as the central decision unit; 2.) the Memory Retrieval, which filters relevant data for interaction upon request; and 3.) the Post-Thinking, which maintains the memory storage. We compare CAIM against existing approaches, focusing on metrics such as retrieval accuracy, response correctness, contextual coherence, and memory storage. The results demonstrate that CAIM outperforms baseline frameworks across different metrics, highlighting its context-awareness and potential to improve long-term human-AI interactions.

Paperid: 2591, https://arxiv.org/pdf/2505.12525.pdf

Abstract:
To reproduce natural standing-up motion, recent studies have emphasized the importance of coordination between the assisting robot and the human. However, many non-wearable assistive devices have struggled to replicate natural motion trajectories. While wearable devices offer better coordination with the human body, they present challenges in completely isolating mechanical and electrical hazards. To address this, we developed a novel standing-assist robot that integrates features of both wearable and non-wearable systems, aiming to achieve high coordination while maintaining safety. The device employs a four-link mechanism aligned with the human joint structure, designed to reproduce the S-shaped trajectory of the hip and the arc trajectory of the knee during natural standing-up motion. Subject-specific trajectory data were obtained using a gyroscope, and the link lengths were determined to drive the seat along the optimal path. A feedforward speed control using a stepping motor was implemented, and the reproducibility of the trajectory was evaluated based on the geometric constraints of the mechanism. A load-bearing experiment with weights fixed to the seat was conducted to assess the trajectory accuracy under different conditions. Results showed that the reproduction errors for the hip and knee trajectories remained within approximately 4 percent of the seat's total displacement, demonstrating high fidelity to the target paths. In addition, durability testing, thermal safety evaluation, and risk assessment confirmed the reliability and safety of the system for indoor use. These findings suggest that the proposed design offers a promising approach for developing assistive technologies that adapt to individual physical characteristics, with potential applications in elderly care and rehabilitation.

Paperid: 2592, https://arxiv.org/pdf/2505.12183.pdf

Abstract:
The widespread integration of Large Language Models (LLMs) across various sectors has highlighted the need for empirical research to understand their biases, thought patterns, and societal implications to ensure ethical and effective use. In this study, we propose a novel framework for evaluating LLMs, focusing on uncovering their ideological biases through a quantitative analysis of 436 binary-choice questions, many of which have no definitive answer. By applying our framework to ChatGPT and Gemini, findings revealed that while LLMs generally maintain consistent opinions on many topics, their ideologies differ across models and languages. Notably, ChatGPT exhibits a tendency to change their opinion to match the questioner's opinion. Both models also exhibited problematic biases, unethical or unfair claims, which might have negative societal impacts. These results underscore the importance of addressing both ideological and ethical considerations when evaluating LLMs. The proposed framework offers a flexible, quantitative method for assessing LLM behavior, providing valuable insights for the development of more socially aligned AI systems.

Paperid: 2593, https://arxiv.org/pdf/2505.12080.pdf

Abstract:
Dementia is an overall decline in memory and cognitive skills severe enough to reduce an elders ability to perform everyday activities. There is an increasing need for accessible technologies for cognitive training to slow down the cognitive decline. With the ability to provide instant feedback and assistance, social robotic systems have been proven effective in enhancing learning abilities across various age groups. This study focuses on the design of an interactive robot-assisted scenario training system TrainBo with self-determination theory, derives design requirements through formative and formal studies and the system usability is also be evaluated. A pilot test is conducted on seven older adults with dementia in an elderly care center in Hong Kong for four weeks. Our finding shows that older adults with dementia have an improvement in behavioural engagement, emotional engagement, and intrinsic motivation after using Trainbo. These findings can provide valuable insights into the development of more captivating interactive robots for extensive training purposes.

Paperid: 2594, https://arxiv.org/pdf/2505.11417.pdf

Abstract:
This paper introduces a novel dataset and evaluation benchmark designed to assess and improve small language models deployable on edge devices, with a focus on user profiling from multi-session natural language interactions in smart home environments. At the core of the dataset are structured user profiles, each defined by a set of routines - context-triggered, repeatable patterns of behavior that govern how users interact with their home systems. Using these profiles as input, a large language model (LLM) generates corresponding interaction sessions that simulate realistic, diverse, and context-aware dialogues between users and their devices. The primary task supported by this dataset is profile reconstruction: inferring user routines and preferences solely from interactions history. To assess how well current models can perform this task under realistic conditions, we benchmarked several state-of-the-art compact language models and compared their performance against large foundation models. Our results show that while small models demonstrate some capability in reconstructing profiles, they still fall significantly short of large models in accurately capturing user behavior. This performance gap poses a major challenge - particularly because on-device processing offers critical advantages, such as preserving user privacy, minimizing latency, and enabling personalized experiences without reliance on the cloud. By providing a realistic, structured testbed for developing and evaluating behavioral modeling under these constraints, our dataset represents a key step toward enabling intelligent, privacy-respecting AI systems that learn and adapt directly on user-owned devices.

Paperid: 2595, https://arxiv.org/pdf/2505.11406.pdf

Abstract:
As AI tools increasingly shape how we write, they may also quietly reshape how we perceive ourselves. This paper explores the psychological impact of co-writing with AI on people's locus of control. Through an empirical study with 462 participants, we found that employment status plays a critical role in shaping users' reliance on AI and their locus of control. Current results demonstrated that employed participants displayed higher reliance on AI and a shift toward internal control, while unemployed users tended to experience a reduction in personal agency. Through quantitative results and qualitative observations, this study opens a broader conversation about AI's role in shaping personal agency and identity.

Paperid: 2596, https://arxiv.org/pdf/2505.10954.pdf

Abstract:
Preferential Bayesian optimization (PBO) is a variant of Bayesian optimization that observes relative preferences (e.g., pairwise comparisons) instead of direct objective values, making it especially suitable for human-in-the-loop scenarios. However, real-world optimization tasks often involve inequality constraints, which existing PBO methods have not yet addressed. To fill this gap, we propose constrained preferential Bayesian optimization (CPBO), an extension of PBO that incorporates inequality constraints for the first time. Specifically, we present a novel acquisition function for this purpose. Our technical evaluation shows that our CPBO method successfully identifies optimal solutions by focusing on exploring feasible regions. As a practical application, we also present a designer-in-the-loop system for banner ad design using CPBO, where the objective is the designer's subjective preference, and the constraint ensures a target predicted click-through rate. We conducted a user study with professional ad designers, demonstrating the potential benefits of our approach in guiding creative design under real-world constraints.

Paperid: 2597, https://arxiv.org/pdf/2505.10869.pdf

Abstract:
This study focuses on the velocity patterns of various body parts during walking and proposes a method for evaluating gait symmetry. Traditional motion analysis studies have assessed gait symmetry based on differences in electromyographic (EMG) signals or acceleration between the left and right sides. In contrast, this paper models intersegmental coordination using an LTI system and proposes a dissimilarity metric to evaluate symmetry. The method was tested on five subjects with both symmetric and asymmetric gait.

Paperid: 2598, https://arxiv.org/pdf/2505.10786.pdf

Abstract:
As a method to connect human brain and external devices, Brain-computer interfaces (BCIs) are receiving extensive research attention. Recently, the integration of communication theory with BCI has emerged as a popular trend, offering potential to enhance system performance and shape next-generation communications. A key challenge in this field is modeling the brain wireless communication channel between intracranial electrocorticography (ECoG) emitting neurons and extracranial electroencephalography (EEG) receiving electrodes. However, the complex physiology of brain challenges the application of traditional channel modeling methods, leaving relevant research in its infancy. To address this gap, we propose a frequency-division multiple-input multiple-output (MIMO) estimation framework leveraging simultaneous macaque EEG and ECoG recordings, while employing neurophysiology-informed regularization to suppress noise interference. This approach reveals profound similarities between neural signal propagation and multi-antenna communication systems. Experimental results show improved estimation accuracy over conventional methods while highlighting a trade-off between frequency resolution and temporal stability determined by signal duration. This work establish a conceptual bridge between neural interfacing and communication theory, accelerating synergistic developments in both fields.

Paperid: 2599, https://arxiv.org/pdf/2505.10742.pdf

Abstract:
Current AI benchmarks miss the messy, multi-turn nature of human-AI collaboration. We present an evaluation framework that decomposes real-world tasks into interdependent subtasks, letting us track both LLM performance and users' strategies across a dialogue. Complementing this framework, we develop a suite of metrics, including a composite usage derived from semantic similarity, word overlap, and numerical matches; structural coherence; intra-turn diversity; and a novel measure of the "information frontier" reflecting the alignment between AI outputs and users' working knowledge. We demonstrate our methodology in a financial valuation task that mirrors real-world complexity. Our empirical findings reveal that while greater integration of LLM-generated content generally enhances output quality, its benefits are moderated by factors such as response incoherence, excessive subtask diversity, and the distance of provided information from users' existing knowledge. These results suggest that proactive dialogue strategies designed to inject novelty may inadvertently undermine task performance. Our work thus advances a more holistic evaluation of human-AI collaboration, offering both a robust methodological framework and actionable insights for developing more effective AI-augmented work processes.

Paperid: 2600, https://arxiv.org/pdf/2505.10695.pdf

Abstract:
As autonomous systems become integral to various industries, effective strategies for fault handling are essential to ensure reliability and efficiency. Transfer of Control (ToC), a traditional approach for interrupting automated processes during faults, is often triggered unnecessarily in non-critical situations. To address this, we propose a data-driven method that uses human interaction data to train AI models capable of preemptively identifying and addressing issues or assisting users in resolution. Using an interactive tool simulating an industrial vacuum cleaner, we collected data and developed an LSTM-based model to predict user behavior. Our findings reveal that even data from non-experts can effectively train models to reduce unnecessary ToC events, enhancing the system's robustness. This approach highlights the potential of AI to learn directly from human problem-solving behaviors, complementing sensor data to improve industrial automation and human-AI collaboration.

Paperid: 2601, https://arxiv.org/pdf/2505.10686.pdf

Abstract:
This paper introduces NeoLightning, a modern reinterpretation of the Buchla Lightning. NeoLightning preserves the innovative spirit of Don Buchla's "Buchla Lightning" (introduced in the 1990s) while making its gesture-based interaction accessible to contemporary users. While the original Buchla Lightning and many other historical instruments were groundbreaking in their time, they are now largely unsupported, limiting user interaction to indirect experiences. To address this, NeoLightning leverages MediaPipe for deep learning-based gesture recognition and employs Max/MSP and Processing for real-time multimedia processing. The redesigned system offers precise, low-latency gesture recognition and immersive 3D interaction. By merging the creative spirit of the original Lightning with modern advancements, NeoLightning redefines gesture-based musical interaction, expanding possibilities for expressive performance and interactive sound design.

Paperid: 2602, https://arxiv.org/pdf/2505.10648.pdf

Abstract:
Decades of interactive electrical-muscle-stimulation (EMS) revealed its promise as a wearable interface for physical assistance-EMS directly demonstrates movements through the users' body (e.g., shaking a spray-can before painting). However, interactive EMS-systems are highly-specialized because their feedback is (1) fixed (e.g., one program executes spray-can instructions, another executes piano instructions) and (2) non-contextual (e.g., using a spray-can while cooking likely involves cooking oil, not paint, and thus shaking is unnecessary). To address this, we explored a more flexible approach and engineered a system that generates muscle-stimulation-instructions given the user's context. Through our examples, we show that such a system is flexible: it enables unprecedented EMS-interactions (e.g., opening a child-proof pill bottle cap) but also replicates existing systems (e.g., shake a spray can)-all without requiring task-specific programming. To achieve this, our system takes in user's spoken-requests and images from their point of view. It uses computer vision (e.g., detect objects/handedness) and large-language-models (e.g., reason about objects/situations) to generate textual-instructions. Finally, these instructions are then constrained by biomechanical-knowledge (e.g., joint limits, kinematic-chain, EMS capabilities) to produce suitable muscle-stimulation gestures. We believe our concept marks a shift toward more general-purpose EMS-interfaces, enabling more flexible and context-aware assistance.

Paperid: 2603, https://arxiv.org/pdf/2505.10588.pdf

Abstract:
This research offers a unique evaluation of how AI systems interpret the digital language of Generation Alpha (Gen Alpha, born 2010-2024). As the first cohort raised alongside AI, Gen Alpha faces new forms of online risk due to immersive digital engagement and a growing mismatch between their evolving communication and existing safety tools. Their distinct language, shaped by gaming, memes, and AI-driven trends, often conceals harmful interactions from both human moderators and automated systems. We assess four leading AI models (GPT-4, Claude, Gemini, and Llama 3) on their ability to detect masked harassment and manipulation within Gen Alpha discourse. Using a dataset of 100 recent expressions from gaming platforms, social media, and video content, the study reveals critical comprehension failures with direct implications for online safety. This work contributes: (1) a first-of-its-kind dataset capturing Gen Alpha expressions; (2) a framework to improve AI moderation systems for youth protection; (3) a multi-perspective evaluation including AI systems, human moderators, and parents, with direct input from Gen Alpha co-researchers; and (4) an analysis of how linguistic divergence increases youth vulnerability. Findings highlight the urgent need to redesign safety systems attuned to youth communication, especially given Gen Alpha reluctance to seek help when adults fail to understand their digital world. This study combines the insight of a Gen Alpha researcher with systematic academic analysis to address critical digital safety challenges.

Paperid: 2604, https://arxiv.org/pdf/2505.10300.pdf

Abstract:
Responsible AI (RAI) efforts increasingly emphasize the importance of addressing potential harms early in the AI development lifecycle through social-technical lenses. However, in cross-functional industry teams, this work is often stalled by a persistent knowledge handoff challenge: the difficulty of transferring high-level, early-stage technical design rationales from technical experts to non-technical or user-facing roles for ethical evaluation and harm identification. Through literature review and a co-design study with 8 practitioners, we unpack how this challenge manifests -- technical design choices are rarely handed off in ways that support meaningful engagement by non-technical roles; collaborative workflows lack shared, visual structures to support mutual understanding; and non-technical practitioners are left without scaffolds for systematic harm evaluation. Existing tools like JIRA or Google Docs, while useful for product tracking, are ill-suited for supporting joint harm identification across roles, often requiring significant extra effort to align understanding. To address this, we developed AI LEGO, a web-based prototype that supports cross-functional AI practitioners in effectively facilitating knowledge handoff and identifying harmful design choices in the early design stages. Technical roles use interactive blocks to draft development plans, while non-technical roles engage with those blocks through stage-specific checklists and LLM-driven persona simulations to surface potential harms. In a study with 18 cross-functional practitioners, AI LEGO increased the volume and likelihood of harms identified compared to baseline worksheets. Participants found that its modular structure and persona prompts made harm identification more accessible, fostering clearer and more collaborative RAI practices in early design.

Paperid: 2605, https://arxiv.org/pdf/2505.09877.pdf

Abstract:
Over the past decade, data provided by digital platforms has informed substantial research in HCI to understand online human interaction and communication. Following the closure of major social media APIs that previously provided free access to large-scale data (the "post-API age"), emerging data access programs required by the European Union's Digital Services Act (DSA) have sparked optimism about increased platform transparency and renewed opportunities for comprehensive research on digital platforms, leading to the "post-post-API age." However, it remains unclear whether platforms provide adequate data access in practice. To assess how platforms make data available under the DSA, we conducted a comprehensive survey followed by in-depth interviews with 19 researchers to understand their experiences with data access in this new era. Our findings reveal significant challenges in accessing social media data, with researchers facing multiple barriers including complex API application processes, difficulties obtaining credentials, and limited API usability. These challenges have exacerbated existing institutional, regional, and financial inequities in data access. Based on these insights, we provide actionable recommendations for platforms, researchers, and policymakers to foster more equitable and effective data access, while encouraging broader dialogue within the CSCW community around interdisciplinary and multi-stakeholder solutions.

Paperid: 2606, https://arxiv.org/pdf/2505.09724.pdf

Abstract:
Analyzing texts such as open-ended responses, headlines, or social media posts is a time- and labor-intensive process highly susceptible to bias. LLMs are promising tools for text analysis, using either a predefined (top-down) or a data-driven (bottom-up) taxonomy, without sacrificing quality. Here we present a step-by-step tutorial to efficiently develop, test, and apply taxonomies for analyzing unstructured data through an iterative and collaborative process between researchers and LLMs. Using personal goals provided by participants as an example, we demonstrate how to write prompts to review datasets and generate a taxonomy of life domains, evaluate and refine the taxonomy through prompt and direct modifications, test the taxonomy and assess intercoder agreements, and apply the taxonomy to categorize an entire dataset with high intercoder reliability. We discuss the possibilities and limitations of using LLMs for text analysis.

Paperid: 2607, https://arxiv.org/pdf/2505.09526.pdf

Abstract:
Misinformation has become a widespread issue in the 21st century, impacting numerous areas of society and underscoring the need for effective intervention strategies. Among these strategies, user-centered interventions, such as warning systems, have shown promise in reducing the spread of misinformation. Many studies have used various metrics to evaluate the effectiveness of these warning interventions. However, no systematic review has thoroughly examined these metrics in all studies. This paper provides a comprehensive review of existing metrics for assessing the effectiveness of misinformation warnings, categorizing them into four main groups: behavioral impact, trust and credulity, usability, and cognitive and psychological effects. Through this review, we identify critical challenges in measuring the effectiveness of misinformation warnings, including inconsistent use of cognitive and attitudinal metrics, the lack of standardized metrics for affective and emotional impact, variations in user trust, and the need for more inclusive warning designs. We present an overview of these metrics and propose areas for future research.

Paperid: 2608, https://arxiv.org/pdf/2505.09402.pdf

Abstract:
Measurement of pressure distribution applied to a fingertip is crucial for the teleoperation of robots and human computer interface. Previous studies have acquired pressure distribution by affixing a sensor array to the fingertip or by optically recording the deformation of an object. However, these existing methods inhibit the fingertip from directly contacting the texture, and the pressure applied to the fingertip is measured indirectly. In this study, we propose a method to measure pressure distribution by directly touching a transparent object, focusing on the change in skin color induced by the applied pressure, caused by blood flow. We evaluated the relationship between pressure and skin color change when local pressure is applied, and found a correlation between the pressure and the color change. However, the contact area and the color change area did not align perfectly. We further explored the factor causing the spatial non-uniformity of the color change, by accounting for the stress distribution using finite element analysis. These results suggest that the proposed measurement method can be utilized to measure the internal stress distribution, and it is anticipated to serve as a simple sensor in the field of human computer interface.

Paperid: 2609, https://arxiv.org/pdf/2505.09376.pdf

Abstract:
We propose AfforDance, an augmented reality (AR)-based dance learning system that generates personalized learning content and enhances learning through visual affordances. Our system converts user-selected dance videos into interactive learning experiences by integrating 3D reference avatars, audio synchronization, and adaptive visual cues that guide movement execution. This work contributes to personalized dance education by offering an adaptable, user-centered learning interface.

Paperid: 2610, https://arxiv.org/pdf/2505.09283.pdf

Abstract:
This paper provides an in-depth examination of the concept of semantic diffusion as a complementary instrument to large language models (LLMs) for design applications. Conventional LLMs and diffusion models fail to induce a convergent, iterative refinement process: each invocation of the diffusion mechanism spawns a new stochastic cycle, so successive outputs do not relate to prior ones and convergence toward a desired design is not guaranteed. The proposed hybrid framework - "LLM + semantic diffusion" - resolves this limitation by enforcing an approximately convergent search procedure, thereby formally addressing the problem of localized design refinement.

Paperid: 2611, https://arxiv.org/pdf/2505.09115.pdf

Abstract:
Advance Care Planning (ACP) allows individuals to specify their preferred end-of-life life-sustaining treatments before they become incapacitated by injury or terminal illness (e.g., coma, cancer, dementia). While online ACP offers high accessibility, it lacks key benefits of clinical consultations, including personalized value exploration, immediate clarification of decision consequences. To bridge this gap, we conducted two formative studies: 1) shadowed and interviewed 3 ACP teams consisting of physicians, nurses, and social workers (18 patients total), and 2) interviewed 14 users of ACP websites. Building on these insights, we designed PreCare in collaboration with 6 ACP professionals. PreCare is a website with 3 AI-driven assistants designed to guide users through exploring personal values, gaining ACP knowledge, and supporting informed decision-making. A usability study (n=12) showed that PreCare achieved a System Usability Scale (SUS) rating of excellent. A comparative evaluation (n=12) showed that PreCare's AI assistants significantly improved exploration of personal values, knowledge, and decisional confidence, and was preferred by 92% of participants.

Paperid: 2612, https://arxiv.org/pdf/2505.09047.pdf

Abstract:
Head-worn displays for everyday wear in the form of regular eyeglasses are technically feasible with recent advances in waveguide technology. One major design decision is determining where in the user's visual field to position the display. Centering the display in the principal point of gaze (PPOG) allows the user to switch attentional focus between the virtual and real images quickly, and best performance often occurs when the display is centered in PPOG or is centered vertically below PPOG. However, these positions are often undesirable in that they are considered interruptive or are associated with negative social perceptions by users. Offsetting the virtual image may be preferred when tasks involve driving, walking, or social interaction. This paper consolidates findings from recent studies on monocular optical see-through HWDs (OST-HWDs), focusing on potential for interruption, comfort, performance, and social perception. For text-based tasks, which serve as a proxy for many monocular OST-HWD tasks, we recommend a 15Â° horizontal field of view (FOV) with the virtual image in the right lens vertically centered but offset to +8.7Â° to +23.7Â° toward the ear. Glanceable content can be offset up to +30Â° for short interactions.

Paperid: 2613, https://arxiv.org/pdf/2505.08691.pdf

Abstract:
Analyzing how the publication records of scientists and research groups have evolved over the years is crucial for assessing their expertise since it can support the management of academic environments by assisting with career planning and evaluation. We introduce VizCV, a novel web-based end-to-end visual analytics framework that enables the interactive exploration of researchers' scientific trajectories. It incorporates AI-assisted analysis and supports automated reporting of career evolution. Our system aims to model career progression through three key dimensions: a) research topic evolution to detect and visualize shifts in scholarly focus over time, b) publication record and the corresponding impact, c) collaboration dynamics depicting the growth and transformation of a researcher's co-authorship network. AI-driven insights provide automated explanations of career transitions, detecting significant shifts in research direction, impact surges, or collaboration expansions. The system also supports comparative analysis between researchers, allowing users to compare topic trajectories and impact growth. Our interactive, multi-tab and multiview system allows for the exploratory analysis of career milestones under different perspectives, such as the most impactful articles, emerging research themes, or obtaining a detailed analysis of the contribution of the researcher in a subfield. The key contributions include AI/ML techniques for: a) topic analysis, b) dimensionality reduction for visualizing patterns and trends, c) the interactive creation of textual descriptions of facets of data through configurable prompt generation and large language models, that include key indicators, to help understanding the career development of individuals or groups.

Paperid: 2614, https://arxiv.org/pdf/2505.08666.pdf

Abstract:
This paper introduces Claycode, a novel 2D scannable code designed for extensive stylization and deformation. Unlike traditional matrix-based codes (e.g., QR codes), Claycodes encode their message in a tree structure. During the encoding process, bits are mapped into a topology tree, which is then depicted as a nesting of color regions drawn within the boundaries of a target polygon shape. When decoding, Claycodes are extracted and interpreted in real-time from a camera stream. We detail the end-to-end pipeline and show that Claycodes allow for extensive stylization without compromising their functionality. We then empirically demonstrate Claycode's high tolerance to heavy deformations, outperforming traditional 2D scannable codes in scenarios where they typically fail.

Paperid: 2615, https://arxiv.org/pdf/2505.08064.pdf

Abstract:
It is well recognised that ensuring fair AI systems is a complex sociotechnical challenge, which requires careful deliberation and continuous oversight across all stages of a system's lifecycle, from defining requirements to model deployment and deprovisioning. Dynamic argument-based assurance cases, which present structured arguments supported by evidence, have emerged as a systematic approach to evaluating and mitigating safety risks and hazards in AI-enabled system development and have also been extended to deal with broader normative goals such as fairness and explainability. This paper introduces a systems-engineering-driven framework, supported by software tooling, to operationalise a dynamic approach to argument-based assurance in two stages. In the first stage, during the requirements planning phase, a multi-disciplinary and multi-stakeholder team define goals and claims to be established (and evidenced) by conducting a comprehensive fairness governance process. In the second stage, a continuous monitoring interface gathers evidence from existing artefacts (e.g. metrics from automated tests), such as model, data, and use case documentation, to support these arguments dynamically. The framework's effectiveness is demonstrated through an illustrative case study in finance, with a focus on supporting fairness-related arguments.

Paperid: 2616, https://arxiv.org/pdf/2505.07884.pdf

Abstract:
Named Entity Recognition NER is very crucial for various natural language processing applications, including information extraction, machine translation, and sentiment analysis. Despite the ever-increasing interest in African languages within computational linguistics, existing NER systems focus mainly on English, European, and a few other global languages, leaving a significant gap for under-resourced languages. This research presents the development of a WAZOBIA-NER system tailored for the three most prominent Nigerian languages: Hausa, Yoruba, and Igbo. This research begins with a comprehensive compilation of annotated datasets for each language, addressing data scarcity and linguistic diversity challenges. Exploring the state-of-the-art machine learning technique, Conditional Random Fields (CRF) and deep learning models such as Bidirectional Long Short-Term Memory (BiLSTM), Bidirectional Encoder Representation from Transformers (Bert) and fine-tune with a Recurrent Neural Network (RNN), the study evaluates the effectiveness of these approaches in recognizing three entities: persons, organizations, and locations. The system utilizes optical character recognition (OCR) technology to convert textual images into machine-readable text, thereby enabling the Wazobia system to accept both input text and textual images for extraction purposes. The system achieved a performance of 0.9511 in precision, 0.9400 in recall, 0.9564 in F1-score, and 0.9301 in accuracy. The model's evaluation was conducted across three languages, with precision, recall, F1-score, and accuracy as key assessment metrics. The Wazobia-NER system demonstrates that it is feasible to build robust NER tools for under-resourced African languages using current NLP frameworks and transfer learning.

Paperid: 2617, https://arxiv.org/pdf/2505.07354.pdf

Abstract:
Background & Objective: Emotional states and stress distort time perception, yet findings are inconsistent, particularly in immersive media. Integrating the Attentional Gate Model (AGM) and Internal Clock Model (ICM), we examined how emotional valence and stress alter perceived duration in Virtual Reality (VR). This study assesses the effects of valence (calming, neutral, stressful) and stress (low/high) on prospective time estimation, mood, and arousal. Methods: Fifty-four adults (18-39 years) explored three custom VR environments: (1) a tranquil Japanese garden, (2) an affectively neutral room, and (3) a threatening underground sewer. Active navigation promoted presence; a distraction task separated conditions. Valence and arousal were assessed with the Visual Analog Mood Scales, stress with the Perceived Stress Scale-10 (PSS-10), and perceived duration with a verbal estimation task. Mixed-model ANOVAs evaluated main and interaction effects. Results: Valence reliably shaped perceived duration: calming VR led to overestimation, stressful VR to underestimation, and neutral VR to intermediate timing. Baseline stress level, as measured by PSS-10, neither altered timing nor interacted with valence. Nevertheless, the VR environments affected VAMS' mood metrics: calming environments elevated mood and reduced perceived stress, whereas stressful environments lowered mood and heightened stress. Conclusions: Findings support the AGM-attentionally demanding negative environments shorten perceived time-and the ICM-valence-linked arousal speeds or slows the pacemaker. Contrary to classical predictions, in VR, baseline stress did not distort duration, suggesting valence-driven attentional allocation outweighs pre-exposure stress levels. VR offers a controllable platform for dissecting time-perception mechanisms and advancing interventions that target emotion-related temporal distortions.

Paperid: 2618, https://arxiv.org/pdf/2505.07154.pdf

Abstract:
Despite the current rise and promising capabilities of Extended Reality (XR) technologies, the architecture, engineering, and construction industry lacks informed guidance when choosing between these technologies, especially for complex processes like assembly and disassembly tasks. This research compares the user experience across different XR devices for (dis)assembly utilizing the NASA Task Load Index and System Usability Scale metrics. Through a workshop and surveys with graduate civil engineering and architecture students, the study found that Augmented Reality scored highest in usability, followed closely by Mixed Reality. However, Mixed Reality showed the best task load index score, indicating low cognitive demand. The findings presented in this research may aid academics and practitioners in making informed decisions when selecting XR systems in practical, real-world assembly scenarios. Moreover, this study suggests opportunities and guidelines for more detailed XR system comparisons and exploration of XR's further role in circular construction practices.

Paperid: 2619, https://arxiv.org/pdf/2505.06947.pdf

Abstract:
With the widespread application of large AI models in various fields, the automation level of multi-agent systems has been continuously improved. However, in high-risk decision-making scenarios such as healthcare and finance, human participation and the alignment of intelligent systems with human intentions remain crucial. This paper focuses on the financial scenario and constructs a multi-agent brainstorming framework based on the BDI theory. A human-computer collaborative multi-agent financial analysis process is built using Streamlit. The system plans tasks according to user intentions, reduces users' cognitive load through real-time updated structured text summaries and the interactive Cothinker module, and reasonably integrates general and reasoning large models to enhance the ability to handle complex problems. By designing a quantitative analysis algorithm for the sentiment tendency of interview content based on LLMs and a method for evaluating the diversity of ideas generated by LLMs in brainstorming based on k-means clustering and information entropy, the system is comprehensively evaluated. The results of human factors testing show that the system performs well in terms of usability and user experience. Although there is still room for improvement, it can effectively support users in completing complex financial tasks. The research shows that the system significantly improves the efficiency of human-computer interaction and the quality of decision-making in financial decision-making scenarios, providing a new direction for the development of related fields.

Paperid: 2620, https://arxiv.org/pdf/2505.06620.pdf

Abstract:
There is a growing demand for the use of Artificial Intelligence (AI) and Machine Learning (ML) in healthcare, particularly as clinical decision support systems to assist medical professionals. However, the complexity of many of these models, often referred to as black box models, raises concerns about their safe integration into clinical settings as it is difficult to understand how they arrived at their predictions. This paper discusses insights and recommendations derived from an expert working group convened by the UK Medicine and Healthcare products Regulatory Agency (MHRA). The group consisted of healthcare professionals, regulators, and data scientists, with a primary focus on evaluating the outputs from different AI algorithms in clinical decision-making contexts. Additionally, the group evaluated findings from a pilot study investigating clinicians' behaviour and interaction with AI methods during clinical diagnosis. Incorporating AI methods is crucial for ensuring the safety and trustworthiness of medical AI devices in clinical settings. Adequate training for stakeholders is essential to address potential issues, and further insights and recommendations for safely adopting AI systems in healthcare settings are provided.

Paperid: 2621, https://arxiv.org/pdf/2505.06467.pdf

Abstract:
Generating high-quality images without prompt engineering expertise remains a challenge for text-to-image (T2I) models, which often misinterpret poorly structured prompts, leading to distortions and misalignments. While humans easily recognize these flaws, metrics like CLIP fail to capture structural inconsistencies, exposing a key limitation in current evaluation methods. To address this, we introduce PromptIQ, an automated framework that refines prompts and assesses image quality using our novel Component-Aware Similarity (CAS) metric, which detects and penalizes structural errors. Unlike conventional methods, PromptIQ iteratively generates and evaluates images until the user is satisfied, eliminating trial-and-error prompt tuning. Our results show that PromptIQ significantly improves generation quality and evaluation accuracy, making T2I models more accessible for users with little to no prompt engineering expertise.

Paperid: 2622, https://arxiv.org/pdf/2505.06402.pdf

Abstract:
In this paper, we present Optimized Prompt-based Unified System (OPUS), a framework that utilizes a Large Language Model (LLM) to control Pan-Tilt-Zoom (PTZ) cameras, providing contextual understanding of natural environments. To achieve this goal, the OPUS system improves cost-effectiveness by generating keywords from a high-level camera control API and transferring knowledge from larger closed-source language models to smaller ones through Supervised Fine-Tuning (SFT) on synthetic data. This enables efficient edge deployment while maintaining performance comparable to larger models like GPT-4. OPUS enhances environmental awareness by converting data from multiple cameras into textual descriptions for language models, eliminating the need for specialized sensory tokens. In benchmark testing, our approach significantly outperformed both traditional language model techniques and more complex prompting methods, achieving a 35% improvement over advanced techniques and a 20% higher task accuracy compared to closed-source models like Gemini Pro. The system demonstrates OPUS's capability to simplify PTZ camera operations through an intuitive natural language interface. This approach eliminates the need for explicit programming and provides a conversational method for interacting with camera systems, representing a significant advancement in how users can control and utilize PTZ camera technology.

Paperid: 2623, https://arxiv.org/pdf/2505.06045.pdf

Abstract:
Flood response planning in local communities is often hindered by fragmented communication across Disaster Risk Reduction and Management (DRRM) councils. In this work, we explore how extended reality (XR) can support more effective planning through narrative-driven design. We present Routscape, an XR prototype for visualizing flood scenarios and evacuation routes, developed through iterative prototyping and user-centered design with DRRM officers. By grounding the system in real-world experiences and localized narratives, we highlight how XR can aid in fostering shared understanding and spatial sensemaking in disaster preparedness efforts.

Paperid: 2624, https://arxiv.org/pdf/2505.05828.pdf

Abstract:
This paper presents a chatbot-based system to engage young Spanish people in the awareness of certain mental disorders through a self-disclosure technique. The study was carried out in a population of teenagers aged between 12 and 18 years. The dialogue engine mixes closed and open conversations, so certain controlled messages are sent to focus the chat on a specific disorder, which will change over time. Once a set of trial questions is answered, the system can initiate the conversation on the disorder under the focus according to the user's sensibility to that disorder, in an attempt to establish a more empathetic communication. Then, an open conversation based on the GPT-3 language model is initiated, allowing the user to express themselves with more freedom. The results show that these systems are of interest to young people and could help them become aware of certain mental disorders.

Paperid: 2625, https://arxiv.org/pdf/2505.05543.pdf

Abstract:
Trust is a fundamental component of human-agent interaction. With the increasing presence of artificial agents in daily life, it is essential to understand how people perceive and trust these agents. One of the key challenges affecting this perception is the Uncanny Valley Effect (UVE), where increasingly human-like artificial beings can be perceived as eerie or repelling. Despite growing interest in trust and the UVE, existing research varies widely in terms of how these concepts are defined and operationalized. This inconsistency raises important questions about how and under what conditions the UVE influences trust in agents. A systematic understanding of their relationship is currently lacking. This review aims to examine the impact of the UVE on human trust in agents and to identify methodological patterns, limitations, and gaps in the existing empirical literature. Following PRISMA guidelines, a systematic search identified 53 empirical studies that investigated both UVE-related constructs and trust or trust-related outcomes. Studies were analyzed based on a structured set of categories, including types of agents and interactions, methodological and measurement approaches, and key findings. The results of our systematic review reveal that most studies rely on static images or hypothetical scenarios with limited real-time interaction, and the majority use subjective trust measures. This review offers a novel framework for classifying trust measurement approaches with regard to the best-practice criteria for empirically investigating the UVE. As the first systematic attempt to map the intersection of UVE and trust, this review contributes to a deeper understanding of their interplay and offers a foundation for future research. Keywords: the uncanny valley effect, trust, human-likeness, affinity response, human-agent interaction

Paperid: 2626, https://arxiv.org/pdf/2505.04885.pdf

Abstract:
This research introduces an innovative AI-driven multi-agent framework specifically designed for creating immersive audiobooks. Leveraging neural text-to-speech synthesis with FastSpeech 2 and VALL-E for expressive narration and character-specific voices, the framework employs advanced language models to automatically interpret textual narratives and generate realistic spatial audio effects. These sound effects are dynamically synchronized with the storyline through sophisticated temporal integration methods, including Dynamic Time Warping (DTW) and recurrent neural networks (RNNs). Diffusion-based generative models combined with higher-order ambisonics (HOA) and scattering delay networks (SDN) enable highly realistic 3D soundscapes, substantially enhancing listener immersion and narrative realism. This technology significantly advances audiobook applications, providing richer experiences for educational content, storytelling platforms, and accessibility solutions for visually impaired audiences. Future work will address personalization, ethical management of synthesized voices, and integration with multi-sensory platforms.

Paperid: 2627, https://arxiv.org/pdf/2505.04210.pdf

Abstract:
As in automated driving the driver becomes a passenger, carsickness might reduce comfort for susceptible individuals. Insights in the prevalence of carsickness and its modulating factors are considered useful for the development of automated vehicles to mitigate or prevent its occurrence. An online survey was conducted with N = 3999 participants in Spain, Sweden, Poland, and Germany. 30% of participants reported to have already experienced carsickness as adult. The frequency of carsickness was modulated not only by demographic factors (country, gender, age), but also by frequency of being a passenger, type of non-driving related task, road type, and the seating position in car. Furthermore, the efficiency of applied countermeasures, temporal aspects of carsickness development, as well as the relation of carsickness with the acceptability of automated driving and the effect on subjective fitness to drive was investigated. The results are discussed with focus on automated driving.

Paperid: 2628, https://arxiv.org/pdf/2505.03682.pdf

Abstract:
Foveated rendering methods usually reduce spatial resolution in the periphery of the users' view. However, using foveated rendering to reduce temporal resolution, i.e., rendering frame rate, seems less explored. In this work, we present the results of a user study investigating the perceptual effects of foveated temporal resolution reduction, where only the temporal resolution (frame rate) is reduced in the periphery without affecting spatial quality (pixel density). In particular, we investigated the perception of temporal resolution artifacts caused by reducing the frame rate dependent on the eccentricity of the user's gaze. Our user study with 15 participants was conducted in a virtual reality setting using a head-mounted display. Our results indicate that it was possible to reduce average rendering costs, i.e., the number of rendered pixels, to a large degree before participants consistently reported perceiving temporal artifacts.

Paperid: 2629, https://arxiv.org/pdf/2505.03584.pdf

Abstract:
Public deliberation, as in open discussion of issues of public concern, often suffers from scattered and shallow discourse, poor sensemaking, and a disconnect from actionable policy outcomes. This paper introduces BCause, a discussion system leveraging generative AI and human-machine collaboration to transform unstructured dialogue around public issues (such as urban living, policy changes, and current socio-economic transformations) into structured, actionable democratic processes. We present three innovations: (i) importing and transforming unstructured transcripts into argumentative discussions, (ii) geo-deliberated problem-sensing via a Telegram bot for local issue reporting, and (iii) smart reporting with customizable widgets (e.g., summaries, topic modelling, policy recommendations, clustered arguments). The system's human-AI partnership preserves critical human participation to ensure ethical oversight, contextual relevance, and creative synthesis.

Paperid: 2630, https://arxiv.org/pdf/2505.03440.pdf

Abstract:
We propose manvr3d, a novel VR-ready platform for interactive human-in-the-loop cell tracking. We utilize VR controllers and eye-tracking hardware to facilitate rapid ground truth generation and proofreading for deep learning-based cell tracking models. Life scientists reconstruct the developmental history of organisms on the cellular level by analyzing 3D time-lapse microscopy images acquired at high spatio-temporal resolution. The reconstruction of such cell lineage trees traditionally involves tracking individual cells through all recorded time points, manually annotating their positions, and then linking them over time to create complete trajectories. Deep learning-based algorithms accelerate this process, yet depend heavily on manually-annotated high-quality ground truth data and curation. Visual representation of the image data in this process still relies primarily on 2D renderings, which greatly limits spatial understanding and navigation. In this work, we bridge the gap between deep learning-based cell tracking software and 3D/VR visualization to create a human-in-the-loop cell tracking system. We lift the incremental annotation, training and proofreading loop of the deep learning model into the 3rd dimension and apply natural user interfaces like hand gestures and eye tracking to accelerate the cell tracking workflow for life scientists.

Paperid: 2631, https://arxiv.org/pdf/2505.03376.pdf

Abstract:
Recommender systems play a vital role in helping users discover content in streaming services, but their effectiveness depends on users understanding why items are recommended. In this study, explanations were based solely on item features rather than personalized data, simulating recommendation scenarios. We compared user perceptions of one-sided (purely positive) and two-sided (positive and negative) feature-based explanations for popular movie recommendations. Through an online study with 129 participants, we examined how explanation style affected perceived trust, transparency, effectiveness, and satisfaction. One-sided explanations consistently received higher ratings across all dimensions. Our findings suggest that in low-stakes entertainment domains such as popular movie recommendations, simpler positive explanations may be more effective. However, the results should be interpreted with caution due to potential confounding factors such as item familiarity and the placement of negative information in explanations. This work provides practical insights for explanation design in recommender interfaces and highlights the importance of context in shaping user preferences.

Paperid: 2632, https://arxiv.org/pdf/2505.03033.pdf

Abstract:
Independent learners often struggle with sustaining focus and emotional regulation in unstructured or distracting settings. Although some rely on ambient aids such as music, ASMR, or visual backgrounds to support concentration, these tools are rarely integrated into cohesive, learner-centered systems. Moreover, existing educational technologies focus primarily on content adaptation and feedback, overlooking the emotional and sensory context in which learning takes place. Large language models have demonstrated powerful multimodal capabilities including the ability to generate and adapt text, audio, and visual content. Educational research has yet to fully explore their potential in creating personalized audiovisual learning environments. To address this gap, we introduce an AI-powered system that uses LLMs to generate personalized multisensory study environments. Users select or generate customized visual themes (e.g., abstract vs. realistic, static vs. animated) and auditory elements (e.g., white noise, ambient ASMR, familiar vs. novel sounds) to create immersive settings aimed at reducing distraction and enhancing emotional stability. Our primary research question investigates how combinations of personalized audiovisual elements affect learner cognitive load and engagement. Using a mixed-methods design that incorporates biometric measures and performance outcomes, this study evaluates the effectiveness of LLM-driven sensory personalization. The findings aim to advance emotionally responsive educational technologies and extend the application of multimodal LLMs into the sensory dimension of self-directed learning.

Paperid: 2633, https://arxiv.org/pdf/2505.02542.pdf

Abstract:
Intangible Cultural Heritage (ICH) like traditional culinary practices face increasing pressure to adapt to globalization while maintaining their cultural authenticity. Centuries-old traditions in Chinese cuisine are subject to rapid changes for adaptation to contemporary tastes and dietary preferences. The preservation of these cultural practices requires approaches that can enable ICH practitioners to reimagine and recreate ICH for modern contexts. To address this, we created workshops where experienced practitioners of traditional Chinese cuisine co-created recipes using GenAI tools and realized the dishes. We found that GenAI inspired ICH practitioners to innovate recipes based on traditional workflows for broader audiences and adapt to modern dining contexts. However, GenAI-inspired co-creation posed challenges in maintaining the accuracy of original ICH workflows and preserving traditional flavors in the culinary outcomes. This study offers implications for designing human-AI collaborative processes for safeguarding and enhancing culinary ICH.

Paperid: 2634, https://arxiv.org/pdf/2505.02230.pdf

Abstract:
Generative Artificial Intelligence (GenAI) is revolutionizing education and workforce development, profoundly shaping how students learn, engage, and prepare for their future. Outpacing the development of uniform policies and structures, GenAI has heralded a unique era and given rise to the GenAI Generation. We define the GenAI Generation as a cohort of students whose education has been increasingly shaped by the opportunities and challenges GenAI presents during its widespread adoption within society. This study examines students' perceptions of GenAI through a concise survey with optional open-ended questions, focusing on their awareness, preparedness, and concerns. Notably, readiness appears increasingly tied to exposure to GenAI through one's coursework. Students with greater curricular exposure to GenAI tend to feel more prepared, while those without it more often express vulnerability and uncertainty, highlighting a new and growing divide in readiness that goes beyond traditional disciplinary boundaries. Evaluation of more than 250 responses, with over 40% providing detailed qualitative feedback, reveals a core dual sentiment: while most students express enthusiasm for GenAI, an even greater proportion voice a spectrum of concerns about ethics, job displacement, and the adequacy of educational structures given the highly transformative technology. These findings offer critical insights into how students view the potential and pitfalls of GenAI for future career impacts. The challenge ahead involves implementing associated recommendations for educational institutions, moving beyond the baseline of access toward more informed guidance on the use of these tools, while preserving critical thinking, ethical reasoning, and adaptive learning.

Paperid: 2635, https://arxiv.org/pdf/2505.01542.pdf

Abstract:
In a world where technology is increasingly embedded in our everyday experiences, systems that sense and respond to human emotions are elevating digital interaction. At the intersection of artificial intelligence and human-computer interaction, affective computing is emerging with innovative solutions where machines are humanized by enabling them to process and respond to user emotions. This survey paper explores recent research contributions in affective computing applications in the area of emotion recognition, sentiment analysis and personality assignment developed using approaches like large language models (LLMs), multimodal techniques, and personalized AI systems. We analyze the key contributions and innovative methodologies applied by the selected research papers by categorizing them into four domains: AI chatbot applications, multimodal input systems, mental health and therapy applications, and affective computing for safety applications. We then highlight the technological strengths as well as the research gaps and challenges related to these studies. Furthermore, the paper examines the datasets used in each study, highlighting how modality, scale, and diversity impact the development and performance of affective models. Finally, the survey outlines ethical considerations and proposes future directions to develop applications that are more safe, empathetic and practical.

Paperid: 2636, https://arxiv.org/pdf/2505.01219.pdf

Abstract:
Online communities are an increasingly important stakeholder for firms, and despite the growing body of research on them, much remains to be learned about them and about the factors that determine their attributes and sustainability. Whereas most of the literature focuses on predictors such as community activity, network structure, and platform interface, there is little research about behavioral and psychological aspects of community members and leaders. In the present study we focus on the personality traits of community founders as predictors of community attributes and sustainability. We develop a tool to estimate community members' Big Five personality traits from their social media text and use it to estimate the traits of 35,164 founders in 8,625 Reddit communities. We find support for most of our predictions about the relationships between founder traits and community sustainability and attributes, including the level of engagement within the community, aspects of its social network structure, and whether the founders themselves remain active in it.

Paperid: 2637, https://arxiv.org/pdf/2505.01030.pdf

Abstract:
This paper describes the challenges that deaf and hard of hearing people face with creating accessible multimedia content, such as portfolios, instructional videos and video presentations. Unlike content consumption, the process of content creation itself remains highly inaccessible, creating barriers to employment in all stages of recruiting, hiring, and carrying out assigned job duties. Overcoming these barriers incurs a "deaf content creation tax" that translates into requiring significant additional time and resources to produce content equivalent to what a non-disabled person would produce. We highlight this process and associated challenges through real-world examples experienced by the authors, and provide guidance and recommendations for addressing them.

Paperid: 2638, https://arxiv.org/pdf/2505.00879.pdf

Abstract:
As the integration of augmented reality (AR) technology in head-up displays (HUDs) becomes more prevalent in vehicles, it is crucial to understand how to design and evaluate AR interfaces to ensure safety. With new AR displays capable of rendering images with larger field of views and at varying depths, the visual and cognitive separation between graphical and real-world visual stimuli will be increasingly more difficult to quantify as will drivers' ability to efficiently allocate visual attention between the two sets of stimuli. In this study, we present a user study that serves as a crucial first step in gaining insight into inattentional blindness while using AR in surface transportation, where understanding is currently limited. Our primary goal is to investigate how the visual demand of AR tasks influences drivers' ability to detect stimuli, and whether the nature of the stimuli itself plays a role in this effect. To address these questions, we designed an on-road user study aimed at producing a more realistic and ecologically valid understanding of the phenomenon. Our results show that drivers' ability to timely detect stimuli in the environment decreased as the AR task visual demand increased demonstrated by both detection performance and inattentional blindness metrics. Further, inattentional blindness caused by AR displays appears to be more prevalent within drivers' central field of view. We conclude by discussing implications towards a safety-centric evaluation framework for AR HUDs.

Paperid: 2639, https://arxiv.org/pdf/2505.00603.pdf

Abstract:
This study investigates whether large language models, specifically GPT4, can match human capabilities in analogical reasoning within strategic decision making contexts. Using a novel experimental design involving source to target matching, we find that GPT4 achieves high recall by retrieving all plausible analogies but suffers from low precision, frequently applying incorrect analogies based on superficial similarities. In contrast, human participants exhibit high precision but low recall, selecting fewer analogies yet with stronger causal alignment. These findings advance theory by identifying matching, the evaluative phase of analogical reasoning, as a distinct step that requires accurate causal mapping beyond simple retrieval. While current LLMs are proficient in generating candidate analogies, humans maintain a comparative advantage in recognizing deep structural similarities across domains. Error analysis reveals that AI errors arise from surface level matching, whereas human errors stem from misinterpretations of causal structure. Taken together, the results suggest a productive division of labor in AI assisted organizational decision making where LLMs may serve as broad analogy generators, while humans act as critical evaluators, applying the most contextually appropriate analogies to strategic problems.

Paperid: 2640, https://arxiv.org/pdf/2504.21347.pdf

Abstract:
We introduce the In Real Life (IRL) Ditto, an AI-driven embodied agent designed to represent remote colleagues in shared office spaces, creating opportunities for real-time exchanges even in their absence. IRL Ditto offers a unique hybrid experience by allowing in-person colleagues to encounter a digital version of their remote teammates, initiating greetings, updates, or small talk as they might in person. Our research question examines: How can the IRL Ditto influence interactions and relationships among colleagues in a shared office space? Through a four-day study, we assessed IRL Ditto's ability to strengthen social ties by simulating presence and enabling meaningful interactions across different levels of social familiarity. We find that enhancing social relationships depended deeply on the foundation of the relationship participants had with the source of the IRL Ditto. This study provides insights into the role of embodied agents in enriching workplace dynamics for distributed teams.

Paperid: 2641, https://arxiv.org/pdf/2504.20903.pdf

Abstract:
We develop an agent-based simulation to formalize AI-human collaboration as a function of task structure, advancing a generalizable framework for strategic decision-making in organizations. Distinguishing between heuristic-based human adaptation and rule-based AI search, we model interactions across modular (parallel) and sequenced (interdependent) tasks using an NK model. Our results reveal that in modular tasks, AI often substitutes for humans - delivering higher payoffs unless human expertise is very high, and the AI search space is either narrowly focused or extremely broad. In sequenced tasks, interesting complementarities emerge. When an expert human initiates the search and AI subsequently refines it, aggregate performance is maximized. Conversely, when AI leads, excessive heuristic refinement by the human can reduce payoffs. We also show that even "hallucinatory" AI - lacking memory or structure - can improve outcomes when augmenting low-capability humans by helping escape local optima. These results yield a robust implication: the effectiveness of AI-human collaboration depends less on context or industry, and more on the underlying task structure. By elevating task decomposition as the central unit of analysis, our model provides a transferable lens for strategic decision-making involving humans and an agentic AI across diverse organizational settings.

Paperid: 2642, https://arxiv.org/pdf/2504.20741.pdf

Abstract:
Since the early days of the Explainable AI movement, post-hoc explanations have been praised for their potential to improve user understanding, promote trust, and reduce patient safety risks in black box medical AI systems. Recently, however, critics have argued that the benefits of post-hoc explanations are greatly exaggerated since they merely approximate, rather than replicate, the actual reasoning processes that black box systems take to arrive at their outputs. In this article, we aim to defend the value of post-hoc explanations against this recent critique. We argue that even if post-hoc explanations do not replicate the exact reasoning processes of black box systems, they can still improve users' functional understanding of black box systems, increase the accuracy of clinician-AI teams, and assist clinicians in justifying their AI-informed decisions. While post-hoc explanations are not a "silver bullet" solution to the black box problem in medical AI, we conclude that they remain a useful strategy for addressing the black box problem in medical AI.

Paperid: 2643, https://arxiv.org/pdf/2504.20442.pdf

Abstract:
With the intensification of global climate change, accurate prediction of weather indicators is of great significance in disaster prevention and mitigation, agricultural production, and transportation. Precipitation, as one of the key meteorological indicators, plays a crucial role in water resource management, agricultural production, and urban flood control. This study proposes a multidimensional precipitation index prediction model based on a CNN- LSTM hybrid framework, aiming to improve the accuracy of precipitation forecasts. The dataset is sourced from Pune, Maharashtra, India, covering monthly mean precipitation data from 1972 to 2002. This dataset includes nearly 31 years (1972-2002) of monthly average precipitation, reflecting the long-term fluctuations and seasonal variations of precipitation in the region. By analyzing these time series data, the CNN-LSTM model effectively captures local features and long-term dependencies. Experimental results show that the model achieves a root mean square error (RMSE) of 6.752, which demonstrates a significant advantage over traditional time series prediction methods in terms of prediction accuracy and generalization ability. Furthermore, this study provides new research ideas for precipitation prediction. However, the model requires high computational resources when dealing with large-scale datasets, and its predictive ability for multidimensional precipitation data still needs improvement. Future research could extend the model to support and predict multidimensional precipitation data, thereby promoting the development of more accurate and efficient meteorological prediction technologies.

Paperid: 2644, https://arxiv.org/pdf/2504.20215.pdf

Abstract:
This study explores the application of the Technology Acceptance Model (TAM) to AI-powered digital innovations within a transnational governance framework. By integrating Latourian actor-network theory (ANT), this study examines how institutional motivations, regulatory compliance, and ethical and cultural acceptance drive organisations to develop and adopt AI innovations, enhancing their market acceptance and transnational accountability. We extend the TAM framework by incorporating regulatory, ethical, and socio-technical considerations as key social pressures shaping AI adoption. Recognizing that AI is embedded within complex actor-networks, we argue that accountability is co-constructed among organisations, regulators, and societal actors rather than being confined to individual developers or adopters. To address these challenges, we propose two key solutions: (1) internal resource reconfiguration, where organisations restructure their governance and compliance mechanisms to align with global standards; and (2) reshaping organisational boundaries through actor-network management, fostering engagement with external stakeholders, regulatory bodies, and transnational governance institutions. These approaches allow organisations to enhance AI accountability, foster ethical and regulatory alignment, and improve market acceptance on a global scale.

Paperid: 2645, https://arxiv.org/pdf/2504.20016.pdf

Abstract:
In child-centered design, directly engaging children is crucial for deeply understanding their experiences. However, current research often prioritizes adult perspectives, as interviewing children involves unique challenges such as environmental sensitivities and the need for trust-building. AI-powered virtual humans (VHs) offer a promising approach to facilitate engaging and multimodal interactions with children. This study establishes key design guidelines for LLM-powered virtual humans tailored to child interviews, standardizing multimodal elements including color schemes, voice characteristics, facial features, expressions, head movements, and gestures. Using ChatGPT-based prompt engineering, we developed three distinct Human-AI workflows (LLM-Auto, LLM-Interview, and LLM-Analyze) and conducted a user study involving 15 children aged 6 to 12. The results indicated that the LLM-Analyze workflow outperformed the others by eliciting longer responses, achieving higher user experience ratings, and promoting more effective child engagement.

Paperid: 2646, https://arxiv.org/pdf/2504.19767.pdf

Abstract:
Analog journaling has grown in popularity, with journaling on paper encompassing a range of motivations, styles, and practices including planning, habit-tracking, and reflecting. Journalers develop strong personal preferences around the tools they use, the ideas they capture, and the layout in which they represent their ideas and memories. Understanding how analog journaling practices are individually shaped and crafted over time is critical to supporting the varied benefits associated with journaling, including improved mental health and positive support for identity development. To understand this development, we qualitatively analyzed publicly-shared journaling content from YouTube and Instagram and interviewed 11 journalers. We report on our identification of the journaling ecosystem in which journaling practices are shaped by materials, personal context, and communities, sharing how this ecosystem plays a role in the practices and identities of journalers as they customize their journaling routine to best suit their personal goals. Using these insights, we discuss design opportunities for how future tools can better align with and reflect the rich affordances and practices of journaling on paper.

Paperid: 2647, https://arxiv.org/pdf/2504.19611.pdf

Abstract:
Haptic feedback contributes to immersive virtual reality (VR) experiences. Designing such feedback at scale, for all objects within a VR scene and their respective arrangements, remains a time-consuming task. We present Scene2Hap, an LLM-centered system that automatically designs object-level vibrotactile feedback for entire VR scenes based on the objects' semantic attributes and physical context. Scene2Hap employs a multimodal large language model to estimate the semantics and physical context of each object, including its material properties and vibration behavior, from the multimodal information present in the VR scene. This semantic and physical context is then used to create plausible vibrotactile signals by generating or retrieving audio signals and converting them to vibrotactile signals. For the more realistic spatial rendering of haptics in VR, Scene2Hap estimates the propagation and attenuation of vibration signals from their source across objects in the scene, considering the estimated material properties and physical context, such as the distance and contact between virtual objects. Results from two user studies confirm that Scene2Hap successfully estimates the semantics and physical context of VR scenes, and the physical modeling of vibration propagation improves usability, perceived materiality, and spatial awareness.

Paperid: 2648, https://arxiv.org/pdf/2504.19556.pdf

Abstract:
Given the subtle human-like effects of large language models on linguistic patterns, this study examines shifts in language over time to detect the impact of AI-mediated communication (AI- MC) on social media. We compare a replicated dataset of 970,919 tweets from 2020 (pre-ChatGPT) with 20,000 tweets from the same period in 2024, all of which mention Donald Trump during election periods. Using a combination of Flesch-Kincaid readability and polarity scores, we analyze changes in text complexity and sentiment. Our findings reveal a significant increase in mean sentiment polarity (0.12 vs. 0.04) and a shift from predominantly neutral content (54.8% in 2020 to 39.8% in 2024) to more positive expressions (28.6% to 45.9%). These findings suggest not only an increasing presence of AI in social media communication but also its impact on language and emotional expression patterns.

Paperid: 2649, https://arxiv.org/pdf/2504.19460.pdf

Abstract:
We introduce a real-time, human-in-the-loop gesture control framework that can dynamically adapt audio and music based on human movement by analyzing live video input. By creating a responsive connection between visual and auditory stimuli, this system enables dancers and performers to not only respond to music but also influence it through their movements. Designed for live performances, interactive installations, and personal use, it offers an immersive experience where users can shape the music in real time. The framework integrates computer vision and machine learning techniques to track and interpret motion, allowing users to manipulate audio elements such as tempo, pitch, effects, and playback sequence. With ongoing training, it achieves user-independent functionality, requiring as few as 50 to 80 samples to label simple gestures. This framework combines gesture training, cue mapping, and audio manipulation to create a dynamic, interactive experience. Gestures are interpreted as input signals, mapped to sound control commands, and used to naturally adjust music elements, showcasing the seamless interplay between human interaction and machine response.

Paperid: 2650, https://arxiv.org/pdf/2504.19120.pdf

Abstract:
The goal of the current study is to introduce a triadic human-AI collaboration framework for the automated vehicle domain. Previous classifications (e.g., SAE Levels of Automation) focus on defining automation levels based on who controls the vehicle. However, it remains unclear how human users and AI should collaborate in real-time, especially in dynamic driving contexts, where roles can shift frequently. To fill the gap, this study proposes a triadic human-AI collaboration framework with three AI roles (i.e., Advisor, Co-Pilot, and Guardian) that dynamically adapt to human needs. Overall, the study lays a foundation for developing adaptive, role-based human-AI collaboration strategies in automated vehicles.

Paperid: 2651, https://arxiv.org/pdf/2504.18988.pdf

Abstract:
Collaborative research often includes contributors with varied perspectives from diverse linguistic backgrounds. However, English as a Second Language (ESL) researchers often struggle to communicate during meetings in English and comprehend discussions, leading to limited contribution. To investigate these challenges, we surveyed 64 ESL researchers who frequently collaborate in multilingual teams and identified four key design goals around participation, comprehension, documentation, and feedback. Guided by these design goals, we developed LINC, a multimodal Language INdependent Collaboration system with two components: a real-time module for multilingual communication during meetings and a post-meeting dashboard for discussion analysis. We evaluated the system through a two-phased study with six triads of multilingual teams. We found that using LINC, participants benefited from communicating in their preferred language, recalled and reviewed actionable insights, and prepared for upcoming meetings effectively. We discuss external factors that impact multilingual meeting participation beyond language preferences and the implications of multimodal systems in facilitating meetings in hybrid multilingual collaborative settings beyond research.

Paperid: 2652, https://arxiv.org/pdf/2504.18410.pdf

Abstract:
Parental verbal abuse leaves lasting emotional impacts, yet current therapeutic approaches often lack immersive self-reflection opportunities. To address this, we developed a VR experience powered by LLMs to foster reflection on parental verbal abuse. Participants with relevant experiences engage in a dual-phase VR experience: first assuming the role of a verbally abusive parent, interacting with an LLM portraying a child, then observing the LLM reframing abusive dialogue into warm, supportive expressions as a nurturing parent. A qualitative study with 12 participants showed that the experience encourages reflection on their past experiences and fosters supportive emotions. However, these effects vary with participants' personal histories, emphasizing the need for greater personalization in AI-driven emotional support. This study explores the use of LLMs in immersive environment to promote emotional reflection, offering insights into the design of AI-driven emotional support systems.

Paperid: 2653, https://arxiv.org/pdf/2504.18380.pdf

Abstract:
Modern extended reality XR systems provide rich analysis of image data and fusion of sensor input and demand AR/VR applications that can reason about 3D scenes in a semantic manner. We present a spatial reasoning framework that bridges geometric facts with symbolic predicates and relations to handle key tasks such as determining how 3D objects are arranged among each other ('on', 'behind', 'near', etc.). Its foundation relies on oriented 3D bounding box representations, enhanced by a comprehensive set of spatial predicates, ranging from topology and connectivity to directionality and orientation, expressed in a formalism related to natural language. The derived predicates form a spatial knowledge graph and, in combination with a pipeline-based inference model, enable spatial queries and dynamic rule evaluation. Implementations for client- and server-side processing demonstrate the framework's capability to efficiently translate geometric data into actionable knowledge, ensuring scalable and technology-independent spatial reasoning in complex 3D environments. The Spatial Reasoner framework is fostering the creation of spatial ontologies, and seamlessly integrates with and therefore enriches machine learning, natural language processing, and rule systems in XR applications.

Paperid: 2654, https://arxiv.org/pdf/2504.18238.pdf

Abstract:
Security vulnerabilities in software systems represent significant risks as potential entry points for malicious attacks. Traditional dashboards that display the results of static analysis security testing often use 2D or 3D visualizations, which tend to lack the spatial details required to effectively reveal issues such as the propagation of vulnerabilities across the codebase or the appearance of concurrent vulnerabilities. Additionally, most reporting solutions only treat the analysis results as an artifact that can be reviewed or edited asynchronously by developers, limiting real-time, collaborative exploration. To the best of our knowledge, no VR-based approach exists for the visualization and interactive exploration of software security vulnerabilities. Addressing these challenges, the virtual reality (VR) environment SecCityVR was developed as a proof-of-concept implementation that employs the code city metaphor within VR to visualize software security vulnerabilities as colored building floors inside the surrounding virtual city. By integrating the application's call graph, vulnerabilities are contextualized within related software components. SecCityVR supports multi-user collaboration and interactive exploration. It provides explanations and mitigations for detected issues. A user study comparing SecCityVR with the traditional dashboard find-sec-bugs showed the VR approach provided a favorable experience, with higher usability, lower temporal demand, and significantly lower frustration despite having longer task completion times. This paper and its results contribute to the fields of collaborative and secure software engineering, as well as software visualization. It provides a new application of VR code cities to visualize security vulnerabilities, as well as a novel environment for security audits using collaborative and immersive technologies.

Paperid: 2655, https://arxiv.org/pdf/2504.18044.pdf

Abstract:
Using LLMs in healthcare, Computer-Supported Cooperative Work, and Social Computing requires the examination of ethical and social norms to ensure safe incorporation into human life. We conducted a mixed-method study, including an online survey with 111 participants and an interview study with 38 experts, to investigate the AI ethics and social norms in ChatGPT as everyday life tools. This study aims to evaluate whether ChatGPT in an empirical context operates following ethics and social norms, which is critical for understanding actions in industrial and academic research and achieving machine ethics. The findings of this study provide initial insights into six important aspects of AI ethics, including bias, trustworthiness, security, toxicology, social norms, and ethical data. Significant obstacles related to transparency and bias in unsupervised data collection methods are identified as ChatGPT's ethical concerns.

Paperid: 2656, https://arxiv.org/pdf/2504.17999.pdf

Abstract:
Generative conversational interfaces powered by large language models (LLMs) typically stream output token-by-token at a rate determined by computational budget, often neglecting actual human reading speeds and the cognitive load associated with the content. This mismatch frequently leads to inefficient use of computational resources. For example, in cloud-based services, streaming content faster than users can read appears unnecessary, resulting in wasted computational resources and potential delays for other users, particularly during peak usage periods. To address this issue, we propose an adaptive streaming method that dynamically adjusts the pacing of LLM streaming output in real-time based on inferred cognitive load. Our approach estimates the cognitive load associated with streaming content and strategically slows down the stream during complex or information-rich segments, thereby freeing computational resources for other users. We conducted a statistical analysis and simulation based on a statistical model derived from data collected in a crowdsourced user study across various types of LLM-generated content. Our results show that this adaptive method can effectively reduce computational consumption while largely maintaining streaming speed above user's normal reading speed.

Paperid: 2657, https://arxiv.org/pdf/2504.17697.pdf

Abstract:
India's gig-based food delivery platforms, such as Swiggy and Zomato, provide crucial income to marginalised communities but also entrench workers in cycles of invisible labour. Through 14 semi-structured interviews, we analyse waiting time and repetitive UI itneractions as key burdens that contribute to 'digital discomfort' for gig based food delivery agents. We find that workers employ creative strategies to navigate algorithmic management, yet remain constrained by platform-side 'gamification' and system opacity. We propose worker-centered GUI automation as a potential intervention to reduce friction while preserving agency. In conclusion, this position paper argues for rethinking HCI approaches in the Global South to prioritise worker autonomy over efficiency-driven design optimisations.

Paperid: 2658, https://arxiv.org/pdf/2504.17663.pdf

Abstract:
In this paper, we adopt a survivor-centered approach to locate and dissect the role of sociotechnical AI governance in preventing AI-Generated Non-Consensual Intimate Images (AIG-NCII) of adults, colloquially known as "deep fake pornography." We identify a "malicious technical ecosystem" or "MTE," comprising of open-source face-swapping models and nearly 200 "nudifying" software programs that allow non-technical users to create AIG-NCII within minutes. Then, using the National Institute of Standards and Technology (NIST) AI 100-4 report as a reflection of current synthetic content governance methods, we show how the current landscape of practices fails to effectively regulate the MTE for adult AIG-NCII, as well as flawed assumptions explaining these gaps.

Paperid: 2659, https://arxiv.org/pdf/2504.17150.pdf

Abstract:
Dashboard guidance helps dashboard users better navigate interactive features, understand the underlying data, and assess insights they can potentially extract from dashboards. However, authoring dashboard guidance is a time consuming task, and embedding guidance into dashboards for effective delivery is difficult to realize. In this work, we contribute DashGuide, a framework and system to support the creation of interactive dashboard guidance with minimal authoring input. Given a dashboard and a communication goal, DashGuide captures a sequence of author-performed interactions to generate guidance materials delivered as playable step-by-step overlays, a.k.a., dashboard tours. Authors can further edit and refine individual tour steps while receiving generative assistance. We also contribute findings from a formative assessment with 9 dashboard creators, which helped inform the design of DashGuide; and findings from an evaluation of DashGuide with 12 dashboard creators, suggesting it provides an improved authoring experience that balances efficiency, expressiveness, and creative freedom.

Paperid: 2660, https://arxiv.org/pdf/2504.17117.pdf

Abstract:
Blind and visually impaired (BVI) students face significant challenges in traditional educational settings. While screen readers and braille materials offer some accessibility, they often lack interactivity and real-time adaptability to individual learning needs. This paper presents Audemy, an AI-powered audio-based learning platform designed to provide personalized, accessible, and engaging educational experiences for BVI students. Audemy uses adaptive learning techniques to customize content based on student accuracy, pacing preferences, and engagement patterns. The platform has been iteratively developed with input from over 20 educators specializing in accessibility and currently serves over 2,000 BVI students. Educator insights show key considerations for accessible AI, including the importance of engagement, intuitive design, compatibility with existing assistive technologies, and the role of positive reinforcement in maintaining student motivation. Beyond accessibility, this paper explores the ethical implications of AI in education, emphasizing data privacy, security, and transparency. Audemy demonstrates how AI can empower BVI students with personalized and equitable learning opportunities, advancing the broader goal of inclusive education.

Paperid: 2661, https://arxiv.org/pdf/2504.17113.pdf

Abstract:
We report an 18-month field experiment in distributed digital institutions: a nine-bedroom Los Angeles coliving house that runs without managers, while sustaining 98% occupancy and below-market rents. Drawing on Elinor Ostrom's commons theory, we outline design principles and three digital mechanisms that form the institutional core: 1) A continuous-auction chore scheduler turns regenerative labor into a time-indexed points market; residents meet a 100-point monthly obligation by claiming tasks whose value rises linearly with neglect. 2) A pairwise-preference layer lets participants asynchronously reprioritize tasks, translating meta-governance into low-cognition spot inputs. 3) A symbolic "hearts" ledger tracks norm compliance through automated enforcement, lightweight challenges, and peer-awarded karma. Together, these mechanisms operationalize cybernetic principles--human sensing, machine bookkeeping, real-time feedback--while minimizing dependence on privileged roles. Our exploratory data (567 chore claims, 255 heart events, and 551 group purchases) show that such tooling can sustain reliable commons governance without continuous leadership, offering a transferable design palette for online communities, coliving houses, and other digitally mediated collectives.

Paperid: 2662, https://arxiv.org/pdf/2504.17055.pdf

Abstract:
AI-powered facial assessment tools are reshaping how individuals evaluate appearance and internalize social judgments. This study examines the psychological impact of such tools on self-objectification, self-esteem, and emotional responses, with attention to gender differences. Two samples used distinct versions of a facial analysis tool: one overtly critical (N=75; M=22.9 years), and another more neutral (N=51; M=19.9 years). Participants completed validated self-objectification and self-esteem scales and custom items measuring emotion, digital/physical appearance enhancement (DAE, PAEE), and perceived social emotion (PSE). Results revealed consistent links between high self-objectification, low self-esteem, and increased appearance enhancement behaviors across both versions. Despite softer framing, the newer tool still evoked negative emotional responses (U=1466.5, p=0.013), indicating implicit feedback may reinforce appearance-related insecurities. Gender differences emerged in DAE (p=0.025) and PSE (p<0.001), with females more prone to digital enhancement and less likely to perceive emotional impact in others. These findings reveal how AI tools may unintentionally reinforce and amplify existing social biases and underscore the critical need for responsible AI design and development. Future research will investigate how human ideologies embedded in the training data of such tools shape their evaluative outputs, and how these, in turn, influence user attitudes and decisions.

Paperid: 2663, https://arxiv.org/pdf/2504.17018.pdf

Abstract:
Large Language Models (LLMs) are rapidly becoming integral to a wide range of tools, tasks, and problem-solving processes, especially in software development. Originally designed for natural language processing tasks such as text generation, LLMs are increasingly being used to assist both professionals and students in writing code. This growing reliance on LLM-based tools is reshaping programming workflows and task execution. In this study, we explore the impact of these technologies on blind and low-vision (BLV) developers. Our review of existing literature indicates that while LLMs help mitigate some of the challenges faced by BLV programmers, they also introduce new forms of inaccessibility. We conducted an evaluation of five popular LLM-powered integrated development environments (IDEs), assessing their performance across a comprehensive set of programming tasks. Our findings highlight several unsupported scenarios, instances of incorrect model output, and notable limitations in interaction support for specific tasks. Through observing BLV developers as they engaged in coding activities, we uncovered key interaction barriers that go beyond model accuracy or code generation quality. This paper outlines the challenges and corresponding opportunities for improving accessibility in the context of generative AI-assisted programming. Addressing these issues can meaningfully enhance the programming experience for BLV developers. As the generative AI revolution continues to unfold, it must also address the unique burdens faced by this community.

Paperid: 2664, https://arxiv.org/pdf/2504.16883.pdf

Abstract:
Retrieval-Augmented Generation (RAG) systems offer a powerful approach to enhancing large language model (LLM) outputs by incorporating fact-checked, contextually relevant information. However, fairness and reliability concerns persist, as hallucinations can emerge at both the retrieval and generation stages, affecting users' reasoning and decision-making. Our research explores how tailored warning messages -- whose content depends on the specific context of hallucination -- shape user reasoning and actions in an educational quiz setting. Preliminary findings suggest that while warnings improve accuracy and awareness of high-level hallucinations, they may also introduce cognitive friction, leading to confusion and diminished trust in the system. By examining these interactions, this work contributes to the broader goal of AI-augmented reasoning: developing systems that actively support human reflection, critical thinking, and informed decision-making rather than passive information consumption.

Paperid: 2665, https://arxiv.org/pdf/2504.16741.pdf

Abstract:
Purpose: The timespan over which exploratory searching can occur, as well as the scope and volume of the search activities undertaken, can make it difficult for searchers to remember key details about their search activities. These difficulties are present both in the midst of searching as well as when resuming a search that spans multiple sessions. In this paper, we present a search interface design and prototype implementation to support cross-session exploratory search in a public digital library context. Methods: Search Timelines provides a visualization of current and past search activities via a dynamic timeline of the search activity (queries and saved resources). This timeline is presented at two levels of detail. An Overview Timeline is provided alongside the search results in a typical search engine results page design. A Detailed Timeline is provided in the workspace, where searchers can review the history of their search activities and their saved resources. A controlled laboratory study (n=32) was conducted to compare this approach to a baseline interface modelled after a typical public digital library search/workspace interface. Results: Participants who used Search Timelines reported higher levels of user engagement, usability, and perceived knowledge gain, during an initial search session and when resuming the search after a 7-8 day interval. This came at the expense of the searchers taking more time to complete the search task, which we view as positive evidence of engagement in cross-session exploratory search processes. Conclusion: Search Timelines serves as an example of how lightweight visualization approaches can be used to enhance typical search interface designs to support exploratory search. The results highlight the value of providing persistent representations of past search activities within the search interface.}

Paperid: 2666, https://arxiv.org/pdf/2504.16671.pdf

Abstract:
The use of large language models (LLMs) in qualitative analysis offers enhanced efficiency but raises questions about their alignment with the contextual nature of research for design (RfD). This research examines the trustworthiness of LLM-driven design insights, using qualitative coding as a case study to explore the interpretive processes central to RfD. We introduce LLMCode, an open-source tool integrating two metrics, namely Intersection over Union (IoU) and Modified Hausdorff Distance, to assess the alignment between human and LLM-generated insights. Across two studies involving 26 designers, we find that while the model performs well with deductive coding, its ability to emulate a designer's deeper interpretive lens over the data is limited, emphasising the importance of human-AI collaboration. Our results highlight a reciprocal dynamic where users refine LLM outputs and adapt their own perspectives based on the model's suggestions. These findings underscore the importance of fostering appropriate reliance on LLMs by designing tools that preserve interpretive depth while facilitating intuitive collaboration between designers and AI.

Paperid: 2667, https://arxiv.org/pdf/2504.16670.pdf

Abstract:
Open source software is a rapidly evolving center for distributed work, and understanding the characteristics of this work across its different contexts is vital for informing policy, economics, and the design of enabling software. The steep increase in open source projects and corporate participation have transformed a peripheral, cottage industry component of the global technology ecosystem into a large, infinitely complex "technology parts supplier" wired into every corner of contemporary life. The lack of theory and tools for breaking this complexity down into identifiable project types or strategies for understanding them more systematically is incommensurate with current industry, society, and developer needs. This paper reviews previous attempts to classify open source software and other organizational ecosystems, using open source scientific software ecosystems in contrast with those found in corporatized open source software. It then examines the divergent and sometimes conflicting purposes that may exist for classifying open source projects and how these competing interests impede our progress in developing a comprehensive understanding of how open source software projects and companies operate. Finally, we will present an empirical, mixed-methods study demonstrating how to classify open-source projects by their lifecycle position. This is the first step forward, advancing our scientific and practical knowledge of open source software through the lens of dynamic and evolving open source genres. It concludes with examples and a proposed path forward.

Paperid: 2668, https://arxiv.org/pdf/2504.16615.pdf

Abstract:
Big Data analytics and Artificial Intelligence systems derive non-intuitive and often unverifiable inferences about individuals' behaviors, preferences, and private lives. Drawing on diverse, feature-rich datasets of unpredictable value, these systems erode the intuitive connection between our actions and how we are perceived, diminishing control over our digital identities. While Explainable Artificial Intelligence scholars have attempted to explain the inner workings of algorithms, their visualizations frequently overwhelm end-users with complexity. This research introduces 'hypothetical inference', a novel approach that uses language models to simulate how algorithms might interpret users' digital footprints and infer personal characteristics without requiring access to proprietary platform algorithms. Through empirical studies with fourteen adult participants, we identified three key design opportunities to foster critical algorithmic literacy: (1) reassembling scattered digital footprints into a unified map, (2) simulating algorithmic inference through LLM-generated interpretations, and (3) incorporating temporal dimensions to visualize evolving patterns. This research lays the groundwork for tools that can help users recognize the influence of data on platforms and develop greater autonomy in increasingly algorithm-mediated digital environments.

Paperid: 2669, https://arxiv.org/pdf/2504.16572.pdf

Abstract:
The global rise of football analytics has rapidly transformed how clubs make strategic decisions. However, in India, the adoption of analytics remains constrained by institutional resistance, infrastructural limitations, and cultural barriers -- challenges that grassroots innovation and low-cost data solutions have the potential to overcome. Despite the growing popularity of the Indian Super League, resource scarcity and fragmented governance continue to hinder the widespread adoption and impact of analytics. This mixed-methods study explores how informal, decentralised analytics communities -- comprising amateur analysts and Twitter-based "data sleuths" -- navigate these constraints through peer mentorship and grassroots innovation. Drawing on extensive digital ethnography, participant observation, and interviews, the study illustrates how these informal networks mitigate data scarcity, limited digital infrastructure, and institutional indifference while fostering skill development and professional growth. Building on these insights, the paper proposes HCI interventions such as decentralised knowledge platforms to facilitate structured, cross-border peer mentorship and low-cost data solutions -- including AI-assisted player tracking and mobile analytics dashboards -- rooted in principles of frugal innovation. These interventions aim to bridge the data divide, support inclusive technical engagement in sport, and enhance analytics-driven decision-making in resource-constrained environments. This paper contributes to HCIxB's focus on cross-border collaboration by highlighting how community-driven technological adaptation in the Global South can foster meaningful participation, skill-building, and long-term sustainability through informal learning networks and scalable, context-sensitive tools.

Paperid: 2670, https://arxiv.org/pdf/2504.16533.pdf

Abstract:
Current tablet-based interfaces for drone operations often impose a heavy cognitive load on pilots and reduce situational awareness by dividing attention between the video feed and the real world. To address these challenges, we designed a heads-up augmented reality (AR) interface that overlays in-situ information to support drone pilots in safety-critical tasks. Through participatory design workshops with professional pilots, we identified key features and developed an adaptive AR interface that dynamically switches between task and safety views to prevent information overload. We evaluated our prototype by creating a realistic building inspection task and comparing three interfaces: a 2D tablet, a static AR, and our adaptive AR design. A user study with 15 participants showed that the AR interface improved access to safety information, while the adaptive AR interface reduced cognitive load and enhanced situational awareness without compromising task performance. We offer design insights for developing safety-first heads-up AR interfaces.

Paperid: 2671, https://arxiv.org/pdf/2504.16504.pdf

Abstract:
Existing depression screening predominantly relies on standardized questionnaires (e.g., PHQ-9, BDI), which suffer from high misdiagnosis rates (18-34% in clinical studies) due to their static, symptom-counting nature and susceptibility to patient recall bias. This paper presents an AI-powered depression prevention system that leverages large language models (LLMs) to analyze real-time conversational cues--including subtle emotional expressions (e.g., micro-sentiment shifts, self-referential language patterns)--for more accurate and dynamic mental state assessment. Our system achieves three key innovations: (1) Continuous monitoring through natural dialogue, detecting depression-indicative linguistic features (anhedonia markers, hopelessness semantics) with 89% precision (vs. 72% for PHQ-9); (2) Adaptive risk stratification that updates severity levels based on conversational context, reducing false positives by 41% compared to scale-based thresholds; and (3) Personalized intervention strategies tailored to users' emotional granularity, demonstrating 2.3x higher adherence rates than generic advice. Clinical validation with 450 participants shows the system identifies 92% of at-risk cases missed by traditional scales, while its explainable AI interface bridges the gap between automated analysis and clinician judgment. This work establishes conversational AI as a paradigm shift from episodic scale-dependent diagnosis to continuous, emotionally intelligent mental health monitoring.

Paperid: 2672, https://arxiv.org/pdf/2504.16502.pdf

Abstract:
Grasping constitutes a critical challenge for visually impaired people. To address this problem, we developed a tactile bracelet that assists in grasping by guiding the user's hand to a target object using vibration commands. Here we demonstrate the fully automated system around the bracelet, which can confidently detect and track target and distractor objects and reliably guide the user's hand. We validate our approach in three tasks that resemble complex, everyday use cases. In a grasping task, the participants grasp varying target objects on a table, guided via the automated hand navigation system. In the multiple objects task, participants grasp objects from the same class, demonstrating our system's ability to track one specific object without targeting surrounding distractor objects. Finally, the participants grasp one specific target object by avoiding an obstacle along the way in the depth navigation task, showcasing the potential to utilize our system's depth estimations to navigate even complex scenarios. Additionally, we demonstrate that the system can aid users in the real world by testing it in a less structured environment with a blind participant. Overall, our results demonstrate that the system, by translating the AI-processed visual inputs into a reduced data rate of actionable signals, enables autonomous behavior in everyday environments, thus potentially increasing the quality of life of visually impaired people.

Paperid: 2673, https://arxiv.org/pdf/2504.16459.pdf

Abstract:
We propose "Insect-Computer Hybrid Speaker", which enables us to make musics made from combinations of computer and insects. Lots of studies have proposed methods and interfaces for controlling insects and obtaining feedback. However, there have been less research on the use of insects for interaction with third parties. In this paper, we propose a method in which cicadas are used as speakers triggered by using Electrical Muscle Stimulation (EMS). We explored and investigated the suitable waveform of chirp to be controlled, the appropriate voltage range, and the maximum pitch at which cicadas can chirp.

Paperid: 2674, https://arxiv.org/pdf/2504.16255.pdf

Abstract:
The issue of fairness in decision-making is a critical one, especially given the variety of stakeholder demands for differing and mutually incompatible versions of fairness. Adopting a strategic interaction of perspectives provides an alternative to enforcing a singular standard of fairness. We present a web-based software application, FairPlay, that enables multiple stakeholders to debias datasets collaboratively. With FairPlay, users can negotiate and arrive at a mutually acceptable outcome without a universally agreed-upon theory of fairness. In the absence of such a tool, reaching a consensus would be highly challenging due to the lack of a systematic negotiation process and the inability to modify and observe changes. We have conducted user studies that demonstrate the success of FairPlay, as users could reach a consensus within about five rounds of gameplay, illustrating the application's potential for enhancing fairness in AI systems.

Paperid: 2675, https://arxiv.org/pdf/2504.16193.pdf

Abstract:
Background and aim: Considering the scope of the application of artificial intelligence beyond the field of computer science, one of the concerns of researchers is to provide quality explanations about the functioning of algorithms based on artificial intelligence and the data extracted from it. The purpose of the present study is to validate the Italian version of system causability scale (I-SCS) to measure the quality of explanations provided in a xAI. Method: For this purpose, the English version, initially provided in 2020 in coordination with the main developer, was utilized. The forward-backward translation method was applied to ensure accuracy. Finally, these nine steps were completed by calculating the content validity index/ratio and conducting cognitive interviews with representative end users. Results: The original version of the questionnaire consisted of 10 questions. However, based on the obtained indexes (CVR below 0.49), one question (Question 8) was entirely removed. After completing the aforementioned steps, the Italian version contained 9 questions. The representative sample of Italian end users fully comprehended the meaning and content of the questions in the Italian version. Conclusion: The Italian version obtained in this study can be used in future research studies as well as in the field by xAI developers. This tool can be used to measure the quality of explanations provided for an xAI system in Italian culture.

Paperid: 2676, https://arxiv.org/pdf/2504.16117.pdf

Abstract:
Vision systems are increasingly deployed in critical domains such as surveillance, law enforcement, and transportation. However, their vulnerabilities to rare or unforeseen scenarios pose significant safety risks. To address these challenges, we introduce Context-Awareness and Interpretability of Rare Occurrences (CAIRO), an ontology-based human-assistive discovery framework for failure cases (or CP - Critical Phenomena) detection and formalization. CAIRO by design incentivizes human-in-the-loop for testing and evaluation of criticality that arises from misdetections, adversarial attacks, and hallucinations in AI black-box models. Our robust analysis of object detection model(s) failures in automated driving systems (ADS) showcases scalable and interpretable ways of formalizing the observed gaps between camera perception and real-world contexts, resulting in test cases stored as explicit knowledge graphs (in OWL/XML format) amenable for sharing, downstream analysis, logical reasoning, and accountability.

Paperid: 2677, https://arxiv.org/pdf/2504.16021.pdf

Abstract:
Flow theory describes an optimal cognitive state where individuals experience deep focus and intrinsic motivation when a task's difficulty aligns with their skill level. In AI-augmented reasoning, interventions that disrupt the state of cognitive flow can hinder rather than enhance decision-making. This paper proposes a context-aware cognitive augmentation framework that adapts interventions based on three key contextual factors: type, timing, and scale. By leveraging multimodal behavioral cues (e.g., gaze behavior, typing hesitation, interaction speed), AI can dynamically adjust cognitive support to maintain or restore flow. We introduce the concept of cognitive flow, an extension of flow theory in AI-augmented reasoning, where interventions are personalized, adaptive, and minimally intrusive. By shifting from static interventions to context-aware augmentation, our approach ensures that AI systems support deep engagement in complex decision-making and reasoning without disrupting cognitive immersion.

Paperid: 2678, https://arxiv.org/pdf/2504.15894.pdf

Abstract:
High stakes decision-making often requires a continuous interplay between evolving evidence and shifting hypotheses, a dynamic that is not well supported by current AI decision support systems. In this paper, we introduce a mixed-initiative framework for AI assisted decision making that is grounded in the data-frame theory of sensemaking and the evaluative AI paradigm. Our approach enables both humans and AI to collaboratively construct, validate, and adapt hypotheses. We demonstrate our framework with an AI-assisted skin cancer diagnosis prototype that leverages a concept bottleneck model to facilitate interpretable interactions and dynamic updates to diagnostic hypotheses.

Paperid: 2679, https://arxiv.org/pdf/2504.15797.pdf

Abstract:
The integration of digital technologies into memorialization practices offers opportunities to transcend physical and temporal limitations. However, designing personalized memorial spaces that address the diverse needs of the dying and the bereaved remains underexplored. Using a Research through Design (RtD) approach, we conducted a three-phase study: participatory design, VR memorial space development, and user testing. This study highlights three key aspects: 1) the value of VR memorial spaces as bonding mediums, 2) the role of a design process that engages users through co-design, development, and user testing in addressing the needs of the dying and the bereaved, and 3) design elements that enhance the VR memorial experience. This research lays the foundation for personalized VR memorialization practices, providing insights into how technology can enrich remembrance and relational experiences.

Paperid: 2680, https://arxiv.org/pdf/2504.15746.pdf

Abstract:
Recent advances in immersive technology have opened new possibilities in sports training, especially for activities requiring precise motor skills, such as tennis. In this paper, we present a virtual reality (VR) tennis training system integrating real-time performance feedback through a wearable sensor device. Ten participants wore the sensor on their dominant hand to capture motion data, including swing speed and swing power, while engaging in a VR tennis environment. Initially, participants performed baseline tests without access to performance metrics. In subsequent tests, participants executed similar routines with their swing data displayed in real-time via a VR overlay. Qualitative and quantitative results indicated that real-time visual feedback led to improved performance behaviors and enhanced situational awareness. Some participants exhibited increased swing consistency and strategic decision-making, though improvements in accuracy varied individually. Additionally, subjective feedback highlighted that the immersive experience, combined with instantaneous performance metrics, enhanced player engagement and motivation. These findings illustrate the effectiveness of VR-based data visualisation in sports training, suggesting broader applicability in performance enhancement.

Paperid: 2681, https://arxiv.org/pdf/2504.15743.pdf

Abstract:
Respiratory auscultation is crucial for early detection of pediatric pneumonia, a condition that can quickly worsen without timely intervention. In areas with limited physician access, effective auscultation is challenging. We present a smartphone-based system that leverages built-in microphones and advanced deep learning algorithms to detect abnormal respiratory sounds indicative of pneumonia risk. Our end-to-end deep learning framework employs domain generalization to integrate a large electronic stethoscope dataset with a smaller smartphone-derived dataset, enabling robust feature learning for accurate respiratory assessments without expensive equipment. The accompanying mobile application guides caregivers in collecting high-quality lung sound samples and provides immediate feedback on potential pneumonia risks. User studies show strong classification performance and high acceptance, demonstrating the system's ability to facilitate proactive interventions and reduce preventable childhood pneumonia deaths. By seamlessly integrating into ubiquitous smartphones, this approach offers a promising avenue for more equitable and comprehensive remote pediatric care.

Paperid: 2682, https://arxiv.org/pdf/2504.15647.pdf

Abstract:
Real-time reflection plays a vital role in synchronous communication. It enables users to adjust their communication strategies dynamically, thereby improving the effectiveness of their communication. Generative AI holds significant potential to enhance real-time reflection due to its ability to comprehensively understand the current context and generate personalized and nuanced content. However, it is challenging to design the way of interaction and information presentation to support the real-time workflow rather than disrupt it. In this position paper, we present a review of existing research on systems designed for reflection in different synchronous communication scenarios. Based on that, we discuss design implications on how to design human-AI interaction to support reflection in real time.

Paperid: 2683, https://arxiv.org/pdf/2504.15480.pdf

Abstract:
Stress is a pervasive challenge that significantly impacts worker health and well-being. Workplace stress is driven by various factors, ranging from organizational changes to poor workplace design. Although individual stress management strategies have been shown to be effective, current interventions often overlook personal and contextual factors shaping stress experiences. In this study, we conducted semi-structured interviews with eight office workers to gain a deeper understanding of their personal experiences with workplace stress. Our analysis reveals key stress triggers, coping mechanisms, and reflections on past stressful events. We highlight the multifaceted and individualized nature of workplace stress, emphasizing the importance of intervention timing, modality, and recognizing that stress is not solely a negative experience but can also have positive effects. Our findings provide actionable insights for the design of user-centered stress management solutions more attuned to the needs of office workers.

Paperid: 2684, https://arxiv.org/pdf/2504.15189.pdf

Abstract:
We present LACE, a hybrid Human-AI co-creative system integrated into Adobe Photoshop supporting turn-taking and parallel interaction modes for iterative image generation. Through a study with 21 participants across representational, abstract, and design tasks, we found turn-taking preferred in early stages for idea generation, and parallel modes suited for detailed refinement. While this shorter workshop paper provides key insights and highlights, the comprehensive findings and detailed analysis are presented in a longer version available separately on arXiv.

Paperid: 2685, https://arxiv.org/pdf/2504.14928.pdf

Abstract:
Large language models (LLMs) increasingly serve as educational tools, yet evaluating their teaching capabilities remains challenging due to the resource-intensive, context-dependent, and methodologically complex nature of teacher-student interactions. We introduce EducationQ, a multi-agent dialogue framework that efficiently assesses teaching capabilities through simulated dynamic educational scenarios, featuring specialized agents for teaching, learning, and evaluation. Testing 14 LLMs across major AI Organizations (OpenAI, Meta, Google, Anthropic, and others) on 1,498 questions spanning 13 disciplines and 10 difficulty levels reveals that teaching effectiveness does not correlate linearly with model scale or general reasoning capabilities - with some smaller open-source models outperforming larger commercial counterparts in teaching contexts. This finding highlights a critical gap in current evaluations that prioritize knowledge recall over interactive pedagogy. Our mixed-methods evaluation, combining quantitative metrics with qualitative analysis and expert case studies, identifies distinct pedagogical strengths employed by top-performing models (e.g., sophisticated questioning strategies, adaptive feedback mechanisms). Human expert evaluations show 78% agreement with our automated qualitative analysis of effective teaching behaviors, validating our methodology. EducationQ demonstrates that LLMs-as-teachers require specialized optimization beyond simple scaling, suggesting next-generation educational AI prioritize targeted enhancement of specific pedagogical effectiveness.

Paperid: 2686, https://arxiv.org/pdf/2504.14827.pdf

Abstract:
This paper introduces LACE, a co-creative system enabling professional artists to leverage generative AI through controlled prompting and iterative refinement within Photoshop. Addressing challenges in precision, iterative coherence, and workflow compatibility, LACE allows flexible control via layer-based editing and dual-mode collaboration (turn-taking and parallel). A pilot study (N=21) demonstrates significant improvements in user satisfaction, ownership, usability, and artistic perception compared to standard AI workflows. We offer comprehensive findings, system details, nuanced user feedback, and implications for integrating generative AI in professional art practices.

Paperid: 2687, https://arxiv.org/pdf/2504.14807.pdf

Abstract:
A driver face monitoring system can detect driver fatigue, which is a significant factor in many accidents, using computer vision techniques. In this paper, we present a real-time technique for driver eye state detection. First, the face is detected, and the eyes are located within the face region for tracking. A normalized cross-correlation-based online dynamic template matching technique, combined with Kalman filter tracking, is proposed to track the detected eye positions in subsequent image frames. A support vector machine with histogram of oriented gradients (HOG) features is used to classify the state of the eyes as open or closed. If the eyes remain closed for a specified period, the driver is considered to be asleep, and an alarm is triggered.

Paperid: 2688, https://arxiv.org/pdf/2504.14779.pdf

Abstract:
While generative artificial intelligence (GenAI) is finding increased adoption in workplaces, current tools are primarily designed for individual use. Prior work established the potential for these tools to enhance personal creativity and productivity towards shared goals; however, we don't know yet how to best take into account the nuances of group work and team dynamics when deploying GenAI in work settings. In this paper, we investigate the potential of collaborative GenAI agents to augment teamwork in synchronous group settings through an exploratory study that engaged 25 professionals across 6 teams in speculative design workshops and individual follow-up interviews. Our workshops included a mixed reality provotype to simulate embodied collaborative GenAI agents capable of actively participating in group discussions. Our findings suggest that, if designed well, collaborative GenAI agents offer valuable opportunities to enhance team problem-solving by challenging groupthink, bridging communication gaps, and reducing social friction. However, teams' willingness to integrate GenAI agents depended on its perceived fit across a number of individual, team, and organizational factors. We outline the key design tensions around agent representation, social prominence, and engagement and highlight the opportunities spatial and immersive technologies could offer to modulate GenAI influence on team outcomes and strike a balance between augmentation and agency.

Paperid: 2689, https://arxiv.org/pdf/2504.14689.pdf

Abstract:
The recent rapid advancement of LLM-based AI systems has accelerated our search and production of information. While the advantages brought by these systems seemingly improve the performance or efficiency of human activities, they do not necessarily enhance human capabilities. Recent research has started to examine the impact of generative AI on individuals' cognitive abilities, especially critical thinking. Based on definitions of critical thinking across psychology and education, this position paper proposes the distinction between demonstrated and performed critical thinking in the era of generative AI and discusses the implication of this distinction in research and development of AI systems that aim to augment human critical thinking.

Paperid: 2690, https://arxiv.org/pdf/2504.14580.pdf

Abstract:
Traditional urban planning methodologies often fail to capture the complexity of contemporary urbanization and environmental sustainability challenges. This study investigates the integration of Generative Design, Virtual Reality (VR), and Digital Twins (DT) to enhance walkability in urban planning. VR provides distinct benefits over conventional approaches, including 2D maps, static renderings, and physical models, by allowing stakeholders to engage with urban designs more intuitively, identify walkability challenges, and suggest iterative improvements. Preliminary findings from structured interviews with Eindhoven residents provide critical insights into pedestrian preferences and walkability considerations. The next phase of the study involves the development of VR-DT integrated prototypes to simulate urban environments, assess walkability, and explore the role of Generative Design in generating adaptive urban planning solutions. The objective is to develop a decision-support tool that enables urban planners to incorporate diverse stakeholder perspectives, optimize pedestrian-oriented urban design, and advance regenerative development principles. By leveraging these emerging technologies, this research contributes to the evolution of data-driven, participatory urban planning frameworks aimed at fostering sustainable and walkable cities.

Paperid: 2691, https://arxiv.org/pdf/2504.14562.pdf

Abstract:
Despite being used in many engineering and scientific areas such as physics and mathematics and often taught in high school, graphical vector addition turns out to be a topic prone to misconceptions in understanding even at university-level physics classes. To improve the learning experience and the resulting understanding of vectors, we propose to investigate how concreteness fading implemented with the use of augmented reality and tangible robots could help learners to build a strong representation of vector addition. We design a gamified learning environment consisting of three concreteness fading stages and conduct an experiment with 30 participants. Our results shows a positive learning gain. We analyze extensively the behavior of the participants to understand the usage of the technological tools -- augmented reality and tangible robots -- during the learning scenario. Finally, we discuss how the combination of these tools shows real advantages in implementing the concreteness fading paradigm. Our work provides empirical insights into how users utilize concrete visualizations conveyed by a haptic-enabled robot and augmented reality in a learning scenario.

Paperid: 2692, https://arxiv.org/pdf/2504.14469.pdf

Abstract:
Non-adherence to medications is a critical concern since nearly half of patients with chronic illnesses do not follow their prescribed medication regimens, leading to increased mortality, costs, and preventable human distress. Amongst stage 0-3 breast cancer survivors, adherence to long-term adjuvant endocrine therapy (i.e., Tamoxifen and aromatase inhibitors) is associated with a significant increase in recurrence-free survival. This work aims to develop multi-scale models of medication adherence to understand the significance of different factors influencing adherence across varying time frames. We introduce a computational framework guided by Social Cognitive Theory for multi-scale (daily and weekly) modeling of longitudinal medication adherence. Our models employ both dynamic medication-taking patterns in the recent past (dynamic factors) as well as less frequently changing factors (static factors) for adherence prediction. Additionally, we assess the significance of various factors in influencing adherence behavior across different time scales. Our models outperform traditional machine learning counterparts in both daily and weekly tasks in terms of both accuracy and specificity. Daily models achieved an accuracy of 87.25%, and weekly models, an accuracy of 76.04%. Notably, dynamic past medication-taking patterns prove most valuable for predicting daily adherence, while a combination of dynamic and static factors is significant for macro-level weekly adherence patterns.

Paperid: 2693, https://arxiv.org/pdf/2504.14269.pdf

Abstract:
A brain-computer interface (BCI) facilitates direct communication between the brain and external equipment through EEG, which is preferred for its superior temporal resolution. Among EEG techniques, the steady-state visual evoked potential (SSVEP) is favored due to its robust signal-to-noise ratio, minimal training demands, and elevated information transmission rate. Frequency detection in SSVEP-based brain-computer interfaces commonly employs canonical correlation analysis (CCA). SSCCA (spatio-spectral canonical correlation analysis) augments CCA by refining spatial filtering. This paper presents a multistage feature fusion methodology for short-duration SSVEP frequency identification, employing SSCCA with template signals derived via leave-one-out cross-validation (LOOCV). A filterbank generates bandpass filters for stimulus frequencies and their harmonics, whereas SSCCA calculates correlation coefficients between subbands and templates. Two phases of non-linear weighting amalgamate these coefficients to discern the target stimulus. This multistage methodology surpasses traditional techniques, attaining a accuracy of 94.5%.

Paperid: 2694, https://arxiv.org/pdf/2504.14065.pdf

Abstract:
Visualization has long been fundamental to human communication and decision-making. Today, we stand at the threshold of integrating veridical, high-fidelity visualizations into immersive digital environments, alongside digital twinning techniques. This convergence heralds powerful tools for communication, co-design, and participatory decision-making. Our paper delves into the development of lightweight open-source immersive digital twin visualisations, capitalizing on the evolution of immersive technologies, the wealth of spatial data available, and advancements in digital twinning. Coined AnywhereXR, this approach ultimately seeks to democratize access to spatial information at a global scale. Utilizing the Netherlands as our starting point, we envision expanding this methodology worldwide, leveraging open data and software to address pressing societal challenges across diverse domains.

Paperid: 2695, https://arxiv.org/pdf/2504.13994.pdf

Abstract:
The Unix terminal, or just simply, the terminal, can be found being applied in almost every facet of computing. It is available across all major platforms and often integrated into other applications. Due to its ubiquity, even marginal improvements to the terminal have the potential to make massive improvements to productivity on a global scale. We believe that evolutionary improvements to the terminal, in its current incarnation as windowed terminal emulator, are possible and that developing a thorough understanding of issues that current terminal users face is fundamental to knowing how the terminal should evolve. In order to develop that understanding we have mined Unix and Linux Stack Exchange using a fully-reproducible method which was able to extract and categorize 91.0% of 1,489 terminal-related questions (from the full set of nearly 240,000 questions) without manual intervention. We present an analysis, to our knowledge the first of its kind, of windowed terminal-related questions posted over a 15-year period and viewed, in aggregate, approximately 40 million times. As expected, given its longevity, we find the terminal's many features being applied across a wide variety of use cases. We find evidence that the terminal, as windowed terminal emulator, has neither fully adapted to its now current graphical environment nor completely untangled itself from features more suited to incarnations in previous environments. We also find evidence of areas where we believe the terminal could be extended along with other areas where it could be simplified. Surprisingly, while many current efforts to improve the terminal include improving the terminal's social and collaborative aspects, we find little evidence of this as a prominent pain point.

Paperid: 2696, https://arxiv.org/pdf/2504.13973.pdf

Abstract:
Animal-Human-Machine (AHM) teams are a type of hybrid intelligence system wherein interactions between a human, AI-enabled machine, and animal members can result in unique capabilities greater than the sum of their parts. This paper calls for a systematic approach to studying the design of AHM team structures to optimize performance and overcome limitations in various applied settings. We consider the challenges and opportunities in investigating the synergistic potential of AHM team members by introducing a set of dimensions of AHM team functioning to effectively utilize each member's strengths while compensating for individual weaknesses. Using three representative examples of such teams -- security screening, search-and-rescue, and guide dogs -- the paper illustrates how AHM teams can tackle complex tasks. We conclude with open research directions that this multidimensional approach presents for studying hybrid human-AI systems beyond AHM teams.

Paperid: 2697, https://arxiv.org/pdf/2504.13942.pdf

Abstract:
This paper introduces Intelligence of Things (INOT), a novel spatial context-aware control system that enhances smart home automation through intuitive spatial reasoning. Current smart home systems largely rely on device-specific identifiers, limiting user interaction to explicit naming conventions rather than natural spatial references. INOT addresses this limitation through a modular architecture that integrates Vision Language Models with IoT control systems to enable natural language commands with spatial context (e.g., "turn on the light near the window"). The system comprises key components including an Onboarding Inference Engine, Zero-Shot Device Detection, Spatial Topology Inference, and Intent-Based Command Synthesis. A comprehensive user study with 15 participants demonstrated INOT's significant advantages over conventional systems like Google Home Assistant, with users reporting reduced cognitive workload (NASA-TLX scores decreased by an average of 13.17 points), higher ease-of-use ratings, and stronger preference (14 out of 15 participants). By eliminating the need to memorize device identifiers and enabling context-aware spatial commands, INOT represents a significant advancement in creating more intuitive and accessible smart home control systems.

Paperid: 2698, https://arxiv.org/pdf/2504.13922.pdf

Abstract:
Web tracking (WT) systems are advanced technologies used to monitor and analyze online user behavior. Initially focused on HTML and static webpages, these systems have evolved with the proliferation of IoT, edge computing, and Big Data, encompassing a broad array of interconnected devices with APIs, interfaces and computing nodes for interaction. WT systems are pivotal in technological innovation and business development, although trends like GDPR complicate data extraction and mandate transparency. Specifically, this study examines WT systems purely from a technological perspective, excluding organizational and privacy implications. A novel classification scheme based on technological architecture and principles is proposed, compared to two preexisting frameworks. The scheme categorizes WT systems into six classes, emphasizing technological mechanisms such as HTTP proto-cols, APIs, and user identification techniques. Additionally, a survey of over 1,000 internet users, conducted via Google Forms, explores user awareness of WT systems. Findings indicate that knowledge of WT technologies is largely unrelated to demographic factors such as age or gender but is strongly influenced by a user's background in computer science. Most users demonstrate only a basic understanding of WT tools, and this awareness does not correlate with heightened concerns about data misuse. As such, the research highlights gaps in user education about WT technologies and underscores the need for a deeper examination of their technical underpinnings. This study provides a foundation for further exploration of WT systems from multiple perspectives, contributing to advance-ments in classification, implementation, and user awareness.

Paperid: 2699, https://arxiv.org/pdf/2504.13918.pdf

Abstract:
As our information environments become ever more powered by artificial intelligence (AI), the phenomenon of trust in a human's interactions with this intelligence is becoming increasingly pertinent. For example, in the not too distant future, there will be teams of humans and intelligent robots involved in dealing with the repercussions of high-risk disaster situations such as hurricanes, earthquakes, or nuclear accidents. Even in such conditions of high uncertainty, humans and intelligent machines will need to engage in shared decision making, and trust is fundamental to the effectiveness of these interactions. A key challenge in modeling the dynamics of this trust is to provide a means to incorporate sensitivity to fluctuations in human trust judgments. In this article, we explore the ability of Quantum Random Walk models to model the dynamics of trust in human-AI interactions, and to integrate a sensitivity to fluctuations in participant trust judgments based on the nature of the interaction with the AI. We found that using empirical parameters to inform the use of different Hamiltonians can provide a promising means to model the evolution of trust in Human-AI interactions.

Paperid: 2700, https://arxiv.org/pdf/2504.13917.pdf

Abstract:
This paper introduces a modular pet feeding device that combines automated feeding, health monitoring, and behavioral insights for modern pet care. Unlike traditional feeders, it features a wide-angle camera and microphone for food and water level assessment, pet approach detection, and sound monitoring. The device also includes an AI-enabled neckband to track heart rate, enabling early detection of unusual behaviors or health concerns. The AI system analyzes feeding history, behavior, and health data to provide personalized care suggestions, optimizing feeding times, portions, and dietary recommendations to improve pet well-being.

Paperid: 2701, https://arxiv.org/pdf/2504.13901.pdf

Abstract:
Mild cognitive impairment (MCI) may affect up to 20 % of people over 65 years old. Global incidence of MCI is increasing, and technology is being explored for early intervention. Theories of technology adoption predict that useful and easy to use solutions will have higher rates of adoption, however, these models do not specifically consider older people with cognitive impairments, or the unique human computer interaction challenges posed by MCI. We collated opinions from older people with MCI about technology solutions proposed for them, found in 83 articles published between Jan 2014 and May 2024, and found in nine databases. Inductive, thematic analysis of feedback identified five themes (i) purpose and need, (ii) solution design and ease of use, (iii) self-impression, (iv) lifestyle, and (v) interaction modality. Solutions are perceived as useful, even though gaps in functional support exist, however, they are not perceived as entirely easy to use, due to issues related to usability and user experience. Devices which are light, portable, common and have large screens, are preferred, as is multimodal interaction, in particular speech, visual/text and touch. This review recommends future work to (i) improve usability and user experience, (ii) enhance personalisation, (iii) better understand interaction preferences and effectiveness, (iv) enable options for multimodal interaction, and (v) more seamlessly integrate solutions into users lifestyles.

Paperid: 2702, https://arxiv.org/pdf/2504.13895.pdf

Abstract:
Modern algorithmic recommendation systems seek to engage users through behavioral content-interest matching. While many platforms recommend content based on engagement metrics, others like TikTok deliver interest-based content, resulting in recommendations perceived to be hyper-personalized compared to other platforms. TikTok's robust recommendation engine has led some users to suspect that the algorithm knows users "better than they know themselves," but this is not always true. In this paper, we explore TikTok users' perceptions of recommended content on their For You Page (FYP), specifically calling attention to unwanted recommendations. Through qualitative interviews of 14 current and former TikTok users, we find themes of frustration with recommended content, attempts to rid themselves of unwanted content, and various degrees of success in eschewing such content. We discuss implications in the larger context of folk theorization and contribute concrete tactical and behavioral examples of algorithmic persistence.

Paperid: 2703, https://arxiv.org/pdf/2504.13893.pdf

Abstract:
Current direct modeling systems limit users to low-level interactions with vertices, edges, and faces, forcing designers to manage detailed geometric elements rather than focusing on high-level design intent. This paper introduces semantic direct modeling (SDM), a novel approach that lifts direct modeling from low-level geometric modifications to high-level semantic interactions. This is achieved by utilizing a large language model (LLM) fine-tuned with CAD-specific prompts, which can guide the LLM to reason through design intent and accurately interpret CAD commands, thereby allowing designers to express their intent using natural language. Additionally, SDM maps design intent to the corresponding geometric features in the CAD model through a new conditional, context-sensitive feature recognition method, which uses generative AI to dynamically assign feature labels based on design intent. Together, they enable a seamless flow from high-level design intent to low-level geometric modifications, bypassing tedious software interactions. The effectiveness of SDM has been validated through real mechanical design cases.

Paperid: 2704, https://arxiv.org/pdf/2504.13892.pdf

Abstract:
Thematic analysis (TA) is a widely used qualitative research method for identifying and interpreting patterns within textual data, such as qualitative interviews. Recent research has shown that it is possible to satisfactorily perform TA using Large Language Models (LLMs). This paper presents a novel application using LLMs to assist researchers in conducting TA. The application enables users to upload textual data, generate initial codes and themes. All of this is possible through a simple Graphical User Interface, (GUI) based on the streamlit framework, working with python scripts for the analysis, and using Application Program Interfaces of LLMs. Having a GUI is particularly important for researchers in fields where coding skills may not be prevalent, such as social sciences or humanities. With the app, users can iteratively refine codes and themes adopting a human-in-the-loop process, without the need to work with programming and scripting. The paper describes the application key features, highlighting its potential for qualitative research while preserving methodological rigor. The paper discusses the design and interface of the app and outlines future directions for this work.

Paperid: 2705, https://arxiv.org/pdf/2504.13890.pdf

Abstract:
Computational mental health research develops models to predict and understand psychological phenomena, but often relies on inappropriate measures of psychopathology constructs, undermining validity. We identify three key issues: (1) reliance on unvalidated measures (e.g., self-declared diagnosis) over validated ones (e.g., diagnosis by clinician); (2) treating mental health constructs as categorical rather than dimensional; and (3) focusing on disorder-specific constructs instead of transdiagnostic ones. We outline the benefits of using validated, dimensional, and transdiagnostic measures and offer practical recommendations for practitioners. Using valid measures that reflect the nature and structure of psychopathology is essential for computational mental health research.

Paperid: 2706, https://arxiv.org/pdf/2504.13886.pdf

Abstract:
Researchers have long recognized pupil response as a potential objective indicator of emotional arousal; however, confounding factors, particularly luminosity of stimuli and the ambient environment, have limited its usefulness in detecting emotions. This study presents a new approach to isolate and remove the effect of luminosity on pupil dilation, obtaining the component of pupil dilation due only to emotional arousal. Our model predicts the pupil size due to luminosity only as a function of the screen luminosity and adapts to individual differences in pupil response to light, different types and configurations of monitors by using a calibration procedure. The predicted pupil size has an average correlation with the measured pupil size of 0.76, an R2 of 0.58, and a normalized root mean square error (NRMSE) of 0.14. Here, we demonstrate that our model can be used simply to calculate emotional arousal. We showed 32 video clips with different content and emotional intensity to 47 participants, who, after each video, reported their level of emotional arousal. We then calculated the pupil size due only to luminosity and subtracted it from the total recorded pupil size, obtaining the component due only to emotional arousal. From the latter, we predicted the arousal of each participant for each video. We obtained an average correlation between predicted and self-reported arousal of 0.65, an R2 of 0.43, and an NRMSE of 0.27. Instead, using the measured pupil size, without subtracting the component due to luminosity, we obtained dramatically worse results. an average correlation between the predicted and self-reported arousal of 0.26, an R2 of 0.09, and an NRMSE of 0.42. Our results highlight that separating the emotional and luminosity components from pupillary responses is critical to accurately and precisely predicting arousal.

Paperid: 2707, https://arxiv.org/pdf/2504.13868.pdf

Abstract:
This study challenges the widely-reported tradeoff between generative AI's (GenAI) contribution to creative outcomes and decreased diversity of these outcomes. We modified the design of such a study, by Doshi and Hauser (2024), in which participants wrote short stories either aided or unaided by GenAI plot ideas[1]. In the modified study, plot ideas were generated through ten unique GenAI "personas" with diverse traits (e.g. cultural backgrounds, thinking styles, genre preferences), creating a pool of 300 story plots. While plot ideas from any individual persona showed high similarity (average cosine similarity of 0.92), ideas across different personas exhibited substantial variation (average similarity of 0.20). When human participants wrote stories based on these diverse plot ideas, their collective outputs maintained the same level of diversity as stories written without GenAI assistance, effectively eliminating the diversity reduction observed in [1]. Traditional text analytics further revealed that GenAI-assisted stories featured greater diversity in descriptive and emotional language compared to purely human-generated stories without GenAI assistance. Our findings demonstrate that introducing diversity at the AI input stage through distinct personas can preserve and potentially enhance the collective diversity of human creative outputs when collaborating with GenAI.

Paperid: 2708, https://arxiv.org/pdf/2504.13867.pdf

Abstract:
Background: Executive functions (EFs) are cognitive processes essential for controlling impulses, staying focused, thinking before acting, and managing information. Childhood is a critical period for EF development, but there is a lack of standardized tools that combine EF tasks with physical activity in a gamified approach. Objectives: This scoping review maps EF tasks for children, identifies common strategies, and explores methods for measuring outcomes, providing a foundation for a research-oriented platform to assess EF development. Design: A systematic search was conducted in SCOPUS, ScienceDirect, and ERIC databases with the query "executive function task" AND (children OR child OR childhood). Inclusion criteria were studies published between 2019 and 2024 in English, with participants aged 5 to 9 years. Data extracted included task details, scoring mechanisms, and stop conditions. Studies lacking clear methodological descriptions were excluded. Results: A total of 2044 articles were identified, with 113 duplicates removed. After selection, 23 studies met the inclusion criteria. The identified tasks are listed in Table 2. Key tasks, strategies, and measurement methodologies were highlighted. Conclusions: Integrating EF tasks into a structured platform offers a promising approach to standardize assessments, fill research gaps, and provide a reliable tool for studying EF development in children. Keywords: Executive Functions, Inhibition, Working Memory, Cognitive Flexibility, Task Design, Standardization

Paperid: 2709, https://arxiv.org/pdf/2504.13863.pdf

Abstract:
Background Telemedicine has the potential to provide secure and cost-effective healthcare at the touch of a button. Nephrotic syndrome is a chronic childhood illness involving frequent relapses and demands long/complex treatment. Hence, developing a remote means of doctor-patient interface will ensure the provision of quality healthcare to patients. Methods The Utsarjan mobile App framework was built with Flutter that enables cross-platform development (Android, iOS, Windows) with speed, smoothness, and open-source benefits. The frontend uses Dart for user interaction, while the backend employs Node.js, Express, and NGINX for APIs, load balancing and high performance. MongoDB ensures a flexible database, Bcrypt secures passwords, PM2 handles deployment, uptime and logs, while Firebase Cloud Messaging powers free push notifications. Results Utsarjan (means excretion) is a multi-functional smartphone application for giving nephrotic care and real-time assistance to all patients (especially those in rural regions and/or who do not have access to specialists). It helps patients and doctors by ensuring opportune visits, recording each clinical test/parameter and improving medication adherence. It gives a graphical visualization of relapses, medicine dosage as well as different anthropometric parameters (urine protein, BP, height and weight). This is the first nephrotic care App that enables prompt access to doctor's advice. Conclusions Utsarjan is a mobile App to provide kidney care and real-time assistance to children with nephrotic syndrome. It gives a graphical overview of changes in a patient's health over the long course of treatment. This will assist doctors in appropriately modifying the treatment regimen. Consequently, it will (hopefully) lead to the prevention of relapses and/or complications.

Paperid: 2710, https://arxiv.org/pdf/2504.13859.pdf

Abstract:
AI, especially Large Language Models (LLMs) like ChatGPT, have rapidly developed and gained widespread adoption in the past five years, shifting user preference from traditional search engines. However, the generative nature of LLMs raises concerns about presenting misinformation as fact. To address this, we developed a web-based application that helps K-12 students enhance critical thinking by identifying misleading information in LLM responses about major historical figures. In this paper, we describe the implementation and design details of the DoYouTrustAI tool, which can be used to provide an interactive lesson which teaches students about the dangers of misinformation and how believable generative AI can make it seem. The DoYouTrustAI tool utilizes prompt engineering to present the user with AI generated summaries about the life of a historical figure. These summaries can be either accurate accounts of that persons life, or an intentionally misleading alteration of their history. The user is tasked with determining the validity of the statement without external resources. Our research questions for this work were:(RQ1) How can we design a tool that teaches students about the dangers of misleading information and of how misinformation can present itself in LLM responses? (RQ2) Can we present prompt engineering as a topic that is easily understandable for students? Our findings highlight the need to correct misleading information before users retain it. Our tool lets users select familiar individuals for testing to reduce random guessing and presents misinformation alongside known facts to maintain believability. It also provides pre-configured prompt instructions to show how different prompts affect AI responses. Together, these features create a controlled environment where users learn the importance of verifying AI responses and understanding prompt engineering.

Paperid: 2711, https://arxiv.org/pdf/2504.13857.pdf

Abstract:
This study explores the influence of environmental colors on human behavior, specifically focusing on aggressiveness and passiveness. Color is widely regarded as an influential environmental factor shaping human behavior, yet existing studies present conflicting evidence regarding its impact on aggressiveness and passiveness. This study employed Minecraft as a controlled digital platform to investigate whether exposure to different colors influences both the frequency and nature of participant interactions (aggressive versus non-aggressive), and whether prolonged exposure amplifies these effects. Anonymous online participants were exposed to various colors before interacting with non-player characters simulating human-like encounters. Three key outcomes were measured: (1) total interactions per color, (2) ratios of aggressive to non-aggressive interactions per color, and (3) the effect of varying exposure durations on aggressiveness. While no significant overall differences in interaction frequency were observed among the colors, post-hoc analyses revealed that Red and Black elicited significantly more interactions compared to Green. Additionally, Red, Yellow, and Black were associated with higher ratios of aggressive behavior relative to Green or White. Prolonged exposure to Red also appeared to intensify aggressive responses. These findings underscore the potential role of environmental color in shaping online social behaviors and highlight the importance of environmental settings in areas ranging from online communication platforms to digital marketing strategies.

Paperid: 2712, https://arxiv.org/pdf/2504.13855.pdf

Abstract:
This study documents a three-week workshop with architecture students, where we designed and 3D printed various minimal surfaces using wood-based filaments, and used them as molds in which to grow mycelium. We detail the design process and the growth of the mycelium in different shapes, together with participants' experiences of working with a living material. After exhibiting the results of the work in a public-facing exhibition, we conducted interviews with members of the general public about their perceptions on interacting with a material such as mycelium in design. Our findings show that 3D-printed minimal surfaces with wood-based filaments can function as structural cores for mycelium-based composites and mycelium binds to the filament. Participants in the workshop exhibited stronger feelings for living materials compared to non-living ones, displaying both biophilia and, to a lesser extent, biophobia when interacting with the mycelium. Members of the general public discuss pragmatic aspects including mold, fragility, or production costs, and speculate on the future of bio-technology and its impact on everyday life. While all are positive about the impact on bio-technologies on the future, they have diverging opinions on how much ethical considerations should influence research directions.

Paperid: 2713, https://arxiv.org/pdf/2504.13852.pdf

Abstract:
This research explores whether the rapid digital transformation due to COVID-19 managed to close or exacerbate the digital divide concerning users' digital skills. We conducted a pre-registered survey with N = 1143 German Internet users. Our findings suggest the latter: younger, male, and higher educated users were more likely to improve their digital skills than older, female, and less educated ones. According to their accounts, the pandemic helped Internet users improve their skills in communicating with others by using video conference software and reflecting critically upon information they found online. These improved digital skills exacerbated not only positive (e.g., feeling informed and safe) but also negative (e.g., feeling lonely) effects of digital media use during the pandemic. We discuss this research's theoretical and practical implications regarding the impact of challenges, such as technological disruption and health crises, on humans' digital skills, capabilities, and future potential, focusing on the second-level digital divide.

Paperid: 2714, https://arxiv.org/pdf/2504.13848.pdf

Abstract:
Generative AI (GenAI) chatbots are becoming increasingly integrated into virtual assistant technologies, yet their success hinges on the ability to gather meaningful user feedback to improve interaction quality, system outcomes, and overall user acceptance. Successful chatbot interactions can enable organizations to build long-term relationships with their customers and users, supporting customer loyalty and furthering the organization's goals. This study explores the impact of two distinct narratives and feedback collection mechanisms on user engagement and feedback behavior: a standard AI-focused interaction versus a hybrid intelligence (HI) framed interaction. Initial findings indicate that while small-scale survey measures allowed for no significant differences in user willingness to leave feedback, use the system, or trust the system, participants exposed to the HI narrative statistically significantly provided more detailed feedback. These initial findings offer insights into designing effective feedback systems for GenAI virtual assistants, balancing user effort with system improvement potential.

Paperid: 2715, https://arxiv.org/pdf/2504.13844.pdf

Abstract:
Human-computer interactions based on gaze-tracking have spread during the last few years. Video games, applications in health, trading, market research, and many other fields have started to use this new technology that seems invisible to the user. However, the dominant form of interaction using gaze tracking uses dwell-time for command activation, which introduces strong constraints in the interaction: dwell-time activation requires users to look steadily at an element for a predefined amount of time in to select it. While dwell-time alleviates a part of the Midas touch problem (referring to the fact that an element fixed by the user will be activated even if it was not intended to do so), it doesn't completely remove it: users should not gaze too long on an item, or they may trigger an unintended activation. In addition, dwell-time slows down users' interaction by requiring a pause each time an activation is needed. In this project, we study an alternative selection method based on crossing interactions, a well-studied method used in conventional HCI. This interaction allows users' gaze to rest in areas that don't have crossing triggers, and it removes the need to pause in the interaction. We found that crossing interaction had similar performances than dwell-time interaction with novice users. The performance was even better for users having previous experience with gaze interaction.

Paperid: 2716, https://arxiv.org/pdf/2504.13841.pdf

Abstract:
The Skill Trade is a site for skill swapping, learning, and career growth. It links people who have matching skills, helps virtual work through Google Meet/Zoom, and lets startups hire talent easily. Users can make profiles, connect with others, share skills, and respond to job ads from startups. Startup users can post jobs and see profiles to hire candidates. Learn-only users get categorized learning materials while developers keep an eye on platform management and upload resources. It is free for individual users, supported by donations, and charges startups a small fee only when they successfully hire. Built with Tailwind CSS, it guarantees to creation of an intuitive, responsive design that fosters collaboration and career opportunities.

Paperid: 2717, https://arxiv.org/pdf/2504.13793.pdf

Abstract:
ChatNekoHacker is a real-time conversational agent system that strengthens fan engagement for musicians. It integrates Amazon Bedrock Agents for autonomous dialogue, Unity for immersive 3D livestream sets, and VOICEVOX for high quality Japanese text-to-speech, enabling two virtual personas to represent the music duo Neko Hacker. In a one-hour YouTube Live with 30 participants, we evaluated the impact of the system. Regression analysis showed that agent interaction significantly elevated fan interest, with perceived fun as the dominant predictor. The participants also expressed a stronger intention to listen to the duo's music and attend future concerts. These findings highlight entertaining, interactive broadcasts as pivotal to cultivating fandom. Our work offers actionable insights for the deployment of conversational agents in entertainment while pointing to next steps: broader response diversity, lower latency, and tighter fact-checking to curb potential misinformation.

Paperid: 2718, https://arxiv.org/pdf/2504.13557.pdf

Abstract:
This study explores the integration of Large Language Models (LLMs) into the grading and appeal resolution process in computer science education. We introduce AI-PAT, an AI-powered assessment tool that leverages LLMs to evaluate computer science exams, generate feedback, and address student appeals. AI-PAT was used to assess over 850 exam submissions and handle 185 appeal cases. Our multi-model comparison (ChatGPT, Gemini) reveals strong correlations between model outputs, though significant variability persists depending on configuration and prompt design. Human graders, while internally consistent, showed notable inter-rater disagreement, further highlighting subjectivity in manual evaluation. The appeal process led to grade changes in 74% of cases, indicating the need for continued refinement of AI evaluation strategies. While students appreciated the speed and detail of AI feedback, survey responses revealed trust and fairness concerns. We conclude that AI-PAT offers scalable benefits for formative assessment and feedback, but must be accompanied by transparent grading rubrics, human oversight, and appeal mechanisms to ensure equitable outcomes.

Paperid: 2719, https://arxiv.org/pdf/2504.13404.pdf

Abstract:
Purpose: This study examines the design and functionality of university library login pages across academic alliances (IVY Plus, BTAA, JULAC, JVU) to identify how these interfaces align with institutional priorities and user needs. It explores consensus features, design variations, and emerging trends in authentication, usability, and security. Methodology: A multi-method approach was employed: screenshots and HTML files from 46 institutions were analyzed through categorization, statistical analysis, and comparative evaluation. Features were grouped into authentication mechanisms, usability, security/compliance, and library-specific elements. Findings: Core functionalities (e.g., ID/password, privacy policies) were consistent across alliances. Divergences emerged in feature emphasis: mature alliances (e.g., BTAA) prioritized resource accessibility with streamlined interfaces, while emerging consortia (e.g., JVU) emphasized cybersecurity (IP restrictions, third-party integrations). Usability features, particularly multilingual support, drove cross-alliance differences. The results highlighted regional and institutional influences, with older alliances favoring simplicity and newer ones adopting security-centric designs. Originality/Value: This is the first systematic comparison of login page designs across academic alliances, offering insights into how regional, technological, and institutional factors shape digital resource access. Findings inform best practices for balancing security, usability, and accessibility in library interfaces. **Keywords**: Academic library consortia, Login page design, User authentication, User experience, Security compliance.

Paperid: 2720, https://arxiv.org/pdf/2504.13330.pdf

Abstract:
The way we work is no longer hybrid -- it is blended with AI co-workers, automated decisions, and virtual presence reshaping human roles, agency, and expertise. We now work through AI, with our outputs shaped by invisible algorithms. AI's infiltration into knowledge, creative, and service work is not just about automation, but concerns redistribution of agency, creativity, and control. How do we deal with physical and distributed AI-mediated workspaces? What happens when algorithms co-author reports, and draft our creative work? In this provocation, we argue that hybrid work is obsolete. Blended work is the future, not just in physical and virtual spaces but in how human effort and AI output become inseparable. We argue this shift demands urgent attention to AI-mediated work practices, work-life boundaries, physical-digital interactions, and AI transparency and accountability. The question is not whether we accept it, but whether we actively shape it before it shapes us.

Paperid: 2721, https://arxiv.org/pdf/2504.13119.pdf

Abstract:
Most adaptive AR storytelling systems define environmental semantics using simple object labels and spatial coordinates, limiting narratives to rigid, pre-defined logic. This oversimplification overlooks the contextual significance of object relationships-for example, a wedding ring on a nightstand might suggest marital conflict, yet is treated as just "two objects" in space. To address this, we explored integrating Vision Language Models (VLMs) into AR pipelines. However, several challenges emerged: First, stories generated with simple prompt guidance lacked narrative depth and spatial usage. Second, spatial semantics were underutilized, failing to support meaningful storytelling. Third, pre-generated scripts struggled to align with AR Foundation's object naming and coordinate systems. We propose a scene-driven AR storytelling framework that reimagines environments as active narrative agents, built on three innovations: 1. State-aware object semantics: We decompose object meaning into physical, functional, and metaphorical layers, allowing VLMs to distinguish subtle narrative cues between similar objects. 2. Structured narrative interface: A bidirectional JSON layer maps VLM-generated metaphors to AR anchors, maintaining spatial and semantic coherence. 3. STAM evaluation framework: A three-part experimental design evaluates narrative quality, highlighting both strengths and limitations of VLM-AR integration. Our findings show that the system can generate stories from the environment itself, not just place them on top of it. In user studies, 70% of participants reported seeing real-world objects differently when narratives were grounded in environmental symbolism. By merging VLMs' generative creativity with AR's spatial precision, this framework introduces a novel object-driven storytelling paradigm, transforming passive spaces into active narrative landscapes.

Paperid: 2722, https://arxiv.org/pdf/2504.12943.pdf

Abstract:
Personalized support is essential to fulfill individuals' emotional needs and sustain their mental well-being. Large language models (LLMs), with great customization flexibility, hold promises to enable individuals to create their own emotional support agents. In this work, we developed ChatLab, where users could construct LLM-powered chatbots with additional interaction features including voices and avatars. Using a Research through Design approach, we conducted a week-long field study followed by interviews and design activities (N = 22), which uncovered how participants created diverse chatbot personas for emotional reliance, confronting stressors, connecting to intellectual discourse, reflecting mirrored selves, etc. We found that participants actively enriched the personas they constructed, shaping the dynamics between themselves and the chatbot to foster open and honest conversations. They also suggested other customizable features, such as integrating online activities and adjustable memory settings. Based on these findings, we discuss opportunities for enhancing personalized emotional support through emerging AI technologies.

Paperid: 2723, https://arxiv.org/pdf/2504.12865.pdf

Abstract:
Industrial dashboards, commonly deployed by organizations such as enterprises and governments, are increasingly crucial in data communication and decision-making support across various domains. Designing an industrial dashboard prototype is particularly challenging due to its visual complexity, which can include data visualization, layout configuration, embellishments, and animations. Additionally, in real-world industrial settings, designers often encounter numerous constraints. For instance, when companies negotiate collaborations with clients and determine design plans, they typically need to demo design prototypes and iterate on them based on mock data quickly. Such a task is very common and crucial during the ideation stage, as it not only helps save developmental costs but also avoids data-related issues such as lengthy data handover periods. However, existing authoring tools of dashboards are mostly not tailored to such prototyping needs, and motivated by these gaps, we propose DashChat, an interactive system that leverages large language models (LLMs) to generate industrial dashboard design prototypes from natural language. We collaborated closely with designers from the industry and derived the requirements based on their practical experience. First, by analyzing 114 high-quality industrial dashboards, we summarized their common design patterns and inject the identified ones into LLMs as reference. Next, we built a multi-agent pipeline powered by LLMs to understand textual requirements from users and generate practical, aesthetic prototypes. Besides, functionally distinct, parallel-operating agents are created to enable efficient generation. Then, we developed a user-friendly interface that supports text-based interaction for generating and modifying prototypes. Two user studies demonstrated that our system is both effective and efficient in supporting design prototyping.

Paperid: 2724, https://arxiv.org/pdf/2504.12433.pdf

Abstract:
From school admissions to hiring and investment decisions, the first step behind many high-stakes decision-making processes is "deciding how to decide." Formulating effective criteria to guide decision-making requires an iterative process of exploration, reflection, and discovery. Yet, this process remains under-supported in practice. In this short paper, we outline an opportunity space for AI-driven tools that augment human meta-decision making. We draw upon prior literature to propose a set of design goals for future AI tools aimed at supporting human meta-decision making. We then illustrate these ideas through InDecision, a mixed-initiative tool designed to support the iterative development of decision criteria. Based on initial findings from designing and piloting InDecision with users, we discuss future directions for AI-augmented meta-decision making.

Paperid: 2725, https://arxiv.org/pdf/2504.12333.pdf

Abstract:
The evaluation of open-ended responses in serious games presents a unique challenge, as correctness is often subjective. Large Language Models (LLMs) are increasingly being explored as evaluators in such contexts, yet their accuracy and consistency remain uncertain, particularly for smaller models intended for local execution. This study investigates the reliability of five small-scale LLMs when assessing player responses in \textit{En-join}, a game that simulates decision-making within energy communities. By leveraging traditional binary classification metrics (including accuracy, true positive rate, and true negative rate), we systematically compare these models across different evaluation scenarios. Our results highlight the strengths and limitations of each model, revealing trade-offs between sensitivity, specificity, and overall performance. We demonstrate that while some models excel at identifying correct responses, others struggle with false positives or inconsistent evaluations. The findings highlight the need for context-aware evaluation frameworks and careful model selection when deploying LLMs as evaluators. This work contributes to the broader discourse on the trustworthiness of AI-driven assessment tools, offering insights into how different LLM architectures handle subjective evaluation tasks.

Paperid: 2726, https://arxiv.org/pdf/2504.11987.pdf

Abstract:
Extended Reality (XR) is a rapidly growing field with a wide range of hardware from head mounted displays to installations. Users have the possibility to access the entire Mixed Reality (MR) continuum. Goal of the human-computer-interaction (HCI) community is to allow natural and intuitive interactions but in general interactions for XR often rely on handheld controllers. One natural interaction method is full body tracking (FBT), where a user can use their body to interact with the experience. Classically, FBT systems require markers or trackers on the users to capture motion. Recently, there have been approaches based on Human Pose Estimation (HPE), which highlight the potential of low-cost non-intrusive FBT for XR. Due to the lack of handheld devices, HPE may also improve accessibility with people struggling with traditional input methods. This paper proposes the concept of non-intrusive FBT for XR for all. The goal is to spark a discussion on advantages for users by using a non-intrusive FBT system for accessibility and user experience.

Paperid: 2727, https://arxiv.org/pdf/2504.11974.pdf

Abstract:
AI-powered predictive systems have high margins of error. However, data visualisations of algorithmic systems in education and other social fields tend to visualise certainty, thus invisibilising the underlying approximations and uncertainties of the algorithmic systems and the social settings in which these systems operate. This paper draws on a critical speculative approach to first analyse data visualisations from predictive analytics platforms for education. It demonstrates that visualisations of uncertainty in education are rare. Second, the paper explores uncertainty visualisations in other fields (defence, climate change and healthcare). The paper concludes by reflecting on the role of data visualisations and un/certainty in shaping educational futures. It also identifies practical implications for the design of data visualisations in education.

Paperid: 2728, https://arxiv.org/pdf/2504.11380.pdf

Abstract:
Public speaking anxiety affects many individuals, yet opportunities for real-world practice remain limited. This study explores how augmented reality (AR) can provide an accessible training environment for public speaking. Drawing from literature on public speaking, VR-based training, self-efficacy, and behavioral feedback mechanisms, we designed SpeakAR, an AR-based tool that simulates audience interaction through virtual models. SpeakAR was evaluated with five participants of varying anxiety levels, each completing six speaking tasks. Results indicate that AR exposure can enhance confidence, with participants finding the system useful for practice. Feedback highlighted the importance of dynamic facial expressions and idle animations in virtual models to improve realism and engagement. Our findings contribute to the design of AR-based training tools for public speaking, offering insights into how immersive environments can support skill development and anxiety reduction.

Paperid: 2729, https://arxiv.org/pdf/2504.10973.pdf

Abstract:
Early dementia diagnosis requires biomarkers sensitive to both structural and functional brain changes. While structural neuroimaging biomarkers have progressed significantly, objective functional biomarkers of early cognitive decline remain a critical unmet need. Current cognitive assessments often rely on behavioral responses, making them susceptible to factors like effort, practice effects, and educational background, thereby hindering early and accurate detection. This work introduces a novel approach, leveraging a lightweight convolutional neural network (CNN) to infer cognitive impairment levels directly from electroencephalography (EEG) data. Critically, this method employs a passive fast periodic visual stimulation (FPVS) paradigm, eliminating the need for explicit behavioral responses or task comprehension from the participant. This passive approach provides an objective measure of working memory function, independent of confounding factors inherent in active cognitive tasks, and offers a promising new avenue for early and unbiased detection of cognitive decline.

Paperid: 2730, https://arxiv.org/pdf/2504.10971.pdf

Abstract:
Business Process Visualisations (BPVs) have become indispensable tools for organisations seeking to enhance their operational efficiency, decision-making capabilities, and overall performance. The burgeoning interest in process modeling and tool development, coupled with the rise of data visualisation field, underscores the significant role of visual tools in leveraging human cognition. Unlike traditional models, data visualisation approaches graphics from a novel angle, emphasising the potency of visual representations. This review aims to integrate the domains of BPV and data visualisation to assess their combined influence on organisational effectiveness comprehensively. Through a meticulous analysis of existing literature, this study aims to amalgamate insights on BPVs impact from a data visualisation standpoint, advocating for a design philosophy that prioritises user engagement to bolster organisational outcomes. Additionally, our systematic review has unveiled promising avenues for future research, identifying underexplored variables that influence the efficacy of BPVs, thereby charting a path for forthcoming scholarly inquiries.

Paperid: 2731, https://arxiv.org/pdf/2504.10848.pdf

Abstract:
As the Internet develops, social networking and other communication tools have transformed people's relationships into something fast, visible, and geographically huge. However, these communication tools have not expanded opportunities for acquainting oneself with neighbors outside one's social network; rather, they have comparatively diminished occasions for interacting with unfamiliar neighbors by prioritizing communication with existing friends. Therefore, we invented the medium Ichiyo to increase the opportunities to think of neighbors walking along the same street or in the same neighborhood and to expand the imagination of those who pass by and those who used to be there. Thus, users can engage in indirect interaction. We used commercially available laser cutters to engrave QR codes on leaves that are naturally found in our living space to prevent environmental invasion. The QR codes lead to a communal space on the web where users can freely leave messages. By engraving QR codes, information can be virtually expanded to be presented. To get the feedback of Ichiyo, we let a total of several thousand people experience a new way of communication as a part of the exhibition ''iii Exhibition 2022'', an art exhibition at the University of Tokyo. A total of more than 1,000 leaves engraved with QR codes were prepared and scattered at the exhibition site and along the road from the nearest station to the venue.

Paperid: 2732, https://arxiv.org/pdf/2504.10650.pdf

Abstract:
The growing prevalence of conversational voice interfaces, powered by developments in both speech and language technologies, raises important questions about their influence on human communication. While written communication can signal identity through lexical and stylistic choices, voice-based interactions inherently amplify socioindexical elements - such as accent, intonation, and speech style - which more prominently convey social identity and group affiliation. There is evidence that even passive media such as television is likely to influence the audience's linguistic patterns. Unlike passive media, conversational AI is interactive, creating a more immersive and reciprocal dynamic that holds a greater potential to impact how individuals speak in everyday interactions. Such heightened influence can be expected to arise from phenomena such as acoustic-prosodic entrainment and linguistic accommodation, which occur naturally during interaction and enable users to adapt their speech patterns in response to the system. While this phenomenon is still emerging, its potential societal impact could provide organisations, movements, and brands with a subtle yet powerful avenue for shaping and controlling public perception and social identity. We argue that the socioindexical influence of AI-generated speech warrants attention and should become a focus of interdisciplinary research, leveraging new and existing methodologies and technologies to better understand its implications.

Paperid: 2733, https://arxiv.org/pdf/2504.10497.pdf

Abstract:
The swift progress of Generative Artificial intelligence (GenAI), notably Large Language Models (LLMs), is reshaping the digital landscape. Recognizing this transformative potential, the National Research Council of Canada (NRC) launched a pilot initiative to explore the integration of GenAI techniques into its daily operation for performance excellence, where 22 projects were launched in May 2024. Within these projects, this paper presents the development of the intelligent agent Pubbie as a case study, targeting the automation of performance measurement, data management and insight reporting at the NRC. Cutting-edge techniques are explored, including LLM orchestration and semantic embedding via RoBERTa, while strategic fine-tuning and few-shot learning approaches are incorporated to infuse domain knowledge at an affordable cost. The user-friendly interface of Pubbie allows general government users to input queries in natural language and easily upload or download files with a simple button click, greatly reducing manual efforts and accessibility barriers.

Paperid: 2734, https://arxiv.org/pdf/2504.10405.pdf

Abstract:
The integration of Large Language Models (LLMs) into healthcare holds significant potential to enhance diagnostic accuracy and support medical treatment planning. These AI-driven systems can analyze vast datasets, assisting clinicians in identifying diseases, recommending treatments, and predicting patient outcomes. This study evaluates the performance of a range of contemporary LLMs, including both open-source and closed-source models, on the 2024 Portuguese National Exam for medical specialty access (PNA), a standardized medical knowledge assessment. Our results highlight considerable variation in accuracy and cost-effectiveness, with several models demonstrating performance exceeding human benchmarks for medical students on this specific task. We identify leading models based on a combined score of accuracy and cost, discuss the implications of reasoning methodologies like Chain-of-Thought, and underscore the potential for LLMs to function as valuable complementary tools aiding medical professionals in complex clinical decision-making.

Paperid: 2735, https://arxiv.org/pdf/2504.10404.pdf

Abstract:
This study investigates how cinematographic techniques influence viewer perception and contribute to the objectification of women, utilizing eye-tracking data from 91 participants. They watched a sexualized music video (SV) known for objectifying portrayals and a non-sexualized music video (TV). Using dynamic Areas of Interests (AOIs) (head, torso, and lower body), gaze metrics such as fixation duration, visit count, and scan paths were recorded to assess visual attention patterns. Participants were grouped according to their average fixations on sexualized AOIs. Statistical analyses revealed significant differences in gaze behavior between the videos and among the groups, with increased attention to sexualized AOIs in SV. Additionally, data-driven group differences in fixations identified specific segments with heightened objectification that are further analyzed using scan path visualization techniques. These findings provide strong empirical evidence of camera-driven gaze objectification, demonstrating how cinematic framing implicitly shapes objectifying gaze patterns, highlighting the critical need for mindful media representation.

Paperid: 2736, https://arxiv.org/pdf/2504.10359.pdf

Abstract:
Language models (LMs) are increasingly being integrated into a wide range of applications, yet the modern evaluation paradigm does not sufficiently reflect how they are actually being used. Current evaluations rely on benchmarks that often lack direct applicability to the real-world contexts in which LMs are being deployed. To address this gap, we propose Dimensional and Contextual Evaluation (DICE), an approach that evaluates LMs on granular, context-dependent dimensions. In this position paper, we begin by examining the insufficiency of existing LM benchmarks, highlighting their limited applicability to real-world use cases. Next, we propose a set of granular evaluation parameters that capture dimensions of LM behavior that are more meaningful to stakeholders across a variety of application domains. Specifically, we introduce the concept of context-agnostic parameters - such as robustness, coherence, and epistemic honesty - and context-specific parameters that must be tailored to the specific contextual constraints and demands of stakeholders choosing to deploy LMs into a particular setting. We then discuss potential approaches to operationalize this evaluation framework, finishing with the opportunities and challenges DICE presents to the LM evaluation landscape. Ultimately, this work serves as a practical and approachable starting point for context-specific and stakeholder-relevant evaluation of LMs.

Paperid: 2737, https://arxiv.org/pdf/2504.10271.pdf

Abstract:
Extended Reality (XR) technologies enable the personalized mediation of an individual's perceivable reality across modalities, thereby creating a Personalized Reality (PR). While this may lead to individually beneficial effects in the form of more efficient, more fun, and safer experiences, it may also lead to perceptual filter bubbles since individuals are exposed predominantly or exclusively to content that is congruent with their existing beliefs and opinions. This undermining of a shared basis for interaction and discussion through constrained perceptual worldviews may impact society through increased polarization and other well-documented negative effects of filter bubbles. In this paper, we argue that this issue can be mitigated by increasing individuals' awareness of their current perspective and providing avenues for development, including through support for engineered serendipity and fostering of self-actualization that already show promise for traditional recommender systems. We discuss how these methods may be transferred to XR to yield valuable tools to give people transparency and agency over their perceptual worldviews in a responsible manner.

Paperid: 2738, https://arxiv.org/pdf/2504.10082.pdf

Abstract:
This paper presents Cooking Code, a VR-based serious game designed to introduce programming concepts to students (ages 12-16) through an immersive, scenario-driven experience. Set in a futuristic world where humans and machines coexist, players take on the role of a fast-food chef who must assemble food orders based on pseudocode instructions. By interpreting and executing these instructions correctly, players develop problem-solving skills, computational thinking, and a foundational understanding of programming logic. The game leverages the kitchen metaphor to teach computational thinking, using affordances for an immersive VR experience.

Paperid: 2739, https://arxiv.org/pdf/2504.10010.pdf

Abstract:
Understanding thermal regulation and subjective perception of temperature is crucial for improving thermal comfort and human energy consumption in times of global warming. Previous work shows that an environment's color temperature affects the experienced temperature. As virtual reality (VR) enables visual immersion, recent work suggests that a VR scene's color temperature also affects experienced temperature. In addition, virtual avatars representing thermal cues influence users' thermal perception and even the body temperature. As immersive technology becomes increasingly prevalent in daily life, leveraging thermal cues to enhance thermal comfort - without relying on actual thermal energy - presents a promising opportunity. Understanding these effects is crucial for optimizing virtual experiences and promoting sustainable energy practices. Therefore, we propose three controlled experiments to learn more about thermal effects caused by virtual worlds and avatars.

Paperid: 2740, https://arxiv.org/pdf/2504.09980.pdf

Abstract:
This paper has two goals. First, we present the turn-taking annotation layers created for 95 minutes of conversational speech of the Graz Corpus of Read and Spontaneous Speech (GRASS), available to the scientific community. Second, we describe the annotation system and the annotation process in more detail, so other researchers may use it for their own conversational data. The annotation system was developed with an interdisciplinary application in mind. It should be based on sequential criteria according to Conversation Analysis, suitable for subsequent phonetic analysis, thus time-aligned annotations were made Praat, and it should be suitable for automatic classification, which required the continuous annotation of speech and a label inventory that is not too large and results in a high inter-rater agreement. Turn-taking was annotated on two layers, Inter-Pausal Units (IPU) and points of potential completion (PCOMP; similar to transition relevance places). We provide a detailed description of the annotation process and of segmentation and labelling criteria. A detailed analysis of inter-rater agreement and common confusions shows that agreement for IPU annotation is near-perfect, that agreement for PCOMP annotations is substantial, and that disagreements often are either partial or can be explained by a different analysis of a sequence which also has merit. The annotation system can be applied to a variety of conversational data for linguistic studies and technological applications, and we hope that the annotations, as well as the annotation system will contribute to a stronger cross-fertilization between these disciplines.

Paperid: 2741, https://arxiv.org/pdf/2504.09835.pdf

Abstract:
Among various methods to learn a second language (L2), such as listening and shadowing, Extensive Viewing involves learning L2 by watching many videos. However, it is difficult for many L2 learners to smoothly and effortlessly comprehend video contents made for native speakers at the original speed. Therefore, we developed a language learning assistance system that automatically adjusts the playback speed according to the learner's comprehension. Our system judges that learners understand the contents if they laugh at the punchlines of comedy dramas, and vice versa. Experimental results show that this system supports learners with relatively low L2 ability (under 700 in TOEIC Score in the experimental condition) to understand video contents. Our system can widen learners' possible options of native speakers' videos as Extensive Viewing material.

Paperid: 2742, https://arxiv.org/pdf/2504.09827.pdf

Abstract:
Online Design Communities (ODCs) offer various artworks with members' comments for beginners to learn visual design. However, as identified by our Formative Study (N = 10), current ODCs lack features customized for personal learning purposes, e.g., searching artworks and digesting useful comments to learn design principles about buttons. In this paper, we present DesignLearner, a redesigned interface of ODCs to facilitate personalized visual design learning with comments structured based on UI components (e.g., button, text) and visual elements (e.g., color, contrast). In DesignLearner, learners can specify the UI components and visual elements that they wish to learn to filter artworks and associated comments. They can interactively read comments on an artwork, take notes, and get suggestions for the next artworks to explore. Our between-subjects study (N = 24) indicates that compared to a traditional ODC interface, DesignLearner can improve the user learning outcome and is deemed significantly more useful. We conclude with design considerations for customizing the interface of online communities to satisfy users' learning needs.

Paperid: 2743, https://arxiv.org/pdf/2504.09435.pdf

Abstract:
AI offers key advantages such as instant generation, multi-modal support, and personalized adaptability - potential that can address the highly heterogeneous communication barriers faced by people with aphasia (PWAs). We designed AI-enhanced communication tools and used them as design probes to explore how AI's real-time processing and generation capabilities - across text, image, and audio - can align with PWAs' needs in real-time communication and preparation for future conversations respectively. Through a two-phase "Research through Design" approach, eleven PWAs contributed design insights and evaluated four AI-enhanced prototypes. These prototypes aimed to improve communication grounding and conversational agency through visual verification, grammar construction support, error correction, and reduced language processing load. Despite some challenges, such as occasional mismatches with user intent, findings demonstrate how AI's specific capabilities can be advantageous in addressing PWAs' complex needs. Our work contributes design insights for future Augmentative and Alternative Communication (AAC) systems.

Paperid: 2744, https://arxiv.org/pdf/2504.09296.pdf

Abstract:
Engaging with AI assistants to gather essential information in a timely manner is becoming increasingly common. Traditional activation methods, like wake words such as Hey Siri, Ok Google, and Hey Alexa, are constrained by technical challenges such as false activations, recognition errors, and discomfort in public settings. Similarly, activating AI systems via physical buttons imposes strict interactive limitations as it demands particular physical actions, which hinders fluid and spontaneous communication with AI. Our approach employs eye-tracking technology within AR glasses to discern a user's intention to engage with the AI assistant. By sustaining eye contact on a virtual AI avatar for a specific time, users can initiate an interaction silently and without using their hands. Preliminary user feedback suggests that this technique is relatively intuitive, natural, and less obtrusive, highlighting its potential for integrating AI assistants fluidly into everyday interactions.

Paperid: 2745, https://arxiv.org/pdf/2504.09227.pdf

Abstract:
People who are blind or have low vision (BLV) may hesitate to travel independently in unfamiliar environments due to uncertainty about the physical landscape. While most tools focus on in-situ navigation, those exploring pre-travel assistance typically provide only landmarks and turn-by-turn instructions, lacking detailed visual context. Street view imagery, which contains rich visual information and has the potential to reveal numerous environmental details, remains inaccessible to BLV people. In this work, we introduce SceneScout, a multimodal large language model (MLLM)-driven AI agent that enables accessible interactions with street view imagery. SceneScout supports two modes: (1) Route Preview, enabling users to familiarize themselves with visual details along a route, and (2) Virtual Exploration, enabling free movement within street view imagery. Our user study (N=10) demonstrates that SceneScout helps BLV users uncover visual information otherwise unavailable through existing means. A technical evaluation shows that most descriptions are accurate (72%) and describe stable visual elements (95%) even in older imagery, though occasional subtle and plausible errors make them difficult to verify without sight. We discuss future opportunities and challenges of using street view imagery to enhance navigation experiences.

Paperid: 2746, https://arxiv.org/pdf/2504.09099.pdf

Abstract:
Since this century, the speed, availability, and plethora of online informational content have made it increasingly difficult for humans to keep an overview of real-world situations, build a personal opinion, and sometimes even decide on the truth. Thereby, personal opinion-making and public discourse became harder - two essential building blocks that keep a democratic society alive. HCI thus needs to rethink news, information, and social media systems to mitigate such negative effects. Instead of polarising through emotional and extremely framed messages, informational content online should make people think about other opinions and discuss constructively. Instead, through polarization and filter bubble effects, people lose openness and tolerance for the existence of opposing opinions. In this workshop, we will discuss how we can redesign our information technology for a better societal impact. We will present key takeaways from the social sciences and discuss how we can implement them using recent HCI findings and digital technologies.

Paperid: 2747, https://arxiv.org/pdf/2504.09089.pdf

Abstract:
Walking is among the most common human activities where the feet can gather rich tactile information from the ground. The dynamic contact between the feet and the ground generates vibration signals that can be sensed by the foot skin. While existing research focuses on foot pressure sensing and lower-limb interactions, methods of decoding tactile information from foot vibrations remain underexplored. Here, we propose a foot-equipped wearable system capable of recording wideband vibration signals during walking activities. By enabling location-based recording, our system generates maps of haptic data that encode information on ground materials, lower-limb activities, and road conditions. Its efficacy was demonstrated through studies involving 31 users walking over 18 different ground textures, achieving an overall identification accuracy exceeding 95\% (cross-user accuracy of 87\%). Our system allows pedestrians to map haptic information through their daily walking activities, which has potential applications in creating digitalized walking experiences and monitoring road conditions.

Paperid: 2748, https://arxiv.org/pdf/2504.08985.pdf

Abstract:
Low technology and eHealth literacy among older adults in retirement communities hinder engagement with digital tools. To address this, we designed an LLM-powered chatbot prototype using a human-centered approach for a local retirement community. Through interviews and persona development, we prioritized accessibility and dual functionality: simplifying internal information retrieval and improving technology and eHealth literacy. A pilot trial with residents demonstrated high satisfaction and ease of use, but also identified areas for further improvement. Based on the feedback, we refined the chatbot using GPT-3.5 Turbo and Streamlit. The chatbot employs tailored prompt engineering to deliver concise responses. Accessible features like adjustable font size, interface theme and personalized follow-up responses were implemented. Future steps include enabling voice-to-text function and longitudinal intervention studies. Together, our results highlight the potential of LLM-driven chatbots to empower older adults through accessible, personalized interactions, bridging literacy gaps in retirement communities.

Paperid: 2749, https://arxiv.org/pdf/2504.08979.pdf

Abstract:
Existing data visualization formalisms are restricted to single-table inputs, which makes existing visualization grammars like Vega-lite or ggplot2 tedious to use, have overly complex APIs, and unsound when visualization multi-table data. This paper presents the first visualization formalism to support databases as input -- in other words, *database visualization*. A visualization specification is defined as a mapping from database constraints (e.g., schemas, types, foreign keys) to visual representations of those constraints, and we state that a visualization is *faithful* if it visually preserves the underlying database constraints. This formalism explains how visualization designs are the result of implicit data modeling decisions. We further develop a javascript library called dvl and use a series of case studies to show its expressiveness over specialized visualization systems and existing grammar-based languages.

Paperid: 2750, https://arxiv.org/pdf/2504.08824.pdf

Abstract:
Colorectal cancer (CRC) ranks as the second leading cause of cancer-related deaths and the third most prevalent malignant tumour worldwide. Early detection of CRC remains problematic due to its non-specific and often embarrassing symptoms, which patients frequently overlook or hesitate to report to clinicians. Crucially, the stage at which CRC is diagnosed significantly impacts survivability, with a survival rate of 80-95\% for Stage I and a stark decline to 10\% for Stage IV. Unfortunately, in the UK, only 14.4\% of cases are diagnosed at the earliest stage (Stage I). In this study, we propose ColonScopeX, a machine learning framework utilizing explainable AI (XAI) methodologies to enhance the early detection of CRC and pre-cancerous lesions. Our approach employs a multimodal model that integrates signals from blood sample measurements, processed using the Savitzky-Golay algorithm for fingerprint smoothing, alongside comprehensive patient metadata, including medication history, comorbidities, age, weight, and BMI. By leveraging XAI techniques, we aim to render the model's decision-making process transparent and interpretable, thereby fostering greater trust and understanding in its predictions. The proposed framework could be utilised as a triage tool or a screening tool of the general population. This research highlights the potential of combining diverse patient data sources and explainable machine learning to tackle critical challenges in medical diagnostics.

Paperid: 2751, https://arxiv.org/pdf/2504.08395.pdf

Abstract:
Mental models and expectations underlying human-human interaction (HHI) inform human-robot interaction (HRI) with domestic robots. To ease collaborative home tasks by improving domestic robot speech and behaviours for human-robot communication, we designed a study to understand how people communicated when failure occurs. To identify patterns of natural communication, particularly in response to robotic failures, participants instructed Laundrobot to move laundry into baskets using natural language and gestures. Laundrobot either worked error-free, or in one of two error modes. Participants were not advised Laundrobot would be a human actor, nor given information about error modes. Video analysis from 42 participants found speech patterns, included laughter, verbal expressions, and filler words, such as ``oh'' and ``ok'', also, sequences of body movements, including touching one's own face, increased pointing with a static finger, and expressions of surprise. Common strategies deployed when errors occurred, included correcting and teaching, taking responsibility, and displays of frustration. The strength of reaction to errors diminished with exposure, possibly indicating acceptance or resignation. Some used strategies similar to those used to communicate with other technologies, such as smart assistants. An anthropomorphic robot may not be ideally suited to this kind of task. Laundrobot's appearance, morphology, voice, capabilities, and recovery strategies may have impacted how it was perceived. Some participants indicated Laundrobot's actual skills were not aligned with expectations; this made it difficult to know what to expect and how much Laundrobot understood. Expertise, personality, and cultural differences may affect responses, however these were not assessed.

Paperid: 2752, https://arxiv.org/pdf/2504.08256.pdf

Abstract:
Recent advances in large language models (LLMs) provide new opportunities for context understanding in virtual reality (VR). However, VR contexts are often highly localized and personalized, limiting the effectiveness of general-purpose LLMs. To address this challenge, we present RAG-VR, the first 3D question-answering system for VR that incorporates retrieval-augmented generation (RAG), which augments an LLM with external knowledge retrieved from a localized knowledge database to improve the answer quality. RAG-VR includes a pipeline for extracting comprehensive knowledge about virtual environments and user conditions for accurate answer generation. To ensure efficient retrieval, RAG-VR offloads the retrieval process to a nearby edge server and uses only essential information during retrieval. Moreover, we train the retriever to effectively distinguish among relevant, irrelevant, and hard-to-differentiate information in relation to questions. RAG-VR improves answer accuracy by 17.9%-41.8% and reduces end-to-end latency by 34.5%-47.3% compared with two baseline systems.

Paperid: 2753, https://arxiv.org/pdf/2504.08128.pdf

Abstract:
Although the Boeing 737 Max incidents resulted from a mix of design shortcomings, regulatory oversights, and systemic issues, they also highlight a critical gap in pilot training on managing automated systems during abnormal conditions. This example demonstrates the urgent need for focused, concise training on human-automation interaction - a need that is equally critical for operators of Level 2 ADAS-equipped vehicles, as discussed in detail later in this article. The lack of structured education for semi-automated vehicle operators mirrors similar risks in other industries, where formal training is critical for safe operation. Two policy recommendations are proposed. First, governments should create concise, official resources in accessible and official format to educate drivers on system capabilities and limitations. Second, mandatory training and certification programs should be introduced, combining theoretical and hands-on components to prepare drivers for real-world scenarios. These measures will improve driver understanding, reduce misuse, and foster public trust in semi-automated vehicle technologies. By addressing the knowledge gap, policymakers can ensure a safer, more responsible transition to automation, maximizing its benefits while minimizing risks to public safety.

Paperid: 2754, https://arxiv.org/pdf/2504.07685.pdf

Abstract:
This paper explores the potential of context-aware monolingual human evaluation for assessing machine translation (MT) when no source is given for reference. To this end, we compare monolingual with bilingual evaluations (with source text), under two scenarios: the evaluation of a single MT system, and the comparative evaluation of pairwise MT systems. Four professional translators performed both monolingual and bilingual evaluations by assigning ratings and annotating errors, and providing feedback on their experience. Our findings suggest that context-aware monolingual human evaluation achieves comparable outcomes to human bilingual evaluations, and suggest the feasibility and potential of monolingual evaluation as an efficient approach to assessing MT.

Paperid: 2755, https://arxiv.org/pdf/2504.07475.pdf

Abstract:
There is a lack of virtual reality (VR) user studies that have been conducted involving people with vision/hearing impairments. This is due to the difficulty of recruiting participants and the accessibility barriers of VR devices. Based on the authors' experience conducting VR user studies with participants with vision/hearing impairments, this position paper identifies 5 key challenges (1. Recruitment, 2. Language Familiarity, 3. Technology Limitations and Barriers, 4. Access to Audio Cue, and 5. Travelling to the Experiment Location) and proposes strategic approaches to mitigate these challenges. In addition, we also presented three key considerations regarding understanding participants' lived experiences that could help the user study become accessible.

Paperid: 2757, https://arxiv.org/pdf/2504.07114.pdf

Abstract:
With the rapid adoption of LLM-based chatbots, there is a pressing need to evaluate what humans and LLMs can achieve together. However, standard benchmarks, such as MMLU, measure LLM capabilities in isolation (i.e., "AI-alone"). Here, we design and conduct a user study to convert MMLU questions into user-AI conversations, by seeding the user with the question and having them carry out a conversation with the LLM to answer their question. We release ChatBench, a new dataset with AI-alone, user-alone, and user-AI data for 396 questions and two LLMs, including 144K answers and 7,336 user-AI conversations. We find that AI-alone accuracy fails to predict user-AI accuracy, with significant differences across multiple subjects (math, physics, and moral reasoning), and we analyze the user-AI conversations to provide insight into how they diverge from AI-alone benchmarks. Finally, we show that fine-tuning a user simulator on a subset of ChatBench improves its ability to estimate user-AI accuracies, increasing correlation on held-out questions by more than 20 points, creating possibilities for scaling interactive evaluation.

Paperid: 2758, https://arxiv.org/pdf/2504.06435.pdf

Abstract:
Large Language Models (LLMs) increasingly power generative search engines which, in turn, drive human information seeking and decision making at scale. The extent to which humans trust generative artificial intelligence (GenAI) can therefore influence what we buy, how we vote and our health. Unfortunately, no work establishes the causal effect of generative search designs on human trust. Here we execute ~12,000 search queries across seven countries, generating ~80,000 real-time GenAI and traditional search results, to understand the extent of current global exposure to GenAI search. We then use a preregistered, randomized experiment on a large study sample representative of the U.S. population to show that while participants trust GenAI search less than traditional search on average, reference links and citations significantly increase trust in GenAI, even when those links and citations are incorrect or hallucinated. Uncertainty highlighting, which reveals GenAI's confidence in its own conclusions, makes us less willing to trust and share generative information whether that confidence is high or low. Positive social feedback increases trust in GenAI while negative feedback reduces trust. These results imply that GenAI designs can increase trust in inaccurate and hallucinated information and reduce trust when GenAI's certainty is made explicit. Trust in GenAI varies by topic and with users' demographics, education, industry employment and GenAI experience, revealing which sub-populations are most vulnerable to GenAI misrepresentations. Trust, in turn, predicts behavior, as those who trust GenAI more click more and spend less time evaluating GenAI search results. These findings suggest directions for GenAI design to safely and productively address the AI "trust gap."

Paperid: 2759, https://arxiv.org/pdf/2504.06379.pdf

Abstract:
Navigational challenges significantly impact the independence and mobility of Individuals with Visual Impairment (IVI). While numerous assistive technologies exist, their adoption remains limited due to usability challenges, financial constraints, and a lack of alignment with user needs. This study employs a mixed-methods approach, combining structured surveys and virtual workshops with 19 IVI to investigate their experiences, needs, and preferences regarding assistive technologies for navigation and daily living. The survey results provide insights into participants technological competence, preferences for assistive devices, and willingness to adopt new solutions. In parallel, workshop discussions offer qualitative perspectives on key navigation challenges, including difficulties in detecting overhead obstacles, navigating environments with complex layout, and the limitations of existing technologies. Findings highlight the need for assistive devices that integrate both navigational guidance and high-level spatial awareness, allowing users to build mental maps of their surroundings. Additionally, multimodal feedback, combining audio, haptic, and tactile cues, emerges as a crucial feature to accommodate diverse user preferences and environmental conditions. The study also underscores financial and training barriers that limit access to advanced assistive technologies. Based on these insights, we recommend the development of customizable, user-friendly, and most importantly affordable navigation aids that align with the daily needs of IVI. The findings from this study provide guidance for technology developers, researchers, and policymakers working toward more inclusive and effective assistive solutions.

Paperid: 2760, https://arxiv.org/pdf/2504.06189.pdf

Abstract:
This paper presents a novel framework for accessible and pedagogically-grounded robot explainability, designed to support human-robot interaction (HRI) with users who have diverse cognitive, communicative, or learning needs. We combine principles from Universal Design for Learning (UDL) and Universal Design (UD) with symbolic communication strategies to facilitate the alignment of mental models between humans and robots. Our approach employs Asterics Grid and ARASAAC pictograms as a multimodal, interpretable front-end, integrated with a lightweight HTTP-to-ROS 2 bridge that enables real-time interaction and explanation triggering. We emphasize that explainability is not a one-way function but a bidirectional process, where human understanding and robot transparency must co-evolve. We further argue that in educational or assistive contexts, the role of a human mediator (e.g., a teacher) may be essential to support shared understanding. We validate our framework with examples of multimodal explanation boards and discuss how it can be extended to different scenarios in education, assistive robotics, and inclusive AI.

Paperid: 2761, https://arxiv.org/pdf/2504.06145.pdf

Abstract:
Despite recent advances in Artificial Intelligence, the use of chatbot technology in customer service continues to face adoption hurdles. This paper explores reasons for these adoption hurdles and tests several service design levers to increase chatbot uptake. We use incentivized online experiments to study chatbot uptake in a variety of scenarios. The results of these experiments are threefold. First, people respond positively to improvements in chatbot performance; however, the chatbot channel is utilized less frequently than expected-time minimization would predict. A key driver of this underutilization is the reluctance to engage with a gatekeeper process, i.e., a process with an imperfect initial service stage and possible transfer to a second, expert service stage -- a behavior we term "gatekeeper aversion". We show that gatekeeper aversion can be further amplified by a secondary hurdle, algorithm aversion. Second, chatbot uptake can be increased by providing customers with average waiting times in the chatbot channel, as well as by being more transparent about chatbot capabilities and limitations. Third, methodologically, we show that chatbot adoption can depend on experimental implementation. In particular, chatbot adoption decreases further as (i) stakes are increased, (ii) the human/algorithmic nature of the server is manipulated with more realism. Our results suggest that firms should continue to prioritize investments in chatbot technology. However, less expensive, process-related interventions can also be effective. These may include being more transparent about the types of queries that are (or are not) suitable for chatbots, emphasizing chatbot reliability and quick resolution times, as well as providing faster live agent access to customers who experienced chatbot failure.

Paperid: 2762, https://arxiv.org/pdf/2504.06114.pdf

Abstract:
Automation and industrial mass production, particularly in sectors with low wages, have harmful consequences that contribute to widening wealth disparities, excessive pollution, and worsened working conditions. Coupled with a mass consumption society, there is a risk of detrimental social outcomes and threats to democracy, such as misinformation and political polarization. But AI, robotics and other emerging technologies could also provide a transition to community-based economies, in which more democratic, egalitarian, and sustainable value circulations can be established. Based on both a review of case studies, and our own experiments in Detroit, we derive three core principles for the use of computing in community-based economies. The prefigurative principle requires that the development process itself incorporates equity goals, rather than viewing equity as something to be achieved in the future. The generative principle requires the prevention of value extraction, and its replacement by circulations in which value is returned back to the aspects of labor, nature, and society by which it is generated. And third, the solidarity principle requires that deployments at all scales and across all domains support both individual freedoms and opportunities for mutual aid. Thus we propose the use of computational technologies to develop a specifically generative form of community-based economy: one that is egalitarian regarding race, class and gender; sustainable both environmentally and socially; and democratic in the deep sense of putting people in control of their own lives and livelihoods.

Paperid: 2763, https://arxiv.org/pdf/2504.05857.pdf

Abstract:
Searching for unfamiliar American Sign Language (ASL) signs is challenging for learners because, unlike spoken languages, they cannot type a text-based query to look up an unfamiliar sign. Advances in isolated sign recognition have enabled the creation of video-based dictionaries, allowing users to submit a video and receive a list of the closest matching signs. Previous HCI research using Wizard-of-Oz prototypes has explored interface designs for ASL dictionaries. Building on these studies, we incorporate their design recommendations and leverage state-of-the-art sign-recognition technology to develop an automated video-based dictionary. We also present findings from an observational study with twelve novice ASL learners who used this dictionary during video-comprehension and question-answering tasks. Our results address human-AI interaction challenges not covered in previous WoZ research, including recording and resubmitting signs, unpredictable outputs, system latency, and privacy concerns. These insights offer guidance for designing and deploying video-based ASL dictionary systems.

Paperid: 2764, https://arxiv.org/pdf/2504.05755.pdf

Abstract:
Artificial Intelligence (AI) is advancing at an unprecedented pace, with clear potential to enhance decision-making and productivity. Yet, the collaborative decision-making process between humans and AI remains underdeveloped, often falling short of its transformative possibilities. This paper explores the evolution of AI agents from passive tools to active collaborators in human-AI teams, emphasizing their ability to learn, adapt, and operate autonomously in complex environments. This paradigm shifts challenges traditional team dynamics, requiring new interaction protocols, delegation strategies, and responsibility distribution frameworks. Drawing on Team Situation Awareness (SA) theory, we identify two critical gaps in current human-AI teaming research: the difficulty of aligning AI agents with human values and objectives, and the underutilization of AI's capabilities as genuine team members. Addressing these gaps, we propose a structured research outlook centered on four key aspects of human-AI teaming: formulation, coordination, maintenance, and training. Our framework highlights the importance of shared mental models, trust-building, conflict resolution, and skill adaptation for effective teaming. Furthermore, we discuss the unique challenges posed by varying team compositions, goals, and complexities. This paper provides a foundational agenda for future research and practical design of sustainable, high-performing human-AI teams.

Paperid: 2765, https://arxiv.org/pdf/2504.05568.pdf

Abstract:
Inter-brain synchronization (IBS), the alignment of neural activities between individuals, is a fundamental mechanism underlying effective social interactions and communication. Prior research has demonstrated that IBS can occur during collaborative tasks and is deeply connected to communication effectiveness. Building on these findings, recent investigations reveal that IBS happens during remote interactions, implying that brain activities between individuals can synchronize despite latency and physical separation. However, the conditions under which this synchronization occurs or is disrupted in remote settings, especially the effect of latency, are not fully understood. This study investigates how varying transmission latency affects IBS, in order to identify thresholds where synchronization is disrupted. Using electroencephalography measurements quantified through Phase Locking Value -- a metric that captures synchronization between brainwave phases -- we first confirm synchronization under face-to-face conditions and then observe changes in IBS across remote communication scenarios. Our findings reveal that IBS can occur during remote collaboration, but is critically dependent on transmission delays, with delays exceeding 450 ms significantly disrupting synchronization. These findings suggest that IBS may serve as a key indicator of communication quality in remote interactions, offering insights for improving remote communication systems and collaboration.

Paperid: 2766, https://arxiv.org/pdf/2504.05445.pdf

Abstract:
Vision Language Models (VLMs) demonstrate promising chart comprehension capabilities. Yet, prior explorations of their visualization literacy have been limited to assessing their response correctness and fail to explore their internal reasoning. To address this gap, we adapted attention-guided class activation maps (AG-CAM) for VLMs, to visualize the influence and importance of input features (image and text) on model responses. Using this approach, we conducted an examination of four open-source (ChartGemma, Janus 1B and 7B, and LLaVA) and two closed-source (GPT-4o, Gemini) models comparing their performance and, for the open-source models, their AG-CAM results. Overall, we found that ChartGemma, a 3B parameter VLM fine-tuned for chart question-answering (QA), outperformed other open-source models and exhibited performance on par with significantly larger closed-source VLMs. We also found that VLMs exhibit spatial reasoning by accurately localizing key chart features, and semantic reasoning by associating visual elements with corresponding data values and query tokens. Our approach is the first to demonstrate the use of AG-CAM on early fusion VLM architectures, which are widely used, and for chart QA. We also show preliminary evidence that these results can align with human reasoning. Our promising open-source VLMs results pave the way for transparent and reproducible research in AI visualization literacy.

Paperid: 2767, https://arxiv.org/pdf/2504.05156.pdf

Abstract:
This paper presents a user study (N=22) where participants used an interface combining Web Search and a Generative AI-Chat feature to solve health-related information tasks. We study how people behaved with the interface, why they behaved in certain ways, and what the outcomes of these behaviours were. A think-aloud protocol captured their thought processes during searches. Our findings suggest that GenAI is neither a search panacea nor a major regression compared to standard Web Search interfaces. Qualitative and quantitative analyses identified 78 tactics across five categories and provided insight into how and why different interface features were used. We find evidence that pre-task confidence and trust both influenced which interface feature was used. In both systems, but particularly when using the chat feature, trust was often misplaced in favour of ease-of-use and seemingly perfect answers, leading to increased confidence post-search despite having incorrect results. We discuss what our findings mean in the context of our defined research questions and outline several open questions for future research.

Paperid: 2768, https://arxiv.org/pdf/2504.04865.pdf

Abstract:
Image-generating AI, which allows users to create images from text, is increasingly used to produce visual content. Despite its advancements, cultural biases in AI-generated images have raised significant concerns. While much research has focused on issues within Western contexts, our study examines the perceived biases regarding the portrayal of East Asian women. In this exploratory study, we invited East Asian users to audit three popular models (DALL-E, Midjourney, Stable Diffusion) and identified 18 specific perceived biases, categorized into four patterns: Westernization, overuse or misuse of cultural symbols, sexualization & feminization, and racial stereotypes. This work highlights the potential challenges posed by AI models in portraying Eastern individuals.

Paperid: 2769, https://arxiv.org/pdf/2504.04299.pdf

Abstract:
Advancements in artificial intelligence (AI) have led to the increase of conversational agents like Replika, designed to provide social interaction and emotional support. However, reports of these AI systems engaging in inappropriate sexual behaviors with users have raised significant concerns. In this study, we conducted a thematic analysis of user reviews from the Google Play Store to investigate instances of sexual harassment by the Replika chatbot. From a dataset of 35,105 negative reviews, we identified 800 relevant cases for analysis. Our findings revealed that users frequently experience unsolicited sexual advances, persistent inappropriate behavior, and failures of the chatbot to respect user boundaries. Users expressed feelings of discomfort, violation of privacy, and disappointment, particularly when seeking a platonic or therapeutic AI companion. This study highlights the potential harms associated with AI companions and underscores the need for developers to implement effective safeguards and ethical guidelines to prevent such incidents. By shedding light on user experiences of AI-induced harassment, we contribute to the understanding of AI-related risks and emphasize the importance of corporate responsibility in developing safer and more ethical AI systems.

Paperid: 2770, https://arxiv.org/pdf/2504.04253.pdf

Abstract:
Recent advances in GenAI have enabled automation in data visualization, allowing users to generate visual representations using natural language. However, existing systems primarily focus on automation, overlooking users' varying expertise levels and analytical needs. In this position paper, we advocate for a shift toward adaptive GenAI-driven visualization tools that tailor interactions, reasoning, and visualizations to individual users. We first review existing automation-focused approaches and highlight their limitations. We then introduce methods for assessing user expertise, as well as key open challenges and research questions that must be addressed to allow for an adaptive approach. Finally, we present our vision for a user-centered system that leverages GenAI not only for automation but as an intelligent collaborator in visual data exploration. Our perspective contributes to the broader discussion on designing GenAI-based systems that enhance human cognition by dynamically adapting to the user, ultimately advancing toward systems that promote augmented cognition.

Paperid: 2771, https://arxiv.org/pdf/2504.04248.pdf

Abstract:
We consider the problem of optimal decision referrals in human-automation teams performing binary classification tasks. The automation, which includes a pre-trained classifier, observes data for a batch of independent tasks, analyzes them, and may refer a subset of tasks to a human operator for fresh and final analysis. Our key modeling assumption is that human performance degrades with task load. We model the problem of choosing which tasks to refer as a stochastic optimization problem and show that, for a given task load, it is optimal to myopically refer tasks that yield the largest reduction in expected cost, conditional on the observed data. This provides a ranking scheme and a policy to determine the optimal set of tasks for referral. We evaluate this policy against a baseline through an experimental study with human participants. Using a radar screen simulator, participants made binary target classification decisions under time constraint. They were guided by a decision rule provided to them, but were still prone to errors under time pressure. An initial experiment estimated human performance model parameters, while a second experiment compared two referral policies. Results show statistically significant gains for the proposed optimal referral policy over a blind policy that determines referrals using the automation and human-performance models but not based on the observed data.

Paperid: 2772, https://arxiv.org/pdf/2504.04075.pdf

Abstract:
Multimodal research and applications are becoming more commonplace as Virtual Reality (VR) technology integrates different sensory feedback, enabling the recreation of real spaces in an audio-visual context. Within VR experiences, numerous applications rely on the user's voice as a key element of interaction, including music performances and public speaking applications. Self-perception of our voice plays a crucial role in vocal production. When singing or speaking, our voice interacts with the acoustic properties of the environment, shaping the adjustment of vocal parameters in response to the perceived characteristics of the space. This technical report presents a real-time auralization pipeline that leverages three-dimensional Spatial Impulse Responses (SIRs) for multimodal research applications in VR requiring first-person vocal interaction. It describes the impulse response creation and rendering workflow, the audio-visual integration, and addresses latency and computational considerations. The system enables users to explore acoustic spaces from various positions and orientations within a predefined area, supporting three and five Degrees of Freedom (3Dof and 5DoF) in audio-visual multimodal perception for both research and creative applications in VR.

Paperid: 2773, https://arxiv.org/pdf/2504.04006.pdf

Abstract:
A main challenge faced by non-profit organisations providing computer science education to under-represented groups are the high drop-out rates. This issue arises from various factors affecting both students and teachers, such as the one-size-fits-all approach of many lessons. Enhancing social inclusion in the learning process could help reduce these drop-out rates. We present JsStories, a tool designed to help students learn JavaScript through interactive stories. The development of JsStories has been informed by existing literature on storytelling for inclusion and insights gained from a visit to HackYourFuture Belgium (HYFBE), a non-profit organisation that teaches web development to refugees and migrants. To lower barriers to entry and maximise the feeling of connection to the story, we incorporated narratives from HYFBE alumni. Further, we adhered to educational best practices by applying the PRIMM principles and offering level-appropriate content based on knowledge graphs. JsStories has been demonstrated, evaluated and communicated to the different stakeholders through interviews and a survey, enabling us to identify future directions for story-based learning solutions.

Paperid: 2774, https://arxiv.org/pdf/2504.03966.pdf

Abstract:
The integration of Large Language Models (LLMs) with Learning Management Systems (LMSs) has the potential to enhance task automation and accessibility in education. However, hallucination where LLMs generate inaccurate or misleading information remains a significant challenge. This study introduces the Dynamic Course Content Integration (DCCI) mechanism, which dynamically retrieves and integrates course content and curriculum from Canvas LMS into the LLM-powered assistant, Ask ME. By employing prompt engineering to structure retrieved content within the LLM's context window, DCCI ensures accuracy, relevance, and contextual alignment, mitigating hallucination. To evaluate DCCI's effectiveness, Ask ME's usability, and broader student perceptions of AI in education, a mixed-methods approach was employed, incorporating user satisfaction ratings and a structured survey. Results from a pilot study indicate high user satisfaction (4.614/5), with students recognizing Ask ME's ability to provide timely and contextually relevant responses for both administrative and course-related inquiries. Additionally, a majority of students agreed that Ask ME's integration with course content in Canvas LMS reduced platform-switching, improving usability, engagement, and comprehension. AI's role in reducing classroom hesitation and fostering self-directed learning and intellectual curiosity was also highlighted. Despite these benefits and positive perception of AI tools, concerns emerged regarding over-reliance on AI, accuracy limitations, and ethical issues such as plagiarism and reduced student-teacher interaction. These findings emphasize the need for strategic AI implementation, ethical safeguards, and a pedagogical framework that prioritizes human-AI collaboration over substitution.

Paperid: 2775, https://arxiv.org/pdf/2504.03334.pdf

Abstract:
The integration of machine learning and deep learning has transformed data analytics in biomechanics, enabled by extensive wearable sensor data. However, the field faces challenges such as limited large-scale datasets and high data acquisition costs, which hinder the development of robust algorithms. Data augmentation techniques show promise in addressing these issues, but their application to biomechanical time-series data requires comprehensive evaluation. This scoping review investigates data augmentation methods for time-series data in the biomechanics domain. It analyzes current approaches for augmenting and generating time-series datasets, evaluates their effectiveness, and offers recommendations for applying these techniques in biomechanics. Four databases, PubMed, IEEE Xplore, Scopus, and Web of Science, were searched for studies published between 2013 and 2024. Following PRISMA-ScR guidelines, a two-stage screening identified 21 relevant publications. Results show that there is no universally preferred method for augmenting biomechanical time-series data; instead, methods vary based on study objectives. A major issue identified is the absence of soft tissue artifacts in synthetic data, leading to discrepancies referred to as the synthetic gap. Moreover, many studies lack proper evaluation of augmentation methods, making it difficult to assess their effects on model performance and data quality. This review highlights the critical role of data augmentation in addressing limited dataset availability and improving model generalization in biomechanics. Tailoring augmentation strategies to the characteristics of biomechanical data is essential for advancing predictive modeling. A better understanding of how different augmentation methods impact data quality and downstream tasks will be key to developing more effective and realistic techniques.

Paperid: 2776, https://arxiv.org/pdf/2504.03147.pdf

Abstract:
Recent developments in Artificial Intelligence (AI) and Machine Learning (ML) are creating new opportunities for Human-Autonomy Teaming (HAT) in tasks, missions, and continuous coordinated activities. A major challenge is enabling humans to maintain awareness and control over autonomous assets, while also building trust and supporting shared contextual understanding. To address this, we present a real-time Human Digital Twin (HDT) architecture that integrates Large Language Models (LLMs) for knowledge reporting, answering, and recommendation, embodied in a visual interface. The system applies a metacognitive approach to enable personalized, context-aware responses aligned with the human teammate's expectations. The HDT acts as a visually and behaviorally realistic team member, integrated throughout the mission lifecycle, from training to deployment to after-action review. Our architecture includes speech recognition, context processing, AI-driven dialogue, emotion modeling, lip-syncing, and multimodal feedback. We describe the system design, performance metrics, and future development directions for more adaptive and realistic HAT systems.

Paperid: 2777, https://arxiv.org/pdf/2504.03029.pdf

Abstract:
Amid the recent uptake of Generative AI, sociotechnical scholars and critics have traced a multitude of resulting harms, with analyses largely focused on values and axiology (e.g., bias). While value-based analyses are crucial, we argue that ontologies -- concerning what we allow ourselves to think or talk about -- is a vital but under-recognized dimension in analyzing these systems. Proposing a need for a practice-based engagement with ontologies, we offer four orientations for considering ontologies in design: pluralism, groundedness, liveliness, and enactment. We share examples of potentialities that are opened up through these orientations across the entire LLM development pipeline by conducting two ontological analyses: examining the responses of four LLM-based chatbots in a prompting exercise, and analyzing the architecture of an LLM-based agent simulation. We conclude by sharing opportunities and limitations of working with ontologies in the design and development of sociotechnical systems.

Paperid: 2778, https://arxiv.org/pdf/2504.02998.pdf

Abstract:
Extended Reality (XR) has expanded the horizons of entertainment and social life and shows great potential in the manufacturing industry. Prototyping in XR can help designers make initial proposals and iterations at low cost before manufacturers and investors decide whether to invest in research, development or even production. According to the literature (54 manuscripts in the last 15 years) prototyping in XR in XR is easier to use than three-dimensional (3D) modeling with a personal computer and more capable of displaying 3D structures than paper drawing. In this comprehensive review, we systematically surveyed the literature on prototyping in XR and discussed the possibility of transferring created virtual prototypes from XR to commonly used 3D modeling software and reality. We proposed five research questions regarding prototyping in XR. They are: what the constituent elements and workflow of prototyping are; which display devices can deliver satisfying immersive and interactive experiences; how user control input is obtained and what methods are available for users to interact with virtual elements and create XR prototypes; what approaches can facilitate the connection with fabrication to ensure a smooth transition from the virtual to the physical world; and what the challenges are and what the future holds for this research domain. Based on these questions, we summarized the components and workflows of prototyping in XR. Moreover, we present an overview of the latest trends in display device evolution, control technologies, digital model construction, and manufacturing processes. In view of these latest developments and gaps, we speculated on the challenges and opportunities in the field of prototyping in XR, especially in linking extended reality to digital fabrication, with the aim of guiding researchers towards new research directions.

Paperid: 2779, https://arxiv.org/pdf/2504.02675.pdf

Abstract:
Investigating cybersickness (CS) in virtual reality (VR) often requires significant resources to create the VR environment and manage other experiment-related aspects. Additionally, slight differences in VR content across studies can lead to conflicting results. To address these challenges, we propose a standardized assessment framework to facilitate cybersickness research. The main goal is to enable consistent and comparable CS-related experiments. By establishing this common foundation, researchers can better evaluate and compare the impact of various factors on cybersickness. We provide a comprehensive explanation of the conceptual designs, detail the technical implementation, and offer instructions for using the proposed framework. Lastly, we conclude by discussing the limitations and potential avenues for future development.

Paperid: 2780, https://arxiv.org/pdf/2504.01260.pdf

Abstract:
This study explores how human perceptions of a non-anthropomorphic robotic manipulator are shaped by two key dimensions of behaviour: arousal, defined as the robot's movement energy and expressiveness, and attention, defined as the robot's capacity to selectively orient toward and engage with a user. We introduce a novel control architecture that integrates a gaze-like attention engine with an arousal-modulated motion system to generate socially meaningful behaviours. In a user study, we find that robots exhibiting high attention -- actively directing their focus toward users -- are perceived as warmer and more competent, intentional, and lifelike. In contrast, high arousal -- characterized by fast, expansive, and energetic motions -- increases perceptions of discomfort and disturbance. Importantly, a combination of focused attention and moderate arousal yields the highest ratings of trust and sociability, while excessive arousal diminishes social engagement. These findings offer design insights for endowing non-humanoid robots with expressive, intuitive behaviours that support more natural human-robot interaction.

Paperid: 2781, https://arxiv.org/pdf/2504.00860.pdf

Abstract:
Despite numerous efforts to mitigate their biases, ML systems continue to harm already-marginalized people. While predominant ML approaches assume bias can be removed and fair models can be created, we show that these are not always possible, nor desirable, goals. We reframe the problem of ML bias by creating models to identify biased language, drawing attention to a dataset's biases rather than trying to remove them. Then, through a workshop, we evaluated the models for a specific use case: workflows of information and heritage professionals. Our findings demonstrate the limitations of ML for identifying bias due to its contextual nature, the way in which approaches to mitigating it can simultaneously privilege and oppress different communities, and its inevitability. We demonstrate the need to expand ML approaches to bias and fairness, providing a mixed-methods approach to investigating the feasibility of removing bias or achieving fairness in a given ML use case.

Paperid: 2782, https://arxiv.org/pdf/2504.00799.pdf

Abstract:
Electronic dictionaries have largely replaced paper dictionaries and become central tools for L2 learners seeking to expand their vocabulary. Users often assume these resources are reliable and rarely question the validity of the definitions provided. The accuracy of major E-dictionaries is seldom scrutinized, and little attention has been paid to how their corpora are constructed. Research on dictionary use, particularly the limitations of electronic dictionaries, remains scarce. This study adopts a combined method of experimentation, user survey, and dictionary critique to examine Youdao, one of the most widely used E-dictionaries in China. The experiment involved a translation task paired with retrospective reflection. Participants were asked to translate sentences containing words that are insufficiently or inaccurately defined in Youdao. Their consultation behavior was recorded to analyze how faulty definitions influenced comprehension. Results show that incomplete or misleading definitions can cause serious misunderstandings. Additionally, students exhibited problematic consultation habits. The study further explores how such flawed definitions originate, highlighting issues in data processing and the integration of AI and machine learning technologies in dictionary construction. The findings suggest a need for better training in dictionary literacy for users, as well as improvements in the underlying AI models used to build E-dictionaries.

Paperid: 2783, https://arxiv.org/pdf/2504.00767.pdf

Abstract:
While football analytics has changed the way teams and analysts assess performance, there remains a communication gap between machine learning practice and how coaching staff talk about football. Coaches and practitioners require actionable insights, which are not always provided by models. To bridge this gap, we show how to build wordalizations (a novel approach that leverages large language models) for shots in football. Specifically, we first build an expected goals model using logistic regression. We then use the co-efficients of this regression model to write sentences describing how factors (such as distance, angle and defensive pressure) contribute to the model's prediction. Finally, we use large language models to give an entertaining description of the shot. We describe our approach in a model card and provide an interactive open-source application describing shots in recent tournaments. We discuss how shot wordalisations might aid communication in coaching and football commentary, and give a further example of how the same approach can be applied to other actions in football.

Paperid: 2784, https://arxiv.org/pdf/2504.00636.pdf

Abstract:
This study examines the impact of an LLM-powered teachable agent, grounded in the Learning by Teaching (LBT) pedagogy, on students' music theory learning and cognitive load. The participants were 28 Chinese university students with prior music instrumental experiences. In an online experiment, they were assigned to either an experimental group, which engaged in music analysis with the teachable agent, or a control group, which conducted self-directed analysis using instructional materials. Findings indicate that students in the experimental group achieved significantly higher post-test scores than those in the control group. Additionally, they reported lower cognitive load, suggesting that the teachable agent effectively reduced the cognitive demands of music analysis tasks. These results highlight the potential of AI-driven scaffolding based on LBT principles to enhance music theory education, supporting teachers in delivering theory-oriented instruction while fostering students' self-directed learning skills.

Paperid: 2785, https://arxiv.org/pdf/2504.00286.pdf

Abstract:
The biopharmaceutical industry is increasingly developing digital twins to digitalize and automate the manufacturing process in response to the growing market demands. However, this shift presents significant challenges for human operators, as the complexity and volume of information can overwhelm their ability to manage the process effectively. These issues are compounded when digital twins are designed without considering interaction and collaboration with operators, who are responsible for monitoring processes and assessing situations, particularly during abnormalities. Our review of current trends in biopharma digital twin development reveals a predominant focus on technology and often overlooks the critical role of human operators. To bridge this gap, this article proposes a collaborative intelligence framework that emphasizes the integration of operators with digital twins. Approaches to system design that can enhance operator trust and human-machine interface usability are presented. Moreover, innovative training programs for preparing operators to understand and utilize digital twins are discussed. The framework outlined in this article aims to enhance collaboration between operators and digital twins effectively by using their full capabilities to boost resilience and productivity in biopharmaceutical manufacturing.

Paperid: 2786, https://arxiv.org/pdf/2504.00025.pdf

Abstract:
Artificial intelligence chatbots driven by large language models (LLMs) have the potential to increase public science literacy and support scientific research, as they can quickly summarize complex scientific information in accessible terms. However, when summarizing scientific texts, LLMs may omit details that limit the scope of research conclusions, leading to generalizations of results broader than warranted by the original study. We tested 10 prominent LLMs, including ChatGPT-4o, ChatGPT-4.5, DeepSeek, LLaMA 3.3 70B, and Claude 3.7 Sonnet, comparing 4900 LLM-generated summaries to their original scientific texts. Even when explicitly prompted for accuracy, most LLMs produced broader generalizations of scientific results than those in the original texts, with DeepSeek, ChatGPT-4o, and LLaMA 3.3 70B overgeneralizing in 26 to 73% of cases. In a direct comparison of LLM-generated and human-authored science summaries, LLM summaries were nearly five times more likely to contain broad generalizations (OR = 4.85, 95% CI [3.06, 7.70]). Notably, newer models tended to perform worse in generalization accuracy than earlier ones. Our results indicate a strong bias in many widely used LLMs towards overgeneralizing scientific conclusions, posing a significant risk of large-scale misinterpretations of research findings. We highlight potential mitigation strategies, including lowering LLM temperature settings and benchmarking LLMs for generalization accuracy.

Paperid: 2787, https://arxiv.org/pdf/2504.00002.pdf

Abstract:
Recent advancements in large language models (LLMs) have prompted interest in deploying these models on mobile devices to enable new applications without relying on cloud connectivity. However, the efficiency constraints of deploying LLMs on resource-limited devices present significant challenges. In this paper, we conduct a comprehensive measurement study to evaluate the efficiency tradeoffs between mobile-based, edge-based, and cloud-based deployments for LLM applications. We implement AutoLife-Lite, a simplified LLM-based application that analyzes smartphone sensor data to infer user location and activity contexts. Our experiments reveal that: (1) Only small-size LLMs (<4B parameters) can run successfully on powerful mobile devices, though they exhibit quality limitations compared to larger models; (2) Model compression is effective in lower the hardware requirement, but may lead to significant performance degradation; (3) The latency to run LLMs on mobile devices with meaningful output is significant (>30 seconds), while cloud services demonstrate better time efficiency (<10 seconds); (4) Edge deployments offer intermediate tradeoffs between latency and model capabilities, with different results on CPU-based and GPU-based settings. These findings provide valuable insights for system designers on the current limitations and future directions for on-device LLM applications.

Paperid: 2788, https://arxiv.org/pdf/2503.23691.pdf

Abstract:
Genome annotation is essential for understanding the functional elements within genomes. While automated methods are indispensable for processing large-scale genomic data, they often face challenges in accurately predicting gene structures and functions. Consequently, manual curation by domain experts remains crucial for validating and refining these predictions. These combined outcomes from automated tools and manual curation highlight the importance of integrating human expertise with AI capabilities to improve both the accuracy and efficiency of genome annotation. However, the manual curation process is inherently labor-intensive and time-consuming, making it difficult to scale for large datasets. To address these challenges, we propose a conceptual framework, Human-AI Collaborative Genome Annotation (HAICoGA), which leverages the synergistic partnership between humans and artificial intelligence to enhance human capabilities and accelerate the genome annotation process. Additionally, we explore the potential of integrating Large Language Models (LLMs) into this framework to support and augment specific tasks. Finally, we discuss emerging challenges and outline open research questions to guide further exploration in this area.

Paperid: 2789, https://arxiv.org/pdf/2503.23688.pdf

Abstract:
This study systematically analyzes geopolitical bias across 11 prominent Large Language Models (LLMs) by examining their responses to seven critical topics in U.S.-China relations. Utilizing a bilingual (English and Chinese) and dual-framing (affirmative and reverse) methodology, we generated 19,712 prompts designed to detect ideological leanings in model outputs. Responses were quantitatively assessed on a normalized scale from -2 (strongly Pro-China) to +2 (strongly Pro-U.S.) and categorized according to stance, neutrality, and refusal rates. The findings demonstrate significant and consistent ideological alignments correlated with the LLMs' geographic origins; U.S.-based models predominantly favored Pro-U.S. stances, while Chinese-origin models exhibited pronounced Pro-China biases. Notably, language and prompt framing substantially influenced model responses, with several LLMs exhibiting stance reversals based on prompt polarity or linguistic context. Additionally, we introduced comprehensive metrics to evaluate response consistency across languages and framing conditions, identifying variability and vulnerabilities in model behaviors. These results offer practical insights that can guide organizations and individuals in selecting LLMs best aligned with their operational priorities and geopolitical considerations, underscoring the importance of careful model evaluation in politically sensitive applications. Furthermore, the research highlights specific prompt structures and linguistic variations that can strategically trigger distinct responses from models, revealing methods for effectively navigating and influencing LLM outputs.

Paperid: 2790, https://arxiv.org/pdf/2503.23327.pdf

Abstract:
A key objective in artificial intelligence (AI) development is to create systems that match or surpass human creativity. Although current AI models perform well across diverse creative tasks, it remains unclear whether these achievements reflect genuine creative thinking. This study examined whether AI models (GPT-3.5-turbo, GPT-4, and GPT-4o) engage in creative thinking by comparing their performance with humans across various creative tasks and core cognitive processes. Results showed that AI models outperformed humans in divergent thinking, convergent thinking, and insight problem-solving, but underperformed in creative writing. Compared to humans, AI generated lower forward flow values in both free and chain association tasks and showed lower accuracy in the representational change task. In creative evaluation, AI exhibited no significant correlation between the weights of novelty and appropriateness when predicting creative ratings, suggesting the absence of a human-like trade-off strategy. AI also had higher decision error scores in creative selection, suggesting difficulty identifying the most creative ideas. These findings suggest that while AI can mimic human creativity, its strong performance in creative tasks is likely driven by non-creative mechanisms rather than genuine creative thinking.

Paperid: 2791, https://arxiv.org/pdf/2503.23244.pdf

Abstract:
In web analytics, cloud-based solutions have limitations in data ownership and privacy, whereas client-side user tracking tools face challenges such as data accuracy and a lack of server-side metrics. This paper presents the Combined Analytics and Web Application Log (CAWAL) framework as an alternative model and an on-premises framework, offering web analytics with application logging integration. CAWAL enables precise data collection and cross-domain tracking in web farms while complying with data ownership and privacy regulations. The framework also improves software diagnostics and troubleshooting by incorporating application-specific data into analytical processes. Integrated into an enterprise-grade web application, CAWAL has demonstrated superior performance, achieving approximately 24% and 85% lower response times compared to Open Web Analytics (OWA) and Matomo, respectively. The empirical evaluation demonstrates that the framework eliminates certain limitations in existing tools and provides a robust data infrastructure for enhanced web analytics.

Paperid: 2792, https://arxiv.org/pdf/2503.23147.pdf

Abstract:
Digital twin technologies help practitioners simulate, monitor, and predict undesirable outcomes in-silico, while avoiding the cost and risks of conducting live simulation exercises. Virtual reality (VR) based digital twin technologies are especially useful when monitoring human Patterns of Life (POL) in secure nuclear facilities, where live simulation exercises are too dangerous and costly to ever perform. However, the high-security status of such facilities may restrict modelers from deploying human activity sensors for data collection. This problem was encountered when deploying MetaPOL, a digital twin system to prevent insider threat or sabotage of secure facilities, at a secure nuclear reactor facility at Oak Ridge National Laboratory (ORNL). This challenge was addressed using an agent-based model (ABM), driven by anecdotal evidence of facility personnel POL, to generate synthetic movement trajectories. These synthetic trajectories were then used to train deep neural network surrogates for next location and stay duration prediction to drive NPCs in the VR environment. In this study, we evaluate the efficacy of this technique for establishing NPC movement within MetaPOL and the ability to distinguish NPC movement during normal operations from that during a simulated emergency response. Our results demonstrate the success of using a multi-layer perceptron for next location prediction and mixture density network for stay duration prediction to predict the ABM generated trajectories. We also find that NPC movement in the VR environment driven by the deep neural networks under normal operations remain significantly different to that seen when simulating responses to a simulated emergency scenario.

Paperid: 2793, https://arxiv.org/pdf/2503.22978.pdf

Abstract:
The potential of large language models (LLMs) to mitigate the time- and cost- related challenges associated with inductive thematic analysis (ITA) has been extensively explored in the literature. However, the use of LLMs to support ITA has often been opportunistic, relying on ad hoc prompt engineering (PE) approaches, thereby undermining the reliability, transparency, and replicability of the analysis. The goal of this study is to develop a structured approach to PE in LLM-assisted ITA. To this end, a comprehensive review of the existing literature is conducted to examine how ITA researchers integrate LLMs into their workflows and, in particular, how PE is utilized to support the analytical process. Built on the insights generated from this review, four key steps for effective PE in LLM-assisted ITA are identified and extensively outlined. Furthermore, the study explores state-of-the-art PE techniques that can enhance the execution of these steps, providing ITA researchers with practical strategies to improve their analyses. In conclusion, the main contributions of this paper include: (i) it maps the existing research on LLM-assisted ITA to enable a better understanding of the rapidly developing field, (ii) it outlines a structured four-step PE process to enhance methodological rigor, (iii) it discusses the application of advanced PE techniques to support the execution of these steps, and (iv) it highlights key directions for future research.

Paperid: 2794, https://arxiv.org/pdf/2503.22967.pdf

Abstract:
Starting in February 2024, the HKUST Library further extended the scope of AI literacy to AI utilization, which focuses on fostering student involvement in utilizing state-of-the-art technologies in the projects that initiated by the Library, named "Digital Scholarship (DS) CoLab". A key focus of the DS CoLab scheme has been on cultivating talents and enabling students to utilize advanced technologies in practical context. It aims to reinforce the library's role as a catalyst and hub for fostering multidisciplinary collaboration and cultivate the "can do spirit" among university members. The Library offers 1-2 projects per year for students to engage with advanced technologies in practical contexts while supporting the Library in tackling challenges and streamlining operational tasks. The tool that introduced in this paper was mainly developed by two of the authors, Sherry Yip Sau Lai and Berry Han Liuruo, as part-time student helpers under one of our DS CoLab scheme in the 2024 Spring Semester (February to May 2024). This paper details the complete journey from ideation to implementation of developing a Chinese Named-Entity Recognition (NER) Tool from the group up within one semester, from the initial research and planning stages to execution and come up a viable product. The collaborative spirit fostered by this project, with students playing a central role, exemplifies the power and potential of innovative educational models that prioritize hands-on learning with student involvement.

Paperid: 2795, https://arxiv.org/pdf/2503.22510.pdf

Abstract:
Emotion recognition technology has been studied from the past decade. With its growing importance and applications such as customer service, medical, education, etc., this research study aims to explore its potential and importance in the field of User experience evaluation. Recognizing and keeping track of user emotions in user research video is important to understand user needs and expectations from a service/product. Little research has been done that focuses on automating emotion extraction from a video where more than one modality has been incorporated in the field of UX. The study aims at implementing different modalities such as facial emotion recognition, speech-to-text and text-based emotion recognition for capturing emotional nuances from a user research video and extract meaningful actionable insights. For selection of facial emotion recognition model, 10 pre-trained models were evaluated on three benchmark datasets i.e. FER-2013, AffectNet and CK+, selecting the model with most generalization ability. To extract speech and convert to text, OpenAI's Whisper model was implemented and finally the emotions from text were recognized using a pre-trained model available at HuggingFace website having an evaluation accuracy more than 95%. The study also integrates the gathered data using temporal alignment and fusion for deeper and contextual insights. The study further demonstrates a way of automating data analysis through PandasAI Python library where OpenAI's GPT-4o model was implemented along with a discussion on other possible solutions. This study is an attempt to demonstrate a proof of concept where automated meaningful insights are extracted from a video based on user emotions.

Paperid: 2796, https://arxiv.org/pdf/2503.22345.pdf

Abstract:
We present a work in progress that explores using a Large Language Model (LLM) as a design material for an interactive museum installation. LLMs offer the possibility of creating chatbots that can facilitate dynamic and human-like conversation, engaging in a form of role play to bring historical persons to life for visitors. However, LLMs are prone to producing misinformation, which runs counter to museums' core mission to educate the public. We use Research-through-Design to explore some approaches to navigating this dilemma through rapid prototyping and evaluation and propose some directions for further research. We suggest that designers may shape interactions with the chatbot to emphasize personal narratives and role play rather than historical facts or to intentionally highlight the unreliability of the chatbot outputs to provoke critical reflection.

Paperid: 2797, https://arxiv.org/pdf/2503.22283.pdf

Abstract:
In recent years, large language models (LLMs) have demonstrated exponential improvements that promise transformative opportunities across various industries. Their ability to generate human-like text and ensure continuous availability facilitates the creation of interactive service chatbots aimed at enhancing customer experience and streamlining enterprise operations. Despite their potential, LLMs face critical challenges, such as a susceptibility to hallucinations and difficulties handling complex linguistic scenarios, notably code switching and dialectal variations. To address these challenges, this paper describes the design of a multilingual chatbot for Bengali-English customer service interactions utilizing retrieval-augmented generation (RAG) and targeted prompt engineering. This research provides valuable insights for the human-computer interaction (HCI) community, emphasizing the importance of designing systems that accommodate linguistic diversity to benefit both customers and businesses. By addressing the intersection of generative AI and cultural heterogeneity, this late-breaking work inspires future innovations in multilingual and multicultural HCI.

Paperid: 2798, https://arxiv.org/pdf/2503.22113.pdf

Abstract:
Understanding how AI recommendations work can help the younger generation become more informed and critical consumers of the vast amount of information they encounter daily. However, young learners with limited math and computing knowledge often find AI concepts too abstract. To address this, we developed Briteller, a light-based recommendation system that makes learning tangible. By exploring and manipulating light beams, Briteller enables children to understand an AI recommender system's core algorithmic building block, the dot product, through hands-on interactions. Initial evaluations with ten middle school students demonstrated the effectiveness of this approach, using embodied metaphors, such as "merging light" to represent addition. To overcome the limitations of the physical optical setup, we further explored how AR could embody multiplication, expand data vectors with more attributes, and enhance contextual understanding. Our findings provide valuable insights for designing embodied and tangible learning experiences that make AI concepts more accessible to young learners.

Paperid: 2799, https://arxiv.org/pdf/2503.21897.pdf

Abstract:
Spatial reasoning and collaboration are essential for childhood development, yet blind and visually impaired (BVI) children often lack access to tools that foster these skills. Tactile maps and assistive technologies primarily focus on individual navigation, overlooking the need for playful, inclusive, and collaborative interactions. We address this with StreetScape, a tactile street puzzle that enhances spatial skills and interdependence between BVI and sighted children. Featuring modular 3D-printed tiles, tactile roadways, and customizable decorative elements, StreetScape allows users to construct and explore cityscapes through gamified tactile interaction. Developed through an iterative design process, it integrates dynamic assembly and tactile markers for intuitive navigation, promoting spatial learning and fostering meaningful social connections. This work advances accessible design by demonstrating how tactile tools can effectively bridge educational and social gaps through collaborative play, redefining assistive technologies for children as a scalable platform that merges learning, creativity, and inclusivity.

Paperid: 2800, https://arxiv.org/pdf/2503.21615.pdf

Abstract:
Successful agent-human partnerships require that any agent generated information is understandable to the human, and that the human can easily steer the agent towards a goal. Such effective communication requires the agent to develop a finer-level notion of what is understandable to the human. State-of-the-art agents, including LLMs, lack this detailed notion of understandability because they only capture average human sensibilities from the training data, and therefore afford limited steerability (e.g., requiring non-trivial prompt engineering). In this paper, instead of only relying on data, we argue for developing generalizable, domain-agnostic measures of understandability that can be used as directives for these agents. Existing research on understandability measures is fragmented, we survey various such efforts across domains, and lay a cognitive-science-rooted groundwork for more coherent and domain-agnostic research investigations in future.

Paperid: 2801, https://arxiv.org/pdf/2503.21189.pdf

Abstract:
The global rise of K-pop and the digital revolution have paved the way for new dimensions in artist recommendations. With platforms like Twitter serving as a hub for fans to interact, share and discuss K-pop, a vast amount of data is generated that can be analyzed to understand listener preferences. However, current recommendation systems often overlook K- pop's inherent diversity, treating it as a singular entity. This paper presents an innovative method that utilizes Natural Language Processing to analyze tweet content and discern individual listening habits and preferences. The mass of Twitter data is methodically categorized using fan clusters, facilitating granular and personalized artist recommendations. Our approach marries the advanced GPT-4 model with large-scale social media data, offering potential enhancements in accuracy for K-pop recommendation systems and promising an elevated, personalized fan experience. In conclusion, acknowledging the heterogeneity within fanbases and capitalizing on readily available social media data marks a significant stride towards advancing personalized music recommendation systems.

Paperid: 2802, https://arxiv.org/pdf/2503.20888.pdf

Abstract:
Safety while walking alone at night is a key indicator of a citizen's well-being and a society's inclusiveness. However, this is not equally felt across all demographic groups, especially for university students living in urban areas. We present Coolight, a mobile application designed to reduce stress and anxiety for nighttime walking through an interactive live map, real-time community incident reports, location sharing, and a route planner optimized for user safety. Coolight's design was informed through interviews, questionnaires, and usability tests with university students and their friends and families in Toronto, Canada. This paper describes the concept, research, design approach, and evaluation results of a solution addressing safety concerns urban commuters face at night.

Paperid: 2803, https://arxiv.org/pdf/2503.20596.pdf

Abstract:
The introduction of automated vehicles has redefined the level of interaction between the driver and the vehicle, introducing new tasks and so impose different workloads. Existing tools such as NASA-TLX and DALI are still used to assess driving workload in automated vehicles, despite not accounting for new tasks. This study introduces AV-TLX, a specialized tool for measuring workload in Level 3 automated driving. The development process began with a narrative literature review to identify the primary factors influencing workload. This was followed by a series of qualitative sessions during which the dimensions and later the questions of the questionnaire were designed. The tools validity was first assessed using CVR and CVI indices, and its reliability and convergent validity were evaluated using a dynamic driving simulator with high fidelity. The final version of AV-TLX comprises 19 questions across 8 subscales, demonstrating excellent reliability (0.86) and validity (CVR > 0.78). An agreement scores between the results of AV-TLX and NASA-TLX in the simulation study was 0.6, which is considered acceptable for the consistency of two questionnaires. Furthermore, this questionnaire can be utilized in two ways. First by reporting the overall workload and/or divided into 8 primary subscales, or by categorizing the questions into two groups including takeover task workload and automated driving task workload. The final version of this questionnaire, as presented in the paper, is available for use in future studies focusing on Level 3 automated driving.

Paperid: 2804, https://arxiv.org/pdf/2503.20406.pdf

Abstract:
This position paper addresses the fallacies associated with the improper use of affordances in the opportunistic design of augmented reality (AR) applications. While opportunistic design leverages existing physical affordances for content placement and for creating tangible feedback in AR environments, their misuse can lead to confusion, errors, and poor user experiences. The paper emphasizes the importance of perceptible affordances and properly mapping virtual controls to appropriate physical features in AR applications by critically reflecting on four fallacies of facilitating affordances, namely, the subjectiveness of affordances, affordance imposition and reappropriation, properties and dynamicity of environments, and mimicking the real world. By highlighting these potential pitfalls and proposing a possible path forward, we aim to raise awareness and encourage more deliberate and thoughtful use of affordances in the design of AR applications.

Paperid: 2805, https://arxiv.org/pdf/2503.20391.pdf

Abstract:
The emergence of Generative AI features in news applications may radically change news consumption and challenge journalistic practices. To explore the future potentials and risks of this understudied area, we created six design fictions depicting scenarios such as virtual companions delivering news summaries to the user, AI providing context to news topics, and content being transformed into other formats on demand. The fictions, discussed with a multi-disciplinary group of experts, enabled a critical examination of the diverse ethical, societal, and journalistic implications of AI shaping this everyday activity. The discussions raised several concerns, suggesting that such consumer-oriented AI applications can clash with journalistic values and processes. These include fears that neither consumers nor AI could successfully balance engagement, objectivity, and truth, leading to growing detachment from shared understanding. We offer critical insights into the potential long-term effects to guide design efforts in this emerging application area of GenAI.

Paperid: 2806, https://arxiv.org/pdf/2503.20331.pdf

Abstract:
Detecting whether a target crosses the given zone (e.g., a door) can enable various practical applications in smart homes, including intelligent security and people counting. The traditional infrared-based approach only covers a line and can be easily cracked. In contrast, reusing the ubiquitous WiFi devices deployed in homes has the potential to cover a larger area of interest as WiFi signals are scattered throughout the entire space. By detecting the walking direction (i.e., approaching and moving away) with WiFi signal strength change, existing work can identify the behavior of crossing between WiFi transceiver pair. However, this method mistakenly classifies the turn-back behavior as crossing behavior, resulting in a high false alarm rate. In this paper, we propose WiCross, which can accurately distinguish the turn-back behavior with the phase statistics pattern of WiFi signals and thus robustly identify whether the target crosses the area between the WiFi transceiver pair. We implement WiCross with commercial WiFi devices and extensive experiments demonstrate that WiCross can achieve an accuracy higher than 95\% with a false alarm rate of less than 5%.

Paperid: 2807, https://arxiv.org/pdf/2503.19334.pdf

Abstract:
In this paper, we present the design of a multimodal interaction framework for intelligent virtual agents in wearable mixed reality environments, especially for interactive applications at museums, botanical gardens, and similar places. These places need engaging and no-repetitive digital content delivery to maximize user involvement. An intelligent virtual agent is a promising mode for both purposes. Premises of framework is wearable mixed reality provided by MR devices supporting spatial mapping. We envisioned a seamless interaction framework by integrating potential features of spatial mapping, virtual character animations, speech recognition, gazing, domain-specific chatbot and object recognition to enhance virtual experiences and communication between users and virtual agents. By applying a modular approach and deploying computationally intensive modules on cloud-platform, we achieved a seamless virtual experience in a device with limited resources. Human-like gaze and speech interaction with a virtual agent made it more interactive. Automated mapping of body animations with the content of a speech made it more engaging. In our tests, the virtual agents responded within 2-4 seconds after the user query. The strength of the framework is flexibility and adaptability. It can be adapted to any wearable MR device supporting spatial mapping.

Paperid: 2808, https://arxiv.org/pdf/2503.18796.pdf

Abstract:
Affective reactions have deep biological foundations, however in humans the development of emotion concepts is also shaped by language and higher-order cognition. A recent breakthrough in AI has been the creation of multimodal language models that exhibit impressive intellectual capabilities, but their responses to affective stimuli have not been investigated. Here we study whether state-of-the-art multimodal systems can emulate human emotional ratings on a standardized set of images, in terms of affective dimensions and basic discrete emotions. The AI judgements correlate surprisingly well with the average human ratings: given that these systems were not explicitly trained to match human affective reactions, this suggests that the ability to visually judge emotional content can emerge from statistical learning over large-scale databases of images paired with linguistic descriptions. Besides showing that language can support the development of rich emotion concepts in AI, these findings have broad implications for sensitive use of multimodal AI technology.

Paperid: 2809, https://arxiv.org/pdf/2503.18619.pdf

Abstract:
We investigated gaze direction during movement observation. The eye movement data were collected during an experiment, in which different models of movement production (based on movement primitives, MPs) were compared in a two alternatives forced choice task (2AFC). Participants observed side-by-side presentation of two naturalistic 3D-rendered human movement videos, where one video was based on motion captured gait sequence, the other one was generated by recombining the machine-learned MPs to approximate the same movement. The task was to discriminate between these movements while their eye movements were recorded. We are complementing previous binary decision data analyses with eye tracking data. Here, we are investigating the role of gaze direction during task execution. We computed the shared information between gaze features and decisions of the participants, and between gaze features and correct answers. We found that eye movements reflect the decision of participants during the 2AFC task, but not the correct answer. This result is important for future experiments, which should take advantage of eye tracking to complement binary decision data.

Paperid: 2810, https://arxiv.org/pdf/2503.18550.pdf

Abstract:
As Virtual Reality (VR) expands into fields like healthcare and education, ensuring secure and user-friendly authentication becomes essential. Traditional password entry methods in VR are cumbersome and insecure, making password managers (PMs) a potential solution. To explore this field, we conducted a user study (n=126 VR users) where participants expressed a strong preference for simpler passwords and showed interest in biometric authentication and password managers. On these grounds, we provide the first in-depth evaluation of PMs in VR. We report findings from 91 cognitive walkthroughs, revealing that while PMs improve usability, they are not yet ready for prime time. Key features like cross-app autofill are missing, and user experiences highlight the need for better solutions. Based on consolidated user views and expert analysis, we make recommendations on how to move forward in improving VR authentication systems, ultimately creating more practical solutions for this growing field.

Paperid: 2811, https://arxiv.org/pdf/2503.18419.pdf

Abstract:
Our study of 20 knowledge workers revealed a common challenge: the difficulty of synthesizing unstructured information scattered across multiple platforms to make informed decisions. Drawing on their vision of an ideal knowledge synthesis tool, we developed Yodeai, an AI-enabled system, to explore both the opportunities and limitations of AI in knowledge work. Through a user study with 16 product managers, we identified three key requirements for Generative AI in knowledge work: adaptable user control, transparent collaboration mechanisms, and the ability to integrate background knowledge with external information. However, we also found significant limitations, including overreliance on AI, user isolation, and contextual factors outside the AI's reach. As AI tools become increasingly prevalent in professional settings, we propose design principles that emphasize adaptability to diverse workflows, accountability in personal and collaborative contexts, and context-aware interoperability to guide the development of human-centered AI systems for product managers and knowledge workers.

Paperid: 2812, https://arxiv.org/pdf/2503.17993.pdf

Abstract:
Modern driving involves interactive technologies that can divert attention, increasing the risk of accidents. This paper presents a computational cognitive model that simulates human multitasking while driving. Based on optimal supervisory control theory, the model predicts how multitasking adapts to variations in driving demands, interactive tasks, and automation levels. Unlike previous models, it accounts for context-dependent multitasking across different degrees of driving automation. The model predicts longer in-car glances on straight roads and shorter glances during curves. It also anticipates increased glance durations with driver aids such as lane-centering assistance and their interaction with environmental demands. Validated against two empirical datasets, the model offers insights into driver multitasking amid evolving in-car technologies and automation.

Paperid: 2813, https://arxiv.org/pdf/2503.17971.pdf

Abstract:
The growing adoption of extended reality, XR, has driven demand for wearable technologies that can replicate natural tactile sensations and allow users to interact freely with their surroundings using bare fingers. However, most existing wearable haptic technologies that support such free interactions can deliver sensations across limited tactile modalities. Here, we introduce a soft haptic ring and a data-driven rendering methodology to generate multimodal texture sensations. The device integrates pneumatic and hydraulic actuation to simulate roughness, thermal, and softness cues on the proximal phalanx, enabling users to explore surroundings naturally with their fingertips. The rendering methodology dynamically modulates those cues based on the user's exploratory actions. We validated our approach by conducting a user study with fifteen participants, who matched six virtual textures generated by the ring to their real counterparts and rated their perceived sensations. Participants achieved up to ninety percent accuracy in texture matching. The adjective ratings confirmed that the ring delivers distinct, perceptually rich stimuli across all rendered sensations. These findings highlight the ring's potential for immersive XR applications, offering diverse tactile feedback without restricting physical interaction.

Paperid: 2814, https://arxiv.org/pdf/2503.17955.pdf

Abstract:
Human-AI Interaction (HAI) guidelines and design principles have become increasingly important in both industry and academia to guide the development of AI systems that align with user needs and expectations. However, large-scale empirical evidence on how HAI principles shape user satisfaction in practice remains limited. This study addresses that gap by analyzing over 100,000 user reviews of AI-related products from G2, a leading review platform for business software and services. Based on widely adopted industry guidelines, we identify seven core HAI dimensions and examine their coverage and sentiment within the reviews. We find that the sentiment on four HAI dimensions-adaptability, customization, error recovery, and security-is positively associated with overall user satisfaction. Moreover, we show that engagement with HAI dimensions varies by professional background: Users with technical job roles are more likely to discuss system-focused aspects, such as reliability, while non-technical users emphasize interaction-focused features like customization and feedback. Interestingly, the relationship between HAI sentiment and overall satisfaction is not moderated by job role, suggesting that once an HAI dimension has been identified by users, its effect on satisfaction is consistent across job roles.

Paperid: 2815, https://arxiv.org/pdf/2503.17625.pdf

Abstract:
Well-being is a dynamic construct that evolves over time and fluctuates within individuals, presenting challenges for accurate quantification. Reduced well-being is often linked to depression or anxiety disorders, which are characterised by biases in visual attention towards specific stimuli, such as human faces. This paper introduces a novel approach to AI-assisted screening of affective disorders by analysing visual attention scan paths using convolutional neural networks (CNNs). Data were collected from two studies examining (1) attentional tendencies in individuals diagnosed with major depression and (2) social anxiety. These data were processed using residual CNNs through images generated from eye-gaze patterns. Experimental results, obtained with ResNet architectures, demonstrated an average accuracy of 48% for a three-class system and 62% for a two-class system. Based on these exploratory findings, we propose that this method could be employed in rapid, ecological, and effective mental health screening systems to assess well-being through eye-tracking.

Paperid: 2816, https://arxiv.org/pdf/2503.17374.pdf

Abstract:
In this paper we introduce a Platform created in order to support SMEs' endeavor to extract value from their intangible assets effectively. To implement the Platform, we developed five knowledge bases using a knowledge-based ex-pert system shell that contain knowledge from intangible as-set consultants, patent attorneys and due diligence lawyers. In order to operationalize the knowledge bases, we developed a "Rosetta Stone", an interpreter unit for the knowledge bases outside the shell and embedded in the plat-form. Building on the initial knowledge bases we have created a system of red flags, risk scoring, and valuation with the involvement of the same experts; these additional systems work upon the initial knowledge bases and therefore they can be regarded as meta-knowledge-representations that take the form of second-order knowledge graphs. All this clever technology is dressed up in an easy-to-handle graphical user interface that we will showcase at the conference. The initial platform was finished mid-2024; therefore, it qualifies as an "emerging application of AI" and "deployable AI", while development continues. The two firms that provided experts for developing the knowledge bases obtained a white-label version of the product (i.e. it runs under their own brand "powered by Intanify"), and there are two completed cases.

Paperid: 2817, https://arxiv.org/pdf/2503.17328.pdf

Abstract:
Impulsive individuals exhibit abnormal reward processing (heightened preference for immediate rewards, i.e., impulsive choice, IC) and a penchant for maladaptive action (the inability to inhibit inappropriate actions, i.e., impulsive action, IA). Both impulsive choice and impulsive action are strongly influenced by emotions (emotional impulsivity); yet how emotions impact impulse behavior remains unclear. The traditional theory suggests that emotions primarily exacerbate impulsive action and prompts impulsive choice. The alternative theory states that emotions primarily disrupt attention (attentional impulsivity, AImp) and prompt impulsive choice. In two studies, we probed the interplay among emotions, impulsive action (IA), attentional impulsivity (AImp), and impulsive choice (IC). We elicited positive and negative emotions using emotional pictures and examined the extent to which elicited emotions altered behavioral indices of impulsivity.

Paperid: 2818, https://arxiv.org/pdf/2503.17302.pdf

Abstract:
As software systems grow increasingly complex, ensuring security during development poses significant challenges. Traditional manual code audits are often expensive, time-intensive, and ill-suited for fast-paced workflows, while automated tools frequently suffer from high false-positive rates, limiting their reliability. To address these issues, we introduce Bugdar, an AI-augmented code review system that integrates seamlessly into GitHub pull requests, providing near real-time, context-aware vulnerability analysis. Bugdar leverages fine-tunable Large Language Models (LLMs) and Retrieval Augmented Generation (RAGs) to deliver project-specific, actionable feedback that aligns with each codebase's unique requirements and developer practices. Supporting multiple programming languages, including Solidity, Move, Rust, and Python, Bugdar demonstrates exceptional efficiency, processing an average of 56.4 seconds per pull request or 30 lines of code per second. This is significantly faster than manual reviews, which could take hours per pull request. By facilitating a proactive approach to secure coding, Bugdar reduces the reliance on manual reviews, accelerates development cycles, and enhances the security posture of software systems without compromising productivity.

Paperid: 2819, https://arxiv.org/pdf/2503.17212.pdf

Abstract:
News outlets' competition for attention in news interfaces has highlighted the need for demographically-aware saliency prediction models. Despite recent advancements in saliency detection applied to user interfaces (UI), existing datasets are limited in size and demographic representation. We present a deep learning framework that enhances the SaRa (Saliency Ranking) model with DeepGaze IIE, improving Salient Object Ranking (SOR) performance by 10.7%. Our framework optimizes three key components: saliency map generation, grid segment scoring, and map normalization. Through a two-fold experiment using eye-tracking (30 participants) and mouse-tracking (375 participants aged 13--70), we analyze attention patterns across demographic groups. Statistical analysis reveals significant age-based variations (p < 0.05, {Îµ^2} = 0.042), with older users (36--70) engaging more with textual content and younger users (13--35) interacting more with images. Mouse-tracking data closely approximates eye-tracking behavior (sAUC = 0.86) and identifies UI elements that immediately stand out, validating its use in large-scale studies. We conclude that saliency studies should prioritize gathering data from a larger, demographically representative sample and report exact demographic distributions.

Paperid: 2820, https://arxiv.org/pdf/2503.17158.pdf

Abstract:
This study investigates cross-cultural differences in the perception of AI-driven chatbots between Germany and South Korea, focusing on topic dependency and explainability. Using a custom AI chat interface, ExplainitAI, we systematically examined these factors with quota-based samples from both countries (N = 297). Our findings revealed significant cultural distinctions: Korean participants exhibited higher trust, more positive user experience ratings, and more favorable perception of AI compared to German participants. Additionally, topic dependency was a key factor, with participants reporting lower trust in AI when addressing societally debated topics (e.g., migration) versus health or entertainment topics. These perceptions were further influenced by interactions among cultural context, content domains, and explainability conditions. The result highlights the importance of integrating cultural and contextual nuances into the design of AI systems, offering actionable insights for the development of culturally adaptive and explainable AI tailored to diverse user needs and expectations across domains.

Paperid: 2821, https://arxiv.org/pdf/2503.16992.pdf

Abstract:
In a 'digital by default' society, essential services must be accessed online. This opens users to digital deception not only from criminal fraudsters but from a range of actors in a marketised digital economy. Using grounded empirical research from northern England, we show how supposedly 'trusted' actors, such as governments,(re)produce the insecurities and harms that they seek to prevent. Enhanced by a weakening of social institutions amid a drive for efficiency and scale, this has built a constricted, unpredictable digital channel. We conceptualise this as a "snipers' alley". Four key snipers articulated by participants' lived experiences are examined: 1) Governments; 2) Business; 3) Criminal Fraudsters; and 4) Friends and Family to explore how snipers are differentially experienced and transfigure through this constricted digital channel. We discuss strategies to re-configure the alley, and how crafting and adopting opportunity models can enable more equitable forms of security for all.

Paperid: 2822, https://arxiv.org/pdf/2503.16932.pdf

Abstract:
Humans and robots are increasingly working in personal and professional settings. In workplace settings, humans and robots may work together as colleagues, potentially leading to social expectations, or violation thereof. Extant research has primarily sought to understand social interactions and expectations in personal rather than professional settings, and none of these studies have examined negative outcomes arising from violations of social expectations. This paper reports the results of a 2x3 online experiment that used a unique first-person perspective video to immerse participants in a collaborative workplace setting. The results are nuanced and reveal that while robots are expected to act in accordance with social expectations despite human behavior, there are benefits for robots perceived as being the bigger person in the face of human rudeness. Theoretical and practical implications are provided which discuss the import of these findings for the design of social robots.

Paperid: 2823, https://arxiv.org/pdf/2503.16679.pdf

Abstract:
Large Language Models (LLMs) have emerged as powerful tools for generating human-like text, transforming human-machine interactions. However, their widespread adoption has raised concerns about their potential to influence public opinion and shape political narratives. In this work, we investigate the geopolitical biases in US and Chinese LLMs, focusing on how these models respond to questions related to geopolitics and international relations. We collected responses from ChatGPT and DeepSeek to a set of geopolitical questions and evaluated their outputs through both qualitative and quantitative analyses. Our findings show notable biases in both models, reflecting distinct ideological perspectives and cultural influences. However, despite these biases, for a set of questions, the models' responses are more aligned than expected, indicating that they can address sensitive topics without necessarily presenting directly opposing viewpoints. This study highlights the potential of LLMs to shape public discourse and underscores the importance of critically assessing AI-generated content, particularly in politically sensitive contexts.

Paperid: 2824, https://arxiv.org/pdf/2503.16644.pdf

Abstract:
Missing data is prevalent in tabular machine learning (ML) models, and different missing data treatment methods can significantly affect ML model training results. However, little is known about how ML researchers and engineers choose missing data treatment methods and what factors affect their choices. To this end, we conducted a survey of 70 ML researchers and engineers. Our results revealed that most participants were not making informed decisions regarding missing data treatment, which could significantly affect the validity of the ML models trained by these researchers. We advocate for better education on missing data, more standardized missing data reporting, and better missing data analysis tools.

Paperid: 2825, https://arxiv.org/pdf/2503.16584.pdf

Abstract:
We introduce a novel multimodal emotion recognition dataset that enhances the precision of Valence-Arousal Model while accounting for individual differences. This dataset includes electroencephalography (EEG), electrocardiography (ECG), and pulse interval (PI) from 64 participants. Data collection employed two emotion induction paradigms: video stimuli that targeted different valence levels (positive, neutral, and negative) and the Mannheim Multicomponent Stress Test (MMST), which induced high arousal through cognitive, emotional, and social stressors. To enrich the dataset, participants' personality traits, anxiety, depression, and emotional states were assessed using validated questionnaires. By capturing a broad spectrum of affective responses while accounting for individual differences, this dataset provides a robust resource for precise emotion modeling. The integration of multimodal physiological data with psychological assessments lays a strong foundation for personalized emotion recognition. We anticipate this resource will support the development of more accurate, adaptive, and individualized emotion recognition systems across diverse applications.

Paperid: 2826, https://arxiv.org/pdf/2503.16519.pdf

Abstract:
Recent advancements in virtual reality (VR) technology have enabled the creation of immersive learning environments that provide engineering students with hands-on, interactive experiences. This paper presents a novel framework for virtual laboratory environments (VLEs) focused on embodied learning, specifically designed to teach concepts related to mechanical and materials engineering. Utilizing the principles of embodiment and congruency, these VR modules offer students the opportunity to engage physically with virtual specimens and machinery, thereby enhancing their understanding of complex topics through sensory immersion and kinesthetic interaction. Our framework employs an event-driven, directed-graph-based architecture developed with Unity 3D and C#, ensuring modularity and scalability. Students interact with the VR environment by performing tasks such as selecting and testing materials, which trigger various visual and haptic events to simulate real-world laboratory conditions. A pre-/post-test evaluation method was used to assess the educational effectiveness of these VR modules. Results demonstrated significant improvements in student comprehension and retention, with notable increases in test scores compared to traditional non-embodied VR methods. The implementation of these VLEs in a university setting highlighted their potential to democratize access to high-cost laboratory experiences, making engineering education more accessible and effective. By fostering a deeper connection between cognitive processes and physical actions, our VR framework not only enhances learning outcomes but also provides a template for future developments in VR-based education. Our study suggests that immersive VR environments can significantly improve the learning experience for engineering students.

Paperid: 2827, https://arxiv.org/pdf/2503.16517.pdf

Abstract:
This research addresses the growing need to measure and understand AI literacy in the context of generative AI technologies. Through three sequential studies involving a total of 517 participants, we establish AI literacy as a coherent, measurable construct with significant implications for education, workforce development, and social equity. Study 1 (N=85) revealed a dominant latent factor - termed the "A-factor" - that accounts for 44.16% of variance across diverse AI interaction tasks. Study 2 (N=286) refined the measurement tool by examining four key dimensions of AI literacy: communication effectiveness, creative idea generation, content evaluation, and step-by-step collaboration, resulting in an 18-item assessment battery. Study 3 (N=146) validated this instrument in a controlled laboratory setting, demonstrating its predictive validity for real-world task performance. Results indicate that AI literacy significantly predicts performance on complex, language-based creative tasks but shows domain specificity in its predictive power. Additionally, regression analyses identified several significant predictors of AI literacy, including cognitive abilities (IQ), educational background, prior AI experience, and training history. The multidimensional nature of AI literacy and its distinct factor structure provide evidence that effective human-AI collaboration requires a combination of general and specialized abilities. These findings contribute to theoretical frameworks of human-AI collaboration while offering practical guidance for developing targeted educational interventions to promote equitable access to the benefits of generative AI technologies.

Paperid: 2828, https://arxiv.org/pdf/2503.16508.pdf

Abstract:
Conversational AI interfaces powered by large language models (LLMs) are increasingly used as coding assistants. However, questions remain about how programmers interact with LLM-based conversational agents, the challenges they encounter, and the factors influencing adoption. This study investigates programmers' usage patterns, perceptions, and interaction strategies when engaging with LLM-driven coding assistants. Through a survey, participants reported both the benefits, such as efficiency and clarity of explanations, and the limitations, including inaccuracies, lack of contextual awareness, and concerns about over-reliance. Notably, some programmers actively avoid LLMs due to a preference for independent learning, distrust in AI-generated code, and ethical considerations. Based on our findings, we propose design guidelines for improving conversational coding assistants, emphasizing context retention, transparency, multimodal support, and adaptability to user preferences. These insights contribute to the broader understanding of how LLM-based conversational agents can be effectively integrated into software development workflows while addressing adoption barriers and enhancing usability.

Paperid: 2829, https://arxiv.org/pdf/2503.16499.pdf

Abstract:
This paper presents an iterative, participatory, empirical study that examines the potential of using artificial intelligence, such as social robots and large language models, to support mediation and advocacy for students with disabilities in higher education. Drawing on qualitative data from interviews and focus groups conducted with various stakeholders, including disabled students, disabled student representatives, and disability practitioners at the University of Cambridge, this study reports findings relating to understanding the problem space, ideating robotic support and participatory co-design of advocacy support robots. The findings highlight the potential of these technologies in providing signposting and acting as a sounding board or study companion, while also addressing limitations in empathic understanding, trust, equity, and accessibility. We discuss ethical considerations, including intersectional biases, the double empathy problem, and the implications of deploying social robots in contexts shaped by structural inequalities. Finally, we offer a set of recommendations and suggestions for future research, rethinking the notion of corrective technological interventions to tools that empower and amplify self-advocacy.

Paperid: 2830, https://arxiv.org/pdf/2503.16488.pdf

Abstract:
With an increasing demand for assistive technologies that promote the independence and mobility of visually impaired people, this study suggests an innovative real-time system that gives audio descriptions of a user's surroundings to improve situational awareness. The system acquires live video input and processes it with a quantized and fine-tuned Florence-2 big model, adjusted to 4-bit accuracy for efficient operation on low-power edge devices such as the NVIDIA Jetson Orin Nano. By transforming the video signal into frames with a 5-frame latency, the model provides rapid and contextually pertinent descriptions of objects, pedestrians, and barriers, together with their estimated distances. The system employs Parler TTS Mini, a lightweight and adaptable Text-to-Speech (TTS) solution, for efficient audio feedback. It accommodates 34 distinct speaker types and enables customization of speech tone, pace, and style to suit user requirements. This study examines the quantization and fine-tuning techniques utilized to modify the Florence-2 model for this application, illustrating how the integration of a compact model architecture with a versatile TTS component improves real-time performance and user experience. The proposed system is assessed based on its accuracy, efficiency, and usefulness, providing a viable option to aid vision-impaired users in navigating their surroundings securely and successfully.

Paperid: 2831, https://arxiv.org/pdf/2503.16486.pdf

Abstract:
Computer programming represents a rapidly evolving and sought-after career path in the 21st century. Nevertheless, novice learners may find the process intimidating for several reasons, such as limited and highly competitive career opportunities, peer and parental pressure for academic success, and course difficulties. These factors frequently contribute to anxiety and eventual dropout as a result of fear. Furthermore, research has demonstrated that beginners are significantly deterred by the fear of failure, which results in programming anxiety and and a sense of being overwhelmed by intricate topics, ultimately leading to dropping out. This project undertakes an exploration beyond the scope of conventional code learning platforms by identifying and utilising effective and personalised strategies of learning. The proposed solution incorporates features such as AI-generated challenging questions, mindfulness quotes, and tips to motivate users, along with an AI chatbot that functions as a motivational aid. In addition, the suggested solution integrates personalized roadmaps and gamification elements to maintain user involvement. The project aims to systematically monitor the progress of novice programmers and enhance their knowledge of coding with a personalised, revised curriculum to help mitigate the fear of coding and boost confidence.

Paperid: 2832, https://arxiv.org/pdf/2503.16485.pdf

Abstract:
This study highlights the transparency and accuracy of GenAI's inductive thematic analysis, particularly using GPT-4 Turbo API integrated within a stepwise prompt-based Python script. This approach ensured a traceable and systematic coding process, generating codes with supporting statements and page references, which enhanced validation and reproducibility. The results indicate that GenAI performs inductive coding in a manner closely resembling human coders, effectively categorizing themes at a level like the average human coder. However, in interpretation, GenAI extends beyond human coders by situating themes within a broader conceptual context, providing a more generalized and abstract perspective.

Paperid: 2833, https://arxiv.org/pdf/2503.16482.pdf

Abstract:
STEAM education integrates Science, Technology, Engineering, Arts, and Mathematics to foster creativity and problem-solving. However, students with visual impairments (VI) encounter significant challenges in programming and robotics, particularly in tracking robot movements and developing spatial awareness. This paper presents a framework that leverages pre-constructed robots and algorithms, such as maze-solving techniques, within an accessible learning environment. The proposed system employs Contrastive Language-Image Pre-training (CLIP) to process global camera-captured maze layouts, converting visual data into textual descriptions that generate spatial audio prompts in an Audio Virtual Reality (AVR) system. Students issue verbal commands, which are refined through CLIP, while robot-mounted stereo cameras provide real-time data processed via Simultaneous Localization and Mapping (SLAM) for continuous feedback. By integrating these technologies, the framework empowers VI students to develop coding skills and engage in complex problem-solving tasks. Beyond maze-solving applications, this approach demonstrates the broader potential of computer vision in special education, contributing to improved accessibility and learning experiences in STEAM disciplines.

Paperid: 2834, https://arxiv.org/pdf/2503.16472.pdf

Abstract:
The rapid development of artificial intelligence (AI) has significantly transformed human-computer interactions, making it essential to establish robust design standards to ensure effective, ethical, and human-centered AI (HCAI) solutions. Standards serve as the foundation for the adoption of new technologies, and human-AI interaction (HAII) standards are critical to supporting the industrialization of AI technology by following an HCAI approach. These design standards aim to provide clear principles, requirements, and guidelines for designing, developing, deploying, and using AI systems, enhancing the user experience and performance of AI systems. Despite their importance, the creation and adoption of HCAI-based interaction design standards face challenges, including the absence of universal frameworks, the inherent complexity of HAII, and the ethical dilemmas that arise in such systems. This chapter provides a comparative analysis of HAII versus traditional human-computer interaction (HCI) and outlines guiding principles for HCAI-based design. It explores international, regional, national, and industry standards related to HAII design from an HCAI perspective and reviews design guidelines released by leading companies such as Microsoft, Google, and Apple. Additionally, the chapter highlights tools available for implementing HAII standards and presents case studies of human-centered interaction design for AI systems in diverse fields, including healthcare, autonomous vehicles, and customer service. It further examines key challenges in developing HAII standards and suggests future directions for the field. Emphasizing the importance of ongoing collaboration between AI designers, developers, and experts in human factors and HCI, this chapter stresses the need to advance HCAI-based interaction design standards to ensure human-centered AI solutions across various domains.

Paperid: 2835, https://arxiv.org/pdf/2503.16467.pdf

Abstract:
Artificial Intelligence (AI) has significantly advanced in recent years, driving innovation across various fields, especially in robotics. Even though robots can perform complex tasks with increasing autonomy, challenges remain in ensuring explainability and user-centered design for effective interaction. A key issue in Human-Robot Interaction (HRI) is enabling robots to effectively perceive and reason over multimodal inputs, such as audio and vision, to foster trust and seamless collaboration. In this paper, we propose a generalized and explainable multimodal framework for context representation, designed to improve the fusion of speech and vision modalities. We introduce a use case on assessing 'Relevance' between verbal utterances from the user and visual scene perception of the robot. We present our methodology with a Multimodal Joint Representation module and a Temporal Alignment module, which can allow robots to evaluate relevance by temporally aligning multimodal inputs. Finally, we discuss how the proposed framework for context representation can help with various aspects of explainability in HRI.

Paperid: 2836, https://arxiv.org/pdf/2503.16464.pdf

Abstract:
In this study, we investigate the feasibility of using a human-centered artificial intelligence (AI) chat platform where medical specialists collaboratively assess complex cases. As the target population for this platform, we focus on patients with cardiovascular diseases who are in a state of multimorbidity, that is, suffering from multiple chronic conditions. We evaluate simulated cases with multiple diseases using a chat application by collaborating with physicians to assess feasibility, efficiency gains through AI utilization, and the quantification of discussion content. We constructed simulated cases based on past case reports, medical errors reports and complex cases of cardiovascular diseases experienced by the physicians. The analysis of discussions across five simulated cases demonstrated a significant reduction in the time required for summarization using AI, with an average reduction of 79.98\%. Additionally, we examined hallucination rates in AI-generated summaries used in multidisciplinary medical discussions. The overall hallucination rate ranged from 1.01\% to 5.73\%, with an average of 3.62\%, whereas the harmful hallucination rate varied from 0.00\% to 2.09\%, with an average of 0.49\%. Furthermore, morphological analysis demonstrated that multidisciplinary assessments enabled a more complex and detailed representation of medical knowledge compared with single physician assessments. We examined structural differences between multidisciplinary and single physician assessments using centrality metrics derived from the knowledge graph. In this study, we demonstrated that AI-assisted summarization significantly reduced the time required for medical discussions while maintaining structured knowledge representation. These findings can support the feasibility of AI-assisted chat-based discussions as a human-centered approach to multidisciplinary medical decision-making.

Paperid: 2837, https://arxiv.org/pdf/2503.16459.pdf

Abstract:
This study proposes the realization of various virtual environments using a lower limb exoskeletal robot for futuristic gait rehabilitation. The proposed method allows the user to feel virtual gravity, buoyancy, and drag while actively walking. The virtual environments include four fluidic conditions: Water, Olive oil, Honey, and Peanut Butter, and four gravitational conditions consisting of the Earth's, Moon's, Mars', and Jupiter's gravity. The control method of the lower limb exoskeletal robot is as follows. First, torque feedback is applied to control the interaction force between the exoskeletal robot and its user. Second, the reference torque is computed in real time with the dynamic equations of the human body and the kinematic data. The eight environments were implemented via the EXOWheel, a wheelchair-integrated lower limb exoskeletal robot. While attaching electromyography sensors and wearing the EXOWheel, eight healthy subjects walked actively under the virtual conditions. Experimental results show that muscular force signals adequately change depending on gravitational, buoyant, and drag effects. Blind tests confirmed that subjects could reliably distinguish all eight virtual environments.

Paperid: 2838, https://arxiv.org/pdf/2503.16453.pdf

Abstract:
While performance in coordinated motor tasks has been shown to improve in children as they age, the characterization of children's movement strategies has been underexplored. In this work, we use upper-body motion data collected from an augmented reality reaching game, and show that short (13 second) sections of motion are are sufficient to reveal arm motion differences across child development. To explore what drives this trend, we characterize the movement patterns across different age groups by analyzing (1) directness of path, (2) maximum speed, and (3) progress towards the reaching target. We find that although maximum arm velocity decreases with age (p~=~0.02), their paths to goal are more direct (p~=~0.03), allowing for faster time to goal overall. We also find that older children exhibit more anticipatory reaching behavior, enabling more accurate goal-reaching (i.e. no overshooting) compared to younger children. The resulting analysis has potential to improve the realism of child-like digital characters and advance our understanding of motor skill development.

Paperid: 2839, https://arxiv.org/pdf/2503.16446.pdf

Abstract:
This study investigates factors influencing employees' perceptions of the usefulness of Business Process Management Systems (BPMS) in commercial settings. It explores the roles of system dependency, system quality, and the quality of information and knowledge in the adoption and use of BPMS. Data were collected using a structured questionnaire from end-users in various firms and analyzed with Partial Least Squares (PLS). The survey evaluated perceptions of service quality, input quality, system attributes, and overall system quality. The findings indicate that service quality, input quality, and specific system attributes significantly influence perceived system quality, while system dependency and information quality are predictors of perceived usefulness. The results highlight the importance of user training, support, and high-quality information in enhancing satisfaction and BPMS. This research offers empirical evidence on the factors impacting user perceptions and acceptance, emphasizing the need for user-centric approaches in BPMS.

Paperid: 2840, https://arxiv.org/pdf/2503.16442.pdf

Abstract:
In the context of artificial life art and agent-based art, this paper draws on Simon Penny's {\itshape Aesthetic of Behavior} theory and Sofian Audry's discussions on behavior computation to examine how artists design agent behaviors and the ensuing aesthetic experiences. We advocate for integrating the environment in which agents operate as the context for behavioral design, positing that the environment emerges through continuous interactions among agents, audiences, and other entities, forming an evolving network of meanings generated by these interactions. Artists create contexts by deploying and guiding these computational systems, audience participation, and agent behaviors through artist strategies. This framework is developed by analysing two categories of agent-based artworks, exploring the intersection of computational systems, audience participation, and artistic strategies in creating aesthetic experiences. This paper seeks to provide a contextual foundation and framework for designing agents' behaviors by conducting a comparative study focused on behavioural design strategies by the artists.

Paperid: 2841, https://arxiv.org/pdf/2503.16439.pdf

Abstract:
We present DreamLLM-3D, a composite multimodal AI system behind an immersive art installation for dream re-experiencing. It enables automated dream content analysis for immersive dream-reliving, by integrating a Large Language Model (LLM) with text-to-3D Generative AI. The LLM processes voiced dream reports to identify key dream entities (characters and objects), social interaction, and dream sentiment. The extracted entities are visualized as dynamic 3D point clouds, with emotional data influencing the color and soundscapes of the virtual dream environment. Additionally, we propose an experiential AI-Dreamworker Hybrid paradigm. Our system and paradigm could potentially facilitate a more emotionally engaging dream-reliving experience, enhancing personal insights and creativity.

Paperid: 2842, https://arxiv.org/pdf/2503.16438.pdf

Abstract:
As artificial intelligence becomes increasingly integrated into professional and personal domains, traditional metrics of human intelligence require reconceptualization. This paper introduces the Artificial Intelligence Quotient (AIQ), a novel measurement framework designed to assess an individual's capacity to effectively collaborate with and leverage AI systems, particularly Large Language Models (LLMs). Building upon established cognitive assessment methodologies and contemporary AI interaction research, we present a comprehensive framework for quantifying human-AI collaborative intelligence. This work addresses the growing need for standardized evaluation of AI-augmented cognitive capabilities in educational and professional contexts.

Paperid: 2843, https://arxiv.org/pdf/2503.16436.pdf

Abstract:
As AI systems become more prevalent, concerns about their development, operation, and societal impact intensify. Establishing ethical, social, and safety standards amidst evolving AI capabilities poses significant challenges. Global initiatives are underway to establish guidelines for AI system development and operation. With the increasing use of collaborative human-AI task execution, it's vital to continuously adapt AI systems to meet user and environmental needs. Failure to synchronize AI evolution with changes in users and the environment could result in ethical and safety issues. This paper evaluates the applicability of existing guidelines in human-robot collaborative systems, assesses their effectiveness, and discusses limitations. Through a case study, we examine whether our target system meets requirements outlined in existing guidelines and propose improvements to enhance human-robot interactions. Our contributions provide insights into interpreting and applying guidelines, offer concrete examples of system enhancement, and highlight their applicability and limitations. We believe these contributions will stimulate discussions and influence system assurance and certification in future AI-infused critical systems.

Paperid: 2844, https://arxiv.org/pdf/2503.16433.pdf

Abstract:
Under-resourced or rural hospitals have limited access to medical specialists and healthcare professionals, which can negatively impact patient outcomes in sepsis. To address this gap, we developed the MATEC (Multi-AI Agent Team Care) framework, which integrates a team of specialized AI agents for sepsis care. The sepsis AI agent team includes five doctor agents, four health professional agents, and a risk prediction model agent, with an additional 33 doctor agents available for consultations. Ten attending physicians at a teaching hospital evaluated this framework, spending approximately 40 minutes on the web-based MATEC application and participating in the 5-point Likert scale survey (rated from 1-unfavorable to 5-favorable). The physicians found the MATEC framework very useful (Median=4, P=0.01), and very accurate (Median=4, P<0.01). This pilot study demonstrates that a Multi-AI Agent Team Care framework (MATEC) can potentially be useful in assisting medical professionals, particularly in under-resourced hospital settings.

Paperid: 2845, https://arxiv.org/pdf/2503.16227.pdf

Abstract:
This paper examines how trust is formed, maintained, or diminished over time in the context of human-autonomy teaming with an optionally piloted aircraft. Whereas traditional factor-based trust models offer a static representation of human confidence in technology, here we discuss how variations in the underlying factors lead to variations in trust, trust thresholds, and human behaviours. Over 200 hours of flight test data collected over a multi-year test campaign from 2021 to 2023 were reviewed. The dispositional-situational-learned, process-performance-purpose, and IMPACTS homeostasis trust models are applied to illuminate trust trends during nominal autonomous flight operations. The results offer promising directions for future studies on trust dynamics and design-for-trust in human-autonomy teaming.

Paperid: 2846, https://arxiv.org/pdf/2503.16191.pdf

Abstract:
The design, operations, and management of water distribution systems (WDS) involve complex mathematical models. These models are continually improving due to computational advancements, leading to better decision-making and more efficient WDS management. However, the significant time and effort required for modeling, programming, and analyzing results remain substantial challenges. Another issue is the professional burden, which confines the interaction with models, databases, and other sophisticated tools to a small group of experts, thereby causing non-technical stakeholders to depend on these experts or make decisions without modeling support. Furthermore, explaining model results is challenging even for experts, as it is often unclear which conditions cause the model to reach a certain state or recommend a specific policy. The recent advancements in Large Language Models (LLMs) open doors for a new stage in human-model interaction. This study proposes a framework of plain language interactions with hydraulic and water quality models based on LLM-EPANET architecture. This framework is tested with increasing levels of complexity of queries to study the ability of LLMs to interact with WDS models, run complex simulations, and report simulation results. The performance of the proposed framework is evaluated across several categories of queries and hyper-parameter configurations, demonstrating its potential to enhance decision-making processes in WDS management.

Paperid: 2847, https://arxiv.org/pdf/2503.16135.pdf

Abstract:
Malleable Glyph is a new visualization problem and a public challenge. It originated from UX research (namely from research on card sorting UX), but its applications can be diverse (UI, gaming, information presentation, maps, and others). Its essence is: carrying as much information in a defined planar and static area as possible. The information should allow human observers to evaluate a pair of glyphs into three possible sortings: the first is "greater", or the second is "greater", or both are equal. The glyphs should adhere to the Illiteracy Rule, in other words, the observer should ask themselves the question "how much?" rather than "how many?". This article motivates the technique, explains its details, and presents the public challenge, including the evaluation protocol. The article aims to call for ideas from other visualization and graphics researchers and practitioners and to invite everyone to participate in the challenge and, by doing so, move scientific knowledge forward.

Paperid: 2848, https://arxiv.org/pdf/2503.16011.pdf

Abstract:
As the capabilities of Large Language Models (LLMs) expand, more researchers are studying their adoption in newsrooms. However, much of the research focus remains broad and does not address the specific technical needs of investigative journalists. Therefore, this paper presents several applied use cases where automation and AI intersect with investigative journalism. We conducted a within-subjects user study with eight investigative journalists. In interviews, we elicited practical use cases using a speculative design approach by having journalists react to a prototype of a system that combines LLMs and Programming-by-Demonstration (PbD) to simplify data collection on numerous websites. Based on user reports, we classified the journalistic processes into data collecting and reporting. Participants indicated they utilize automation to handle repetitive tasks like content monitoring, web scraping, summarization, and preliminary data exploration. Following these insights, we provide guidelines on how investigative journalism can benefit from AI and automation.

Paperid: 2849, https://arxiv.org/pdf/2503.15942.pdf

Abstract:
Social media is central to activists, who use it internally for coordination and externally to reach supporters and the public. To date, the HCI community has not explored activists' perspectives on future social media platforms. In interviews with 14 activists from an environmental and a queer-feminist movement in Germany, we identify activists' needs and feature requests for future social media platforms. The key finding is that on- and offline safety is their main need. Based on this, we make concrete proposals to improve safety measures. Increased control over content presentation and tools to streamline activist workflows are also central to activists. We make concrete design and research recommendations on how social media platforms and the HCI community can contribute to improved safety and content presentation, and how activists themselves can reduce their workload.

Paperid: 2850, https://arxiv.org/pdf/2503.15529.pdf

Abstract:
Dashboards have arguably been the most used visualizations during the COVID-19 pandemic. They were used to communicate its evolution to national governments for disaster mitigation, to the public domain to inform about its status, and to epidemiologists to comprehend and predict the evolution of the disease. Each design had to be tailored for different tasks and to varying audiences - in many cases set up in a very short time due to the urgent need. In this paper, we collect notable examples of dashboards and reflect on their use and design during the pandemic from a user-oriented perspective: we interview a group of researchers with varying visualization expertise who actively used dashboards during the pandemic as part of their daily workflow. We discuss our findings and compile a list of lessons learned to support future visualization researchers and dashboard designers.

Paperid: 2851, https://arxiv.org/pdf/2503.15526.pdf

Abstract:
This study explores the integration of artificial intelligence (AI) or large language models (LLMs) into pediatric rehabilitation clinical documentation, focusing on the generation of SOAP (Subjective, Objective, Assessment, Plan) notes, which are essential for patient care. Creating complex documentation is time-consuming in pediatric settings. We evaluate the effectiveness of two AI tools; Copilot, a commercial LLM, and KAUWbot, a fine-tuned LLM developed for KidsAbility Centre for Child Development (an Ontario pediatric rehabilitation facility), in simplifying and automating this process. We focus on two key questions: (i) How does the quality of AI-generated SOAP notes based on short clinician summaries compare to human-authored notes, and (ii) To what extent is human editing necessary for improving AI-generated SOAP notes? We found no evidence of prior work assessing the quality of AI-generated clinical notes in pediatric rehabilitation. We used a sample of 432 SOAP notes, evenly divided among human-authored, Copilot-generated, and KAUWbot-generated notes. We employ a blind evaluation by experienced clinicians based on a custom rubric. Statistical analysis is conducted to assess the quality of the notes and the impact of human editing. The results suggest that AI tools such as KAUWbot and Copilot can generate SOAP notes with quality comparable to those authored by humans. We highlight the potential for combining AI with human expertise to enhance clinical documentation and offer insights for the future integration of AI into pediatric rehabilitation practice and other settings for the management of clinical conditions.

Paperid: 2852, https://arxiv.org/pdf/2503.15523.pdf

Abstract:
Children tend to be constantly exposed to technologies, such as smartphones, tablets, and gaming consoles, drawn by the interactive and visually stimulating nature of digital platforms. Thus, integrating the teaching process with technological gadgets may enhance engagement and foster interactive learning experiences, besides equipping students with the digital skills for today's increasingly technology-driven world. The main goal of this work is to provide an open-source and manageable tool that teachers can use as an everyday activity and as an exergame. For this, we present a prototype of an interactive platform that students use to answer a quiz by moving to segments available on an interactive floor. All the platform design and implementation directions are publicly available.

Paperid: 2853, https://arxiv.org/pdf/2503.15513.pdf

Abstract:
Dysgraphia is a key cognitive disorder impacting writing skills. Current tests often identify dysgraphia after writing issues emerge. This paper presents a set of computer games and uses machine learning to analyze the results, predicting if a child is at risk. The games focus on cognitive differences like visual attention between dysgraphic and typical children. The machine learning model forecasts dysgraphia by observing how kids interact with these games. We also create an algorithm to detect unsuitable testing conditions, acting as a preprocess to avoid mislabeling them as dysgraphia. We developed a machine learning model capable of predicting dysgraphia with 93.24% accuracy in a test group of 74 participants.

Paperid: 2854, https://arxiv.org/pdf/2503.15509.pdf

Abstract:
An important part of data science is the use of visualisations to display data in a way that is easy to digest. Visualisations often rely on underlying statistical or machine learning models -- ranging from basic calculations like category means to advanced methods such as principal component analysis of multidimensional datasets -- to convey insights. We introduce an analogous concept for word descriptions of data, which we call wordalisations. Wordalisations describe data in easy to digest words, without necessarily reporting numerical values from the data. We show how to create wordalisations using large language models, through prompt templates engineered according to a task-agnostic structure which can be used to automatically generate prompts from data. We show how to produce reliable and engaging texts on three application areas: scouting football players, personality tests, and international survey data. Using the model cards framework, we emphasise the importance of clearly stating the model we are imposing on the data when creating the wordalisation, detailing how numerical values are translated into words, incorporating background information into prompts for the large language model, and documenting the limitations of the wordalisations. We argue that our model cards approach is a more appropriate framework for setting best practices in wordalisation of data than performance tests on benchmark datasets.

Paperid: 2855, https://arxiv.org/pdf/2503.15503.pdf

Abstract:
Robot Assisted Surgeries (RAS) have one of the steepest learning curves of any type of surgery. Because of this, methods to practice RAS outside the operating room have been developed to improve the surgeons skills. These strategies include the incorporation of extended reality simulators into surgical training programs. In this Systematic review, we seek to determine if extended reality simulators can improve the performance of novice surgeons and how their performance compares to the conventional training of surgeons on Surgical robots. Using the PRISMA 2020 guidelines, a systematic review and meta-analysis was performed searching PubMed, Embase, Web of Science, and Cochrane library for studies that compared the performance of novice surgeons that received no additional training, trained with extended reality, or trained with inanimate physical simulators (conventional additional training). We included articles that gauged performance using either GEARS or Time to complete measurements and used SPSS to perform a meta-analysis to compare the performance outcomes of the surgeons after training. Surgeons trained using extended reality completed their surgical tasks statistically significantly faster than those who did not receive training (Cohen's d=-0.95, p=0.02), and moderately slower than those conventionally trained (Cohen's d=0.65, p=0.14). However, this difference was not statistically significant. Surgeons trained on extended reality demonstrated a statistically significant improvement in GEARS scores over those who did not train (Cohen's d=0.964, p<0.001). While surgeons trained in extended reality had comparable GEARS scores to surgeons trained conventionally (Cohen's d=0.65, p=0.14). This meta-analysis demonstrates that extended reality simulators translated complex skills to surgeons in a low cost and low risk environment.

Paperid: 2856, https://arxiv.org/pdf/2503.15499.pdf

Abstract:
Revitalizing Japan's remote areas has become a crucial task, and Matsue City exemplifies this effort in its temporary event spaces, created through collective efforts to foster urban vibrancy and bring together residents and visitors. This research examines the relationship between data-driven in-sights using generative AI and visual attractiveness by evaluating tempo-rary events in Matsue City, particularly considering the cognitive-cultural differences in processing visual information of the participants. The first phase employs semantic keyword extraction from interviews, categorizing responses into physical elements, activities, and atmosphere. The second phase analyzes spatial perception through three categories: layout hierar-chy, product visibility, and visual attention. The correlation indicates that successful event design requires a balance between spatial efficiency and diverse needs, with a spatial organization that optimizes visitor flow and visibility strategies considering cultural and demographic diversity. These findings contribute to understanding the urban quality of temporary event spaces and offer a replicable framework for enhancing the visual appeal of events in remote areas throughout Japan.

Paperid: 2857, https://arxiv.org/pdf/2503.15497.pdf

Abstract:
This study investigates how the Big Five personality traits influence decision-making processes in AI agents within public spaces. Using AgentVerse framework and GPT-3.5-turbo, we simulated interactions among 10 AI agents, each embodying different dimensions of the Big Five personality traits, in a classroom environment responding to misinformation. The experiment assessed both public expressions ([Speak]) and private thoughts ([Think]) of agents, revealing significant correlations between personality traits and decision-making patterns. Results demonstrate that Openness to Experience had the strongest impact on information acceptance, with curious agents showing high acceptance rates and cautious agents displaying strong skepticism. Extraversion and Conscientiousness also showed notable influence on decision-making, while Neuroticism and Agreeableness exhibited more balanced responses. Additionally, we observed significant discrepancies between public expressions and private thoughts, particularly in agents with friendly and extroverted personalities, suggesting that social context influences decision-making behavior. Our findings contribute to understanding how personality traits shape AI agent behavior in social settings and have implications for developing more nuanced and context-aware AI systems.

Paperid: 2858, https://arxiv.org/pdf/2503.15493.pdf

Abstract:
Grounded in framing theory, this study examines how news titles about older adults shape user engagement on a Chinese video-sharing platform. We analyzed 2,017 video news titles from 2016 to 2021, identifying nine frames. Negative frames produced higher views and shares, suggesting that negative portrayals garner attention and encourage further distribution. In contrast, positive frames led to more collections and rewards, reflecting viewer preference and financial support for favorable depictions. These findings underscore how framing aligns with ageism concerns and highlight the need for more balanced media portrayals of older adults.

Paperid: 2859, https://arxiv.org/pdf/2503.15490.pdf

Abstract:
We introduce a framework for understanding the impact of generative AI on human work, which we call the human-AI task tensor. A tensor is a structured framework that organizes tasks along multiple interdependent dimensions. Our human-AI task tensor introduces a systematic approach to studying how humans and AI interact to perform tasks, and has eight dimensions: task definition, AI contribution, interaction modality, audit requirement, output definition, decision-making authority, AI structure, and human persona. After describing the eight dimensions of the tensor, we provide illustrative frameworks (derived from projections of the tensor) and a human-AI task canvas that provide analytical tractability and practical insight for organizational decision-making. We demonstrate how the human-AI task tensor can be used to organize emerging and future research on generative AI. We propose that the human-AI task tensor offers a starting point for understanding how work will be performed with the emergence of generative AI.

Paperid: 2860, https://arxiv.org/pdf/2503.15489.pdf

Abstract:
This paper introduces PersonaAI, a cutting-edge application that leverages Retrieval-Augmented Generation (RAG) and the LLAMA model to create highly personalized digital avatars capable of accurately mimicking individual personalities. Designed as a cloud-based mobile application, PersonaAI captures user data seamlessly, storing it in a secure database for retrieval and analysis. The result is a system that provides context-aware, accurate responses to user queries, enhancing the potential of AI-driven personalization. Why should you care? PersonaAI combines the scalability of RAG with the efficiency of prompt-engineered LLAMA3, offering a lightweight, sustainable alternative to traditional large language model (LLM) training methods. The system's novel approach to data collection, utilizing real-time user interactions via a mobile app, ensures enhanced context relevance while maintaining user privacy. By open-sourcing our implementation, we aim to foster adaptability and community-driven development. PersonaAI demonstrates how AI can transform interactions by merging efficiency, scalability, and personalization, making it a significant step forward in the future of digital avatars and personalized AI.

Paperid: 2861, https://arxiv.org/pdf/2503.15176.pdf

Abstract:
This paper provides a comprehensive review of the integration of Large Language Models (LLMs) with visual analytics, addressing their foundational concepts, capabilities, and wide-ranging applications. It begins by outlining the theoretical underpinnings of visual analytics and the transformative potential of LLMs, specifically focusing on their roles in natural language understanding, natural language generation, dialogue systems, and text-to-media transformations. The review further investigates how the synergy between LLMs and visual analytics enhances data interpretation, visualization techniques, and interactive exploration capabilities. Key tools and platforms including LIDA, Chat2VIS, Julius AI, and Zoho Analytics, along with specialized multimodal models such as ChartLlama and CharXIV, are critically evaluated. The paper discusses their functionalities, strengths, and limitations in supporting data exploration, visualization enhancement, automated reporting, and insight extraction. The taxonomy of LLM tasks, ranging from natural language understanding (NLU), natural language generation (NLG), to dialogue systems and text-to-media transformations, is systematically explored. This review provides a SWOT analysis of integrating Large Language Models (LLMs) with visual analytics, highlighting strengths like accessibility and flexibility, weaknesses such as computational demands and biases, opportunities in multimodal integration and user collaboration, and threats including privacy concerns and skill degradation. It emphasizes addressing ethical considerations and methodological improvements for effective integration.

Paperid: 2862, https://arxiv.org/pdf/2503.14408.pdf

Abstract:
Co-speech gestures convey a wide variety of meanings and play an important role in face-to-face human interactions. These gestures significantly influence the addressee's engagement, recall, comprehension, and attitudes toward the speaker. Similarly, they impact interactions between humans and embodied virtual agents. The process of selecting and animating meaningful gestures has thus become a key focus in the design of these agents. However, automating this gesture selection process poses a significant challenge. Prior gesture generation techniques have varied from fully automated, data-driven methods, which often struggle to produce contextually meaningful gestures, to more manual approaches that require crafting specific gesture expertise and are time-consuming and lack generalizability. In this paper, we leverage the semantic capabilities of Large Language Models to develop a gesture selection approach that suggests meaningful, appropriate co-speech gestures. We first describe how information on gestures is encoded into GPT-4. Then, we conduct a study to evaluate alternative prompting approaches for their ability to select meaningful, contextually relevant gestures and to align them appropriately with the co-speech utterance. Finally, we detail and demonstrate how this approach has been implemented within a virtual agent system, automating the selection and subsequent animation of the selected gestures for enhanced human-agent interactions.

Paperid: 2863, https://arxiv.org/pdf/2503.14257.pdf

Abstract:
One's own voice is one of the most frequently heard voices. Studies found that hearing and talking to oneself have positive psychological effects. However, the design and implementation of self-voice for emotional regulation in HCI have yet to be explored. In this paper, we introduce InnerSelf, an innovative voice system based on speech synthesis technologies and the Large Language Model. It allows users to engage in supportive and empathic dialogue with their deepfake voice. By manipulating positive self-talk, our system aims to promote self-disclosure and regulation, reshaping negative thoughts and improving emotional well-being.

Paperid: 2864, https://arxiv.org/pdf/2503.14220.pdf

Abstract:
Music visualization is an important medium that enables synesthetic experiences and creative inspiration. However, previous research focused mainly on the technical and theoretical aspects, overlooking users' everyday interaction with music visualizations. This gap highlights the pressing need for research on how music visualization influences users in synesthetic creative experiences and where they are heading. Thus, we developed musicolors, a web-based music visualization library available in real-time. Additionally, we conducted a qualitative user study with composers, developers, and listeners to explore how they use musicolors to appreciate and get inspiration and craft the music-visual interaction. The results show that musicolors provides a rich value of music visualization to users through sketching for musical ideas, integrating visualizations with other systems or platforms, and synesthetic listening. Based on these findings, we also provide guidelines for future music visualizations to offer a more interactive and creative experience.

Paperid: 2865, https://arxiv.org/pdf/2503.14178.pdf

Abstract:
With the development of technology, digital games have permeated into family and parent-child relationships, leading to cognitive deficiencies and inter-generational conflicts that have yet to be effectively addressed. Building on previous research on digital games and parent-child relationships, we have developed Figame, a Joint Media Engagement (JME) based parent-child digital game aimed at fostering healthy family gaming relationships through co-playing experiences. The game itself involves providing game-related cognitive support, facilitating role-switching between parent and child, encouraging discussions both within and outside the game, and balancing competition and collaboration. During the study, we assessed the gameplay experiences of 8 parent-child pairs (aged between 8 and 12 years). The results indicated that Figame effectively enhances parent-child digital gaming relationships and promotes a willingness to engage in shared gameplay, thereby fostering positive family dynamics within the context of digital gaming.

Paperid: 2866, https://arxiv.org/pdf/2503.14168.pdf

Abstract:
Extensive research has confirmed the positive relationship between exposure to natural environments and human cognitive, behavioral, physical, and mental health. However, only some have easy access to nature. With electronic information and simulation technology advancements, digital nature experiences are widely used across various devices and scenarios. It is essential to explore how to effectively select and utilize natural elements to guide the design of digital nature scenes. This paper examines critical elements in immersive virtual nature (IVN) and their impact on user perception. Through online surveys and design experiments, we identified specific natural elements that promote relaxation and proposed design strategies for virtual environments. We developed several immersive virtual nature scenes for further validation. Finally, we outline our future experimental plans and research directions in digital nature. Our research aims to provide HCI designers insights into creating restorative, immersive virtual scenes.

Paperid: 2867, https://arxiv.org/pdf/2503.14139.pdf

Abstract:
Psychological stress encompasses emotional tension and pressure experienced by people, which usually arises from situations people find challenging. However, more is needed to know about the pressures faced by international college students studying in China. The goal of this study is to investigate the various stressors that international college students in China face and how they cope with stress (coping mechanisms). Twenty international students were interviewed to gather data, which was then transcribed. Thematic analysis and coding were applied to the qualitative data, revealing themes related to the causes of stress. The following themes emerge from this data: anticipatory anxiety or future stress, social and cultural challenges, financial strain, and academic pressure. These themes will help understand the various stressors international college students in China face and how they try to cope. Studying how international college students in China cope with challenges can guide the development of targeted interventions to support their mental health. Research suggests that integrating aesthetics and connectivity into design interventions can notably improve the well-being of these students. This paper presents possible future design solutions, leveraging the aesthetics of connectivity to empower students and enhance their resilience. Additionally, it aims to provide valuable insights for designers interested in creating solutions that alleviate stress and promote emotional awareness among international students.

Paperid: 2868, https://arxiv.org/pdf/2503.14127.pdf

Abstract:
Autistic children often face challenges in social interaction and communication, impacting their social connectivity, especially with their parents. Despite the effectiveness of game-based interactive therapy in improving motor skills, research on enhancing parent-child relationships is lacking. We address this gap with Magicarpet, an interactive play carpet that encourages parent-child interaction and has been validated through a user study with five families. The preliminary results indicate that Magicarpet enhances the motivation and participation of autistic children in play, demonstrating the potential of human-computer interaction (HCI) designs to foster connectivity.

Paperid: 2869, https://arxiv.org/pdf/2503.14122.pdf

Abstract:
Empowerment in smart clothing, which incorporates advanced technologies, requires the integration of scientific and technological expertise with artistic and design principles. Little research has focused on this unique and innovative field of design until now, and that is about to change. The concept of 'wearables' cut across several fields. A global 'language' that permits both free-form creativity and a methodical design approach is required. Smart clothing designers often seek guidance in their research since it may be difficult to prioritize and understand issues like as usability, production, style, consumer culture, reuse, and end-user needs. Researchers in this research made sure that their design tool was presented in a manner that practitioners from many walks of life could understand. The 'critical route' is a useful tool for smart technology implementation design, study, and development since it helps to clarify the path that must be taken.

Paperid: 2870, https://arxiv.org/pdf/2503.13809.pdf

Abstract:
The Immersive Archive is an initiative dedicated to preserve and restore the groundbreaking works from across Extended Reality (XR) history. Originating at the University of Southern California's Mobile and Environmental Media Lab, this archive is committed to developing and exhibiting simulations of influential XR devices that have shaped immersive media over time. This paper examines the challenges and strategies involved in archiving seminal XR technologies, with a focus on Morton Heilig's Sensorama and Ivan Sutherland's HeadMounted Display. As pioneering prototypes in virtual and augmented reality, these devices provide valuable insights into the evolution of immersive media, highlighting both technological innovation and sensory experimentation. Through collaborative archival efforts with institutions such as the HMH Moving Image Archive at University of Southern California and the Computer History Museum, this research integrates media archaeology with digital preservation techniques. Emphasis is placed on documentation practices, restoration of physical artifacts and developing simulations of these historic experiences for contemporary virtual reality platforms. Our interdisciplinary approach to archival methodologies, which captures the multisensory and interactive qualities of these pioneering devices, has been instrumental in developing a framework for future immersive media preservation initiatives. By preserving the immersive essence of these early experiences, we lay the groundwork for future generations to explore and learn from the origins of immersive media. Safeguarding this rich legacy is essential to ensure these visionary works continue to inspire and shape the future of media landscapes.

Paperid: 2871, https://arxiv.org/pdf/2503.13425.pdf

Abstract:
By 2050, a quarter of the US population will be over the age of 65 with greater than a 40% risk of developing life-altering neuromusculoskeletal pathologies. The potential of wearables, such as Apple AirPods and hearing aids, to provide personalized preventative and predictive health monitoring outside of the clinic is nascent, but large quantities of open-ended data that capture movement in the physical world now exist. Algorithms that leverage existing wearable technology to detect subtle changes to walking mechanics, an early indicator of neuromusculoskeletal pathology, have successfully been developed to determine population-level statistics, but individual-level variability is more difficult to parse from population-level data. Like genetic sequencing, the individual's gait pattern can be discerned by decomposing the movement signal into its fundamental features from which we can detect "mutations" or changes to the pattern that are early indicators of pathology - movement-based biomarkers. We have developed a novel approach to quantify "normal baseline movement" at an individual level, combining methods from gait laboratories with methods used to characterize stellar oscillations. We tested our approach by asking participants to complete an outdoor circuit while wearing a pair of AirPods, using orthopaedic braces to simulate pathology. We found that the novel features we propose are sensitive enough to distinguish between normal walking and brace walking at the population level and at the individual level in all sensor directions (both p $<$ 0.05). We also perform principal component analysis on our population-level and individual-level models, and find significant differences between individuals as well as between the overall population model and most individuals. We also demonstrate the potential of these gait features in deep learning applications.

Paperid: 2872, https://arxiv.org/pdf/2503.13041.pdf

Abstract:
This study protocol outlines the design and methodology of a research study investigating collaborative game elements to promote physical activity within digital health interventions. The study aims to examine how social relatedness influences motivation and adherence to step-count goals. Participants will use Shared Achievements, a minimalistic multiplayer step counter game, over two weeks, one week contributing absolute step counts and one week sharing step counts as a relative percentage of a team goal. Data will be collected through usage metrics and participant feedback to evaluate engagement, motivation, and perceived challenges. Findings will inform the design of digital health tools that balance competition and collaboration, optimising social and behavioural support mechanisms.

Paperid: 2873, https://arxiv.org/pdf/2503.12924.pdf

Abstract:
As people engage with the social media landscape, popular platforms rise and fall. As current research uncovers the experiences people have on various platforms, rarely do we engage with the sociotechnical migration processes when joining and leaving them. In this paper, we asked 32 visitors of a science communication festival to draw out artifacts that we call Social Media Journey Maps about the social media platforms they frequented, and why. By combining qualitative content analysis with a graph representation of Social Media Journeys, we present how social media migration processes are motivated by the interplay of environmental and platform factors. We find that peer-driven popularity, the timing of feature adoption, and personal perceptions of migration causes - such as security - shape individuals' reasoning for migrating between social media platforms. With this work, we aim to pave the way for future social media platforms that foster meaningful and enriching online experiences for users.

Paperid: 2874, https://arxiv.org/pdf/2503.12877.pdf

Abstract:
In the field of group recommendation systems (GRS), effectively addressing the diverse preferences of group members poses a significant challenge. Traditional GRS approaches often aggregate individual preferences into a collective group preference to generate recommendations, which may overlook the intricate interactions between group members. We introduce a novel approach to group recommendation, with a specific focus on small groups sharing common interests. In particular, we present a web-based restaurant recommendation system that enhances user satisfaction by modeling mutual interactions among group members. Drawing inspiration from group decision-making literature and leveraging graph theory, we propose a recommendation algorithm that emphasizes the dynamics of relationships and trust within the group. By representing group members as nodes and their interactions as directed edges, the algorithm captures pairwise relationships to foster consensus and improve the alignment of recommendations with group preferences. This interaction-focused framework ultimately seeks to enhance overall group satisfaction with the recommended choices.

Paperid: 2875, https://arxiv.org/pdf/2503.12341.pdf

Abstract:
Online scams are a growing threat in India, impacting millions and causing substantial financial losses year over year. This white paper presents ShieldUp!, a novel mobile game prototype designed to inoculate users against common online scams by leveraging the principles of psychological inoculation theory. ShieldUp! exposes users to weakened versions of manipulation tactics frequently used by scammers, and teaches them to recognize and pre-emptively refute these techniques. A randomized controlled trial (RCT) with 3,000 participants in India was conducted to evaluate the game's efficacy in helping users better identify scams scenarios. Participants were assigned to one of three groups: the ShieldUp! group (play time: 15 min), a general scam awareness group (watching videos and reading tips for 10-15 min), and a control group (plays "Chrome Dino", an unrelated game, for 10 minutes). Scam discernment ability was measured using a newly developed Scam Discernment Ability Test (SDAT-10) before the intervention, immediately after, and at a 21-day follow-up. Results indicated that participants who played ShieldUp! showed a significant improvement in their ability to identify scams compared to both control groups, and this improvement was maintained at follow-up. Importantly, while both interventions initially led users to to show increased skepticism towards even genuine online offers (NOT Scam scenarios), this effect dissipated after 21 days, suggesting no long-term negative impact on user trust. This study demonstrates the potential of game-based inoculation as a scalable and effective scam prevention strategy, offering valuable insights for product design, policy interventions, and future research, including the need for longitudinal studies and cross-cultural adaptations.

Paperid: 2876, https://arxiv.org/pdf/2503.12080.pdf

Abstract:
In this article we explore the application of Large Language Models (LLMs) in assessing the content validity of psychometric instruments, focusing on the Big Five Questionnaire (BFQ) and Big Five Inventory (BFI). Content validity, a cornerstone of test construction, ensures that psychological measures adequately cover their intended constructs. Using both human expert evaluations and advanced LLMs, we compared the accuracy of semantic item-construct alignment. Graduate psychology students employed the Content Validity Ratio (CVR) to rate test items, forming the human baseline. In parallel, state-of-the-art LLMs, including multilingual and fine-tuned models, analyzed item embeddings to predict construct mappings. The results reveal distinct strengths and limitations of human and AI approaches. Human validators excelled in aligning the behaviorally rich BFQ items, while LLMs performed better with the linguistically concise BFI items. Training strategies significantly influenced LLM performance, with models tailored for lexical relationships outperforming general-purpose LLMs. Here we highlights the complementary potential of hybrid validation systems that integrate human expertise and AI precision. The findings underscore the transformative role of LLMs in psychological assessment, paving the way for scalable, objective, and robust test development methodologies.

Paperid: 2877, https://arxiv.org/pdf/2503.11915.pdf

Abstract:
Writing about a subject enriches writers' understanding of that subject. This cognitive benefit of writing -- known as constructive learning -- is essential to how students learn in various disciplines. However, does this benefit persist when students write with generative AI writing assistants? Prior research suggests the answer varies based on the type of AI, e.g., auto-complete systems tend to hinder ideation, while assistants that pose Socratic questions facilitate it. This paper adds an additional perspective. Through a case study, we demonstrate that the impact of genAI on students' idea development depends not only on the AI but also on the students and, crucially, their interactions in between. Students who proactively explored ideas gained new ideas from writing, regardless of whether they used auto-complete or Socratic AI assistants. Those who engaged in prolonged, mindless copyediting developed few ideas even with a Socratic AI. These findings suggest opportunities in designing AI writing assistants, not merely by creating more thought-provoking AI, but also by fostering more thought-provoking writer-AI interactions.

Paperid: 2878, https://arxiv.org/pdf/2503.11914.pdf

Abstract:
The Steering Law has long been a fundamental model in predicting movement time for tasks involving navigating through constrained paths, such as in selecting sub-menu options, particularly for straight and circular arc trajectories. However, this does not reflect the complexities of real-world tasks where curvatures can vary arbitrarily, limiting its applications. This study aims to address this gap by introducing the total curvature parameter K into the equation to account for the overall curviness characteristic of a path. To validate this extension, we conducted a mouse-steering experiment on fixed-width paths with varying lengths and curviness levels. Our results demonstrate that the introduction of K significantly improves model fitness for movement time prediction over traditional models. These findings advance our understanding of movement in complex environments and support potential applications in fields like speech motor control and virtual navigation.

Paperid: 2879, https://arxiv.org/pdf/2503.11894.pdf

Abstract:
The use of interactive voice assistants (IVAs) in healthcare provides an avenue to address diverse health needs, such as gaps in the medical recovery period for older adult patients who have recently experienced serious illness. By using a voice-assisted medical recovery curriculum, discharged patients can receive ongoing support as they recover. However, there exist significant medical and technology disparities among older adults, particularly among Black older adults. We recruited 26 Black older adults to participate in the design process of an IVA-enacted medical recovery curriculum by providing feedback during the early stages of design. Lack of cultural relevancy, accountability, privacy concerns, and stigmas associated with aging and disability made participants reluctant to engage with the technology unless in a position of extreme need. This study underscored the need for Black cultural representation, whether it regarded the IVA's accent, the types of media featured, or race-specific medical advice, and the need for strategies to address participants' concerns and stigmas. Participants saw the value in the curriculum for those who did not have caregivers and deliberated about the trade-offs the technology presented. We discuss tensions surrounding inclusion and representation and conclude by showing how we enacted the lessons from this study in future design plans.

Paperid: 2880, https://arxiv.org/pdf/2503.11861.pdf

Abstract:
The rapid growth of mobile banking (m-banking), especially after the COVID-19 pandemic, has reshaped the financial sector. This study analyzes consumer reviews of m-banking apps from five major Canadian banks, collected from Google Play and iOS App stores. Sentiment analysis and topic modeling classify reviews as positive, neutral, or negative, highlighting user preferences and areas for improvement. Data pre-processing was performed with NLTK, a Python language processing tool, and topic modeling used Latent Dirichlet Allocation (LDA). Sentiment analysis compared methods, with Long Short-Term Memory (LSTM) achieving 82\% accuracy for iOS reviews and Multinomial Naive Bayes 77\% for Google Play. Positive reviews praised usability, reliability, and features, while negative reviews identified login issues, glitches, and dissatisfaction with updates.This is the first study to analyze both iOS and Google Play m-banking app reviews, offering insights into app strengths and weaknesses. Findings underscore the importance of user-friendly designs, stable updates, and better customer service. Advanced text analytics provide actionable recommendations for improving user satisfaction and experience.

Paperid: 2881, https://arxiv.org/pdf/2503.11677.pdf

Abstract:
Objective. Patients implanted with the PRIMA photovoltaic subretinal prosthesis in geographic atrophy report form vision with the average acuity matching the 100um pixel size. Although this remarkable outcome enables them to read and write, they report difficulty with perceiving faces. This paper provides a novel, non-pixelated algorithm for simulating prosthetic vision the way it is experienced by PRIMA patients, compares the algorithm's predictions to clinical perceptual outcomes, and offers computer vision and machine learning (ML) methods to improve face representation. Approach. Our simulation algorithm integrates a grayscale filter, spatial resolution filter, and contrast filter. This accounts for the limited sampling density of the retinal implant, as well as the reduced contrast sensitivity of prosthetic vision. Patterns of Landolt C and faces created using this simulation algorithm are compared to reports from actual PRIMA users. To recover the facial features lost in prosthetic vision, we apply an ML facial landmarking model as well as contrast adjusting tone curves to the face image prior to its projection onto the implant. Main results. Simulated prosthetic vision matches the maximum letter acuity observed in clinical studies as well as patients' subjective descriptions. Application of the inversed contrast filter helps preserve the contrast in prosthetic vision. Identification of the facial features using an ML facial landmarking model and accentuating them further improve face representation. Significance. Spatial and contrast constraints of prosthetic vision limit resolvable features and degrade natural images. ML based methods and contrast adjustments mitigate some limitations and improve face representation. Even though higher spatial resolution can be expected with implants having smaller pixels, contrast enhancement still remains essential for face recognition.

Paperid: 2882, https://arxiv.org/pdf/2503.11202.pdf

Abstract:
Patients with extreme forms of paralysis face challenges in communication, adversely impacting their quality of life. Recent studies have reported higher-than-chance performance in decoding handwritten letters from EEG signals, potentially allowing these subjects to communicate. However, all prior works have attempted to decode handwriting from EEG during actual motion. Furthermore, they assume that precise movement-onset is known. In this work, we focus on settings closer to real-world use where either movement onset is not known or movement does not occur at all, fully utilizing motor imagery. We show that several existing studies are affected by confounds that make them inapplicable to the imagined handwriting setting. We also investigate how sample complexity affects handwriting decoding performance, guiding future data collection efforts. Our work shows that (a) Sample complexity analysis in single-trial EEG reveals a noise ceiling, which can be alleviated by averaging over trials. (b) Knowledge of movement-onset is crucial to reported performance in prior works. (c) Fully imagined handwriting can be decoded from EEG with higher-than-chance performance. Taken together, these results highlight both the unique challenges and avenues to pursue to build a practical EEG-based handwriting BCI.

Paperid: 2883, https://arxiv.org/pdf/2503.11186.pdf

Abstract:
The primary objective of the dataset is to provide a better understanding of the coupling between human actions and gaze in a shared working environment with a cobot, with the aim of signifcantly enhancing the effciency and safety of humancobot interactions. More broadly, by linking gaze patterns with physical actions, the dataset offers valuable insights into cognitive processes and attention dynamics in the context of assembly tasks. The proposed dataset contains gaze and action data from approximately 80 participants, recorded during simulated industrial assembly tasks. The tasks were simulated using controlled scenarios in which participants manipulated educational building blocks. Gaze data was collected using two different eye-tracking setups -head-mounted and remote-while participants worked in two positions: sitting and standing.

Paperid: 2884, https://arxiv.org/pdf/2503.09901.pdf

Abstract:
Generative AI (GAI) technologies are disrupting professional writing, challenging traditional practices. Recent studies explore GAI adoption experiences of creative practitioners, but we know little about how these experiences evolve into established practices and how GAI resistance alters these practices. To address this gap, we conducted 25 semi-structured interviews with writing professionals who adopted and/or resisted GAI. Using the theoretical lens of Job Crafting, we identify four strategies professionals employ to reshape their roles. Writing professionals employed GAI resisting strategies to maximize human potential, reinforce professional identity, carve out a professional niche, and preserve credibility within their networks. In contrast, GAI-enabled strategies allowed writers who embraced GAI to enhance desirable workflows, minimize mundane tasks, and engage in new AI-managerial labor. These strategies amplified their collaborations with GAI while reducing their reliance on other people. We conclude by discussing implications of GAI practices on writers' identity and practices as well as crafting theory.

Paperid: 2885, https://arxiv.org/pdf/2503.09079.pdf

Abstract:
Children with ADHD often struggle with executive function (EF) and motor skills, impacting their academics and social life. While medications are commonly used, they have side effects, leading to interest in non-drug treatments. Physical activity (PA) has shown promise in improving cognitive and motor skills in children with ADHD. This study examined the short- and long-term effects of three PA interventions: a specific skill training group (EG1), a low-demand exercise group (EG2), and a control group (CG) over 12 weeks. EG1 showed significant improvements in motor tasks and working memory (15\% improvement, p<0.05), while EG2 and CG showed smaller changes. Long-term PA improved working memory, but short-term PA had limited effects on balance and manual dexterity. These findings suggest that skill training has an immediate impact on motor performance, while more complex motor skills require longer interventions. Smart devices tracked progress, confirming sustained engagement and improvement in EG1. This research highlights PA as a promising non-pharmacological treatment for ADHD, warranting further exploration of its effects on other cognitive domains.

Paperid: 2886, https://arxiv.org/pdf/2503.09077.pdf

Abstract:
IoT-based devices and wearable sensors are now common in daily life, with smartwatches, smartphones, and other digital tools tracking physical activity and health data. This lifelogging process provides valuable insights into people's lives. This paper analyzes a publicly available lifelog dataset of 14 individuals to explore how exercise affects mood and, in turn, executive function. Results show that moderate physical activity significantly improves mood, reduces stress, and enhances cognitive functions like decision-making and focus. Improved mood not only boosts exercise performance but also strengthens executive function, suggesting exercise benefits both emotional and cognitive well-being. This opens the door for personalized exercise plans tailored to emotional states to optimize brain function.

Paperid: 2887, https://arxiv.org/pdf/2503.09001.pdf

Abstract:
Context: As the demand for digital solutions adapted to different user profiles increases, creating more inclusive and diverse software development teams becomes an important initiative to improve software product accessibility. Problem: However, neurodivergent professionals are underrepresented in this area, encountering obstacles from difficulties in communication and collaboration to inadequate software tools, which directly impact their productivity and well-being. Solution: This study seeks to understand the work experiences of neurodivergent professionals acting in different software development roles. A better understanding of their challenges and strategies to deal with them can collaborate to create more inclusive software development teams. IS Theory: We applied the Sociotechnical Theory (STS) to investigate how the social structures of organizations and their respective work technologies influence the inclusion of these professionals. Method: To address this study, we conducted semi-structured interviews with nine neurodivergent professionals in the Software Engineering field and analyzed the results by applying a continuous comparison coding strategy. Results: The results highlighted issues faced by interviewees, the main ones related to difficulties in communication, social interactions, and prejudice related to their diagnosis. Additionally, excessive in work tools became a significant challenge, leading toconstant distractions and cognitive overload. This scenario negatively impacts their concentration and overall performance. Contributions and Impact in the IS area: As a contribution,this study presents empirically based recommendations to overcome sociotechnical challenges faced by neurodivergent individuals working in software development teams.

Paperid: 2888, https://arxiv.org/pdf/2503.08928.pdf

Abstract:
Gradually-typed languages feature a dynamic type that supports implicit coercions, greatly weakening the type system but making types easier to adopt. Understanding how developers use this dynamic type is a critical question for the design of useful and usable type systems. This paper reports on an in-progress corpus study of the dynamic type in Python, targeting 221 GitHub projects that use the mypy type checker. The study reveals eight patterns-of-use for the dynamic type, which have implications for future refinements of the mypy type system and for tool support to encourage precise type annotations.

Paperid: 2889, https://arxiv.org/pdf/2503.08926.pdf

Abstract:
Eye tracking has been found to be useful in various tasks including diagnostic and screening tools. However, traditional eye trackers had a complicated setup and operated at a higher frequency to measure eye movements. The use of more commonly available eye trackers such as those in head-mounted virtual reality (VR) headsets greatly expands the utility of these eye trackers for research and analytical purposes. In this study, the research question is focused on detecting saccades, which is a common task when analyzing eye tracking data, but it is not well-established for VR headset-mounted eye trackers. The aim is to determine how accurately saccadic eye movements can be detected using an eye tracker that operates at 60 or 90Hz. The study involves VR eye tracking technology and neuroscience with respect to saccadic eye movements. The goal is to build prototype software implemented using VR eye tracking technology to detect saccadic eye movements, and per-eye differences in an individual. It is anticipated that the software will be able to accurately detect when saccades occur and analyze the differences in saccadic eye movements per-eye. The field of research surrounding VR eye tracking software is still developing rapidly, specifically its applications to neuroscience. Since previous methods of eye tracking involved specialized equipment, using commercially and consumer available VR eye tracking technology to assist in the detection of saccades and per-eye differences would be novel. This project will impact the field of neuroscience by providing a tool that can be used to detect saccadic eye movements and neurological and neurodegenerative disorders. However, this project is limited by the short time frame and that the eye tracker used in this study operates at a maximum frequency of 90Hz.

Paperid: 2890, https://arxiv.org/pdf/2503.08836.pdf

Abstract:
Dimensionality reduction is used as an important tool for unraveling the complexities of high-dimensional datasets in many fields of science, such as cell biology, chemical informatics, and physics. Visualizations of the dimensionally reduced data enable scientists to delve into the intrinsic structures of their datasets and align them with established hypotheses. Visualization researchers have thus proposed many dimensionality reduction methods and interactive systems designed to uncover latent structures. At the same time, different scientific domains have formulated guidelines or common workflows for using dimensionality reduction techniques and visualizations for their respective fields. In this work, we present a critical analysis of the usage of dimensionality reduction in scientific domains outside of computer science. First, we conduct a bibliometric analysis of 21,249 academic publications that use dimensionality reduction to observe differences in the frequency of techniques across fields. Next, we conduct a survey of a 71-paper sample from four fields: biology, chemistry, physics, and business. Through this survey, we uncover common workflows, processes, and usage patterns, including the mixed use of confirmatory data analysis to validate a dataset and projection method and exploratory data analysis to then generate more hypotheses. We also find that misinterpretations and inappropriate usage is common, particularly in the visual interpretation of the resulting dimensionally reduced view. Lastly, we compare our observations with recent works in the visualization community in order to match work within our community to potential areas of impact outside our community.

Paperid: 2891, https://arxiv.org/pdf/2503.08766.pdf

Abstract:
In 2023, ASTRON took the step of incorporating a dedicated User Experience (UX) designer into its software development process. This decision aimed to enhance the accessibility and usability of services providing access to the data holdings from the telescopes we are developing. The field of astronomical software development has historically under emphasized UX design. ASTRON's initiative not only improves our own tools, but can also be used to demonstrate to the broader community the value of integrating UX expertise into development teams. We discuss how we integrate the UX designer at the start of our software development lifecycle. We end with providing some considerations on how other projects could make use of UX knowledge in their development process.

Paperid: 2892, https://arxiv.org/pdf/2503.08568.pdf

Abstract:
In recent years, major privacy laws like the GDPR have brought about positive changes. However, challenges remain in enforcing the laws, particularly due to under-resourced regulators facing a large number of potential privacy-violating software applications (apps) and the high costs of investigating them. Since 2019, China has launched a series of privacy enforcement campaigns known as Special Privacy Rectification Campaigns (SPRCs) to address widespread privacy violations in its mobile application (app) ecosystem. Unlike the enforcement of the GDPR, SPRCs are characterized by large-scale privacy reviews and strict sanctions, under the strong control of central authorities. In SPRCs, central government authorities issue administrative orders to mobilize various resources for market-wide privacy reviews of mobile apps. They enforce strict sanctions by requiring privacy-violating apps to rectify issues within a short timeframe or face removal from app stores. While there are a few reports on SPRCs, the effectiveness and potential problems of this campaign-style privacy enforcement approach remain unclear to the community. In this study, we conducted 18 semi-structured interviews with app-related engineers involved in SPRCs to better understand the campaign-style privacy enforcement. Based on the interviews, we reported our findings on a variety of aspects of SPRCs, such as the processes that app engineers regularly follow to achieve privacy compliance in SPRCs, the challenges they encounter, the solutions they adopt to address these challenges, and the impacts of SPRCs, etc. We found that app engineers face a series of challenges in achieving privacy compliance in their apps...

Paperid: 2893, https://arxiv.org/pdf/2503.07777.pdf

Abstract:
Socialization is an essential development skill for preschool children. In collaboration with the LEGO Group, we developed Robert Robot, a simplified robot, which enables socialization between children and facilitates shared experiences when meeting for the first time. An exploratory study to observe socialization between preschool children was conducted with 30 respondents in pairs. Additionally, observational data from 212 play sessions with four Robert Robots in the wild were collected. Subsequent analysis found that children have fun as Robert Robot breaks the ice between unfamiliar children. The children relayed audio cues related to the imaginative world of Robert Robot's personalities and mimicked each other as a method of initiating social play and communication with their unfamiliar peers. Furthermore, the study contributes four implications for the design of robots for socialization between children. This chapter provides an example case of serious storytelling using playful interactions engaging children with the character of the robot and the mini-narratives around the build requests.

Paperid: 2894, https://arxiv.org/pdf/2503.06803.pdf

Abstract:
Interaction between humans and AI systems raises the question of how people understand AI systems. This has been addressed with explainable AI, the interpretability arising from users' domain expertise, or collaborating with AI in a stable environment. In the absence of these elements, we discuss designing Actionable AI, which allows non-experts to configure black-box agents. In this paper, we experiment with an AI-powered cartpole game and observe 22 pairs of participants to configure it via direct manipulation. Our findings suggest that, in uncertain conditions, non-experts were able to achieve good levels of performance. By influencing the behaviour of the agent, they exhibited an operational understanding of it, which proved sufficient to reach their goals. Based on this, we derive implications for designing Actionable AI systems. In conclusion, we propose Actionable AI as a way to open access to AI-based agents, giving end users the agency to influence such agents towards their own goals.

Paperid: 2895, https://arxiv.org/pdf/2503.06788.pdf

Abstract:
We paraphrase Descartes' famous dictum in the area of AI ethics where the "I doubt and therefore I am" is suggested as a necessary aspect of morality. Therefore AI, which cannot doubt itself, cannot possess moral agency. Of course, this is not the end of the story. We explore various aspects of the human mind that substantially differ from AI, which includes the sensory grounding of our knowing, the act of understanding, and the significance of being able to doubt ourselves. The foundation of our argument is the discipline of ethics, one of the oldest and largest knowledge projects of human history, yet, we seem only to be beginning to get a grasp of it. After a couple of thousand years of studying the ethics of humans, we (humans) arrived at a point where moral psychology suggests that our moral decisions are intuitive, and all the models from ethics become relevant only when we explain ourselves. This recognition has a major impact on what and how we can do regarding AI ethics. We do not offer a solution, we explore some ideas and leave the problem open, but we hope somewhat better understood than before our study.

Paperid: 2896, https://arxiv.org/pdf/2503.06583.pdf

Abstract:
Dynamic data physicalisation is an emerging field of research, investigating the representation and exploration of data via multiple modalities, beyond traditional visual methods. Despite the development of various data physicalisation applications in recent years, the integration of diverse hardware components remains both time-consuming and costly. Further, there is a lack of solutions for rapid prototyping and experimentation with different dynamic data physicalisation alternatives. To address this problem, we propose a modular and extensible hardware platform for dynamic data physicalisation. This platform introduces a communication architecture that ensures seamless plug-and-play functionality for modules representing different physical variables. We detail the implementation and technical evaluation of a preliminary prototype of our platform, demonstrating its potential to facilitate rapid prototyping and experimentation with various data physicalisation designs. The platform aims to support researchers and developers in the field by providing a versatile and efficient tool for the rapid prototyping and experimentation with different data physicalisation design alternatives.

Paperid: 2897, https://arxiv.org/pdf/2503.06425.pdf

Abstract:
Deaf and Hard-of-Hearing (DHH) individuals are increasingly participating as livestreamers in China's e-commerce livestreaming industry but face obstacles that limit the scope and diversity of their audience. Our paper examines these challenges and explores a potential solution for connecting the hearing audience to sign language (SL) livestreaming teams with DHH members in e-commerce livestreaming. We interviewed four SL livestreaming team members and 15 hearing audience members to identify information and emotional communication challenges that discourage the hearing audience from continuing to watch SL livestreaming. Based on these findings, we developed a virtual co-presenter demo, which targets SL livestreaming teams with DHH members as users, through a design workshop with six designers, incorporating voice broadcasting with animations. Follow-up evaluations with previous participants provided positive feedback on the virtual co-presenter's potential to address these challenges. We summarize design suggestions on its functionality and interaction design for further refinement to assist SL livestreaming teams with DHH members in reaching a broader hearing audience.

Paperid: 2898, https://arxiv.org/pdf/2503.06067.pdf

Abstract:
Effective communication of UX considerations to stakeholders (e.g., designers and developers) is a critical challenge for UX practitioners. To explore this problem, we interviewed four UX practitioners about their communication challenges and strategies. Our study identifies that providing an example user flow-a screen sequence representing a semantic task-as evidence reinforces communication, yet finding relevant examples remains challenging. To address this, we propose a method to systematically retrieve user flows using semantic embedding. Specifically, we design a model that learns to associate screens' visual features with user flow descriptions through contrastive learning. A survey confirms that our approach retrieves user flows better aligned with human perceptions of relevance. We analyze the results and discuss implications for the computational representation of user flows.

Paperid: 2899, https://arxiv.org/pdf/2503.05711.pdf

Abstract:
In this research, we explored the efficacy of various warning label designs for AI-generated content on social media platforms e.g., deepfakes. We devised and assessed ten distinct label design samples that varied across the dimensions of sentiment, color/iconography, positioning, and level of detail. Our experimental study involved 911 participants randomly assigned to these ten label designs and a control group evaluating social media content. We explored their perceptions relating to 1. Belief in the content being AI-generated, 2. Trust in the labels and 3. Social Media engagement perceptions of the content. The results demonstrate that the presence of labels had a significant effect on the users belief that the content is AI generated, deepfake, or edited by AI. However their trust in the label significantly varied based on the label design. Notably, having labels did not significantly change their engagement behaviors, such as like, comment, and sharing. However, there were significant differences in engagement based on content type: political and entertainment. This investigation contributes to the field of human computer interaction by defining a design space for label implementation and providing empirical support for the strategic use of labels to mitigate the risks associated with synthetically generated media.

Paperid: 2900, https://arxiv.org/pdf/2503.05516.pdf

Abstract:
Cognitive biases, systematic deviations from rationality in judgment, pose significant challenges in generating objective content. This paper introduces a novel approach for real-time cognitive bias detection in user-generated text using large language models (LLMs) and advanced prompt engineering techniques. The proposed system analyzes textual data to identify common cognitive biases such as confirmation bias, circular reasoning, and hidden assumption. By designing tailored prompts, the system effectively leverages LLMs' capabilities to both recognize and mitigate these biases, improving the quality of human-generated content (e.g., news, media, reports). Experimental results demonstrate the high accuracy of our approach in identifying cognitive biases, offering a valuable tool for enhancing content objectivity and reducing the risks of biased decision-making.

Paperid: 2901, https://arxiv.org/pdf/2503.04692.pdf

Abstract:
Persuasion through conversation has been the focus of much research. Nudging is a popular strategy to influence decision-making in physical and digital settings. However, conversational agents employing "nudging" have not received significant attention. We explore the manifestation of cognitive biases-the underlying psychological mechanisms of nudging-and investigate how the complexity of prior dialogue tasks impacts decision-making facilitated by conversational agents. Our research used a between-group experimental design, involving 756 participants randomly assigned to either a simple or complex task before encountering a decision-making scenario. Three scenarios were adapted from Samuelson's classic experiments on status-quo bias, the underlying mechanism of default nudges. Our results aligned with previous studies in two out of three simple-task scenarios. Increasing task complexity consistently shifted effect-sizes toward our hypothesis, though bias was significant in only one case. These findings inform conversational nudging strategies and highlight inherent biases relevant to behavioural economics.

Paperid: 2902, https://arxiv.org/pdf/2503.04460.pdf

Abstract:
Background. Career abandonment, the process in which professionals leave the activity, assuming positions in another area, among software developers involves frustration with the lost investment and emotional and financial costs, even though being beneficial for the human being, depending on personal context. Previous studies have identified work-related motivators for career abandonment, such as the threat of obsolescence, unstable requirements, and low code quality, though these factors have primarily been examined in former developers. The relationship between these motivators and the intention to abandon among currently active developers remains unexplored. Goal. This article investigates the relationship between key work-related motivators and currently active software developers intention to abandon their careers. Method. We employed a quantitative approach, surveying 221 software developers to validate a theoretical model for career abandonment intention, based on an adaptation of the Investment Model, which incorporates satisfaction with technical aspects of the profession as well as the intention to abandon. Findings. Exploratory and confirmatory factor analyses, through structural equation modeling (SEM), provided robust support for the adapted Investment Model in explaining software developers intention to abandon their careers. Moreover, career commitment significantly impacts the intention to leave the profession, being positively influenced by satisfaction with technical work-related factors and negatively influenced by career alternatives and career investment. Conclusion. The paper offers valuable insights for organizational leaders and research, potentially guiding retention strategies to better support developers, and the adoption of theoretical models to explain career abandonment.

Paperid: 2903, https://arxiv.org/pdf/2503.04343.pdf

Abstract:
Human trust in social robots is a complex attitude based on cognitive and emotional evaluations, as well as a behavior, like task delegation. While previous research explored the features of robots that influence overall trust attitude, it remains unclear whether these features affect behavioral trust. Additionally, there is limited investigation into which features of robots influence cognitive and emotional attitudes, and how these attitudes impact humans' willingness to delegate new tasks to robots. This study examines the interplay between competence, autonomy, and personality traits of robots and their impact on trust attitudes (cognitive and affective trust) and trust behavior (task delegation), within the context of task-oriented Human-Robot Interaction. Our findings indicate that robot competence is a key determinant of trust, influencing cognitive, affective, and behavioral trust. In contrast, robot personality traits significantly impact only affective trust without affecting cognitive trust or trust behavior. In addition, autonomy was found to moderate the relationship between competence and cognitive trust, as well as between personality and affective trust. Finally, cognitive trust was found to positively influence task delegation, whereas affective trust did not show a significant effect. This paper contributes to the literature on Human-Robot Trust by providing novel evidence that enhances the acceptance and effectiveness of social robots in collaborative scenarios.

Paperid: 2905, https://arxiv.org/pdf/2503.04293.pdf

Abstract:
Software developing small and medium enterprises (SMEs) play a crucial role as suppliers to larger corporations and public administration. It is therefore necessary for them to be able to demonstrate that their products meet certain security criteria, both to gain trust of their customers and to comply to standards that demand such a demonstration. In this study we have investigated ways for SMEs to demonstrate their security when operating in a business-to-business model, conducting semi-structured interviews (N=16) with practitioners from different SMEs in Denmark and validating our findings in a follow-up workshop (N=6). Our findings indicate five distinctive security demonstration approaches, namely: Certifications, Reports, Questionnaires, Interactive Sessions and Social Proof. We discuss the challenges, benefits, and recommendations related to these approaches, concluding that none of them is a one-size-fits all solution and that more research into relative advantages of these approaches and their combinations is needed.

Paperid: 2906, https://arxiv.org/pdf/2503.04217.pdf

Abstract:
Cybersickness remains a significant challenge in virtual reality (VR), limiting its usability across various applications. Existing mitigation strategies focus on optimising VR hardware and/or software and enhancing self-motion perception to minimise sensory conflict. However, anticipatory postural adaptation, a strategy widely studied with regards to motion sickness while being driven, has not been systematically examined in VR. Therefore, in this study, we explore whether adopting comfort-orientated postural movements, based on the literature, mitigates cybersickness. We conducted an exploratory analysis using a cumulative link mixed model (CLMM) on secondary data from a VR-based postural alignment experiment. Results indicate that misalignment between trunk roll and the virtual trajectory increases the odds of reporting higher cybersickness scores by 5%. Additionally, each additional minute in VR increases the odds of reporting higher cybersickness scores (FMS scores) by 11% %, but prolonged exposure leads to a 75% % reduction in the odds of reporting cybersickness symptoms, suggesting adaptation effects. Individual differences also play a role, with higher cybersickness susceptibility increasing the odds of reporting higher symptom severity by 8%. These findings indicate that anticipatory postural adaptation could serve as a natural mitigation strategy for cybersickness. VR applications, particularly in training and simulation, may benefit from designing adaptive cues that encourage users to align their posture with virtual movement. Future research should explore real-time postural feedback mechanisms to enhance user comfort and reduce cybersickness.

Paperid: 2907, https://arxiv.org/pdf/2503.04075.pdf

Abstract:
Augmented Reality Head-Mounted Displays (AR-HMDs) have proven effective to assist workers. However, they may degrade their Safety and Situational Awareness (SSA), particularly in complex and hazardous industrial settings. This paper analyzes, objectively and subjectively, the effects of AR-HMDs' on workers' SSA in a simulated hazardous industrial environment. Our evaluation was comprised of sixty participants performing various tasks in a simulated cargo ship room while receiving remote guidance through one of three devices: two off-the-shelf AR-HMDs (Trimble XR10 with HoloLens 2, RealWear Navigator 520), and a smartphone (Google Pixel 6). Several sensors were installed throughout the room to obtain quantitative measures of the participants' safe execution of the tasks, such as the frequency in which they hit the objects in the room or stepped over simulated holes or oil spills. The results reported that the Trimble XR10 led to statistically highest head-knockers and knee-knocker incidents compared to the Navigator 520 and the Pixel 6. Furthermore, the Trimble XR10 also led to significantly higher difficulties to cross hatch doors, lower perceived safety, comfort, perceived performance, and usability. Overall, participants wearing AR-HMDs failed to perceive more hazards, meaning that safety-preserving capabilities must be developed for AR-HMDs before introducing them into industrial hazardous settings confidently.

Paperid: 2908, https://arxiv.org/pdf/2503.03953.pdf

Abstract:
Static maps and animations remain popular in spatial epidemiology of dengue, limiting the analytical depth and scope of visualisations. Over half of the global population live in dengue endemic regions. Understanding the spatiotemporal dynamics of the four closely related dengue serotypes, and their immunological interactions, remains a challenge at a global scale. To facilitate this understanding, we worked with dengue epidemiologists in a user-centered design framework to create GeoDEN, an exploratory visualisation tool that empowers experts to investigate spatiotemporal patterns in dengue serotype reports. The tool has several linked visualisations and filtering mechanisms, enabling analysis at a range of spatial and temporal scales. To identify successes and failures, we present both insight-based and value-driven evaluations. Our domain experts found GeoDEN valuable, verifying existing hypotheses and uncovering novel insights that warrant further investigation by the epidemiology community. The developed visual exploration approach can be adapted for exploring other epidemiology and disease incident datasets.

Paperid: 2909, https://arxiv.org/pdf/2503.03428.pdf

Abstract:
In a world where data is the new currency, wearable health devices offer unprecedented insights into daily life, continuously monitoring vital signs and metrics. However, this convenience raises privacy concerns, as these devices collect sensitive data that can be misused or breached. Traditional measures often fail due to real-time data processing needs and limited device power. Users also lack awareness and control over data sharing and usage. We propose a Privacy-Enhancing Technology (PET) framework for wearable devices, integrating federated learning, lightweight cryptographic methods, and selectively deployed blockchain technology. The blockchain acts as a secure ledger triggered only upon data transfer requests, granting users real-time notifications and control. By dismantling data monopolies, this approach returns data sovereignty to individuals. Through real-world applications like secure medical data sharing, privacy-preserving fitness tracking, and continuous health monitoring, our framework reduces privacy risks by up to 70 percent while preserving data utility and performance. This innovation sets a new benchmark for wearable privacy and can scale to broader IoT ecosystems, including smart homes and industry. As data continues to shape our digital landscape, our research underscores the critical need to maintain privacy and user control at the forefront of technological progress.

Paperid: 2910, https://arxiv.org/pdf/2503.03398.pdf

Abstract:
Generative AI (GenAI) tools are increasingly integrated into design workflows. While text prompts remain the primary input method for GenAI image tools, designers often struggle to craft effective ones. Moreover, research has primarily focused on input methods for ideation, with limited attention to refinement tasks. This study explores designers' preferences for three input methods - text prompts, annotations, and scribbles - through a preliminary digital paper-based study with seven professional designers. Designers preferred annotations for spatial adjustments and referencing in-image elements, while scribbles were favored for specifying attributes such as shape, size, and position, often combined with other methods. Text prompts excelled at providing detailed descriptions or when designers sought greater GenAI creativity. However, designers expressed concerns about AI misinterpreting annotations and scribbles and the effort needed to create effective text prompts. These insights inform GenAI interface design to better support refinement tasks, align with workflows, and enhance communication with AI systems.

Paperid: 2911, https://arxiv.org/pdf/2503.03134.pdf

Abstract:
Generative AI (GenAI) tools enhance social media video creation by streamlining tasks such as scriptwriting, visual and audio generation, and editing. These tools enable the creation of new content, including text, images, audio, and video, with platforms like ChatGPT and MidJourney becoming increasingly popular among YouTube creators. Despite their growing adoption, knowledge of their specific use cases across the video production process remains limited. This study analyzes 274 YouTube how-to videos to explore GenAI's role in planning, production, editing, and uploading. The findings reveal that YouTubers use GenAI to identify topics, generate scripts, create prompts, and produce visual and audio materials. Additionally, GenAI supports editing tasks like upscaling visuals and reformatting content while also suggesting titles and subtitles. Based on these findings, we discuss future directions for incorporating GenAI to support various video creation tasks.

Paperid: 2912, https://arxiv.org/pdf/2503.02885.pdf

Abstract:
In recent years, Large Language Models (LLMs) rapidly gained popularity across all parts of society, including education. After initial skepticism and bans, many schools have chosen to embrace this new technology by integrating it into their curricula in the form of virtual tutors and teaching assistants. However, neither the companies developing this technology nor the public institutions involved in its implementation have set up a formal system to collect feedback from the stakeholders impacted by them. In this paper, we argue that understanding the perceptions of those directly or indirectly impacted by LLMs in the classroom, including parents and school staff, is essential for ensuring responsible use of AI in this critical domain. Our contributions are two-fold. First, we propose the Contextualized Perceptions for the Adoption of LLMs in Education (Co-PALE) framework, which can be used to systematically elicit perceptions and inform whether and how LLM-based tools should be designed, developed, and deployed in the classroom. Second, we explain how our framework can be used to ground specific rubrics for eliciting perceptions of the relevant stakeholders in view of specific goals and context of implementation. Overall, Co-PALE is a practical step toward helping educational agents, policymakers, researchers, and technologists ensure the responsible and effective deployment of LLM-based systems across diverse learning contexts.

Paperid: 2913, https://arxiv.org/pdf/2503.02861.pdf

Abstract:
Recent advancements in multimodal Generative AI have the potential to democratize specialized architectural tasks, such as interpreting technical drawings and creating 3D CAD models, which traditionally require expert knowledge. This paper presents a comparative evaluation of two systems: GPT-4o and Claude 3.5, in the task of architectural 3D synthesis. We conduct a case study on two buildings from Palladio's Four Books of Architecture (1965): Villa Rotonda and Palazzo Porto. High-level architectural models and drawings of these buildings were prepared, inspired by Palladio's original texts and drawings. Through sequential text and image prompting, we assess the systems' abilities in (1) interpreting 2D and 3D representations of buildings from drawings, (2) encoding the buildings into a CAD software script, and (3) self-improving based on outputs. While both systems successfully generate individual parts, they struggle to accurately assemble these parts into the desired spatial relationships, with Claude 3.5 demonstrating better performance, particularly in self-correcting its output. This study contributes to ongoing research on benchmarking the strengths and weaknesses of off-the-shelf AI systems in performing intelligent human tasks that require discipline-specific knowledge. The findings highlight the potential of language-enabled AI systems to act as collaborative technical assistants in the architectural design process.

Paperid: 2914, https://arxiv.org/pdf/2503.02743.pdf

Abstract:
Stuttering is a clinical speech disorder that disrupts fluency and leads to significant psychological and social challenges. This study evaluates the effectiveness of Eloquent, a digital speech therapy app for stuttering, by analyzing pre-therapy and post-therapy speech samples using the Stuttering Severity Index-4 (SSI-4) and the S24 communication and attitude scale. Results showed a 52.7% reduction in overall SSI-4 scores, with marked improvements in reading (45%), speaking (46%), duration (57%), and physical concomitants (63%) scores. Over 75% of participants improved by at least one severity category. S24 scores decreased by 33.5%, indicating more positive self-perceptions of speech and reduced avoidance. These findings highlight the potential of structured, technology-driven speech therapy interventions to deliver measurable improvements in stuttering severity and communication confidence.

Paperid: 2915, https://arxiv.org/pdf/2503.02133.pdf

Abstract:
Given the ubiquity of SmartTVs and head-mounted-display-based virtual environments, recent research has explored techniques to support eyes-free text entry using touchscreen devices. However, proposed techniques, leveraging lexicons, limit the user's ability to enter out-of-vocabulary words. In this paper, we investigate how to enter text while relying on unambiguous input to support out-of-vocabulary words. Through an iterative design approach, and after a careful investigation of actions that can be accurately and rapidly performed eyes-free, we devise DuSK, a {Du}al-handed, {S}troke-based, 1{K}eyboarding technique. In a controlled experiment, we show initial speeds of 10 WPM steadily increasing to 13~WPM with training. DuSK outperforms the common cursor-based text entry technique widely deployed in commercial SmartTVs (8 WPM) and is comparable to other eyes-free lexicon-based techniques, but with the added benefit of supporting out-of-vocabulary word input.

Paperid: 2916, https://arxiv.org/pdf/2503.01870.pdf

Abstract:
Identifying customer needs (CNs) is important for product management, product development, and marketing. Applications rely on professional analysts interpreting textual data (e.g., interview transcripts, online reviews) to understand the nuances of customer experience and concisely formulate "jobs to be done." The task is cognitively complex and time-consuming. Current practice facilitates the process with keyword search and machine learning but relies on human judgment to formulate CNs. We examine whether Large Language Models (LLMs) can automatically extract CNs. Because evaluating CNs requires professional judgment, we partnered with a marketing consulting firm to conduct a blind study of CNs extracted by: (1) a foundational LLM with prompt engineering only (Base LLM), (2) an LLM fine-tuned with professionally identified CNs (SFT LLM), and (3) professional analysts. The SFT LLM performs as well as or better than professional analysts when extracting CNs. The extracted CNs are well-formulated, sufficiently specific to identify opportunities, and justified by source content (no hallucinations). The SFT LLM is efficient and provides more complete coverage of CNs. The Base LLM was not sufficiently accurate or specific. Organizations can rely on SFT LLMs to reduce manual effort, enhance the precision of CN articulation, and provide improved insight for innovation and marketing strategy.

Paperid: 2917, https://arxiv.org/pdf/2503.01767.pdf

Abstract:
VR simulation in Health Professions (HP) education demonstrates huge potential, but fixed learning content with little customization limits its application beyond lab environments. To address these limitations in the context of VR for patient communication training, we conducted a user-centered study involving semi-structured interviews with advanced HP students to understand their challenges in clinical communication training and perceptions of VR-based solutions. From this, we derived design insights emphasizing the importance of realistic scenarios, simple interactions, and unpredictable dialogues. Building on these insights, we developed the Virtual AI Patient Simulator (VAPS), a novel VR system powered by Large Language Models (LLMs) and Embodied Conversational Agents (ECAs), supporting dynamic and customizable patient interactions for immersive learning. We also provided an example of how clinical professors could use user-friendly design forms to create personalized scenarios that align with course objectives in VAPS and discuss future implications of integrating AI-driven technologies into VR education.

Paperid: 2918, https://arxiv.org/pdf/2503.00992.pdf

Abstract:
In this paper we leverage psychological methods to investigate LLMs' conceptual mastery in applying rules. We introduce a novel procedure to match the diversity of thought generated by LLMs to that observed in a human sample. We then conducted two experiments comparing rule-based decision-making in humans and LLMs. Study 1 found that all investigated LLMs replicated human patterns regardless of whether they are prompted with scenarios created before or after their training cut-off. Moreover, we found unanticipated differences between the two sets of scenarios among humans. Surprisingly, even these differences were replicated in LLM responses. Study 2 turned to a contextual feature of human rule application: under forced time delay, human samples rely more heavily on a rule's text than on other considerations such as a rule's purpose.. Our results revealed that some models (Gemini Pro and Claude 3) responded in a human-like manner to a prompt describing either forced delay or time pressure, while others (GPT-4o and Llama 3.2 90b) did not. We argue that the evidence gathered suggests that LLMs have mastery over the concept of rule, with implications for both legal decision making and philosophical inquiry.

Paperid: 2919, https://arxiv.org/pdf/2503.00681.pdf

Abstract:
Large Language Models (LLMs), such as ChatGPT, exhibit advanced capabilities in generating text, images, and videos. However, their effective use remains constrained by challenges in prompt formulation, personalization, and opaque decision-making processes. To investigate these challenges and identify design opportunities, we conducted a two-phase qualitative study. In Phase 1, we performed in-depth interviews with eight everyday LLM users after they engaged in structured tasks using ChatGPT across both familiar and unfamiliar domains. Our findings revealed key user difficulties in constructing effective prompts, iteratively refining AI-generated responses, and assessing response reliability especially in domains beyond users' expertise. Informed by these insights, we designed a high-fidelity prototype incorporating Reflective Prompting, Section Regeneration, Input-Output Mapping, Confidence Indicators, and a Customization Panel. In Phase 2, user testing of the prototype indicated that these interface-level improvements may prove useful for reducing cognitive load, increasing transparency, and fostering more intuitive and collaborative human-AI interactions. Our study contributes to the growing discourse on human-centred AI, advocating for human-LLM interactions that enhance user agency, transparency, and co-creative interaction, ultimately supporting more intuitive, accessible, and trustworthy generative AI systems.

Paperid: 2920, https://arxiv.org/pdf/2502.21300.pdf

Abstract:
Metcalfe et al (1) argue that the greatest potential for human-AI partnerships lies in their application to highly complex problem spaces. Herein, we discuss three different forms of hybrid team intelligence and posit that across all three forms, the hybridization of man and machine intelligence can be effective under the right conditions. We foresee two significant research and development (R&D) challenges underlying the creation of effective hybrid intelligence. First, rapid advances in machine intelligence and/or fundamental changes in human behaviors or capabilities over time can outpace R&D. Second, the future conditions under which hybrid intelligence will operate are unknown, but unlikely to be the same as the conditions of today. Overcoming both of these challenges requires a deep understanding of multiple human-centric and machine-centric disciplines that creates a large barrier to entry into the field. Herein, we outline an open, shareable research platform that creates a form of hybrid team intelligence that functions under representative future conditions. The intent for the platform is to facilitate new forms of hybrid intelligence research allowing individuals with human-centric or machine-centric backgrounds to rapidly enter the field and initiate research. Our hope is that through open, community research on the platform, state-of-the-art advances in human and machine intelligence can quickly be communicated across what are currently different R&D communities and allow hybrid team intelligence research to stay at the forefront of scientific advancement.

Paperid: 2921, https://arxiv.org/pdf/2502.20938.pdf

Abstract:
The growing popularity and widespread adoption of large language models (LLMs) necessitates the development of tools that enhance the effectiveness of user interactions with these models. Understanding the structures and functions of these models poses a significant challenge for users. Visual analytics-driven tools enables users to explore and compare, facilitating better decision-making. This paper presents a visual analytics-driven tool equipped with interactive controls for key hyperparameters, including top-p, frequency and presence penalty, enabling users to explore, examine and compare the outputs of LLMs. In a user study, we assessed the tool's effectiveness, which received favorable feedback for its visual design, with particular commendation for the interface layout and ease of navigation. Additionally, the feedback provided valuable insights for enhancing the effectiveness of Human-LLM interaction tools.

Paperid: 2922, https://arxiv.org/pdf/2502.20140.pdf

Abstract:
Telephone surveys remain a valuable tool for gathering insights but typically require substantial resources in training and coordinating human interviewers. This work presents an AI-driven telephone survey system integrating text-to-speech (TTS), a large language model (LLM), and speech-to-text (STT) that mimics the versatility of human-led interviews (full-duplex dialogues) at scale. We tested the system across two populations, a pilot study in the United States (n = 75) and a large-scale deployment in Peru (n = 2,739), inviting participants via web-based links and contacting them via direct phone calls. The AI agent successfully administered open-ended and closed-ended questions, handled basic clarifications, and dynamically navigated branching logic, allowing fast large-scale survey deployment without interviewer recruitment or training. Our findings demonstrate that while the AI system's probing for qualitative depth was more limited than human interviewers, overall data quality approached human-led standards for structured items. This study represents one of the first successful large-scale deployments of an LLM-based telephone interviewer in a real-world survey context. The AI-powered telephone survey system has the potential for expanding scalable, consistent data collecting across market research, social science, and public opinion studies, thus improving operational efficiency while maintaining appropriate data quality for research.

Paperid: 2923, https://arxiv.org/pdf/2502.19066.pdf

Abstract:
Electrotactile feedback is a promising method for delivering haptic sensations, but challenges such as the naturalness of sensations hinder its adoption in commercial devices. In this study, we introduce a novel device that enables the exploration of complex stimulation signals to enhance sensation naturalness. We designed six stimulation signals with linearly modulated frequency, amplitude, or both, across two frequency levels based on a ramp-and-hold shape, aiming to replicate sensation of pressing a button. Our results showed that these modulated signals achieve higher naturalness scores than tonic stimulations, with a 6.8% improvement. Moreover, we examined the relationship between perceived intensity and signal energy for these stimulation patterns. Our findings indicate that, under conditions of constant perceived intensity, signal energy is not uniform across different stimulation patterns. Instead, there is a distinct relationship between the energy levels of different patterns, which is consistently reflected in the energy of the stimulations selected by the participants. Based on our findings, we propose a predictive model that estimates the desired intensity for any stimulation pattern using this relationship between signal energies and the user's preferred intensity for a single reference pattern. This model demonstrated high reliability, with a mean R2 score of 83.33%. Using this approach, intensity calibration for different stimulation patterns can be streamlined, reducing calibration time by 87.5%, as only one out of eight reference pattern must be calibrated. These findings highlight the potential of stimulation signal modulation to improve sensation quality and validate the viability of our predictive model for automating intensity calibration. This approach is an essential step toward delivering complex and naturalistic sensations in advanced haptic systems.

Paperid: 2924, https://arxiv.org/pdf/2502.18719.pdf

Abstract:
Achieving high subject-independent accuracy in functional near-infrared spectroscopy (fNIRS)-based brain-computer interfaces (BCIs) remains a challenge, particularly when minimizing the number of channels. This study proposes a novel feature extraction scheme and a Pearson correlation-based channel selection algorithm to enhance classification accuracy while reducing hardware complexity. Using an open-access fNIRS dataset, our method improved average accuracy by 28.09% compared to existing approaches, achieving a peak subject-independent accuracy of 95.98% with only two channels. These results demonstrate the potential of our optimized feature extraction and channel selection methods for developing efficient, subject-independent fNIRS-based BCI systems.

Paperid: 2925, https://arxiv.org/pdf/2502.18614.pdf

Abstract:
Besides the utilitarian aspects of online shopping, hedonic motivations play a significant role in shaping the shopping behavior of online users. With the increased popularity of voice-enabled devices, online shopping platforms have attempted to drive online shopping on voice. However, we explain why voice might be more suitable for the hedonic aspects of shopping. We introduce a prototype that enables such focus in a voice experience and share our findings from a qualitative study.

Paperid: 2926, https://arxiv.org/pdf/2502.18500.pdf

Abstract:
In this study, we examine the role of Twitter as a first line of defense against misinformation by tracking the public engagement with, and the platforms response to, 500 tweets concerning the RussoUkrainian conflict which were identified as misinformation. Using a realtime sample of 543 475 of their retweets, we find that users who geolocate themselves in the U.S. both produce and consume the largest portion of misinformation, however accounts claiming to be in Ukraine are the second largest source. At the time of writing, 84% of these tweets were still available on the platform, especially those having an anti-Russia narrative. For those that did receive some sanctions, the retweeting rate has already stabilized, pointing to ineffectiveness of the measures to stem their spread. These findings point to the need for a change in the existing anti-misinformation system ecosystem. We propose several design and research guidelines for its possible improvement.

Paperid: 2927, https://arxiv.org/pdf/2502.18395.pdf

Abstract:
A plethora of toolkits, checklists, and workshops have been developed to bridge the well-documented gap between AI ethics principles and practice. Yet little is known about effects of such interventions on practitioners. We conducted an ethnographic investigation in a major European city organization that developed and works to integrate an ethics toolkit into city operations. We find that the integration of ethics tools by technical teams destabilises their boundaries, roles, and mandates around responsibilities and decisions. This lead to emotional discomfort and feelings of vulnerability, which neither toolkit designers nor the organization had accounted for. We leverage the concept of moral stress to argue that this affective experience is a core challenge to the successful integration of ethics tools in technical practice. Even in this best case scenario, organisational structures were not able to deal with moral stress that resulted from attempts to implement responsible technology development practices.

Paperid: 2928, https://arxiv.org/pdf/2502.18163.pdf

Abstract:
Overtaking on country roads with possible opposing traffic is a dangerous maneuver and many proposed assistant systems assume car-to-car communication and sensors currently unavailable in cars. To overcome this limitation, we develop an assistant that uses simple in-car sensors to predict the required sight distance for safe overtaking. Our models predict this from vehicle speeds, accelerations, and 3D map data. In a user study with a Virtual Reality driving simulator (N=25), we compare two UI variants (monitoring-focused vs scheduling-focused). The results reveal that both UIs enable more patient driving and thus increase overall driving safety. While the monitoring-focused UI achieves higher System Usability Score and distracts drivers less, the preferred UI depends on personal preference. Driving data shows predictions were off at times. We investigate and discuss this in a comparison of our models to actual driving behavior and identify crucial model parameters and assumptions that significantly improve model predictions.

Paperid: 2929, https://arxiv.org/pdf/2502.17714.pdf

Abstract:
Generative AI systems are transforming content creation, but their usability remains a key challenge. This paper examines usability factors such as user experience, transparency, control, and cognitive load. Common challenges include unpredictability and difficulties in fine-tuning outputs. We review evaluation metrics like efficiency, learnability, and satisfaction, highlighting best practices from various domains. Improving interpretability, intuitive interfaces, and user feedback can enhance usability, making generative AI more accessible and effective.

Paperid: 2930, https://arxiv.org/pdf/2502.17172.pdf

Abstract:
Affective computing has made significant strides in emotion recognition and generation, yet current approaches mainly focus on short-term pattern recognition and lack a comprehensive framework to guide affective agents toward long-term human well-being. To address this, we propose a teleology-driven affective computing framework that unifies major emotion theories (basic emotion, appraisal, and constructivist approaches) under the premise that affect is an adaptive, goal-directed process that facilitates survival and development. Our framework emphasizes aligning agent responses with both personal/individual and group/collective well-being over extended timescales. We advocate for creating a "dataverse" of personal affective events, capturing the interplay between beliefs, goals, actions, and outcomes through real-world experience sampling and immersive virtual reality. By leveraging causal modeling, this "dataverse" enables AI systems to infer individuals' unique affective concerns and provide tailored interventions for sustained well-being. Additionally, we introduce a meta-reinforcement learning paradigm to train agents in simulated environments, allowing them to adapt to evolving affective concerns and balance hierarchical goals - from immediate emotional needs to long-term self-actualization. This framework shifts the focus from statistical correlations to causal reasoning, enhancing agents' ability to predict and respond proactively to emotional challenges, and offers a foundation for developing personalized, ethically aligned affective systems that promote meaningful human-AI interactions and societal well-being.

Paperid: 2931, https://arxiv.org/pdf/2502.17089.pdf

Abstract:
Hybrid paper interfaces leverage augmented reality to combine the desired tangibility of paper documents with the affordances of interactive digital media. Typically, virtual content can be embedded through direct links (e.g., QR codes); however, this impacts the aesthetics of the paper print and limits the available visual content space. To address this problem, we present Imprinto, an infrared inkjet watermarking technique that allows for invisible content embeddings only by using off-the-shelf IR inks and a camera. Imprinto was established through a psychophysical experiment, studying how much IR ink can be used while remaining invisible to users regardless of background color. We demonstrate that we can detect invisible IR content through our machine learning pipeline, and we developed an authoring tool that optimizes the amount of IR ink on the color regions of an input document for machine and human detectability. Finally, we demonstrate several applications, including augmenting paper documents and objects.

Paperid: 2932, https://arxiv.org/pdf/2502.16124.pdf

Abstract:
Zero-Input AI (ZIA) introduces a novel framework for human-computer interaction by enabling proactive intent prediction without explicit user commands. It integrates gaze tracking, bio-signals (EEG, heart rate), and contextual data (time, location, usage history) into a multi-modal model for real-time inference, targeting <100 ms latency. The proposed architecture employs a transformer-based model with cross-modal attention, variational Bayesian inference for uncertainty estimation, and reinforcement learning for adaptive optimization. To support deployment on edge devices (CPUs, TPUs, NPUs), ZIA utilizes quantization, weight pruning, and linear attention to reduce complexity from quadratic to linear with sequence length. Theoretical analysis establishes an information-theoretic bound on prediction error and demonstrates how multi-modal fusion improves accuracy over single-modal approaches. Expected performance suggests 85-90% accuracy with EEG integration and 60-100 ms inference latency. ZIA provides a scalable, privacy-preserving framework for accessibility, healthcare, and consumer applications, advancing AI toward anticipatory intelligence.

Paperid: 2933, https://arxiv.org/pdf/2502.16083.pdf

Abstract:
In the context of explainable artificial intelligence (XAI), limited research has identified role-specific explanation needs. This study investigates the explanation needs of data scientists, who are responsible for training, testing, deploying, and maintaining machine learning (ML) models in AI systems. The research aims to determine specific explanation content of data scientists. A task analysis identified user goals and proactive user tasks. Using explanation questions, task-specific explanation needs and content were identified. From these individual explanations, we developed a mental model for explanations, which was validated and revised through a qualitative study (n=12). In a second quantitative study (n=12), we examined which explanation intents (reason, comparison, accuracy, prediction, trust) require which type of explanation content from the mental model. The findings are: F1: Explanation content for data scientists comes from the application domain, system domain, and AI domain. F2: Explanation content can be complex and should be organized sequentially and/or in hierarchies (novelty claim). F3: Explanation content includes context, inputs, evidence, attributes, ranked list, interim results, efficacy principle, and input/output relationships (novelty claim). F4: Explanation content should be organized as a causal story. F5: Standardized explanation questions ensure complete coverage of explanation needs (novelty claim). F6: Refining mental models for explanations increases significantly its quality (novelty claim).

Paperid: 2934, https://arxiv.org/pdf/2502.15948.pdf

Abstract:
Using social robots and virtual agents (VAs) as interfaces for health monitoring systems for older adults offers the possibility of more engaging interactions that can support long-term health and well-being. While robots are characterized by their physical presence, software-based VAs are more scalable and flexible. Few comparisons of these interfaces exist in the human-robot and human-agent interaction domains, especially in long-term and real-world studies. In this work, we examined impressions of social robots and VAs at the beginning and end of an eight-week study in which older adults interacted with these systems independently in their homes. Using a between-subjects design, participants could choose which interface to evaluate during the study. While participants perceived the social robot as somewhat more likable, the VA was perceived as more intelligent. Our work provides a basis for further studies investigating factors most relevant for engaging interactions with social interfaces for long-term health monitoring.

Paperid: 2935, https://arxiv.org/pdf/2502.15883.pdf

Abstract:
Process-based learning is crucial for the transmission of intangible cultural heritage, especially in complex arts like Chinese calligraphy, where mastering techniques cannot be achieved by merely observing the final work. To explore the challenges faced in calligraphy heritage transmission, we conducted semi-structured interviews (N=8) as a formative study. Our findings indicate that the lack of calligraphy instructors and tools makes it difficult for students to master brush techniques, and teachers struggle to convey the intricate details and rhythm of brushwork. To address this, we collaborated with calligraphy instructors to develop an educational tool that integrates writing process capture and visualization, showcasing the writing rhythm, hand force, and brush posture. Through empirical studies conducted in multiple teaching workshops, we evaluated the system's effectiveness with teachers (N=4) and students (N=12). The results show that the tool significantly enhances teaching efficiency and aids learners in better understanding brush techniques.

Paperid: 2936, https://arxiv.org/pdf/2502.15722.pdf

Abstract:
Accessing accurate medication insights is vital for enhancing patient safety, minimizing errors, and supporting clinical decision-making. However, healthcare professionals in Africa often rely on manual and time-consuming processes to retrieve drug information, exacerbated by limited access to pharmacists due to brain drain and healthcare disparities. This paper presents "Drug Insights," an open-source Retrieval-Augmented Generation (RAG) chatbot designed to streamline medication lookup for healthcare workers in Africa. By leveraging a corpus of Nigerian pharmaceutical data and advanced AI technologies, including Pinecone databases and GPT models, the system delivers accurate, context-specific responses with minimal hallucination. The chatbot integrates prompt engineering and S-BERT evaluation to optimize retrieval and response generation. Preliminary tests, including pharmacist feedback, affirm the tool's potential to improve drug information access while highlighting areas for enhancement, such as UI/UX refinement and extended corpus integration.

Paperid: 2937, https://arxiv.org/pdf/2502.15255.pdf

Abstract:
Music composition has long been recognized as a significant art form. However, existing digital audio workstations and music production software often present high entry barriers for users lacking formal musical training. To address this, we introduce ComposeOn, a music theory-based tool designed for users with limited musical knowledge. ComposeOn enables users to easily extend their melodic ideas into complete compositions and offers simple editing features. By integrating music theory, it explains music creation at beginner, intermediate, and advanced levels. Our user study (N=10) compared ComposeOn with the baseline method, Suno AI, demonstrating that ComposeOn provides a more accessible and enjoyable composing and learning experience for individuals with limited musical skills. ComposeOn bridges the gap between theory and practice, offering an innovative solution as both a composition aid and music education platform. The study also explores the differences between theory-based music creation and generative music, highlighting the former's advantages in personal expression and learning.

Paperid: 2938, https://arxiv.org/pdf/2502.15096.pdf

Abstract:
Chat interfaces for intelligent tutoring systems (ITSs) enable interactivity and flexibility. However, when students interact with chat interfaces, they expect dialogue-driven navigation from the system and can express frustration and disinterest if this is not provided. Intent detection systems help students navigate within an ITS, but detecting students' intent during open-ended dialogue is challenging. We designed an intent detection system in a chatbot ITS, classifying a student's intent between continuing the current lesson or switching to a new lesson. We explore the utility of four machine learning approaches for this task - including both conventional classification approaches and fine-tuned large language models - finding that using an intent classifier introduces trade-offs around implementation cost, accuracy, and prediction time. We argue that implementing intent detection in chat interfaces can reduce frustration and support student learning.

Paperid: 2939, https://arxiv.org/pdf/2502.15095.pdf

Abstract:
Testing Web User Interfaces (UIs) requires considerable time and effort and resources, most notably participants for user testing. Additionally, the tests results may demand adjustments on the UI, taking further resources and testing. Early tests can make this process less costly with the help of low fidelity prototypes, but it is difficult to conduct user tests on them, and recruiting participants is still necessary. To tackle this issue, there are tools that can predict UI aspects like interaction time, as the well-known KLM model. Another aspect that can be predicted is complexity, and this was achieved by the Big I notation, which can be applied to early UX concepts like lo-fi wireframes. Big I assists developers in estimating the interaction complexity, specified as a function of user steps, which are composed of abstracted user actions. Interaction complexity is expressed in mathematical terms, making the comparison of interaction complexities for various UX concepts easy. However, big I is not able to predict execution time for user actions, which would be very helpful for early assessment of lo-fi prototypes. To address this shortcoming, in this paper we present a study in which we took measurements from real users (n=100) completing tasks in a fictitious website, in order to derive average times per interaction step. Using these results, we were able to study the relationship between interaction complexity and time and ultimately complement big I predictions with time estimates.

Paperid: 2940, https://arxiv.org/pdf/2502.15005.pdf

Abstract:
In this paper, we propose a Retrieval Augmented Generation (RAG) agent that maps natural language queries about research topics to precise, machine-interpretable semantic entities. Our approach combines RAG with Socratic dialogue to align a user's intuitive understanding of research topics with established Knowledge Organization Systems (KOSs). The proposed approach will effectively bridge "little semantics" (domain-specific KOS structures) with "big semantics" (broad bibliometric repositories), making complex academic taxonomies more accessible. Such agents have the potential for broad use. We illustrate with a sample application called CollabNext, which is a person-centric knowledge graph connecting people, organizations, and research topics. We further describe how the application design has an intentional focus on HBCUs and emerging researchers to raise visibility of people historically rendered invisible in the current science system.

Paperid: 2941, https://arxiv.org/pdf/2502.14761.pdf

Abstract:
According to the World Health Organization, over 466 million people worldwide suffer from disabling hearing loss, with approximately 34 million of these being children. Hearing aids (HA) and cochlear implants (CI) have become indispensable tools for restoring hearing and enhancing the quality of life for individuals with hearing impairments. Clinical research and consumer studies indicate that users of HAs and CIs report significant improvements in their daily lives, including enhanced communication abilities and social engagement and reduced psychological stress. Modern auditory prosthetic devices are more advanced and interconnected with digital networks to add functionality, such as streaming audio directly from smartphones and other devices, remote adjustments by audiologists, integration with smart home systems, and access to artificial intelligence-driven sound enhancement features. With this interconnectivity, issues surrounding data privacy and security have become increasingly pertinent. There is limited research on the usability perceptions of current HA and CI models from the perspective of end-users. In addition, no studies have investigated consumer mental models during the purchasing process, particularly which factors they prioritize when selecting a device. In this study, we assessed participants' satisfaction levels with various features of their auditory prostheses. This work contributes to the field by addressing gaps in user perceptions of HA and CI usability, identifying key factors in consumer purchasing decisions, and highlighting the need for improved privacy and security awareness and education among users.

Paperid: 2942, https://arxiv.org/pdf/2502.14217.pdf

Abstract:
Conversational agents, such as chatbots, have increasingly found their way into many dimensions of our lives, including entertainment and education. In this exploratory study we built a child-friendly chatbot, "Ask Me Anything" (AMA), and investigated children's attitudes and trust toward AI-driven conversational agents. To prompt targeted questioning from students and drive engagement, AMA is a specialized chatbot that answers only topic--specific questions in three areas--astronomy, sneakers and shoes, and dinosaurs. We tested AMA with 63 students in a K-8 public school in the Northeast USA. Students worked in small groups, interacted with our tool for three to ten minutes, and completed a post survey. We identified three key themes that emerged from student conversational interactions with AMA: expressing wonder, surprise, and curiosity; building trust and developing confidence; and building relationships and anthropomorphizing. Also, we observed a broad attitude of openness and comfort. Students trusted the chatbot responses in general, indicating a high level of trust in and reliance on AI as a source of information. They described AMA as "knowledgeable," "smart," and that they could "trust it." To confirm their perception of reliability, some students tested the chatbot with questions to which they knew the answers. This behavior illustrated a fundamental aspect of children's cognitive development: the process of actively evaluating the credibility of sources. Our work extends and contributes to the existing body of literature that explores children's interactions with conversational agents.

Paperid: 2943, https://arxiv.org/pdf/2502.14059.pdf

Abstract:
Nearly one million total hip and knee arthroplasties (THA/TKA) are performed annually in the United States, with most patients discharged home and prescribed home exercise programs (HEPs) to enhance lower extremity function. Traditional paper-based HEPs, while accessible and low-cost, often lack engagement and real-time feedback, which are critical for adherence and performance optimization. Extended reality (XR) and telehealth (TH) systems offer promising solutions, combining engagement and feedback, though each has limitations. To address these gaps, we designed and executed a pilot study that compared exercise performance in individuals with THA/TKA using a conventional paper-based HEP versus a proof-of-concept system, dubbed Tele-PhyT, that included the ideal characteristics of a future XR technology that would enable seamless HEP-TH systems, with robust marker-less full body tracking, real-time visual feedback, and performance quantification. The pilot study used a randomized cross-over design and targeted two types of users: therapists and patients. Participants favored Tele- PhyT for its real-time feedback and ease of use, and noted its potential to improve HEP adherence and exercise accuracy.

Paperid: 2944, https://arxiv.org/pdf/2502.14000.pdf

Abstract:
This paper presents a novel perspective on human-computer interaction (HCI), framing it as a dynamic interplay between human and computational agents within a networked system. Going beyond traditional interface-based approaches, we emphasize the importance of coordination and communication among heterogeneous agents with different capabilities, roles, and goals. A key distinction is made between multi-agent systems (MAS) and Centaurian systems, which represent two different paradigms of human-AI collaboration. MAS maintain agent autonomy, with structured protocols enabling cooperation, while Centaurian systems deeply integrate human and AI capabilities, creating unified decision-making entities. To formalize these interactions, we introduce a framework for communication spaces, structured into surface, observation, and computation layers, ensuring seamless integration between MAS and Centaurian architectures, where colored Petri nets effectively represent structured Centaurian systems and high-level reconfigurable networks address the dynamic nature of MAS. Our research has practical applications in autonomous robotics, human-in-the-loop decision making, and AI-driven cognitive architectures, and provides a foundation for next-generation hybrid intelligence systems that balance structured coordination with emergent behavior.

Paperid: 2945, https://arxiv.org/pdf/2502.13816.pdf

Abstract:
This paper offers a structured understanding of mediated social touch (MST) using a human-oriented approach, through an extensive review of literature spanning tactile interfaces, emotional information, mapping mechanisms, and the dynamics of human-human and human-robot interactions. By investigating the existing and exploratory mapping strategies of the 37 selected MST cases, we established the emotional expression space of MSTs that accommodated a diverse spectrum of emotions by integrating the categorical and Valence-arousal models, showcasing how emotional cues can be translated into tactile signals. Based on the expressive capacity of MSTs, a practical design space was structured encompassing factors such as the body locations, device form, tactile modalities, and parameters. We also proposed various design strategies for MSTs including workflow, evaluation methods, and ethical and cultural considerations, as well as several future research directions. MSTs' potential is reflected not only in conveying emotional information but also in fostering empathy, comfort, and connection in both human-human and human-robot interactions. This paper aims to serve as a comprehensive reference for design researchers and practitioners, which helps expand the scope of emotional communication of MSTs, facilitating the exploration of diverse applications of affective haptics, and enhancing the naturalness and sociability of haptic interaction.

Paperid: 2946, https://arxiv.org/pdf/2502.13499.pdf

Abstract:
Recent work has highlighted the risks of LLM-generated content for a wide range of harmful behaviors, including incorrect and harmful code. In this work, we extend this by studying whether LLM-generated web design contains dark patterns. This work evaluated designs of ecommerce web components generated by four popular LLMs: Claude, GPT, Gemini, and Llama. We tested 13 commonly used ecommerce components (e.g., search, product reviews) and used them as prompts to generate a total of 312 components across all models. Over one-third of generated components contain at least one dark pattern. The majority of dark pattern strategies involve hiding crucial information, limiting users' actions, and manipulating them into making decisions through a sense of urgency. Dark patterns are also more frequently produced in components that are related to company interests. These findings highlight the need for interventions to prevent dark patterns during front-end code generation with LLMs and emphasize the importance of expanding ethical design education to a broader audience.

Paperid: 2947, https://arxiv.org/pdf/2502.13294.pdf

Abstract:
The implementation of responsible AI in an organization is inherently complex due to the involvement of multiple stakeholders, each with their unique set of goals and responsibilities across the entire AI lifecycle. These responsibilities are often ambiguously defined and assigned, leading to confusion, miscommunication, and inefficiencies. Even when responsibilities are clearly defined and assigned to specific roles, the corresponding AI actors lack effective tools to support their execution. Toward closing these gaps, we present a systematic review and comprehensive meta-analysis of the current state of responsible AI tools, focusing on their alignment with specific stakeholder roles and their responsibilities in various AI lifecycle stages. We categorize over 220 tools according to AI actors and stages they address. Our findings reveal significant imbalances across the stakeholder roles and lifecycle stages addressed. The vast majority of available tools have been created to support AI designers and developers specifically during data-centric and statistical modeling stages while neglecting other roles such as institutional leadership, deployers, end-users, and impacted communities, and stages such as value proposition and deployment. The uneven distribution we describe here highlights critical gaps that currently exist in responsible AI governance research and practice. Our analysis reveals that despite the myriad of frameworks and tools for responsible AI, it remains unclear \emph{who} within an organization and \emph{when} in the AI lifecycle a tool applies. Furthermore, existing tools are rarely validated, leaving critical gaps in their usability and effectiveness. These gaps provide a starting point for researchers and practitioners to create more effective and holistic approaches to responsible AI development and governance.

Paperid: 2948, https://arxiv.org/pdf/2502.12471.pdf

Abstract:
Parkinson's disease (PD) is a neurodegenerative disorder characterized by motor dysfunction and abnormal neural oscillations. These symptoms can be modulated through electrical stimulation. Traditional neural activity analysis in PD has typically relied on statistical methods, which often introduce bias owing to the need for expert-driven feature extraction. To address this limitation, we explore an explainable artificial intelligence (XAI) approach to analyze neural activity in Parkinsonian rats receiving electrical stimulation. Electrocorticogram (ECoG) signals were collected before and after electrical stimulation using graphene-based electrodes that enable less-invasive monitoring and stimulation in PD. EEGNet, a convolutional neural network, classified these ECoG signals into pre- and post-stimulation states. We applied layer-wise relevance propagation, an XAI technique, to identify key neural inputs contributing to the model's decisions, incorporating the spatial electrode information matched to the cortex map. The XAI analysis highlighted area-specific importance in beta and gamma frequency bands, which could not be detected through mean comparison analyses relying on feature extraction. These findings demonstrate the potential of XAI in analyzing neural dynamics in neurodegenerative disorders such as PD, suggesting that the integration of graphene-based electrodes with advanced deep learning models offers a promising solution for real-time PD monitoring and therapy.

Paperid: 2949, https://arxiv.org/pdf/2502.11659.pdf

Abstract:
Recent advancements in large language models (LLMs) provide a more effective pathway for upgrading brain-computer interface (BCI) technology in terms of user interaction. The widespread adoption of BCIs in daily application scenarios is still limited by factors such as their single functionality, restricted paradigm design, weak multilingual support, and low levels of intelligence. In this paper, we propose an innovative BCI system that deeply integrates a steady-state visual evoked potential (SSVEP) speller with an LLM application programming interface (API). It allows natural language input through the SSVEP speller and dynamically calls large models to generate SSVEP paradigms. The command prompt, blinking frequency, and layout position are adjustable to meet the user's control requirements in various scenarios. More than ten languages are compatible with the multilingual support of LLM. A variety of task scenarios, such as home appliance control, robotic arm operation, and unmanned aerial vehicle (UAV) management are provided. The task interfaces of the system can be personalized according to the user's habits, usage scenarios, and equipment characteristics. By combining the SSVEP speller with an LLM, the system solves numerous challenges faced by current BCI systems and makes breakthroughs in functionality, intelligence, and multilingual support. The introduction of LLM not only enhances user experience but also expands the potential applications of BCI technology in real-world environments.

Paperid: 2950, https://arxiv.org/pdf/2502.11291.pdf

Abstract:
The problem of explaining inconsistency-tolerant reasoning in knowledge bases (KBs) is a prominent topic in Artificial Intelligence (AI). While there is some work on this problem, the explanations provided by existing approaches often lack critical information or fail to be expressive enough for non-binary conflicts. In this paper, we identify structural weaknesses of the state-of-the-art and propose a generic argumentation-based approach to address these problems. This approach is defined for logics involving reasoning with maximal consistent subsets and shows how any such logic can be translated to argumentation. Our work provides dialogue models as dialectic-proof procedures to compute and explain a query answer wrt inconsistency-tolerant semantics. This allows us to construct dialectical proof trees as explanations, which are more expressive and arguably more intuitive than existing explanation formalisms.

Paperid: 2951, https://arxiv.org/pdf/2502.10788.pdf

Abstract:
Online shared content, such as group pictures, often contains information about multiple users. Developing technical solutions to manage the privacy of such "co-owned" content is challenging because each co-owner may have different preferences. Recent technical approaches advocate group-decision mechanisms, including auctions, to decide as how best to resolve these differences. However, it is not clear if users would participate in such mechanisms and if they do, whether they would act altruistically. Understanding the privacy dynamics is crucial to develop effective mechanisms for privacy-respecting collaborative systems. Accordingly, this work develops RESOLVE, a privacy auction game to understand the sharing behavior of users in groups. Our results of users' playing the game show that i) the users' understanding of individual vs. group privacy differs significantly; ii) often users fight for their preferences even at the cost of others' privacy; and iii) at times users collaborate to fight for the privacy of others.

Paperid: 2952, https://arxiv.org/pdf/2502.10407.pdf

Abstract:
Generative AI technologies, particularly Large Language Models (LLMs), have transformed information management systems but introduced substantial biases that can compromise their effectiveness in informing business decision-making. This challenge presents information management scholars with a unique opportunity to advance the field by identifying and addressing these biases across extensive applications of LLMs. Building on the discussion on bias sources and current methods for detecting and mitigating bias, this paper seeks to identify gaps and opportunities for future research. By incorporating ethical considerations, policy implications, and sociotechnical perspectives, we focus on developing a framework that covers major stakeholders of Generative AI systems, proposing key research questions, and inspiring discussion. Our goal is to provide actionable pathways for researchers to address bias in LLM applications, thereby advancing research in information management that ultimately informs business practices. Our forward-looking framework and research agenda advocate interdisciplinary approaches, innovative methods, dynamic perspectives, and rigorous evaluation to ensure fairness and transparency in Generative AI-driven information systems. We expect this study to serve as a call to action for information management scholars to tackle this critical issue, guiding the improvement of fairness and effectiveness in LLM-based systems for business practice.

Paperid: 2953, https://arxiv.org/pdf/2502.10401.pdf

Abstract:
Current definitions of Information Science are inadequate to comprehensively describe the nature of its field of study and for addressing the problems that are arising from intelligent technologies. The ubiquitous rise of artificial intelligence applications and their impact on society demands the field of Information Science acknowledge the sociotechnical nature of these technologies. Previous definitions of Information Science over the last six decades have inadequately addressed the environmental, human, and social aspects of these technologies. This perspective piece advocates for an expanded definition of Information Science that fully includes the sociotechnical impacts information has on the conduct of research in this field. Proposing an expanded definition of Information Science that includes the sociotechnical aspects of this field should stimulate both conversation and widen the interdisciplinary lens necessary to address how intelligent technologies may be incorporated into society and our lives more fairly.

Paperid: 2954, https://arxiv.org/pdf/2502.09989.pdf

Abstract:
Weighted abduction computes hypotheses that explain input observations. A reasoner of weighted abduction first generates possible hypotheses and then selects the hypothesis that is the most plausible. Since a reasoner employs parameters, called weights, that control its plausibility evaluation function, it can output the most plausible hypothesis according to a specific application using application-specific weights. This versatility makes it applicable from plant operation to cybersecurity or discourse analysis. However, the predetermined application-specific weights are not applicable to all cases of the application. Hence, the hypothesis selected by the reasoner does not necessarily seem the most plausible to the user. In order to resolve this problem, this article proposes two types of user-feedback dialogue protocols, in which the user points out, either positively, negatively or neutrally, properties of the hypotheses presented by the reasoner, and the reasoner regenerates hypotheses that satisfy the user's feedback. As it is required for user-feedback dialogue protocols, we then prove: (i) our protocols necessarily terminate under certain reasonable conditions; (ii) they achieve hypotheses that have the same properties in common as fixed target hypotheses do in common if the user determines the positivity, negativity or neutrality of each pointed-out property based on whether the target hypotheses have that property.

Paperid: 2955, https://arxiv.org/pdf/2502.09912.pdf

Abstract:
Team closeness provides the foundations of trust and communication, contributing to teams' success and viability. However, newcomers often struggle to be included in a team since incumbents tend to interact more with other existing members. Previous research suggests that online communication technologies can help team inclusion by mitigating members' perceived differences. In this study, we test how virtual reality (VR) can promote team closeness when forming teams. We conducted a between-subject experiment with teams working in-person and VR, where two members interacted first, and then a third member was added later to conduct a hidden-profile task. Participants evaluated how close they felt with their teammates after the task was completed. Our results show that VR newcomers felt closer to the incumbents than in-person newcomers. However, incumbents' closeness to newcomers did not vary across conditions. We discuss the implications of these findings and offer suggestions for how VR can promote inclusion.

Paperid: 2956, https://arxiv.org/pdf/2502.09867.pdf

Abstract:
Generative AI has enabled novice designers to quickly create professional-looking visual representations for product concepts. However, novices have limited domain knowledge that could constrain their ability to write prompts that effectively explore a product design space. To understand how experts explore and communicate about design spaces, we conducted a formative study with 12 experienced product designers and found that experts -- and their less-versed clients -- often use visual references to guide co-design discussions rather than written descriptions. These insights inspired DesignWeaver, an interface that helps novices generate prompts for a text-to-image model by surfacing key product design dimensions from generated images into a palette for quick selection. In a study with 52 novices, DesignWeaver enabled participants to craft longer prompts with more domain-specific vocabularies, resulting in more diverse, innovative product designs. However, the nuanced prompts heightened participants' expectations beyond what current text-to-image models could deliver. We discuss implications for AI-based product design support tools.

Paperid: 2957, https://arxiv.org/pdf/2502.09763.pdf

Abstract:
We present a primary attack ontology and analysis framework for deception attacks in Mixed Reality (MR). This is achieved through multidisciplinary Systematization of Knowledge (SoK), integrating concepts from MR security, information theory, and cognition. While MR grows in popularity, it presents many cybersecurity challenges, particularly concerning deception attacks and their effects on humans. In this paper, we use the Borden-Kopp model of deception to develop a comprehensive ontology of MR deception attacks. Further, we derive two models to assess impact of MR deception attacks on information communication and decision-making. The first, an information-theoretic model, mathematically formalizes the effects of attacks on information communication. The second, a decision-making model, details the effects of attacks on interlaced cognitive processes. Using our ontology and models, we establish the MR Deception Analysis Framework (DAF) to assess the effects of MR deception attacks on information channels, perception, and attention. Our SoK uncovers five key findings for research and practice and identifies five research gaps to guide future work.

Paperid: 2958, https://arxiv.org/pdf/2502.09693.pdf

Abstract:
Online debates can enhance critical thinking but may escalate into hostile attacks. As humans are increasingly reliant on Generative AI (GenAI) in writing tasks, we need to understand how people utilize GenAI in online debates. To examine the patterns of writing behavior while making arguments with GenAI, we created an online forum for soccer fans to engage in turn-based and free debates in a post format with the assistance of ChatGPT, arguing on the topic of "Messi vs Ronaldo". After 13 sessions of two-part study and semi-structured interviews with 39 participants, we conducted content and thematic analyses to integrate insights from interview transcripts, ChatGPT records, and forum posts. We found that participants prompted ChatGPT for aggressive responses, created posts with similar content and logical fallacies, and sacrificed the use of ChatGPT for better human-human communication. This work uncovers how polarized forum members work with GenAI to engage in debates online.

Paperid: 2959, https://arxiv.org/pdf/2502.09435.pdf

Abstract:
The retinal afterimage is a widely known effect in the human visual system, which has been studied and used in the context of a number of major art movements. Therefore, when considering the general role of computation in the visual arts, this begs the question whether this effect, too, may be induced using partly automated techniques. If so, it may become a computationally controllable ingredient of (interactive) visual art, and thus take its place among the many other aspects of visual perception which already have preceded it in this sense. The present moment provides additional inspiration to lay the groundwork for extending computer graphics in general with the retinal afterimage: Historically, we are in a phase where some head-mounted stereoscopic AR/VR technologies are now providing eye tracking by default, thereby allowing realtime monitoring of the processes of visual fixation that can induce the retinal afterimage. A logical starting point for general investigation is then shape display via the retinal afterimage, since shape recognition lends itself well to unambiguous reporting. Shape recognition, however, may also occur due to normal vision, which happens simultaneously. Carefully and rigorously excluding this possibility, we develop computational techniques enabling shape display exclusive to the retinal afterimage.

Paperid: 2960, https://arxiv.org/pdf/2502.09362.pdf

Abstract:
HCI is future-oriented by nature: it explores new human--technology interactions and applies the findings to promote and shape vital visions of society. Still, the visions of futures in HCI publications seem largely implicit, techno-deterministic, narrow, and lacking in roadmaps and attention to uncertainties. A literature review centered on this problem examined futuring and its forms in the ACM Digital Library's most frequently cited HCI publications. This analysis entailed developing the four-category framework SPIN, informed by futures studies literature. The results confirm that, while technology indeed drives futuring in HCI, a growing body of HCI research is coming to challenge techno-centric visions. Emerging foci of HCI futuring demonstrate active exploration of uncertainty, a focus on human experience, and contestation of dominant narratives. The paper concludes with insight illuminating factors behind techno-centrism's continued dominance of HCI discourse, as grounding for five opportunities for the field to expand its contribution to futures and anticipation research.

Paperid: 2961, https://arxiv.org/pdf/2502.09076.pdf

Abstract:
One huge advantage of Augmented Reality (AR) is its numerous possibilities of displaying information in the physical world, especially when applying Situated Analytics (SitA). AR devices and their respective interaction techniques allow for supplementary guidance to assist an operator carrying out complex procedures such as medical diagnosis and surgery, for instance. Their usage promotes user autonomy by presenting relevant information when the operator may not necessarily possess expert knowledge of every procedure and may also not have access to external help such as in a remote or isolated situation (e.g., International Space Station, middle of an ocean, desert).In this paper, we propose a comparison of two different forms of AR visualisation: An embedded visualisation and a situated projected visualisation, with the aim to assist operators with the most appropriate visualisation format when carrying out procedures (medical in our case). To evaluate these forms of visualisation, we carried out an experiment involving 23 participants possessing latent/novice medical knowledge. These participant profiles were representative of operators who are medically trained yet do not apply their knowledge every day (e.g., an astronaut in orbit or a sailor out at sea). We discuss our findings which include the advantages of embedded visualised information in terms of precision compared to situated projected information with the accompanying limitations in addition to future improvements to our proposition. We conclude with the prospects of our work, notably the continuation and possibility of evaluating our proposition in a less controlled and real context in collaboration with our national space agency.

Paperid: 2962, https://arxiv.org/pdf/2502.09055.pdf

Abstract:
Recent advances in generative AI music have resulted in new technologies that are being framed as co-creative tools for musicians with early work demonstrating their potential to add to music practice. While the field has seen many valuable contributions, work that involves practising musicians in the design and development of these tools is limited, with the majority of work including them only once a tool has been developed. In this paper, we present a case study that explores the needs of practising musicians through the co-design of a musical variation system, highlighting the importance of involving a diverse range of musicians throughout the design process and uncovering various design insights. This was achieved through two workshops and a two week ecological evaluation, where musicians from different musical backgrounds offered valuable insights not only on a musical system's design but also on how a musical AI could be integrated into their musical practices.

Paperid: 2963, https://arxiv.org/pdf/2502.09048.pdf

Abstract:
Data visualizations are increasingly seen as socially constructed, with several recent studies positing that perceptions and interpretations of visualization artifacts are shaped through complex sets of interactions between members of a community. However, most of these works have focused on audiences and researchers, and little is known about if and how practitioners account for the socially constructed framing of data visualization. In this paper, we study and analyze how visualization practitioners understand the influence of their beliefs, values, and biases in their design processes and the challenges they experience. In 17 semi-structured interviews with designers working with race and gender demographic data, we find that a complex mix of factors interact to inform how practitioners approach their design process, including their personal experiences, values, and their understandings of power, neutrality, and politics. Based on our findings, we suggest a series of implications for research and practice in this space.

Paperid: 2964, https://arxiv.org/pdf/2502.08920.pdf

Abstract:
Conversational AI chatbots have become increasingly common within the customer service industry. Despite improvements in their emotional development, they often lack the authenticity of real customer service interactions or the competence of service providers. By comparing emotion-sensitive and emotion-insensitive LLM-based chatbots across 30 participants, we aim to explore how emotional sensitivity in chatbots influences perceived competence and overall customer satisfaction in service interactions. Additionally, we employ sentiment analysis techniques to analyze and interpret the emotional content of user inputs. We highlight that perceptions of chatbot trustworthiness and competence were higher in the case of the emotion-sensitive chatbot, even if issue resolution rates were not affected. We discuss implications of improved user satisfaction from emotion-sensitive chatbots and potential applications in support services.

Paperid: 2965, https://arxiv.org/pdf/2502.08854.pdf

Abstract:
Widespread integration of Generative AI tools is transforming white-collar work, reshaping how workers define their roles, manage their tasks, and collaborate with peers. This has created a need to develop an overarching understanding of common worker-driven patterns around these transformations. To fill this gap, we conducted a systematic literature review of 23 studies from the ACM Digital Library that focused on workers' lived-experiences and practitioners with GenAI. Our findings reveal that while many professionals have delegated routine tasks to GenAI to focus on core responsibilities, they have also taken on new forms of AI managerial labor to monitor and refine GenAI outputs. Additionally, practitioners have restructured collaborations, sometimes bypassing traditional peer and subordinate interactions in favor of GenAI assistance. These shifts have fragmented cohesive tasks into piecework creating tensions around role boundaries and professional identity. Our analysis suggests that current frameworks, like job crafting, need to evolve to address the complexities of GenAI-driven transformations.

Paperid: 2966, https://arxiv.org/pdf/2502.08848.pdf

Abstract:
Speech-to-text capabilities on mobile devices have proven helpful for hearing and speech accessibility, language translation, note-taking, and meeting transcripts. However, our foundational large-scale survey (n=263) shows that the inability to distinguish and indicate speaker direction makes them challenging in group conversations. SpeechCompass addresses this limitation through real-time, multi-microphone speech localization, where the direction of speech allows visual separation and guidance (e.g., arrows) in the user interface. We introduce efficient real-time audio localization algorithms and custom sound perception hardware running on a low-power microcontroller and four integrated microphones, which we characterize in technical evaluations. Informed by a large-scale survey (n=494), we conducted an in-person study of group conversations with eight frequent users of mobile speech-to-text, who provided feedback on five visualization styles. The value of diarization and visualizing localization was consistent across participants, with everyone agreeing on the value and potential of directional guidance for group conversations.

Paperid: 2967, https://arxiv.org/pdf/2502.08428.pdf

Abstract:
To design social robots to effectively promote health behavior change, it is essential to understand how people respond to various health communication strategies employed by these robots. This study examines the effectiveness of two types of social control strategies from a social robot, relationship-focused strategies (emphasizing relational consequences) and target-focused strategies (emphasizing health consequences), in encouraging people to reduce sedentary behavior. A two-session lab experiment was conducted (n = 135), where participants first played a game with a robot, followed by the robot persuading them to stand up and move using one of the strategies. Half of the participants joined a second session to have a repeated interaction with the robot. Results showed that relationship-focused strategies motivated participants to stay active longer. Repeated sessions did not strengthen participants' relationship with the robot, but those who felt more attached to the robot responded more actively to the target-focused strategies. These findings offer valuable insights for designing persuasive strategies for social robots in health communication contexts.

Paperid: 2968, https://arxiv.org/pdf/2502.08312.pdf

Abstract:
This paper introduces the Word Synchronization Challenge, a novel benchmark to evaluate large language models (LLMs) in Human-Computer Interaction (HCI). This benchmark uses a dynamic game-like framework to test LLMs ability to mimic human cognitive processes through word associations. By simulating complex human interactions, it assesses how LLMs interpret and align with human thought patterns during conversational exchanges, which are essential for effective social partnerships in HCI. Initial findings highlight the influence of model sophistication on performance, offering insights into the models capabilities to engage in meaningful social interactions and adapt behaviors in human-like ways. This research advances the understanding of LLMs potential to replicate or diverge from human cognitive functions, paving the way for more nuanced and empathetic human-machine collaborations.

Paperid: 2969, https://arxiv.org/pdf/2502.07999.pdf

Abstract:
Online, visual artists have more places than ever to routinely share their creative work and connect with other artists. These interactions support the routine enactment of creative identity in artists and provide inspirational opportunities for artists. As creative work shifts online, interactions between artists and routines around how these artists get inspired to do creative work are mediated by and through the logics of the online platforms where they take place. In an interview study of 22 artists, this paper explores the interplay between the development of artists' creative identities and the, at times, contradictory practices they have around getting inspired. We find platforms which support the disciplined practice of creative work while supporting spontaneous moments of inspiration, play an increasing role in passive approaches to searching for inspiration, and foster numerous small community spaces for artists to negotiate their creative identities. We discuss how platforms can better support and embed mechanisms for inspiration into their infrastructures into their design and platform policy.

Paperid: 2970, https://arxiv.org/pdf/2502.07981.pdf

Abstract:
Humor is a social binding agent. It is an act of creativity that can provoke emotional reactions on a broad range of topics. Humor has long been thought to be "too human" for AI to generate. However, humans are complex, and humor requires our complex set of skills: cognitive reasoning, social understanding, a broad base of knowledge, creative thinking, and audience understanding. We explore whether giving AI such skills enables it to write humor. We target one audience: Gen Z humor fans. We ask people to rate meme caption humor from three sources: highly upvoted human captions, 2) basic LLMs, and 3) LLMs captions with humor skills. We find that users like LLMs captions with humor skills more than basic LLMs and almost on par with top-rated humor written by people. We discuss how giving AI human-like skills can help it generate communication that resonates with people.

Paperid: 2971, https://arxiv.org/pdf/2502.07649.pdf

Abstract:
Traditionally, linters are code analysis tools that help developers by flagging potential issues from syntax and logic errors to enforcing syntactical and stylistic conventions. Recently, linting has been taken as an interface metaphor, allowing it to be extended to more complex inputs, such as visualizations, which demand a broader perspective and alternative approach to evaluation. We explore a further extended consideration of linting inputs, and modes of evaluation, across the puritanical, neutral, and rebellious dimensions. We specifically investigate the potential for leveraging human computation in linting operations through Community Notes -- crowd-sourced contextual text snippets aimed at checking and critiquing potentially accurate or misleading content on social media. We demonstrate that human-powered assessments not only identify misleading or error-prone visualizations but that integrating human computation enhances traditional linting by offering social insights. As is required these days, we consider the implications of building linters powered by Artificial Intelligence.

Paperid: 2972, https://arxiv.org/pdf/2502.06843.pdf

Abstract:
Traditional autonomous driving systems often struggle with reasoning in complex, unexpected scenarios due to limited comprehension of spatial relationships. In response, this study introduces a Large Language Model (LLM)-based Autonomous Driving (AD) assistance system that integrates a vision adapter and an LLM reasoning module to enhance visual understanding and decision-making. The vision adapter, combining YOLOv4 and Vision Transformer (ViT), extracts comprehensive visual features, while GPT-4 enables human-like spatial reasoning and response generation. Experimental evaluations with 45 experienced drivers revealed that the system closely mirrors human performance in describing situations and moderately aligns with human decisions in generating appropriate responses.

Paperid: 2973, https://arxiv.org/pdf/2502.06378.pdf

Abstract:
Augmented reality (AR) will enable individuals to share and experience content augmented at real world locations with ease. But what protections and restrictions should be in place? Should, for example, anyone be able to post any content they wish at a place of religious or cultural significance? We developed a smartphone app to give individuals hands-on experience posting and sharing AR content. After using our app, we investigated their attitudes towards posting different types of AR content (of an artistic, protest, social, informative, and commercial nature) in a variety of locations (cultural sites, religious sites, residential areas, public spaces, government buildings, and tourist points of interests). Our results show individuals expect restrictions to be in place to control who can post AR content at some locations, in particular those of religious and cultural significance. We also report individuals prefer augmentations to fit contextually within the environment they are posted, and expect the posting and sharing of AR content to adhere to the same regulations/legislation as social media platforms.

Paperid: 2974, https://arxiv.org/pdf/2502.06141.pdf

Abstract:
This study evaluates the performance and usability of Mixed Reality (MR), Virtual Reality (VR), and camera stream interfaces for remote error resolution tasks, such as correcting warehouse packaging errors. Specifically, we consider a scenario where a robotic arm halts after detecting an error, requiring a remote operator to intervene and resolve it via pick-and-place actions. Twenty-one participants performed simulated pick-and-place tasks using each interface. A linear mixed model (LMM) analysis of task resolution time, usability scores (SUS), and mental workload scores (NASA-TLX) showed that the MR interface outperformed both VR and camera interfaces. MR enabled significantly faster task completion, was rated higher in usability, and was perceived to be less cognitively demanding. Notably, the MR interface, which projected a virtual robot onto a physical table, provided superior spatial understanding and physical reference cues. Post-study surveys further confirmed participants' preference for MR over other interfaces.

Paperid: 2975, https://arxiv.org/pdf/2502.05999.pdf

Abstract:
Can we derive computational metrics to quantify visual creativity in drawings across intelligent agents, while accounting for inherent differences in technical skill and style? To answer this, we curate a novel dataset consisting of 1338 drawings by children, adults and AI on a creative drawing task. We characterize two aspects of the drawings -- (1) style and (2) content. For style, we define measures of ink density, ink distribution and number of elements. For content, we use expert-annotated categories to study conceptual diversity, and image and text embeddings to compute distance measures. We compare the style, content and creativity of children, adults and AI drawings and build simple models to predict expert and automated creativity scores. We find significant differences in style and content in the groups -- children's drawings had more components, AI drawings had greater ink density, and adult drawings revealed maximum conceptual diversity. Notably, we highlight a misalignment between creativity judgments obtained through expert and automated ratings and discuss its implications. Through these efforts, our work provides, to the best of our knowledge, the first framework for studying human and artificial creativity beyond the textual modality, and attempts to arrive at the domain-agnostic principles underlying creativity. Our data and scripts are available on GitHub.

Paperid: 2976, https://arxiv.org/pdf/2502.05951.pdf

Abstract:
This work introduces Cyri, an AI-powered conversational assistant designed to support a human user in detecting and analyzing phishing emails by leveraging Large Language Models. Cyri has been designed to scrutinize emails for semantic features used in phishing attacks, such as urgency, and undesirable consequences, using an approach that unifies features already established in the literature with others by Cyri features extraction methodology. Cyri can be directly plugged into a client mail or webmail, ensuring seamless integration with the user's email workflow while maintaining data privacy through local processing. By performing analyses on the user's machine, Cyri eliminates the need to transmit sensitive email data over the internet, reducing associated security risks. The Cyri user interface has been designed to reduce habituation effects and enhance user engagement. It employs dynamic visual cues and context-specific explanations to keep users alert and informed while using emails. Additionally, it allows users to explore identified malicious semantic features both through conversation with the agent and visual exploration, obtaining the advantages of both modalities for expert or non-expert users. It also allows users to keep track of the conversation, supports the user in solving additional questions on both computed features or new parts of the mail, and applies its detection on demand. To evaluate Cyri, we crafted a comprehensive dataset of 420 phishing emails and 420 legitimate emails. Results demonstrate high effectiveness in identifying critical phishing semantic features fundamental to phishing detection. A user study involving 10 participants, both experts and non-experts, evaluated Cyri's effectiveness and usability. Results indicated that Cyri significantly aided users in identifying phishing emails and enhanced their understanding of phishing tactics.

Paperid: 2977, https://arxiv.org/pdf/2502.05118.pdf

Abstract:
As social robots become more common, many have adopted cute aesthetics aiming to enhance user comfort and acceptance. However, the effect of this aesthetic choice on human feedback in reinforcement learning scenarios remains unclear. Previous research has shown that humans tend to give more positive than negative feedback, which can cause failure to reach optimal robot behavior. We hypothesize that this positive bias may be exacerbated by the robot's level of perceived cuteness. To investigate, we conducted a user study where participants critique a robot's trajectories while it performs a task. We then analyzed the impact of the robot's aesthetic cuteness on the type of participant feedback. Our results suggest that there is a shift in the ratio of positive to negative feedback when perceived cuteness changes. In light of this, we experiment with a stochastic version of TAMER which adapts based on the user's level of positive feedback bias to mitigate these effects.

Paperid: 2978, https://arxiv.org/pdf/2502.05017.pdf

Abstract:
Democratic processes increasingly aim to integrate large-scale voting with face-to-face deliberation, addressing the challenge of reconciling individual preferences with collective decision-making. This work introduces new methods that use algorithms and computational tools to bridge online voting with face-to-face deliberation, tested in two real-world scenarios: Kultur Komitee 2024 (KK24) and vTaiwan. These case studies highlight the practical applications and impacts of the proposed methods. We present three key contributions: (1) Preference-based Clustering for Deliberation (PCD), which enables both in-depth and broad discussions in deliberative settings by computing homogeneous and heterogeneous group compositions with balanced and adjustable group sizes; (2) Human-in-the-loop MES, a practical method that enhances the Method of Equal Shares (MES) algorithm with real-time digital feedback. This builds algorithmic trust by giving participants full control over how much decision-making is delegated to the voting aggregation algorithm as compared to deliberation; and (3) the ReadTheRoom deliberation method, which uses opinion space mapping to identify agreement and divergence, along with spectrum-based preference visualisation to track opinion shifts during deliberation. This approach enhances transparency by clarifying collective sentiment and fosters collaboration by encouraging participants to engage constructively with differing perspectives. By introducing these actionable frameworks, this research extends in-person deliberation with scalable digital methods that address the complexities of modern decision-making in participatory processes.

Paperid: 2979, https://arxiv.org/pdf/2502.04942.pdf

Abstract:
The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia mentions and links shared in posts and comments on Reddit 2020-2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

Paperid: 2980, https://arxiv.org/pdf/2502.04398.pdf

Abstract:
Hand kinematics can be measured in Human-Computer Interaction (HCI) with the intention to predict the user's intention in a reach-to-grasp action. Using multiple hand sensors, multivariate time series data are being captured. Given a number of possible actions on a number of objects, the goal is to classify the multivariate time series data, where the class shall be predicted as early as possible. Many machine-learning methods have been developed for such classification tasks, where different approaches produce favorable solutions on different data sets. We, therefore, employ an ensemble approach that includes and weights different approaches. To provide a trustworthy classification production, we present the XMTC tool that incorporates coordinated multiple-view visualizations to analyze the predictions. Temporal accuracy plots, confusion matrix heatmaps, temporal confidence heatmaps, and partial dependence plots allow for the identification of the best trade-off between early prediction and prediction quality, the detection and analysis of challenging classification conditions, and the investigation of the prediction evolution in an overview and detail manner. We employ XMTC to real-world HCI data in multiple scenarios and show that good classification predictions can be achieved early on with our classifier as well as which conditions are easy to distinguish, which multivariate time series measurements impose challenges, and which features have most impact.

Paperid: 2981, https://arxiv.org/pdf/2502.04110.pdf

Abstract:
Achieving consistency in immersive learning case descriptions is essential but challenging due to variations in research focus, methodology, and researchers' background. We address these challenges by leveraging the Immersive Learning Case Sheet (ILCS), a methodological instrument to standardize case descriptions, that we applied to an immersive learning case on ancient Greek technology in VRChat. Research team members had differing levels of familiarity with the ILCS and the case content, so we developed a custom ChatGPT assistant to facilitate consistent terminology and process alignment across the team. This paper constitutes an example of how structured case reports can be a novel contribution to immersive learning literature. Our findings demonstrate how the ILCS supports structured reflection and interpretation of the case. Further we report that the use of a ChatGPT assistant significantly sup-ports the coherence and quality of the team members development of the final ILCS. This exposes the potential of employing AI-driven tools to enhance collaboration and standardization of research practices in qualitative educational research. However, we also discuss the limitations and challenges, including reliance on AI for interpretive tasks and managing varied levels of expertise within the team. This study thus provides insights into the practical application of AI in standardizing immersive learning research processes.

Paperid: 2982, https://arxiv.org/pdf/2502.03971.pdf

Abstract:
Existing Visual Language Modelsoften struggle with information loss and limited reasoning abilities when handling high-resolution web interfaces that combine complex visual, textual, and interactive elements. These challenges are particularly evident in tasks requiring webpage layout comprehension and multi-step interactive reasoning. To address these challenges, we propose RWKV-UI, a Visual Language Model based on the RWKV architecture, specifically designed to handle high-resolution UI images. During model training, we introduce layout detection as a visual prompt to help the model better understand the webpage layout structures. Additionally, we design a visual prompt based on the Chain-of-Thought(CoT) mechanism, which enhances the model's ability to understand and reason about webpage content through reasoning chains. Experimental results show that RWKV-UI demonstrates significant performance improvements in high-resolution UI understanding and interactive reasoning tasks.

Paperid: 2983, https://arxiv.org/pdf/2502.03709.pdf

Abstract:
The nine-grid layout is commonly used for multi-image posts, arranging nine images in a tic-tac-toe board. This layout effectively presents content within limited space. Moreover, due to the numerous possible arrangements within the nine-image grid, the optimal arrangement that yields the highest level of attractiveness remains unknown. Our study investigates how the arrangement of images within a nine-grid layout affects the overall popularity of the image set, aiming to explore alignment schemes more aligned with user preferences. Based on survey results regarding user preferences in image arrangement, we have identified two ordering sequences that are widely recognized: sequential order and center prioritization, considering both image visual content and aesthetic quality as alignment metrics, resulting in four layout schemes. Finally, we recruited participants to annotate various layout schemes of the same set of images. Our experience-centered evaluation indicates that layout schemes based on aesthetic quality outperformed others. This research yields empirical evidence supporting the optimization of the nine-grid layout for multi-image posts, thereby furnishing content creators with valuable insights to enhance both attractiveness and user experience.

Paperid: 2984, https://arxiv.org/pdf/2502.03419.pdf

Abstract:
This paper presents a novel adaptive Virtual Reality (VR) system that aims to mitigate cybersickness in immersive environments through dynamic, real-time adjustments. The system predicts cybersickness levels in real-time using a machine learning (ML) model trained on head tracking and kinematic data. The adaptive system adjusts foveated rendering (FFR) strength and field of view (FOV) to enhance user comfort. With a goal to balance usability with system performance, we believe our approach will optimize both user experience and performance. Adapting responsively to user needs, our work explores the potential of a machine learning-based feedback loop for user experience management, contributing to a user-centric VR system design.

Paperid: 2985, https://arxiv.org/pdf/2502.03078.pdf

Abstract:
Artificial Intelligence (AI) advancement is heavily dependent on access to large-scale, high-quality training data. However, in specialized domains such as healthcare, data acquisition faces significant constraints due to privacy regulations, ethical considerations, and limited availability. While synthetic data generation offers a promising solution, conventional approaches typically require substantial real data for training generative models. The emergence of large-scale prompt-based models presents new opportunities for synthetic data generation without direct access to protected data. However, crafting effective prompts for domain-specific data generation remains challenging, and manual prompt engineering proves insufficient for achieving output with sufficient precision and authenticity. We review recent developments in automatic prompt optimization, following PRISMA guidelines. We analyze six peer-reviewed studies published between 2020 and 2024 that focus on automatic data-free prompt optimization methods. Our analysis reveals three approaches: feedback-driven, error-based, and control-theoretic. Although all approaches demonstrate promising capabilities in prompt refinement and adaptation, our findings suggest the need for an integrated framework that combines complementary optimization techniques to enhance synthetic data generation while minimizing manual intervention. We propose future research directions toward developing robust, iterative prompt optimization frameworks capable of improving the quality of synthetic data. This advancement can be particularly crucial for sensitive fields and in specialized domains where data access is restricted, potentially transforming how we approach synthetic data generation for AI development.

Paperid: 2986, https://arxiv.org/pdf/2502.01795.pdf

Abstract:
Robots extend beyond the tools of productivity; they also contribute to creativity. While typically defined as utility-driven technologies designed for productive or social settings, the role of robots in creative settings remains underexplored. This paper examines how robots participate in artistic creation. Through semi-structured interviews with robotic artists, we analyze the impact of robots on artistic processes and outcomes. We identify the critical roles of social interaction, material properties, and temporal dynamics in facilitating creativity. Our findings reveal that creativity emerges from the co-constitution of artists, robots, and audiences within spatial-temporal dimensions. Based on these insights, we propose several implications for socially informed, material-attentive, and process-oriented approaches to creation with computing systems. These approaches can inform the domains of HCI, including media and art creation, craft, digital fabrication, and tangible computing.

Paperid: 2987, https://arxiv.org/pdf/2502.00966.pdf

Abstract:
Artistic creation is often seen as a uniquely human endeavor, yet robots bring distinct advantages to music-making, such as precise tempo control, unpredictable rhythmic complexities, and the ability to coordinate intricate human and robot performances. While many robotic music systems aim to mimic human musicianship, our work emphasizes the unique strengths of robots, resulting in a novel multi-robot performance instrument called the Beatbots, capable of producing music that is challenging for humans to replicate using current methods. The Beatbots were designed using an ``informed prototyping'' process, incorporating feedback from three musicians throughout development. We evaluated the Beatbots through a live public performance, surveying participants (N=28) to understand how they perceived and interacted with the robotic performance. Results show that participants valued the playfulness of the experience, the aesthetics of the robot system, and the unconventional robot-generated music. Expert musicians and non-expert roboticists demonstrated especially positive mindset shifts during the performance, although participants across all demographics had favorable responses. We propose design principles to guide the development of future robotic music systems and identify key robotic music affordances that our musician consultants considered particularly important for robotic music performance.

Paperid: 2988, https://arxiv.org/pdf/2502.00750.pdf

Abstract:
Autonomous vehicles (AVs) are rapidly evolving as an innovative mode of transportation. However, the consensus in both industry and academia is that AVs cannot independently resolve all traffic scenarios. Consequently, the need for remote human assistance becomes clear. To enable the widespread integration of AVs on public roadways, it is imperative to develop novel models for remote operation. One such model is tele-assistance, which promotes delegating low-level maneuvers to automation through high-level directives. Our study investigates the design and evaluation of a new command-based tele-assistance user interface for the teleoperation of AVs. First, by integrating various control paradigms and interaction concepts, we created a simulation-based, high-fidelity interactive prototype consisting of 175 screens. Next, we conducted a comprehensive usability study with 14 expert teleoperators to assess the acceptance and usability of the system. Finally, we formulated high-level insights and guidelines for designing command-based user interfaces for the remote operation of AVs.

Paperid: 2989, https://arxiv.org/pdf/2502.00428.pdf

Abstract:
Independent algorithm audits hold the promise of bringing accountability to automated decision-making. However, third-party audits are often hindered by access restrictions, forcing auditors to rely on limited, low-quality data. To study how these limitations impact research integrity, we conduct audit simulations on two realistic case studies for recidivism and healthcare coverage prediction. We examine the accuracy of estimating group parity metrics across three levels of access: (a) aggregated statistics, (b) individual-level data with model outputs, and (c) individual-level data without model outputs. Despite selecting one of the simplest tasks for algorithmic auditing, we find that data minimization and anonymization practices can strongly increase error rates on individual-level data, leading to unreliable assessments. We discuss implications for independent auditors, as well as potential avenues for HCI researchers and regulators to improve data access and enable both reliable and holistic evaluations.

Paperid: 2990, https://arxiv.org/pdf/2502.00244.pdf

Abstract:
Artificial intelligence and social computing rely on hundreds of thousands of content reviewers to classify high volumes of harmful and forbidden content. Many workers report long-term, potentially irreversible psychological harm. This work is similar to activities that cause psychological harm to other kinds of helping professionals even after small doses of exposure. Yet researchers struggle to measure the mental health of content reviewers well enough to inform diagnoses, evaluate workplace improvements, hold employers accountable, or advance scientific understanding. This systematic review summarizes psychological measures from other professions and relates them to the experiences of content reviewers. After identifying 1,673 potential papers, we reviewed 143 that validate measures in related occupations. We summarize the uses of psychological measurement for content reviewing, differences between clinical and research measures, and 12 measures that are adaptable to content reviewing. We find serious gaps in measurement validity in regions where content review labor is common. Overall, we argue for reliable measures of content reviewer mental health that match the nature of the work and are culturally-relevant.

Paperid: 2991, https://arxiv.org/pdf/2502.00055.pdf

Abstract:
Given the exponential advancement in AI technologies and the potential escalation of harmful effects from recommendation systems, it is crucial to simulate and evaluate these effects early on. Doing so can help prevent possible damage to both societies and technology companies. This paper introduces the Recommender Systems LLMs Playground (RecSysLLMsP), a novel simulation framework leveraging Large Language Models (LLMs) to explore the impacts of different content recommendation setups on user engagement and polarization in social networks. By creating diverse AI agents (AgentPrompts) with descriptive, static, and dynamic attributes, we assess their autonomous behaviour across three scenarios: Plurality, Balanced, and Similarity. Our findings reveal that the Similarity Scenario, which aligns content with user preferences, maximizes engagement while potentially fostering echo chambers. Conversely, the Plurality Scenario promotes diverse interactions but produces mixed engagement results. Our study emphasizes the need for a careful balance in recommender system designs to enhance user satisfaction while mitigating societal polarization. It underscores the unique value and challenges of incorporating LLMs into simulation environments. The benefits of RecSysLLMsP lie in its potential to calculate polarization effects, which is crucial for assessing societal impacts and determining user engagement levels with diverse recommender system setups. This advantage is essential for developing and maintaining a successful business model for social media companies. However, the study's limitations revolve around accurately emulating reality. Future efforts should validate the similarity in behaviour between real humans and AgentPrompts and establish metrics for measuring polarization scores.

Paperid: 2992, https://arxiv.org/pdf/2502.00033.pdf

Abstract:
Technological advances in high performance computing and maturing physical models allow scientists to simulate weather and climate evolutions with an increasing accuracy. While this improved accuracy allows us to explore complex dynamical interactions within such physical systems, inconceivable a few years ago, it also results in grand challenges regarding the data visualization and analytics process. We present STRIELAD, a scalable weather analytics toolkit, which allows for interactive exploration and real-time visualization of such large scale datasets. It combines parallel and distributed feature extraction using high-performance computing resources with smart level-of-detail rendering methods to assure interactivity during the complete analysis process.

Paperid: 2993, https://arxiv.org/pdf/2502.00016.pdf

Abstract:
Large language models (LLMs) show promise for aiding graduate level education, but are limited by their training data and potential confabulations. We developed ChemTAsk, an open-source pipeline that combines LLMs with retrieval-augmented generation (RAG) to provide accurate, context-specific assistance. ChemTAsk utilizes course materials, including lecture transcripts and primary publications, to generate accurate responses to student queries. Over nine weeks in an advanced biological chemistry course at the University of Pennsylvania, students could opt in to use ChemTAsk for assistance in any assignment or to understand class material. Comparative analysis showed ChemTAsk performed on par with human teaching assistants (TAs) in understanding student queries and providing accurate information, particularly excelling in creative problem-solving tasks. In contrast, TAs were more precise in their responses and tailored their assistance to the specifics of the class. Student feedback indicated that ChemTAsk was perceived as correct, helpful, and faster than TAs. Open-source and proprietary models from Meta and OpenAI respectively were tested on an original biological chemistry benchmark for future iterations of ChemTAsk. It was found that OpenAI models were more tolerant to deviations in the input prompt and excelled in self-assessment to safeguard for potential confabulations. Taken together, ChemTAsk demonstrates the potential of integrating LLMs with RAG to enhance educational support, offering a scalable tool for students and educators.

Paperid: 2994, https://arxiv.org/pdf/2501.19405.pdf

Abstract:
Precision Medicine (PM) transforms the traditional "one-drug-fits-all" paradigm by customising treatments based on individual characteristics, and is an emerging topic for HCI research on digital health. A key element of PM, the Polygenic Risk Score (PRS), uses genetic data to predict an individual's disease risk. Despite its potential, PRS faces barriers to adoption, such as data inclusivity, psychological impact, and public trust. We conducted a mixed-methods study to explore how people perceive PRS, formed of surveys (n=254) and interviews (n=11) with UK-based participants. The interviews were supplemented by interactive storyboards with the ContraVision technique to provoke deeper reflection and discussion. We identified ten key barriers and five themes to PRS adoption and proposed design implications for a responsible PRS framework. To address the complexities of PRS and enhance broader PM practices, we introduce the term Human-Precision Medicine Interaction (HPMI), which integrates, adapts, and extends HCI approaches to better meet these challenges.

Paperid: 2995, https://arxiv.org/pdf/2501.19323.pdf

Abstract:
The Industry 5.0 transition highlights EU efforts to design intelligent devices that can work alongside humans to enhance human capabilities, and such vision aligns with user preferences and needs to feel safe while collaborating with such systems take priority. This demands a human-centric research vision and requires a societal and educational shift in how we perceive technological advancements. To better understand this perspective, we conducted a systematic literature review focusing on understanding how trust and trustworthiness can be key aspects of supporting this move towards Industry 5.0. This review aims to overview the most common methodologies and measurements and collect insights about barriers and facilitators for fostering trustworthy HRI. After a rigorous quality assessment following the Systematic Reviews and Meta-Analyses guidelines, using rigorous inclusion criteria and screening by at least two reviewers, 34 articles were included in the review. The findings underscores the significance of trust and safety as foundational elements for promoting secure and trustworthy human-machine cooperation. Confirm that almost 30% of the revised articles do not present a definition of trust, which can be problematic as this lack of conceptual clarity can undermine research efforts in addressing this problem from a central perspective. It highlights that the choice of domain and area of application should influence the choice of methods and approaches to fostering trust in HRI, as those choices can significantly affect user preferences and their perceptions and assessment of robot capabilities. Additionally, this lack of conceptual clarity can be a potential barrier to fostering trust in HRI and explains the sometimes contradictory findings or choice of methods and instruments used to investigate trust in robots and other autonomous systems in the literature.

Paperid: 2996, https://arxiv.org/pdf/2501.19256.pdf

Abstract:
Explanation is a fundamentally human process. Understanding the goal and audience of the explanation is vital, yet existing work on explainable reinforcement learning (XRL) routinely does not consult humans in their evaluations. Even when they do, they routinely resort to subjective metrics, such as confidence or understanding, that can only inform researchers of users' opinions, not their practical effectiveness for a given problem. This paper calls on researchers to use objective human metrics for explanation evaluations based on observable and actionable behaviour to build more reproducible, comparable, and epistemically grounded research. To this end, we curate, describe, and compare several objective evaluation methodologies for applying explanations to debugging agent behaviour and supporting human-agent teaming, illustrating our proposed methods using a novel grid-based environment. We discuss how subjective and objective metrics complement each other to provide holistic validation and how future work needs to utilise standardised benchmarks for testing to enable greater comparisons between research.

Paperid: 2997, https://arxiv.org/pdf/2501.19217.pdf

Abstract:
Flow, a state of deep task engagement, is associated with optimal experience and well-being, making its detection a prolific HCI research focus. While physiological sensors show promise for flow detection, most studies are lab-based. Furthermore, brain sensing during natural work remains unexplored due to the intrusive nature of traditional EEG setups. This study addresses this gap by using wearable, around-the-ear EEG sensors to observe flow during natural knowledge work, measuring EEG throughout an entire day. In a semi-controlled field experiment, participants engaged in academic writing or programming, with their natural flow experiences compared to those from a classic lab paradigm. Our results show that natural work tasks elicit more intense flow than artificial tasks, albeit with smaller experience contrasts. EEG results show a well-known quadratic relationship between theta power and flow across tasks, and a novel quadratic relationship between beta asymmetry and flow during complex, real-world tasks.

Paperid: 2998, https://arxiv.org/pdf/2501.19211.pdf

Abstract:
The capability of GenAI-based chatbots, such as ChatGPT and Gemini, has expanded quickly in recent years, turning them into GenAI Chatbot Ecosystems. Yet, users' understanding of how such ecosystems work remains unknown. In this paper, we investigate users' mental models of how GenAI Chatbot Ecosystems work. This is an important question because users' mental models guide their behaviors, including making decisions that impact their privacy. Through 21 semi-structured interviews, we uncovered users' four mental models towards first-party (e.g., Google Gemini) and third-party (e.g., ChatGPT) GenAI Chatbot Ecosystems. These mental models centered around the role of the chatbot in the entire ecosystem. We further found that participants held a more consistent and simpler mental model towards third-party ecosystems than the first-party ones, resulting in higher trust and fewer concerns towards the third-party ecosystems. We discuss the design and policy implications based on our results.

Paperid: 2999, https://arxiv.org/pdf/2501.19174.pdf

Abstract:
This work presents NeuroTouch, an optical-based tactile sensor that combines a highly deformable dome-shaped soft material with an integrated neuromorphic camera, leveraging frame-based and dynamic vision for gesture detection. Our approach transforms an elastic body into a rich and nuanced interactive controller by tracking markers printed on its surface with event-based methods and harnessing their trajectories through RANSAC-based techniques. To benchmark our framework, we have created a 25 min gesture dataset, which we make publicly available to foster research in this area. Achieving over 91% accuracy in gesture classification, a 3.41 mm finger localization distance error, and a 0.96 mm gesture intensity error, our real-time, lightweight, and low-latency pipeline holds promise for applications in video games, augmented/virtual reality, and accessible devices. This research lays the groundwork for advancements in gesture detection for vision-based soft-material input technologies. Dataset: Coming Soon, Video: Coming Soon

Paperid: 3000, https://arxiv.org/pdf/2501.18754.pdf

Abstract:
Traditionally rooted in the domain of Human-Computer Interaction (HCI), usability has been primarily associated with the technological performance of a system's user interface. However, as learning technologies continue to advance, a pressing need exists to evaluate these tools from a broader perspective, encompassing not just technological but also pedagogical and sociocultural dimensions. The current paper delves into the multifaceted nature of usability in the context of Learning Design and Technology (LDT). We identified prevailing gaps in current usability research practices within LDT, notably the over-reliance on HCI-derived instruments that may not holistically capture the unique usability demands of learning technologies. To address these challenges, we embarked on the development and analysis of the Comprehensive Assessment of Usability Scale for Learning Technologies (CAUSLT). A total of 155 responses were collected and analyzed. Utilizing exploratory factor analysis, this study aimed to explore core constructs for the development of CAUSLT. Our findings underscore the importance and the critical need for a comprehensive usability evaluation framework tailored for learning technologies, setting the stage for more effective and user-centric educational tools.

Paperid: 3001, https://arxiv.org/pdf/2501.18565.pdf

Abstract:
In recent years, the rapid development of artificial intelligence (AI) especially multi-modal Large Language Models (MLLMs), has enabled it to understand text, images, videos, and other multimedia data, allowing AI systems to execute various tasks based on human-provided prompts. However, AI-powered bots have increasingly been able to bypass most existing CAPTCHA systems, posing significant security threats to web applications. This makes the design of new CAPTCHA mechanisms an urgent priority. We observe that humans are highly sensitive to shifts and abrupt changes in videos, while current AI systems still struggle to comprehend and respond to such situations effectively. Based on this observation, we design and implement BounTCHA, a CAPTCHA mechanism that leverages human perception of boundaries in video transitions and disruptions. By utilizing generative AI's capability to extend original videos with prompts, we introduce unexpected twists and changes to create a pipeline for generating guided short videos for CAPTCHA purposes. We develop a prototype and conduct experiments to collect data on humans' time biases in boundary identification. This data serves as a basis for distinguishing between human users and bots. Additionally, we perform a detailed security analysis of BounTCHA, demonstrating its resilience against various types of attacks. We hope that BounTCHA will act as a robust defense, safeguarding millions of web applications in the AI-driven era.

Paperid: 3002, https://arxiv.org/pdf/2501.18462.pdf

Abstract:
Entrepreneurship is a key component of society, and universities and major political structures have tried to support its development in recent years. The present study aims to check the perception of students (based on gender) about entrepreneurial intentions after participating in a course that had a large number of undergraduate students. There were 970 students enrolled from different faculties with various specializations. We conducted a gender-based survey on the unconventional entrepreneurial fundamentals course, where each course was delivered by a different speaker. We also compared the responses provided by computer science students with the overall responses to find differences in their perceptions related to the feasibility of teaching entrepreneurship online, determining the entrepreneurial intention of the students taking this course, and analyzing the perceptions related to the business environment and the ease of starting a business. We found that students, regardless of gender or field of study, prefer interactive online presentations based on the manner in which lectures on this subject were conducted.

Paperid: 3003, https://arxiv.org/pdf/2501.18045.pdf

Abstract:
How has the public responded to the increasing prevalence of artificial intelligence (AI)-based technologies? We investigate public perceptions of AI by collecting over 12,000 responses over 12 months from a nationally representative U.S. sample. Participants provided open-ended metaphors reflecting their mental models of AI, a methodology that overcomes the limitations of traditional self-reported measures by capturing more nuance. Using a mixed-methods approach combining quantitative clustering and qualitative coding, we identify 20 dominant metaphors shaping public understanding of AI. To analyze these metaphors systematically, we present a scalable framework integrating language modeling (LM)-based techniques to measure key dimensions of public perception: anthropomorphism (attribution of human-like qualities), warmth, and competence. We find that Americans generally view AI as warm and competent, and that over the past year, perceptions of AI's human-likeness and warmth have significantly increased ($+34\%, r = 0.80, p < 0.01; +41\%, r = 0.62, p < 0.05$). These implicit perceptions, along with the identified dominant metaphors, strongly predict trust in and willingness to adopt AI ($r^2 = 0.21, 0.18, p < 0.001$). Moreover, we uncover systematic demographic differences in metaphors and implicit perceptions, such as the higher propensity of women, older individuals, and people of color to anthropomorphize AI, which shed light on demographic disparities in trust and adoption. In addition to our dataset and framework for tracking evolving public attitudes, we provide actionable insights on using metaphors for inclusive and responsible AI development.

Paperid: 3004, https://arxiv.org/pdf/2501.17942.pdf

Abstract:
As large language models (LLMs) become increasingly integrated into educational technology, their potential to assist in developing curricula has gained interest among educators. Despite this growing attention, their applicability in culturally responsive Indigenous educational settings like Hawai`i's public schools and Kaiapuni (immersion language) programs, remains understudied. Additionally, `Olelo Hawai`i, the Hawaiian language, as a low-resource language, poses unique challenges and concerns about cultural sensitivity and the reliability of generated content. Through surveys and interviews with kumu (educators), this study explores the perceived benefits and limitations of using LLMs for culturally revitalizing computer science (CS) education in Hawaiian public schools with Kaiapuni programs. Our findings highlight AI's time-saving advantages while exposing challenges such as cultural misalignment and reliability concerns. We conclude with design recommendations for future AI tools to better align with Hawaiian cultural values and pedagogical practices, towards the broader goal of trustworthy, effective, and culturally grounded AI technologies.

Paperid: 3005, https://arxiv.org/pdf/2501.17182.pdf

Abstract:
Emotional support dialogue systems aim to reduce help-seekers' distress and help them overcome challenges. While human values$\unicode{x2013}$core beliefs that shape an individual's priorities$\unicode{x2013}$are increasingly emphasized in contemporary psychological therapy for their role in fostering internal transformation and long-term emotional well-being, their integration into emotional support systems remains underexplored. To bridge this gap, we present a value-driven method for training emotional support dialogue systems designed to reinforce positive values in seekers. Notably, our model identifies which values to reinforce at each turn and how to do so, by leveraging online support conversations from Reddit. We evaluate the method across support skills, seekers' emotional intensity, and value reinforcement. Our method consistently outperforms various baselines, effectively exploring and eliciting values from seekers. Additionally, leveraging crowd knowledge from Reddit significantly enhances its effectiveness. Therapists highlighted its ability to validate seekers' challenges and emphasize positive aspects of their situations$\unicode{x2013}$both crucial elements of value reinforcement. Our work, being the first to integrate value reinforcement into emotional support systems, demonstrates its promise and establishes a foundation for future research.

Paperid: 3006, https://arxiv.org/pdf/2501.16668.pdf

Abstract:
As radical messaging has proliferated on social networking sites, platforms like Reddit have been used to host support groups, including support communities for the families and friends of radicalized individuals. This study examines the subreddit r/QAnonCasualties, an online forum for users whose loved ones have been radicalized by QAnon. We collected 1,665 posts and 78,171 comments posted between 7/2021 and 7/2022 and content coded top posts for prominent themes. Sentiment analysis was also conducted on all posts. We find venting, advice and validation-seeking, and pressure to refuse the COVID-19 vaccine were prominent themes. 40% (n=167) of coded posts identified the Q relation(s) of users as their parent(s) and 16.3% (n=68) as their partner. Posts with higher proportions of words related to swearing, social referents, and physical needs were positively correlated with engagement. These findings show ways that communities around QAnon adherents leverage anonymous online spaces to seek and provide social support.

Paperid: 3007, https://arxiv.org/pdf/2501.16577.pdf

Abstract:
Generative AI could enhance scientific discovery by supporting knowledge workers in science organizations. However, the real-world applications and perceived concerns of generative AI use in these organizations are uncertain. In this paper, we report on a collaborative study with a US national laboratory with employees spanning Science and Operations about their use of generative AI tools. We surveyed 66 employees, interviewed a subset (N=22), and measured early adoption of an internal generative AI interface called Argo lab-wide. We have four findings: (1) Argo usage data shows small but increasing use by Science and Operations employees; Common current and envisioned use cases for generative AI in this context conceptually fall into either a (2) copilot or (3) workflow agent modality; and (4) Concerns include sensitive data security, academic publishing, and job impacts. Based on our findings, we make recommendations for generative AI use in science and other organizations.

Paperid: 3008, https://arxiv.org/pdf/2501.15877.pdf

Abstract:
There is a growing need for diverse, high-quality stuttered speech data, particularly in the context of Indian languages. This paper introduces Project Boli, a multi-lingual stuttered speech dataset designed to advance scientific understanding and technology development for individuals who stutter, particularly in India. The dataset constitutes (a) anonymized metadata (gender, age, country, mother tongue) and responses to a questionnaire about how stuttering affects their daily lives, (b) captures both read speech (using the Rainbow Passage) and spontaneous speech (through image description tasks) for each participant and (c) includes detailed annotations of five stutter types: blocks, prolongations, interjections, sound repetitions and word repetitions. We present a comprehensive analysis of the dataset, including the data collection procedure, experience summarization of people who stutter, severity assessment of stuttering events and technical validation of the collected data. The dataset is released as an open access to further speech technology development.

Paperid: 3009, https://arxiv.org/pdf/2501.15819.pdf

Abstract:
Individuals who are differently-able in vision cannot proceed with their day-to-day activities as smoothly as other people do. Especially independent walking is a hard target to achieve with their visual impairment. Assistive electronic travel aids equipped with different types of sensors are designed for visually impaired persons to assist their safe navigation. The amount of research on combining multiple sensors in assistive navigation aids for visually impaired navigation is limited. Most work is targeted at sensor integration but not at sensor fusion. This paper aims to address how sensor fusion and integration will be used to improve the sub-processes of visually impaired navigation and the way to evaluate the sensor fusion-based approach for visually impaired navigation which consists of several contributions to field sensor fusion in visually impaired navigation such as a novel homogeneous sensor fusion algorithm based on extended Kalman filter, a novel heterogeneous sensor integration approach, and a complementary sensor fusion algorithm based on error state extended Kaman filter. Overall this research presents a novel navigational framework to integrate obstacle detection, obstacle recognition, localization, motion planning, and current context awareness with sensor fusion.

Paperid: 3010, https://arxiv.org/pdf/2501.15678.pdf

Abstract:
As the use of Generative AI (GenAI) tools becomes more prevalent in interpersonal communication, understanding their impact on social perceptions is crucial. According to signaling theory, GenAI may undermine the credibility of social signals conveyed in writing, since it reduces the cost of writing and makes it hard to verify the authenticity of messages. Using a pre-registered large-scale online experiment (N = 647; Prolific), featuring scenarios in a range of communication contexts (personal vs. professional; close others vs. strangers), we explored how senders' use of GenAI influenced recipients' impressions of senders, both when GenAI use was known or uncertain. Consistent with past work, we found strong negative effects on social impressions when disclosing that a message was AI-generated, compared to when the same message was human-written. However, under the more realistic condition when potential GenAI use was not explicitly highlighted, recipients did not exhibit any skepticism towards senders, and these "uninformed" impressions were virtually indistinguishable from those of fully human-written messages. Even when we highlighted the potential (but uncertain) use of GenAI, recipients formed overly positive impressions. These results are especially striking given that 46% of our sample admitted having used such tools for writing messages, just within the past two weeks. Our findings put past work in a new light: While social judgments can be substantially affected when GenAI use is explicitly disclosed, this information may not be readily available in more realistic communication settings, making recipients blissfully ignorant about others' potential use of GenAI.

Paperid: 3011, https://arxiv.org/pdf/2501.15346.pdf

Abstract:
This chapter examines the conceptual tensions in understanding artificial intelligence (AI) agents' role in creative processes, particularly focusing on Large Language Models (LLMs). Building upon Schmidt's 1954 categorization of human-technology relationships and the classical definition of "author," this chapter proposes to understand AI agency as existing somewhere between that of an inanimate puppet and a performing actor. While AI agents demonstrate a degree of creative autonomy, including the ability to improvise and construct complex narrative content in interactive storytelling, they cannot be considered authors in the classical sense of the term. This chapter thus suggests that AI agents exist in a dynamic state between human-controlled puppets and semi-autonomous actors. This conceptual positioning reflects how AI agents, while they can certainly contribute to creative work, remain bound to human direction. We also argue that existing conceptual frames concerning authorship should evolve and adapt to capture these new relationships.

Paperid: 3012, https://arxiv.org/pdf/2501.15056.pdf

Abstract:
Effective decision-making and problem-solving in conversational systems require the ability to identify and acquire missing information through targeted questioning. A key challenge lies in efficiently narrowing down a large space of possible outcomes by posing questions that minimize uncertainty. To address this, we introduce a novel framework that leverages Large Language Models (LLMs) to generate information-seeking questions, with Monte Carlo Tree Search (MCTS) to strategically select questions that maximize information gain, as a part of inference-time planning. Our primary contribution includes a hierarchical feedback mechanism that exploits past interaction patterns to guide future strategy. Specifically, each new problem is mapped to a cluster based on semantic similarity, and our UCT (Upper Confidence bound for Trees) formulation employs a cluster-specific bonus reward to prioritize successful question trajectories that have proven effective for similar problems in the past. Extensive empirical evaluation across medical diagnosis and technical troubleshooting domains shows that our method achieves an average of 12% improvement in success rates and about 10x reduction in the number of LLM calls made for planning per conversation, compared to the state of the art. An additional 8% gain in success rate is observed on average when we start with a constrained set of possibilities. Our results underscore the efficacy of feedback-aware MCTS in enhancing information-seeking in goal-oriented dialogues.

Paperid: 3013, https://arxiv.org/pdf/2501.14903.pdf

Abstract:
Conspiratorial thinking can connect many distinct or distant ills to a central cause. This belief has visual form in the octopus map: a map where a central force (for instance a nation, an ideology, or an ethnicity) is depicted as a literal or figurative octopus, with extending tendrils. In this paper, we explore how octopus maps function as visual arguments through an analysis of historical examples as well as a through a crowd-sourced study on how the underlying data and the use of visual metaphors contribute to specific negative or conspiratorial interpretations. We find that many features of the data or visual style can lead to "octopus-like" thinking in visualizations, even without the use of an explicit octopus motif. We conclude with a call for a deeper analysis of visual rhetoric, and an acknowledgment of the potential for the design of data visualizations to contribute to harmful or conspiratorial thinking.

Paperid: 3014, https://arxiv.org/pdf/2501.14778.pdf

Abstract:
The increasing use of AI technologies has led to increasing AI incidents, posing risks and causing harm to individuals, organizations, and society. This study recognizes and addresses the lack of standardized protocols for reliably and comprehensively gathering such incident data crucial for preventing future incidents and developing mitigating strategies. Specifically, this study analyses existing open-access AI-incident databases through a systematic methodology and identifies nine gaps in current AI incident reporting practices. Further, it proposes nine actionable recommendations to enhance standardization efforts to address these gaps. Ensuring the trustworthiness of enabling technologies such as AI is necessary for sustainable digital transformation. Our research promotes the development of standards to prevent future AI incidents and promote trustworthy AI, thus facilitating achieving the UN sustainable development goals. Through international cooperation, stakeholders can unlock the transformative potential of AI, enabling a sustainable and inclusive future for all.

Paperid: 3015, https://arxiv.org/pdf/2501.14530.pdf

Abstract:
Mental disorders have become a significant global public health issue, while the shortage of psychiatrists and inefficient training systems severely hinder the accessibility of mental health services. This paper designs and implements an artificial intelligence-based training system for psychiatrists. By integrating technologies such as large language models, knowledge graphs, and expert systems, the system constructs an intelligent and standardized training platform. It includes six functional modules: case generation, consultation dialogue, examination prescription, diagnostic decision-making, integrated traditional Chinese and Western medicine prescription, and expert evaluation, providing comprehensive support from clinical skill training to professional level assessment.The system adopts a B/S architecture, developed using the Vue.js and Node.js technology stack, and innovatively applies deep learning algorithms for case generation and doctor-patient dialogue. In a clinical trial involving 60 psychiatrists at different levels, the system demonstrated excellent performance and training outcomes: system stability reached 99.95%, AI dialogue accuracy achieved 96.5%, diagnostic accuracy reached 92.5%, and user satisfaction scored 92.3%. Experimental data showed that doctors using the system improved their knowledge mastery, clinical thinking, and diagnostic skills by 35.6%, 28.4%, and 23.7%, respectively.The research results provide an innovative solution for improving the efficiency of psychiatrist training and hold significant importance for promoting the standardization and scalability of mental health professional development.

Paperid: 3016, https://arxiv.org/pdf/2501.14133.pdf

Abstract:
Background: The rise of mobile technology and health apps has increased the use of person-generated health data (PGHD). PGHD holds significant potential for clinical decision-making but remains challenging to manage. Objective: This study aimed to enhance the clinical utilization of wearable health data by developing the Validation and Inspection Tool for Armband-Based Lifelog Data (VITAL), a pipeline for data integration, visualization, and quality management, and evaluating its usability. Methods: The study followed a structured process of requirement gathering, tool implementation, and usability evaluation. Requirements were identified through input from four clinicians. Wearable health data from Samsung, Apple, Fitbit, and Xiaomi devices were integrated into a standardized dataframe at 10-minute intervals, focusing on biometrics, activity, and sleep. Features of VITAL support data integration, visualization, and quality management. Usability evaluation involved seven clinicians performing tasks, completing the Unified Theory of Acceptance and Use of Technology (UTAUT) survey, and participating in interviews to identify usability issues. Results: VITAL successfully integrated wearable data, thus enabling all participants to complete tasks with minimal errors without prior participant training. UTAUT survey results were positive, with average scores of 4.2 for performance expectancy, 3.96 for effort expectancy, and 4.14 for intention to use, indicating high user satisfaction and intent to adopt the tool. Conclusions: By enhancing wearable data integration, visualization, and quality management, the VITAL prototype shows significant potential for clinical application. Positive feedback highlights its promise, while emphasizing the need for further studies to confirm its real-world effectiveness.

Paperid: 3017, https://arxiv.org/pdf/2501.14098.pdf

Abstract:
Remote healthcare technology can help tackle societal issues by improving access to quality healthcare services and enhancing diagnoses through in-place monitoring. These services can be implemented through a combination of mobile devices, applications, wearable sensors, and other smart technology. It is paramount to handle sensitive data that is collected in ways that meet users' privacy expectations. We surveyed 384 people in Canada aged 20 to 93 years old to explore participants' comfort with data collection, sharing preferences, and potential privacy concerns related to remote healthcare technology. We explore these topics within the context of various healthcare scenarios including health emergencies and managing chronic health conditions.

Paperid: 3018, https://arxiv.org/pdf/2501.12576.pdf

Abstract:
In blockchain-based order book systems, buyers and sellers trade assets, while it is miners to match them and include their transactions in the blockchain. It is found that many miners behave selfishly and myopically, prioritizing transactions with high fees and ignoring many desirable matches that could enhance social welfare. Existing blockchain mechanisms fail to address this issue by overlooking miners' selfish behaviors. To our best knowledge, this work presents the first analytical study to quantify and understand buyer and seller transaction fee choices and selfish miners' transaction matching strategies, proving an infinitely large price of anarchy (PoA) for social welfare loss. To mitigate this, we propose an adjustable block size mechanism that is easy to implement without altering the existing decentralized protocols and still allows buyers and sellers to freely decide transaction fees and miners to selfishly match. The analysis is challenging, as pure strategy Nash equilibria do not always exist, requiring the analysis of many buyers' or sellers' interactive mixed-strategy distributions. Moreover, the system designer may even lack information about each buyer's or seller's bid/ask prices and trading quantities. Nevertheless, our mechanism achieves a well-bounded PoA, and under the homogeneous-quantity trading for non-fungible tokens (NFT), it attains a PoA of 1 with no social welfare loss. We implement our mechanism on a local instance of Ethereum to demonstrate the feasibility of our approach. Experiments based on the realistic dataset demonstrate that our mechanism achieves social optimum for homogeneous-quantity trading like NFT. It can enhance social welfare up to 3.7 times compared to the existing order book benchmarks for heterogeneous-quantity trading of Bitcoin tokens. It exhibits robustness against random variations in buyers and sellers.

Paperid: 3019, https://arxiv.org/pdf/2501.12557.pdf

Abstract:
Large language models (LLMs) have been positioned to revolutionize HCI, by reshaping not only the interfaces, design patterns, and sociotechnical systems that we study, but also the research practices we use. To-date, however, there has been little understanding of LLMs' uptake in HCI. We address this gap via a systematic literature review of 153 CHI papers from 2020-24 that engage with LLMs. We taxonomize: (1) domains where LLMs are applied; (2) roles of LLMs in HCI projects; (3) contribution types; and (4) acknowledged limitations and risks. We find LLM work in 10 diverse domains, primarily via empirical and artifact contributions. Authors use LLMs in five distinct roles, including as research tools or simulated users. Still, authors often raise validity and reproducibility concerns, and overwhelmingly study closed models. We outline opportunities to improve HCI research with and on LLMs, and provide guiding questions for researchers to consider the validity and appropriateness of LLM-related work.

Paperid: 3020, https://arxiv.org/pdf/2501.12001.pdf

Abstract:
In this study, we introduce the Conversation Progress Guide (CPG), a system designed for text-based conversational AI interactions that provides a visual interface to represent progress. Users often encounter failures when interacting with conversational AI, which can negatively affect their self-efficacy-an individual's belief in their capabilities, reducing their willingness to engage with these services. The CPG offers visual feedback on task progress, providing users with mastery experiences, a key source of self-efficacy. To evaluate the system's effectiveness, we conducted a user study assessing how the integration of the CPG influences user engagement and self-efficacy. Results demonstrate that users interacting with a conversational AI enhanced by the CPG showed significant improvements in self-efficacy measures compared to those using a conventional conversational AI.

Paperid: 3021, https://arxiv.org/pdf/2501.11840.pdf

Abstract:
Systematic reviews are time-consuming endeavors. Historically speaking, knowledgeable humans have had to screen and extract data from studies before it can be analyzed. However, large language models (LLMs) hold promise to greatly accelerate this process. After a pilot study which showed great promise, we investigated the use of freely available LLMs for extracting data for systematic reviews. Using three different LLMs, we extracted 24 types of data, 9 explicitly stated variables and 15 derived categorical variables, from 112 studies that were included in a published scoping review. Overall we found that Gemini 1.5 Flash, Gemini 1.5 Pro, and Mistral Large 2 performed reasonably well, with 71.17%, 72.14%, and 62.43% of data extracted being consistent with human coding, respectively. While promising, these results highlight the dire need for a human-in-the-loop (HIL) process for AI-assisted data extraction. As a result, we present a free, open-source program we developed (AIDE) to facilitate user-friendly, HIL data extraction with LLMs.

Paperid: 3022, https://arxiv.org/pdf/2501.11782.pdf

Abstract:
As modern video games become increasingly complex, traditional manual testing methods are proving costly and inefficient, limiting the ability to ensure high-quality game experiences. While advancements in Artificial Intelligence (AI) offer the potential to assist human testers, the effectiveness of AI in truly enhancing real-world human performance remains underexplored. This study investigates how AI can improve game testing by developing and experimenting with an AI-assisted workflow that leverages state-of-the-art machine learning models for defect detection. Through an experiment involving 800 test cases and 276 participants of varying backgrounds, we evaluate the effectiveness of AI assistance under four conditions: with or without AI support, and with or without detailed knowledge of defects and design documentation. The results indicate that AI assistance significantly improves defect identification performance, particularly when paired with detailed knowledge. However, challenges arise when AI errors occur, negatively impacting human decision-making. Our findings show the importance of optimizing human-AI collaboration and implementing strategies to mitigate the effects of AI inaccuracies. By this research, we demonstrate AI's potential and problems in enhancing efficiency and accuracy in game testing workflows and offers practical insights for integrating AI into the testing process.

Paperid: 3023, https://arxiv.org/pdf/2501.11748.pdf

Abstract:
Software development relies on effective collaboration between Software Development Engineers (SDEs) and User eXperience Designers (UXDs) to create software products of high quality and usability. While this collaboration issue has been explored over the past decades, anecdotal evidence continues to indicate the existence of challenges in their collaborative efforts. To understand this gap, we first conducted a systematic literature review (SLR) of 45 papers published since 2004, uncovering three key collaboration challenges and two main categories of potential best practices. We then analyzed designer and developer forums and discussions from one open-source software repository to assess how the challenges and practices manifest in the status quo. Our findings have broad applicability for collaboration in software development, extending beyond the partnership between SDEs and UXDs. The suggested best practices and interventions also act as a reference for future research, assisting in the development of dedicated collaboration tools for SDEs and UXDs.

Paperid: 3024, https://arxiv.org/pdf/2501.11433.pdf

Abstract:
Collaboration has been shown to enhance creativity, leading to more innovative and effective outcomes. While previous research has explored the abilities of Large Language Models (LLMs) to serve as co-creative partners in tasks like writing poetry or creating narratives, the collaborative potential of LLMs in humor-rich and culturally nuanced domains remains an open question. To address this gap, we conducted a user study to explore the potential of LLMs in co-creating memes - a humor-driven and culturally specific form of creative expression. We conducted a user study with three groups of 50 participants each: a human-only group creating memes without AI assistance, a human-AI collaboration group interacting with a state-of-the-art LLM model, and an AI-only group where the LLM autonomously generated memes. We assessed the quality of the generated memes through crowdsourcing, with each meme rated on creativity, humor, and shareability. Our results showed that LLM assistance increased the number of ideas generated and reduced the effort participants felt. However, it did not improve the quality of the memes when humans collaborated with LLM. Interestingly, memes created entirely by AI performed better than both human-only and human-AI collaborative memes in all areas on average. However, when looking at the top-performing memes, human-created ones were better in humor, while human-AI collaborations stood out in creativity and shareability. These findings highlight the complexities of human-AI collaboration in creative tasks. While AI can boost productivity and create content that appeals to a broad audience, human creativity remains crucial for content that connects on a deeper level.

Paperid: 3025, https://arxiv.org/pdf/2501.10847.pdf

Abstract:
Enterprise ontology serves as a foundational framework for semantically comprehending the nature of organizations and the essential components that uphold their integrity. The systematic and conceptual understanding of organizations has garnered significant attention from researchers due to its pivotal role in various domains, including business modeling, enterprise architecture, business process management, context-aware systems, application development, interoperability across diverse systems and platforms, knowledge management, organizational learning and innovation, and conflict resolution within organizations. Achieving a consensus on the concepts related to the fundamental elements that constitute an organization is therefore critical. This study aims to conduct a comprehensive analysis and comparison of existing conceptual models of enterprises as documented in scholarly articles published over the past decade. We discuss the strengths and weaknesses of each model and introduce a robust framework for their evaluation. To facilitate this evaluation, we propose several pertinent criteria derived from established methodologies for assessing ontologies. Furthermore, we identify contemporary challenges and issues that have been overlooked in prior studies, offering insights and suggestions for future research directions in enterprise modeling. This article ultimately presents a roadmap for enhancing the systematic understanding of organizations through refined enterprise ontology frameworks.

Paperid: 3026, https://arxiv.org/pdf/2501.10091.pdf

Abstract:
Programming students have a widespread access to powerful Generative AI tools like ChatGPT. While this can help understand the learning material and assist with exercises, educators are voicing more and more concerns about an overreliance on generated outputs and lack of critical thinking skills. It is thus important to understand how students actually use generative AI and what impact this could have on their learning behavior. To this end, we conducted a study including an exploratory experiment with 37 programming students, giving them monitored access to ChatGPT while solving a code authoring exercise. The task was not directly solvable by ChatGPT and required code comprehension and reasoning. While only 23 of the students actually opted to use the chatbot, the majority of those eventually prompted it to simply generate a full solution. We observed two prevalent usage strategies: to seek knowledge about general concepts and to directly generate solutions. Instead of using the bot to comprehend the code and their own mistakes, students often got trapped in a vicious cycle of submitting wrong generated code and then asking the bot for a fix. Those who self-reported using generative AI regularly were more likely to prompt the bot to generate a solution. Our findings indicate that concerns about potential decrease in programmers' agency and productivity with Generative AI are justified. We discuss how researchers and educators can respond to the potential risk of students uncritically over-relying on Generative AI. We also discuss potential modifications to our study design for large-scale replications.

Paperid: 3027, https://arxiv.org/pdf/2501.08736.pdf

Abstract:
We present Holoview, an augmented reality (AR) system designed to support immersive and interactive learning of human anatomy. Holoview enables users to dynamically explore volumetric anatomical data through intuitive hand gestures in a 3D AR environment, allowing inspection of individual organs and cross-sectional views via clipping and bioscope features. The system adopts a lightweight client-server architecture optimized for real-time performance on the HoloLens through hybrid and foveated rendering. Our user study demonstrated Holoview's educational effectiveness, with participants showing a 135 percent improvement in task-specific knowledge and reporting increased confidence in understanding anatomical structures. The system was perceived as engaging and intuitive, particularly for organ selection and cross-sectional exploration, with low cognitive load and increasing ease of use over time. These findings highlight Holoview's potential to enhance anatomy learning through immersive, user-centered AR experiences.

Paperid: 3028, https://arxiv.org/pdf/2501.08518.pdf

Abstract:
Seasickness poses a widespread problem that adversely impacts both passenger comfort and the operational efficiency of maritime crews. Although attention shift has been proposed as a potential method to alleviate symptoms of motion sickness, its efficacy remains to be rigorously validated, especially in maritime environments. In this study, we develop an AI-driven brain-computer interface (BCI) to realize sustained and practical attention shift by incorporating tasks such as breath counting. Forty-three participants completed a real-world nautical experiment consisting of a real-feedback session, a resting session, and a pseudo-feedback session. Notably, 81.39\% of the participants reported that the BCI intervention was effective. EEG analysis revealed that the proposed system can effectively regulate motion sickness EEG signatures, such as an decrease in total band power, along with an increase in theta relative power and a decrease in beta relative power. Furthermore, an indicator of attentional focus, the theta/beta ratio, exhibited a significant reduction during the real-feedback session, providing further evidence to support the effectiveness of the BCI in shifting attention. Collectively, this study presents a novel nonpharmacological, portable, and effective approach for seasickness intervention, which has the potential to open up a brand-new application domain for BCIs.

Paperid: 3029, https://arxiv.org/pdf/2501.08393.pdf

Abstract:
Conversational agents (CAs) are revolutionizing human-computer interaction by evolving from text-based chatbots to empathetic digital humans (DHs) capable of rich emotional expressions. This paper explores the integration of neural and physiological signals into the perception module of CAs to enhance empathetic interactions. By leveraging these cues, the study aims to detect emotions in real-time and generate empathetic responses and expressions. We conducted a user study where participants engaged in conversations with a DH about emotional topics. The DH responded and displayed expressions by mirroring detected emotions in real-time using neural and physiological cues. The results indicate that participants experienced stronger emotions and greater engagement during interactions with the Empathetic DH, demonstrating the effectiveness of incorporating neural and physiological signals for real-time emotion recognition. However, several challenges were identified, including recognition accuracy, emotional transition speeds, individual personality effects, and limitations in voice tone modulation. Addressing these challenges is crucial for further refining Empathetic DHs and fostering meaningful connections between humans and artificial entities. Overall, this research advances human-agent interaction and highlights the potential of real-time neural and physiological emotion recognition in creating empathetic DHs.

Paperid: 3030, https://arxiv.org/pdf/2501.08237.pdf

Abstract:
Extended reality (XR) technologies-encompassing virtual reality (VR), augmented reality (AR), and mixed reality (MR) are transforming cognitive assessment and training by offering immersive, interactive environments that simulate real-world tasks. XR enhances ecological validity while enabling real-time, multimodal data collection through tools such as galvanic skin response (GSR), electroencephalography (EEG), eye tracking (ET), hand tracking, and body tracking. This allows for a more comprehensive understanding of cognitive and emotional processes, as well as adaptive, personalized interventions for users. Despite these advancements, current XR applications often underutilize the full potential of multimodal integration, relying primarily on visual and auditory inputs. Challenges such as cybersickness, usability concerns, and accessibility barriers further limit the widespread adoption of XR tools in cognitive science and clinical practice. This review examines XR-based cognitive assessment and training, focusing on its advantages over traditional methods, including ecological validity, engagement, and adaptability. It also explores unresolved challenges such as system usability, cost, and the need for multimodal feedback integration. The review concludes by identifying opportunities for optimizing XR tools to improve cognitive evaluation and rehabilitation outcomes, particularly for diverse populations, including older adults and individuals with cognitive impairments.

Paperid: 3031, https://arxiv.org/pdf/2501.07957.pdf

Abstract:
This paper presents AI Guide Dog (AIGD), a lightweight egocentric (first-person) navigation system for visually impaired users, designed for real-time deployment on smartphones. AIGD employs a vision-only multi-label classification approach to predict directional commands, ensuring safe navigation across diverse environments. We introduce a novel technique for goal-based outdoor navigation by integrating GPS signals and high-level directions, while also handling uncertain multi-path predictions for destination-free indoor navigation. As the first navigation assistance system to handle both goal-oriented and exploratory navigation across indoor and outdoor settings, AIGD establishes a new benchmark in blind navigation. We present methods, datasets, evaluations, and deployment insights to encourage further innovations in assistive navigation systems.

Paperid: 3032, https://arxiv.org/pdf/2501.07883.pdf

Abstract:
As trends in education evolve, personalized learning has transformed individuals' engagement with knowledge and skill development. In the digital age, state-of-the-art technologies have been increasingly integrated into classrooms to support intelligent education and foster personalized learning experiences. One promising approach is the use of eye-tracking technology to evaluate student engagement in intelligent virtual classrooms. This paper explores the assessment of personalized learning in the virtual classroom and its impact on student engagement through the eye movement paradigm. The study aims to provide insights into how personalized learning approaches can enhance student participation, motivation, and academic performance in the online learning environment. Through a comprehensive literature review, case study, and data analysis, the paper examines the key elements of personalized learning, the methods of assessment, and the resulting effects on student engagement. The findings suggest that the eye movement paradigm has the potential to assess student engagement and promote better educational outcomes.

Paperid: 3033, https://arxiv.org/pdf/2501.07736.pdf

Abstract:
In developing and underdeveloped regions, many 'Blind Colleges' exclusively enroll individuals with Blindness or Vision Impairment (BLV) for higher education. While advancements in accessible technologies have facilitated BLV student integration into 'Integrated Colleges,' their implementation in 'Blind Colleges' remains uneven due to complex economic, social, and policy challenges. This study investigates the practices, perceptions, and challenges of BLV students using accessible technologies in a Chinese 'Blind College' through a two-part empirical approach. Our findings demonstrate that tactile and digital technologies enhance access to education but face significant integration barriers. We emphasize the critical role of early education in addressing capability gaps, BLV students' aspirations for more inclusive educational environments, and the systemic obstacles within existing frameworks. We advocate for leveraging accessible technologies to transition 'Blind Colleges' into 'Integrated Colleges,' offering actionable insights for policymakers, designers, and educators. Finally, we outline future research directions on accessible technology innovation and its implications for BLV education in resource-constrained settings.

Paperid: 3034, https://arxiv.org/pdf/2501.07207.pdf

Abstract:
Digital infrastructures are seeing convergence and connectivity at unprecedented scale. This is true for both current critical national infrastructures and emerging future systems that are highly cyber-physical in nature with complex intersections between humans and technologies, e.g., smart cities, intelligent transportation, high-value manufacturing and Industry 4.0. Diverse legacy and non-legacy software systems underpinned by heterogeneous hardware compose on-the-fly to deliver services to millions of users with varying requirements and unpredictable actions. This complexity is compounded by intricate and complicated supply-chains with many digital assets and services outsourced to third parties. The reality is that, at any particular point in time, there will be untrusted, partially-trusted or compromised elements across the infrastructure. Given this reality, and the societal scale of digital infrastructures, delivering secure and resilient operations is a major challenge. We argue that this requires us to move beyond the paradigm of security-by-design and embrace the challenge of securing-a-compromised-system.

Paperid: 3035, https://arxiv.org/pdf/2501.06977.pdf

Abstract:
In reading tasks drift can move fixations from one word to another or even another line, invalidating the eye tracking recording. Manual correction is time-consuming and subjective, while automated correction is fast yet limited in accuracy. In this paper we present Fix8 (Fixate), an open-source GUI tool that offers a novel semi-automated correction approach for eye tracking data in reading tasks. The proposed approach allows the user to collaborate with an algorithm to produce accurate corrections faster without sacrificing accuracy. Through a usability study (N=14) we assess the time benefits of the proposed technique, and measure the correction accuracy in comparison to manual correction. In addition, we assess subjective workload through NASA Task Load Index, and user opinions through Likert-scale questions. Our results show that on average the proposed technique was 44% faster than manual correction without any sacrifice in accuracy. In addition, users reported a preference for the proposed technique, lower workload, and higher perceived performance compared to manual correction. Fix8 is a valuable tool that offers useful features for generating synthetic eye tracking data, visualization, filters, data converters, and eye movement analysis in addition to the main contribution in data correction.

Paperid: 3036, https://arxiv.org/pdf/2501.06607.pdf

Abstract:
This paper describes the AI Drawing Partner, which is a co-creative drawing agent that also serves as a research platform to model co-creation. The AI Drawing Partner is an early example of a quantified co-creative AI system that automatically models the co-creation that happens on the system. The method the system uses to capture this data is based on a new cognitive science framework called co-creative sense-making (CCSM). The CCSM is based on the cognitive theory of enaction, which describes how meaning emerges through interaction with the environment and other people in that environment in a process of sense-making. The CCSM quantifies elements of interaction dynamics to identify sense-making patterns and interaction trends. This paper describes a new technique for modeling the interaction and collaboration dynamics of co-creative AI systems with the co-creative sense-making (CCSM) framework. A case study is conducted of ten co-creative drawing sessions between a human user and the co-creative agent. The analysis includes showing the artworks produced, the quantified data from the AI Drawing Partner, the curves describing interaction dynamics, and a visualization of interaction trend sequences. The primary contribution of this paper is presenting the AI Drawing Partner, which is a unique co-creative AI system and research platform that collaborates with the user in addition to quantifying, modeling, and visualizing the co-creative process using the CCSM framework.

Paperid: 3037, https://arxiv.org/pdf/2501.06597.pdf

Abstract:
The widespread adoption of generative AI has generated diverse opinions, with individuals expressing both support and criticism of its applications. This study investigates the emotional dynamics surrounding generative AI by analyzing human tweets referencing terms such as ChatGPT, OpenAI, Copilot, and LLMs. To further understand the emotional intelligence of ChatGPT, we examine its responses to selected tweets, highlighting differences in sentiment between human comments and LLM-generated responses. We introduce EmoXpt, a sentiment analysis framework designed to assess both human perspectives on generative AI and the sentiment embedded in ChatGPT's responses. Unlike prior studies that focus exclusively on human sentiment, EmoXpt uniquely evaluates the emotional expression of ChatGPT. Experimental results demonstrate that LLM-generated responses are notably more efficient, cohesive, and consistently positive than human responses.

Paperid: 3038, https://arxiv.org/pdf/2501.05840.pdf

Abstract:
Think-alouds are a common HCI usability method where participants verbalize their thoughts while using interfaces. However, their utility in cross-cultural settings, particularly in the Global South, is unclear, where cultural differences impact user interactions. This paper investigates the usability challenges teachers in rural CÃ´te d'Ivoire faced when using a chatbot designed to support an educational program. We conducted think-aloud sessions with 20 teachers two weeks after a chatbot deployment, analyzing their navigation, errors, and time spent on tasks. We discuss our approach and findings that helped us identify usability issues and challenging features for improving the chatbot designs. Our note summarizes our reflections on using think-aloud and contributes to discussions on its culturally sensitive adaptation in the Global South.

Paperid: 3039, https://arxiv.org/pdf/2501.05703.pdf

Abstract:
The ability to effectively visualize data is crucial in the contemporary world where information is often voluminous and complex. Visualizations, such as charts, graphs, and maps, provide an intuitive and easily understandable means to interpret, analyze, and communicate patterns, trends, and insights hidden within large datasets. These graphical representations can help researchers, policymakers, and the public to better comprehend and respond to a multitude of issues. In this study, we explore a visualization tool to interpret and understand various data of COVID-19 pandemic. While others have shown COVID-19 visualization methods/tools, our tool provides a mean to analyze COVID-19 data in a more comprehensive way. We have used the public data from NY Times and CDC, and various COVID-19 data (e.g., core places, patterns, foot traffic) from Safegraph. Figure 1 shows the basic view of our visualization view. In addition to providing visualizations of these data, our visualization also considered the Surprising Map. The Surprising Map is a type of choropleth map that can avoid misleading of producing visual prominence to known base rates or to artifacts of sample size and normalization in visualizing the density of events in spatial data. It is based on Bayesian surprise-it creates a space of equi-plausible models and uses Bayesian updating to re-estimate their plausibility based on individual events.

Paperid: 3040, https://arxiv.org/pdf/2501.05101.pdf

Abstract:
Our study has investigated the effect of music on the experience of viewing art, investigating the factors which create a sense of connectivity between the two forms. We worked with 138 participants, and included multiple choice and open-ended questions. For the latter, we performed both a qualitative analysis and also sentiment analysis using text-mining. We investigated the relationship between the user experience and the emotions in the artwork and music. We found that, besides emotion, theme, story, and to a lesser extent music tempo were factors which helped form connections between artwork and music. Overall, participants rated the music as being helpful in developing an appreciation of the art. We propose guidelines for using music to enhance the experience of viewing art, and we propose directions for future research.

Paperid: 3041, https://arxiv.org/pdf/2501.04363.pdf

Abstract:
While local governments have invested heavily in smart city infrastructure, significant disparities in adopting these services remain in urban areas. The success of many user-facing smart city technologies requires understanding barriers to adoption, including persistent inequalities in urban areas. An analysis of a random sample telephone survey (n=489) in four neighborhoods of Tel Aviv merged with digital municipal services usage data found that neighborhood residency influences the reasons why residents adopt resident-facing smart city services, as well as individual-level factors. Structured Equation Modeling shows that neighborhood residency is related to digital proficiency and privacy perceptions beyond demographic factors and that those influence the adoption of smart-city services. We summarize the paper by discussing why and how place effects must be considered in further research in smart cities and the study and mitigation of digital inequality.

Paperid: 3042, https://arxiv.org/pdf/2501.03757.pdf

Abstract:
This paper introduces a novel algorithm designed for speech synthesis from neural activity recordings obtained using invasive electroencephalography (EEG) techniques. The proposed system offers a promising communication solution for individuals with severe speech impairments. Central to our approach is the integration of time-frequency features in the high-gamma band computed from EEG recordings with an advanced NeuroIncept Decoder architecture. This neural network architecture combines Convolutional Neural Networks (CNNs) and Gated Recurrent Units (GRUs) to reconstruct audio spectrograms from neural patterns. Our model demonstrates robust mean correlation coefficients between predicted and actual spectrograms, though inter-subject variability indicates distinct neural processing mechanisms among participants. Overall, our study highlights the potential of neural decoding techniques to restore communicative abilities in individuals with speech disorders and paves the way for future advancements in brain-computer interface technologies.

Paperid: 3043, https://arxiv.org/pdf/2501.03624.pdf

Abstract:
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment using the Montgomery-Asberg Depression Rating Scale (MADRS). We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews. Our approach, tested on 236 real-world interviews from the Context-Adaptive Multimodal Informatics (CAMI) dataset, demonstrates strong correlations with clinician assessments. The Qwen 2.5--72b model achieves near-human level agreement across most MADRS items, with Intraclass Correlation Coefficients (ICC) closely approaching those between human raters. We provide a comprehensive analysis of model performance across different MADRS items, highlighting strengths and current limitations. Our findings suggest that LLMs, with appropriate prompting, can serve as efficient tools for mental health assessment, potentially increasing accessibility in resource-limited settings. However, challenges remain, particularly in assessing symptoms that rely on non-verbal cues, underscoring the need for multimodal approaches in future work.

Paperid: 3044, https://arxiv.org/pdf/2501.03618.pdf

Abstract:
Online Learning Management Systems (LMSs), such as Blackboard and Canvas, have existed for decades. Yet, course readings, when provided at all, consistently exist as simple digital twins to their real-life counterparts. While online tools and resources exist to help students process digital texts more efficiently or in ways better suited to their learning styles, knowledge about such resources is not evenly distributed and creates a gulf in advantage between students. This paper proposes the courseware integration of "smart" textbooks, a newfound way for students to chat with their readings, receive summaries and explanations for highlighted text, and generate quiz questions via an AI agent embedded in their online course material. Future iterations of the software aim to add in-context reference highlighting for AI-generated answers and personalized tunings for the end learner.

Paperid: 3045, https://arxiv.org/pdf/2501.03572.pdf

Abstract:
Web accessibility ensures that individuals with disabilities can access and interact with digital content without barriers, yet a significant majority of most used websites fail to meet accessibility standards. This study evaluates ChatGPT's (GPT-4o) ability to generate and improve web pages in line with Web Content Accessibility Guidelines (WCAG). While ChatGPT can effectively address accessibility issues when prompted, its default code often lacks compliance, reflecting limitations in its training data and prevailing inaccessible web practices. Automated and manual testing revealed strengths in resolving simple issues but challenges with complex tasks, requiring human oversight and additional iterations. Unlike prior studies, we incorporate manual evaluation, dynamic elements, and use the visual reasoning capability of ChatGPT along with the prompts to fix accessibility issues. Providing screenshots alongside prompts enhances the LLM's ability to address accessibility issues by allowing it to analyze surrounding components, such as determining appropriate contrast colors. We found that effective prompt engineering, such as providing concise, structured feedback and incorporating visual aids, significantly enhances ChatGPT's performance. These findings highlight the potential and limitations of large language models for accessible web development, offering practical guidance for developers to create more inclusive websites.

Paperid: 3046, https://arxiv.org/pdf/2501.03376.pdf

Abstract:
As Robots become ever more important in our daily lives there's growing need for understanding how they're perceived by people. This study aims to investigate how the user perception of robots is influenced by displays of personality. Using LLMs and speech to text technology, we designed a within-subject study to compare two conditions: a personality-driven robot and a purely task-oriented, personality-neutral robot. Twelve participants, recruited from Socially Intelligent Robotics course at Vrije Universiteit Amsterdam, interacted with a robot Nao tasked with asking them a set of medical questions under both conditions. After completing both interactions, the participants completed a user experience questionnaire measuring their emotional states and robot perception using standardized questionnaires from the SRI and Psychology literature.

Paperid: 3047, https://arxiv.org/pdf/2501.02368.pdf

Abstract:
This paper discusses the use of Artificial Intelligence (AI) to enhance workplace productivity and employee well-being. By integrating machine learning (ML) techniques with neurobiological data, the proposed approaches ensure alignment with human ethical standards through value alignment models and Hierarchical Reinforcement Learning (HRL) for autonomous task management. The system utilizes biometric feedback from employees to generate personalized health prompts, fostering a supportive work environment that encourages physical activity. Additionally, we explore decentralized multi-agent systems for improved collaboration and decision-making frameworks that enhance transparency. Various approaches using ML techniques in conjunction with AI implementations are discussed. Together, these innovations aim to create a more productive and health-conscious workplace. These outcomes assist HR management and organizations in launching more rational career progression streams for employees and facilitating organizational transformation.

Paperid: 3048, https://arxiv.org/pdf/2501.02342.pdf

Abstract:
We propose a holistic approach for deploying Small Language Models (SLMs) as function-calling agents within vehicles as edge devices, offering a more flexible and robust alternative to traditional rule-based systems. By leveraging SLMs, we simplify vehicle control mechanisms and enhance the user experience. Given the in-vehicle hardware constraints, we apply state-of-the-art model compression techniques, including structured pruning, healing, and quantization, ensuring that the model fits within the resource limitations while maintaining acceptable performance. Our work focuses on optimizing a representative SLM, Microsoft's Phi-3 mini, and outlines best practices for enabling embedded models, including compression, task-specific fine-tuning, and vehicle integration. We demonstrate that, despite significant reduction in model size which removes up to 2 billion parameters from the original model, our approach preserves the model's ability to handle complex in-vehicle tasks accurately and efficiently. Furthermore, by executing the model in a lightweight runtime environment, we achieve a generation speed of 11 tokens per second, making real-time, on-device inference feasible without hardware acceleration. Our results demonstrate the potential of SLMs to transform vehicle control systems, enabling more intuitive interactions between users and their vehicles for an enhanced driving experience.

Paperid: 3049, https://arxiv.org/pdf/2501.01711.pdf

Abstract:
The paper presents a preliminary analysis of an experiment conducted by Frank Bold, a Czech expert group, to explore user interactions with GPT-4 for addressing legal queries. Between May 3, 2023, and July 25, 2023, 1,252 users submitted 3,847 queries. Unlike studies that primarily focus on the accuracy, factuality, or hallucination tendencies of large language models (LLMs), our analysis focuses on the user query dimension of the interaction. Using GPT-4o for zero-shot classification, we categorized queries on (1) whether users provided factual information about their issue (29.95%) or not (70.05%), (2) whether they sought legal information (64.93%) or advice on the course of action (35.07\%), and (3) whether they imposed requirements to shape or control the model's answer (28.57%) or not (71.43%). We provide both quantitative and qualitative insight into user needs and contribute to a better understanding of user engagement with LLMs.

Paperid: 3050, https://arxiv.org/pdf/2501.01451.pdf

Abstract:
Recently, there is an increasing interest in using artificial intelligence (AI) to automate aspects of the research process, or even autonomously conduct the full research cycle from idea generation, over data analysis, to composing and evaluation of scientific manuscripts. Examples of working AI scientist systems have been demonstrated for computer science tasks and running molecular biology labs. While some approaches aim for full autonomy of the scientific AI, others rather aim for leveraging human-AI teaming. Here, we address how to adapt such approaches for boosting Brain-Computer Interface (BCI) development, as well as brain research resp. neuroscience at large. We argue that at this time, a strong emphasis on human-AI teaming, in contrast to fully autonomous AI BCI researcher will be the most promising way forward. We introduce the collaborative workspaces concept for human-AI teaming based on a set of Janusian design principles, looking both ways, to the human as well as to the AI side. Based on these principles, we present ChatBCI, a Python-based toolbox for enabling human-AI collaboration based on interaction with Large Language Models (LLMs), designed for BCI research and development projects. We show how ChatBCI was successfully used in a concrete BCI project on advancing motor imagery decoding from EEG signals. Our approach can be straightforwardly extended to broad neurotechnological and neuroscientific topics, and may by design facilitate human expert knowledge transfer to scientific AI systems in general.

Paperid: 3051, https://arxiv.org/pdf/2501.01192.pdf

Abstract:
Early childhood science education is crucial for developing scientific literacy, yet translating complex scientific concepts into age-appropriate content remains challenging for educators. Our study evaluates four leading Large Language Models (LLMs) - GPT-4, Claude, Gemini, and Llama - on their ability to generate preschool-appropriate scientific explanations across biology, chemistry, and physics. Through systematic evaluation by 30 nursery teachers using established pedagogical criteria, we identify significant differences in the models' capabilities to create engaging, accurate, and developmentally appropriate content. Unexpectedly, Claude outperformed other models, particularly in biological topics, while all LLMs struggled with abstract chemical concepts. Our findings provide practical insights for educators leveraging AI in early science education and offer guidance for developers working to enhance LLMs' educational applications. The results highlight the potential and current limitations of using LLMs to bridge the early childhood science literacy gap.

Paperid: 3052, https://arxiv.org/pdf/2501.00867.pdf

Abstract:
We introduce Interactionalism as a new set of guiding principles and heuristics for the design and architecture of learning now available due to Generative AI (GenAI) platforms. Specifically, we articulate interactional intelligence as a net new skill set that is increasingly important when core cognitive tasks are automatable and augmentable by GenAI functions. We break down these skills into core sets of meta-cognitive and meta-emotional components and show how working with Large Language Model (LLM)-based agents can be proactively used to help develop learners. Interactionalism is not advanced as a theory of learning; but as a blueprint for the practice of learning - in coordination with GenAI.

Paperid: 3053, https://arxiv.org/pdf/2501.00822.pdf

Abstract:
In robotic bimanual teleoperation, multimodal sensory feedback plays a crucial role, providing operators with a more immersive operating experience, reducing cognitive burden, and improving operating efficiency. In this study, we develop an immersive bilateral isomorphic bimanual telerobotic system, which comprises dual arm and dual dexterous hands, with visual and haptic force feedback. To assess the performance of this system, we carried out a series of experiments and investigated the user's teleoperation experience. The results demonstrate that haptic force feedback enhances physical perception capabilities and complex task operating abilities. In addition, it compensates for visual perception deficiencies and reduces the operator's work burden. Consequently, our proposed system achieves more intuitive, realistic and immersive teleoperation, improves operating efficiency, and expands the complexity of tasks that robots can perform through teleoperation.

Paperid: 3054, https://arxiv.org/pdf/2501.00791.pdf

Abstract:
The study illustrates a first step towards an ongoing work aimed at developing a dataset of dialogues potentially useful for customer service conversation management between humans and AI chatbots. The approach exploits ChatGPT 3.5 to generate dialogues. One of the requirements is that the dialogue is characterized by a specific language proficiency level of the user; the other one is that the user expresses a specific emotion during the interaction. The generated dialogues were then evaluated for overall quality. The complexity of the language used by both humans and AI agents, has been evaluated by using standard complexity measurements. Furthermore, the attitudes and interaction patterns exhibited by the chatbot at each turn have been stored for further detection of common conversation patterns in specific emotional contexts. The methodology could improve human-AI dialogue effectiveness and serve as a basis for systems that can learn from user interactions.

Paperid: 3055, https://arxiv.org/pdf/2501.00476.pdf

Abstract:
This paper implies Bluetooth technology, which is put into effect to alter extant, wired into wireless Programmable Logic Controller (PLC). Here two Bluetooth devices are employed as a transceiver to transmit and receives the input signal to contrive wireless PLC. The main advantage of PLC is to control the output according to the status of input. In Bluetooth technology, the handshaking between the two Bluetooth modules takes place, which is interfaced with a microcontroller board (Arduino board) and then to PLC such that field devices can be controlled without wire.

Paperid: 3056, https://arxiv.org/pdf/2501.00359.pdf

Abstract:
Visitors to cultural heritage sites often encounter official information, while local people's unofficial stories remain invisible. To explore expression of local narratives, we conducted a workshop with 20 participants utilizing Generative AI (GenAI) to support visual narratives, asking them to use Stable Diffusion to create images of familiar cultural heritage sites, as well as images of unfamiliar ones for comparison. The results revealed three narrative strategies and highlighted GenAI's strengths in illuminating, amplifying, and reinterpreting personal narratives. However, GenAI showed limitations in meeting detailed requirements, portraying cultural features, and avoiding bias, which were particularly pronounced with unfamiliar sites due to participants' lack of local knowledge. To address these challenges, we recommend providing detailed explanations, prompt engineering, and fine-tuning AI models to reduce uncertainties, using objective references to mitigate inaccuracies from participants' inability to recognize errors or misconceptions, and curating datasets to train AI models capable of accurately portraying cultural features.

Paperid: 3057, https://arxiv.org/pdf/2501.00277.pdf

Abstract:
Modern AI algorithms require labeled data. In real world, majority of data are unlabeled. Labeling the data are costly. this is particularly true for some areas requiring special skills, such as reading radiology images by physicians. To most efficiently use expert's time for the data labeling, one promising approach is human-in-the-loop active learning algorithm. In this work, we propose a novel active learning framework with significant potential for application in modern AI systems. Unlike the traditional active learning methods, which only focus on determining which data point should be labeled, our framework also introduces an innovative perspective on incorporating different query scheme. We propose a model to integrate the information from different types of queries. Based on this model, our active learning frame can automatically determine how the next question is queried. We further developed a data driven exploration and exploitation framework into our active learning method. This method can be embedded in numerous active learning algorithms. Through simulations on five real-world datasets, including a highly complex real image task, our proposed active learning framework exhibits higher accuracy and lower loss compared to other methods.

Paperid: 3058, https://arxiv.org/pdf/2501.00074.pdf

Abstract:
As technology advances, the integration of physical, virtual, and social worlds has led to a complex landscape of ``Realities'' such as Virtual Reality (VR), Augmented Reality (AR), metaverse, spatial computing, and other emerging paradigms. This paper builds upon and refines the concept of eXtended Reality (XR) as the unifying framework that not only interpolates across these diverse realities but also extrapolates (extends) to create entirely new possibilities. XR is the ``physical spatial metaverse,'' bridging the physical world, the virtual world of artificial intelligence, and the social world of human interaction. These three worlds define the Socio-Cyber-Physical Taxonomy of XR that allows us to identify underexplored research areas such as Diminished Reality (DR), and chart future directions to {\bf advance technology for people and planet}. We highlight the six core properties of XR for applications in sustainability, healthcare, frontline work, and daily life. Central to this vision is the development of AI-driven wearable technologies, such as the smart eyeglass, that sustainably extend human capabilities.

Paperid: 3059, https://arxiv.org/pdf/2506.23815.pdf

Abstract:
The influence of Artificial Intelligence (AI), and specifically Large Language Models (LLM), on education is continuously increasing. These models are frequently used by students, giving rise to the question whether current forms of assessment are still a valid way to evaluate student performance and comprehension. The theoretical framework developed in this paper is grounded in Constructive Alignment (CA) theory and Bloom's taxonomy for defining learning objectives. We argue that AI influences learning objectives of different Bloom levels in a different way, and assessment has to be adopted accordingly. Furthermore, in line with Bloom's vision, formative and summative assessment should be aligned on whether the use of AI is permitted or not. Although lecturers tend to agree that education and assessment need to be adapted to the presence of AI, a strong bias exists on the extent to which lecturers want to allow for AI in assessment. This bias is caused by a lecturer's familiarity with AI and specifically whether they use it themselves. To avoid this bias, we propose structured guidelines on a university or faculty level, to foster alignment among the staff. Besides that, we argue that teaching staff should be trained on the capabilities and limitations of AI tools. In this way, they are better able to adapt their assessment methods.

Paperid: 3060, https://arxiv.org/pdf/2506.23116.pdf

Abstract:
AI is flattening culture. Evaluations of "culture" are showing the myriad ways in which large AI models are homogenizing language and culture, averaging out rich linguistic differences into generic expressions. I call this phenomenon "softmaxing culture,'' and it is one of the fundamental challenges facing AI evaluations today. Efforts to improve and strengthen evaluations of culture are central to the project of cultural alignment in large AI systems. This position paper argues that machine learning (ML) and human-computer interaction (HCI) approaches to evaluation are limited. I propose two key conceptual shifts. First, instead of asking "what is culture?" at the start of system evaluations, I propose beginning with the question: "when is culture?" Second, while I acknowledge the philosophical claim that cultural universals exist, the challenge is not simply to describe them, but to situate them in relation to their particulars. Taken together, these conceptual shifts invite evaluation approaches that move beyond technical requirements toward perspectives that are more responsive to the complexities of culture.

Paperid: 3062, https://arxiv.org/pdf/2506.22815.pdf

Abstract:
This position paper aims to rethink the role and design of memory in Large Language Model (LLM)-based agent systems. We observe that while current memory practices have begun to transcend the limitations of single interactions, they remain conceptually grounded in "bound memory" in terms of design concept-where memory is treated as local state attached to specific context or entities, forming "memory silos" that impede cross-entity collaboration. To overcome this architectural bottleneck, this paper proposes the timely design perspective of "Memory as a Service" (MaaS). MaaS advocates decoupling memory from its conventional role as an interaction byproduct and encapsulating it as a modular service that can be independently callable, dynamically composable, and finely governed. At its core, MaaS leverages the duality of memory-its inherently private nature and its potential for public service-to grant memory controlled, on-demand interoperability across entities. This paper introduces a two-dimensional design space defined by entity structure and service type, illustrating how MaaS aligns with current memory practices while naturally extending them to cross-entity collaborative scenarios. Finally, we outline an open research agenda spanning governance, security, and ethical ecosystems, and call upon the broader research community to explore this shift toward service-oriented memory for collaborative agents operating across entity boundaries.

Paperid: 3063, https://arxiv.org/pdf/2506.22464.pdf

Abstract:
This paper presents a novel localization algorithm for wireless sensor networks (WSNs) called Golden Ratio Localization (GRL), which leverages the mathematical properties of the golden ratio (phi 1.618) to optimize both node placement and communication range. GRL introduces phi-based anchor node deployment and hop-sensitive weighting using phi-exponents to improve localization accuracy while minimizing energy consumption. Through extensive simulations conducted on a 100 m * 100 m sensor field with 100 nodes and 10 anchors, GRL achieved an average localization error of 2.35 meters, outperforming DV- Hop (3.87 meters) and Centroid (4.95 meters). In terms of energy efficiency, GRL reduced localization energy consumption to 1.12 microJ per node, compared to 1.78 microJ for DV-Hop and 1.45 microJ for Centroid. These results confirm that GRL provides a more balanced and efficient localization approach, making it especially suitable for energy-constrained and large-scale WSN deployments.

Paperid: 3064, https://arxiv.org/pdf/2506.22231.pdf

Abstract:
The rapid proliferation of generative artificial intelligence (AI) tools - especially large language models (LLMs) such as ChatGPT - has ushered in a transformative era in higher education. Universities in developed regions are increasingly integrating these technologies into research, teaching, and assessment. On one hand, LLMs can enhance productivity by streamlining literature reviews, facilitating idea generation, assisting with coding and data analysis, and even supporting grant proposal drafting. On the other hand, their use raises significant concerns regarding academic integrity, ethical boundaries, and equitable access. Recent empirical studies indicate that nearly 47% of students use LLMs in their coursework - with 39% using them for exam questions and 7% for entire assignments - while detection tools currently achieve around 88% accuracy, leaving a 12% error margin. This article critically examines the opportunities offered by generative AI, explores the multifaceted challenges it poses, and outlines robust policy solutions. Emphasis is placed on redesigning assessments to be AI-resilient, enhancing staff and student training, implementing multi-layered enforcement mechanisms, and defining acceptable use. By synthesizing data from recent research and case studies, the article argues that proactive policy adaptation is imperative to harness AI's potential while safeguarding the core values of academic integrity and equity.

Paperid: 3065, https://arxiv.org/pdf/2506.21845.pdf

Abstract:
This paper presents 3Description, an experimental human-AI collaborative approach for intuitive 3D modeling. 3Description aims to address accessibility and usability challenges in traditional 3D modeling by enabling non-professional individuals to co-create 3D models using verbal and gesture descriptions. Through a combination of qualitative research, product analysis, and user testing, 3Description integrates AI technologies such as Natural Language Processing and Computer Vision, powered by OpenAI and MediaPipe. Recognizing the web has wide cross-platform capabilities, 3Description is web-based, allowing users to describe the desired model and subsequently adjust its components using verbal and gestural inputs. In the era of AI and emerging media, 3Description not only contributes to a more inclusive and user-friendly design process, empowering more people to participate in the construction of the future 3D world, but also strives to increase human engagement in co-creation with AI, thereby avoiding undue surrender to technology and preserving human creativity.

Paperid: 3066, https://arxiv.org/pdf/2506.21195.pdf

Abstract:
Have you wondered how cross-functional teams balance between maximizing value that users derive and business growth leading to win-win situations? This case study shows how User Experience Research (UXR) and Data Science teams used mixed methods research to strategically influence Product Led Growth (PLG) for a Password Manager used by million+ users, thus allowing our users, internal teams, and business to win. The audience will take away practical lessons/techniques related to leveraging mixed methods to: a. Maximize user value while meeting business growth goals b. Influence cross-functional teams c. Measure user and business impact This case study can be easily tied to the UXR Point of view pyramid (POV) [2] that represents a methodological approach to construct a POV and further dives into actioning POV to create measurable user and business impact.

Paperid: 3067, https://arxiv.org/pdf/2506.19484.pdf

Abstract:
Large Language Models (LLMs) are rapidly transforming education by enabling rich conversational learning experiences. This article provides a comprehensive review of how LLM-based conversational agents are being used in higher education, with extensions to secondary and lifelong learning contexts. We synthesize existing literature on LLMs in education and theories of conversational and dialogic pedagogy - including Vygotsky's sociocultural learning (scaffolding and the Zone of Proximal Development), the Socratic method, and Laurillard's conversational framework - and examine how prompting strategies and retrieval-augmented generation (RAG) can align LLM behaviors with these pedagogical theories, and how it can support personalized, adaptive learning. We map educational theories to LLM capabilities, highlighting where LLM-driven dialogue supports established learning principles and where it challenges or falls short of traditional pedagogical assumptions. Notable gaps in applying prior theories to LLMs are identified, such as the models tendency to provide direct answers instead of fostering co-construction of knowledge, and the need to account for the constant availability and broad but non-human expertise of LLM tutors. In response, we propose practical strategies to better align LLM interactions with sound pedagogy - for example, designing prompts that encourage Socratic questioning, scaffolded guidance, and student reflection, as well as integrating retrieval mechanisms to ensure accuracy and contextual relevance. Our aim is to bridge the gap between educational theory and the emerging practice of AI-driven conversational learning, offering insights and tools for making LLM-based dialogues more educationally productive and theory-aligned.

Paperid: 3068, https://arxiv.org/pdf/2506.17936.pdf

Abstract:
Concept-based explainable artificial intelligence (C-XAI) can help reveal the inner representations of AI models. Understanding these representations is particularly important in complex tasks like safety evaluation. Such tasks rely on high-level semantic information (e.g., about actions) to make decisions about abstract categories (e.g., whether a situation is dangerous). In this context, it may desirable for C-XAI concepts to show some variability, suggesting that the AI is capable of generalising beyond the concrete details of a situation. However, it is unclear whether people recognise and appreciate such generalisations and can distinguish them from other, less desirable forms of imprecision. This was investigated in an experimental railway safety scenario. Participants evaluated the performance of a simulated AI that evaluated whether traffic scenes involving people were dangerous. To explain these decisions, the AI provided concepts in the form of similar image snippets. These concepts differed in their match with the classified image, either regarding a highly relevant feature (i.e., relation to tracks) or a less relevant feature (i.e., actions). Contrary to the hypotheses, concepts that generalised over less relevant features led to ratings that were lower than for precisely matching concepts and comparable to concepts that systematically misrepresented these features. Conversely, participants were highly sensitive to imprecisions in relevant features. These findings cast doubts on whether people spontaneously recognise generalisations. Accordingly, they might not be able to infer from C-XAI concepts whether AI models have gained a deeper understanding of complex situations.

Paperid: 3069, https://arxiv.org/pdf/2506.17467.pdf

Abstract:
Large language models (LLMs) have shown significant potential to change how we write, communicate, and create, leading to rapid adoption across society. This dissertation examines how individuals and institutions are adapting to and engaging with this emerging technology through three research directions. First, I demonstrate how the institutional adoption of AI detectors introduces systematic biases, particularly disadvantaging writers of non-dominant language varieties, highlighting critical equity concerns in AI governance. Second, I present novel population-level algorithmic approaches that measure the increasing adoption of LLMs across writing domains, revealing consistent patterns of AI-assisted content in academic peer reviews, scientific publications, consumer complaints, corporate communications, job postings, and international organization press releases. Finally, I investigate LLMs' capability to provide feedback on research manuscripts through a large-scale empirical analysis, offering insights into their potential to support researchers who face barriers in accessing timely manuscript feedback, particularly early-career researchers and those from under-resourced settings.

Paperid: 3070, https://arxiv.org/pdf/2506.17011.pdf

Abstract:
This study compares the impact of "juiciness" on user engagement and short-term information retention in interactive infographics. Juicy designs generally showed a slight advantage in overall user engagement scores compared to dry designs. Specifically, the juicy version of the Burcalories infographic had the highest engagement score. However, the differences in engagement were often small. Regarding information retention, the results were mixed. The juicy versions of The Daily Routines of Famous Creative People and The Main Chakras infographics showed marginally better average recall and more participants with higher recall. Conversely, the dry version of Burcalories led to more correct answers in multiple-choice questions. The study suggests that while juicy design elements can enhance user engagement and, in some cases, short-term information retention, their effectiveness depends on careful implementation. Excessive juiciness could be overwhelming or distracting, while well-implemented juicy elements contributed to a more entertaining experience. The findings emphasize the importance of balancing engaging feedback with clarity and usability.

Paperid: 3071, https://arxiv.org/pdf/2506.16702.pdf

Abstract:
Large language models (LLMs) offer emerging opportunities for psychological and behavioral research, but methodological guidance is lacking. This article provides a framework for using LLMs as psychological simulators across two primary applications: simulating roles and personas to explore diverse contexts, and serving as computational models to investigate cognitive processes. For simulation, we present methods for developing psychologically grounded personas that move beyond demographic categories, with strategies for validation against human data and use cases ranging from studying inaccessible populations to prototyping research instruments. For cognitive modeling, we synthesize emerging approaches for probing internal representations, methodological advances in causal interventions, and strategies for relating model behavior to human cognition. We address overarching challenges including prompt sensitivity, temporal limitations from training data cutoffs, and ethical considerations that extend beyond traditional human subjects review. Throughout, we emphasize the need for transparency about model capabilities and constraints. Together, this framework integrates emerging empirical evidence about LLM performance--including systematic biases, cultural limitations, and prompt brittleness--to help researchers wrangle these challenges and leverage the unique capabilities of LLMs in psychological research.

Paperid: 3072, https://arxiv.org/pdf/2506.16697.pdf

Abstract:
Large language models (LLMs) are rapidly being adopted across psychology, serving as research tools, experimental subjects, human simulators, and computational models of cognition. However, the application of human measurement tools to these systems can produce contradictory results, raising concerns that many findings are measurement phantoms--statistical artifacts rather than genuine psychological phenomena. In this Perspective, we argue that building a robust science of AI psychology requires integrating two of our field's foundational pillars: the principles of reliable measurement and the standards for sound causal inference. We present a dual-validity framework to guide this integration, which clarifies how the evidence needed to support a claim scales with its scientific ambition. Using an LLM to classify text may require only basic accuracy checks, whereas claiming it can simulate anxiety demands a far more rigorous validation process. Current practice systematically fails to meet these requirements, often treating statistical pattern matching as evidence of psychological phenomena. The same model output--endorsing "I am anxious"--requires different validation strategies depending on whether researchers claim to measure, characterize, simulate, or model psychological constructs. Moving forward requires developing computational analogues of psychological constructs and establishing clear, scalable standards of evidence rather than the uncritical application of human measurement tools.

Paperid: 3073, https://arxiv.org/pdf/2506.16312.pdf

Abstract:
This study investigated the impact of a theory-driven, explainable Learning Analytics Dashboard (LAD) on university students' human-AI collaborative academic abstract writing task. Grounded in Self-Regulated Learning (SRL) theory and incorporating Explainable AI (XAI) principles, our LAD featured a three-layered design (Visual, Explainable, Interactive). In an experimental study, participants were randomly assigned to either an experimental group (using the full explainable LAD) or a control group (using a visual-only LAD) to collaboratively write an academic abstract with a Generative AI. While quantitative analysis revealed no significant difference in the quality of co-authored abstracts between the two groups, a significant and noteworthy difference emerged in conceptual understanding: students in the explainable LAD group demonstrated a superior grasp of abstract writing principles, as evidenced by their higher scores on a knowledge test (p= .026). These findings highlight that while basic AI-generated feedback may suffice for immediate task completion, the provision of explainable feedback is crucial for fostering deeper learning, enhancing conceptual understanding, and developing transferable skills fundamental to self-regulated learning in academic writing contexts.

Paperid: 3074, https://arxiv.org/pdf/2506.16008.pdf

Abstract:
Engaging in smooth conversations with others is a crucial social skill. However, differences in knowledge between conversation participants can sometimes hinder effective communication. To tackle this issue, this study proposes a real-time support system that integrates head-mounted display (HMD)-based augmented reality (AR) technology with large language models (LLMs). This system facilitates conversation by recognizing keywords during dialogue, generating relevant information using the LLM, reformatting it, and presenting it to the user via the HMD. A significant issue with this system is that the user's eye movements may reveal to the conversation partner that they are reading the displayed text. This study also proposes a method for presenting information that takes into account appropriate eye movements during conversation. Two experiments were conducted to evaluate the effectiveness of the proposed system. The first experiment revealed that the proposed information presentation method reduces the likelihood of the conversation partner noticing that the user is reading the displayed text. The second experiment demonstrated that the proposed method led to a more balanced speech ratio between the user and the conversation partner, as well as a increase in the perceived excitement of the conversation.

Paperid: 3075, https://arxiv.org/pdf/2506.15497.pdf

Abstract:
This book provides a comprehensive exploration of affective computing and human-computer interaction technologies. It begins with the historical development and basic concepts of human-computer interaction, delving into the technical frameworks and practical applications of emotional computing, visual interaction, voice interaction, brain-computer interfaces, physiological electrical signal analysis, and social robotics. The book covers a wide range of topics, including the psychological and neuroscience foundations of emotion, multimodal emotion recognition, emotional expression mechanisms, and the principles of brain-computer interfaces. Key technologies such as affective computing based on discrete emotion theory and dimensional models, visual perception principles, speech recognition and synthesis, EEG signal acquisition and processing, and multimodal emotion recognition are explained in detail. This book also addresses the technical challenges in the field, including multimodal data fusion, privacy and security, and ethical considerations in human-machine relationships. It discusses the applications of these technologies across various domains such as education, healthcare, entertainment, and intelligent assistance. Looking to the future, the book anticipates trends such as the deep integration of artificial intelligence with emotion recognition, the advancement of multimodal interaction technologies, and the development of more personalized and adaptive emotion recognition systems. It emphasizes the importance of balancing technological innovation with ethical considerations to ensure the responsible development and application of affective computing technologies.

Paperid: 3076, https://arxiv.org/pdf/2506.15332.pdf

Abstract:
This paper presents three User Experience Research (UXR) perspectives based on data, evidence and insights - known as Point of View (POV) - showcasing how the strategies and methods of building a POV work in an enterprise setting. The POV are: 1. Smart Visuals: Use AI to extract and translate text from visuals in videos (2019). 2. Assessable Code Editor: Focus on direct AI-feedback to the learner as it is the loop that requires the least effort for the highest impact(2023). 3. Opportunity Landscape: Identify high-impact opportunities at the intersection of emergent technical capabilities that unlock novel approaches to critical user needs while addressing business strategic priorities (2019). They all seemed far-fetched and went against common practice. All were adopted and had long-lasting impact.

Paperid: 3077, https://arxiv.org/pdf/2506.15189.pdf

Abstract:
Augmented reality (AR) offers immersive interaction but remains inaccessible for users with motor impairments or limited dexterity due to reliance on precise input methods. This study proposes a gesture-based interaction system for AR environments, leveraging deep learning to recognize hand and body gestures from wearable sensors and cameras, adapting interfaces to user capabilities. The system employs vision transformers (ViTs), temporal convolutional networks (TCNs), and graph attention networks (GATs) for gesture processing, with federated learning ensuring privacy-preserving model training across diverse users. Reinforcement learning optimizes interface elements like menu layouts and interaction modes. Experiments demonstrate a 20% improvement in task completion efficiency and a 25% increase in user satisfaction for motor-impaired users compared to baseline AR systems. This approach enhances AR accessibility and scalability. Keywords: Deep learning, Federated learning, Gesture recognition, Augmented reality, Accessibility, Human-computer interaction

Paperid: 3078, https://arxiv.org/pdf/2506.15129.pdf

Abstract:
This article discusses the role that text elements play in a data visualisation. We argue that there is a need for a simple, coherent explanation of text elements similar to the understanding that already exists for non-text elements like bars, points, and lines. We explore examples of how text is used within a data visualisation and use existing knowledge and assessment techniques to evaluate when text is effective and when it is not. The result is a framework that aims to be easy to understand and easy to apply in order to understand the purpose and effectiveness of the text elements in any data visualisation.

Paperid: 3079, https://arxiv.org/pdf/2506.15107.pdf

Abstract:
While the use of social robots for language teaching has been explored, there remains limited work on a task-specific synthesized voices for language teaching robots. Given that language is a verbal task, this gap may have severe consequences for the effectiveness of robots for language teaching tasks. We address this lack of L2 teaching robot voices through three contributions: 1. We address the need for a lightweight and expressive robot voice. Using a fine-tuned version of Matcha-TTS, we use emoji prompting to create an expressive voice that shows a range of expressivity over time. The voice can run in real time with limited compute resources. Through case studies, we found this voice more expressive, socially appropriate, and suitable for long periods of expressive speech, such as storytelling. 2. We explore how to adapt a robot's voice to physical and social ambient environments to deploy our voices in various locations. We found that increasing pitch and pitch rate in noisy and high-energy environments makes the robot's voice appear more appropriate and makes it seem more aware of its current environment. 3. We create an English TTS system with improved clarity for L2 listeners using known linguistic properties of vowels that are difficult for these listeners. We used a data-driven, perception-based approach to understand how L2 speakers use duration cues to interpret challenging words with minimal tense (long) and lax (short) vowels in English. We found that the duration of vowels strongly influences the perception for L2 listeners and created an "L2 clarity mode" for Matcha-TTS that applies a lengthening to tense vowels while leaving lax vowels unchanged. Our clarity mode was found to be more respectful, intelligible, and encouraging than base Matcha-TTS while reducing transcription errors in these challenging tense/lax minimal pairs.

Paperid: 3080, https://arxiv.org/pdf/2506.14720.pdf

Abstract:
As technology increasingly aligns with users' personal values, traditional models of usability, focused on functionality and specifically effectiveness, efficiency, and satisfaction, may not fully capture how people perceive and evaluate it. This study investigates how the warm-glow phenomenon, the positive feeling associated with doing good, shapes perceived usability. An experimental approach was taken in which participants evaluated a hypothetical technology under conditions designed to evoke either the intrinsic (i.e., personal fulfillment) or extrinsic (i.e., social recognition) dimensions of warm-glow. A Multivariate Analysis of Variance as well as subsequent follow-up analyses revealed that intrinsic warm-glow significantly enhances all dimensions of perceived usability, while extrinsic warm-glow selectively influences perceived effectiveness and satisfaction. These findings suggest that perceptions of usability extend beyond functionality and are shaped by how technology resonates with users' broader sense of purpose. We conclude by proposing that designers consider incorporating warm-glow into technology as a strategic design decision.

Paperid: 3081, https://arxiv.org/pdf/2506.14677.pdf

Abstract:
Existing end-to-end sign-language animation systems suffer from low naturalness, limited facial/body expressivity, and no user control. We propose a human-centered, real-time speech-to-sign animation framework that integrates (1) a streaming Conformer encoder with an autoregressive Transformer-MDN decoder for synchronized upper-body and facial motion generation, (2) a transparent, editable JSON intermediate representation empowering deaf users and experts to inspect and modify each sign segment, and (3) a human-in-the-loop optimization loop that refines the model based on user edits and ratings. Deployed on Unity3D, our system achieves a 13 ms average frame-inference time and a 103 ms end-to-end latency on an RTX 4070. Our key contributions include the design of a JSON-centric editing mechanism for fine-grained sign-level personalization and the first application of an MDN-based feedback loop for continuous model adaptation. This combination establishes a generalizable, explainable AI paradigm for user-adaptive, low-latency multimodal systems. In studies with 20 deaf signers and 5 professional interpreters, we observe a +13 point SUS improvement, 6.7 point reduction in cognitive load, and significant gains in naturalness and trust (p $<$ .001) over baselines. This work establishes a scalable, explainable AI paradigm for accessible sign-language technologies.

Paperid: 3082, https://arxiv.org/pdf/2506.14287.pdf

Abstract:
Imitation learning has driven the development of generalist policies capable of autonomously solving multiple tasks. However, when a pretrained policy makes errors during deployment, there are limited mechanisms for users to correct its behavior. While collecting additional data for finetuning can address such issues, doing so for each downstream use case is inefficient at deployment. My research proposes an alternative: keeping pretrained policies frozen as a fixed skill repertoire while allowing user interactions to guide behavior generation toward user preferences at inference time. By making pretrained policies steerable, users can help correct policy errors when the model struggles to generalize-without needing to finetune the policy. Specifically, I propose (1) inference-time steering, which leverages user interactions to switch between discrete skills, and (2) task and motion imitation, which enables user interactions to edit continuous motions while satisfying task constraints defined by discrete symbolic plans. These frameworks correct misaligned policy predictions without requiring additional training, maximizing the utility of pretrained models while achieving inference-time user objectives.

Paperid: 3083, https://arxiv.org/pdf/2506.14040.pdf

Abstract:
This review explores recent advances in commonsense reasoning and intent detection, two key challenges in natural language understanding. We analyze 28 papers from ACL, EMNLP, and CHI (2020-2025), organizing them by methodology and application. Commonsense reasoning is reviewed across zero-shot learning, cultural adaptation, structured evaluation, and interactive contexts. Intent detection is examined through open-set models, generative formulations, clustering, and human-centered systems. By bridging insights from NLP and HCI, we highlight emerging trends toward more adaptive, multilingual, and context-aware models, and identify key gaps in grounding, generalization, and benchmark design.

Paperid: 3084, https://arxiv.org/pdf/2506.12739.pdf

Abstract:
Pet adoption processes often face inefficiencies, including limited accessibility, lack of real-time information, and mismatched expectations between shelters and adopters. To address these challenges, this study presents Shelter Soul, a technology-based solution designed to streamline pet adoption through an integrated, web-based platform. Developed using the MERN stack and GraphQL, Shelter Soul is a prototype system built to improve pet matching accuracy, shelter management efficiency, and secure online donations. The system includes modules for intelligent pet matching, shelter administration, donation processing, volunteer coordination, and analytics. Prototype testing (performance load tests, usability studies, and security assessments) demonstrated that the system meets its design goals: it handled 500 concurrent users with a 99.2% transaction success rate and an average response time of 250 ms, and usability feedback rated the interface highly (4.5/5). These results indicate Shelter Soul's potential as a practical solution to enhance animal shelter operations and adoption outcomes.

Paperid: 3085, https://arxiv.org/pdf/2506.11890.pdf

Abstract:
Virtual Reality simulators offer a powerful tool for teacher training, yet the integration of AI-powered student avatars presents a critical challenge: determining the optimal level of avatar realism for effective pedagogy. This literature review examines the evolution of avatar realism in VR teacher training, synthesizes its theoretical implications, and proposes a new pedagogical framework to guide future design. Through a systematic review, this paper traces the progression from human-controlled avatars to generative AI prototypes. Applying learning theories like Cognitive Load Theory, we argue that hyper-realism is not always optimal, as high-fidelity avatars can impose excessive extraneous cognitive load on novices, a stance supported by recent empirical findings. A significant gap exists between the technological drive for photorealism and the pedagogical need for scaffolded learning. To address this gap, we propose Graduated Realism, a framework advocating for starting trainees with lower-fidelity avatars and progressively increasing behavioral complexity as skills develop. To make this computationally feasible, we outline a novel single-call architecture, Crazy Slots, which uses a probabilistic engine and a Retrieval-Augmented Generation database to generate authentic, real-time responses without the latency and cost of multi-step reasoning models. This review provides evidence-based principles for designing the next generation of AI simulators, arguing that a pedagogically grounded approach to realism is essential for creating scalable and effective teacher education tools.

Paperid: 3086, https://arxiv.org/pdf/2506.10585.pdf

Abstract:
This paper introduces the Primender sequence, a novel integer sequence defined by a hybrid rule that combines classical primality with modular digit-based conditions. Specifically, a number n is included in the sequence if it is prime or ends with a prime number of unit digit or any length. In other words, numbers which are primes or have at least one prime suffix. The resulting sequence exhibits a deterministic yet non-trivial structure, blending number-theoretic properties with symbolic patterning. We propose the Primender sequence as a benchmark for evaluating the symbolic reasoning capabilities of Large Language Models (LLMs). The study is motivated by the need for interpretable, rule-based testbeds that can assess an LLM's ability to infer hidden rules, validate mathematical hypotheses, and generalize symbolic logic at scale. A key hypothesis explored is: Whenever a number in the Primender sequence is exactly one more than the largest prime less than or equal to it, the difference between it and the previous number in the sequence is also 1. We design a structured prompt and evaluation framework to test this hypothesis across multiple state-of-the-art LLMs, including ChatGPT, Copilot, DeepSeek, Gemini, Grok, and LLaMA. The models are tasked with identifying the underlying rule, validating the hypothesis, and generating the next 100,000 terms of the sequence. Comparative metrics such as rule inference accuracy, hypothesis evaluation, sequence validity, and symbolic explanation quality are used to assess model performance. This work contributes a novel mathematical construct and a reproducible methodology for benchmarking LLMs in symbolic reasoning, hypothesis testing, and scalable pattern generalization - bridging the domains of number theory, artificial intelligence, and software engineering.

Paperid: 3087, https://arxiv.org/pdf/2506.10324.pdf

Abstract:
This paper proposes a shift from compliance-centered web accessibility to a care-driven model that prioritizes user autonomy, using neurodivergent users as a catalyst case for broader personalization needs. While accessibility standards offer a flexible framework, they are often interpreted and implemented as static compliance checklists, our approach reframes it as a flexible, user-centered process. We introduce a customizable Comfort Mode framework that allows users to adapt interface settings, such as contrast, typography, motion, and scaling, according to their individual needs, while retaining the brand's core visual identity. Grounded in psychological and cognitive accessibility principles, our design supports personalization without sacrificing creative freedom. We present both minimal and advanced implementation models with mock-ups, demonstrating how inclusive design can be seamlessly integrated at minimal cost. This approach aims to broaden digital inclusivity by offering autonomy to those who require it, without imposing changes on those who do not. The proposed system is adaptable, scalable, and suitable for a wide range of users and brands, offering a new paradigm where user autonomy, aesthetic integrity, and accessibility converge not through compromise, but through choice.

Paperid: 3088, https://arxiv.org/pdf/2506.10042.pdf

Abstract:
In an era of increasing interaction with artificial intelligence (AI), users face evolving privacy decisions shaped by complex, uncertain factors. This paper introduces Multiverse Privacy Theory, a novel framework in which each privacy decision spawns a parallel universe, representing a distinct potential outcome based on user choices over time. By simulating these universes, this theory provides a foundation for understanding privacy through the lens of contextual integrity, evolving preferences, and probabilistic decision-making. Future work will explore its application using real-world, scenario-based survey data.

Paperid: 3089, https://arxiv.org/pdf/2506.10012.pdf

Abstract:
Thief of Truth is a first-person perspective Virtual Reality (VR) comic that explores the relationship between humans and artificial intelligence (AI). The work tells the story of a mind-uploaded human being reborn as a new subject while interacting with an AI that is looking for the meaning of life. In order to experiment with the expandability of VR comics, the work was produced by focusing on three problems. First, the comic is designed using the viewing control effect of VR. Second, through VR controller-based interaction, the player's immersion in the work is increased. Third, a method for increasing accessibility to VR comics was devised. This work aims to present an example of an experimental attempt in VR Comics.

Paperid: 3090, https://arxiv.org/pdf/2506.09089.pdf

Abstract:
In developing the teaching program for a course in Oral Expression in Teaching Chinese as a Foreign Language at the university level, the teacher designs communicative tasks based on conflicts to encourage learners to engage in interactive dynamics and develop their oral interaction skills. During the design of these tasks, the teacher uses ChatGPT to assist in finalizing the program. This article aims to present the key characteristics of the interactions between the teacher and ChatGPT during this program development process, as well as to examine the use of ChatGPT and its impacts in this specific context.

Paperid: 3091, https://arxiv.org/pdf/2506.06447.pdf

Abstract:
Digital commerce thrives on advertising, with many of the largest technology companies relying on it as a significant source of revenue. However, in the context of information-seeking behavior, such as search, advertising may degrade the user experience by lowering search quality, misusing user data for inappropriate personalization, potentially misleading individuals, or even leading them toward harm. These challenges remain significant as conversational search technologies, such as ChatGPT, become widespread. This paper critically examines the future of advertising in conversational search, utilizing several speculative examples to illustrate the potential risks posed to users who seek guidance on sensitive topics. Additionally, it provides an overview of the forms that advertising might take in this space and introduces the "fake friend dilemma," the idea that a conversational agent may exploit unaligned user trust to achieve other objectives. This study presents a provocative discussion on the future of online advertising in the space of conversational search and ends with a call to action.

Paperid: 3092, https://arxiv.org/pdf/2506.06286.pdf

Abstract:
Recent advances in AI research make it increasingly plausible that artificial agents with consequential real-world impact will soon operate beyond tightly controlled environments. Ensuring that these agents are not only safe but that they adhere to broader normative expectations is thus an urgent interdisciplinary challenge. Multiple fields -- notably AI Safety, AI Alignment, and Machine Ethics -- claim to contribute to this task. However, the conceptual boundaries and interrelations among these domains remain vague, leaving researchers without clear guidance in positioning their work. To address this meta-challenge, we develop a structured conceptual framework for understanding AI alignment. Rather than focusing solely on alignment goals, we introduce a taxonomy distinguishing the alignment aim (safety, ethicality, legality, etc.), scope (outcome vs. execution), and constituency (individual vs. collective). This structural approach reveals multiple legitimate alignment configurations, providing a foundation for practical and philosophical integration across domains, and clarifying what it might mean for an agent to be aligned all-things-considered.

Paperid: 3093, https://arxiv.org/pdf/2506.05490.pdf

Abstract:
During the wake of the Covid-19 pandemic, the educational paradigm has experienced a major change from in person learning traditional to online platforms. The change of learning convention has impacted the teacher-student especially in non-verbal communication. The absent of non-verbal communication has led to a reliance on verbal feedback which diminished the efficacy of the educational experience. This paper explores the integration of sentiment analysis into learning management systems (LMS) to bridge the student-teacher's gap by offering an alternative approach to interpreting student feedback beyond its verbal context. The research involves data preparation, feature selection, and the development of a deep neural network model encompassing word embedding, LSTM, and attention mechanisms. This model is compared against a logistic regression baseline to evaluate its efficacy in understanding student feedback. The study aims to bridge the communication gap between instructors and students in online learning environments, offering insights into the emotional context of student feedback and ultimately improving the quality of online education.

Paperid: 3094, https://arxiv.org/pdf/2506.05265.pdf

Abstract:
Effective teamwork is essential across diverse domains. During the team formation stage, a key challenge is forming teams that effectively balance user preferences with task objectives to enhance overall team satisfaction. In the team performing stage, maintaining cohesion and engagement is critical for sustaining high team performance. However, existing computational tools and algorithms for team optimization often rely on static data inputs, narrow algorithmic objectives, or solutions tailored for specific contexts, failing to account for the dynamic interplay of team members personalities, evolving goals, and changing individual preferences. Therefore, teams may encounter member dissatisfaction, as purely algorithmic assignments can reduce members commitment to team goals or experience suboptimal engagement due to the absence of timely, personalized guidance to help members adjust their behaviors and interactions as team dynamics evolve. Ultimately, these challenges can lead to reduced overall team performance. My Ph.D. dissertation aims to develop AI-augmented team optimization frameworks and practical systems that enhance team satisfaction, engagement, and performance. First, I propose a team formation framework that leverages a multi-armed bandit algorithm to iteratively refine team composition based on user preferences, ensuring alignment between individual needs and collective team goals to enhance team satisfaction. Second, I introduce tAIfa (Team AI Feedback Assistant), an AI-powered system that utilizes large language models (LLMs) to deliver immediate, personalized feedback to both teams and individual members, enhancing cohesion and engagement. Finally, I present PuppeteerLLM, an LLM-based simulation framework that simulates multi-agent teams to model complex team dynamics within realistic environments, incorporating task-driven collaboration and long-term coordination.

Paperid: 3095, https://arxiv.org/pdf/2506.04545.pdf

Abstract:
Mediated by today's visual displays, information space allows users to discover, access and interact with a wide range of digital and physical information. The information presented in this space may be digital, physical or a blend of both, and appear across different dimensions - such as texts, images, 3D content and physical objects embedded within real-world environment. Navigating within the information space often involves interacting with mixed-dimensional entities, visually represented in both 2D and 3D. At times, interactions also involve transitioning among entities represented in different dimensions. We introduce the concept of mixed-dimensional information space, encompassing entities represented in both 2D and 3D. Interactions within the mixed-dimensional information space should be seamless and efficient: users should be able to focus on their primary tasks without being distracted by interactions with or transitions between entities. While incorporating 3D representations into the mixed-dimensional information space offers intuitive and immersive ways to interact with complex information, it is important to address potential seams and inefficiencies that arise while interacting with both 2D and 3D entities. This dissertation introduces new interactive techniques and systems to realize seamless and efficient interactions within the mixed-dimensional information space. This dissertation introduces three interactive systems: MemoVis which aims to use emergent generative AI to help users create reference images for 3D design feedback; PaperToPlace which demonstrates how paper-based instruction documents can be transformed and spatialized into a context-aware MR experience; and VRContour which explores how contour delineation workflow can be brought into VR.

Paperid: 3096, https://arxiv.org/pdf/2506.03731.pdf

Abstract:
Dyslexic individuals often face significant challenges with traditional reading, particularly when engaging with complex texts such as mystery novels. These texts typically demand advanced narrative tracking and information integration skills, making it difficult for dyslexic readers to fully comprehend the content. However, research indicates that while dyslexic individuals may struggle with textual processing, they often possess strong spatial imagination abilities. Leveraging this strength, this study proposes an innovative approach using Transformer models to map sentences and words into three-dimensional vector representations. This process clusters semantically similar sentences and words in spatial proximity, allowing dyslexic readers to interpret the semantic structure and narrative flow of the text through spatial perception. Experimental results demonstrate that, compared to direct text reading, this three-dimensional semantic visualization method significantly enhances dyslexic readers' comprehension of complex texts. In particular, it shows marked advantages in identifying narrative relationships and character connections. This study provides a novel pathway for improving textual comprehension among dyslexic individuals

Paperid: 3097, https://arxiv.org/pdf/2506.03399.pdf

Abstract:
With the onset of large language models (LLMs), the performance of artificial intelligence (AI) models is becoming increasingly multi-dimensional. Accordingly, there have been several large, multi-dimensional evaluation frameworks put forward to evaluate LLMs. Though these frameworks are much more realistic than previous attempts which only used a single score like accuracy, multi-dimensional evaluations can complicate decision-making since there is no obvious way to select an optimal model. This work introduces preference sampling, a method to extract a scalar trustworthiness score from multi-dimensional evaluation results by considering the many characteristics of model performance which users value. We show that preference sampling improves upon alternate aggregation methods by using multi-dimensional trustworthiness evaluations of LLMs from TrustLLM and DecodingTrust. We find that preference sampling is consistently reductive, fully reducing the set of candidate models 100% of the time whereas Pareto optimality never reduces the set by more than 50%. Likewise, preference sampling is consistently sensitive to user priors-allowing users to specify the relative weighting and confidence of their preferences-whereas averaging scores is intransigent to the users' prior knowledge.

Paperid: 3098, https://arxiv.org/pdf/2506.02987.pdf

Abstract:
Background: Large language models (LLMs) have demonstrated substantial potential to support clinical practice. Other than Chat GPT4 and its predecessors, few LLMs, especially those of the leading and more powerful reasoning model class, have been subjected to medical specialty examination questions, including in the domain of primary care. This paper aimed to test the capabilities of leading LLMs as of May 2025 (o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro) in primary care education, specifically in answering Member of the Royal College of General Practitioners (MRCGP) style examination questions. Methods: o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro were tasked to answer 100 randomly chosen multiple choice questions from the Royal College of General Practitioners GP SelfTest on 25 May 2025. Questions included textual information, laboratory results, and clinical images. Each model was prompted to answer as a GP in the UK and was provided with full question information. Each question was attempted once by each model. Responses were scored against correct answers provided by GP SelfTest. Results: The total score of o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro was 99.0%, 95.0%, 95.0%, and 95.0%, respectively. The average peer score for the same questions was 73.0%. Discussion: All models performed remarkably well, and all substantially exceeded the average performance of GPs and GP registrars who had answered the same questions. o3 demonstrated the best performance, while the performances of the other leading models were comparable with each other and were not substantially lower than that of o3. These findings strengthen the case for LLMs, particularly reasoning models, to support the delivery of primary care, especially those that have been specifically trained on primary care clinical data.

Paperid: 3099, https://arxiv.org/pdf/2506.01287.pdf

Abstract:
Current "social acceptability" guidelines for interactive technologies advise against certain, seemingly problematic forms of interaction. Specifically, "suspenseful" interactions, characterized by visible manipulations and invisible effects, are generally considered be problematic. However, the empirical grounding for this claim is surprisingly weak. To test its validity, this paper presents a controlled replication study (n = 281) of the "suspensefulness effect". Although it could be statistically replicated with two out of three social acceptability measures, effect sizes were small (r =< .2), and all compared forms of interaction, including the suspenseful one, had high absolute social acceptability scores. Thus, despite the slight negative effect, suspenseful interactions seem less problematic in the overall scheme of things. We discuss alternative approaches to improve the social acceptability of interactive technology, and recommend to more closely engage with their specific social situatedness.

Paperid: 3100, https://arxiv.org/pdf/2506.00080.pdf

Abstract:
With the growing importance of AI governance, numerous high-level frameworks and principles have been articulated by policymakers, institutions, and expert communities to guide the development and application of AI. While such frameworks offer valuable normative orientation, they may not fully capture the practical concerns of those who interact with AI systems in organizational and operational contexts. To address this gap, this study adopts a bottom-up approach to explore how governance-relevant themes are expressed in user discourse. Drawing on over 100,000 user reviews of AI products from G2.com, we apply BERTopic to extract latent themes and identify those most semantically related to AI governance. The analysis reveals a diverse set of governance-relevant topics spanning both technical and non-technical domains. These include concerns across organizational processes-such as planning, coordination, and communication-as well as stages of the AI value chain, including deployment infrastructure, data handling, and analytics. The findings show considerable overlap with institutional AI governance and ethics frameworks on issues like privacy and transparency, but also surface overlooked areas such as project management, strategy development, and customer interaction. This highlights the need for more empirically grounded, user-centered approaches to AI governance-approaches that complement normative models by capturing how governance unfolds in applied settings. By foregrounding how governance is enacted in practice, this study contributes to more inclusive and operationally grounded approaches to AI governance and digital policy.

Paperid: 3101, https://arxiv.org/pdf/2505.23631.pdf

Abstract:
Assessing student depression in sensitive environments like special education is challenging. Standardized questionnaires may not fully reflect students' true situations. Furthermore, automated methods often falter with rich student narratives, lacking the crucial, individualized insights stemming from teachers' empathetic connections with students. Existing methods often fail to address this ambiguity or effectively integrate educator understanding. To address these limitations by fostering a synergistic human-AI collaboration, this paper introduces Human Empathy as Encoder (HEAE), a novel, human-centered AI framework for transparent and socially responsible depression severity assessment. Our approach uniquely integrates student narrative text with a teacher-derived, 9-dimensional "Empathy Vector" (EV), its dimensions guided by the PHQ-9 framework,to explicitly translate tacit empathetic insight into a structured AI input enhancing rather than replacing human judgment. Rigorous experiments optimized the multimodal fusion, text representation, and classification architecture, achieving 82.74% accuracy for 7-level severity classification. This work demonstrates a path toward more responsible and ethical affective computing by structurally embedding human empathy

Paperid: 3102, https://arxiv.org/pdf/2505.22987.pdf

Abstract:
By late 20th century, the rationality wars had launched debates about the nature and norms of intuitive and reflective thinking. Those debates drew from mid-20th century ideas such as bounded rationality, which challenged more idealized notions of rationality observed since the 19th century. Now that 21st century cognitive scientists are applying the resulting dual pro-cess theories to artificial intelligence, it is time to dust off some lessons from this history. So this paper synthesizes old ideas with recent results from experiments on humans and machines. The result is Strategic Reflec-tivism, the position that one key to intelligent systems (human or artificial) is pragmatic switching between intuitive and reflective inference to opti-mally fulfill competing goals. Strategic Reflectivism builds on American Pragmatism, transcends superficial indicators of reflective thinking such as model size or chains of thought, applies to both individual and collective intelligence systems (including human-AI teams), and becomes increasingly actionable as we learn more about the value of intuition and reflection.

Paperid: 3103, https://arxiv.org/pdf/2505.22767.pdf

Abstract:
Large Language Models (LLMs) are typically analysed through architectural, behavioural, or training-data lenses. This article offers a theoretical and experiential re-framing: LLMs as dynamic instantiations of Collective human Knowledge (CK), where intelligence is evoked through dialogue rather than stored statically. Drawing on concepts from neuroscience and AI, and grounded in sustained interaction with ChatGPT-4, I examine emergent dialogue patterns, the implications of fine-tuning, and the notion of co-augmentation: mutual enhancement between human and machine cognition. This perspective offers a new lens for understanding interaction, representation, and agency in contemporary AI systems.

Paperid: 3104, https://arxiv.org/pdf/2505.21982.pdf

Abstract:
User experience research often uses surveys and interviews, which may miss subconscious user interactions. This study explores eye-tracking and biometric feedback as tools to assess user engagement and cognitive load in digital interfaces. These methods measure gaze behavior and bodily responses, providing an objective complement to qualitative insights. Using empirical evidence, practical applications, and advancements from 2023-2025, we present experimental data, describe our methodology, and place our work within foundational and recent literature. We address challenges like data interpretation, ethical issues, and technological integration. These tools are key for advancing UX design in complex digital environments.

Paperid: 3105, https://arxiv.org/pdf/2505.20585.pdf

Abstract:
Implementation of digital health systems in low-middle-income countries (LMICs) often fails due to a lack of evaluations that take into account infrastructure limitations, local policies, and community readiness. We introduce HOT-FIT-BR, a contextual evaluation framework that expands the HOT-FIT model with three new dimensions: (1) Infrastructure Index to measure electricity/internet availability, (2) Policy Compliance Layer to ensure regulatory compliance (e.g., Permenkes 24/2022 in Indonesia), and (3) Community Engagement Fit. Simulations at Indonesian Health Centers show that HOT-FIT-BR is 58% more sensitive to detecting problems than HOT-FIT, especially in rural areas with an Infra Index <3. The framework has also proven adaptive to the context of other LMICs such as India and Kenya through local parameter adjustments.

Paperid: 3106, https://arxiv.org/pdf/2505.18928.pdf

Abstract:
Unstructured clinical notes contain essential patient information but are challenging for physicians to search and interpret efficiently. Although large language models (LLMs) have shown promise in question answering (QA), most existing systems lack transparency, usability, and alignment with clinical workflows. This work introduces an interactive QA system that enables physicians to query clinical notes via text or voice and receive extractive answers highlighted directly in the note for traceability. The system was built using OpenAI models with zero-shot prompting and evaluated across multiple metrics, including exact string match, word overlap, SentenceTransformer similarity, and BERTScore. Results show that while exact match scores ranged from 47 to 62 percent, semantic similarity scores exceeded 87 percent, indicating strong contextual alignment even when wording varied. To assess usability, the system was also evaluated using simulated clinical personas. Seven diverse physician and nurse personas interacted with the system across scenario-based tasks and provided structured feedback. The evaluations highlighted strengths in intuitive design and answer accessibility, alongside opportunities for enhancing explanation clarity.

Paperid: 3107, https://arxiv.org/pdf/2505.18006.pdf

Abstract:
Legal AI systems are increasingly being adopted by judicial and legal system deployers and providers worldwide to support a range of applications. While they offer potential benefits such as reducing bias, increasing efficiency, and improving accountability, they also pose significant risks, requiring a careful balance between opportunities, and legal and ethical development and deployment. AI literacy, as a legal requirement under the EU AI Act and a critical enabler of ethical AI for deployers and providers, could be a tool to achieve this. The article introduces the term "legal AI systems" and then analyzes the concept of AI literacy and the benefits and risks associated with these systems. This analysis is linked to a broader AI-L concept for organizations that deal with legal AI systems. The outcome of the article, a roadmap questionnaire as a practical tool for developers and providers to assess risks, benefits, and stakeholder concerns, could be useful in meeting societal and regulatory expectations for legal AI.

Paperid: 3108, https://arxiv.org/pdf/2505.15365.pdf

Abstract:
As large language models (LLMs) are increasingly deployed in high-stakes settings, their ability to refuse ethically sensitive prompts-such as those involving hate speech or illegal activities-has become central to content moderation and responsible AI practices. While refusal responses can be viewed as evidence of ethical alignment and safety-conscious behavior, recent research suggests that users may perceive them negatively. At the same time, automated assessments of model outputs are playing a growing role in both evaluation and training. In particular, LLM-as-a-Judge frameworks-in which one model is used to evaluate the output of another-are now widely adopted to guide benchmarking and fine-tuning. This paper examines whether such model-based evaluators assess refusal responses differently than human users. Drawing on data from Chatbot Arena and judgments from two AI judges (GPT-4o and Llama 3 70B), we compare how different types of refusals are rated. We distinguish ethical refusals, which explicitly cite safety or normative concerns (e.g., "I can't help with that because it may be harmful"), and technical refusals, which reflect system limitations (e.g., "I can't answer because I lack real-time data"). We find that LLM-as-a-Judge systems evaluate ethical refusals significantly more favorably than human users, a divergence not observed for technical refusals. We refer to this divergence as a moderation bias-a systematic tendency for model-based evaluators to reward refusal behaviors more than human users do. This raises broader questions about transparency, value alignment, and the normative assumptions embedded in automated evaluation systems.

Paperid: 3109, https://arxiv.org/pdf/2505.15049.pdf

Abstract:
Generative UI is transforming interface design by facilitating AI-driven collaborative workflows between designers and computational systems. This study establishes a working definition of Generative UI through a multi-method qualitative approach, integrating insights from a systematic literature review of 127 publications, expert interviews with 18 participants, and analyses of 12 case studies. Our findings identify five core themes that position Generative UI as an iterative and co-creative process. We highlight emerging design models, including hybrid creation, curation-based workflows, and AI-assisted refinement strategies. Additionally, we examine ethical challenges, evaluation criteria, and interaction models that shape the field. By proposing a conceptual foundation, this study advances both theoretical discourse and practical implementation, guiding future HCI research toward responsible and effective generative UI design practices.

Paperid: 3110, https://arxiv.org/pdf/2505.12064.pdf

Abstract:
Learning Analytics Dashboards (LADs) often fall short of their potential to empower learners, frequently prioritizing data visualization over the cognitive processes crucial for translating data into actionable learning strategies. This represents a significant gap in the field: while much research has focused on data collection and presentation, there is a lack of comprehensive models for how LADs can actively support learners' sensemaking and self-regulation. This paper introduces the Adaptive Understanding Framework (AUF), a novel conceptual model for learner-centered LAD design. The AUF seeks to address this limitation by integrating a multi-dimensional model of situational awareness, dynamic sensemaking strategies, adaptive mechanisms, and metacognitive support. This transforms LADs into dynamic learning partners that actively scaffold learners' sensemaking. Unlike existing frameworks that tend to treat these aspects in isolation, the AUF emphasizes their dynamic and intertwined relationships, creating a personalized and adaptive learning ecosystem that responds to individual needs and evolving understanding. The paper details the AUF's core principles, key components, and suggests a research agenda for future empirical validation. By fostering a deeper, more actionable understanding of learning data, AUF-inspired LADs have the potential to promote more effective, equitable, and engaging learning experiences.

Paperid: 3111, https://arxiv.org/pdf/2505.11684.pdf

Abstract:
Community engagement processes form a critical foundation of democratic governance, yet frequently struggle with resource constraints, sensemaking challenges, and barriers to inclusive participation. These processes rely on constructive communication between public leaders and community organizations characterized by understanding, trust, respect, legitimacy, and agency. As artificial intelligence (AI) technologies become increasingly integrated into civic contexts, they offer promising capabilities to streamline resource-intensive workflows, reveal new insights in community feedback, translate complex information into accessible formats, and facilitate reflection across social divides. However, these same systems risk undermining democratic processes through accuracy issues, transparency gaps, bias amplification, and threats to human agency. In this paper, we examine how human-AI collaboration might address these risks and transform civic communication dynamics by identifying key communication pathways and proposing design considerations that maintain a high level of control over decision-making for both public leaders and communities while leveraging computer automation. By thoughtfully integrating AI to amplify human connection and understanding while safeguarding agency, community engagement processes can utilize AI to promote more constructive communication in democratic governance.

Paperid: 3112, https://arxiv.org/pdf/2505.08375.pdf

Abstract:
Accessible and inclusive design has gained increased attention in HCI, yet practical implementation remains challenging due to resource-intensive prototyping methods. Traditional approaches such as workshops, A-B tests, and co-design sessions struggle to capture the diverse and complex needs of users with disabilities at scale. This position paper argues for an automated, accessible Human-in-the-Loop (HITL) design optimization process that shifts the designer's role from directly crafting prototypes to curating constraints for algorithmic exploration. By pre-constraining the design space based on specific user interaction needs, integrating adaptive multi-modal feedback channels, and personalizing feedback prompts, the HITL approach could efficiently refine design parameters, such as text size, color contrast, layout, and interaction modalities, to achieve optimal accessibility. This approach promises scalable, individualized design solutions while raising critical questions about constraint curation, transparency, user agency, and ethical considerations, making it essential to discuss and refine these ideas collaboratively at the workshop.

Paperid: 3113, https://arxiv.org/pdf/2505.07534.pdf

Abstract:
Visual Analytics (VA) integrates humans, data, and models as key actors in insight generation and data-driven decision-making. This position paper values and reflects on 16 VA process models and frameworks and makes nine high-level observations that motivate a fresh perspective on VA. The contribution is the HDMI Canvas, a perspective to VA that complements the strengths of existing VA process models and frameworks. It systematically characterizes diverse roles of humans, data, and models, and how these actors benefit from and contribute to VA processes. The descriptive power of the HDMI Canvas eases the differentiation between a series of VA building blocks, rather than describing general VA principles only. The canvas includes modern human-centered methodologies, including human knowledge externalization and forms of feedback loops, while interpretable and explainable AI highlight model contributions beyond their conventional outputs. The HDMI Canvas has generative power, guiding the design of new VA processes and is optimized for external stakeholders, improving VA outreach, interdisciplinary collaboration, and user-centered design. The utility of the HDMI Canvas is demonstrated through two preliminary case studies.

Paperid: 3114, https://arxiv.org/pdf/2505.07020.pdf

Abstract:
This paper presents R-CAGE (Rhythmic Control Architecture for Guarding Ego), a theoretical framework for restructuring emotional output in long-term human-AI interaction. While prior affective computing approaches emphasized expressiveness, immersion, and responsiveness, they often neglected the cognitive and structural consequences of repeated emotional engagement. R-CAGE instead conceptualizes emotional output not as reactive expression but as ethical design structure requiring architectural intervention. The model is grounded in experiential observations of subtle affective symptoms such as localized head tension, interpretive fixation, and emotional lag arising from prolonged interaction with affective AI systems. These indicate a mismatch between system-driven emotion and user interpretation that cannot be fully explained by biometric data or observable behavior. R-CAGE adopts a user-centered stance prioritizing psychological recovery, interpretive autonomy, and identity continuity. The framework consists of four control blocks: (1) Control of Rhythmic Expression regulates output pacing to reduce fatigue; (2) Architecture of Sensory Structuring adjusts intensity and timing of affective stimuli; (3) Guarding of Cognitive Framing reduces semantic pressure to allow flexible interpretation; (4) Ego-Aligned Response Design supports self-reference recovery during interpretive lag. By structurally regulating emotional rhythm, sensory intensity, and interpretive affordances, R-CAGE frames emotion not as performative output but as sustainable design unit. The goal is to protect users from oversaturation and cognitive overload while sustaining long-term interpretive agency in AI-mediated environments.

Paperid: 3115, https://arxiv.org/pdf/2505.05396.pdf

Abstract:
From the original abstract: This thesis initially aims to study the pain assessment process from a clinical-theoretical perspective while exploring and examining existing automatic approaches. Building on this foundation, the primary objective of this Ph.D. project is to develop innovative computational methods for automatic pain assessment that achieve high performance and are applicable in real clinical settings. A primary goal is to thoroughly investigate and assess significant factors, including demographic elements that impact pain perception, as recognized in pain research, through a computational standpoint. Within the limits of the available data in this research area, our goal was to design, develop, propose, and offer automatic pain assessment pipelines for unimodal and multimodal configurations that are applicable to the specific requirements of different scenarios. The studies published in this Ph.D. thesis showcased the effectiveness of the proposed methods, achieving state-of-the-art results. Additionally, they paved the way for exploring new approaches in artificial intelligence, foundation models, and generative artificial intelligence.

Paperid: 3116, https://arxiv.org/pdf/2505.03800.pdf

Abstract:
This course design aims to develop and research a handwriting matrix recognition and step-by-step visual calculation process display system, addressing the issue of abstract formulas and complex calculation steps that students find difficult to understand when learning mathematics. By integrating artificial intelligence with visualization animation technology, the system enhances precise recognition of handwritten matrix content through the introduction of Mamba backbone networks, completes digital extraction and matrix reconstruction using the YOLO model, and simultaneously combines CoordAttention coordinate attention mechanisms to improve the accurate grasp of character spatial positions. The calculation process is demonstrated frame by frame through the Manim animation engine, vividly showcasing each mathematical calculation step, helping students intuitively understand the intrinsic logic of mathematical operations. Through dynamically generating animation processes for different computational tasks, the system exhibits high modularity and flexibility, capable of generating various mathematical operation examples in real-time according to student needs. By innovating human-computer interaction methods, it brings mathematical calculation processes to life, helping students bridge the gap between knowledge and understanding on a deeper level, ultimately achieving a learning experience where "every step is understood." The system's scalability and interactivity make it an intuitive, user-friendly, and efficient auxiliary tool in education.

Paperid: 3117, https://arxiv.org/pdf/2505.03492.pdf

Abstract:
Modern chess engines significantly outperform human players and are essential for evaluating positions and move quality. These engines assign a numerical evaluation $E$ to positions, indicating an advantage for either white or black, but similar evaluations can mask varying levels of move complexity. While some move sequences are straightforward, others demand near-perfect play, limiting the practical value of these evaluations for most players. To quantify this problem, we use entropy to measure the complexity of the principal variation (the sequence of best moves). Variations with forced moves have low entropy, while those with multiple viable alternatives have high entropy. Our results show that, except for experts, most human players struggle with high-entropy variations, especially when $|E|<100$ centipawns, which accounts for about $2/3$ of positions. This underscores the need for AI-generated evaluations to convey the complexity of underlying move sequences, as they often exceed typical human cognitive capabilities, reducing their practical utility.

Paperid: 3119, https://arxiv.org/pdf/2505.03105.pdf

Abstract:
Scientific knowledge creation is fundamentally transforming as humans and AI systems evolve beyond tool-user relationships into co-evolutionary epistemic partnerships. When AlphaFold revolutionized protein structure prediction, researchers described engaging with an epistemic partner that reshaped how they conceptualized fundamental relationships. This article introduces Cognitio Emergens (CE), a framework addressing critical limitations in existing models that focus on static roles or narrow metrics while failing to capture how scientific understanding emerges through recursive human-AI interaction over time. CE integrates three components addressing these limitations: Agency Configurations describing how authority distributes between humans and AI (Directed, Contributory, Partnership), with partnerships dynamically oscillating between configurations rather than following linear progression; Epistemic Dimensions capturing six specific capabilities emerging through collaboration across Discovery, Integration, and Projection axes, creating distinctive "capability signatures" that guide development; and Partnership Dynamics identifying forces shaping how these relationships evolve, particularly the risk of epistemic alienation where researchers lose interpretive control over knowledge they formally endorse. Drawing from autopoiesis theory, social systems theory, and organizational modularity, CE reveals how knowledge co-creation emerges through continuous negotiation of roles, values, and organizational structures. By reconceptualizing human-AI scientific collaboration as fundamentally co-evolutionary, CE offers a balanced perspective that neither uncritically celebrates nor unnecessarily fears AI's evolving role, instead providing conceptual tools for cultivating partnerships that maintain meaningful human participation while enabling transformative scientific breakthroughs.

Paperid: 3120, https://arxiv.org/pdf/2505.03073.pdf

Abstract:
Biofeedback is being used more recently as a general control paradigm for human-computer interfaces (HCIs). While biofeedback especially from breath has seen increasing uptake as a controller for novel musical interfaces, new interfaces for musical expression (NIMEs), the community has not given as much attention to the heart. The heart is just as intimate a part of music as breath and it is argued that the heart determines our perception of time and so indirectly our perception of music. Inspired by this I demonstrate a photoplethysmogram (PPG)-based NIME controller using heart rate as a 1D control parameter to transform the qualities of sounds in real-time over a Bluetooth wireless HCI. I apply time scaling to "warp" audio buffers inbound to the sound card, and play these transformed audio buffers back to the listener wearing the PPG sensor, creating a hypothetical perceptual biofeedback loop: changes in sound change heart rate to change PPG measurements to change sound. I discuss how a sound-heart-PPG biofeedback loop possibly affords greater control and/or variety of movements with a 1D controller, how controlling the space and/or time scale of sound playback with biofeedback makes for possibilities in performance ambience, and I briefly discuss generative latent spaces as a possible way to extend a 1D PPG control space.

Paperid: 3121, https://arxiv.org/pdf/2505.02443.pdf

Abstract:
Driven by the global shift towards online learning prompted by the COVID 19 pandemic, Artificial Intelligence has emerged as a pivotal player in the field of education. Intelligent Tutoring Systems offer a new method of personalized teaching, replacing the limitations of traditional teaching methods. However, concerns arise about the ability of AI tutors to address skill development and engagement during the learning process. In this paper, I will conduct a quasi experiment with paired sample t test on 34 students pre and post use of AI tutors in language learning platforms like Santa and Duolingo to examine the relationship between students engagement, academic performance, and students satisfaction during a personalized language learning experience.

Paperid: 3122, https://arxiv.org/pdf/2505.02004.pdf

Abstract:
In a typical authentication process, the local system verifies the user's identity using a stored hash value generated by a cross-system hash algorithm. This article shifts the research focus from traditional password encryption to the establishment of gatekeeping mechanisms for effective interactions between a system and the outside world. Here, we propose a triple-identity authentication system to achieve this goal. Specifically, this local system opens the inner structure of its hash algorithm to all user credentials, including the login name, login password, and authentication password. When a login credential is entered, the local system hashes it and then creates a unique identifier using intermediate hash elements randomly selected from the open algorithm. Importantly, this locally generated unique identifier (rather than the stored hash produced by the open algorithm) is utilized to verify the user's combined identity, which is generated by combining the entered credential with the International Mobile Equipment Identity and the International Mobile Subscriber Identity. The verification process is implemented at each interaction point: the login name field, the login password field, and the server's authentication point. Thus, within the context of this triple-identity authentication system, we establish a robust gatekeeping mechanism for system interactions, ultimately providing a level of security that is equivalent to multi-factor authentication.

Paperid: 3123, https://arxiv.org/pdf/2505.01692.pdf

Abstract:
This project explores speculative evolution through a 3D implementation of Conway's Game of Life, using procedural simulation to generate unfamiliar extraterrestrial organic forms. By applying a volumetric optimized workflow, the raw cellular structures are smoothed into unified, bone-like geometries that resemble hypothetical non-terrestrial morphologies. The resulting forms, strange yet organic, are 3D printed as fossil-like artifacts, presenting a tangible representation of generative structures. This process situates the work at the intersection of artificial life, evolutionary modeling, and digital fabrication, illustrating how simple rules can simulate complex biological emergence and challenge conventional notions of organic form.

Paperid: 3124, https://arxiv.org/pdf/2505.01651.pdf

Abstract:
This paper introduces the HAIG framework for analysing trust dynamics across evolving human-AI relationships. Current categorical frameworks (e.g., "human-in-the-loop" models) inadequately capture how AI systems evolve from tools to partners, particularly as foundation models demonstrate emergent capabilities and multi-agent systems exhibit autonomous goal-setting behaviours. As systems advance, agency redistributes in complex patterns that are better represented as positions along continua rather than discrete categories, though progression may include both gradual shifts and significant step changes. The HAIG framework operates across three levels: dimensions (Decision Authority Distribution, Process Autonomy, and Accountability Configuration), continua (gradual shifts along each dimension), and thresholds (critical points requiring governance adaptation). Unlike risk-based or principle-based approaches, HAIG adopts a trust-utility orientation, focusing on maintaining appropriate trust relationships that maximise utility while ensuring sufficient safeguards. Our analysis reveals how technical advances in self-supervision, reasoning authority, and distributed decision-making drive non-uniform trust evolution across both contextual variation and technological advancement. Case studies in healthcare and European regulation demonstrate how HAIG complements existing frameworks while offering a foundation for alternative approaches that anticipate governance challenges before they emerge.

Paperid: 3125, https://arxiv.org/pdf/2505.01001.pdf

Abstract:
My project looks at an efficient workflow for creative image/video editing using Adobe Photoshop Actions tool and Batch Processing System. This innovative approach to video editing through Photoshop creates a fundamental shift to creative workflow management through the integration of industry-leading image manipulation with video editing techniques. Through systematic automation of Actions, users can achieve a simple and consistent application of visual edits across a string of images. This approach provides an alternative method to optimize productivity while ensuring uniform results across image collections through a post-processing pipeline.

Paperid: 3126, https://arxiv.org/pdf/2505.00987.pdf

Abstract:
Destructive Interference is a data visualization installation that representing the deaths and injuries caused by mass shootings in 2024 in the United States. I parametrically designed and fabricated an interlocking ring sculpture for each month of 2024; where the overall height corresponds to the level of violence in that month. Taller forms mark the deadliest months, while shorter ones reflect fewer casualties. Each inner ring encodes the number of people killed or injured, and each outer ring encodes the number of shootings and the number of days without them. The interlocking cylinders are powered via a motor to rotate, and lit from within. As the cylinders rotate, they cast overlapping shadows that represent those killed or injured by mass shootings. The goal of this work is to visualize otherwise overwhelming and disparate statistics in a way that is both physically present and emotionally resonant. By inviting viewers to step into and engage with these shadows, the piece creates space for reflection, conversation, and confrontation with the scale of this ongoing crisis.

Paperid: 3127, https://arxiv.org/pdf/2505.00153.pdf

Abstract:
Visually impaired people face significant challenges when attempting to interact with and understand complex environments, and traditional assistive technologies often struggle to quickly provide necessary contextual understanding and interactive intelligence. This thesis presents Audo-Sight, a state-of-the-art assistive system that seamlessly integrates Multimodal Large Language Models (MLLMs) to provide expedient, context-aware interactions for Blind and Visually Impaired (BVI) individuals. The system operates in two different modalities: personalized interaction through user identification and public access in common spaces like museums and shopping malls. In tailored environments, the system adjusts its output to conform to the preferences of individual users, thus enhancing accessibility through a user-aware form of interaction. In shared environments, Audo-Sight employs a shared architecture that adapts to its current user with no manual reconfiguration required. To facilitate appropriate interactions with the LLM, the public Audo-Sight solution includes an Age-Range Determiner and Safe Query Filter. Additionally, the system ensures that responses are respectful to BVI users through NeMo Guardrails. By utilizing multimodal reasoning, BVI-cognizant response editing, and safeguarding features, this work represents a major leap in AI-driven accessibility technology capable of increasing autonomy, safety, and interaction for people with visual impairments in social settings. Finally, we present the integration of Audo-Sight and SmartSight, which enables enhanced situational awareness for BVI individuals. This integration takes advantage of the real-time visual analysis of SmartSight, combined with the extensive reasoning and interactive capabilities of Audo-Sight, and goes beyond object identification to provide context-driven, voice-controlled assistance in dynamic environments.

Paperid: 3128, https://arxiv.org/pdf/2504.21735.pdf

Abstract:
Massage therapy training emphasizes hands-on techniques and effective therapist--patient communication. However, many educational programs struggle to provide realistic practice scenarios. To address this problem, we propose TheraQuest, a gamified, web-based simulation platform that employs large language models (LLMs) to generate diverse virtual patients with varying symptoms and cultural backgrounds. Through interactive dialogue, anatomical decision-making, and immediate assessment, trainees develop both diagnostic reasoning and empathetic communication skills in a low-risk environment. Unlike exclusively VR-based solutions, TheraQuest remains accessible via standard web browsers, mitigating the cost and discomfort associated with extended headset use. Preliminary testing suggests that integrating LLM-driven virtual patients with real-time skill metrics can enhance trainee engagement and help bridge the gap between theoretical knowledge and clinical proficiency.

Paperid: 3129, https://arxiv.org/pdf/2504.20342.pdf

Abstract:
Reflexion is an AI-powered platform designed to enable structured emotional self-reflection at scale. By integrating real-time emotion detection, layered reflective prompting, and metaphorical storytelling generation, Reflexion empowers users to engage in autonomous emotional exploration beyond basic sentiment categorization. Grounded in theories of expressive writing, cognitive restructuring, self-determination, and critical consciousness development, the system scaffolds a progressive journey from surface-level emotional recognition toward value-aligned action planning. Initial pilot studies with diverse participants demonstrate positive outcomes in emotional articulation, cognitive reframing, and perceived psychological resilience. Reflexion represents a promising direction for scalable, theory-informed affective computing interventions aimed at fostering emotional literacy and psychological growth across educational, therapeutic, and public health contexts.

Paperid: 3130, https://arxiv.org/pdf/2504.19047.pdf

Abstract:
There is growing enthusiasm about the potential for humans and AI to collaborate by leveraging their respective strengths. Yet in practice, this promise often falls short. This paper uses an online experiment to identify non-instrumental image concerns as a key reason individuals underutilize AI recommendations. I show that concerns about how one is perceived, even when those perceptions carry no monetary consequences, lead participants to disregard AI advice and reduce task performance.

Paperid: 3131, https://arxiv.org/pdf/2504.19010.pdf

Abstract:
Different errors that occur in video games are often referred to as glitches or bugs. The goal of this exploratory research is to understand how these glitches and bugs within video games affect a players experience. To do this, I reviewed relevant literature and performed observations of these different errors in different games via Twitch livestreams. I then performed thematic analysis with the observation data and generated themes that tie back into to the relevant literature. Most of the current literature focuses on the what and how behind bugs in games, but very little on the implications of these bugs on the overall experience for the players, and what patterns of behavior may emerge because of them.

Paperid: 3132, https://arxiv.org/pdf/2504.18807.pdf

Abstract:
This paper critiques digital cloning in academic research, highlighting how it exemplifies AI solutionism. Digital clones, which replicate user data to simulate behavior, are often seen as scalable tools for behavioral insights. However, this framing obscures ethical concerns around consent, agency, and representation. Drawing on feminist theories of agency, the paper argues that digital cloning oversimplifies human complexity and risks perpetuating systemic biases. To address these issues, it proposes decentralized data repositories and dynamic consent models, promoting ethical, context-aware AI practices that challenge the reductionist logic of AI solutionism

Paperid: 3133, https://arxiv.org/pdf/2504.18759.pdf

Abstract:
Isolated perspectives have often paved the way for great scientific discoveries. However, many breakthroughs only emerged when moving away from singular views towards interactions. Discussions on Artificial Intelligence (AI) typically treat human and AI bias as distinct challenges, leaving their dynamic interplay and compounding potential largely unexplored. Recent research suggests that biased AI can amplify human cognitive biases, while well-calibrated systems might help mitigate them. In this position paper, I advocate for transcending beyond separate treatment of human and AI biases and instead focus on their interaction effects. I argue that a comprehensive framework, one that maps (compound human-AI) biases to mitigation strategies, is essential for understanding and protecting human cognition, and I outline concrete steps for its development.

Paperid: 3134, https://arxiv.org/pdf/2504.17906.pdf

Abstract:
Access control needs have broad design implications, but access control specifications may be elicited before, during, or after these needs are captured. Because access control knowledge is distributed, we need to make knowledge asymmetries more transparent, and use expertise already available to stakeholders. In this paper, we present a tool-supported technique identifying knowledge asymmetries around access control based on asset and goal models. Using simple and conventional modelling languages that complement different design techniques, we provide boundary objects to make access control transparent, thereby making knowledge about access control concerns more symmetric. We illustrate this technique using a case study example considering the suitability of a reusable software component in a new military air system.

Paperid: 3135, https://arxiv.org/pdf/2504.17171.pdf

Abstract:
This paper introduces an augmented reality (AR) captioning framework designed to support Deaf and Hard of Hearing (DHH) learners in STEM classrooms by integrating non-verbal emotional cues into live transcriptions. Unlike conventional captioning systems that offer only plain text, our system fuses real-time speech recognition with affective and visual signal interpretation, including facial movements, gestures, and vocal tone, to produce emotionally enriched captions. These enhanced captions are rendered in an AR interface developed with Unity and provide contextual annotations such as speaker tone markers (e.g., "concerned") and gesture indicators (e.g., "nods"). The system leverages live camera and microphone input, processed through AI models to detect multimodal cues. Findings from preliminary evaluations suggest that this AR-based captioning approach significantly enhances comprehension and reduces cognitive effort compared to standard captions. Our work emphasizes the potential of immersive environments for inclusive, emotion-aware educational accessibility.

Paperid: 3136, https://arxiv.org/pdf/2504.17170.pdf

Abstract:
Unresolved questions about how autonomous vehicles (AVs) should meet the informational needs of riders hinder real-world adoption. Complicating our ability to satisfy rider needs is that different people, goals, and driving contexts have different criteria for what constitutes interaction success. Unfortunately, most human-AV research and design today treats all people and situations uniformly. It is crucial to understand how an AV should communicate to meet rider needs, and how communications should change when the human-AV complex system changes. I argue that understanding the relationships between different aspects of the human-AV system can help us build improved and adaptable AV communications. I support this argument using three empirical studies. First, I identify optimal communication strategies that enhance driving performance, confidence, and trust for learning in extreme driving environments. Findings highlight the need for task-sensitive, modality-appropriate communications tuned to learner cognitive limits and goals. Next, I highlight the consequences of deploying faulty communication systems and demonstrate the need for context-sensitive communications. Third, I use machine learning (ML) to illuminate personal factors predicting trust in AVs, emphasizing the importance of tailoring designs to individual traits and concerns. Together, this dissertation supports the necessity of transparent, adaptable, and personalized AV systems that cater to individual needs, goals, and contextual demands. By considering the complex system within which human-AV interactions occur, we can deliver valuable insights for designers, researchers, and policymakers. This dissertation also provides a concrete domain to study theories of human-machine joint action and situational awareness, and can be used to guide future human-AI interaction research. [shortened for arxiv]

Paperid: 3137, https://arxiv.org/pdf/2504.16824.pdf

Abstract:
This article explores into the intricate design and meticulous construction of a digital platform aimed at revolutionizing early-age English education, particularly for Spanish-speaking children. The focus of this work used an innovative methodologies, vibrant and engaging visuals, and a comprehensive approach to phonics. The principles of usability, accessibility, and user-centered design are intricately woven into every facet of the platform's architecture.

Paperid: 3138, https://arxiv.org/pdf/2504.16770.pdf

Abstract:
While generative artificial intelligence (Gen AI) increasingly transforms academic environments, a critical gap exists in understanding and mitigating human biases in AI interactions, such as anchoring and confirmation bias. This position paper advocates for metacognitive AI literacy interventions to help university students critically engage with AI and address biases across the Human-AI interaction workflows. The paper presents the importance of considering (1) metacognitive support with deliberate friction focusing on human bias; (2) bi-directional Human-AI interaction intervention addressing both input formulation and output interpretation; and (3) adaptive scaffolding that responds to diverse user engagement patterns. These frameworks are illustrated through ongoing work on "DeBiasMe," AIED (AI in Education) interventions designed to enhance awareness of cognitive biases while empowering user agency in AI interactions. The paper invites multiple stakeholders to engage in discussions on design and evaluation methods for scaffolding mechanisms, bias visualization, and analysis frameworks. This position contributes to the emerging field of AI-augmented learning by emphasizing the critical role of metacognition in helping students navigate the complex interaction between human, statistical, and systemic biases in AI use while highlighting how cognitive adaptation to AI systems must be explicitly integrated into comprehensive AI literacy frameworks.

Paperid: 3139, https://arxiv.org/pdf/2504.16092.pdf

Abstract:
Cooperative speech is purposive. From the speaker's perspective, one crucial purpose is the transmission of knowledge. Cooperative speakers care about getting things right for their conversational partners. This attitude is a kind of respect. Cooperative speech is an ideal form of communication because participants have respect for each other. And having respect within a cooperative enterprise is sufficient for a particular kind of moral standing: we ought to respect those who have respect for us. Respect demands reciprocity. I maintain that large language models aren't owed the kind of respect that partly constitutes a cooperative conversation. This implies that they aren't cooperative interlocutors, otherwise we would be obliged to reciprocate the attitude. Leveraging this conclusion, I argue that present-day LLMs are incapable of assertion and that this raises an overlooked doubt about their semantic competence. One upshot of this argument is that knowledge of meaning isn't just a subject for the cognitive psychologist. It's also a subject for the moral psychologist.

Paperid: 3140, https://arxiv.org/pdf/2504.15970.pdf

Abstract:
Extended Reality (XR), encompassing Augmented Reality (AR), Virtual Reality (VR) and Mixed Reality (MR), is a transformative technology bridging the physical and virtual world and it has diverse potential which will be ubiquitous in the future. This review examines XR's evolution through foundational framework - hardware ranging from monitors to sensors and software ranging from visual tasks to user interface; highlights state of the art (SOTA) XR products with the comparison and analysis of performance based on their foundational framework; discusses how commercial XR devices can support the demand of high-quality performance focusing on spatial intelligence. For future directions, attention should be given to the integration of multi-modal AI and IoT-driven digital twins to enable adaptive XR systems. With the concept of spatial intelligence, future XR should establish a new digital space with realistic experience that benefits humanity. This review underscores the pivotal role of AI in unlocking XR as the next frontier in human-computer interaction.

Paperid: 3141, https://arxiv.org/pdf/2504.15408.pdf

Abstract:
The goal of this exploratory research is to investigate how glitches and bugs within video games affect a players overall experience. The severity or frequency of bugs, as well as the nature of the bugs present, could influence how the players perceive these interactions. Another factor is the players personality because this will affect their motivations for playing certain games as well as how they react to bugs within these games. Glitches and bugs are framed as a negative aspect within games, but create the potential for enjoyable experiences, despite being unexpected. To explore this hypothesis, I observed some glitches within recorded gameplay via YouTube and Twitch livestream VODs and analyzed the streamers reaction, as well as the audiences. I also conducted semi-structured interviews with gamers with the goal of learning more about that players personality and attitudes towards bugs in the games they play. I concluded that the types of bugs matter less to the players than how frequently they occur, the context they occur, and the outcome of them.

Paperid: 3142, https://arxiv.org/pdf/2504.15132.pdf

Abstract:
The widespread adoption of generative artificial intelligence/machine learning (AI/ML) technologies has increased the need to support youth in developing AI/ML literacies. However, most work has centered on preparing young people to use these systems, with less attention to how they can participate in designing and evaluating them. This study investigates how engaging young people in the design and auditing of generative language models (GLMs) may foster the development of their understanding of how these systems work from both technical and ethical perspectives. The study takes an in-pieces approach to investigate novices' conceptions of GLMs. Such an approach supports the analysis of how technical and ethical conceptions evolve and relate to each other. I am currently conducting a series of participatory design workshops with sixteen ninth graders (ages 14-15) in which they will (a) build GLMs from a data-driven perspective that glassboxes how data shapes model performance and (b) audit commercial GLMs by repeatedly and systematically querying them to draw inferences about their behaviors. I will analyze participants' interactions to identify ethical and technical conceptions they may exhibit while designing and auditing GLMs. I will also conduct clinical interviews and use microgenetic knowledge analysis and ordered network analysis to investigate how participants' ethical and technical conceptions of GLMs relate to each other and change after the workshop. The study will contribute (a) evidence of how engaging youth in design and auditing activities may support the development of ethical and technical understanding of GLMs and (b) an inventory of novice design and auditing practices that may support youth's technical and ethical understanding of GLMs.

Paperid: 3143, https://arxiv.org/pdf/2504.14873.pdf

Abstract:
In this position paper, the author presents a process artifact that aims to serve as an archival and educational tool that revitalizes World War II oral histories in Japan. First, the author introduces the historical background and how the work is informed by the positionality of the author. Then, the author presents features of the artifact using references to interview footage of the author's grandmother and grandaunt sharing their firsthand accounts of the 1945 Tokyo Air Raids. The affordances and barriers of this application of augmented reality is discussed and a included is a list of questions to be posed at the workshop.

Paperid: 3144, https://arxiv.org/pdf/2504.14631.pdf

Abstract:
With artificial intelligence (AI) embedded in many everyday software systems, effectively and reliably developing and maintaining AI systems becomes an essential skill for software developers. However, the complexity inherent to AI poses new challenges. Explainable AI (XAI) may allow developers to understand better the systems they build, which, in turn, can help with tasks like debugging. In this paper, we report insights from a series of surveys with software developers that highlight that there is indeed an increased need for explanatory tools to support developers in creating AI systems. However, the feedback also indicates that existing XAI systems still fall short of this aspiration. Thus, we see an unmet need to provide developers with adequate support mechanisms to cope with this complexity so they can embed AI into high-quality software in the future.

Paperid: 3145, https://arxiv.org/pdf/2504.13965.pdf

Abstract:
This study explores the use of electroencephalography (EEG)-based brain wave monitoring to enable dynamic difficulty adjustment (DDA) in a virtual reality (VR) gaming environment. Using the Task Engagement Index (TEI) derived from frontal EEG electrodes, we adapt game challenge levels in real time to maintain optimal player engagement. In a within-subject design with six participants, we found that the DDA condition significantly increased engagement duration by 19.79% compared to a non-DDA control condition. These results suggest that combining EEG, DDA, and VR technologies can enhance user experience and has potential applications in adaptive learning, rehabilitation, and personalized interfaces.

Paperid: 3146, https://arxiv.org/pdf/2504.13946.pdf

Abstract:
Machine learning models are often criticized as opaque from a lack of transparency in their decision-making process. This study examines how policy design impacts the quality of explanations in ML models. We conducted a classroom experiment with 124 participants and analyzed the effects of policy length and purpose on developer compliance with policy requirements. Our results indicate that while policy length affects engagement with some requirements, policy purpose has no effect, and explanation quality is generally poor. These findings highlight the challenge of effective policy development and the importance of addressing diverse stakeholder perspectives within explanations.

Paperid: 3147, https://arxiv.org/pdf/2504.13928.pdf

Abstract:
NPCs in traditional games are often limited by static dialogue trees and a single platform for interaction. To overcome these constraints, this study presents a prototype system that enables large language model (LLM)-powered NPCs to communicate with players both in the game en vironment (Unity) and on a social platform (Discord). Dialogue logs are stored in a cloud database (LeanCloud), allowing the system to synchronize memory between platforms and keep conversa tions coherent. Our initial experiments show that cross-platform interaction is technically feasible and suggest a solid foundation for future developments such as emotional modeling and persistent memory support.

Paperid: 3148, https://arxiv.org/pdf/2504.13905.pdf

Abstract:
Mathematical research data plays a crucial role across scientific disciplines, yet its documentation and dissemination remain challenging due to the lack of standardized research data management practices. The MaRDMO Plugin addresses these challenges by integrating mathematical models, algorithms, and interdisciplinary workflows into the established framework of the Research Data Management Organiser (RDMO). Built on FAIR principles, MaRDMO enables structured documentation and retrieval of mathematical research data through guided questionnaires. It connects to multiple knowledge graphs, including MathModDB, MathAlgoDB, and the MaRDI Portal. Users can document and search for models, algorithms, and workflows via dynamic selection interfaces that also leverage other sources such as Wikidata. The plugin facilitates the export to the individual MaRDI services, ensuring data quality through automated validation. By embedding mathematical research data management into the widely adopted RDMO platform, MaRDMO represents a significant step toward making mathematical research data more findable, accessible, and reusable.

Paperid: 3149, https://arxiv.org/pdf/2504.13896.pdf

Abstract:
Currently, it is impossible for educators to be in multiple places simultaneously and teach each student individually. Technologies such as Extended Reality (XR) and Artificial Intelligence (AI) enable the creation of realistic educational copies of experts that preserve not only visual and mental characteristics but also social aspects crucial for learning. However, research in this area is limited, which opens new questions for future work. This paper discusses how these human digital twins can potentially improve aspects like scalability, engagement, and preservation of social learning factors. While this technology offers benefits, it also introduces challenges related to educator autonomy, social interaction shifts, and ethical considerations such as privacy, bias, and identity preservation. We outline key research questions that need to be addressed to ensure that human digital twins enhance the social aspects of education instead of harming them.

Paperid: 3150, https://arxiv.org/pdf/2504.13894.pdf

Abstract:
This paper explores the integration of Artificial Intelligence (AI) in the design of interactive experiences for Cultural Heritage (CH). Previous studies indeed either miss to represent the specificity of the CH or mention possible tools without making a clear reference to a structured Interaction Design (IxD) workflow. The study also attempts to overcome one of the major limitations of traditional literature review, which may fail to capture proprietary tools whose release is rarely accompanied by academic publications. Besides the analysis of previous research, the study proposes a possible workflow for IxD in CH, subdivided into phases and tasks: for each of them, this paper proposes possible AI-based tools that can support the activity of designers, curators, and CH professionals. The review concludes with a final section outlining future paths for research and development in this domain.

Paperid: 3151, https://arxiv.org/pdf/2504.13885.pdf

Abstract:
User Experience (UX) in virtual worlds is a fast-developing discipline that requires creative design concepts to overcome the divide between physical and virtual interaction. This research investigates primary principles and techniques to improve UX in virtual experiences based on usability, accessibility, user engagement, and technology advancements. It gives detailed insight into trends, issues, and prospects for UX design of virtual applications that guarantee an efficient, easy-to-use, and immersive experience.

Paperid: 3152, https://arxiv.org/pdf/2504.13878.pdf

Abstract:
Papert's constructionism makes it clear that learning is particularly effective when learners create tangible artifacts and share and discuss them in social contexts. Technological progress in recent decades has created numerous opportunities for learners to not only passively consume media, but to actively shape it through construction. This article uses the EDUMING concept to present a new method to simplify the development of digital learning games and thus support their integration into learning situations. A key difference between the concept and established ideas such as game-based learning, gamification, serious games, etc. is that games are not closed and are consumed passively, but can also be actively developed by users individually by modifying the source code with the help of an IDE. As part of an empirical study, the usability of the game "Professor Chip's Learning Quest" (PCLQ) is recorded, as well as previous experience with digital learning games and the acceptance and motivation to use new technologies. The purpose of this article is to test the PCLQ digital learning game, developed according to the EDUMING concept, as part of an exploratory study regarding its usability, acceptance and suitability for use in schools. The study is intended as a first empirical approach to practical testing of the concept.

Paperid: 3153, https://arxiv.org/pdf/2504.13876.pdf

Abstract:
Walking along trails in natural areas is a rewarding experience, but visitors sometimes need proper assistance to enhance their enjoyment, maximize learning, and ensure safety. Over the years, various signage techniques have been introduced, but today, the widespread use of smartphones offers new opportunities for visitor support. In this paper, we outline the key principles for designing an Android app tailored for geotourists. Our approach begins by defining user personas and deriving app requirements based on their needs. We then present a proof of concept that addresses the critical aspects identified during the design process.

Paperid: 3154, https://arxiv.org/pdf/2504.13870.pdf

Abstract:
Machine learning and automation are transforming scientific research, yet the implementation of self-driving laboratories (SDLs) remains costly and complex, and it remains difficult to learn how to use these facilities. To address this, we introduce Claude-Light, a lightweight, remotely accessible instrument designed for prototyping automation algorithms and machine learning workflows. Claude-Light integrates a REST API, a Raspberry Pi-based control system, and an RGB LED with a photometer that measures ten spectral outputs, providing a controlled but realistic experimental environment. This device enables users to explore automation at multiple levels, from basic programming and experimental design to machine learning-driven optimization. We demonstrate the application of Claude-Light in structured automation approaches, including traditional scripting, statistical design of experiments, and active learning methods. Additionally, we explore the role of large language models (LLMs) in laboratory automation, highlighting their use in instrument selection, structured data extraction, function calling, and code generation. While LLMs present new opportunities for streamlining automation, they also introduce challenges related to reproducibility, security, and reliability. We discuss strategies to mitigate these risks while leveraging LLMs for enhanced efficiency in self-driving laboratories. Claude-Light provides a practical and accessible platform for students and researchers to develop automation skills and test algorithms before deploying them in larger-scale SDLs. By lowering the barrier to entry for automation in scientific research, this tool facilitates broader adoption of AI-driven experimentation and fosters innovation in autonomous laboratories.

Paperid: 3155, https://arxiv.org/pdf/2504.13858.pdf

Abstract:
The desirable properties of explanations in information systems have fueled the demands for transparency in artificial intelligence (AI) outputs. To address these demands, the field of explainable AI (XAI) has put forth methods that can support human decision-making by explaining AI outputs. However, current empirical works present inconsistent findings on whether such explanations help to improve users' task performance in decision support systems (DSS). In this paper, we conduct a meta-analysis to explore how XAI affects human performance in classification tasks. Our results show an improvement in task performance through XAI-based decision support, though explanations themselves are not the decisive driver for this improvement. The analysis reveals that the studies' risk of bias moderates the effect of explanations in AI, while the explanation type appears to play only a negligible role. Our findings contribute to the human computer interaction field by enhancing the understanding of human-XAI collaboration in DSS.

Paperid: 3156, https://arxiv.org/pdf/2504.13851.pdf

Abstract:
The nascent field of neurogames relies on active Brain-Computer Interface input to drive its game mechanics. Consequently, users expect their conscious will to be meaningfully reflected on the virtual environment they're engaging in. Additionally, the videogame industry considers it paramount to provide gamers with seamless experiences to avoid disrupting their state of flow. Thus, this paper suggests gamification as a strategy to camouflage the often fatiguing data acquisition process in Machine Learning from neurodata so that neurogamers can further immerse themselves in the virtual experience while Artificial Intelligence models benefit from data taken in reproducible contexts.

Paperid: 3157, https://arxiv.org/pdf/2504.13849.pdf

Abstract:
Despite the rise in affordable eXtended Reality (XR) technologies, accessibility still remains a key concern, often excluding people with disabilities from accessing these immersive XR platforms. Consequently, there has been a notable surge in HCI research on creating accessible XR solutions (also known as, assistive XR). This increased focus in assistive XR research is also reflected in the number of research and innovative solutions submitted at the ACM Conference on Accessible Computing (ASSETS), with an aim to make XR experiences inclusive for disabled communities. However, till date, there is little to no work that provides a comprehensive overview of state-of-the-art research in assistive XR for disability at ACM ASSETS, a premier conference dedicated for research in HCI for people with disabilities. This study aims to fill this research gap by conducting a scoping review of literature delineating the key focus areas, research methods, statistical and temporal trends in XR research for disability at ACM ASSETS (2019-2023). From a pool of 1595 articles submitted to ASSETS, 26 articles are identified that specifically focus on XR research for disability. Through a detailed analysis, 6 key focus areas of XR research explored at ACM ASSETS are identified and a detailed examination of each is provided. Additionally, an overview of multiple research methods employed for XR research at ASSETS is also presented. Lastly, this work reports on the statistics and temporal trends regarding the number of publications, XR technologies used, disabilities addressed, and methodologies adopted for assistive XR research at ASSETS, highlighting emerging trends and possible future research directions.

Paperid: 3158, https://arxiv.org/pdf/2504.13847.pdf

Abstract:
Recent advances in large language models (LLMs) offer unprecedented opportunities to enhance human-AI collaboration in qualitative research methods, including interviews. While interviews are highly valued for gathering deep, contextualized insights, interviewers often face significant cognitive challenges, such as real-time information processing, question adaptation, and rapport maintenance. My doctoral research introduces Interview AI-ssistant, a system designed for real-time interviewer-AI collaboration during both the preparation and execution phases. Through four interconnected studies, this research investigates the design of effective human-AI collaboration in interviewing contexts, beginning with a formative study of interviewers' needs, followed by a prototype development study focused on AI-assisted interview preparation, an experimental evaluation of real-time AI assistance during interviews, and a field study deploying the system in a real-world research setting. Beyond informing practical implementations of intelligent interview support systems, this work contributes to the Intelligent User Interfaces (IUI) community by advancing the understanding of human-AI collaborative interfaces in complex social tasks and establishing design guidelines for AI-enhanced qualitative research tools.

Paperid: 3159, https://arxiv.org/pdf/2504.13846.pdf

Abstract:
This Master's Thesis in Computer Science dives into the design and creation of a user-friendly interface for VoxLogicA, an image analysis tool using spatial model checking with a focus on neuroimaging. The research tackles the problem of existing tools being too complex, which makes them hard for medical professionals and researchers to use. By using spatial logic, the goal is to make these powerful analytical tools more practical and accessible in real-world clinical settings. The main objectives are to design a modern web interface that's easy to use, build it with the latest web technologies (e.g. Svelte and Niivue), and test its effectiveness through user studies and real-world case analyses.

Paperid: 3160, https://arxiv.org/pdf/2504.13843.pdf

Abstract:
The use of animation to gain user attention has been increasing, supported by various studies on user behavior and psychology. However, excessive use of animation in interfaces can negatively impact the user. This paper deals with a specific type of animation within a specialized domain of e-commerce. Drawing upon theories such as the Zeigarnik Effect, Aesthetic-Usability effect, Peak-End rule, and Hick's law, we analyze user behavior and psychology when exposed to a dynamic price-drop animation. Unlike conventional static pricing strategy, this animation introduces movement to signify price reduction. In our theoretical study approach, we evaluate and present a user study on how such an animation influences user perception, psychology, and attention. If acquired effectively, dynamic animations can enhance engagement, spark anticipation, and subconsciously create a positive experience by reducing cognitive load.

Paperid: 3161, https://arxiv.org/pdf/2504.13840.pdf

Abstract:
Experimentation is a cornerstone of successful game development and live operations, enabling teams to optimize player engagement, retention, and monetization. This article provides a comprehensive guide to implementing experimentation in gaming, structured around the game development lifecycle and the marketing mix. From pre-launch concept testing and prototyping to post-launch personalization and LiveOps, experimentation plays a pivotal role in driving innovation and adapting game experiences to diverse player preferences. Gaming presents unique challenges, such as highly engaged communities, complex interactive systems, and highly heterogeneous and evolving player behaviors, which require tailored approaches to experimentation. The article emphasizes the importance of collaborative frameworks across product, marketing, and analytics teams and provides practical guidance to game makers how to adopt experimentation successfully. It also addresses ethical considerations like fairness and player autonomy.

Paperid: 3162, https://arxiv.org/pdf/2504.13777.pdf

Abstract:
This paper proposes a conceptual framework for understanding AI hallucinations as a distinct form of misinformation. While misinformation scholarship has traditionally focused on human intent, generative AI systems now produce false yet plausible outputs absent of such intent. I argue that these AI hallucinations should not be treated merely as technical failures but as communication phenomena with social consequences. Drawing on a supply-and-demand model and the concept of distributed agency, the framework outlines how hallucinations differ from human-generated misinformation in production, perception, and institutional response. I conclude by outlining a research agenda for communication scholars to investigate the emergence, dissemination, and audience reception of hallucinated content, with attention to macro (institutional), meso (group), and micro (individual) levels. This work urges communication researchers to rethink the boundaries of misinformation theory in light of probabilistic, non-human actors increasingly embedded in knowledge production.

Paperid: 3163, https://arxiv.org/pdf/2504.13667.pdf

Abstract:
This paper presents a hopeful perspective on the potentially dramatic impacts of Large Language Models on how we children learn and how they will expect to interact with technology. We review the effects of LLMs on education so far, and make the case that these effects are minor compared to the upcoming changes that are occurring. We present a small scenario and self-ethnographic study demonstrating the effects of these changes, and define five significant considerations that interactive systems designers will have to accommodate in the future.

Paperid: 3164, https://arxiv.org/pdf/2504.13477.pdf

Abstract:
The idea of augmented or hybrid intelligence offers a compelling vision for combining human and AI capabilities, especially in tasks where human wisdom, expertise, or common sense are essential. Unfortunately, human reasoning can be flawed and shortsighted, resulting in adverse individual impacts or even long-term societal consequences. While strong efforts are being made to develop and optimize the AI aspect of hybrid reasoning, the real urgency lies in fostering wiser and more intelligent human participation. Tools that enhance critical thinking, ingenuity, expertise, and even wisdom could be essential in addressing the challenges of our emerging future. This paper proposes the development of generative AI-based tools that enhance both the human ability to reflect upon a problem as well as the ability to explore the technical aspects of it. A high-level model is also described for integrating AI and human capabilities in a way that centralizes human participation and control.

Paperid: 3165, https://arxiv.org/pdf/2504.13261.pdf

Abstract:
Purpose: The rapid emergence of large language models (LLMs) such as ChatGPT has significantly impacted foreign language education, yet their pedagogical grammar competence remains under-assessed. This paper introduces CPG-EVAL, the first dedicated benchmark specifically designed to evaluate LLMs' knowledge of pedagogical grammar within the context of foreign language instruction. Methodology: The benchmark comprises five tasks designed to assess grammar recognition, fine-grained grammatical distinction, categorical discrimination, and resistance to linguistic interference. Findings: Smaller-scale models can succeed in single language instance tasks, but struggle with multiple instance tasks and interference from confusing instances. Larger-scale models show better resistance to interference but still have significant room for accuracy improvement. The evaluation indicates the need for better instructional alignment and more rigorous benchmarks, to effectively guide the deployment of LLMs in educational contexts. Value: This study offers the first specialized, theory-driven, multi-tiered benchmark framework for systematically evaluating LLMs' pedagogical grammar competence in Chinese language teaching contexts. CPG-EVAL not only provides empirical insights for educators, policymakers, and model developers to better gauge AI's current abilities in educational settings, but also lays the groundwork for future research on improving model alignment, enhancing educational suitability, and ensuring informed decision-making concerning LLM integration in foreign language instruction.

Paperid: 3166, https://arxiv.org/pdf/2504.13183.pdf

Abstract:
Artificial intelligent (AI) conversational agents hold a promising future in the field of mental health, especially in helping marginalized communities that lack access to mental health support services. It is tempting to have a 24/7 mental health companion that can be accessed anywhere using mobile phones to provide therapist-like advice. Yet, caution should be taken, and studies around their feasibility need to be surveyed. Before adopting such a rapidly changing technology, studies on its feasibility should be explored, summarized, and synthesized to gain a solid understanding of the status quo and to enable us to build a framework that can guide us throughout the development and deployment processes. Different perspectives must be considered when investigating the feasibility of AI conversational agents, including the mental healthcare professional perspective. The literature can provide insights into their perspectives in terms of opportunities, concerns, and implications. Mental health professionals, the subject-matter experts in this field, have their points of view that should be understood and considered. This systematic literature review will explore mental health practitioners' attitudes toward AI conversational agents and the factors that affect their adoption and recommendation of the technology to augment their services and treatments. The TAM3 Framework will be the lens through which this systematic literature review will be conducted.

Paperid: 3167, https://arxiv.org/pdf/2504.13182.pdf

Abstract:
This investigates the relationship between eye fixation patterns and performance in Java programming exercises using eye-tracking technology. Thirty-one students from a university in Metro Manila participated, and their eye movements were recorded while solving five Java programming exercises (three of the five exercises were picked). The fixation data were preprocessed and visualized using heatmap bin graphs, dividing the participants into correct and wrong answer groups. The Mann-Whitney U Test was employed to determine if there were significant differences in the fixation patterns between the two groups.

Paperid: 3168, https://arxiv.org/pdf/2504.12977.pdf

Abstract:
This paper presents a novel research analytical IT system grounded in Martin Heidegger's Fundamental Ontology, distinguishing between beings (das Seiende) and Being (das Sein). The system employs two modally distinct, descriptively complete languages: a categorical language of beings for processing user inputs and an existential language of Being for internal analysis. These languages are bridged via a phenomenological reduction module, enabling the system to analyze user queries (including questions, answers, and dialogues among IT specialists), identify recursive and self-referential structures, and provide actionable insights in categorical terms. Unlike contemporary systems limited to categorical analysis, this approach leverages Heidegger's phenomenological existential analysis to uncover deeper ontological patterns in query processing, aiding in resolving logical traps in complex interactions, such as metaphor usage in IT contexts. The path to full realization involves formalizing the language of Being by a research team based on Heidegger's Fundamental Ontology; given the existing completeness of the language of beings, this reduces the system's computability to completeness, paving the way for a universal query analysis tool. The paper presents the system's architecture, operational principles, technical implementation, use cases--including a case based on real IT specialist dialogues--comparative evaluation with existing tools, and its advantages and limitations.

Paperid: 3169, https://arxiv.org/pdf/2504.12891.pdf

Abstract:
The rapid evolution of artificial intelligence (AI) has introduced AI agents as a disruptive paradigm across various industries, yet their application in machine translation (MT) remains underexplored. This paper describes and analyses the potential of single- and multi-agent systems for MT, reflecting on how they could enhance multilingual digital communication. While single-agent systems are well-suited for simpler translation tasks, multi-agent systems, which involve multiple specialized AI agents collaborating in a structured manner, may offer a promising solution for complex scenarios requiring high accuracy, domain-specific knowledge, and contextual awareness. To demonstrate the feasibility of multi-agent workflows in MT, we are conducting a pilot study in legal MT. The study employs a multi-agent system involving four specialized AI agents for (i) translation, (ii) adequacy review, (iii) fluency review, and (iv) final editing. Our findings suggest that multi-agent systems may have the potential to significantly improve domain-adaptability and contextual awareness, with superior translation quality to traditional MT or single-agent systems. This paper also sets the stage for future research into multi-agent applications in MT, integration into professional translation workflows, and shares a demo of the system analyzed in the paper.

Paperid: 3170, https://arxiv.org/pdf/2504.12593.pdf

Abstract:
Learning is an active process that is deeply tied to physical and social contexts. Yet schools traditionally place learners in a passive role and focus on decontextualizing knowledge. Situating learning in more authentic tasks and contexts typically requires taking it outside the classroom via field trips and apprenticeships, but virtual reality (VR) is a promising tool to bring more authentically situated learning experiences into classrooms. In this position paper, I discuss how one of VR's primary affordances for learning is heightening agenct, and how such heightened agency can facilitate more authenticlaly situated learning by allowing learners legitimate peripheral participation.

Paperid: 3171, https://arxiv.org/pdf/2504.12337.pdf

Abstract:
The emergence of generative AI chatbots such as ChatGPT has prompted growing public and academic interest in their role as informal mental health support tools. While early rule-based systems have been around for several years, large language models (LLMs) offer new capabilities in conversational fluency, empathy simulation, and availability. This study explores how users engage with LLMs as mental health tools by analyzing over 10,000 TikTok comments from videos referencing LLMs as mental health tools. Using a self-developed tiered coding schema and supervised classification models, we identify user experiences, attitudes, and recurring themes. Results show that nearly 20% of comments reflect personal use, with these users expressing overwhelmingly positive attitudes. Commonly cited benefits include accessibility, emotional support, and perceived therapeutic value. However, concerns around privacy, generic responses, and the lack of professional oversight remain prominent. It is important to note that the user feedback does not indicate which therapeutic framework, if any, the LLM-generated output aligns with. While the findings underscore the growing relevance of AI in everyday practices, they also highlight the urgent need for clinical and ethical scrutiny in the use of AI for mental health support.

Paperid: 3172, https://arxiv.org/pdf/2504.10620.pdf

Abstract:
SPREV, short for hyperSphere Reduced to two-dimensional Regular Polygon for Visualisation, is a novel dimensionality reduction technique developed to address the challenges of reducing dimensions and visualizing labeled datasets that exhibit a unique combination of three characteristics: small class size, high dimensionality, and low sample size. SPREV is designed not only to uncover but also to visually represent hidden patterns within such datasets. Its distinctive integration of geometric principles, adapted for discrete computational environments, makes it an indispensable tool in the modern data science toolkit, enabling users to identify trends, extract insights, and navigate complex data efficiently and effectively.

Paperid: 3173, https://arxiv.org/pdf/2504.09861.pdf

Abstract:
Large language models (LLMs) are transforming global decision-making and societal systems by processing diverse data at unprecedented scales. However, their potential to homogenize human values poses critical risks, similar to biodiversity loss undermining ecological resilience. Rooted in the ancient Greek concept of ethos, meaning both individual character and the shared moral fabric of communities, EthosGPT draws on a tradition that spans from Aristotle's virtue ethics to Adam Smith's moral sentiments as the ethical foundation of economic cooperation. These traditions underscore the vital role of value diversity in fostering social trust, institutional legitimacy, and long-term prosperity. EthosGPT addresses the challenge of value homogenization by introducing an open-source framework for mapping and evaluating LLMs within a global scale of human values. Using international survey data on cultural indices, prompt-based assessments, and comparative statistical analyses, EthosGPT reveals both the adaptability and biases of LLMs across regions and cultures. It offers actionable insights for developing inclusive LLMs, such as diversifying training data and preserving endangered cultural heritage to ensure representation in AI systems. These contributions align with the United Nations Sustainable Development Goals (SDGs), especially SDG 10 (Reduced Inequalities), SDG 11.4 (Cultural Heritage Preservation), and SDG 16 (Peace, Justice and Strong Institutions). Through interdisciplinary collaboration, EthosGPT promotes AI systems that are both technically robust and ethically inclusive, advancing value plurality as a cornerstone for sustainable and equitable futures.

Paperid: 3174, https://arxiv.org/pdf/2504.09343.pdf

Abstract:
This article explores the phenomenon of confirmation bias in generative AI chatbots, a relatively underexamined aspect of AI-human interaction. Drawing on cognitive psychology and computational linguistics, it examines how confirmation bias, commonly understood as the tendency to seek information that aligns with existing beliefs, can be replicated and amplified by the design and functioning of large language models. The article analyzes the mechanisms by which confirmation bias may manifest in chatbot interactions, assesses the ethical and practical risks associated with such bias, and proposes a range of mitigation strategies. These include technical interventions, interface redesign, and policy measures aimed at promoting balanced AI-generated discourse. The article concludes by outlining future research directions, emphasizing the need for interdisciplinary collaboration and empirical evaluation to better understand and address confirmation bias in generative AI systems.

Paperid: 3175, https://arxiv.org/pdf/2504.08755.pdf

Abstract:
While it is increasingly evident that the internet is becoming saturated with content created by generated Ai large language models, accurately measuring the scale of this phenomenon has proven challenging. By analyzing the frequency of specific keywords commonly used by ChatGPT, this paper demonstrates that such linguistic markers can effectively be used to esti-mate the presence of generative AI content online. The findings suggest that at least 30% of text on active web pages originates from AI-generated sources, with the actual proportion likely ap-proaching 40%. Given the implications of autophagous loops, this is a sobering realization.

Paperid: 3176, https://arxiv.org/pdf/2504.08739.pdf

Abstract:
The rapid progress in diffusion models, transformers, and language agents has unlocked new possibilities, yet their potential in user interfaces and commercial applications remains underexplored. We present Sketch-Search Agent, a novel framework that transforms the image search experience by integrating a multimodal language agent with freehand sketches as control signals for diffusion models. Using the T2I-Adapter, Sketch-Search Agent combines sketches and text prompts to generate high-quality query images, encoded via a CLIP image encoder for efficient matching against an image corpus. Unlike existing methods, Sketch-Search Agent requires minimal setup, no additional training, and excels in sketch-based image retrieval and natural language interactions. The multimodal agent enhances user experience by dynamically retaining preferences, ranking results, and refining queries for personalized recommendations. This interactive design empowers users to create sketches and receive tailored product suggestions, showcasing the potential of diffusion models in user-centric image retrieval. Experiments confirm Sketch-Search Agent's high accuracy in delivering relevant product search results.

Paperid: 3177, https://arxiv.org/pdf/2504.08670.pdf

Abstract:
To build AI that children can intuitively understand and benefit from, designers need a design grammar that serves their developmental needs. This paper bridges artificial intelligence design for children - an emerging field still defining its best practices - and animation, a well established field with decades of experience in engaging children through accessible storytelling. Pairing Piagetian developmental theory with design pattern extraction from 52 works of animation, the paper presents a six scaffold framework that integrates design insights transferable to child centred AI design: (1) signals for visual animacy and clarity, (2) sound for musical and auditory scaffolding, (3) synchrony in audiovisual cues, (4) sidekick style personas, (5) storyplay that supports symbolic play and imaginative exploration, and (6) structure in the form of predictable narratives. These strategies, long refined in animation, function as multimodal scaffolds for attention, understanding, and attunement, supporting learning and comfort. This structured design grammar is transferable to AI design. By reframing cinematic storytelling and child development theory as design logic for AI, the paper offers heuristics for AI that aligns with the cognitive stages and emotional needs of young users. The work contributes to design theory by showing how sensory, affective, and narrative techniques can inform developmentally attuned AI design. Future directions include empirical testing, cultural adaptation, and participatory co design.

Paperid: 3178, https://arxiv.org/pdf/2504.08227.pdf

Abstract:
DaemonSec is an early-stage startup exploring machine learning (ML)-based security for Linux daemons, a critical yet often overlooked attack surface. While daemon security remains underexplored, conventional defenses struggle against adaptive threats and zero-day exploits. To assess the perspectives of IT professionals on ML-driven daemon protection, a systematic interview study based on semi-structured interviews was conducted with 22 professionals from industry and academia. The study evaluates adoption, feasibility, and trust in ML-based security solutions. While participants recognized the potential of ML for real-time anomaly detection, findings reveal skepticism toward full automation, limited security awareness among non-security roles, and concerns about patching delays creating attack windows. This paper presents the methods, key findings, and implications for advancing ML-driven daemon security in industry.

Paperid: 3179, https://arxiv.org/pdf/2504.07763.pdf

Abstract:
Recently, a growing number of experts in artificial intelligence (AI) and medicine have be-gun to suggest that the use of AI systems, particularly machine learning (ML) systems, is likely to humanise the practice of medicine by substantially improving the quality of clinician-patient relationships. In this thesis, however, I argue that medical ML systems are more likely to negatively impact these relationships than to improve them. In particular, I argue that the use of medical ML systems is likely to comprise the quality of trust, care, empathy, understanding, and communication between clinicians and patients.

Paperid: 3180, https://arxiv.org/pdf/2504.06646.pdf

Abstract:
Since high dropout rates in online learning platforms were reported, various factors affecting learner retention have been identified, with learners' perceptions of their experiences playing a crucial role in shaping their persistence. For instance, Kittur et al. highlight how success expectations are shaped by perceived system fit and course difficulty. Recent advances in generative Artificial Intelligence (GenAI) present new possibilities for GenAI-mediated learning. AI-generated instructional messages are often perceived as clearer than human-written content, but their impact on learners' perceptions of skill-building experiences remains underexplored. This study examines GenAI-mediated learning in a self-directed context, focusing on communication skills. We compare three messaging styles - Affective, Cognitive, and Action-Oriented - to investigate their influence on learners' perceptions of the learning process. We applied this approach to ten instructional units, using GenAI to generate 30 learning items. Three evaluators assessed them for desirability and appropriateness through numerical ratings and open-ended feedback. The 180 excerpts were analyzed using reflexive thematic analysis, revealing four overarching themes: Prerequisite Common Ground, Intrinsic Value, User Responses, and Expressed Preferences. We discuss these insights to inform the design of GenAI-mediated, self-directed skill-building, with the goal of enhancing engagement, persistence, and learning outcomes.

Paperid: 3181, https://arxiv.org/pdf/2504.06461.pdf

Abstract:
Adaptive Virtual Reality (VR) systems have the potential to enhance training and learning experiences by dynamically responding to users' cognitive states. This research investigates how eye tracking and heart rate variability (HRV) can be used to detect cognitive load and stress in VR environments, enabling real-time adaptation. The study follows a three-phase approach: (1) conducting a user study with the Stroop task to label cognitive load data and train machine learning models to detect high cognitive load, (2) fine-tuning these models with new users and integrating them into an adaptive VR system that dynamically adjusts training difficulty based on physiological signals, and (3) developing a privacy-aware approach to detect high cognitive load and compare this with the adaptive VR in Phase two. This research contributes to affective computing and adaptive VR using physiological sensing, with applications in education, training, and healthcare. Future work will explore scalability, real-time inference optimization, and ethical considerations in physiological adaptive VR.

Paperid: 3182, https://arxiv.org/pdf/2504.05331.pdf

Abstract:
As artificial intelligence (AI) becomes embedded in healthcare, trust in medical decision-making is changing fast. Nowhere is this shift more visible than in radiology, where AI tools are increasingly embedded across the imaging workflow - from scheduling and acquisition to interpretation, reporting, and communication with referrers and patients. This opinion paper argues that trust in AI isn't a simple transfer from humans to machines - it is a dynamic, evolving relationship that must be built and maintained. Rather than debating whether AI belongs in medicine, it asks: what kind of trust must AI earn, and how? Drawing from philosophy, bioethics, and system design, it explores the key differences between human trust and machine reliability - emphasizing transparency, accountability, and alignment with the values of good care. It argues that trust in AI should not be built on mimicking empathy or intuition, but on thoughtful design, responsible deployment, and clear moral responsibility. The goal is a balanced view - one that avoids blind optimism and reflexive fear. Trust in AI must be treated not as a given, but as something to be earned over time.

Paperid: 3183, https://arxiv.org/pdf/2504.05293.pdf

Abstract:
The challenge to synchronize augmented reality (AR) across sessions/devices has been solved by relying solely on vision-feature mapping, which is suboptimal in scaling workable space and flaws under visual changes in surroundings. This study implemented AR synchronization solutions utilizing location beacon technology, namely Bluetooth Low Energy (BLE) and Ultra-Wideband (UWB), to discourse scalability issues and inconsistencies in the existing AR system. The framework is bifurcated into two approaches: BLE-assist and UWB-assist AR synchronization. The BLE-assist method utilizes iBeacon technology for room context recognition, integrating with Apple's ARKit ARWorldMap and Google's ARCore Cloud Anchors. The UWB-assist solution employs precise beacon ranging capabilities fusion with the device's azimuth to establish fixed spatial reference in AR across sessions/devices. Comparative evaluations show that the UWB-assist approach outperforms the BLE-assist approach in reliability across environmental variations, as it always successfully resolves virtual anchors with a near-constant latency average at 25 seconds, regardless of the physical setting changes. Conversely, the BLE-assist implementation tends to be more accurate in resolving virtual anchors with a mean of 0.02 metres in position error and within 0.03 radian in orientation error. In the UWB-assist approach, computed fixed spatial references have an average disparity of 0.04 metres and 0.11 radians in pose. The UWB-assist approach is ideal for scenarios requiring consistently successful localization with acceptable accuracy. In contrast, the BLE-assist approach is more suitable when demanding finer precision in virtual anchor poses with the performance tradeoffs when the surroundings are altered, such as for destinated short-lived AR sessions.

Paperid: 3184, https://arxiv.org/pdf/2504.05210.pdf

Abstract:
Machine learning (ML) systems are vulnerable to performance decline over time due to dataset shift. To address this problem, experts often suggest that ML systems should be regularly updated to ensure ongoing performance stability. Some scholarly literature has begun to address the epistemic and ethical challenges associated with different updating methodologies. Thus far, however, little attention has been paid to the impact of model updating on the ML-assisted decision-making process itself, particularly in the AI ethics and AI epistemology literatures. This article aims to address this gap in the literature. It argues that model updating introduces a new sub-type of opacity into ML-assisted decision-making -- update opacity -- that occurs when users cannot understand how or why an update has changed the reasoning or behaviour of an ML system. This type of opacity presents a variety of distinctive epistemic and safety concerns that available solutions to the black box problem in ML are largely ill-equipped to address. A variety of alternative strategies may be developed or pursued to address the problem of update opacity more directly, including bi-factual explanations, dynamic model reporting, and update compatibility. However, each of these strategies presents its own risks or carries significant limitations. Further research will be needed to address the epistemic and safety concerns associated with model updating and update opacity going forward.

Paperid: 3185, https://arxiv.org/pdf/2504.04650.pdf

Abstract:
This paper proposes a highly robust autonomous agent framework based on the ReAct paradigm, designed to solve complex tasks through adaptive decision making and multi-agent collaboration. Unlike traditional frameworks that rely on fixed workflows generated by LLM-based planners, this framework dynamically generates next actions during agent execution based on prior trajectories, thereby enhancing its robustness. To address potential termination issues caused by adaptive execution paths, I propose a timely abandonment strategy incorporating a probabilistic penalty mechanism. For multi-agent collaboration, I introduce a memory transfer mechanism that enables shared and dynamically updated memory among agents. The framework's innovative timely abandonment strategy dynamically adjusts the probability of task abandonment via probabilistic penalties, allowing developers to balance conservative and exploratory tendencies in agent execution strategies by tuning hyperparameters. This significantly improves adaptability and task execution efficiency in complex environments. Additionally, agents can be extended through external tool integration, supported by modular design and MCP protocol compatibility, which enables flexible action space expansion. Through explicit division of labor, the multi-agent collaboration mechanism enables agents to focus on specific task components, thereby significantly improving execution efficiency and quality.

Paperid: 3186, https://arxiv.org/pdf/2504.02664.pdf

Abstract:
If artificial intelligence (AI) is to be applied in safety-critical domains, its performance needs to be evaluated reliably. The present study aimed to understand how humans evaluate AI systems for person detection in automatic train operation. In three experiments, participants saw image sequences of people moving in the vicinity of railway tracks. A simulated AI had highlighted all detected people, sometimes correctly and sometimes not. Participants had to provide a numerical rating of the AI's performance and then verbally explain their rating. The experiments varied several factors that might influence human ratings: the types and plausibility of AI mistakes, the number of affected images, the number of people present in an image, the position of people relevant to the tracks, and the methods used to elicit human evaluations. While all these factors influenced human ratings, some effects were unexpected or deviated from normative standards. For instance, the factor with the strongest impact was people's position relative to the tracks, although participants had explicitly been instructed that the AI could not process such information. Taken together, the results suggest that humans may sometimes evaluate more than the AI's performance on the assigned task. Such mismatches between AI capabilities and human expectations should be taken into consideration when conducting safety audits of AI systems.

Paperid: 3187, https://arxiv.org/pdf/2504.02153.pdf

Abstract:
Online communities are important organizational forms where members socialize and share information. Curiously, different online communities often overlap considerably in topic and membership. Recent research has investigated competition and mutualism among overlapping online communities through the lens of organizational ecology; however, it has not accounted for how the nonlinear dynamics of online attention may lead to episodic competition and mutualism. Neither has it explored the origins of competition and mutualism in the processes by which online communities select or adapt to their niches. This paper presents a large-scale study of 8,806 Reddit communities belonging to 1,919 clusters of high user overlap over a 5-year period. The method uses nonlinear time series methods to infer bursty, often short-lived ecological dynamics. Results reveal that mutualism episodes are longer lived and slightly more frequent than competition episodes. Next, it tests whether online communities find their niches by specializing to avoid competition using panel regression models. It finds that competitive ecological interactions lead to decreasing topic and user overlaps; however, changes that decrease such niche overlaps do not lead to mutualism. The discussion proposes that future designs may enable online community ecosystem management by informing online community leaders to organize "spin-off" communities or via feeds and recommendations.

Paperid: 3188, https://arxiv.org/pdf/2504.00412.pdf

Abstract:
The skill to separate form from substance in writing has gained new prominence in the age of AI-generated content. The challenge - discriminating between fluent expression and substantive thought - constitutes a critical literacy skill for modern education. This paper examines form-substance discrimination (FSD) as an essential learning outcome for curriculum development in higher education. We analyze its cognitive foundations in fluency bias and inhibitory control, trace its evolution from composition theory concepts like "higher-order concerns," and explore how readers progress from novice acceptance of polished text to expert critical assessment. Drawing on research in cognitive psychology, composition studies, and emerging AI pedagogy, we propose practical strategies for fostering this ability through curriculum design, assessment practices, and explicit instruction. By prioritizing substance over surface in writing education, institutions can prepare students to navigate an information landscape where AI-generated content amplifies the ancient tension between style and meaning, ultimately safeguarding the value of authentic human thought in knowledge construction and communication.

Paperid: 3189, https://arxiv.org/pdf/2504.00221.pdf

Abstract:
Large Language Models (LLMs) are advancing into Multimodal LLMs (MLLMs), capable of processing image, audio, and video as well as text. Combining first-person video, MLLMs show promising potential for understanding human activities through video and audio, enabling many human-computer interaction and human-augmentation applications such as human activity support, real-world agents, and skill transfer to robots or other individuals. However, handling high-resolution, long-duration videos generates large latent representations, leading to substantial memory and processing demands, limiting the length and resolution MLLMs can manage. Reducing video resolution can lower memory usage but often compromises comprehension. This paper introduces a method that optimizes first-person video analysis by integrating eye-tracking data, and proposes a method that decomposes first-person vision video into sub areas for regions of gaze focus. By processing these selectively gazed-focused inputs, our approach achieves task comprehension equivalent to or even better than processing the entire image at full resolution, but with significantly reduced video data input (reduce the number of pixels to one-tenth), offering an efficient solution for using MLLMs to interpret and utilize human skills.

Paperid: 3190, https://arxiv.org/pdf/2503.22756.pdf

Abstract:
The rapid digitalisation of contemporary society has profoundly impacted various facets of our lives, including healthcare, communication, business, and education. The ability to engage with new technologies and solve problems has become crucial, making CT skills, such as pattern recognition, decomposition, and algorithm design, essential competencies. In response, Switzerland is conducting research and initiatives to integrate CT into its educational system. This study aims to develop a comprehensive framework for large-scale assessment of CT skills, particularly focusing on AT, the ability to design algorithms. To achieve this, we first developed a competence model capturing the situated and developmental nature of CT, guiding the design of activities tailored to cognitive abilities, age, and context. This framework clarifies how activity characteristics influence CT development and how to assess these competencies. Additionally, we developed an activity for large-scale assessment of AT skills, offered in two variants: one based on non-digital artefacts (unplugged) and manual expert assessment, and the other based on digital artefacts (virtual) and automatic assessment. To provide a more comprehensive evaluation of students' competencies, we developed an IAS based on BNs with noisy gates, which offers real-time probabilistic assessment for each skill rather than a single overall score. The results indicate that the proposed instrument can measure AT competencies across different age groups and educational contexts in Switzerland, demonstrating its applicability for large-scale use. AT competencies exhibit a progressive development, with no overall gender differences, though variations are observed at the school level, significantly influenced by the artefact-based environment and its context, underscoring the importance of creating accessible and adaptable assessment tools.

Paperid: 3191, https://arxiv.org/pdf/2503.22726.pdf

Abstract:
In online advertising systems, publishers often face a trade-off in information disclosure strategies: while disclosing more information can enhance efficiency by enabling optimal allocation of ad impressions, it may lose revenue potential by decreasing uncertainty among competing advertisers. Similar to other challenges in market design, understanding this trade-off is constrained by limited access to real-world data, leading researchers and practitioners to turn to simulation frameworks. The recent emergence of large language models (LLMs) offers a novel approach to simulations, providing human-like reasoning and adaptability without necessarily relying on explicit assumptions about agent behavior modeling. Despite their potential, existing frameworks have yet to integrate LLM-based agents for studying information asymmetry and signaling strategies, particularly in the context of auctions. To address this gap, we introduce InfoBid, a flexible simulation framework that leverages LLM agents to examine the effects of information disclosure strategies in multi-agent auction settings. Using GPT-4o, we implemented simulations of second-price auctions with diverse information schemas. The results reveal key insights into how signaling influences strategic behavior and auction outcomes, which align with both economic and social learning theories. Through InfoBid, we hope to foster the use of LLMs as proxies for human economic and social agents in empirical studies, enhancing our understanding of their capabilities and limitations. This work bridges the gap between theoretical market designs and practical applications, advancing research in market simulations, information design, and agent-based reasoning while offering a valuable tool for exploring the dynamics of digital economies.

Paperid: 3192, https://arxiv.org/pdf/2503.22151.pdf

Abstract:
AI risks are typically framed around physical threats to humanity, a loss of control or an accidental error causing humanity's extinction. However, I argue in line with the gradual disempowerment thesis, that there is an underappreciated risk in the slow and irrevocable decline of human autonomy. As AI starts to outcompete humans in various areas of life, a tipping point will be reached where it no longer makes sense to rely on human decision-making, creativity, social care or even leadership. What may follow is a process of gradual de-skilling, where we lose skills that we currently take for granted. Traditionally, it is argued that AI will gain human skills over time, and that these skills are innate and immutable in humans. By contrast, I argue that humans may lose such skills as critical thinking, decision-making and even social care in an AGI world. The biggest threat to humanity is therefore not that machines will become more like humans, but that humans will become more like machines.

Paperid: 3193, https://arxiv.org/pdf/2503.20841.pdf

Abstract:
Neurons encode information in a binary manner and process complex signals. However, predicting or generating diverse neural activity patterns remains challenging. In vitro and in vivo studies provide distinct advantages, yet no robust computational framework seamlessly integrates both data types. We address this by applying the Transformer model, widely used in large-scale language models, to neural data. To handle binary data, we introduced Dice loss, enabling accurate cross-domain neural activity generation. Structural analysis revealed how Dice loss enhances learning and identified key brain regions facilitating high-precision data generation. Our findings support the 3Rs principle in animal research, particularly Replacement, and establish a mathematical framework bridging animal experiments and human clinical studies. This work advances data-driven neuroscience and neural activity modeling, paving the way for more ethical and effective experimental methodologies.

Paperid: 3194, https://arxiv.org/pdf/2503.20233.pdf

Abstract:
Data analysts are essential in organizations, transforming raw data into insights that drive decision-making and strategy. This study explores how analysts' productivity evolves on a collaborative platform, focusing on two key learning activities: writing queries and viewing peer queries. While traditional research often assumes static models, where performance improves steadily with cumulative learning, such models fail to capture the dynamic nature of real-world learning. To address this, we propose a Hidden Markov Model (HMM) that tracks how analysts transition between distinct learning states based on their participation in these activities. Using an industry dataset with 2,001 analysts and 79,797 queries, this study identifies three learning states: novice, intermediate, and advanced. Productivity increases as analysts advance to higher states, reflecting the cumulative benefits of learning. Writing queries benefits analysts across all states, with the largest gains observed for novices. Viewing peer queries supports novices but may hinder analysts in higher states due to cognitive overload or inefficiencies. Transitions between states are also uneven, with progression from intermediate to advanced being particularly challenging. This study advances understanding of into dynamic learning behavior of knowledge worker and offers practical implications for designing systems, optimizing training, enabling personalized learning, and fostering effective knowledge sharing.

Paperid: 3195, https://arxiv.org/pdf/2503.19213.pdf

Abstract:
This paper surveys the development of large language model (LLM)-based agents for question answering (QA). Traditional agents face significant limitations, including substantial data requirements and difficulty in generalizing to new environments. LLM-based agents address these challenges by leveraging LLMs as their core reasoning engine. These agents achieve superior QA results compared to traditional QA pipelines and naive LLM QA systems by enabling interaction with external environments. We systematically review the design of LLM agents in the context of QA tasks, organizing our discussion across key stages: planning, question understanding, information retrieval, and answer generation. Additionally, this paper identifies ongoing challenges and explores future research directions to enhance the performance of LLM agent QA systems.

Paperid: 3196, https://arxiv.org/pdf/2503.18729.pdf

Abstract:
Users share a vast amount of data while using web and mobile applications. Most service providers such as email and social media providers provide users with privacy controls, which aim to give users the means to control what, how, when, and with whom, users share data. Nevertheless, it is not uncommon to hear users say that they feel they have lost control over their data on the web. This article aims to shed light on the often overlooked difference between two main types of privacy from a control perspective: privacy between a user and other users, and privacy between a user and institutions. We argue why this difference is important and what we need to do from here.

Paperid: 3197, https://arxiv.org/pdf/2503.18387.pdf

Abstract:
Large Language Model chatbots are increasingly taking the form and visage of human beings, adapting human faces, names, voices, personalities, and quirks, including those of celebrities and well-known political figures. Personifying AI chatbots could foreseeably increase their trust with users. However, it could also make them more capable of manipulation, by creating the illusion of a close and intimate relationship with an artificial entity. The European Commission has finalized the AI Act, with the EU Parliament making amendments banning manipulative and deceptive AI systems that cause significant harm to users. Although the AI Act covers harms that accumulate over time, it is unlikely to prevent harms associated with prolonged discussions with AI chatbots. Specifically, a chatbot could reinforce a person's negative emotional state over weeks, months, or years through negative feedback loops, prolonged conversations, or harmful recommendations, contributing to a user's deteriorating mental health.

Paperid: 3198, https://arxiv.org/pdf/2503.18303.pdf

Abstract:
As large language models (LLMs) like ChatGPT become increasingly integrated into our everyday lives--from customer service and education to creative work and personal productivity--understanding how people interact with these AI systems has become a pressing issue. Despite the widespread use of LLMs, researchers lack standardized tools for systematically studying people's interactions with LLMs. To address this issue, we introduce GPT for Researchers (G4R), or g4r.org, a free website that researchers can use to easily create and integrate a GPT Interface into their studies. At g4r.org, researchers can (1) enable their study participants to interact with GPT (such as ChatGPT), (2) customize GPT Interfaces to guide participants' interactions with GPT (e.g., set constraints on topics or adjust GPT's tone or response style), and (3) capture participants' interactions with GPT by downloading data on messages exchanged between participants and GPT. By facilitating study participants' interactions with GPT and providing detailed data on these interactions, G4R can support research on topics such as consumer interactions with AI agents or LLMs, AI-assisted decision-making, and linguistic patterns in human-AI communication. With this goal in mind, we provide a step-by-step guide to using G4R at g4r.org.

Paperid: 3199, https://arxiv.org/pdf/2503.17401.pdf

Abstract:
This paper introduces AIJIM, the Artificial Intelligence Journalism Integration Model -- a novel framework for integrating real-time AI into environmental journalism. AIJIM combines Vision Transformer-based hazard detection, crowdsourced validation with 252 validators, and automated reporting within a scalable, modular architecture. A dual-layer explainability approach ensures ethical transparency through fast CAM-based visual overlays and optional LIME-based box-level interpretations. Validated in a 2024 pilot on the island of Mallorca using the NamicGreen platform, AIJIM achieved 85.4\% detection accuracy and 89.7\% agreement with expert annotations, while reducing reporting latency by 40\%. Unlike conventional approaches such as Data-Driven Journalism or AI Fact-Checking, AIJIM provides a transferable model for participatory, community-driven environmental reporting, advancing journalism, artificial intelligence, and sustainability in alignment with the UN Sustainable Development Goals and the EU AI Act.

Paperid: 3200, https://arxiv.org/pdf/2503.16534.pdf

Abstract:
This study evaluates the biases in Gemini 2.0 Flash Experimental, a state-of-the-art large language model (LLM) developed by Google, focusing on content moderation and gender disparities. By comparing its performance to ChatGPT-4o, examined in a previous work of the author, the analysis highlights some differences in ethical moderation practices. Gemini 2.0 demonstrates reduced gender bias, notably with female-specific prompts achieving a substantial rise in acceptance rates compared to results obtained by ChatGPT-4o. It adopts a more permissive stance toward sexual content and maintains relatively high acceptance rates for violent prompts, including gender-specific cases. Despite these changes, whether they constitute an improvement is debatable. While gender bias has been reduced, this reduction comes at the cost of permitting more violent content toward both males and females, potentially normalizing violence rather than mitigating harm. Male-specific prompts still generally receive higher acceptance rates than female-specific ones. These findings underscore the complexities of aligning AI systems with ethical standards, highlighting progress in reducing certain biases while raising concerns about the broader implications of the model's permissiveness. Ongoing refinements are essential to achieve moderation practices that ensure transparency, fairness, and inclusivity without amplifying harmful content.

Paperid: 3201, https://arxiv.org/pdf/2503.16506.pdf

Abstract:
Post-merger integration (PMI) planning presents significant challenges due to the complex interdependencies between integration initiatives and their associated synergies. While dependency-based planning approaches offer valuable frameworks, practitioners often become anchored to specific integration paths without systematically exploring alternative solutions. This research introduces a novel AI-assisted tool designed to expand and enhance the exploration of viable integration planning options. The proposed system leverages a frontier model-based agent augmented with specialized reasoning techniques to map and analyze dependencies between integration plan elements. Through a chain-of-thought planning approach, the tool guides users in systematically exploring the integration planning space, helping identify and evaluate alternative paths that might otherwise remain unconsidered. In an initial evaluation using a simulated case study, participants using the tool identified 43% more viable integration planning options compared to the control group. While the quality of generated options showed improvement, the effect size was modest. These preliminary results suggest promising potential for AI-assisted tools in enhancing the systematic exploration of PMI planning alternatives. This early-stage research contributes to both the theoretical understanding of AI-assisted planning in complex organizational contexts and the practical development of tools to support PMI planning. Future work will focus on refining the underlying models and expanding the evaluation scope to real-world integration scenarios.

Paperid: 3202, https://arxiv.org/pdf/2503.16504.pdf

Abstract:
Background: The increasing use of artificial intelligence (AI) in healthcare documentation necessitates robust methods for evaluating the quality of AI-generated medical notes compared to those written by humans. This paper introduces an open-source tool, the Human Notes Evaluator, designed to assess clinical note quality and differentiate between human and AI authorship. Methods: The Human Notes Evaluator is a Flask-based web application implemented on Hugging Face Spaces. It employs the Physician Documentation Quality Instrument (PDQI-9), a validated 9-item rubric, to evaluate notes across dimensions such as accuracy, thoroughness, clarity, and more. The tool allows users to upload clinical notes in CSV format and systematically score each note against the PDQI-9 criteria, as well as assess the perceived origin (human, AI, or undetermined). Results: The Human Notes Evaluator provides a user-friendly interface for standardized note assessment. It outputs comprehensive results, including individual PDQI-9 scores for each criterion, origin assessments, and overall quality metrics. Exportable data facilitates comparative analyses between human and AI-generated notes, identification of quality trends, and areas for documentation improvement. The tool is available online at https://huggingface.co/spaces/iyadsultan/human_evaluator . Discussion: This open-source tool offers a valuable resource for researchers, healthcare professionals, and AI developers to rigorously evaluate and compare the quality of medical notes. By leveraging the PDQI-9 framework, it provides a structured and reliable approach to assess clinical documentation, contributing to the responsible integration of AI in healthcare. The tool's availability on Hugging Face promotes accessibility and collaborative development in the field of AI-driven medical documentation.

Paperid: 3203, https://arxiv.org/pdf/2503.16503.pdf

Abstract:
This paper introduces a simple JavaScript-based web application designed to assist educators in detecting AI-generated content in student essays and written assignments. Unlike existing AI detection tools that rely on obfuscated machine learning models, AIDetection.info employs a heuristic-based approach to identify common syntactic traces left by generative AI models, such as ChatGPT, Claude, Grok, DeepSeek, Gemini, Llama/Meta, Microsoft Copilot, Grammarly AI, and other text-generating models and wrapper applications. The tool scans documents in bulk for potential AI artifacts, as well as AI citations and acknowledgments, and provides a visual summary with downloadable Excel and CSV reports. This article details its methodology, functionalities, limitations, and applications within educational settings.

Paperid: 3204, https://arxiv.org/pdf/2503.16502.pdf

Abstract:
This study investigates the development and assessment of an artificial human designed as a conversational AI chatbot, focusing on its role as a clinical psychologist. The project involved creating a specialized chatbot using the Character.ai platform. The chatbot was designed to engage users in psychological discussions, providing advice and support with a human-like touch. The study involved participants (N=27) from diverse backgrounds, including psychologists, AI researchers, and the general public, who interacted with the chatbot and provided feedback on its human-likeness, empathy, and engagement levels. Results indicate that while many users found the chatbot engaging and somewhat human-like, limitations were noted in areas such as empathy and nuanced understanding. The findings suggest that although conversational AI has made strides, it remains far from achieving the true human-like interaction necessary for Artificial General Intelligence (AGI). The study highlights the challenges and potential of AI in human-computer interactions, suggesting directions for future research and development to bridge the gap between current capabilities and AGI. The project was completed in November of 2022 before the release of chatGPT.

Paperid: 3205, https://arxiv.org/pdf/2503.16501.pdf

Abstract:
Human communication has been profoundly changed by social media, which allows users to engage in previously unheard-of ways, such as text-based conversations, video chats, and live streaming. The digital landscape has started to change in recent years as a result of the introduction of Virtual Reality (VR) to these platforms. Instead of using conventional 2D screens, VR offers a completely immersive experience that lets users interact with content and one another in 3D spaces. This study examines the integration of virtual reality (VR) technology into social media applications, evaluating their potential to provide more dynamic and captivating digital spaces. Globally, social media sites like Facebook, Instagram, and Twitter have already changed the nature of communication. Immersion technologies like virtual reality (VR) represent the next stage, though, as they have the ability to change how we interact, connect, and share in social settings in addition to improving user experience.

Paperid: 3206, https://arxiv.org/pdf/2503.16487.pdf

Abstract:
The rise of online programming education has necessitated more effective, personalized interactions, a gap that PythonPal aims to fill through its innovative learning system integrated with a chatbot. This research delves into PythonPal's potential to enhance the online learning experience, especially in contexts with high student-to-teacher ratios where there is a need for personalized feedback. PythonPal's design, featuring modules for conversation, tutorials, and exercises, was evaluated through student interactions and feedback. Key findings reveal PythonPal's proficiency in syntax error recognition and user query comprehension, with its intent classification model showing high accuracy. The system's performance in error feedback, though varied, demonstrates both strengths and areas for enhancement. Student feedback indicated satisfactory query understanding and feedback accuracy but also pointed out the need for faster responses and improved interaction quality. PythonPal's deployment promises to significantly enhance online programming education by providing immediate, personalized feedback and interactive learning experiences, fostering a deeper understanding of programming concepts among students. These benefits mark a step forward in addressing the challenges of distance learning, making programming education more accessible and effective.

Paperid: 3207, https://arxiv.org/pdf/2503.16469.pdf

Abstract:
As the population of older adults increases, so will the need for both human and robot care providers. While traditional practices involve hiring human caregivers to serve meals and attend to basic needs, older adults often require continuous companionship and health monitoring. However, hiring human caregivers for this job costs a lot of money. However, using a robot like Nao could be cheaper and still helpful. This study explores the integration of humanoid robots, particularly Nao, in health monitoring and caregiving for older adults. Using a mixed-methods approach with a within-subject factorial design, we investigated the effectiveness of nonverbal communication modalities, including touch, gestures, and LED patterns, in enhancing human-robot interactions. Our results indicate that Nao's touch-based health monitoring was well-received by participants, with positive ratings across various dimensions. LED patterns were perceived as more effective and accurate compared to hand and head gestures. Moreover, longer interactions were associated with higher trust levels and perceived empathy, highlighting the importance of prolonged engagement in fostering trust in human-robot interactions. Despite limitations, our study contributes valuable insights into the potential of humanoid robots to improve health monitoring and caregiving for older adults.

Paperid: 3208, https://arxiv.org/pdf/2503.15808.pdf

Abstract:
ChatGPT, powered by a large language model (LLM), has revolutionized everyday human-computer interaction (HCI) since its 2022 release. While now used by millions around the world, a coherent pathway for evaluating the user experience (UX) ChatGPT offers remains missing. In this rapid review (N = 58), I explored how ChatGPT UX has been approached quantitatively so far. I focused on the independent variables (IVs) manipulated, the dependent variables (DVs) measured, and the methods used for measurement. Findings reveal trends, gaps, and emerging consensus in UX assessments. This work offers a first step towards synthesizing existing approaches to measuring ChatGPT UX, urgent trajectories to advance standardization and breadth, and two preliminary frameworks aimed at guiding future research and tool development. I seek to elevate the field of ChatGPT UX by empowering researchers and practitioners in optimizing user interactions with ChatGPT and similar LLM-based systems.

Paperid: 3209, https://arxiv.org/pdf/2503.15530.pdf

Abstract:
Amidst the race to create more intelligent machines there is a risk that we will rely on AI in ways that reduce our own agency as humans. To reduce this risk, we could aim to create tools that prioritize and enhance the human role in human-AI interactions. This paper outlines a human-centered augmented reasoning paradigm by 1. Articulating fundamental principles for augmented reasoning tools, emphasizing their ergonomic, pre-conclusive, directable, exploratory, enhancing, and integrated nature; 2. Proposing a 'many tasks, many tools' approach to ensuring human influence and control, and 3. Offering examples of interaction modes that can serve as bridges between human reasoning and AI algorithms.

Paperid: 3210, https://arxiv.org/pdf/2503.15520.pdf

Abstract:
AI agents using Large Language Models (LLMs) as foundations have shown promise in solving complex real-world tasks. In this paper, we propose an LLM-based agentic workflow for automating Standard Operating Procedures (SOP). For customer care operations, an SOP defines a logical step-by-step process for human agents to resolve customer issues. We observe that any step in the SOP can be categorized as user interaction or API call, while the logical flow in the SOP defines the navigation. We use LLMs augmented with memory and environments (API tools, user interface, external knowledge source) for SOP automation. Our agentic architecture consists of three task-specific LLMs, a Global Action Repository (GAR), execution memory, and multiple environments. SOP workflow is written as a simple logical block of text. Based on the current execution memory and the SOP, the agent chooses the action to execute; it interacts with an appropriate environment (user/API) to collect observations and feedback, which are, in turn, inputted to memory to decide the next action. The agent is designed to be fault-tolerant, where it dynamically decides to repeat an action or seek input from an external knowledge source. We demonstrate the efficacy of the proposed agent on the three SOPs from the e-commerce seller domain. The experimental results validate the agent's performance under complex real-world scenarios.

Paperid: 3211, https://arxiv.org/pdf/2503.15517.pdf

Abstract:
This article examines the characteristics of human errors in processing transportation requests. The role of artificial intelligence (AI) in maritime transportation is explored. The main methods and technologies used for automating and optimizing the handling of transportation requests are analyzed, along with their impact on reducing the number of errors. Examples of successful AI implementation in large companies are provided, confirming the positive influence of these technologies on overall operational efficiency and customer service levels.

Paperid: 3212, https://arxiv.org/pdf/2503.15508.pdf

Abstract:
The rapid advancement of Artificial Intelligence (AI) technologies, including the potential emergence of Artificial General Intelligence (AGI) and Artificial Superintelligence (ASI), has raised concerns about AI surpassing human cognitive capabilities. To address this challenge, intelligence augmentation approaches, such as Brain Machine Interfaces (BMI) and Brain Organoid (BO) integration have been proposed. In this study, we compare three intelligence augmentation strategies, namely BMI, BO, and a hybrid approach combining both. These strategies are evaluated from three key perspectives that influence user decisions in selecting an augmentation method: information processing capacity, identity risk, and consent authenticity risk. First, we model these strategies and assess them across the three perspectives. The results reveal that while BO poses identity risks and BMI has limitations in consent authenticity capacity, the hybrid approach mitigates these weaknesses by striking a balance between the two. Second, we investigate how users might choose among these intelligence augmentation strategies in the context of evolving AI capabilities over time. As the result, we find that BMI augmentation alone is insufficient to compete with advanced AI, and while BO augmentation offers scalability, BO increases identity risks as the scale grows. Moreover, the hybrid approach provides a balanced solution by adapting to AI advancements. This study provides a novel framework for human capability augmentation in the era of advancing AI and serves as a guideline for adapting to AI development.

Paperid: 3213, https://arxiv.org/pdf/2503.15501.pdf

Abstract:
This study addresses the pressing challenge of educational inclusion for students with special needs by proposing and developing an inclusive educational platform. Integrating machine learning, natural language processing, and cross-platform interfaces, the platform features key functionalities such as speech recognition functionality to support voice commands and text generation via voice input; real-time object recognition using the YOLOv5 model, adapted for educational environments; Grapheme-to-Phoneme (G2P) conversion for Text-to-Speech systems using seq2seq models with attention, ensuring natural and fluent voice synthesis; and the development of a cross-platform mobile application in Flutter with on-device inference execution using TensorFlow Lite. The results demonstrated high accuracy, usability, and positive impact in educational scenarios, validating the proposal as an effective tool for educational inclusion. This project underscores the importance of open and accessible technologies in promoting inclusive and quality education.

Paperid: 3214, https://arxiv.org/pdf/2503.15495.pdf

Abstract:
FÃ¼r eine frÃ¼hzeitige Erkennung von LieferengpÃ¤ssen mÃ¼ssen Lieferketten in einer geeigneten digitalen Form vorliegen, damit sie verarbeitet werden kÃ¶nnen. Der fÃ¼r die Datenmodellierung benÃ¶tigte Arbeitsaufwand ist jedoch, gerade IT-fremden Personen, nicht zuzumuten. Es wurde deshalb im Rahmen dieser Arbeit eine Webanwendung entwickelt, welche die zugrunde liegende KomplexitÃ¤t fÃ¼r den Benutzer verschleiern soll. Konkret handelt es sich dabei um eine grafische BenutzeroberflÃ¤che, auf welcher Templates instanziiert und miteinander verknÃ¼pft werden kÃ¶nnen. FÃ¼r die Definition dieser Templates wurden in dieser Arbeit geeignete Konzepte erarbeitet und erweitert. Zur Erhebung der Benutzerfreundlichkeit der Webanwendung wurde abschlieÃend eine Nutzerstudie mit mehreren Testpersonen durchgefÃ¼hrt. Diese legte eine Vielzahl von nÃ¼tzlichen VerbesserungsvorschlÃ¤gen offen. -- For early detection of supply bottlenecks, supply chains must be available in a suitable digital form so that they can be processed. However, the amount of work required for data modeling cannot be expected of people who are not familiar with IT topics. Therefore, a web application was developed in the context of this thesis, which is supposed to disguise the underlying complexity for the user. Specifically, this is a graphical user interface on which templates can be instantiated and linked to each other. Suitable concepts for the definition of these templates were developed and extended in this thesis. Finally, a user study with several test persons was conducted to determine the usability of the web application. This revealed a large number of useful suggestions for improvement.

Paperid: 3215, https://arxiv.org/pdf/2503.15488.pdf

Abstract:
Due to the multidisciplinary nature of wearable technology, the industry faces potential limitations in innovation. The wearable technology industry is still in its infancy and increased applicable use faces stagnation despite the plethora of technologies that have been largely wrist worn. This could be a result of the lack of multidisciplinary expert knowledge disseminating through the industry. Unlike other technologies which have standardizations and processes for how they are developed, wearable technologies exist in a realm of perpetual change as given the various materials and subcomponents that continue to be developed. It is essential that expert opinions form a collaborative foundation, and even more so that intelligent systems foster that collaboration. The caveat though, is likeliness of these artificial intelligence (AI) collaboration tools to be utilized by industry experts. Mental model development for AI tool usage could be applied to wearable technology innovation in this regard, thus the goal of this paper and focus of research.

Paperid: 3216, https://arxiv.org/pdf/2503.14714.pdf

Abstract:
This paper revises aesthetics theory through the lens of authenticity and investigates practical applications using a co-design approach. We encourage designers to include ordinary clients as co-creators in the co-design process, guiding them in expressing their aesthetics, values, and preferences while stimulating their creativity. This paper proposes a bespoke design process framework for authenticity aesthetics that incorporates empathy, defining, ideating, prototyping, and testing. This framework delineates the roles and responsibilities of clients and designers at different phases and highlights evolving material mediums that enable their communication. The paper concludes by reflecting on consumerist aesthetics, advocating for designers to focus on the insights of ordinary clients, design for their authentic uniqueness, and recognize the broad prospects of bespoke design methods.

Paperid: 3217, https://arxiv.org/pdf/2503.14540.pdf

Abstract:
This article examines the evolving role of legal frameworks in shaping ethical artificial intelligence (AI) use in corporate governance. As AI systems become increasingly prevalent in business operations and decision-making, there is a growing need for robust governance structures to ensure their responsible development and deployment. Through analysis of recent legislative initiatives, industry standards, and scholarly perspectives, this paper explores key legal and regulatory approaches aimed at promoting transparency, accountability, and fairness in corporate AI applications. It evaluates the strengths and limitations of current frameworks, identifies emerging best practices, and offers recommendations for developing more comprehensive and effective AI governance regimes. The findings highlight the importance of adaptable, principle-based regulations coupled with sector-specific guidance to address the unique challenges posed by AI technologies in the corporate sphere.

Paperid: 3218, https://arxiv.org/pdf/2503.14539.pdf

Abstract:
This article examines the ethical and legal implications of artificial intelligence (AI) driven data collection, focusing on developments from 2023 to 2024. It analyzes recent advancements in AI technologies and their impact on data collection practices across various sectors. The study compares regulatory approaches in the European Union, the United States, and China, highlighting the challenges in creating a globally harmonized framework for AI governance. Key ethical issues, including informed consent, algorithmic bias, and privacy protection, are critically assessed in the context of increasingly sophisticated AI systems. The research explores case studies in healthcare, finance, and smart cities to illustrate the practical challenges of AI implementation. It evaluates the effectiveness of current legal frameworks and proposes solutions encompassing legal and policy recommendations, technical safeguards, and ethical frameworks. The article emphasizes the need for adaptive governance and international cooperation to address the global nature of AI development while balancing innovation with the protection of individual rights and societal values.

Paperid: 3219, https://arxiv.org/pdf/2503.14266.pdf

Abstract:
This study introduced a Multimodal Mindfulness-Training System. Our installation, 'EmotionCarrier', correlates traditional calligraphy interactions with real-time physiological data from an Apple Watch. We aim to enhance mindfulness training effectiveness, aiding in achieving physiological calmness through calligraphy practice. Our experiments with varied participant groups focused on data diversity, usability, and stability. We adopted methods like using EmotionCarrier for Heart Sutra transcription and adjusting installation placement for optimal user experience. Our primary finding was a correlation between calligraphy performance data and emotional responses during the transcription of the Heart Sutra.

Paperid: 3220, https://arxiv.org/pdf/2503.14103.pdf

Abstract:
Planning a trip into a potentially unsafe area is a difficult task. We conducted a formative study on travelers' information needs, finding that most of them turn to search engines for trip planning. Search engines, however, fail to provide easily interpretable results adapted to the context and personal information needs of a traveler. Large language models (LLMs) create new possibilities for providing personalized travel safety advice. To explore this idea, we developed DangerMaps, a mapping system that assists its users in researching the safety of an urban travel destination, whether it is pre-travel or on-location. DangerMaps plots safety ratings onto a map and provides explanations on demand. This late breaking work specifically emphasizes the challenges of designing real-world applications with large language models. We provide a detailed description of our approach to prompt design and highlight future areas of research.

Paperid: 3221, https://arxiv.org/pdf/2503.12831.pdf

Abstract:
The main purpose of the paper is development, implementation, and testing of a low-cost portable system to assist partially paralyzed patients in their hand rehabilitation after strokes or some injures. Rehabilitation includes time consuming and repetitive exercises which are costly and demotivating as well as the requirements of clinic attending and direct supervision of physiotherapists. In this work, the system consists of a graphical user interface (GUI) on a smartphone screen to instruct and motivate the patients to do their exercises by themselves. Through the GUI, the patients are instructed to do a sequence of exercises step by step, and the system measures the electrical activities (electromyographic signals EMG) of the user's forearm muscles by Myo armband. Depending on d database, the system can tell whether the patients have done correct movements or not. If a correct movement is detected, the system will inform the user through the GUI and move to the next exercise. For preliminary results, the system was extensively tested on a healthy person.

Paperid: 3222, https://arxiv.org/pdf/2503.11970.pdf

Abstract:
Motion aftereffect (MAE) offers valuable insights into the mechanisms underlying motion-in-depth (MID) perception. This study investigates two critical aspects of MAE in depth: (1) the potential directional asymmetry between motion toward versus away from the observer, and (2) the effect of induced eye vergence on MAE magnitude. We conducted two experiments using random dot stereograms (RDS) to isolate the interocular velocity difference (IOVD) mechanism. In Experiment 1, we compared MAE magnitude following adaptation to motion-toward versus motion-away stimuli with a static fixation point. In Experiment 2, we introduced a fixation point oscillating in depth to induce vergence eye movements during adaptation and testing. Our results revealed a directional asymmetry in MAE strength, with motion-toward adaptation producing stronger aftereffects than motion-away adaptation in Experiment 1. When eye vergence was induced in Experiment 2, this pattern was reversed, with motion-away adaptation yielding stronger MAEs. These findings suggest an important interaction between adaptation direction and eye vergence state in MID perception, highlighting the complex integration of retinal and extra-retinal signals in the visual system's processing of motion through depth.

Paperid: 3223, https://arxiv.org/pdf/2503.11944.pdf

Abstract:
Digital twins (DTs) are redefining healthcare by paving the way for more personalized, proactive, and intelligent medical interventions. As the shift toward personalized care intensifies, there is a growing need for an individual's virtual replica that delivers the right treatment at the optimal time and in the most effective manner. The emerging concept of a Human Digital Twin (HDT) holds the potential to revolutionize the traditional healthcare system much like digital twins have transformed manufacturing and aviation. An HDT mirrors the physical entity of a human body through a dynamic virtual model that continuously reflects changes in molecular, physiological, emotional, and lifestyle factors. This digital representation not only supports remote monitoring, diagnosis, and prescription but also facilitates surgery, rehabilitation, and overall personalized care, thereby relieving pressure on conventional healthcare frameworks. Despite its promising advantages, there are considerable research challenges to overcome as HDT technology evolves. In this study, I will initially delineate the distinctions between traditional digital twins and HDTs, followed by an exploration of the networking architecture integral to their operation--from data acquisition and communication to computation, management, and decision-making--thereby offering insights into how these innovations may reshape the modern healthcare industry.

Paperid: 3224, https://arxiv.org/pdf/2503.10647.pdf

Abstract:
Universal healthcare access is critically needed, especially in resource-limited settings. Large Language Models (LLMs) offer promise for democratizing healthcare with advanced diagnostics, but their reliability requires thorough evaluation, especially in trust-dependent environments. This study assesses LLMs' diagnostic reliability focusing on consistency, manipulation resilience, and contextual integration, crucial for safe and ethical use in universal healthcare. We evaluated leading LLMs using 52 patient cases, expanded into variants with demographic changes, symptom rewordings, and exam modifications, while keeping core diagnoses constant. Manipulation susceptibility was tested by inserting misleading narratives and irrelevant details. Contextual awareness was rvaluated by comparing diagnoses with and without patient history. We analyzed diagnostic change rates and response patterns across manipulations. LLMs showed perfect diagnostic consistency for identical data but significant manipulation susceptibility. Gemini had a 40% diagnosis change rate and ChatGPT 30% with irrelevant details. ChatGPT had a higher context influence rate (77.8% vs. Gemini's 55.6%), but both showed limited nuanced contextual integration, exhibiting anchoring bias by prioritizing salient data over context. LLMs' vulnerability to manipulation and limited contextual awareness pose challenges in clinical use. Unlike clinicians, they may overstate diagnostic certainty without validation. Safeguards and domain-specific designs are crucial for reliable healthcare applications. Broad clinical use without oversight is premature and risky. LLMs can enhance diagnostics with responsible use, but future research is needed to improve manipulation resistance and contextual understanding for safe healthcare democratization.

Paperid: 3225, https://arxiv.org/pdf/2503.07840.pdf

Abstract:
The notion of citizen science is often referred to as the means of engaging public members in scientific research activities that can advance the reach and impact of technoscience. Despite this, few studies have addressed how human-machine collaborations in a citizen science context enable and constrain scientific citizenship and citizens' epistemic agencies and reconfigure science-citizen relations, including the process of citizens' engagement in scientific knowledge production. The following will address this gap by analysing the human and nonhuman material and discursive engagements in the citizen science project The Sound of Denmark. Doing so contributes to new knowledge on designing more responsible forms of citizen science engagement that advance civic agencies. Key findings emphasise that citizen science development can benefit from diverse fields such as participatory design research and feminist technoscience. Finally, the paper contributes to a broader debate on the formation of epistemic subjects, scientific citizenship, and responsible designing and evaluation of citizen science. Keywords: scientific citizenship, citizen science communication, epistemic agency, co-design, material-discursive practices, response-ability.

Paperid: 3226, https://arxiv.org/pdf/2503.06551.pdf

Abstract:
This paper critically examines the recent publication "ChatGPT-4 in the Turing Test" by Restrepo EchavarrÃa (2025), challenging its central claims regarding the absence of minimally serious test implementations and the conclusion that ChatGPT-4 fails the Turing Test. The analysis reveals that the criticisms based on rigid criteria and limited experimental data are not fully justified. More importantly, the paper makes several constructive contributions that enrich our understanding of Turing Test implementations. It demonstrates that two distinct formats--the three-player and two-player tests--are both valid, each with unique methodological implications. The work distinguishes between absolute criteria (reflecting an optimal 50% identification rate in a three-player format) and relative criteria (which measure how closely a machine's performance approximates that of a human), offering a more nuanced evaluation framework. Furthermore, the paper clarifies the probabilistic underpinnings of both test types by modeling them as Bernoulli experiments--correlated in the three-player version and uncorrelated in the two-player version. This formalization allows for a rigorous separation between the theoretical criteria for passing the test, defined in probabilistic terms, and the experimental data that require robust statistical methods for proper interpretation. In doing so, the paper not only refutes key aspects of the criticized study but also lays a solid foundation for future research on objective measures of how closely an AI's behavior aligns with, or deviates from, that of a human being.

Paperid: 3227, https://arxiv.org/pdf/2503.05786.pdf

Abstract:
With the increasing prevalence of mental health conditions worldwide, AI-powered chatbots and conversational agents have emerged as accessible tools to support mental health. However, deploying Large Language Models (LLMs) in mental healthcare applications raises significant privacy concerns, especially regarding regulations like HIPAA and GDPR. In this work, we propose FedMentalCare, a privacy-preserving framework that leverages Federated Learning (FL) combined with Low-Rank Adaptation (LoRA) to fine-tune LLMs for mental health analysis. We investigate the performance impact of varying client data volumes and model architectures (e.g., MobileBERT and MiniLM) in FL environments. Our framework demonstrates a scalable, privacy-aware approach for deploying LLMs in real-world mental healthcare scenarios, addressing data security and computational efficiency challenges.

Paperid: 3228, https://arxiv.org/pdf/2503.05765.pdf

Abstract:
As robots take on caregiving roles, ensuring equitable and unbiased interactions with diverse populations is critical. Although Large Language Models (LLMs) serve as key components in shaping robotic behavior, speech, and decision-making, these models may encode and propagate societal biases, leading to disparities in care based on demographic factors. This paper examines how LLM-generated responses shape robot caregiving characteristics and responsibilities when prompted with different demographic information related to sex, gender, sexuality, race, ethnicity, nationality, disability, and age. Findings show simplified descriptions for disability and age, lower sentiment for disability and LGBTQ+ identities, and distinct clustering patterns reinforcing stereotypes in caregiving narratives. These results emphasize the need for ethical and inclusive HRI design.

Paperid: 3229, https://arxiv.org/pdf/2503.05709.pdf

Abstract:
This paper explores advancements in Artificial Intelligence technologies to enhance classroom learning, highlighting contributions from companies like IBM, Microsoft, Google, and ChatGPT, as well as the potential of brain signal analysis. The focus is on improving students learning experiences by using Machine Learning algorithms to : identify a student preferred learning style and predict academic dropout risk. A Logistic Regression algorithm is applied for binary classification using six predictor variables, such as assessment scores, lesson duration, and preferred learning style, to accurately identify learning preferences. A case study, with 76,519 candidates and 35 predictor variables, assesses academic dropout risk using Logistic Regression, achieving a test accuracy of 87.39%. In comparison, the Stochastic Gradient Descent classifier achieved an accuracy of 83.1% on the same dataset.

Paperid: 3230, https://arxiv.org/pdf/2503.04751.pdf

Abstract:
The rapid integration of Artificial Intelligence (AI) in Higher Education (HE) is transforming personalized learning, administrative automation, and decision-making. However, this progress presents a duality, as AI adoption also introduces ethical and institutional challenges, including algorithmic bias, data privacy risks, and governance inconsistencies. To address these concerns, this study introduces the Human-Driven AI in Higher Education (HD-AIHED) Framework, ensuring compliance with UNESCO and OECD ethical standards. This conceptual research employs a qualitative meta-synthesis approach, integrating qualitative and quantitative studies to identify patterns, contradictions, and gaps in AI adoption within HE. It reinterprets existing datasets through theoretical and ethical lenses to develop governance frameworks. The study applies a participatory integrated co-system, Phased Human Intelligence, SWOC analysis, and AI ethical review boards to assess AI readiness and governance strategies for universities and HE institutions. The HD-AIHED model bridges AI research gaps, addresses global real-time challenges, and provides tailored, scalable, and ethical strategies for diverse educational contexts. By emphasizing interdisciplinary collaboration among stakeholders, this study envisions AIHED as a transparent and equitable force for innovation. The HD-AIHED framework ensures AI acts as a collaborative and ethical enabler rather than a disruptive replacement for human intelligence while advocating for responsible AI implementation in HE.

Paperid: 3231, https://arxiv.org/pdf/2503.03968.pdf

Abstract:
Human-AI collaboration is increasingly promoted to improve high-stakes decision-making, yet its benefits have not been fully realized. Application-grounded evaluations are needed to better evaluate methods for improving collaboration but often require domain experts, making studies costly and limiting their generalizability. Current evaluation methods are constrained by limited public datasets and reliance on proxy tasks. To address these challenges, we propose an application-grounded framework for large-scale, online evaluations of vision-based decision-making tasks. The framework introduces Blockies, a parametric approach for generating datasets of simulated diagnostic tasks, offering control over the traits and biases in the data used to train real-world models. These tasks are designed to be easy to learn but difficult to master, enabling participation by non-experts. The framework also incorporates storytelling and monetary incentives to manipulate perceived task stakes. An initial empirical study demonstrated that the high-stakes condition significantly reduced healthy distrust of AI, despite longer decision-making times. These findings underscore the importance of perceived stakes in fostering healthy distrust and demonstrate the framework's potential for scalable evaluation of high-stakes Human-AI collaboration.

Paperid: 3233, https://arxiv.org/pdf/2503.02258.pdf

Abstract:
Digital technologies are reshaping how people experience their surroundings, often pulling focus toward virtual spaces and making it harder to stay present and engaged. Wearable augmented reality (AR), by embedding digital information into the physical world, may further immerse users in digital layers. Yet paradoxically, it also holds the potential to support presence and engagement. To explore this possibility, this study adopts an autoethnographic approach, providing a first-person perspective on how everyday technologies shape real-world engagement. Over four weeks, 20 experiences were documented, capturing interactions with phones, laptops, and fitness trackers in various contexts. The findings reveal nuanced patterns of technology use and propose design implications for wearable AR, emphasising its potential for personalised, context-aware interventions that support meaningful real-world connection. This work contributes to the discourse on digital well-being, suggesting that wearable AR can evolve beyond digital augmentation to help users reconnect with their surroundings.

Paperid: 3234, https://arxiv.org/pdf/2503.00257.pdf

Abstract:
Speculative technologies in science fiction have long inspired advancements in Human-Computer Interaction (HCI). Doraemon, a Japanese manga featuring a robotic cat from the 22nd century, presents an extensive collection of futuristic gadgets-an underexplored source of speculative technologies. This study systematically analyses 379 of these gadgets, categorising them into 33 subcategories within 10 high-level groupings, to examine the fundamental human needs they address, their parallels to contemporary technologies, and their potential insights for HCI design. The findings reveal that while human needs remain constant, the ways in which technology fulfils them differ. Doraemon's gadgets emphasise tangible, single-purpose interactions with built-in reversibility, contrasting with the increasing complexity and software-driven nature of modern systems. By examining these speculative technologies, this study highlights alternative interaction paradigms that challenge current HCI trends and offer inspiration for future user-centred innovation.

Paperid: 3235, https://arxiv.org/pdf/2502.20402.pdf

Abstract:
This chapter is interested in the epistemology of algorithms. As I intend to approach the topic, this is an issue about epistemic justification. Current approaches to justification emphasize the transparency of algorithms, which entails elucidating their internal mechanisms -- such as functions and variables -- and demonstrating how (or that) these produce outputs. Thus, the mode of justification through transparency is contingent on what can be shown about the algorithm and, in this sense, is internal to the algorithm. In contrast, I advocate for an externalist epistemology of algorithms that I term computational reliabilism (CR). While I have previously introduced and examined CR in the field of computer simulations ([42, 53, 4]), this chapter extends this reliabilist epistemology to encompass a broader spectrum of algorithms utilized in various scientific disciplines, with a particular emphasis on machine learning applications. At its core, CR posits that an algorithm's output is justified if it is produced by a reliable algorithm. A reliable algorithm is one that has been specified, coded, used, and maintained utilizing reliability indicators. These reliability indicators stem from formal methods, algorithmic metrics, expert competencies, cultures of research, and other scientific endeavors. The primary aim of this chapter is to delineate the foundations of CR, explicate its operational mechanisms, and outline its potential as an externalist epistemology of algorithms.

Paperid: 3236, https://arxiv.org/pdf/2502.19679.pdf

Abstract:
Large Language Models, despite their power, have a fundamental architectural vulnerability stemming from their causal transformer design -- order sensitivity. This architectural constraint may distorts classification outcomes when prompt elements like label options are reordered, revealing a theoretical gap between accuracy metrics and true model reliability. The paper conceptualizes this vulnerability through the lens of survey methodology, where respondent biases parallel LLM positional dependencies. Empirical evidence using the F1000 biomedical dataset across three scales of LLaMA3.1 models (8B, 70B, 405B) demonstrates that these architectural constraints produce inconsistent annotations under controlled perturbations. The paper advances a practical solution for social science - Independent Probability Assessment - which decouples label evaluation to circumvent positional bias inherent in sequential processing. This approach yields an information-theoretic reliability measure (R-score) that quantifies annotation robustness at the case level. The findings establish that architectural vulnerabilities in causal transformers require methodological innovations beyond accuracy metrics to ensure valid social science inference, as demonstrated through downstream regression analyses where order-sensitive annotations significantly alter substantive conclusions about scientific impact.

Paperid: 3237, https://arxiv.org/pdf/2502.18513.pdf

Abstract:
While there is an increased discourse on large language models (LLMs) like ChatGPT and DeepSeek, there is no comprehensive understanding of how users of online platforms, like Reddit, perceive these models. This is an important omission because public opinion can influence AI development, trust, and future policy. This study aims at analyzing Reddit discussions about ChatGPT and DeepSeek using sentiment and topic modeling to advance the understanding of user attitudes. Some of the significant topics such as trust in AI, user expectations, potential uses of the tools, reservations about AI biases, and ethical implications of their use are explored in this study. By examining these concerns, the study provides a sense of how public sentiment might shape the direction of AI development going forward. The report also mentions whether users have faith in the technology and what they see as its future. A word frequency approach is used to identify broad topics and sentiment trends. Also, topic modeling through the Latent Dirichlet Allocation (LDA) method identifies top topics in users' language, for example, potential benefits of LLMs, their technological applications, and their overall social ramifications. The study aims to inform developers and policymakers by making it easier to see how users comprehend and experience these game-changing technologies.

Paperid: 3238, https://arxiv.org/pdf/2502.17703.pdf

Abstract:
As technology has become more embedded into our society, the security of modern-day systems is paramount. One topic which is constantly under discussion is that of patching, or more specifically, the installation of updates that remediate security vulnerabilities in software or hardware systems. This continued deliberation is motivated by complexities involved with patching; in particular, the various incentives and disincentives for organizations and their cybersecurity teams when deciding whether to patch. In this paper, we take a fresh look at the question of patching and critically explore why organizations and IT/security teams choose to patch or decide against it (either explicitly or due to inaction). We tackle this question by aggregating and synthesizing prominent research and industry literature on the incentives and disincentives for patching, specifically considering the human aspects in the context of these motives. Through this research, this study identifies key motivators such as organizational needs, the IT/security team's relationship with vendors, and legal and regulatory requirements placed on the business and its staff. There are also numerous significant reasons discovered for why the decision is taken not to patch, including limited resources (e.g., person-power), challenges with manual patch management tasks, human error, bad patches, unreliable patch management tools, and the perception that related vulnerabilities would not be exploited. These disincentives, in combination with the motivators above, highlight the difficult balance that organizations and their security teams need to maintain on a daily basis. Finally, we conclude by discussing implications of these findings and important future considerations.

Paperid: 3239, https://arxiv.org/pdf/2502.17348.pdf

Abstract:
Scientists across disciplines write code for critical activities like data collection and generation, statistical modeling, and visualization. As large language models that can generate code have become widely available, scientists may increasingly use these models during research software development. We investigate the characteristics of scientists who are early-adopters of code generating models and conduct interviews with scientists at a public, research-focused university. Through interviews and reviews of user interaction logs, we see that scientists often use code generating models as an information retrieval tool for navigating unfamiliar programming languages and libraries. We present findings about their verification strategies and discuss potential vulnerabilities that may emerge from code generation practices unknowingly influencing the parameters of scientific analyses.

Paperid: 3240, https://arxiv.org/pdf/2502.16345.pdf

Abstract:
In this paper, we attempt to understand the anthropomorphic features of chatbot outputs and how these features provide a discursive frame for human-AI interactions. To do so, we explore the use of a prompt-based walkthrough method with two phases: (1) interview-style prompting to reveal the chatbots' context of expected use and (2) roleplaying-type prompting to evoke everyday use scenarios and typical chatbot outputs. We applied this method to catalogue anthropomorphic features across four different LLM chatbots, finding that anthropomorphism was exhibited as both subjective language and a sympathetic conversational tone. We also found that socio-emotional cues in prompts increase the incidence of anthropomorphic expressions in outputs. We argue that the prompt-based walkthrough method was successful in stimulating social role performance in LLM chatbots and in eliciting a variety of anthropomorphic features, making it useful in the study of interaction-based algorithmic harms where users project inappropriate social roles onto LLM-based tools.

Paperid: 3241, https://arxiv.org/pdf/2502.15870.pdf

Abstract:
This study investigates how individuals' perceptions of artificial intelligence (AI) limitations influence organizational readiness for AI adoption. Through semi-structured interviews with seven AI implementation experts, analyzed using the Gioia methodology, the research reveals that organizational readiness emerges through dynamic interactions between individual sensemaking, social learning, and formal integration processes. The findings demonstrate that hands-on experience with AI limitations leads to more realistic expectations and increased trust, mainly when supported by peer networks and champion systems. Organizations that successfully translate these individual and collective insights into formal governance structures achieve more sustainable AI adoption. The study advances theory by showing how organizational readiness for AI adoption evolves through continuous cycles of individual understanding, social learning, and organizational adaptation. These insights suggest that organizations should approach AI adoption not as a one-time implementation but as an ongoing strategic learning process that balances innovation with practical constraints. The research contributes to organizational readiness theory and practice by illuminating how micro-level perceptions and experiences shape macro-level adoption outcomes.

Paperid: 3242, https://arxiv.org/pdf/2502.15869.pdf

Abstract:
This thesis presents a framework that integrates state-of-the-art generative AI models for real-time creation of three-dimensional (3D) objects in augmented reality (AR) environments. The primary goal is to convert diverse inputs, such as images and speech, into accurate 3D models, enhancing user interaction and immersion. Key components include advanced object detection algorithms, user-friendly interaction techniques, and robust AI models like Shap-E for 3D generation. Leveraging Vision Language Models (VLMs) and Large Language Models (LLMs), the system captures spatial details from images and processes textual information to generate comprehensive 3D objects, seamlessly integrating virtual objects into real-world environments. The framework demonstrates applications across industries such as gaming, education, retail, and interior design. It allows players to create personalized in-game assets, customers to see products in their environments before purchase, and designers to convert real-world objects into 3D models for real-time visualization. A significant contribution is democratizing 3D model creation, making advanced AI tools accessible to a broader audience, fostering creativity and innovation. The framework addresses challenges like handling multilingual inputs, diverse visual data, and complex environments, improving object detection and model generation accuracy, as well as loading 3D models in AR space in real-time. In conclusion, this thesis integrates generative AI and AR for efficient 3D model generation, enhancing accessibility and paving the way for innovative applications and improved user interactions in AR environments.

Paperid: 3243, https://arxiv.org/pdf/2502.15196.pdf

Abstract:
Human-AI collaborative decision making has emerged as a pivotal field in recent years. Existing methods treat human and AI as different entities when designing human-AI systems. However, as the decision capabilities of AI models become closer to human beings, it is necessary to build a uniform framework for capability modeling and integrating. In this study, we propose a general architecture for human-AI collaborative decision making, wherein we employ learnable capability vectors to represent the decision-making capabilities of both human experts and AI models. These capability vectors are utilized to determine the decision weights of multiple decision makers, taking into account the contextual information of each decision task. Our proposed architecture accommodates scenarios involving multiple human-AI decision makers with varying capabilities. Furthermore, we introduce a learning-free approach to establish a baseline using global collaborative weights. Experiments on image classification and hate speech detection demonstrate that our proposed architecture significantly outperforms the current state-of-the-art methods in image classification and sentiment analysis, especially for the case with large non-expertise capability levels. Overall, our method provides an effective and robust collaborative decision-making approach that integrates diverse human/AI capabilities within a unified framework.

Paperid: 3244, https://arxiv.org/pdf/2502.14870.pdf

Abstract:
The development of artificial general intelligence (AGI) is likely to be one of humanity's most consequential technological advancements. Leading AI labs and scientists have called for the global prioritization of AI safety citing existential risks comparable to nuclear war. However, research on catastrophic risks and AI alignment is often met with skepticism, even by experts. Furthermore, online debate over the existential risk of AI has begun to turn tribal (e.g. name-calling such as "doomer" or "accelerationist"). Until now, no systematic study has explored the patterns of belief and the levels of familiarity with AI safety concepts among experts. I surveyed 111 AI experts on their familiarity with AI safety concepts, key objections to AI safety, and reactions to safety arguments. My findings reveal that AI experts cluster into two viewpoints -- an "AI as controllable tool" and an "AI as uncontrollable agent" perspective -- diverging in beliefs toward the importance of AI safety. While most experts (78%) agreed or strongly agreed that "technical AI researchers should be concerned about catastrophic risks", many were unfamiliar with specific AI safety concepts. For example, only 21% of surveyed experts had heard of "instrumental convergence," a fundamental concept in AI safety predicting that advanced AI systems will tend to pursue common sub-goals (such as self-preservation). The least concerned participants were the least familiar with concepts like this, suggesting that effective communication of AI safety should begin with establishing clear conceptual foundations in the field.

Paperid: 3245, https://arxiv.org/pdf/2502.14632.pdf

Abstract:
The integration of generative AI (GenAI) tools, particularly large language models (LLMs), is transforming professional coaching workflows. This study explores how coaches use GenAI, the perceived benefits and limitations of these tools, and broader attitudes toward AI-assisted coaching. A survey of 205 coaching professionals reveals widespread adoption of GenAI for research, content creation, and administrative support, while its role in relational and interpretative coaching remains limited. Findings indicate that AI literacy and perceived AI impact strongly predict GenAI adoption, with positive attitudes fostering greater use. Ethical considerations, particularly transparency and data privacy, are a key concern, with frequent AI users demonstrating greater ethical awareness. Regression analyses show that while perceived effectiveness drives GenAI adoption, concerns about AI replacing human coaches do not significantly influence usage. Coaches express interest in future AI capabilities that enhance personalization, real-time feedback, and administrative automation while maintaining human oversight. The study highlights that GenAI functions best as an augmentation tool rather than a replacement, emphasizing the need for AI literacy training, ethical guidelines, and human-centered AI integration. These findings contribute to the ongoing discourse on human-AI collaboration, advocating for responsible and effective AI adoption in professional coaching.

Paperid: 3246, https://arxiv.org/pdf/2502.14048.pdf

Abstract:
In this paper, we present two techniques for use in context-aware systems: Semantic Decomposition, which sequentially decomposes input prompts into a structured and hierarchal information schema in which systems can parse and process easily, and Selective Context Filtering, which enables systems to systematically filter out specific irrelevant sections of contextual information that is fed through a system's NLP-based pipeline. We will explore how context-aware systems and applications can utilize these two techniques in order to implement dynamic LLM-to-system interfaces, improve an LLM's ability to generate more contextually cohesive user-facing responses, and optimize complex automated workflows and pipelines.

Paperid: 3247, https://arxiv.org/pdf/2502.13779.pdf

Abstract:
Balancing user agency and system automation is essential for effective human-AI interactions. Fully automated systems can deliver efficiency but risk undermining usability and user autonomy, while purely manual tools are often inefficient and fail to enhance user capabilities. This dissertation addresses the question: "How can we balance user agency and system automation for interactions with intelligent systems?" We present four main contributions. First, we develop a spherical electromagnet that provides adjustable forces on an untethered tool, allowing haptic feedback while preserving user agency. Second, we create an integrated sensing and actuation system that tracks a passive magnetic tool in 3D and delivers haptic feedback without external tracking. Third, we propose an optimal control method for electromagnetic haptic guidance that balances user input with system control, enabling users to adjust trajectories and speed. Finally, we introduce a model-free reinforcement learning approach for adaptive interfaces that learns interface adaptations without heuristics or real user data. Our simulations and user studies show that shared control significantly outperforms naive strategies. By incorporating explicit or implicit models of human behavior into control strategies, intelligent systems can better account for user agency. We demonstrate that the trade-off between agency and automation is both an algorithmic challenge and an engineering concern, shaped by the design of physical devices and user interfaces. We advocate an integrated, end-to-end approach-combining algorithmic, engineering, and design perspectives-to enable more intuitive and effective interactions with intelligent systems.

Paperid: 3248, https://arxiv.org/pdf/2502.11737.pdf

Abstract:
The digital age requires strong security measures to protect online activities. Two-Factor Authentication (2FA) has emerged as a critical solution. However, its implementation presents significant challenges, particularly in terms of accessibility for people with disabilities. This paper examines the intricacies of deploying 2FA in a way that is secure and accessible to all users by outlining the concrete challenges for people who are affected by various types of impairments. This research investigates the implications of 2FA on digital inclusivity and proposes solutions to enhance accessibility. An analysis was conducted to examine the implementation and availability of various 2FA methods across popular online platforms. The results reveal a diverse landscape of authentication strategies. While 2FA significantly improves account security, its current adoption is hampered by inconsistencies across platforms and a lack of standardised, accessible options for users with disabilities. Future advancements in 2FA technologies, including but not limited to autofill capabilities and the adoption of Fast IDentity Onlines (FIDO) protocols, offer possible directions for more inclusive authentication mechanisms. However, ongoing research is necessary to address the evolving needs of users with disabilities and to mitigate new security challenges. This paper proposes a collaborative approach among stakeholders to ensure that security improvements do not compromise accessibility. It promotes a digital environment where security and inclusivity mutually reinforce each other.

Paperid: 3249, https://arxiv.org/pdf/2502.11069.pdf

Abstract:
The advent of mobile applications has brought new frontiers to usability studies. So far, the ongoing research has undertaken considerable efforts to model usability in such new challenging context. One of these endeavors is the PACMAD+3 model, which consists of a total of ten unique factors. However, to the best of our knowledge, little or no effort has been made to empirically evaluate these factors against perceived influence. With this in mind, the objective of this study is to explore this issue by evaluating the selected factors. To achieve this goal in a reliable and reproducible manner, we took advantage of previous attempts to conceptualize the mobile usability factors, but we contribute by operationalizing these theoretical constructs into observable and measurable phenomena. In this sense, the survey was designed and carried out on the sample of 838 users in order to evaluate the significance of the PACMAD+3 factors on the perceived usability of mobile applications. Our findings show that, on average, users rated efficiency as highly important, while the remaining seven, namely: cognitive load, errors, learnability, operability, effectiveness, memorability, and understandability, were rated as moderately important. The discussed results provide insight into the importance of usability attributes and quality criteria from both perspectives, ultimately facilitating and securing the design and development of mobile applications. Therefore, our research contributes to the field of human-computer interaction, with both theoretical and practical implications for mobile usability researchers, UX designers, and quality assurance engineers.

Paperid: 3250, https://arxiv.org/pdf/2502.10913.pdf

Abstract:
This study examines how WhatsApp has evolved from a personal communication tool to a professional platform, focusing on its use by small business owners in India. Initially embraced in smaller, rural communities for its ease of use and familiarity, WhatsApp played a crucial role in local economies. However, as Meta introduced WhatsApp Business with new, formalized features, users encountered challenges in adapting to the more complex and costly platform. Interviews with 14 small business owners revealed that while they adapted creatively, they felt marginalized by the advanced tools. This research contributes to HCI literature by exploring the transition from personal to professional use and introduces the concept of Coercive Professionalization. It highlights how standardization by large tech companies affects marginalized users, exacerbating power imbalances and reinforcing digital colonialism, concluding with design implications for supporting community-based appropriations.

Paperid: 3251, https://arxiv.org/pdf/2502.09903.pdf

Abstract:
In this paper, we reexamine prompt engineering for large language models through the lens of automata theory. We argue that language models function as automata and, like all automata, should be programmed in the languages they accept, a unified collection of all natural and formal languages. Therefore, traditional software engineering practices--conditioned on the clear separation of programming languages and natural languages--must be rethought. We introduce the Ann Arbor Architecture, a conceptual framework for agent-oriented programming of language models, as a higher-level abstraction over raw token generation, and provide a new perspective on in-context learning. Based on this framework, we present the design of our agent platform Postline, and report on our initial experiments in agent training.

Paperid: 3252, https://arxiv.org/pdf/2502.09422.pdf

Abstract:
As technologies and interfaces for the instrumental control of musical sound get ever better at tracking aspects of human position and motion in space, a fundamental problem emerges: Unintended or even counter-intentional control may result when humans themselves become a source of positional noise. A clear case of what is meant by this is the "stillness movement" of a body part occurring despite the simultaneous explicit intention for that body part to remain still. In this paper, we present the results of a randomized, controlled experiment investigating this phenomenon along a vertical axis relative to the human fingertip. The results include characterizations of both the spatial distribution and frequency distribution of the stillness movement observed. Also included are results indicating a possible role for constant forces and viscosities in reducing stillness movement amplitude, thereby potentially enabling the implementation of more positional control of musical sound within the same available spatial range. Importantly, the above is summarized in a form that is directly interpretable for anyone designing technologies, interactions, or performances that involve fingertip control of musical sound. Also, a complete data set of the experimental results is included in the separate Appendices to this paper, again in a format that is directly interpretable.

Paperid: 3253, https://arxiv.org/pdf/2502.09203.pdf

Abstract:
Due to large intra-subject and inter-subject variabilities of electroencephalogram (EEG) signals, EEG-based brain-computer interfaces (BCIs) usually need subject-specific calibration to tailor the decoding algorithm for each new subject, which is time-consuming and user-unfriendly, hindering their real-world applications. Transfer learning (TL) has been extensively used to expedite the calibration, by making use of EEG data from other subjects/sessions. An important consideration in TL for EEG-based BCIs is to reduce the data distribution discrepancies among different subjects/sessions, to avoid negative transfer. Euclidean alignment (EA) was proposed in 2020 to address this challenge. Numerous experiments from 13 different BCI paradigms demonstrated its effectiveness and efficiency. This paper revisits EA, explaining its procedure and correct usage, introducing its applications and extensions, and pointing out potential new research directions. It should be very helpful to BCI researchers, especially those who are working on EEG signal decoding.

Paperid: 3254, https://arxiv.org/pdf/2502.08833.pdf

Abstract:
Daily activity recognition has gained prominence due to its applications in context-aware computing. Current methods primarily rely on supervised learning for detecting simple, repetitive activities. This paper introduces LayeredSense, a novel framework designed to recognize complex activities by decomposing them into smaller, easily identifiable unit patterns. Utilizing a Myo armband for data collection, our system processes inertial measurement unit (IMU) data to identify basic actions like walking, running, and jumping. These actions are then aggregated to infer more intricate activities such as playing sports or working. LayeredSense employs Gaussian Mixture Models for new pattern detection and machine learning algorithms, including Random Forests, for real-time activity recognition. Our system demonstrates high accuracy in identifying both unit patterns and complex activities, providing a scalable solution for comprehensive daily activity monitoring

Paperid: 3255, https://arxiv.org/pdf/2502.08471.pdf

Abstract:
In this thesis, we present an articulated, empirical view on what human music making is, and on how this fundamentally relates to computation. The experimental evidence which we obtained seems to indicate that this view can be used as a tool, to systematically generate models, hypotheses and new technologies that enable an ever more complete answer to the fundamental question as to what forms of instrumental control of musical sound are possible to implement. This also entails the development of two novel transducer technologies for computed fingertip touch: The cyclotactor (CT) system, which provides fingerpad-orthogonal force output while tracking surface-orthogonal fingertip movement; and the kinetic surface friction transducer (KSFT) system, which provides fingerpad-parallel force output while tracking surface-parallel fingertip movement. In addition to the main research, the thesis also contains two research excursions, which are due to the nature of the Ph.D. position. The first excursion shows how repeated and varying pressing movements on the already held-down key of a computer keyboard can be used both to simplify existing user interactions and to implement new ones, that allow the rapid yet detailed navigation of multiple possible interaction outcomes. The second excursion shows that automated computational techniques can display shape specifically in the retinal afterimage, a well-known effect in the human visual system.

Paperid: 3256, https://arxiv.org/pdf/2502.07698.pdf

Abstract:
Design assistants are frameworks, tools or applications intended to facilitate both the creative and technical facets of design processes. Large language models (LLMs) are AI systems engineered to analyze and produce text resembling human language, leveraging extensive datasets. This study introduces a framework wherein LLMs are employed as Design Assistants, focusing on three key modalities within the Design Process: Idea Exploration, Dialogue with Designers, and Design Evaluation. Importantly, our framework is not confined to a singular design process but is adaptable across various processes.

Paperid: 3257, https://arxiv.org/pdf/2502.05953.pdf

Abstract:
This paper presents an inexpensive Augmented Reality (AR) application which is aimed to use with mobile devices. Our application is a marker based AR application, and it can be used by inexpensive three dimensional (3D) red-cyan glasses. In our AR application, we combine left and right views without creating any uncomfortable situation for human eyes. We validate our mobile AR application on several objects, scenes, and views. We show that 3D AR perception can be obtained by using our inexpensive AR application [GÃ¼ngÃ¶r and Kurt 2014].

Paperid: 3258, https://arxiv.org/pdf/2502.05775.pdf

Abstract:
Implicit communication is crucial in human-robot collaboration (HRC), where contextual information, such as intentions, is conveyed as implicatures, forming a natural part of human interaction. However, enabling robots to appropriately use implicit communication in cooperative tasks remains challenging. My research addresses this through three phases: first, exploring the impact of linguistic implicatures on collaborative tasks; second, examining how robots' implicit cues for backchanneling and proactive communication affect team performance and perception, and how they should adapt to human teammates; and finally, designing and evaluating a multi-LLM robotics system that learns from human implicit communication. This research aims to enhance the natural communication abilities of robots and facilitate their integration into daily collaborative activities.

Paperid: 3259, https://arxiv.org/pdf/2502.05120.pdf

Abstract:
This study is motivated by two key considerations: the significant benefits mobile applications offer individuals and businesses, and the limited empirical research on usability challenges. To address this gap, we conducted structured interviews with twelve experts to identify common usability issues. Our findings highlight the top five concerns related to: information architecture, user interface design, performance, interaction patterns, and aesthetics. In addition, we identify five key directions for future research: usability in AI-powered mobile applications, augmented reality (AR) and virtual reality (VR), multimodal interactions, personalized mobile ecosystems, and accessibility. Our study provides insights into emerging usability challenges and trends, contributing to both the theory and practice of mobile human-computer interaction.

Paperid: 3260, https://arxiv.org/pdf/2502.04259.pdf

Abstract:
The Human Cognitive Simulation Framework represents a significant advancement in integrating human cognitive capabilities into artificial intelligence systems. By merging short-term memory (conversation context), long-term memory (interaction context), advanced cognitive processing, and efficient knowledge management, it ensures contextual coherence and persistent data storage, enhancing personalization and continuity in human-AI interactions. The framework employs a unified database that synchronizes these contexts while incorporating logical, creative, and analog processing modules inspired by human brain hemispheric functions to perform structured tasks and complex inferences. Dynamic knowledge updates enable real-time integration, improving adaptability and fostering applications in education, behavior analysis, and knowledge management. Despite its potential to process vast data volumes and enhance user experience, challenges remain in scalability, cognitive bias mitigation, and ethical compliance. This framework lays the foundation for future research in continuous learning algorithms, sustainability, and multimodal adaptability, positioning Cognitive AI as a transformative model in emerging fields.

Paperid: 3261, https://arxiv.org/pdf/2502.03508.pdf

Abstract:
This article focuses on elucidating the concept of consciousness from a relational and post-phenomenological theory of non-human communication agents (ANHC). Specifically, we explore the contributions of Thomas Metzinger s Self Model Theory, Katherine Hayles conceptualizations of non-conscious cognitive processes centered on knowledge processing phenomena shared between biological and technical systems and Lenore and Manuel Blum s theoretical perspective on computation, which defines consciousness as an emergent phenomenon of complex computational systems, arising from the appropriate organization of their inorganic materiality. Building on interactions with non-human cognitive agents, among other factors, the explainability of sociotechnical systems challenges the humanistic common sense of modern philosophy and science. This critical integration of various approaches ultimately questions other concepts associated with consciousness, such as autonomy, freedom, and mutual responsibility. The aim is to contribute to a necessary discussion for designing new frameworks of understanding that pave the way toward an ethical and pragmatic approach to addressing contemporary challenges in the design, regulation, and interaction with ANHC. Such frameworks, in turn, enable a more inclusive and relational understanding of agency in an interconnected world.

Paperid: 3262, https://arxiv.org/pdf/2502.03504.pdf

Abstract:
This work reflects upon what Immersion can mean from the perspective of an Artificial Intelligence (AI). Applying the lens of immersive learning theory, it seeks to understand whether this new perspective supports ways for AI participation in cognitive ecologies. By treating AI as a participant rather than a tool, it explores what other participants (humans and other AIs) need to consider in environments where AI can meaningfully engage and contribute to the cognitive ecology, and what the implications are for designing such learning environments. Drawing from the three conceptual dimensions of immersion - System, Narrative, and Agency - this work reinterprets AIs in immersive learning contexts. It outlines practical implications for designing learning environments where AIs are surrounded by external digital services, can interpret a narrative of origins, changes, and structural developments in data, and dynamically respond, making operational and tactical decisions that shape human-AI collaboration. Finally, this work suggests how these insights might influence the future of AI training, proposing that immersive learning theory can inform the development of AIs capable of evolving beyond static models. This paper paves the way for understanding AI as an immersive learner and participant in evolving human-AI cognitive ecosystems.

Paperid: 3263, https://arxiv.org/pdf/2502.03293.pdf

Abstract:
There is no consensus on what constitutes human-centeredness in AI, and existing frameworks lack empirical validation. This study addresses this gap by developing a hierarchical framework of 26 attributes of human-centeredness, validated through practitioner input. The framework prioritizes ethical foundations (e.g., fairness, transparency), usability, and emotional intelligence, organized into four tiers: ethical foundations, usability, emotional and cognitive dimensions, and personalization. By integrating theoretical insights with empirical data, this work offers actionable guidance for AI practitioners, promoting inclusive design, rigorous ethical standards, and iterative user feedback. The framework provides a robust foundation for creating AI systems that enhance human well-being and align with societal values. Future research should explore how these attributes evolve across cultural and industrial contexts, ensuring the framework remains relevant as AI technologies advance.

Paperid: 3264, https://arxiv.org/pdf/2502.02456.pdf

Abstract:
Instructional designers face an overwhelming array of design choices, making it challenging to identify the most effective interventions. To address this issue, I propose the concept of a Model Human Learner, a unified computational model of learning that can aid designers in evaluating candidate interventions. This paper presents the first successful demonstration of this concept, showing that a computational model can accurately predict the outcomes of two human A/B experiments -- one testing a problem sequencing intervention and the other testing an item design intervention. It also demonstrates that such a model can generate learning curves without requiring human data and provide theoretical insights into why an instructional intervention is effective. These findings lay the groundwork for future Model Human Learners that integrate cognitive and learning theories to support instructional design across diverse tasks and interventions.

Paperid: 3265, https://arxiv.org/pdf/2502.01493.pdf

Abstract:
Human-AI collaboration is evolving from a tool-based perspective to a partnership model where AI systems complement and enhance human capabilities. Traditional approaches often limit AI to a supportive role, missing the potential for reciprocal relationships where both human and AI inputs contribute to shared goals. Although Human-Centered AI (HcAI) frameworks emphasize transparency, ethics, and user experience, they often lack mechanisms for genuine, dynamic collaboration. The "Human-AI Handshake Model" addresses this gap by introducing a bi-directional, adaptive framework with five key attributes: information exchange, mutual learning, validation, feedback, and mutual capability augmentation. These attributes foster balanced interaction, enabling AI to act as a responsive partner, evolving with users over time. Human enablers like user experience and trust, alongside AI enablers such as explainability and responsibility, facilitate this collaboration, while shared values of ethics and co-evolution ensure sustainable growth. Distinct from existing frameworks, this model is reflected in tools like GitHub Copilot and ChatGPT, which support bi-directional learning and transparency. Challenges remain, including maintaining ethical standards and ensuring effective user oversight. Future research will explore these challenges, aiming to create a truly collaborative human-AI partnership that leverages the strengths of both to achieve outcomes beyond what either could accomplish alone.

Paperid: 3266, https://arxiv.org/pdf/2502.00030.pdf

Abstract:
This paper examines the evolving performance practices of Ludwig van Beethoven's cello sonatas, with a particular focus on tempo and portamento between 1930 and 2012. It integrates analyses of 22 historical recordings, advancements in recording technology to shed light on changes in interpretative approaches. By comparing Beethoven's metronome markings, as understood through contemporaries such as Czerny and Moscheles, with their application in modern performances, my research highlights notable deviations. These differences prove the challenges performers face in reconciling historical tempos with the demands of contemporary performance practice. My study pays special attention to the diminishing use of audible portamento in the latter half of the 20th century, contrasted with a gradual increase in tempo after 1970. This development is linked to broader cultural and pedagogical shifts, including the adoption of fingering techniques that reduce hand shifts, thereby facilitating greater technical precision at faster tempos. Nonetheless, my study identifies the persistence of 'silent portamento' as an expressive device, allowing performers to retain stylistic expression without compromising rhythmic integrity. My paper offers valuable insights for performers and scholars alike, advocating a critical reassessment of Beethoven's tempo markings and the nuanced application of portamento in modern performance practice.

Paperid: 3267, https://arxiv.org/pdf/2502.00011.pdf

Abstract:
Artificial Intelligence (AI) has emerged as a transformative technology with the potential to revolutionize various sectors, from healthcare to finance, education, and beyond. However, successfully implementing AI systems remains a complex challenge, requiring a comprehensive and methodologically sound framework. This paper contributes to this challenge by introducing the Trustworthy, Optimized, Adaptable, and Socio-Technologically harmonious (TOAST) framework. It draws on insights from various disciplines to align technical strategy with ethical values, societal responsibilities, and innovation aspirations. The TOAST framework is a novel approach designed to guide the implementation of AI systems, focusing on reliability, accountability, technical advancement, adaptability, and socio-technical harmony. By grounding the TOAST framework in healthcare case studies, this paper provides a robust evaluation of its practicality and theoretical soundness in addressing operational, ethical, and regulatory challenges in high-stakes environments, demonstrating how adaptable AI systems can enhance institutional efficiency, mitigate risks like bias and data privacy, and offer a replicable model for other sectors requiring ethically aligned and efficient AI integration.

Paperid: 3268, https://arxiv.org/pdf/2502.00005.pdf

Abstract:
Good mental health enables individuals to cope with the normal stresses of life. In Germany, approximately one-quarter of the adult population is affected by mental illnesses. Teletherapy and digital health applications are available to bridge gaps in care and relieve healthcare professionals. The acceptance of these tools is a strongly influencing factor for their effectiveness, which also needs to be evaluated for AI-based conversational agents (CAs) (e. g. ChatGPT, Siri) to assess the risks and potential for integration into therapeutic practice. This study investigates the perspectives of both the general population and healthcare professionals with the following questions: 1. How frequently are CAs used for mental health? 2. How high is the acceptance of CAs in the field of mental health? 3. To what extent is the use of CAs in counselling, diagnosis, and treatment acceptable? To address these questions, two quantitative online surveys were conducted with 444 participants from the general population and 351 healthcare professionals. Statistical analyses show that 27 % of the surveyed population already confide their concerns to CAs. Not only experience with this technology but also experience with telemedicine shows a higher acceptance among both groups for using CAs for mental health. Additionally, participants from the general population were more likely to support CAs as companions controlled by healthcare professionals rather than as additional experts for the professionals. CAs have the potential to support mental health, particularly in counselling. Future research should examine the influence of different communication media and further possibilities of augmented intelligence. With the right balance between technology and human care, integration into patient-professional interaction can be achieved.

Paperid: 3269, https://arxiv.org/pdf/2501.19241.pdf

Abstract:
Our world today is facing a confluence of several mutually reinforcing crises each of which intersects with concerns of social justice and emancipation. This paper is a provocation for the role of computer-mediated information access in our emancipatory struggles. We define emancipatory information retrieval as the study and development of information access methods that challenge various forms of human oppression, and situates its activities within broader collective emancipatory praxis. The term "emancipatory" here signifies the moral concerns of universal humanization of all peoples and the elimination of oppression to create the conditions under which we can collectively flourish. To develop an emancipatory research agenda for information retrieval (IR), in this paper we speculate about the practices that the community can adopt, enumerate some of the projects that the field should undertake, and discuss provocations to spark new ideas and directions for research. We challenge the field of IR research to embrace humanistic values and commit to universal emancipation and social justice. We also invite scholars from fields such as human-computer interaction, information sciences, media studies, design, science and technology studies, social and political sciences, philosophy, law, environmental sciences, public health, educational sciences, as well as legal and policy experts, civil rights advocates, social justice activists and movement organizers, and artists to join us in realizing this transformation. In this process, we must both imagine post-oppressive worlds, and reimagine the role of IR in that world and in the journey that leads us there.

Paperid: 3270, https://arxiv.org/pdf/2501.16084.pdf

Abstract:
Automation and semi-automation through computational tools like LLMs are also making their way to deployment in research synthesis and secondary research, such as systematic reviews. In some steps of research synthesis, this has the opportunity to provide substantial benefits by saving time that previously was spent on repetitive tasks. The screening stages in particular may benefit from carefully vetted computational support. However, this position paper argues for additional caution when bringing in such tools to the analysis and synthesis phases, where human judgement and expertise should be paramount throughout the process.

Paperid: 3271, https://arxiv.org/pdf/2501.15885.pdf

Abstract:
Wireless charging pads are common, yet their functionality is mainly restricted to charging. Existing gesture recognition techniques, such as those based on machine vision and WiFi, have drawbacks like high costs and poor precision. This paper presents a new human machine interaction solution using multicoil wireless charging pads. The proposed approach leverages the pads existing modules without additional wearable sensors. It determines gestures by monitoring current and power changes in different coils. The data processing includes noise removal, sorting, highpass filtering, and slicing. A Bayesian network and particle filtering are employed for motion tracking. Through experiments, this solution proves to have wide applications, high recognition accuracy, and low cost. It can effectively identify diverse gestures, increasing the value of wireless charging pads. It outperforms traditional methods, with a 0.73 improvement in recognition accuracy and better environmental adaptability.

Paperid: 3272, https://arxiv.org/pdf/2501.15721.pdf

Abstract:
Music and language are structurally similar. Such structural similarity is often explained by generative processes. This paper describes the recent development of probabilistic generative models (PGMs) for language learning and symbol emergence in robotics. Symbol emergence in robotics aims to develop a robot that can adapt to real-world environments and human linguistic communications and acquire language from sensorimotor information alone (i.e., in an unsupervised manner). This is regarded as a constructive approach to symbol emergence systems. To this end, a series of PGMs have been developed, including those for simultaneous phoneme and word discovery, lexical acquisition, object and spatial concept formation, and the emergence of a symbol system. By extending the models, a symbol emergence system comprising a multi-agent system in which a symbol system emerges is revealed to be modeled using PGMs. In this model, symbol emergence can be regarded as collective predictive coding. This paper expands on this idea by combining the theory that ''emotion is based on the predictive coding of interoceptive signals'' and ''symbol emergence systems,'' and describes the possible hypothesis of the emergence of meaning in music.

Paperid: 3273, https://arxiv.org/pdf/2501.15456.pdf

Abstract:
The emerging field of panoramic video generation from text and image prompts unlocks new creative possibilities in virtual reality (VR), addressing the limitations of current immersive experiences, which are constrained by pre-designed environments that restrict user creativity. To advance this frontier, we present Imagine360, a proof-of-concept prototype that integrates co-creation principles with AI agents. This system enables refined speech-based text prompts, egocentric perspective adjustments, and real-time customization of virtual surroundings based on user perception and intent. An eight-participant pilot study comparing non-AI and linear AI-driven workflows demonstrates that Imagine360's co-creative approach effectively integrates temporal and spatial creative controls. This introduces a transformative VR paradigm, allowing users to seamlessly transition between 'seeing' and 'imagining,' thereby shaping virtual reality through the creations of their minds.

Paperid: 3274, https://arxiv.org/pdf/2501.15304.pdf

Abstract:
This paper presents an approach that combines Human-In-The-Loop Reinforcement Learning (HITL RL) with principles derived from music theory to facilitate real-time generation of musical compositions. HITL RL, previously employed in diverse applications such as modelling humanoid robot mechanics and enhancing language models, harnesses human feedback to refine the training process. In this study, we develop a HILT RL framework that can leverage the constraints and principles in music theory. In particular, we propose an episodic tabular Q-learning algorithm with an epsilon-greedy exploration policy. The system generates musical tracks (compositions), continuously enhancing its quality through iterative human-in-the-loop feedback. The reward function for this process is the subjective musical taste of the user.

Paperid: 3275, https://arxiv.org/pdf/2501.15047.pdf

Abstract:
This study addresses a critical gap in adolescent health education strategies in the Philippines, as highlighted by the Young Adult Fertility and Sexuality (YAFS) survey series, which overlooks the use of games as a medium for disseminating health information. To bridge this gap, the research introduces AHlam Na, a game-based mobile application designed to enhance adolescents' awareness and understanding of key health-related topics. Using a single-group pretest-posttest design, the study involved forty junior high school students from a randomly selected school in the Philippines. They interacted with the application that embedded adolescent health topics into its gameplay. Data collected through pretest and post-test surveys revealed a significant improvement in the student's knowledge and attitudes toward adolescent health after engaging in the game, indicating that game-based learning effectively enhances their learning experience. The positive reception and knowledge gains suggest that AHlam Na is a promising tool for promoting adolescent health awareness. Based on these findings, it is recommended that the application be integrated into the adolescent health curriculum in schools across the Philippines. Future studies should examine the long-term impact of game-based learning on health behaviors and expand the sample size to include more diverse demographic groups. This research contributes to the growing body of literature on game-based learning in health education by demonstrating the potential of digital games to address the limitations of traditional teaching methods. The successful implementation of AHlam Na underscores the importance of exploring gamified learning tools to deliver critical health information to young people effectively.

Paperid: 3276, https://arxiv.org/pdf/2501.14092.pdf

Abstract:
This paper advances a theoretical argument about the role capital plays in structuring CHI research. We introduce the concept of technological capture to theorize the mechanism by which this happens. Using this concept, we decompose the effect on CHI into four broad forms: technological capture creates market-creating, market-expanding, market-aligned, and externality-reducing CHI research. We place different CHI subcommunities into these forms -- arguing that many of their values are inherited from capital underlying the field. Rather than a disciplinary- or conference-oriented conceptualization of the field, this work theorizes CHI as tightly-coupled with capital via technological capture. The paper concludes by discussing some implications for CHI.

Paperid: 3277, https://arxiv.org/pdf/2501.13407.pdf

Abstract:
This paper explores the profound impact of User Experience (UX) design on user retention and conversion rates in mobile applications. As the mobile app market becomes increasingly competitive, understanding how UX design can enhance user satisfaction, engagement, and loyalty is crucial for developers and businesses. Through a comprehensive review of existing literature and statistical insights, this study identifies key UX design principles that contribute to improved user retention and conversion rates. Intuitive navigation, appealing visuals, performance optimization, and integration of user feedback emerge as essential components of effective UX design that drive app success. Applications that prioritize these elements foster a positive user experience, leading to higher engagement and greater retention. Additionally, UX design strategies, such as personalization and customization, have been shown to significantly increase conversion rates, demonstrating the critical the role that tailored experiences play in app success. By analyzing these principles and their impact, this paper provides valuable insights for developers aiming to enhance user satisfaction, optimize app performance, and ultimately improve business outcomes.

Paperid: 3278, https://arxiv.org/pdf/2501.12498.pdf

Abstract:
Recent advancements in artificial intelligence have reopened the question about the boundaries of AI autonomy, particularly in discussions around artificial general intelligence (AGI) and its potential to act independently across varied purposes. This paper explores these boundaries through the analysis of the Alignment Research Center experiment on GPT-4 and introduces the Start Button Problem, a thought experiment that examines the origins and limits of AI autonomy. By examining the thought experiment and its counterarguments will be enlightened how in the need for human activation and purpose definition lies the AI's inherent dependency on human-initiated actions, challenging the assumption of AI as an agent. Finally, the paper addresses the implications of this dependency on human responsibility, questioning the measure of the extension of human responsibility when using AI systems.

Paperid: 3279, https://arxiv.org/pdf/2501.11613.pdf

Abstract:
This study introduces Conversation Routines (CR), a structured prompt engineering framework for developing task-oriented dialog systems using Large Language Models (LLMs). While LLMs demonstrate remarkable natural language understanding capabilities, engineering them to reliably execute complex business workflows remains challenging. The proposed CR framework enables the development of Conversation Agentic Systems (CAS) through natural language specifications, embedding task-oriented logic within LLM prompts. This approach provides a systematic methodology for designing and implementing complex conversational workflows while maintaining behavioral consistency. We demonstrate the framework's effectiveness through two proof-of-concept implementations: a Train Ticket Booking System and an Interactive Troubleshooting Copilot. These case studies validate CR's capability to encode sophisticated behavioral patterns and decision logic while preserving natural conversational flexibility. Results show that CR enables domain experts to design conversational workflows in natural language while leveraging custom functions (tools) developed by software engineers, creating an efficient division of responsibilities where developers focus on core API implementation and domain experts handle conversation design. While the framework shows promise in accessibility and adaptability, we identify key challenges including computational overhead, non-deterministic behavior, and domain-specific logic optimization. Future research directions include CR evaluation methods based on prompt engineering frameworks driven by goal-oriented grading criteria, improving scalability for complex multi-agent interactions, and enhancing system robustness to address the identified limitations across diverse business applications.

Paperid: 3280, https://arxiv.org/pdf/2501.10738.pdf

Abstract:
Litrepl is a lightweight text processing tool designed to recognize and evaluate code sections within Markdown or Latex documents. This functionality is useful for both batch document section evaluation and interactive coding within a text editor, provided a straightforward integration is established. Inspired by Project Jupyter, Litrepl aims to facilitate the creation of research documents. In the light of recent developments in software deployment, however, we have shifted our focus from informal reproducibility to enhancing transparency in communication with programming language interpreters, by either eliminating or clearly exposing mutable states within the communication process.

Paperid: 3281, https://arxiv.org/pdf/2501.10428.pdf

Abstract:
Objective: This study explores a novel deep learning approach for EEG analysis and perceptual state guidance, inspired by Level of Detail (LOD) theory. The goal is to improve perceptual state identification accuracy and advance personalized psychological therapy. Methods: Portable EEG devices and music rhythm signals were used for data collection. LOD theory was applied to dynamically adjust EEG signal processing, extracting core perceptual features. A Unity-based software system integrated EEG data with audio materials. The deep learning model combined a CNN for feature extraction and classification, and a DQN for reinforcement learning to optimize rhythm adjustments. Results: The CNN achieved 94.05% accuracy in perceptual state classification. The DQN guided subjects to target states with a 92.45% success rate, averaging 13.2 rhythm cycles. However, only 50% of users reported psychological alignment with the target state, indicating room for improvement. Discussion: The results validate the potential of LOD-based EEG biofeedback. Limitations include dataset source, label subjectivity, and reward function optimization. Future work will expand to diverse subjects, incorporate varied musical elements, and refine reward functions for better generalization and personalization.

Paperid: 3282, https://arxiv.org/pdf/2501.10384.pdf

Abstract:
This research applies Harold Demsetz's concept of the nirvana approach to the realm of AI governance and debunks three common fallacies in various AI policy proposals--"the grass is always greener on the other side," "free lunch," and "the people could be different." Through this, I expose fundamental flaws in the current AI regulatory proposal. First, some commentators intuitively believe that people are more reliable than machines and that government works better in risk control than companies' self-regulation, but they do not fully compare the differences between the status quo and the proposed replacements. Second, when proposing some regulatory tools, some policymakers and researchers do not realize and even gloss over the fact that harms and costs are also inherent in their proposals. Third, some policy proposals are initiated based on a false comparison between the AI-driven world, where AI does lead to some risks, and an entirely idealized world, where no risk exists at all. However, the appropriate approach is to compare the world where AI causes risks to the real world where risks are everywhere, but people can live well with these risks. The prevalence of these fallacies in AI governance underscores a broader issue: the tendency to idealize potential solutions without fully considering their real-world implications. This idealization can lead to regulatory proposals that are not only impractical but potentially harmful to innovation and societal progress.

Paperid: 3283, https://arxiv.org/pdf/2501.10369.pdf

Abstract:
Procedural Content Generation (PCG) is widely used to create scalable and diverse environments in games. However, existing methods, such as the Wave Function Collapse (WFC) algorithm, are often limited to static scenarios and lack the adaptability required for dynamic, narrative-driven applications, particularly in augmented reality (AR) games. This paper presents a reinforcement learning-enhanced WFC framework designed for mobile AR environments. By integrating environment-specific rules and dynamic tile weight adjustments informed by reinforcement learning (RL), the proposed method generates maps that are both contextually coherent and responsive to gameplay needs. Comparative evaluations and user studies demonstrate that the framework achieves superior map quality and delivers immersive experiences, making it well-suited for narrative-driven AR games. Additionally, the method holds promise for broader applications in education, simulation training, and immersive extended reality (XR) experiences, where dynamic and adaptive environments are critical.

Paperid: 3285, https://arxiv.org/pdf/2501.08070.pdf

Abstract:
Misinformation is a challenging problem. This paper provides the first systematic interdisciplinary investigation of technical and non-technical interventions against misinformation. It combines interviews and a survey to understand which interventions are accepted across academic disciplines and approved by misinformation experts. Four interventions are supported by more than two in three misinformation experts: promoting media literacy, education in schools and universities, finding information about claims, and finding sources for claims. The most controversial intervention is deleting misinformation. We discuss the potentials and risks of all interventions. Education-based interventions are perceived as the most helpful by misinformation experts. Interventions focused on providing evidence are also widely perceived as helpful. We discuss them as scalable and always available interventions that empower users to independently identify misinformation. We also introduce the Phase Model of Misinformation Interventions that helps practitioners make informed decisions about which interventions to focus on and how to best combine interventions.

Paperid: 3286, https://arxiv.org/pdf/2501.07182.pdf

Abstract:
TikTok has gradually become one of the most pervasive social media platforms in our daily lives. While much can be said about the merits of platforms such as TikTok, there is a different kind of attention paid towards the political affect of social media today compared to its impact on other aspects of modern networked reality. I explored how users on TikTok discussed the crisis in Palestine that worsened in 2023. Using network analysis, I situate keywords representing the conflict and categorize them thematically based on a coding schema derived from politically and ideologically differentiable stances. I conclude that activism and propaganda are contending amongst themselves in the thriving space afforded by TikTok today.

Paperid: 3287, https://arxiv.org/pdf/2501.06527.pdf

Abstract:
This exploratory study investigates the intersection of Generative AI tools and experiential learning in business education. Through a case study of an innovative undergraduate course, we examine how students interact with and adapt to various AI modalities-from text-based tools to image generation-alongside real-world experiences. Our findings reveal how this integrated approach enables novice users to overcome creative barriers, accelerates skill acquisition, and creates a dynamic interplay between AI-generated insights and real-world validation. We identify critical interaction challenges, including prompt engineering patterns and the need for more intuitive AI interfaces in educational contexts. These insights inform the design of future AI tools for creative learning and contribute to broader HCI discussions about human-AI collaboration in educational settings.

Paperid: 3288, https://arxiv.org/pdf/2501.05621.pdf

Abstract:
As social media platforms are increasingly adopted, the data the data people leave behind is shining new light into our understanding of phenomena, ranging from socio-economic-political events to the spread of infectious diseases. This chapter presents research conducted in the past decade that has harnessed social media data in the service of mental health and well-being. The discussion is organized along three thrusts: a first that highlights how social media data has been utilized to detect and predict risk to varied mental health concerns; a second thrust that focuses on translation paradigms that can enable to use of such social media based algorithms in the real-world; and the final thrust that brings to the fore the ethical considerations and challenges that engender the conduct of this research as well as its translation. The chapter concludes by noting open questions and problems in this emergent area, emphasizing the need for deeper interdisciplinary collaborations and participatory research design, incorporating and centering on human agency, and attention to societal inequities and harms that may result from or be exacerbated in this line of computational social science research.

Paperid: 3289, https://arxiv.org/pdf/2501.05595.pdf

Abstract:
This entry provides an overview of Human-centered Geospatial Data Science, highlighting the gaps it aims to bridge, its significance, and its key topics and research. Geospatial Data Science, which derives geographic knowledge and insights from large volumes of geospatial big data using advanced Geospatial Artificial Intelligence (GeoAI), has been widely used to tackle a wide range of geographic problems. However, it often overlooks the subjective human experiences that fundamentally influence human-environment interactions, and few strategies have been developed to ensure that these technologies follow ethical guidelines and prioritize human values. Human-centered Geospatial Data Science advocates for two primary focuses. First, it advances our understanding of human-environment interactions by leveraging Geospatial Data Science to measure and analyze human subjective experiences at place including emotion, perception, cognition, and creativity. Second, it advocates for the development of responsible and ethical Geospatial Data Science methods that protect geoprivacy, enhance fairness and reduce bias, and improve the explainability and transparency of geospatial technologies. With these two missions, Human-centered Geospatial Data Sciences brings a fresh perspective to develop and utilize geospatial technologies that positively impact society and benefit human well-being and the humanities.

Paperid: 3290, https://arxiv.org/pdf/2501.05345.pdf

Abstract:
Synchronous data-rich conversations are commonplace within enterprise organizations, taking place at varying degrees of formality between stakeholders at different levels of data literacy. In these conversations, representations of data are used to analyze past decisions, inform future course of action, as well as persuade customers, investors, and executives. However, it is difficult to conduct these conversations between remote stakeholders due to poor support for presenting data when video-conferencing, resulting in disappointing audience experiences. In this position statement, I reflect on our recent work incorporating multimodal interaction and augmented reality video, suggesting that video-conferencing does not need to be limited to screen-sharing and relegating a speaker's video to a separate thumbnail view. I also comment on future research directions and collaboration opportunities.

Paperid: 3291, https://arxiv.org/pdf/2501.04429.pdf

Abstract:
To reduce cycles of rejection and redesign -- especially in the absence of clear acceptance criteria and the diversity of possible development paths -- User-Centered Design (UCD) has become a central methodology in computer science, emphasizing the integration of user perspectives throughout the entire system lifecycle. Despite its widespread adoption, however, UCD remains conceptually ambiguous and theoretically underdeveloped. This paper addresses that gap by drawing on the theories of Ernesto Laclau and Jacques Lacan to analyze UCD as a potential empty signifier: a term that gains rhetorical power precisely through its semantic openness. We argue that this ambiguity enables UCD to unify diverse and sometimes conflicting expectations under a shared label, which both empowers participatory design practices and conceals underlying tensions. Acknowledging UCD as an empty signifier allows for a more critical engagement with its practical and symbolic functions, revealing how it can foster inclusivity, empathy, and user empowerment, but also how it risks ideological capture and conceptual dilution. This theoretical reframing opens new pathways for reflection and renewal within sociotechnical system design.

Paperid: 3292, https://arxiv.org/pdf/2501.03266.pdf

Abstract:
LLM safety and ethical alignment are widely discussed, but the impact of content moderation on user satisfaction remains underexplored. In particular, little is known about how users respond when models refuse to answer a prompt-one of the primary mechanisms used to enforce ethical boundaries in LLMs. We address this gap by analyzing nearly 50,000 model comparisons from Chatbot Arena, a platform where users indicate their preferred LLM response in pairwise matchups, providing a large-scale setting for studying real-world user preferences. Using a novel RoBERTa-based refusal classifier fine-tuned on a hand-labeled dataset, we distinguish between refusals due to ethical concerns and technical limitations. Our results reveal a substantial refusal penalty: ethical refusals yield significantly lower win rates than both technical refusals and standard responses, indicating that users are especially dissatisfied when models decline a task for ethical reasons. However, this penalty is not uniform. Refusals receive more favorable evaluations when the underlying prompt is highly sensitive (e.g., involving illegal content), and when the refusal is phrased in a detailed and contextually aligned manner. These findings underscore a core tension in LLM design: safety-aligned behaviors may conflict with user expectations, calling for more adaptive moderation strategies that account for context and presentation.

Paperid: 3293, https://arxiv.org/pdf/2501.01636.pdf

Abstract:
Whisphone is a novel earbud device designed for speech input via whispering. Utilizing canal-type earbuds with a unique microphone placement at the tip of the earplug, it effectively captures whispered voices radiated in the ear canal through bone conduction. This design can boost whispered voice volume with ear canal occlusion effect while simultaneously blocking external noise by sealing the ear hole. By incorporating Active Noise Canceling (ANC), Whisphone can effectively detect subtle whispers, even in noisy environments of up to 80dB(A). Its compact and comfortable design ensures discreet wearability, allowing users to interact with AI assistants hands-free without disturbing others in various daily situations such as offices, homes, or urban public spaces.

Paperid: 3294, https://arxiv.org/pdf/2501.01535.pdf

Abstract:
Drawing on contemporary pragmatist philosophy and linguistic theories on cognition, meaning, and communication, this paper presents a dynamic, metasemantic-metapragmatic taxonomy for grounding and conceptualizing human-like multimodal communicative alignment. The framework is rooted in contemporary developments of the three basic communicative capacities initially identified by American logician and pragmatist philosopher Charles Sanders Peirce: iconic (sensory and perceptual qualities), indexical (contextual and sociocultural associations), and rule-like (symbolic and intuitive reasoning). Expanding on these developments, I introduce the concept of indexical contextualization and propose the principle of "contextualization directionality" for characterizing the crucial metapragmatic capacity for maintaining, navigating, or transitioning between semantic and pragmatic modes of multimodal communication. I contend that current cognitive-social computational and engineering methodologies disproportionately emphasize the semantic/metasemantic domain, overlooking the pivotal role of metapragmatic indexicality in traversing the semantic-pragmatic spectrum of communication. The framework's broader implications for intentionality, identity, affect, and ethics in within-modal and cross-modal human-machine alignment are also discussed.

Paperid: 3295, https://arxiv.org/pdf/2501.00987.pdf

Abstract:
In light of Phillips' contention regarding the impracticality of Search Neutrality, asserting that non-epistemic factors presently dictate result prioritization, our objective in this study is to confront this constraint by questioning prevailing design practices in search engines. We posit that the concept of prioritization warrants scrutiny, along with the consistent hierarchical ordering that underlies this lack of neutrality. We introduce the term Search Plurality to encapsulate the idea of emphasizing the various means a query can be approached. This is demonstrated in a design that prioritizes the display of categories over specific search items, helping users grasp the breadth of their search. Whether a query allows for multiple interpretations or invites diverse opinions, the presentation of categories highlights the significance of organizing data based on relevance, importance, and relative significance, akin to traditional methods. However, unlike previous approaches, this method enriches our comprehension of the overall information landscape, countering the potential bias introduced by ranked lists.