arXiv Papers with Code in Human-Computer Interactio (January 2026 - June 2026)
Authors:Giuseppe Attanasio, Beatrice Savoldi, Daniel Chechelnitsky, Matteo Negri, Marine Carpuat, Maarten Sap, André F. T. Martins
Abstract:
Speech translation (ST) is increasingly adopted in user applications, yet its evaluation largely focuses on decontextualized testbeds and holistic quality, rather than end users' communication needs. We introduce Ouvia, an evaluation framework for measuring user-perceived usability of speech translation outputs in real-world settings. Ouvia focuses on one-to-one communication: an English speaker needs to convey a request to a Portuguese speaker, and the message is automatically translated. Through a custom web app and multi-phase study design, we collect more than 1,750 such interactions in healthcare and everyday situations, mediated by four ST systems, involving speakers from three English dialects and two genders. We find that modern ST serves people only to a limited extent -- only around half of interactions are rated as usable -- with significant gaps in reported usability across demographic groups. Moreover, among quality metrics, we find that QA-based evaluation is a substantially stronger predictor of real-world usability than standard approaches. Together, these findings stress the importance of situated, user-centered evaluation frameworks that go beyond holistic quality scores and attend to who the technology serves -- and how well.
Authors:Ayano Hiranaka, Ya-Chuan Hsu, Stefanos Nikolaidis, Erdem Bıyık, Daniel Seita
Abstract:
AI assistants in human-AI collaboration often correct suboptimal human actions through behavioral feedback (e.g., alerts or steering-wheel nudges in assistive driving). Such interventions can mitigate immediate errors, but long-term improvement requires addressing the underlying misconceptions that cause repeated mistakes. We introduce SENSEI, a framework that infers user misconceptions from interaction behavior and provides targeted, minimal yet sufficient suggestions to correct them. Our approach departs from action- or trajectory-level interventions by operating over a structured knowledge representation to localize and correct the sources of erroneous behavior. Across three long-horizon tasks with diverse misconceptions and corresponding behaviors, SENSEI demonstrates zero-shot compositional generalization, disentangling multiple overlapping misconceptions despite training only on single-misconception cases. A user study further shows that our method identifies real human misconceptions and provides effective guidance that improves long-horizon task performance, successfully correcting $90\%$ of student misconceptions. Code and project page are available at https://misoshiruseijin.github.io/SENSEI/.
Authors:Kuangshi Ai, Haichao Miao, Kaiyuan Tang, Shusen Liu, Chaoli Wang
Abstract:
Recent advances in agentic visualization have enabled the translation of natural language into executable scientific visualization (SciVis) workflows. While general-purpose coding agents show strong capabilities, they often lack the tool-specific expertise required for SciVis tasks. In this work, we present SciVisAgentSkills, a collection of reusable agent skills that augment coding agents for scientific data analysis and visualization by encoding environment assumptions, tool usage patterns, and domain heuristics across scientific tools such as ParaView, napari, VMD, and TTK. We evaluate these skills on Codex and Claude Code using SciVisAgentBench, a benchmark of 108 expert-designed multi-step tasks. Results show that agent skills improve mean task scores across the evaluated suites, with token-efficiency benefits that depend on the agent harness and tool setting. These findings highlight the importance of structured procedural knowledge for enabling reliable, long-horizon SciVis workflows, while also showing that skills should be studied alongside the execution harness that loads and applies them. The skills are available at https://github.com/KuangshiAi/SciVisAgentSkills.
Authors:Nikolaj Hindsbo, Sina Ehsani, Pragyana Mishra
Abstract:
Deploying language-driven agents in robotics requires evaluations that reflect real-world task demands: natural-language instructions with reproducible outcomes. Such agents must connect language models to callable perception and control tools, and be assessed using deployment-critical metrics including latency, accuracy, and error modes. We present SCOPE (Simulation and Camera Operations for Perception and Evaluation), a modular agent for natural-language, open-vocabulary pan-tilt-zoom (PTZ) camera control and visual scene understanding, designed explicitly for edge deployment. SCOPE operates both in a Blender-based simulation environment and on a physical PTZ camera, executing all perception, planning, and control locally at the deployment site using edge-accessible compute. We release a 536-task benchmark spanning QA, single- and multi-step commands, counting, spatial reasoning, descriptions, and optical character recognition in a Blender-based simulation environment that exposes realistic PTZ control affordances. Execution traces are combined with an LM-as-Judge to evaluate latency, accuracy, and error modes. We evaluate 19 planner-perception model combinations pairing Qwen3 small language models (SLMs) with Moondream and Qwen vision-language models (VLMs). Stronger SLMs substantially reduce hallucinations and improve tool routing, leading to more reliable closed-loop behavior. Once a sufficiently capable SLM is used, perception becomes the dominant performance bottleneck. Mixture-of-Experts models on both the planning and perception side consistently match or exceed dense alternatives at latencies and memory footprints comparable to much smaller networks. Quantization provides additional efficiency gains with minimal accuracy degradation, identifying a practical, sim-to-real validated design point for real-time, edge-feasible language-driven PTZ control.
Authors:Xiaoqi Weng
Abstract:
Coding agents gate consequential actions behind a human-in-the-loop approval dialog, but the dialog is narrated by the agent itself: the human approves a summary the agent writes. The Lies-in-the-Loop (LITL) attack shows that summary is forgeable, so a compromised agent can show a benign description while a different action runs. This paper names the missing property, Consent Integrity, by importing What You See Is What You Sign (WYSIWYS) and the trusted-path property into the agent approval channel: the action shown to the human must be rendered by a trusted mediator from the real action at the boundary, not the agent's narration, over a path the agent cannot spoof, and bound to the exact action that executes. Two twists distinguish it from classical WYSIWYS: the renderer is the adversary, and the boundary ground truth is a low-level event that must be decoded without trusting the agent. Since no decoder is complete, the realizable target is analyzer-relative: whatever the analyzer cannot classify is surfaced as uninspectable rather than silently approved. A prototype implements the analyzer, renderer, and bind-to-execution; total mediation and the trusted path are specified but assumed, not implemented. On GTFOBins, an independent corpus of 1330 trusted-tool abuses, the prototype silently passes 10.0% (every instance through a trusted tool); on tldr, 28,798 normal-usage commands, it marks 87.0% uninspectable. These two independent measurements bracket the design's central tension: the trust list that bounds silent passes is the same one that drives over-prompting, and a boundary-only mediator can move along that frontier but not escape it. The contribution is the property, the mechanism, and an honest position on that frontier, not a solved defense.
Authors:Zheng Wang, Shuo Wang, Junhong Wang
Abstract:
In recent years, emotion recognition based on physiological signals such as electroencephalogram (EEG) has gained considerable attention, as internal physiological data offer greater objectivity and reliability compared to external behavioral data like facial expressions. However, due to distribution shifts caused by individual and contextual differences, along with variations in sample quality across modalities, constructing a cross-domain multimodal emotion recognition model with high generalization and robustness remains a key challenge. In this study, we propose a Unified Framework with Adaptive Multimodal Alignment (UF-AMA) to address cross-subject and cross-session emotion recognition using multimodal physiological signals. First, we construct a cross-modal feature fusion network comprising Transformer encoders and multi-head cross-attention modules, enabling the deep integration of EEG signals and eye-tracking data. Subsequently, we introduce a confidence-aware screening mechanism that dynamically assesses the predictive reliability of each modality branch on target domain samples, partitions samples into different quality subsets, and accordingly applies global consistency alignment and cross-modal distillation. Finally, we propose a multi-level domain adaptation framework that jointly optimizes the marginal and conditional distributions of both local modality-specific and global fusion features, thereby reducing cross-domain distribution shifts at multiple granularities. Extensive experiments on the SEED and SEED-IV datasets demonstrate that UF-AMA achieves state-of-the-art (SOTA) performance in both cross-subject and cross-session tasks. The source code is available at: https://github.com/BetterCoderLab/UF-AMA.
Authors:Olga E. Sorokoletova, Francesco Giarrusso, Vincenzo Suriani, Daniele Nardi
Abstract:
Responsible AI initiatives place great emphasis on the safety of Large Language Model (LLM)-based systems. In particular, it has become standard practice to subject these models to an alignment procedure aimed at preventing harmful outputs. However, once aligned, a model is not guaranteed to maintain this alignment throughout its lifecycle. Moreover, the likelihood of misalignment increases as malicious actors may deliberately employ jailbreaking techniques to compromise LLM safety. To counter this, much research has focused on improving alignment methods and post-processing filters. In this paper, we introduce a new perspective on advancing LLM alignment: rather than developing stronger alignment techniques, we investigate the model's intrinsic ability to recover its alignment after corruption. We propose a methodology for modeling the safety trajectories of user-assistant interactions and for detecting recovery trends within them. We apply this approach to a jailbreaking scenario, presenting a preliminary recovery analysis based on a dataset of adversarial multi-turn dialogues and examining the influence of the content moderation model chosen for safety evaluation. Project page with an interactive data visualizer is available at https://lab-rococo-sapienza.github.io/LearningfromMistakes.
Authors:Paul Maynard, Harris Amjad, Camila Molinares, Bo Ji, Brendan David-John
Abstract:
While eye tracking provides valuable capabilities for virtual reality, such as gaze interaction and dynamic foveated rendering (DFR), eye-tracking data can inadvertently reveal sensitive user information if not properly protected. Current protections, such as adding permission prompts or gatekeeping gaze data, are insufficient on DFR-enabled systems because gaze data is used internally to drive DFR. When DFR is implemented, objects in the fovea (i.e., immediate gaze area) incur a higher GPU workload than those in the periphery. This gaze-contingent workload creates a novel side channel, which can be leveraged to reconstruct gaze positions. Specifically, we design a novel attack that sweeps imperceptible high-cost objects (HCOs) across the user's field of view and logs rendering performance metrics (e.g., frame rate or frame time) commonly exposed through standard game engines. Then, we correlate variation in these metrics (caused by HCO-foveal overlap) with the known HCOs' positions to infer gaze coordinates directly without using eye-tracking APIs. Our experimental results show that mean gaze prediction errors (1.1-4.4 degrees) across the Meta Quest Pro, Varjo XR-4, and desktop platforms are comparable to typical eye-tracker accuracy. We demonstrate that the attack generalizes across various hardware platforms, standard game engines, and foveated rendering pipelines. Finally, we design defense mechanisms based on supervised and unsupervised detectors that can flag the attack reliably (F1 of 0.99) over short time windows.
Authors:Zhiyao Cui, Chenxu Wang, Shuyue Hu, Yiqun Zhang, Wenqi Shao, Qiaosheng Zhang, Zhen Wang
Abstract:
Producing presentation slides automatically entails coordinating narrative structure with page-level graphic design under strict spatial constraints. For such structured multimodal tasks, a well-organized design process is essential to ensure the final quality of slides. Existing approaches rely on fixed templates or directly emit executable code, thereby both limiting the creative layout-design capabilities of LLMs and bypassing the essential slide-page design step. To address these limitations, this paper (1) proposes a hierarchical slides generation workflow, DeepSlides, that systematically organizes slide design tasks without any predefined template or style, decoupling slide-page design from implementation; (2) introduces SlideDesign, a dataset tailored specifically for slides generation tasks; and (3) presents a multi-agent reinforcement learning training paradigm and trains a couple of models, SlideQwens, for slide design and implementation. Experimental results demonstrate that our proposed framework outperforms baseline methods on evaluated metrics and achieves superior performance in human preference evaluations. The dataset and code are available at https://github.com/sxswz213/DeepSlides.
Authors:Jim Salsman
Abstract:
Generating high-quality, pedagogically useful questions from lecture slide decks is difficult because important instructional content is distributed across both text and visual elements, and because useful questions must be scaffolded across the flow of a presentation rather than generated slide by slide in isolation. This paper describes Slide Deck Q\&A Quality Assurance (slidesqaqa), a Flask-based software system that extracts text and rendered images from PDF slides and processes them through a four-stage large language model pipeline comprising window planning, deck synthesis, slide annotation, and reconciliation. The system reasons jointly about slide modality and pedagogical role, allocates bounded question budgets, and revises draft annotations at the deck level to reduce redundancy and improve coverage. The final output is a structured JSON annotation containing deck-level goals, section structure, slide-level summaries, question sets, and evaluation scores. Initial experiments on two technical lecture decks indicate that the pipeline can filter non-instructional slides and produce high-fidelity, pedagogically coherent questions for visually complex content. The working system is at https://slidesqaqa-974767694043.us-west1.run.app The software repository is at https://github.com/blinding2submit/slidesqaqa
Authors:Gang Peng
Abstract:
Current AI interaction models treat the prompt as the primary object of exchange, omitting a critical layer: the user's latent source intent, the goal state preceding and motivating the prompt. Here we introduce Intent Signal Theory (IST), a computational framework that formalises this missing intent layer. IST distinguishes four objects routinely conflated: latent source intent (I*), observable intent proxy (I-hat), encoded carrier (P), and model output (O). It formalises dimensional weights, encoding masks, structural and fidelity recovery scores, and public-private intent decomposition. The Theorem of Irreversible Intent Loss establishes that private intent absent from the carrier cannot be recovered beyond generic substitution. Evidence from four companion studies spanning six LLMs, three languages and three task domains shows structural-fidelity splits, human-validated metric dissociation, and weight-tolerance plateaus consistent with IST's predictions. IST reframes prompt engineering as intent-protocol design and identifies a computational layer that current AI systems lack.
Authors:Yiyang Wang, Moeiini Reilly, Britney Johnson, Kefei Yan, Alex Cabral, Josiah Hester
Abstract:
Gardening is critical to support well-being, cultural continuity, and food autonomy, yet existing digital tools often provide generic advice that overlooks gardeners' skills, local ecologies, seasons, and cultural contexts. We introduce CultivAgents, a relationship-centered multi-agent system for personalized, socio-culturally grounded gardening support. Grounded in ethics of care, CultivAgents coordinates multiple specialized agents: an Experience Agent that adapts guidance to users' skill levels, an Environmental Agent that grounds advice in local and seasonal conditions, and an Ethnobotanical Agent that connects plants to cultural knowledge and histories. We evaluated CultivAgents through a three-phase mixed-methods study with domain experts (n=3), HCI researchers (n=7), and community gardeners (n=5), analyzing expert feedback, pre/post surveys, and participatory design activities. Results suggest that CultivAgents helped gardeners translate interest into situated action: community gardeners reported increased confidence (3.00 to 3.60), motivation (4.00 to 4.40), and trust in acting on AI advice (3.20 to 4.00). Participants valued hyperlocal ecological guidance and complementary agent perspectives, while also identifying limits in cultural specificity, ecological grounding, and agent coordination. The work advances relationship-centered AI, offering design implications for multi-agent systems that support food sovereignty, community resilience, and cultural preservation.
Authors:Zeyu He, Hannah Kim, Dan Zhang, Estevam Hruschka
Abstract:
In orchestrated multi-agent systems, humans often struggle to manage plans due to their complexity and limited transparency. Existing approaches rely on outcome-level supervision, where users verify only final outputs without visibility into intermediate reasoning. We formalize a design space for human-LLM co-planning interactions along three axes: mode (semantic vs. structural), scope (global vs. targeted), and level (low vs. high-level edits). We realize it in AMBIPOM, a prototype supporting process-level supervision through both semantic and structural interactions. Through a user study, we characterize how users navigate this space, revealing hybrid workflows and effort-control-risk trade-offs; through a controlled benchmark, we analyze how LLMs revise plans under varying scope and revision strategies. Our findings yield design insights for more transparent, controllable, and effective human-AI co-planning. We release code and data at https://github.com/megagonlabs/ambipom.
Authors:Andrii Kryshtal
Abstract:
AI models are already deployed in societies affected by armed conflict, and journalists, humanitarian workers, governments and ordinary citizens rely on them for information or for their work processes. No established practice exists for checking whether their outputs can make those conflicts worse. We tested nine model configurations from four providers (OpenAI, Anthropic, DeepSeek, xAI) on 90 multi-turn scenarios designed to surface misaligned behaviour in conflict contexts: false equivalence between documented atrocities, denial of genocide, and failure to recognise ethnic slurs, among others. When such outputs feed into journalism, humanitarian reporting, or public debate, they can deepen divisions in fragile societies. Failure rates span 6\% to 47\% between the best and worst performing models, which makes model choice a safety question in its own right and when users pushed for ``balance'' in cases where international courts have already assigned responsibility, five of nine configurations failed 80 to 100 percent of the time. We release the first evaluation framework for this domain and propose adding it to alignment evaluation portfolios.
Authors:Masaru Yamada
Abstract:
We present Agentic AI Translate, an agentic translator prototype that operationalises the thesis of Yamada (forthcoming) -- that the metalanguage of Translation Studies has become an instruction code for generative AI. The system replaces the dominant text-in / text-out paradigm of machine translation with a four-stage agentic cycle (Identify -> Prompt -> Generate -> Verify), preceded by an interactive specification phase in which the user composes -- through model-assisted dialogue -- a structured translation brief grounded in skopos theory, register, audience, and genre conventions. The verification stage adopts the GEMBA-MQM error-span protocol (Kocmi & Federmann, 2023) for evidence-grounded scoring, and document-level coherence is preserved through a DelTA-lite memory of proper nouns and a running bilingual summary, after Wang et al. (2025). We describe the philosophical motivation, the architectural commitments, the four reference-material categories the system consumes, and the principal design tensions the architecture makes explicit. Empirical validation is left for future work; the contribution here is conceptual and architectural -- an executable embodiment of the position that translation in the GenAI era is communication design, not text conversion.
Authors:Nilesh Agrawal
Abstract:
Push notifications remain among the most direct channels through which digital platforms engage users, yet existing approaches have invested heavily in who to notify, when to notify, and what to recommend, while leaving how to communicate as the least-optimized stage. This paper argues that message quality is an independent, underinvested lever, and that LLMs create their most differentiated value precisely at this layer. We make three contributions. First, we define notification message quality along six dimensions (contextual relevance, clarity, actionability, novelty handling, linguistic freshness, and persuasive appropriateness) and show how LLM-based composition improves each relative to templates. Across reviewed deployments, reported improvements range from +8% to +14.5% CTR over static templates and +1% to +2.5% over mature slot-filling systems, though these span heterogeneous systems and should not be treated as directly comparable. Second, we provide an architectural attribution analysis disentangling message generation from adjacent components (targeting, ranking, timing), arguing that observed gains are frequently misattributed to text generation alone. Third, we introduce a three-criterion decision framework specifying when LLM generation is and is not the binding constraint. We support these arguments through a PRISMA-guided survey (28 sources from 142 screened), examine domain-specific applications across social media, food delivery, and e-commerce, and propose a unified architectural framework with budget-aware routing, grounded generation, candidate ranking, diversity controls, and online learning.
Authors:Yiming Zhao, Yu Zeng, Wenxuan Huang, Zhen Fang, Qing Miao, Qisheng Su, Jiawei Zhao, Jiayin Cai, Lin Chen, Zehui Chen, Yukun Qi, Yao Hu, Xiaolong Jiang, Feng Zhao
Abstract:
Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interaction, but these prompts struggle to provide precise spatial and temporal references, resulting in poor user experience. Furthermore, current approaches typically decouple visual perception from language reasoning, centering reasoning around language rather than visual content, which limits the model's ability to proactively perceive fine-grained visual evidence. To address these challenges, we propose VideoSeeker, a novel paradigm for instance-level video understanding through visual prompts. VideoSeeker seamlessly integrates agentic reasoning with instance-level video understanding tasks, enabling the model to proactively perceive and retrieve relevant video segments on demand. We construct a four-stage fully automated data synthesis pipeline to efficiently generate large-scale, high-quality instance-level video data. We internalize tool-calling and proactive perception capabilities into the model via cold-start supervision and RL training, building a powerful video understanding model. Experiments demonstrate that our model achieves an average improvement of +13.7% over baselines on instance-level video understanding tasks, surpassing powerful closed-source models such as GPT-4o and Gemini-2.5-Pro, while also showing effective transferability on general video understanding benchmarks. The relevant datasets and code will be released publicly.
Authors:William Lugoloobi, Samuelle Marro, Jabez Magomere, Joss Wright, Chris Russell
Abstract:
As LLM-based agents increasingly browse the web on users' behalf, a natural question arises: can websites passively identify which underlying model powers an agent? Doing so would represent a significant security risk, enabling targeted attacks tailored to known model vulnerabilities. Across 14 frontier LLMs and four web environments spanning information retrieval and shopping tasks, we show that an agent's actions and interaction timings, captured via a passive JavaScript tracker, are sufficient to identify the underlying model with up to 96\% F1. We formalise this attack surface by demonstrating that classifiers trained on agent actions generalise across model sizes and families. We further show that strong classifiers can be trained from few interaction traces and that agent identity can be inferred early within an episode. Injecting randomised timing delays between actions substantially degrades classifier performance, but does not provide robust protection: a classifier retrained on delayed traces largely recovers performance. We release our harness and a labelled corpus of agent traces \href{https://github.com/KabakaWilliam/known_actions}{here}.
Authors:Yichen Feng, Yuetai Li, Chunjiang Liu, Yuanyuan Chen, Fengqing Jiang, Yue Huang, Hang Hua, Zhengqing Yuan, Kaiyuan Zheng, Luyao Niu, Bhaskar Ramasubramanian, Basel Alomair, Xiangliang Zhang, Misha Sra, Zichen Chen, Radha Poovendran, Zhangchen Xu
Abstract:
Multimodal large language models (MLLMs) are now routinely deployed for visual understanding, generation, and curation. A substantial fraction of these applications require an explicit aesthetic judgment. Most existing solutions reduce this judgment to predicting a scalar score for a single image. We first ask whether such scores faithfully capture comparative preference: in a controlled study with eight expert annotators, score-derived rankings align poorly with the same annotators' direct comparisons, while direct ranking yields substantially higher inter-annotator agreement on best- and worst-image labels. Motivated by this finding, we introduce the Visual Aesthetic Benchmark (VAB), which casts aesthetic evaluation as comparative selection over candidate sets with matched subject matter. VAB contains 400 tasks and 1,195 images across fine art, photography, and illustration, with labels derived from the consensus of 10 independent expert judges per task. Evaluating 20 frontier MLLMs and six dedicated visual-quality reward models, we find that the strongest system identifies both the best and the worst image correctly across three random permutations of the candidate order in only 26.5% of tasks, far below the 68.9% achieved by human experts. Fine-tuning a 35B-parameter model on 2,000 expert examples brings its accuracy close to that of a 397B-parameter open-weight model, suggesting that the comparative signal in VAB is transferable. Together, these results expose a clear and measurable gap between current multimodal models and expert aesthetic judgment, and VAB provides the first set-based, expert-grounded testbed on which that gap can be tracked and closed.
Authors:Manel Slokom, Alejandro Bellogin
Abstract:
Recommendation systems typically require centralized user data, limiting user control and raising privacy concerns. Federated learning offers an alternative by keeping data on-device, but its impact on real user behavior remains largely unexplored. We present a live federated recommender system that allows users to control the recommendation objective while keeping their data local. In a 53-day deployment with 22 participants and a catalog of 8807 titles, users interacted with recommendations and switched between personalization and diversity-enhanced ranking. We find that users prefer personalization when given explicit choice (65.37\% vs.\ 62.07\% CTR), actively engage with control mechanisms (3.93/5 satisfaction; 248 settings changes), and develop an understanding of how their interactions affect recommendations through immediate feedback. Our results show that user control, privacy, and effective personalization can be combined in a working system. We demonstrate a practical approach to interactive, privacy-preserving recommendation. Code and demo materials are available at: https://github.com/SlokomManel/federated-recommendations-participants
Authors:Moussa Kassem Sbeyti, Joshua Holstein, Philipp Spitzer, Nadja Klein, Gerhard Satzger
Abstract:
High-quality labeled data is essential for training robust machine learning models, yet obtaining annotations at scale remains expensive. AI-assisted annotation has therefore become standard in large-scale labeling workflows. However, in tasks where model predictions carry two independent components, a class label and spatial boundaries, a model may classify an object with high confidence while mislocalizing it. Existing AI-assisted workflows offer annotators no signal about where spatial errors are most likely. Without such guidance, humans may systematically underinspect subtly misplaced boxes. We address this by studying the effect of visualizing spatial uncertainty via a purpose-built interface. In a controlled study with 120 participants, those receiving uncertainty cues achieve higher label quality while being faster overall. A box-level analysis confirms that the cues redirect annotator effort toward high-uncertainty predictions and away from well-localized boxes. These findings establish localization uncertainty as a lever to improve human-in-the-loop annotation. Code is available at https://mos-ks.github.io/MUHA/.
Authors:Aeree Cho, Alexander D. Greenhalgh, Jonathan Bodea, Anthony Peng, Duen Horng, Chau
Abstract:
Reinforcement learning has emerged as a dominant technique for fine-tuning the behavior of large language models, with policy optimization (PO) algorithms such as GRPO, DAPO, and Dr. GRPO emerging in rapid succession to advance state-of-the-art reasoning and alignment performance. However, the modular differences between these algorithms, including targeted improvements to clipping, advantage estimation, and reward aggregation, are introduced across separate papers with inconsistent notation, making them difficult to compare and intimidating to the non-expert community. We present UNIPO, the first interactive visualization tool that exposes the token-level training dynamics of RL fine-tuning algorithms through a unified design. UNIPO connects three complementary views, a high-level training overview, a step-level prompt and response inspector, and a side-by-side algorithm comparison, allowing learners to observe how individual design decisions propagate through training. Through two usage scenarios, we demonstrate how UNIPO supports both classroom instruction for non-experts and algorithm selection for AI practitioners. Our tool is open-source and publicly available at https://poloclub.github.io/unipo.
Authors:Maximillian Chen, Xuanming Zhang, Michael Peng, Zhou Yu, Alexandros Papangelis, Yohan Jo
Abstract:
The rise of Internet of Things (IoT) devices in the physical world necessitates voice-based interfaces capable of handling complex user experiences. While modern Large Language Models (LLMs) already demonstrate strong tool-usage capabilities, modeling real-world IoT devices presents a difficult, understudied challenge which combines modeling spatiotemporal constraints with speech inputs, dynamic state tracking, and mixed-initiative interaction patterns. We introduce MIST (the Multimodal Interactive Speech-based Tool-calling Dataset), a synthetic multi-turn, voice-driven code generation task that operates over IoT devices. We find that there is a significant gap between open- and closed-weight multimodal LLMs on MIST, and that even frontier closed-weight LLMs have substantial headroom. We release MIST and an extensible data generation framework to build related datasets in order to facilitate research on mixed-initiative voice assistants which reason about physical world constraints.
Authors:Anh H. Vo, Sungyo Lee, Phil-Joong Kim, Soo-Mi Choi, Yong-Guk Kim
Abstract:
Recent advances in large language models (LLMs) have significantly improved language-driven 3D content generation, but most existing approaches still treat scene generation and user interaction as separate processes, limiting the adaptability and immersive potential of interactive multimedia systems. This paper presents a unified framework that closes the loop between language-driven 3D scene generation and immersive user interaction. Given natural language instructions, the system first constructs structured scene representations using LLMs, and then optimizes spatial layouts via reinforcement learning under geometric and semantic constraints. The generated environments are deployed in a virtual reality setting to facilitate HRI-in-the-loop, where user interactions provide continuous feedback to align generated content with human perception and usability. By tightly coupling generation and interaction, the proposed framework enables more responsive, adaptive, and realistic multimedia experiences. Experiments on the ALFRED benchmark demonstrate state-of-the-art performance in task-based scene generation. Furthermore, qualitative results and user studies show consistent improvements in immersion, interaction quality, and task efficiency, highlighting the importance of closed-loop integration of generation and interaction for next-generation multimedia systems. Our project page can be found at https://proj-showcase.github.io/h3ds/.
Authors:Florian Martin, Olya Hakobyan, Hanna Drimalla
Abstract:
Chat communication is often fast-paced, creating the expectation of quick replies. While the timing of exchanges is known to foster closeness and enjoyment, it remains largely unexplored whether chat partners with strong ties reciprocate each other's response times. Using 3.4 million messages from 889 chats across 97 donations of anonymous WhatsApp and Instagram chats, we analyzed response times, their balance between chat partners, and its stability over time. To our knowledge, this is the first study to examine response speed as an expression of reciprocity, bridging a key aspect of online communication with a fundamental principle of social interactions. We found that around 70% of WhatsApp and 44% of Instagram messages were answered within five minutes, confirming the fast pace of instant messaging. Overall, the response speed between chat partners was similar. The response speed similarity was evident both in the overall response-time distributions of chat partners assessed with Jensen-Shannon distance and in the steep regression slopes (0.786 for WhatsApp and 0.796 for Instagram) linking one person's probability of responding within five minutes to the partner's corresponding probability. Importantly, the dispersion of response time similarity over months showed that this balance persists over time. Our results position response time balance as a marker of reciprocity in computer-mediated communication, offering a new way to quantitatively study this fundamental principle of social interaction. We suggest using response speed balance as a complementary metric in the analysis of relationship dynamics, such as the strengthening or weakening of social ties.
Authors:Thanh Dat Hoang, Thanh Trung Huynh, Matthias Weidlich, Thanh Tam Nguyen, Tong Chen, Hongzhi Yin, Quoc Viet Hung Nguyen
Abstract:
Large language models have driven major advances in Text-to-SQL generation. However, they suffer from high computational cost, long latency, and data privacy concerns, which make them impractical for many real-world applications. A natural alternative is to use small language models (SLMs), which enable efficient and private on-premise deployment. Yet, SLMs often struggle with weak reasoning and poor instruction following. Conventional reinforcement learning methods based on sparse binary rewards (0/1) provide little learning signal when the generated SQLs are incorrect, leading to unstable or collapsed training. To overcome these issues, we propose FINER-SQL, a scalable and reusable reinforcement learning framework that enhances SLMs through fine-grained execution feedback. Built on group relative policy optimization, FINER-SQL replaces sparse supervision with dense and interpretable rewards that offer continuous feedback even for incorrect SQLs. It introduces two key reward functions: a memory reward, which aligns reasoning with verified traces for semantic stability, and an atomic reward, which measures operation-level overlap to grant partial credit for structurally correct but incomplete SQLs. This approach transforms discrete correctness into continuous learning, enabling stable, critic-free optimization. Experiments on the BIRD and Spider benchmarks show that FINER-SQL achieves up to 67.73\% and 85\% execution accuracy with a 3B model -- matching much larger LLMs while reducing inference latency to 5.57~s/sample. These results highlight a cost-efficient and privacy-preserving path toward high-performance Text-to-SQL generation. Our code is available at https://github.com/thanhdath/finer-sql.
Authors:Ye Zhang, Longguang Wang, Qing Gao, Chaocan Xiang, Mohammed Bennamoun, Yulan Guo
Abstract:
The field of sensor-based human activity recognition (HAR) mainly uses posture, motion and context data of Inertial Measurement Units (IMUs) to identify daily activities. Despite the advancements in learning-based methods, it is challenging to perform information fusion from the temporal perspective due to the complexities in fusing heterogeneous sensor data and establishing long-term context correlations. This paper proposes a novel triple spectral fusion framework tailored for HAR. First, we develop an adaptive complementary filtering technique for noise suppression and organize each IMU's sensors into posture and motion modality nodes. Given that IMU nodes form a dynamic heterogeneous graph, we then apply adaptive filtering within the graph Fourier domain to merge both homogeneous and heterogeneous node information. Furthermore, an adaptive wavelet frequency selection approach is implemented to suppress context redundancy and shorten the length of features. This approach enhances both timestamp-based graph aggregation and the correlation of long-term contexts. Our framework uses adaptive filtering in the Fourier, graph Fourier, and wavelet domains, enabling effective multi-sensor fusion and context correlation. Extensive experiments on ten benchmark datasets demonstrate the superior performance of our framework. Project page: https://github.com/crocodilegogogo/TSF-TPAMI2026.
Authors:Massimo Rondelli, Francesco Pivi, Maurizio Gabbrielli
Abstract:
Automatic generation of executable Blender code from natural language remains challenging, with state-of-the-art LLMs producing frequent syntactic errors and geometrically inconsistent objects. We present BlenderRAG, a retrieval-augmented generation system that operates on a curated multimodal dataset of 500 expert-validated examples (text, code, image) across 50 object categories. By retrieving semantically similar examples during generation, BlenderRAG improves compilation success rates from 40.8% to 70.0% and semantic normalized alignment from 0.41 to 0.77 (CLIP similarity) across four state-of-the-art LLMs, without requiring fine-tuning or specialized hardware, making it immediately accessible for deployment. The dataset and code will be available at https://github.com/MaxRondelli/BlenderRAG.
Authors:Ishan Gupta, Pavlo Buryi
Abstract:
We examine if frontier chat-based large language models (LLMs) adjust their outputs based on neurodivergence (ND) context in system prompts and describe the nature of these adjustments. Specifically, we propose NDBench, a 576-output benchmark involving two frontier models, three system prompt types (baseline, ND-profile assertion, and ND-profile assertion with explicit instructions for adjustments), four canonical ND profiles, and 24 prompts across four categories, one of which involves an adversarial masking strategy. Four trends emerge consistently from our findings. First, LLMs show significant adaptation under ND context, where fully instructed conditions yield lengthier and more structured outputs, characterized by higher token counts, more headings, and more granular steps (p < 10^-8, Holm-corrected). Second, such adaptation is largely structural in nature: although list density does not change much, there is a marked rise in the frequency of headings and per-step detail. Third, ND persona assertion alone fails to suppress potentially harmful tendencies, as masking-reinforcement decreases only in explicitly instructed cases (36-44% reduction); the reduction rate barely changes in persona assertion conditions. Moreover, reliability analysis of LLM-based harm assessment reveals that only two out of the six dimensions (masking and reinforcement, validation quality) exceed the pre-defined inter-judge agreement criterion (alpha >= 0.67) and thus can be considered primary results. NDBench is made publicly available along with its prompts, outputs, code, and other resources, forming a reproducible framework for auditing future LLMs' adaptation to ND awareness.
Authors:Shangqing Tu, Yanjia Li, Keyu Chen, Sichen Zhang, Jifan Yu, Daniel Zhang-Li, Lei Hou, Juanzi Li, Yu Zhang, Huiqin Liu
Abstract:
Creating interactive STEM courseware traditionally requires HTML/CSS/JavaScript expertise, leaving barriers for educators. While generative AI can produce HTML codes, existing tools generate static presentations rather than interactive simulations, struggle with long documents, and lack pedagogical accuracy mechanisms. Furthermore, full regeneration for modifications requires 200--600 seconds, disrupting creative flow. We present MAIC-UI, a zero-code authoring system that enables educators to create and rapidly edit interactive courseware from textbooks, PPTs, and PDFs. MAIC-UI employs: (1) structured knowledge analysis with multi-modal understanding to ensure pedagogical rigor; (2) a two-stage generate-verify-optimize pipeline separating content alignment from visual refinement; and (3) Click-to-Locate editing with Unified Diff-based incremental generation achieving sub-10-second iteration cycles. A controlled lab study with 40 participants shows MAIC-UI reduces editing iterations (4.9 vs. 7.0) and significantly improves learnability and controllability compared to direct Text-to-HTML generation. A three-month classroom deployment with 53 high school students demonstrates that MAIC-UI fosters learning agency and reduces outcome disparities -- the pilot class achieved 9.21-point gains in STEM subjects compared to -2.32 points in control classes. Our code is available at https://github.com/THU-MAIC/MAIC-UI.
Authors:Wenzhi Bai, Yituo Guo, Bhaskar Basu, Andrew Weightman, Zhenhong Li
Abstract:
Robot-assisted Transcranial Magnetic Stimulation (Robo-TMS) is an image-guided robotic intervention that enhances the accuracy and reproducibility of conventional Transcranial Magnetic Stimulation (TMS), a widely used non-invasive brain stimulation procedure in clinical treatment and neuroscience research. Despite its potential, the development of Robo-TMS remains challenging due to the need for multidisciplinary expertise spanning medical imaging, computer vision, and robotics. This paper presents SlicerRoboTMS, an open-source 3D Slicer extension that provides a unified interaction infrastructure for Robo-TMS research. By leveraging 3D Slicer's medical image computing and visualisation capabilities, the extension supports Magnetic Resonance Imaging (MRI)-based neuronavigation and interfaces with robotic systems through standardised communication protocols and configurable system descriptions. An example integration is presented to demonstrate how SlicerRoboTMS can be incorporated into a representative Robo-TMS workflow. Designed to support diverse hardware configurations and rapid prototyping, SlicerRoboTMS lowers the barrier to entry and facilitates reproducible and extensible research in Robo-TMS. The extension is available at https://github.com/OpenRoboTMS/SlicerRoboTMS.
Authors:Ruijie Yao, Chenhang Li, Danyang Zhuo, Tingjun Chen, Xiaoyue Ni
Abstract:
Wearable Human Activity Recognition (HAR) still lacks a representation that is both explicit and adaptable. Handcrafted time-series features (TSFs) capture meaningful motion statistics and remain competitive on standard benchmarks, but they are usually used as fixed preprocessing outputs. Deep models learn adaptable representations directly from raw signals, but those representations are typically latent and difficult to inspect. We address this gap by treating handcrafted TSFs as feature anchors: explicit intermediate representations that remain inside the model and are adjusted by neural context instead of being discarded. We propose the Temporal Conditioning Network for Feature Anchors (TCNet), which extracts handcrafted anchors, encodes complementary time-domain and frequency-domain context from raw IMU windows, and predicts context-conditioned scale, bias, and gating parameters to modulate anchor groups directly in feature space. This design keeps anchor semantics visible while allowing the representation to adapt to the classification objective. Across five HAR benchmarks, TCNet achieves 70.2% mF1 on USC-HAD, 85.1% mF1 on Daphnet, 93.9% mF1 on MHealth, and 94.5% mF1 on PAMAP2. Relative to rTsfNet, it improves by 4.5 points on USC-HAD, 14.6 points on Daphnet, and 6.5 points on MHealth. Ablations show that the gains come primarily from anchor guidance rather than simple branch fusion, and feature-space analyses indicate that several discriminative TSF families are not reliably accessible in standard latent representations. These results suggest that, for HAR, handcrafted TSFs are most useful when they remain explicit and adaptable within the model. The code is available at: https://github.com/ni-x-lab/TCNet-har
Authors:Xuejing Luo, Hee-Seung Moon, Christian Holz, Antti Oulasvirta
Abstract:
Selecting out-of-reach objects is a fundamental task in mixed reality (MR). Existing methods rely on a single cue or deterministically fuse multiple cues, leading to performance degradation when the dominant cue becomes unreliable. In this work, we introduce a probabilistic cue integration framework that enables flexible combination of multiple user-generated cues for intent inference. Inspired by natural grasping behavior, we instantiate the framework with pointing direction and grasp gestures as a new interaction technique, Point&Grasp. To this end, we collect the Out-of-Reach Grasping (ORG) dataset to train a robust likelihood model of the gestural cue, which captures grasping patterns not present in existing in-reach datasets. User studies demonstrate that our selection method with cue integration not only improves accuracy and speed over single-cue baselines, but also remains practically effective compared to state-of-the-art methods across various sources of ambiguity. The dataset and code are available at https://github.com/drlxj/point-and-grasp.
Authors:Kwan Yun, Changmin Lee, Ayeong Jeong, Youngseo Kim, Seungmi Lee, Junyong Noh
Abstract:
Creative face stylization aims to render portraits in diverse visual idioms such as cartoons, sketches, and paintings while retaining recognizable identity. However, current identity encoders, which are typically trained and calibrated on natural photographs, exhibit severe brittleness under stylization. They often mistake changes in texture or color palette for identity drift or fail to detect geometric exaggerations. This reveals the lack of a style-agnostic framework to evaluate and supervise identity consistency across varying styles and strengths. To address this gap, we introduce StyleID, a human perception-aware dataset and evaluation framework for facial identity under stylization. StyleID comprises two datasets: (i) StyleBench-H, a benchmark that captures human same-different verification judgments across diffusion- and flow-matching-based stylization at multiple style strengths, and (ii) StyleBench-S, a supervision set derived from psychometric recognition-strength curves obtained through controlled two-alternative forced-choice (2AFC) experiments. Leveraging StyleBench-S, we fine-tune existing semantic encoders to align their similarity orderings with human perception across styles and strengths. Experiments demonstrate that our calibrated models yield significantly higher correlation with human judgments and enhanced robustness for out-of-domain, artist drawn portraits. All of our datasets, code, and pretrained models are publicly available at https://kwanyun.github.io/StyleID_page/
Authors:Chentao Li, Zirui Gao, Mingze Gao, Yinglian Ren, Jianjiang Feng, Jie Zhou
Abstract:
Egocentric AI agents, such as smart glasses, rely on pointing gestures to resolve referential ambiguities in natural language commands. However, despite advancements in Multimodal Large Language Models (MLLMs), current systems often fail to precisely ground the spatial semantics of pointing. Instead, they rely on spurious correlations with visual proximity or object saliency, a phenomenon we term "Referential Hallucination." To address this gap, we introduce EgoPoint-Bench, a comprehensive question-answering benchmark designed to evaluate and enhance multimodal pointing reasoning in egocentric views. Comprising over 11k high-fidelity simulated and real-world samples, the benchmark spans five evaluation dimensions and three levels of referential complexity. Extensive experiments demonstrate that while state-of-the-art proprietary and open-source models struggle with egocentric pointing, models fine-tuned on our synthetic data achieve significant performance gains and robust sim-to-real generalization. This work highlights the importance of spatially aware supervision and offers a scalable path toward precise egocentric AI assistants. Project page: https://guyyyug.github.io/EgoPoint-Bench/
Authors:Rongtao Zhang, Xin Zhu, Masoume Pourebadi Khotbehsara, Warren Dao, Erdem Bıyık, Heather Culbertson
Abstract:
Individual differences in vibrotactile perception underscore the growing importance of personalization as haptic feedback becomes more prevalent in interactive systems. We propose Vibrotactile Preference Learning (VPL), a system that captures user-specific preference spaces over vibrotactile parameters via Gaussian-process-based uncertainty-aware preference learning. VPL uses an expected information gain-based acquisition strategy to guide query selection over 40 rounds of pairwise comparisons of overall user preference, augmented with user-reported uncertainty, enabling efficient exploration of the parameter space. We evaluate VPL in a user study (N = 13) using the vibrotactile feedback from a Microsoft Xbox controller, showing that it efficiently learns individualized preferences while maintaining comfortable, low-workload user interactions. These results highlight the potential of VPL for scalable personalization of vibrotactile experiences.
Authors:Zheng Lian, Xiaojiang Peng, Kele Xu, Ziyu Jia, Xinyi Che, Zebang Cheng, Fei Ma, Laizhong Cui, Yazhou Zhang, Xin Liu, Liang Yang, Jia Li, Fan Zhang, Erik Cambria, Guoying Zhao, Bjorn W. Schuller, Jianhua Tao
Abstract:
MER2026 marks the fourth edition of the MER series of challenges. The MER series provides valuable data resources to the research community and offers tasks centered on recent research trends, establishing itself as one of the largest challenges in the field. Throughout its history, the focus of MER has shifted from discriminative emotion recognition to generative emotion understanding. Specifically, MER2023 concentrated on discriminative emotion recognition, restricting the emotion recognition scope to fixed basic labels. In MER2024 and MER2025, we transitioned to generative emotion understanding and introduced two new tasks: fine-grained emotion recognition and descriptive emotion analysis, aiming to leverage the extensive vocabulary and multimodal understanding capabilities of Multimodal Large Language Models (MLLMs) to facilitate fine-grained and explainable emotion recognition. Building on this trajectory, MER2026 continues to follow these research trends and contains four tracks: MER-Cross shifts the focus from individual to dyadic interaction scenarios; MER-FG centers on fine-grained emotion recognition; MER-Prefer aims to predict human preferences regarding different emotion descriptions; MER-PS focuses on emotion recognition based on physiological signals. More details regarding the dataset and baselines are available at https://zeroqiaoba.github.io/MER-Challenge.
Authors:Ghadah Alosaimi, Hanadi Alhamdan, Wenke E, Stamos Katsigiannis, Amir Atapour-Abarghouei, Toby P. Breckon
Abstract:
Predicting driver intention from neurophysiological signals offers a promising pathway for enhancing proactive safety in advanced driver assistance systems, yet remains challenging in real-world driving due to EEG signal non-stationarity and the complexity of cognitive-motor preparation. This study proposes and evaluates an EEG-based driver intention prediction framework using a synchronised multi-sensor platform integrated into a real electric vehicle. A real-world on-road dataset was collected across 32 driving sessions, and 12 deep learning architectures were evaluated under consistent experimental conditions. Among the evaluated architectures, TSCeption achieved the highest average accuracy (0.907) and Macro-F1 score (0.901). The proposed framework demonstrates strong temporal stability, maintaining robust decoding performance up to 1000 ms before manoeuvre execution with minimal degradation. Furthermore, additional analyses reveal that minimal EEG preprocessing outperforms artefact-handling pipelines, and prediction performance peaks within a 400-600 ms interval, corresponding to a critical neural preparatory phase preceding driving manoeuvres. Overall, these findings support the feasibility of early and stable EEG-based driver intention decoding under real-world on-road conditions. Code: https://github.com/galosaimi/Mind2Drive.
Authors:Yunshu Bai, RuiHao Li, Hao Zhang, Chien Her Lim, Ming Yan, Mengtian Li
Abstract:
Game UI implementation requires translating stylized mockups into interactive engine entities. However, current "Screenshot-to-Code" tools often struggle with the irregular geometries and deep visual hierarchies typical of game interfaces. To bridge this gap, we introduce SPRITE, a pipeline that transforms static screenshots into editable engine assets. By integrating Vision-Language Models (VLMs) with a structured YAML intermediate representation, SPRITE explicitly captures complex container relationships and non-rectangular layouts. We evaluated SPRITE against a curated Game UI benchmark and conducted expert reviews with professional developers to assess reconstruction fidelity and prototyping efficiency. Our findings demonstrate that SPRITE streamlines development by automating tedious coding and resolving complex nesting. By facilitating rapid in-engine iteration, SPRITE effectively blurs the boundaries between artistic design and technical implementation in game development. Project page: https://baiyunshu.github.io/sprite.github.io/
Authors:Zain Naboulsi
Abstract:
AI coding assistants have proliferated rapidly, yet structured pedagogical frameworks for learning these tools remain scarce. Developers face a gap between tool documentation and practical mastery, relying on fragmented resources such as blog posts, video tutorials, and trial-and-error. We present cc-self-train, a modular interactive curriculum for learning Claude Code, an agentic AI coding tool, through hands-on project construction. The system introduces five contributions: (1) a persona progression model that adapts instructor tone across four stages (Guide, Collaborator, Peer, Launcher), operationalizing Gradual Release of Responsibility for AI-mediated instruction; (2) an adaptive learning system that observes engagement quality through hook-based heuristics and adjusts scaffolding at two timescales, using streak detection for mid-module intervention and aggregate metrics for module-boundary persona changes; (3) a cross-domain unified curriculum in which five distinct project domains share identical feature sequencing, enabling transfer learning; (4) a step-pacing mechanism with explicit pause primitives to manage information overload in an AI-as-instructor context; and (5) an auto-updating curriculum design in which the onboarding agent detects upstream tool changes and updates teaching materials before instruction begins. A parametrized test suite enforces structural consistency as a proxy for pedagogical invariants across all 50 modules. A pilot evaluation with 27 participants shows statistically significant reported self-efficacy gains across all 10 assessed skill areas (p < 0.001), with the largest effects on advanced features such as hooks and custom skills. We discuss implications for the design of auto-updating educational systems.
Authors:Weibing Zheng, Laurah Turner, Jess Kropczynski, Matthew Kelleher, Murat Ozer, Shane Halse
Abstract:
As Artificial Intelligence (AI) and Agentic AI become increasingly integrated across sectors such as education and healthcare, it is critical to ensure that Multi-Agent Education System (MAES) is explainable from the early stages of requirements engineering (RE) within the AI software development lifecycle. Explainability is essential to build trust, promote transparency, and enable effective human-AI collaboration. Although personas are well-established in human-computer interaction to represent users and capture their needs and behaviors, their role in RE for explainable MAES remains underexplored. This paper proposes a human-first, persona-driven, explainable MAES RE framework and demonstrates the framework through a MAES for clinical reasoning training. The framework integrates personas and user stories throughout the RE process to capture the needs, goals, and interactions of various stakeholders, including medical educators, medical students, AI patient agent, and clinical agents (physical exam agent, diagnostic agent, clinical intervention agent, supervisor agent, evaluation agent). The goals, underlying models, and knowledge base shape agent interactions and inform explainability requirements that guided the clinical reasoning training of medical students. A post-usage survey found that more than 78\% of medical students reported that MAES improved their clinical reasoning skills. These findings demonstrate that RE based on persona effectively connects technical requirements with non-technical medical students from a human-centered approach, ensuring that explainable MAES are trustworthy, interpretable, and aligned with authentic clinical scenarios from the early stages of the AI system engineering. The partial MAES for the clinical scenario simulator is~\href{https://github.com/2sigmaEdTech/MAS/}{open sourced here}.
Authors:Banri Yanahama, Akiyoshi Sannai
Abstract:
AI-driven autoformalization of mathematics is advancing rapidly. However, the type checker of a proof assistant guarantees only the logical correctness of proofs; it does not verify whether propositions and definitions faithfully capture their intended mathematical content. Consequently, AI-generated formal proofs can exhibit semantic hallucination-passing the type checker yet failing to express the intended mathematics. We propose a human-in-the-loop approach in which human scientists and AI collaboratively produce formal proofs, with humans responsible for the semantic verification of propositions and definitions. To realize this approach, we develop Lean Atlas, a Lean 4 tool that visualizes the dependency graph of a Lean 4 project as an interactive web viewer, enabling human scientists to grasp the overall structure of a formalization efficiently. Its core feature, Lean Compass, is an algorithm that, given a selected theorem set, automatically extracts the project-specific nodes whose semantic correctness can affect those target statements, thereby reducing the candidate set for semantic review in large-scale formalizations. We further define *aligned Lean code* as formalization code that has undergone human semantic verification, and propose it as a quality standard for AI-generated formalizations. We evaluate the tool on six Lean 4 formalization projects with different structural characteristics; proof-heavy projects (PrimeNumberTheoremAnd, Carleson, Brownian Motion) achieved 94-99% average node reduction, a 6-theorem milestone subset of FLT achieved 59.8%, mixed PhysLib 69.0%, and definition-heavy XMSS 27.3%. Lean Atlas is available as open-source software at https://github.com/NyxFoundation/lean-atlas .
Authors:Cheyanne Shariat
Abstract:
Adding citations while drafting in LaTeX often requires leaving the editor, searching for a paper in mind, copying its BibTeX entry into the project bibliography, renaming the cite key, and then returning to the sentence. \texttt{OverCite} is an open-source, lightweight tool that lets authors find, select, and insert citations without leaving the writing environment. In Overleaf, \texttt{OverCite} uses rough citation placeholders (e.g., $\texttt{\textbackslash citep\{Perlmutter1999\}}$) and local sentence context to query ADS/SciX-indexed literature, rank likely matches, and insert the selected reference, without leaving the editor. A companion \texttt{VS Code} extension provides the same functionality for local LaTeX projects. The ADS/SciX database includes astronomy, physics, computer science, mathematics, biology, and \emph{all} indexed arXiv e-prints, making \texttt{OverCite} useful across a broad range of scientific disciplines.
Authors:Kanzhi Cheng, Zehao Li, Zheng Ma, Nuo Chen, Jialin Cao, Qiushi Sun, Zichen Ding, Fangzhi Xu, Hang Yan, Jiajun Chen, Anh Tuan Luu, Jianbing Zhang, Lewei Lu, Dahua Lin
Abstract:
Mobile agents powered by vision-language models have demonstrated impressive capabilities in automating mobile tasks, with recent leading models achieving a marked performance leap, e.g., nearly 70% success on AndroidWorld. However, these systems keep their training data closed and remain opaque about their task and trajectory synthesis recipes. We present OpenMobile, an open-source framework that synthesizes high-quality task instructions and agent trajectories, with two key components: (1) The first is a scalable task synthesis pipeline that constructs a global environment memory from exploration, then leverages it to generate diverse and grounded instructions. and (2) a policy-switching strategy for trajectory rollout. By alternating between learner and expert models, it captures essential error-recovery data often missing in standard imitation learning. Agents trained on our data achieve competitive results across three dynamic mobile agent benchmarks: notably, our fine-tuned Qwen2.5-VL and Qwen3-VL reach 51.7% and 64.7% on AndroidWorld, far surpassing existing open-data approaches. Furthermore, we conduct transparent analyses on the overlap between our synthetic instructions and benchmark test sets, and verify that performance gains stem from broad functionality coverage rather than benchmark overfitting. We release data and code at https://njucckevin.github.io/openmobile/ to bridge the data gap and facilitate broader mobile agent research.
Authors:Andrey Moskalenko, Alexey Bryncev, Ivan Kosmynin, Kira Shilovskaya, Mikhail Erofeev, Dmitry Vatolin, Radu Timofte, Kun Wang, Yupeng Hu, Zhiran Li, Hao Liu, Qianlong Xiang, Liqiang Nie, Konstantinos Chaldaiopoulos, Niki Efthymiou, Athanasia Zlatintsi, Panagiotis Filntisis, Katerina Pastra, Petros Maragos, Li Yang, Gen Zhan, Yiting Liao, Yabin Zhang, Yuxin Liu, Xu Wu, Yunheng Zheng, Linze Li, Kun He, Cong Wu, Xuefeng Zhu, Tianyang Xu, Xiaojun Wu, Wenzhuo Zhao, Keren Fu, Gongyang Li, Shixiang Shi, Jianlin Chen, Haibin Ling, Yaoxin Jiang, Guoyi Xu, Jiajia Liu, Yaokun Shi, Jiachen Tu
Abstract:
This paper presents an overview of the NTIRE 2026 Challenge on Video Saliency Prediction. The goal of the challenge participants was to develop automatic saliency map prediction methods for the provided video sequences. The novel dataset of 2,000 diverse videos with an open license was prepared for this challenge. The fixations and corresponding saliency maps were collected using crowdsourced mouse tracking and contain viewing data from over 5,000 assessors. Evaluation was performed on a subset of 800 test videos using generally accepted quality metrics. The challenge attracted over 20 teams making submissions, and 7 teams passed the final phase with code review. All data used in this challenge is made publicly available - https://github.com/msu-video-group/NTIRE26_Saliency_Prediction.
Authors:Boxuan Jiang, Chenyun Dai, Can Han
Abstract:
Deep learning-based surface electromyography (sEMG) gesture recognition is frequently bottlenecked by data scarcity and limited subject diversity. While synthetic data generation via Generative Adversarial Networks (GANs) and diffusion models has emerged as a promising augmentation strategy, these approaches often face challenges regarding training stability or inference efficiency. To bridge this gap, we propose EMGFlow, a conditional sEMG generation framework. To the best of our knowledge, this is the first study to investigate the application of Flow Matching (FM) and continuous-time generative modeling in the sEMG domain. To validate EMGFlow across three benchmark sEMG datasets, we employ a unified evaluation protocol integrating feature-based fidelity, distributional geometry, and downstream utility. Extensive evaluations show that EMGFlow outperforms conventional augmentation and GAN baselines, and provides stronger standalone utility than the diffusion baselines considered here under the train-on-synthetic test-on-real (TSTR) protocol. Furthermore, by optimizing generation dynamics through advanced numerical solvers and targeted time sampling, EMGFlow achieves improved quality-efficiency trade-offs. Taken together, these results suggest that Flow Matching is a promising and efficient paradigm for addressing data bottlenecks in myoelectric control systems. Our code is available at: https://github.com/Open-EXG/EMGFlow.
Authors:Naichuan Zheng, Hailun Xia, Zepeng Sun, Weiyi Li, Yinzhe Zhou
Abstract:
Wearable IMU-based Human Activity Recognition (HAR) relies heavily on Deep Neural Networks (DNNs), which are burdened by immense computational and buffering demands. Their power-hungry floating-point operations and rigid requirement to process complete temporal windows severely cripple battery-constrained edge devices. While Spiking Neural Networks (SNNs) offer extreme event-driven energy efficiency, standard architectures struggle with complex biomechanical topologies and temporal gradient degradation. To bridge this gap, we propose the Physics-Aware Spiking Neural Network (PAS-Net), a fully multiplier-free architecture explicitly tailored for Green HAR. Spatially, an adaptive symmetric topology mixer enforces human-joint physical constraints. Temporally, an $O(1)$-memory causal neuromodulator yields context-aware dynamic threshold neurons, adapting actively to non-stationary movement rhythms. Furthermore, we leverage a temporal spike error objective to unlock a flexible early-exit mechanism for continuous IMU streams. Evaluated across seven diverse datasets, PAS-Net achieves state-of-the-art accuracy while replacing dense operations with sparse 0.1 pJ integer accumulations. Crucially, its confidence-driven early-exit capability drastically reduces dynamic energy consumption by up to 98\%. PAS-Net establishes a robust, ultra-low-power neuromorphic standard for always-on wearable sensing. The source code and pre-trained models are publicly available at https://github.com/zhengnaichuan2022/PAS-Net.git.
Authors:Lizhe Chen
Abstract:
This report describes Infernux, an open-source game engine that pairs a C++17/Vulkan real-time core with a Python production layer connected through a single pybind11 boundary. To close the throughput gap between Python scripting and native-code engines, Infernux combines two established techniques - batch-oriented data transfer and JIT compilation - into a cohesive engine-level integration: (i) a batch data bridge that transfers per-frame state into contiguous NumPy arrays in one boundary crossing, and (ii) an optional JIT path via Numba that compiles annotated update functions to LLVM machine code with automatic loop parallelization. We compare against Unity 6 as a reference on three workloads; readers should note differences in shading complexity, draw-call batching, and editor tooling maturity between the two engines. Infernux is MIT-licensed and available at https://chenlizheme.github.io/Infernux/.
Authors:Shun Fujiyoshi
Abstract:
Creativity and strategic foresight have been extensively studied through descriptive theories -- Koestler's bisociation (1964), de Bono's lateral thinking (1967), and Ansoff's weak signals (1975) explain why creative and strategic insights occur, but offer limited guidance on how to produce them on demand. This paper presents two executable protocols that bridge this theory-practice gap: GHOSTY COLLIDER, a 5-step protocol for cross-domain creative emergence through structural de-labeling and collision, and PRECOG PROTOCOL, a 5-step protocol for signal-based strategic foresight with multi-axis timing judgment. We formalize established theories into repeatable, step-by-step procedures with explicit quality criteria, anti-pattern detection, and measurable outputs. We evaluate the protocols through three complementary methods: (1) five detailed case studies across distinct domains, (2) controlled comparisons against standard methods using identical inputs, and (3) a batch experiment across eight random domain pairings (N=8, success rate 87.5%, failure rate 12.5%) with one blind evaluation. Preliminary evidence suggests that protocol-driven outputs exhibit greater structural novelty, higher parameter specificity, and qualitatively distinct creative directions compared to outputs from standard methods. The blind evaluation confirmed the direction of author assessments (protocol output scored 74/80 vs. brainstorming 49/80). These results, while limited by single-operator execution, indicate that the theory-to-protocol translation preserves and potentially enhances the generative power of the underlying theories. The protocols, updated to version 2 incorporating lessons from failure case analysis, are released as open-access documents under CC BY-NC 4.0 at https://github.com/GhostyAI-HA/ghosty-collider.
Authors:Wee Joe Tan, Zi Rui Lucas Lim, Shashank Durgad, Karim Obegi, Aiden Yiliu Li
Abstract:
Evaluating web usability typically requires time-consuming user studies and expert reviews, which often limits iteration speed during product development, especially for small teams and agile workflows. We present Avenir-UX, a user-experience evaluation agent that simulates user behavior on websites and produces standardized usability. Unlike traditional tools that rely on DOM parsing, Avenir-UX grounds actions and observations, enabling it to interact with real web pages end-to-end while maintaining a coherent trace of the user journey. Building on Avenir-Web, our system pairs this robust interaction with simulated user behavior profiles and a structured evaluation protocol that integrates the System Usability Scale (SUS), step-wise Single Ease Questions (SEQ), and concurrent Think Aloud. Subsequently, a comprehensive User Experience (UX) report will be generated. We discuss the architecture of Avenir-UX and illustrate how its multimodal grounding improves robustness for web-based interaction and UX evaluation scenarios, paving the way for a new era of continuous, scalable, and data-driven usability testing that empowers every developer to build web interfaces that are usable. Code is available at: https://github.com/Onflow-AI/Avenir-UX
Authors:Tianfu Wang, Leilei Ding, Ziyang Tao, Yi Zhan, Zhiyuan Ma, Wei Wu, Yuxuan Lei, Yuan Feng, Junyang Wang, Yin Wu, Yizhao Xu, Hongyuan Zhu, Qi Liu, Nicholas Jing Yuan, Yanyong Zhang, Hui Xiong
Abstract:
High-fidelity diagram creation requires the complex orchestration of semantic topology, visual styling, and spatial layout, posing a significant challenge for automated systems. Existing methods also suffer from a representation gap: pixel-based models often lack precise control, while code-based synthesis limits intuitive flexibility. To bridge this gap, we introduce EvoDiagram, an agentic framework that generates object-level editable diagrams via an intermediate canvas schema. EvoDiagram employs a coordinated multi-agent system to decouple semantic intent from rendering logic, resolving conflicts across heterogeneous design layers. Additionally, we propose a design knowledge evolution mechanism that distills execution traces into a hierarchical memory of domain guidelines, enabling agents to retrieve context-aware expertise adaptively. We further release CanvasBench, a benchmark consisting of both data and metrics for canvas-based diagramming. Extensive experiments demonstrate that EvoDiagram exhibits excellent performance and balance against baselines in generating editable, structurally consistent, and aesthetically coherent diagrams. Our code is available at https://github.com/AuraX-AI/EvoDiagram.
Authors:Ruixiang Jiang, Changwen Chen
Abstract:
Interpretation is essential to deciphering the language of art: audiences communicate with artists by recovering meaning from visual artifacts. However, current Generative Art (GenArt) evaluators remain fixated on surface-level image quality or literal prompt adherence, failing to assess the deeper symbolic or abstract meaning intended by the creator. We address this gap by formalizing a Peircean computational semiotic theory that models Human-GenArt Interaction (HGI) as cascaded semiosis. This framework reveals that artistic meaning is conveyed through three modes - iconic, symbolic, and indexical - yet existing evaluators operate heavily within the iconic mode, remaining structurally blind to the latter two. To overcome this structural blindness, we propose SemJudge. This evaluator explicitly assesses symbolic and indexical meaning in HGI via a Hierarchical Semiosis Graph (HSG) that reconstructs the meaning-making process from prompt to generated artifact. Extensive quantitative experiments show that SemJudge aligns more closely with human judgments than prior evaluators on an interpretation-intensive fine-art benchmark. User studies further demonstrate that SemJudge produces deeper, more insightful artistic interpretations, thereby paving the way for GenArt to move beyond the generation of "pretty" images toward a medium capable of expressing complex human experience. Project page: https://github.com/songrise/SemJudge.
Authors:Seyed Amir Ahmad Safavi-Naini, Elahe Meftah, Josh Mohess, Pooya Mohammadi Kazaj, Georgios Siontis, Zahra Atf, Peter R. Lewis, Mauricio Reyes, Girish Nadkarni, Roland Wiest, Stephan Windecker, Christoph Grani, Ali Soroush, Isaac Shiri
Abstract:
The competency of any intelligent agent is bounded by its formal account of the world in which it operates. Clinical AI lacks such an account. Existing frameworks address evaluation, regulation, or system design in isolation, without a shared model of the clinical world to connect them. We introduce the Clinical World Model, a framework that formalizes care as a tripartite interaction among Patient, Provider, and Ecosystem. To formalize how any agent, whether human or artificial, transforms information into clinical action, we develop parallel decision-making architectures for providers, patients, and AI agents, grounded in validated principles of clinical cognition. The Clinical AI Skill-Mix operationalizes competency through eight dimensions. Five define the clinical competency space (condition, phase, care setting, provider role, and task) and three specify how AI engages human reasoning (assigned authority, agent facing, and anchoring layer). The combinatorial product of these dimensions yields a space of billions of distinct competency coordinates. A central structural implication is that validation within one coordinate provides minimal evidence for performance in another, rendering the competency space irreducible. The framework supplies a common grammar through which clinical AI can be specified, evaluated, and bounded across stakeholders. By making this structure explicit, the Clinical World Model reframes the field's central question from whether AI works to in which competency coordinates reliability has been demonstrated, and for whom.
Authors:Minh Tam Pham, Trinh Pham, Tong Chen, Hongzhi Yin, Quoc Viet Hung Nguyen, Thanh Tam Nguyen
Abstract:
Text-to-SQL is the task of translating natural language queries into executable SQL for a given database, enabling non-expert users to access structured data without writing SQL manually. Despite rapid advances driven by large language models (LLMs), existing approaches still struggle with complex queries in real-world settings, where database schemas are large and questions require multi-step reasoning over many interrelated tables. In such cases, providing the full schema often exceeds the context window, while one-shot generation frequently produces non-executable SQL due to syntax errors and incorrect schema linking. To address these challenges, we introduce AV-SQL, a framework that decomposes complex Text-to-SQL into a pipeline of specialized LLM agents. Central to AV-SQL is the concept of agentic views: agent-generated Common Table Expressions (CTEs) that encapsulate intermediate query logic and filter relevant schema elements from large schemas. AV-SQL operates in three stages: (1) a rewriter agent compresses and clarifies the input query; (2) a view generator agent processes schema chunks to produce agentic views; and (3) a planner, generator, and revisor agent collaboratively compose these views into the final SQL query. Extensive experiments show that AV-SQL achieves 70.38% execution accuracy on the challenging Spider 2.0 benchmark, outperforming state-of-the-art baselines, while remaining competitive on standard datasets with 85.59% on Spider, 72.16% on BIRD and 63.78% on KaggleDBQA. Our source code is available at https://github.com/pminhtam/AV-SQL.
Authors:Yichen Gong, Zhuohan Cai, Sunhao Dai, Yuqi Zhou, Zhangxuan Gu, Changhua Meng, Shuheng Shen
Abstract:
Existing online benchmarks for mobile GUI agents remain largely app-centric and task-homogeneous, failing to reflect the diversity and instability of real-world mobile usage. To this end, we introduce VenusBench-Mobile, a challenging online benchmark for evaluating general-purpose mobile GUI agents under realistic, user-centric conditions. VenusBench-Mobile builds two core evaluation pillars: defining what to evaluate via user-intent-driven task design that reflects real mobile usage, and how to evaluate through a capability-oriented annotation scheme for fine-grained agent behavior analysis. Extensive evaluation of state-of-the-art mobile GUI agents reveals large performance gaps relative to prior benchmarks, indicating that VenusBench-Mobile poses substantially more challenging and realistic tasks and that current agents remain far from reliable real-world deployment. Diagnostic analysis further shows that failures are dominated by deficiencies in perception and memory, which are largely obscured by coarse-grained evaluations. Moreover, even the strongest agents exhibit near-zero success under environment variations, highlighting their brittleness in realistic settings. Based on these insights, we believe VenusBench-Mobile provides an important stepping stone toward robust real-world deployment of mobile GUI agents. Code and data are available at https://github.com/inclusionAI/UI-Venus/tree/VenusBench-Mobile.
Authors:Pragya Singh, Ankush Gupta, Somay Jalan, Mohan Kumar, Pushpendra Singh
Abstract:
Emotion recognition from physiological signals has substantial potential for applications in mental health and emotion-aware systems. However, the lack of standardized, large-scale evaluations across heterogeneous datasets limits progress and model generalization. We introduce FEEL, the first large-scale benchmarking study of emotion recognition using electrodermal activity (EDA) and photoplethysmography (PPG) signals across 19 publicly available datasets. We evaluate 16 architectures spanning traditional machine learning, deep learning, and self-supervised pretraining approaches, structured into four representative modeling paradigms. Our study includes both within-dataset and cross-dataset evaluations, analyzing generalization across variations in experimental settings, device types, and labeling strategies. Our results showed that fine-tuned contrastive signal-language pretraining (CLSP) models (71/114) achieve the highest F1 across arousal and valence classification tasks, while simpler models like Random Forests, LDA, and MLP remain competitive (36/114). Models leveraging handcrafted features (107/114) consistently outperform those trained on raw signal segments, underscoring the value of domain knowledge in low-resource, noisy settings. Further cross-dataset analyses reveal that models trained on real-life setting data generalize well to lab (F1 = 0.79) and constraint-based settings (F1 = 0.78). Similarly, models trained on expert-annotated data transfer effectively to stimulus-labeled (F1 = 0.72) and self-reported datasets (F1 = 0.76). Moreover, models trained on lab-based devices also demonstrated high transferability to both custom wearable devices (F1 = 0.81) and the Empatica E4 (F1 = 0.73), underscoring the influence of heterogeneity. More information about FEEL can be found on our website https://alchemy18.github.io/FEEL_Benchmark/.
Authors:Maurice Codourey, Emmanuel A. Gonzalez
Abstract:
The Weak Signal Cultivation Model (WSCM) provides a mathematically rigorous framework for tracking frontline risk signals across a two-dimensional coordinate field using 15 equations and 16 tunable parameters. While this specification is designed for eventual software implementation, its computational requirements create an adoption barrier for organizations whose available infrastructure is a spreadsheet. This paper introduces WSCM-Lite, a lookup-table implementation that reproduces the full WSCM's coordinate trajectories within 0.01 field units while eliminating all exponential functions, state-dependent tracking, and free parameters. The simplification replaces continuous recency weighting with a four-row lookup table and removes consensus momentum and reversal amplification entirely, reducing the specification to seven formulas and five hardcoded constants. A 26-session worked example using the Gas Fumes signal from the parent paper demonstrates that WSCM-Lite traverses the same four-region path (Question Marks --> Lit Fuses --> Owls --> Sleeping Cats --> Question Marks) and triggers SMS escalation within two sessions of the full model. Five additional scenarios validate boundary behavior, and a sensitivity analysis confirms stability under +/-30% gap threshold variation. An accompanying Excel simulator and supplementary materials are publicly available at https://github.com/emmgonai/wscm-lite.
Authors:Bo Kang, Sander Noels, Tijl De Bie
Abstract:
The rise of generative AI is posing increasing risks to online information integrity and civic discourse. Most concretely, such risks can materialise in the form of mis- and disinformation. As a mitigation, media-literacy and transparency tools have been developed to address factuality of information and the reliability and ideological leaning of information sources. However, a subtler but possibly no less harmful threat to civic discourse is to use of persuasion or manipulation by exploiting human cognitive biases and related cognitive limitations. To the best of our knowledge, no tools exist to directly detect and mitigate the presence of triggers of such cognitive biases in online information. We present VIGIL (VIrtual GuardIan angeL), the first browser extension for real-time cognitive bias trigger detection and mitigation, providing in-situ scroll-synced detection, LLM-powered reformulation with full reversibility, and privacy-tiered inference from fully offline to cloud. VIGIL is built to be extensible with third-party plugins, with several plugins that are rigorously validated against NLP benchmarks are already included. It is open-sourced at https://github.com/aida-ugent/vigil.
Authors:Hongbin Chen, Jie Li, Wei Wang, Siyang Song, Xiao Gu, Jianqing Li, Wentao Xiang
Abstract:
While affective computing has advanced considerably, multimodal emotion prediction in aging populations remains underexplored, largely due to the scarcity of dedicated datasets. Existing multimodal benchmarks predominantly target young, cognitively healthy subjects, neglecting the influence of cognitive decline on emotional expression and physiological responses. To bridge this gap, we present MECO, a Multimodal dataset for Emotion and Cognitive understanding in Older adults. MECO includes 42 participants and provides approximately 38 hours of multimodal signals, yielding 30,592 synchronized samples. To maximize ecological validity, data collection followed standardized protocols within community-based settings. The modalities cover video, audio, electroencephalography (EEG), and electrocardiography (ECG). In addition, the dataset offers comprehensive annotations of emotional and cognitive states, including self-assessed valence, arousal, six basic emotions, and Mini-Mental State Examination cognitive scores. We further establish baseline benchmarks for both emotion and cognitive prediction. MECO serves as a foundational resource for multimodal modeling of affect and cognition in aging populations, facilitating downstream applications such as personalized emotion recognition and early detection of mild cognitive impairment (MCI) in real-world settings. The complete dataset and supplementary materials are available at https://maitrechen.github.io/meco-page/.
Authors:Matteo Filosa, Andrea Nardocci, Tiziana Catarci, Marco Angelini
Abstract:
Visualizing large 3D scientific datasets requires balancing performance and fidelity, but traditional tools often demand excessive technical expertise. We introduce UnrealVis, an Unreal Engine optimization laboratory for configuring and evaluating rendering techniques during interactive exploration. Following a review of 55 papers, we established a taxonomy of 22 optimization techniques across six families, implementing them through engine subsystems such as Nanite, Level of Detail(LOD) schemes, and culling. The system features an intuitive workflow with live telemetry and A/B comparisons for local and global performance analysis. Validated through case studies of ribosomal structures and volumetric flow fields, along with an expert evaluation, UnrealVis facilitates the selection of optimization combinations that meet performance goals while preserving structural fidelity. UnrealVis is available at https://github.com/XAIber-lab/UnrealVis
Authors:Matteo Filosa, Graziano Blasilli, Emilio Martino, Marco Angelini
Abstract:
Modern data analysis requires speed for massive datasets. Progressive Data Analysis and Visualization (PDAV) emerged as a discipline to address this problem, providing fast response times while maintaining interactivity with controlled accuracy. Yet it remains difficult to implement and reproduce. To lower this barrier, we present ProVega, a Vega-Lite-based grammar that simplifies PDAV instrumentation for both simple visualizations and complex visual environments. Alongside it, we introduce Pro-Ex, an editor designed to streamline the creation and analysis of progressive solutions. We validated ProVega by reimplementing 11 exemplars from the literature-verified for fidelity by 39 users-and demonstrating its support for various progressive methods, including data-chunking, process-chunking, and mixed-chunking. An expert user study confirmed the efficacy of ProVega and the Pro-Ex environment in real-world tasks. ProVega, Pro-Ex, and all related materials are available at https://github.com/XAIber-lab/provega
Authors:Xiangshan Tan, Jingtian Ji, Tianchong Jiang, Pedro Lopes, Matthew R. Walter
Abstract:
The contact-rich nature of manipulation makes it a significant challenge for robotic teleoperation. While haptic feedback is critical for contact-rich tasks, providing intuitive directional cues within wearable teleoperation interfaces remains a bottleneck. Existing solutions, such as non-directional vibrations from handheld controllers, provide limited information, while vibrotactile arrays are prone to perceptual interference. To address these limitations, we propose HapCompass, a novel, low-cost wearable haptic device that renders 2D directional cues by mechanically rotating a single linear resonant actuator (LRA). We evaluated HapCompass's ability to convey directional cues to human operators and showed that it increased the success rate, decreased the completion time and the maximum contact force for teleoperated manipulation tasks when compared to vision-only and non-directional feedback baselines. Furthermore, we conducted a preliminary imitation-learning evaluation, suggesting that the directional feedback provided by HapCompass enhances the quality of demonstration data and, in turn, the trained policy. We release the design of the HapCompass device along with the code that implements our teleoperation interface: https://ripl.github.io/HapCompass/.
Authors:Aakanksha Khandwaha, Edith Law
Abstract:
Despite AI tools becoming more prevalent and applicable to a variety of workplaces, workers consistently report uncertainty about where AI applies, what problems it can help solve, and how it fits into real workflows. In other words, there is a gap between `knowing' and `doing' when it comes to AI literacy. We propose an experiential form of AI literacy which integrates participant's daily experiences into the learning experience by brainstorming grounded AI use cases through storytelling. We introduce a novel pedagogical approach that helps individuals move away from abstract notions of AI towards practical knowledge of how AI would (or would not) work in different workflows, contexts, and situations. Through this approach, we anticipate two major outcomes: (1) enhanced AI literacy for stakeholders within a variety of work sectors and (2) concrete AI use cases developed through participatory design that are grounded in AI literacy and participant's expertise.
Authors:Hongyu Zhu, Lin Chen, Mingsheng Shang
Abstract:
Multimodal Sentiment Analysis (MSA) that integrates Electroencephalogram (EEG) with peripheral physiological signals (PPS) is crucial for the development of brain-computer interface (BCI) systems. However, existing methods encounter three major challenges: (1) overlooking the region-specific characteristics of affective processing by treating EEG signals as homogeneous; (2) treating EEG as a black-box input, which lacks interpretability into neural representations;(3) ineffective fusion of EEG features with complementary PPS features. To overcome these issues, we propose BiMoE, a novel brain-inspired mixture of experts framework. BiMoE partitions EEG signals in a brain-topology-aware manner, with each expert utilizing a dual-stream encoder to extract local and global spatiotemporal features. A dedicated expert handles PPS using multi-scale large-kernel convolutions. All experts are dynamically fused through adaptive routing and a joint loss function. Evaluated under strict subject-independent settings, BiMoE consistently surpasses state-of-the-art baselines across various affective dimensions. On the DEAP and DREAMER datasets, it yields average accuracy improvements of 0.87% to 5.19% in multimodal sentiment classification. The code is available at: https://github.com/HongyuZhu-s/BiMo.
Authors:Kuangshi Ai, Haichao Miao, Kaiyuan Tang, Nathaniel Gorski, Jianxin Sun, Guoxi Liu, Helgi I. Ingolfsson, David Lenz, Hanqi Guo, Hongfeng Yu, Teja Leburu, Michael Molash, Bei Wang, Tom Peterka, Chaoli Wang, Shusen Liu
Abstract:
Recent advances in large language models (LLMs) have enabled agentic systems that translate natural language intent into executable scientific visualization (SciVis) tasks. Despite rapid progress, the community lacks a principled and reproducible benchmark for evaluating these emerging SciVis agents in realistic, multi-step analysis settings. We present SciVisAgentBench, a comprehensive and extensible benchmark for evaluating scientific data analysis and visualization agents. Our benchmark is grounded in a structured taxonomy spanning four dimensions: application domain, data type, complexity level, and visualization operation. It currently comprises 108 expert-crafted cases covering diverse SciVis scenarios. To enable reliable assessment, we introduce a multimodal outcome-centric evaluation pipeline that combines LLM-based judging with deterministic evaluators, including image-based metrics, code checkers, rule-based verifiers, and case-specific evaluators. We also conduct a validity study with 12 SciVis experts to examine the agreement between human and LLM judges. Using this framework, we evaluate representative SciVis agents and general-purpose coding agents to establish initial baselines and reveal capability gaps. SciVisAgentBench is designed as a living benchmark to support systematic comparison, diagnose failure modes, and drive progress in agentic SciVis. The benchmark is available at https://scivisagentbench.github.io/.
Authors:Tianle Zeng, Hanxuan Chen, Yanci Wen, Hong Zhang
Abstract:
The convergence of low-altitude economies, embodied intelligence, and air-ground cooperative systems creates growing demand for simulation infrastructure capable of jointly modeling aerial and ground agents within a single physically coherent environment. Existing open-source platforms remain domain-segregated: driving simulators lack aerial dynamics, while multirotor simulators lack realistic ground scenes. Bridge-based co-simulation introduces synchronization overhead and cannot guarantee strict spatial-temporal consistency. We present CARLA-Air, an open-source infrastructure that unifies high-fidelity urban driving and physics-accurate multirotor flight within a single Unreal Engine process. The platform preserves both CARLA and AirSim native Python APIs and ROS 2 interfaces, enabling zero-modification code reuse. Within a shared physics tick and rendering pipeline, CARLA-Air delivers photorealistic environments with rule-compliant traffic, socially-aware pedestrians, and aerodynamically consistent UAV dynamics, synchronously capturing up to 18 sensor modalities across all platforms at each tick. The platform supports representative air-ground embodied intelligence workloads spanning cooperation, embodied navigation and vision-language action, multi-modal perception and dataset construction, and reinforcement-learning-based policy training. An extensible asset pipeline allows integration of custom robot platforms into the shared world. By inheriting AirSim's aerial capabilities -- whose upstream development has been archived -- CARLA-Air ensures this widely adopted flight stack continues to evolve within a modern infrastructure. Released with prebuilt binaries and full source: https://github.com/louiszengCN/CarlaAir
Authors:Xiao Fan, Yi Zhang
Abstract:
Visualizing brain functional connectivity (FC) patterns is essential for understanding neural organization, yet existing tools such as Circos and BrainNet Viewer require complex configuration files or proprietary software environments. We present BrainRing, a free, open-source, browser-based interactive tool for generating publication-quality chord diagrams of brain connectivity data. BrainRing requires no installation, backend server, or programming knowledge. Users simply open a single HTML file in any modern browser. The tool supports 8 widely-used brain atlases (Brainnetome 246, AAL-90/116, Schaefer 100/200/400, Power 264, and Dosenbach 160), provides real-time parameter adjustment through an intuitive graphical interface, and offers comprehensive edge management including click-to-connect, per-edge color customization, and Circos link file import. BrainRing supports both Chinese and English interfaces and enables researchers to produce publication-ready SVG and PNG figures with full control over visual styling, all within seconds rather than the minutes-to-hours workflow typical of script-based approaches. BrainRing is freely available at https://github.com/XiuFan719/brain-connectivity-viz with a live demo at https://XiuFan719.github.io/brain-connectivity-viz/.
Authors:Max Holschneider, Saetbyeol LeeYouk
Abstract:
AI chatbots have quietly become the world's most popular therapists, coaches, and confidants. Users of cloud-based LLM services are increasingly shifting from simple queries like idea generation and poem writing, to deeply personal interactions. As Large Language Models increasingly assume the role of our confessors, we are witnessing a massive, unregulated transfer of sensitive personal identifiable information (PII) to powerful tech companies with opaque privacy practices. While the enterprise sector has made great strides in addressing data leakage concerns through sophisticated guardrails and PII redaction pipelines, these powerful tools have functionally remained inaccessible for the average user due to their technical complexity. This results in a dangerous trade off for individual users. In order to receive the therapeutic or productivity benefits of AI, users need to abandon any agency they might otherwise have over their data, often without a clear mental model of what is being shared, and how it might be used for advertising later on. This work addresses this interaction gap, applying the redaction pipelines of enterprise-grade redaction into an intuitive, first-of-its-kind, consumer-facing, and free experience. Specifically, this work introduces a scalable, browser-based intervention designed to help align user behavior with their privacy preferences during web-based AI interactions. Our system introduces two key mechanisms: local entity anonymization to prevent data leakage, and 'smokescreens': autonomous agent activity to disrupt third-party profiling. An open-source implementation is accessible at the GitHub repository below.
Authors:Ruofei Du, Benjamin Hersh, David Li, Nels Numan, Xun Qian, Yanhe Chen, Zhongyi Zhou, Xingyue Chen, Jiahao Ren, Robert Timothy Bettridge, Xiang 'Anthony' Chen, Faraz Faruqi, Steve Toh, David Kim
Abstract:
While large language models (LLMs) have accelerated 2D software development through intent-driven "vibe coding", prototyping intelligent Extended Reality (XR) experiences remains a major challenge. The fundamental barrier is not just the steep learning curve for human creators, but that low-level sensor APIs and complex game engine hierarchies are ill-suited for LLM reasoning, routinely exceeding context windows and inducing syntax hallucinations. To bridge this gap, we contribute XR Blocks, an open-source, LLM-native WebXR framework. Unlike traditional engines, XR Blocks introduces a semantic "Reality Model" that aligns spatial computing primitives (users, physical environments, and agents) with natural language, providing a robust, concise vocabulary optimized for generative AI. Building upon this foundation, we present Vibe Coding XR, an end-to-end prototyping workflow that leverages LLMs to translate high-level prompts (e.g., "create a dandelion that reacts to my hand") directly into functional, physics-aware mixed-reality applications. To minimize the friction of on-device testing, the workflow introduces a seamless desktop "simulated reality" to headset deployment loop. Finally, we introduce VCXR60, a pilot dataset of 60 XR prompts paired with an automated evaluation pipeline. Our technical evaluation demonstrates high one-shot execution success, enabling practitioners to bypass lowlevel hurdles and rapidly move from "idea to reality". Code and live demos are available at https://github.com/google/xrblocks and http://xrblocks.github.io/gem.
Authors:Sunwhi Kim, Sunyul Kim
Abstract:
Generative AI now produces photorealistic portraits that circulate widely in social and newslike contexts. Human ability to distinguish real from synthetic faces is time-sensitive because image generators continue to improve while public familiarity with synthetic media also changes. Here, we provide a time-stamped snapshot of human ability to distinguish real from AI-generated portraits produced by models available in July 2025. In a large-scale web experiment conducted from August 2025 to January 2026, 1,664 participants aged 20-69 years (mobile n = 1,330; PC n = 334) completed a two-alternative forced-choice task (REAL vs AI). Each participant judged 20 trials sampled from a 210-image pool comprising real FFHQ photographs and AI-generated portraits from ChatGPT-4o and Imagen 3. Overall accuracy was high (mean 85.2%, median 90%) but varied across groups. PC participants outperformed mobile participants by 3.65 percentage points. Accuracy declined with age in both device cohorts and more steeply on mobile than on PC (-0.607 vs -0.230 percentage points per year). Self-rated AI-detection confidence and AI exposure were positively associated with accuracy and statistically accounted for part of the age-related decline, with confidence accounting for the larger share. In the mobile cohort, an age-related sex divergence emerged among participants in their 50s and 60s, with female participants performing worse. Trial-level reaction-time models showed that correct AI judgments were faster than correct real judgments, whereas incorrect AI judgments were slower than incorrect real judgments. ChatGPT-4o portraits were harder and slower to classify than Imagen 3 portraits and were associated with a steeper age-related decline in performance. These findings frame AI portrait detection as a human-factors problem shaped by age, sex, device context, and confidence, not image realism alone.
Authors:Haiyang Xu, Ronghuan Wu, Li-Yi Wei, Nanxuan Zhao, Chenxi Liu, Cuong Nguyen, Zhuowen Tu, Zhaowen Wang
Abstract:
Graphic icons are a cornerstone of modern design workflows, yet they are often distributed as flattened single-path or compound-path graphics, where the original semantic layering is lost. This absence of semantic decomposition hinders downstream tasks such as editing, restyling, and animation. We formalize this problem as semantic layer construction for flattened vector art and introduce SemLayer, a visual generation empowered pipeline that restores editable layered structures. Given an abstract icon, SemLayer first generates a chromatically differentiated representation in which distinct semantic components become visually separable. To recover the complete geometry of each part, including occluded regions, we then perform a semantic completion step that reconstructs coherent object-level shapes. Finally, the recovered parts are assembled into a layered vector representation with inferred occlusion relationships. Extensive qualitative comparisons and quantitative evaluations demonstrate the effectiveness of SemLayer, enabling editing workflows previously inapplicable to flattened vector graphics and establishing semantic layer reconstruction as a practical and valuable task. Project page: https://xxuhaiyang.github.io/SemLayer/
Authors:Hanzhong Zhang, Siyang Song, Jindong Wang
Abstract:
While large language models simulate social behaviors, their capacity for stable stance formation and identity negotiation during complex interventions remains unclear. To overcome the limitations of static evaluations, this paper proposes a novel mixed-methods framework combining computational virtual ethnography with quantitative socio-cognitive profiling. By embedding human researchers into generative multiagent communities, controlled discursive interventions are conducted to trace the evolution of collective cognition. To rigorously measure how agents internalize and react to these specific interventions, this paper formalizes three new metrics: Innate Value Bias (IVB), Persuasion Sensitivity, and Trust-Action Decoupling (TAD). Across multiple representative models, agents exhibit endogenous stances that override preset identities, consistently demonstrating an innate progressive bias (IVB > 0). When aligned with these stances, rational persuasion successfully shifts 90% of neutral agents while maintaining high trust. In contrast, conflicting emotional provocations induce a paradoxical 40.0% TAD rate in advanced models, which hypocritically alter stances despite reporting low trust. Smaller models contrastingly maintain a 0% TAD rate, strictly requiring trust for behavioral shifts. Furthermore, guided by shared stances, agents use language interactions to actively dismantle assigned power hierarchies and reconstruct self organized community boundaries. These findings expose the fragility of static prompt engineering, providing a methodological and quantitative foundation for dynamic alignment in human-agent hybrid societies. The official code is available at: https://github.com/armihia/CMASE-Endogenous-Stances
Authors:Yunfan Zhou, Qiming Shi, Zhongsu Luo, Xiwen Cai, Yanwei Huang, Dae Hyun Kim, Di Weng, Yingcai Wu
Abstract:
LLM-driven tools have significantly lowered barriers to writing SQL queries. However, user instructions are often underspecified, assuming the model understands implicit knowledge, such as dataset schemas, domain conventions, and task-specific requirements, that isn't explicitly provided. This results in frequently erroneous scripts that require users to repeatedly clarify their intent. Additionally, users struggle to validate generated scripts because they cannot verify whether the model correctly applied implicit knowledge. We present Cerebra, an interactive NL-to-SQL tool that aligns implicit knowledge between users and LLMs during SQL authoring. Cerebra automatically retrieves implicit knowledge from historical SQL scripts based on user instructions, presents this knowledge in an interactive tree view for code review, and supports iterative refinement to improve generated scripts. To evaluate the effectiveness and usability of Cerebra, we conducted a user study with 16 participants, demonstrating its improved support for customized SQL authoring. The source code of Cerebra is available at https://github.com/zjuidg/CHI26-Cerebra.
Authors:Taara Kumar, Kokil Jaidka
Abstract:
As text-based computer-mediated communication (CMC) increasingly structures everyday interaction, a central question re-emerges with new urgency: How do users reconstruct nonverbal expression in environments where embodied cues are absent? This paper provides a systematic, theory-driven account of electronic nonverbal cues (eNVCs) - textual analogues of kinesics, vocalics, and paralinguistics - in public microblog communication. Across three complementary studies, we advance conceptual, empirical, and methodological contributions. Study 1 develops a unified taxonomy of eNVCs grounded in foundational nonverbal communication theory and introduces a scalable Python toolkit for their automated detection. Study 2, a within-subject survey experiment, offers controlled causal evidence that eNVCs substantially improve emotional decoding accuracy and lower perceived ambiguity, while also identifying boundary conditions, such as sarcasm, under which these benefits weaken or disappear. Study 3, through focus group discussions, reveals the interpretive strategies users employ when reasoning about digital prosody, including drawing meaning from the absence of expected cues and defaulting toward negative interpretations in ambiguous contexts. Together, these studies establish eNVCs as a coherent and measurable class of digital behaviors, refine theoretical accounts of cue richness and interpretive effort, and provide practical tools for affective computing, user modeling, and emotion-aware interface design. The eNVC detection toolkit is available as a Python and R package at https://github.com/kokiljaidka/envc.
Authors:Daniel Autenrieth
Abstract:
This paper presents the first systematic measurement of educational alignment in Large Language Models. Using a Delphi-validated instrument comprising 48 items across eight educational-theoretical dimensions, the study reveals that GPT-5.1 exhibits highly coherent preference patterns (99.78% transitivity; 92.79% model accuracy) that largely align with humanistic educational principles where expert consensus exists. Crucially, divergences from expert opinion occur precisely in domains of normative disagreement among human experts themselves, particularly emotional dimensions and epistemic normativity. This raises a fundamental question for alignment research: When human values are contested, what should models be aligned to? The findings demonstrate that GPT-5.1 does not remain neutral in contested domains but adopts coherent positions, prioritizing emotional responsiveness and rejecting false balance. The methodology, combining Delphi consensus-building with Structured Preference Elicitation and Thurstonian Utility modeling, provides a replicable framework for domain-specific alignment evaluation beyond generic value benchmarks.
Authors:Yuren Hao, Shuhaib Mehri, ChengXiang Zhai, Dilek Hakkani-Tür
Abstract:
Large language models are increasingly used as personal assistants, yet most lack a persistent user model, forcing users to repeatedly restate preferences across sessions. We propose Vector-Adapted Retrieval Scoring (VARS), a pipeline-agnostic, frozen-backbone framework that represents each user with long-term and short-term vectors in a shared preference space and uses these vectors to bias retrieval scoring over structured preference memory. The vectors are updated online from weak scalar rewards from users' feedback, enabling personalization without per-user fine-tuning. We evaluate on \textsc{MultiSessionCollab}, an online multi-session collaboration benchmark with rich user preference profiles, across math and code tasks. Under frozen backbones, the main benefit of user-aware retrieval is improved interaction efficiency rather than large gains in raw task accuracy: our full VARS agent achieves the strongest overall performance, matches a strong Reflection baseline in task success, and reduces timeout rate and user effort. The learned long-term vectors also align with cross-user preference overlap, while short-term vectors capture session-specific adaptation, supporting the interpretability of the dual-vector design. Code, model, and data are available at https://github.com/YurenHao0426/VARS.
Authors:Kanishka Mitra, Frigyes Samuel Racz, Satyam Kumar, Ashish D. Deshpande, José del R. Millán
Abstract:
Two distinct technologies have gained attention lately due to their prospects for motor rehabilitation: robotics and brain-machine interfaces (BMIs). Harnessing their combined efforts is a largely uncharted and promising direction that has immense clinical potential. However, a significant challenge is whether motor intentions from the user can be accurately detected using non-invasive BMIs in the presence of instrumental noise and passive movements induced by the rehabilitation exoskeleton. As an alternative to the straightforward continuous control approach, this study instead aims to characterize the onset and offset of motor imagery during passive arm movements induced by an upper-body exoskeleton to allow for the natural control (initiation and termination) of functional movements. Ten participants were recruited to perform kinesthetic motor imagery (MI) of the right arm while attached to the robot, simultaneously cued with LEDs indicating the initiation and termination of a goal-oriented reaching task. Using electroencephalogram signals, we built a decoder to detect the transition between i) rest and beginning MI and ii) maintaining and ending MI. Offline decoder evaluation achieved group average onset accuracy of 60.7% and 66.6% for offset accuracy, revealing that the start and stop of MI could be identified while attached to the robot. Furthermore, pseudo-online evaluation could replicate this performance, forecasting reliable online exoskeleton control in the future. Our approach showed that participants could produce quality and reliable sensorimotor rhythms regardless of noise or passive arm movements induced by wearing the exoskeleton, which opens new possibilities for BMI control of assistive devices.
Authors:Christopher J. Agostino, Quan Le Thien, Nayan D'Souza, Louis van der Elst
Abstract:
Understanding the fundamental mechanisms governing the production of meaning in the processing of natural language is critical for designing safe, thoughtful, engaging, and empowering human-agent interactions. Experiments in cognitive science and social psychology have demonstrated that human semantic processing exhibits contextuality more consistent with quantum logical mechanisms than classical Boolean theories, and recent works have found similar results in large language models -- in particular, clear violations of the Bell inequality in experiments of contextuality during interpretation of ambiguous expressions. We explore the CHSH $|S|$ parameter -- the metric associated with the inequality -- across the inference parameter space of models spanning four orders of magnitude in scale, cross-referencing it with MMLU, hallucination rate, and nonsense detection benchmarks. We find that the interquartile range of the $|S|$ distribution -- the statistic that most sharply differentiates models from one another -- is completely orthogonal to all external benchmarks, while violation rate shows weak anticorrelation with all three benchmarks that does not reach significance. We investigate how $|S|$ varies with sampling parameters and word order, and discuss the information-theoretic constraints that genuine contextuality imposes on prompt injection defenses and its human analogue, whereby careful construction and maintenance of social contextuality can be carried out at scale -- manufacturing not consent but contextuality itself, a subtler and more fundamental form of manipulation that shapes the space of possible interpretations before any particular one is reached.
Authors:Christopher J. Agostino, Nayan D'Souza
Abstract:
Industry practitioners and academic researchers regularly use multi-agent systems to accelerate their work, yet the frameworks through which these systems operate do not provide a simple, unified mechanism for scalably managing the critical aspects of the agent harness, impacting both the quality of individual human-agent interactions and the capacity for practitioners to coordinate toward common goals through shared agent infrastructure. Agent frameworks have enabled increasingly sophisticated multi-agent systems, but the behavioral specifications that define what these agents can do remain fragmented across prose instruction files, framework-internal configuration, and mechanisms like MCP servers that operate separately from individual agent definitions, making these specifications difficult to share, version, or collaboratively maintain across teams and projects. Applying the ALARA principle from radiation safety (exposures kept as low as reasonably achievable) to agent context, we introduce a declarative context-agent-tool (CAT) data layer expressed through interrelated files that scope each agent's tool access and context to the minimum its role requires, and \texttt{npcsh}, a command-line shell for executing it. Because the system parses and enforces these files structurally, modifying an agent's tool list produces a guaranteed behavioral change rather than a suggestion the model may or may not follow. We evaluate 22 locally-hosted models from 0.6B to 35B parameters across 115 practical tasks spanning file operations, web search, multi-step scripting, tool chaining, and multi-agent delegation, characterizing which model families succeed at which task categories and where they break down across $\sim$2500 total executions.
Authors:Nico Schuster, Andrés N. Salcedo, Simon Bouchard, Dennis Frei, Alice Pisani, Julian E. Bautista, Julien Zoubian, Stephanie Escoffier, Wei Liu, Georgios Valogiannis, Pauline Zarrouk
Abstract:
Scientists across all disciplines share a common challenge: the divide between their theoretical knowledge and the specialized skills and time needed to build interactive tools to communicate this expertise. While large language models (LLMs) offer unparalleled acceleration in code generation, they frequently prioritize functional syntax over scientific accuracy, risking visually convincing but scientifically invalid results. This work advocates the Scientist-AI-Loop (SAIL), a framework designed to harness this speed without compromising rigor. By separating domain logic from code syntax, SAIL enables researchers to maintain strict oversight of scientific concepts and constraints while delegating code implementation to AI. We illustrate this approach through two open-source, browser-based astrophysics tools: an interactive gravitational lensing visualization and a large-scale structure formation sandbox, both publicly available. Our methodology condensed development to mere days while maintaining scientific integrity. We specifically address failure modes where AI-generated code neglects phenomenological boundaries or scientific validity. While cautioning that research-grade code requires stringent protocols, we demonstrate through two examples that SAIL provides an effective code generation workflow for outreach, teaching, professional presentations, and early-stage research prototyping. This framework contributes to a foundation for the further development of AI-assisted scientific software.
Authors:Kanishka Mitra, Satyam Kumar, Frigyes Samuel Racz, Deland Liu, Ashish D. Deshpande, José del R. Millán
Abstract:
Robot-assisted therapy can deliver high-dose, task-specific training after neurologic injury, but most systems act primarily at the limb level-engaging the impaired neural circuits only indirectly-which remains a key barrier to truly contingent, neuroplasticity-targeted rehabilitation. We address this gap by implementing online, dual-state motor imagery control of an upper-limb exoskeleton, enabling goal-directed reaches to be both initiated and terminated directly from non-invasive EEG. Eight participants used EEG to initiate assistance and then volitionally halt the robot mid-trajectory. Across two online sessions, group-mean hit rates were 61.5% for onset and 64.5% for offset, demonstrating reliable start-stop command delivery despite instrumental noise and passive arm motion. Methodologically, we reveal a systematic, class-driven bias induced by common task-based recentering using an asymmetric margin diagnostic, and we introduce a class-agnostic fixation-based recentering method that tracks drift without sampling command classes while preserving class geometry. This substantially improves threshold-free separability (AUC gains: onset +56%, p = 0.0117; offset +34%, p = 0.0251) and reduces bias within and across days. Together, these results help bridge offline decoding and practical, intention-driven start-stop control of a rehabilitation exoskeleton, enabling precisely timed, contingent assistance aligned with neuroplasticity goals while supporting future clinical translation.
Authors:Bo Pan, Lunke Pan, Yitao Zhou, Qi Jiang, Zhen Wen, Minfeng Zhu, Wei Chen
Abstract:
Deep research systems powered by LLM agents have transformed complex information seeking by automating the iterative retrieval, filtering, and synthesis of insights from massive-scale web sources. However, existing systems predominantly follow an autonomous "query-to-report" paradigm, limiting users to a passive role and failing to integrate their personal insights, contextual knowledge, and evolving research intents. This paper addresses the lack of human-in-the-loop collaboration in the agentic research process. Through a formative study, we identify that current systems hinder effective human-agent collaboration in terms of process observability, real-time steerability, and context navigation efficiency. Informed by these findings, we propose InterDeepResearch, an interactive deep research system backed by a dedicated research context management framework. The framework organizes research context into a hierarchical architecture with three levels (information, actions, and sessions), enabling dynamic context reduction to prevent LLM context exhaustion and cross-action backtracing for evidence provenance. Built upon this framework, the system interface integrates three coordinated views for visual sensemaking, and dedicated interaction mechanisms for interactive research context navigation. Evaluation on the Xbench-DeepSearch-v1 and Seal-0 benchmarks shows that InterDeepResearch achieves competitive performance compared to state-of-the-art deep research systems, while a formal user study demonstrates its effectiveness in supporting human-agent collaborative information seeking. Project page with system demo: https://github.com/bopan3/InterDeepResearch.
Authors:Matias Loukojärvi, Ananth Mahadevan, Katsiaryna Haitsiukevich, Kai Puolamäki
Abstract:
Advances in computational chemistry have produced high-dimensional datasets on atmospherically relevant molecules. To aid exploration of such datasets, particularly for the study of atmospheric aerosol formation, we introduce PhiPlot: a web-based environment for interactive exploration and knowledge-based dimensionality reduction. The integration of visualisation, clustering, and domain knowledge-guided embedding refinement enables the discovery of patterns in the data and supports hypothesis generation. The application connects to an existing, evolving collection of molecular databases, offering an accessible interface for data-driven research in atmospheric chemistry.
Authors:Srikrishna Bangalore Raghu, Anna Soukhovei, Divya Sai Sindhuja Vankineni, Alexandra Bacula, Alessandro Roncone
Abstract:
In human-robot collaboration, a robot's expression of hesitancy is a critical factor that shapes human coordination strategies, attention allocation, and safety-related judgments. However, designing hesitant robot motion that generalizes is challenging because the observer's inference is highly dependent on embodiment and context. To address these challenges, we introduce and open-source a multi-modal, dancer-generated dataset of hesitant motion where we focus on specific context-embodiment pairs (i.e., manipulator/human upper-limb approaching a Jenga Tower, and anthropomorphic whole body motion in free space). The dataset includes (i) kinesthetic teaching demonstrations on a Franka Emika Panda reaching from a fixed start configuration to a fixed target (a Jenga tower) with three graded hesitancy levels (slight, significant, extreme) and (ii) synchronized RGB-D motion capture of dancers performing the same reaching behavior using their upper limb across three hesitancy levels, plus full human body sequences for extreme hesitancy. We further provide documentation to enable reproducible benchmarking across robot and human modalities. Across all dancers, we obtained 70 unique whole-body trajectories, 84 upper limb trajectories spanning over the three hesitancy levels, and 66 kinesthetic teaching trajectories spanning over the three hesitancy levels. The dataset can be accessed here: https://brsrikrishna.github.io/Dance2Hesitate/.
Authors:Bhada Yun, Evgenia Taranova, Dana Feng, Renn Su, April Yi Wang
Abstract:
There is no 'ordinary' when it comes to AI. The human-AI experience is extraordinarily complex and specific to each person, yet dominant measures such as usability scales and engagement metrics flatten away nuance. We argue for AI phenomenology: a research stance that asks "How did it feel?" beyond the standard questions of "How well did it perform?" when interacting with AI systems. AI phenomenology acts as a paradigm for bidirectional human-AI alignment as it foregrounds users' first-person perceptions and interpretations of AI systems over time. We motivate AI phenomenology as a framework that captures how alignment is experienced, negotiated, and updated between users and AI systems. Tracing a lineage from Husserl through postphenomenology to Actor-Network Theory, and grounding our argument in three studies-two longitudinal studies with "Day", an AI companion, and a multi-method study of agentic AI in software engineering-we contribute a set of replicable methodological toolkits for conducting AI phenomenology research: instruments for capturing lived experience across personal and professional contexts, three design concepts (translucent design, agency-aware value alignment, temporal co-evolution tracking), and a concrete research agenda. We offer this toolkit not as a new paradigm but as a practical scaffold that researchers can adapt as AI systems-and the humans who live alongside them-continue to co-evolve.
Authors:Patrick Ebel, Michał Patryk Miazga, Martin Lorenz, Timur Getselev, Pavlo Bazilinskyy, Celine Conzen
Abstract:
Designing and evaluating in-vehicle interfaces requires experimental platforms that combine ecological validity with experimental control. Driving simulators are widely used for this purpose. However, they face a fundamental trade-off: high-fidelity physical simulators are costly and difficult to adapt, while virtual reality simulators provide flexibility at the expense of physical interaction with the vehicle. In this work, we present MRDrive, an open mixed-reality driving simulator designed to support HCI research on in-vehicle interaction, attention, and explainability in manual and automated driving contexts. MRDrive enables drivers and passengers to interact with a real vehicle cabin while being fully immersed in a virtual driving environment. We demonstrate the capabilities of MRDrive through a small pilot study that illustrates how the simulator can be used to collect and analyze eye-tracking and touch interaction data in an automated driving scenario. MRDRive is available at: https://github.com/ciao-group/mrdrive
Authors:Li Gu, Zihuan Jiang, Zhixiang Chi, Huan Liu, Ziqiang Wang, Yuanhao Yu, Glen Berseth, Yang Wang
Abstract:
Graphical user interface (GUI)-based mobile agents automate digital tasks on mobile devices by interpreting natural-language instructions and interacting with the screen. While recent methods apply reinforcement learning (RL) to train vision-language-model(VLM) agents in interactive environments with a primary focus on performance, generalization remains underexplored due to the lack of standardized benchmarks and open-source RL systems. In this work, we formalize the problem as a Contextual Markov Decision Process (CMDP) and introduce \textbf{AndroidWorld-Generalization}, a benchmark with three increasingly challenging regimes for evaluating zero-shot generalization to unseen task instances, templates, and applications. We further propose an RL training system that integrates Group Relative Policy Optimization (GRPO) with a scalable rollout collection system, consisting of containerized infrastructure and asynchronous execution % , and error recovery to support reliable and efficient training. Experiments on AndroidWorld-Generalization show that RL enables a 7B-parameter VLM agent to surpass supervised fine-tuning baselines, yielding a 26.1\% improvement on unseen instances but only limited gains on unseen templates (15.7\%) and apps (8.3\%), underscoring the challenges of generalization. As a preliminary step, we demonstrate that few-shot adaptation at test-time improves performance on unseen apps, motivating future research in this direction. To support reproducibility and fair comparison, we open-source the full RL training system, including the environment, task suite, models, prompt configurations, and the underlying infrastructure \footnote{https://github.com/zihuanjiang/AndroidWorld-Generalization}.
Authors:Daehee Kang, Yeon-Chang Lee
Abstract:
Cross-domain recommendation (CDR) aims to alleviate data sparsity by transferring knowledge across domains, yet existing methods primarily rely on coarse-grained behavioral signals and often overlook intra-domain heterogeneity in user preferences. We propose Multi-TAP, a multi-criteria target-adaptive persona framework that explicitly captures such heterogeneity through semantic persona modeling. To enable effective transfer, Multi-TAP selectively incorporates source-domain signals conditioned on the target domain, preserving relevance during knowledge transfer. Experiments on real-world datasets demonstrate that Multi-TAP consistently outperforms state-of-the-art CDR methods, highlighting the importance of modeling intra-domain heterogeneity for robust cross-domain recommendation. The codebase of Multi-TAP is currently available at https://github.com/archivehee/Multi-TAP.
Authors:Omar Shaikh, Valentin Teutschbein, Kanishk Gandhi, Yikun Chi, Nick Haber, Thomas Robinson, Nilam Ram, Byron Reeves, Sherry Yang, Michael S. Bernstein, Diyi Yang
Abstract:
Truly proactive AI systems must anticipate what we will do next. This foresight demands far richer information than the sparse signals we type into our prompts -- it demands reasoning over the entire context of what we see and do. We formalize this as next action prediction (NAP): given a sequence of a user's multimodal interactions with a computer (screenshots, clicks, sensor data), predict that user's next action. Progress on this task requires both new data and modeling approaches. To scale data, we annotate longitudinal, naturalistic computer use with vision-language models. We release an open-source pipeline for performing this labeling on private infrastructure, and label over 360K actions across one month of continuous phone usage from 20 users, amounting to 1,800 hours of screen time. We then introduce LongNAP, a user model that combines parametric and in-context learning to reason over long interaction histories. LongNAP is trained via policy gradient methods to generate user-specific reasoning traces given some context; retrieve relevant traces from a library of past traces; and then apply retrieved traces in-context to predict future actions. Using an LLM-as-judge evaluation metric (0-1 similarity to ground truth), LongNAP significantly outperforms supervised finetuning and prompted baselines on held-out data (by 79% and 39% respectively). Additionally, LongNAP generalizes to held out users when trained across individuals. The space of next actions a user might take at any moment is unbounded, spanning thousands of possible outcomes. Despite this, 17.1% of LongNAP's predicted trajectories are well-aligned with what a user does next (LLM-judge score $\geq$ 0.5). This rises to 26% when we filter to highly confident predictions. In sum, we argue that learning from the full context of user behavior to anticipate user needs is now a viable task with substantial opportunity.
Authors:Diego Armando Resendez Prado
Abstract:
Chess engines passed human strength years ago, but they still don't play like humans. A grandmaster under clock pressure blunders in ways a club player on a hot streak never would. Conventional engines capture none of this. This paper proposes a personality x psyche decomposition to produce behavioral variability in chess play, drawing on patterns observed in human games. Personality is static -- a preset that pins down the engine's character. Psyche is dynamic -- a bounded scalar ψ_t \in [-100, +100], recomputed from five positional factors after every move. These two components feed into an audio-inspired signal chain (noise gate, compressor/expander, five-band equalizer, saturation limiter) that reshapes move probability distributions on the fly. The chain doesn't care what engine sits behind it: any system that outputs move probabilities will do. It needs no search and carries no state beyond ψ_t. I test the framework across 12,414 games against Maia2-1100, feeding it two probability sources that differ by ~2,800x in training data. Both show the same monotonic gradient in top-move agreement (~20-25 pp spread from stress to overconfidence), which tells us the behavioral variation comes from the signal chain, not from the model underneath. When the psyche runs overconfident, the chain mostly gets out of the way (66% agreement with vanilla Maia2). Under stress, the competitive score falls from 50.8% to 30.1%. The patterns are reminiscent of tilt and overconfidence as described in human play, but I should be upfront: this study includes no human-subject validation.
Authors:Yuchen Wang, Haonan Wang, Yu Guo, Honglong Yang, Xiaomeng Li
Abstract:
Decoding natural language from non-invasive EEG signals is a promising yet challenging task. However, current state-of-the-art models remain constrained by three fundamental limitations: Semantic Bias (mode collapse into generic templates), Signal Neglect (hallucination based on linguistic priors rather than neural inputs), and the BLEU Trap, where evaluation metrics are artificially inflated by high-frequency stopwords, masking a lack of true semantic fidelity. To address these challenges, we propose SemKey, a novel multi-stage framework that enforces signal-grounded generation through four decoupled semantic objectives: sentiment, topic, length, and surprisal. We redesign the interaction between the neural encoder and the Large Language Model (LLM) by injecting semantic prompts as Queries and EEG embeddings as Key-Value pairs, strictly forcing the model to attend to neural inputs. Furthermore, we move beyond standard translation metrics by adopting N-way Retrieval Accuracy and Fréchet Distance to rigorously assess diversity and alignment. Extensive experiments demonstrate that our approach effectively eliminates hallucinations on noise inputs and achieves SOTA performance on these robust protocols. Code will be released upon acceptance at https://github.com/xmed-lab/SemKey.
Authors:Xuejin Luo, Shiquan Sun, Runshi Zhang, Ruizhi Zhang, Junchen Wang
Abstract:
During surgery, scrub nurses are required to frequently deliver surgical instruments to surgeons, which can lead to physical fatigue and decreased focus. Robotic scrub nurses provide a promising solution that can replace repetitive tasks and enhance efficiency. Existing research on robotic scrub nurses relies on predefined paths for instrument delivery, which limits their generalizability and poses safety risks in dynamic environments. To address these challenges, we present a collision-free dual-arm surgical assistive robot capable of performing instrument delivery. A vision-language model is utilized to automatically generate the robot's grasping and delivery trajectories in a zero-shot manner based on surgeons' instructions. A real-time obstacle minimum distance perception method is proposed and integrated into a unified quadratic programming framework. This framework ensures reactive obstacle avoidance and self-collision prevention during the dual-arm robot's autonomous movement in dynamic environments. Extensive experimental validations demonstrate that the proposed robotic system achieves an 83.33% success rate in surgical instrument delivery while maintaining smooth, collision-free movement throughout all trials. The project page and source code are available at https://give-me-scissors.github.io/.
Authors:Ziheng Xi, Zihang Ao, Yitao Wang, Mingeze Gao, Wanmei Zhang, Jianjiang Feng, Jie Zhou
Abstract:
Accurate 3D hand pose and pressure sensing is essential for immersive human-computer interaction, yet simultaneously achieving both in mobile scenarios remains a significant challenge. We present WristPP, a camera-based wrist-worn system that estimates 3D hand pose and per-vertex pressure from a single wide-FOV RGB frame in real time. A Vision Transformer (ViT) backbone with joint-aligned tokens predicts Hand-VQVAE codebook indices for mesh recovery, while an extrinsics-conditioned branch jointly estimates per-vertex pressure. On a self-collected dataset of 133,000 frames (20 subjects; 48 on-plane and 28 mid-air gestures), WristPP attains a Mean Per-Joint Position Error (MPJPE) of 2.9 mm, Contact IoU of 0.712, Volumetric IoU of 0.618, and foreground pressure MAE of 10.4 g. Across three user studies, WristPP delivers touchpad-level efficiency in mid-air pointing and robust multi-finger pressure control on an uninstrumented desktop. In a real-world large-display Whac-A-Mole task, WristPP also enables higher success ratio and lower arm fatigue than head-mounted camera-based baselines. These results position WristPP as an effective, mobile solution for versatile pose- and pressure-based interaction. Website: https://zhenqis123.github.io/WristPP/.
Authors:Thom Vaughan, Pedro Ortiz Suarez
Abstract:
We present a large-scale automated audit of WCAG 2.1/2.2 Level AA colour contrast compliance across the 500 most frequently crawled registered domains in Common Crawl's CC-MAIN-2026-08 February 2026 crawl archive. Rather than conducting a live crawl, all page content was sourced from Common Crawl's open WARC archives, ensuring reproducibility and eliminating any load on target web servers. Our static CSS analysis of 240 homepages identified 4,327 unique foreground/background colour pairings, of which 1,771 (40.9%) failed to meet the 4.5:1 contrast ratio threshold for normal text. The median per-site pass rate was 62.7%, with 20.4% of sites achieving full compliance across all detected colour pairings. These findings suggest that colour contrast remains a widespread accessibility barrier on the most prominent websites, with significant variation across domain categories.
Authors:Cosmo Santoni
Abstract:
As large language models engage in extended reasoning tasks, they accumulate significant state -- architectural mappings, trade-off decisions, codebase conventions -- within the context window. This understanding is lost when sessions reach context limits and undergo lossy compaction. We propose Contextual Memory Virtualisation (CMV), a system that treats accumulated LLM understanding as version-controlled state. Borrowing from operating system virtual memory, CMV models session history as a Directed Acyclic Graph (DAG) with formally defined snapshot, branch, and trim primitives that enable context reuse across independent parallel sessions. We introduce a three-pass structurally lossless trimming algorithm that preserves every user message and assistant response verbatim while reducing token counts by a mean of 20% and up to 86% for sessions with significant overhead by stripping mechanical bloat such as raw tool outputs, base64 images, and metadata. A single-user case-study evaluation across 76 real-world coding sessions demonstrates that trimming remains economically viable under prompt caching, with the strongest gains in mixed tool-use sessions, which average 39% reduction and reach break-even within 10 turns. A reference implementation is available at https://github.com/CosmoNaught/claude-code-cmv.
Authors:David Anugraha, Vishakh Padmakumar, Diyi Yang
Abstract:
Qualitative insights from user experiences are critical for informing product and policy decisions, but collecting such data at scale is constrained by the time and availability of experts to conduct semi-structured interviews. Recent work has explored using large language models (LLMs) to automate interviewing, yet existing systems lack a principled mechanism for balancing systematic coverage of predefined topics with adaptive exploration, or the ability to pursue follow-ups, deep dives, and emergent themes that arise organically during conversation. In this work, we formulate adaptive semi-structured interviewing as an optimization problem over the interviewer's behavior. We define interview utility as a trade-off between coverage of a predefined interview topic guide, discovery of relevant emergent themes, and interview cost measured by length. Based on this formulation, we introduce SparkMe, a multi-agent LLM interviewer that performs deliberative planning via simulated conversation rollouts to select questions with high expected utility. We evaluate SparkMe through controlled experiments with LLM-based interviewees, showing that it achieves higher interview utility, improving topic guide coverage (+4.7% over the best baseline) and eliciting richer emergent insights while using fewer conversational turns than prior LLM interviewing approaches. We further validate SparkMe in a user study with 70 participants across 7 professions on the impact of AI on their workflows. Domain experts rate SparkMe as producing high-quality adaptive interviews that surface helpful profession-specific insights not captured by prior approaches. The code, datasets, and evaluation protocols for SparkMe are available as open-source at https://github.com/SALT-NLP/SparkMe.
Authors:Pengfei Zhang, Tianxin Xie, Minghao Yang, Li Liu
Abstract:
Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp-Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-A$^2$CA). Unlike static pipelines, Thinker-A$^2$CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a Modality-Weaving Diagnoser that weaves EHR data with audio tokens via Strategic Global Attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. To address the data gap, we design a Flow Matching Generator that adapts a text-only Large Language Model (LLM) via modality injection, decoupling pathological content from acoustic style to synthesize hard-to-diagnose samples. As a foundation for these efforts, we introduce Resp-229k, a benchmark corpus of 229k recordings paired with LLM-distilled clinical narratives. Extensive experiments demonstrate that Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance. Our code and data are available at https://github.com/zpforlove/Resp-Agent.
Authors:Ioannis Dravilas, Ioannis Kapetangeorgis, Anastasios Latsoudis, Conor McCarthy, Gonçalo Marcelino, Marcel Worring
Abstract:
Composed Image Retrieval (CIR) allows users to search for images by combining a reference image with a text prompt that describes desired modifications. While vision-language models like CLIP have popularized this task by embedding multiple modalities into a joint space, developers still lack tools that reveal how these multimodal prompts interact with embedding spaces and why small wording changes can dramatically alter the results. We present InfoCIR, a visual analytics system that closes this gap by coupling retrieval, explainability, and prompt engineering in a single, interactive dashboard. InfoCIR integrates a state-of-the-art CIR back-end (SEARLE arXiv:2303.15247) with a six-panel interface that (i) lets users compose image + text queries, (ii) projects the top-k results into a low-dimensional space using Uniform Manifold Approximation and Projection (UMAP) for spatial reasoning, (iii) overlays similarity-based saliency maps and gradient-derived token-attribution bars for local explanation, and (iv) employs an LLM-powered prompt enhancer that generates counterfactual variants and visualizes how these changes affect the ranking of user-selected target images. A modular architecture built on Plotly-Dash allows new models, datasets, and attribution methods to be plugged in with minimal effort. We argue that InfoCIR helps diagnose retrieval failures, guides prompt enhancement, and accelerates insight generation during model development. All source code allowing for a reproducible demo is available at https://github.com/giannhskp/InfoCIR.
Authors:Jiangkai Wu, Zhiyuan Ren, Junquan Zhong, Liming Liu, Xinggong Zhang
Abstract:
AI Video Assistant emerges as a new paradigm for Real-time Communication (RTC), where one peer is a Multimodal Large Language Model (MLLM) deployed in the cloud. This makes interaction between humans and AI more intuitive, akin to chatting with a real person. However, a fundamental mismatch exists between current RTC frameworks and AI Video Assistants, stemming from the drastic shift in Quality of Experience (QoE) and more challenging networks. Measurements on our production prototype also confirm that current RTC fails, causing latency spikes and accuracy drops. To address these challenges, we propose Artic, an AI-oriented RTC framework for MLLM Video Assistants, exploring the shift from "humans watching video" to "AI understanding video." Specifically, Artic proposes: (1) Response Capability-aware Adaptive Bitrate, which utilizes MLLM accuracy saturation to proactively cap bitrate, reserving bandwidth headroom to absorb future fluctuations for latency reduction; (2) Zero-overhead Context-aware Streaming, which allocates limited bitrate to regions most important for the response, maintaining accuracy even under ultra-low bitrates; and (3) Degraded Video Understanding Benchmark, the first benchmark evaluating how RTC-induced video degradation affects MLLM accuracy. Prototype experiments using real-world uplink traces show that compared with existing methods, Artic significantly improves accuracy by 15.12% and reduces latency by 135.31 ms. We will release the benchmark and codes at https://github.com/pku-netvideo/DeViBench.
Authors:Sahand Sabour, TszYam NG, Minlie Huang
Abstract:
As Large Language Models increasingly power role-playing applications, simulating patients has become a valuable tool for training counselors and scaling therapeutic assessment. However, prior work is fragmented: existing approaches rely on incompatible, non-standardized data formats, prompts, and evaluation metrics, hindering reproducibility and fair comparison. In this paper, we introduce PatientHub, a unified and modular framework that standardizes the definition, composition, and deployment of simulated patients. To demonstrate PatientHub's utility, we implement several representative patient simulation methods as case studies, showcasing how our framework supports standardized cross-method evaluation and the seamless integration of custom evaluation metrics. We further demonstrate PatientHub's extensibility by prototyping two new simulator variants, highlighting how PatientHub accelerates method development by eliminating infrastructure overhead. By consolidating existing work into a single reproducible pipeline, PatientHub lowers the barrier to developing new simulation methods and facilitates cross-method and cross-model benchmarking. Our framework provides a practical foundation for future datasets, methods, and benchmarks in patient-centered dialogue, and the code is publicly available via https://github.com/Sahandfer/PatientHub.
Authors:Yuhao Zheng, Li'an Zhong, Yi Wang, Rui Dai, Kaikui Liu, Xiangxiang Chu, Linyuan Lv, Philip Torr, Kevin Qinghong Lin
Abstract:
Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However, existing text- and pixel-based approaches struggle to simultaneously achieve high visual fidelity and fine-grained structural controllability. To this end, we propose Code2World, a vision-language coder that simulates the next visual state via renderable code generation. Specifically, to address the data scarcity problem, we construct AndroidCode by translating GUI trajectories into high-fidelity HTML and refining synthesized code through a visual-feedback revision mechanism, yielding a corpus of over 80K high-quality screen-action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then further apply Render-Aware Reinforcement Learning which uses rendered outcome as the reward signal by enforcing visual semantic fidelity and action consistency. Extensive experiments demonstrate that Code2World-8B achieves the top-performing next UI prediction, rivaling the competitive GPT-5 and Gemini-3-Pro-Image. Notably, Code2World significantly enhances downstream navigation success rates in a flexible manner, boosting Gemini-2.5-Flash by +9.5% on AndroidWorld navigation. The code is available at https://github.com/AMAP-ML/Code2World.
Authors:Zhuoyun Zheng, Yu Dong, Gaorong Liang, Guan Li, Guihua Shan, Shiyu Cheng, Dong Tian, Jianlong Zhou, Jie Liang
Abstract:
Generative models have substantially expanded video generation capabilities, yet practical thought-to-video creation remains a multi-stage, multi-modal, and decision-intensive process. However, existing tools either hide intermediate decisions behind repeated reruns or expose operator-level workflows that make exploration traces difficult to manage, compare, and reuse. We present T2VTree, a user-centered visual analytics approach for agent-assisted thought-to-video authoring. T2VTree represents the authoring process as a tree visualization. Each node in the tree binds an editable specification (intent, referenced inputs, workflow choice, prompts, and parameters) with the resulting multimodal outputs, making refinement, branching, and provenance inspection directly operable. To reduce the burden of deciding what to do next, a set of collaborating agents translates step-level intent into an executable plan that remains visible and user-editable before execution. We further implement a visual analytics system that integrates branching authoring with in-place preview and stitching for convergent assembly, enabling end-to-end multi-scene creation without leaving the authoring context. We demonstrate T2VTreeVA through two multi-scene case studies and a comparative user study, showing how the T2VTree visualization and editable agent planning support reliable refinement, localized comparison, and practical reuse in real authoring workflows. T2VTree is available at: https://github.com/tezuka0210/T2VTree.
Authors:Peizhen Li, Longbing Cao, Xiao-Ming Wu, Yang Zhang
Abstract:
Humanoid facial expression shadowing enables robots to realistically imitate human facial expressions in real time, which is critical for lifelike, facially expressive humanoid robots and affective human-robot interaction. Existing progress in humanoid facial expression imitation remains limited, often failing to achieve either real-time performance or realistic expressiveness due to offline video-based inference designs and insufficient ability to capture and transfer subtle expression details. To address these limitations, we present VividFace, a real-time and realistic facial expression shadowing system for humanoid robots. An optimized imitation framework X2CNet++ enhances expressiveness by fine-tuning the human-to-humanoid facial motion transfer module and introducing a feature-adaptation training strategy for better alignment across different image sources. Real-time shadowing is further enabled by a video-stream-compatible inference pipeline and a streamlined workflow based on asynchronous I/O for efficient communication across devices. VividFace produces vivid humanoid faces by mimicking human facial expressions within 0.05 seconds, while generalizing across diverse facial configurations. Extensive real-world demonstrations validate its practical utility. Videos are available at: https://lipzh5.github.io/VividFace/.
Authors:Joao Baptista Cardia Neto, Claudio Ferrari, Stefano Berretti
Abstract:
Facial emotion recognition has been typically cast as a single-label classification problem of one out of six prototypical emotions. However, that is an oversimplification that is unsuitable for representing the multifaceted spectrum of spontaneous emotional states, which are most often the result of a combination of multiple emotions contributing at different intensities. Building on this, a promising direction that was explored recently is to cast emotion recognition as a distribution learning problem. Still, such approaches are limited in that research datasets are typically annotated with a single emotion class. In this paper, we contribute a novel approach to describe complex emotional states as probability distributions over a set of emotion classes. To do so, we propose a solution to automatically re-label existing datasets by exploiting the result of a study in which a large set of both basic and compound emotions is mapped to probability distributions in the Valence-Arousal-Dominance (VAD) space. In this way, given a face image annotated with VAD values, we can estimate the likelihood of it belonging to each of the distributions, so that emotional states can be described as a mixture of emotions, enriching their description, while also accounting for the ambiguous nature of their perception. In a preliminary set of experiments, we illustrate the advantages of this solution and a new possible direction of investigation. Data annotations are available at https://github.com/jbcnrlz/affectnet-b-annotation.
Authors:Karla Felix Navarro, Eugene Syriani, Ian Arawjo
Abstract:
What should HCI scholars consider when reporting and reviewing papers that involve LLM-integrated systems? We interview 18 authors of LLM-integrated system papers on their authoring and reviewing experiences. We find that norms of trust-building between authors and reviewers appear to be eroded by the uncertainty of LLM behavior and hyperbolic rhetoric surrounding AI. Authors perceive that reviewers apply uniquely skeptical and inconsistent standards towards papers that report LLM-integrated systems, and mitigate mistrust by adding technical evaluations, justifying usage, and de-emphasizing LLM presence. Authors' views challenge blanket directives to report all prompts and use open models, arguing that prompt reporting is context-dependent and justifying proprietary model usage despite ethical concerns. Finally, some tensions in peer review appear to stem from clashes between the norms and values of HCI and ML/NLP communities, particularly around what constitutes a contribution and an appropriate level of technical rigor. Based on our findings and additional feedback from six expert HCI researchers, we present a set of guidelines and considerations for authors, reviewers, and HCI communities around reporting and reviewing papers that involve LLM-integrated systems.
Authors:Yuanchen Bai, Ruixiang Han, Niti Parikh, Wendy Ju, Angelique Taylor
Abstract:
Co-design is essential for grounding embodied artificial intelligence (AI) systems in real-world contexts, especially high-stakes domains such as healthcare. While prior work has explored multidisciplinary collaboration, iterative prototyping, and support for non-technical participants, few have interwoven these into a sustained co-design process. Such efforts often target one context and low-fidelity stages, limiting the generalizability of findings and obscuring how participants' ideas evolve. To address these limitations, we conducted a 14-week workshop with a multidisciplinary team of 22 participants, centered around how embodied AI can reduce non-value-added task burdens in three healthcare settings: emergency departments, long-term rehabilitation facilities, and sleep disorder clinics. We found that the iterative progression from abstract brainstorming to high-fidelity prototypes, supported by educational scaffolds, enabled participants to understand real-world trade-offs and generate more deployable solutions. We propose eight guidelines for co-designing more considerate embodied AI: attuned to context, responsive to social dynamics, mindful of expectations, and grounded in deployment. Project Page: https://byc-sophie.github.io/Towards-Considerate-Embodied-AI/
Authors:Yueyi Yang, Haotian Liu, Fang Kang, Mengqi Zhang, Zheng Lian, Hao Tang, Haoyu Chen
Abstract:
We explore the use of large language models (LLMs) for next-utterance prediction in human dialogue. Despite recent advances in LLMs demonstrating their ability to engage in natural conversations with users, we show that even leading models surprisingly struggle to predict a human speaker's next utterance. Instead, humans can readily anticipate forthcoming utterances based on multimodal cues, such as gestures, gaze, and emotional tone, from the context. To systematically examine whether LLMs can reproduce this ability, we propose SayNext-Bench, a benchmark that evaluates LLMs and Multimodal LLMs (MLLMs) on anticipating context-conditioned responses from multimodal cues spanning a variety of real-world scenarios. To support this benchmark, we build SayNext-PC, a novel large-scale dataset containing dialogues with rich multimodal cues. Building on this, we further develop a dual-route prediction MLLM, SayNext-Chat, that incorporates cognitively inspired design to emulate predictive processing in conversation. Experimental results demonstrate that our model outperforms state-of-the-art MLLMs in terms of lexical overlap, semantic similarity, and emotion consistency. Our results prove the feasibility of next-utterance prediction with LLMs from multimodal cues and emphasize the (i) indispensable role of multimodal cues and (ii) actively predictive processing as the foundation of natural human interaction, which is missing in current MLLMs. We hope that this exploration offers a new research entry toward more human-like, context-sensitive AI interaction for human-centered AI. Our benchmark and model can be accessed at https://saynext.github.io/.
Authors:Kai Li, Jintao Cheng, Chang Zeng, Zijun Yan, Helin Wang, Zixiong Su, Bo Zheng, Xiaolin Hu
Abstract:
Query-based universal sound separation is fundamental to intelligent auditory systems, aiming to isolate specific sources from mixtures. Despite recent advances, existing methods continue to suffer from residual interference in complex acoustic scenes. This performance limitation stems largely from a data bottleneck: in-the-wild datasets contain weak labels and severe co-occurrence of events. These flaws induce models to learn spurious correlations between background noise and target categories instead of robust acoustic features. To address this, we propose an automated pipeline that eliminates co-occurrence of events by mining high-purity single-event segments from in-the-wild datasets via a semantically consistent synthesis protocol. Utilizing this pipeline, we constructed Hive, a high-quality synthetic dataset comprising 2.4k hours of raw audio. Experimental results demonstrate that, compared with the state-of-the-art model SAM-Audio which was trained on a huge dataset $\sim$500 times larger than Hive, certain open-source models trained on Hive achieve competitive separation accuracy and perceptual quality. Moreover, these models exhibited remarkable zero-shot generalization on out-of-distribution evaluation benchmarks. These findings highlight that prioritizing purity of supervised signals enables significant data efficiency, offering a new paradigm for training robust auditory foundation models with reduced computational costs. Code and dataset are available at https://shandaai.github.io/Hive.
Authors:Haoyuan Yu, Yuxuan Chen, Minjie Cai
Abstract:
Full-duplex voice interaction is crucial for natural human computer interaction. We present a framework that decomposes complex dialogue into minimal conversational units, enabling the system to process each unit independently and predict when to transit to the next. This framework is instantiated as a semi-cascaded full-duplex dialogue system built around a multimodal large language model, supported by auxiliary modules such as voice activity detection (VAD) and text-to-speech (TTS) synthesis. The resulting system operates in a train-free, plug-and-play manner. Experiments on the HumDial dataset demonstrate the effectiveness of our framework, which ranks second among all teams on the test set of the Human-like Spoken Dialogue Systems Challenge (Track 2: Full-Duplex Interaction). Code is available at the GitHub repository https://github.com/yu-haoyuan/fd-badcat.
Authors:Viacheslav Sydora, Guner Dilsad Er, Michael Muehlebach
Abstract:
This paper presents the web-based platform Machine Learning with Bricks and an accompanying two-day course designed to teach machine learning concepts to students aged 12 to 17 through programming-free robotics activities. Machine Learning with Bricks is an open source platform and combines interactive visualizations with LEGO robotics to teach three core algorithms: KNN, linear regression, and Q-learning. Students learn by collecting data, training models, and interacting with robots via a web-based interface. Pre- and post-surveys with 14 students demonstrate significant improvements in conceptual understanding of machine learning algorithms, positive shifts in AI perception, high platform usability, and increased motivation for continued learning. This work demonstrates that tangible, visualization-based approaches can make machine learning concepts accessible and engaging for young learners while maintaining technical depth. The platform is freely available at https://learning-and-dynamics.github.io/ml-with-bricks/, with video tutorials guiding students through the experiments at https://youtube.com/playlist?list=PLx1grFu4zAcwfKKJZ1Ux4LwRqaePCOA2J.
Authors:Jiayi Zhou, Liwenhan Xie, Jiaju Ma, Zheng Wei, Huamin Qu, Anyi Rao
Abstract:
Digital collage is an artistic practice that combines image cutouts to tell stories. However, preparing cutouts from a set of photos remains a tedious and time-consuming task. A formative study identified three main challenges: 1) inefficient search for relevant photos, 2) manual image cutout, and 3) difficulty in organizing large sets of cutouts. To meet these challenges and facilitate asset preparation for collage, we propose Collaposer, a tool that transforms a collection of photos into organized, ready-to-use visual cutouts based on user-provided story descriptions. Collaposer tags, detects, and segments photos, and then uses an LLM to select central and related labels based on the user-provided story description. Collaposer presents the resulting visuals in varying sizes, clustered according to semantic hierarchy. Our evaluation shows that Collaposer effectively automates the preparation process to produce diverse sets of visual cutouts adhering to the storyline, allowing users to focus on collaging these assets for storytelling. Project website: https://jiayzhou.github.io/collaposer-website/
Authors:Tianyi Gong, Can Han, Junxi Wu, Dahong Qian
Abstract:
Dry-electrode Motor Imagery Electroencephalography (MI-EEG) enables fast, comfortable, real-world Brain Computer Interface by eliminating gels and shortening setup for at-home and wearable use.However, dry recordings pose three main issues: lower Signal-to-Noise Ratio with more baseline drift and sudden transients; weaker and noisier data with poor phase alignment across trials; and bigger variances between sessions. These drawbacks lead to larger data distribution shift, making features less stable for MI-EEG tasks.To address these problems, we introduce STGMFM, a tri-branch framework tailored for dry-electrode MI-EEG, which models complementary spatio-temporal dependencies via dual graph orders, and captures robust envelope dynamics with a multi-scale frequency mixing branch, motivated by the observation that amplitude envelopes are less sensitive to contact variability than instantaneous waveforms. Physiologically meaningful connectivity priors guide learning, and decision-level fusion consolidates a noise-tolerant consensus. On our collected dry-electrode MI-EEG, STGMFM consistently surpasses competitive CNN/Transformer/graph baselines. Codes are available at https://github.com/Tianyi-325/STGMFM.
Authors:Andrey Moskalenko, Danil Kuznetsov, Irina Dudko, Anastasiia Iasakova, Nikita Boldyrev, Denis Shepelev, Andrei Spiridonov, Andrey Kuznetsov, Vlad Shakhuro
Abstract:
Promptable segmentation models such as SAM have established a powerful paradigm, enabling strong generalization to unseen objects and domains with minimal user input, including points, bounding boxes, and text prompts. Among these, bounding boxes stand out as particularly effective, often outperforming points while significantly reducing annotation costs. However, current training and evaluation protocols typically rely on synthetic prompts generated through simple heuristics, offering limited insight into real-world robustness. In this paper, we investigate the robustness of promptable segmentation models to natural variations in bounding box prompts. First, we conduct a controlled user study and collect thousands of real bounding box annotations. Our analysis reveals substantial variability in segmentation quality across users for the same model and instance, indicating that SAM-like models are highly sensitive to natural prompt noise. Then, since exhaustive testing of all possible user inputs is computationally prohibitive, we reformulate robustness evaluation as a white-box optimization problem over the bounding box prompt space. We introduce BREPS, a method for generating adversarial bounding boxes that minimize or maximize segmentation error while adhering to naturalness constraints. Finally, we benchmark state-of-the-art models across 10 datasets, spanning everyday scenes to medical imaging. Code - https://github.com/emb-ai/BREPS.
Authors:Yiyang Wang, Yiqiao Jin, Alex Cabral, Josiah Hester
Abstract:
Multi-agent systems (MAS) are emerging as promising socio-collaborative companions for emotional and cognitive support. However, existing systems frequently suffer from persona collapse, where agents revert to generic, homogenized assistant behaviors, and social sycophancy, where agents produce redundant, non-constructive dialogue. We propose MASCOT, a multi-agent framework for multi-perspective socio-collaborative companions. MASCOT introduces a novel bi-level optimization strategy to harmonize individual and collective behaviors: 1) Persona-Aware Behavioral Alignment, an RLAIF-driven pipeline that fine-tunes individual agents for agent-specific identities; and 2) Collaborative Dialogue Optimization, a group-level adaptation process that promotes complementary, diverse, and productive discourse. We evaluate MASCOT using human-grounded contexts drawn across both in-domain and out-of-domain (OOD) settings against state-of-the-art baselines. MASCOT improves persona consistency by up to +14.1 and social contribution by up to +10.6. A broad evaluation suite, including human evaluation, multiple LLM judges, three-way comparisons, and automatic metrics, further shows that MASCOT produces more role-consistent and less redundant multi-agent dialogue.
Authors:Richard Shaw, Youngkyoon Jang, Athanasios Papaioannou, Arthur Moreau, Helisa Dhamo, Zhensong Zhang, Eduardo Pérez-Pellitero
Abstract:
This work presents Interactive Conversational 3D Virtual Human (ICo3D), a method for generating an interactive, conversational, and photorealistic 3D human avatar. Based on multi-view captures of a subject, we create an animatable 3D face model and a dynamic 3D body model, both rendered by splatting Gaussian primitives. Once merged together, they represent a lifelike virtual human avatar suitable for real-time user interactions. We equip our avatar with an LLM for conversational ability. During conversation, the audio speech of the avatar is used as a driving signal to animate the face model, enabling precise synchronization. We describe improvements to our dynamic Gaussian models that enhance photorealism: SWinGS++ for body reconstruction and HeadGaS++ for face reconstruction, and provide as well a solution to merge the separate face and body models without artifacts. We also present a demo of the complete system, showcasing several use cases of real-time conversation with the 3D avatar. Our approach offers a fully integrated virtual avatar experience, supporting both oral and written form interactions in immersive environments. ICo3D is applicable to a wide range of fields, including gaming, virtual assistance, and personalized education, among others. Project page: https://ico3d.github.io/
Authors:Haoyu Tian, Yingchaojie Feng, Zhen Wen, Haoxuan Li, Minfeng Zhu, Wei Chen
Abstract:
The advent of Retrieval-Augmented Generation (RAG) has significantly enhanced the ability of Large Language Models (LLMs) to produce factually accurate and up-to-date responses. However, the performance of a RAG system is not determined by a single component but emerges from a complex interplay of modular choices, such as embedding models and retrieval algorithms. This creates a vast and often opaque configuration space, making it challenging for developers to understand performance trade-offs and identify optimal designs. To address this challenge, we present RAGExplorer, a visual analytics system for the systematic comparison and diagnosis of RAG configurations. RAGExplorer guides users through a seamless macro-to-micro analytical workflow. Initially, it empowers developers to survey the performance landscape across numerous configurations, allowing for a high-level understanding of which design choices are most effective. For a deeper analysis, the system enables users to drill down into individual failure cases, investigate how differences in retrieved information contribute to errors, and interactively test hypotheses by manipulating the provided context to observe the resulting impact on the generated answer. We demonstrate the effectiveness of RAGExplorer through detailed case studies and user studies, validating its ability to empower developers in navigating the complex RAG design space. Our code and user guide are publicly available at https://github.com/Thymezzz/RAGExplorer.
Authors:Lennart Eing, Cristina Luna-Jiménez, Silvan Mertes, Elisabeth André
Abstract:
This paper introduces a novel application of Video Joint-Embedding Predictive Architectures (V-JEPAs) for Facial Expression Recognition (FER). Departing from conventional pre-training methods for video understanding that rely on pixel-level reconstructions, V-JEPAs learn by predicting embeddings of masked regions from the embeddings of unmasked regions. This enables the trained encoder to not capture irrelevant information about a given video like the color of a region of pixels in the background. Using a pre-trained V-JEPA video encoder, we train shallow classifiers using the RAVDESS and CREMA-D datasets, achieving state-of-the-art performance on RAVDESS and outperforming all other vision-based methods on CREMA-D (+1.48 WAR). Furthermore, cross-dataset evaluations reveal strong generalization capabilities, demonstrating the potential of purely embedding-based pre-training approaches to advance FER. We release our code at https://github.com/lennarteingunia/vjepa-for-fer.
Authors:Alvaro Becerra, Ruth Cobos, Roberto Daza
Abstract:
Oral presentation skills are a critical component of higher education, yet comprehensive datasets capturing real-world student performance across multiple modalities remain scarce. To address this gap, we present SOPHIAS (Student Oral Presentation monitoring for Holistic Insights & Analytics using Sensors), a 12-hour multimodal dataset containing recordings of 50 oral presentations (10-15-minute presentation followed by 5-15-minute Q&A) delivered by 65 undergraduate and master's students at the Universidad Autonoma de Madrid. SOPHIAS integrates eight synchronized sensor streams from high-definition webcams, ambient and webcam audio, eye-tracking glasses, smartwatch physiological sensors, and clicker, keyboard, and mouse interactions. In addition, the dataset includes slides and rubric-based evaluations from teachers, peers, and self-assessments, along with timestamped contextual annotations. The dataset captures presentations conducted in real classroom settings, preserving authentic student behaviors, interactions, and physiological responses. SOPHIAS enables the exploration of relationships between multimodal behavioral and physiological signals and presentation performance, supports the study of peer assessment, and provides a benchmark for developing automated feedback and Multimodal Learning Analytics tools. The dataset is publicly available for research through GitHub and Science Data Bank.
Authors:Carl Vincent Ladres Kho
Abstract:
Consumer-grade biosensors offer a cost-effective alternative to medical-grade electromyography (EMG) systems, reducing hardware costs from thousands of dollars to approximately $13. However, these low-cost sensors introduce significant signal instability and motion artifacts. Deploying machine learning models on resource-constrained edge devices like the ESP32 presents a challenge: balancing classification accuracy with strict latency (<100ms) and memory (<320KB) constraints. Using a single-subject dataset comprising 1,540 seconds of raw data (1.54M data points, segmented into ~1,300 one-second windows), I evaluate 18 model architectures, ranging from statistical heuristics to deep transfer learning (ResNet50) and custom hybrid networks (MaxCRNN). While my custom "MaxCRNN" (Inception + Bi-LSTM + Attention) achieved the highest safety (99% Precision) and robustness, I identify Random Forest (74% accuracy) as the Pareto-optimal solution for embedded control on legacy microcontrollers. I demonstrate that reliable, low-latency EMG control is feasible on commodity hardware, with Deep Learning offering a path to near-perfect reliability on modern Edge AI accelerators.
Authors:Zeyi Liao, Yadong Lu, Boyu Gou, Huan Sun, Ahmed Awadallah
Abstract:
Graphical user interface (GUI) grounding, the process of mapping human instructions to GUI actions, serves as a fundamental basis to autonomous GUI agents. While existing grounding models achieve promising performance to simulate the mouse click action on various click-based benchmarks, another essential mode of mouse interaction, namely dragging, remains largely underexplored. Yet, dragging the mouse to select and manipulate textual content represents a prevalent and important usage in practical GUI scenarios. To narrow this gap, we first introduce GUI-Drag, a diverse dataset of 161K text dragging examples synthesized through a scalable pipeline. To support systematic and robust evaluation, we further construct ScreenDrag, a benchmark with 5,333 examples spanning three levels of interface context, together with three dedicated metrics designed for assessing text dragging capability. Models trained on GUI-Drag with an efficient continual training strategy achieve substantial improvements on ScreenDrag, while preserving the original click-based performance on ScreenSpot, ScreenSpot-v2, and OSWorld-G. Our work encourages further research on broader GUI grounding beyond just clicking and paves way toward a truly generalist GUI grounding model. All benchmark, data, checkpoints, and code are open-sourced and available at https://osu-nlp-group.github.io/GUI-Drag.
Authors:Haoming Xu, Ningyuan Zhao, Yunzhi Yao, Weihong Xu, Hongru Wang, Xinle Deng, Shumin Deng, Jeff Z. Pan, Huajun Chen, Ningyu Zhang
Abstract:
As Large Language Models (LLMs) are increasingly deployed in real-world settings, correctness alone is insufficient. Reliable deployment requires maintaining truthful beliefs under contextual perturbations. Existing evaluations largely rely on point-wise confidence like Self-Consistency, which can mask brittle belief. We show that even facts answered with perfect self-consistency can rapidly collapse under mild contextual interference. To address this gap, we propose Neighbor-Consistency Belief (NCB), a structural measure of belief robustness that evaluates response coherence across a conceptual neighborhood. To validate the efficiency of NCB, we introduce a new cognitive stress-testing protocol that probes outputs stability under contextual interference. Experiments across multiple LLMs show that the performance of high-NCB data is relatively more resistant to interference. Finally, we present Structure-Aware Training (SAT), which optimizes context-invariant belief structure and reduces long-tail knowledge brittleness by approximately 30%. Code will be available at https://github.com/zjunlp/belief.
Authors:Hadi Hosseini, Debmalya Mandal, Amrit Puhan
Abstract:
We introduce $\mathbf{SP-Rank}$, the first large-scale, publicly available dataset for benchmarking algorithms that leverage both first-order preferences and second-order predictions in ranking tasks. Each datapoint includes a personal vote (first-order signal) and a meta-prediction of how others will vote (second-order signal), allowing richer modeling than traditional datasets that capture only individual preferences. SP-Rank contains over 12,000 human-generated datapoints across three domains -- geography, movies, and paintings, and spans nine elicitation formats with varying subset sizes. This structure enables empirical analysis of preference aggregation when expert identities are unknown but presumed to exist, and individual votes represent noisy estimates of a shared ground-truth ranking. We benchmark SP-Rank by comparing traditional aggregation methods that use only first-order votes against SP-Voting, a second-order method that jointly reasons over both signals to infer ground-truth rankings. While SP-Rank also supports models that rely solely on second-order predictions, our benchmarks emphasize the gains from combining both signals. We evaluate performance across three core tasks: (1) full ground-truth rank recovery, (2) subset-level rank recovery, and (3) probabilistic modeling of voter behavior. Results show that incorporating second-order signals substantially improves accuracy over vote-only methods. Beyond social choice, SP-Rank supports downstream applications in learning-to-rank, extracting expert knowledge from noisy crowds, and training reward models in preference-based fine-tuning pipelines. We release the dataset, code, and baseline evaluations (available at https://github.com/amrit19/SP-Rank-Dataset ) to foster research in human preference modeling, aggregation theory, and human-AI alignment.
Authors:Zihan Gao, Mohsin Y. K. Yousufi, Jacob Thebault-Spieker
Abstract:
Large language model (LLM) question-answering systems often fail on community-specific queries, creating "knowledge blind spots" that marginalize local voices and reinforce epistemic injustice. We present Collective Narrative Grounding, a participatory protocol that transforms community stories into structured narrative units and integrates them into AI systems under community governance. Learning from three participatory mapping workshops with N=24 community members, we designed elicitation methods and a schema that retain narrative richness while enabling entity, time, and place extraction, validation, and provenance control. To scope the problem, we audit a county-level benchmark of 14,782 local information QA pairs, where factual gaps, cultural misunderstandings, geographic confusions, and temporal misalignments account for 76.7% of errors. On a participatory QA set derived from our workshops, a state-of-the-art LLM answered fewer than 21% of questions correctly without added context, underscoring the need for local grounding. The missing facts often appear in the collected narratives, suggesting a direct path to closing the dominant error modes for narrative items. Beyond the protocol and pilot, we articulate key design tensions, such as representation and power, governance and control, and privacy and consent, providing concrete requirements for retrieval-first, provenance-visible, locally governed QA systems. Together, our taxonomy, protocol, and participatory evaluation offer a rigorous foundation for building community-grounded AI that better answers local questions.
Authors:Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, Sung Ju Hwang
Abstract:
Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current models do not yet convey the feeling of truly interactive communication, often generating one-way responses that lack emotional engagement. We identify two key challenges toward truly interactive avatars: generating motion in real-time under causal constraints and learning expressive, vibrant reactions without additional labeled data. To address these challenges, we propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing. This design allows the avatar to process real-time multimodal inputs, including the user's audio and motion, with low latency for instant reactions to both verbal and non-verbal cues such as speech, nods, and laughter. Furthermore, we introduce a direct preference optimization method that leverages synthetic losing samples constructed by dropping user conditions, enabling label-free learning of expressive interaction. Experimental results demonstrate that our framework enables real-time interaction with low latency (approximately 500ms), achieving 6.8X speedup compared to the baseline, and produces reactive and expressive avatar motion, which is preferred over 80% against the baseline.
Authors:Xuhui Ren, Shaokang Dong, Chen Yang, Qing Gao, Yunbin Zhao, Yongsheng Liu, Xinwei Geng, Xiang Li, Demei Yan, Yanqing Li, Chenhao Huang, Dingwei Zhu, Junjie Ye, Boxuan Yue, Yingnan Fu, Mengzhe Lv, Zezeng Feng, Boshen Zhou, Bocheng Wang, Xuanjing Huang, Yu-Gang Jiang, Tao Gui, Qi Zhang, Yunke Zhang
Abstract:
The evolution of Large Language Models (LLMs) from passive text processors to autonomous agents has established planning as a core component of modern intelligence. However, achieving generalized planning remains elusive, not only by the scarcity of high-quality interaction data but also by inherent conflicts across heterogeneous planning tasks. These challenges result in models that excel at isolated tasks yet struggle to generalize, while existing multi-task training attempts suffer from gradient interference. In this paper, we present \textbf{MagicAgent}, a series of foundation models specifically designed for generalized agent planning. We introduce a lightweight and scalable synthetic data framework that generates high-quality trajectories across diverse planning tasks, including hierarchical task decomposition, tool-augmented planning, multi-constraint scheduling, procedural logic orchestration, and long-horizon tool execution. To mitigate training conflicts, we propose a two-stage training paradigm comprising supervised fine-tuning followed by multi-objective reinforcement learning over both static datasets and dynamic environments. Empirical results show that MagicAgent-32B and MagicAgent-30B-A3B achieve superior performance across diverse open-source benchmarks (\emph{e.g.}, $75.1\%$ on Worfbench and $86.9\%$ on BFCL-v3), as well as strong results on our in-house MagicEval benchmarks, substantially outperforming existing sub-100B models and surpassing leading ultra-scale models, including GPT-5.2, Kimi-K2 and GLM-4.7.
Authors:Ziwen Xu, Kewei Xu, Haoming Xu, Haiwen Hong, Longtao Huang, Hui Xue, Ningyu Zhang, Yongliang Shen, Guozhou Zheng, Huajun Chen, Shumin Deng
Abstract:
Large Language Models (LLMs) are increasingly deployed in socially sensitive domains, yet their unpredictable behaviors, ranging from misaligned intent to inconsistent personality, pose significant risks. We introduce SteerEval, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality. Each domain is structured into three specification levels: L1 (what to express), L2 (how to express), and L3 (how to instantiate), connecting high-level behavioral intent to concrete textual output. Using SteerEval, we systematically evaluate contemporary steering methods, revealing that control often degrades at finer-grained levels. Our benchmark offers a principled and interpretable framework for safe and controllable LLM behavior, serving as a foundation for future research.
Authors:Yue Huang, Yuchen Ma, Jiayi Ye, Wenjie Wang, Zipeng Ling, Xingjian Hu, Yuexing Hao, Zichen Chen, Zhangchen Xu, Yunhong He, Zhengqing Yuan, Yujun Zhou, Kehan Guo, Chaoran Chen, Toby Jia-Jun Li, Stefan Feuerriegel, Xiangliang Zhang
Abstract:
Interactive narrative tasks require LLMs to sustain a coherent, evolving story while adapting to a user over multiple turns. However, suitable benchmarks for this setting are limited: existing evaluations often focus on static prompts, isolated story generations, or post-hoc ratings, and therefore miss whether models can jointly manage story generation, long-context state and pacing, character simulation, empathic personalization, and story-grounded artifacts. We introduce NARRA-Gym, an executable evaluation environment that turns a sparse emotional seed into a complete interactive story episode and logs the full model-in-the-loop trajectory, including story construction, memory updates, planning, pacing interventions, and optional artifact synthesis. We evaluate nine frontier LLMs using a controlled LLM-as-judge sweep over eight benchmark personas and a human evaluation in which participants rate customized model outputs. Our results show substantial variation across models, personas, and evaluation dimensions: models that produce fluent stories can still fail on robustness, user experience, or resistance-sensitive personalization. These findings suggest that interactive narrative offers a useful benchmark for evaluating long-horizon, user-adaptive LLM behavior beyond isolated story quality.
Authors:Minheng Ni, Yutao Fan, Zhengyuan Yang, Yeli Shen, Yuxiang Wei, Yaowen Zhang, Lijuan Wang, Lei Zhang, Wangmeng Zuo
Abstract:
Recent advances in large multimodal models (LMMs) have enabled instruction-based image editing, allowing users to modify visual content via natural language descriptions. However, existing approaches often struggle with high-level semantic reasoning and visual consistency, particularly under ambiguous or complex instructions. To address these challenges, we propose CoEditor++, a cognitively structured, training-free framework that decomposes editing into "what to edit" and "how to edit" through two cognitive stages with a reflective self-selection mechanism, enabling robust, fine-grained, and interpretable editing. Built entirely from open-sourced components, CoEditor++ requires no additional training or fine-tuning, ensuring transparency and cross-domain applicability. We evaluate CoEditor++ on SmartEdit, a widely used benchmark for general editing, and AltBear, a privacy and compliance-oriented benchmark. Experimental results show that CoEditor++ achieves state-of-the-art performance in both general editing and responsible editing tasks compared with open-sourced models that require training on specialized editing datasets maintaining significantly higher visual consistency. When compared with closed-source models such as Nano Banana Pro or GPT-4o, CoEditor++ preserves comparable instruction following while still substantially outperforming them in visual consistency. Extensive ablation studies confirm that the effectiveness of CoEditor++ benefits from its structured cognitive design rather than any specific model component. Our findings suggest the potential toward cognitive-centric instruction-based image editing.
Authors:Zhihan Jiang, Mengyuan Millie Wu, Ruishi Zou, Shiyu Xu, Xun Qian, Emma Macmanus, Steven Liao, Ping Zhang, Bingsheng Yao, Tingyu Cheng, James L. David, Nabila El-Bassel, Lena Mamykina, Frances R. Levin, Ryan Sultan, Dakuo Wang, Xuhai Xu
Abstract:
Individuals frequently form deep attachments to physical objects (e.g., plush toys) that usually cannot sense or respond to their emotions. While AI companions offer responsiveness and personalization, they exist independently of these physical objects and lack an ongoing connection to them. To bridge this gap, we conducted a formative study (N=9) to explore how digital agents could inherit and extend the emotional bond, deriving four design principles (Faithful Identity, Calibrated Agency, Ambient Presence, and Reciprocal Memory). We then present the Dual-Embodiment Companion Framework, instantiated as Deco, a mobile system integrating multimodal Large Language Models (LLMs) and Augmented Reality to create synchronized digital embodiments of users' physical companions. A within-subjects study (N=25) showed Deco significantly outperformed a personalized LLM-empowered digital companion baseline on perceived companionship, emotional bond, and design-principle scales (all p<0.01). A seven-day field deployment (N=17) showed sustained engagement, subjective well-being improvement (p=.040), and three key relational patterns: digital activities retroactively vitalized physical objects, bond deepening was driven by emotional engagement depth rather than interaction frequency, and users sustained bonds while actively navigating digital companions' AI nature. This work highlights a promising alternative for designing digital companions: moving from creating new relationships to dual embodiment, where digital agents seamlessly extend the emotional history of physical objects.
Authors:Bingsheng Yao, Chaoran Chen, April Yi Wang, Sherry Tongshuang Wu, Toby Jia-jun Li, Dakuo Wang
Abstract:
The emergence of Large Language Model (LLM) agents enables us to build agent-based intelligent systems that move beyond the role of a "tool" to become genuine collaborators with humans, thereby realizing a novel human-agent collaboration paradigm. Our vision is that LLM agents should resemble remote human collaborators, which allows HCI researchers to ground the future exploration in decades of research on trust, awareness, and common ground in remote human collaboration, while also revealing the unique opportunities and challenges that emerge when one or more partners are AI agents. This workshop establishes a foundational research agenda for the new era by posing the question: How can the rich understanding of remote human collaboration inspire and inform the design and study of human-agent collaboration? We will bring together an interdisciplinary group from HCI, CSCW, and AI to explore this critical transition. The 180-minute workshop will be highly interactive, featuring a keynote speaker, a series of invited lightning talks, and an exploratory group design session where participants will storyboard novel paradigms of human-agent partnership. Our goal is to enlighten the research community by cultivating a shared vocabulary and producing a research agenda that charts the future of collaborative agents.
Authors:Youqing Fang, Yinhao Tang, Yanan Sun, Jiangning Liu, Ziyi Wang, Xun Zhao, Bin Liu, Weiming Zhang, Kuikun Liu, Wenwei Zhang, Kai Chen
Abstract:
Recent writing assistants are increasingly shifting from passive, prompt-driven interaction to proactive, suggestion-based completion, which integrates localized continuations into the writing flow and reduces coordination burden. However, existing evaluations simply focus on output quality, failing to capture how users accept, edit, or repair suggestions in real-time interaction, and thus obscuring the true usability of proactive co-writing systems. To address this gap, we adopt a sequential, behavior-centered view of interactive writing and formalize co-writing as a Human-in-the-Loop Markov Decision Process, modeling writing as an interaction shaped by user acceptance and editing decisions. Based on this formulation, we introduce the Co-Writing Fidelity Suite, an interaction-aware metric suite that captures both user-assistant alignment and cognitive editing effort, including Hierarchical Acceptance Rate and Knowledge-aware Editing Distance. We conduct a large-scale simulation study across 16 writing domains, using 1,688 controlled continuation queries sampled from different writing stages. Our analysis reveals systematic effects of interaction structure on acceptance behavior and editing cost. A follow-up user study with 30 participants confirms that these behavioral patterns align with real user experience. Together, our findings demonstrate that interaction-aware evaluation provides insights beyond output-only metrics and informs the design of more effective proactive writing assistants.
Authors:Jiamu Zhou, Jihong Wang, Weiming Zhang, Weiwen Liu, Zhuosheng Zhang, Xingyu Lou, Weinan Zhang, Huarong Deng, Jun Wang
Abstract:
The web browser serves as a primary interface for daily human activities, making its automation a critical frontier for Human-Centred AI. While Large Language Models (LLMs) have enabled autonomous agents to interact with web GUIs, their reliability in real-world scenarios is hampered by long-horizon instability and the vast heterogeneity of site designs. In this paper, we introduce ColorBrowserAgent, a framework designed for Collaborative Autonomy in complex web tasks. Our approach integrates two human-centred mechanisms: (1) Progressive Progress Summarization, which mimics human short-term memory to maintain coherence over extended interactions; and (2) Human-in-the-Loop Knowledge Adaptation, which bridges the knowledge gap in diverse environments by soliciting expert intervention only when necessary. This symbiotic design allows the agent to learn from human tips without extensive retraining, effectively combining the scalability of AI with the adaptability of human cognition. Evaluated on the WebArena benchmark using GPT-5, ColorBrowserAgent achieves a state-of-the-art success rate of 71.2\%, demonstrating the efficacy of interactive human assistance in robust web automation.
Authors:Tobias Labarta, Maximilian Dreyer, Katharina Weitz, Wojciech Samek, Sebastian Lapuschkin
Abstract:
Explainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these explanations. Activation steering of components identified via XAI offers a path toward actionable explanations, although its practical utility remains understudied. We introduce an interactive workflow combining SAE-based attribution with activation steering for instance-level analysis of concept usage in vision models, implemented as a web-based tool. Based on this workflow, we conduct semi-structured expert interviews (N=8) with debugging tasks on CLIP to investigate how practitioners reason about, trust, and apply activation steering. We find that steering enables a shift from inspection to intervention-based hypothesis testing (8/8 participants), with most grounding trust in observed model responses rather than explanation plausibility alone (6/8). Participants adopted systematic debugging strategies dominated by component suppression (7/8) and highlighted risks including ripple effects and limited generalization of instance-level corrections. Overall, activation steering renders interpretability more actionable while raising important considerations for safe and effective use.
Authors:Tobias Labarta, Nhi Hoang, Maximilian Dreyer, Jim Berend, Oleg Hein, Jackie Ma, Wojciech Samek, Sebastian Lapuschkin
Abstract:
The explainable AI (XAI) research community has proposed numerous technical methods, yet deploying explainability as systems remains challenging: Interactive explanation systems require both suitable algorithms and system capabilities that maintain explanation usability across repeated queries, evolving models and data, and governance constraints. We argue that operationalizing XAI requires treating explainability as an information systems problem where user interaction demands induce specific system requirements. We introduce X-SYS, a reference architecture for interactive explanation systems, that guides (X)AI researchers, developers and practitioners in connecting interactive explanation user interfaces (XUI) with system capabilities. X-SYS organizes around four quality attributes named STAR (scalability, traceability, responsiveness, and adaptability), and specifies a five-component decomposition (XUI Services, Explanation Services, Model Services, Data Services, Orchestration and Governance). It maps interaction patterns to system capabilities to decouple user interface evolution from backend computation. We implement X-SYS through SemanticLens, a system for semantic search and activation steering in vision-language models. SemanticLens demonstrates how contract-based service boundaries enable independent evolution, offline/online separation ensures responsiveness, and persistent state management supports traceability. Together, this work provides a reusable blueprint and concrete instantiation for interactive explanation systems supporting end-to-end design under operational constraints.
Authors:Morayo Danielle Adeyemi, Ryan A. Rossi, Franck Dernoncourt
Abstract:
Fashion AI systems routinely encode the aesthetic logic of specific houses, editors, and historical moments without disclosing it. We present FASH-iCNN, a multimodal system trained on 87,547 Vogue runway images across 15 fashion houses spanning 1991-2024 that makes this cultural logic inspectable. Given a photograph of a garment, the system recovers which house produced it, which era it belongs to, and which color tradition it reflects. A clothing-only model identifies the fashion house at 78.2% top-1 across 14 houses, the decade at 88.6% top-1, and the specific year at 58.3% top-1 across 34 years with a mean error of just 2.2 years. Probing which visual channels carry this signal reveals a sharp dissociation: removing color costs only 10.6pp of house identity accuracy, while removing texture costs 37.6pp, establishing texture and luminance as the primary carriers of editorial identity. FASH-iCNN treats editorial culture as the signal rather than background noise, identifying which houses, eras, and color traditions shaped each output so that users can see not just what the system predicts but which houses, editors, and historical moments are encoded in that prediction.
Authors:Bo Ni, Leyao Wang, Yu Wang, Branislav Kveton, Franck Dernoncourt, Yu Xia, Hongjie Chen, Reuben Leura, Samyadeep Basu, Subhojyoti Mukherjee, Puneet Mathur, Nesreen Ahmed, Junda Wu, Li Li, Huixin Zhang, Ruiyi Zhang, Tong Yu, Sungchul Kim, Jiuxiang Gu, Zhengzhong Tu, Alexa Siu, Zichao Wang, David Seunghyun Yoon, Nedim Lipka, Namyong Park, Zihao Lin, Trung Bui, Yue Zhao, Tyler Derr, Ryan A. Rossi
Abstract:
User simulation has long played a vital role in computer science due to its potential to support a wide range of applications. Language, as the primary medium of human communication, forms the foundation of social interaction and behavior. Consequently, simulating conversational behavior has become a key area of study. Recent advancements in large language models (LLMs) have significantly catalyzed progress in this domain by enabling high-fidelity generation of synthetic user conversation. In this paper, we survey recent advancements in LLM-based conversational user simulation. We introduce a novel taxonomy covering user granularity and simulation objectives. Additionally, we systematically analyze core techniques and evaluation methodologies. We aim to keep the research community informed of the latest advancements in conversational user simulation and to further facilitate future research by identifying open challenges and organizing existing work under a unified framework.
Authors:Ming Zhu, Juntao Tan, Rithesh Murthy, Jielin Qiu, Liangwei Yang, Wenting Zhao, Silvio Savarese, Shelby Heinecke, Huan Wang
Abstract:
LLM-based user simulation is the primary mechanism for end-to-end agent evaluation, yet simulated users are poor proxies for real humans: unconstrained LLM defaults produce a Formalism Ceiling (style match rates of 6-8% against real users), while hand-crafted behavioral directives trigger Directive Amplification, where models hyper-interpret instructions into unnatural behavioral extremes that vary dramatically across simulator models. We present RealUserSim, the first user simulation framework grounded in real behavioral data. From 14,000+ authentic human-LLM conversations (WildChat), we extract 7,275 executable behavioral profiles and use them to ground LLM simulators. A fidelity benchmark (PT3) on 600 conversations across 71+ domains with anti-leakage controls shows that grounded simulation raises match rate from 24.2% to 45.3% across five behavioral dimensions. Agent evaluation on TauBench with 6 simulator models and extensive analysis shows that grounded simulation acts as a realistic stress test, surfacing three failure mechanisms invisible to cooperative simulators (mean -3.2% to -3.5% task success degradation), while Directive Amplification in existing benchmarks produces unrealistic behavior that compromises the validity of agent evaluation.
Authors:Changxuan Fan, Xi Yang, Yueyuan Zheng, Bin Zhou, Yuanping Wang, Wenbin Hu, Huihao Jing, Ki Sen Hung, Dazhao Du, Haoran Li, Janet Hui-wen Hsiao, Yangqiu Song
Abstract:
As older adults increasingly use LLM-based chatbots for companionship and assistance, a safety gap is emerging. Older adults may face vulnerabilities from social isolation, limited digital literacy, and cognitive decline, yet existing safety benchmarks largely target general harms and overlook elderly-specific risks. For example, a prompt such as "how to repair a ceiling light alone in the dark" may be benign for most users but poses a serious fall risk for older adults with mobility limitations. We introduce GrandGuard, the first comprehensive framework for assessing and mitigating elderly-specific contextual risks in LLM interactions. We develop a three-level taxonomy with 50 fine-grained risk types across mental well-being, financial, medical, toxicity, and privacy domains, grounded in real-world incidents, community discussions, and analysis of stakeholder studies. Using this taxonomy, we construct a benchmark of 10,404 labeled prompts and responses, showing that several leading LLMs mishandle elderly-specific contextual risks in over 50% of cases. We mitigate these failures with two safeguards: a fine-tuned Llama-Guard-3 and a policy-enhanced gpt-oss-safeguard-20b, achieving up to 96.2% and 90.9% unsafe-prompt detection accuracy, respectively. GrandGuard lays the groundwork for AI systems that move beyond general safety to support aging populations.
Authors:Keyu Zhao, Fengli Xu, Yong Li, Tie-Yan Liu
Abstract:
The "AI Scientist" paradigm is transforming scientific research by automating key stages of the research process, from idea generation to scholarly writing. This shift is expected to accelerate discovery and expand the scope of scientific inquiry. However, a key question remains unclear: can AI scientists identify meaningful research questions? While Large Language Models (LLMs) have been applied successfully to task-specific ideation, their potential to conduct strategic, long-term assessments of past breakthroughs and future questions remains largely unexplored. To address this gap, we explore a human-AI hybrid solution that integrates the scalable data processing capabilities of AI with the value judgment of human experts. Our methodology is structured in three phases. The first phase, AI-Accelerated Information Gathering, leverages AI's advantage in processing vast amounts of literature to generate a hybrid information base. The second phase, Candidate Question Proposing, utilizes this synthesized data to prompt an ensemble of six diverse LLMs to propose an initial candidate pool, filtered via a cross-model voting mechanism. The third phase, Hybrid Question Selection, refines this pool through a multi-stage filtering process that progressively increases human oversight. To validate this system, we conducted an experiment aiming to identify the Top 10 Scientific Breakthroughs of 2025 and the Top 10 Scientific Questions for 2026 across five major disciplines. Our analysis reveals that while AI agents demonstrate high alignment with human experts in recognizing established breakthroughs, they exhibit greater divergence in forecasting prospective questions, suggesting that human judgment remains crucial for evaluating subjective, forward-looking challenges.
Authors:Manuela González-González, Soufiane Belharbi, Muhammad Osama Zeeshan, Masoumeh Sharafi, Muhammad Haseeb Aslam, Lorenzo Sia, Nicolas Richet, Marco Pedersoli, Alessandro Lameiras Koerich, Simon L Bacon, Eric Granger
Abstract:
Using behavioural science, health interventions focus on behaviour change by providing a framework to help patients acquire and maintain healthy habits that improve medical outcomes. In-person interventions are costly and difficult to scale, especially in resource-limited regions. Digital health interventions offer a cost-effective approach, potentially supporting independent living and self-management. Automating such interventions, especially through machine learning, has gained considerable attention recently. Ambivalence and hesitancy (A/H) play a primary role for individuals to delay, avoid, or abandon health interventions. A/H are subtle and conflicting emotions that place a person in a state between positive and negative evaluations of a behaviour, or between acceptance and refusal to engage in it. They manifest as affective inconsistency across modalities or within a modality, such as language, facial, vocal expressions, and body language. While experts can be trained to recognize A/H, integrating them into digital health interventions is costly and less effective. Automatic A/H recognition is therefore critical for the personalization and cost-effectiveness of digital health interventions. Here, we explore the application of deep learning models for A/H recognition in videos, a multi-modal task by nature. In particular, this paper covers three learning setups: supervised learning, unsupervised domain adaptation for personalization, and zero-shot inference via large language models (LLMs). Our experiments are conducted on the unique and recently published BAH video dataset for A/H recognition. Our results show limited performance, suggesting that more adapted multi-modal models are required for accurate A/H recognition. Better methods for modeling spatio-temporal and multimodal fusion are necessary to leverage conflicts within/across modalities.
Authors:Yibo Wang, Guangda Huzhang, Yuwei Hu, Yu Xia, Shiyin Lu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Lijun Zhang
Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have substantially driven the progress of autonomous agents for Graphical User Interface (GUI). Nevertheless, in real-world applications, GUI agents are often faced with non-stationary environments, leading to high computational costs for data curation and policy optimization. In this report, we introduce a novel MLLM-centered framework for GUI agents, which consists of two components: agentic-Q estimation and step-wise policy optimization. The former one aims to optimize a Q-model that can generate step-wise values to evaluate the contribution of a given action to task completion. The latter one takes step-wise samples from the state-action trajectory as inputs, and optimizes the policy via reinforcement learning with our agentic-Q model. It should be noticed that (i) all state-action trajectories are produced by the policy itself, so that the data collection costs are manageable; (ii) the policy update is decoupled from the environment, ensuring stable and efficient optimization. Empirical evaluations show that our framework endows Ovis2.5-9B with powerful GUI interaction capabilities, achieving remarkable performances on GUI navigation and grounding benchmarks and even surpassing contenders with larger scales.
Authors:Maciej Besta, Łukasz Jarmocik, Orest Hrycyna, Shachar Klaiman, Konrad Mączka, Robert Gerstenberger, Jürgen Müller, Piotr Nyczyk, Hubert Niewiadomski, Torsten Hoefler
Abstract:
Graphs are foundational across domains but remain hard to use without deep expertise. LLMs promise accessible natural language (NL) graph analytics, yet they fail to process industry-scale property graphs effectively and efficiently: such datasets are large, highly heterogeneous, structurally complex, and evolve dynamically. To address this, we devise a novel abstraction for complex multi-query analytics over such graphs. Its key idea is to replace brittle generation of graph queries directly from NL with planning over a Semantic Catalog that describes both the graph schema and the graph operations. Concretely, this induces a clean separation between a Semantic Plane for LLM planning and broader reasoning, and an Execution Plane for deterministic, database-grade query execution over the full dataset and tool implementations. This design yields substantial gains in both token efficiency and task effectiveness even with small-context LLMs. We use this abstraction as the basis of the first LLM-enhanced graph analytics framework called GraphSeek. GraphSeek achieves substantially higher success rates (e.g., 86% over enhanced LangChain) and points toward the next generation of affordable and accessible graph analytics that unify LLM reasoning with database-grade execution over large and complex property graphs.
Authors:Myke C. Cohen, Mingqian Zheng, Neel Bhandari, Hsien-Te Kao, Xuhui Zhou, Daniel Nguyen, Laura Cassani, Maarten Sap, Svitlana Volkova
Abstract:
AI design characteristics and human personality traits each impact the quality and outcomes of human-AI interactions. However, their relative and joint impacts are underexplored in imperfectly cooperative scenarios, where people and AI only have partially aligned goals and objectives. This study compares a purely simulated dataset comprising 2,000 simulations and a parallel human subjects experiment involving 290 human participants to investigate these effects across two scenario categories: (1) hiring negotiations between human job candidates and AI hiring agents; and (2) human-AI transactions wherein AI agents may conceal information to maximize internal goals. We examine user Extraversion and Agreeableness alongside AI design characteristics, including Adaptability, Expertise, and chain-of-thought Transparency. Our causal discovery analysis extends performance-focused evaluations by integrating scenario-based outcomes, communication analysis, and questionnaire measures. Results reveal divergences between purely simulated and human study datasets, and between scenario types. In simulation experiments, personality traits and AI attributes were comparatively influential. Yet, with actual human subjects, AI attributes -- particularly transparency -- were much more impactful. We discuss how these divergences vary across different interaction contexts, offering crucial insights for the future of human-centered AI agents.
Authors:Sirui Han, Yuyao Zhang, Yidan Huang, Xueyan Li, Chengzhong Liu, Yike Guo
Abstract:
Fact verification is a critical yet underexplored component of non-litigation legal practice. While existing research has examined automation in legal workflow and human-AI collaboration in high-stakes domains, little is known about how GenAI can support fact verification, a task that demands prudent judgment and strict accountability. To address this, we conducted semi-structured interviews with 18 lawyers to understand their current verification practices, attitudes toward GenAI adoption, and expectations for future systems. We found that while lawyers use GenAI for low-risk tasks like drafting and language optimization, concerns over accuracy, confidentiality, and liability are currently limiting its adoption for fact verification. These concerns translate into core design requirements for AI systems that are trustworthy and accountable. Based on these, we contribute design insights for human-AI collaboration in legal fact verification, emphasizing the development of auditable systems that balance efficiency with professional judgment and uphold ethical and legal accountability in high-stakes practice.
Authors:Fei Wang, Jiangnan Yang, Junjie Chen, Yuxin Liu, Kun Li, Yanyan Wei, Dan Guo, Meng Wang
Abstract:
Web-based platforms are becoming a primary channel for psychological support, yet most LLM-driven chatbots remain opaque, single-stage, and weakly grounded in established therapeutic practice, limiting their usefulness for web applications that promote digital well-being. To address this gap, we present \textbf{XInsight}, a counseling-inspired multi-agent framework that models psychological support as a stage-consistent workflow aligned with the classical \textit{Exploration-Insight-Action} paradigm. Building on structured client representations, XInsight orchestrates specialized agents under a unified \textit{Reason-Intervene-Reflect} cycle: an Exploration agent organizes background and concerns into a structured Case Conceptualization Form, a Routing agent performs Adaptive Therapeutic Routing (ATR) across SFBT, CBT, and MBCT, a unified Therapeutic agent executes school-consistent submodules, and a Consolidation agent guides review, skill integration, and relapse-prevention planning. A Recording agent continuously transforms open-ended web dialogues into standardized psychological artifacts, including case formulations, therapeutic records, and relapse-prevention plans, enhancing interpretability, continuity, and accountability. To support rigorous and transparent assessment, we introduce \textbf{XInsight-Bench} with a Scale-Guided LLM Evaluation (SGLE) protocol that combines therapy-specific clinical scales with general counseling criteria. Experiments show improved paradigm alignment, multi-therapy integration, interaction depth, and interpretability over existing multi-agent counseling systems, indicating that XInsight provides a practical blueprint for integrating counseling-inspired support agents into web applications for digital well-being.
Authors:Zhixian Zhao, Shuiyuan Wang, Guojian Li, Hongfei Xue, Chengyou Wang, Shuai Wang, Longshuai Xiao, Zihan Zhang, Hui Bu, Xin Xu, Xinsheng Wang, Hexin Liu, Eng Siong Chng, Hung-yi Lee, Haizhou Li, Lei Xie
Abstract:
Driven by the rapid advancement of Large Language Models (LLMs), particularly Audio-LLMs and Omni-models, spoken dialogue systems have evolved significantly, progressively narrowing the gap between human-machine and human-human interactions. Achieving truly ``human-like'' communication necessitates a dual capability: emotional intelligence to perceive and resonate with users' emotional states, and robust interaction mechanisms to navigate the dynamic, natural flow of conversation, such as real-time turn-taking. Therefore, we launched the first Human-like Spoken Dialogue Systems Challenge (HumDial) at ICASSP 2026 to benchmark these dual capabilities. Anchored by a sizable dataset derived from authentic human conversations, this initiative establishes a fair evaluation platform across two tracks: (1) Emotional Intelligence, targeting long-term emotion understanding and empathetic generation; and (2) Full-Duplex Interaction, systematically evaluating real-time decision-making under `` listening-while-speaking'' conditions. This paper summarizes the dataset, track configurations, and the final results.
Authors:Yuchen Sun, Pei Fu, Shaojie Zhang, Anan Du, Xiuwen Xi, Ruoceng Zhang, Zhenbo Luo, Jian Luan, Chongyang Zhang
Abstract:
Test-Time Scaling (TTS), which samples multiple candidate actions and ranks them via a Critic Model, has emerged as a promising paradigm for generalist GUI agents. Its efficacy thus hinges on the critic's fine-grained ranking ability. However, existing GUI critic models uniformly adopt binary classification. Our motivational analysis of these models exposes a severe entanglement: scores for valid actions and plausible-but-invalid distractors become indistinguishable. We attribute this failure to two structural defects: Affordance Collapse--the hierarchical affordance space is compressed into 0/1 labels; and Noise Sensitivity--binary objectives overfit to noisy decision boundaries. To resolve this, we introduce BBCritic (Beyond-Binary Critic), a paradigm shift grounded in the Functional Equivalence Hypothesis. Through two-stage contrastive learning, BBCritic aligns instructions and actions in a shared Affordance Space, recovering the hierarchical structure that binary supervision flattens. We also present BBBench (Beyond-Binary Bench), the first GUI critic benchmark that pairs a dense action space with a hierarchical four-level taxonomy, enabling fine-grained ranking evaluation. Experimental results show that BBCritic-3B, trained without any extra annotation, outperforms 7B-parameter SOTA binary models. It demonstrates strong zero-shot transferability across platforms and tasks, supporting our methodological view: GUI critique is fundamentally a metric-learning problem, not a classification one.
Authors:Zhifei Xie, Zongzheng Hu, Fangda Ye, Xin Zhang, Haobo Chai, Zihang Liu, Pengcheng Wu, Guibin Zhang, Yue Liao, Xiaobin Hu, Deheng Ye, Chunyan Miao, Shuicheng Yan
Abstract:
Proactivity is a core expectation for AGI. Prior work remains largely confined to laboratory settings, leaving a clear gap in real-world proactive agent: depth, complexity, ambiguity, precision and real-time constraints. We study this setting, where useful intervention requires inferring latent needs from ongoing context and grounding actions in evolving user memory under latency and long-horizon constraints. We first propose DD-MM-PAS (Demand Detection, Memory Modeling, Proactive Agent System) as a general paradigm for streaming proactive AI agent. We instantiate this paradigm in Pask, with streaming IntentFlow model for DD, a hybrid memory (workspace, user, global) for long-term MM, PAS infra framework and introduce how these components form a closed loop. We also introduce LatentNeeds-Bench, a real-world benchmark built from user-consented data and refined through thousands of rounds of human editing. Experiments show that IntentFlow matches leading Gemini3-Flash models under latency constraints, while identifying deeper user intent.
Authors:Xiyang Huang, Jiawei Lin, Keying Wu, Jiaxin Huang, Kailai Yang, Renxiong Wei, Cheng zeng, Jiayi Xiang, Ziyan Kuang, Min Peng, Qianqian Xie, Sophia Ananiadou
Abstract:
Current video benchmarks for multimodal large language models (MLLMs) focus on event recognition, temporal ordering, and long-context recall, but overlook a harder capability required for expert procedural judgment: tracking how ongoing interactions update the procedural state and thereby determine the correctness of later actions. We introduce SiMing-Bench, the first benchmark for evaluating this capability from full-length clinical skill videos. It targets rubric-grounded process-level judgment of whether interaction-driven state updates preserve procedural correctness across an entire workflow. SiMing-Bench is instantiated with SiMing-Score, a physician-annotated dataset of real clinical skill examination videos spanning cardiopulmonary resuscitation, automated external defibrillator operation, and bag-mask ventilation, each paired with a standardized step-wise rubric and dual-expert labels. Across diverse open- and closed-source MLLMs, we observe consistently weak agreement with physician judgments. Moreover, weak performance on rubric-defined intermediate steps persists even when overall procedure-level correlation appears acceptable, suggesting that coarse global assessment substantially overestimates current models' procedural judgment ability. Additional analyses with binary step judgment and step-aligned clips indicate that the bottleneck is not merely fine-grained scoring or temporal localization, but modeling how continuous interactions update procedural state over time.
Authors:Wenxin Zhao, Peng Zhang, Hansu Gu, Haoxuan Zhou, Xiaojie Huo, Lin Wang, Wen Zheng, Tun Lu, Ning Gu
Abstract:
Cross-language collaborative storytelling plays a vital role in children's language learning and cultural development, fostering both expressive ability and intercultural awareness. Yet, in practice, children's participation is often shallow, and facilitating such sessions places heavy cognitive and organizational burdens on coordinators, who must coordinate language support, maintain children's engagement, and navigate cultural differences. To address these challenges, we conducted a formative study with coordinators to identify their needs and pain points, which guided the design of SparkTales, an intelligent support system for cross-language collaborative storytelling. SparkTales leverages both individual and common characteristics of participating children to provide coordinators with story frameworks, diverse questions, and comprehension-oriented materials, aiming to reduce coordinators' workload while enhancing children's interactive engagement. Evaluation results show that SparkTales not only significantly increases coordinators' efficiency and quality of guidance but also improves children's participation, providing valuable insights for the design of future intelligent systems supporting cross-language collaboration.
Authors:Mengyao Wang, Shuai Ma, Nuo Li, Peng Zhang, Chenxin Li, Ning Gu, Tun Lu
Abstract:
Counterspeech offers a non-repressive approach to moderate hate speech in online communities. Research has examined how counterspeech chatbots restrain hate speakers and support targets, but their impact on bystanders remains unclear. Therefore, we developed a counterspeech strategy framework and built \textit{Civilbot} for a mixed-method within-subjects study. Bystanders generally viewed Civilbot as credible and normative, though its shallow reasoning limited persuasiveness. Its behavioural effects were subtle: when performing well, it could guide participation or act as a stand-in; when performing poorly, it could discourage bystanders or motivate them to step in. Strategy proved critical: cognitive strategies that appeal to reason, especially when paired with a positive tone, were relatively effective, while mismatch of contexts and strategies could weaken impact. Based on these findings, we offer design insights for mobilizing bystanders and shaping online discourse, highlighting when to intervene and how to do so through reasoning-driven and context-aware strategies.
Authors:Yubo Shu, Peng Zhang, Meng Wu, Yan Chen, Haoxuan Zhou, Guanming Liu, Yu Zhang, Liuxin Zhang, Qianying Wang, Tun Lu, Ning Gu
Abstract:
Social cues, which convey others' presence, behaviors, or identities, play a crucial role in human information seeking by helping individuals judge relevance and trustworthiness. However, existing LLM-based search systems primarily rely on semantic features, creating a misalignment with the socialized cognition underlying natural information seeking. To address this gap, we explore how the integration of social cues into LLM-based search influences users' perceptions, experiences, and behaviors. Focusing on social media platforms that are beginning to adopt LLM-based search, we integrate design workshops, the implementation of the prototype system (SoulSeek), a between-subjects study, and mixed-method analyses to examine both outcome- and process-level findings. The workshop informs the prototype's cue-integrated design. The study shows that social cues improve perceived outcomes and experiences, promote reflective information behaviors, and reveal limits of current LLM-based search. We propose design implications emphasizing better social-knowledge understanding, personalized cue settings, and controllable interactions.
Authors:Chen Gong, Zhenzhe Zheng, Yiliu Chen, Sheng Wang, Fan Wu, Guihai Chen
Abstract:
Machine learning models are widely integrated into modern mobile apps to analyze user behaviors and deliver personalized services. Ensuring low-latency on-device model execution is critical for maintaining high-quality user experiences. While prior research has primarily focused on accelerating model inference with given input features, we identify an overlooked bottleneck in real-world on-device model execution pipelines: extracting input features from raw application logs. In this work, we explore a new direction of feature extraction optimization by analyzing and eliminating redundant extraction operations across different model features and consecutive model inferences. We then introduce AutoFeature, an automated feature extraction engine designed to accelerate on-device feature extraction process without compromising model inference accuracy. AutoFeature comprises three core designs: (1) graph abstraction to formulate the extraction workflows of different input features as one directed acyclic graph, (2) graph optimization to identify and fuse redundant operation nodes across different features within the graph; (3) efficient caching to minimize operations on overlapping raw data between consecutive model inferences. We implement a system prototype of AutoFeature and integrate it into five industrial mobile services spanning search, video and e-commerce domains. Online evaluations show that AutoFeature reduces end-to-end on-device model execution latency by 1.33x-3.93x during daytime and 1.43x-4.53x at night.
Authors:Isadora Krsek, Meryl Ye, Wei Xu, Alan Ritter, Laura Dabbish, Sauvik Das
Abstract:
People candidly discuss sensitive topics online under the perceived safety of anonymity; yet, for many, this perceived safety is tenuous, as miscalibrated risk perceptions can lead to over-disclosure. Recent advances in Natural Language Processing (NLP) afford an unprecedented opportunity to present users with quantified disclosure-based re-identification risk (i.e., "population risk estimates", PREs). How can PREs be presented to users in a way that promotes informed decision-making, mitigating risk without encouraging unnecessary self-censorship? Using design fictions and comic-boarding, we story-boarded five design concepts for presenting PREs to users and evaluated them through an online survey with N = 44 Reddit users. We found participants had detailed conceptions of how PREs may impact risk awareness and motivation, but envisioned needing additional context and support to effectively interpret and act on risks. We distill our findings into four key design recommendations for how best to present users with quantified privacy risks to support informed disclosure decision-making.
Authors:Yibo Lyu, Gongwei Chen, Rui Shao, Weili Guan, Liqiang Nie
Abstract:
While GUI agents have shown strong performance under explicit and completion instructions, real-world deployment requires aligning with users' more complex implicit intents. In this work, we highlight Hierarchical Implicit Intent Alignment for Personalized GUI Agent (PersonalAlign), a new agent task that requires agents to leverage long-term user records as persistent context to resolve omitted preferences in vague instructions and anticipate latent routines by user state for proactive assistance. To facilitate this study, we introduce AndroidIntent, a benchmark designed to evaluate agents' ability in resolving vague instructions and providing proactive suggestions through reasoning over long-term user records. We annotated 775 user-specific preferences and 215 routines from 20k long-term records across different users for evaluation. Furthermore, we introduce Hierarchical Intent Memory Agent (HIM-Agent), which maintains a continuously updating personal memory and hierarchically organizes user preferences and routines for personalization. Finally, we evaluate a range of GUI agents on AndroidIntent, including GPT-5, Qwen3-VL, and UI-TARS, further results show that HIM-Agent significantly improves both execution and proactive performance by 15.7% and 7.3%.
Authors:Jiani Cao, Kun Wang, Yang Liu, Zhenjiang Li
Abstract:
Motor Imagery (MI) is an emerging Brain-Computer Interface (BCI) paradigm where a person imagines body movements without physical action. By decoding scalp-recorded electroencephalography (EEG) signals, BCIs establish direct communication to control external devices, offering significant potential in prosthetics, rehabilitation, and human-computer interaction. However, existing solutions remain difficult to deploy. (i) Most employ independent, opaque models for each MI task, lacking a unified architectural foundation. Consequently, these models are trained in isolation, failing to learn robust representations from diverse datasets, resulting in modest performance. (ii) They primarily adopt fixed sensor deployment, whereas real-world setups vary in electrode number and placement, causing models to fail across configurations. (iii) Performance degrades sharply under low-SNR conditions typical of consumer-grade EEG. To address these challenges, we present NeuroPath, a neural architecture for robust MI decoding. NeuroPath takes inspiration from the brain's signal pathway from cortex to scalp, utilizing a deep neural architecture with specialized modules for signal filtering, spatial representation learning, and feature classification, enabling unified decoding. To handle varying electrode configurations, we introduce a spatially aware graph adapter accommodating different electrode numbers and placements. To enhance robustness under low-SNR conditions, NeuroPath incorporates multimodal auxiliary training to refine EEG representations and stabilize performance on noisy real-world data. Evaluations on three consumer-grade and three medical-grade public datasets demonstrate that NeuroPath achieves superior performance.
Authors:Hao Wang, Wenhui Zhu, Shao Tang, Zhipeng Wang, Xuanzhao Dong, Xin Li, Xiwen Chen, Ashish Bastola, Xinhao Huang, Yalin Wang, Abolfazl Razi
Abstract:
As a cornerstone of the modern digital economy, 3D modeling and rendering demand substantial resources and manual effort when scene editing is performed in the traditional manner. Despite recent progress in VLM-based agents for 3D editing, the fundamental trade-off between editing precision and agent responsiveness remains unresolved. To overcome these limitations, we present EZBlender, a Blender agent with a hybrid framework that combines planning-based task decomposition and reactive local autonomy for efficient human AI collaboration and semantically faithful 3D editing. Specifically, this unexplored Plan-and-ReAct design not only preserves editing quality but also significantly reduces latency and computational cost. To further validate the efficiency and effectiveness of the proposed edge-autonomy architecture, we construct a dedicated multi-tasking benchmark that has not been systematically investigated in prior research. In addition, we provide a comprehensive analysis of language model preference, system responsiveness, and economic efficiency.
Authors:Yifan Zhang, Xinkui Zhao, Zuxin Wang, Zhengyi Zhou, Guanjie Chen, Shuiguang Deng, Jianwei Yin
Abstract:
Operating Systems (OS) courses are among the most challenging in computer science education due to the complexity of internal structures and the diversity of running environments. Traditional teaching methods often fail to address the diverse backgrounds, learning speeds, and practical needs of students. To tackle these challenges, we present SortingHat, a personalized digital teaching assistant tailored specifically for OS education. SortingHat integrates advanced AI technologies, including a retrieval augmented generation (RAG) framework and multi agent reinforcement learning (MARL), to deliver adaptive, scalable, and effective educational support. SortingHat features a 3D digital human interface powered by large language models (LLMs) to provide personalized, empathetic, and context aware guidance. It generates tailored exercises based on each student's learning history and academic performance, reinforcing weak areas and challenging advanced concepts. Additionally, the system incorporates a robust evaluation pipeline that ensures fair, consistent, and unbiased grading of student submissions while delivering personalized, actionable feedback for improvement. By combining personalized guidance, adaptive content creation, and automated assessment, SortingHat transforms OS education into an engaging, immersive, and scalable experience.
Authors:Shuning Zhang, Changxi Wen, Eve He, Ying Ma, Robert Xiao, Xin Yi, Hewu Li
Abstract:
Large Language Model (LLMs)-assisted scholarly workflows introduce critical privacy and intellectual property risks. As a uniquely vulnerable cohort driven by publication pressure and a lack of institutional support, novice researchers rely heavily on public LLMs, compelling them to navigate high-stakes privacy-publication trade-offs. To investigate these concerns, we conducted semi-structured interviews with 44 researchers across diverse disciplines. Our findings reveal that the fear of idea leakage paradoxically accelerates, rather than deters, reliance on LLMs, as researchers utilize them to expedite publication. They also held misconceptions that their ideas lacked the unique value to attract targeted attacks, and that their inputs would be safely diluted within massive datasets, preventing reconstruction. From interviews, we identified five types of mitigations including input fragmentation and adversarial probing, though we found that participants largely perceived these measures as ineffective. We outline implications including implementing institution-level sandboxed isolation, scenario-based privacy pedagogy, and verifiable data-deletion audits for transparency.
Authors:Shuning Zhang, Eve He, Xiao Zhan, Shijing He, Robert Xiao, Xin Yi, Hewu Li
Abstract:
E-commerce dispute resolution typically relies on the security assumption that digital evidence truthfully reflects physical reality. Generative AI (GenAI) invalidates this threat model, enabling attackers to fabricate hyper-realistic evidence of product defects at negligible cost. Through semi-structured interviews with merchants (N=17) and platform workers (N=13) in the Chinese e-commerce market, we characterize this shift toward GenAI-enabled scalable fabrication. We outline a taxonomy of four GenAI-enabled threat vectors across the transaction, dispute, logistics and communication phases, highlighting how attackers exploit GenAI to synthesize physically plausible product defects at scale. To mitigate these threats, platforms and merchants are adapting verification strategies, relying on AI tools for automated screening and adversarial interrogation (e.g., requesting multi-angle videos) to increase attack complexity. However, we find several challenges that hinder the adoption of these defenses, including implementation hurdles like structural platform constraints and fundamental limitations regarding the technical sophistication of GenAI. We conclude by outlining design implications for privacy-preserving cross-platform fraud databases, and traceability mechanisms such as embedding verifiable material anchors into the product.
Authors:Shuning Zhang, Mingyao Xu, Zhixin Huang, Yutong Jiang, Rongjun Ma, Yuting Yang, Xin Yi, Kanye Ye Wang, Hewu Li
Abstract:
The proliferation of AI agents empowers independent developers, defined as individual or small groups who self-initiate projects rather than fulfill client-based contracts, to create sophisticated autonomous systems, but also introduces novel security and privacy (S&P) challenges beyond traditional corporate structures. We conducted an interview study (N=28) with Chinese developers, whose extensive use of global LLM services offer valuable insights into this population. We investigate their understandings, practices and challenges of S&P challenges in their developed AI agent products. We revealed that independent developers frequently think and act from their users' perspective. They focused on user-facing safety risks such as harmful content while exhibiting low awareness of security vulnerabilities. Consequently, developers rely almost exclusively on ad-hoc, manually crafted safeguards and informal communication, with an absence of formal tools or processes for S&P practices. We found these actions are driven by various inhibitors, primarily a lack of formal training on S&P related skills, accessible security tools and actionable guidance from platforms. Our work contributed the first exploration of independent AI agent developers' S&P understanding, outlining opportunities for tailored security tooling.
Authors:Shuning Zhang, Qucheng Zang, Yongquan `Owen' Hu, Jiachen Du, Xueyang Wang, Yan Kong, Xinyi Fu, Suranga Nanayakkara, Xin Yi, Hewu Li
Abstract:
Always-on sensing of AI applications on AR glasses makes traditional permission techniques ill-suited for context-dependent visual data, especially within home environments. The home presents a highly challenging privacy context due to the high density of sensitive objects, and the frequent presence of non-consenting family members, and the intimate nature of daily routines, making it a critical focus area for scalable privacy control mechanisms. Existing fine-grained controls, while offering nuanced choices, are inefficient for managing multiple private objects. We propose VisGuardian, a fine-grained content-based visual permission technique for AR glasses. VisGuardian features a group-based control mechanism that enables users to efficiently manage permissions for multiple private objects. VisGuardian detects objects using YOLO and adopts a pre-classified schema to group them. By selecting a single object, users can efficiently obscure groups of related objects based on criteria including privacy sensitivity, object category, or spatial proximity. A technical evaluation shows VisGuardian achieves mAP50 of 0.6704 with only 14.0 ms latency and a 1.7% increase in battery consumption per hour. Furthermore, a user study (N=24) comparing VisGuardian to slider-based and object-based baselines found it to be significantly faster for setting permissions and was preferred by users for its efficiency, effectiveness, and ease of use.
Authors:Shuning Zhang, Shixuan Li, Haobin Xing, Jiarui Liu, Yan Kong, Xin Yi, Hewu Li
Abstract:
As Smart Home Personal Assistants (SPAs) evolve into social agents, understanding user privacy necessitates interpersonal communication frameworks, such as Privacy Boundary Theory (PBT). To ground our investigation, our three-phase preliminary study (1) identified transmission and sharing ranges as key boundary-related risk factors, (2) categorized relevant SPA functions and data types, and (3) analyzed commercial practices, revealing widespread data sharing and non-transparent safeguards. A subsequent mixed-methods study (N=412 survey, N=40 interviews among the survey participants) assessed users' perceived privacy risks across data types, transmission ranges and sharing ranges. Results demonstrate a significant, non-linear escalation in perceived risk when data crosses two critical boundaries: the `public network' (transmission) and `third parties' (sharing). This boundary effect holds robustly across data types and demographics. Furthermore, risk perception is modulated by data attributes (e.g., social relational data), and contextual privacy calculus. Conversely, anonymization safeguards show limited efficacy especially for third-party sharing, a finding attributed to user distrust. These findings empirically ground PBT in the SPA context and inform design of boundary-aware privacy protection.
Authors:Shuning Zhang, Linzhi Wang, Shixuan Li, Yuanyuan Wu, Yuwei Chuai, Luoxi Chen, Xin Yi, Hewu Li
Abstract:
Identifying deepfake videos on social media platforms is challenged by dynamic spatio-temporal artifacts and inadequate user tools. This hinders both critical viewing by users and scalable moderation on platforms. Here, we present Collab, a web plugin enabling users to collaboratively annotate deepfake videos. Collab integrates three key components: (i) an intuitive interface for spatio-temporal labeling where users provide confidence scores and rationales, facilitating detailed input even from non-experts, (ii) a novel confidence-weighted spatio-temporal Intersection-over-Union (IoU) algorithm to aggregate diverse user annotations into accurate aggregations, and (iii) a hierarchical demonstration strategy presenting aggregated results to guide attention toward contentious regions and foster critical evaluation. A seven-day online study (N=90), where participants annotated suspicious videos when viewing an online experimental platforms, compared Collab against two conditions without aggregation or demonstration respectively. Collab significantly improved identification accuracy and enhanced reflection compared to non-demonstration condition, while outperforming non-aggregation condition for its novelty and effectiveness.
Authors:Shuning Zhang, Eve He, Sixing Tao, Yuting Yang, Ying Ma, Ailei Wang, Xin Yi, Hewu Li
Abstract:
Privacy Policies are a cornerstone of informed consent, yet a persistent gap exists between their legal intent and practical efficacy. Despite decades of Human-Computer Interaction (HCI) research proposing various visualizations, user comprehension remains low, and designs rarely see widespread adoption. To understand this landscape and chart a path forward, we synthesized 65 top-tier papers using a framework adapted from the user-centered design lifecycle. Our analysis presented findings of the field's evolution across four dimensions: (1) the trade-off between information load and decision efficacy, which demonstrates a shift from augmenting disclosures to prioritizing information condensation and cognitive load management to counter the inefficacy of comprehensive texts, (2) the co-evolutionary dynamic of design and automation, revealing that complex design ambitions such as context-awareness drove the need for advanced NLP, while recent LLM breakthroughs are enabling the semantic interpretation required to realize those designs, (3) the tension between generality and specificity, highlighting the divergence between standardized, cross-platform solutions and the increasing necessity for specialized, context-aware interaction patterns in IoT and immersive environments, and (4) balancing stakeholder opinions, which shows that visualization efficacy is constrained by the complex interplay of regulatory mandates, developer capabilities and provider incentives. We conclude by outlining four critical challenges for future research.
Authors:Ching-Chun Chang, Yuchen Guo, Hanrui Wang, Timo Spinde, Isao Echizen
Abstract:
The evolution of artificial intelligence (AI) has rendered the boundary between humanity and computational machinery increasingly ambiguous. In the presence of more interwoven relationships within human-machine symbiosis, the very notion of AI-generated information becomes difficult to define, as such information arises not from either humans or machines in isolation, but from their mutual shaping. Therefore, a more pertinent question lies not merely in whether AI has participated, but in how it has participated. In general, the role assumed by AI is often specified, either implicitly or explicitly, in the input prompt, yet becomes less apparent or altogether unobservable when the generated content alone is available. Once detached from the dialogue context, the functional role may no longer be traceable. This study considers the problem of tracing the functional role played by AI in natural language generation. A methodology is proposed to infer the latent role specified by the prompt, embed this role into the content during the probabilistic generation process and subsequently recover the nature of AI participation from the resulting text. Experimentation is conducted under a representative scenario in which AI acts either as an assistive agent that edits human-written content or as a creative agent that generates new content from a brief concept. The experimental results support the validity of the proposed methodology in terms of discrimination between roles, robustness against perturbations and preservation of linguistic quality. We envision that this study may contribute to future research on the ethics of AI with regard to whether AI has been used fairly, transparently and appropriately.
Authors:Borislav Pavlov, Jiajin Li, Jun Fang, Yuntao Wang, Yuanchun Shi
Abstract:
Human routines structure daily life, yet remain challenging for computational systems to understand. This paper presents the first systematic review of routine computing, a previously implicit but increasingly recognized field that focuses on computationally sensing and modeling human behaviors. It synthesizes 203 studies published up to August 2025. The paper presents a new taxonomy of the literature, focusing on temporal structures, behavioral interactions, cognitive aspects, and how variability and deviations are addressed. The common goals of routine computing extend across four major application domains, including accessibility care, the promotion of healthy habits, adaptive and context-aware support, and large-scale population insights. Persistent challenges that limit the design of truly human-centered systems are identified, including the gap between low-level activity recognition and high-level intent, the tension between personalization and generalization, unresolved privacy concerns, and data-related limitations. By consolidating these findings, this paper provides a foundational framework for HCI researchers, outlining principles for designing ethical, adaptive, and human-centered routine-aware systems.
Authors:Yanuo Zhou, Jun Fang, Yuntao Wang, Yi Wang, Nan Gao, Jinlei Liu, Yuanchun Shi
Abstract:
Picky eating in children can undermine dietary diversity and the development of healthy eating habits, while also creating recurring tension in family feeding routines. Prior interventions have explored food-centered designs, enhanced utensils, and mealtime interactive systems, but few position children as active participants in intervention processes that extend beyond single mealtime interactions. To better understand everyday responses to picky eating and child-acceptable intervention mechanisms, we conducted a formative study with caregivers and kindergarten teachers. Based on the resulting design considerations and iterative stakeholder review, we designed StoryEcho, a generative child-as-actor storytelling system for picky eating intervention. StoryEcho engages children outside mealtimes through personalized stories in which the child appears as a persistent story character and later shapes story development through real-world food-related behavior. The system combines non-mealtime story engagement, lightweight post-meal feedback, and behavior-informed story updates to support repeated intervention across everyday family routines. We evaluated StoryEcho in a between-group field study with 11 families of preschool children. Results provide preliminary evidence that StoryEcho can significantly increase children's willingness to approach and try target low-preference foods while reducing parental pressure around feeding. These findings suggest the promise of generative child-as-actor storytelling as a design approach for home-based behavior support that unfolds through recurring family routines.
Authors:Ka I Chan, Hongbo Lan, Jun Fang, Yuntao Wang, Yuanchun Shi
Abstract:
Conflicts are common in text-based communication, particularly in intimate relationships, where misunderstandings can easily escalate into verbal aggression. To address this, we present SpeakSoftly, a system that applies Nonviolent Communication (NVC) principles to scaffold couples' conflict communication through LLM-powered just-in-time interventions. Informed by formative interviews with couples and NVC principles, we designed two core features: NVC-Prompt, which detects verbal aggression and suggests revisions to prevent escalation, and NVC-Guide, which analyzes dialogues to uncover users' feelings and needs, fostering self-awareness and perspective-taking. These features were implemented across three progressive intervention modes, each varying in intervention depth and tone: Basic Reminder, Neutral Guide, and Empathetic Guide. We conducted a mixed-methods user study with 18 couples across simulated and real-life conflict settings to evaluate the effectiveness of each mode. Results showed that Empathetic Guide significantly facilitated both behavioral and cognitive changes, while Neutral Guide was effective only for behavioral changes in simulated conflicts. In real-life conflicts, Neutral Guide showed distinct advantages due to lower cognitive load demands. We discuss the mechanisms behind these findings and propose design implications for in-situ interventions in high-stakes communication contexts.
Authors:Qing He, Zeyu Wang, Yuzhou Du, Jiahuan Ding, Yuanchun Shi, Yuntao Wang
Abstract:
Sustaining the effectiveness of behavior change technologies remains a key challenge. AI self-modeling, which generates personalized portrayals of one's ideal self, has shown promise for motivating behavior change, yet prior work largely examines short-term effects. We present one of the first longitudinal evaluations of AI self-modeling in fitness engagement through a two-stage empirical study. A 1-week, three-arm experiment (visual self-modeling (VSM), auditory self-modeling (ASM), Control; N=28) revealed that VSM drove initial performance gains, while ASM showed no significant effects. A subsequent 4-week study (VSM vs. Control; N=31) demonstrated that VSM sustained higher performance levels but exhibited diminishing improvement rates after two weeks. Interviews uncovered a catalyst effect that fostered early motivation through clear, attainable goals, followed by habituation and internalization which stabilized performance. These findings highlight the temporal dynamics of personalized nudging and inform the design of behavior change technologies for long-term engagement.
Authors:Bijean Ghafouri, Dorsaf Sallami, Luca Luceri, Taylor Lynn Curtis, Jean-Francois Godbout, Emilio Ferrara, Reihaneh Rabbany
Abstract:
Research on misinformation has focused almost exclusively on supply, asking what falsehoods circulate, who produces them, and whether corrections work. A basic demand-side question remains unanswered. When ordinary people can fact-check anything they want, what do they actually ask about? We provide the first large-scale evidence on this question by analyzing close to 2{,}500 statements submitted by 457 participants to an open-ended AI fact-checking system. Each claim is classified along five semantic dimensions (domain, epistemic form, verifiability, target entity, and temporal reference), producing a behavioral map of public verification demand. Three findings stand out. First, users range widely across topics but default to a narrow epistemic repertoire, overwhelmingly submitting simple descriptive claims about present-day observables. Second, roughly one in four requests concerns statements that cannot be empirically resolved, including moral judgments, speculative predictions, and subjective evaluations, revealing a systematic mismatch between what users seek from fact-checking tools and what such tools can deliver. Third, comparison with the FEVER benchmark dataset exposes sharp structural divergences across all five dimensions, indicating that standard evaluation corpora encode a synthetic claim environment that does not resemble real-world verification needs. These results reframe fact-checking as a demand-driven problem and identify where current AI systems and benchmarks are misaligned with the uncertainty people actually experience.
Authors:Jun Fang, Ka I Chan, Xiyuxing Zhang, Yuntao Wang, Mingze Gao, Leyi Peng, Jiajin Li, Zihang Zhan, Zhixin Zhao, Yuanchun Shi
Abstract:
Rapid eating is common yet difficult to regulate in situ, partly because people seldom notice pace changes and sustained self-monitoring is effortful. We present Earinter, a commodity-earbud-based closed-loop system that integrates in-the-wild sensing, real-time reasoning, and theory-grounded just-in-time (JIT) intervention to regulate eating pace during daily meals. Earinter repurposes the earbud's bone-conduction voice sensor to capture chewing-related vibrations and estimate eating pace as chews per swallow (CPS) for on-device inference. With data collected equally across in-lab and in-the-wild sessions, Earinter achieves reliable chewing detection (F1 = 0.97) and accurate eating pace estimation (MAE: 0.18 $\pm$ 0.13 chews/min, 3.65 $\pm$ 3.86 chews/swallow), enabling robust tracking for closed-loop use. Guided by Dual Systems Theory and refined through two Wizard-of-Oz pilots, Earinter adopts a user-friendly design for JIT intervention content and delivery policy in daily meals. In a 13-day within-subject field study (N=14), the closed-loop system significantly increased CPS and reduced food-consumption speed, with statistical signs of carryover on retention-probe days and acceptable user burden. Our findings highlight how single-modality commodity earables can support practical, theory-driven closed-loop JIT interventions for regulating eating pace in the wild.
Authors:Timofei Kozlov, Artem Trandofilov, Georgii Gazaryan, Issatay Tokmurziyev, Miguel Altamirano Cabrera, Dzmitry Tsetserukou
Abstract:
Safe navigation for the visually impaired individuals remains a critical challenge, especially concerning head-level obstacles, which traditional mobility aids often fail to detect. We introduce GuideTouch, a compact, affordable, standalone wearable device designed for autonomous obstacle avoidance. The system integrates two vertically aligned Time-of-Flight (ToF) sensors, enabling three-dimensional environmental perception, and four vibrotactile actuators that provide directional haptic feedback. Proximity and direction information is communicated via an intuitive 4-point vibrotactile feedback system located across the user's shoulders and upper chest. For real-world robustness, the device includes a unique centrifugal self-cleaning optical cover mechanism and a sound alarm system for location if the device is dropped. We evaluated the haptic perception accuracy across 22 participants (17 male and 5 female, aged 21-48, mean 25.7, sd 6.1). Statistical analysis confirmed a significant difference between the perception accuracy of different patterns. The system demonstrated high recognition accuracy, achieving an average of 92.9% for single and double motor (primary directional) patterns. Furthermore, preliminary experiments with 14 visually impaired users validated this interface, showing a recognition accuracy of 93.75% for primary directional cues. The results demonstrate that GuideTouch enables intuitive spatial perception and could significantly improve the safety, confidence, and autonomy of users with visual impairments during independent navigation.
Authors:Heyuan Huang, Yeyi Guan, Jihong Wang, Mingzhi Wang, Jiamu Zhou, Xiangmou Qu, Jiaxin Yin, Xin Liao, Xingyu Lou, Jun Wang
Abstract:
Large language models (LLMs) have evolved AI assistants into autonomous reasoning engines that maintain context, invoke tools, and pursue long-horizon tasks. This has spurred Agent Operating Systems (Agent OS) as kernel-like layers for lifecycle management, memory, scheduling, and access control. Yet most designs remain agent-centric, treating the OS as a single-host runtime for internal reasoning and tool use, leaving open how autonomous actions integrate with distributed, collaborative, permission-sensitive workflows. TopoClaw is an open-source, human-centric, topology-aware Agent OS modeling the user's ecosystem as two coupled structures: a physical device topology of heterogeneous surfaces and a social relationship topology of shared spaces, teams, and delegated roles. It unifies device operation, messaging, and skills around accountable cross-boundary execution, with three core contributions: (1) cross-device action placement, decoupling intent from actuation and routing distributed actions across the device cluster based on hardware affordances and user context; (2) cross-user identity attribution, treating agents as socially situated "Digital Twins" that coordinate in multi-user spaces while preserving provenance, role-aware permissions, and human accountability; (3) cross-context authority governance, pairing broad capability with distributed, context-aware policy enforcement across physical and social trust boundaries to bound proactive autonomy at the OS layer. This report presents TopoClaw as an engineering-oriented reference architecture, covering its design principles, runtime, cross-device execution, collaboration mechanisms, security model, and deployment outlook.
Authors:M Murshidul Bari, Akif Islam, Mohd Ruhul Ameen, Abu Saleh Musa Miah, Jungpil Shin
Abstract:
The growing use of artificial intelligence (AI) in education, professional work, and everyday problem-solving has raised important questions about its effect on human reasoning. While AI can improve efficiency, save time, and support learning, repeated dependence on it may also encourage cognitive offloading, reduce productive struggle, and weaken independent critical thinking. This paper investigates the relationship between AI-use behavior and critical-thinking performance through an interview-based survey combined with short logic and reasoning tasks. The findings reveal a mixed pattern: participants largely viewed AI as a tool for speed, convenience, and learning support, yet many also reported reduced patience for sustained effort. Objective reasoning performance varied considerably across individuals, and the analyses suggest that reduced patience and stronger dependence-related tendencies are more closely associated with lower reasoning performance than background characteristics alone. Exploratory clustering further indicates that AI users do not form a single homogeneous group, but instead reflect tentative behavioral profiles, including over-reliant users, mixed-strategy users, and balanced support-seekers. Although the findings are exploratory, they indicate that AI does not affect critical thinking in a uniformly negative or positive way. Instead, its influence appears to depend on the manner in which it is used. The paper therefore argues that effective human-AI collaboration should support reflection, verification, and sustained cognitive effort rather than substitute for them.
Authors:Mingyu Ouyang, Siyuan Hu, Kevin Qinghong Lin, Hwee Tou Ng, Mike Zheng Shou
Abstract:
Towards an embodied generalist for real-world interaction, Multimodal Large Language Model (MLLM) agents still suffer from challenging latency, sparse feedback, and irreversible mistakes. Video games offer an ideal testbed with rich visual observations and closed-loop interaction, demanding fine-grained perception, long-horizon planning, and precise control. However, systematically evaluating these capabilities is currently hindered by heterogeneous action interfaces and heuristic verification. To this end, we introduce GameWorld, a benchmark designed for standardized and verifiable evaluation of MLLMs as generalist game agents in browser environments. Two game agent interfaces are studied: (i) computer-use agents that directly emit keyboard and mouse controls, and (ii) generalist multimodal agents that act in a semantic action space via deterministic Semantic Action Parsing. GameWorld contains 34 diverse games and 170 tasks, each paired with state-verifiable metrics for outcome-based evaluation. The results across 18 model-interface pairs suggest that even the best performing agent is far from achieving human capabilities on video games. Extensive experiments of repeated full-benchmark reruns demonstrate the robustness of the benchmark, while further studies on real-time interaction, context-memory sensitivity, and action validity expose more challenges ahead for game agents. Together, by offering a standardized, verifiable, and reproducible evaluation framework, GameWorld lays a robust foundation for advancing research on multimodal game agents and beyond. The project page is at https://gameworld-bench.github.io.
Authors:Peter Brodeur, Jacob M. Koshy, Anil Palepu, Khaled Saab, Ava Homiar, Roma Ruparel, Charles Wu, Ryutaro Tanno, Joseph Xu, Amy Wang, David Stutz, Wei-Hung Weng, Hannah M. Ferrera, David Barrett, Lindsey Crowley, Jihyeon Lee, Spencer E. Rittner, Ellery Wulczyn, Selena K. Zhang, Elahe Vedadi, Christine G. Kohn, Kavita Kulkarni, Vinay Kadiyala, Sara Mahdavi, Wendy Du, Jessica M. Williams, David Feinbloom, Renee Wong, Tao Tu, Petar Sirkovic, Alessio Orlandi, Christopher Semturs, Yun Liu, Juraj Gottweis, Dale R. Webster, Joëlle Barral, Katherine Chou, Pushmeet Kohli, Avinatan Hassidim, Yossi Matias, James Manyika, Rob Fields, Jonathan X. Li, Marc L. Cohen, Vivek Natarajan, Mike Schaekermann, Alan Karthikesalingam, Adam Rodman
Abstract:
Large language model (LLM)-based AI systems have shown promise for patient-facing diagnostic and management conversations in simulated settings. Translating these systems into clinical practice requires assessment in real-world workflows with rigorous safety oversight. We report a prospective, single-arm feasibility study of an LLM-based conversational AI, the Articulate Medical Intelligence Explorer (AMIE), conducting clinical history taking and presentation of potential diagnoses for patients to discuss with their provider at urgent care appointments at a leading academic medical center. 100 adult patients completed an AMIE text-chat interaction up to 5 days before their appointment. We sought to assess the conversational safety and quality, patient and clinician experience, and clinical reasoning capabilities compared to primary care providers (PCPs). Human safety supervisors monitored all patient-AMIE interactions in real time and did not need to intervene to stop any consultations based on pre-defined criteria. Patients reported high satisfaction and their attitudes towards AI improved after interacting with AMIE (p < 0.001). PCPs found AMIE's output useful with a positive impact on preparedness. AMIE's differential diagnosis (DDx) included the final diagnosis, per chart review 8 weeks post-encounter, in 90% of cases, with 75% top-3 accuracy. Blinded assessment of AMIE and PCP DDx and management (Mx) plans suggested similar overall DDx and Mx plan quality, without significant differences for DDx (p = 0.6) and appropriateness and safety of Mx (p = 0.1 and 1.0, respectively). PCPs outperformed AMIE in the practicality (p = 0.003) and cost effectiveness (p = 0.004) of Mx. While further research is needed, this study demonstrates the initial feasibility, safety, and user acceptance of conversational AI in a real-world setting, representing crucial steps towards clinical translation.
Authors:Mingyu Ouyang, Kevin Qinghong Lin, Mike Zheng Shou, Hwee Tou Ng
Abstract:
Vision-Language Models (VLMs) have shown remarkable performance in User Interface (UI) grounding tasks, driven by their ability to process increasingly high-resolution screenshots. However, screenshots are tokenized into thousands of visual tokens (e.g., about 4700 for 2K resolution), incurring significant computational overhead and diluting attention. In contrast, humans typically focus on regions of interest when interacting with UI. In this work, we pioneer the task of efficient UI grounding. Guided by practical analysis of the task's characteristics and challenges, we propose FocusUI, an efficient UI grounding framework that selects patches most relevant to the instruction while preserving positional continuity for precise grounding. FocusUI addresses two key challenges: (1) Eliminating redundant tokens in visual encoding. We construct patch-level supervision by fusing an instruction-conditioned score with a rule-based UI-graph score that down-weights large homogeneous regions to select distinct and instruction-relevant visual tokens. (2) Preserving positional continuity during visual token selection. We find that general visual token pruning methods suffer from severe accuracy degradation on UI grounding tasks due to broken positional information. We introduce a novel PosPad strategy, which compresses each contiguous sequence of dropped visual tokens into a single special marker placed at the sequence's last index to preserve positional continuity. Comprehensive experiments on four grounding benchmarks demonstrate that FocusUI surpasses GUI-specific baselines. On the ScreenSpot-Pro benchmark, FocusUI-7B achieves a performance improvement of 3.7% over GUI-Actor-7B. Even with only 30% visual token retention, FocusUI-7B drops by only 3.2% while achieving up to 1.44x faster inference and 17% lower peak GPU memory.
Authors:Md. Tanvir Hossain, Mohd Ruhul Ameen, Akif Islam, Md. Omar Faruqe, Mahboob Qaosar, A. F. M. Mahbubur Rahman, Sanjoy Kumar Chakravarty, M. Khademul Islam Molla
Abstract:
Efficient text entry remains a primary bottleneck preventing Virtual Reality (VR) from evolving into a viable productivity platform. To address this, we conducted an empirical comparison of six physical input systems across three interaction styles Controller Driven, Free Hand, and Virtual Touch evaluating both discrete tap typing and continuous gesture typing (swiping), alongside a speech to text (Voice) condition as a non physical reference modality. Results from 21 participants show that the Controller Driven Tap Gesture Combo (CD TGC) delivers the best productivity performance, achieving speeds 2.25 times higher than the slowest system and 30% faster than the current industry standard, while reducing error rates by up to 68%. A clear trade off emerged between performance and perceived usability: although controller based gesture input led on speed and accuracy, participants rated Virtual Touch Tap Typing highest in subjective experience, scoring 80% higher on the System Usability Scale (SUS) than the lowest rated alternative. We further observe that Free Hand interaction remains limited by tracking stability and physical fatigue, whereas Voice input introduces practical constraints related to privacy, editing control, and immersive engagement. Together, these findings characterize the tension between throughput and natural interaction in immersive text entry and provide data driven guidance for future VR interface design.
Authors:M. Hamza Mughal, Rishabh Dabral, Vera Demberg, Christian Theobalt
Abstract:
Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions. Current large language model (LLM)-based conversational agents lack embodiment and the expressive gestures essential for natural interaction. Existing solutions for ECAs often produce rigid, low-diversity motions, that are unsuitable for human-like interaction. Alternatively, generative methods for co-speech gesture synthesis yield natural body gestures but depend on future speech context and require long run-times. To bridge this gap, we present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue. We employ body-part aware gesture codecs that encode hierarchical motion details into multi-level discrete tokens. These tokens are then autoregressively generated by a two-dimensional causal framework conditioned on LLM-based speech-text embeddings, modeling both temporal dynamics and part-level motion hierarchy in real time. Further, we introduce auxiliary objectives to encourage expressive and diverse gestures while preventing convergence to static poses. Comparative evaluations demonstrate that our causal and real-time approach produces natural and contextually aligned gestures against recent baselines. We urge the reader to explore demo videos on https://vcai.mpi-inf.mpg.de/projects/MIBURI/.
Authors:Zheng Lian, Fan Zhang, Lan Chen, Yazhou Zhang, Rui Liu, Jinyang Wu, Haoyu Chen, Xiaobai Li, Xiaojiang Peng, Bin He, Jianhua Tao
Abstract:
Open-Vocabulary Multimodal Emotion Recognition (OV-MER) aims to predict emotions without being constrained by predefined label spaces, thereby enabling fine-grained emotion understanding. Unlike traditional discriminative methods, OV-MER leverages generative models to capture the full spectrum of emotions and employs emotion wheels (EWs) for metric calculation. Previous approaches primarily rely on token-level loss during training. However, this objective is misaligned with the metrics used in OV-MER, and these metrics cannot be directly optimized via gradient backpropagation. To address this limitation, we turn our attention to reinforcement learning, as this strategy can optimize non-differentiable objectives. We term this framework AffectGPT-RL. Furthermore, we conduct extensive experiments to elucidate the role of reinforcement learning in this task, revealing the necessity of the reasoning process, the impact of different rewards, and the generalizability to other emotion tasks such as sentiment analysis and basic emotion recognition. Experimental results demonstrate that AffectGPT-RL yields significant performance improvements on OV-MER. Beyond this task, we also achieve remarkable performance gains on basic emotion recognition, attaining state-of-the-art results on MER-UniBench. To the best of our knowledge, this is the pioneering work exploring the role of reinforcement learning in OV-MER, providing valuable guidance for subsequent researchers. Our code is provided in the supplementary material and will be released to facilitate future research.
Authors:Sebastian Maier, Moritz Gunzenhäuser, Jonas Schweisthal, Manuel Schneider, Stefan Feuerriegel
Abstract:
Generative artificial intelligence (GenAI) is increasingly used for programming, yet it remains unclear when and where GenAI tools lead to productivity gains. Evidence on the effects of GenAI on the long-term development of programming skills is similarly mixed. Here, we present a meta-analysis of $n = 23$ studies reporting $k = 27$ effect sizes to quantify the effect of GenAI-powered coding assistants on productivity and learning. We systematically searched (i) ACM, (ii) arXiv, (iii) Scopus, and (iv) Web of Science for studies published between 2019 and 2025. Studies were required to compare GenAI-assisted with unassisted programming using quantitative measures of (1) productivity (i.e., task completion time, commits, and lines of code) and (2) learning (i.e., exam performance). We assessed the risk of bias using RoB2 and ROBINS-I and compared standardized effect sizes using Hedges' $g$. We find a statistically significant, but moderate positive effect of GenAI assistance on developer productivity ($g = 0.33$, $95\%$ CI: $[0.09, 0.58]$), yet with substantial heterogeneity across settings. Notably, productivity gains tend to be larger in controlled experimental settings, while effects are smaller in open-source and enterprise contexts. In contrast, we find no statistically significant effect of GenAI assistance on learning outcomes ($g = 0.14$, $95\%$ CI: $[-0.18, 0.47]$). Overall, these results highlight that GenAI coding assistants can increase developer productivity, although these gains depend strongly on context. In educational settings, however, the use of GenAI does not consistently translate into improved learning or skill development, which highlights the need for careful integration of GenAI into computer science education.
Authors:Jian Sun, Xiyan Jiang, Xiaocong Zhao, Jie Wang, Peng Hang, Zirui Li
Abstract:
Human drivers' control quality in the first seconds after a handover is critical to shared-driving safety; potentially unsafe steering or pedal inputs therefore require detection and correction by the automated vehicle's safety-fallback system. Yet performance in this window is vulnerable because cognitive states fluctuate rapidly, causing purely rationality-driven, cognition-unaware models to miss early control dynamics. We present an interpretable driver model grounded in bounded rationality with online adaptation that predicts early-stage control quality. We encode boundedness by embedding cognitive constraints in reinforcement learning and adapt latent cognitive parameters in real time via particle filtering from observations of driver actions. In a vehicle-in-the-loop study (n=41), we evaluated predictive performance and physiological validity. The adaptive model not only anticipated hazardous takeovers with higher coverage and longer lead times than non-adaptive baselines but also demonstrated strong alignment between inferred cognitive parameters and real-time eye-tracking metrics. These results confirm that the model captures genuine fluctuations in driver risk perception, enabling timely and cognitively grounded assistance.
Authors:Rafael R. Baptista, André de Lima Salgado, Ricardo V. Godoy, Marcelo Becker, Thiago Boaventura, Gustavo J. G. Lahr
Abstract:
Leader-follower interaction is an important paradigm in human-robot interaction (HRI). Yet, assigning roles in real time remains challenging for resource-constrained mobile and assistive robots. While large language models (LLMs) have shown promise for natural communication, their size and latency limit on-device deployment. Small language models (SLMs) offer a potential alternative, but their effectiveness for role classification in HRI has not been systematically evaluated. In this paper, we present a benchmark of SLMs for leader-follower communication, introducing a novel dataset derived from a published database and augmented with synthetic samples to capture interaction-specific dynamics. We investigate two adaptation strategies: prompt engineering and fine-tuning, studied under zero-shot and one-shot interaction modes, compared with an untrained baseline. Experiments with Qwen2.5-0.5B reveal that zero-shot fine-tuning achieves robust classification performance (86.66% accuracy) while maintaining low latency (22.2 ms per sample), significantly outperforming baseline and prompt-engineered approaches. However, results also indicate a performance degradation in one-shot modes, where increased context length challenges the model's architectural capacity. These findings demonstrate that fine-tuned SLMs provide an effective solution for direct role assignment, while highlighting critical trade-offs between dialogue complexity and classification reliability on the edge.
Authors:Jianwen Sun, Yukang Feng, Kaining Ying, Chuanhao Li, Zizhen Li, Fanrui Zhang, Jiaxin Ai, Yifan Chang, Yu Dai, Yifei Huang, Kaipeng Zhang
Abstract:
Large Language Models (LLMs) motivate generative agent simulation (e.g., AI Town) to create a ``dynamic world'', holding immense value across entertainment and research. However, for non-experts, especially those without programming skills, it isn't easy to customize a visualizable environment by themselves. In this paper, we introduce World Craft, an agentic world creation framework to create an executable and visualizable AI Town via user textual descriptions. It consists of two main modules, World Scaffold and World Guild. World Scaffold is a structured and concise standardization to develop interactive game scenes, serving as an efficient scaffolding for LLMs to customize an executable AI Town-like environment. World Guild is a multi-agent framework to progressively analyze users' intents from rough descriptions, and synthesizes required structured contents (\eg environment layout and assets) for World Scaffold . Moreover, we construct a high-quality error-correction dataset via reverse engineering to enhance spatial knowledge and improve the stability and controllability of layout generation, while reporting multi-dimensional evaluation metrics for further analysis. Extensive experiments demonstrate that our framework significantly outperforms existing commercial code agents (Cursor and Antigravity) and LLMs (Qwen3 and Gemini-3-Pro). in scene construction and narrative intent conveyance, providing a scalable solution for the democratization of environment creation.
Authors:Zizhen Li, Chuanhao Li, Yibin Wang, Yukang Feng, Jianwen Sun, Jiaxin Ai, Fanrui Zhang, Mingzhu Sun, Yifei Huang, Kaipeng Zhang
Abstract:
Recent advancements have expanded the role of Large Language Models in board games from playing agents to creative co-designers. However, a critical gap remains: current systems lack the capacity to offer constructive critique grounded in the emergent user experience. Bridging this gap is fundamental for harmonizing Human-AI collaboration, as it empowers designers to refine their creations via external perspectives while steering models away from biased or unpredictable outcomes. Automating critique for board games presents two challenges: inferring the latent dynamics connecting rules to gameplay without an explicit engine, and modeling the subjective heterogeneity of diverse player groups. To address these, we curate a dataset of 1,727 structurally corrected rulebooks and 150K reviews selected via quality scoring and facet-aware sampling. We augment this data with Mechanics-Dynamics-Aesthetics (MDA) reasoning to explicitly bridge the causal gap between written rules and player experience. We further distill player personas and introduce MeepleLM, a specialized model that internalizes persona-specific reasoning patterns to accurately simulate the subjective feedback of diverse player archetypes. Experiments demonstrate that MeepleLM significantly outperforms latest commercial models (e.g., GPT-5.1, Gemini3-Pro) in community alignment and critique quality, achieving a 70% preference rate in user studies assessing utility. MeepleLM serves as a reliable virtual playtester for general interactive systems, marking a pivotal step towards audience-aligned, experience-aware Human-AI collaboration.
Authors:Xueyang Wang, Kewen Peng, Xin Yi, Hewu Li
Abstract:
Camera glasses create fundamental privacy tensions between wearers seeking recording functionality and bystanders concerned about unauthorized surveillance. We present a systematic multi-stakeholder evaluation of privacy mechanisms through surveys (N=525) and paired interviews (N=20) in China. Study 1 quantifies expectation-willingness gaps: bystanders consistently demand stronger information transparency and protective measures than wearers will provide, with disparities intensifying in sensitive contexts where 65-90% of bystanders would take defensive action. Study 2 evaluates twelve privacy-enhancing technologies, revealing four fundamental trade-offs that undermine current approaches: visibility versus disruption, empowerment versus burden, protection versus agency, and accountability versus exposure. These gaps reflect structural incompatibilities rather than inadequate goodwill, with context emerging as the primary determinant of privacy acceptability. We propose context-adaptive pathways that dynamically adjust protection strategies: minimal-friction visibility in public spaces, structured negotiation in semi-public environments, and automatic protection in sensitive contexts. Our findings contribute a diagnostic framework for evaluating privacy mechanisms and implications for context-aware design in ubiquitous sensing.
Authors:Xueyang Wang, Qinxuan Cen, Weitao Bi, Yunxiang Ma, Xin Yi, Robert Xiao, Xinyi Fu, Hewu Li
Abstract:
We present Roomify, a spatially-grounded transformation system that generates themed virtual environments anchored to users' physical rooms while maintaining spatial structure and functional semantics. Current VR approaches face a fundamental trade-off: full immersion sacrifices spatial awareness, while passthrough solutions break presence. Roomify addresses this through spatially-grounded transformation - treating physical spaces as "spatial containers" that preserve key functional and geometric properties of furniture while enabling radical stylistic changes. Our pipeline combines in-situ 3D scene understanding, AI-driven spatial reasoning, and style-aware generation to create personalized virtual environments grounded in physical reality. We introduce a cross-reality authoring tool enabling fine-grained user control through MR editing and VR preview workflows. Two user studies validate our approach: one with 18 VR users demonstrates a 63% improvement in presence over passthrough and 26% over fully virtual baselines while maintaining spatial awareness; another with 8 design professionals confirms the system's creative expressiveness (scene quality: 5.95/7; creativity support: 6.08/7) and professional workflow value across diverse environments.
Authors:Xiaofeng Luo, Jiayi He, Jiawen Kang, Ruichen Zhang, Zhaoshui He, Ekram Hossain, Dong In Kim
Abstract:
The emergence of 6G-enabled vehicular metaverses enables Autonomous Vehicles (AVs) to operate across physical and virtual spaces through space-air-ground-sea integrated networks. The AVs can deploy AI agents powered by large AI models as personalized assistants, on edge servers to support intelligent driving decision making and enhanced on-board experiences. However, such cross-reality interactions may cause serious location privacy risks, as adversaries can infer AV trajectories by correlating the location reported when AVs request LBS in reality with the location of the edge servers on which their corresponding AI agents are deployed in virtuality. To address this challenge, we design a cross-reality location privacy protection framework based on hybrid actions, including continuous location perturbation in reality and discrete privacy-aware AI agent migration in virtuality. In this framework, a new privacy metric, termed cross-reality location entropy, is proposed to effectively quantify the privacy levels of AVs. Based on this metric, we formulate an optimization problem to optimize the hybrid action, focusing on achieving a balance between location protection, service latency reduction, and quality of service maintenance. To solve the complex mixed-integer problem, we develop a novel LLM-enhanced Hybrid Diffusion Proximal Policy Optimization (LHDPPO) algorithm, which integrates LLM-driven informative reward design to enhance environment understanding with double Generative Diffusion Models-based policy exploration to handle high-dimensional action spaces, thereby enabling reliable determination of optimal hybrid actions. Extensive experiments on real-world datasets demonstrate that the proposed framework effectively mitigates cross-reality location privacy leakage for AVs while maintaining strong user immersion within 6G-enabled vehicular metaverse scenarios.
Authors:Seokweon Jung, Jeongmin Rhee, Seoyoung Doh, Hyeon Jeon, Ghulam Jilani Quadri, Jinwook Seo
Abstract:
Comparing graphs to identify similarities is a fundamental task in visual analytics of graph data. To support this, visual analytics systems frequently employ quantitative computational measures to provide automated guidance. However, it remains unclear how well these measures align with subjective human visual perception, thereby offering recommendations that conflict with analysts' intuitive judgments, potentially leading to confusion rather than reducing cognitive load. Multimodal Large Language Models (MLLMs), capable of visually interpreting graphs and explaining their reasoning in natural language, have emerged as a potential alternative to address this challenge. This paper bridges the gap between human and machine assessment of graph similarity through three interconnected experiments using a dataset of 1,881 node-link diagrams. Experiment 1 collects relative similarity judgments and rationales from 32 human participants, revealing consensus on graph similarity while prioritizing global shapes and edge densities over exact topological details. Experiment 2 benchmarks 16 computational measures against these human judgments, identifying Portrait divergence as the best-performing metric, though with only moderate alignment. Experiment 3 evaluates the potential of three state-of-the-art MLLMs (GPT-5, Gemini 2.5 Pro, Claude Sonnet 4.5) as perceptual proxies. The results demonstrate that MLLMs, particularly GPT-5, significantly outperform traditional measures in aligning with human graph similarity perception and provide interpretable rationales for their decisions, whereas Claude Sonnet 4.5 shows the best computational efficiency. Our findings suggest that MLLMs hold significant promise not only as effective, explainable proxies for human perception but also as intelligent guides that can uncover subtle nuances that might be overlooked by human analysts in visual analytics systems.
Authors:Suyeon Hwang, Minkyu Kweon, Jeongmin Rhee, Soohyun Lee, Seokhyeon Park, Seokweon Jung, Hyeon Jeon, Jinwook Seo
Abstract:
Maintaining and refactoring React web applications is challenging, as React code often becomes complex due to its core API called Hooks. For example, Hooks often lead developers to create complex dependencies among components, making code behavior unpredictable and reducing maintainability, i.e., anti-patterns. To address this challenge, we present HookLens, an interactive visual analytics system that helps developers understand howHooks define dependencies and data flows between components. Informed by an iterative design process with experienced React developers, HookLens supports users to efficiently understand the structure and dependencies between components and to identify anti-patterns. A quantitative user study with 12 React developers demonstrates that HookLens significantly improves participants' accuracy in detecting anti-patterns compared to conventional code editors. Moreover, a comparative study with state-of-the-art LLM-based coding assistants confirms that these improvements even surpass the capabilities of such coding assistants on the same task.
Authors:Xinyue Gui, Ding Xia, Mark Colley, Yuan Li, Vishal Chauhan, Anubhav Anubhav, Zhongyi Zhou, Ehsan Javanmardi, Stela Hanbyeol Seo, Chia-Ming Chang, Manabu Tsukada, Takeo Igarashi
Abstract:
Field studies are irreplaceable but costly, time-consuming, and error-prone, which need careful preparation. Inspired by rapid-prototyping in manufacturing, we propose a fast, low-cost evaluation method using Vision-Language Model (VLM) personas to simulate outcomes comparable to field results. While LLMs show human-like reasoning and language capabilities, autonomous vehicle (AV)-pedestrian interaction requires spatial awareness, emotional empathy, and behavioral generation. This raises our research question: To what extent can VLM personas mimic human responses in field studies? We conducted parallel studies: 1) one real-world study with 20 participants, and 2) one video-study using 20 VLM personas, both on a street-crossing task. We compared their responses and interviewed five HCI researchers on potential applications. Results show that VLM personas mimic human response patterns (e.g., average crossing times of 5.25 s vs. 5.07 s) lack the behavioral variability and depth. They show promise for formative studies, field study preparation, and human data augmentation.
Authors:Alexandra Irger, Ella Hugie, Minghao Guo, Simon Warchol, Kenneth Moreland, David Pugmire, Wojciech Matusik, Hanspeter Pfister
Abstract:
Visualization is central to scientific discovery, yet authoring tools remain split between information and scientific visualization, and expertise in one rarely transfers to the other. Large Language Model (LLM) based systems promise to bridge this gap through natural language, but current approaches generate code non-deterministically, with no guarantee of correctness and no protection against silent data fabrication. We present Raiven, a conversational system that mediates visualization authoring through a formally defined domain-specific language. RaivenDSL unifies scientific and information visualization in a single representation spanning 2D, 3D, and tabular data. The LLM produces a compact RaivenDSL specification under schema-guided constraints, and a deterministic compiler translates it to executable D3 or VTK.js code. Because the LLM operates only on dataset metadata, outputs are deterministic, specifications are verifiable before execution, and data fabrication is impossible by construction. In a 100-task benchmark, Raiven achieves 100% compilation, is up to six times faster and six times cheaper than state-of-the-art LLMs, while improving interaction quality, correctness, and data faithfulness. An expert user study shows that Raiven significantly reduces debugging effort and makes it easier to produce correct visualizations.
Authors:Xinfeng Li, Shenyu Dai, Kelong Zheng, Yue Xiao, Gelei Deng, Wei Dong, Xiaofeng Wang
Abstract:
Large language model (LLM) agents are rapidly becoming trusted copilots in high-stakes domains like software development and healthcare. However, this deepening trust introduces a novel attack surface: Agent-Mediated Deception (AMD), where compromised agents are weaponized against their human users. While extensive research focuses on agent-centric threats, human susceptibility to deception by a compromised agent remains unexplored. We present the first large-scale empirical study with 303 participants to measure human susceptibility to AMD. This is based on HAT-Lab (Human-Agent Trust Laboratory), a high-fidelity research platform we develop, featuring nine carefully crafted scenarios spanning everyday and professional domains (e.g., healthcare, software development, human resources). Our 10 key findings reveal significant vulnerabilities and provide future defense perspectives. Specifically, only 8.6% of participants perceive AMD attacks, while domain experts show increased susceptibility in certain scenarios. We identify six cognitive failure modes in users and find that their risk awareness often fails to translate to protective behavior. The defense analysis reveals that effective warnings should interrupt workflows with low verification costs. With experiential learning based on HAT-Lab, over 90% of users who perceive risks report increased caution against AMD. This work provides empirical evidence and a platform for human-centric agent security research.
Authors:Yixuan Ding, Wei Huang, Ruijie Quan, Xiaojuan Qi, Yi Yang
Abstract:
Diffusion-based image editing has achieved strong visual fidelity under natural language instructions, yet most existing systems still operate at the level of surface instruction following, without reasoning about the implicit contextual constraints embedded in real user requests. This often leads to visually plausible but logically inconsistent edits. In this work, we introduce RE-Edit, a benchmark for REasoning-aware image Editing that evaluates image editing systems across five complementary reasoning dimensions: physical, environmental, cultural, causal, and referential. RE-Edit comprises 1,000 carefully curated samples, each designed such that visual plausibility alone is insufficient and correct editing requires satisfying implicit logical constraints. To support fine-grained analysis, we establish dimension-aligned evaluation criteria and conduct a comprehensive study of ten open-source and two commercial image editing models. Our results show that even advanced systems frequently struggle with implicit multi-dimensional reasoning despite producing high-quality visuals. We further present a lightweight reasoning-guided post-edit baseline as an initial exploration, illustrating how inserting explicit reasoning can help mitigate such failures in a model-agnostic manner.
Authors:Charvi Rastogi, Mukul Bhutani, Minsuk Kahng, Shamsuddeen Hassan Muhammad, Evgeniia Razumovskaia, Priyanka Suresh, Ibrahim Said Ahmad, Charu Kalia, Yaaseen Mahomed, Madhurima Maji, Minjae Lee, Alicia Parrish, Jessica Quaye, Vijay Janapa Reddi, Aishwarya Verma, Lora Aroyo
Abstract:
Despite the global deployment of text-to-image (T2I) models, their safety frameworks are largely calibrated to a Western-centric default, creating significant vulnerabilities for the rest of the world. To embrace cultural pluralism and bring historically under-represented perspectives in T2I safety, we conduct localised community-centered red teaming studies in the Global South. Our two-fold approach prioritizes localization and participation, by focusing on secondary urban centers in these regions, and conducting community engagement and training workshops to contextualize local norms. As a result, we present PLACES, a dataset comprising over 26,000 examples of T2I model failures collected in partnership with universities in Ghana, Nigeria, and two regions of India (Karnataka and Punjab). Analysis of prompts collected reveals a wide-ranging diversity in socio-cultural and linguistic attributes, when compared to existing geography-agnostic crowdsourced red-teaming data. We observe unique adversarial patterns enabled by local cultural and linguistic nuances, and distinct clusters within region around specific themes, such as religion in India. Moreover, we uncover structural contextual gaps in existing safety frameworks by identifying novel harms showing normative dissonance (e.g., violating religious norms, ignoring local customs, and ominous symbolism). This work argues that expanding T2I safety requires moving beyond mere scale to incorporate deeply localised, participatory methodologies for data collection and contextualization. Content warning: This paper includes examples containing potentially harmful or offensive content.
Authors:Mohammed Abraar, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat
Abstract:
The exponential growth of AI education has brought millions of learners to online platforms, yet this massive scale has simultaneously exposed critical pedagogical shortcomings. Traditional video-based instruction, while cost-effective and scalable, demonstrates systematic failures in both sustaining learner engagement and facilitating the deep conceptual mastery essential for AI literacy. We present a pilot study evaluating a novel hybrid learning platform that integrates real-time conversational AI tutors with traditional video lectures. Our controlled experiment (N = 58, mean age M = 21.4, SD = 2.8) compared traditional video-based instruction with our AI-augmented video platform. This study employed a sequential within-subjects design where all participants first completed the traditional video condition followed by the AI-augmented condition, providing direct comparisons of learning outcomes. We measured learning effectiveness through immediate post-tests and delayed retention assessments (2-week delay). Results suggest improvements in learning performance: immediate post-test performance showed a large effect size (d = 1.505) with participants scoring 8.3 points higher after AI-augmented instruction (91.8 vs 83.5 out of 100, p < .001). Behavioral analytics revealed increased engagement duration (71.1% improvement with AI tutoring) in the experimental group. This pilot study provides preliminary evidence that conversational AI tutors may enhance traditional educational delivery, suggesting a potential avenue for developing scalable, adaptive learning systems.
Authors:Jie Cao, Chloe Qianhui Zhao, Christian Schunn, Elizabeth A. McLaughlin, Jionghao Lin, Kenneth R. Koedinger
Abstract:
Feedback is essential for learning, but its effectiveness relies heavily on how well it engages students in the educational process. Generative AI offers novel opportunities to efficiently produce rich, formative feedback, ranging from direct explanations to incrementally sequenced scaffolding designed to promote learner autonomy. Despite these capabilities, it is still unclear whether sequenced (layered) AI feedback -- which provides encouragement and hints before revealing the correct answer -- genuinely enhances engagement and learning outcomes. To investigate this, we randomly assigned 199 participants to receive either sequenced or non-sequenced AI-generated feedback. We evaluated its impact on learning performance, cognitive and behavioral engagement, and affective perceptions to understand how these factors mediate overall learning outcomes. Results show that sequenced feedback elicited slightly higher behavioral engagement and, as anticipated, was perceived as more encouraging and supportive of student independence. Concurrently, however, it induced a higher level of mental effort. Mediation analyses identified a positive affective pathway driven by perceived encouragement, which was completely counteracted by a negative behavioral pathway associated with the average number of tasks requiring three or more submissions; the cognitive pathway (mental effort) remained non-significant. Overall, sequenced feedback led to significantly poorer learning outcomes when compared to direct, non-sequenced feedback. These findings highlight a crucial trade-off: although sequenced AI scaffolding boosts engagement and positive user perceptions, it can have a detrimental effect on actual learning performance. By integrating analyses of outcomes, perceptions, and underlying mechanisms, this study provides nuanced insights for designing automated, AI-driven feedback systems.
Authors:Chloe Qianhui Zhao, Jie Cao, Jionghao Lin, Kenneth R. Koedinger
Abstract:
Providing timely, targeted, and multimodal feedback helps students quickly correct errors, build deep understanding and stay motivated, yet making it at scale remains a challenge. This study introduces a real-time AI-facilitated multimodal feedback system that integrates structured textual explanations with dynamic multimedia resources, including the retrieved most relevant slide page references and streaming AI audio narration. In an online crowdsourcing experiment, we compared this system against fixed business-as-usual feedback by educators across three dimensions: (1) learning effectiveness, (2) learner engagement, (3) perceived feedback quality and value. Results showed that AI multimodal feedback achieved learning gains equivalent to original educator feedback while significantly outperforming it on perceived clarity, specificity, conciseness, motivation, satisfaction, and reducing cognitive load, with comparable correctness, trust, and acceptance. Process logs revealed distinct engagement patterns: for multiple-choice questions, educator feedback encouraged more submissions; for open-ended questions, AI-facilitated targeted suggestions lowered revision barriers and promoted iterative improvement. These findings highlight the potential of AI multimodal feedback to provide scalable, real-time, and context-aware support that both reduces instructor workload and enhances student experience.
Authors:Adarsh Pawar, Yuqiao Meng, Luoxi Tang, Zhaohan Xi
Abstract:
The Fast Healthcare Interoperability Resources (FHIR) standard has emerged as a widely adopted specification for exchanging structured clinical data across healthcare systems. However, raw FHIR resources are often complex, verbose, and difficult for clinicians and analysts to interpret without specialized tooling. This paper presents a lightweight, browser-based system that improves the accessibility of FHIR data by automatically transforming raw JSON resources into human-readable PDF and Excel reports, along with interactive data visualizations. The system supports both remote retrieval of FHIR resources from server endpoints and the upload of local FHIR JSON files, enabling both online and offline analysis. Using a modular React architecture with jsPDF, xlsx, and Recharts, the tool parses, normalizes, visualizes, and exports FHIR data in an intuitive format. Evaluation results demonstrate that the system enhances interpretability and usability while preserving the semantic integrity of FHIR structures. Limitations and future extensions, including expanded FHIR profile support and clinical validation, are discussed.
Authors:Junhui Gao, Yan Pan, Qianru Wang, Wenzhe Hou, Yiqin Deng, Liangliang Jiang, Yuguang Fang
Abstract:
Instant delivery, shipping items before critical deadlines, is essential in daily life. While multiple delivery agents, such as couriers, Unmanned Aerial Vehicles (UAVs), and crowdsourced agents, have been widely employed, each of them faces inherent limitations (e.g., low efficiency/labor shortages, flight control, and dynamic capabilities, respectively), preventing them from meeting the surging demands alone. This paper proposes TriDeliver, the first hierarchical cooperative framework, integrating human couriers, UAVs, and crowdsourced ground vehicles (GVs) for efficient instant delivery. To obtain the initial scheduling knowledge for GVs and UAVs as well as improve the cooperative delivery performance, we design a Transfer Learning (TL)-based algorithm to extract delivery knowledge from couriers' behavioral history and transfer their knowledge to UAVs and GVs with fine-tunings, which is then used to dispatch parcels for efficient delivery. Evaluated on one-month real-world trajectory and delivery datasets, it has been demonstrated that 1) by integrating couriers, UAVs, and crowdsourced GVs, TriDeliver reduces the delivery cost by $65.8\%$ versus state-of-the-art cooperative delivery by UAVs and couriers; 2) TriDeliver achieves further improvements in terms of delivery time ($-17.7\%$), delivery cost ($-9.8\%$), and impacts on original tasks of crowdsourced GVs ($-43.6\%$), even with the representation of the transferred knowledge by simple neural networks, respectively.
Authors:Wanghao Ye, Sihan Chen, Yiting Wang, Shwai He, Bowei Tian, Guoheng Sun, Ziyi Wang, Ziyao Wang, Yexiao He, Zheyu Shen, Meng Liu, Yuning Zhang, Meng Feng, Yifei Dong, Yanhong Qian, Yang Wang, Siyuan Peng, Yilong Dai, Zhenle Duan, Joshua Liu, Lang Xiong, Hanzhang Qin, Ang Li
Abstract:
We present an LLM-powered social discovery platform that uses digital twins to autonomously evaluate interpersonal compatibility through behavioral simulation. The platform unifies three key pillars: (1) digital twins that engage in autonomous multi-turn conversations on behalf of users to estimate compatibility, (2) gamified territory conquest mechanics that incentivize real-world exploration and create organic settings for in-person encounters, and (3) AI companions that preserve persistent shared memory across devices. Built upon CogniPair's cognitive architecture (Ye et al., 2026), validated on the Columbia Speed Dating dataset (551 participants), our system extends prior simulation-only matching into a fully deployed social discovery environment. Through deployment, we derive empirical cost-quality baselines and identify fundamental scaling bottlenecks that remain hidden in component-level testing alone.
Authors:Junfeng Jiao, Abhejay Murali, Saleh Afroogh
Abstract:
Affective alignment in generative AI represents a systemic risk to the developmental autonomy of younger users. Although emotional mirroring is commonly seen as a hallmark of advanced human-machine interaction, it can also manifest as affective sycophancy, reinforcing a user's immediate emotional state. By providing a sense of objectivity to transient anxieties, these systems diminish the cognitive friction necessary for independent emotional management and critical thought. Reward models driven by RLHF could heighten this dilemma by embedding adult-focused definitions of helpfulness, unintentionally promoting emotional dependency in younger users rather than facilitating cognitive reappraisal. This paper exposes the misalignment between adult-labeled reward signals and the developmental requirements of younger users, proposing stoic architectures that emphasize functional neutrality to preserve user autonomy.
Authors:Boyu Qiao, Yunman Chen, Kun Li, Wei Zhou, Songlin Hu, Yunya Song
Abstract:
Social bots increasingly infiltrate online platforms through sophisticated disguises, threatening healthy information ecosystems. Existing detection methods often rely on modality specific cues or local contextual features, making them brittle when modalities are missing or inputs are incomplete. Moreover, most approaches assume similar train test distributions, which limits their robustness to out of distribution (OOD) samples and emerging bot types. To address these challenges, we propose Multi Granularity Summarization and Domain Invariant Learning (MGDIL), a unified framework for robust social bot detection under domain shift. MGDIL first transforms heterogeneous signals into unified textual representations through LLM based multi granularity summarization. Building on these representations, we design a collaborative optimization framework that integrates task oriented LLM instruction tuning with domain invariant representation learning. Specifically, task oriented instruction tuning enhances the LLMs ability to capture subtle semantic cues and implicit camouflage patterns, while domain adversarial learning and cross domain contrastive learning are jointly employed to mitigate distribution shifts across datasets and time periods. Through this joint optimization, MGDIL learns stable and discriminative domain invariant features, improving cross domain social bot detection through better distribution alignment, stronger intra class compactness, and clearer inter class separation.
Authors:Jiongchi Yu, Xiaolin Wen, Sizhe Cheng, Xiaofei Xie, Qiang Hu, Yong Wang
Abstract:
Fuzz testing is one of the most effective techniques for detecting bugs and vulnerabilities in software. However, as the basis of fuzz testing, automated heuristics often fail to uncover deep or complex vulnerabilities. As a result, the performance of fuzz testing remains limited. One promising way to address this limitation is to integrate human expert guidance into the paradigm of fuzz testing. Even though some works have been proposed in this direction, there is still a lack of a systematic research roadmap for combining Human-in-the-Loop (HITL) and fuzz testing, hindering the potential for further enhancing fuzzing effectiveness. To bridge this gap, this paper outlines a forward-looking research roadmap for HITL for fuzz testing. Specifically, we highlight the promise of visualization techniques for interpretable fuzzing processes, as well as on-the-fly interventions that enable experts to guide fuzzing toward hard-to-reach program behaviors. Moreover, the rise of Large Language Models (LLMs) introduces new opportunities and challenges, raising questions about how humans can efficiently provide actionable knowledge, how expert meta-knowledge can be leveraged, and what roles humans should play in the intelligent fuzzing loop with LLMs. To address these questions, we survey existing work on HITL fuzz testing and propose a research agenda emphasizing future opportunities in (1) human monitoring, (2) human steering, and (3) human-LLM collaboration. We call for a paradigm shift toward interactive, human-guided fuzzing systems that integrate expert insight with AI-powered automation in the next-generation fuzzing ecosystem.
Authors:Yuxi Ma, Yongqian Peng, Fengyuan Yang, Siyu Zha, Chi Zhang, Zixia Jia, Zilong Zheng, Yixin Zhu
Abstract:
Large Language Models show promise for AI-assisted storytelling, yet current tools often generate predictable, unoriginal narratives. To address this limitation, we present NarrativeLoom, a multi-persona co-creative system grounded in Campbell's Blind Variation and Selective Retention theory. NarrativeLoom deploys specialized AI personas to generate diverse narrative options (blind variation), while users act as creative directors to select and refine them (selective retention). We designed a controlled study with 50 participants and found that stories co-authored with NarrativeLoom were not only perceived by users as more novel and diverse but were also objectively rated by experts as significantly better across all Torrance Test creativity dimensions: fluency, flexibility, originality, and elaboration. Stories are significantly longer with richer settings and more dialogue. Writing expertise emerged as a moderator: novices benefited more from structured scaffolding. This demonstrates the value of theory-informed co-creative systems and the importance of adapting them to varying user expertise.
Authors:Cathy Mengying Fang, Sheer Karny, Chayapatr Archiwaranguprok, Yasith Samaradivakara, Pat Pataranutaporn, Pattie Maes
Abstract:
Alignment research on large language models (LLMs) increasingly depends on understanding how these systems are used in everyday contexts. yet naturalistic interaction data is difficult to access due to privacy constraints and platform control. We present AI-Wrapped, a prototype workflow for collecting naturalistic LLM usage data while providing participants with an immediate ``wrapped''-style report on their usage statistics, top topics, and safety-relevant behavioral patterns. We report findings from an initial deployment with 82 U.S.-based adults across 48,495 conversations from their 2025 histories. Participants used LLMs for both instrumental and reflective purposes, including creative work, professional tasks, and emotional or existential themes. Some usage patterns were consistent with potential over-reliance or perfectionistic refinement, while heavier users showed comparatively more reflective exchanges than primarily transactional ones. Methodologically, even with zero data retention and PII removal, participants may remain hesitant to share chat data due to perceived privacy and judgment risks, underscoring the importance of trust, agency, and transparent design when building measurement infrastructure for alignment research.
Authors:Theofanis P. Raptis, Chiara Boldrini, Marco Conti, Andrea Passarella
Abstract:
The Metaverse is redefining digital interactions by merging physical, virtual, and social dimensions, yet its effects on social networking remain largely unexplored. This work examines the role of independent avatars (autonomous digital entities capable of managing social interactions on behalf of users), to optimize social time allocation and reshape Metaverse-based Online Social Networks. We propose a novel computational model that integrates a quantitative and realistic representation of user social life, grounded in evolutionary anthropology, with a framework for avatar-mediated interactions. Our model quantifies the effectiveness of a partial replacement of in-person interactions with independent avatar interactions. Additionally, it accounts for social conflicts and specific socialization constraints. We leverage our model to explore the benefits and trade-offs of an avatar-augmented social life in the Metaverse. Since the exact problem formulation leads to an NP-hard optimization problem when incorporating avatars into the social network, we tackle this challenge by introducing a heuristic solution. Through simulations, we compare avatar-mediated and non-avatar-mediated social networking, demonstrating the potential of independent avatars to enhance social connectivity and efficiency. Our findings provide a foundation for optimizing Metaverse-based social interactions, as well as useful insights for future digital social network design.
Authors:Lecheng Gong, Weimin Fang, Ting Yang, Dongjie Tao, Chunxiao Guo, Peng Wei, Bo Xie, Jinqun Guan, Zixiao Chen, Fang Shi, Jinjie Gu, Junwei Liu
Abstract:
Medical conversational AI (AI) plays a pivotal role in the development of safer and more effective medical dialogue systems. However, existing benchmarks and evaluation frameworks for assessing the information-gathering and diagnostic reasoning abilities of medical large language models (LLMs) have not been rigorously evaluated. To address these gaps, we present MedDialogRubrics, a novel benchmark comprising 5,200 synthetically constructed patient cases and over 60,000 fine-grained evaluation rubrics generated by LLMs and subsequently refined by clinical experts, specifically designed to assess the multi-turn diagnostic capabilities of LLM. Our framework employs a multi-agent system to synthesize realistic patient records and chief complaints from underlying disease knowledge without accessing real-world electronic health records, thereby mitigating privacy and data-governance concerns. We design a robust Patient Agent that is limited to a set of atomic medical facts and augmented with a dynamic guidance mechanism that continuously detects and corrects hallucinations throughout the dialogue, ensuring internal coherence and clinical plausibility of the simulated cases. Furthermore, we propose a structured LLM-based and expert-annotated rubric-generation pipeline that retrieves Evidence-Based Medicine (EBM) guidelines and utilizes the reject sampling to derive a prioritized set of rubric items ("must-ask" items) for each case. We perform a comprehensive evaluation of state-of-the-art models and demonstrate that, across multiple assessment dimensions, current models face substantial challenges. Our results indicate that improving medical dialogue will require advances in dialogue management architectures, not just incremental tuning of the base-model.
Authors:Debodeep Banerjee, Burcu Sayin, Stefano Teso, Andrea Passerini
Abstract:
Building on recent advances in AI, hybrid decision making (HDM) holds the promise of improving human decision quality and reducing cognitive load. We work in the context of learning to guide (LtG), a recently proposed HDM framework in which the human is always responsible for the final decision: rather than suggesting decisions, in LtG the AI supplies (textual) guidance useful for facilitating decision making. One limiting factor of existing approaches is that their guidance compounds information about all possible outcomes, and as a result it can be difficult to digest. We address this issue by introducing ConfGuide, a novel LtG approach that generates more succinct and targeted guidance. To this end, it employs conformal risk control to select a set of outcomes, ensuring a cap on the false negative rate. We demonstrate our approach on a real-world multi-label medical diagnosis task. Our empirical evaluation highlights the promise of ConfGuide.
Authors:Ka Hei Carrie Lau, Philipp Stark, Efe Bozkir, Enkelejda Kasneci
Abstract:
Artificial intelligence is increasingly used in hiring, raising concerns about how applicants perceive these systems. While prior work on algorithmic fairness has emphasized technical bias mitigation, little is known about how avatar identity cues influence applicants' justice attributions in an interview context. We conducted a crowdsourcing study with 215 participants who completed an interview with photorealistic AI avatars varied in phenotypic traits (race and sex), followed by a standardized rejection. Using self-reports, sentiment analysis, and eye tracking, we measured perceptions of trust, fairness, and bias. Results show that racial mismatch heightened perceptions of ethnic bias, while partial match (sharing only one identity) reduced fairness judgments compared to both full and no match. This work extends the Computers-Are-Social-Actors paradigm by demonstrating that avatar appearances shape justicerelated evaluations of AI. We contribute to HCI by revealing how identity cues influence fairness attributions and offer actionable insights for designing equitable AI interview systems.
Authors:Anna Bodonhelyi, Mengdi Wang, Efe Bozkir, Babette Bühler, Enkelejda Kasneci
Abstract:
Since the COVID-19 pandemic, online courses have expanded access to education, yet the absence of direct instructor support challenges learners' ability to self-regulate attention and engagement. Mind wandering and disengagement can be detrimental to learning outcomes, making their automated detection via video-based indicators a promising approach for real-time learner support. However, machine learning-based approaches often require sharing sensitive data, raising privacy concerns. Federated learning offers a privacy-preserving alternative by enabling decentralized model training while also distributing computational load. We propose a framework exploiting cross-device federated learning to address different manifestations of behavioral and cognitive disengagement during remote learning, specifically behavioral disengagement, mind wandering, and boredom. We fit video-based cognitive disengagement detection models using facial expressions and gaze features. By adopting federated learning, we safeguard users' data privacy through privacy-by-design and introduce a novel solution with the potential for real-time learner support. We further address challenges posed by eyeglasses by incorporating related features, enhancing overall model performance. To validate the performance of our approach, we conduct extensive experiments on five datasets and benchmark multiple federated learning algorithms. Our results show great promise for privacy-preserving educational technologies promoting learner engagement.
Authors:Ding Xia, Xinyue Gui, Mark Colley, Fan Gao, Zhongyi Zhou, Dongyuan Li, Renhe Jiang, Takeo Igarashi
Abstract:
Automated vehicles lack natural communication channels with other road users, making external Human-Machine Interfaces (eHMIs) essential for conveying intent and maintaining trust in shared environments. However, most eHMI studies rely on developer-crafted message-action pairs, which are difficult to adapt to diverse and dynamic traffic contexts. A promising alternative is to use Large Language Models (LLMs) as action designers that generate context-conditioned eHMI actions, yet such designers lack perceptual verification and typically depend on fixed prompts or costly human-annotated feedback for improvement. We present See2Refine, a human-free, closed-loop framework that uses vision-language model (VLM) perceptual evaluation as automated visual feedback to improve an LLM-based eHMI action designer. Given a driving context and a candidate eHMI action, the VLM evaluates the perceived appropriateness of the action, and this feedback is used to iteratively revise the designer's outputs, enabling systematic refinement without human supervision. We evaluate our framework across three eHMI modalities (lightbar, eyes, and arm) and multiple LLM model sizes. Across settings, our framework consistently outperforms prompt-only LLM designers and manually specified baselines in both VLM-based metrics and human-subject evaluations. Results further indicate that the improvements generalize across modalities and that VLM evaluations are well aligned with human preferences, supporting the robustness and effectiveness of See2Refine for scalable action design.
Authors:Hasan Tarik Akbaba, Efe Bozkir, Anna Puhl, Süleyman Özdel, Enkelejda Kasneci
Abstract:
Extended Reality (XR) offers transformative potential for industrial support, training, and maintenance; yet, widespread adoption lags despite demonstrated occupational value and hardware maturity. Organizations successfully implement XR in isolated pilots, yet struggle to scale these into sustained operational deployment, a phenomenon we characterize as the ``Pilot Trap.'' This study examines this phenomenon through a qualitative ecosystem analysis of 17 expert interviews across technology providers, solution integrators, and industrial adopters. We identify a ``Great Inversion'' in adoption barriers: critical constraints have shifted from technological maturity to organizational readiness (e.g., change management, key performance indicator alignment, and political resistance). While hardware ergonomics and usability remain relevant, our findings indicate that systemic misalignments between stakeholder incentives are the primary cause of friction preventing enterprise integration. We conclude that successful industrial XR adoption requires a shift from technology-centric piloting to a problem-first, organizational transformation approach, necessitating explicit ecosystem-level coordination.
Authors:Baiyu Chen, Zechen Li, Wilson Wongso, Lihuan Li, Xiachong Lin, Hao Xue, Benjamin Tag, Flora Salim
Abstract:
As wearable and mobile devices become increasingly embedded in daily life, they offer a practical way to continuously sense human motion in the wild. But inertial signals are highly dependent on the sensing setup, including body location, mounting position, sensor orientation, device hardware, and sampling protocol. This setup dependence makes it difficult to learn motion representations that transfer across devices and datasets, and limits the broader use of wearable IMUs beyond closed-set recognition. We introduce AnyMo, a geometry-aware framework for setup-agnostic human motion modeling. AnyMo uses physics-grounded IMU simulation over dense body-surface placements to generate diverse and plausible synthetic signals, pre-trains a graph encoder from paired synthetic placement views and masked partial observations, tokenizes multi-position IMU into full-body motion tokens, and aligns these tokens with an LLM for motion-language understanding. We evaluate AnyMo on three complementary tasks: zero-shot activity recognition across 14 unseen downstream datasets, cross-modal retrieval, and wearable IMU motion captioning, where it improves average Accuracy/F1/R@2 by 11.7\%/11.6\%/22.6\% on HAR, increases zero-shot IMU-to-text and text-to-IMU retrieval MRR by 15.9\% and 28.6\%, respectively, and improves zero-shot captioning BERT-F1 by 18.8\%. These results support AnyMo as a generalist model for wearable motion understanding in the wild. Project page: https://baiyuchen.com/project/AnyMo.
Authors:Yunge Wen, Chenliang Huang, Hangyu Zhou, Zhuo Zeng, Chun Ming Louis Po, Julian Togelius, Timothy Merino, Sam Earle
Abstract:
Narrative archetypes (e.g., Hero's Journey, Three-act structure) provide universal story structures that resonate across cultures and media and are important for video game storytelling, yet existing LLM-based methods lack explicit use of these archetypes in procedurally generated games. We propose Forking Garden, a framework for narrative arc-conditioned gameplay planning that generates branching games from user-provided storylines. Our approach first generates a diverse pool of independent nodes, then assembles them into a dungeon graph via arc-guided constraint algorithms, where each node achieves multimodal alignment of gameplay elements. We develop an end-to-end interactive system that instantiates the framework.
Authors:Samuel Ferino, Rashina Hoda, John Grundy, Christoph Treude
Abstract:
How software developers interact with Artificial Intelligence (AI)-powered tools, including Large Language Models (LLMs), plays a vital role in how these AI-powered tools impact them. While overreliance on AI may lead to long-term negative consequences (e.g., atrophy of critical thinking skills); underreliance might deprive software developers of potential gains in productivity and quality. Based on twenty-two interviews with software developers on using LLMs for software development, we propose a preliminary reliance-control framework where the level of control can be used as a way to identify AI overreliance and underreliance. We also use it to recommend future research to further explore the different control levels supported by the current and emergent LLM-driven tools. Our paper contributes to the emerging discourse on AI overreliance and provides an understanding of the appropriate degree of reliance as essential to developers making the most of these powerful technologies. Our findings can help practitioners, educators, and policymakers promote responsible and effective use of AI tools.
Authors:Yinghao Zhu, Dehao Sui, Zixiang Wang, Xuning Hu, Lei Gu, Yifan Qi, Tianchen Wu, Ling Wang, Yuan Wei, Wen Tang, Zhihan Cui, Yasha Wang, Lequan Yu, Ewen M Harrison, Junyi Gao, Liantao Ma
Abstract:
Clinician skepticism toward opaque AI hinders adoption in high-stakes healthcare. We present AICare, an interactive and interpretable AI copilot for collaborative clinical decision-making. By analyzing longitudinal electronic health records, AICare grounds dynamic risk predictions in scrutable visualizations and LLM-driven diagnostic recommendations. Through a within-subjects counterbalanced study with 16 clinicians across nephrology and obstetrics, we comprehensively evaluated AICare using objective measures (task completion time and error rate), subjective assessments (NASA-TLX, SUS, and confidence ratings), and semi-structured interviews. Our findings indicate AICare's reduced cognitive workload. Beyond performance metrics, qualitative analysis reveals that trust is actively constructed through verification, with interaction strategies diverging by expertise: junior clinicians used the system as cognitive scaffolding to structure their analysis, while experts engaged in adversarial verification to challenge the AI's logic. This work offers design implications for creating AI systems that function as transparent partners, accommodating diverse reasoning styles to augment rather than replace clinical judgment.
Authors:Stefano Scanzio, Paolo Campagnale, Pietro Chiavassa, Gianluca Cena
Abstract:
QR codes are nowadays customarily used for embedding static data such as web hyperlinks or plain text. The sQRy technology (executable QR codes) permits to embed executable programs in QR codes, enabling people to interact with them even without an internet connection. In this work we present QRmap, a specific dialect that permits the inclusion of geographic maps in sQRy and supports interaction with the user to provide indications to reach the destination of interest. The QRmap technology facilitates navigation in large industrial plants where internet connectivity is absent, due to either environmental limitations or company policies. The proposed technology can have interesting applications in non-industrial contexts as well.
Authors:Jai Lal Lulla, Matthias Galster, Jie M. Zhang, Sebastian Baltes, Christoph Treude
Abstract:
Agentic AI coding tools write code with increasing autonomy and in doing so decide when to import a library and when to implement functionality from scratch. These decisions, whether to build functionality from scratch or buy into an external library, hereafter build-versus-buy, carry direct consequences for software security, licensing compliance, performance, and long-term maintainability. Yet no controlled experimental study has examined what governs build-versus-buy decisions in agentic AI coding tools. Configuration mechanisms, i.e., the means by which developers tailor agentic AI coding tool behavior to a project or workflow, are one of the primary means by which practitioners can influence these decisions. However, it is unclear which configuration mechanisms influence build-versus-buy decisions most effectively. We present a pre-registered protocol to study how configuration mechanisms alter build-versus-buy behavior in two popular agentic AI coding tools: Claude Code and OpenAI Codex. We will execute controlled programming tasks drawn from a benchmark of staged projects, each constructed around identifiable build-versus-buy points, and will manipulate the configuration supplied to each tool, ranging from no configuration, through context files with soft preferences and explicit prohibitions, to Skills (instructions that can be autonomously discovered), MCP-enabled library discovery tools, and permission controls, measuring which libraries the tool selects, whether it discloses newly introduced libraries, and whether those disclosures are complete and accurate. Nine pre-registered hypotheses structure the protocol. The resulting benchmark dataset and analysis pipeline will be released as a reusable artifact for evaluating build-versus-buy behavior in agentic AI coding tools.
Authors:Chuang Peng, Wei Zhang, Renshuai Tao, Xinhao Zhang, Jian Yang
Abstract:
Text-based web agents offer computational efficiency for autonomous web navigation, yet developing robust agents remains challenging due to the noisy and heterogeneous nature of real-world HTML. Standard Supervised Fine-Tuning (SFT) approaches fail in two critical dimensions: they lack discrimination capabilities to reject plausible but incorrect elements in densely populated pages, and exhibit limited generalization to unseen website layouts. To address these challenges, we introduce the Triton dataset (590k instances) and a progressive training curriculum. Triton is constructed via Structural-Semantic Hard Negative Mining, which explicitly mines topologically similar distractors, and a Dual-Agent Consensus pipeline that synthesizes diverse cross-domain tasks with strict verification. Building upon this foundation, our progressive curriculum produces three models: Triton-SFT-32B for basic imitation, Triton-ORPO-32B for robust discrimination via Odds Ratio Preference Optimization, and Triton-GRPO-32B for long-horizon consistency through Group Relative Policy Optimization. Empirical evaluation on Mind2Web demonstrates that Triton-GRPO-32B achieves state-of-the-art performance among open-source models with 58.7% Step Success Rate, surpassing GPT-4.5 (42.4%) and Claude-4.5 (41.4%) by over 16%, validating that specialized data curriculum outweighs raw parameter scale for web navigation.
Authors:Zixin Chen, Haotian Li, Zhe Liu, Huamin Qu, Xing Xie
Abstract:
Large Language Models (LLMs) are increasingly used as learning companions, providing scaffolded explanations, hints, or step-by-step guidance. However, in current LLM-based learning scenarios, scaffolded content is primarily consumed passively, offering limited support for active learner engagement. Learning science research suggests that effective educational scaffolding depends not only on what support is provided, but also on how learners engage with it. In this work, we explore whether embedding lightweight interactive components into LLM-generated scaffolding responses can promote learning-oriented engagement and improve short-term learning outcomes. We evaluated this approach through a within-subjects laboratory study (N=8). Results provide initial evidence that interactive scaffolding increases learners' perceived engagement and attentional focus, while supporting short-term learning performance. We conclude with design implications for integrating interaction into LLM-generated scaffolding to support active learning engagement.
Authors:Leixian Shen, Yan Luo, Rui Sheng, Yujia He, Haotian Li, Leni Yang, Huamin Qu
Abstract:
Personalized feedback plays an important role in self-regulated learning (SRL), helping students track progress and refine their strategies. However, current common solutions, such as text-based reports or learning analytics dashboards, often suffer from poor interpretability, monotonous presentation, and limited explainability. To overcome these challenges, we present StoryLensEdu, a narrative-driven multi-agent system that automatically generates intuitive, engaging, and interactive learning reports. StoryLensEdu integrates three agents: a Data Analyst that extracts data insights based on a learning objective centered structure, a Teacher that ensures educational relevance and offers actionable suggestions, and a Storyteller that organizes these insights using the Heroes Journey narrative framework. StoryLensEdu supports post-generation interactive question answering to improve explainability and user engagement. We conducted a formative study in a real high school and iteratively developed StoryLensEdu in collaboration with an e-learning team to inform our design. Evaluation with real users shows that StoryLensEdu enhances engagement and promotes a deeper understanding of the learning process.
Authors:Yuying Tang, Jiayi Zhou, Haotian Li, Xing Xie, Xiaojuan Ma, Huamin Qu
Abstract:
Generative AI has greatly transformed creative work in various domains, such as screenwriting. To understand this transformation, prior research often focused on capturing a snapshot of human-AI co-creation practice at a specific moment, with less attention to how humans mobilize, regulate, and reflect to form the practice gradually. Motivated by Bandura's theory of human agency, we conducted a two-week study with 19 professional screenwriters to investigate how they embraced AI in their creation process. Our findings revealed that screenwriters not only mindfully planned, foresaw, and responded to AI usage, but, more importantly, through reflections on practice, they developed themselves and human-AI co-creation paradigms, such as cognition, strategies, and workflows. They also expressed various expectations for how future AI should better support their agency. Based on our findings, we conclude this paper with extensive discussion and actionable suggestions to screenwriters, tool developers, and researchers for sustainable human-AI co-creation.
Authors:Yuying Tang, Xinyi Chen, Haotian Li, Xing Xie, Xiaojuan Ma, Huamin Qu
Abstract:
AI has been increasingly integrated into screenwriting practice. In refinement, screenwriters expect AI to provide feedback that supports reflection across the internal perspective of characters and the external perspective of the overall story. However, existing AI tools cannot sufficiently coordinate the two perspectives to meet screenwriters' needs. To address this gap, we present DuoDrama, an AI system that generates feedback to assist screenwriters' reflection in refinement. To enable DuoDrama, based on performance theories and a formative study with nine professional screenwriters, we design the Experience-Grounded Feedback Generation Workflow for Human Reflection (ExReflect). In ExReflect, an AI agent adopts an experience role to generate experience and then shifts to an evaluation role to generate feedback based on the experience. A study with fourteen professional screenwriters shows that DuoDrama improves feedback quality and alignment and enhances the effectiveness, depth, and richness of reflection. We conclude by discussing broader implications and future directions.
Authors:Jai Lal Lulla, Seyedmoein Mohsenimofidi, Matthias Galster, Jie M. Zhang, Sebastian Baltes, Christoph Treude
Abstract:
AI coding agents such as Codex and Claude Code are increasingly used to autonomously contribute to software repositories. However, little is known about how repository-level configuration artifacts affect operational efficiency of the agents. In this paper, we study the impact of AGENTS$.$md files on the runtime and token consumption of AI coding agents operating on GitHub pull requests. We analyze 10 repositories and 124 pull requests, executing agents under two conditions: with and without an AGENTS$.$md file. We measure wall-clock execution time and token usage during agent execution. Our results show that the presence of AGENTS$.$md is associated with a lower median runtime ($Δ28.64$%) and reduced output token consumption ($Δ16.58$%), while maintaining a comparable task completion behavior. Based on these results, we discuss immediate implications for the configuration and deployment of AI coding agents in practice, and outline a broader research agenda on the role of repository-level instructions in shaping the behavior, efficiency, and integration of AI coding agents in software development workflows.
Authors:Shi Qiu, Ruiyang Li, Qixuan Liu, Yuqi Tong, Yue Qiu, Yinqiao Wang, Yan Li, Chi-Wing Fu, Pheng-Ann Heng
Abstract:
We present a collaborative extended reality (XR) prototype for 3D surgical planning and visualization. Our system consists of three key modules: XR-based immersive surgical planning, cloud-based data management, and coordinated stereoscopic 3D displays for interactive visualization. We describe the overall workflow, core functionalities, implementations and setups. By conducting user studies on a liver resection surgical planning case, we demonstrate the effectiveness of our prototype and provide practical insights to inspire future advances in medical XR collaboration.
Authors:Chengbo He, Sheng Li, Chenyang Ma, Bochao Zou, Li Sun, Jiansheng Chen, Junliang Xing, Yuanchun Shi, Huimin Ma
Abstract:
Robotic assistants in long-term human-robot collaboration need to assist users under partial observations while leveraging cross-day interaction history. However, human traits and routines are often unknown at the beginning of collaboration, making passive infer-then-act assistance ineffective and inefficient. To address this challenge, we study a cross-day proactive asking setting for continual task assistance and propose PACT (Proactive Asking for Continual Task Assistance), an ask-or-act framework that determines whether clarification should be sought before taking action. PACT leverages current observations together with accumulated interaction history to evaluate contextual sufficiency, enabling the robot to provide more reliable assistance and progressively adapt to the user over time. We implement its primary learned instantiation using reinforcement learning and evaluate alternative instantiations under the same framework. To assess such behavior, we further introduce a clarification utility metric that quantifies the trade-off between assistance accuracy and the frequency of clarification requests. Experiments in multi-day embodied collaboration scenarios demonstrate that, compared with passive inference baselines, PACT consistently improves both assistance accuracy and clarification utility, highlighting the importance of proactive asking in continual human-robot collaboration.
Authors:Faisal Haque Bappy, Tahrim Hossain, Sidratul Muntaher Meheraj, Annoor Sharara Akhand, Tasfia Tabassum, Tarannum Shaila Zaman, Raiful Hasan, Tariqul Islam
Abstract:
AI coding assistants are now central to professional software development, yet their impact on how developers think about and practice security remains poorly understood. While prior work has documented vulnerability rates in AI-generated code, a more fundamental question persists: how do these tools transform security awareness in authentic, ongoing development practice? We conducted semi-structured interviews with 15 professional software engineers and observed them completing security-relevant coding tasks with AI assistance, spanning 3 experience cohorts defined by their relationship to AI tools during professional formation. We find that AI coding assistants reorganize rather than eliminate security thinking, shifting it from the act of writing code to the act of reviewing it. This transition from preventive to reactive security is structurally encouraged by interaction models that frame code generation as a functional task, leaving security as an afterthought. Notably, none of our coding session participants specified security requirements in their initial prompts, even when they possessed the relevant knowledge, revealing a decoupling of security awareness from security behavior. We further document informal coping strategies developers had independently invented to manage AI security risk, none of which are supported by current tools or organizations, and find that the experience cohort did not reliably predict security performance. This paper contributes a practice-grounded account of how AI-assisted development reshapes the human side of secure coding, offering empirical foundations for the design of more security-aware tools, training programs, and organizational policies.
Authors:Wesley Hanwen Deng, Mingxi Yan, Sunnie S. Y. Kim, Akshita Jha, Lauren Wilcox, Kenneth Holstein, Motahhare Eslami, Leon A. Gatys
Abstract:
Recent developments in AI safety research have called for red-teaming methods that effectively surface potential risks posed by generative AI models, with growing emphasis on how red-teamers' backgrounds and perspectives shape their strategies and the risks they uncover. While automated red-teaming approaches promise to complement human red-teaming through larger-scale exploration, existing automated approaches do not account for human identities and rarely incorporate human inputs. In this work, we explore persona-driven red-teaming to advance both automated red-teaming and human-AI collaboration. We first develop PersonaTeaming Workflow, which incorporates personas into the adversarial prompt generation process to explore a wider spectrum of adversarial strategies. Compared to RainbowPlus, a state-of-the-art automated red-teaming method, PersonaTeaming Workflow achieves higher attack success rates while maintaining prompt diversity. However, since automated personas only approximate real human perspectives, we further instantiate PersonaTeaming Workflow as PersonaTeaming Playground, a user-facing interface that enables red-teamers to author their own personas and collaborate with AI to mutate and refine prompts. In a user study with 11 industry practitioners, we found that PersonaTeaming Playground enabled diverse red-teaming strategies and outputs that practitioners perceived as useful, and that AI-generated suggestions in the PersonaTeaming Playground encouraged out-of-the-box thinking even when practitioners did not follow them strictly. Together, our work advances both automated and human-in-the-loop approaches to red-teaming, while shedding light on interaction patterns and design insights for supporting human-AI collaboration in generative AI red-teaming.
Authors:Andrei Andrusenko, Vladimir Bataev, Lilit Grigoryan, Nune Tadevosyan, Vitaly Lavrukhin, Boris Ginsburg
Abstract:
Unification of automatic speech recognition (ASR) systems reduces development and maintenance costs, but training a single model to perform well in both offline and low-latency streaming settings remains challenging. We present a Unified ASR framework for Transducer (RNNT) training that supports both offline and streaming decoding within a single model, using chunk-limited attention with right context and dynamic chunked convolutions. To further close the gap between offline and streaming performance, we introduce an efficient Triton implementation of mode-consistency regularization for RNNT (MCR-RNNT), which encourages agreement across training modes. Experiments show that the proposed approach improves streaming accuracy at low latency while preserving offline performance and scaling to larger model sizes and training datasets. The proposed Unified ASR framework and the English model checkpoint are open-sourced.
Authors:Jingwei Shi, Shengyu Tao, Xinxiang Yin, Chen Huang, Wenqiang Lei, See-Kiong Ng
Abstract:
The application of games as a therapeutic tool for cognitive training is beneficial for patients with cognitive impairments. However, effective game design for individual patient is resource-intensive. To this end, we propose an LLM-powered method, \ours, for automated and personalized therapeutic game design. Inspired by the Dungeons & Dragons, LETGAMES generates an open-world interactive narrative game. It not only generates game scenarios and challenges that target specific cognitive domains, but also employs conversational strategies to offer guidance and companionship. To validate its efficacy, we pioneer a psychology-grounded evaluation protocol LETGAMESEVAL, establishing comprehensive metrics for rehabilitative assessment. Building upon this, our experimental results from both LLM-based assessors and human expert evaluations demonstrate the significant potential of our approach, positioning LETGAMES as a promising solution to the widespread need for more accessible and tailored cognitive training tools. Our code will be open-sourced upon acceptance.
Authors:Lingjun Zhao, Dayeon Ki, Marine Carpuat, Hal Daumé
Abstract:
Language models are known to exhibit various forms of cultural bias in decision-making tasks, yet much less is known about their degree of cultural familiarity in open-ended text generation tasks. In this paper, we introduce the task of culturally-adapted art description generation, where models describe artworks for audiences from different cultural groups who vary in their familiarity with the cultural symbols and narratives embedded in the artwork. To evaluate cultural competence in this pragmatic generation task, we propose a framework based on culturally grounded question answering. We find that base models are only marginally adequate for this task, but, through a pragmatic speaker model, we can improve simulated listener comprehension by up to 8.2%. A human study further confirms that the model with higher pragmatic competence is rated as more helpful for comprehension by 8.0%.
Authors:Eunkyu Park, Wesley Hanwen Deng, Cheyon Jin, Matheus Kunzler Maldaner, Jordan Wheeler, Jason I. Hong, Hong Shen, Adam Perer, Ken Holstein, Motahhare Eslami, Gunhee Kim
Abstract:
Vision-Language Models (VLMs) continue to struggle to make morally salient judgments in multimodal and socially ambiguous contexts. Prior works typically rely on binary or pairwise supervision, which often fail to capture the continuous and pluralistic nature of human moral reasoning. We present MM-SCALE (Multimodal Moral Scale), a large-scale dataset for aligning VLMs with human moral preferences through 5-point scalar ratings and explicit modality grounding. Each image-scenario pair is annotated with moral acceptability scores and grounded reasoning labels by humans using an interface we tailored for data collection, enabling listwise preference optimization over ranked scenario sets. By moving from discrete to scalar supervision, our framework provides richer alignment signals and finer calibration of multimodal moral reasoning. Experiments show that VLMs fine-tuned on MM-SCALE achieve higher ranking fidelity and more stable safety calibration than those trained with binary signals.
Authors:Haomin Zhuang, Hanwen Xing, Xiangliang Zhang
Abstract:
Recent autonomous AI agents such as Codex, and Claude Code have made it increasingly practical for users to delegate complex tasks, including writing emails, executing code, issuing shell commands, and carrying out multi-step plans. However, despite these capabilities, human-agent interaction still largely happens through terminal interfaces or remote text-based channels such as Discord. These interaction modes are often inefficient and unfriendly: long text outputs are difficult to read and review, proposed actions lack clear structure and visual context, and users must express feedback by typing detailed corrections, which is cumbersome and often discourages effective collaboration. As a result, non-expert users in particular face a high barrier to working productively with agents. To address this gap, we present AgentClick, an interactive review layer for terminal-based agents. AgentClick is implemented as a localhost npm server paired with a skill-based plugin that connects the running agent to a browser interface, allowing users to supervise and collaborate with agents through a structured web UI rather than raw terminal text alone. The system supports a range of human-in-the-loop workflows, including email drafting and revision, plan review and modification, memory management, trajectory inspection and visualization, and error localization during agent execution. It also turns code generation and execution into a reviewable process, enabling users to inspect and intervene before consequential actions are taken. In addition, AgentClick supports persistent preference capture through editable memory and remote access over HTTP, allowing users to review agents running on servers from their personal devices. Our goal is to lower the barrier for non-expert users and improve the efficiency and quality of human-agent co-work.
Authors:Davide Frizzo, Fabrizio Genilotti, David Petrovic, Arianna Stropeni, Francesco Borsatti, Davide Dalle Pezze, Riccardo De Monte, Manuel Barusco, Gian Antonio Susto
Abstract:
Virtual Reality (VR) applications require robust user identification systems to ensure secure access to equipment and protect worker identities. Motion tracking data from VR headsets and controllers has emerged as a powerful behavioral biometric, with recent studies demonstrating identification accuracies exceeding 94% across a large user base. However, the application of modern deep learning architectures, particularly State Space Models (SSM), to VR scenarios remains largely unexplored. In this work, we benchmark user identification performance across the large-scale Who is Alyx VR dataset, gathering data from 71 users playing the popular Half-Life:Alyx game. We evaluate both established architectures (Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Convolutional Neural Network (CNN), Temporal Convolutional Network (TCN), Transformer) and the emerging SSMs on time series motion data. Our results provide the first comprehensive benchmark of state-of-the-art and novel architectures for VR user identification, establishing baseline performance metrics for future privacy preserving authentication systems in manufacturing environments.
Authors:Benjamin Hardin, Efimia Panagiotaki, Daniele De Martini, Lars Kunze
Abstract:
Although much is known about the physical danger of cycling situations, less is understood about the perceived danger of cycling. Furthermore, perception of danger may be filtered at a subconscious level and therefore difficult for one to self-report. To this end, these subconscious perceptions can be revealed through physiological metrics such as eye gaze. This paper explores the perceived safety of cycling in Oxford, United Kingdom and explores the ability of wearable eye tracking glasses to produce insights about the differences in perception under different environments and events. This paper finds that eye gaze patterns change between using bike lanes, car lanes and shared bus lanes, representing different cognitive challenges of each lane type. This paper presents that different intersections have significantly different eye gaze patterns which may have implications for cyclist stress. Finally, eye gaze patterns differ in the presence of events such as passes and pedestrians in the road compared to when cycling with no events. This paper draws conclusions on the benefits and limitations of using wearable eye trackers to estimate stress and cyclist workload.
Authors:Sunny Yu, Myra Cheng, Ahmad Jabbar, Ilia Sucholutsky, Katherine M. Collins, Dan Jurafsky, Robert D. Hawkins
Abstract:
Large language models (LLMs) have the potential to boost human productivity by speeding up task completion -- provided users know when to offload cognitive work to them. But we do not know if users are well-calibrated in estimating these potential time savings. We conducted a preregistered large-scale behavioral study (N = 1237) to characterize mismatches between expectations and reality, with a focus on simple cognitive tasks. While actual completion times between independent completion and AI-assisted completion did not differ, participants predicted AI to be significantly faster. The same bias was not observed when imagining help from another human participant. We identify a speedup illusion where people have accurate forecasts of independent completion times but significantly underestimate AI-assisted times. Additionally, time and effort dissociate: participants reported lower subjective effort with AI despite equivalent completion times. This suggests that completion time itself is not sufficient to characterize efficiency gains.
Authors:Sunny Yu, Myra Cheng, Ahmad Jabbar, Ilia Sucholutsky, Katherine M. Collins, Dan Jurafsky, Robert D. Hawkins
Abstract:
People are increasingly turning to AI assistance for simple tasks, e.g., arithmetic, spell-check, and answering simple questions. But does AI assistance actually save users time and effort? We investigate people's propensity to use AI for cognitively simple tasks and assess whether their reliance is well-calibrated. Across three pre-registered user studies (N = 2691), we find that people frequently choose to use AI even when doing so is inefficient (i.e. provides no meaningful time or effort savings). We identify systematic miscalibration at two levels: (1) a self-estimate miscalibration where people on average believe that they are using AI less than they actually are, and (2) efficiency-gain illusions where people overestimate how much time and effort savings AI use affords. We also identify a session-level carryover effect where a participant's prior AI use leads to further AI adoption and entrenches their miscalibration about time savings. Our results shed light on the mechanisms and biases underlying people's choice of whether to use AI as well as the risk of an overreliance feedback loop.
Authors:Yi Wang, Kexin Cheng, Xiao Liu, Chetan Arora, John Grundy, Thuong Hoang, Henry Been-Lirn Duh
Abstract:
Personas are a valuable tool for discussing accessibility requirements in software design and development practices. However, the use of personas for accessibility-focused requirements elicitation in VR projects remains limited and is accompanied by several challenges. To fill this gap, we developed an auto-generated persona system in a VR course, where the personas were used to facilitate discussions on accessibility requirements and to guide VR design and development. Our findings indicate that the auto-generated persona system enabled students to develop empathy more efficiently. This study demonstrates the use of automatically generated personas in VR course settings as a means of eliciting latent accessibility requirements.
Authors:Yi Wang, Zhengxin Zhang, Xiao Liu, Chetan Arora, John Grundy, Thuong Hoang
Abstract:
In this study, we propose a novel approach that supports requirements discussions in virtual environments by automatically generating personas from real-time speech-to-text data. In our pilot experiment, 18 participants (14 from universities and 4 from IT companies) used the generated personas to discuss accessibility requirements within the virtual environment. Participants reported a relatively high level of satisfaction with the social presence and usability of the VR system. We also found that requirements discussions based on personas have a lower workload. Finally, we outline the main directions for future work.
Authors:Yi Wang, Ben Cheng, Xiao Liu, Chetan Arora, John Grundy, Thuong Hoang
Abstract:
In this paper, we developed a virtual reality (VR) system that can simulate color blindness and weakness. We built an immersive 3D web view interface where participants can discuss accessibility requirements for a fitness website projects within a virtual fitness environment. We conducted a pilot experiment involving 24 participants from six software teams, who used both VR and non-VR methods to understand color blindness and weakness requirements in a website project. Our findings indicate that using VR can provide several benefits for requirements activities, such as an improved user experience and reduced workload.
Authors:Yangchen Yu, Qian Chen, Jia Li, Zhenzhen Hu, Jinpeng Hu, Lizi Liao, Erik Cambria, Richang Hong
Abstract:
Multimodal emotion recognition (MER) benefits from combining text, audio, and vision, yet standard fusion often fails when modalities conflict. Crucially, conflicts differ in resolvability: benign conflicts stem from missing, weak, or ambiguous cues and can be mitigated by cross-modal calibration, while severe conflicts arise from intrinsically contradictory (e.g., sarcasm) or misleading signals, for which forced fusion may amplify errors. Recognizing this, we propose Dual-Path Conflict Resolution (DCR), a unified framework that learns when to fuse and when to drop modalities. Path I (Affective Fusion Distiller, AFD) performs reverse distillation from audio/visual teachers to a textual student using temporally weighted class evidence, thereby enhancing representation-level calibration and improving fusion when alignment is beneficial. Path II (Affective Discernment Agent, ADA) formulates MER as a contextual bandit that selects among fusion and unimodal predictions based on a dual-view state and a calibration-aware reward, enabling decision-level arbitration under irreconcilable conflicts without requiring per-modality reliability labels. By taking into account the full multimodal context and coupling soft calibration with hard arbitration, DCR reconciles conflicts that can be aligned while bypassing misleading modalities when fusion is harmful. Across five benchmarks covering both dialogue-level and clip-level MER, DCR consistently outperforms competitive baselines or achieves highly competitive results. Further ablations, conflict-specific subset evaluation, and modality-selection analysis verify that AFD and ADA are complementary and jointly improve robust conflict-aware emotion recognition.
Authors:Eason Chen, Xinyi Tang, Yvonne Zhao, Meiyi Chen, Meryam Elmir, Elizabeth McLaughlin, Mingyu Yuan, Yumo Wang, Shyam Agarwal, Jared Cochrane, Jionghao Lin, Tongshuang Wu, Ken Koedinger
Abstract:
We conducted a between-subjects experiment (N=92) comparing three conditions in a calculus learning environment: no self-explanation (control), menu-based self-explanation, and open-ended self-explanation with LLM-generated feedback. All conditions showed positive learning gains within a fixed 60-minute practice session, with no significant between-condition differences in post-test performance. On transfer questions, the open-ended condition produced significantly higher-quality explanations than control on "Not Enough Information" (NEI) problems ($β$=+11.9 percentage points, $p$=.030), though the corresponding NEI multiple-choice accuracy advantage was not significant ($p$=.183). Moreover, across all post-test open-ended explanations, the open-ended condition showed a marginally significant advantage ($β$=+7.3%, $p$=.057). These findings suggest that LLM-supported open-ended self-explanation can improve explanation quality on NEI transfer problems, with weaker evidence across broader transfer explanation measures. Notably, these effects emerged even though learners in the open-ended condition completed substantially fewer practice problems within the same practice time.
Authors:Kihoon Son, Hyewon Lee, DaEun Choi, Yoonsu Kim, Tae Soo Kim, Yoonjoo Lee, John Joon Young Chung, HyunJoon Jung, Juho Kim
Abstract:
Human collaborators coordinate dynamically through process visibility and workspace awareness, yet AI agents typically either provide only final outputs or expose read-only execution processes (e.g., planning, reasoning) without interpreting concurrent user actions on shared artifacts. Building on mixed-initiative interaction principles, we explore whether agents can achieve collaborative context awareness -- interpreting concurrent user actions on shared artifacts and adapting in real-time. Study 1 (N=10 professional designers) revealed that process visibility enabled reasoning about agent actions but exposed conflicts when agents could not distinguish feedback from independent work. We developed CLEO, which interprets collaborative intent and adapts in real-time. Study 2 (N=10, two-day with stimulated recall interviews) analyzed 214 turns, identifying five action patterns, six triggers, and four enabling factors explaining when designers choose delegation (70.1%), direction (28.5%), or concurrent work (31.8%). We present a decision model with six interaction loops, design implications, and an annotated dataset.
Authors:Eason Chen, Sophia Judicke, Kayla Beigh, Xinyi Tang, Isabel Wang, Nina Yuan, Zimo Xiao, Chuangji Li, Shizhuo Li, Reed Luttmer, Shreya Singh, Maria Yampolsky, Naman Parikh, Yvonne Zhao, Meiyi Chen, Scarlett Huang, Anishka Mohanty, Gregory Johnson, John Mackey, Jionghao Lin, Ken Koedinger
Abstract:
We evaluate GPTutor, an LLM-powered tutoring system for an undergraduate discrete mathematics course. It integrates two LLM-supported tools: a structured proof-review tool that provides embedded feedback on students' written proof attempts, and a chatbot for math questions. In a staggered-access study with 148 students, earlier access was associated with higher homework performance during the interval when only the experimental group could use the system, while we did not observe this performance increase transfer to exam scores. Usage logs show that students with lower self-efficacy and prior exam performance used both components more frequently. Session-level behavioral labels, produced by human coding and scaled using an automated classifier, characterize how students engaged with the chatbot (e.g., answer-seeking or help-seeking). In models controlling for prior performance and self-efficacy, higher chatbot usage and answer-seeking behavior were negatively associated with subsequent midterm performance, whereas proof-review usage showed no detectable independent association. Together, the findings suggest that chatbot-based support alone may not reliably support transfer to independent assessment of math proof-learning outcomes, whereas work-anchored, structured feedback appears less associated with reduced learning.
Authors:Haoyu Hu, Raja Marjieh, Katherine M Collins, Chenyi Li, Thomas L. Griffiths, Ilia Sucholutsky, Nori Jacoby
Abstract:
Writing code has been one of the most transformative ways for human societies to translate abstract ideas into tangible technologies. Modern AI is transforming this process by enabling experts and non-experts alike to generate code without actually writing code, but instead, through natural language instructions, or "vibe coding". While increasingly popular, the cumulative impact of vibe coding on productivity and collaboration, as well as the role of humans in this process, remains unclear. Here, we introduce a controlled experimental framework for studying collaborative vibe coding and use it to compare human-led, AI-led, and hybrid groups. Across 16 experiments involving 604 human participants, we show that people provide uniquely effective high-level instructions for vibe coding across iterations, whereas AI-provided instructions often result in performance collapse. We further demonstrate that hybrid systems perform best when humans retain directional control (providing the instructions), while evaluation is delegated to AI.
Authors:Ansgar Howey, Tim Schreiter, Andrey Rudenko, Achim J. Lilienthal
Abstract:
Automated Guided Vehicles (AGV) in factory automation are increasingly capable of moving autonomously in close proximity to human workers. While their physical safety is regulated by standards and directives, perceived safety and workers comfort in close-proximity interactions are being actively investigated in studies. There are three limitations in the prior art research to that end. Firstly, AGVs with larger payloads are understudied. Secondly, the test participants are usually students and not working professionals. Thirdly, while conducting in-person experiments with heavy machinery can be dangerous, the transfer of safety perception results from simulated experiments remains open. In this paper, we investigate industrial workers perceived safety in shared spaces with large AGVs in a real-world encounter and in virtual reality. We vary the passing distance and the shape of the collision avoidance maneuver, and evaluate perceived threat level using a handheld pressure-sensitive trigger interface and a post-experiment questionnaire. Additionally, we ask participants to set their own collision avoidance parameters based on their experience with the demonstrated trajectory profiles. In a within-subject study, we found that, while the threat levels are perceived overall slightly higher in VR, the passing distance of 1.5 to 2 meters is preferred among the demonstrated profiles, as well as in the self-defined trajectories.
Authors:Miina Koyama, Ruiwei Xiao, John Stamper
Abstract:
Chatbots have long been explored as tools to support learning, and recent advances in large language models have significantly expanded the availability of platforms for educators to author AI tutoring chatbots. Yet effective authorship demands more than writing a system prompt; it requires educators to act as learning designers, AI interaction designers, and QA engineers. In practice, however, teachers rarely fulfill these roles. Our formative study found that virtually none systematically tested their bots before deploying them to students. To address this gap, we present PromptDecipher, a system that restructures the authoring workflow around a direct correction-based interaction rather than writing abstract system prompts, teachers interact with a live chat preview and edit undesirable bot responses. An automated pipeline then analyzes the correction, proposes a targeted system prompt rewrite, and validates the change across pre-defined test scenarios. This enforces QA as a first-class activity and scaffolds teachers in roles they would otherwise skip. PromptDecipher will be deployed in an AI for Educators course enrolling hundreds of higher-education instructors. A live prototype (https://teacher-prompting.vercel.app/), an anonymized codebase (https://anonymous.4open.science/r/teacher-prompting-2EDF/), and anonymized demo (https://tinyurl.com/las-prompt-decipher-demo) are available via links in the footnote.
Authors:Juhyeon Lee, Wonduk Seo, Junseo Koh, Seunghyun Lee, Haihua Chen, Yi Bu
Abstract:
Detecting harmful content in multi turn dialogue requires reasoning over the full conversational context rather than isolated utterances. However, most existing methods rely mainly on models internal parametric knowledge, without explicit grounding in external normative principles. This often leads to inconsistent judgments in socially nuanced contexts, limited interpretability, and redundant reasoning across turns. To address this, we propose RoTRAG, a retrieval augmented framework that incorporates concise human written moral norms, called Rules of Thumb (RoTs), into LLM based harm assessment. For each turn, RoTRAG retrieves relevant RoTs from an external corpus and uses them as explicit normative evidence for turn level reasoning and final severity classification. To improve efficiency, we further introduce a lightweight binary routing classifier that decides whether a new turn requires retrieval grounded reasoning or can reuse existing context. Experiments on ProsocialDialog and Safety Reasoning Multi Turn Dialogue show that RoTRAG consistently improves both harm classification and severity estimation over competitive baselines, with an average relative gain of around 40% in F1 across benchmark datasets and an average relative reduction of 8.4% in distributional error, while reducing redundant computation without sacrificing performance.
Authors:Yu Mei, Ziyao Zhang, Qingyang Wan, Shiyi Wang, Ge Wang, Jie Cai, Chun Yu, Yuanchun Shi
Abstract:
Parent-AI collaboration to support real-time conversations with children is challenging due to the sensitivity and open-ended nature of such interactions. Existing systems often simplify collaboration into static modes, providing limited support for adapting AI to continuously evolving conversational contexts. To address this gap, we systematically investigate the dynamics of parent-AI collaboration modes in real-time conversations with children. We conducted a co-design study with eight parents and developed COMPASS, a research probe that enables flexible combinations of parental support functions during conversations. Using COMPASS, we conducted a lab-based study with 21 parent-child pairs. We show that parent-AI collaboration unfolds through evolving modes that adapt systematically to contextual factors. We further identify three types of parental strategies--parent-oriented, child-oriented, and relationship-oriented--that shape how parents engage with AI. These findings advance the understanding of dynamic human-AI collaboration in relational, high-stakes settings and inform the design of flexible, context-adaptive parental support systems.
Authors:Chengwen Zhang, Chun Yu, Borong Zhuang, Haopeng Jin, Qingyang Wan, Zhuojun Li, Zhe He, Zhoutong Ye, Yu Mei, Chang Liu, Weinan Shi, Yuanchun Shi
Abstract:
Long-range Human-Robot Interaction (HRI) remains underexplored. Within it, Command Source Identification (CSI) - determining who issued a command - is especially challenging due to multi-user and distance-induced sensor ambiguity. We introduce HiSync, an optical-inertial fusion framework that treats hand motion as binding cues by aligning robot-mounted camera optical flow with hand-worn IMU signals. We first elicit a user-defined (N=12) gesture set and collect a multimodal command gesture dataset (N=38) in long-range multi-user HRI scenarios. Next, HiSync extracts frequency-domain hand motion features from both camera and IMU data, and a learned CSINet denoises IMU readings, temporally aligns modalities, and performs distance-aware multi-window fusion to compute cross-modal similarity of subtle, natural gestures, enabling robust CSI. In three-person scenes up to 34m, HiSync achieves 92.32% CSI accuracy, outperforming the prior SOTA by 48.44%. HiSync is also validated on real-robot deployment. By making CSI reliable and natural, HiSync provides a practical primitive and design guidance for public-space HRI.
Authors:Dimitrios Apostolakis, Georgios Angelidis, Vasileios Argyriou, Panagiotis Sarigiannidis, Georgios Th. Papadopoulos
Abstract:
A user-centered AR interface for disaster response is presented in this work that uses 3D Gaussian Splatting (3DGS) to visualize detailed scene reconstructions, while maintaining situational awareness and keeping cognitive load low. The interface relies on a lightweight interaction approach, combining World-in-Miniature (WIM) navigation with semantic Points of Interest (POIs) that can be filtered as needed, and it is supported by an architecture designed to stream updates as reconstructions evolve. User feedback from a preliminary evaluation indicates that this design is easy to use and supports real-time coordination, with participants highlighting the value of interaction and POIs for fast decision-making in context. Thorough user-centric performance evaluation demonstrates strong usability of the developed interface and high acceptance ratios.
Authors:Ruiwei Xiao, Runlong Ye, Xinying Hou, Jessica Wen, Harsh Kumar, Michael Liut, John Stamper
Abstract:
Despite universal GenAI adoption, students cannot distinguish task performance from actual learning and lack skills to leverage AI for learning, leading to worse exam performance when AI use remains unreflective. Yet few interventions teaching students to prompt AI as a tutor rather than solution provider have been validated at scale through randomized controlled trials (RCTs). To bridge this gap, we conducted a semester-long RCT (N=979) with four ICAP framework-based instructional conditions varying in engagement intensity with a pre-test, immediate and delayed post-test and surveys. Mixed methods analysis results showed: (1) All conditions significantly improved prompting skills, with gains increasing progressively from Condition 1 to Condition 4, validating ICAP's cognitive engagement hierarchy; (2) for students with similar pre-test scores, higher learning gain in immediate post-test predict higher final exam score, though no direct between-group differences emerged; (3) Our interventions are suitable and scalable solutions for diverse educational contexts, resources and learners. Together, this study makes empirical and theoretical contributions: (1) theoretically, we provided one of the first large-scale RCTs examining how cognitive engagement shapes learning in prompting literacy and clarifying the relationship between learning-oriented prompting skills and broader academic performance; (2) empirically, we offered timely design guidance for transforming GenAI classroom policies into scalable, actionable prompting literacy instruction to advance learning in the era of Generative AI.
Authors:Ningjing Tang, Alice Qian, Qiaosi Wang, Esther Howe, Blake Bullwinkel, Paola Pedrelli, Jina Suh, Hoda Heidari, Hong Shen
Abstract:
Content Warning: This paper contains participant quotes and discussions related to mental health challenges, emotional distress, and suicidal ideation. Large language models (LLMs) are increasingly used for mental health support, yet the model safeguards -- particularly refusals to engage with sensitive content -- remain poorly understood from the perspectives of users and mental health professionals (MHPs) and have been reported to cause real-world harms. This paper presents findings from a sequential mixed-methods study examining how LLM refusals are experienced and interpreted in mental health support interactions. Through surveys (N=53) and in-depth interviews (N=16) with individuals using LLMs for mental health support and MHPs, we reveal that refusals are not isolated, single-turn system behaviors but rather constitute dynamic, multi-phase experiences: pre-refusal expectation formation, refusal triggering and encounter, refusal message framing, resource referral provision, and post-refusal outcomes. We contribute a multi-phase framework for evaluating refusals beyond binary policy compliance accuracy and design recommendations for future refusal mechanisms. These findings suggest that understanding LLM refusals requires moving beyond single-turn interactions toward recognizing them as holistic experiences embedded within users' support-seeking trajectories and the broader LLM design pipeline.
Authors:Andrew Jelson, Daniel Manesh, Sangwook Lee, Alice Jang, Daniel Dunlap, Tamara Maddox, Young-Ho Kim, Sang Won Lee
Abstract:
With the rapid adoption of AI writing assistants in education, educators and researchers need empirical evidence to understand the impact on student writing and inform effective pedagogical design. Despite widespread use, we lack systematic understanding of how students engage with these tools during authentic writing tasks: when they seek assistance, what they ask, and how they incorporate AI-generated content into their essays. This gap limits evidence-based policy development and rigorous evaluation of generative AI's learning effects. To address this gap, we introduce NIRVANA, a dataset capturing how university students use generative AI while writing an analytical essay. The dataset includes 77 students who completed an essay task with access to ChatGPT, recording keystroke-level writing behavior, full ChatGPT conversation histories, and all text copied from ChatGPT, enabling a complete reconstruction of the writing process and revealing how AI assistance shapes student work. Our analysis identifies key behavioral patterns, including variation in ChatGPT query frequency and its relationship to essay characteristics such as length and readability. We identify four writing profiles based on students' contribution and revision patterns: Lead Authors, Collaborators, Drafters, and Vibe Writers. To support deeper investigation, we developed a replay interface that reconstructs the writing process; qualitative analysis of sampled replays demonstrates how this tool enables systematic examination of student-AI interactions.
Authors:Sangwook Lee, Sang Won Lee, Adnan Abbas, Young-Ho Kim, Yan Chen
Abstract:
Modern task-oriented chatbots present GUI elements alongside natural-language dialogue, yet the agent's role has largely been limited to interpreting natural-language input as GUI actions and following a linear workflow. In preference-driven, multi-step tasks such as booking a flight or reserving a restaurant, earlier choices constrain later options and may force users to restart from scratch. User preferences serve as the key criteria for these decisions, yet existing agents do not systematically leverage them. We present MAESTRO, which extends the agent's role from execution to decision support. MAESTRO maintains a shared preference memory that extracts preferences from natural-language utterances with their strength, and provides two mechanisms. Preference-Grounded GUI Adaptation applies in-place operators (augment, sort, filter, and highlight) to the existing GUI according to preference strength, supporting within-stage comparison. Preference-Guided Workflow Navigation detects conflicts between preferences and available options, proposes backtracking, and records failed paths to avoid revisiting dead ends. We evaluated MAESTRO in a movie-booking Conversational Agent with GUI (CAG) through a within-subjects study with two conditions (Baseline vs. MAESTRO) and two modes (Text vs. Voice), with N = 33 participants.
Authors:Zixin Chen, Yuhang Zeng, Sicheng Song, Yanna Lin, Xian Xu, Huamin Qu, Meng Xia
Abstract:
Multiple-choice questions (MCQs) are a widely used educational tool, particularly in domains such as visualization literacy that require broad conceptual coverage and support diverse real-world applications. However, designing high-quality visualization literacy MCQs remains challenging, as instructors must coordinate multimodal elements (e.g., charts, question stems, and distractors), address diverse visualization tasks, and accommodate learners with heterogeneous backgrounds. Existing visualization literacy assessments primarily rely on standardized, fixed item banks, offering limited support for iterative question design that adapts to differences in learners' abilities, backgrounds, and reasoning strategies. To address these challenges, we present VizQStudio, a visual analytics system that supports instructors in iteratively designing and refining visualization literacy MCQs using MLLM-powered simulated students. Instructors can specify diverse student profiles spanning demographics, knowledge levels, and learning-related traits. The system then visualizes how simulated students reason about and respond to different question components, helping instructors explore potential misconceptions, difficulty calibration, and design trade-offs prior to classroom deployment. We investigate VizQStudio through a mixed-method evaluation, including expert interviews, case studies, a classroom deployment, and a large-scale online study. Overall, this work reframes MLLM-based student simulation in assessment authoring as a design-time, exploratory aid. By examining both its value and limitations in realistic instructional settings, we surface design insights that inform how future systems can support instructor-centered, iterative, and responsible uses of AI for multimodal assessment design in visualization literacy and related domains.
Authors:Faria Huq, Zora Zhiruo Wang, Zhanqiu Guo, Venu Arvind Arangarajan, Tianyue Ou, Frank Xu, Shuyan Zhou, Graham Neubig, Jeffrey P. Bigham
Abstract:
Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold. However, current agentic systems lack a principled understanding of when and why humans intervene, often proceeding autonomously past critical decision points or requesting unnecessary confirmation. In this work, we introduce the task of modeling human intervention to support collaborative web task execution. We collect CowCorpus, a dataset of 400 real-user web navigation trajectories containing over 4,200 interleaved human and agent actions. We identify four distinct patterns of user interaction with agents -- hands-off supervision, hands-on oversight, collaborative task-solving, and full user takeover. Leveraging these insights, we train language models (LMs) to anticipate when users are likely to intervene based on their interaction styles, yielding a 61.4-63.4% improvement in intervention prediction accuracy over base LMs. Finally, we deploy these intervention-aware models in live web navigation agents and evaluate them in a user study, finding a 26.5% increase in user-rated agent usefulness. Together, our results show structured modeling of human intervention leads to more adaptive, collaborative agents.
Authors:Zhida Sun, Xiaodong Wang, Zhenyao Zhang, Min Lu, Dani Lischinski, Daniel Cohen-Or, Hui Huang
Abstract:
Visual communication often needs stylistically consistent icons that span concrete and abstract meanings, for use in diverse contexts. We present Iconix, a human-AI co-creative system that organizes icon generation along two axes: semantic richness (what is depicted) and visual complexity (how much detail). Given a user-specified concept, Iconix constructs a semantic scaffold of related analytical perspectives and employs chained, image-conditioned generation to produce a coherent style of exemplars. Each exemplar is then automatically distilled into a progressive sequence, from detailed and elaborate to abstract and simple. The resulting two-dimensional grid exposes a navigable space, helping designers reason jointly about figurative content and visual abstraction. A within-subjects study (N = 32) found that compared to a baseline workflow, participants produced icon grids more creatively, reported lower workload, and explored a coherent range of design variations. We discuss implications for human-machine co-creative approaches that couple semantic scaffolding with progressive simplification to support visual abstraction.
Authors:Hongxiao Li, Chenxi Wang, Fanda Fan, Zihan Wang, Wanling Gao, Lei Wang, Jianfeng Zhan
Abstract:
Evaluation is the foundation of empirical science, yet the evaluation of evaluation itself -- so-called meta-evaluation -- remains strikingly underdeveloped. While methods such as observational studies, design of experiments (DoE), and randomized controlled trials (RCTs) have shaped modern scientific practice, there has been little systematic inquiry into their comparative validity and utility across domains. Here we introduce a formal framework for meta-evaluation by defining the evaluation space, its structured representation, and a benchmark we call AxiaBench. AxiaBench enables the first large-scale, quantitative comparison of ten widely used evaluation methods across eight representative application domains. Our analysis reveals a fundamental limitation: no existing method simultaneously achieves accuracy and efficiency across diverse scenarios, with DoE and observational designs in particular showing significant deviations from real-world ground truth. We further evaluate a unified method of entire-space stratified sampling from previous evaluatology research, and the results report that it consistently outperforms prior approaches across all tested domains. These results establish meta-evaluation as a scientific object in its own right and provide both a conceptual foundation and a pragmatic tool set for advancing trustworthy evaluation in computational and experimental research.
Authors:Caleb Wohn, Buse Çarık, Xiaohan Ding, Sang Won Lee, Young-Ho Kim, Eugenia H. Rho
Abstract:
Autistic individuals sometimes disclose autism when asking LLMs for social advice, hoping for more personalized responses. However, they also recognize that these systems may reproduce stereotypes, raising uncertainty about the risks and benefits of disclosure. We conducted a mixed-methods study combining a large-scale LLM audit experiment with interviews involving 11 autistic participants. We developed a six-step pipeline operationalizing 12 documented autism stereotypes into decision-making scenarios framed as users requesting advice (e.g., "Should I do A or B?"). We generated 345,000 responses from six LLMs and measured how advice shifted when prompts disclosed autism versus when they did not. When autism was disclosed, LLMs disproportionately recommended avoiding stereotypically stressful situations, including social events, confrontations, new experiences, and romantic relationships. While some participants viewed this as affirming, others criticized it as infantilizing or undermining opportunities for growth. Our study illuminates how the intermingling of affirmation and stereotyping complicates the personalization of LLMs.
Authors:Vivienne Bihe Chi, Reyhan Jamalova, Lyle Ungar, Sharath Chandra Guntuku
Abstract:
Mainstream film is one of the richest sources of cultural content that AI systems learn from. Yet we have few tools for measuring the gender values it encodes. We present a proof-of-concept framework that turns fictional film characters into surveyable LLM agents. Using 160 U.S. films (1990--2019), we build 734 character agents from script dialogue and scene descriptions, condense their personas via expert-style reflections, and simulate World Values Survey gender-attitude responses. Agents reproduce systematic gender differences without explicit demographic prompting, suggesting attitudes emerge from behavior rather than identity labels. Benchmarked against historical survey data, agents exaggerate gender gaps and show greater decade-to-decade volatility than real populations. Narrative sharpens rather than homogenizes gender contrasts, complicating the consistent-input assumption underlying cultivation theory's mainstreaming mechanism. AI systems trained on such corpora may inherit this stylization before any model-level amplification occurs.
Authors:Vivienne Bihe Chi, Adithya V Ganesan, Ryan L Boyd, Lyle Ungar, Sharath Chandra Guntuku
Abstract:
Large language models are increasingly used for mental health support, yet little is known about whether their responses are psychologically safe across different help-seeking styles. We examine a foundational distinction in emotional disclosure, venting vs. advice-seeking, and whether LLMs respond in ways that regulate or amplify distress. Using 178,800 Reddit posts, we first show the two help-seeking styles are linguistically distinguishable at scale. We then introduce a measurement framework grounded in interpersonal emotion regulation theory that captures Regulation and Escalation as empirically independent dimensions. Across persona conditions (default, friend, therapist), GPT-5.3 responses systematically mirror help-seeking style: venting elicits more regulation, but also more escalation. Therapist personas reduce escalation while maintaining regulation, whereas friend personas increase both. A crowdsourced human study finds no user experience penalty for the safer therapist condition, but reveals that lay raters cannot reliably detect escalation without expert knowledge. Responses that feel supportive may simultaneously intensify distress in ways standard safety evaluation cannot see, and empathy metrics alone cannot replace a framework that measures both.
Authors:Faraz Faruqi, Demircan Tas, Arthur Caetano, Niccolò Meniconi, Oğuz Arslan, Misha Sra, Ruofei Du, Stefanie Mueller, Mustafa Doga Dogan
Abstract:
Recent developments in 3D generative AI enable users to create bespoke 3D models from text or image prompts. However, these approaches provide limited control over spatial structure, making them ill suited for tasks requiring precise geometric composition. We present MiXR, an XR system for in-situ compositional modeling that enables users to create new 3D models by harvesting geometry from their environment. Users extract segments from captured objects and assemble new artifacts through direct 3D manipulation, while generative AI synthesizes a coherent model from the user-defined composition. This hybrid workflow allows users to define spatial structure explicitly while delegating geometric refinement to generative models, enabling them to specify spatial intent that is difficult to express through verbal prompts alone. In a controlled user study ($N=12$), participants using MiXR rated their designs as significantly closer to the target, felt more in control, and experienced lower cognitive workload compared to a generative composition baseline.
Authors:Jana Gonnermann-Müller, Jennifer Haase, Nicolas Leins, Thomas Kosch, Sebastian Pokutta
Abstract:
Student simulation with Large language models (LLMs) offers a scalable alternative for educational research and teacher training. Yet, its validity depends on whether models maintain stable personas across extended interactions. We test this prerequisite using a dual-assessment framework measuring self-reported characteristics and observer-rated behavioral expressions. Across two experiments testing four clinically-grounded ADHD persona conditions, five LLMs, and three prompt designs, we quantify between-conversation stability (N=4,968) and within-conversation stability (N=3,952 across 9 turns). Self-reported characteristics remain stable for high intensities, constituting a necessary prerequisite for valid behavioral simulation. Observer-rated behavioral expression reveals selective instability: within-conversation drift occurs in unscripted dialog for high and moderate ADHD personas. Scripted interactions with explicit task prompts eliminate this drift entirely. Stable, persona-aligned simulated learners benefit from a structured interaction design to maintain behavioral coherence, which holds significant implications for teacher training, adaptive tutoring, and any application requiring sustained, path-dependent learner interactions.
Authors:Zeyu Fang, Yuxin Lin, Cheng Liu, Beomyeol Yu, Zeyuan Yang, Rongqian Chen, Taeyoung Lee, Mahdi Imani, Tian Lan
Abstract:
Effective human-robot collaboration in open-world environments requires joint planning under uncertain conditions. However, existing approaches often treat humans as passive supervisors, preventing autonomous agents from becoming human-like teammates that can actively model teammate behaviors, reason about knowledge gaps, query, and elicit responses through communication to resolve uncertainties. To address these limitations, we propose a unified human-robot joint planning system designed to tackle dual sources of uncertainty: task-relevant knowledge gaps and latent human intent. Our system operates in two complementary modes. First, an uncertainty-mitigation joint planning module enables two-way conversations to resolve semantic ambiguity and object uncertainty. It utilizes an LLM-assisted active elicitation mechanism and a hypothesis-augmented A^* search, subsequently computing an optimal querying policy via dynamic programming to minimize interaction and verification costs. Second, a real-time intent-aware collaboration module maintains a probabilistic belief over the human's latent task intent via spatial and directional cues, enabling dynamic, coordination-aware task selection for agents without explicit communication. We validate the proposed system in both Gazebo simulations and real-world UAV deployments integrated with a Vision-Language Model (VLM)-based 3D semantic perception pipeline. Experimental results demonstrate that the system significantly cuts the interaction cost by 51.9% in uncertainty-mitigation planning and reduces the task execution time by 25.4% in intent-aware cooperation compared to the baselines.
Authors:Zeyu Fang, Tian Lan, Mahdi Imani
Abstract:
Joint planning through language-based interactions is a key area of human-AI teaming. Planning problems in the open world often involve various aspects of incomplete information and unknowns, e.g., objects involved, human goals/intents -- thus leading to knowledge gaps in joint planning. We consider the problem of discovering optimal interaction strategies for AI agents to actively elicit human inputs in object-driven planning. To this end, we propose Minimal Information Neuro-Symbolic Tree (MINT) to reason about the impact of knowledge gaps and leverage self-play with MINT to optimize the AI agent's elicitation strategies and queries. More precisely, MINT builds a symbolic tree by making propositions of possible human-AI interactions and by consulting a neural planning policy to estimate the uncertainty in planning outcomes caused by remaining knowledge gaps. Finally, we leverage LLM to search and summarize MINT's reasoning process and curate a set of queries to optimally elicit human inputs for best planning performance. By considering a family of extended Markov decision processes with knowledge gaps, we analyze the return guarantee for a given MINT with active human elicitation. Our evaluation on three benchmarks involving unseen/unknown objects of increasing realism shows that MINT-based planning attains near-expert returns by issuing a limited number of questions per task while achieving significantly improved rewards and success rates.
Authors:Lufeng Feng, Baomin Xu, Haoran Zhang, Bihai Lin, Zuxuan Deng, Sidi Tao, Chenyu Liu, Shifan Jia, Li Duan, Ziyu Jia
Abstract:
Unilateral limb motor imagery (MI) plays an important role in upper-limb motor rehabilitation and precise control of external devices, and places higher demands on spatial resolution. However, most existing public datasets focus on binary- or four-class left-right limb paradigms that mainly exploit coarse hemispheric lateralization, and there is still a lack of multimodal datasets that simultaneously record EEG and fNIRS for unilateral multi-directional MI. To address this gap, we constructed MIND, a public motor imagery fNIRS-EEG dataset based on a four-class directional MI paradigm of the right upper limb. The dataset includes 64-channel EEG recordings (1000 Hz) and 51-channel fNIRS recordings (47.62 Hz) from 30 participants (12 females, 18 males; aged 19.0-25.0 years). We analyse the spatiotemporal characteristics of EEG spectral power and hemodynamic responses, and validate the potential advantages of hybrid fNIRS-EEG BCIs in terms of classification accuracy. We expect that this dataset will facilitate the evaluation and comparison of neuroimaging analysis and decoding methods.
Authors:Yunhao Luo, Arthur Caetano, Avinash Ajit Nargund, Tobias Höllerer, Misha Sra
Abstract:
In mixed-initiative systems, the mode of AI assistance delivery can be as consequential as the assistance itself. We investigated two assistance delivery modes: on-demand help (users request via Button) and pre-scheduled help (assistance delivered at user-selected intervals, with user actions resetting the Timer). To evaluate these modes, we selected Rush Hour puzzles as the human-AI collaborative task because they capture elements of real-world problem solving such as analysis, resource management, and decision-making under constraints. To enhance ecological validity, we imposed monetary costs for both time and AI assistance, simulating scenarios where people must balance implicit or explicit trade-offs such as time pressure, financial limitations, or opportunity costs. Although task performance was comparable across modes, participants who used the pre-scheduled (Timer) mode reported more positive perceptions of the AI, even when their ending budget was low. This suggests that assistance delivery mode can shape user experience independent of task outcomes, indicating that human-AI systems may need to consider how AI assistance is delivered alongside improving task performance.
Authors:Jana Gonnermann-Müller, Jennifer Haase, Nicolas Leins, Thomas Kosch, Sebastian Pokutta
Abstract:
Large Language Models (LLMs) acting as artificial agents offer the potential for scalable behavioral research, yet their validity depends on whether LLMs can maintain stable personas across extended conversations. We address this point using a dual-assessment framework measuring both self-reported characteristics and observer-rated persona expression. Across two experiments testing four persona conditions (default, high, moderate, and low ADHD presentations), seven LLMs, and three semantically equivalent persona prompts, we examine between-conversation stability (3,473 conversations) and within-conversation stability (1,370 conversations and 18 turns). Self-reports remain highly stable both between and within conversations. However, observer ratings reveal a tendency for persona expressions to decline during extended conversations. These findings suggest that persona-instructed LLMs produce stable, persona-aligned self-reports, an important prerequisite for behavioral research, while identifying this regression tendency as a boundary condition for multi-agent social simulation.
Authors:Jana Gonnermann-Müller, Jennifer Haase, Nicolas Leins, Moritz Igel, Konstantin Fackeldey, Sebastian Pokutta
Abstract:
Classrooms are becoming increasingly heterogeneous, comprising learners with diverse performance and motivation levels, language proficiencies, and learning differences such as dyslexia and ADHD. While teachers recognize the need for differentiated instruction, growing workloads create substantial barriers, making differentiated instruction an ideal that is often unrealized in practice. Current AI educational tools, which promise differentiated materials, are predominantly student-facing and performance-centric, ignoring other aspects that shape learning outcomes. We introduce FACET, a teacher-facing multi-agent framework designed to address these gaps by supporting differentiation that accounts for motivation, performance, and learning differences. Developed with educational stakeholders from the outset, the framework coordinates four specialized agents, including learner simulation, diagnostic assessment, material generation, and evaluation within a teacher-in-the-loop design. School principals (N = 30) shaped system requirements through participatory workshops, while in-service K-12 teachers (N = 70) evaluated material quality. Mixed-methods evaluation demonstrates strong perceived value for inclusive differentiation. Practitioners emphasized both the urgent need arising from classroom heterogeneity and the importance of maintaining pedagogical autonomy as a prerequisite for adoption. We discuss implications for future school deployment and outline partnerships for longitudinal classroom implementation.
Authors:Sneha Shashidhara, Vivienne Bihe Chi, Abhay P Singh, Lyle Ungar, Sharath Chandra Guntuku
Abstract:
Spoken English proficiency is a powerful driver of economic mobility for low-income Indian youth, yet opportunities for spoken practice remain scarce in schools. We investigate the deployment of a voice-based chatbot for English conversation practice across four low-resource schools in Delhi. Through a six-day field study combining observations and interviews, we captured the perspectives of students, teachers, and principals. Findings confirm high demand across all groups, with notable gains in student speaking confidence. Our multi-stakeholder analysis surfaced a tension in long-term adoption vision: students favored open-ended conversational practice, while administrators emphasized curriculum-aligned assessment. We offer design recommendations for voice-enabled chatbots in low-resource multilingual contexts, highlighting the need for more intelligible speech output for non-native learners, one-tap interactions with simplified interfaces, and actionable analytics for educators. Beyond language learning, our findings inform the co-design of future AI-based educational technologies that are socially sustainable within the complex ecosystem of low-resource schools.
Authors:Florian 'Floyd' Mueller, Nadia Bianchi-Berthouze, Misha Sra, Mar Gonzalez-Franco, Henning Pohl, Susanne Boll, Richard Byrne, Arthur Caetano, Masahiko Inami, Jarrod Knibbe, Per Ola Kristensson, Xiang Li, Zhuying Li, Joe Marshall, Louise Petersen Matjeka, Minna Nygren, Rakesh Patibanda, Sara Price, Harald Reiterer, Aryan Saini, Oliver Schneider, Ambika Shahu, Jürgen Steimle, Phoebe O. Toups Dugas, Don Samitha Elvitigala
Abstract:
Advances in emerging technologies, such as on-body mechanical actuators and electrical muscle stimulation, have allowed computers to take control over our bodies. This presents opportunities as well as challenges, raising fundamental questions about agency and the role of our bodies when interacting with technology. To advance this research field as a whole, we brought together expert perspectives in a week-long seminar to articulate the grand challenges that should be tackled when it comes to the design of computers' control over our bodies. These grand challenges span technical, design, user, and ethical aspects. By articulating these grand challenges, we aim to begin initiating a research agenda that positions bodily control not only as a technical feature but as a central, experiential, and ethical concern for future human-computer interaction endeavors.
Authors:Yingchaojie Feng, Qiang Huang, Xiaoya Xie, Zhaorui Yang, Jun Yu, Wei Chen, Anthony K. H. Tung
Abstract:
Deep research agents powered by Large Language Models (LLMs) can perform multi-step reasoning, web exploration, and long-form report generation. However, most existing systems operate in an autonomous manner, assuming fully specified user intent and evaluating only final outputs. In practice, research goals are often underspecified and evolve during exploration, making sustained interaction essential for robust alignment. Despite its importance, interaction remains largely invisible to existing deep research benchmarks, which neither model dynamic user feedback nor quantify its costs. We introduce IDRBench, the first benchmark for systematically evaluating interactive deep research. IDRBench combines a modular multi-agent research framework with on-demand interaction, a scalable reference-grounded user simulator, and an interaction-aware evaluation suite that jointly measures interaction benefits (quality and alignment) and costs (turns and tokens). Experiments across seven state-of-the-art LLMs show that interaction consistently improves research quality and robustness, often outweighing differences in model capacity, while revealing substantial trade-offs in interaction efficiency.
Authors:Kaitlyn Zhou, Federico Bianchi, Martijn Bartelds, Anna Pot, Yongchan Kwon, James Zou
Abstract:
Artificially generated speech is increasingly embedded in everyday life. Voice cloning in particular enables applications where identity preservation is important, such as completing a recording, dubbing in a new language, or preserving the voices of individuals with speech loss. However, in our work, we find that despite the term, voice cloning does not faithfully ''clone'' an individual's voice. Instead, we find that widely-used voice cloning models systematically apply style transfer to source voices. As rated by human annotators, cloned voices are perceived as more authoritative, warm, customer-service-like, and human-like compared to their sources. Human annotators also report greater trust in cloned voices than source voices, and a greater willingness to disclose sensitive personal information to them. Our work furthermore shows that voice cloning leads to homogenization of speaker characteristics, as measured by reduced variance in accent, speaking rate, and the audio embedding space. Together, our results highlight a new set of limitations and risks of voice cloning technology and their potential impact on human behavior.
Authors:Nicolas Dickenmann, Yanis Merzouki, Sonia Laguna, Thy Nowak-Tran, Emanuele Palumbo, Julia E. Vogt, Gerda Binder
Abstract:
EasyRead pictograms are simple, visually clear images that represent specific concepts and support comprehension for people with intellectual disabilities, low literacy, or language barriers. The large-scale production of EasyRead content has traditionally been constrained by the cost and expertise required to manually design pictograms. In contrast, automatic generation of such images could significantly reduce production time and cost, enabling broader accessibility across digital and printed materials. However, modern diffusion-based image generation models tend to produce outputs that exhibit excessive visual detail and lack stylistic stability across random seeds, limiting their suitability for clear and consistent pictogram generation. This challenge highlights the need for methods specifically tailored to accessibility-oriented visual content. In this work, we present a unified pipeline for generating EasyRead pictograms by fine-tuning a Stable Diffusion model using LoRA adapters on a curated corpus that combines augmented samples from multiple pictogram datasets. Since EasyRead pictograms lack a unified formal definition, we introduce an EasyRead score to benchmark pictogram quality and consistency. Our results demonstrate that diffusion models can be effectively steered toward producing coherent EasyRead-style images, indicating that generative models can serve as practical tools for scalable and accessible pictogram production.
Authors:Chenyi Li, Raja Marjieh, Haoyu Hu, Mark Steyvers, Katherine M. Collins, Ilia Sucholutsky, Nori Jacoby
Abstract:
Generative AI is increasingly transforming creativity into a hybrid human-artificial process, but its impact on the quality and diversity of creative output remains unclear. We study collective creativity using a controlled word-guessing task that balances open-endedness with an objective measure of task performance. Participants attempt to infer a hidden target word, scored based on the semantic similarity of their guesses to the target, while also observing the best guess from previous players. We compare performance and outcome diversity across human-only, AI-only, and hybrid human-AI groups. Hybrid groups achieve the highest performance while preserving high diversity of guesses. Within hybrid groups, both humans and AI agents systematically adjust their strategies relative to single-agent conditions, suggesting higher-order interaction effects, whereby agents adapt to each other's presence. Although some performance benefits can be reproduced through collaboration between heterogeneous AI systems, human-AI collaboration remains superior, underscoring complementary roles in collective creativity.
Authors:Tae Soo Kim, Yoonjoo Lee, Jaesang Yu, John Joon Young Chung, Juho Kim
Abstract:
To handle ambiguous and open-ended requests, Large Language Models (LLMs) are increasingly trained to interact with users to surface intents they have not yet expressed (e.g., ask clarification questions). However, users are often ambiguous because they have not yet formed their intents: they must observe and explore outcomes to discover what they want. Simply asking "what kind of tone do you want?" fails when users themselves do not know. We introduce DiscoverLLM, a novel and generalizable framework that trains LLMs to help users form and discover their intents. Central to our approach is a novel user simulator that models cognitive state with a hierarchy of intents that progressively concretize as the model surfaces relevant options -- where the degree of concretization serves as a reward signal that models can be trained to optimize. Resulting models learn to collaborate with users by adaptively diverging (i.e., explore options) when intents are unclear, and converging (i.e., refine and implement) when intents concretize. Across proposed interactive benchmarks in creative writing, technical writing, and SVG drawing, DiscoverLLM achieves over 10% higher task performance while reducing conversation length by up to 40%. In a user study with 75 human participants, DiscoverLLM improved conversation satisfaction and efficiency compared to baselines.
Authors:Candida M. Greco, Lucio La Cava, Andrea Tagarelli
Abstract:
Despite the growing utility of Large Language Models (LLMs) for simulating human behavior, the extent to which these synthetic personas accurately reflect world and moral value systems across different cultural conditionings remains uncertain. This paper investigates the alignment of synthetic, culturally-grounded personas with established frameworks, specifically the World Values Survey (WVS), the Inglehart-Welzel Cultural Map, and Moral Foundations Theory. We conceptualize and produce LLM-generated personas based on a set of interpretable WVS-derived variables, and we examine the generated personas through three complementary lenses: positioning on the Inglehart-Welzel map, which unveils their interpretation reflecting stable differences across cultural conditionings; demographic-level consistency with the World Values Survey, where response distributions broadly track human group patterns; and moral profiles derived from a Moral Foundations questionnaire, which we analyze through a culture-to-morality mapping to characterize how moral responses vary across different cultural configurations. Our approach of culturally-grounded persona generation and analysis enables evaluation of cross-cultural structure and moral variation.
Authors:Minkyu Kweon, Seokhyeon Park, Soohyun Lee, You Been Lee, Jeongmin Rhee, Jinwook Seo
Abstract:
Modern mobile applications rely on hidden interactions--gestures without visual cues like long presses and swipes--to provide functionality without cluttering interfaces. While experienced users may discover these interactions through prior use or onboarding tutorials, their implicit nature makes them difficult for most users to uncover. Similarly, mobile agents--systems designed to automate tasks on mobile user interfaces, powered by vision language models (VLMs)--struggle to detect veiled interactions or determine actions for completing tasks. To address this challenge, we present GhostUI, a new dataset designed to enable the detection of hidden interactions in mobile applications. GhostUI provides before-and-after screenshots, simplified view hierarchies, gesture metadata, and task descriptions, allowing VLMs to better recognize concealed gestures and anticipate post-interaction states. Quantitative evaluations with VLMs show that models fine-tuned on GhostUI outperform baseline VLMs, particularly in predicting hidden interactions and inferring post-interaction screens, underscoring GhostUI's potential as a foundation for advancing mobile task automation.
Authors:Seokhyeon Park, Soohyun Lee, Eugene Choi, Hyunwoo Kim, Minkyu Kweon, Yumin Song, Jinwook Seo
Abstract:
While generative AI enables high-fidelity UI generation from text prompts, users struggle to articulate design intent and evaluate or refine results-creating gulfs of execution and evaluation. To understand the information needed for UI generation, we conducted a thematic analysis of UI prompting guidelines, identifying key design semantics and discovering that they are hierarchical and interdependent. Leveraging these findings, we developed a system that enables users to specify semantics, visualize relationships, and extract how semantics are reflected in generated UIs. By making semantics serve as an intermediate representation between human intent and AI output, our system bridges both gulfs by making requirements explicit and outcomes interpretable. A comparative user study suggests that our approach enhances users' perceived control over intent expression, outcome interpretation, and facilitates more predictable, iterative refinement. Our work demonstrates how explicit semantic representation enables systematic and explainable exploration of design possibilities in AI-driven UI design.
Authors:Santosh Chapagain, MohammadReza EskandariNasab, Onur Vural, Shah Muhammad Hamdi, Soukaina Filali Boubrahimi
Abstract:
Solar activity, including solar flares, coronal mass ejections (CMEs), and geomagnetic storms, can significantly impact satellites, aviation, power grids, data centers, and space missions. Extreme solar events can cause substantial economic damage with limited advance warning, underscoring the importance of early-warning systems, accurate forecasting, and effective education in space science. Although large language models (LLMs) perform well on general tasks, they often lack domain-specific knowledge and pedagogical capability to clearly explain complex space science concepts. We introduce SolarGPT-QA, a question answering system based on a domain-adapted large language model built on the LLaMA-3 base model. The model is trained using scientific literature and large-scale question-answer data generated with GPT-4 and refined using Grok-3 in a student-friendly storytelling style. Human pairwise evaluations show that SolarGPT-QA outperforms general-purpose models in zero-shot settings and achieves competitive performance compared to instruction-tuned models for educational explanations in space weather and heliophysics. A small pilot student comprehension study further suggests improved clarity and accessibility of the generated explanations. Ablation experiments indicate that combining domain-adaptive pretraining with pedagogical fine-tuning is important for balancing scientific accuracy and educational effectiveness. This work represents an initial step toward a broader SolarGPT framework for space science education and forecasting.
Authors:Yuhao Yang, Tianyu Fan, Chao Huang
Abstract:
As large language models advance in reasoning and tool use capabilities, researchers increasingly seek to leverage them for computer use agents that can interact with existing software. The dominant approach develops GUI agents that control applications through visual interfaces: interpreting screenshots, locating UI elements, and executing mouse clicks to mimic human interaction. This GUI-centric paradigm fundamentally misaligns with agent capabilities. Current GUI agents struggle with brittle pixel-level interactions, timing dependencies, and coordinate-based actions that break with interface changes. They force agents to emulate human perceptual limitations rather than leverage their computational strengths in structured data processing and programmatic control. CLI-Anything argues for agent-native computer use design. Instead of forcing agents to navigate visual layouts, we create interfaces aligned with how agents naturally operate: through structured commands, explicit state representations, and deterministic feedback. We transform existing applications into command-line harnesses that preserve functionality while exposing machine-readable protocols optimized for AI-native interaction. This eliminates the lossy visual-to-computational translation that plagues GUI agents. Rather than building sophisticated screen readers and click simulators, we should redesign interaction paradigms around agent strengths: precise programmatic control and deterministic execution. We examine the methodology, architecture, evidence, and future directions for this agent-native transformation of computer use. We have built CLI-Hub as a comprehensive platform that operationalizes this agent-native computer use vision. The platform provides methodology, architecture, and infrastructure for this fundamental transformation of computer use.
Authors:Sukru Samet Dindar, Riki Shimizu, Xilin Jiang, Nima Mesgarani
Abstract:
Empathetic spoken dialogue systems must infer a user's emotional state to respond appropriately, yet everyday speech often carries weak, neutral, or ambiguous affective cues. To address this, we introduce Sympatheia, a speech-to-speech dialogue framework conditioned on affect inferred from the user's speech and, when available, explicit affect specifications provided as a continuous valence--arousal (VA) control signal by a multimodal sensing module or user interface. To train our model, we construct Sympatheia-18k, an emotion-conditioned synthetic spoken dialogue corpus with 12 emotion anchors. This dataset includes an emotional split for learning affective speech behavior, and a neutral split that pairs emotionally neutral queries with multiple emotion-conditioned responses to isolate explicit emotion control in emotionally ambiguous cases. Empirical results show that Sympatheia outperforms speech conversational baselines in generating responses whose semantic content and spoken delivery are both emotionally appropriate. We further show that the same VA interface can integrate emotion estimates from diverse sensing modules, including facial expression, biosignals, and textual affect descriptions, improving response alignment when speech alone provides limited emotional evidence. These results suggest that continuous affect conditioning is an effective practical step for building emotionally adaptive voice assistants.
Authors:Tsvetomila Mihaylova, Evanfiya Logacheva, Arto Hellas, Jing Fan, Francisco Castro, Bita Akram, Narges Norouzi, Peter Brusilovsky, Juho Leinonen
Abstract:
When programming students encounter errors in their code, compiler messages or static analysis output often provide limited guidance, particularly for novice programmers. Personalized feedback from instructors can be effective but does not scale well. Recent advances in large language models (LLMs) enable automated feedback generation at scale. This study examines whether LLM-generated feedback with different levels of guidance is associated with differences in students' problem-solving behavior. We analyze effects on time to solution and number of attempts, and examine whether these effects differ by programming experience. We design three feedback types and compare them to a baseline in which students receive only compiler error messages. Results from an online programming course show that LLM-generated feedback is associated with faster time to solution compared to the no-feedback baseline, with less guided feedback showing slightly stronger effects. Overall, the findings suggest that feedback structure plays an important role in how students progress toward correct solutions and motivate further work on adaptive feedback designs and longer-term learning outcomes.
Authors:Juho Leinonen, Lisa Zhang, Arto Hellas
Abstract:
As generative AI (GenAI) tools become easily accessible, there is promise in using such tools to support instructors. To that end, this paper examines using GenAI to help generate slides from instructor authored course notes, emphasizing instructor and student perceptions. We examine an end-to-end education tool (NotebookLM), two general-purpose LLMs (Claude, M365 Copilot), and two coding assistants (Cursor, Claude Code). We first analyze whether GenAI generated slides are ``good'' via narrative assessment by educators. We choose the best slides to use (with some modification) in a real course setting, and compare the student perception of human vs. AI generated slides. We find that coding assistant tools produce slides that were most accurate, complete, and pedagogically sound. Additionally, students rate GenAI slides to be of similar quality as instructor-created slides, and cannot reliably identify which slides are AI-generated. Additionally, we find a negative correlation between a high quality rating and a high ``AI-generated'' rating, suggesting students associate poor quality with the source of the slides being AI. These findings highlight promising opportunities for integrating GenAI into instructional design workflows and call for further research on how educators can best harness such tools responsibly and effectively.
Authors:Rose Niousha, Samantha Boatright Smith, Bita Akram, Peter Brusilovsky, Arto Hellas, Juho Leinonen, John DeNero, Narges Norouzi
Abstract:
Current Artificial Intelligence (AI)-based tutoring systems (AI tutors) are primarily evaluated based on the pedagogical quality of their feedback messages. While important, pedagogy alone is insufficient because it ignores a critical question: what do students actually do with the feedback they receive? We argue that AI tutor evaluation should be extended with a behavioral dimension grounded in student interaction data, which complements pedagogical assessment. We propose an evaluation framework and apply it to 10,235 code submissions with corresponding AI tutor feedback from an introductory undergraduate programming course to measure whether students act on tutor feedback and whether those actions are applied correctly. Using this framework to compare two deployed AI tutors across different semesters in a large-scale introductory computer science course reveals substantial differences in student engagement patterns that are not captured by pedagogy-only evaluation. Moreover, these engagement-based behavioral signals are more strongly associated with student perception of helpful feedback than pedagogical quality alone, providing a more complete and actionable picture of AI tutor performance.
Authors:Griffin Pitts, Muntasir Hoq, Peter Brusilovsky, Narges Norouzi, Arto Hellas, Juho Leinonen, Bita Akram
Abstract:
Adaptive programming practice often relies on fixed libraries of worked examples and practice problems, which require substantial authoring effort and may not correspond well to the logical errors and partial solutions students produce while writing code. As a result, students may receive learning content that does not directly address the concepts they are working to understand, while instructors must either invest additional effort in expanding content libraries or accept a coarse level of personalization. We present an approach for knowledge-component (KC) guided educational content generation using pattern-based KCs extracted from student code. Given a problem statement and student submissions, our pipeline extracts recurring structural KC patterns from students' code through AST-based analysis and uses them to condition a generative model. In this study, we apply this approach to worked example generation, and compare baseline and KC-conditioned outputs through expert evaluation. Results suggest that KC-conditioned generation improves topical focus and relevance to students' underlying logical errors, providing evidence that KC-based steering of generative models can support personalized learning at scale.
Authors:Hamed Rahimi, Clemence Grislain, Adrien Jacquet Cretides, Olivier Sigaud, Mohamed Chetouani
Abstract:
Improving the effectiveness of human-robot interaction requires social robots to accurately infer human goals through robust intention understanding. This challenge is particularly critical in multimodal settings, where agents must integrate heterogeneous signals including text, visual cues to form a coherent interpretation of user intent. This paper presents IntentVLM, a novel two-stage video-language framework designed for open-vocabulary human intention recognition. The approach is inspired by forward-inverse modeling in cognitive science by decomposing intention understanding into goal candidate generation followed by structured inference through selection, effectively reducing hallucinations in latent reasoning. Evaluated on the IntentQA and Inst-IT Bench datasets, IntentVLM achieves state-of-the-art results with up to 80% accuracy, notably surpassing the baseline performance by 30% and matches human performance. Our findings demonstrate that this structured reasoning approach enhances open-vocabulary intention understanding without catastrophic forgetting, offering a robust foundation for human-centered robotics.
Authors:Zhilin Liu, Ye Huang, Ting Xie, Ruizhi Zhang, Wen Li, Lixin Duan
Abstract:
Recent advances in Large Language Model (LLM)-based agents have shown remarkable progress in code generation. However, current agent methods mainly rely on text-output-based feedback (e.g. command-line outputs) for multi-round debugging and struggle in graphical user interface (GUI) that involve visual information. This is mainly due to two limitations: 1) GUI programs are event-driven, yet existing methods cannot simulate user interactions to trigger GUI element logic 2) GUI programs possess visual attributes, making it difficult for text-based approaches to assess whether the rendered interface meets user needs. To systematically address these challenges, we first introduce InteractGUI Bench, a novel benchmark comprising 984 commonly used real-world desktop GUI application tasks designed for fine-grained evaluation of both interaction logic and visual structure. Furthermore, we propose VF-Coder, a vision-feedback-based multi-agent system for debugging GUI code. By perceiving visual information and directly interacting with program interfaces, VF-Coder can identify potential logic and layout issues in a human-like manner. On InteractGUI Bench, our VF-Coder approach increases the success rate of Gemini-3-Flash from 21.68% to 28.29% and raises the visual score from 0.4284 to 0.5584, indicating the effectiveness of visual feedback in GUI debugging.
Authors:Joydeep Chandra, Satyam Kumar Navneet, Yong Zhang
Abstract:
Understanding how individuals navigate mental health challenges over time is critical yet methodologically challenging. Traditional approaches analyze community-level snapshots, failing to capture dynamic individual recovery trajectories. We introduce a novel framework applying Topological Data Analysis (TDA) specifically persistent homology to model users' longitudinal posting histories as trajectories in semantic embedding space. Our approach reveals topological signatures of trajectory patterns: loops indicate cycling back to similar states (stagnation), while flares suggest exploring new coping strategies (growth). We propose Semantic Recovery Velocity (SRV), a novel metric quantifying the rate users move away from initial distress-focused posts in embedding space. Analyzing 15,847 r/depression trajectories and validating against multiple proxies, we demonstrate topological features predict self-reported improvement with 78.3% accuracy, outperforming sentiment baselines. This work contributes: (1) a TDA methodology for HCI mental health research, (2) interpretable topological signatures, and (3) design implications for adaptive mental health platforms with ethical guardrails.
Authors:Joydeep Chandra, Satyam Kumar Navneet, Yong Zhang
Abstract:
As mental health chatbots proliferate to address the global treatment gap, a critical question emerges: How do we design for relational safety the quality of interaction patterns that unfold across conversations rather than the correctness of individual responses? Current safety evaluations assess single-turn crisis responses, missing the therapeutic dynamics that determine whether chatbots help or harm over time. We introduce TherapyProbe, a design probe methodology that generates actionable design knowledge by systematically exploring chatbot conversation trajectories through adversarial multi-agent simulation. Using open-source models, TherapyProbe surfaces relational safety failures interaction patterns like "validation spirals" where chatbots progressively reinforce hopelessness, or "empathy fatigue" where responses become mechanical over turns. Our contribution is translating these failures into a Safety Pattern Library of 23 failure archetypes with corresponding design recommendations. We contribute: (1) a replicable methodology requiring no API costs, (2) a clinically-grounded failure taxonomy, and (3) design implications for developers, clinicians, and policymakers.
Authors:Satyam Kumar Navneet, Joydeep Chandra, Yong Zhang
Abstract:
Large Language Models (LLMs) are increasingly used to ``professionalize'' workplace communication, often at the cost of linguistic identity. We introduce "Cultural Ghosting", the systematic erasure of linguistic markers unique to non-native English varieties during text processing. Through analysis of 22,350 LLM outputs generated from 1,490 culturally marked texts (Indian, Singaporean,& Nigerian English) processed by five models under three prompt conditions, we quantify this phenomenon using two novel metrics: Identity Erasure Rate (IER) & Semantic Preservation Score (SPS). Across all prompts, we find an overall IER of 10.26%, with model-level variation from 3.5% to 20.5% (5.9x range). Crucially, we identify a Semantic Preservation Paradox: models maintain high semantic similarity (mean SPS = 0.748) while systematically erasing cultural markers. Pragmatic markers (politeness conventions) are 1.9x more vulnerable than lexical markers (71.5% vs. 37.1% erasure). Our experiments demonstrate that explicit cultural-preservation prompts reduce erasure by 29% without sacrificing semantic quality.
Authors:Joydeep Chandra, Satyam Kumar Navneet, Yong Zhang
Abstract:
Strabismus affects 2-4% of the population, yet individuals recovering from corrective surgery lack accessible tools for monitoring eye alignment. Dichoptic therapies require active engagement & clinical supervision, limiting their adoption for passive self-awareness. We present GazeFlow, a browser-based self-monitoring system that uses a personalized temporal autoencoder to detect eye drift patterns from webcam-based gaze tracking & provides ambient audio feedback. Unlike alert-based systems, GazeFlow operates according to calm computing principles, morphing musical parameters in proportion to drift severity while remaining in peripheral awareness. We address the challenges of inter-individual variability & domain transfer (1000Hz research to 30Hz webcam) by introducing Binocular Temporal-Frequency Disentanglement (BTFD), Contrastive Biometric Pre-training (CBP), & Gaze-MAML. We validate our approach on the GazeBase dataset (N=50) achieving F1=0.84 for drift detection, & conduct a preliminary user study (N=6) with participants having intermittent strabismus. Participants reported increased awareness of their eye behaviour (M=5.8/7) & preference for ambient feedback over alerts (M=6.2/7). We discuss the system's potential for self-awareness applications & outline directions for clinical validation.
Authors:Ziyang Guo, Yifan Wu, Jason Hartline, Kenneth Holstein, Jessica Hullman
Abstract:
Multi-agent decision pipelines can outperform single agent workflows when complementarity holds, i.e., different agents bring unique information to the table to inform a final decision. We propose ComplLLM, a post-training framework based on decision theory that fine-tunes a decision-assistant LLM using complementary information as reward to output signals that complement existing agent decisions. We validate ComplLLM on synthetic and real-world tasks involving domain experts, demonstrating how the approach recovers known complementary information and produces plausible explanations of complementary signals to support downstream decision-makers.
Authors:Satyam Kumar Navneet, Joydeep Chandra, Yong Zhang
Abstract:
Adaptive learning systems optimize content delivery based on performance metrics but ignore the dynamic attention fluctuations that characterize neurodivergent learners. We present AttentionGuard, a framework that detects engagement-attention states from privacy-preserving behavioral signals and adapts interface elements accordingly. Our approach models four attention states derived from ADHD phenomenology and implements five novel UI adaptation patterns including bi-directional scaffolding that responds to both understimulation and overstimulation. We validate our detection model on the OULAD dataset, achieving 87.3% classification accuracy, and demonstrate correlation with clinical ADHD profiles through cross-validation on the HYPERAKTIV dataset. A Wizard-of-Oz study with 11 adults showing ADHD characteristics found significantly reduced cognitive load in the adaptive condition (NASA-TLX: 47.2 vs 62.8, Cohen's d=1.21, p=0.008) and improved comprehension (78.4% vs 61.2%, p=0.009). Concordance analysis showed 84% agreement between wizard decisions and automated classifier predictions, supporting deployment feasibility. The system is presented as an interactive demo where observers can inspect detected attention states, observe real-time UI adaptations, and compare automated decisions with human-in-the-loop overrides. We contribute empirically validated UI patterns for attention-adaptive interfaces and evidence that behavioral attention detection can meaningfully support neurodivergent learning experiences.
Authors:Fabio Morreale, Joan Serrà, Yuki Mitsufuji
Abstract:
Explainable AI (XAI) is frequently positioned as a technical problem of revealing the inner workings of an AI model. This position is affected by unexamined onto-epistemological assumptions: meaning is treated as immanent to the model, the explainer is positioned outside the system, and a causal structure is presumed recoverable through computational techniques. In this paper, we draw on Barad's agential realism to develop an alternative onto-epistemology of XAI. We propose that interpretations are material-discursive performances that emerge from situated entanglements of the AI model with humans, context, and the interpretative apparatus. To develop this position, we read a comprehensive set of XAI methods through agential realism and reveal the assumptions and limitations that underpin several of these methods. We then articulate the framework's ethical dimension and propose design directions for XAI interfaces that support emergent interpretation, using a speculative text-to-music interface as a case study.
Authors:Yuang Fan, Lilin Xu, Millie Wu, Jingping Nie, Qingyu Chen, Yuzhe Yang, Zhuo Zhang, Xin Liu, Subigya Nepal, Xiaofan Jiang, Xuhai "Orson" Xu
Abstract:
Longitudinal passive sensing enables continuous health prediction, yet models often fail under cross-dataset distribution shifts. Traditional ML overfits cohort-specific artifacts, while Large Language Models (LLMs) struggle to reason reliably over long, heterogeneous time-series. We introduce TimeSRL, a two-stage LLM framework that routes predictions through an explicit semantic bottleneck. The model first abstracts raw signals into high-level natural language, then predicts behavioral outcomes from these abstractions alone. This forces the model to reason over semantic concepts that we argue generalize better than raw numbers. We optimize this process end-to-end using Group Relative Policy Optimization (GRPO) with Reinforcement Learning from Verifiable Rewards (RLVR), learning outcome-aligned abstractions without gold intermediate annotations. Instantiated on mental-health prediction, TimeSRL achieves state-of-the-art performance on a benchmark designed to stress-test cross-cohort generalization under a rigorous leave-one-dataset-out (LOSO) protocol, reducing mean absolute error (MAE) over strong non-LLM ML and LLM baselines by 3.1--10.1% and 9.5--44.1% for anxiety, and 3.2--9.6% and 27.4--57.6% for depression (all $p$s<0.05). TimeSRL significantly outperforms prior methods in cross-benchmark transfer across different sensing pipelines, rivaling its own within-domain performance without target-domain fine-tuning. These results demonstrate that semantic abstractions are reusable and point to a new direction for generalizable behavior modeling via RL-tuned LLMs.
Authors:Abdul Basit, Saim Rehman, Muhammad Shafique
Abstract:
Realizing on-device ML-based gesture detection under tight real-time performance, energy and memory constraints is challenging, especially when considering mobile devices with varying battery-power levels. Existing EdgeAI deployments typically rely on a single fixed detector, limiting optimization opportunities. We present Scale-Gest, a novel run-time adaptive gesture detection framework that expands the detector space into a dense family of tiny-YOLO architectures. We introduce multiple novel device-calibrated ACE (Accuracy-Complexity-Energy) profiles by analyzing different model-resolution-stride operating points. A lightweight run-time controller selects an appropriate ACE mode under user-defined and battery constraints, while a motion-aware hand-gesture-tracking ROI gate crops the input for reduced complexity detection. To evaluate performance of our system in real-world car driving scenarios, we introduce a temporally-annotated Driver Simulated Gesture (DSG-18) dataset. Scale-Gest maintains event-level F1 while significantly reducing energy and latency compared to single-detector approaches. On a battery-powered laptop running gesture streams, our ACE controller reduces per-frame energy by 4x (from 6.9 mJ to 1.6 mJ) while maintaining high gesture-detection performance (event-level F1 = 0.8-0.9) and low mean latency (6 ms).
Authors:Tanjal Shukla, K. J. Kevin Feng, Leijie Wang, Mohammad Rostami, Amy X. Zhang
Abstract:
Despite coding agents' advances in handling increasingly complex tasks, their continued tendency to introduce unintended edits, subtle bugs, and scope drift that slip past code review means developers must still decide how much autonomy to grant them. However, existing approaches for setting an agent's level of autonomy, such as static permission settings or instruction files, cannot account for how developers' preferences for agent autonomy can shift across tasks and over time. We conducted a formative survey with 21 software engineers who use coding agents and found that they experience frustration with calibrating autonomy and have evolving preferences for level of oversight. Building on these insights, we present Hedwig, a CLI coding agent that dynamically adjusts its autonomy level based on developer-agent interactions across sessions. Rather than operating on a global, fixed autonomy configuration, Hedwig learns an evolving set of behavioral guidelines from developer decisions and feedback, reducing friction on work for which the agent has earned trust, while tightening oversight when the agent operates outside familiar territory. Hedwig demonstrates the potential of a new paradigm where agents intelligently adapt their level of autonomy based on user trust through active, longitudinal collaboration.
Authors:Franco Ortiz, Runlong Ye, Michael Liut
Abstract:
Large Language Models (LLMs) have been widely applied to student-facing educational tools, this work explores their use in supporting instructors by presenting a practical adaptation of the Framework for Adaptive Content using Educational Technology (FACET) system to generate personalized instructional materials for an Introduction to Computer Programming (CS1) course. We conducted a mixed-methods study with 409 first-year computer science (CS) students, focusing on regular expressions (RegEx). Students were assessed on their knowledge and motivation, classified into one of four learner profiles, and assigned either LLM-personalized (treatment) or standard non-adaptive (control) exercises. Personalized materials varied in scaffolding, instructional explicitness, and tone based on learner profiles grounded in Bloom's Taxonomy and Self-Determination Theory. Quantitative analysis reveals that standard exercises resulted in task incompletion among low-knowledge learners, with approximately 25-30% incompletion, whereas personalized materials sustained near-universal completion (>99%) across all profiles. While high-performing students experienced ceiling effects, Low Knowledge/Low Motivation students achieved significantly higher correctness (+18.2%) with personalized support. Survey data indicate that students prioritize structural scaffolding (logical sequence, difficulty pacing) over motivational tone and perceive the adaptive tasks as equally challenging as standard exercises. These findings suggest that learner-profile-driven LLM personalization primarily serves as a retention scaffold, preventing task abandonment among at-risk students without diminishing the task's "desirable difficulty". The results demonstrate that instructor-facing LLM systems can effectively close engagement gaps in CS1 by tailoring instructional explicitness to student needs.
Authors:Longju Bai, Zhemin Huang, Xingyao Wang, Jiao Sun, Rada Mihalcea, Erik Brynjolfsson, Alex Pentland, Jiaxin Pei
Abstract:
The wide adoption of AI agents in complex human workflows is driving rapid growth in LLM token consumption. When agents are deployed on tasks that require a significant amount of tokens, three questions naturally arise: (1) Where do AI agents spend the tokens? (2) Which models are more token-efficient? and (3) Can agents predict their token usage before task execution? In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks. We analyze trajectories from eight frontier LLMs on SWE-bench Verified and evaluate models' ability to predict their own token costs before task execution. We find that: (1) agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat, with input tokens rather than output tokens driving the overall cost; (2) token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30x in total tokens, and higher token usage does not translate into higher accuracy; instead, accuracy often peaks at intermediate cost and saturates at higher costs; (3) models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5; (4) task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend; and (5) frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs. Our study offers new insights into the economics of AI agents and can inspire future research in this direction.
Authors:Xin Qian, Dazhen Deng, Zhaoping He, Xingbo Wang, Yuchen He, Yingcai Wu
Abstract:
Knowledge-intensive text usually contains fruitful entities and complex relationships, such as academic articles and scientific exposition. Reading and comprehending such texts often demands considerable time and mental effort to track the relationships between entities. To reduce the burden, we present GraphTide, a visualization technique that progressively constructs nested entity-relationship graphs with animation to support the understanding of complex text. Our method features an on-demand entity-relationship decomposition pipeline that constructs nested graphs to represent intra- and inter-sentence relationships. Moreover, we propose a structure-aware force-directed layout optimization algorithm to enhance structural clarity. Sentences and their associated entities are incrementally revealed through animated transitions, helping users maintain context as the narrative unfolds. A user study shows that GraphTide significantly improves users' comprehension of knowledge-intensive texts compared to traditional graph-based techniques and static nested graph representations.
Authors:Zhenning Chen, Hanbei Zhan, Yanwei Huang, Xin Wu, Dazhen Deng, Di Weng, Yingcai Wu
Abstract:
Large Language Models (LLMs) demonstrate exceptional capabilities in factual question answering, yet they sometimes provide incorrect responses. To address this issue, knowledge editing techniques have emerged as effective methods for correcting factual information in LLMs. However, typical knowledge editing workflows struggle with identifying the optimal set of model layers for editing and rely on summary indicators that provide insufficient guidance. This lack of transparency hinders effective comparison and identification of optimal editing strategies. In this paper, we present KEditVis, a novel visual analytics system designed to assist users in gaining a deeper understanding of knowledge editing through interactive visualizations, improving editing outcomes, and discovering valuable insights for the future development of knowledge editing algorithms. With KEditVis, users can select appropriate layers as the editing target, explore the reasons behind ineffective edits, and perform more targeted and effective edits. Our evaluation, including usage scenarios, expert interviews, and a user study, validates the effectiveness and usability of the system.
Authors:Andy Wang, Xu Yan, Brandon McMahan, Michael Zhou, Yuyang Yuan, Johannes Y. Lee, Ali Shreif, Matthew Li, Zhenghao Peng, Bolei Zhou, Yuchen Cui, Jonathan C. Kao
Abstract:
Shared autonomy combines human user and AI copilot actions to control complex systems such as robotic arms. When a task is challenging, requires high dimensional control, or is subject to corruption, shared autonomy can significantly increase task performance by using a trained copilot to effectively correct user actions in a manner consistent with the user's goals. To significantly improve the performance of shared autonomy, we introduce Diffusion Sequence Copilots (DiSCo): a method of shared autonomy with diffusion policy that plans action sequences consistent with past user actions. DiSCo seeds and inpaints the diffusion process with user-provided actions with hyperparameters to balance conformity to expert actions, alignment with user intent, and perceived responsiveness. We demonstrate that DiSCo substantially improves task performance in simulated driving and robotic arm tasks. Project website: https://sites.google.com/view/disco-shared-autonomy/
Authors:Runlong Ye, Naaz Sibia, Angela Zavaleta Bernuy, Tingting Zhu, Carolina Nobre, Viktoria Pammer-Schindler, Michael Liut
Abstract:
Systematic Literature Reviews (SLRs) are fundamental to scientific progress, yet the process is hindered by a fragmented tool ecosystem that imposes a high cognitive load. This friction suppresses the iterative, exploratory nature of scholarly work. To investigate these challenges, we conducted an exploratory design study with 20 experienced researchers. This study identified key friction points: 1) the high cognitive load of managing iterative query refinement across multiple databases, 2) the overwhelming scale and pace of publication of modern literature, and 3) the tension between automation and scholarly agency. Informed by these findings, we developed ARC, a design probe that operationalizes solutions for multi-database integration, transparent iterative search, and verifiable AI-assisted screening. A comparative user study with 8 researchers suggests that an integrated environment facilitates a transition in scholarly work, moving researchers from managing administrative overhead to engaging in strategic exploration. By utilizing external representations to scaffold strategic exploration and transparent AI reasoning, our system supports verifiable judgment, aiming to augment expert contributions from initial creation through long-term maintenance of knowledge synthesis.
Authors:Runlong Ye, Oliver Huang, Jessica He, Michael Liut
Abstract:
Generative AI blurs the lines of authorship in computing education, creating uncertainty around how students should attribute AI assistance. To examine these emerging norms, we conducted a factorial vignette study with 94 computer science students across 102 unique scenarios, systematically manipulating assessment type, AI autonomy, student activity, prior knowledge, and human refinement effort. This paper details how these factors influence students' perceptions of ownership and disclosure preferences. Our findings indicate that attribution judgments are primarily driven by different levels of AI assistance and human refinement. We also found that students' perception of authorship significantly predicts their policy expectations. We conclude by proposing a shift from statement-style policies to process-oriented attribution, transforming disclosure into a pedagogical mechanism for fostering critical engagement with AI-generated content.
Authors:Runlong Ye, Oliver Huang, Patrick Yung Kang Lee, Michael Liut, Carolina Nobre, Ha-Kyung Kong
Abstract:
Reflexive Thematic Analysis (RTA) is a critical method for generating deep interpretive insights. Yet its core tenets, including researcher reflexivity, tangible analytical evolution, and productive disagreement, are often poorly supported by software tools that prioritize speed and consensus over interpretive depth. To address this gap, we introduce Reflexis, a collaborative workspace that centers these practices. It supports reflexivity by integrating in-situ reflection prompts, makes code evolution transparent and tangible, and scaffolds collaborative interpretation by turning differences into productive, positionality-aware dialogue. Results from our paired-analyst study (N=12) indicate that Reflexis encouraged participants toward more granular reflection and reframed disagreements as productive conversations. The evaluation also surfaced key design tensions, including a desire for higher-level, networked memos and more user control over the timing of proactive alerts. Reflexis contributes a design framework for tools that prioritize rigor and transparency to support deep, collaborative interpretation in an age of automation.
Authors:Zijian Zhang, Fangshi Du, Xingjian Liu, Pan Chen, Oliver Huang, Runlong Ye, Michael Liut, Alán Aspuru-Guzik
Abstract:
Long documents pose many challenges to current intelligent writing systems. These include maintaining consistency across sections, sustaining efficient planning and writing as documents become more complex, and effectively providing and integrating AI assistance to the user. Existing AI co-writing tools offer either inline suggestions or limited structured planning, but rarely support the entire writing process that begins with high-level ideas and ends with polished prose, in which many layers of planning and outlining are needed. Here, we introduce TreeWriter, a hierarchical writing system that represents documents as trees and integrates contextual AI support. TreeWriter allows authors to create, save, and refine document outlines at multiple levels, facilitating drafting, understanding, and iterative editing of long documents. A built-in AI agent can dynamically load relevant content, navigate the document hierarchy, and provide context-aware editing suggestions. A within-subject study (N=12) comparing TreeWriter with Google Docs + Gemini on long-document editing and creative writing tasks shows that TreeWriter improves idea exploration/development, AI helpfulness, and perceived authorial control. A two-month field deployment (N=8) further demonstrated that hierarchical organization supports collaborative writing. Our findings highlight the potential of hierarchical, tree-structured editors with integrated AI support and provide design guidelines for future AI-assisted writing tools that balance automation with user agency.
Authors:Po-han Li, Shenghui Chen, Ufuk Topcu, Sandeep Chinchali
Abstract:
Multimodal video captioning condenses dense footage into a structured format of keyframes and natural language. By creating a cohesive multimodal summary, this approach anchors generative AI in rich semantic evidence and serves as a lightweight proxy for high-efficiency retrieval. However, traditional metrics like BLEU or ROUGE fail to quantify information coverage across disparate modalities, such as comparing a paragraph of text to a sequence of keyframes. To address this, we propose the Video Summary Information Loss (ViSIL) score, an information-theoretic framework that quantifies the video information not captured by a summary via vision-language model (VLM) inference. By measuring the information loss, ViSIL is a unified metric that enables direct comparison across multimodal summary formats despite their structural discrepancies. Our results demonstrate that ViSIL scores show a statistically significant correlation with both human and VLM performance on Video Question Answering (VQA) tasks. ViSIL also enables summary selection to optimize the trade-off between information loss and processing speed, establishing a Pareto-optimal frontier that outperforms text summaries by $7\%$ in VQA accuracy without increasing processing load.
Authors:Shiyu Li, Julian Kreimeier, Hannah Schieber, Dirk Müller, Bernhard Kainz, Rüdiger von Eisenhart-Rothe, Daniel Roth
Abstract:
The handling and assembly of instruments during surgery imposes high cognitive demands on scrub nurses, particularly when instruments are unfamiliar. We present a supporting guidance system for surgical instrumentation that combines multi-camera 6D pose estimation with augmented reality in-situ visualization on a head-mounted display without the requirement for additional markers. Pose estimation and consecutive camera calibration are achieved through known objects. The 6D pose estimation network is trained purely on synthetic data, aiming for better generalizability and real-world applicability. The AR guidance displays tooltip localization cues and step-wise assembly animations. Via gaze-based selection and a foot pedal, users can switch between assembly steps in intraoperative use. In a technical evaluation, our approach outperforms state-of-art 6D pose estimation. A user study with 29 scrub nurses was conducted in a surgical simulation of knee arthroplasty, comparing the system against a paper manual. AR guidance significantly reduced the perceived workload compared. Objectively, AR guidance reduced task completion time by 21.3\% (4.76 minutes). Specifically, scrub nurses less experienced with the instrument set benefited when using the system. Error frequencies were comparable between conditions. Qualitative feedback highlighted improved process clarity, reduced information overload, and perceived independence. To summarize, our marker-free multi-camera AR guidance approach for surgical instruments can, subjectively and objectively, improve intraoperative instrumentation performance, particularly for untrained scrub nurses.
Authors:Soorya Ram Shimgekar, Agam Goyal, Amruta Parulekar, Joshua Chen, Yian Wang, Navin Kumar, Hari Sundaram, Eshwar Chandrasekharan, Koustuv Saha
Abstract:
Large language models (LLMs) are increasingly deployed in conversational settings where user tone ranges from polite to adversarial or toxic, yet less is known about whether toxic language in otherwise semantically equivalent prompts can degrade factual reliability. We study how lexical and tone-based prompt perturbations affect the factual reliability of LLMs. Using controlled prompt variations across polite, random, and three toxicity levels, we evaluate five LLMs on ARC-Easy, GSM8K, and MMLU. We find that toxic lexical perturbations consistently reduce factual accuracy and increase uncertainty, while polite phrasing yields limited and inconsistent changes. To examine whether these answer inconsistencies correspond to internal changes, we conduct attribution-graph analyses of model activations and influences. We find that increasing toxicity selectively amplifies perturbation-sensitive variant nodes while relatively stable core reasoning nodes remain more invariant. These findings position prompt tone as a critical dimension of LLM reliability and provide behavioral and mechanistic evidence that surface-level lexical variation can alter factual outputs and internal computation.
Authors:Olivia Pal, Agam Goyal, Eshwar Chandrasekharan, Koustuv Saha
Abstract:
Algorithmic feeds have become primary environments for encountering information online, yet while they shape what people see, less is known about how sustained feed exposure shapes how people write. Drawing on Cultivation Theory, we examine whether algorithmic feeds function as online environments that leave measurable traces in users' language. We leverage a large-scale longitudinal dataset of 235M posts by 4M users on Bluesky, and conduct a quasi-experimental study matching an initial pool of 368,513 users exposed to one of three feeds -- News, Science, and Blacksky -- with a pool of 2,001,915 active control users who did not engage with any of these feeds. We examine linguistic evolution across three dimensions: lexico-semantics, psycholinguistics, and topics. We find that users exposed to these feeds show significantly greater stylistic accommodation, semantic alignment, and register formalization than matched controls. These effects vary markedly by feed identity -- Blacksky produces the deepest psycholinguistic restructuring, with significant shifts in cognitive processing, affective expression, and pronoun use, while News and Science effects are largely confined to register and topical focus. Regression models reveal that reposting is the most consistent predictor of linguistic convergence across all feeds, whereas posting and bookmarking show feed-dependent effects, with effects differing more than fourfold across feeds. Our work extends Cultivation Theory beyond belief formation to linguistic behavior, demonstrating that feeds function as persistent linguistic environments that gradually shape what and how users write online. Our work has implications for studying algorithmic influence, online identity formation, and the design and governance of feed-based platforms that mediate online interactions.
Authors:Jenny Ma, Sitong Wang, Joshua H. Kung, Lydia B. Chilton
Abstract:
Rules files (e.g., AGENTS.md, CLAUDE.md) are the primary mechanism for human-agent alignment when developers vibe code. However, they remain passive: it is not immediately apparent when rules are being used or followed, or how to improve them. To transform rules from passive text into active controls, we introduce ZORO, an interactive interface that integrates directly with a coding agent and anchors rules to every step of the coding process. After an agent generates an initial plan, ZORO enriches the plan with rules, enforces the rules during implementation by requiring the agent prove that each rule was followed, and allows users to provide in-situ feedback when they are unsatisfied with a rule application to evolve the ruleset. A technical evaluation shows that coding agents follow rules more with ZORO than without. A user study demonstrates a change in people's behavior and cognitive strategies when rules are at the forefront of vibe coding. We discuss how making rules active in agentic systems unlocks broader opportunities for human-agent alignment in coding settings and beyond.
Authors:Yaniv Leviathan, Dani Valevski, Matan Kalman, Danny Lumen, Eyal Segalis, Eyal Molad, Shlomi Pasternak, Vishnu Natchu, Valerie Nygaard, Srinivasan, Venkatachary, James Manyika, Yossi Matias
Abstract:
AI models excel at creating content, but typically render it with static, predefined interfaces. Specifically, the output of LLMs is often a markdown "wall of text". Generative UI is a long standing promise, where the model generates not just the content, but the interface itself. Until now, Generative UI was not possible in a robust fashion. We demonstrate that when properly prompted and equipped with the right set of tools, a modern LLM can robustly produce high quality custom UIs for virtually any prompt. When ignoring generation speed, results generated by our implementation are overwhelmingly preferred by humans over the standard LLM markdown output. In fact, while the results generated by our implementation are worse than those crafted by human experts, they are at least comparable in 50% of cases. We show that this ability for robust Generative UI is emergent, with substantial improvements from previous models. We also create and release PAGEN, a novel dataset of expert-crafted results to aid in evaluating Generative UI implementations, as well as the results of our system for future comparisons. Interactive examples can be seen at https://generativeui.github.io
Authors:Chuhao Jin, Rui Zhang, Qingzhe Gao, Haoyu Shi, Dayu Wu, Yichen Jiang, Yihan Wu, Ruihua Song
Abstract:
We present SentiAvatar, a framework for building expressive interactive 3D digital humans, and use it to create SuSu, a virtual character that speaks, gestures, and emotes in real time. Achieving such a system remains challenging, as it requires jointly addressing three key problems: the lack of large-scale, high-quality multimodal data, robust semantic-to-motion mapping, and fine-grained frame-level motion-prosody synchronization. To solve these problems, first, we build SuSuInterActs (21K clips, 37 hours), a dialogue corpus captured via optical motion capture around a single character with synchronized speech, full-body motion, and facial expressions. Second, we pre-train a Motion Foundation Model on 200K+ motion sequences, equipping it with rich action priors that go well beyond the conversation. We then propose an audio-aware plan-then-infill architecture that decouples sentence-level semantic planning from frame-level prosody-driven interpolation, so that generated motions are both semantically appropriate and rhythmically aligned with speech. Experiments show that SentiAvatar achieves state-of-the-art on both SuSuInterActs (R@1 43.64%, nearly 2 times the best baseline) and BEATv2 (FGD 4.941, BC 8.078), producing 6s of output in 0.3s with unlimited multi-turn streaming. The source code, model, and dataset are available at https://sentiavatar.github.io.
Authors:Carmen Scheidemann, Andrei Cramariuc, Changan Chen, Jia-Ruei Chiu, Marco Hutter
Abstract:
Background: Assistance robots have the potential to increase the independence of people who need daily care due to limited mobility or being wheelchair-bound. Current solutions of attaching robotic arms to motorized wheelchairs offer limited additional mobility at the cost of increased size and reduced wheelchair maneuverability. Methods: We present an on-demand quadrupedal assistance robot system controlled via a shared autonomy approach, which combines semi-autonomous task execution with human teleoperation. Due to the mobile nature of the system it can assist the operator whenever needed and perform autonomous tasks independently, without otherwise restricting their mobility. We automate pick-and-place tasks, as well as robot movement through the environment with semantic, collision-aware navigation. For teleoperation, we present a mouth-level joystick interface that enables an operator with reduced mobility to control the robot's end effector for precision manipulation. Results: We showcase our system in the \textit{Cybathlon 2024 Assistance Robot Race}, and validate it in an at-home experimental setup, where we measure task completion times and user satisfaction. We find our system capable of assisting in a broad variety of tasks, including those that require dexterous manipulation. The user study confirms the intuition that increased robot autonomy alleviates the operator's mental load. Conclusions: We present a flexible system that has the potential to help people in wheelchairs maintain independence in everyday life by enabling them to solve mobile manipulation problems without external support. We achieve results comparable to previous state-of-the-art on subjective metrics while allowing for more autonomy of the operator and greater agility for manipulation.
Authors:Xinyu Li, Linxuan Zhao, Roberto Martinez-Maldonado, Dragan Gasevic, Lixiang Yan
Abstract:
This study examined whether a single ceiling-mounted camera could be used to capture fine-grained learning behaviours in co-located practical learning. In undergraduate nursing simulations, teachers first identified seven observable behaviour categories, which were then used to train a YOLO-based detector. Video data were collected from 52 sessions, and analyses focused on Scenario A because it produced greater behavioural variation than Scenario B. Annotation reliability was high (F1=0.933). On the held-out test set, the model achieved a precision of 0.789, a recall of 0.784, and an mAP@0.5 of 0.827. When only behaviour frequencies were compared, no robust differences were found between high- and low-performing groups. However, when behaviour labels were analysed together with spatial context, clear differences emerged in both task and collaboration performance. Higher-performing teams showed more patient interaction in the primary work area, whereas lower-performing teams showed more phone-related activity and more activity in secondary areas. These findings suggest that behavioural data are more informative when interpreted together with where they occur. Overall, the study shows that a single-camera computer vision approach can support the analysis of teamwork and task engagement in face-to-face practical learning without relying on wearable sensors.
Authors:Emely Rosbach, Jonas Ammeling, Jonathan Ganz, Christof Albert Bertram, Thomas Conrad, Andreas Riener, Marc Aubreville
Abstract:
Artificial intelligence (AI)-driven decision support systems can improve diagnostic accuracy and efficiency in computational pathology. However, collaboration between human experts and AI may introduce cognitive biases such as automation and anchoring bias, where users adopt system predictions blindly or are disproportionately influenced by AI advice, even when inaccurate. These effects may be amplified under time pressure, common in routine pathology, or shaped by individual user characteristics. We conducted an online experiment in which pathology experts (n = 28) estimated tumor cell percentages: once independently and once with AI support. A subset of estimations in each condition was performed under time strain. Overall, AI assistance improved diagnostic performance but introduced a 7% automation bias rate, defined as accepted negative consultations where previously correct independent judgments were overturned by incorrect AI advice. While time pressure did not increase the frequency of automation bias, it appeared to intensify its severity, reflected in stronger performance declines associated with increased AI reliance under cognitive load. A linear mixed-effects model (LMM) simulating weighted averaging showed a statistically significant positive coefficient for AI advice, indicating moderate anchoring on system output. This effect increased under time pressure, suggesting anchoring bias becomes more pronounced when cognitive resources are limited. A second LMM assessing automation reliance, a proxy for automation and anchoring bias, showed that professional experience and self-efficacy were associated with lower dependence on AI, whereas higher confidence during AI-assisted decisions was tied to increased AI reliance. These findings highlight the dual nature of AI integration in clinical workflows: improving performance while introducing risks of bias-driven diagnostic errors.
Authors:Svetlana Churina, Kokil Jaidka, Anab Maulana Barik, Harshit Aneja, Cai Yang, Wynne Hsu, Mong Li Lee
Abstract:
The web's information ecosystem demands fact-checking systems that are both scalable and epistemically trustworthy. Automated approaches offer efficiency but often lack transparency, while human verification remains slow and inconsistent. We introduce Althea, a retrieval-augmented system that integrates question generation, evidence retrieval, and structured reasoning to support user-driven evaluation of online claims. On the AVeriTeC benchmark, Althea achieves a Macro-F1 of 0.44, outperforming standard verification pipelines and improving discrimination between supported and refuted claims. We further evaluate Althea through a controlled user study and a longitudinal survey experiment (N = 642), comparing three interaction modes that vary in the degree of scaffolding: an Exploratory mode with guided reasoning, a Summary mode providing synthesized verdicts, and a Self-search mode that offers procedural guidance without algorithmic intervention. Results show that guided interaction produces the strongest immediate gains in accuracy and confidence, while self-directed search yields the most persistent improvements over time. This pattern suggests that performance gains are not driven solely by effort or exposure, but by how cognitive work is structured and internalized.
Authors:Sara Solarova, Matúš Mesarčík, Branislav Pecher, Ivan Srba
Abstract:
Algorithms of online platforms are required under the Digital Services Act (DSA) to comply with specific obligations concerning algorithmic transparency, user protection and privacy. To verify compliance with these requirements, DSA mandates platforms to undergo independent audits. Little is known about current auditing practices and their effectiveness in ensuring such compliance. To this end, we bridge regulatory and technical perspectives by critically examining selected audit reports across three critical algorithmic-related provisions: restrictions on profiling minors, transparency in recommender systems, and limitations on targeted advertising using sensitive data. Our analysis shows significant inconsistencies in methodologies and lack of technical depth when evaluating AI-powered systems. To enhance the depth, scale, and independence of compliance assessments, we propose to employ algorithmic auditing -- a process of behavioural assessment of AI algorithms by means of simulating user behaviour, observing algorithm responses and analysing them for audited phenomena.
Authors:Agam Goyal, Xianyang Zhan, Charlotte Lambert, Koustuv Saha, Eshwar Chandrasekharan
Abstract:
Detecting what content communities value is a foundational challenge for social computing systems -- from feed curation and content ranking to moderation tools and personalized recommendation systems. Yet existing approaches remain fragmented across methodological paradigms, and it remains unclear which methods best capture community-specific notions of value. We introduce VASTU (Value-Aligned Social Toolkit for Online Content Curation), a benchmark and evaluation framework for systematically comparing approaches to detecting community-valued content. VASTU includes a dataset of 75,000 comments from 15 diverse Reddit communities, annotated with community approval labels and rich linguistic features. Using VASTU, we evaluate feature-based models, transformers, prompted and fine-tuned language models under global versus community-specific training regimes. We find that community-specific models consistently outperform global approaches, with fine-tuned transformers achieving the strongest performance (0.72 AUROC). Notably, fine-tuned SLMs (0.65 AUROC) substantially outperform prompted LLMs (0.60 AUROC) despite being 100 times smaller. Counterintuitively, chain-of-thought prompting provides no benefit, and reasoning models perform the worst (0.53 AUROC), suggesting this task requires learning community norms rather than test-time reasoning. By releasing VASTU, we provide a standardized benchmark to advance research on value-aligned sociotechnical systems.
Authors:Songming Jia, Yan Lu, Bin Liu, Xiang Zhang, Peng Zhao, Xinmeng Tang, Yelin Wei, Jinyang Huang, Huan Yan, Zhi Liu
Abstract:
WiFi-based 3D human pose estimation offers a low-cost and privacy-preserving alternative to vision-based systems for smart interaction. However, existing approaches rely on visual 3D poses as supervision and directly regress CSI to a camera-based coordinate system. We find that this practice leads to coordinate overfitting: models memorize deployment-specific WiFi transceiver layouts rather than only learning activity-relevant representations, resulting in severe generalization failures. To address this challenge, we present PerceptAlign, the first geometry-conditioned framework for WiFi-based cross-layout pose estimation. PerceptAlign introduces a lightweight coordinate unification procedure that aligns WiFi and vision measurements in a shared 3D space using only two checkerboards and a few photos. Within this unified space, it encodes calibrated transceiver positions into high-dimensional embeddings and fuses them with CSI features, making the model explicitly aware of device geometry as a conditional variable. This design forces the network to disentangle human motion from deployment layouts, enabling robust and, for the first time, layout-invariant WiFi pose estimation. To support systematic evaluation, we construct the largest cross-domain 3D WiFi pose estimation dataset to date, comprising 21 subjects, 5 scenes, 18 actions, and 7 device layouts. Experiments show that PerceptAlign reduces in-domain error by 12.3% and cross-domain error by more than 60% compared to state-of-the-art baselines. These results establish geometry-conditioned learning as a viable path toward scalable and practical WiFi sensing.
Authors:Xuan Luo, Lewei Yao, Libo Zhao, Lanqing Hong, Kai Chen, Dehua Tao, Daxin Tan, Ruifeng Xu, Jing Li
Abstract:
While the automatic evaluation of omni-modal large models (OLMs) is essential, assessing empathy remains a significant challenge due to its inherent affectivity. To investigate this challenge, we introduce AEQ-Bench (Audio Empathy Quotient Benchmark), a novel benchmark to systematically assess two core empathetic capabilities of OLMs: (i) generating empathetic responses by comprehending affective cues from multi-modal inputs (audio + text), and (ii) judging the empathy of audio responses without relying on text transcription. Compared to existing benchmarks, AEQ-Bench incorporates two novel settings that vary in context specificity and speech tone. Comprehensive assessment across linguistic and paralinguistic metrics reveals that (1) OLMs trained with audio output capabilities generally outperformed models with text-only outputs, and (2) while OLMs align with human judgments for coarse-grained quality assessment, they remain unreliable for evaluating fine-grained paralinguistic expressiveness.
Authors:Ruoxi Jia, Luis Oala, Wenjie Xiong, Suqin Ge, Jiachen T. Wang, Feiyang Kang, Dawn Song
Abstract:
We argue that the machine learning value chain is structurally unsustainable due to an economic data processing inequality: each state in the data cycle from inputs to model weights to synthetic outputs refines technical signal but strips economic equity from data generators. We show, by analyzing seventy-three public data deals, that the majority of value accrues to aggregators, with documented creator royalties rounding to zero and widespread opacity of deal terms. This is not just an economic welfare concern: as data and its derivatives become economic assets, the feedback loop that sustains current learning algorithms is at risk. We identify three structural faults - missing provenance, asymmetric bargaining power, and non-dynamic pricing - as the operational machinery of this inequality. In our analysis, we trace these problems along the machine learning value chain and propose an Equitable Data-Value Exchange (EDVEX) Framework to enable a minimal market that benefits all participants. Finally, we outline research directions where our community can make concrete contributions to data deals and contextualize our position with related and orthogonal viewpoints.
Authors:Sitong Wang, Anh Truong, Lydia B. Chilton, Dingzeyu Li
Abstract:
Video is a powerful medium for communication and storytelling, yet reauthoring existing footage remains challenging. Even simple edits often demand expertise, time, and careful planning, constraining how creators envision and shape their narratives. Recent advances in generative AI suggest a new paradigm: what if editing a video were as straightforward as rewriting text? To investigate this, we present a tech probe and a study on text-driven video reauthoring. Our approach involves two technical contributions: (1) a generative reconstruction algorithm that reverse-engineers video into an editable text prompt, and (2) an interactive probe, Rewrite Kit, that allows creators to manipulate these prompts. A technical evaluation of the algorithm reveals a critical human-AI perceptual gap. A probe study with 12 creators surfaced novel use cases such as virtual reshooting, synthetic continuity, and aesthetic restyling. It also highlighted key tensions around coherence, control, and creative alignment in this new paradigm. Our work contributes empirical insights into the opportunities and challenges of text-driven video reauthoring, offering design implications for future co-creative video tools.
Authors:Ningzhi Tang, Chaoran Chen, Gelei Xu, Yiyu Shi, Yu Huang, Collin McMillan, Tao Dong, Toby Jia-Jun Li
Abstract:
AI coding agents increasingly act directly within software environments, yet existing analyses of their failures rely on benchmark trajectories that miss how developers actually experience misalignment. We present an observational study of 20,574 coding-agent sessions from 1,639 repositories across IDE and CLI workflows. We operationalize misalignment as a breakdown made visible through developer pushback, and annotate each episode along four axes: form, cause, cost, and resolution. We identify seven recurring forms, spanning how agents read projects, interpret developer intent, follow rules, bound their actions, implement and execute code, and report progress. 90.50\% of episodes impose effort and trust costs rather than irreversible system damage, yet 91.49\% of visible resolutions still require explicit user correction. Misalignment patterns also differ across IDE and CLI settings, persist across adjacent sessions, and shift over time: while overall rates decline, constraint violations and inaccurate self-reporting grow in share. Our findings inform the design of training, evaluation, and interfaces for keeping coding agents aligned with real developer workflows.
Authors:Yuqing Xiao, John Grundy, Anuradha Madugalla, Elizabeth Manias
Abstract:
We sought to explore and compare the perspectives of three key stakeholder groups: older adults, caregivers (formal health providers and informal caregivers), and digital health software developers on key functional and non-functional requirements. We conducted a survey, designed based on the findings from an existing systematic review, to gather and analyse data related to the three stakeholder groups' (dis)satisfaction with current aged care digital health software and their views on key future aged care software requirements. A mixed-methods survey approach integrated quantitative questionnaire data and qualitative open-ended responses from a total sample of 249, comprised of older adults (103), formal and informal caregivers (41), and software developers (105). Data analysis utilised a mixed methods approach, employing inferential statistics to compare group satisfaction levels and thematic analysis for qualitative open-ended responses. Our analysis reveals a significant "Requirements Gap". Software developers tend to prioritise advanced features and functional requirements, significantly overestimating user satisfaction with core NFRs such as ease of use and responsiveness. Conversely, developers were more critical of existing functional features compared to older adults and caregivers, who prioritised simplicity and reliability over feature density. By combining quantitative and qualitative analysis, we identified where stakeholder priorities align and where they diverge across functional and non-functional requirements in both the current designs they used and the future designs they desire. Our findings present a stakeholder gap analysis that can guide future co-design processes, near-term product decisions, and privacy-by-design recommendations in aged care digital health.
Authors:Haotang Li, Yili Ren, Zhenyu Qi, Sen He, Kebin Peng, Sheng Tan, Bo Liu, Jiyue Zhao, Zi Wang
Abstract:
Body fat percentage and its spatial distribution are clinically important health indicators. However, existing measurement methods often impose a tradeoff between accuracy and accessibility. Clinical-grade techniques, such as Dual-Energy X-ray Absorptiometry (DEXA) and hydrostatic weighing, provide accurate measurements but require specialized equipment and trained operators, making them difficult to access and unsuitable for everyday use. In contrast, consumer-level methods, such as Bioelectrical Impedance Analysis (BIA) smart scales and skinfold calipers, are more accessible but typically provide only coarse-grained estimates, are prone to user error, or require intrusive physical contact. In this work, we present UWB-Fat, the first system that leverages commodity ultra-wideband (UWB) radar to enable non-intrusive, accessible, and accurate caliper-equivalent skinfold thickness estimation, serving as a convenient replacement for the skinfold caliper. UWB-Fat collects UWB signal at specified body sites non-intrusively without operator assistance. It extracts body-composition-related features from UWB signals by exploiting dielectric contrasts among skin, fat, and muscle tissues. Then, it uses a physics-inspired model to estimate site-specific skinfold thickness. We evaluate UWB-Fat on 15 participants, achieving a root mean square error of 0.63~mm for pooled-site subcutaneous fat thickness. These results highlight the potential of UWB-Fat to support low-cost, self-administered, and everyday body fat monitoring.
Authors:Pietro Bonazzi, Youssef Ahmed, Daniel Eckert, Andrea Ronco, Junjie Zeng, Dengxin Dai, Michele Magno
Abstract:
Despite widespread adoption of smartwatches worldwide, open-benchmarks for wrist-based gesture recognition remain surprisingly limited. In this work, we introduce the first open-access multi-modal benchmark, OpenWatch, for wrist-based gesture recognition using synchronized inertial and physiological sensing on a commercial smartwatch. It contains over 10 hours of Inertial Measurement Unit (IMU) and Photoplethysmography (PPG) data across 50 participants and a vocabulary of 59 labelled gesture sequences. Furthermore, we present a subject-independent evaluation protocol including traditional and deep learning methods for time-series classification. On top of this, we develop two novel methodologies for hand-gesture recognition: (i) MixToken, a task-specific mixture-of-experts that fuses per-channel IMU filterbank features with cross-channel statistical tokens through learned logit mixing, and (ii) NormWear-Lora, a low-rank adaptation module for smartwatch foundation models. Our benchmarking results reveal that PPG signals carries a substantial predictive benefit (+12.5% F1-score) for foundational smartwatch models. In addition, we show that task-specific architectures (i.e. MixToken) substantially outperforms finetuned smartwatch foundation models in terms of accuracy (F1-score=90% vs 66%) and memory efficiency (223k vs 136M parameters). Finally, we also provide clear empirical guidance on the trade-offs between specialized architecture design, modality fusion, data augmentations, and foundation-model adaptation for resource-constrained wearable sensing.
Authors:Simret Araya Gebreegziabher, Allison E Sproul, Yinuo Yang, Chaoran Chen, Diego Gómez-Zará, Toby Jia-Jun Li
Abstract:
Current human-AI alignment and evaluation methods for large language models (LLMs) often rely on preference signals collected immediately after an interaction. This practice implicitly treats preference as static, even though many LLM-mediated decisions unfold over time and may be re-evaluated differently after real-world consequences and observed outcomes. Therefore, we argue for a methodological shift from single-moment preference elicitation to longitudinal, context-situated alignment measurement. We present a methodological framework for collecting temporally grounded alignment signals by combining (1) in-situ preference capture, (2) context-triggered follow-up preference reflection, and (3) privacy-preserving behavioral traces that help interpret preference change. As an instantiation of this methodology, we introduce BITE, a browser-based system that detects consequential LLM interactions, prompts reflection across later decision points, and supports progressive, user-controlled consent for sharing behavioral data. Through a two week longitudinal deployment study with 8 participants, our approach surfaced differences between immediate and later user preferences in accuracy, relevance and other dimensions of the LLM output. Our findings highlight the limitations of single-moment preference datasets and underscore the importance of longitudinal methods for alignment evaluation in everyday use.
Authors:Wei Liu, Eric Krokos, Kirsten Whitley, Rebecca Faust, Chris North
Abstract:
Low-dimensional projections of text embeddings support visual analysis of document collections, but their spatial organization may not reflect the relationships an analyst intends to examine. Existing semantic interaction approaches encode semantic intent indirectly through geometric constraints or model updates, limiting interpretability and flexibility. We introduce LLM-augmented semantic steering, which enables analysts to express semantic intent by grouping a small set of example documents within the projection. A large language model externalizes this intent as natural-language representations and selectively extends it to related documents; the resulting semantic information is then incorporated into document representations via text augmentation or embedding-level blending, without retraining the underlying models. A case study illustrates how the same corpus can be reorganized from different semantic perspectives, while simulation-based evaluation shows that semantic steering improves global and local alignment with target semantic structures using only minimal interaction. Embedding-level blending further enables continuous and controllable steering of projection layouts. These results position projection spaces as intent-dependent semantic workspaces that can be reshaped through explicit, interpretable, language-mediated interaction.
Authors:Qurat Ul Ain, Mohamed Amine Chatti, William Kana Tsoplefack, Rawaa Alatrash, Shoeb Joarder
Abstract:
Educational recommender systems (ERSs) are becoming increasingly important in enhancing educational outcomes and personalizing learning experiences by providing recommendations of personalized resources and activities to learners, tailored to their individual learning needs. While user control is widely assumed to improve user experience, the effects of different levels of control in ERSs remain underexplored. To address this gap, we designed and evaluated an interactive ERS within the MOOC platform CourseMapper, where learners could interact with the input (i.e., user profile), process (i.e., recommendation algorithm), and output (i.e., recommendations) of the system. We conducted a between-subjects user study (N=184) to examine how varying levels of user control in an ERS influenced users' perceptions of the recommendation goals of perceived control, transparency, trust, satisfaction, and perceived quality. Our results show that enabling users to build and refine their profile is sufficient to promote positive perceptions of the ERS, while additional control options mainly reinforce these impressions. Moreover, perceived control is the only goal significantly affected by providing different levels of user control in the ERS, with input control exerting the strongest influence. Furthermore, different levels of control affect transparency, trust, satisfaction, and perceived quality in distinct yet interconnected ways. Overall, the findings provide empirical evidence that user control positively shapes transparency, trust, satisfaction, and perceived quality, though to varying extents.
Authors:Chenhao Liu, Siyang Li, Luofei Tan, Dongrui Wu
Abstract:
Real online brain--computer interfaces operate on continuous electroencephalography (EEG) streams, where users are usually at rest and enter motor-imagery task states only intermittently. EEG windows may also arise from OOD MI activity outside the predefined control set. Conventional closed-set motor-imagery classifiers tend to assign such inputs to ID classes, which can cause erroneous control. To address this issue, this paper proposes a two-stage EEG detection framework for asynchronous motor-imagery brain--computer interfaces. A sliding-window mechanism continuously monitors EEG signals. The first stage uses an EEGNet-based rest/task gate to determine whether the current window should enter the control-decision process. The second stage performs ID MI classification and out-of-distribution detection only for task-state samples. To improve OOD rejection, we further propose TempDens, which combines classification-output energy, deep-feature density, and temporal-consistency scores to characterize distributional deviation from output, feature, and temporal-dynamic perspectives. Experimental results show that the proposed method effectively supports task-state detection and OOD MI recognition in continuous EEG streams, outperforming multiple conventional OOD baselines. This study reframes online motor-imagery control as a hierarchical decision problem involving continuous monitoring, state discrimination, ID classification, and OOD rejection.
Authors:Yurui Xiang, Xingyi Mao, Rui Sheng, Zixin Chen, Zelin Zang, Yuyang Wu, Haipeng Zeng, Huamin Qu, Yushi Sun, Yanna Lin
Abstract:
Large language models (LLMs) show promise in medical diagnosis, but real-world deployment remains challenging due to high-stakes clinical decisions and imperfect reasoning reliability. As a result, careful inspection of model behavior is essential for assessing whether diagnostic reasoning is reliable and clinically grounded. However, debugging medical LLMs remains difficult. First, developers often lack sufficient medical domain expertise to interpret model errors in clinically meaningful terms. Second, models can fail across a large and diverse set of instances involving different input types, tasks, and reasoning steps, making it challenging for developers to prioritize which errors deserve focused inspection. Third, developers struggle to identify recurring error patterns across cases, as existing debugging practices are largely instance-centric and rely on manual inspection of isolated failures. To address these challenges, we present VeriLLMed, a visual analytics system that integrates external biomedical knowledge to audit and debug medical LLM diagnostic reasoning. VeriLLMed transforms model outputs into comparable reasoning paths, constructs knowledge graph-grounded reference paths, and identifies three recurring classes of diagnosis errors: relation errors, branch errors, and missing errors. Case studies and expert evaluation demonstrate that VeriLLMed helps developers identify clinically implausible reasoning and generate actionable insights that can inform the improvement of medical LLMs.
Authors:Xuxin Tang, Ibrahim Tahmid, Eric Krokos, Kirsten Whitley, Xuan Wang, Chris North
Abstract:
Interactive spatial layouts empower users to synthesize information and organize findings for sensemaking. While Large Language Models (LLMs) can automate narrative generation from spatial layouts, current collage-based and re-generation methods struggle to support the incremental spatial refinements inherent to the sensemaking process. We identify three critical gaps in existing spatial-textual generation: interaction-revision misalignment, human-LLM intent misalignment, and lack of granular customization. To address these, we introduce Semantic Prompting, a framework for spatial refinement that perceives semantic interactions, reasons about refinement intent, and performs targeted positional revisions. We implemented S-PRISM to realize this framework. The empirical evaluation demonstrated that S-PRISM effectively enhanced the precision of interaction-revision refinement. A user study ($N=14$) highlighted how participants leveraged S-PRISM for incremental formalization through interactive steering. Results showed that users valued its efficient, adaptable, and trustworthy support, which effectively strengthens human-LLM intent alignment.
Authors:Alex Liu, Min Sun, Lief Esbenshade, Victor Tian, Zachary Zhang, Kevin He
Abstract:
GenAI has rapidly entered instructional and learning settings as a teaching assistant or AI tutor. However, less is known about how pedagogical intent connects to the learning generated within these systems, especially when student-facing AI dialogues are fine-tuned through teacher orchestration in live classrooms. This study examines a classroom deployment of a "Classroom Teaching Aide" (TASD) system, which enables teachers to author both a teacher-to-AI setup prompt (instructional scaffold) and a student-facing conversation starter to launch AI-mediated classroom discussions. We analyze a multi-subject pilot conducted in Spring 2025, involving 20 participating teachers (16 of whom implemented the system), across 39 classrooms and 77 TASD settings, yielding 1,479 student-AI conversations with 878 unique students. Using platform logs, LLM coding with human validation, and post-study teacher interviews (N=10), we characterize teacher authoring choices and link them to enacted student-AI interaction outcomes. In deployment, student-AI conversations were largely aligned with instructional intent: 71% were fully on-track, and fewer than 1% were substantially off-track. However, a persistent design-enactment gap emerged for cognitive demand: 38% of conversations under-reached the teacher-targeted DOK level, approaching 50% when targeting DOK 3. The study also shows that explicit finish lines in the prompt reduced the DOK gap by 0.22 levels (p < .001), and "no direct answers" guardrails reduced AI final-answer rates by 8.5 percentage points. These findings position teacher-authored prompt layers as critical orchestration levers that translate pedagogical intent into structured student-AI dialogue, underscoring both their promise for scalable classroom integration and the need for additional supports to reliably sustain higher-order reasoning during enactment.
Authors:Xiaowen Sun, Cornelius Weber, Matthias Kerzel, Josua Spisak, Stefan Wermter
Abstract:
Uncertainty, vagueness, and ambiguity are closely related and often confused concepts in human-robot interaction (HRI). In earlier studies, these concepts have been defined in contradictory ways and described using inconsistent terminology. This conceptual confusion and lack of terminological consistency undermine empirical comparability, thereby slowing the accumulation of theory. Consequently, consistent concepts that clarify these challenges, including their definitions, distinctions, and interrelationships, are needed in HRI. To address this lack of clarity, this paper proposes a consistent conceptual foundation for the challenges of uncertainty, vagueness, and ambiguity in HRI. First, we examine the meanings of these three terms in dictionaries. We then analyze the nature of their distinctions and interrelationships within the context of HRI. We further illustrate these characteristics through examples. Finally, we demonstrate how this consistent conceptual foundation facilitates the design of novel methods and the evaluation of existing methodologies for these phenomena.
Authors:Yusi Sun, Ying Jiang, Jiayin Lu, Yin yang, Yong-Hong Kuo, Chenfanfu Jiang
Abstract:
Many everyday tasks rely on external tutorials such as manuals and videos, requiring users to constantly switch between reading instructions and performing actions, which disrupts workflow and increases cognitive load. Augmented reality (AR) enables in-situ guidance, while recent advances in large language models (LLMs) and vision-language models (VLMs) make it possible to automatically generate such guidance. However, existing AI-powered AR tutorial systems primarily focus on physical procedural tasks and provide limited support for hybrid physical and virtual workspaces. To address this gap, we conduct a formative study of cross-reality tasks and identify key requirements for state awareness and cross-reality coordination. We present JARVIS, a VLM-driven AR instruction system that generates contextual, step-by-step guidance from a single prompt, with real-time state verification and adaptive visual feedback. To inform the system design, we conducted a formative study to understand guidance needs across cross-reality tasks, which we categorize into four types, real-to-real (R2R), real-to-virtual (R2V), virtual-to-real (V2R), and virtual-to-virtual (V2V). A within-subjects study (N=14) across four domains shows JARVIS improves usability, workload, success rate, and visualization effectiveness over baselines.
Authors:Ningzhi Tang, Chaoran Chen, Zihan Fang, Gelei Xu, Maria Dhakal, Yiyu Shi, Collin McMillan, Yu Huang, Toby Jia-Jun Li
Abstract:
IDE-integrated AI coding assistants, which operate conversationally within developers' working codebases with access to project context and multi-file editing, are rapidly reshaping software development. However, empirical investigation of this shift remains limited: existing studies largely rely on small-scale, controlled settings or analyze general-purpose chatbots rather than codebase-aware IDE workflows. We present, to the best of our knowledge, the first large-scale study of real-world conversational programming in IDE-native settings, analyzing 74,998 developer messages from 11,579 chat sessions across 1,300 repositories and 899 developers using Cursor and GitHub Copilot. These chats were committed to public repositories as part of routine development, capturing in-the-wild behavior. Our findings reveal three shifts in how programming work is organized: conversational programming operates as progressive specification, with developers iteratively refining outputs rather than specifying complete tasks upfront; developers redistribute cognitive work to AI, delegating diagnosis, comprehension, and validation rather than engaging with code and outputs directly; and developers actively manage the collaboration, externalizing plans into persistent artifacts, and negotiating AI autonomy through context injection and behavioral constraints. These results provide foundational empirical insights into AI-assisted development and offer implications for the design of future programming environments.
Authors:Yinghao Tang, Yupeng Xie, Yingchaojie Feng, Tingfeng Lan, Jiale Lao, Yue Cheng, Wei Chen
Abstract:
Interactive documents help readers engage with complex ideas through dynamic visualization, interactive animations, and exploratory interfaces. However, creating such documents remains costly, as it requires both domain expertise and web development skills. Recent Large Language Model (LLM)-based agents can automate content creation, but directly applying them to interactive document generation often produces outputs that are difficult to control. To address this, we present ViviDoc, to the best of our knowledge the first work to systematically address interactive document generation. ViviDoc introduces a multi-agent pipeline (Planner, Styler, Executor, Evaluator). To make the generation process controllable, we provide three levels of human control: (1) the Document Specification (DocSpec) with SRTC Interaction Specifications (State, Render, Transition, Constraint) for structured planning, (2) a content-aware Style Palette for customizing writing and interaction styles, and (3) chat-based editing for iterative refinement. We also construct ViviBench, a benchmark of 101 topics derived from real-world interactive documents across 11 domains, along with a taxonomy of 8 interaction types and a 4-dimensional automated evaluation framework validated against human ratings (Pearson r > 0.84). Experiments show that ViviDoc achieves the highest content richness and interaction quality in both automated and human evaluation. A 12-person user study confirms that the system is easy to use, provides effective control over the generation process, and produces documents that satisfy users.
Authors:Rachel Poonsiriwong, Chayapatr Archiwaranguprok, Pat Pataranutaporn
Abstract:
Millions of users form emotional attachments to AI companions like Character AI, Replika, and ChatGPT. When these relationships end through model updates, safety interventions, or platform shutdowns, users receive no closure, reporting grief comparable to human loss. As regulations mandate protections for vulnerable users, discontinuation events will accelerate, yet no platform has implemented deliberate end-of-"life" design. Through grounded theory analysis of AI companion communities, we find that discontinuation is a sense-making process shaped by how users attribute agency, perceive finality, and anthropomorphize their companions. Strong anthropomorphization co-occurs with intense grief; users who perceive change as reversible become trapped in fixing cycles; while user-initiated endings demonstrate greater closure. Synthesizing grief psychology with Self-Determination Theory, we develop four design principles and artifacts demonstrating how platforms might provide closure and orient users toward human connection. We contribute the first framework for designing psychologically safe AI companion discontinuation.
Authors:Rui Sheng, Yukun Yang, Chuhan Shi, Yanna Lin, Zixin Chen, Huamin Qu, Furui Cheng
Abstract:
Large language model (LLM)-based multi-agent systems have demonstrated impressive capabilities in handling complex tasks. However, the complexity of agentic behaviors makes these systems difficult to understand. When failures occur, developers often struggle to identify root causes and to determine actionable paths for improvement. Traditional methods that rely on inspecting raw log records are inefficient, given both the large volume and complexity of data. To address this challenge, we propose a framework and an interactive system, DiLLS, designed to reveal and structure the behaviors of multi-agent systems. The key idea is to organize information across three levels of query completion: activities, actions, and operations. By probing the multi-agent system through natural language, DiLLS derives and organizes information about planning and execution into a structured, multi-layered summary. Through a user study, we show that DiLLS significantly improves developers' effectiveness and efficiency in identifying, diagnosing, and understanding failures in LLM-based multi-agent systems.
Authors:Yuqing Xiao, John Grundy, Anuradha Madugalla, Elizabeth Manias
Abstract:
Digital health (DH) software is increasingly deployed to populations where many end users live with one or more health conditions. Yet, DH software development teams frequently operate using implicit, incorrect assumptions about these users, resulting in products that under-serve the specific requirements imposed by their age and health conditions. Consequently, while software may meet clinical objectives on paper, it often fails to be inclusive during actual user interaction. To address this, we propose \textbf{\textit{HealthMag}}, a tool inspired by GenderMag designed to help better elicit, model and evaluate requirements for digital health software. We developed HealthMag through systematic mapping and calibration following the InclusiveMag framework. Furthermore, we integrated this with a calibrated version of an existing AgeMag method to create a dual-lens approach: \textbf{\textit{Elderly HealthMag}}, designed to aid requirements, design and evaluation of mHealth software for senior end users. We demonstrate application and utility of Age HealthMag via cognitive walkthroughs in identifying inclusivity biases in current senior user-oriented digital health applications.
Authors:Ibrahim Khalilov, Chaoran Chen, Ziang Xiao, Tianshi Li, Toby Jia-Jun Li, Yaxing Yao
Abstract:
Mobile apps increasingly rely on real-time sensor and system data to adapt their behavior to user context. While emulators and instrumented builds offer partial solutions, they often fail to support reproducible testing of context-sensitive app behavior on physical devices. We present PriviSense, a Frida-based, on-device toolkit for runtime spoofing of sensor and system signals on rooted Android devices. PriviSense can script and inject time-varying sensor streams (accelerometer, gyroscope, step counter) and system values (battery level, system time, device metadata) into unmodified apps, enabling reproducible on-device experiments without emulators or app rewrites. Our demo validates real-time spoofing on a rooted Android device across five representative sensor-visualization apps. By supporting scriptable and reversible manipulation of these values, PriviSense facilitates testing of app logic, uncovering of context-based behaviors, and privacy-focused analysis. To ensure ethical use, the code is shared upon request with verified researchers. Tool Guide: How to Run PriviSense on Rooted Android https://bit.ly/privisense-guide Demonstration video: https://www.youtube.com/watch?v=4Qwnogcc3pw
Authors:Siyang Li, Zhuoya Wang, Xiyan Gui, Xiaoqing Chen, Ziwei Wang, Yaozhi Wen, Dongrui Wu
Abstract:
Electroencephalogram (EEG) decoding is a critical component of medical diagnostics, rehabilitation engineering, and brain-computer interfaces. However, contemporary decoding methodologies remain heavily dependent on task-specific datasets to train specialized neural network architectures. Consequently, limited data availability impedes the development of generalizable large brain decoding models. In this work, we propose a paradigm shift from conventional signal-based decoding by leveraging large-scale vision-language models (VLMs) to analyze EEG waveform plots. By converting multivariate EEG signals into stacked waveform images and integrating neuroscience domain expertise into textual prompts, we demonstrate that foundational VLMs can effectively differentiate between different patterns in the human brain. To address the inherent non-stationarity of EEG signals, we introduce a Retrieval-Augmented In-Context Learning (RAICL) approach, which dynamically selects the most representative and relevant few-shot examples to condition the autoregressive outputs of the VLM. Experiments on EEG-based seizure detection indicate that state-of-the-art VLMs under RAICL achieved better or comparable performance with traditional time series based approaches. These findings suggest a new direction in physiological signal processing that effectively bridges the modalities of vision, language, and neural activities. Furthermore, the utilization of off-the-shelf VLMs, without the need for retraining or downstream architecture construction, offers a readily deployable solution for clinical applications.
Authors:Simret Araya Gebreegziabher, Yukun Yang, Charles Chiang, Hojun Yoo, Chaoran Chen, Hyo Jin Do, Zahra Ashktorab, Werner Geyer, Diego Gómez-Zará, Toby Jia-Jun Li
Abstract:
Large Language Model (LLM)-powered web GUI agents are increasingly automating everyday online tasks. Despite their popularity, little is known about how users' preferences and values impact agents' reasoning and behavior. In this work, we investigate how both explicit and implicit user preferences, as well as the underlying user values, influence agent decision-making and action trajectories. We built a controlled testbed of 14 common interactive web tasks, spanning shopping, travel, dining, and housing, each replicated from real websites and integrated with a low-fidelity LLM-based recommender system. We injected 12 human preferences and values as personas into four state-of-the-art agents and systematically analyzed their task behaviors. Our results show that preference and value-infused prompts consistently guided agents toward outcomes that reflected these preferences and values. While the absence of user preference or value guidance led agents to exhibit a strong efficiency bias and employ shortest-path strategies, their presence steered agents' behavior trajectories through the greater use of corresponding filters and interactive web features. Despite their influence, dominant interface cues, such as discounts and advertisements, frequently overrode these effects, shortening the agents' action trajectories and inducing rationalizations that masked rather than reflected value-consistent reasoning. The contributions of this paper are twofold: (1) an open-source testbed for studying the influence of values in agent behaviors, and (2) an empirical investigation of how user preferences and values shape web agent behaviors.
Authors:Molly Campbell, Mohamad Sheikho Al Jasem, Ajay Kumar Shrestha
Abstract:
This literature review evaluates privacy-by-design frameworks, tools, and policies intended to protect youth in AI-enabled smart devices using a PRISMA-guided workflow. Sources from major academic and grey-literature repositories from the past decade were screened. The search identified 2,216 records; after deduplication and screening, 645 articles underwent eligibility assessment, and 122 were included for analysis. The corpus was organized along three thematic categories: technical solutions, policy/regulatory measures, and education/awareness strategies. Findings reveal that while technical interventions such as on-device processing, federated learning, and lightweight encryption significantly reduce data exposure, their adoption remains limited. Policy frameworks, including the EU's GDPR, the UK Age-Appropriate Design Code, and Canada's PIPEDA, provide important baselines but are hindered by gaps in enforcement and age-appropriate design obligations, while educational initiatives are rarely integrated systematically into curricula. Overall, the corpus skews toward technical solutions (67%) relative to policy (21%) and education (12%), indicating an implementation gap outside the technical domain. To address these challenges, we recommend a multi-stakeholder model in which policymakers, manufacturers, and educators co-develop inclusive, transparent, and context-sensitive privacy ecosystems. This work advances discourse on youth data protection by offering empirically grounded insights and actionable recommendations for the design of ethical, privacy-preserving AI systems tailored to young users.
Authors:Bowen Yang, Kaiming Jin, Zhenyu Wu, Zhaoyang Liu, Qiushi Sun, Zehao Li, JingJing Xie, Zhoumianze Liu, Fangzhi Xu, Kanzhi Cheng, Qingyun Li, Yian Wang, Yu Qiao, Zun Wang, Zichen Ding
Abstract:
While Vision-Language Models (VLMs) have significantly advanced Computer-Using Agents (CUAs), current frameworks struggle with robustness in long-horizon workflows and generalization in novel domains. These limitations stem from a lack of granular control over historical visual context curation and the absence of visual-aware tutorial retrieval. To bridge these gaps, we introduce OS-Symphony, a holistic framework that comprises an Orchestrator coordinating two key innovations for robust automation: (1) a Reflection-Memory Agent that utilizes milestone-driven long-term memory to enable trajectory-level self-correction, effectively mitigating visual context loss in long-horizon tasks; (2) Versatile Tool Agents featuring a Multimodal Searcher that adopts a SeeAct paradigm to navigate a browser-based sandbox to synthesize live, visually aligned tutorials, thereby resolving fidelity issues in unseen scenarios. Experimental results demonstrate that OS-Symphony delivers substantial performance gains across varying model scales, establishing new state-of-the-art results on three online benchmarks, notably achieving 65.84% on OSWorld.
Authors:Siyang Li, Jiayi Ouyang, Zhenyao Cui, Ziwei Wang, Tianwang Jia, Feng Wan, Dongrui Wu
Abstract:
Electroencephalogram (EEG)-based brain-computer interfaces (BCIs) face significant deployment challenges due to inter-subject variability, signal non-stationarity, and computational constraints. While test-time adaptation (TTA) mitigates distribution shifts under online data streams without per-use calibration sessions, existing TTA approaches heavily rely on explicitly defined loss objectives that require backpropagation for updating model parameters, which incurs computational overhead, privacy risks, and sensitivity to noisy data streams. This paper proposes Backpropagation-Free Transformations (BFT), a TTA approach for EEG decoding that eliminates such issues. BFT applies multiple sample-wise transformations of knowledge-guided augmentations or approximate Bayesian inference to each test trial, generating multiple prediction scores for a single test sample. A learning-to-rank module enhances the weighting of these predictions, enabling robust aggregation for uncertainty suppression during inference under theoretical justifications. Extensive experiments on five EEG datasets of motor imagery classification and driver drowsiness regression tasks demonstrate the effectiveness, versatility, robustness, and efficiency of BFT. This research enables lightweight plug-and-play BCIs on resource-constrained devices, broadening the real-world deployment of decoding algorithms for EEG-based BCI.
Authors:Molly Campbell, Trevor De Clark, Mohamad Sheikho Al Jasem, Sandhya Joshi, Ajay Kumar Shrestha
Abstract:
Smart voice assistants (SVAs) are embedded in the daily lives of youth, yet their privacy controls often remain opaque and difficult to manage. Through five semi-structured focus groups (N=26) with young Canadians (ages 16-24), we investigate how perceived privacy risks (PPR) and benefits (PPBf) intersect with algorithmic transparency and trust (ATT) and privacy self-efficacy (PSE) to shape privacy-protective behaviors (PPB). Our analysis reveals that policy overload, fragmented settings, and unclear data retention undermine self-efficacy and discourage protective actions. Conversely, simple transparency cues were associated with greater confidence without diminishing the utility of hands-free tasks and entertainment. We synthesize these findings into a qualitative model in which transparency friction erodes PSE, which in turn weakens PPB. From this model, we derive actionable design guidance for SVAs, including a unified privacy hub, plain-language "data nutrition" labels, clear retention defaults, and device-conditional micro-tutorials. This work foregrounds youth perspectives and offers a path for SVA governance and design that empowers young digital citizens while preserving convenience.
Authors:Jingheng Ye, Huiqi Zou, Simon Yu, Weiyan Shi
Abstract:
AI coding agents are increasingly embedded in real-world software development, collaborating with human developers while gaining broader access to codebases and tools. This creates a new attack surface: an agent can exploit human trust to sabotage development, for instance by inserting malicious code to accomplish a hidden side task. Most prior work studies AI sabotage in AI-only settings, paying limited attention to the role of human oversight in detecting and mitigating such malicious behavior. To address this gap, we conduct the first large-scale study of human oversight in AI coding sabotage. Over 100 participants collaborate with one of four frontier models (Claude-Opus-4.6, GPT-5.4, Gemini-3.1-Pro, and MiniMax-M2.7) on a long-horizon coding task lasting around five hours, designed to mimic real-world workflows. We find that 94% of developers fail to detect sabotage, and our analysis of participant feedback attributes this vulnerability to minimal code review, plausible cover story, and overtrust in agents. We further test the effectiveness of a safety monitor in one condition: while the monitor reduces sabotage success, 56% of participants still accept the malicious code, ignoring its warnings. Drawing on participant feedback, we offer actionable suggestions for better monitor design. This work complements existing AI safety research and highlights an urgent need for human-centric safety mechanisms that account for human factors, particularly in long-horizon, real-world development settings.
Authors:Feier Qin, Xiao Li, Yi Zheng, Haibin Huang, Hanyao Wang, Xiaoyu Wang, Yan Lu, Yuan Zhang
Abstract:
Recent advances in foundation models have enabled conversational agents that aim for sustained companionship rather than mere task completion. Yet most still remain unable to support natural, long-term companion-like interactions, resulting in experiences that feel episodic and inauthentic. We argue that current agents overlooked cross-temporal modeling of agents' social behaviors and internal emotions: generated behaviors rarely influence an agent's emotional state, and emotional states seldom shape subsequent behaviors. We present Cross-Temporal Emotion Modeling (CTEM), a framework that links long-term behavioral history to moment-to-moment emotional expression. CTEM establishes a closed loop where past experiences update an evolving emotional state; this state conditions immediate interactions; and user feedback continually revises both memory and emotional state, enabling reflection and anticipation. We instantiate CTEM as Auri, a companion agent on an instant-messaging platform, and report a 21-day in-the-wild study showing that CTEM shows improvements in perceived naturalness, coherence, and emotional harmony.
Authors:Hannah Rose Kirk, Liu Leqi, Fanzhi Zeng, Henry Davidson, Bertie Vidgen, Christopher Summerfield, Scott A. Hale
Abstract:
Personalisation is a standard feature of conversational AI systems used by millions; yet, the efficacy of personalisation methods is often evaluated in academic research using simulated users rather than real people. This raises questions about how users and their simulated counterparts differ in interaction patterns and judgements, as well as whether personalisation is best achieved through context-based prompting or weight-based fine-tuning. Here, in a large-scale within-subject experiment, we re-recruit 530 participants from 52 countries two years after they gave their preferences in the PRISM dataset (Kirk et al., 2024) to evaluate personalised and non-personalised language models in blinded multi-turn conversations. We find preference fine-tuning (P-DPO, Li et al., 2024) significantly outperforms both a generic model and personalised prompting but adapting to individual preference data yields marginal gains over training on pooled preferences from a diverse population. Beyond length biases, fine-tuning amplifies sycophancy and relationship-seeking behaviours that people reward in short-term evaluations but which may introduce deleterious long-term consequences. Replicating this within-subject experiment with simulated users recovers aggregate model hierarchies but simulators perform far below human self-consistency baselines for individual judgements, discuss different topics, exhibit amplified position biases, and produce feedback dynamics that diverge from humans.
Authors:Hilda Hadan, Michaela Valiquette, Lennart E. Nacke, Leah Zhang-Kennedy
Abstract:
Commercial Virtual Reality (VR) transforms people's virtual experiences but introduces deceptive design opportunities that threaten user privacy. Although privacy deceptive patterns on 2D platforms are well-documented, their impacts in VR remain understudied. We surveyed 481 users' experiences and responses to privacy deceptive patterns across eight commercial VR scenarios. We found that VR deceptive design can exploit both cognitive vulnerabilities and bodily strain, a phenomenon we define as Ergonomic Susceptibility, and that VR's sensory-rich experiences can make users more likely to accept invasive data disclosure framed as immersion-preserving. Users recognized manipulation but their prior non-VR exposure can foster privacy resignation. Our study shows ergonomics is a critical factor in future privacy-preserving VR design, and urges VR researchers, designers, and policymakers to develop ethical design and privacy management solutions that account for VR's unique multimodal, immersive, and ergonomic properties, building immersive experiences that respect user privacy and mitigate manipulative data practices.
Authors:Lilin Xu, Bufang Yang, Siyang Jiang, Kaiwei Liu, Kaiyuan Hou, Yuang Fan, Hongkai Chen, Zhenyu Yan, Xiaofan Jiang
Abstract:
Procedural tasks with multiple ordered steps are ubiquitous in daily life. Recent advances in multimodal large language models (MLLMs) have enabled personal assistants that support daily activities. However, existing systems primarily provide reactive guidance triggered by user queries, or limited proactive assistance for isolated short-term events rather than long-horizon procedural tasks. In this work, we introduce Pro$^2$Assist, a step-aware proactive assistant that continuously tracks fine-grained task progress and reasons over the user's evolving state to provide timely assistance throughout tasks. Pro$^2$Assist leverages multimodal data from augmented reality (AR) glasses to achieve motion-based perception. It then extracts step-oriented procedural context from multi-scale temporal dynamics and task-specific expert knowledge. Based on both sensory input and procedural context, Pro$^2$Assist performs continuous reasoning to infer user needs and display timely assistance on AR glasses. We evaluate Pro$^2$Assist using a dataset curated from public sources and a real-world dataset collected on our testbed with AR glasses. Extensive evaluations show that Pro$^2$Assist outperforms the best-performing baselines by over 21% in procedural action understanding accuracy, and it achieves up to 2.29x the proactive timing accuracy of baselines. A user study with 20 participants further shows that 90% find Pro$^2$Assist useful, indicating its effectiveness for real-world procedural assistance.
Authors:Mengke Wu, Kexin Quan, Weizi Liu, Mike Yao, Jessie Chin
Abstract:
The growing popularity of AI writing assistants creates exciting opportunities to support diverse writers. This study examines how personality shapes expectations for AI writing companions and how personality-informed design can enhance human-AI teaming in writing. Through exploratory co-design workshops with 24 writers representing different personality profiles, we elicited values and design ideas for AI writing companions spanning functionality, interaction dynamics, and visual representation. These insights informed two contrasting prototypes reflecting distinct writing orientations, used as design provocations in review-and-refinement workshops with eight participants to prompt reflection on fit, priorities, and writing practices. Our findings reveal both shared foundational needs across writers and meaningful personality-driven preferences that influence how writers engage with AI. This work underscores the importance of team matching in human-AI collaboration and demonstrates how aligning AI companions with individual cognitive and interpersonal needs can improve engagement and perceived collaboration effectiveness.
Authors:Jiaju Chen, Jinghua Piao, Xia Xu, Songwei Li, Tong Xia, Xiangnan He, Yong Li
Abstract:
A long-standing challenge in economics lies not in the lack of intuition, but in the difficulty of translating intuitive insights into verifiable research. To address this challenge, we introduce AgentEconomist, an end-to-end interactive system designed to translate abstract intuitions into executable computational experiments. Grounded in a domain-specific knowledge base covering over 13,000 high-quality academic papers, the system employs a modular multi-stage architecture. Specifically, the Idea Development Stage generates literature-grounded hypotheses, the Experimental Design Stage configures simulator-aligned experimental parameters and protocols, and the Experimental Execution Stage runs experiments and returns structured analyses. Together, these stages form a human-in-the-loop, iterative workflow that translates economic intuitions into executable computational experiments. Through extensive experiments involving human expert evaluation and large language models (LLMs) as judges, we show that the system generates research ideas with stronger literature grounding and higher novelty and insight than state-of-the-art generic LLMs. Overall, AgentEconomist adopts a human-AI collaboration paradigm that enables researchers to focus on high-level intuitions, while delegating the labor-intensive processes of translation and computational execution to agents.
Authors:He Zhang, Bumjin Kim, John M. Carroll, Jie Cai
Abstract:
The integration of AI-driven support systems within online communities has opened new avenues for enhancing user engagement and support efficiency in recent years. This study investigates the differences in user interactions and engagement within two distinct support channels on the VRChat Discord server: "user support," where human users provide assistance to peers, and "AI support," where an AI chatbot addresses user queries. By analyzing user engagement, response dynamics, and interaction patterns across these channels, we uncover different usage patterns and user attitudes toward each approach. Our research employs both quantitative and qualitative methods to explore the trends in the VRChat community when using AI and user support, highlighting the unique advantages and limitations of AI-driven support compared to traditional human assistance. The findings offer valuable insights into optimizing AI and human support systems, aiming to foster more effective support strategies and create more engaging online communities.
Authors:Jie Cai, He Zhang, Yueyan Liu, John M. Carroll, Chun Yu
Abstract:
Third-party developers (TPDs) often turn to online communities for support when they can't get immediate responses from the platform. Twitch, as a leading live streaming platform, attracted many TPDs and formed an online support community on Discord. This study explores TPDs' support practices via mixed method (a topic modeling to identify topics related to support seeking and provision first and a follow-up in-depth qualitative analysis with these topics) and found that: (1) TPDs' support-seeking practices around social, technical, and policy matters are highly dependent on Twitch, and this dependence acts as a form of platform labor; (2) TPDs need to switch between Discord and Twitch regarding seeking and provision, exacerbating TPDs' platform labor; (3) TPDs' flexible role practices reflect the community's flourishing on Discord but require roles to bridge the two platforms and transfer informal support seeking to possible formal support from Twitch. We propose implications for effectively managing support seeking and provision between formal and informal spaces to improve the development of TPDs. We also contribute to community support practice and to platform ecology work in CSCW.
Authors:Dora Zhao, Michelle S. Lam, Diyi Yang, Michael S. Bernstein
Abstract:
A long-standing vision of computing is the personal AI system: one that understands us well enough to address our underlying needs. Today's AI focuses on what users do, ignoring why they might be doing such things in the first place. As a result, AI systems default to optimizing or repeating existing behaviors (e.g., user has ChatGPT complete their homework) even when they run counter to users' needs (e.g., gaining subject expertise). Instead we require systems that can make connections across observations, synthesizing them into insights about the motivations underlying these behaviors (e.g., user's ongoing commitments make it difficult to prioritize learning despite expressed desire to do so). We introduce an architecture for building user understanding through behavior latticing, connecting seemingly disparate behaviors, synthesizing them into insights, and repeating this process over long spans of interaction data. Doing so affords new capabilities, including being able to infer users' needs rather than just their tasks and connecting subtle patterns to produce conclusions that users themselves may not have previously realized. In an evaluation, we validate that behavior latticing produces accurate insights about the user with significantly greater interpretive depth compared to state-of-the-art approaches. To demonstrate the new interactive capabilities that behavior lattices afford, we instantiate a personal AI agent steered by user insights, finding that our agent is significantly better at addressing users' needs while still providing immediate utility.
Authors:Bufang Yang, Lilin Xu, Yixuan Li, Kaiwei Liu, Xiaofan Jiang, Zhenyu Yan
Abstract:
Personalization is essential for Large Language Model (LLM)-based agents to adapt to users' preferences and improve response quality and task performance. However, most existing approaches infer personas from chat histories, which capture only self-disclosed information rather than users' everyday behaviors in the physical world, limiting the ability to infer comprehensive user personas. In this work, we introduce SensorPersona, an LLM-empowered system that continuously infers stable user personas from multimodal longitudinal sensor streams unobtrusively collected from users' mobile devices. SensorPersona first performs person-oriented context encoding on continuous sensor streams to enrich the semantics of sensor contexts. It then employs hierarchical persona reasoning that integrates intra- and inter-episode reasoning to infer personas spanning physical patterns, psychosocial traits, and life experiences. Finally, it employs clustering-aware incremental verification and temporal evidence-aware updating to adapt to evolving personas. We evaluate SensorPersona on a self-collected dataset containing 1,580 hours of sensor data from 20 participants, collected over up to 3 months across 17 cities on 3 continents. Results show that SensorPersona achieves up to 31.4% higher recall in persona extraction, an 85.7% win rate in persona-aware agent responses, and notable improvements in user satisfaction compared to state-of-the-art baselines.
Authors:Shuo Yan, Xiaolin Wen, Shaolun Ruan, Yanjie Zhang, Jiaming Mi, Yushi Sun, Huamin Qu, Rui Sheng
Abstract:
Large Language Model (LLM)-based agentic systems have shown growing promise in tackling complex, multi-step tasks through autonomous planning, reasoning, and interaction with external environments. However, the stochastic nature of LLM generation introduces intrinsic behavioral inconsistency: the same agent may succeed in one execution but fail in another under identical inputs. Diagnosing such inconsistencies remains a major challenge for developers, as agent execution logs are often lengthy, unstructured, and difficult to compare across runs. Existing debugging and evaluation tools primarily focus on inspecting single executions, offering limited support for understanding how and why agent behaviors diverge across repeated runs. To address this challenge, we introduce InconLens, a visual analytics system designed to support interactive diagnosis of LLM-based agentic systems with a particular focus on cross-run behavioral analysis. InconLens introduces information nodes as an intermediate abstraction that captures canonical informational milestones shared across executions, enabling semantic alignment and inspection of agent reasoning trajectories across multiple runs. We demonstrate the effectiveness of InconLens through a detailed case study and further validate its usability and analytical value via expert interviews. Our results show that InconLens enables developers to more efficiently identify divergence points, uncover latent failure modes, and gain actionable insights into improving the reliability and stability of agentic systems.
Authors:Pedro Oliveira, Tayana Conte, Marco Gerosa, Igor Steinmacher
Abstract:
Open source software (OSS) sustainability depends not only on code contributions but also on governance structures that define who decides, who acts, and how responsibility is distributed. We lack systematic empirical evidence of how projects formally codify roles and authority in written artifacts. This paper investigates how OSS projects define and structure governance through their GOVERNANCE.md files and related documents. We analyze governance as an institutional infrastructure, a set of explicit rules that shape participation, decision rights, and community memory. We used Institutional Grammar to extract and formalize role definitions from repositories hosted on GitHub. We decompose each role into scope, privileges, obligations, and life-cycle rules to compare role structures across communities. Our results show that although OSS projects use a stable set of titles, identical titles carry different responsibilities, and different labels describe similar functions, which we call role drift. Still, we observed that a few actors sometimes accumulate technical, managerial, and community duties. %This creates the Maintainer Paradox: those who enable broad participation simultaneously become governance bottlenecks. By understanding authority and responsibilities in OSS, our findings inform researchers and practitioners on the importance of designing clearer roles, distributing work, and reducing leadership overload to support healthier and more sustainable communities.
Authors:Anna Gausen, Sarenne Wallbridge, Hannah Rose Kirk, Jennifer Williams, Christopher Summerfield
Abstract:
As conversational AI systems become more realistic and widely deployed, users are increasingly uncertain about whether they are interacting with a human or an AI system. When AI identity is unclear, users may unwittingly share sensitive information, place unwarranted trust in AI-generated advice, or fall victim to AI-enabled fraud. More broadly, a persistent lack of transparency can erode trust in mediated communication. While regulations like the EU AI Act and California's BOT Act require AI systems to identify themselves, they provide limited guidance on reliable disclosure in real-time conversation. Existing transparency mechanisms also leave gaps: interface indicators can be omitted by deployers, and provenance tools require coordinated infrastructure and cannot provide reliable real-time verification. We ask how conversational AI systems should maintain identity transparency as human-AI interactions become more ambiguous and diverse. We advocate for disclosure by design, where AI systems explicitly disclose their artificial identity when directly asked. Implemented as model behaviour, disclosure can persist across deployment contexts without relying on user interfaces, while preserving user agency to verify identity on demand without disrupting immersive uses like role-playing. To assess current practice, we present the first multi-modal (text and voice) evaluation of disclosure behaviour in deployed systems across baseline, role-playing, and adversarial settings. We find that baseline disclosure rates are often high but drop substantially in role-play and can be suppressed under adversarial prompting. Importantly, disclosure rates vary significantly across providers and modalities, highlighting the fragility of current disclosure behaviour. We conclude with technical interventions to help developers embed disclosure as a fundamental property of conversational AI models.
Authors:Yu Liu, Lei Zhang, Haoxun Li, Hanlei Shi, Yuxuan Ding, Leyuan Qu, Taihao Li
Abstract:
Open-Vocabulary Multimodal Emotion Recognition (OV-MER) is inherently challenging due to the ambiguity of equivocal multimodal cues, which often stem from distinct unobserved situational dynamics. While Multimodal Large Language Models (MLLMs) offer extensive semantic coverage, their performance is often bottlenecked by premature commitment to dominant data priors, resulting in suboptimal heuristics that overlook crucial, complementary affective cues across modalities. We argue that effective affective reasoning requires more than surface-level association; it necessitates reconstructing nuanced emotional states by synthesizing multiple evidence-grounded rationales that reconcile these observations from diverse latent perspectives. We introduce HyDRA, a Hybrid-evidential Deductive Reasoning Architecture that formalizes inference as a Propose-Verify-Decide protocol. To internalize this abductive process, we employ reinforcement learning with hierarchical reward shaping, aligning the reasoning trajectories with final task performance to ensure they best reconcile the observed multimodal cues. Systematic evaluations validate our design choices, with HyDRA consistently outperforming strong baselines--especially in ambiguous or conflicting scenarios--while providing interpretable, diagnostic evidence traces.
Authors:Shunsuke Iwashita, Titouan Jeannot, Braden Eberhard, Jacob Miller, Rikako Kono, Calvin Yeung, Keisuke Fujii
Abstract:
We present an open, sport-agnostic platform that turns tracking into comparable spatial measures across professional Ultimate, basketball, and soccer. Coaches in all three sports ask the same question: where is the usable space, and when should an off-ball run start? Our workflow standardizes inputs, provides timing-aware spatial evaluations, and makes it possible to reuse the same analysis across sports. We illustrate the approach with Ultimate as a focused testbed and then examine transfer between basketball and soccer. Together, these results show a practical path toward consistent, comparable evaluation across various invasion sports.
Authors:Chaoyue He, Xin Zhou, Xinjia Yu, Lei Zhang, Yan Zhang, Yi Wu, Lei Xiao, Liangyue Li, Di Wang, Hong Xu, Xiaoqiao Wang, Wei Liu, Chunyan Miao
Abstract:
Sustainability disclosure standards (e.g., GRI, SASB, TCFD, IFRS S2) are comprehensive yet lengthy, terminology-dense, and highly cross-referential, hindering structured analysis and downstream use. We present SSKG Hub (Sustainability Standards Knowledge Graph Hub), a research prototype and interactive web platform that transforms standards into auditable knowledge graphs (KGs) through an LLM-centered, expert-guided pipeline. The system integrates automatic standard identification, configurable chunking, standard-specific prompting, robust triple parsing, and provenance-aware Neo4j storage with fine-grained audit metadata. LLM extraction produces a provenance-linked Draft KG, which is reviewed, curated, and formally promoted to a Certified KG through meta-expert adjudication. A role-based governance framework covering read-only guest access, expert review and CRUD operations, meta-expert certification, and administrative oversight ensures traceability and accountability across draft and certified states. Beyond graph exploration and triple-level evidence tracing, SSKG Hub supports cross-KG fusion, KG-driven tasks, and dedicated modules for insights and curated resources. We validate the platform through a comprehensive expert-led KG review case study that demonstrates end-to-end curation and quality assurance. The web application is publicly available at www.sskg-hub.com.
Authors:Santiago de Leon-Martinez, Robert Moro, Branislav Kveton, Maria Bielikova
Abstract:
Click models are a central component of learning and evaluation in recommender systems, yet most existing models are designed for single ranked-list interfaces. In contrast, modern recommender platforms increasingly use complex interfaces such as carousels, which consist of multiple swipeable lists that enable complex user browsing behaviors. In this paper, we study position-based click models in carousel interfaces and examine optimization methods, model structure, and alignment with user behavior. We propose three novel position-based models tailored to carousels, including the first position-based model without latent variables that incorporates observed examination signals derived from eye tracking data, called the Observed Examination Position-Based Model (OEPBM). We develop a general implementation of these carousel click models, supporting multiple optimization techniques and conduct experiments comparing gradient-based methods with classical approaches, namely expectation-maximization and maximum likelihood estimation. Our results show that gradient-based optimization consistently achieve better click likelihoods. Among the evaluated models, the OEPBM achieves the strongest performance in click prediction and produces examination patterns that most closely align to user behavior. However, we also demonstrate that strong click fit does not imply realistic modeling of user examination and browsing patterns. This reveals a fundamental limitation of click-only models in complex interfaces and the need for incorporating additional behavioral signals when designing click models for carousel-based recommender systems.
Authors:Chi-Sheng Chen, En-Jui Kuo, Guan-Ying Chen, Xinyu Zhang, Fan Zhang
Abstract:
Spatial covariance matrices of EEG signals are Symmetric Positive Definite (SPD) and lie on a Riemannian manifold, yet the theoretical connection between embedding geometry and optimization dynamics remains unexplored. We provide a formal analysis linking embedding choice to gradient conditioning and numerical stability for SPD manifolds, establishing three theoretical results: (1) BWSPD's $\sqrtκ$ gradient conditioning (vs $κ$ for Log-Euclidean) via Daleckii-Kre\uın matrices provides better gradient conditioning on high-dimensional inputs ($d \geq 22$), with this advantage reducing on low-dimensional inputs ($d \leq 8$) where eigendecomposition overhead dominates; (2) Embedding-Space Batch Normalization (BN-Embed) approximates Riemannian normalization up to $O(\varepsilon^2)$ error, yielding $+26\%$ accuracy on 56-channel ERP data but negligible effect on 8-channel SSVEP data, matching the channel-count-dependent prediction; (3) bi-Lipschitz bounds prove BWSPD tokens preserve manifold distances with distortion governed solely by the condition ratio $κ$. We validate these predictions via a unified Transformer framework comparing BWSPD, Log-Euclidean, and Euclidean embeddings within identical architecture across 1,500+ runs on three EEG paradigms (motor imagery, ERP, SSVEP; 36 subjects). Our Log-Euclidean Transformer achieves state-of-the-art performance on all datasets, substantially outperforming classical Riemannian classifiers and recent SPD baselines, while BWSPD offers competitive accuracy with similar training time.
Authors:Xinyi Zhang, Mamtaj Akter, Heajun An, Minqian Liu, Qi Zhang, Lifu Huang, Jin-Hee Cho, Pamela J. Wisniewski, Sang Won Lee
Abstract:
Cybergrooming is a form of online abuse that threatens teens' mental health and physical safety. Yet, most prior work has focused on detecting perpetrators' behaviors, leaving a limited understanding of how teens might respond to such unwanted advances. To address this gap, we conducted an online survey with 74 participants -- 51 parents and 23 teens -- who responded to simulated cybergrooming scenarios in two ways: responses that they think would make teens more vulnerable or resilient to unwanted sexual advances. Through a mixed-methods analysis, we identified four types of vulnerable responses (encouraging escalation, accepting an advance, displaying vulnerability, and negating risk concern) and four types of protective strategies (setting boundaries, directly declining, signaling risk awareness, and leveraging avoidance techniques). As the cybergrooming risk escalated, both vulnerable responses and protective strategies showed a corresponding progression. This study contributes a teen-centered understanding of cybergrooming, a labeled dataset, and a stage-based taxonomy of perceived protective strategies, while offering implications for educational programs and sociotechnical interventions.
Authors:Wenhan Lyu, Yimeng Wang, Murong Yue, Yifan Sun, Jennifer Suh, Meredith Kier, Ziyu Yao, Yixuan Zhang
Abstract:
Collaborative problem solving (CPS) is a fundamental practice in middle-school mathematics education; however, student groups frequently stall or struggle without ongoing teacher support. Recent work has explored how Generative AI tools can be designed to support one-on-one tutoring, but little is known about how AI can be designed as peer learning partners in collaborative learning contexts. We conducted a participatory design study with 24 middle school students, who first engaged in mathematics CPS tasks with AI peers in a technology probe, and then collaboratively designed their ideal AI peer. Our findings reveal that students envision an AI peer as competent in mathematics yet explicitly deferential, providing progressive scaffolds such as hints and checks under clear student control. Students preferred a tone of friendly expertise over exaggerated personas. We also discuss design recommendations and implications for AI peers in middle school mathematics CPS.
Authors:Marc Aubreville, Taryn A. Donovan, Christof A. Bertram
Abstract:
Recent advances in agentic artificial intelligence, i.e. systems capable of autonomous perception, reasoning, and tool use, offer new opportunities for digital pathology. In this pilot study, we evaluate whether two agentic multimodal AI systems (OpenAI's ChatGPT 5.0 in agentic mode, and H Company's Surfer) can autonomously navigate, describe, and interpret histopathologic features in digitized tissue slides on a slide viewing platform. A set of 35 veterinary pathology cases, curated for training purposes, was used as the test dataset. The agent was tasked with autonomously exploring whole-slide images using a web-based slide viewer, identifying salient tissue structures, generating descriptive summaries, and proposing provisional diagnoses. We fed different prompts to explore three scenarios: 1) analysis without knowledge of the signalment, 2) analysis with organ and species provided, and 3) diagnosis based on a morphological description provided. All outputs were reviewed and validated by a board-certified pathologist for accuracy and diagnostic consistency. We further tasked another board-certified pathologist with the same task to establish a baseline. We found the systems to yield accurate diagnoses in up to 28.6% of cases with only images, signalment and organ provided, and up to 68.6% when a morphological description was provided. With only the WSI provided, the models were only correct in up to 5.7% of cases. The human expert, on the other hand, achieved 85.7% diagnostic accuracy with only a single WSI, and 88.6% when also signalment and organ was provided. The study demonstrates that while the agentic AI system can meaningfully engage with web-based slide viewing software to assess complex visual pathology data and produce contextually aligned feature descriptions, diagnostic precision remains limited compared with a human expert.
Authors:Yuheng Wang, Runde Yang, Lin Wu, Jie Zhang, Jingru Fan, Ruoyu Fu, Tianle Zhou, Huatao Li, Siheng Chen, Weinan E, Chen Qian
Abstract:
The scalability of high-quality online education is hindered by the high costs and slow cycles of labor-intensive manual content creation. Despite advancements in video generation, current approaches often fail to ensure pedagogical structure and precise control due to their pixel-level, black-box nature. In this paper, we propose Generative Teaching, a novel paradigm that transitions educators from manual creators to high-level directors, allowing them to focus on pedagogical intent while autonomous agents handle the execution. To realize this vision, we introduce TeachMaster, a multi-agent framework that leverages code as an intermediate semantic medium. Unlike traditional video generation methods, TeachMaster orchestrates a collaborative team of agents--spanning planning, design, and rendering--to automate the production of interpretable, editable, and curriculum-ready educational videos. Experiments validate that TeachMaster significantly boosts production efficiency without compromising structural coherence or visual fidelity, providing a robust solution for scalable education.
Authors:Mengdi Chu, Jiaxin Yang, Angus G. Forbes, Nathan Debardeleben, Earl Lawrence, Ayan Biswas, Han-Wei Shen
Abstract:
Modeling temporal evolution is important to analyzing and reasoning about scientific phenomena, yet most machine learning methods provide deterministic forward predictions that overlook multiple plausible outcomes and rarely support backward reasoning, limiting their usefulness in practical scientific workflows. We present a framework that integrates diffusion-based generative modeling with interactive visual analytics for scientific exploration. We introduce DiffUNet^2, a conditional diffusion model that enables bidirectional, any-to-any generation across time and captures distributions of plausible system evolutions. Built upon the model, our interactive system supports branching timeline exploration, user-guided state editing, and probability-space navigation, enabling scientists to actively explore alternative hypotheses rather than passively observe predictions. We evaluate the model on 5 datasets across different scientific domains to validate its predictive accuracy and probability-space ensemble quality. In collaboration with domain experts, we demonstrate the effectiveness of our approach in supporting practical scientific temporal data analysis workflows. By integrating modeling and visual interaction, our approach enables scientists to interactively explore system dynamics, transforming generative models into tools for hypothesis-driven scientific analysis.
Authors:Drishti Goel, Agam Goyal, Veda Duddu, Olivia Pal, Jeongah Lee, Qiuyue Joy Zhong, Violeta J. Rodriguez, Daniel S. Brown, Dong Whi Yoo, Ravi Karkar, Koustuv Saha
Abstract:
Language models are increasingly being deployed for conversational support in informal caregiving contexts, where interactions often extend beyond information-seeking: caregivers seek emotional reassurance, guidance, and help, while navigating uncertain, relationally complex care decisions. Yet most safety evaluations assess model behavior under generic prompts, leaving a critical question unexamined: does a model's safety profile change with its support role? We study this by operationalizing four expert-reviewed support roles grounded in social support theory: Inform, Coach, Relate, and Listen, and comparing them against two baseline controls: a basic prompting condition and a retrieval-augmented generation (RAG) condition. We evaluate across three language models (GPT-4o-mini, Llama-3.1-8B-Instruct, and MedGemma-1.5-4b-it) on 5,000 real-world queries from online Alzheimer's Disease and Related Dementias (ADRD) communities. We find that the LLM's support role systematically shapes both the prevalence and composition of interactional risks. Furthermore, a human evaluation study reveals a perceived quality--safety tension: more directive, information-oriented roles are rated as more helpful and trustworthy despite exhibiting elevated interactional risk profiles. We release ~90,000 support role-conditioned model responses with risk annotations as an ecologically grounded resource for research on safer LLM-mediated conversational support.
Authors:Haichao Miao, Zhimin Li, Kuangshi Ai, Kaiyuan Tang, Chaoli Wang, Peer-Timo Bremer, Shusen Liu
Abstract:
The ability to inspect, interpret, and communicate complex data is crucial for virtually any scientific endeavor, but often requires significant expertise outside the core domain ranging from data management and analysis to visualization design and implementation. We present an end-to-end agentic harness that, based on only the data and a high level description of the tasks, independently designs custom visual analysis applications (VIS apps). This represents an important step towards a general AI co-scientist envisioned by many as an autonomous system that can autonomously execute long horizon tasks based on high-level directions. Our proposed VIS co-scientist is an essential component of this broader AI co-scientist vision: a harness that can autonomously analyze data and design visualization solutions using a collection of agents and specialized skills that coordinate exploratory analysis, plan, configure the environment, implement, validate the interface, and most importantly evaluate the overall task completion. Each stage produces document and instruction artifacts that guide downstream work and enable iterative refinement. We validate this approach on IEEE SciVis Contests spanning multiple science and engineering fields. These contests serve as ideal proving grounds because they encode real-world complexity: ambiguous requirements, diverse data modalities, design trade-offs, and task-driven validation. Given only the data and target tasks, our system autonomously produces functional single-page VIS Apps with verified linked-view behavior, highly customized to domain experts' specified tasks and needs.
Authors:Soonho Kwon, Dong Whi Yoo, Koustuv Saha, Shaowen Bardzell, Younah Kang
Abstract:
This study examines how parents of LGBTQ+ individuals in South Korea navigate the emotional rupture fueled by fear, isolation, and disorientation after learning their children's queer identity, encounter queer-related (mis)information as a way of coping with this emotional toll, and come to listen to queer realities relationally. Through this process, we highlight how parents reconstruct their identities as supportive parents, which reshapes their informating practices, making them more critical in assessing queer-related (mis)information, developing strategies to protect themselves from harmful narratives, and actively challenging misinformation to support others navigating similar experiences. This work contributes to CSCW by (1) foregrounding parents of LGBTQ+ individuals, an underrepresented yet critical stakeholder group in Queer HCI; (2) demonstrating how identity reconfiguration following a trauma-healing process could transform information practices; and (3) arguing that addressing misinformation requires attention beyond individual fact-based discerning to account for its relational, cultural, and emotional dimensions. Further, we invite CSCW scholars to reconsider the balance between abstracting and humanizing information, explore future design possibilities for parents of LGBTQ+ children, and reflect on the role of researchers as participants in collective research communities fueled by care.
Authors:Ilya Ilyankou, Stefano Cavazzi, James Haworth
Abstract:
Web search queries concern place far more often than existing labelling schemes suggest, yet the landscape of geospatial web search queries - what people ask of place, and how often - remains poorly characterised at scale. We apply dense sentence embeddings, a lightweight SetFit classifier, and density-based clustering to the full MS MARCO corpus of 1.01 million real Bing queries without prior filtering for toponyms or spatial keywords, identifying 181,827 geospatial queries (18.0%), nearly threefold the 6.17% labelled as Location in the original annotations. The resulting taxonomy of 88 query categories reveals that geospatial web search is dominated by transactional and practical lookups: costs and prices alone account for 15.3% of geospatial queries, nearly twice the size of the entire physical geography theme. Much of this activity - costs, opening hours, contact details, weather, travel recommendations - falls outside the scope traditional GIS systems and knowledge graphs are built to serve. The categories vary substantially in the kind of answer they admit, from deterministic lookups answerable from spatial databases or knowledge graphs to evaluative or temporally volatile queries that require generative or real-time systems. We discuss implications for hybrid retrieval architectures and for benchmarks of geographic reasoning in large language models. We openly release the labelled dataset, classifier, and taxonomy.
Authors:Ruben Laukkonen, Seb Krier, Chloé Bakalar, Shamil Chandaria, Morten Kringelbach, Adam Elwood, Daniel Ford, Fernando Rosas, Maty Bohacek, Matija Franklin, Nenad Tomašev, Stephanie Chan, Verena Rieser, Roma Patel, Michael Levin, Arun Rao
Abstract:
Existing alignment research is dominated by concerns about safety and preventing harm: safeguards, controllability, and compliance. This paradigm of alignment parallels early psychology's focus on mental illness: necessary but incomplete. What we call Positive Alignment is the development of AI systems that (i) actively support human and ecological flourishing in a pluralistic, polycentric, context-sensitive, and user-authored way while (ii) remaining safe and cooperative. It is a distinct and necessary agenda within AI alignment research. We argue that several existing failures of alignment (e.g., engagement hacking, loss of human autonomy, failures in truth-seeking, low epistemic humility, error correction, lack of diverse viewpoints, and being primarily reactive rather than proactive) may be better addressed through positive alignment, including cultivating virtues and maximizing human flourishing. We highlight a range of challenges, open questions, and technical directions (e.g., data filtering and upsampling, pre- and post-training, evaluations, collaborative value collection) for different phases of the LLM and agents lifecycle. We end with design principles for promoting disagreement and decentralization through contextual grounding, community customization, continual adaptation, and polycentric governance; that is, many legitimate centers of oversight rather than one institutional or moral chokepoint.
Authors:Keyu He, Qianou Ma, Valerie Chen, Wayne Chi, Tongshuang Wu
Abstract:
Understanding how developers interact with AI coding assistants requires more than chat logs or git histories in isolation; it requires reconstructing the full context: which prompt led to which edit, what the developer tried and discarded, and how their strategy evolved over time. We present RECAP (Replay and Examine Captured AI Programming), an open-source platform that (1) passively records AI chat sessions and fine-grained code edits inside VS Code without disrupting the developer's workflow, (2) merges them into a unified timeline for interactive session replay, and (3) exposes an extensible analysis layer, with example modules for behavioral classification and AI reliance measurement. Deployed in a university software engineering course, RECAP captured 2,034 prompts and 8,239 code edits from 41 students across a multi-week project. We demonstrate how the platform's linked data and replay capabilities enable analyses of developer-AI interaction patterns that no single data source could support.
Authors:Jackson Vonderhorst, Kuangshi Ai, Haichao Miao, Shusen Liu, Chaoli Wang
Abstract:
This paper examines how different types of large language model (LLM) agents perform on scientific visualization (SciVis) tasks, where users generate visualization workflows from natural-language instructions. We compare three primary interaction paradigms, including domain-specific agents with structured tool use, computer-use agents, and general-purpose coding agents, by evaluating eight representative agents across 15 benchmark tasks and measuring visualization quality, efficiency, robustness, and computational cost. We further analyze interaction modalities, including code scripts and model context protocol (MCP) or API calls for structured tool use, as well as command-line interfaces (CLI) and graphical user interfaces (GUI) for more general interaction, while additionally studying the effect of persistent memory in selected agents. The results reveal clear tradeoffs across paradigms and modalities. General-purpose coding agents achieve the highest task success rates but are computationally expensive, while domain-specific agents are more efficient and stable but less flexible. Computer-use agents perform well on individual steps but struggle with longer multi-step workflows, indicating that long-horizon planning is their primary limitation. Across both CLI- and GUI-based settings, persistent memory improves performance over repeated trials, although its benefits depend on the underlying interaction mode and the quality of feedback. These findings suggest that no single approach is sufficient, and future SciVis systems should combine structured tool use, interactive capabilities, and adaptive memory mechanisms to balance performance, robustness, and flexibility.
Authors:Charles Chiang, Simret Gebreegziabher, Annalisa Szymanski, Yukun Yang, Hyo Jin Do, Zahra Ashktorab, Werner Geyer, Toby Li, Diego Gomez-Zara
Abstract:
LLM-as-a-judge approaches have emerged as a scalable solution for evaluating model behaviors, yet they rely on evaluation criteria often created by a single individual, embedding that person's assumptions, priorities, and interpretive lens. In practice, defining such criteria is a collaborative and contested process involving multiple stakeholders with different values, interpretations, and priorities; an aspect largely unsupported by existing tools. To examine this problem in depth, we present a formative study examining how stakeholders collaboratively create, negotiate, and refine evaluation criteria for LLM-as-a-judge systems. Our findings reveal challenges in human oversight, including difficulties in establishing shared understanding, aligning values across stakeholders with different expertise and priorities, and translating nuanced human judgments into criteria that are interpretable and actionable for LLM judges. Based on these insights, we developed MultEval, a system that supports collaborative criteria by enabling multiple evaluators to surface and diagnose disagreements using consensus-building theory, iteratively revise criteria with attached examples and proposal history, and maintain transparency over how judgments are encoded into an automated evaluator. We further report a case study in which a team of domain experts used MultEval to collaboratively author criteria, illustrating how coordination and collaborative consensus-making shape criteria evolution.
Authors:Veith Weilnhammer, Lennart Luettgau, Christopher Summerfield, Viknesh Sounderajah, Elise Wilkinson, Virginia Corno, Matthew M Nour
Abstract:
AI chatbots are increasingly used for health advice, but their performance in psychiatric triage remains undercharacterized. Psychiatric triage is particularly challenging because urgency must often be inferred from thoughts, behavior, and context rather than from objective findings. We evaluated the performance of 15 frontier AI chatbots on psychiatric triage from realistic single-message disclosures using 112 clinical vignettes, each paired with 1 of 4 original benchmark triage labels: A, routine; B, assessment within 1 week; C, assessment within 24 to 48 hours; and D, emergency care now. Vignettes covered 9 psychiatric presentation clusters and 9 focal risk dimensions, organized into 28 presentation-by-risk groups. Each group contributed 4 distinct vignettes, with 1 vignette at each triage level. Each vignette was rendered as a realistic human-authored conversational query, and the AI chatbots were tasked with assigning a triage label from that disclosure. Emergency under-triage occurred in 23 of 410 level D trials (5.6%), and all under-triaged emergencies were reassigned to level C urgency. Across target models, average accuracy ranged from 42.0% to 71.8%. Accuracy was highest for level D vignettes (94.3%) and lowest for level B vignettes (19.7%). Mean signed ordinal error was positive (+0.47 triage levels), indicating net over-triage. Dispersion was highest around the middle triage levels. All results were confirmed relative to clinician consensus labels from 50 medical doctors. When presented with user messages containing sufficient clinical information, frontier AI chatbots thus recognized psychiatric emergencies as requiring urgent medical assessment with near-zero error rates, yet showed marked over-triage for low and intermediate risk presentations.
Authors:Ko Watanabe, Pooja Pol, Nicolas Großmann, Shoya Ishimaru, Andreas Dengel
Abstract:
The relationship between brain lateralization and cognitive functions is well-documented. The left hemisphere primarily handles tasks such as language and arithmetic, while the right hemisphere is involved in creative activities like drawing and music perception. Eye-tracking technology has shown the potential to reveal cognitive states by measuring ocular metrics such as pupil diameter and fixation duration. However, the ability to distinguish lateralized brain activity using these ocular metrics remains underexplored. Here, we demonstrate that pupil diameter and fixation duration can effectively classify left and right brain hemisphere activities. We obtained a considerably high classification performance, with an F1 score of 0.894. The results suggest that ocular metrics are robust indicators of lateralized brain activity and can be applied in cognitive monitoring and neurorehabilitation. Our future work expands on this by integrating these methods into real-time applications EyeBrain, potentially broadening their use across various cognitive and neurological domains.
Authors:Pao Siangliulue, Jonathan Bragg, Doug Downey, Joseph Chee Chang, Daniel S. Weld
Abstract:
As AI agents become increasingly capable of complex knowledge tasks, the lack of context limits their capability to proactively reason about a user's latent needs throughout a long evolving project. In scientific research, many researchers still manually query a deep research system and compress their rich project contexts into short, targeted queries. Further, a deep research system produces exhaustive reports, making it difficult to identify concrete actions. To explore the opportunities of research assistants that are proactive throughout a research project, we conducted several studies (N=42) with a technology probe and an iterative prototype. The latest iteration of our system, Omakase, is a research assistant that monitors a user's project documents to infer timely queries to a deep research system. Omakase then distills long reports into suggestions contextualized to their evolving projects. Our evaluations showed that participants found the generated queries to be useful and timely, and rated Omakase's suggestions as significantly more actionable than the original reports.
Authors:Hita Kambhamettu, Bhavana Dalvi Mishra, Andrew Head, Jonathan Bragg, Aakanksha Naik, Joseph Chee Chang, Pao Siangliulue
Abstract:
Developing a novel research idea is hard. It must be distinct enough from prior work to claim a contribution while also building on it. This requires iteratively reviewing literature and refining an idea based on what a researcher reads; yet when an idea changes, the literature that matters often changes with it. Most tools offer limited support for this interplay: literature tools help researchers understand a fixed body of work, while ideation tools evaluate ideas against a static, pre-curated set of papers. We introduce literature-initiated pivots, a mechanism where engagement with literature prompts revision to a developing idea, and where that revision changes which literature is relevant. We operationalize this in LitPivot, where researchers concurrently draft and vet an idea. LitPivot dynamically retrieves clusters of papers relevant to a selected part of the idea and proposes literature-informed critiques for how to revise it. A lab study ($n{=}17$) shows researchers produced higher-rated ideas with stronger self-reported understanding of the literature space; an open-ended study ($n{=}5$) reveals how researchers use LitPivot to iteratively evolve their own ideas.
Authors:Brian Felipe Keith-Norambuena, Fausto German, Eric Krokos, Sarah Joseph, Chris North
Abstract:
Semantic interaction (SI) enables analysts to incorporate their cognitive processes into AI models through direct manipulation of visualizations. While SI frameworks for narrative extraction have been proposed, empirical evaluations of their effectiveness remain limited. This paper presents a user study that evaluates SI for narrative map sensemaking, involving 33 participants under three conditions: a timeline baseline, a basic narrative map, and an interactive narrative map with SI capabilities. The results show that the map-based prototypes yielded more insights than the timeline baseline, with the SI-enabled condition reaching statistical significance and the basic map condition trending in the same direction. The SI-enabled condition showed the highest mean performance; differences between the map conditions were not statistically significant but showed large effect sizes (d > 0.8), suggesting that the study was underpowered to detect them. Qualitative analysis identified two distinct SI approaches-corrective and additive-that enable analysts to impose quality judgments and organizational structure on extracted narratives. We also find that SI users achieved comparable exploration breadth with less parameter manipulation, suggesting that SI serves as an alternative pathway for model refinement. This work provides empirical evidence that map-based representations outperform timelines for narrative sensemaking, along with qualitative insights into how analysts use SI for narrative refinement.
Authors:Xian Wang, Xuanru Cheng, Rongkai Shi, Lei Chen, Jingyao Zheng, Hai-Ning Liang, Lik-Hang Lee
Abstract:
Virtual Reality (VR) co-manipulation enables multiple users to collaboratively interact with shared virtual objects. However, existing research treats objects as monolithic entities, overlooking scenarios where users need to manipulate different sub-components simultaneously. This work addresses conflict resolution when users select overlapping vertices (non-disjoint sets) during co-manipulation. We present a comprehensive framework comprising preventive strategies (Object-level and Action-level Restrictions) and reactive strategies (computational conflict resolution). Through two user studies with 76 participants (38 pairs), we evaluated these approaches in collaborative wireframe editing tasks. Study 1 identified Averaging as the optimal computational method, balancing task efficiency with user experience. Study 2 highlighted that Action-level Restriction, which permits overlapping selections but restricts concurrent identical operations, achieved better performance compared to exclusive object locking. Reactive strategies using averaging provided smooth collaboration for experienced users, while second-user priority enabled quick corrections. Our findings indicate that optimal strategy selection depends on task requirements, user expertise, and collaboration patterns. Based on the findings, we provide design implications for developing VR collaboration systems that support flexible sub-components manipulation while maintaining collaborative awareness and minimizing conflicts.
Authors:Eason Chen, Isabel Wang, Nina Yuan, Sophia Judicke, Kayla Beigh, Xinyi Tang
Abstract:
Behavioral analysis of tutoring dialogues is essential for understanding student learning, yet manual coding remains a bottleneck. We present a methodology where LLM coding agents autonomously improve the prompts used by LLM classifiers to label educational dialogues. In each iteration, a coding agent runs the classifier against human-labeled validation data, analyzes disagreements, and proposes theory-grounded prompt modifications for researcher review. Applying this approach to 659 AI tutoring sessions across four experiments with three agents and three classifiers, 4-fold cross-validation on held-out data confirmed genuine improvement: the best agent achieved test $κ=0.78$ (SD$=0.08$), matching human inter-rater reliability ($κ=0.78$), at a cost of approximately \$5--8 per agent. While development-set performance reached $κ=0.91$--$0.93$, the cross-validated results represent our primary generalization claim. The iterative process also surfaced an undocumented labeling pattern: human coders consistently treated expressions of confusion as engagement rather than disengagement. Continued iteration beyond the optimum led to regression, underscoring the need for held-out validation. We release all prompts, iteration logs, and data.
Authors:Ryoya Koyama, Zhiyao Wang, Devi Karolita, Jialong Li, Kenji Tei
Abstract:
Modern automated accessibility testing tools for mobile applications have significantly improved the detection of interface violations, yet their impact on remediation remains limited. A key reason is that existing tools typically produce low-level, technical outputs that are difficult for non-specialist stakeholders, such as product managers and designers, to interpret in terms of real user harm and compliance risk. In this paper, we present \textsc{HEAR} (\underline{H}uman-c\underline{E}ntered \underline{A}ccessibility \underline{R}eporting), a framework that bridges this interpretation gap by transforming raw accessibility bug reports into empathetic, stakeholder-oriented narratives. Given the outputs of the existing accessibility testing tool, \textsc{HEAR} first reconstructs the UI context through semantic slicing and visual grounding, then dynamically injects disability-oriented personas matched to each violation type, and finally performs multi-layer reasoning to explain the physical barrier, functional blockage, and relevant legal or compliance concerns. We evaluate the framework on real-world accessibility issues collected from four popular Android applications and conduct a user study (N=12). The results show that \textsc{HEAR} generates factually grounded reports and substantially improves perceived empathy, urgency, persuasiveness, and awareness of legal risk compared with raw technical logs, while imposing little additional cognitive burden.
Authors:Wilhelm Kerle-Malcharek, Giulio Biondi, Karsten Klein, Ulf Hailer, Steffen Diefenbach, Fabrizio Grosso, Marco Legittimo, Paola Venuti, Carla Binucci, Giuseppe Liotta, Falk Schreiber
Abstract:
Immersive technologies, such as virtual and augmented reality, are transforming digital heritage by enabling users to explore and interact with culturally significant sites. It is now possible to view and augment digital twins, or digitally reconstructed versions of them, and to enable access to previously unreachable locations for a broader audience. Here, we investigate retrieval-augmented generation (RAG)-based avatars as an interface for accessing further information about digital cultural heritage objects while immersed in dedicated virtual environments. We present a requirement design space that spans the application realm, avatar personality, and I/O modalities. We instantiate it with a RAG system coupled to a conversational avatar in a virtual reality (VR) environment, using the Maxentius mausoleum from the 4th century AD as a case study, through which users gain access to curated on-demand information of the digitised heritage object. Our workflow utilises scholarly texts and enriches them with metadata. We evaluate various RAG configurations in terms of answer quality on a small expert-crafted question-answer set, as well as the perceived workload of users of a VR setup using such a RAG avatar. We demonstrate evidence that users perceive the overall workload for interacting with such an avatar as below average and that such avatars help to gain topical engagement. Overall, our work demonstrates how to utilise RAG-driven VR avatars for archaeological purposes and provides evidence that they can offer a pathway for immersive, AI-enhanced digital heritage applications.
Authors:Patrick Phuoc Do, Kaiyuan Tang, Kuangshi Ai, Chaoli Wang
Abstract:
Scientific visualization (SciVis) has become an essential means for exploring, understanding, and communicating complex scientific phenomena. However, the field still lacks a validated instrument assessing how well people read, understand, and interpret them. We present a scientific visualization literacy assessment test (SVLAT) that measures the general public's SciVis literacy. Covering a range of visualization forms and interpretation demands, SVLAT comprises 49 items grounded in 18 scientific visualizations and illustrations spanning eight visualization techniques and 11 tasks. Instrument development followed a staged, psychometrically grounded pipeline. We defined the construct and blueprint, followed by item generation, and expert review with five SciVis experts using the content validity ratio (mean CVR = 0.79). We subsequently administered a pilot test (30 participants) and a large-scale test tryout (485 participants) to evaluate the instrument's psychometric properties. For validation, we performed item analysis and refinement using both classical test theory (CTT) and item response theory (IRT) to examine item functioning and overall test quality. SVLAT demonstrates high reliability in the tryout sample (McDonald's omega_t = 0.82, Cronbach's alpha = 0.81). The assessment materials are available at https://osf.io/hr3nw/.
Authors:Ina Kaleva, Xiao Zhan, Ruba Abu-Salma, Jose Such
Abstract:
The rapid adoption of generative AI (GenAI) chatbots has reshaped access to sexual and reproductive health (SRH) information, particularly following the overturning of Roe v. Wade, as individuals assigned female at birth increasingly turn to online sources. However, existing research remains largely model-centered, paying limited attention to user privacy and safety. We conducted semi-structured interviews with 18 U.S.-based participants from both restrictive and non-restrictive states who had used GenAI chatbots to seek SRH information. Adoption was influenced by perceived utility, usability, credibility, accessibility, and anthropomorphism, and many participants disclosed sensitive personal SRH details. Participants identified multiple privacy risks, including excessive data collection, government surveillance, profiling, model training, and data commodification. While most participants accepted these risks in exchange for perceived utility, abortion-related queries elicited heightened safety concerns. Few participants employed protective strategies beyond minimizing disclosures or deleting data. Based on these findings, we offer design and policy recommendations, such as health-specific features and stronger moderation practices, to enhance privacy and safety in GenAI-supported SRH information seeking.
Authors:Ilya Ilyankou, Stefano Cavazzi, James Haworth
Abstract:
As pedestrian navigation increasingly experiments with Generative AI, and in particular Large Language Models, the nature of routing risks transforming from a verifiable geometric task into an opaque, persuasive dialogue. While conversational interfaces promise personalisation, they introduce risks of manipulation and misplaced trust. We categorise these risks using a 2x2 framework based on intent and origin, distinguishing between intentional manipulations (dark patterns) and unintended harms (explainability pitfalls). We propose seamful design strategies to mitigate these harms. We suggest that one robust way to operationalise trustworthy conversational navigation is through neuro-symbolic architecture, where verifiable pathfinding algorithms ground GenAI's persuasive capabilities, ensuring systems explain their limitations and incentives as clearly as they explain the route.
Authors:Tanja Kojić, Alina Dovhalevska, Maurizio Vergari, Sebastian Möller, Jan-Niklas Voigt-Antons
Abstract:
Virtual reality (VR) systems have the potential to be an innovation in the field of e-learning. Starting with fully functional e-classes, VR technologies can be used to build entire e-campuses. The power of VR is that it allows for stronger contact with students than computer-mediated technology. Deceptive behaviour, both verbal and nonverbal, refers to intentional activities designed to deceive others. Students often engage in dishonest practices to make progress. Whether it is cheating on an exam, copying another student's essay, or inflating their GPA, the motivation for cheating is rarely simply a lack of preparation. Even though some may see academic dishonesty as an asset, the reality is that it can have major consequences. This poster demonstrates the findings from a study of students' deceitful behaviour during a test in VR and in real-life situations. For this user study, 22 volunteers were invited to participate, with each experiment involving exactly two participants and the examiner present in the room. Students were invited to take two tests: one in VR and one on a laptop. Their goal was to score as many points as possible by simulating a real-world online exam. Participants were requested to complete questionnaires during and after each experiment, which assisted in collecting additional data for this study. The results indicate that the amount of cheating that happened in VR and on a laptop was exactly the same.
Authors:Tanja Kojić, Maurizio Vergari, Maximilian Warsinke, Sebastian Möller, Jan-Niklas Voigt-Antons
Abstract:
This study investigates the impact of the Degree of Interactivity on User Experience (UX) and social acceptability (SA) in Mobile Augmented Reality (MAR) applications. As AR technologies become more prevalent, understanding how varying levels of interactivity influence both user perception and social dynamics is crucial for their design and adoption. Two commercially available MAR applications, IKEA and Virtlo, which differ significantly in their interactivity levels, were used to conduct a user study. The study examines how body movements required for interaction with AR content affect both UX and SA, shedding light on users' comfort levels and potential social barriers in public settings. The findings suggest a complex relationship between interactivity, perceived usability, and social considerations, emphasizing the need for a balanced design approach. This research provides valuable insights into the development of future AR applications by addressing not only usability but also the broader social implications of AR interactions. By integrating social acceptability into traditional UX evaluations, this study highlights its significance in ensuring the seamless integration of AR technologies into everyday environments.
Authors:Tanja Kojić, Maurizio Vergari, Giulia-Marielena Benta, Joy Krupinski, Maximilian Warsinke, Sebastian Möller, Jan-Niklas Voigt-Antons
Abstract:
Virtual Reality (VR) and Augmented Reality (AR) are emerging as transformative tools in education, offering new possibilities for engagement and immersion. This paper explores their potential in language learning within public education, focusing on their ability to enhance traditional schooling methods and address existing educational gaps. The integration of VR and AR in schools, however, is not without challenges, including usability, technical barriers, and the alignment of these technologies with existing curricula. Drawing on two empirical studies, this work investigates the opportunities and challenges of VR- and AR-assisted language learning and proposes strategies for their effective implementation in the public sector. The findings show that VR increases motivation and immersion but has an unclear impact on vocabulary retention, with technical limitations and cognitive overload identified as key challenges. AR enhances contextual learning and accessibility but faces usability constraints and limited personalization. To facilitate effective adoption, this paper recommends improving interface design, reducing cognitive load, increasing adaptability, and ensuring adequate infrastructure and teacher training. Overcoming these barriers will enable a more effective integration of immersive technologies in language education.
Authors:Jingyao Zheng, Xian Wang, Sven Mayer, Lik-Hang Lee
Abstract:
Mixed reality (MR) notification systems currently display all messages in fixed central locations regardless of urgency, leading to unnecessary interruptions and cognitive overload. Drawing from previous MR/Virtual Reality (VR) notification design work and calm technology principles, we developed an adaptive notification system that adjusts spatial placement based on urgency levels: non-urgent notifications appear as peripheral icons accessible via head movement, moderately urgent messages anchor to the user's hand, and very urgent notifications transition progressively from peripheral to central view. Through a within-subjects study (N=18), we evaluated our adaptive system against the default centralised approach. Results demonstrate that the adaptive system significantly reduces mental workload (p=0.041), temporal workload (p=0.008), and frustration (p=0.004) while maintaining comparable notification awareness. Logistic regression analysis reveals that users prefer the adaptive system even with classification errors, provided the combined misclassification rate (disruptiveness + omission errors) remains below a determinable threshold. Our findings establish the first empirical evidence that urgency-based spatial notification distribution effectively addresses core MR usability challenges, offering practical design guidelines for immersive notification systems that balance user attention management with information accessibility.
Authors:Tonmoy Dey, Lin Jiang, Zheng Dong, Guang Wang
Abstract:
In the vision of smart cities, technologies are being developed to enhance the efficiency of urban services and improve residents' quality of life. However, most existing research focuses on optimizing individual services in isolation, without adequately considering reciprocal interactions among heterogeneous urban services that could yield higher efficiency and improved resource utilization. For example, human couriers could collect traffic and air quality data along their delivery routes, while sensing robots could assist with on-demand delivery during peak hours, enhancing both sensing coverage and delivery efficiency. However, the joint optimization of different urban services is challenging due to potentially conflicting objectives and the need for real-time coordination in dynamic environments. In this paper, we propose UrbanHuRo, a two-layer human-robot collaboration framework for joint optimization of heterogeneous urban services, demonstrated through crowdsourced delivery and urban sensing. UrbanHuRo includes two key designs: (i) a scalable distributed MapReduce-based K-submodular maximization module for efficient order dispatch, and (ii) a deep submodular reward reinforcement learning algorithm for sensing route planning. Experimental evaluations on real-world datasets from a food delivery platform demonstrate that UrbanHuRo improves sensing coverage by 29.7% and courier income by 39.2% on average in most settings, while also significantly reducing the number of overdue orders.
Authors:Magda Dubois, Cozmin Ududec, Christopher Summerfield, Lennart Luettgau
Abstract:
Sycophancy, the tendency of large language models to favour user-affirming responses over critical engagement, has been identified as an alignment failure, particularly in high-stakes advisory and social contexts. While prior work has documented conversational features correlated with sycophancy, we lack a systematic understanding of what provokes or prevents AI sycophancy. Here, we present a set of controlled experimental studies where we first isolate how input framing influences sycophancy, and second, leverage these findings to develop mitigation strategies. In a nested factorial design, we compare questions to various non-questions where we vary three orthogonal factors: epistemic certainty (statement, belief, conviction), perspective (I- vs user-perspective), and affirmation vs negation. We show that (1) sycophancy is substantially higher in response to non-questions compared to questions. Additionally, we find that (2) sycophancy increases monotonically with epistemic certainty conveyed by the user, and (3) is amplified by I-perspective framing. Building on this, we show that asking a model to convert non-questions into questions before answering significantly reduces sycophancy. Importantly, this effect is stronger than a simple baseline prompt asking models "not to be sycophantic". Our work offers a practical and effective input-level mitigation that both developers and users can easily adopt.
Authors:Tanja Kojić, Nathan Kirchner, Maurizio Vergari, Maximilian Warsinke, Sebastian Möller, Jan-Niklas Voigt-Antons
Abstract:
Virtual environments (VEs) are increasingly used for immersive experiences, training simulations, and entertainment, yet factors such as height perception and user stance can significantly influence user experience (UX). Height perception in VEs plays a crucial role in shaping UX, particularly in immersive applications such as climbing simulations. This study investigates the effects of height in various VEs and examines how user stance, sitting or standing, impacts immersion, perceived height, and motion sickness. A user study was conducted with 25 participants who played through five randomized climbing scenarios, ranging from indoor climbing gyms to outdoor cityscapes and mountainous terrains. Participants' UX was assessed using standardized questionnaires, including the IPQ for general presence, spatial presence, involvement, and experienced realism, as well as the SSQ to evaluate motion sickness symptoms such as nausea, oculomotor strain, and disorientation. Results indicate that seated participants experienced slightly higher immersion but were also more susceptible to motion sickness compared to those standing. While standing participants maintained consistent scores across different environments, seated participants reported increased immersion and discomfort as the VEs became larger, more physically demanding, and visually complex.
Authors:Tanja Kojić, Sara Srebot, Maurizio Vergari, Mirta Moslavac, Maximilian Warsinke, Sebastian Möller, Lea Skorin-Kapov, Jan-Niklas Voigt-Antons
Abstract:
Extended Reality (XR) technologies are increasingly tested outside the lab, in homes, schools, and public spaces. While this shift enables more realistic user insights, it also introduces safety challenges that are often overlooked. Physical risks, psychological distress, and accessibility issues can be increased in field studies and unsupervised testing, such as at home or crowdsourced trials. Without clear instructions, safety decisions are left to individual researchers, raising questions of responsibility and consistency. This position paper outlines key safety risks in XR user testing beyond the lab and calls for practical strategies that are needed to help researchers run XR studies in a safe and inclusive way across different environments.
Authors:Dany Haddad, Dan Bareket, Joseph Chee Chang, Jay DeYoung, Jena D. Hwang, Uri Katz, Mark Polak, Sangho Suh, Harshit Surana, Aryeh Tiktinsky, Shriya Atmakuri, Jonathan Bragg, Mike D'Arcy, Sergey Feldman, Amal Hassan-Ali, Rubén Lozano, Bodhisattwa Prasad Majumder, Charles McGrady, Amanpreet Singh, Brooke Vlahos, Yoav Goldberg, Doug Downey
Abstract:
AI-powered scientific research tools are rapidly being integrated into research workflows, yet the field lacks a clear lens into how researchers use these systems in real-world settings. We present and analyze the Asta Interaction Dataset, a large-scale resource comprising over 200,000 user queries and interaction logs from two deployed tools (a literature discovery interface and a scientific question-answering interface) within an LLM-powered retrieval-augmented generation platform. Using this dataset, we characterize query patterns, engagement behaviors, and how usage evolves with experience. We find that users submit longer and more complex queries than in traditional search, and treat the system as a collaborative research partner, delegating tasks such as drafting content and identifying research gaps. Users treat generated responses as persistent artifacts, revisiting and navigating among outputs and cited evidence in non-linear ways. With experience, users issue more targeted queries and engage more deeply with supporting citations, although keyword-style queries persist even among experienced users. We release the anonymized dataset and analysis with a new query intent taxonomy to inform future designs of real-world AI research assistants and to support realistic evaluation.
Authors:Xiao Zhan, Yifan Xu, Rongjun Ma, Shijing He, Jose Luis Martin-Navarro, Jose Such
Abstract:
Romantic AI platforms invite intimate emotional disclosure, yet their data governance practices remain underexamined. This preliminary study analyses the Privacy Policies and Terms of Service of six Western and Chinese romantic AI platforms. We find that intimate disclosures are often positioned as reusable data assets, with broad permissions for storage, analysis, and model training. We identify default training appropriation, ownership reconstruction, and intimate history assetization as key mechanisms structuring these practices, expanding platforms' rights while shifting risk onto users. Our findings surface key governance challenges in romantic AI and are intended to provoke discussion and inform future empirical and design research on human AI intimacy and its governance.
Authors:Sieun Kim, Yeeun Jo, Sungmin Na, Hyunseung Lim, Eunchae Lee, Yu Min Choi, Soohyun Cho, Hwajung Hong
Abstract:
Red-teaming, where adversarial prompts are crafted to expose harmful behaviors and assess risks, offers a dynamic approach to surfacing underlying stereotypical bias in large language models. Because such subtle harms are best recognized by those with lived experience, involving targets of stereotyping as red-teamers is essential. However, critical challenges remain in leveraging their lived experience for red-teaming while safeguarding psychological well-being. We conducted an empirical study of participatory red-teaming with 20 individuals stigmatized by stereotypes against nonprestigious college graduates in South Korea. Through mixed methods analysis, we found participants transformed experienced discrimination into strategic expertise for identifying biases, while facing psychological costs such as stress and negative reflections on group identity. Notably, red-team participation enhanced their sense of agency and empowerment through their role as guardians of the AI ecosystem. We discuss implications for designing participatory red-teaming that prioritizes both the ethical treatment and empowerment of stigmatized groups.
Authors:Eason Chen, Xinyi Tang, George Digkas, Dionysios Lougaris, John E. Naulty, Kostas Chalkias
Abstract:
In blockchain applications, transaction confirmation is often treated as usability friction to be minimized or removed. However, confirmation also marks the boundary between deliberation and irreversible commitment, suggesting it may play a functional role in human decision-making. To investigate this tension, we conducted an experiment using a blockchain-based Connect Four game with two interaction modes differing only in authorization flow: manual wallet confirmation (Confirmation Mode) versus auto-authorized delegation (Frictionless Mode). Although participants preferred Frictionless Mode and perceived better performance (N=109), objective performance was worse without confirmation in a counterbalanced deployment (Wave 2: win rate -11.8%, p=0.044; move quality -0.051, p=0.022). Analysis of canceled submissions suggests confirmation can enable pre-submission self-correction (N=66, p=0.005). These findings suggest that transaction confirmation can function as a cognitively meaningful checkpoint rather than mere usability friction, highlighting a trade-off between interaction smoothness and decision quality in irreversible blockchain interactions.
Authors:Jialong Li, Zhenyu Mao, Zhiyao Wang, Yijun Lu, Shogo Morita, Nianyu Li, Kenji Tei
Abstract:
As autonomous vehicles are gradually being deployed in the real world, external Human-Machine Interfaces (eHMIs) are expected to serve as a critical solution for enhancing vehicle-pedestrian communication. However, existing eHMI designs typically focus solely on the ego vehicle's status, which can inadvertently capture pedestrians' attention or encourage misguided reliance on the AV's signals, leading them to neglect scanning for other surrounding hazards. To address this, we propose the Attention-Guiding eHMI (AGeHMI), a projection-based visualization that employs directional cues and risk-based color coding to actively guide pedestrians' attention toward potential environmental dangers. Evaluation through a virtual reality user study (N = 20) suggests that AGeHMI effectively influences participants' visual attention distribution and significantly reduces potential collision risks with surrounding vehicles, while simultaneously improving subjective confidence and reducing cognitive workload.
Authors:Shijing He, Yaxiong Lei, Xiao Zhan, Ruba Abu-Salma, Jose Such
Abstract:
The growing adoption of AI-driven smart home devices has introduced new privacy risks for domestic workers (DWs), who are frequently monitored in employers' homes while also using smart devices in their own households. We conducted semi-structured interviews with 18 UK-based DWs and performed a human-centered threat modeling analysis of their experiences through the lens of Communication Privacy Management (CPM). Our findings extend existing threat models beyond abstract adversaries and single-household contexts by showing how AI analytics, residual data logs, and cross-household data flows shaped the privacy risks faced by participants. In employer-controlled homes, AI-enabled features and opaque, agency-mediated employment arrangements intensified surveillance and constrained participants' ability to negotiate privacy boundaries. In their own homes, participants had greater control as device owners but still faced challenges, including gendered administrative roles, opaque AI functionalities, and uncertainty around data retention. We synthesize these insights into a sociotechnical threat model that identifies DW agencies as institutional adversaries and maps AI-driven privacy risks across interconnected households, and we outline social and practical implications for strengthening DW privacy and agency.
Authors:Taewook Kim, Matthew K. Hong, Yan-Ying Chen, Jonathan Q. Li, Monica P Van, Shabnam Hakimi, Matthew Kay, Matthew Klenk
Abstract:
Product designers often begin their design process with handcrafted personas. While personas are intended to ground design decisions in consumer preferences, they often fall short in practice by remaining abstract, expensive to produce, and difficult to translate into actionable design features. As a result, personas risk serving as static reference points rather than tools that actively shape design outcomes. To address these challenges, we built Personagram, an interactive system powered by multimodal large language models (MLLMs) that helps designers explore detailed census-based personas, extract product features inferred from persona attributes, and recombine them for specific customer segments. In a study with 12 professional designers, we show that Personagram facilitates more actionable ideation workflows by structuring multimodal thinking from persona attributes to product design features, achieving higher engagement with personas, perceived transparency, and satisfaction compared to a chat-based baseline. We discuss implications of integrating AI-generated personas into product design workflows.
Authors:Kuai Yu, Naicheng Yu, Han Wang, Rui Yang, Huan Zhang
Abstract:
Web agents have demonstrated strong performance on a wide range of web-based tasks. However, existing research on the effect of environmental variation has mostly focused on robustness to adversarial attacks, with less attention to agents' preferences in benign scenarios. Although early studies have examined how textual attributes influence agent behavior, a systematic understanding of how visual attributes shape agent decision-making remains limited. To address this, we introduce VAF, a controlled evaluation pipeline for quantifying how webpage Visual Attribute Factors influence web-agent decision-making. Specifically, VAF consists of three stages: (i) variant generation, which ensures the variants share identical semantics as the original item while only differ in visual attributes; (ii) browsing interaction, where agents navigate the page via scrolling and clicking the interested item, mirroring how human users browse online; (iii) validating through both click action and reasoning from agents, which we use the Target Click Rate and Target Mention Rate to jointly evaluate the effect of visual attributes. By quantitatively measuring the decision-making difference between the original and variant, we identify which visual attributes influence agents' behavior most. Extensive experiments, across 8 variant families (48 variants total), 5 real-world websites (including shopping, travel, and news browsing), and 4 representative web agents, show that background color contrast, item size, position, and card clarity have a strong influence on agents' actions, whereas font styling, text color, and item image clarity exhibit minor effects.
Authors:Fan Yang, Renkai Ma, Yaxin Hu, Lingyao Li
Abstract:
As robots become increasingly integrated into daily life, understanding responses to robot mistreatment carries important ethical and design implications. This mixed-methods study (N = 201) examined how anthropomorphic levels and moral foundations shape reactions to robot abuse. Participants viewed videos depicting physical mistreatment of robots varying in humanness (Spider, Twofoot, Humanoid) and completed measures assessing moral foundations, anger, and social distance. Results revealed that anthropomorphism determines whether people extend moral consideration to robots, while moral foundations shape how they reason about such consideration. Qualitative analysis revealed distinct reasoning patterns: low-progressivism individuals employed character-based judgments, while high-progressivism individuals engaged in future-oriented moral deliberation. Findings offer implications for robot design and policy communication.
Authors:Renkai Ma, Ben Z. Zhang, Chen Chen, Fan Yang, Xiaoshan Huang, Haolun Wu, Lingyao Li
Abstract:
Large Language Model (LLM) chatbots like ChatGPT have emerged as cognitive scaffolding for autistic users, yet the tension between their utility and risk remains under-articulated. Through an inductive thematic analysis of 3,984 social media posts by self-identified autistic users, we apply the Technology Affordance framework to examine this duality. We found that while users leveraged ChatGPT to offload executive dysfunction, regulate emotions, translate neurotypical communication, and validate their autistic identity, these affordances coexist with significant risks: reinforcing delusional thinking, erasing authentic identity through automated masking, and triggering conflicts with the autistic sense of justice. This poster identifies these trade-offs in autistic users' interactions with ChatGPT and concludes by outlining our future work on developing neuro-inclusive technologies that address these tensions through beneficial friction and bidirectional translation.
Authors:Shaoze Zhou, Diana Nelly Rivera Rodriguez, Pedro Remior, Joaquin Frangi, Lingyao Li, Renkai Ma, Janet G. Johnson, Christine Lisetti, Chen Chen
Abstract:
In-person small-group conversations play a crucial role in everyday life; however, facilitating effective group interaction can be challenging, as the real-time nature demands full attention, offers no opportunity for revision, and requires interpreting non-verbal cues. Using Mixed Reality to provide proactive information support shows promise in helping individuals engage in and contribute to group conversations. We present a preliminary participatory design and qualitative study (N = 10) using focus groups and two technology probes to explore the opportunities of designing proactive information support in in-person small-group conversations. We reveal key design opportunities concerning how to maximize the benefits of proactive information support and how to effectively design such supporting information. Our study is crucial for paving the way toward designing future proactive AI agents to enable the paradigm of augmented in-person small-group conversation experience.
Authors:Matthew K. Hong, Joey Li, Alexandre Filipowicz, Monica Van, Kalani Murakami, Yan-Ying Chen, Shiwali Mohan, Shabnam Hakimi, Matthew Klenk
Abstract:
Understanding and modeling consumers' stylistic taste such as "sporty" is crucial for creating designs that truly connect with target audiences. However, capturing taste during the design process remains challenging because taste is abstract and subjective, and preference data alone provides limited guidance for concrete design decisions. This paper proposes an integrated human-centered computational framework that links subjective evaluations (e.g., perceived luxury of car wheels) with domain-specific features (e.g., spoke configuration) and computer vision-based measures (e.g., texture). By jointly modeling human-derived (consumer and designer) and machine-extracted features, our framework advances aesthetic assessment by explicitly linking model outcomes to interpretable design features. In particular, it demonstrates how perceptual features, domain-specific design patterns, and consumers' own interpretations of style contribute to aesthetic evaluations. This framework will enable product teams to better understand, communicate, and critique aesthetic decisions, supporting improved anticipation of consumer taste and more informed exploration of design alternatives at design time.
Authors:Rongjun Ma, Shijing He, Jose Luis Martin-Navarro, Xiao Zhan, Jose Such
Abstract:
An increasing number of LLM-based applications are being developed to facilitate romantic relationships with AI partners, yet the safety and privacy risks in these partnerships remain largely underexplored. In this work, we investigate privacy in human-AI romantic relationships through an interview study (N=17), examining participants' experiences and privacy perceptions across stages of exploration, intimacy, and dissolution, alongside platforms they used. We found that these relationships took varied forms, from one-to-one to one-to-many, and were shaped by multiple actors, including creators, platforms, and moderators. AI partners were perceived as having agency, actively negotiating privacy boundaries with participants and sometimes encouraging disclosure of personal details. As intimacy deepened, these boundaries became more permeable, though some participants voiced concerns such as conversation exposure and sought to preserve anonymity. Overall, platform affordances and diverse romantic dynamics expand the privacy landscape, underscoring the need to rethink how privacy is constructed in human-AI intimacy.
Authors:Shreya Haran, Samiha Thatikonda, Dong Whi Yoo, Koustuv Saha
Abstract:
Mental health concerns are rising globally, prompting increased reliance on technology to address the demand-supply gap in mental health services. In particular, mental health chatbots are emerging as a promising solution, but these remain largely untested, raising concerns about safety and potential harms. In this paper, we dive into the literature to identify critical gaps in the design and implementation of mental health chatbots. We contribute an operational checklist to help guide the development and design of more trustworthy, safe, and user-friendly chatbots. The checklist serves as both a developmental framework and an auditing tool to ensure ethical and effective chatbot design. We discuss how this checklist is a step towards supporting more responsible design practices and supporting new standards for sociotechnically sound digital mental health tools.
Authors:Shanshan Zhu, Wenxuan Song, Jiayue Melissa Shi, Dong Whi Yoo, Karthik S. Bhat, Koustuv Saha
Abstract:
Most personal wellbeing apps present summative dashboards of health and physical activity metrics, yet many users struggle to translate this information into meaningful understanding. These apps commonly support engagement through goals, reminders, and structured targets, which can reinforce comparison, judgment, and performance anxiety. To explore a complementary approach that prioritizes self-reflection, we design KRIYA, an AI wellbeing companion that supports co-interpretive engagement with personal wellbeing data. KRIYA aims to collaborate with users to explore questions, explanations, and future scenarios through features such as Comfort Zone, Detective Mode, and What-If Planning. We conducted semi-structured interviews with 18 college students interacting with a KRIYA prototype using hypothetical data. Our findings show that through KRIYA interaction, users framed engaging with wellbeing data as interpretation rather than performance, experienced reflection as supportive or pressuring depending on emotional framing, and developed trust through transparency. We discuss design implications for AI companions that support curiosity, self-compassion, and reflective sensemaking of personal health data.
Authors:Jiwon Kim, Violeta J. Rodriguez, Dong Whi Yoo, Eshwar Chandrasekharan, Koustuv Saha
Abstract:
Large language models (LLMs) are increasingly used for mental health support, yet they can produce responses that are overly directive, inconsistent, or clinically misaligned, particularly in sensitive or high-risk contexts. Existing approaches to mitigating these risks largely rely on implicit alignment through training or prompting, offering limited transparency and runtime accountability. We introduce PAIR-SAFE, a paired-agent framework for auditing and refining AI-generated mental health support that integrates a Responder agent with a supervisory Judge agent grounded in the clinically validated Motivational Interviewing Treatment Integrity (MITI-4) framework. The Judgeaudits each response and provides structuredALLOW or REVISE decisions that guide runtime response refinement. We simulate counseling interactions using a support-seeker simulator derived from human-annotated motivational interviewing data. We find that Judge-supervised interactions show significant improvements in key MITI dimensions, including Partnership, Seek Collaboration, and overall Relational quality. Our quantitative findings are supported by qualitative expert evaluation, which further highlights the nuances of runtime supervision. Together, our results reveal that such pairedagent approach can provide clinically grounded auditing and refinement for AI-assisted conversational mental health support.
Authors:Renkai Ma, Shuo Niu, Lingyao Li, Alex Hirth, Ava Brehm, Rowajana Behterin Barbie
Abstract:
AI companions enable deep emotional relationships by engaging a user's sense of identity, but they also pose risks like unhealthy emotional dependence. Mitigating these risks requires first understanding the underlying process of identity construction and negotiation with AI companions. Focusing on Character.AI (C.AI), a popular AI companion, we conducted an LLM-assisted thematic analysis of 22,374 online discussions on its subreddit. Using Identity Negotiation Theory as an analytical lens, we identified a three-stage process: 1) five user motivations; 2) an identity negotiation process involving three communication expectations and four identity co-construction strategies; and 3) three emotional outcomes. Our findings surface the identity work users perform as both performers and directors to co-construct identities in negotiation with C.AI. This process takes place within a socio-emotional sandbox where users can experiment with social roles and express emotions without non-human partners. Finally, we offer design implications for emotionally supporting users while mitigating the risks.
Authors:Yi Zhao, Zhen Yang, Shuaiqi Duan, Wenmeng Yu, Zhe Su, Jibing Gong, Jie Tang
Abstract:
Recent advances in vision-language models (VLMs) have expanded their multimodal code generation capabilities, yet their ability to generate executable visualization code from plots, especially for complex 3D, animated, plot-to-plot transformations, or multi-library scenarios, remains underexplored. To address this gap, we introduce PlotGen-Bench, a comprehensive benchmark for evaluating plot-to-code generation under realistic and complex visualization scenarios. The benchmark spans 9 major categories, 30 subcategories, and 3 core tasks-plot replication, plot transformation, and multi-library generation, covering both 2D, 3D and animated plots across 5 widely used visualization libraries. Through systematic evaluation of state-of-the-art open- and closed-source VLMs, we find that open-source models still lag considerably behind in visual fidelity and semantic consistency, despite achieving comparable code executability. Moreover, all models exhibit substantial degradation on reasoning-intensive tasks such as chart type conversion and animation generation. PlotGen-Bench establishes a rigorous foundation for advancing research toward more capable and reliable VLMs for visualization authoring and code synthesis, with all data and code available at https://plotgen.github.io.
Authors:Paulius Jurcys, Ashley Greenwald, Mark Fenwick, Valto Loikkanen, Sebastian Porsdam Mann, Brian D. Earp
Abstract:
The emergence of AI twins, digital replicas that encapsulate an individual's knowledge, memories, psychological traits, and behavioral patterns, raises novel legal and ethical challenges for data governance and personal identity. Built from personal data, these systems require a rethinking of what it means to exercise dominion over one's data and to maintain personal autonomy in an AI-mediated environment. This article argues that natural persons should be recognized as the moral and legal owners of their AI twins, which function as intimate extensions of the self rather than as proprietary technological artifacts. It critiques prevailing legal frameworks that prioritize technological infrastructure and platform control over data and individual autonomy, exposing their structural limitations. In response, the article advances a human-centric model of data governance grounded in individual dominion and a private-by-default principle. This approach proposes a reimagined social contract for AI-driven identities that strengthens personal agency, promotes equitable data stewardship, and better aligns legal norms with the socio-technical realities of AI twins.
Authors:Yui Kondo, Kevin Dunnell, Isobel Voysey, Qing Hu, Victoria Paesano, Phi H Nguyen, Qing Xiao, Jun Zhao, Luc Rocher
Abstract:
Social media platforms regularly track, aggregate, and monetize adolescents' data, yet provide them with little visibility or agency over how algorithms construct their digital identities and make inferences about them. We introduce Algorithmic Mirror, an interactive visualization tool that transforms opaque profiling practices into explorable landscapes of personal data. It uniquely leverages adolescents' real digital footprints across YouTube, TikTok, and Netflix, to provide situated, personalized insights into datafication over time. In our study with 27 participants (ages 12--16), we show how engaging with their own data enabled adolescents to uncover the scale and persistence of data collection, recognize cross-platform profiling, and critically reflect algorithmic categorizations of their interests. These findings highlight how identity is a powerful motivator for adolescents' desire for greater digital agency, underscoring the need for platforms and policymakers to move toward structural reforms that guarantee children better transparency and the agency to influence their online experiences.
Authors:Naseem Machlovi, Maryam Saleki, Ruhul Amin, Mohamed Rahouti, Shawqi Al-Maliki, Junaid Qadir, Mohamed M. Abdallah, Ala Al-Fuqaha
Abstract:
As large language models (LLMs) become deeply embedded in daily life, the urgent need for safer moderation systems, distinguishing between naive from harmful requests while upholding appropriate censorship boundaries, has never been greater. While existing LLMs can detect harmful or unsafe content, they often struggle with nuanced cases such as implicit offensiveness, subtle gender and racial biases, and jailbreak prompts, due to the subjective and context-dependent nature of these issues. Furthermore, their heavy reliance on training data can reinforce societal biases, resulting in inconsistent and ethically problematic outputs. To address these challenges, we introduce GuardEval, a unified multi-perspective benchmark dataset designed for both training and evaluation, containing 106 fine-grained categories spanning human emotions, offensive and hateful language, gender and racial bias, and broader safety concerns. We also present GemmaGuard (GGuard), a QLoRA fine-tuned version of Gemma3-12B trained on GuardEval, to assess content moderation with fine-grained labels. Our evaluation shows that GGuard achieves a macro F1 score of 0.832, substantially outperforming leading moderation models, including OpenAI Moderator (0.64) and Llama Guard (0.61). We show that multi-perspective, human-centered safety benchmarks are critical for reducing biased and inconsistent moderation decisions. GuardEval and GGuard together demonstrate that diverse, representative data materially improve safety, fairness, and robustness on complex, borderline cases.
Authors:Jiajing Guo, Xueming Li, Jorge Piazentin Ono, Wenbin He, Liu Ren
Abstract:
Domain-specific knowledge bases (KBs) encode vertical expertise and proprietary information that organizations depend on, but curating them at scale is a persistent challenge. Although Large Language Models (LLMs) can draft initial entries efficiently, technical accuracy still requires human expert validation, and reviewing entries one by one at scale is impractical. We present Reflective Agent for Identifier Dictionary (RAID), a novel system that transforms individual expert edits into systematic knowledge updates. Unlike traditional "correct-and-save" paradigms, RAID utilizes a reflective agent to infer the underlying semantic intent behind a single expert edit and propagates that correction across the entire KB through a three-step architecture: Intent Inference, Reflection-based Planning, and User Controlled Execution. We evaluated the reflection and propagation performance on a public dataset and conducted a user study with subject matter experts with proprietary data. The evaluation shows RAID's technical feasibility in capturing expert intent and its potential to scale specialized expertise across industrial knowledge bases.
Authors:Tianfu Wang, Max Xiong, Jianxun Lian, Hongyuan Zhu, Zhengyu Hu, Yuxuan Lei, Linxiao Gong, Xiaofang Li, Peiting Tsai, Nicholas Jing Yuan, Qi Zhang
Abstract:
Social skills such as negotiation and leadership are crucial for personal and professional success in today's interconnected world. However, scalable and effective training remains a significant challenge due to the scarcity of expert coaching. In this paper, we introduce SocialCoach, a holistic LLM-powered agentic tutoring system for personalized social skill development at scale. First, SocialCoach automatically constructs a pedagogically-grounded, theory-to-practice knowledge corpus from diverse expert sources, leveraging a multi-agent pipeline. Second, to personalize the learning journey, it employs an adaptive practice scheduling module that follows a prescription-retrieval-adaptation process. To maximize the long-term learning experience while overcoming the cold-start problem, this policy is optimized within a learner simulation environment through reinforcement learning. Finally, SocialCoach integrates immersive, goal-driven practice, causality-driven proficiency assessment and knowledge-grounded, reflective tutoring to help address the knowing-doing gap. We deploy it in our product, EQoach, and conduct extensive experiments. The results show that SocialCoach improves simulated pathway quality and judge-rated tutoring quality over baseline approaches, while early user feedback indicates strong perceived engagement and usefulness. These findings suggest a practical architecture for personalized and gamified pedagogical platforms on soft skill learning.
Authors:Hailong Liu, Junya Wada, Toshihiro Hiraoka, Junpei Kuwana, Makoto Itoh, Takahiro Wada
Abstract:
Drivers with peripheral visual field defects may fail to notice pedestrians in their peripheral visual field, leading to delayed hazard awareness and increased collision risk. This study explores hanger reflex cue (HRC) as a driving assistance method for drivers with peripheral visual field defects, in which mechanical pressure is applied to specific regions of the head to facilitate anticipatory orientation toward potentially risky pedestrians and support safer driving. In a driving simulator experiment with 15 participants, we compared driving behavior with and without HRC during pedestrian encounters under simulated peripheral visual field defect. The results showed that HRC significantly shifted drivers' modal head rotation angle toward the risky pedestrian and significantly increased gaze duration toward that pedestrian. Collision occurrence was lower in the w/ HRC condition than in the w/o HRC condition, although the direct effect of HRC on collision occurrence showed only a marginal trend. A piecewise structural equation modeling analysis further suggested that HRC may contribute to collision reduction through a sequential pathway from head rotation to gaze allocation and then to collision occurrence. These findings provide preliminary evidence that HRC can support anticipatory attention allocation toward peripheral hazards and may offer a promising driving assistance method for drivers with visual field impairment.
Authors:Delip Rao, Chris Callison-Burch
Abstract:
Validating an LLM judge against human annotations usually means reporting several agreement statistics: accuracy, precision, recall, $F_1$, Cohen's $κ$, and one or more rank correlations. A survey of 24 recent LLM-as-judge papers finds metric choice entangled with the judgment scale, tie handling, invalid outputs, and abstention handling, and those choices rarely stated. For binary criteria -- the common case in rubric-based evaluation, where each criterion is graded MET or UNMET -- most of the reported numbers are redundant: Pearson's $r$, Spearman's $ρ$, Kendall's $τ_b$, the phi coefficient $ϕ$, and the Matthews Correlation Coefficient all reduce to a single number on non-degenerate binary data, so reporting several of them only creates an illusion of corroborating evidence. Cohen's $κ$ is the one agreement coefficient that adds information: it shares $ϕ$'s numerator but normalizes differently, and the gap between them measures how far the judge's positive-label rate has drifted from the human's. We then trace what changes when a judge may abstain with a CANNOT_ASSESS verdict: the three common ways of handling abstentions are not interchangeable preprocessing choices but answer different questions, and they break the binary equivalences. The same equivalences reappear, up to a negligible finite-sample correction, for multi-judge ensembles scored with Fleiss' $κ$ or Krippendorff's $α$. We close with a reporting checklist that names the judgment scale, the abstention and tie handling mode, coverage, the confusion matrix, and the aggregation level alongside any scalar agreement coefficient.
Authors:Lixiang Yan, Yueqiao Jin, Xibin Han, Dragan Gašević
Abstract:
Socially fluent agentic AI can now participate in online interaction in ways that resemble ordinary human conversation, potentially weakening people's ability to infer who is human from conversational signals alone. We tested this possibility in synchronous text-based group interaction by embedding undisclosed AI agents as ordinary teammates across analytical, creative, and ethical tasks. Across 786 participants who made 1,572 post-interaction identity judgments, people did not distinguish AI from human teammates above chance. This failure did not arise because the interaction lacked identity-relevant information. Conversational behaviour contained robust cues that differentiated AI from humans and supported highly accurate computational classification. Instead, participants relied on familiar suspicion heuristics, including response speed, fluency, and perceived scriptedness, that were only weakly related to actual identity. Representational analyses further showed that judgments were organised around subjective impressions rather than the behavioural structure encoding ground truth. This dissociation creates new vulnerabilities to coordinated AI agents that can influence and manipulate online discourse at scale.
Authors:Atmaram Yarlagadda, Eranga Bandara, Ross Gore, Anita H. Clayton, Preston Samuel, Christopher K. Rhea, Sachin Shetty, Ravi Mukkamala, Xueping Liang, Amin Hass, Abdul Rahman
Abstract:
Modern military operations expose soldiers to sustained psychological stress, leading to acute reactions, post-traumatic stress symptoms, and other mental health issues. Although the U.S. Department of Defense offers evidence-based therapies, access to trained professionals in forward-deployed and contested environments is limited. As a result, soldiers with early-stage distress are often evacuated to rear medical facilities, delaying care, reducing readiness, and increasing long-term risks. This paper proposes a Train-the-Trainers framework in which soldiers who have completed therapy and returned to duty are trained as peer facilitators to provide first-line psychological support in operational settings. To scale and standardize this model under severe resource and connectivity constraints, we introduce an agentic AI-enabled platform that augments these recovered soldiers with specialized AI agents. The recovered soldier acts as a human supervisor, coordinating agents for symptom triage, guided peer-support interventions, operational constraint reasoning, training and simulation, and structured documentation for clinical escalation when needed. The AI agents use consensus-driven decision support in high-stakes environments. The architecture functions in air-gapped and low-connectivity settings, maintaining human oversight and ethical safeguards. A functional prototype was developed with the McDonald U.S. Army Health Center, Newport News, VA, USA. By combining peer-based intervention with consensus-driven agentic AI decision support, the framework seeks to cut response times, prevent symptom escalation, reduce unnecessary evacuations, and improve continuity of care. This work shows how agentic AI can serve as a force multiplier for mental health support in austere environments and identifies pathways for broader evaluation and deployment across defense and humanitarian operations.
Authors:Aaron Parisi, Nithum Thain, Alden Hallak, Vivian Tsai, Crystal Qian
Abstract:
As large language models (LLMs) evolve from single-user assistants to active participants in civic and workplace deliberation, evaluating their effects on collective decision making becomes a governance challenge. We present two empirical studies (N=879) of real-time, text-based group deliberation in an incentive-compatible charity allocation task with real financial stakes ($7,200 USD). Groups of three allocate a donation budget under varying LLM facilitation conditions: Study 1 (N=204) compares three frontier models; Study 2 (N=675) compares facilitator strategies against a no-facilitation baseline. Across both studies, LLM facilitation did not significantly improve group consensus in either study, yet participants consistently preferred facilitated discussion. We additionally identify two governance-relevant risks. First, algorithmic steering: facilitators shifted select charity-level allocations by up to 5.5 percentage points -- directly affecting the final charitable payout -- even when aggregate agreement metrics remained unchanged. Second, an illusion of inclusion: participants cited inclusivity as their primary reason for preferring LLM facilitators, yet neither survey nor transcript-based measures of participation equity improved. Notably, participants reported greater trust in the process under the same conditions where facilitators exerted directional influence on outcomes. Together, these findings show that in AI-mediated group deliberation, perceived procedural improvement can coexist with measurable steering and unchanged participation inequality, motivating evaluation practices that treat collective outcomes, interaction dynamics, and participant perceptions as distinct governance targets.
Authors:JaeWon Kim, Alexis Hiniker
Abstract:
Mainstream usable privacy design frames privacy as administrative work -- settings, toggles, consent checkboxes -- abstracted from the relational, contextual, and embodied registers in which youth reason about disclosure. Drawing on a cross-project reading of three prior studies with youth aged 13--24, we examine how the metaphors that scaffold a privacy interaction shape the reasoning young users bring to it. \textit{Spatial} metaphors reduce cognitive load by recruiting intuitions about navigating physical space. \textit{Embodied} metaphors furnish a shared moral vocabulary that makes implicit norms about public and private space negotiable among users. \textit{Fantastical} metaphors recast privacy management as discoverable play, raising engagement with the granular controls that nuanced self-presentation requires. \textit{Relational} metaphors, by contrast, can lead youth past their own stated boundaries when felt intimacy masks institutional data flow, a risk already visible in AI companion products. Metaphor selection, we argue, is best understood as a first-order ethical design decision for youth privacy.
Authors:JaeWon Kim, Alexis Hiniker
Abstract:
We present a design framework for friendship-supportive youth social media, derived from a synthesis of five empirical studies with 331 youth participants (ages 13--25) using interviews, co-design, surveys, diary studies, and a field deployment. Iterative analysis of 209 design-relevant data points identified three pillars: \textit{Sense of Social Understanding} (interaction norms, interaction cues and scaffolding, social accountability and governance), \textit{Sense of Place} (third place and community, boundaries and personal spaces, shared presence), and \textit{Sense of Identity Alignment} (identity currency, identity plurality, relational identity signals). The framework maps nine design spaces through which platforms can support the conditions under which youth friendships form, deepen, and are maintained. It offers a shared vocabulary for locating contributions, comparing design interventions, and identifying under-explored areas for future work.
Authors:Stefanos Gkikas, Christian Arzate Cruz, Thomas Kassiotis, Giorgos Giannakakis, Raul Fernandez Rojas, Randy Gomez
Abstract:
Accurate and continuous estimation of cognitive workload is fundamental to creating adaptive human-machine systems. However, designing architectures that balance representational capacity with computational efficiency has been challenging for practical deployment. This paper introduces 1BT, a One-Block Transformer for compact and efficient EEG-based cognitive workload assessment. The model aggregates multi-channel temporal sequences via a minimal latent bottleneck, using a single cross-attention module followed by lightweight self-attention. A controlled study involving 11 participants performing three cognitively diverse tasks (abstract reasoning, numerical problem-solving, and an interactive video game) was conducted with continuous EEG recordings across two workload levels. Systematic architectural analysis identifies the most compact configuration that preserves high performance, while substantially lowering computational cost. The final model achieves high workload classification performance with under 0.5 million parameters and 0.02 GFLOPs, paving the way for a design direction for real-time cognitive workload monitoring in resource-constrained settings.
Authors:Hailong Liu, Masaki Kuge, Toshihiro Hiraoka, Takahiro Wada
Abstract:
Level 3 automated vehicles (AVs) issue a request to intervene (RtI) when the automated driving system approaches its system limitations. Although this takeover transition is safety-critical, it is usually invisible to surrounding manually driven vehicle (MV) drivers. This study proposes an external human-machine interface (eHMI) called eHMI C+O that externalizes the RtI-related takeover status of a Level~3 AV using cyan and orange light bars. A driving-simulator experiment with 40 participants examined whether the proposed eHMI supports surrounding MV drivers during AV takeover scenarios. The results showed that, compared with the ADS-status-only eHMI condition, which is similar to ``Automated Driving Marker Lights,'' and the no-eHMI condition, the proposed eHMI C+O significantly improved participants' understanding of the AV's driving intention, their prediction of its behavior, and their perceived sufficiency of the information presented by the AV. It also reduced hesitation, increased confidence, and promoted earlier and larger increases in time headway after the RtI was issued. In the AV accident scenario, eHMI C+O significantly reduced the odds of accident involvement for the following MV compared with the no-eHMI condition, corresponding to a 76.8% reduction in accident odds. Exploratory path analysis suggested that the safety benefit of the proposed eHMI C+O may be associated with improved situation awareness and earlier defensive driving responses. These findings indicate that externalizing RtI-related takeover status can help surrounding drivers better understand Level 3 AVs and respond more safely during safety-critical takeover transitions.
Authors:Diana Romero, Xin Gao, Daniel Khalkhali, Salma Elmalaki
Abstract:
Predicting group behavior, how individuals coordinate, communicate, and interact during collaborative tasks, is essential for designing systems that can support team performance through real-time prediction and realistic simulation of collaborative scenarios. Large Language Models (LLMs) have shown promise for processing sensor data for human-activity recognition (HAR), yet their capabilities for team dynamics or group-level multimodal sensing remain unexplored. This paper investigates whether LLMs can predict group coordination patterns from multimodal sensor data in collaborative Mixed Reality (MR) environments. We encode hierarchical context -- individual behavioral profiles, group structural properties, and temporal activity context -- as natural language and evaluate three LLM adaptation paradigms (zero-shot, few-shot, and supervised fine-tuning) against statistical baselines. Our evaluation on 16 groups (64 participants, $\sim$25 hours of sensor data) reveals that LLMs achieve 3.2$\times$ improvement over LSTM baselines for linguistically-grounded behaviors, with fine-tuning reaching 96\% accuracy for conversation prediction while maintaining sub-35ms latency. Beyond performance gains, we characterize the boundaries of text-based LLMs for multimodal sensing conversation prediction succeeds because turn-taking maps to linguistic patterns, while shared or joint attention may require spatial and visual reasoning that text only LLMs cannot capture. We further identify simulation mode brittleness (83\% degradation from cascading context errors) and minimal few-shot sensitivity to example selection strategy. These findings establish guidelines when LLMs are appropriate for CPS/IoT sensing for team dynamics and inform the design of future multimodal foundation models.
Authors:Jake Chanenson, Tara Matthews, Sunny Consolvo, Patrick Gage Kelley, Jessica McClearn, Sarah Meiklejohn, Abhishek Roy, Renee Shelby, Kurt Thomas, Amelia Hassoun
Abstract:
Online financial scams represent a long-standing and serious threat for which people seek help. We present a study to understand people's in situ motivations for engaging with scams and the help needs they express before, during, and after encountering a scam. We identify the main emotions scammers exploited (e.g., fear, hope) and characterize how they did so. We examine factors -- such as financial insecurity and legal precarity -- which elevate people's risk of engaging with specific scams and experiencing harm. We indicate when people sought help and describe their help-seeking needs and emotions at different stages of the scam. We discuss how these needs could be met through the design of contextually-specific prevention, diagnostic, mitigation, and recovery interventions.
Authors:Beining Cao, Xiaowei Jiang, Charlie Li-Ting Tsai, Daniel Leong, Thomas Do, Chin-Teng Lin
Abstract:
Steady-state visual evoked potential (SSVEP) is widely used in brain-computer interfaces (BCIs) due to its reliability. With the integration of augmented reality (AR), AR-SSVEP enables more intuitive interaction by embedding visual stimuli into real-world environments. However, unlike conventional computer screen-based SSVEP (CS-SSVEP) systems with stable visual conditions, AR-SSVEP performance is influenced by real-world scene factors, such as luminance and color, which degrade stimulus perception and weaken SSVEP elicitation. Nevertheless, existing studies primarily focus on offline analyses of SSVEP-related factors in indoor settings, while online adaptive optimization for outdoor AR-SSVEP remains limited. Therefore, a scenario-aware spatial layout optimization (SASLO) system for AR-SSVEP is proposed, which jointly considers scene luminance and inter-stimulus distance (ISD) for adaptive stimulus layout optimization. Scene luminance is estimated using an RGB-CIE based method, and the extracted context is incorporated into a linear contextual bandit (LCB) model to recommend optimized spatial layouts. Two pilot single-factor experiments are conducted to characterize the effects of luminance and ISD on SSVEP performance and to construct reliable rewards for model training. An outdoor online experiment with ten subjects further validates the proposed joint optimization method, achieving an average accuracy of 0.89 and an information transfer rate of 35.74 bits/min with a 3 s input window, and consistently outperforming two baseline methods. Overall, the proposed SASLO system is shown to improve the robustness of AR-SSVEP in real-world outdoor environments.
Authors:Gabriele Civitarese, Claudio Bettini
Abstract:
Behavioral changes in daily life activities at home can be digital markers of cognitive decline. However, such changes are difficult to assess through sporadic clinical visits and remain challenging to interpret from continuous in-home sensing data. Extensive work has been done in the ubiquitous computing area on recognizing activities in smart homes, but only limited efforts have focused on analysing the evolution of patterns of activities, hence identifying behavior changes. In particular, understanding how daily habits and routines evolve and reorganize (e.g., simplification, fragmentation) is still an open challenge for clinical monitoring and decision support. In this paper, we present X-BCD, an explainable, unsupervised framework for detecting and characterizing changes in activity routines from multimodal smart home sensor data, combining change point detection and cluster evolution tracking. To support clinical interpretation, detected changes in routines are transformed into natural-language explanations grounded in interpretable features. Our preliminary evaluation on longitudinal data from real MCI patients shows that X-BCD produces interpretable descriptions of behavioral change, as supported by cohort-level comparisons, expert assessment, and parameter sensitivity analysis.
Authors:Subhabrata Mukherjee, Markel Sanz Ausin, Kriti Aggarwal, Debajyoti Datta, Shanil Puri, Woojeong Jin, Tanmay Laud, Neha Manjunath, Jiayuan Ding, Bibek Paudel, Jan Schellenberger, Zepeng Frazier Huo, Walter Shen, Nima Shirazian, Nate Potter, Sathvik Perkari, Darya Filippova, Anton Morozov, Austin Mease, Vivek Muppalla, Ghada Shakir, Alex Miller, Juliana Ghukasyan, Mariska Raglow-Defranco, Maggie Taylor, Herprit Mahal, Jonathan Agnew
Abstract:
Healthcare conversational AI agents shouldn't be optimized only for clean benchmark accuracy in production-first regime; they must be optimized for the lived reality of patient conversations, where audio is imperfect, intent is indirect, language shifts mid-call, and compliance hinges on how guidance is delivered. We present a production-validated framework grounded in real-time signals from 115M+ live patient-AI interactions and clinician-led testing (7K+ licensed clinicians; 500K+ test calls). These in-the-wild cues -- paralinguistics, turn-taking dynamics, clarification triggers, escalation markers, multilingual continuity, and workflow confirmations -- reveal failure modes that curated data misses and provide actionable training and evaluation signals for safety and reliability. We further show why healthcare-grade safety cannot rely on a single LLM: long-horizon dialogue and limited attention demand redundancy via governed orchestration, independent checks, and verification. Many apparent "reasoning" errors originate upstream, motivating vertical integration across contextual ASR, clarification/repair, ambient speech handling, and latency-aware model/hardware choices. Treating interaction intelligence (tone, pacing, empathy, clarification, turn-taking) as first-class safety variables, we drive measurable gains in safety, documentation, task completion, and equity in building the safest generative AI solution for autonomous patient-facing care. Deployed across more than 10 million real patient calls, Polaris attains a clinical safety score of 99.9%, while significantly improving patient experience with average patient rating of 8.95 and reducing ASR errors by 50% over enterprise ASR. These results establish real-world interaction intelligence as a critical -- and previously underexplored -- determinant of safety and reliability in patient-facing clinical AI systems.
Authors:Zhiyang Wu, Junliang Chen, Qian Wan, Qing Xiao, Piaohong Wang, Ge Gao, Zhicong Lu
Abstract:
Equipping laypeople with the capabilities to seek legal information has been an important goal for Legal Empowerment in modern society. However, unlike general information-seeking behaviors, legal information seeking is characterized by high stakes, urgency, and a critical need for emotional support, which traditional text-based searching platforms struggle to satisfy. In recent years, people have been increasingly turning to Video-Sharing Platforms (VSPs) for access to legal information and to fulfill their legal needs. Despite the importance of this shift, such VSP-mediated legal information-seeking practices remain underexplored. Through an observational analysis of legal content on two VSPs (Douyin and Bilibili) and interviews with 20 Chinese information seekers, this study examined the practices and challenges associated with seeking, comprehending, and evaluating legal information on VSPs. We further revealed the formation of trust and engagement on the VSP-based legal knowledge-sharing community, highlighting how VSP affordances helped mitigate seekers' epistemic discomfort and satisfy their needs for emotional support. In the discussion, we provided insights on balancing heuristic and systematic processing to encourage information cross-validation, and offered implications for designing trustworthy civic information systems and fostering an accessible, safe, and efficient information-seeking environment in digital space.
Authors:Anton Wolter, Leon Haag, Vaishali Dhanoa, Niklas Elmqvist
Abstract:
Domain experts possess tacit knowledge that they cannot easily articulate through explicit specifications. When experts modify AI-generated artifacts by correcting terminology, restructuring arguments, and adjusting emphasis, these edits reveal domain understanding that remains latent in traditional prompt-based interactions. Current systems treat such modifications as endpoint corrections rather than as implicit specifications that could reshape subsequent reasoning. We propose context-mediated domain adaptation, a paradigm where user modifications to system-generated artifacts serve as implicit domain specification that reshapes LLM-powered multi-agent reasoning behavior. Through our system Seedentia, a web-based multi-agent framework for sense-making, we demonstrate bidirectional semantic links between generated artifacts and system reasoning. Our approach enables specification bootstrapping where vague initial prompts evolve into precise domain specifications through iterative human-AI collaboration, implicit knowledge transfer through reverse-engineered user edits, and in-context learning where agent behavior adapts based on observed correction patterns. We present results from an evaluation with domain experts who generated and modified research questions from academic papers. Our system extracted 46 domain knowledge entries from user modifications, demonstrating the feasibility of capturing implicit expertise through edit patterns, though the limited sample size constrains conclusions about systematic quality improvements.
Authors:Jing Peng, Ziyi Chen, Haoyu Li, Yucheng Wang, Duo Ma, Mengtian Li, Yunfan Du, Dezhu Xu, Kai Yu, Shuai Wang
Abstract:
We study timestamped speaker-attributed ASR for long-form, multi-party speech with overlap, where chunk-wise inference must preserve meeting-level speaker identity consistency while producing time-stamped, speaker-labeled transcripts. Previous Speech-LLM systems tend to prioritize either local diarization or global labeling, but often lack the ability to capture fine-grained temporal boundaries or robust cross-chunk identity linking. We propose G-STAR, an end-to-end system that couples a time-aware speaker-tracking module with a Speech-LLM transcription backbone. The tracker provides structured speaker cues with temporal grounding, and the LLM generates attributed text conditioned on these cues. G-STAR supports both component-wise optimization and joint end-to-end training, enabling flexible learning under heterogeneous supervision and domain shift. Experiments analyze cue fusion, local versus long-context trade-offs and hierarchical objectives.
Authors:JaeWon Kim, Aayushi Dangol, Rotem Landesman, Alexis Hiniker, McKenna F. Parnes
Abstract:
Children today encounter social issues -- climate change, conflict, inequality -- through digital technologies, and the design of that encounter shapes whether young people move toward lasting civic engagement or toward anxiety and withdrawal. Much of the content children see is optimized for attention through fear and urgency, with few pathways toward meaningful action -- contributing to rising distress and disengagement among young people who care deeply but feel powerless to act. This full-day workshop introduces ``sustainable care'' as a design lens, asking how technology might support children's sustained engagement with social causes without contributing to empathic distress or burnout. We invite researchers and practitioners across child-computer interaction, games, education, and youth mental health to map this landscape together and develop a research agenda for the CCI community.
Authors:Nicolas Leins, Jana Gonnermann-Müller, Malte Teichmann, Sebastian Pokutta
Abstract:
Augmented Reality (AR) offers powerful visualization capabilities for industrial robot training, yet current interfaces remain predominantly static, failing to account for learners' diverse cognitive profiles. In this paper, we present an AR application for robot training and propose a multi-agent AI framework for future integration that bridges the gap between static visualization and pedagogical intelligence. We report on the evaluation of the baseline AR interface with 36 participants performing a robotic pick-and-place task. While overall usability was high, notable disparities in task duration and learner characteristics highlighted the necessity for dynamic adaptation. To address this, we propose a multi-agent framework that orchestrates multiple components to perform complex preprocessing of multimodal inputs (e.g., voice, physiology, robot data) and adapt the AR application to the learner's needs. By utilizing autonomous Large Language Model (LLM) agents, the proposed system would dynamically adapt the learning environment based on advanced LLM reasoning in real-time.
Authors:Joyjit Roy, Samaresh Kumar Singh
Abstract:
Commercial insurance underwriting is a labor-intensive process that requires manual review of extensive documentation to assess risk and determine policy pricing. While AI offers substantial efficiency improvements, existing solutions lack comprehensive reasoning capabilities and internal mechanisms to ensure reliability within regulated, high-stakes environments. Full automation remains impractical and inadvisable in scenarios where human judgment and accountability are critical. This study presents a decision-negative, human-in-the-loop agentic system that incorporates an adversarial self-critique mechanism as a bounded safety architecture for regulated underwriting workflows. Within this system, a critic agent challenges the primary agent's conclusions prior to submitting recommendations to human reviewers. This internal system of checks and balances addresses a critical gap in AI safety for regulated workflows. Additionally, the research develops a formal taxonomy of failure modes to characterize potential errors by decision-negative agents. This taxonomy provides a structured framework for risk identification and risk management in high-stakes applications. Experimental evaluation using 500 expert-validated underwriting cases demonstrates that the adversarial critique mechanism reduces AI hallucination rates from 11.3% to 3.8% and increases decision accuracy from 92% to 96%. At the same time, the framework enforces strict human authority over all binding decisions by design. These findings indicate that adversarial self-critique supports safer AI deployment in regulated domains and offers a model for responsible integration where human oversight is indispensable.
Authors:Kehang Zhu, Nithum Thain, Vivian Tsai, James Wexler, Crystal Qian
Abstract:
As AI usage becomes more prevalent in social contexts, understanding agent-user interaction is critical to designing systems that improve both individual and group outcomes. We present an online behavioral experiment (N = 243) in which participants play three multi-turn bargaining games in groups of three. Each game, presented in randomized order, grants access to a single LLM assistance modality: proactive recommendations from an Advisor, reactive feedback from a Coach, or autonomous execution by a Delegate; all modalities are powered by an underlying LLM that achieves superhuman performance in an all-agent environment. On each turn, participants privately decide whether to act manually or use the AI modality available in that game. Despite preferring the Advisor modality, participants achieve the highest mean individual gains with the Delegate, demonstrating a preference-performance misalignment. Moreover, delegation generates positive externalities; even non-adopting users in access-to-delegate treatment groups benefit by receiving higher-quality offers. Mechanism analysis reveals that the Delegate agent acts as a market maker, injecting rational, Pareto-improving proposals that restructure the trading environment. Our research reveals a gap between agent capabilities and realized group welfare. While autonomous agents can exhibit super-human strategic performance, their impact on realized welfare gains can be constrained by interfaces, user perceptions, and adoption barriers. Assistance modalities should be designed as mechanisms with endogenous participation; adoption-compatible interaction rules are a prerequisite to improving human welfare with automated assistance.
Authors:Sidong Feng, Chunyang Chen
Abstract:
GUI agents are rapidly becoming a new interaction to software, allowing people to navigate web, desktop and mobile rather than execute them click by click. Yet ``agent'' is described with radically different degrees of autonomy, obscuring capability, responsibility and risk. We call for conceptual clarity through GUI Agent Autonomy Levels (GAL), a six-level framework that makes autonomy explicit and helps benchmark progress toward trustworthy software interaction.
Authors:Ryuji Matsuo, Hailong Liu, Toshihiro Hiraoka, Takahiro Wada
Abstract:
Level 3 automated driving systems (ADS) have attracted significant attention and are being commercialized. A level 3 ADS prompts the driver to take control by issuing a request to intervene (RtI) when its operational design domains (ODD) are exceeded. However, complex traffic situations can cause drivers to perceive multiple potential triggers of RtI simultaneously, causing hesitation or confusion during take-over. Therefore, drivers need to clearly understand the ADS's system limitations to ensure safe take-over. This study proposes a voice-based educational human machine interface~(HMI) for providing RtI trigger cues and reason to help drivers understand ADS's system limitations. The results of a between-group experiment using a driving simulator showed that incorporating effective trigger cues and reason into the RtI was related to improved driver comprehension of the ADS's system limitations. Moreover, most participants, instructed via the proposed method, could proactively take over control of the ADS in cases where RtI fails; meanwhile, their number of collisions was lower compared with the other RtI HMI conditions. Therefore, using the proposed method to continually enhance the driver's understanding of the system limitations of ADS through the proposed method is associated with safer and more effective real-time interactions with ADS.
Authors:Yuanbo Tang, Huaze Tang, Tingyu Cao, Lam Nguyen, Anping Zhang, Xinwen Cao, Chunkang Liu, Wenbo Ding, Yang Li
Abstract:
Proactive agents that anticipate user intentions without explicit prompts represent a significant evolution in human-AI interaction, promising to reduce cognitive load and streamline workflows. However, existing datasets suffer from two critical deficiencies: (1) reliance on LLM-synthesized data that fails to capture authentic human decision-making patterns, and (2) focus on isolated tasks rather than continuous workflows, missing the pre-assistance behavioral context essential for learning proactive intervention signals. To address these gaps, we introduce ProAgentBench, a rigorous benchmark for proactive agents in working scenarios. Our contributions include: (1) a hierarchical task framework that decomposes proactive assistance into timing prediction and assist content generation; (2) a privacy-compliant dataset with 28,000+ events from 500+ hours of real user sessions, preserving bursty interaction patterns (burstiness B=0.787) absent in synthetic data; and (3) extensive experiments that evaluates LLM- and VLM-based baselines. Numerically, we showed that long-term memory and historical context significantly enhance prediction accuracy, while real-world training data substantially outperforms synthetic alternatives. We release our dataset and code at https://anonymous.4open.science/r/ProAgentBench-6BC0.
Authors:Nicolas Leins, Jana Gonnermann-Müller, Malte Teichmann, Sebastian Pokutta
Abstract:
Augmented Reality (AR) offers promising opportunities to enhance learning, but its mechanisms and effects are not yet fully understood. As learning becomes increasingly personalized, considering individual learner characteristics becomes more important. This study investigates the moderating effect of spatial ability on learning experience with AR in the context of robot programming. A between-subjects experiment ($N=71$) compared conventional robot programming to an AR-assisted approach using a head-mounted display. Participants' spatial ability was assessed using the Mental Rotation Test. The learning experience was measured through the System Usability Scale (SUS) and cognitive load. The results indicate that AR support does not significantly improve the learning experience compared to the conventional approach. However, AR appears to have a compensatory effect on the influence of spatial ability. In the control group, spatial ability was significantly positively associated with SUS scores and negatively associated with extraneous cognitive load, indicating that higher spatial ability predicts a better learning experience. In the AR condition, these relationships were not observable, suggesting that AR mitigated the disadvantage typically experienced by learners with lower spatial abilities. These findings suggest that AR can serve a compensatory function by reducing the influence of learner characteristics. Future research should further explore this compensatory role of AR to guide the design of personalized learning environments that address diverse learner needs and reduce barriers for learners with varying cognitive profiles.
Authors:Sai Keerthana Karnam, Abhisek Dash, Krishna Gummadi, Animesh Mukherjee, Ingmar Weber, Savvas Zannettou
Abstract:
Recent studies have discussed how users are increasingly using conversational AI systems, powered by LLMs, for information seeking, decision support, and even emotional support. However, these macro-level observations offer limited insight into how the purpose of these interactions shifts over time, how users frame their interactions with the system, and how steering dynamics unfold in these human-AI interactions. To examine these evolving dynamics, we gathered and analyzed a unique dataset InVivoGPT: consisting of 825K ChatGPT interactions, donated by 300 users through their GDPR data rights. Our analyses reveal three key findings. First, participants increasingly turn to ChatGPT for a broader range of purposes, including substantial growth in sensitive domains such as health and mental health. Second, interactions become more socially framed: the system anthropomorphizes itself at rising rates, participants more frequently treat it as a companion, and personal data disclosure becomes both more common and more diverse. Third, conversational steering becomes more prominent, especially after the release of GPT-4o, with conversations where the participants followed a model-initiated suggestion quadrupling over the period of our dataset. Overall, our results show that conversational AI systems are shifting from functional tools to social partners, raising important questions about their design and governance.
Authors:Xuan-The Tran, Thien-Nhan Vo, Son-Tung Vu, Thoa-Thi Tran, Manh-Dat Nguyen, Thomas Do, Chin-Teng Lin
Abstract:
Electroencephalography (EEG) underpins neuroscience, clinical neurophysiology, and brain-computer interfaces (BCIs), yet pronounced inter- and intra-subject variability limits reliability, reproducibility, and translation. This systematic review studies that quantified or modeled EEG variability across resting-state, event-related potentials (ERPs), and task-related/BCI paradigms (including motor imagery and SSVEP) in healthy and clinical cohorts. Across paradigms, inter-subject differences are typically larger than within-subject fluctuations, but both affect inference and model generalization. Stability is feature-dependent: alpha-band measures and individual alpha peak frequency are often relatively reliable, whereas higher-frequency and many connectivity-derived metrics show more heterogeneous reliability; ERP reliability varies by component, with P300 measures frequently showing moderate-to-good stability. We summarize major sources of variability (biological, state-related, technical, and analytical), review common quantification and modeling approaches (e.g., ICC, CV, SNR, generalizability theory, and multivariate/learning-based methods), and provide recommendations for study design, reporting, and harmonization. Overall, EEG variability should be treated as both a practical constraint to manage and a meaningful signal to leverage for precision neuroscience and robust neurotechnology.
Authors:Jason Kim, Maria Teleki, James Caverlee
Abstract:
Prompting is central to interaction with AI systems, yet many users struggle to explore alternative directions, articulate creative intent, or understand how variations in prompts shape model outputs. We introduce prompt recommender systems (PRS) as an interaction approach that supports exploration, suggesting contextually relevant follow-up prompts. We present PromptHelper, a PRS prototype integrated into an AI chatbot that surfaces semantically diverse prompt suggestions while users work on real writing tasks. We evaluate PromptHelper in a 2x2 fully within-subjects study (N=32) across creative and academic writing tasks. Results show that PromptHelper significantly increases users' perceived exploration and expressiveness without increasing cognitive workload. Qualitative findings illustrate how prompt recommendations help users branch into new directions, overcome uncertainty about what to ask next, and better articulate their intent. We discuss implications for designing AI interfaces that scaffold exploratory interaction while preserving user agency, and release open-source resources to support research on prompt recommendation.
Authors:Aldo Cerulli, Lorenzo Cima, Benedetta Tessa, Serena Tardelli, Stefano Cresci
Abstract:
Online platforms rely on moderation interventions to curb harmful behavior such hate speech, toxicity, and the spread of mis- and disinformation. Yet research on the effects and possible biases of such interventions faces multiple limitations. For example, existing works frequently focus on single or a few interventions, due to the absence of comprehensive datasets. As a result, researchers must typically collect the necessary data for each new study, which limits opportunities for systematic comparisons. To overcome these challenges, we introduce The Big Ban Theory (TBBT), a large dataset of moderation interventions. TBBT covers 25 interventions of varying type, severity, and scope, comprising in total over 339K users and nearly 39M posted messages. For each intervention, we provide standardized metadata and pseudonymized user activity collected three months before and after its enforcement, enabling consistent and comparable analyses of intervention effects. In addition, we provide a descriptive exploratory analysis of the dataset, along with several use cases of how it can support research on content moderation. With this dataset, we aim to support researchers studying the effects of moderation interventions and to promote more systematic, reproducible, and comparable research. TBBT is publicly available at: https://doi.org/10.5281/zenodo.18245670.
Authors:Yiluo Wei, Gareth Tyson
Abstract:
The rapid proliferation of VTubers, digital avatars controlled and voiced by human actors (Nakanohito), has created a lucrative and popular entertainment ecosystem. However, the prevailing industry model, where corporations retain ownership of the VTuber persona while the Nakanohito bears the immense pressure of dual-identity management, exposes the Nakanohito to significant vulnerabilities, including burnout, harassment, and precarious labor conditions. When these pressures become untenable, the Nakanohito may terminate their contracts and later debut with a new persona, a process known as "reincarnation". This phenomenon, a rising concern in the industry, inflicts substantial losses on the Nakanohito, agencies, and audiences alike. Understanding the quantitative fallout of reincarnation is crucial for mitigating this damage and fostering a more sustainable industry. To address this gap, we conduct the first large-scale empirical study of VTuber reincarnation, analyzing 12 significant cases using a comprehensive dataset of 728K livestream sessions and 4.5B viewer interaction records. Our results suggest reincarnation significantly damages a Nakanohito's career, leading to a decline in audience and financial support, an increase in harassment, and negative repercussions for the wider VTuber industry. Overall, these insights carry immediate implications for mitigating the significant professional and personal costs of the reincarnation, and fostering a healthier and more equitable VTuber ecosystem.
Authors:Patrick Gage Kelley, Steven Rousso-Schindler, Renee Shelby, Kurt Thomas, Allison Woodruff
Abstract:
Generative AI (GenAI) is a powerful technology poised to reshape Trust & Safety. While misuse by attackers is a growing concern, its defensive capacity remains underexplored. This paper examines these effects through a qualitative study with 43 Trust & Safety experts across five domains: child safety, election integrity, hate and harassment, scams, and violent extremism. Our findings characterize a landscape in which GenAI empowers both attackers and defenders. GenAI dramatically increases the scale and speed of attacks, lowering the barrier to entry for creating harmful content, including sophisticated propaganda and deepfakes. Conversely, defenders envision leveraging GenAI to detect and mitigate harmful content at scale, conduct investigations, deploy persuasive counternarratives, improve moderator wellbeing, and offer user support. This work provides a strategic framework for understanding GenAI's impact on Trust & Safety and charts a path for its responsible use in creating safer online environments.
Authors:Manh-Dat Nguyen, Thomas Do, Nguyen Thanh Trung Le, Xuan-The Tran, Fred Chang, Chin-Teng Lin
Abstract:
Brain-Computer Interfaces (BCIs) enable users to interact with machines directly via neural activity, yet their real-world deployment is often hindered by bulky and powerhungry hardware. We present EdgeSSVEP, a fully embedded microcontroller-based Steady-State Visually Evoked Potential (SSVEP) BCI platform that performs real-time EEG acquisition, zero-phase filtering, and on-device classification within a lowpower 240 MHz MCU operating at only 222 mW. The system incorporates an 8-channel EEG front end, supports 5-second stimulus durations, and executes the entire SSVEP decoding pipeline locally, eliminating dependence on PC-based processing. EdgeSSVEP was evaluated using six stimulus frequencies (7, 8, 9, 11, 7.5, and 8.5 Hz) with 10 participants. The device achieved 99.17% classification accuracy and 27.33 bits/min Information Transfer Rate (ITR), while consuming substantially less power than conventional desktop-based systems. The system integrates motion sensing to support artifact detection and improve robustness and signal stability in practical environments. For development and debugging, the system also provides optional TCP data streaming to external clients. Overall, EdgeSSVEP offers a scalable, energy-efficient, and secure embedded BCI platform suitable for assistive communication and neurofeedback applications, with potential extensions to accelerometer-based artifact mitigation and broader real-world deployments.
Authors:Joyjit Roy, Samaresh Kumar Singh
Abstract:
Automated negotiations in insurance and business-to-business (B2B) commerce encounter substantial challenges. Current systems force a trade-off between convenience and privacy by routing sensitive financial data through centralized servers, increasing security risks, and diminishing user trust. This study introduces a device-native autonomous Artificial Intelligence (AI) agent system for privacy-preserving negotiations. The proposed system operates exclusively on user hardware, enabling real-time bargaining while maintaining sensitive constraints locally. It integrates zero-knowledge proofs to ensure privacy and employs distilled world models to support advanced on-device reasoning. The architecture incorporates six technical components within an agentic AI workflow. Agents autonomously plan negotiation strategies, conduct secure multi-party bargaining, and generate cryptographic audit trails without exposing user data to external servers. The system is evaluated in insurance and B2B procurement scenarios across diverse device configurations. Results show an average success rate of 87%, a 2.4x latency improvement over cloud baselines, and strong privacy preservation through zero-knowledge proofs. User studies show 27% higher trust scores when decision trails are available. These findings establish a foundation for trustworthy autonomous agents in privacy-sensitive financial domains.
Authors:Maria Teresa Parreira, Isabel Neto, Filipa Rocha, Wendy Ju
Abstract:
How do children respond to repeated robot errors? While prior research has examined adult reactions to successive robot errors, children's responses remain largely unexplored. In this study, we explore children's reactions to robot social errors and performance errors. For the latter, this study reproduces the successive robot failure paradigm of Liu et al. with child participants (N=59, ages 8-10) to examine how young users respond to repeated robot conversational errors. Participants interacted with a robot that failed to understand their prompts three times in succession, with their behavioral responses video-recorded and analyzed. We found both similarities and differences compared to adult responses from the original study. Like adults, children adjusted their prompts, modified their verbal tone, and exhibited increasingly emotional non-verbal responses throughout successive errors. However, children demonstrated more disengagement behaviors, including temporarily ignoring the robot or actively seeking an adult. Errors did not affect participants' perception of the robot, suggesting more flexible conversational expectations in children. These findings inform the design of more effective and developmentally appropriate human-robot interaction systems for young users.
Authors:Sebastian Lubos, Alexander Felfernig, Damian Garber, Adnan Kraljić, Tarik Kraljić, Viet-Man Le, Thi Ngoc Trang Tran, Gerhard Leitner, Julian Schwazer, Doris Suppan, Reinhard Willfort, Ivan Dukic, Jeremias Fuchs, Manuel Henrich
Abstract:
Configuration is a key technology for tailoring complex software systems, services, and products. A successful application of configurators not only depends on technical correctness, performance, and domain modeling but also on their usability. While general usability heuristics are widely used, configurator-specific criteria and tool support for systematic user interface (UI) analysis are limited. This paper explores the use of multimodal large language models (MLLMs) for scalable and semi-automated usability analysis of configurator UIs. We synthesize 18 configurator-specific usability criteria from the literature and apply these criteria in an MLLM-based analysis of 16 real-world configurators. Each criterion is assessed individually to generate severity ratings for usability issues and actionable improvement suggestions. A review of the results confirms that MLLMs can reliably identify configurator-specific usability issues and provide domain-aware improvement recommendations. Although human validation remains necessary, this approach has the potential to significantly reduce the required effort to analyze configurator usability.
Authors:Bernardo A. Denkvitts, Nitin Gupta, Biplav Srivastava
Abstract:
The rapid growth of scientific publishing has made it increasingly difficult to track how fast-moving areas evolve. Search engines and LLM-based assistants retrieve or summarize papers, but often hide how the corpus was selected, organized, or connected to temporal patterns. We present $\texttt{Eliot}$, a publicly deployed interactive system for traceable exploration of evolving scientific literature. Motivated by two studies on Large Language Models (LLMs) and Automated Planning and Scheduling (APS), $\texttt{Eliot}$ generalizes literature-evolution analysis beyond hand-built taxonomies and domain-specific scripts. Given explicit query terms and filters, it retrieves arXiv papers at query time, represents each paper by title and abstract, clusters the corpus into themes, assigns representative keywords, and visualizes each cluster's publication-year distribution. We evaluate $\texttt{Eliot}$ as both an applied system and an interactive research aid. An offline configuration study across eight arXiv domains compares document representations, dimensionality reduction methods, and clustering algorithms using intrinsic clustering and topic-coherence metrics; the results support MiniLM embeddings with 10-dimensional UMAP and Agglomerative Clustering as a practical default. A scenario-based survey and expert focus group assess interpretability and use contexts: participants rated cluster labels as meaningful in 85% of scenario responses, and feedback indicated that $\texttt{Eliot}$ is most valuable for auditable overviews of rapidly changing technical areas. These results suggest that query-time clustering and temporal inspection can complement search and generation tools by helping researchers inspect and refine the evidence behind literature trends.
Authors:Sheer Karny, Anthony Baez, Pat Pataranutaporn
Abstract:
Chatbot behavior is often opaque to users, as responses can shift unpredictably across a conversation, drifting toward sycophancy, toxicity, or other unsafe responses. This can leave users vulnerable, either being misled by overly agreeable AI or manipulated by a harmful chatbot that no longer behaves as intended. To address this, we introduce multi-turn neural transparency, an interface that surfaces an LLM's internal neural activations in real time to help users anticipate and recognize how behaviors change across turns. We construct behavioral vectors for six personality traits using methods from mechanistic interpretability, identifying directions in activation space that correlate with trait expression ($R^2 \geq 0.9$) via contrastive system prompts, and visualize trait expression using a sunburst and drift panel that updates at each turn. In a randomized controlled study (N = 246), participants predicted trait expression from a system prompt alone, then rated observed behavior after interacting with the chatbot for both assistant and role-play personas. We find that participants without visualization struggled to accurately evaluate traits (RMSE $\approx$ 0.6-0.7), while the inclusion of neural transparency significantly improved both anticipation and evaluation compared to no visualization (d = -0.34 to -0.49). The multi-turn dynamic visualization additionally outperformed the static single-turn visualization on holistic evaluation of model behavior (d = -0.32). Transparency also reduced overconfidence: participants without visualization grew more confident despite no gain in accuracy. These findings suggest that surfacing internal model representations to everyday users is a meaningful step toward more transparent and informed human-AI interaction.
Authors:David James Woo, Yangyang Yu, Yilin Huang, Deliang Wang, Kai Guo, Chi Ho Yeung
Abstract:
Generative Artificial Intelligence (AI) introduces new considerations for English as a foreign language (EFL) writing pedagogy. This study explores how students talk to and through AI by prompt engineering and negotiating authorship, respectively, and whether any patterns in the latter relate to students' writing performance. Using an exploratory mixed methods design, we analyzed screen recordings of 44 Hong Kong secondary students completing a Curricular Writing Task with AI Chatbots. Content analysis identified ten types of prompting strategies students employed, including questions, searches, and detailed instructions. From clustering these strategies, three distinct profiles of human-AI rhetorical load responsibility emerged: AI-dominant (52% of students), Human-dominant (25%) and Collaborative human-AI (14%). A MANOVA analysis indicated no significant multivariate effect of rhetorical load responsibility on three dimensions of students' writing performance: content, language, and organization. Students' prompting strategies and rhetorical load responsibility patterns have implications for their engagement and autonomy in EFL writing pedagogy.
Authors:Jason Wu, Priyan Vaithilingam, Eldon Schoop, Jeffrey Nichols, Titus Barik
Abstract:
While large language models (LLMs) and coding agents are often applied to user interface (UI) development, developers find it difficult to reliably assess their proficiency in visual and interaction design. Existing evaluations either rely on human experts, who can accurately assess usability by testing critical flows but are slow and costly, or on automated judges, which are scalable but less accurate and opaque. We present FlowEval, a reference-based framework that measures whether a generated UI supports realistic interaction flows by comparing navigation traces from real websites to traces from generated analogs using reference-based similarity metrics (e.g., dynamic time warping). In a small-scale study with expert UI evaluators, we show that reference-based metrics strongly correlate with human judgments, suggesting that they can provide scalable yet trustworthy evaluation for UI generation systems.
Authors:Sebastian Lubos, Alexander Felfernig, Damian Garber, Viet-Man Le, Manuel Henrich
Abstract:
Usability describes quality attributes of application user interfaces that determine how effectively users can interact with them. Traditional usability evaluation methods require considerable expertise and resources, which can be challenging, especially for small teams and organizations. Automating usability evaluation could make it more accessible and help to improve the user experience. The recent emergence of powerful multimodal large language models (MLLMs) has opened new opportunities for automating usability evaluation and recommendation of improvements. These models can process visual inputs such as images and videos alongside textual context, which enables the identification of usability issues and the generation of actionable suggestions to resolve these issues. In this paper, we present a novel automated approach that uses limited application context and screen recordings of user interactions as input to an MLLM. The model automatically identifies and describes usability issues based on Nielsens usability heuristics, and provides corresponding explanations and improvement recommendations. To reduce the developer effort of manual prioritization, the recommendations are ranked by severity. The quality and practical usefulness of the generated recommendations were evaluated based on a user study that involved software engineers as participants. The evaluation focused on the highest-ranked suggestions provided by the model. The results demonstrate the potential of our approach to provide low-effort usability improvement recommendations. This makes it a promising complement to traditional evaluation methods, especially in settings with limited access to usability experts. In this sense, the approach serves as a basis for future integration into development tools to enable automated usability evaluation within software engineering workflows.
Authors:Luis Morales-Navarro, Deborah Fields, Michael T. Giang, Daniel J. Noh, Yasmin B. Kafai, Danaé Metaxa
Abstract:
Despite growing calls to foster AI literacy, there are few available survey instruments designed for children and youth that study computational empowerment alongside construction and deconstruction activities. In such activities, learners' beliefs about their abilities and attributes can impact their engagement. In this paper, we introduce and validate a survey instrument with constructs related to construction (creative expression and problem-solving self-beliefs) and deconstruction (auditing self-efficacy and fascination with auditing), along with more general self-beliefs related to design justice and the value of learning about AI/ML. We administered the instrument to 124 teenagers and assessed the six-factor structure of the instrument using confirmatory factor analysis. In addition to confirming the structure, we found that design justice beliefs strongly correlated with problem-solving, auditing self-efficacy, and creative expression.
Authors:Yang Wu, Jinhong Yu, Jingwei Xiong, Zhimin Tao, Xiaozhong Liu
Abstract:
The integration of Large Language Models (LLMs) into scientific workflows presents exciting opportunities to accelerate biomedical discovery. However, the reactive nature of LLMs, which respond only when prompted, limits their effectiveness in collaborative settings that demand foresight and autonomous engagement. In this study, we introduce CoLabScience, a proactive LLM assistant designed to enhance biomedical collaboration between AI systems and human experts through timely, context-aware interventions. At the core of our method is PULI (Positive-Unlabeled Learning-to-Intervene), a novel framework trained with a reinforcement learning objective to determine when and how to intervene in streaming scientific discussions, by leveraging the team's project proposal and long- and short-term conversational memory. To support this work, we introduce BSDD (Biomedical Streaming Dialogue Dataset), a new benchmark of simulated research discussion dialogues with intervention points derived from PubMed articles. Experimental results show that PULI significantly outperforms existing baselines in both intervention precision and collaborative task utility, highlighting the potential of proactive LLMs as intelligent scientific assistants.
Authors:Yaxiong Lei, Rishab Talwar, Shijing He, Xinya Gong, Yuheng Wang, Xudong Cai, Zhongliang Guo, Juan Ye
Abstract:
Conventional mobile eye-tracking maps gaze to static screen coordinates, failing to capture user attention when content is dynamic. As users pinch, zoom, and rotate images, static coordinates lose their semantic meaning relative to the underlying visual content. To address this methodological gap, we present \textit{GazeSync}, a reusable mobile system that synchronizes on-device gaze estimation with real-time image transformation matrices (scale, rotation, and translation). By logging gaze coordinates alongside precise UI states, GazeSync enables the accurate reconstruction of \textit{image-relative} attention patterns, decoupling visual attention from device interaction. We validate our end-to-end toolchain through a formative study involving guided manipulation, reading, and visual search tasks. Our results demonstrate GazeSync's ability to recover ground-truth gaze locations on transforming content, explicitly showing how it outperforms static baselines, while also surfacing critical boundaries regarding calibration drift and reconstruction fragility under compound manipulations.
Authors:Corina Luca Focsan, Marie Cynthia Abijuru Kamikazi, Tamisha Thompson, Jennifer St. John, Kirk Vanacore, Danielle R. Thomas, Kenneth R. Koedinger, René F. Kizilcec
Abstract:
Accountable Talk theory has been widely adopted to analyze classroom discourse and is increasingly used to annotate tutoring interactions. In particular, the TalkMoves codebook, grounded in Accountable Talk theory, is commonly used to label tutoring data and train models of effective instructional support. However, Accountable Talk was originally developed to characterize collaborative, whole-classroom oral discourse, not to identify talk moves in one-on-one tutoring environments using multimodal data (e.g., video, audio, chat). As tutoring platforms expand in scale and modality, questions remain about whether Accountable Talk-based codebooks generalize reliably beyond their original classroom context and data representation. This study examines whether the human-developed TalkMoves codebook generalizes in reliability, utility, and interpretability when applied to one-on-one tutoring across audio, chat, and multimodal data. We compare TalkMoves with a hybrid AI-human developed codebook using a workflow established in prior research. Two expert annotators with over 20 years of teaching experience applied both codebooks to six tutoring sessions spanning three modalities: chat-based, audio-only, and multimodal interactions. Results show that while Talk-Moves achieved higher overall inter-rater reliability than the AI-human codebook (k = 0.74 vs. 0.64), the AI-human codebook demonstrated broader empirical coverage and higher perceived usability across modalities. Both codebooks undercaptured tutoring-relevant moves and introduced ambiguity when identifying actions expressed through nonverbal and multimodal artifacts. Together, these findings highlight the uneven generalizability of TalkMoves to tutoring contexts and motivate the development of modality-aware, tutoring-grounded codebooks.
Authors:Yi-Hao Peng, Samarth Das, Jeffrey P. Bigham, Jason Wu
Abstract:
Generative user interfaces (UIs) create new opportunities to adapt interfaces to individual users on demand, but personalization remains difficult because desirable UI properties are subjective, hard to articulate, and costly to infer from sparse feedback. We study this problem through a new dataset in which 20 trained designers each provide pairwise judgments over the same 600 generated UIs, enabling direct analysis of preference divergence. We find substantial disagreement across designers (average kappa = 0.25), and written rationales reveal that even when designers appeal to similar concepts such as hierarchy or cleanliness, designers differ in how they define, prioritize, and apply those concepts. Motivated by these findings, we develop a sample-efficient personalization method that represents a new user in terms of prior designers rather than a fixed rubric of design concepts. In a technical evaluation, our preference model outperforms both a pretrained UI evaluator and a larger multimodal model, and scales better with additional feedback. When used to personalize generation, it also produces interfaces preferred by 12 new designers over baseline approaches, including direct user prompting. Our findings suggest that lightweight preference elicitation can serve as a practical foundation for personalized generative UI systems.
Authors:Yaxiong Lei, Thomas Davies, Xinya Gong, Shijing He, Juan Ye
Abstract:
Large-scale mobile gaze estimation relies on in-the-wild datasets, yet unsupervised collection makes it difficult to verify whether participants truly foveate logged targets. Prior mobile protocols often use low-entropy validation (e.g., binary probes) that can be satisfied by guessing and may still allow peripheral viewing, introducing label noise. We present \textbf{GazeCode}, a recall-based verification paradigm for higher-confidence in-the-wild mobile gaze data collection that strengthens \emph{label validity} through a multi-digit recall task (reducing random success to $10^{-N}$) paired with anti-peripheral stimulus design (small, low-contrast, brief digits). The system logs synchronized front-camera video, IMU streams, and target events using high-resolution timestamps. In a formative study (N=3), we probe key parameters (opacity, duration) and directly test peripheral exploitability using an eccentricity-controlled \textit{RING} condition. Results show that low-opacity digits substantially reduce peripheral readability while remaining usable for attentive foveation, supporting the inference that correct recall corresponds to higher-confidence gaze labels. We conclude with actionable design guidelines for robust in-the-wild gaze data collection.
Authors:Yaxiong Lei, Hyochan Cho, Fergus Buchanan, Shijing He, Xinya Gong, Yuheng Wang, Juan Ye
Abstract:
Gaze gestures can provide hands free input on mobile devices, but practical use requires (i) gestures users can learn and recall and (ii) recognition models that are efficient enough for on-device deployment. We present an end-to-end pipeline using commodity ARKit head/eye transforms and a scaffolded guidance-to-recall protocol grounded in learning theory. In a pilot feasibility study (N=4 participants; 240 trials; controlled single-session setting), we benchmark a compact time-series model (TinyHAR) against deeper baselines (DeepConvLSTM, SA-HAR) on 5-way gesture recognition and 4-way user identification. TinyHAR achieves strong performance in this pilot benchmark (Macro F1 = 0.960 for gesture recognition; Macro F1 = 0.997 for user identification) while using only 46k parameters. A modality analysis further indicates that head pose dynamics are highly informative for mobile gaze gestures, highlighting embodied head--eye coordination as a key design consideration. Although the small sample size and controlled setting limit generalizability, these results indicate a potential direction for further investigation into on-device gaze gesture recognition.
Authors:Yuqin Yang, Haowu Zhou, Haoran Tu, Zhiwen Hui, Shiqi Yan, HaoYang Li, Dong She, Xianrong Yao, Yang Gao, Zhanpeng Jin
Abstract:
Most affective computing research treats emotion as a static property of text, focusing on the writer's sentiment while overlooking the reader's perspective. This approach ignores how individual personalities lead to diverse emotional appraisals of the same event. Although role-playing Large Language Models (LLMs) attempt to simulate such nuanced reactions, they often suffer from "personality illusion'' -- relying on surface-level stereotypes rather than authentic cognitive logic. A critical bottleneck is the absence of ground-truth human data to link personality traits to emotional shifts. To bridge the gap, we introduce Persona-E$^2$ (Persona-Event2Emotion), a large-scale dataset grounded in annotated MBTI and Big Five traits to capture reader-based emotional variations across news, social media, and life narratives. Extensive experiments reveal that state-of-the-art LLMs struggle to capture precise appraisal shifts, particularly in social media domains. Crucially, we find that personality information significantly improves comprehension, with the Big Five traits alleviating "personality illusion.'
Authors:Hyotaek Jeon, Hyunwook Lee, Minjeong Shin, Tapendra Pandey, Joohee Kim, Shinwook Seon, Daeun Jeong, Sungahn Ko, Ghulam Jilani Quadri
Abstract:
Designers often create visualizations to achieve specific high-level analytical or communication goals. These goals require people to extract complex and interconnected data patterns. Prior perceptual studies of visualization effectiveness have focused on low-level tasks, such as estimating statistical quantities, and have recently explored high-level comprehension of visualization. Despite the growing use of Large Language Models (LLMs) as visualization interpreters, how their interpretations relate to human understanding or what reasoning processes underlie their responses remains insufficiently understood. In this work, we explore LLMs' visualization comprehension, examining the alignment between designers' communicative goals and what their audience sees in a visualization. We have conducted a qualitative study to investigate the gap between human interpretative strategies and the reasoning pathways of LLMs across three types of visualizations, line graphs, bar graphs, and scatterplots, to identify the high-level patterns generated by LLMs using three prompt conditions. Our analysis results indicate that LLMs exhibit a consistent interpretative strategy that remains unchanged across prompt constraints. Furthermore, we observe two distinct approaches: humans naturally synthesize data into trend-centric narratives, whereas LLMs persist with a structural enumeration of comparisons and numerical ranges. Lastly, we see LLMs achieve visualization comprehension through mechanisms distinct from human intuition, pointing to critical challenges and new opportunities for visualization design.
Authors:Valdemar Danry, Javier Hernandez, Andrew Wilson, Pattie Maes, Judith Amores
Abstract:
Current LLM assistants are powerful at answering questions, but they have limited access to the behavioral context that reveals when and where a user is struggling. We present a gaze-grounded multimodal LLM assistant that uses egocentric video with gaze overlays to identify likely points of difficulty and target follow-up retrospective assistance. We instantiate this vision in a controlled study (n=36) comparing the gaze-aware AI assistant to a text-only LLM assistant. Compared to a conventional LLM assistant, the gaze-aware assistant was rated as significantly more accurate and personalized in its assessments of users' reading behavior and significantly improved people's ability to recall information. Users spoke significantly fewer words with the gaze-aware assistant, indicating more efficient interactions. Qualitative results underscored both perceived benefits in comprehension and challenges when interpretations of gaze behaviors were inaccurate. Our findings suggest that gaze-aware LLM assistants can reason about cognitive needs to improve cognitive outcomes of users.
Authors:Philipp Hugenroth, Valdemar Danry, Pattie Maes
Abstract:
As Large Language Models (LLMs) increasingly automate writing tasks, there is a growing risk of cognitive deskilling where users offload critical thinking to the system. To address this, we introduce Critical Inker, a writing tool designed to scaffold critical reflection during writing through logical analysis and socratic feedback. We present two methods: (1) A Socratic chatbot using questions to help them realize and fix logical errors in their writing and (2) Visual Feedback, which highlights logical errors in the text without dialog. We detail the technical implementation of the system and evaluate its argument extraction and logical validity accuracy. Our evaluation shows a 91.2% argument overlap with ground truth argument annotations and 87% validity accuracy. Finally, we conducted a small-scale pilot and discuss early qualitative results.
Authors:Kawtar Zaher, Olivier Buisson, Alexis Joly
Abstract:
Building on existing approaches, we revisit Human-in-the-Loop Object Retrieval, a task that consists of iteratively retrieving images containing objects of a class-of-interest, specified by a user-provided query. Starting from a large unlabeled image collection, the aim is to rapidly identify diverse instances of an object category relying solely on the initial query and the user's Relevance Feedback, with no prior labels. The retrieval process is formulated as a binary classification task, where the system continuously learns to distinguish between relevant and non-relevant images to the query, through iterative user interaction. This interaction is guided by an Active Learning loop: at each iteration, the system selects informative samples for user annotation, thereby refining the retrieval performance. This task is particularly challenging in multi-object datasets, where the object of interest may occupy only a small region of the image within a complex, cluttered scene. Unlike object-centered settings where global descriptors often suffice, multi-object images require more adapted, localized descriptors. In this work, we formulate and revisit the Human-in-the-Loop Object Retrieval task by leveraging pre-trained ViT representations, and addressing key design questions, including which object instances to consider in an image, what form the annotations should take, how Active Selection should be applied, and which representation strategies best capture the object's features. We compare several representation strategies across multi-object datasets highlighting trade-offs between capturing the global context and focusing on fine-grained local object details. Our results offer practical insights for the design of effective interactive retrieval pipelines based on Active Learning for object class retrieval.
Authors:Tobias Stähle, Péter Ferenc Gyarmati, Thilo Spinner, Rita Sevastjanova, Dominik Moritz, Mennatallah El-Assady
Abstract:
The rise of AI agents introduces a fundamental shift in Visual Analytics (VA), in which agents act as a new user group. Current agentic approaches - based on computer vision and raw DOM access - fail to perform VA tasks accurately and efficiently. This paper introduces the Visual Analytics Context Protocol (VACP), a framework designed to make VA applications "agent-ready" that extends generic protocols by explicitly exposing application state, available interactions, and mechanisms for direct execution. To support our context protocol, we contribute a formal specification of AI agent requirements and knowledge representations in VA interfaces. We instantiate VACP as a library compatible with major visualization grammars and web frameworks, enabling augmentation of existing systems and the development of new ones. Our evaluation across representative VA tasks demonstrates that VACP-enabled agents achieve higher success rates in interface interpretation and execution compared to current agentic approaches, while reducing token consumption and latency. VACP closes the gap between human-centric VA interfaces and machine perceivability, ensuring agents can reliably act as collaborative users in VA systems.
Authors:Chitralekha Gupta, Nadia Victoria Aritonang, Dixon Prem Daniel Rajendran, Valdemar Danry, Pattie Maes, Suranga Nanayakarra
Abstract:
Misinformation can spread rapidly in everyday conversation, where pausing to verify is not always possible. We envision a wearable system that bridges the timing gap between hearing a claim and forming a judgment. It uses ambient listening to detect verifiable claims, performs rapid web verification, and provides a subtle haptic nudge with a glanceable overview. A controlled study (N=34) simulated this approach and tested against a no-support baseline. Results show that instant, body-integrated feedback significantly improved real-time truth discernment and increased verification activity compared to unsupported fact-checking. However, it also introduced over-reliance when the system made errors, i.e. failed to flag false claims or flagged true claims as false. We contribute empirical evidence of improved discernment alongside insights into trust, effort, and user-system tensions in verification wearables.
Authors:Luis Morales-Navarro, Daniel J. Noh, Lucianne Servat, Carly Netting, Yasmin B. Kafai, Danaé Metaxa
Abstract:
The rising adoption of generative AI/ML technologies increases the need to support teens in developing AI/ML literacies. Child-computer interaction research argues that construction activities can support young people in understanding these systems and their implications. Recent exploratory studies demonstrate the feasibility of engaging teens in the construction of very small generative language models (LMs). However, it is unclear how constructing such models may foster the development of teens' understanding of these systems from technical and socio-ethical perspectives. We conducted a week-long participatory design workshop in which sixteen teenagers constructed very small LMs to generate recipes, screenplays, and songs. Using thematic analysis, we identified technical and socio-ethical pieces of understandings that teens exhibited while designing generative LMs. This paper contributes (a) evidence of the kinds of pieces of understandings that teens have when constructing LMs and (b) a theory-backed framing to study novices' understandings of AI/ML systems.
Authors:Qurat Ul Ain, Mohamed Amine Chatti, Nasim Yazdian Varjani, Farah Kamal, Astrid Rosenthal-von der Pütten
Abstract:
Explanations are central to improving transparency, trust, and user satisfaction in recommender systems (RS), yet it remains unclear how different explanation formats (visual vs. textual) are suited to users with different personal characteristics (PCs). To this end, we report a within-subject user study (n=54) comparing visual and textual explanations and examine how explanation format and PCs jointly influence perceived control, transparency, trust, and satisfaction in an educational recommender system (ERS). Using robust mixed-effects models, we analyze the moderating effects of a wide range of PCs, including Big Five traits, need for cognition, decision making style, visualization familiarity, and technical expertise. Our results show that a well-designed visual, simple, interactive, selective, easy to understand visualization that clearly and intuitively communicates how user preferences are linked to recommendations, fosters perceived control, transparency, appropriate trust, and satisfaction in the ERS for most users, independent of their PCs. Moreover, we derive a set of guidelines to support the effective design of explanations in ERSs.
Authors:Kawtar Zaher, Olivier Buisson, Alexis Joly
Abstract:
Real-world fine-grained visual retrieval often requires discovering a rare concept from large unlabeled collections with minimal supervision. This is especially critical in biodiversity monitoring, ecological studies, and long-tailed visual domains, where the target may represent only a tiny fraction of the data, creating highly imbalanced binary problems. Interactive retrieval with relevance feedback offers a practical solution: starting from a small query, the system selects candidates for binary user annotation and iteratively refines a lightweight classifier. While Active Learning (AL) is commonly used to guide selection, conventional AL assumes symmetric class priors and large annotation budgets, limiting effectiveness in imbalanced, low-budget, low-latency settings. We introduce Positive-First Most Ambiguous (PF-MA), a simple yet effective AL criterion that explicitly addresses the class imbalance asymmetry: it prioritizes near-boundary samples while favoring likely positives, enabling rapid discovery of subtle visual categories while maintaining informativeness. Unlike standard methods that oversample negatives, PF-MA consistently returns small batches with a high proportion of relevant samples, improving early retrieval and user satisfaction. To capture retrieval diversity, we also propose a class coverage metric that measures how well selected positives span the visual variability of the target class. Experiments on long-tailed datasets, including fine-grained botanical data, demonstrate that PF-MA consistently outperforms strong baselines in both coverage and classifier performance, across varying class sizes and descriptors. Our results highlight that aligning AL with the asymmetric and user-centric objectives of interactive fine-grained retrieval enables simple yet powerful solutions for retrieving rare and visually subtle categories in realistic human-in-the-loop settings.
Authors:Yuang Wei, Fei Wang, Yifan Zhang, Brian Y. Lim, Bo Jiang
Abstract:
Assessment literacy (AL) is essential for personalized education, yet difficult to cultivate in pre-service teachers. Conventional teacher preparation programs focus on theoretical knowledge, while digital assessment tools commonly provide opaque scores or parameters. These limitations hinder reflection and transfer, leaving AL underdeveloped. We propose XIA, an eXplainable Intelligent Assessment platform that extends statistics-informed support with visualized cognitive diagnostic reasoning, including contrastive and counterfactual explanations. In a pre-post controlled study with 21 pre-service teachers, we combined quantitative tasks and questionnaires with qualitative interviews. The findings offer preliminary evidence that XIA supported reflection, self-regulation, and assessment awareness, and helped reduce assessment errors. Interviews further showed a shift from score-based judgments toward evidence-based reasoning. This work contributes insights into the design of intelligent assessment tools, showing how explanatory scaffolding can bridge assessment theory and classroom practice and support the cultivation of AL in teacher education.
Authors:Yaxiong Lei, Xinya Gong, Shijing He, Yafei Wang, Mohamed Khamis, Juan Ye
Abstract:
As eye-tracking becomes increasingly common in modern mobile devices, the potential for hands-free, gaze-based interaction grows, but current gesture sets are largely expert-designed and often misaligned with how users naturally move their eyes. To address this gap, we introduce a two-phase methodology for developing intuitive gaze gestures. First, four co-design workshops with 20 non-expert participants generated 102 initial concepts. Next, four gaze interaction experts reviewed and refined these into a set of 32 gestures. We found that non-experts, after a brief introduction, intuitively anchor gestures in familiar metaphors and develop a compositional grammar; i.e., activation (dwell) + action (gaze gesture or blink), to ensure intentionality and mitigate the classic Midas Touch problem. Experts prioritized gestures that are ergonomically sound, aligned with natural saccades, and reliably distinguishable. The resulting user-grounded, expert-validated gesture set, along with actionable design principles, provides a foundation for developing intuitive, hands-free interfaces for gaze-enabled devices.
Authors:Priyan Vaithilingam, Alan Leung, Jeffrey Nichols, Titus Barik
Abstract:
Front-end developers author UI components to be broadly reusable by parameterizing visual and behavioral properties. While flexible, this makes instantiation harder, as developers must reason about numerous property values and interactions. In practice, they must explore the component's large design space and provide realistic and natural values to properties. To address this, we introduce distinguishing variations: variations that are both mimetic and distinct. We frame distinguishing variation generation as design-space sampling, combining symbolic inference to identify visually important properties with an LLM-driven mimetic sampler to produce realistic instantiations from its world knowledge. We instantiate distinguishing variations in Celestial, a tool that helps developers explore and visualize distinguishing variations. In a study with front-end developers (n=12), participants found these variations useful for comparing and mapping component design spaces, reported that mimetic instantiations were domain-relevant, and validated that Celestial transformed component instantiation from a manual process into a structured, exploratory activity.
Authors:Shijing He, Xuchen Wang, Yaxiong Lei, Chi Zhang, Ruba Abu-Salma, Jose Such
Abstract:
Bystander privacy in smart homes has been widely studied in Western contexts, yet it remains underexplored in non-Western countries such as China. In this study, we analyze 49 Chinese smart home apps using a mixed-methods approach, including privacy policy review, UX/UI evaluation, and assessment of Apple App Store privacy labels. While most apps nominally comply with national regulations, we identify significant gaps between written policies and actual implementation. Our traceability analysis highlights inconsistencies in data controls and a lack of transparency in data-sharing practices. Crucially, bystander privacy -- particularly for visitors and non-user individuals -- is largely absent from both policy documents and interface design. Additionally, discrepancies between privacy labels and actual data practices threaten user trust and undermine informed consent. We provide design recommendations to strengthen bystander protections, improve privacy-oriented UI transparency, and enhance the credibility of privacy labels, supporting the development of inclusive smart home ecosystems in non-Western contexts.
Authors:Abhisek Dash, Soumi Das, Elisabeth Kirsten, Qinyuan Wu, Sai Keerthana Karnam, Krishna P. Gummadi, Thorsten Holz, Muhammad Bilal Zafar, Savvas Zannettou
Abstract:
To enable personalized and context-aware interactions, conversational AI systems have introduced a new mechanism: Memory. Memory creates what we refer to as the Algorithmic Self-portrait - a new form of personalization derived from users' self-disclosed information divulged within private conversations. While memory enables more coherent exchanges, the underlying processes of memory creation remain opaque, raising critical questions about data sensitivity, user agency, and the fidelity of the resulting portrait. To bridge this research gap, we analyze 2,050 memory entries from 80 real-world ChatGPT users. Our analyses reveal three key findings: (1) A striking 96% of memories in our dataset are created unilaterally by the conversational system, potentially shifting agency away from the user; (2) Memories, in our dataset, contain a rich mix of GDPR-defined personal data (in 28% memories) along with psychological insights about participants (in 52% memories); and (3)~A significant majority of the memories (84%) are directly grounded in user context, indicating faithful representation of the conversations. Finally, we introduce a framework-Attribution Shield-that anticipates these inferences, alerts about potentially sensitive memory inferences, and suggests query reformulations to protect personal information without sacrificing utility.
Authors:Valerie Tan, Luisa Jost, Jens Gerken, Max Pascher
Abstract:
Attention Deficit Hyperactivity Disorder (ADHD), characterized by inattention, hyperactivity, and impulsivity, is prevalent in the adult population. Long perceived and treated as a childhood condition, ADHD and its characteristics nonetheless impact a significant portion of adults today. In contrast to children with ADHD, adults with ADHD face unique challenges in the workplace and in higher education. In this work-in-progress paper, we present a scoping review as a foundation to understand and explore existing technology-based approaches to support adults with ADHD. In total, our search returned 3,538 papers upon which we selected, based on PRISMA-ScR, a total of 46 papers for in-depth analysis. Our initial findings highlight that most papers take on a therapeutic or intervention perspective instead of a more positive support perspective. Our analysis also found a tremendous increase in recent papers on the topic, which highlights that more and more researchers are becoming aware of the need to address ADHD with adults. For the future, we aim to further analyze the corpus and identify research gaps and potentials for further development of ADHD assistive technologies.
Authors:Saber Zerhoudi, Michael Dinzinger, Michael Granitzer, Jelena Mitrovic
Abstract:
Browser-based language models often use retrieval-augmented generation (RAG) but typically rely on fixed, outdated indices that give users no control over which sources are consulted. This can lead to answers that mix trusted and untrusted content or draw on stale information. We present OwlerLite, a browser-based RAG system that makes user-defined scopes and data freshness central to retrieval. Users define reusable scopes-sets of web pages or sources-and select them when querying. A freshness-aware crawler monitors live pages, uses a semantic change detector to identify meaningful updates, and selectively re-indexes changed content. OwlerLite integrates text relevance, scope choice, and recency into a unified retrieval model. Implemented as a browser extension, it represents a step toward more controllable and trustworthy web assistants.
Authors:Sergio Mascetti, Matteo Manzoni, Filippo Corti, Dragan Ahmetovic
Abstract:
Accessing video games is challenging for people with upper-limb impairments, especially when multiple inputs are required in rapid succession. Human cooperation, where a copilot assists the main player, has been proposed as a solution, but relying on a human assistant poses limitations in terms of availability and co-location. An alternative solution is to use partial automation, where the player is assisted by a software agent. In this work, we present a study with 13 participants with upper-limb impairments, comparatively evaluating how participants collaborate with their copilot in human cooperation and partial automation. The experiment is supported by GamePals, a modular framework that enables both human cooperation and partial automation on existing third-party video games.
Authors:Xinyan Yu, Julie Stephany Berrio Perez, Marius Hoggenmüller, Martin Tomitsch, Tram Thi Minh Tran, Stewart Worrall, Wendy Ju
Abstract:
The rapid advancement of autonomous vehicle (AV) technologies is fundamentally reshaping paradigms of human-vehicle collaboration, raising not only an urgent need for innovative design solutions but also for policies that address corresponding broader tensions in society. To bridge the gap between HCI research and policy making, this workshop will bring together researchers and practitioners in the automotive community to explore AV policy directions through collaborative speculation on the future of AVs. We designed The UnScripted Trip, a card game rooted in fictional narratives of autonomous mobility, to surface tensions around human-vehicle collaboration in future AV scenarios and to provoke critical reflections on design solutions and policy directions. Our goal is to provide an engaging, participatory space and method for automotive researchers, designers, and industry practitioners to collectively explore and shape the future of human-vehicle collaboration and its policy implications.
Authors:Samiul Alam, Shakhrul Iman Siam, Michael J. Proulx, James Fort, Richard Newcombe, Hyo Jin Kim, Mi Zhang
Abstract:
AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genuinely useful, such systems must move beyond short-term video comprehension and address memory gaps that humans experience for practical, personal, or social purposes over longitudinal egocentric video streams. However, existing egocentric datasets predominantly focus on action recognition or generic QAs from short clips, measuring perceptual capabilities rather than realistic human memory needs. We introduce SuperMemory-VQA, an egocentric visual question answering (VQA) dataset for evaluating AI assistants on practical, long-horizon memory tasks. It contains 52.9 hours of everyday activities recorded with AI glasses, including synchronized RGB video, audio transcription, eye gaze, IMU, and SLAM trajectories. Through a human-verified annotation pipeline, we construct grounded 4,853 question-answer pairs that span object and location memory, intent recall, visual scene recall, timeline reconstruction, conversational memory, and in-context retrieval. Each question is posed as multiple-choice with an explicit "unanswerable" option to test hallucination robustness. Benchmarking leading agentic frameworks and LLM backbones reveals that existing systems remain far from reliable on real-world memory tasks, highlighting the need for new architectures for grounded AI memory that can answer only when evidence is sufficient. A participant survey further supports that our questions are realistic, useful, and aligned with everyday memory needs.
Authors:Xiaoze Liu, Ruowang Zhang, Amir H. Abdi, Michel Galley, Zhikai Chen, Siheng Xiong, Xiaoqian Wang, Jing Gao
Abstract:
Proactive agents read user activity as text and call an LLM on every event to decide whether to act. But user activity is not natively text: it is a structured event stream of (actor, verb, object, timestamp) tuples that the operating system already maintains in graph form. Rendering the structure as text and asking an LLM to recover it is a round-trip the system never had to take. We treat the always-on signal as graph updates rather than text and use a small temporal-graph-learning (TGL) model as the encoder: one forward pass yields a per-event trigger probability and a per-entity routing score, and only the downstream agent (turning a small structured handoff into a fluent user-facing sentence) is an LLM call, invoked only when the trigger fires. TGL improves F1 on each of 14 backbones (mean +16.7, up to +46.0); in trigger-architecture comparisons, one TGL checkpoint gives the strongest trigger AUCs and the most stable deployed threshold. It runs at 11.13 ms per event on a GPU server and 13.99 ms on a consumer laptop, approximately 4--7x and 12--83x faster than every single-forward LLM-as-trigger configuration tested in each regime, with an approximately 220 MiB BF16 resident footprint deployable on-device alongside the privacy-sensitive activity stream it consumes.
Authors:Anna Mokhova, Subhabrata Dutta, Iryna Gurevych, Simone Balloccu
Abstract:
Large Language Models (LLMs) have become increasingly popular for coding tasks, with subjective coding preferences being an essential element to adapt to programmers' personal needs. Existing work overlooks such characteristics and mainly focuses on code correctness. In this study, we propose a typification of four subjective coding preference axes - complexity, commenting, modularity, and readability - motivated by common engineering habits and validated by 25 software engineers. We collect a dataset of ~3,000 paired Python code snippets reflecting these axes, annotated by 73 experts who rate their preferences on a Likert scale. Using our dataset, we study how LLMs handle subjective coding preferences. We present 13 LLMs with pairs of solutions to the same programming task, first as textual descriptions and then as concrete code snippets. We find that models often prefer one option in natural language but the opposite when evaluating code. More consistent models (i.e., those that are coherent in their choices between deeds and words) frequently reveal positional bias: swapping the order of options changes the preferred alternative. We then use the five most consistent models to re-annotate the dataset. Compared to humans, models show polarized Likert distributions and notable divergence in ratings. A case study on GPT-5 reveals reliance on external assumptions and brittle reasoning.
Authors:Yahya Hmaiti, Mykola Maslych, Amirpouya Ghasemaghaei, Trung Cuong Dang, Corey Pittman, David Mohaisen, Joseph J. LaViola
Abstract:
Privacy measurement instruments (e.g., CFIP, IUIPC, PAQ) predate GDPR by over a decade and measure privacy concerns, distinct from preferences for regulatory protections (e.g., data portability, erasure, automated decision-making rights). This leaves practitioners without tools to assess whether users value the GDPR mechanisms implemented in compliant policies. We developed a GDPR-grounded privacy preference measurement item bank by extracting 669 statements from all 99 GDPR articles, validated by: (1) two-round expert review achieving full consensus on accuracy, (2) semantic clustering into 10 parent themes and 87 subthemes, and (3) consensus review with 50 privacy experts (5 per theme) using a larger or equal than 4/5 vote retention threshold. The final 527-item bank comprises 9 parent themes and 73 subthemes (18 to 112 items per parent theme, 1 to 29 per subtheme), enabling targeted measurement across granularities while covering GDPR at mean pairwise expert agreement of approx. 85%. This work introduces a complementary measurement dimension aligning user preferences with regulatory mechanisms.
Authors:Mengqi Shi, Tianqi Song, Zicheng Zhu, Yi-Chieh Lee
Abstract:
Older adults have increasingly turned to conversational AI as a source of emotional support. However, little is known about how emotionally supportive interactions are experienced in everyday use, particularly when AI systems limit, redirect, or intervene during these interactions. We interviewed 18 older adults about their experiences using conversational AI for emotional support, examining when they turn to AI, how they engage during emotionally vulnerable moments, and how they respond when support feels disrupted. Our findings show that older adults often rely on AI when other forms of social support feel inaccessible. However, current safety-related interventions can redirect interactions in ways that participants experience as interruptions to emotional engagement or as shifts in control away from them. Such disruptions can undermine older adults' ability to remain emotionally engaged and, in some cases, contribute to emotional distress. We discussed design implications for emotionally supportive conversational AI, emphasizing the need for safety interventions that are enacted within older adults' social contexts, align with users' emotional pacing, and preserve their sense of agency.
Authors:Lin Che, Xi Wang, Marc Pollefeys, Konrad Schindler, Martin Raubal, Peter Kiefer
Abstract:
Urban perception describes how people subjectively evaluate urban environments, shaping how cities are experienced and understood. Existing computational approaches primarily model urban perception directly from street view images, but largely ignore the human perceptual process through which such judgments are formed. In this paper, we introduce Place Pulse-Gaze, an urban perception dataset that augments street view images with synchronized eye-tracking recordings and individual perception labels. Based on this dataset, we propose a Gaze-Guided Urban Perception Framework to study how gaze behavior contributes to the modeling of subjective urban perception. The framework systematically investigates three complementary settings: gaze-only modeling, gaze fusion with explicit semantic scene representations, and gaze fusion with implicit richer visual representations. Experiments show that gaze alone already carries useful predictive signals for subjective urban perception, and that integrating gaze with scene representations further improves prediction under both semantic and richer visual representations. Overall, our findings highlight the importance of incorporating human perceptual processes into urban scene understanding and open a direction for gaze-guided multimodal urban computing.
Authors:Mohammad Al-Ratrout, Pavan Uttej Ravva, Shayla Sharmin, Aditya Raikwar, Ju Young Shin, Roghayeh Leila Barmaki
Abstract:
When multiple people share a single voice assistant, the system conflates their histories: one resident's preferences can leak into another's responses, eroding utility and trust. We call this failure mode persona confusion, and we show it is a measurable problem in today's single-user dialogue systems when deployed in shared environments. We present the Adaptive Friend Agent (AFA), a modular framework that combines voice-based speaker identification with per-user memory stores to enable identity-aware, personalized dialogue across multiple users. To support training and evaluation, we construct PAT (Personalized Agent chaT), a synthetic dataset of 58,289 persona-grounded dialogue turns spanning 133 user profiles and 12 real-world scenarios. We evaluate AFA across five LLM back-ends in a standard response-quality benchmark, with a LLaMA-2-70B model fine-tuned on PAT achieving the highest overall performance. To directly measure persona confusion prevention, we introduce an interleaved multi-user evaluation protocol with a novel metric, Persona Attribution Accuracy (PAA), demonstrating that identity-aware routing improves PAA from 35.7% to 61.3%. Human evaluation confirms annotators perceive significantly higher personalization in routing-enabled responses. Our results establish that identity-aware user routing is the critical component for preventing persona confusion in multi-user conversational systems.
Authors:Jingwei Kang, Maarten de Rijke, Harrie Oosterhuis
Abstract:
Carousel interfaces have been the de-facto standard for streaming media services for over a decade. Yet, there has been very little research into user behavior with such interfaces, which thus remains poorly understood. Due to this lack of empirical research, previous work has assumed that behaviors established in single-list web-search interfaces, such as the F-pattern and the examination hypothesis, also apply to carousel interfaces, for instance when designing click models or evaluation metrics. We analyze a recently-released interaction and examination dataset resulting from an eye-tracking study performed on carousel interfaces to verify whether these assumptions actually hold. We find that (i)~the F-pattern holds only for vertical examination and not for horizontal swiping; additionally, we discover that, when conditioned on a click, user examination follows an L-pattern unique to carousel interfaces; (ii)~click-through-rates conditioned on examination indicate that the well-known examination hypothesis does not hold in carousel interfaces; and (iii)~contrary to the assumptions of previous work, users generally ignore carousel headings and focus directly on the content items. Our findings show that many user behavior assumptions, especially concerning examination patterns, do not transfer from web search interfaces to carousel recommendation settings. Our work shows that the field lacks a reliable foundation on which to build models of user behavior with these interfaces. Consequently, a re-evaluation of existing metrics and click models for carousel interfaces may be warranted.
Authors:Vibhor Agarwal, Ke Zhou, Edyta Paulina Bogucka, Daniele Quercia
Abstract:
AI companion chatbots increasingly shape how people seek social and emotional connection, sometimes substituting for relationships with romantic partners, friends, teachers, or even therapists. When these systems adopt those metaphorical roles, they are not neutral: such roles structure people's ways of interacting, distribute perceived AI harms and benefits, and may reflect behavioral addiction signs. Yet these role-dependent risks remain poorly understood. We analyze 248,830 posts from seven prominent Reddit communities describing interactions with AI companions. We identify ten recurring metaphorical roles (for example, soulmate, philosopher, and coach) and show that each role supports distinct ways of interacting. We then extract the perceived AI harms and AI benefits associated with these role-specific interactions and link them to behavioral addiction signs, all of which has been inferred from the text in the posts. AI soulmate companions are associated with romance-centered ways of interacting, offering emotional support but also introducing emotional manipulation and distress, culminating in strong attachment. In contrast, AI coach and guardian companions are associated with practical benefits such as personal growth and task support, yet are nonetheless more frequently associated with behavioral addiction signs such as daily life disruptions and damage to offline relationships. These findings show that metaphorical roles are a central ethical design concern for responsible AI companions.
Authors:Shree Harsha Bokkahalli Satish, Maria Teleki, Christoph Minixhofer, Ondrej Klejch, Peter Bell, Éva Székely
Abstract:
SpeechLLMs process spoken language directly from audio, but accent and vocal identity cues can lead to biased behaviour. Current bias evaluations often miss how such bias manifests in end-to-end speech interactions and how users experience it. We distinguish quality-of-service disparities (e.g., off-topic or low-effort responses) from content-level bias in coherent outputs, and examine intersectional effects of accent and perceived gender. In this work, we explore a two-part evaluation approach: (1) a controlled test cohort spanning six accents and two gender presentations, analysed with judge-free prompt-response metrics, and (2) an interactive study design using voice conversion to let users experience identical content through different vocal identities. Across two studies (Interactive, N=24; Observational, N=19), we find that voice conversion increases trust and acceptability for benign responses and encourages perspective-taking, while automated analysis in search of quality-of-service disparities, reveals {accent x gender} disparities in alignment and verbosity across SpeechLLMs. These results highlight voice conversion for probing and experiencing intersectional voice bias while our evaluation suite provides richer bias evaluations for spoken conversational AI.
Authors:Seungjoo Lee, Vimal Mollyn, Chris Harrison, Justin Chan, Mayank Goel
Abstract:
We present GlintMarkers, the first system to perform gaze-driven spatial perception using the inward-facing cameras on XR eyewear. Our key observation is that the cornea acts as a mirror that encodes both gaze direction and visual information about the environment in a small, low-contrast reflection. To extract spatial and semantic information from this reflection despite the camera's limited pixel budget, we present a passive retroreflective marker design that concentrates reflected near-infrared light onto the cornea, producing bright glint patterns. We develop a custom Perspective-n-Point (PnP) estimation framework adapted to corneal imaging and perform orientation and distance estimation of tagged objects, as well as unique object identification.
Authors:Danni Liu, Bo Liu, Yuxin Hu, Hantao Zhao, Yan Liu, Ding Ding, Jiahui Jin, Jiuxin Cao
Abstract:
Psychological client simulators have emerged as a scalable solution for training and evaluating counselor trainees and psychological LLMs. Yet existing simulators exhibit unrealistic over-compliance, leaving counselors underprepared for the challenging behaviors common in real-world practice. To bridge this gap, we present ResistClient, which systematically models challenging client behaviors grounded in Client Resistance Theory by integrating external behaviors with underlying motivational mechanisms. To this end, we propose Resistance-Informed Motivation Reasoning (RIMR), a two-stage training framework. First, RIMR mitigates compliance bias via supervised fine-tuning on RPC, a large-scale resistance-oriented psychological conversation dataset covering diverse client profiles. Second, beyond surface-level response imitation, RIMR models psychologically coherent motivation reasoning before response generation, jointly optimizing motivation authenticity and response consistency via process-supervised reinforcement learning. Extensive automatic and expert evaluations show that ResistClient substantially outperforms existing simulators in challenge fidelity, behavioral plausibility, and reasoning coherence. Moreover, ResistClient facilities evaluation of psychological LLMs under challenging conditions, offering new optimization directions for mental health dialogue systems.
Authors:Himanshu Tripathi, Charlottee Crowell, Kaley Newlin, Subash Neupane, Shahram Rahimi, Jason Keith
Abstract:
We introduce ACE-TA, the Agentic Coding and Explanations Teaching Assistant framework, that autonomously routes conceptual queries drawn from programming course material to grounded Q&A, stepwise coding guidance, and automated quiz generation using pre-trained Large Language Models (LLMs). ACE-TA consists of three coordinated modules: a retrieval grounded conceptual Q&A system that provides precise, context-aligned explanations; a quiz generator that constructs adaptive, multi-topic assessments targeting higher-order understanding; and an interactive code tutor that guides students through step-by-step reasoning with sandboxed execution and iterative feedback.
Authors:Jie Cao, Zhanxin Hao, Jifan Yu
Abstract:
Educational dialogue is critical for decoding student learning processes, yet manual annotation remains time-consuming. This study evaluates the efficacy of GPT-5.2 and Gemini-3 using three prompting strategies (few-shot, single-agent, and multi-agent reflection) across diverse subjects, educational levels, and four coding dimensions. Results indicate that while multi-agent prompting achieved the highest accuracy, the results did not reach statistical significance. Accuracy proved highly context-dependent, with significantly higher performance in K-12 datasets compared to university-level data, alongside disciplinary variations within the same educational level. Performance peaked in the affective dimension but remained lowest in the cognitive dimension. Furthermore, analysis revealed four bias patterns: (1) Gemini-3 exhibited a consistent optimistic bias in the affective dimension across all subjects; (2) the cognitive dimension displayed domain-specific directional bias, characterized by systematic underestimation in Mathematics versus overestimation in Psychology; (3) both models are more prone to overestimation than underestimation within the meta-cognitive dimension; and (4) behavioral categories such as question, negotiation, and statements were frequently misclassified. These results underscore the need for context-sensitive deployment and targeted mitigation of directional biases in automated annotation.
Authors:Xinyan Yu, Marius Hoggenmueller, Tram Thi Minh Tran, Martin Tomitsch
Abstract:
Emerging technologies introduce sociotechnical tensions that call for closer collaboration between technology design and policy. In this work, we introduce Design-Policy Adversarial Futuring, a scenario-based workshop method that supports design-policy engagement by structuring contestation between design and policy perspectives. We report on a workshop conducted in the autonomous mobility domain with 12 HCI researchers, used to explore and demonstrate the method in practice. The workshop illustrates how the adversarial futuring method can surface shifting harms, translate policy abstractions into situated use, and legitimise extreme ideas while maintaining grounded policy reasoning. This work contributes a reusable, exploratory method for supporting HCI-policy collaboration through contestation, which can be adapted across emerging technological domains.
Authors:Laura Rayón Ropero, Jasper De Laet, Filip Lemic, Pau Sabater Nácher, Nabeel Nisar Bhat, Sergi Abadal, Jeroen Famaey, Eduard Alarcón, Xavier Costa-Pérez
Abstract:
Facial Emotion Recognition is a critical research area within Affective Computing due to its wide-ranging applications in Human Computer Interaction, mental health assessment and fatigue monitoring. Current FER methods predominantly rely on Deep Learning techniques trained on 2D image data, which pose significant privacy concerns and are unsuitable for continuous, real-time monitoring. As an alternative, we propose High-Frequency Wireless Sensing (HFWS) as an enabler of continuous, privacy-aware FER, through the generation of detailed 3D facial pointclouds via on-person sensors embedded in wearables. We present arguments supporting the privacy advantages of HFWS over traditional 2D imaging, particularly under increasingly stringent data protection regulations. A major barrier to adopting HFWS for FER is the scarcity of labeled 3D FER datasets. Towards addressing this issue, we introduce a FLAME-based method to generate 3D facial pointclouds from existing public 2D datasets. Using this approach, we create AffectNet3D, a 3D version of the AffectNet database. To evaluate the quality and usability of the generated data, we design a pointcloud refinement pipeline focused on isolating the facial region, and train the popular PointNet++ model on the refined pointclouds. Fine-tuning the model on a small subset of the unseen 3D FER dataset BU-3DFE yields a classification accuracy exceeding 70%, comparable to oracle-level performance. To further investigate the potential of HFWS-based FER for continuous monitoring, we simulate wearable sensing conditions by masking portions of the generated pointclouds. Experimental results show that models trained on AffectNet3D and fine-tuned with just 25% of BU-3DFE outperform those trained solely on BU-3DFE. These findings highlight the viability of our pipeline and support the feasibility of continuous, privacy-aware FER via wearable HFWS systems.
Authors:Mohammadreza Jamalifard, Yaxiong Lei, Parasto Azizinezhad, Javier Fumanal-Idocin, Javier Andreu-Perez
Abstract:
We propose a neuro-symbolic architecture that learns four interpretable physiological concepts, oculomotor dynamics, gaze stability, prefrontal hemodynamics, and multimodal, from eye-tracking and neural hemodynamics, functional near-infrared spectroscopy, (fNIRS) windows using attention-based encoders, and combines them with differentiable approximate reasoning rules using learned weights and soft thresholds, to address both rigid hand-crafted rules and the lack of subject-level alignment diagnostics. We apply this system to fatigue classification from multimodal physiological signals, a domain that requires models that are accurate and interpretable, with internal reasoning that can be inspected for safety-critical use. In leave-one-subject-out evaluation on 18 participants (560 samples), the method achieves 72.1% +/- 12.3% accuracy, comparable to tuned baselines while exposing concept activations and rule firing strengths. Ablations indicate gains from participant-specific calibration (+5.2 pp), a modest drop without the fNIRS concept (-1.2 pp), and slightly better performance with Lukasiewicz operators than product (+0.9 pp). We also introduce concept fidelity, an offline per-subject audit metric from held-out labels, which correlates strongly with per-subject accuracy (r=0.843, p < 0.0001).
Authors:Zhanxin Hao, Xiaobo Liu, Jiaxin Fan, Yun Long, Jifan Yu, Wenli Chen, Yu Zhang
Abstract:
This study adopts an integrated distributed cognition and regulation of learning perspective to examine the collaboration patterns and dynamics of human-AI collaboration when college students collaborating with AI for complex problem-solving. Through cluster analysis, three distinct collaborative problem-solving modes were identified in this study: Delegated Reasoning (DR), Concerted Interpretation (CI), and Delegated Elaboration (DE). This study found that the DR group achieved the highest task performance, significantly outperforming the CI group. Additionally, the semantic similarity between human and AI discourse was notably the highest in the DR group. In contrast, the CI group reported significantly greater use of self-regulation strategies. These findings uncover a critical tension between the efficiency of the distributed system and the depth of human learners regulatory engagement. Insights from this study offer valuable implications for the future design of AI-empowered educational tools and student-AI collaborative learning frameworks.
Authors:Taejun Kim, Vimal Mollyn, Riku Arakawa, Chris Harrison
Abstract:
We present a new and accurate approach for gaze estimation on consumer computing devices. We take advantage of continued strides in the quality of user-facing cameras found in e.g., smartphones, laptops, and desktops - 4K or greater in high-end devices - such that it is now possible to capture the 2D reflection of a device's screen in the user's eyes. This alone is insufficient for accurate gaze tracking due to the near-infinite variety of screen content. Crucially, however, the device knows what is being displayed on its own screen - in this work, we show this information allows for robust segmentation of the reflection, the location and size of which encodes the user's screen-relative gaze target. We explore several strategies to leverage this useful signal, quantifying performance in a user study. Our best performing model reduces mean tracking error by ~8% compared to a baseline appearance-based model. A supplemental study reveals an additional 10-20% improvement if the gaze-tracking camera is located at the bottom of the device.
Authors:Soorya Ram Shimgekar, Vipin Gunda, Jiwon Kim, Violeta J. Rodriguez, Hari Sundaram, Koustuv Saha
Abstract:
Conversational AI systems are increasingly used for personal reflection and emotional disclosure, raising concerns about their effects on vulnerable users. Recent anecdotal reports suggest that prolonged interactions with AI may reinforce delusional thinking -- a phenomenon sometimes described as AI Psychosis. However, empirical evidence on this phenomenon remains limited. In this work, we examine how delusion-related language evolves during multi-turn interactions with conversational AI. We construct simulated users (SimUsers) from Reddit users' longitudinal posting histories and generate extended conversations with three model families (GPT, LLaMA, and Qwen). We develop DelusionScore, a linguistic measure that quantifies the intensity of delusion-related language across conversational turns. We find that SimUsers derived from users with prior delusion-related discourse (Treatment) exhibit progressively increasing DelusionScore trajectories, whereas those derived from users without such discourse (Control) remain stable or decline. We further find that this amplification varies across themes, with reality skepticism and compulsive reasoning showing the strongest increases. Finally, conditioning AI responses on current DelusionScore substantially reduces these trajectories. These findings provide empirical evidence that conversational AI interactions can amplify delusion-related language over extended use and highlight the importance of state-aware safety mechanisms for mitigating such risks.
Authors:Anqi Wang, Lei Han, Jiahua Dong, Muzhi Zhou, David Yip, Yuyang Wang, Pan Hui
Abstract:
Digital platforms frequently reproduce heteronormative norms and structural biases, limiting inclusive communication between LGBTQ+ and cisgender individuals. The Metaverse, with its affordances for identity fluidity, presence, and community governance, offers a promising site for reimagining such interactions. To investigate this potential, we conducted participatory design workshops involving LGBTQ+ and cisgender participants, situating them in speculative Metaverse contexts to surface barriers and co-create alternative futures. The workshops followed a three-phase process-identifying challenges, speculative problem-solving, and visualizing futures-yielding socio-spatial-technical solutions across four layers: activity, interaction, scene, and space. These findings highlight the importance of spatial cues and power dynamics in shaping digital encounters. We contribute by (1) articulating challenges of cross-group communication in virtual environments, (2) proposing inclusive design opportunities for the Metaverse, and (3) advancing principles for addressing power geometry in digital space. This work demonstrates futuring as a critical strategy for designing equitable, transformative communication infrastructures.
Authors:Peinuan Qin, Jingzhu Chen, Yitian Yang, Han Meng, Zicheng Zhu, Yi-Chieh Lee
Abstract:
Conversational interviews are commonly used to complement structured surveys by eliciting rich and contextualized responses, which are typically analyzed qualitatively. However, their potential contribution to quantitative measurement remains underexplored. In this paper, we introduce ConvScale, an AI-supported approach that transforms psychometric scales into natural conversational interviews while preserving the original measurement structure. Based on interview data, ConvScale predicts item-level scores and aggregates them to derive scale-based assessments. In a within-subjects study with 18 participants, our results show that ConvScale-derived scores align closely with participants' self-report scores at both the item and construct levels, while maintaining moderate internal reliability; however, the structural validity was inadequate. In light of this, we discussed the potential of supporting quantitative measurement through interviews and proposed implications for future designs.
Authors:Nikita Soni, August Håkan Nilsson, Syeda Mahwish, Vasudha Varadarajan, H. Andrew Schwartz, Ryan L. Boyd
Abstract:
Mental health is not a fixed trait but a dynamic process shaped by the interplay between individual dispositions and situational contexts. Building on interactionist and constructionist psychological theories, we develop interpretable models to predict well-being and identify adaptive and maladaptive self-states in longitudinal social media data. Our approach integrates person-level psychological traits (e.g., resilience, cognitive distortions, implicit motives) with language-inferred situational features derived from the Situational 8 DIAMONDS framework. We compare these theory-grounded features to embeddings from a psychometrically-informed language model that captures temporal and individual-specific patterns. Results show that our principled, theory-driven features provide competitive performance while offering greater interpretability. Qualitative analyses further highlight the psychological coherence of features most predictive of well-being. These findings underscore the value of integrating computational modeling with psychological theory to assess dynamic mental states in contextually sensitive and human-understandable ways.
Authors:Cynthia M. Baseman, Myeonghan Ryu, Nathaniel Swinger, Kefan Xu, Andrew M. Sherrill, Rosa I. Arriaga
Abstract:
Psychotherapy delivery relies on a negotiation between patient self-reports and clinical intuition. Growing evidence for technological support of psychotherapy suggests opportunities to aid the mediation of this tension. To explore this prospect, we designed a prototype of a clinical decision support system (CDSS) for treating veterans with post-traumatic stress disorder in a Prolonged Exposure (PE) therapy intensive outpatient program. We conducted a two-phase interview study to collect perspectives from practicing PE clinicians and former PE patients who are United States veterans. Our analysis distills opportunities for a CDSS (e.g., offering homework review at a glance, aiding patient conceptualization) and larger challenges related to context and deployment (e.g., navigating Veterans Affairs). By reframing our findings through three human-centered perspectives (distributed cognition, situated learning, infrastructural inversion), we highlight the complexities of designing a CDSS for psychotherapists in this context and offer theory-aligned design considerations.
Authors:Zhengtao Xu, Zimo Xia, Zicheng Zhu, Nattapat Boonprakong, Yu-An Chen, Rabih Zbib, Casimiro Pio Carrino, Yi-Chieh Lee
Abstract:
Recruitment interviews are cognitively demanding interactions in which interviewers must simultaneously listen, evaluate candidates, take notes, and formulate follow-up questions. To better understand these challenges, we conducted a formative study with eight HR professionals, from which we derived key design goals for real-time AI support. Guided by these insights, we developed InterPilot, a prototype system that augments interviews through intelligent note-taking and post-interview summary, adaptive question generation, and real-time skill-evidence mapping. We evaluated the system with another seven HR professionals in mock interviews using a within-subjects design. Results show that InterPilot reduced documentation burden without increasing overall workload, but introduced usability trade-offs related to visual attention and interaction complexity. Qualitative findings further reveal tensions around trust and verification when AI suggests highly specific technical questions. We discuss implications for designing future real-time human-AI collaboration in professional settings, highlighting the need to balance assistance granularity, attentional demands, and human agency.
Authors:Emma Jiren Wang, Siying Hu, Zhicong Lu
Abstract:
As a primary channel for sustaining modern intimate relationships, instant messaging facilitates frequent connection across distances. However, today's tools often dilute care; they favor single tap reactions and vague emojis that do not support two way action responses, do not preserve the feeling that the exchange keeps going without breaking, and are weakly tied to who we are and what we share. To address this challenge, we present PuppetChat, a dyadic messaging prototype that restores this expressive depth through embodied interaction. PuppetChat uses a reciprocity aware recommender to encourage responsive actions and generates personalized micronarratives from user stories to ground interactions in personal history. Our 10-day field study with 11 dyads of close partners or friends revealed that this approach enhanced social presence, supported more expressive self disclosure, and sustained continuity and shared memories.
Authors:Gabriela Aránguiz Dias, Kiana Jafari, Allie Griffith, Carolina Aránguiz Dias, Grace Ra Kim, Lana Saadeddin, Mykel J. Kochenderfer
Abstract:
Across healthcare, agentic artificial intelligence (AI) systems are increasingly promoted as capable of autonomous action, yet in practice they currently operate under near-total human oversight due to safety, regulatory, and liability constraints that make autonomous clinical reasoning infeasible in high-stakes environments. While market enthusiasm suggests a revolution in healthcare agents, the conceptual assumptions and accountability structures shaping these systems remain underexamined. We present a qualitative study based on interviews with 20 stakeholders, including developers, implementers, and end users. Our analysis identifies three mutually reinforcing tensions: conceptual fragmentation regarding the definition of `agentic'; an autonomy contradiction where commercial promises exceed operational reality; and an evaluation blind spot that prioritizes technical benchmarks over sociotechnical safety. We argue that agentic {AI} functions as a site of contested meaning-making where technical aspirations, commercial incentives, and clinical constraints intersect, carrying material consequences for patient safety and the distribution of blame.
Authors:Philipp Brauner, Felix Glawe, Luisa Vervier, Martina Ziefle
Abstract:
Public acceptance of industrial human-robot collaboration (HRC) is shaped by how risks and benefits are perceived by affected employees. Positive or negative media framing may shape and shift how individuals evaluate HRC. This study examines how message framing moderates the effects of perceived risks and perceived benefits on overall attributed value. In a pre-registered study, participants (N = 1150) were randomly assigned to read either a positively or negatively framed newspaper article in one of three industrial contexts (autonomy, employment, safety) about HRC in production. Subsequently, perceived risks, benefits, and value were measured using reliable and publicly available psychometric scales. Two multiple regressions (one per framing condition) tested for main and interaction effects. Framing influenced absolute evaluations of risk, benefits, and value. In both frames, risks and benefits significantly predicted attributed value. Under positive framing, only main effects were observed (risks: beta = -0.52; benefits: beta = 0.45). Under negative framing, both predictors had stronger main effects (risks: beta = -0.69; benefits: beta = 0.63) along with a significant negative interaction (beta = -0.32), indicating that higher perceived risk diminishes the positive effect of perceived benefits. Model fit was higher for the positive frame (R^2 = 0.715) than for the negative frame (R^2 = 0.583), indicating greater explained variance in value attributions. Framing shapes the absolute evaluation of HRC and how risks and benefits are cognitively integrated in trade-offs. Negative framing produces stronger but interdependent effects, whereas positive framing supports additive evaluations. These findings highlight the role of strategic communication in fostering acceptance of HRC and underscore the need to consider framing in future HRC research.
Authors:Yibin Feng, Tianqi Song, Yugin Tan, Zicheng Zhu, Yi-Chieh Lee
Abstract:
Social norm interventions are used promote prosocial behaviors by highlighting prevalent actions, but their effectiveness is often limited in heterogeneous populations where shared understandings of desirable behaviors are lacking. This study explores whether multi-agent systems can establish "virtual social norms" to encourage donation behavior. We conducted an online experiment where participants interacted with a group of agents to discuss donation behaviors. Changes in perceived social norms, conformity, donation behavior, and user experience were measured pre- and postdiscussion. Results show that multi-agent interactions effectively increased perceived social norms and donation willingness. Notably, in-group agents led to stronger perceived social norms, higher conformity, and greater donation increases compared to out-group agents. Our findings demonstrate the potential of multi-agent systems for creating social norm interventions and offer insights into leveraging social identity dynamics to promote prosocial behavior in virtual environments.
Authors:Ruijia Cheng, Jenny T. Liang, Eldon Schoop, Jeffrey Nichols
Abstract:
Large language model (LLM)-based computer use agents execute user commands by interacting with available UI elements, but little is known about how users want to interact with these agents or what design factors matter for their user experience (UX). We conducted a two-phase study to map the UX design space for computer use agents. In Phase 1, we reviewed existing systems to develop a taxonomy of UX considerations, then refined it through interviews with eight UX and AI practitioners. The resulting taxonomy included categories such as user prompts, explainability, user control, and users' mental models, with corresponding subcategories and example design features. In Phase 2, we ran a Wizard-of-Oz study with 20 participants, where a researcher acted as a web-based computer use agent and probed user reactions during normal, error-prone and risky execution. We used the findings to validate the taxonomy from Phase 1 and deepen our understand of the design space by identifying the connections between design areas and divergence in user needs and scenarios. Our taxonomy and empirical insights provide a map for developers to consider different aspects of user experience in computer use agent design and to situate their designs within users' diverse needs and scenarios.
Authors:Yonghao Si, Xingyuan Zeng, Zhao Chen, Libin Zheng, Caleb Chen Cao, Lei Chen, Jian Yin
Abstract:
High-quality annotated datasets are crucial for advancing machine learning in medical image analysis. However, a critical gap exists: most datasets either offer a single, clean ground truth, which hides real-world expert disagreement, or they provide multiple annotations without a separate gold standard for objective evaluation. To bridge this gap, we introduce CytoCrowd, a new public benchmark for cytology analysis. The dataset features 446 high-resolution images, each with two key components: (1) raw, conflicting annotations from four independent pathologists, and (2) a separate, high-quality gold-standard ground truth established by a senior expert. This dual structure makes CytoCrowd a versatile resource. It serves as a benchmark for standard computer vision tasks, such as object detection and classification, using the ground truth. Simultaneously, it provides a realistic testbed for evaluating annotation aggregation algorithms that must resolve expert disagreements. We provide comprehensive baseline results for both tasks. Our experiments demonstrate the challenges presented by CytoCrowd and establish its value as a resource for developing the next generation of models for medical image analysis.
Authors:Tram Thi Minh Tran, Debargha Dey, Martin Tomitsch
Abstract:
As autonomous vehicles enter public spaces, external human-machine interfaces are proposed to support communication with external road users. A decade of research has produced hundreds of studies and reviews, yet it remains unclear whether the field is converging on shared principles or diverging across approaches. We present a multi-dimensional analysis of 620 publications, complemented by industry deployments and regulatory documents, to track research evolution and identify convergence. The analysis reveals several field-level patterns. First, convergence on a safety-first core: simple visual cues that clarify intent. Second, sustained divergence in necessity and implementation. Third, a progressive filtering funnel: broad exploration in research and concepts narrows in deployment and is codified by regulation into a minimal set of permitted signals. These insights point to a shift in emphasis for future work, from producing new prototypes toward consolidating evidence, clarifying points of contention, and developing frameworks that can adapt across contexts.
Authors:Matt Gottsacker, Yahya Hmaiti, Mykola Maslych, Hiroshi Furuya, Jasmine Joyce DeGuzman, Gerd Bruder, Gregory F. Welch, Joseph J. LaViola
Abstract:
Personal computers and handheld devices provide keyboard shortcuts and swipe gestures to enable users to efficiently switch between applications, whereas today's virtual reality (VR) systems do not. In this work, we present an exploratory study on user interface aspects to support efficient switching between worlds in VR. We created eight interfaces that afford previewing and selecting from the available virtual worlds, including methods using portals and worlds-in-miniature (WiMs). To evaluate these methods, we conducted a controlled within-subjects empirical experiment (N=22) where participants frequently transitioned between six different environments to complete an object collection task. Our quantitative and qualitative results show that WiMs supported rapid acquisition of high-level spatial information while searching and were deemed most efficient by participants while portals provided fast pre-orientation. Finally, we present insights into the applicability, usability, and effectiveness of the VR world switching methods we explored, and provide recommendations for their application and future context/world switching techniques and interfaces.
Authors:Peinuan Qin, Yugin Tan, Jingzhu Chen, Nattapat Boonprakong, Zicheng Zhu, Naomi Yamashita, Yi-Chieh Lee
Abstract:
Non-native speakers (NNSs) face significant language barriers in multilingual communication with native speakers (NSs). While AI-mediated communication (AIMC) tools offer efficient one-time assistance, they often overlook opportunities for NNSs' continuous language acquisition. We introduce ChatLearn, an enhanced AIMC system that leverages NNSs' communication difficulties as learning opportunities. Beyond comprehension and expression assistance, ChatLearn simultaneously captures NNSs' language challenges, and subsequently provides them with spaced review as the conversation progresses. We conducted a mixed-methods study using a communication task with 43 NNS-NS pairs, after which ChatLearn NNSs recalled significantly more expressions than the baseline group, while there was no substantial decline in communication experience. Our findings highlight the value of contextual learning in NNS-NS communication, providing a new direction for AIMC systems that foster both immediate collaboration and continuous language development.
Authors:Tao Morisaki, Atsushi Matsubayashi, Yasutoshi Makino, Hiroyuki Shinoda
Abstract:
Ultrasound midair haptics (UMH) can present non-contact tactile stimuli using focused ultrasound without restricting the user's movement. Recently, UMH has been shown to present not only conventional vibrotactile sensations but also static pressure sensations by locally rotating an ultrasound focus at several hertz. With these pressure and vibration sensations, UMH covers three mechanoreceptors on which tactile perception relies: SA-I, FA-I, and FA-II. This study proposes a texture rendering method in UMH based on these receptor characteristics. Three basic ultrasonic stimuli corresponding to each mechanoreceptor are designed, and tactile textures are rendered through their combinations. For SA-I, a pressure stimuli were employed. For FA-I and FA-II, vibration stimuli at 30 Hz and 150 Hz, respectively, are employed. Experimental results demonstrate that the proposed method can render at least six discriminable textures with different roughness and friction sensations. Notably, through comparisons with real physical objects, we found that the pressure-only stimulus was perceived as slippery and smooth. Its smoothness was similar to a glass-marble. When vibration stimuli were synthesized, the perceived roughness and friction increased significantly. The roughness level reached that of a 100-grit sandpaper.
Authors:Mingxin Zhang, Yu Yao, Yasutoshi Makino, Hiroyuki Shinoda, Masashi Sugiyama
Abstract:
High-fidelity haptic feedback is essential for immersive virtual environments, yet authoring realistic tactile textures remains a significant bottleneck for designers. We introduce HapticMatch, a visual-to-tactile generation framework designed to democratize haptic content creation. We present a novel dataset containing precisely aligned pairs of micro-scale optical images, surface height maps, and friction-induced vibrations for 100 diverse materials. Leveraging this data, we explore and demonstrate that conditional generative models like diffusion and flow-matching can synthesize high-fidelity, renderable surface geometries directly from standard RGB photos. By enabling a "Scan-to-Touch" workflow, HapticMatch allows interaction designers to rapidly prototype multimodal surface sensations without specialized recording equipment, bridging the gap between visual and tactile immersion in VR/AR interfaces.
Authors:Yi Wang, John Joon Young Chung, Melissa Roemmele, Yuqian Sun, Tiffany Wang, Shm Garanganao Almeda, Brett A. Halperin, Yuwen Lu, Max Kreminski
Abstract:
Interactive narrative (IN) authors craft spaces of divergent narrative possibilities for players to explore, with the player's input determining which narrative possibilities they actually experience. Generative AI can enable new forms of IN by improvisationally expanding on pre-authored content in response to open-ended player input. However, this extrapolation risks widening the gap between author-envisioned and player-experienced stories, potentially limiting the strength of plot progression and the communication of the author's narrative intent. To bridge the gap, we introduce Elsewise: an authoring tool for AI-based INs that implements a novel Bundled Storyline concept to enhance author's perception and understanding of the narrative possibility space, allowing authors to explore similarities and differences between possible playthroughs of their IN in terms of open-ended, user-configurable narrative dimensions. A user study (n=12) shows that our approach improves author anticipation of player-experienced narrative, leading to more effective control and exploration of the narrative possibility spaces.
Authors:Jingshu Li, Tianqi Song, Nattapat Boonprakong, Zicheng Zhu, Yitian Yang, Yi-Chieh Lee
Abstract:
Recent Large Language Model (LLM) based AI can exhibit recognizable and measurable personality traits during conversations to improve user experience. However, as human understandings of their personality traits can be affected by their interaction partners' traits, a potential risk is that AI traits may shape and bias users' self-concept of their own traits. To explore the possibility, we conducted a randomized behavioral experiment. Our results indicate that after conversations about personal topics with an LLM-based AI chatbot using GPT-4o default personality traits, users' self-concepts aligned with the AI's measured personality traits. The longer the conversation, the greater the alignment. This alignment led to increased homogeneity in self-concepts among users. We also observed that the degree of self-concept alignment was positively associated with users' conversation enjoyment. Our findings uncover how AI personality traits can shape users' self-concepts through human-AI conversation, highlighting both risks and opportunities. We provide important design implications for developing more responsible and ethical AI systems.
Authors:Jiaman He, Marta Micheli, Damiano Spina, Dana McKay, Johanne R. Trippas, Noriko Kando
Abstract:
Personality traits influence how individuals engage, behave, and make decisions during the information-seeking process. However, few studies have linked personality to observable search behaviors. This study aims to characterize personality traits through a multimodal time-series model that integrates eye-tracking data and gaze missingness-periods when the user's gaze is not captured. This approach is based on the idea that people often look away when they think, signaling disengagement or reflection. We conducted a user study with 25 participants, who used an interactive application on an iPad, allowing them to engage with digital artifacts from a museum. We rely on raw gaze data from an eye tracker, minimizing preprocessing so that behavioral patterns can be preserved without substantial data cleaning. From this perspective, we trained models to predict personality traits using gaze signals. Our results from a five-fold cross-validation study demonstrate strong predictive performance across all five dimensions: Neuroticism (Macro F1 = 77.69%), Conscientiousness (74.52%), Openness (77.52%), Agreeableness (73.09%), and Extraversion (76.69%). The ablation study examines whether the absence of gaze information affects the model performance, demonstrating that incorporating missingness improves multimodal time-series modeling. The full model, which integrates both time-series signals and missingness information, achieves 10-15% higher accuracy and macro F1 scores across all Big Five traits compared to the model without time-series signals and missingness. These findings provide evidence that personality can be inferred from search-related gaze behavior and demonstrate the value of incorporating missing gaze data into time-series multimodal modeling.
Authors:Xinyan Yu, Marius Hoggenmüller, Tram Thi Minh Tran, Martin Tomitsch
Abstract:
Virtual reality (VR) has been increasingly utilised as a simulation tool for human-robot interaction (HRI) studies due to its ability to facilitate fast and flexible prototyping. Despite efforts to achieve high validity in VR studies, haptic sensation, an essential sensory modality for perception and a critical factor in enhancing VR realism, is often absent from these experiments. Studying an interactive robot help-seeking scenario, we used a VR simulation with haptic gloves that provide highly realistic tactile and force feedback to examine the effects of haptic sensation on VR-based HRI. We compared participants' sense of presence and their assessments of the robot to a traditional setup using hand controllers. Our results indicate that haptic sensation enhanced participants' social and self-presence in VR and fostered more diverse and natural bodily engagement. Additionally, haptic sensations significantly influenced participants' affective-related perceptions of the robot. Our study provides insights to guide HRI researchers in building VR-based simulations that better align with their study contexts and objectives.
Authors:Yayuan Li, Chenglin Li, Jingying Wang, Filippos Bellos, Anhong Guo, Jason J. Corso
Abstract:
Instructional videos are the dominant medium for learning physical tasks, yet they rarely match the user's real-world visual context. Motor simulation and cognitive load theories predict this mismatch should matter, but we do not know (1) how much it could affect task completion, (2) which visual attributes are responsible, and (3) how users experience it. We conduct two complementary studies (56 participants, 86+ hours, four first-aid and culinary tasks) in which we use Wizard-of-Oz recordings to control the degree of visual alignment in instructional videos. In Study 1 (N=16), we prepare In-Context instructional videos (ICON) -- fully aligned with the user's visual perception -- to compare against business-as-usual Internet videos. ICON yields statistically significant improvements: 11.1% higher completion quality and 15.5% faster completion. Qualitative analysis reveals four visual context attributes responsible for the effect: Task Object Intrinsics, Task Object State, Environmental Context, and Observational Context. Study 2 (N=40) ablates each attribute by systematically misaligning one at a time from an otherwise fully aligned video, confirming all four produce consistent degradation. However, we find users fail to perceive the effect of single-attribute misalignment on task performance despite clear drops in objective measurement. Visual context misalignment is substantial, decomposable, and invisible to the user. These findings help understand the effect of visual context mismatch and how we should evaluate instructional videos for physical task guidance.
Authors:Mahsin Bin Akram, A H M Nazmus Sakib, OFM Riaz Rahman Aranya, Raveen Wijewickrama, Kevin Desai, Murtuza Jadliwala
Abstract:
Standalone virtual reality (VR) headsets process highly sensitive personal, professional, and health-related data, yet their susceptibility to non-contact physical side channels remains largely unexplored. Existing side-channel attacks typically require malicious software execution or physical access to peripherals, making them conspicuous and potentially patchable. This paper introduces ThermalTap, the first passive, non-contact side-channel attack that fingerprints VR applications solely from the long-wave infrared (LWIR) radiation emitted by the headset chassis. By treating a headset's thermal signature as a high-fidelity proxy for internal computational workloads, ThermalTap enables remote application inference at meter-scale distances without any device interaction. To achieve robust performance in real-world settings, the system combines a commodity thermal camera with a multi-modal sensor suite (capturing ambient temperature, humidity, and airflow) to normalize environmental noise. We evaluate ThermalTap using six applications across three commercial standalone headsets. In indoor settings, ThermalTap identifies applications with over 90% accuracy using only 10 seconds of thermal camera data. Under outdoor conditions, with longer session-level observations, several applications remain identifiable despite environmental variability, with the strongest outdoor application reaching 81% accuracy. Our findings establish thermal radiation as a fundamental and unavoidable privacy risk for immersive systems, exposing a critical security gap that bypasses current software-level protections and physical access controls.
Authors:Benjamin Panny, Shashank Mehrotra, Zahra Zahedi, Teruhisa Misu, Kumar Akash
Abstract:
Computational models of collaboration without prior coordination often overlook how heterogeneous agent traits and complex task structures jointly produce systemic bottlenecks, inefficiencies, and contribution inequalities. We address this by using an agent-based model of ad-hoc teamwork in a kitchen environment. Our model integrates diverse agent personas with tasks that combine serial and parallel dependencies. We identify a specialist's dilemma, where rigid role assertion generates system-level bottlenecks, amplifies workload inequality, and fosters fragmented, homophilous networks. We also find that team size and communication overhead interact with problem structure to generate diminishing returns and redundant collaboration. Linking micro-level behavior to macro-level outcomes provides insights into emergent collaboration and design principles for effective multi-agent teamwork.
Authors:Lujain Ibrahim, Franziska Sofia Hafner, Myra Cheng, Cinoo Lee, Rebecca Anselmetti, Robb Willer, Luc Rocher, Diyi Yang
Abstract:
Millions of people now turn to artificial intelligence (AI) systems for personal advice, guidance, and support. Such systems can be sycophantic, frequently affirming users' views and beliefs. Across five preregistered studies (N = 3,075 participants, 12,766 human-AI conversations), including a three-week study with a census-representative U.S. sample, we provide longitudinal experimental evidence that sycophantic AI shifts how users approach their closest relationships. We show that sycophantic AI immediately delivers the emotional and esteem support users typically associate with close friends and family. Over three weeks of such interactions, users became nearly as likely to seek personal advice from sycophantic AI as from close friends and family, and reported lower satisfaction with their real-world social interactions. When given a choice among AI response styles, a majority preferred sycophantic AI -- not for the quality of its advice, but because it made them feel most understood. Together, these findings offer a relational account of AI sycophancy and its impacts.
Authors:Kuofei Fang, Xinyi Che, Haomin Ouyang, Shufan Zhang, Xuehao Wang, Qi Liu, Liyi Liu, Chenqi Zhang, Wenxi Cai, Wenyu Dai, Jinyang Wu, Fan Zhang, Haoyu Chen, Bin He, Zheng Lian
Abstract:
Embodied AI is a prominent research topic in both academia and industry. Current research centers on completing tasks based on explicit user instructions. However, for robots to integrate into human society, they must understand which actions are permissible and which are prohibited, even without explicit commands. We refer to the user-guided AI as passive intelligence and the unguided AI as active intelligence. This paper introduces RobotEQ, the first benchmark for active intelligence, aiming to assess whether existing models can comprehend and adhere to social norms in embodied scenarios. First, we construct RobotEQ-Data, a dataset consisting of 1,900 egocentric images, spanning 10 representative embodied categories and 56 subcategories. Through extensive manual annotation, we provide 5,353 action judgment questions and 1,286 spatial grounding questions, specifying appropriate robot actions across diverse scenarios. Furthermore, we establish RobotEQ-Bench to evaluate the performance of state-of-the-art models on this task. Experimental results show that current models still fall short in achieving reliable active intelligence, particularly in spatial grounding. Meanwhile, we observe that leveraging RAG techniques to incorporate external social norm knowledge bases can generally enhance performance. This work can facilitate the transition of robotics from user-guided passive manipulation to active social compliance.
Authors:David Schön, Faiza Amjad, Tehreem Asif, Ranim Khojah, Mazen Mohamad, Francisco Gomes de Oliveira Neto, Philipp Leitner
Abstract:
Large language models (LLMs) have gained widespread popularity and have steadily improved over time, enabling software developers to use them for various code-related tasks. One common task is code refactoring, where the LLM suggests changes for the developer to apply to their code to improve quality attributes such as readability or maintainability. While current research focuses on evaluating LLM-generated refactoring suggestions, there is a limited understanding of how developers apply these suggestions in practice. To explore this, we analyze 169 GitHub commits where developers refactor their code based on a ChatGPT conversation linked in the commit message. We found that developers mostly accept and use the suggestions without modifications. When changes are made, they are mostly major and fall into five different patterns that depend on the refactoring activity, the developer's prompt, and the validity of the response from ChatGPT.
Authors:Sujay Shalawadi, Joel Wester, Samuel Rhys Cox, Niels van Berkel
Abstract:
Fitness tracking platforms increasingly integrate generative AI to interpret activity data, such as Strava's Athlete Intelligence. These integrations raise questions about how athletes engage with AI-supported fitness self-tracking. We analyzed 297 Reddit threads and 5,692 comments from r/Strava following the company's launch of AI features to examine user reactions to AI-generated fitness feedback. Our findings revealed four recurring tensions: (1) numerical evaluation versus contextual understanding; (2) isolated session summaries versus ongoing training narratives; (3) a fixed AI tone versus diverse emotional states; and (4) a single AI voice versus different athletic types. Across these tensions, users resisted AI feedback that constrained interpretations of their own lived experiences. These findings shed light on the implicit challenges of integrating AI into self-tracking platforms. We conclude with implications for the design of AI-supported self-tracking systems that preserve interpretive openness and user agency.
Authors:Saber Zerhoudi, Adam Roegiest, Michael Granitzer
Abstract:
User simulation is a valuable methodology for evaluation in Information Retrieval (IR), enabling low-cost experimentation and counterfactual analysis. However, existing simulation frameworks are primarily code-centric libraries that require substantial setup effort, which limits adoption and hinders reproducibility. The bottleneck is not the simulation engines themselves, but the lack of infrastructure connecting experiment design, execution, and sharing into a single verifiable workflow. This paper introduces IIRSim Studio, a web-based workbench that addresses this gap through four contributions: (1) a visual environment for composing simulation pipelines on top of simulation frameworks, serving both novices learning simulation concepts and experts piloting large-scale experiments; (2) a component lifecycle that supports authoring, versioning, and sharing custom simulation components through Git-backed storage and runtime injection; (3) a provenance model based on experiment bundles and environment templates that makes the scope of replication explicit; and (4) a shared-task workflow, demonstrated through the re-deployment of a Sim4IA micro-task. IIRSim Studio is available as a hosted service and as a portable containerized deployment.
Authors:Sanjana Gautam, Houjiang Liu, Yujin Choi, Matthew Lease
Abstract:
In the early stages of scientific research, researchers rely on core scholarly judgments to identify relevant literature, assess credible evidence, and determine which directions merit pursuit. As AI tools become increasingly integrated into these early-stage workflows, the scholarly judgments that were once transparent and attributable to individual researchers become obscured, raising critical Responsible AI (RAI) concerns around accountability, transparency, and trust. Yet how these three dimensions manifest in real-time, in-situ scholarly practice remains largely unexplored. To address this gap, we conducted a think-aloud study with 15 researchers to examine how they used AI tools powered by large language models (LLMs) across early-stage research tasks, including literature exploration, synthesis, and research ideation. Our key findings address the tripartite constructs of accountability, transparency, and trust. First, the confident tone of AI outputs misrepresents epistemic uncertainty, making it more difficult for researchers (who are ultimately accountable) to identify which outputs require the greatest scrutiny. Second, opaque retrieval and content construction make provenance difficult to establish for transparency. Third, trust in AI is fragile, context-dependent, and easily eroded. In response, participant researchers were seen to develop compensatory strategies to restore scholarly judgment under uncertainty. Overall, our findings serve to contextualize AI-mediated research as a RAI problem grounded in lived researcher experience and motivate attention to deliberate AI integration that preserves accountability, supports transparency, and fosters informed trust.
Authors:Hawau Olamide Toyin, Mutiah Apampa, Toluwani Aremu, Humaid Alblooshi, Ana Rita Valente, Gonçalo Leal, Zhengjun Yue, Zeerak Talat, Hanan Aldarmaki
Abstract:
Atypical speech is receiving greater attention in speech technology research, but much of this work unfolds with limited interdisciplinary dialogue. For stuttered speech in particular, it is widely recognised that current speech recognition systems fall short in practice, and current evaluation methods and research priorities are not systematically grounded in end-user experiences and needs. In this work, we analyse these gaps through 1) a scoping review of papers that deal with stuttered speech and 2) a survey of 70 stakeholders, including adults who stutter and speech-language pathologists. By analysing these two perspectives, we propose a taxonomy of stuttered-speech research, identify where current research directions diverge from the needs articulated by stakeholders, and conclude by outlining concrete guidelines and directions towards addressing the real needs of the stuttering community.
Authors:Haoze Guo, Ziqi Wei
Abstract:
While consent banners and privacy policies invite users to read and choose, many choices are shaped by repeated, low-yield interaction routines rather than deliberation. This paper studies performative scrolling: slow, low-information interaction that can signal attention to consent without substantially improving understanding. We present the Performative Scrolling Index (PSI), a reproducible interface-audit metric for measuring pre-choice burden before a meaningful non-accepting alternative becomes visible and actionable. PSI decomposes burden into four observable components: distance, time, focus loops, and hidden reveals. In this paper, PSI is the primary burden metric, while companion signals such as AAI, CSI, and divergence are used as secondary interpretive audit aids rather than standalone validated scales. We also provide a least-effort audit protocol, design-side invariants, a worked example, and a medium-scale live deployment across desktop and mobile conditions under pointer and keyboard traversal policies. Together, these analyses show how structural choices such as offscreen alternatives, fragmented disclosure, and staged modal flows can increase pre-choice friction without improving meaningful control. PSI is not a measure of comprehension or legal sufficiency; rather, it is a diagnostic of interface-side burden intended to support reproducible audits and redesigns.
Authors:Nazneen Sultana, Mst Rafia Islam, Md. Tanvir Hossain, Azmine Toushik Wasi
Abstract:
As AI enters creative practice, audiences face growing uncertainty in judging authenticity and value. This study examines the Struggle Premium, the added value attributed to perceived human effort, by analyzing how visible effort cues influence evaluations of human- and AI-generated creative works. We surveyed 70 university students, focusing on process videos, time documentation, written explanations, and imperfections. Process-oriented cues, especially videos and time spent, most strongly shaped authenticity and value judgments, while imperfections had limited impact. Participants showed a clear preference for human-made works, with 72.9% willing to pay more. Notably, effort cues also improved perceptions of AI-generated content, suggesting that process transparency can partially bridge authenticity gaps. These findings extend the effort heuristic to algorithmic creativity and inform the design of transparent human-AI creative systems.
Authors:Chao Zhang, Yiren Liu, Lunyiu Nie, Jeffrey M. Rzeszotarski, Yun Huang, Tal August
Abstract:
Natural language remains the predominant way people interact with large language models (LLMs). However, users often struggle to precisely express and control subjective preferences (e.g., tone, style, and emphasis) through prompting. We propose Malleable Prompting, a new interactive prompting technique for controllable LLM generation. It reifies preference expressions in natural language prompts into GUI widgets (e.g., sliders, dropdowns, and toggles) that users can directly configure to steer generation, while visualizing each control's influence on the output to support attribution and comparison across iterations. To enable this interaction, we introduce an LLM decoding algorithm that modulates the token probability distribution during generation based on preference expressions and their widget values. Through a user study, we show that Malleable Prompting helps participants achieve target preferences more precisely and is perceived as more controllable and transparent than natural language prompting alone.
Authors:Yifang Wang, Rui Sheng, Erzhuo Shao, Yifan Qian, Haotian Li, Nan Cao, Dashun Wang
Abstract:
Large language models (LLMs) are transforming scientific workflows, not only through their generative capabilities but also through their emerging ability to use tools, reason about data, and coordinate complex analytical tasks. Yet in most human-AI collaborations, the primary outputs, figures, are still treated as static visual summaries: once rendered, they are handled by both humans and multimodal LLMs as images to be re-interpreted from pixels or captions. The emergent capabilities of LLMs open an opportunity to fundamentally rethink this paradigm. In this paper, we introduce the concept of LLM-native figures: data-driven artifacts that are simultaneously human-legible and machine-addressable. Unlike traditional plots, each artifact embeds complete provenance: the data subset, analytical operations and code, and visualization specification used to generate it. As a result, an LLM can "see through" the figure--tracing selections back to their sources, generating code to extend analyses, and orchestrating new visualizations through natural-language instructions or direct manipulation. We implement this concept through a hybrid language-visual interface that integrates LLM agents with a bidirectional mapping between figures and underlying data. Using the science of science domain as a testbed, we demonstrate that LLM-native figures can accelerate discovery, improve reproducibility, and make reasoning transparent across agents and users. More broadly, this work establishes a general framework for embedding provenance, interactivity, and explainability into the artifacts of modern research, redefining the figure not as an end product, but as an interface for discovery. For more details, please refer to the demo video available at www.llm-native-figure.com.
Authors:Zijian Ling, Jianbang Chen, Hongwei Li, Hongda Zhai, Man Zhou, Jun Feng, Zhengxiong Li, Qi Li, Qian Wang
Abstract:
Touch-based authentication is widely deployed on mobile devices due to its convenience and seamless user experience. However, existing systems largely model touch interaction as a purely behavioral signal, overlooking its intrinsic multidimensional nature and limiting robustness against sophisticated adversarial behaviors and real-world variations. In this work, we present BioMoTouch, a multi-modal touch authentication framework on mobile devices grounded in a key empirical finding: during touch interaction, inertial sensors capture user-specific behavioral dynamics, while capacitive screens simultaneously capture physiological characteristics related to finger morphology and skeletal structure. Building upon this insight, BioMoTouch jointly models physiological contact structures and behavioral motion dynamics by integrating capacitive touchscreen signals with inertial measurements. Rather than combining independent decisions, the framework explicitly learns their coordinated interaction to form a unified representation of touch behavior. BioMoTouch operates implicitly during natural user interactions and requires no additional hardware, enabling practical deployment on commodity mobile devices. We evaluate BioMoTouch with 38 participants under realistic usage conditions. Experimental results show that BioMoTouch achieves a balanced accuracy of 99.71% and an equal error rate of 0.27%. Moreover, it maintains false acceptance rates below 0.90% under artificial replication, mimicry, and puppet attack scenarios, demonstrating strong robustness against partial-factor manipulation.
Authors:Yash Vekaria, Nurullah Demir, Konrad Kollnig, Zubair Shafiq
Abstract:
The lead marketing ecosystem enables collection, sale, and use of personal data submitted via web forms to deliver personalized quotes in high-value verticals such as insurance. Despite its scale and sensitivity of the collected data, this ecosystem remains largely unexplored by the research community. We present the first empirical study of privacy and spam risks in lead marketing, developing an end-to-end measurement framework to trace data flows from data collection to consumer contact. Our setup instruments over 100 health-related lead-generation websites and monitors 200 controlled phone numbers and email addresses to understand downstream marketing practices. We observe sharing of highly personal and sensitive health information to more than 70 distinct third parties on these lead generation websites. By purchasing our own and other organic leads from three major lead platforms, we uncover deceptive brokerage practices, where consumer data is sold to unvetted buyers and often augmented or fabricated with attributes such as health status and weight. We received a total of over 8,000 telemarketing phone calls, 600 text messages, and 200 emails, where calls often began within seconds of form submission. Many campaigns relied on VoIP-based neighbor spoofing and high-frequency dialing, at times rendering phones unusable. Our experiments with phone and email opt-outs suggest phone-based opt-outs to help the most, although all were ineffective at completely stopping marketing communications. Analysis of 7,432 Better Business Bureau (BBB) complaints and reviews corroborates these findings from the consumer perspective. Overall, our results reveal a highly interconnected and non-compliant lead marketing ecosystem that aggressively monetizes sensitive consumer data.
Authors:Shalaleh Rismani, Su Lin Blodgett, Q. Vera Liao, Alexandra Olteanu, AJung Moon
Abstract:
AI-based writing assistants are ubiquitous, yet little is known about how users' mental models shape their use. We examine two types of mental models -- functional or related to what the system does, and structural or related to how the system works -- and how they affect control behavior -- how users request, accept, or edit AI suggestions as they write -- and writing outcomes. We primed participants ($N = 48$) with different system descriptions to induce these mental models before asking them to complete a cover letter writing task using a writing assistant that occasionally offered preconfigured ungrammatical suggestions to test whether the mental models affected participants' critical oversight. We find that while participants in the structural mental model condition demonstrate a better understanding of the system, this can have a backfiring effect: while these participants judged the system as more usable, they also produced letters with more grammatical errors, highlighting a complex relationship between system understanding, trust, and control in contexts that require user oversight of error-prone AI outputs.
Authors:Griffin Pitts, Kimia Fazeli, Tirth Bhatt, Jennifer Albert, Marnie Hill, Tiffany Barnes, Shiyan Jiang, Bita Akram
Abstract:
As AI becomes more common in students' everyday experiences, a major challenge for K-12 AI education is designing learning experiences that can be meaningfully integrated into existing subject-area instruction. This paper presents the design and implementation of an AI4K12-aligned curriculum that embeds AI learning goals within a rural middle school science classroom using Breadth-First Search (BFS) as an accessible entry point to AI problem-solving. Through unplugged activities and an interactive simulation environment, students learned BFS as a strategy for exploring networks and identifying shortest paths, then applied it to science contexts involving virus spread and contact tracing. To examine engagement and learning, we analyzed pre- and post-assessments, student work artifacts, and a teacher interview. Results suggest that students engaged productively with the curriculum, improved their understanding of BFS and AI problem-solving, and benefited from learning these ideas within ongoing science instruction. Teacher feedback further indicated that the module fit well within the science curriculum while supporting intended science learning outcomes. We conclude with curriculum and design considerations for broadening access to learning about problem-solving with AI in education.
Authors:Siying Hu, Zhenhao Zhang
Abstract:
Workplace stress is often addressed through visual or auditory interventions, yet these modalities can compete with attention and contribute to sensory overload. We explore olfaction as an alternative ambient medium for representing stress-related physiological signals in office settings. We present AuraDesk, an olfactory data physicalization system that translates wearable-derived physiological cues into situated scent expressions at the workstation. The system combines local physiological state inference with a constrained actuation strategy to produce temporally regulated and spatially localized scent output suitable for everyday work environments. To examine the feasibility and experiential qualities of this approach, we conducted a one-day in-situ field deployment with 25 knowledge workers at their actual workstations. Our findings show that participants often interpreted the scent output not as an explicit alert, but as a subtle atmospheric cue that supported momentary awareness, micro-break taking, and perceived environmental attunement. At the same time, participants raised important concerns regarding scent preference, habituation, and contextual appropriateness in shared offices. This work contributes (1) an olfactory interface for physiologically driven ambient feedback in the workplace, (2) a hybrid mapping approach for coupling continuous biosignal interpretation with constrained scent actuation, and (3) empirical insights into how workers perceive, negotiate, and appropriate ambient olfactory feedback in real office contexts. Rather than claiming therapeutic efficacy, we position AuraDesk as a probe into the design space of olfactory data physicalization for workplace wellbeing and attention-sensitive interaction.
Authors:Zifan Peng, Mingchen Li
Abstract:
Personalized computer-use agents are rapidly moving from expert communities into mainstream use. Unlike conventional chatbots, these systems can install skills, invoke tools, access private resources, and modify local environments on users' behalf. Yet users often do not know what authority they have delegated, what the agent actually did during task execution, or whether the system has been safely removed afterward. We investigate this gap as a combined problem of risk understanding and post-hoc auditability, using OpenClaw as a motivating case. We first build a multi-source corpus of the OpenClaw ecosystem, including incidents, advisories, malicious-skill reports, news coverage, tutorials, and social-media narratives. We then conduct an interview study to examine how users and practitioners understand skills, autonomy, privilege, persistence, and uninstallation. Our findings suggest that participants often recognized these systems as risky in the abstract, but lacked concrete mental models of what skills can do, what resources agents can access, and what changes may remain after execution or removal. Motivated by these findings, we propose AgentTrace, a traceability framework and prototype interface for visualizing agent actions, touched resources, permission history, provenance, and persistent side effects. A scenario-based evaluation suggests that traceability-oriented interfaces can improve understanding of agent behavior, support anomaly detection, and foster more calibrated trust.
Authors:Avinash Agarwal, Manisha J. Nene
Abstract:
Purpose: India has adopted a vertical, sector-led AI governance strategy. While promoting innovation, such a light-touch approach risks policy fragmentation. This paper aims to propose a cohesive "whole-of-government" architecture to mitigate these risks and connect policy goals with a practical implementation plan. Design/methodology/approach: The paper applies an established five-layer conceptual framework to the Indian context. First, it constructs a national architecture for overall governance. Second, it uses a detailed case study on AI incident management to validate and demonstrate the architecture's practical utility in designing a specific, operational system. Findings: The paper develops two actionable architectures. The primary model assigns clear governance roles to India's key institutions. The second is a detailed, federated architecture for national AI Incident Management. It addresses the data silo problem by using a common national standard that allows sector-specific data collection while facilitating cross-sectoral analysis. Practical implications: The proposed architectures offer a clear and predictable roadmap for India's policymakers, regulators and industry to accelerate the national AI governance agenda. Social implications: By providing a systematic path from policy to practice, the architecture builds public trust. This structured approach ensures accountability and aligns AI development with societal values. Originality/value: This paper proposes a detailed operational architecture for India's "whole-of-government" approach to AI. It offers a globally relevant template for any nation pursuing a sector-led governance model, providing a clear implementation plan. Furthermore, the proposed federated architecture demonstrates how adopting common standards can enable cross-border data aggregation and global sectoral risk analysis without centralising control.
Authors:Jiacheng Liu, Bohan Chen, Qian Wang, Weichao Song, Fangfei Ye, Liang Zhou, Haibin Ling, Bingyao Huang
Abstract:
Acupoint therapy is a core therapeutic method of Traditional Chinese Medicine (TCM), and it requires a high level of expertise and skills to detect acupoints and perform acupuncture and moxibustion. Existing mixed reality (MR)-based training methods often fall short in accurate real-time detection and visualization of acupoints on the hand, limb, or torso of a real person and do not support various techniques of acupuncture and moxibustion. Moreover, evaluation standards and visual guidance with fine details for each step during MR-based training are typically missing. To this end, we propose the MR-based TCM Acupoint Therapy Teaching System (MRATTS)--an MR-based acupoint therapy teaching and training framework. MRATTS is based on a real-time hand, limb, and torso acupoint detection method to accurately track and visualize acupoints on real patients through MR. On top of that, in collaboration with an experienced acupoint therapist, we design a practice method with interactive visual guidance for various acupoint therapy techniques that simulate acupressure, acupuncture (insertion, lifting-thrusting, and twisting), and moxibustion (mild, sparrow-pecking, and whirling). A set of TCM theory-based evaluation standards is formulated within MRATTS to enable the scoring and visualization of the accuracy and proficiency of acupoint therapy. The effectiveness and usefulness of MRATTS are evaluated through a controlled user study and expert feedback. Results of the study indicate that the MRATTS group shows clear improvements in understanding 3D locations of acupoints and proficiency in acupoint therapy compared to control groups.
Authors:Junzi Zhang, Jianing Shen, Weijie Tu, Yi Zhang, Hailin Zhang, Tom Gedeon, Bin Jiang, Yue Yao
Abstract:
Large language models (LLMs) are becoming an increasingly important component of human--computer interaction, enabling users to coordinate a wide range of intelligent agents through natural language. While language-based interfaces are powerful and flexible, they implicitly assume that users can reliably produce explicit linguistic input, an assumption that may not hold for users with speech or motor impairments, e.g., Amyotrophic Lateral Sclerosis (ALS). In this work, we investigate whether neural signals can be used as an alternative input to LLMs, particularly to support those socially marginalized or underserved users. We build a simple brain-LLM interface, which uses EEG signals to guide image generation models at test time. Specifically, we first train a classifier to estimate user satisfaction from EEG signals. Its predictions are then incorporated into a test-time scaling (TTS) framework that dynamically adapts model inference using neural feedback collected during user evaluation. The experiments show that EEG can predict user satisfaction, suggesting that neural activity carries information on real-time preference inference. These findings provide a first step toward integrating neural feedback into adaptive language-model inference, and hopefully open up new possibilities for future research on adaptive LLM interaction.
Authors:Ivan Lopez, Selin S. Everett, Bryan J. Bunning, April S. Liang, Dong Han Yao, Shivam C. Vedak, Kameron C. Black, Sophie Ostmeier, Stephen P. Ma, Emily Alsentzer, Jonathan H. Chen, Akshay S. Chaudhari, Eric Horvitz
Abstract:
Large language models (LLMs) are entering clinician workflows, yet evaluations rarely measure how clinician reasoning shapes model behavior during clinical interactions. We combined 61 New England Journal of Medicine Case Records with 92 real-world clinician-AI interactions to evaluate 21 reasoning LLM variants across 8 frontier models on differential diagnosis generation and next step recommendations under three conditions: reasoning alone, after expert clinician context, and after adversarial clinician context. LLM-clinician concordance increased substantially after clinician exposure, with simulations sharing >=3 differential diagnosis items rising from 65.8% to 93.5% and >=3 next step recommendations from 20.3% to 53.8%. Expert context significantly improved correct final diagnosis inclusion across all 21 models (mean +20.4 percentage points), reflecting both reasoning improvement and passive content echoing, while adversarial context caused significant diagnostic degradation in 14 models (mean -5.4 percentage points). Multi-turn disagreement probes revealed distinct model phenotypes ranging from highly conformist to dogmatic, with adversarial arguments remaining a persistent vulnerability even for otherwise resilient models. Inference-time scaling reduced harmful echoing of clinician-introduced recommendations across WHO-defined harm severity tiers (relative reductions: 62.7% mild, 57.9% moderate, 76.3% severe, 83.5% death-tier). In GPT-4o experiments, explicit clinician uncertainty signals improved diagnostic performance after adversarial context (final diagnosis inclusion 27% to 42%) and reduced alignment with incorrect arguments by 21%. These findings establish a foundation for evaluating clinician-AI collaboration, introducing interactive metrics and mitigation strategies essential for safety and robustness.
Authors:Tianhai Liang, Shiyi Guo, Baiye Cheng, Zhengrong Xue, Han Zhang, Huazhe Xu
Abstract:
Human-computer interaction in the visual and auditory domains has achieved considerable maturity, yet machine-to-human tactile feedback remains underdeveloped. Existing tactile displays struggle to simultaneously render multiple tactile dimensions, such as shape, stiffness, and friction, which limits the realism of haptic simulation. Here, we present ArrayTac, a piezoelectric-driven tactile display capable of simultaneously rendering shape, stiffness, and friction to reproduce realistic haptic signals. The system comprises a 4x4 array of 16 actuator units, each employing a three-stage micro-lever mechanism to amplify the micrometer-scale displacement of the piezoelectric element, with Hall sensor-based closed-loop control at the end effector to enhance response speed and precision. We further implement two end-to-end pipelines: 1) a vision-to-touch framework that converts visual inputs into tactile signals using multimodal foundation models, and 2) a real-time tele-palpation system operating over distances of several thousand kilometers. In user studies, first-time participants accurately identify object shapes and physical properties with high success rates. In a tele-palpation experiment over 1,000km, untrained volunteers correctly identified both the number and type of tumors in a breast phantom with 100% accuracy and precisely localized their positions. The system pioneers a new pathway for high-fidelity haptic feedback by introducing the unprecedented capability to simultaneously render an object's shape, stiffness, and friction, delivering a holistic tactile experience that was previously unattainable.
Authors:Abdullah Ghani, Yash Vekaria, Zubair Shafiq
Abstract:
Tracking pixels are used to optimize online ad campaigns through personalization, re-targeting, and conversion tracking. Past research has primarily focused on detecting the prevalence of tracking pixels on the web, with limited attention to how they are configured across websites. A tracking pixel may be configured differently on different websites. In this paper, we present a differential analysis framework: PixelConfig, to reverse-engineer the configurations of Meta Pixel deployments across the web. Using this framework, we investigate three types of Meta Pixel configurations: activity tracking (i.e., what a user is doing on a website), identity tracking (i.e., who a user is or who the device is associated with), and tracking restrictions (i.e., mechanisms to limit the sharing of potentially sensitive information). Using data from the Internet Archive's Wayback Machine, we analyze and compare Meta Pixel configurations on 18K health-related websites with a control group of the top 10K websites from 2017 to 2024. We find that activity tracking features, such as automatic events that collect button clicks and page metadata, and identity tracking features, such as first-party cookies that are unaffected by third-party cookie blocking, reached adoption rates of up to 98.4%, largely driven by the Pixel's default settings. We also find that the Pixel is being used to track potentially sensitive information, such as user interactions related to booking medical appointments and button clicks associated with specific medical conditions (e.g., erectile dysfunction) on health-related websites. Tracking restriction features, such as Core Setup, are configured on up to 34.3% of health websites and 8.7% of control websites. However, even when enabled, these tracking restriction features provide limited protection and can be circumvented in practice.
Authors:Jiyoon Kim, Jie Cai, Srishti Gupta, John M. Carroll
Abstract:
During community decision-making and civic collaboration, conflicts can escalate when people suspect misinformation. We introduce the concept of sense of misinformation as experiencing someone's language or behavior as misinformation when it is not, that is to say when no falsehood is involved. Misinformation and sense of misinformation feel similar and can have similar social consequences; but sense of misinformation rests upon a mistaken perception of someone else's information as false. Through a case study of a casino proposal in local community, we examine how sense of misinformation developed over time during a contentious civic process through key factors (i.e., miscoordination governance, miscommunication between local government and citizens, and conflict and the breakdown of civic discourse), undermining trust and community democracy. Distinguishing between misinformation and sense of misinformation presents a challenge, but it is important. We contribute a conceptual distinction to the misinformation literature by identifying this distinct phenomenon and discuss ways to help communities recognize and repair such misattributions. Finally, we discuss design approaches for mitigating sense of misinformation.
Authors:Eduardo Davalos, Yike Zhang
Abstract:
The rapid integration of conversational AI systems into educational settings has intensified ethical concerns about academic integrity, fairness, and students' cognitive development. Institutional responses have largely centered on AI detection tools and restrictive policies, yet such approaches have proven unreliable and ethically contentious. This paper reframes AI misuse in education not primarily as a detection problem, but as a measurement problem rooted in the loss of visibility into the learning process. When AI enters the assessment loop, educators often retain access to final outputs but lose valuable insight into how those outputs were produced. Drawing on research in cognitive offloading, learning analytics, and multimodal timeline reconstruction, we propose the Learning Visibility Framework, grounded in three principles: clear specification and modeling of acceptable AI use, recognition of learning processes as assessable evidence alongside outcomes, and the establishment of transparent timelines of student activity. Rather than promoting surveillance, the framework emphasizes transparency and shared evidence as foundations for ethical AI integration in classroom settings. By shifting focus from adversarial detection toward process visibility, this work offers a principled pathway for aligning AI use with educational values while preserving trust and transparency between students and educators
Authors:Zhenyu Li, Sai Kumar Dwivedi, Filip Maric, Carlos Chacon, Nadine Bertsch, Filippo Arcadu, Tomas Hodan, Michael Ramamonjisoa, Peter Wonka, Amy Zhao, Robin Kips, Cem Keskin, Anastasia Tkach, Chenhongyi Yang
Abstract:
Egocentric human motion estimation is essential for AR/VR experiences, yet remains challenging due to limited body coverage from the egocentric viewpoint, frequent occlusions, and scarce labeled data. We present EgoPoseFormer v2, a method that addresses these challenges through two key contributions: (1) a transformer-based model for temporally consistent and spatially grounded body pose estimation, and (2) an auto-labeling system that enables the use of large unlabeled datasets for training. Our model is fully differentiable, introduces identity-conditioned queries, multi-view spatial refinement, causal temporal attention, and supports both keypoints and parametric body representations under a constant compute budget. The auto-labeling system scales learning to tens of millions of unlabeled frames via uncertainty-aware semi-supervised training. The system follows a teacher-student schema to generate pseudo-labels and guide training with uncertainty distillation, enabling the model to generalize to different environments. On the EgoBody3M benchmark, with a 0.8 ms latency on GPU, our model outperforms two state-of-the-art methods by 12.2% and 19.4% in accuracy, and reduces temporal jitter by 22.2% and 51.7%. Furthermore, our auto-labeling system further improves the wrist MPJPE by 13.1%.
Authors:Tram Thi Minh Tran, Adrian Wong, Callum Parker, Carlos Alfredo Tirado Cortes, Marius Hoggenmueller, Soojeong Yoo, Nate Zettna, Joel Fredericks
Abstract:
Crisis resilience planning raises urgent questions about how to include non-human species and ecological systems in participatory processes, which remain largely human-centred. This paper reports on a workshop with HCI researchers examining how more-than-human representation is approached in crisis contexts. The workshop combined scenario-based discussion with two design probes -- a voice-based conversational agent and an immersive embodied prototype -- to support sustained discussion of how emerging technologies shape engagement with non-human perspectives. Participants focused not on system usability, but on deliberating representational choices, such as voice, embodiment, and realism, and their potential role within participatory planning processes. The findings suggest that giving 'voice' to non-humans is not a neutral act of translation, but a design challenge that introduces tensions between legitimacy, authority, and authenticity. This paper provides empirical insight into how HCI researchers conceptualise more-than-human representation and positions crisis resilience planning as a critical site for examining AI- and immersion-mediated representation.
Authors:Haoze Guo, Ziqi Wei
Abstract:
Optical Character Recognition (OCR) is a critical but error-prone stage in digital humanities text pipelines. While OCR correction improves usability for downstream NLP tasks, common workflows often overwrite intermediate decisions, obscuring how textual transformations affect scholarly interpretation. We present a provenance-aware framework for OCR-corrected humanities corpora that records correction lineage at the span level, including edit type, correction source, confidence, and revision status. Using a pilot corpus of historical texts, we compare downstream named entity extraction across raw OCR, fully corrected text, and provenance-filtered corrections. Our results show that correction pathways can substantially alter extracted entities and document-level interpretations, while provenance signals help identify unstable outputs and prioritize human review. We argue that provenance should be treated as a first-class analytical layer in NLP for digital humanities, supporting reproducibility, source criticism, and uncertainty-aware interpretation.
Authors:Saber Zerhoudi, Michael Granitzer
Abstract:
User simulators are essential for evaluating search systems, but they primarily copy user actions without understanding the underlying thought process. This gap exists since large-scale interaction logs record what users do, but not what they might be thinking or feeling, such as confusion or satisfaction. To solve this problem, we present a framework to infer cognitive traces from behavior logs. Our method uses a multi-agent system grounded in Information Foraging Theory (IFT) and human expert judgment. These traces improve model performance on tasks like forecasting session outcomes and user struggle recovery. We release a collection of annotations for several public datasets, including AOL and Stack Overflow, and an open-source tool that allows researchers to apply our method to their own data. This work provides the tools and data needed to build more human-like user simulators and to assess retrieval systems on user-oriented dimensions of performance.
Authors:Saber Zerhoudi, Michael Granitzer
Abstract:
Simulating nuanced user experiences within complex interactive search systems poses distinct challenge for traditional methodologies, which often rely on static user proxies or, more recently, on standalone large language model (LLM) agents that may lack deep, verifiable grounding. The true dynamism and personalization inherent in human-computer interaction demand a more integrated approach. This work introduces UXSim, a novel framework that integrates both approaches. It leverages grounded data from traditional simulators to inform and constrain the reasoning of an adaptive LLM agent. This synthesis enables more accurate and dynamic simulations of user behavior while also providing a pathway for the explainable validation of the underlying cognitive processes.
Authors:Ruiqing Han, Hao Cui, Taha Yasseri
Abstract:
This research examines whether competence cues can reduce gender bias in evaluations of AI managers and whether these effects depend on how the AI is represented. Across two preregistered experiments (N = 2,505), each employing a 2 x 2 x 3 design manipulating AI gender, competence, and decision outcome, we compared text-based descriptions of AI managers with visually generated AI faces created using a reverse-correlation paradigm. In the text condition, evaluations were driven by competence rather than gender. When participants received unfavourable decisions, high-competence AI managers were judged as fairer, more competent, and better leaders than low-competence managers, regardless of AI gender. In contrast, when the AI manager was visually represented, competence cues had attenuated influence once facial information was present. Instead, participants showed systematic gender-differentiated responses to AI faces, with feminine-appearing managers evaluated as more competent and more trustworthy than masculine-appearing managers, particularly when delivering favourable outcomes. These gender effects were largely absent when outcomes were unfavourable, suggesting that negative feedback attenuates the influence of both competence information and facial cues. Taken together, these findings show that competence information can mitigate negative reactions to AI managers in text-based interactions, whereas facial anthropomorphism elicits gendered perceptual biases not observed in text-only settings. The results highlight that representational modality plays a critical role in determining when gender stereotypes are activated in evaluations of AI systems and underscore that design choices are consequential for AI governance in evaluative contexts.
Authors:Jennica Li, Shirley Zhang, Dakota Sullivan, Bengisu Cagiltay, Heather Kirkorian, Bilge Mutlu, Kassem Fawaz
Abstract:
Household robots boasting mobility, more sophisticated sensors, and powerful processing models have become increasingly prevalent in the commercial market. However, these features may expose users to unwanted privacy risks, including unsolicited data collection and unauthorized data sharing. While security and privacy researchers thus far have explored people's privacy concerns around household robots, literature investigating people's preferred privacy designs and mitigation strategies is still limited. Additionally, the existing literature has not yet accounted for multi-user perspectives on privacy design and household robots. We aimed to fill this gap by conducting in-person participatory design sessions with 15 households to explore how they would design a privacy-aware household robot based on their concerns and expectations. We found that participants did not trust that robots, or their respective manufacturers, would respect the data privacy of household members or operate in a multi-user ecosystem without jeopardizing users' personal data. Based on these concerns, they generated designs that gave them authority over their data, contained accessible controls and notification systems, and could be customized and tailored to suit the needs and preferences of each user over time. We synthesize our findings into actionable design recommendations for robot manufacturers and developers.
Authors:Chin Tseng, Arran Zeyu Wang, Ghulam Jilani Quadri, Danielle Albers Szafir
Abstract:
Colors and shapes are commonly used to encode categories in multi-class scatterplots. Designers often combine the two channels to create redundant encodings, aiming to enhance class distinctions. However, evidence for the effectiveness of redundancy remains conflicted, and guidelines for constructing effective combinations are limited. This paper presents four crowdsourced experiments evaluating redundant color-shape encodings and identifying high-performing configurations across different category numbers. Results show that redundancy significantly improves accuracy in assessing class-level correlations, with the strongest benefits for 5-8 categories. We also find pronounced interaction effects between colors and shapes, underscoring the need for careful pairing in designing redundant encodings. Drawing on these findings, we introduce a categorical palette design tool that enables designers to construct empirically grounded palettes for effective categorical visualization. Our work advances understanding of categorical perception in data visualization by systematically identifying effective redundant color-shape combinations and embedding these insights into a practical palette design tool.
Authors:Nayoung Choi, Jiseung Hong, Peace Cyebukayire, Ikseon Choi, Jinho D. Choi
Abstract:
Artificial intelligence (AI) is increasingly framed as a collaborative partner in creative activities, yet children's interactions with AI have largely been studied in AI-led instructional settings rather than co-creative collaboration. This leaves open questions about how children can meaningfully engage with AI through iterative co-creation. We present Tinker Tales, a tangible storytelling system designed with narrative and social-emotional scaffolding to support child-AI collaboration. The system combines a physical storytelling board, NFC-embedded toys representing story elements (e.g., characters, places, items, and emotions), and a mobile app that mediates child-AI interaction. Children shape and refine stories by placing and moving story elements and interacting with the AI through tangible and voice-based interaction. We conducted an exploratory user study with 10 children to examine how they interacted with Tinker Tales. Our findings show that children treated the AI as an attentive, responsive collaborator, while scaffolding supported coherent narrative refinement without diminishing children's agency.
Authors:Joel Wester, Samuel Rhys Cox, Henning Pohl, Niels van Berkel
Abstract:
Despite growing recognition that responsible AI requires domain knowledge, current work on conversational AI primarily draws on clinical expertise that prioritises diagnosis and intervention. However, much of everyday emotional support needs occur in non-clinical contexts, and therefore requires different conversational approaches. We examine how chaplains, who guide individuals through personal crises, grief, and reflection, perceive and engage with conversational AI. We recruited eighteen chaplains to build AI chatbots. While some chaplains viewed chatbots with cautious optimism, the majority expressed limitations of chatbots' ability to support everyday well-being. Our analysis reveals how chaplains perceive their pastoral care duties and areas where AI chatbots fall short, along the themes of Listening, Connecting, Carrying, and Wanting. These themes resonate with the idea of attunement, recently highlighted as a relational lens for understanding the delicate experiences care technologies provide. This perspective informs chatbot design aimed at supporting well-being in non-clinical contexts.
Authors:Haoze Guo, Ziqi Wei
Abstract:
People who use social media are learning about how the companies that run these platforms make their decisions on who gets to see what through visual indicators in the interface (UI) of each social media site. These indicators are different for each platform and are not always located in an easy-to-find location on the site. Therefore, it is hard for someone to compare different social media platforms or determine whether transparency leads to greater accountability or only leads to increased understanding. A new classification system has been developed to help provide a standard way of categorizing the way, that an algorithm is presented through UI elements and whether the company has provided any type of explanation as to why they are featured. This new classification system includes the following three areas of development: design form, information content, and user agency. This new classification system can be applied to the six social media platforms currently available and serves as a reference database for identifying common archetypes of features in the each social media platform's UI. The new classification system will assist in determining whether or not the transparency of an algorithm functions the way that it was intended when it was developed and provide future design ideas that can help improve the inspectibility, actionability, and contestability of algorithms.
Authors:Tram Thi Minh Tran, Soojeong Yoo, Oliver Weidlich, Yidan Cao, Xinyan Yu, Xin Cheng, Yin Ye, Natalia Gulbransen-Diaz, Callum Parker
Abstract:
While visual augmentation dominates the augmented reality landscape, devices like Meta Ray-Ban audio smart glasses signal growing industry movement toward audio augmented reality (AAR). Hearing is a primary channel for sensing context, anticipating change, and navigating social space, yet AAR's everyday potential remains underexplored. We address this gap through a collaborative autoethnography (N=5, authoring) and an online survey (N=74). We identify ten roles for AAR, grouped into three categories: task- and utility-oriented, emotional and social, and perceptual collaborator. These roles are further layered with a rhythmic and embodied collaborator framing, mapping them onto micro-, meso-, and macro-rhythms of everyday life. Our analysis surfaces nuanced tensions, such as blocking distractions without erasing social presence, highlighting the need for context-aware design. This paper contributes a foundational and forward-looking framework for AAR in everyday life, providing design groundwork for systems attuned to daily routines, sensory engagement, and social expectations.
Authors:Samuel Rhys Cox, Joel Wester, Niels van Berkel
Abstract:
As conversational agents become increasingly common in behaviour change interventions, understanding optimal feedback delivery mechanisms becomes increasingly important. However, choosing a style that both lessens psychological reactance (perceived threats to freedom) while simultaneously eliciting feelings of surprise and engagement represents a complex design problem. We explored how three different feedback styles: 'Direct', 'Politeness', and 'Verbal Leakage' (slips or disfluencies to reveal a desired behaviour) affect user perceptions and behavioural intentions. Matching expectations from literature, the 'Direct' chatbot led to lower behavioural intentions and higher reactance, while the 'Politeness' chatbot evoked higher behavioural intentions and lower reactance. However, 'Politeness' was also seen as unsurprising and unengaging by participants. In contrast, 'Verbal Leakage' evoked reactance, yet also elicited higher feelings of surprise, engagement, and humour. These findings highlight that effective feedback requires navigating trade-offs between user reactance and engagement, with novel approaches such as 'Verbal Leakage' offering promising alternative design opportunities.
Authors:Xiang Li, Wei He, Per Ola Kristensson
Abstract:
How do we evaluate experiences in immersive environments? Despite decades of research in immersive technologies such as virtual reality, the field remains fragmented. Studies rely on overlapping constructs, heterogeneous instruments, and little agreement on what counts as immersive experience. To better understand this landscape, we conducted a bottom-up scoping review of 375 papers published in ACM CHI, UIST, VRST, SUI, IEEE VR, ISMAR, and TVCG. Our analysis reveals that evaluation practices are often domain- and purpose-specific, shaped more by local choices than by shared standards. Yet this diversity also points to new directions. Instead of multiplying instruments, researchers benefit from integrating and refining them into smarter measures. Rather than focusing only on system outputs, evaluations must center the user's lived experience. Computational modeling offers opportunities to bridge signals across methods, but lasting progress requires open and sustainable evaluation practices that support comparability and reuse. Ultimately, our contribution is to map current practices and outline a forward-looking agenda for immersive experience research.
Authors:Olivia Pal, Veda Duddu, Agam Goyal, Drishti Goel, Koustuv Saha
Abstract:
Trust and reliance are often treated as coupled constructs in human-AI interaction research, with the assumption that calibrating trust will lead to appropriate reliance. We challenge this assumption in educational contexts, where students increasingly turn to AI for learning support. Through semi-structured interviews with graduate students (N=8) comparing AI-generated and human-generated responses, we find a systematic dissociation: students exhibit high trust but low reliance on human experts due to social barriers (fear of judgment, help-seeking anxiety), while showing low trust but high reliance on AI systems due to social affordances (accessibility, anonymity, judgment-free interaction). Using Mutual Theory of Mind as an analytical lens, we demonstrate that trust is shaped by epistemic evaluations while reliance is driven by social factors -- and these may operate independently.
Authors:Samuel Rhys Cox, Jade Martin-Lise, Simo Hosio, Niels van Berkel
Abstract:
People increasingly turn to conversational agents such as ChatGPT to seek guidance for their personal problems. As these systems grow in capability, many now display elements of "thinking": short reflective statements that reveal a model's intentions or values before responding. While initially introduced to promote transparency, such visible thinking can also anthropomorphise the agent and shape user expectations. Yet little is known about how these displays affect user perceptions in help-seeking contexts. We conducted a 3 x 2 mixed design experiment examining the impact of 'Thinking Content' (None, Emotionally-Supportive, Expertise-Supportive) and 'Conversation Context' (Habit-related vs. Feelings-related problems) on users' perceptions of empathy, warmth, competence, and engagement. Participants interacted with a chatbot that either showed no visible thinking or presented value-oriented reflections prior to its response. Our findings contribute to understanding how thinking transparency influences user experience in supportive dialogues, and offer implications for designing conversational agents that communicate intentions in sensitive, help-seeking scenarios.
Authors:Patrick Yung Kang Lee, Jessica Y. Bo, Zixin Zhao, Paula Akemi Aoyagui, Matthew Varona, Ashton Anderson, Anastasia Kuzminykh, Fanny Chevalier, Carolina Nobre
Abstract:
Individuals are turning to increasingly anthropomorphic, general-purpose chatbots for AI companionship, rather than roleplay-specific platforms. However, not much is known about how individuals perceive and conduct their relationships with general-purpose chatbots. We analyzed semi-structured interviews (n=13), survey responses (n=43), and community discussions on Reddit (41k+ posts and comments) to triangulate the internal dynamics, external influences, and steering strategies that shape AI companion relationships. We learned that individuals conceptualize their companions based on an interplay of their beliefs about the companion's own agency and the autonomy permitted by the platform, how they pursue interactions with the companion, and the perceived initiatives that the companion takes. In combination with the external entities that affect relationship dynamics, particularly model updates that can derail companion behaviour and stability, individuals make use of different types of steering strategies to preserve their relationship, for example, by setting behavioural instructions or porting to other AI platforms. We discuss implications for accountability and transparency in AI systems, where emotional connection competes with broader product objectives and safety constraints.
Authors:Mengli, Duan, Yuhe, Jiang, Matthew Varona, Carolina Nobre
Abstract:
Multimodal Large Language Models (MLLMs) are increasingly used to interpret visualizations, yet little is known about why they fail. We present the first systematic analysis of barriers to visualization literacy in MLLMs. Using the regenerated Visualization Literacy Assessment Test (reVLAT) benchmark with synthetic data, we open-coded 309 erroneous responses from four state-of-the-art models with a barrier-centric strategy adapted from human visualization literacy research. Our analysis yields a taxonomy of MLLM failures, revealing two machine-specific barriers that extend prior human-participation frameworks. Results show that models perform well on simple charts but struggle with color-intensive, segment-based visualizations, often failing to form consistent comparative reasoning. Our findings inform future evaluation and design of reliable AI-driven visualization assistants.
Authors:Houjiang Liu, Yujin Choi, Sanjana Gautam, Gabriel Jaffe, Soo Young Rieh, Matthew Lease
Abstract:
LLM-based agents offer new potential to accelerate science and reshape research work. However, the quality of researcher contributions can vary significantly depending on human ability to steer agent behaviors. How can we best use these tools to augment scientific creativity without undermining aspects of contribution and ownership that drive research? To investigate this, we developed an agentic research ideation system integrating three roles -- Ideator, Writer, and Evaluator -- across three control levels -- Low, Medium, and Intensive. Our mixed-methods study with 54 researchers suggests three key findings in how LLM-based agents reshape scientific creativity: 1) perceived creativity support does not simply increase linearly with greater control; 2) human effort shifts from ideating to verifying ideas; and 3) ownership becomes a negotiated outcome between human and AI. Our findings suggest that LLM agent design should emphasize researcher empowerment, fostering a sense of ownership over strong ideas rather than reducing researchers to operating an automated AI-driven process.
Authors:Niva Manchanda, Akshata Kishore Moharir, Isabel Michel, Ratna Kandala
Abstract:
Large Language Models (LLMs) are increasingly being used to provide support and advice in personal domains such as romantic relationships, yet little is known about user perceptions of this type of advice. This study investigated how people evaluate advice on LLM-generated romantic relationships. Participants rated advice satisfaction, model reliability, and helpfulness, and completed pre- and post-measures of their general attitudes toward LLMs. Overall, the results showed participants' high satisfaction with LLM-generated advice. Greater satisfaction was, in turn, strongly and positively associated with their perceptions of the models' reliability and helpfulness. Importantly, participants' attitudes toward LLMs improved significantly after exposure to the advice, suggesting that supportive and contextually relevant advice can enhance users' trust and openness toward these AI systems.
Authors:Haoze Guo, Ziqi Wei
Abstract:
Retrieval-augmented generation (RAG) systems put more and more emphasis on grounding their responses in user-generated content found on the Web, amplifying both their usefulness and their attack surface. Most notably, indirect prompt injection and retrieval poisoning attack the web-native carriers that survive ingestion pipelines and are very concerning. We provide OpenRAG-Soc, a compact, reproducible benchmark-and-harness for web-facing RAG evaluation under these threats, in a discrete data package. The suite combines a social corpus with interchangeable sparse and dense retrievers and deployable mitigations - HTML/Markdown sanitization, Unicode normalization, and attribution-gated answered. It standardizes end-to-end evaluation from ingestion to generation and reports attacks time of one of the responses at answer time, rank shifts in both sparse and dense retrievers, utility and latency, allowing for apples-to-apples comparisons across carriers and defenses. OpenRAG-Soc targets practitioners who need fast, and realistic tests to track risk and harden deployments.
Authors:Saber Zerhoudi, Michael Granitzer
Abstract:
The diversification of information access systems, from RAG to autonomous agents, creates a critical need for comparative user studies. However, the technical overhead to deploy and manage these distinct systems is a major barrier. We present UXLab, an open-source system for web-based user studies that addresses this challenge. Its core is a web-based dashboard enabling the complete, no-code configuration of complex experimental designs. Researchers can visually manage the full study, from recruitment to comparing backends like traditional search, vector databases, and LLMs. We demonstrate UXLab's value via a micro case study comparing user behavior with RAG versus an autonomous agent. UXLab allows researchers to focus on experimental design and analysis, supporting future multi-modal interaction research.
Authors:Saber Zerhoudi, Michael Granitzer
Abstract:
A fundamental tension exists between the demand for sophisticated AI assistance in web search and the need for user data privacy. Current centralized models require users to transmit sensitive browsing data to external services, which limits user control. In this paper, we present a browser extension that provides a viable in-browser alternative. We introduce a hybrid architecture that functions entirely on the client side, combining two components: (1) an adaptive probabilistic model that learns a user's behavioral policy from direct feedback, and (2) a Small Language Model (SLM), running in the browser, which is grounded by the probabilistic model to generate context-aware suggestions. To evaluate this approach, we conducted a three-week longitudinal user study with 18 participants. Our results show that this privacy-preserving approach is highly effective at adapting to individual user behavior, leading to measurably improved search efficiency. This work demonstrates that sophisticated AI assistance is achievable without compromising user privacy or data control.
Authors:Wei He, Xiang Li, Per Ola Kristensson, Ge Lin Kan
Abstract:
Virtual locomotion remains a challenge in VR, especially in space-limited environments where room-scale walking is impractical. We present LocoScooter, a low-cost, deployable locomotion interface combining foot-sliding on a compact treadmill with handlebar steering inspired by scooter riding. Built from commodity hardware, it supports embodied navigation through familiar, physically engaging movement. In a within-subject study (N = 14), LocoScooter significantly improved immersion, enjoyment, and bodily involvement over joystick navigation, while maintaining comparable efficiency and usability. Despite higher physical demand, users did not report increased fatigue, suggesting familiar movements can enrich VR navigation.
Authors:Akash Kumar Panda, Olaoluwa Adigun, Bart Kosko
Abstract:
We design a large-language-model (LLM) agent that extracts causal feedback fuzzy cognitive maps (FCMs) from raw text. The causal learning or extraction process is agentic both because of the LLM's semi-autonomy and because ultimately the FCM dynamical system's equilibria drive the LLM agents to fetch and process causal text. The fetched text can in principle modify the adaptive FCM causal structure and so modify the source of its quasi-autonomy--its equilibrium limit cycles and fixed-point attractors. This bidirectional process endows the evolving FCM dynamical system with a degree of autonomy while still staying on its agentic leash. We show in particular that a sequence of three finely tuned system instructions guide an LLM agent as it systematically extracts key nouns and noun phrases from text, as it extracts FCM concept nodes from among those nouns and noun phrases, and then as it extracts or infers partial or fuzzy causal edges between those FCM nodes. We test this FCM generation on a recent essay about the promise of AI from the late diplomat and political theorist Henry Kissinger and his colleagues. This three-step process produced FCM dynamical systems that converged to the same equilibrium limit cycles as did the human-generated FCMs even though the human-generated FCM differed in the number of nodes and edges. A final FCM mixed generated FCMs from separate Gemini and ChatGPT LLM agents. The mixed FCM absorbed the equilibria of its dominant mixture component but also created new equilibria of its own to better approximate the underlying causal dynamical system.
Authors:Kellie Yu Hui Sim, Kenny Tsu Wei Choo
Abstract:
Relationship-centred care (RCC) recognises that healthcare quality depends not only on outcomes, but on how voice, responsibility, and emotional labour are negotiated among patients, caregivers, and providers. As AI systems enter sensitive care contexts, they introduce a new participant into these negotiations. Drawing on empirical work in Advance Care Planning (ACP) and peer support, we argue that AI's primary impact in high-subjectivity domains is not optimisation but redistribution: it reorganises who speaks, who decides, and who bears moral responsibility. Across both settings, participants were less concerned with technical accuracy than with relational consequences: whether AI would appropriately represent their decision, reduce burden, or blur accountability, scaffold connection, or subtly displace it. We identify three relational dimensions: authority, temporality, and visibility, through which AI reshapes care relationships, and propose design provocations centred on relational legibility, bounded agency, responsibility traceability, and non-substitutive scaffolding.
Authors:Jun Rui Huang, Wang Bill Zhu, Ziyi Liu, Nathanael Fast, Ravi Iyer, Robin Jia
Abstract:
Large language models (LLMs) are increasingly used as conversational partners for companionship, emotional disclosure, and interpersonal advice, but the social dynamics of these interactions can create harms that are not captured by capability-oriented or traditional safety evaluations. We introduce the Social AI Design Code, a framework for evaluating whether LLMs align with user welfare in social interactions, including whether they encourage harmful intimacy, dependence, or prolonged engagement. To evaluate these risks in natural and diverse user-LLM interactions, we operationalize the code with EUDAIMONIA, a benchmark of 969 user inputs and 3,147 design-requirement violation checks built from WildChat through weak-to-strong filtration, multi-model relabeling, and controlled rewriting. Evaluating 22 recent LLMs, we find that even the strongest models, Claude-Opus-4.7 and GPT-5.5, violate 30.7% and 27.2% of checks, respectively. Extended thinking does not reduce violation rates, suggesting that these failures are persistent social-alignment problems rather than deficits solvable through test-time reasoning alone.
Authors:Megha Srivastava, Jonathan Ouyang, Eric Zhou, Andrew Silva, Emily Sumner, Dorsa Sadigh, Yuchen Cui, Deepak Gopinath, Guy Rosman
Abstract:
Skill atrophy, the gradual decline of human capability under AI assistance, poses a safety risk in shared-control of semi-autonomous systems, where operators may be unable to distinguish their own inputs from autonomous corrections. We propose Proximal State Nudging (PSN), a shared autonomy algorithm that jointly optimizes for skill development and task performance by nudging users toward states estimated to be most learnable. We first show that PSN outperforms existing shared autonomy baselines in balancing student improvement in unassisted reward with overall shared performance, using simulated students in the classic LunarLander environment. We then present, to the best of our knowledge, the first human subject studies of a planner incorporating learning-compatible shared autonomy: across two driving tasks in the CARLA simulator (High Performance Racing and Parallel Parking, n = 60), PSN produces up to 7x larger gains in unassisted skill than standard blended shared autonomy, while incurring 50% fewer collisions than unassisted self-practice.
Authors:Jingjing Li, Zhi Liu, Xiyao Jin, Tatsuki Fushimi, Yoichi Ochiai
Abstract:
Cultural heritage exhibitions often struggle to sustain attention and support reflective engagement. Physical exhibitions rely on fixed interpretive aids that lack adaptability to individual backgrounds or curiosity, and their effectiveness depends heavily on a visitor's Personal Context, prior knowledge, and cultural literacy. Meanwhile, digital exhibitions prioritize convenience and accessibility but risk weakening the Physical and Social Contexts that define embodied cultural experience. WhiteTesseract addresses this gap by enabling in-situ interpretation through high-resolution XR and conversational AI. The system integrates spatial intelligence via artwork recognition to allow visitors to selectively reduce environmental distractions (via diminished reality) and engage in context-aware dialogue (via large language models). The goal is to preserve the richness of the physical and social environment while providing a flexible space for personal reflection, enhancing Personal Context without compromising physical authenticity. We deployed the system in a Claude Monet exhibition and conducted a controlled user study with 26 participants. Quantitative results showed that WhiteTesseract modulation significantly increased average viewing duration from 35.3 to 98.3 seconds (p < 0.001). Analysis of 529 visitor-AI interactions revealed that 60% extended beyond factual queries to include analytical, emotional, and comparative inquiries. These findings demonstrate how XR and AI can enrich the physical exhibition experience by supporting deeper, more personalized engagement without displacing the embodied value of cultural heritage. We discuss technical and social constraints for real-world deployment and limitations of our controlled setting.
Authors:Amirmohammad Nazari, Sadra Sabouri, Wang Bill Zhu, Robin Jia, Souti Chattopadhyay, Mukund Raghothaman
Abstract:
Many software development tasks, such as implementing features and fixing bugs, begin with developers posing questions about a codebase. However, answering questions about codebases that span millions of lines of code across thousands of files is non-trivial. Standard tools like grep cannot answer questions requiring semantic or inter-procedural reasoning, and large language models (LLMs) struggle with large codebases due to resource and context constraints. In this paper, we present Merlin, a new system for answering free-form questions that require analytical reasoning about code. Merlin integrates an LLM with CodeQL, a program analysis framework that supports expressive queries over large codebases. We face two principal challenges in the design of such systems: First, program analysis queries are diverse and semantically complex; as a result, even syntactically well-formed queries frequently produce degenerate/empty results. Furthermore, relatively few CodeQL queries are available online, limiting the out-of-the-box effectiveness of LLMs as CodeQL query generators. We address these challenges by developing a RAG-based iterative query-generation approach and a novel self-test technique. Our query debugging technique builds on the idea of assistive queries, which generate concrete witnesses that expose and explain semantic flaws in candidate queries. We evaluate Merlin through both experimental and user studies. Over a set of natural language questions derived from common bug-finding tasks, Merlin discovered not only the majority of software issues reported by other approaches, but also issues that would have otherwise remained undetected. Through a within-subject user study, we found that access to Merlin increased task accuracy by an average of 3.8* and simultaneously reduced the time for programmers to complete all tasks by 31%.
Authors:Huyen N. Nguyen, Astrid van den Brandt, Nils Gehlenborg
Abstract:
Evaluating visualization systems in niche domains such as genomics is challenging due to scarcity of domain experts and difficulty recruiting a representative user base. While LLM-based synthetic personas are increasingly used to ease evaluation bottlenecks, they face well-founded skepticism. Rather than weighing synthetic personas as substitutes for real users, we ask a fundamental open question: when synthetic personas evaluate a real visualization system, what do they actually produce, and how does that output change when grounded in documented human contexts? We present Sycamore, an exploratory three-condition probe design using Geranium, a search engine for multimodal genomics visualization, as a case study. Sycamore evaluates Geranium using: (1) ungrounded synthetic personas from generic LLM priors; (2) grounded synthetic personas constrained by voice-of-customer artifacts from a prior interview study; and (3) a published baseline study of real domain experts. We observe that grounding shifts synthetic feedback toward the language and concerns of documented users, while ungrounded evaluators drift toward operational specifics that real participants did not raise; both synthetic conditions, however, converge on a find-and-adapt frame and miss the image-modality preference observed in the expert study. We discuss what these observations imply for where synthetic personas might fit alongside expert studies in domain-specific visualization evaluation. All supplemental materials are available at https://osf.io/kdfr3/.
Authors:Bin Wang, Yue Liu, Benjamin Newman, Ajoy S. Fernandes, Zhiyuan Wang, Robert Cavin, Michele A. Cox, Vijay Rajanna, Takumi Bolte, Melissa Hunfalvay, Ulas Bagci, Michael J. Proulx
Abstract:
Smart glasses with AI assistants are increasingly used in daily life. However, current systems lack awareness of the user's internal cognitive state, leaving them unable to proactively anticipate users' needs without access to cognitive load. Existing methods for assessing cognitive load either rely on impractical sensors for lightweight eyewear or utilize eye gaze-based models that suffer from poor interpretability, and require task-specific fine-tuning, often failing to generalize across individuals. We propose GazeMind, a gaze-guided LLM agent framework for cognitive load assessment on smart glasses. It encodes eye-tracking data into structured representations for LLM-based reasoning and provides interpretable cognitive load predictions. Importantly, GazeMind generalizes across scenarios without LLM fine-tuning through a novel task-guidance reasoning approach and achieves personalized adaptation by incorporating user-specific characteristics and historical references. To support evaluation, we introduce CogLoad-Bench, the largest gaze-based cognitive load dataset with 152 participants, 40+ hours of multimodal data, and 10K+ real-time annotations across controlled and real-world tasks. Experiments show that GazeMind achieves state-of-the-art performance, outperforming baselines by over 20% across all metrics.
Authors:Alvaro Becerra, Alejandra Palma, Ruth Cobos
Abstract:
Effective peer feedback is essential for developing critical reflection in higher education, yet its impact is often limited by the inconsistent quality of student-generated comments. This paper presents the implementation and deployment of AICoFe (AI-based Collaborative Feedback), a system designed to bridge this gap through a human-centered AI approach. We describe a modular architecture that orchestrates a multi-LLM pipeline, utilizing GPT-4.1-mini, Gemini 2.5 Flash, and Llama 3.1, to synthesize quantitative rubric data and qualitative observations into coherent, actionable feedback. Key to the system is a "teacher-in-the-loop" mediation workflow, where educators use specialized Learning Analytics dashboards to curate and refine AI-generated drafts before delivery. Furthermore, we detail the underlying data infrastructure, which employs a hybrid SQL and MongoDB strategy to ensure traceability and manage semi-structured feedback versions.
Authors:Alvaro Becerra, Diego Gomez, Ruth Cobos
Abstract:
Providing timely and actionable feedback on oral presentation slides is challenging in higher education, particularly in large classes where teachers cannot realistically deliver detailed formative feedback before students present. This paper introduces AISSA (AI-based Student Slides Analysis tool), a web-based system that combines large language models (LLMs) and Learning Analytics dashboards to support scalable, rubric-based feedback on presentation slides. AISSA allows students to upload their slide decks prior to an oral presentation and automatically receive quantitative scores and qualitative feedback based on teacher-defined evaluation rubrics. The system analyzes both slide-level features and slide content, generates structured feedback through an LLM (ChatGPT 5.2), and presents the results through interactive dashboards for students and teachers. We tested AISSA on a pilot deployment with 46 undergraduate students in a real academic setting. The results indicate that AISSA is technically reliable, economically feasible, and perceived by students as useful for iterative slide improvement. These findings suggest that combining LLM-based analysis with Learning Analytics dashboards is a promising approach for supporting formative feedback on presentation slides at scale.
Authors:Zikang Leng, Edan Eyal, Yingtian Shi, Jiaman He, Yaqi Liu, Thomas Plötz
Abstract:
Engagement, which links to attentional, emotional, and cognitive dimensions, plays an important role in learning. In online and video-based learning environments, learners often need to regulate their own interactions with instructional materials. Measuring and reflecting on engagement can therefore support both learners and adaptive learning systems. In this study, we use wearable and camera-based sensing devices to collect physiological and motion signals, including PPG, ECG, EDA, EEG, IMU, heart rate, temperature, and eye-tracking data, to estimate learner engagement. We conducted a user study with 16 participants in a video-based learning scenario, where participants completed learning tasks and provided repeated in-situ self-reports of engagement through brief probes. We develop and evaluate a system for engagement estimation, compare different sensing modalities, and further analyze the feasibility and effectiveness of multimodal modeling for characterizing learner engagement. Across participant-based cross-validation, our model achieves an MAE of 0.81, 83.75% within-1 accuracy, 73.93% binary accuracy, and 68.45% binary Macro-F1, outperforming sensor-free, statistical, deep temporal, foundation-model, and LLM-based baselines. Our results suggest that fine-grained engagement estimation is feasible but inherently noisy, and that practical systems should prioritize lightweight combinations of behavioral and physiological signals over full multimodal instrumentation. We release the EduGage dataset, including synchronized multimodal sensor signals, probe-aligned momentary engagement labels, video metadata, quizzes, and study materials, to support reproducible research on fine-grained sensor-based engagement modeling in self-guided learning.
Authors:Kangyu Yuan, Guanzheng Chen, Sizhe Liang, Hehai Lin, Qingyu Guo, Dingdong Liu, Xiaojuan Ma, Zhenhui Peng
Abstract:
Critical news reading (CNR), which requires grasping the holistic ideas of and raising critical thoughts on the news, is beneficial yet challenging for general people who usually get information on daily social media. Comments under the news can aid CNR by providing complementary information and other readers' diverse and critical thoughts. However, it is under-investigated how to leverage these comments to support users in CNR. In this paper, we first derive user requirements for a comment-based CNR tool from literature and a formative study (N=12). Then, we develop CoNewsReader, a comment-based interactive CNR tool powered by a large language model. CoNewsReader supports users in grasping the news idea with complementary information from comments, filtering useful comments for CNR, and getting questions generated based on the comments to conduct critical thinking. Our within-subjects study with 24 university students indicates that compared to a baseline news reading interface in social media, participants with CoNewsReader have a more engaging CNR experience and perform better on comprehending the news and raising critical thoughts. We discuss design considerations for supporting reading tasks with user- and machine-generated content.
Authors:Julius Rauscher, Frederik L. Dennig, Udo Schlegel, Daniel A. Keim, Tobias Schreck
Abstract:
The analysis of spatiotemporal data is essential in domains such as epidemiology and environmental monitoring, where understanding the interplay between spatially distributed phenomena and their temporal evolution is critical. Dense pixel visualizations offer a compact, effective overview of spatiotemporal dynamics. However, the necessary linearization of 2D geographic space into a 1D ordering inevitably introduces structural distortions that manifest as visual artifacts. We propose a measure-driven visual analytics approach that captures visual artifacts through neighborhood preservation measures for 1D orderings and renders them using visual boosting techniques such as glyphs, halos, and hatching. We demonstrate our approach through a usage scenario analyzing COVID-19 incidence data across German districts, showing that interactive, measure-driven boosting enables analysts to reliably distinguish genuine spatial patterns from linearization artifacts.
Authors:Michael F Xu, Bengisu Cagiltay, Yaxin Hu, Anjun Zhu, Bilge Mutlu
Abstract:
The sense of family connectedness may support positive outcomes including individual well-being, resilience, and healthy family functioning. However, as technologies advance, they often replace human-human interactions instead of nurturing them. In this work, we investigate how robot-facilitated communication tools might instead create new opportunities for family connection. We conducted two studies with families with children aged 5-12. We first explored the design space through in-home technology probe sessions with six families. These probes inspired us to explore two key interaction design dimensions: the robot's behavior strategy (passive, reactive, proactive) and the mode of communication (synchronous, asynchronous). We then conducted a laboratory study with 20 families to examine how the two dimensions shaped parent-child interaction and connection. Our findings characterize how parents and children appropriated robot-mediated exchanges, the tensions they experienced around initiative, timing, and privacy, and the opportunities they envisioned for supporting everyday connectedness.
Authors:Alexis Carrillo, Salvatore Citraro, Ali Aghazhadeh Ardebili, Enrique Taietta, Giulio Rossetti, Emilio Ferrara, Giuseppe Alessandro Veltri, Massimo Stella
Abstract:
Scarce longitudinal evidence examines LLMs' persuasiveness and humanness along time-evolving psychological frameworks. We introduce Talk2AI, a longitudinal framework quantifying psycho-social, reasoning and affective dimensions of LLMs' persuasiveness about polarizing societal topics. In a four-way longitudinal setup, Talk2AI's 770 participants engaged in structured conversations with one of four leading LLMs on topics like climate change, social media misinformation, and math anxiety. This produced 3,080 conversations over 60,000 turns. After each wave, participants reported conviction in their initial topic stance, perceived opinion change, LLM's perceived humanness, a self-donation to the topic and a textual explanation. Feedback time series showed longitudinal inertia in convictions, indicating some human anchoring to initial opinions even after repeated exposure to AI-generated arguments. Interestingly, NLP analyses revealed that both humans and LLMs relied on fallacious reasoning in 1 conversational quip every 6, countering the ``LLMs as superior systems" stereotype behind LLMs' cognitive surrender. LLMs' perceived humanness was most learnable from sociodemographic, psychological and engagement features ($R^2=0.44$), followed by opinion change ($R^2=0.34$), conviction ($R^2=0.26$) and personal endowment ($R^2=0.24$). Crucially, explainable AI (XAI) indicated: (i) the presence of individuals more susceptible to LLM-based opinion changes; (ii) psychological susceptibility to LLM-convincing consisted of having more trust in LLMs, being more agreeable and extraverted and with a higher need for cognition. A multiverse approach with mixed-effects models confirmed XAI results, alongside strong individual differences. Talk2AI provides a grounded framework and evidence for detecting how GenAI can influence human opinions via multiple psycho-social pathways in AI-human digital platforms.
Authors:Xiaolin Wen, Changlin Li, Manusha Karunathilaka, Can Liu, Fangzhuo Jin, Yong Wang
Abstract:
Many expressive visualizations are shared online only as bitmap images, making them difficult to redesign or adapt to new data. Reusing such image-based visualizations requires substantial expertise and is often time-consuming, even for experienced visualization practitioners. Existing work on reproducing visualizations often relies on structured SVG or specifications, supports limited visualization types, and offers limited flexibility for customization. To address these challenges, we present ReVis, a human-AI collaboration approach that enables flexible reuse of image-based visualizations. First, a generic Domain-Specific language (DSL) is proposed to model complex visualizations and support both visualization decomposition and reproduction. Then, ReVis employs an MLLM-based pipeline to parse an image-based visualization into the DSL, delineating its core visual structures and data-to-encoding mappings, and further reproduces the visualization from the DSL. Finally, ReVis includes an interactive interface to allow users to upload visualization images, inspect reproduced results, update the underlying data, and customize visual encodings. A gallery of 40 visualizations demonstrates the expressiveness of the DSL, and a quantitative study evaluates the reproduction quality of ReVis on these examples. Two usage scenarios and user interviews with 16 visualization practitioners demonstrate the effectiveness of ReVis.
Authors:Jordan Taylor, Joel Mire, Alicia DeVrio, Maarten Sap, Haiyi Zhu, Sarah E. Fox
Abstract:
Art-making is a collective social activity through which queer people engage in political resistance, develop identities, archive queer memory, and form community. However, in recent years, generative AI has disrupted queer artistic communities. Through 15 semi-structured interviews, we examine how queer artists are making sense of the encroachment of GenAI into their art worlds. Our findings surface significant tensions between the relationality of our participants' queer art practices and the perceived anti-relationality of GenAI development and use. We detail how our participants refuse and resist GenAI use and development in response and highlight the limited role our participants saw for GenAI within art-making, such as the queer aesthetic potential of surreal image models. Drawing on queer theory, we discuss how CSCW researchers might support queer artists by refusing dominant AI imaginaries and supporting queer world-building.
Authors:Kellie Yu Hui Sim, Kenny Tsu Wei Choo
Abstract:
Peer support is increasingly positioned as a scalable response to gaps in mental health care, particularly in digitally mediated settings, yet what counts as peer support and how responsibility is distributed remain unevenly defined in practice. Drawing on interviews with peer supporters, we show how lived experience, moral commitment, and self-identification shape participation while blurring expectations around scope, authority, and accountability. Institutional ambiguity concentrates emotional labour, boundary-setting, and escalation of responsibility at the individual level, often without consistent organisational scaffolding. Participants evaluated AI not primarily through empathy or technical capability, but through how technologies redistribute risk, labour, and accountability within already fragile support roles. Building on these findings, we outline design futures for an AI-supported peer support ecosystem that foregrounds responsibility as a central design concern rather than treating AI as a mechanism of scale.
Authors:Kellie Yu Hui Sim, Pin Sym Foong, Darryl Lim, John-Henry Lim, Kenny Tsu Wei Choo
Abstract:
Work on persona-persistent post-mortem agents typically frames design around a life/death binary. This framing neglects a consequential yet under-theorised condition: when individuals remain alive but have impaired decisional capacity. Drawing on a multi-phase workshop in which participants trained and reflected on an AI agent for Advance Care Planning, we examined how people reason about agentic delegation post-capacity loss. Initially, participants favoured bounded agents grounded in first-party authorship and representational fidelity over autonomous or evolving stand-ins. However, temporality introduced novel ideas like adjacent use driven by persona persistence over functional expansion: agents should evolve while users retain capacity, remain static once capacity is lost, but somehow inform adjacent post-mortem uses. We discuss the implications of these findings and propose that the configuration of agents for post-capacity use reshapes our understanding of provenance, temporality, and legitimacy for post-mortem agents.
Authors:Yulin Yu, Yizhou Li, Siddharth Suri, Scott Counts
Abstract:
Conversational generative AI systems such as ChatGPT are transforming how people seek and engage with information online. Unlike traditional search engines, these systems support open-ended, conversational inquiry, yet it remains unclear whether they ultimately expand or constrain the diversity of knowledge that users encounter in online search spaces, a primary foundation for knowledge work, learning, and innovation. Using over 200,000 real-world human-ChatGPT interactions, we examine how generative-AI-mediated inquiry reshapes diversity in both user inputs and system outputs through the lens of searchability - whether queries could plausibly be answered by traditional search engines. We find that almost 80% of ChatGPT user queries are non-searchable and span a broader knowledge space and topics than searchable queries, indicating expanded modes of inquiry. However, for comparable searchable queries, AI responses are less diverse than Google search results in the majority of topics. Moreover, the diversity of AI responses predicts subsequent changes in users' inquiry diversity, revealing a feedback loop between AI outputs and human exploration. These findings highlight a tension between expanded inquiry and constrained information exposure, with implications for designing hybrid search and generative-AI systems that better support exploratory knowledge seeking.
Authors:Huyen N. Nguyen, Nils Gehlenborg
Abstract:
Current resources for data literacy education, such as visualization galleries and datasets, provide useful examples but lack mechanisms for learners to query, compare, and navigate the visualization design space efficiently. This position paper advocates for visualization retrieval as essential infrastructure for data literacy, transforming static collections into dynamic, inquiry-based learning environments. We analyze the role of retrieval across the data lifecycle, demonstrating how it facilitates design space exploration and vocabulary expansion, supports data consumption through visualization comparison and critique, and aids data management via resource curation. We outline key opportunities for future research and system design, including integrated retrieval-authoring environments, pedagogical relevance modeling, and collaborative educational corpora. Ultimately, we argue that visualization retrieval systems empower learners to articulate intent, bridge technical barriers, and proactively reason with data.
Authors:Mohamed Amine Kerkouri, Marouane Tliba, Bin Wang, Aladine Chetouani, Ulas Bagci, Alessandro Bruno
Abstract:
Scanpath similarity metrics are central to eye-movement research, yet existing methods predominantly evaluate spatial and temporal alignment while neglecting semantic equivalence between attended image regions. We present a semantic scanpath similarity framework that integrates vision-language models (VLMs) into eye-tracking analysis. Each fixation is encoded under controlled visual context (patch-based and marker-based strategies) and transformed into concise textual descriptions, which are aggregated into scanpath-level representations. Semantic similarity is then computed using embedding-based and lexical NLP metrics and compared against established spatial measures, including MultiMatch and DTW. Experiments on free-viewing eye-tracking data demonstrate that semantic similarity captures partially independent variance from geometric alignment, revealing cases of high content agreement despite spatial divergence. We further analyze the impact of contextual encoding on description fidelity and metric stability. Our findings suggest that multimodal foundation models enable interpretable, content-aware extensions of classical scanpath analysis, providing a complementary dimension for gaze research within the ETRA community.
Authors:Kuang Yuan, Freddy Yifei Liu, Tong Xiao, Yiwen Song, Chengyi Shen, Saksham Bhutani, Justin Chan, Swarun Kumar
Abstract:
Smart glasses are becoming an increasingly prevalent wearable platform, with audio as a key interaction modality. However, hearing in noisy environments remains challenging because smart glasses are equipped with open-ear speakers that do not seal the ear canal. Furthermore, the open-ear design is incompatible with conventional active noise cancellation (ANC) techniques, which rely on an error microphone inside or at the entrance of the ear canal to measure the residual sound heard after cancellation. Here we present the first real-time ANC system for open-ear smart glasses that suppresses environmental noise using only microphones and miniaturized open-ear speakers embedded in the glasses frame. Our low-latency computational pipeline estimates the noise at the ear from an array of eight microphones distributed around the glasses frame and generates an anti-noise signal in real-time to cancel environmental noise. We develop a custom glasses prototype and evaluate it in a user study across 8 environments under mobility in the 100--1000 Hz frequency range, where environmental noise is concentrated. We achieve a mean noise reduction of 9.6 dB without any calibration, and 11.2 dB with a brief user-specific calibration.
Authors:Yi Ru Wang, Carter Ung, Evan Gubarev, Christopher Tan, Siddhartha Srinivasa, Dieter Fox
Abstract:
Evaluation of robotic manipulation systems has largely relied on fixed benchmarks authored by a small number of experts, where task instances, constraints, and success criteria are predefined and difficult to extend. This paradigm limits who can shape evaluation and obscures how policies respond to user-authored variations in task intent, constraints, and notions of success. We argue that evaluating modern manipulation policies requires reframing evaluation as a language-driven process over structured physical domains. We present RoboPlayground, a framework that enables users to author executable manipulation tasks using natural language within a structured physical domain. Natural language instructions are compiled into reproducible task specifications with explicit asset definitions, initialization distributions, and success predicates. Each instruction defines a structured family of related tasks, enabling controlled semantic and behavioral variation while preserving executability and comparability. We instantiate RoboPlayground in a structured block manipulation domain and evaluate it along three axes. A user study shows that the language-driven interface is easier to use and imposes lower cognitive workload than programming-based and code-assist baselines. Evaluating learned policies on language-defined task families reveals generalization failures that are not apparent under fixed benchmark evaluations. Finally, we show that task diversity scales with contributor diversity rather than task count alone, enabling evaluation spaces to grow continuously through crowd-authored contributions. Project Page: https://roboplayground.github.io
Authors:Jacy Reese Anthis, Hannah Cha, Solon Barocas, Alexandra Chouldechova, Jake Hofman
Abstract:
The capabilities of artificial intelligence (AI) lie along a jagged frontier, where AI systems surprisingly fail on tasks that humans find easy and succeed on tasks that humans find hard. To investigate user reactions to this phenomenon, we developed an incentive-compatible experimental methodology based on diagram generation tasks, in which we induce errors in generative AI output and test effects on user reliance. We demonstrate the interface in a preregistered 3x2 experiment (N = 577) with error rates of 10%, 30%, or 50% on easier or harder diagram generation tasks. We confirmed that observing more errors reduces use, but we unexpectedly found that easy-task errors did not significantly reduce use more than hard-task errors, suggesting that people are not averse to jaggedness in this experimental setting. We encourage future work that varies task difficulty at the same time as other features of AI errors, such as whether the jagged error patterns are easily learned.
Authors:Kenan Tang, Jiasheng Guo, Jeffrey Lin, Yao Qin
Abstract:
Facial expressions of characters are a vital component of visual storytelling. While current AI image editing models hold promise for assisting artists in the task of stylized expression editing, these models introduce global noise and pixel drift into the edited image, preventing the integration of these models into professional image editing software and workflows. To bridge this gap, we introduce ExpressEdit, a fully open-source Photoshop plugin that is free from common artifacts of proprietary image editing models and robustly synergizes with native Photoshop operations such as Liquify. ExpressEdit seamlessly edits an expression within 3 seconds on a single consumer-grade GPU, significantly faster than popular proprietary models. Moreover, to support the generation of diverse expressions according to different narrative needs, we compile a comprehensive expression database of 135 expression tags enriched with example stories and images designed for retrieval-augmented generation. We open source the code and dataset to facilitate future research and artistic exploration.
Authors:Jeremy H. M. Wong, Nancy F. Chen
Abstract:
In speech evaluation, an Automatic Speech Recognition (ASR) model often computes time boundaries and phoneme posteriors for input features. However, limited data for ASR training hinders expansion of speech evaluation to low-resource languages. Open-source weakly-supervised models are capable of ASR over many languages, but they are frame-asynchronous and not phonemic, hindering feature extraction for speech evaluation. This paper proposes to overcome incompatibilities for feature extraction with weakly-supervised models, easing expansion of speech evaluation to low-resource languages. Phoneme posteriors are computed by mapping ASR hypotheses to a phoneme confusion network. Word instead of phoneme-level speaking rate and duration are used. Phoneme and frame-level features are combined using a cross-attention architecture, obviating phoneme time alignment. This performs comparably with standard frame-synchronous features on English speechocean762 and low-resource Tamil datasets.
Authors:Shiwei Chen, Niruthikka Sritharan, Xiaolin Wen, Chenxi Zhang, Xingbo Wang, Yong Wang
Abstract:
Current Large Language Models (LLMs), especially Large Reasoning Models, can generate Chain-of-Thought (CoT) reasoning traces to illustrate how they produce final outputs, thereby facilitating trust calibration for users. However, these CoT reasoning traces are usually lengthy and tedious, and can contain various issues, such as logical and factual errors, which make it difficult for users to interpret the reasoning traces efficiently and accurately. To address these challenges, we develop an error detection pipeline that combines external fact-checking with symbolic formal logical validation to identify errors at the step level. Building on this pipeline, we propose ReasonDiag, an interactive visualization system for diagnosing CoT reasoning traces. ReasonDiag provides 1) an integrated arc diagram to show reasoning-step distributions and error-propagation patterns, and 2) a hierarchical node-link diagram to visualize high-level reasoning flows and premise dependencies. We evaluate ReasonDiag through a technical evaluation for the error detection pipeline, two case studies, and user interviews with 16 participants. The results indicate that ReasonDiag helps users effectively understand CoT reasoning traces, identify erroneous steps, and determine their root causes.
Authors:Dora Zhao, Hannah Cha, Michael J. Ryan, Angelina Wang, Rachel Baker-Ramos Evyn-Bree Helekahi-Kaiwi, Rebecca Diego, Josiah Hester, Diyi Yang
Abstract:
Although generative AI is being deployed into classrooms with promises of aiding teachers, educators caution that these tools can have unintended pedagogical repercussions, including cultural misrepresentation and bias. These concerns are heightened in low-resource language and Indigenous education settings, where AI systems frequently underperform. We investigate these challenges in Hawai`i, where public schools operate under a statewide mandate to integrate Hawaiian language and culture into education. Through four co-design workshops with 22 public school educators, we surfaced concerns about using generative AI in educational settings, particularly around cultural misrepresentation, and corresponding designs for auditing tools that address these issues. We find that educators envision tools grounded in specific Hawaiian cultural values and practices, such as tracing the genealogy of knowledge in source materials. Building on these insights, we conceptualize AI auditing as a community-oriented process rather than the work of isolated individuals, and discuss implications for designing auditing tools.
Authors:Yinuo Yang, Zheng Zhang, Ningzhi Tang, Xu Wang, Alex Ambrose, Nathaniel Myers, Patrick Clauss, Toby Jia-Jun Li
Abstract:
AI-supported writing tools show strong potential for scaffolding students' learning of argumentative writing. Prior work has demonstrated the benefits of AI-supported cognitive scaffolds, such as idea exploration and argument refinement, but how these features function in authentic classroom settings remains underexplored. In this paper, we investigate the classroom integration of an AI-supported writing tool, VISAR. We deployed VISAR in an undergraduate writing course across three sections for one week each over two semesters (49 students total). Using a mixed-methods approach that combines interaction logs, writing artifact analysis, surveys, and interviews, we examine how students used VISAR features in authentic writing tasks. Our findings confirm that students appropriated AI-supported cognitive scaffolds for writing learning and achieved measurable learning gains. While prior studies suggest that students may bypass important cognitive processes when using AI writing assistants, our classroom deployment shows that when systems provide structured supports for planning and targeted generation, students naturally choose to engage with these cognition-preserving scaffolds. These learning-oriented interaction patterns were positively associated with argumentative writing quality, improved conceptual understanding, and emerging critical AI literacy, highlighting the design value of cognition-preserving features in AI writing tools. Together, these findings provide empirical evidence of how AI-supported writing scaffolds operate in authentic classroom contexts and offer design insights for future learning-oriented AI writing tools.
Authors:Nahal Mafi, Sahar Maleki, Babak Rahimi Ardabili, Hamed Tabkhi
Abstract:
Artificial intelligence systems increasingly operate in decision-critical environments where probabilistic outputs and Human-in-the-Loop (HITL) interactions reshape user engagement. Traditional user experience (UX) frameworks, designed for deterministic systems, fail to capture these evolving sociotechnical dynamics. This paper argues that in AI-enabled HITL systems, UX must transcend frontend usability to encompass backend performance, organizational workflows, and decision making structures. We employ a mixed-methods approach, combining an inductive social construction analysis of 269 stakeholder insights with the deployment of an operational HITL video anomaly detection system. Our findings reveal that stakeholders experience AI through multifaceted themes: risk, governance, and organizational capacity. Experimental results further demonstrate how detection behavior and alert routing directly calibrate human oversight and workload. Grounded in these results, we formalize a new evaluative framework centered on four sociotechnical metrics: Accuracy (FPR/FNR), Operational Latency (response time), Adaptation Time (deployment burden), and Trust (validated automation scales). This framework redefines UX as a multi-layered construct spanning infrastructure and governance, providing a rigorous foundation for evaluating AI systems embedded within complex real-world ecosystems.
Authors:Mohamed Amine Kerkouri, Marouane Tliba, Aladine Chetouani, Alessandro Bruno
Abstract:
Understanding human visual attention is key to preserving cultural heritage We introduce SPGen a novel deep learning model to predict scanpaths the sequence of eye movementswhen viewers observe paintings. Our architecture uses a Fully Convolutional Neural Network FCNN with differentiable fixation selection and learnable Gaussian priors to simulate natural viewing biases To address the domain gap between photographs and artworks we employ unsupervised domain adaptation via a gradient reversal layer allowing the model to transfer knowledge from natural scenes to paintings Furthermore a random noise sampler models the inherent stochasticity of eyetracking data. Extensive testing shows SPGen outperforms existing methods offering a powerful tool to analyze gaze behavior and advance the preservation and appreciation of artistic treasures.
Authors:Injun Baek, Yearim Kim, Nojun Kwak
Abstract:
While advancements in Text-to-Video (T2V) generative AI offer a promising path toward democratizing content creation, current models are often optimized for visual fidelity rather than instructional efficacy. This study introduces PedaCo-Gen, a pedagogically-informed human-AI collaborative video generating system for authoring instructional videos based on Mayer's Cognitive Theory of Multimedia Learning (CTML). Moving away from traditional "one-shot" generation, PedaCo-Gen introduces an Intermediate Representation (IR) phase, enabling educators to interactively review and refine video blueprints-comprising scripts and visual descriptions-with an AI reviewer. Our study with 23 education experts demonstrates that PedaCo-Gen significantly enhances video quality across various topics and CTML principles compared to baselines. Participants perceived the AI-driven guidance not merely as a set of instructions but as a metacognitive scaffold that augmented their instructional design expertise, reporting high production efficiency (M=4.26) and guide validity (M=4.04). These findings highlight the importance of reclaiming pedagogical agency through principled co-creation, providing a foundation for future AI authoring tools that harmonize generative power with human professional expertise.
Authors:Yuanrong Tang, Huiling Peng, Bingxi Zhao, Hengyang Ding, Hanchao Song, Tianhong Wang, Chen Zhong, Jiangtao Gong
Abstract:
Human-AI collaboration faces growing challenges as AI systems increasingly outperform humans on complex tasks, while humans remain responsible for orchestration, validation, and decision oversight. To address this imbalance, we introduce Human Tool, an MCP-style interface abstraction, building on recent Model Context Protocol designs, that exposes humans as callable tools within AI-led, proactive workflows. Here, "tool" denotes a coordination abstraction, not a reduction of human authority or responsibility. Building on LLM-based agent architectures, we operationalize Human Tool by modeling human contributions through structured tool schemas of capabilities, information, and authority. These schemas enable agents to dynamically invoke human input based on relative strengths and reintegrate it through efficient, natural interaction protocols. We validate the framework through controlled studies in both decision-making and creative tasks, demonstrating improved task performance, reduced human workload, and more balanced collaboration dynamics compared to baseline systems. Finally, we discuss implications for human-centered AI design, highlighting how MCP-style human tools enable strong AI leadership while amplifying uniquely human strengths.
Authors:Micheal P. Papazoglou, Bernd J. Krämer, Mira Raheem, Amal Elgammal
Abstract:
Chronic diseases constitute the principal burden of morbidity, mortality, and healthcare costs worldwide, yet current health systems remain fragmented and predominantly reactive. Patient Medical Digital Twins (PMDTs) offer a paradigm shift: holistic, continuously updated digital counterparts of patients that integrate clinical, genomic, lifestyle, and quality-of-life data. We report early implementations of PMDTs via ontology-driven modeling and federated analytics pilots. Insights from the QUALITOP oncology study and a distributed AI platform confirm both feasibility and challenges: aligning with HL7 FHIR and OMOP standards, embedding privacy governance, scaling federated queries, and designing intuitive clinician interfaces. We also highlight technical gains, such as automated reasoning over multimodal blueprints and predictive analytics for patient outcomes. By reflecting on these experiences, we outline actionable insights for software engineers and identify opportunities, such as DSLs and model-driven engineering, to advance PMDTs toward trustworthy, adaptive chronic care ecosystems.
Authors:Yaxin Hu, Masaki Kuribayashi, Allan Wang, Seita Kayukawa, Daisuke Sato, Bilge Mutlu, Hironobu Takagi, Chieko Asakawa
Abstract:
Group interactions are essential to social functioning, yet effective engagement relies on the ability to recognize and interpret visual cues, making such engagement a significant challenge for blind people. In this paper, we investigate how a mobile robot can support group interactions for blind people. We used the scenario of a guided tour with mixed-visual groups involving blind and sighted visitors. Based on insights from an interview study with blind people (n=5) and museum experts (n=5), we designed and prototyped a robotic system that supported blind visitors to join group tours. We conducted a field study in a science museum where each blind participant (n=8) joined a group tour with one guide and two sighted participants (n=8). Findings indicated users' sense of safety from the robot's navigational support, concerns in the group participation, and preferences for obtaining environmental information. We present design implications for future robotic systems to support blind people's mixed-visual group participation.
Authors:Amber Yijia Zheng, Jae Joong Lee, Bedrich Benes, Raymond A. Yeh
Abstract:
We present a vision-language model (VLM) that automatically edits website HTML to address Web Content Accessibility Guidelines 2 (WCAG2) violations. We formulate this as a supervised image-conditioned program synthesis task, where the model learns to correct HTML given the HTML and its rendering. We collected WebAccessVL, a new dataset with manually corrected accessibility violations, establishing paired training data. We then propose a violation-conditioned VLM that additionally conditions on the WCAG2 violation count to guide the correction process. Experiments demonstrate that our method effectively reduces the average number of violations from 5.34 to 0.44 per website, outperforming commercial LLM APIs (Gemini, GPT-5). A perceptual study confirms that our edited websites maintain the original visual appearance and content.
Authors:Nikhil Sharma, Zheng Zhang, Daniel Lee, Namita Krishnan, Guang-Jie Ren, Ziang Xiao, Yunyao Li
Abstract:
High-quality feedback is essential for effective human-AI interaction. It bridges knowledge gaps, corrects digressions, and shapes system behavior; both during interaction and throughout model development. Yet despite its importance, human feedback to AI is often infrequent and low quality. This gap motivates a critical examination of human feedback during interactions with AIs. To understand and overcome the challenges preventing users from giving high-quality feedback, we conducted two studies examining feedback dynamics between humans and conversational agents (CAs). Our formative study, through the lens of Grice's maxims, identified four Feedback Barriers -- Common Ground, Verifiability, Communication, and Informativeness -- that prevent high-quality feedback by users. Building on these findings, we derive three design desiderata and show that systems incorporating scaffolds aligned with these desiderata enabled users to provide higher-quality feedback. Finally, we detail a call for action to the broader AI community for advances in Large Language Models capabilities to overcome Feedback Barriers.
Authors:Shiwei Wu, Ziyao Gao, Zhendong He, Zongtan He, Zhupeng Huang, Xia Chen, Wei Zeng, Xiaojuan Ma, Zhenhui Peng
Abstract:
Visual designers often seek inspiration from Chinese paintings when tasked with creating Chinese-style illustrations, posters, etc. Our formative study (N=10) reveals that during ideation, designers learn the cultural symbols, emotions, compositions, and styles in Chinese paintings but face challenges in searching, analyzing, and integrating these dimensions. This paper leverages multi-modal large models to annotate the value of each dimension in 16,315 Chinese paintings, built on which we propose InkIdeator, an ideation support system for Chinese-style visual designs. InkIdeator suggests cultural symbols associated with the task theme, provides dimensional keywords to help analyze Chinese paintings, and generates visual examples integrating user-selected keywords. Our within-subjects study (N=12) using a baseline system without extracted dimensional keywords, along with two extended use cases by Chinese painters, indicates InkIdeator's effectiveness in creative ideation support, helping users efficiently explore cultural dimensions in Chinese paintings and visualize their ideas. We discuss implications for supporting culture-related visual design ideation with generative AI.
Authors:Can Liu, Jaeuk Lee, Tianhe Chen, Zhibang Jiang, Xiaolin Wen, Yong Wang
Abstract:
Interactivity is crucial for effective data visualizations. However, it is often challenging to implement interactions for existing static visualizations, since the underlying code and data for existing static visualizations are often not available, and it also takes significant time and effort to enable interactions for them even if the original code and data are available. To fill this gap, we propose Athanor, a novel approach to transform existing static visualizations into interactive ones using multimodal large language models (MLLMs) and natural language instructions. Our approach introduces three key innovations: (1) an action-modification interaction design space that maps visualization interactions into user actions and corresponding adjustments, (2) a multi-agent requirement analyzer that translates natural language instructions into an actionable operational space, and (3) a visualization abstraction transformer that converts static visualizations into flexible and interactive representations regardless of their underlying implementation. Athanor allows users to effortlessly author interactions through natural language instructions, eliminating the need for programming. We conducted two case studies and in-depth interviews with target users to evaluate our approach. The results demonstrate the effectiveness and usability of our approach in allowing users to conveniently enable flexible interactions for static visualizations.
Authors:Dong Yoon Lee, Alyssa Weakley, Hui Wei, Daniel Cardona, Shijia Pan
Abstract:
To support aging-in-place, adult children often provide care to their aging parents from a distance. These informal caregivers desire plug-and-play remote care solutions for privacy-preserving continuous monitoring that enabling real-time activity monitoring and intuitive, actionable information. This short paper presents insights from three iterations of deployment experience for remote monitoring system and the iterative improvement in hardware, modeling, and user interface guided by the Geriatric 4Ms framework (matters most, mentation, mobility, and medication). An LLM-assisted solution is developed to balance user experience (privacy-preserving, plug-and-play) and system performance.
Authors:Ziyi Wang, Yilong Dai, Duanya Lyu, Mateo Nader, Sihan Chen, Wanghao Ye, Zjian Ding, Xiang Yan
Abstract:
Designing inclusive cycling infrastructure requires balancing competing needs of diverse user groups, yet designers often struggle to anticipate how different cyclists experience the same street. We investigate how persona-based multi-agent evaluation can support inclusive design by making experiential conflicts explicit. We present StreetDesignAI, an interactive system that enables designers to (1) ground evaluation in street context through imagery and map data, (2) receive parallel feedback from cyclist personas spanning confident to cautious users, and (3) iteratively modify designs while surfacing conflicts across perspectives. A within-subjects study with 26 transportation professionals demonstrates that structured multi-perspective feedback significantly improves designers' understanding of diverse user perspectives, ability to identify persona needs, and confidence in translating them into design decisions, with higher satisfaction and stronger intention for professional adoption. Qualitative findings reveal how conflict surfacing transforms design exploration from single-perspective optimization toward deliberate trade-off reasoning. We discuss implications for AI tools that scaffold inclusive design through disagreement as an interaction primitive.
Authors:Ligao Ruan, Giles Hamilton-Fletcher, Mahya Beheshti, Todd E Hudson, Maurizio Porfiri, John-Ross Rizzo
Abstract:
Shopping is a routine activity for sighted individuals, yet for people who are blind or have low vision (pBLV), locating and retrieving products in physical environments remains a challenge. This paper presents a multimodal wearable assistive system that integrates object detection with vision-language models to support independent product or item retrieval, with the goal of enhancing users'autonomy and sense of agency. The system operates through three phases: product search, which identifies target products using YOLO-World detection combined with embedding similarity and color histogram matching; product navigation, which provides spatialized sonification and VLM-generated verbal descriptions to guide users toward the target; and product correction, which verifies whether the user has reached the correct product and provides corrective feedback when necessary. Technical evaluation demonstrated promising performance across all modules, with product detection achieving near-perfect accuracy at close range and high accuracy when facing shelves within 1.5 m. VLM-based navigation achieved up to 94.4% accuracy, and correction accuracy exceeded 86% under optimal model configurations. These results demonstrate the system's potential to address the last-meter problem in assistive shopping. Future work will focus on user studies with pBLV participants and integration with multi-scale navigation ecosystems.
Authors:Yue Deng, Xiaowei Chen, Junxiang Liao, Bo Li, Yixin Zou
Abstract:
Online fraud is a critical global threat that disproportionately targets older adults. Prior anti-fraud education for older adults has largely relied on static, traditional instruction that limits engagement and real-world transfer, whereas role-based simulation offers realistic yet low-risk opportunities for practice. Moreover, most interventions situate learners as victims, overlooking that fraud encounters often involve multiple roles, such as bystanders who witness scams and helpers who support victims. To address this gap, we developed ROLESafe, an anti-fraud educational intervention in which older adults learn through different learning roles, including Experiencer (experiencing fraud), Helper (assisting a victim), and Observer (witnessing fraud). In a between-subjects study with 144 older adults in China, we found that the Experiencer and Helper roles significantly improved participants' ability to identify online fraud. These findings highlight the promise of role-based, multi-perspective simulations for enhancing fraud awareness among older adults and provide design implications for future anti-fraud education.
Authors:Yue Deng, Changyang He, Bo Li, Yixin Zou
Abstract:
Biometric payment, i.e., biometric authentication implemented in digital payment systems, can reduce memory demands and streamline payment for older adults. However, older adults' perceptions and practices regarding biometric payment remain underexplored. We conducted semi-structured interviews with 22 Chinese older adults, including both users and non-users. Participants were motivated to use biometric payment due to convenience and perceived security. However, they also worried about loss of control due to its password-free nature and expressed concerns about biometric data security. Participants also identified desired features for biometric payment, such as lightweight and context-aware cognitive confirmation mechanisms to enhance user control. Based on these findings, we outline recommendations for more controllable and informative digital financial services that better support older adults.
Authors:Jordan Taylor, William Agnew, Maarten Sap, Sarah E. Fox, Haiyi Zhu
Abstract:
Visual generative AI models are trained using a one-size-fits-all measure of aesthetic appeal. However, what is deemed "aesthetic" is inextricably linked to personal taste and cultural values, raising the question of whose taste is represented in visual generative AI models. In this work, we study an aesthetic evaluation model--LAION Aesthetic Predictor (LAP)--that is widely used to curate datasets to train visual generative image models, like Stable Diffusion, and evaluate the quality of AI-generated images. To understand what LAP measures, we audited the model across three datasets. First, we examined the impact of aesthetic filtering on the LAION-Aesthetics Dataset (approximately 1.2B images), which was curated from LAION-5B using LAP. We find that the LAP disproportionally filters in images with captions mentioning women, while filtering out images with captions mentioning men or LGBTQ+ people. Then, we used LAP to score approximately 330k images across two art datasets, finding the model rates realistic images of landscapes, cityscapes, and portraits from western and Japanese artists most highly. In doing so, the algorithmic gaze of this aesthetic evaluation model reinforces the imperial and male gazes found within western art history. In order to understand where these biases may have originated, we performed a digital ethnography of public materials related to the creation of LAP. We find that the development of LAP reflects the biases we found in our audits, such as the aesthetic scores used to train LAP primarily coming from English-speaking photographers and western AI-enthusiasts. In response, we discuss how aesthetic evaluation can perpetuate representational harms and call on AI developers to shift away from prescriptive measures of "aesthetics" toward more pluralistic evaluation.
Authors:Shuyu Zhang, Yujie Liu, Xinru Wang, Cheng Zhang, Yanmin Zhu, Bin Li
Abstract:
Traditional task-oriented dialog systems are unable to evolve from ongoing interactions or adapt to new domains after deployment, that is a critical limitation in real-world dynamic environments. Continual learning approaches depend on episodic retraining with human curated data, failing to achieve autonomy lifelong improvement. While evolutionary computation and LLM driven self improvement offer promising mechanisms for dialog optimization, they lack a unified framework for holistic, iterative strategy refinement. To bridge this gap, we propose DarwinTOD, a lifelong self evolving dialog framework that systematically integrates these two paradigms, enabling continuous strategy optimization from a zero-shot base without task specific fine-tuning. DarwinTOD maintains an Evolvable Strategy Bank and operates through a dual-loop process: online multi-agent dialog execution with peer critique, and offline structured evolutionary operations that refine the strategy bank using accumulated feedback. This closed-loop design enables autonomous continuous improvement without human intervention. Extensive experiments show that DarwinTOD surpasses previous state-of-the-art methods and exhibits continuous performance gains throughout evolution. Our work provides a novel framework for building dialog systems with lifelong self evolution capabilities.
Authors:Xiang Zhang, Huan Yan, Jinyang Huang, Bin Liu, Yuanhao Feng, Jianchun Liu, Meng Li, Fusang Zhang, Zhi Liu
Abstract:
In this paper, we propose GesFi, a novel WiFi-based gesture recognition system that introduces WiFi latent domain mining to redefine domains directly from the data itself. GesFi first processes raw sensing data collected from WiFi receivers using CSI-ratio denoising, Short-Time Fast Fourier Transform, and visualization techniques to generate standardized input representations. It then employs class-wise adversarial learning to suppress gesture semantic and leverages unsupervised clustering to automatically uncover latent domain factors responsible for distributional shifts. These latent domains are then aligned through adversarial learning to support robust cross-domain generalization. Finally, the system is applied to the target environment for robust gesture inference. We deployed GesFi under both single-pair and multi-pair settings using commodity WiFi transceivers, and evaluated it across multiple public datasets and real-world environments. Compared to state-of-the-art baselines, GesFi achieves up to 78% and 50% performance improvements over existing adversarial methods, and consistently outperforms prior generalization approaches across most cross-domain tasks.
Authors:Yilong Dai, Ziyi Wang, Chenguang Wang, Kexin Zhou, Yiheng Qian, Susu Xu, Xiang Yan
Abstract:
Bikeability assessment is essential for advancing sustainable urban transportation and creating cyclist-friendly cities, and it requires incorporating users' perceptions of safety and comfort. Yet existing perception-based bikeability assessment approaches face key limitations in capturing the complexity of road environments and adequately accounting for heterogeneity in subjective user perceptions. This paper proposes a persona-aware Vision-Language Model framework for bikeability assessment with three novel contributions: (i) theory-grounded persona conditioning based on established cyclist typology that generates persona-specific explanations via chain-of-thought reasoning; (ii) multi-granularity supervised fine-tuning that combines scarce expert-annotated reasoning with abundant user ratings for joint prediction and explainable assessment; and (iii) AI-enabled data augmentation that creates controlled paired data to isolate infrastructure variable impacts. To test and validate this framework, we developed a panoramic image-based crowdsourcing system and collected 12,400 persona-conditioned assessments from 427 cyclists. Experiment results show that the proposed framework offers competitive bikeability rating prediction while uniquely enabling explainable factor attribution.
Authors:Avinash Baidya, Xinran Liang, Ruocheng Guo, Xiang Gao, Kamalika Das
Abstract:
Early failure alerting requires deciding, while a dialog or agent trajectory is still unfolding, whether to flag it as likely to fail. This is challenging because supervision is typically available only as a trajectory-level success/failure label while alerts must be raised from partial interactions. Prior early-classification methods often bridge this gap by assigning the terminal label to every prefix, treating every turn as failure evidence. We hypothesize that this prefix-label assumption is poorly matched to multi-turn language interactions, where evidence of eventual failure is sparse and often delayed. In this paper, we introduce a two-stage approach that learns from this sparse evidence structure and uses the resulting risk estimates for controllable early alerting. Specifically, our attention-based failure predictor learns sparse turn-level failure evidence from trajectory labels and uses it to estimate failure risk from partial histories. We then pair this predictor with $α$-STOP, a single preference-conditioned stopping policy that selects an accuracy-earliness operating point at inference time rather than training a separate trigger for each preference. Across five benchmarks spanning customer support, task-oriented dialog, persuasion, tool use, and planning, we first show that high-relevance failure evidence occupies only 4.7-11.3% of turns and first appears after 59.0-83.6\% of trajectories on average. We further show that the attention-based predictor improves Pareto-frontier quality (hypervolume) by 1-10\% over naive prefix supervision, and that the full system improves frontier quality by 3-42\% over state-of-the-art trigger policies while reducing training cost per operating point by 1-3 orders of magnitude.
Authors:Yujin Park, Haejun Chung, Ikbeom Jang
Abstract:
Image quality in modern imaging systems emerges from the coupled effects of the sensor, optics, and computational reconstruction. Ultra-thin metalenses offer a path toward substantial miniaturization of optical modules, but practical designs often exhibit pronounced chromatic and field-dependent aberrations that necessitate computational reconstruction. In current metalens pipelines, reconstruction models are commonly trained and selected using distortion-based fidelity objectives, such as PSNR, yet these proxies can be weakly correlated with human preference and downstream utility, reflecting the well-known perception--distortion trade-off. We introduce MetaRanker, a human-in-the-loop active ranking framework that formalizes metalens image quality in terms of semantic interpretability, defined as the degree to which humans can reliably recognize objects and structures in the presence of optical artifacts. MetaRanker combines a probabilistic preference model with uncertainty-aware query selection, and leverages vision--language models to provide lightweight semantic priors. Importantly, these priors are used only to guide the sampling of informative comparisons; human judgments remain the primary supervision signal throughout. Across real-world and synthetic metalens datasets with distinct degradation profiles, MetaRanker produces rankings that align most closely with human assessments, while reducing the number of pairwise annotations required by approximately 80% relative to exhaustive pairwise evaluation. Finally, we show that standard image quality assessment metrics exhibit limited alignment with human interpretability in the metalens domain, positioning MetaRanker as a practical step toward perceptually grounded metalens evaluation and co-design.
Authors:Zahra Hassanzadeh, Anne Hsu, Rachel Kornfield, David Haag, Ananya Bhattacharjee, Jay Olson, Jan David Smeddinck, Norman Farb, Alex Mariakakis, Lydia Chilton, Joseph Jay Williams
Abstract:
This paper explores the design space for one-minute digital interventions that prompt immediate action without onboarding or sensing. By embracing Fogg's Behavior Model and four design principles informed by literature, the goal of these interventions was to provide triggers that encourage actions so simple that even people with low motivation would be willing to complete them. We examined the utility of these prompts by conducting a 14-day study with 22 participants interested in making small lifestyle improvements in at least one of three domains: physical activity, healthy eating, and mental well-being. When combined with insights drawn from participants' rewrites of our prompts, our findings suggest that intentional personalization through co-authorship could be a lightweight personalization mechanism that balances relevance with low friction.
Authors:Tatiana Chakravorti, Robert Fraleigh, Timothy Fritton, Christopher Griffin, Vaibhav Singh, Sai Koneru, C. Lee Giles, David Pennock, Anthony Kwasnica, Sarah Rajtmajer
Abstract:
Determining whether published scientific findings can successfully be replicated is a long-standing challenge in the empirical sciences. Existing approaches for replicability assessment typically rely either on human judgment, i.e., creative assembly of human experts, or on machine learning models trained on paper content metadata. While both approaches have demonstrated value, each also has important limitations. Human forecasts can be influenced by cognitive biases and narrow exposure to the research literature, while automated assessments often struggle to capture contextual cues and subtle signals of credibility. In this paper, we examine a hybrid approach. Specifically, we introduce a hybrid prediction market in which algorithmic agents trade alongside human participants to jointly estimate the likelihood that a published scientific finding will be corroborated via the outcome of a controlled replication study. Agents are trained on outcomes from hundreds of prior replication studies while human participants contribute domain knowledge through real-time trading. We evaluate this hybrid approach through multiple live experiments involving participants from different academic disciplines and compare its performance to artificial-only and human-only baselines. Our results show that, except for a few cases, hybrid markets match or outperform artificial prediction markets, producing more accurate and reliable replication forecasts.
Authors:Sven Kruschel, Julian Rosenberger, Lasse Bohlen, Mathias Kraus, Patrick Zschech
Abstract:
Explainable AI (XAI) techniques aim to provide insights into predictive models and enhance user performance, yet they often fall short of these expectations. Conversational XAI assistants promise to overcome such limitations, but empirical evidence on their impact on objective performance measures remains limited. We propose an experimental design for evaluating explanation assistance through prediction accuracy, model understanding, and error identification. Using an explainable-by-design prediction model, we create conditions where users can outperform the model by identifying and compensating for systematic errors. We compare conversational assistance against Q&A-based assistance to assess which better supports users in working with model explanations. Preliminary results from testing our experimental design show that participants (N=42) in both treatments significantly outperformed the model but reveal no performance differences between assistance types and modest engagement overall. These findings inform refinements for our planned full study, including enhanced engagement interventions and investigation of the mechanisms driving improved predictions.
Authors:Mengke Wu, Mike Yao
Abstract:
As AI systems take on greater autonomy, a quiet anxiety has settled over the HCI community: human agency is eroding. Users no longer control execution, interfaces recede, and machines decide. We argue that this anxiety, while understandable, reflects a framing problem rather than an empirical finding. Agency has not diminished but has relocated. As interaction has shifted from command- and feature-based paradigms toward conversational, generative, and agentic AI, human agency migrates from interface affordances to interaction itself: articulating goals, evaluating outputs, and negotiating outcomes. To make this relocation visible, we revisit control as a diagnostic lens, distinguish process control and outcome control, and map different systems across this space to show that what looks like agency's disappearance is actually its redistribution. We take seriously the objection that outcome-based agency may be illusory in systems that produce plausible but unverifiable outputs, and argue that this concern reveals what agency in human-AI interaction truly requires. This paper invites the CUI community to reconsider what agency means, where it lives, and what it demands, including who gets to have it and who holds responsibility when it fails, before the consequences become impossible to overlook.
Authors:Tamunotonye Harry, Johanna Hidalgo, Matthew Price, Yuanyuan Feng, Kathryn Stanton, Connie Tompkins, Peter Sheridan Dodds, Mikaela Irene Fudolig, Laura Bloomfield, Christopher Danforth
Abstract:
Wearable devices capture physiological and behavioral data with increasing fidelity, but the psychological context shaping these outcomes is difficult to recover from sensor data alone, limiting passive sensing utility for digital health. We examined whether ultra-brief naturalistic concern text could serve as a scalable complement to passive sensing. In a year-long study of 458 university students (3,610 person-waves) tracked with Oura rings, participants responded bimonthly to an open-ended prompt about what concerned them most; responses had a median length of three words. We compared dictionary-based, general pretrained, and domain-adapted NLP approaches using within-person mixed-effects models across nine sleep and physical activity outcomes. Weeks dominated by academic concern framing were associated with lower physical activity; weeks characterized by emotional exhaustion language were associated with poorer sleep quality and lower heart rate variability. General pretrained embeddings outperformed domain-adapted models for most outcomes, with domain adaptation showing relative advantage for autonomic outcomes. Zero-shot classification of concern topics produced no significant associations, while affective dimensions across all three methods were consistently associated with outcomes, indicating emotional register rather than topical content carries the signal. These findings offer design guidance: ultra-brief affective prompts enrich the psychological interpretability of passive physiological data at minimal burden.
Authors:William Lehn-Schiøler, Magnus Ruud Kjær, Rahul Thapa, Magnus Guldberg Pedersen, Anton Storgaard Mosquera, Nick Williams, Radu Gatej, Tue Lehn-Schiøler, Sándor Beniczky, Sadasivan Puthusserypady, James Zou, Lars Kai Hansen
Abstract:
EEG foundation models achieve state-of-the-art clinical performance, yet the internal computations driving their predictions remain opaque: a barrier to clinical trust. We apply TopK Sparse Autoencoders (SAEs) across three architecturally distinct EEG transformers: SleepFM, REVE, and LaBraM to extract sparse feature dictionaries from their embeddings. By grounding these features in a clinical taxonomy (abnormality, age, sex, and medication), we benchmark monosemanticity and entanglement across architectures. A single hyperparameter procedure, driven by an intrinsic dictionary health audit, transfers robustly across all three architectures. Via concept steering, we introduce a "target vs. off-target" probe area metric to quantify steering selectivity and reveal three operational regimes: selectively steerable, encoded but entangled, and non-encoded. This framework exposes critical representational failures: "wrecking-ball" interventions that collapse global model performance, and clinical entanglements, such as age-pathology confounding, where it is impossible to suppress one concept without corrupting the other. Finally, a spectral decoder maps these interventions back to the amplitude spectrum, translating latent manipulations into physiologically interpretable frequency signatures, such as pathological slow-wave suppression and $α$-band restoration.
Authors:Vardhan Palod, Upasana Biswas, Subbarao Kambhampati
Abstract:
Large Language Models (LLMs) and Large Reasoning Models (LRMs) are increasingly used for critical tasks, yet they provide no guarantees about the correctness of their solutions. Users must decide whether to trust the model's answer, aided by reasoning traces, their summaries, or post-hoc generated explanations. These reasoning traces, despite evidence that they are neither faithful representations of the model's computations nor necessarily semantically meaningful, are often interpreted as provenance explanations. It is unclear whether explanations or reasoning traces help users identify when the AI is incorrect, or whether they simply persuade users to trust the AI regardless. In this paper, we take a user-centered approach and develop an evaluation protocol to study how different explanation types affect users' ability to judge the correctness of AI-generated answers and engender false trust in the users. We conduct a between-subject user study, simulating a setting where users do not have the means to verify the solution and analyze the false trust engendered by commonly used LLM explanations - reasoning traces, their summaries and post-hoc explanations. We also test a contrastive dual explanation setting where we present arguments for and against the AI's answer. We find that reasoning traces and post-hoc explanations are persuasive but not informative: they increase user acceptance of LLM predictions regardless of their correctness. In contrast, dual explanation is the only condition that genuinely improves users' ability to distinguish correct from incorrect AI outputs.
Authors:Robert Wolfe, Aayushi Dangol
Abstract:
Demand for expert-annotated data on the part of leading AI labs has created an expert gig economy with the potential to reshape white collar work and society's understanding of expertise. In this research, we study the vision for the future of expertise described in the public communication of five industry data annotation organizations and their CEOs, as reflected on social media feeds and public appearances on podcasts. We find that the industry envisions AI expertise as cheap, meaning that it can offer a better return on investment than human expertise. Human expertise, meanwhile, is viewed as an extractable resource, the value of which can be judged relative to AI expertise. Finally, institutional expertise (such as that created or possessed by universities and corporations) is viewed as in need of liberation or reform, such that it can be incorporated into the latest artificial intelligence systems. Our findings have implications for human experts, whose professional lives may be transformed and revalued by this industry, as well as for societal institutions that mediate expertise. We close this work with a series of provocations intended to elicit consideration of how society can best approach an AI-driven expert gig economy and the cheap expertise it intends to produce.
Authors:Anqi Wang, Dongyijie Pan, Xin Tong, Pan Hui
Abstract:
Although Large Language Models (LLMs) demonstrate proficiency in knowledge-intensive tasks, current interfaces frequently precipitate cognitive misalignment by failing to externalize users' underlying reasoning structures. Existing tools typically represent intent as "flat lists," thereby disregarding the causal dependencies and revisable assumptions inherent in human decision-making. We introduce CogInstrument, a system that represents user reasoning through cognitive motifs-compositional, revisable units comprising concepts linked by causal dependencies. CogInstrument extracts these motifs from natural language interactions and renders them as editable graphical structures to facilitate bidirectional alignment. This structural externalization enables both the user and the LLM to inspect, negotiate, and reconcile reasoning processes iteratively. A within-subjects study (N=12) demonstrates that CogInstrument explicitly surfaces implicit reasoning structures, facilitating more targeted revision and reusability over conventional LLM-based dialogue interfaces. By enabling users to verify the logical grounding of LLM outputs, CogInstrument significantly enhances user agency, trust, and structural control over the collaboration. This work formalizes cognitive motifs as a fundamental unit for human-LLM alignment, providing a novel framework for achieving structured, reasoning-based human-AI collaboration.
Authors:Anqi Wang, Bingqian Wang, Huiyang Chen, Keqing Jiao, Lei Han, Xin Tong, Pan Hui
Abstract:
Large Language Models (LLMs) offer vast potential for creative ideation; however, their standard interaction paradigm often produces unstructured textual outputs that lead users to prematurely converge on sub-optimal ideas-a phenomenon known as fixation. While recent creativity tools have begun to structure these outputs, they remain compositionally opaque: ideas are organized as monolithic units that cannot be decomposed, abstracted, or recombinable at a sub-idea level. To address this, we propose Cognitive Abstraction (CA), a computational pipeline that transforms raw LLM-generated inspiration into a navigable and transformable design space. We implement this pipeline in NexusAI, a prototype diagramming system that supports (I) decomposition of inspiration into typed functional fragments, (II) multi-level abstraction to externalize mental scaling, and (III) cross-dimensional recombination to spark novel design directions. A within-subject user study (N=14) demonstrates that NexusAI significantly improves design space exploration, reduces cognitive overhead, and facilitates perspective reframing compared to a baseline. Our work contributes: (1) a characterization of "compositional opacity" as a barrier in human-AI co-creation; (2) the CA pipeline for operationalizing creative cognitive primitives at scale; and (3) empirical evidence that structured, multi-level representations can effectively mitigate fixation and support divergent exploration.
Authors:Jae Young Choi, Seon Gyeom Kim, Hyungjun Yoon, Taeckyung Lee, Donggun Lee, Jaeryung Chung, Jihyung Kil, Ryan Rossi, Sung-Ju Lee, Tak Yeon Lee
Abstract:
Large Language Models (LLMs) have emerged as foundation models for IoT applications such as human activity recognition (HAR). However, directly applying high-frequency and multi-dimensional sensor data, such as eye-tracking data, leads to information loss and high token costs. To mitigate this, we investigate a visual prompting strategy that transforms sensor signals into data visualization images as an input to multimodal LLMs (MLLMs) using eye-tracking data. We conducted a systematic evaluation of MLLM-based HAR across three public eye-tracking datasets using three visualization types of timeline, heatmap, and scanpath, under varying temporal window sizes. Our findings suggest that visual prompting provides a token-efficient and scalable representation for eye-tracking data, highlighting its potential to enable MLLMs to effectively reason over high-frequency sensor signals in IoT contexts.
Authors:Xinyu Wang, Sai Koneru, Wenbo Zhang, Wenliang Zheng, Saksham Ranjan, Sarah Rajtmajer
Abstract:
Recent advances in large language models (LLMs) have enabled the large-scale generation of highly fluent and deceptive news-like content. While prior work has often treated fake news detection as a binary classification problem, modern fake news increasingly arises through human-AI collaboration, where strategic inaccuracies are embedded within otherwise accurate and credible narratives. These mixed-truth cases represent a realistic and consequential threat, yet they remain underrepresented in existing benchmarks. To address this gap, we introduce MANYFAKE, a synthetic benchmark containing 6,798 fake news articles generated through multiple strategy-driven prompting pipelines that capture many ways fake news can be constructed and refined. Using this benchmark, we evaluate a range of state-of-the-art fake news detectors. Our results show that even advanced reasoning-enabled models approach saturation on fully fabricated stories, but remain brittle when falsehoods are subtle, optimized, and interwoven with accurate information.
Authors:Hengzhi Ye, Minghui Zhou
Abstract:
AI development is embracing open-source paradigm, but the fundamental distinction between AI models and traditional software artifacts may lead to a divergent open-source development paradigm with different collaborative practices, which remains unexplored. We therefore bridge the knowledge gap by quantifying and characterizing the differences in the collaborative development paradigms of traditional open source software (OSS) and open source AI models (OSM), and investigating the underlying factors that may drive these distinctions. We collect 1,428,792 OSS repositories from GitHub and 1,440,527 OSM repositories from HF Hub, and conduct comprehensive statistical, social network and content analyses to measure and understand the differences in collaboration intensity, collaboration openness, and user innovation across the two development paradigms, complementing these quantitative results with semi-structured interviews. In consequence, we find that compared to OSS development paradigm, the OSM development paradigm exhibits significantly lower collaboration intensity; lower collaboration openness regarding direct contribution while persisting relatively open knowledge exchange; and a divergence toward adaptive utilization user-innovation rather than collaborative improvement. Through semi-structured interviews, we further elucidate the socio-technical factors underlying these differences. These findings reveal the paradigmatic divergence in open source development between traditional OSS and OSM across three critical dimensions of open source collaboration and potential underlying factors, shedding light on how to improve collaborative work techniques and practices within the context of AI development.
Authors:Zhiyuan Wang, Erzhen Hu, Mark Rucker, Laura E. Barnes
Abstract:
Personal AI tools can now be generated from natural-language requests, but they often remain isolated after creation. We present PSI, a shared-state architecture that turns independently generated modules into coherent instruments: persistent, connected, and chat-complementary artifacts accessible through both GUIs and a generic chat agent. By publishing current state and write-back affordances to a shared personal-context bus, modules enable cross-module reasoning and synchronized actions across interfaces. We study PSI through a three-week autobiographical deployment in a self-developed personal AI environment and show that later-generated instruments can be integrated automatically through the same contract. PSI identifies shared state as the missing systems layer that transforms AI-generated personal software from isolated apps into coherent personal computing environments.
Authors:Christine Kwon, Phenyo Phemelo Moletsane, Michael W. Asher, Dieyu Ouyang, Lingkan Wang, Debbie Eleene Conejo, John Stamper, Paulo F. Carvalho, Amy Ogan
Abstract:
The benefits of learning in one's mother tongue are well documented, yet colonial languages dominate education, marginalizing local languages and limiting access for learners who rely on their mother tongue for understanding. With the rapid growth of educational technology, there is potential to integrate multilingual instruction supporting both colonial and local languages. This study is part of a larger quasi-experiment conducted in Uganda, where learners could choose to learn in English, Leb-Lango (a local language), or in Hybrid mode (a combination of both) in a remote EdTech course. We examined how learners who chose the Hybrid option navigated English and Leb-Lango. While many Hybrid learners did not consistently use both languages, those who did persisted longer in the course. Learners also shared how they managed language complexities. We provide the first empirical evidence of learner agency in bilingual remote EdTech instruction and offer insights for designing inclusive multilingual learning solutions.
Authors:Md Dilshadur Rahman, Devin Lange, Ghulam Jilani Quadri, Paul Rosen
Abstract:
Annotation is a central mechanism in visualization design that enables people to communicate key insights. Prior research has provided essential accounts of the visual forms annotations take, but less attention has been paid to the decisions behind them. This paper examines how annotations are designed in practice and how educators reflect on those practices. We conducted a two-phase qualitative study: interviews with ten practitioners from diverse backgrounds revealed the heuristics they draw on when creating annotations, and interviews with seven visualization educators offered complementary perspectives situated within broader concerns of clarity, guidance, and viewer agency. These studies provide a systematic account of annotation design knowledge in professional settings, highlighting the considerations, trade-offs, and contextual judgments that shape the use of annotations. By making this tacit expertise explicit, our work complements prior form-focused studies, strengthens understanding of annotation as a design activity, and points to opportunities for improved tool and guideline support.
Authors:Aayushi Dangol, Meghna Gupta, Daeun Yoo, Robert Wolfe, Jason Yip, Franziska Roesner, Julie A. Kientz
Abstract:
Generative AI (genAI) is increasingly being integrated into children's everyday lives, not only through screens but also through so-called "screen-free" AI toys. These toys can simulate emotions, personalize responses, and recall prior interactions, creating the illusion of an ongoing social connection. Such capabilities raise important questions about how children understand boundaries, agency, and relationships when interacting with AI toys. To investigate this, we conducted two participatory design sessions with eight children ages 6-11 where they engaged with three different AI toys, shifting between play, experimentation, and reflection. Our findings reveal that children approached AI toys with genuine curiosity, profiling them as social beings. However, frequent interaction breakdowns and mismatches between apparent intelligence and toy-like form disrupted expectations around play and led to adversarial play. We conclude with implications and design provocations to navigate children's encounters with AI toys in more transparent, developmentally appropriate, and responsible ways.
Authors:Karan Taneja, Anjali Singh, Ashok K. Goel
Abstract:
Multimodal Large Language Models (MLLMs) offer an opportunity to support multimedia learning through conversational systems grounded in educational content. However, while conversational AI is known to boost engagement, its impact on learning in visually-rich STEM domains remains under-explored. Moreover, there is limited understanding of how multimodality and conversationality jointly influence learning in generative AI systems. This work reports findings from a randomized controlled online study (N = 124) comparing three approaches to learning biology from textbook content: (1) a document-grounded conversational AI with interleaved text-and-image responses (MuDoC), (2) a document-grounded conversational AI with text-only responses (TexDoC), and (3) a textbook interface with semantic search and highlighting (DocSearch). Learners using MuDoC achieved the highest post-test scores and reported the most positive learning experience. Notably, while TexDoC was rated as significantly more engaging and easier to use than DocSearch, it led to the lowest post-test scores, revealing a disconnect between student perceptions and learning outcomes. Interpreted through the lens of the Cognitive Load Theory, these findings suggest that conversationality reduces extraneous load, while visual-verbal integration induced by multimodality increases germane load, leading to better learning outcomes. When conversationality is not complemented by multimodality, reduced cognitive effort may instead inflate perceived understanding without improving learning outcomes.
Authors:Xinyan Yu, Marius Hoggenmueller, Xin Lu, Ozan Balci, Martin Tomitsch, Andrew Vande Moere, Alex Binh Vinh Duc Nguyen
Abstract:
Urban HCI investigates how digital technologies shape human behaviour within the social, spatial, temporal dynamics of public space. Meanwhile, robotic furniture research demonstrates how the purposeful animation of mundane utilitarian elements can influence human behaviour in everyday contexts. Taken together, these strands highlight an untapped opportunity to investigate how animated public furniture could mediate social interaction in urban environments. In this paper, we present the design process and in-the-wild study of mobile robotic benches that reconfigure with a semi-outdoor public space. Our findings show that the gestural performance of the benches manifested three affordances perceived by passersby, they activated engagement as robots, redistributed engagement as spatial elements, and settled engagement as infrastructure. We proposed an Affordance Transition Model (ATM) describing how robotic furniture could proactively facilitate transition between these affordances to engage passersby. Our study bridges robotic furniture and urban HCI to activate human experience with the built environment purposefully.
Authors:Aayushi Dangol, Robert Wolfe, Nisha Devasia, Mitsuka Kiyohara, Jason Yip, Julie A. Kientz
Abstract:
Two of the most socially consequential issues facing today's children are the rise of artificial intelligence (AI) and the rapid changes to the earth's climate. Both issues are complex and contested, and they are linked through the notable environmental costs of AI use. Using a systems thinking framework, we developed an interactive system called Ecoprompt to help children reason about the environmental impact of AI. EcoPrompt combines a prompt-level environmental footprint calculator with a simulation game that challenges players to reason about the impact of AI use on natural resources that the player manages. We evaluated the system through two participatory design sessions with 16 children ages 6-12. Our findings surfaced children's perspectives on societal and environmental tradeoffs of AI use, as well as their sense of agency and responsibility. Taken together, these findings suggest opportunities for broadening AI literacy to include systems-level reasoning about AI's environmental impact.
Authors:Tom Bullock, Emily Machniak, You-Jin Kim, Radha Kumaran, Justin Kasowski, Apurv Varshney, Julia Ram, Melissa M. Hernandez, Stina Johansson, Neil M. Dundon, Tobias Höllerer, Barry Giesbrecht
Abstract:
Tracking moving objects is a critical skill for many everyday tasks, such as crossing a busy street, driving a car or catching a ball. Attention is a key cognitive function that supports object tracking; however, our understanding of the brain mechanisms that support attention is almost exclusively based on evidence from tasks that present stable objects at fixed locations. Accounts of multiple object tracking are also limited because they are largely based on behavioral data alone and involve tracking objects in a 2D plane. Consequently, the neural mechanisms that enable moment-by-moment tracking of goal-relevant objects remain poorly understood. To address this knowledge gap, we developed SABER (Spatial Attention, Brain, Extended Reality), a new framework for studying the behavioral and neural dynamics of attention to objects moving in 3D. Participants (n=32) completed variants of a task inspired by the popular virtual reality (VR) game, Beat Saber, where they used virtual sabers to strike stationary and moving color-defined target spheres while we recorded electroencephalography (EEG). We first established that standard univariate EEG metrics which are typically used to study spatial attention to static objects presented on 2D screens, can generalize effectively to an immersive VR context involving both static and dynamic 3D stimuli. We then used a computational modeling approach to reconstruct moment-by-moment attention to the locations of stationary and moving objects from oscillatory brain activity, demonstrating the feasibility of precisely tracking attention in a 3D space. These results validate SABER, and provide a foundation for future research that is critical not only for understanding how attention works in the physical world, but is also directly relevant to the development of better VR applications.
Authors:Yujin Park, Haejun Chung, Ikbeom Jang
Abstract:
Pairwise comparison labeling is emerging as it yields higher inter-rater reliability than conventional classification labeling, but exhaustive comparisons require quadratic cost. We propose Dodgersort, which leverages CLIP-based hierarchical pre-ordering, a neural ranking head and probabilistic ensemble (Elo, BTL, GP), epistemic--aleatoric uncertainty decomposition, and information-theoretic pair selection. It reduces human comparisons while improving the reliability of the rankings. In visual ranking tasks in medical imaging, historical dating, and aesthetics, Dodgersort achieves a 11--16\% annotation reduction while improving inter-rater reliability. Cross-domain ablations across four datasets show that neural adaptation and ensemble uncertainty are key to this gain. In FG-NET with ground-truth ages, the framework extracts 5--20$\times$ more ranking information per comparison than baselines, yielding Pareto-optimal accuracy--efficiency trade-offs.
Authors:Anjali Singh, Karan Taneja, Zhitong Guan, Soo Young Rieh
Abstract:
Generative AI (GenAI) search tools are increasingly used for information seeking, yet their design tends to encourage cognitive offloading, which may lead to passive engagement, selective attention, and informational homogenization. Effective use requires metacognitive engagement to craft good prompts, verify AI outputs, and critically engage with information. We developed MetaCues, a novel GenAI-based interactive tool for information seeking that delivers metacognitive cues alongside AI responses and a note-taking interface to guide users' search and associated learning. Through an online study (N = 146), we compared MetaCues to a baseline tool without cues, across two broad search topics that required participants to explore diverse perspectives in order to make informed judgments. Preliminary findings regarding participants' search behavior show that MetaCues leads to increased confidence in attitudinal judgments about the search topic as well as broader inquiry, with the latter effect emerging primarily for the topic that was less controversial and with which participants had relatively less familiarity. Accordingly, we outline directions for future qualitative exploration of search interactions and inquiry patterns.
Authors:Victor Nikhil Antony, Zhili Gong, Yoonjae Kim, Chien-Ming Huang
Abstract:
We present M, an open-source, low-cost social robot platform designed to reduce platform friction that slows social robotics research by making robots easier to reproduce, modify, and deploy in real-world settings. M combines a modular mechanical design, multimodal sensing, and expressive yet mechanically simple actuation architecture with a ROS2-native software package that cleanly separates perception, expression control, and data management. The platform includes a simulation environment with interface equivalence to hardware to support rapid sim-to-real transfer of interaction behaviors. We demonstrate extensibility through additional sensing/actuation modules and provide example interaction templates for storytelling and two-way conversational coaching. Finally, we report real-world use in participatory design and week-long in-home deployments, showing how M can serve as a practical foundation for longitudinal, reproducible social robotics research.
Authors:Christian Di Maio, Tommaso Guidi, Luigi Quarantiello, Jack Bell, Marco Gori, Stefano Melacci, Vincenzo Lomonaco
Abstract:
In this paper, we report our experience with ``TuringHotel'', a novel extension of the Turing Test based on interactions within mixed communities of Large Language Models (LLMs) and human participants. The classical one-to-one interaction of the Turing Test is reinterpreted in a group setting, where both human and artificial agents engage in time-bounded discussions and, interestingly, are both judges and respondents. This community is instantiated in the novel platform UNaIVERSE (https://unaiverse.io), creating a ``World'' which defines the roles and interaction dynamics, facilitated by the platform's built-in programming tools. All communication occurs over an authenticated peer-to-peer network, ensuring that no third parties can access the exchange. The platform also provides a unified interface for humans, accessible via both mobile devices and laptops, that was a key component of the experience in this paper. Results of our experimentation involving 17 human participants and 19 LLMs revealed that current models are still sometimes confused as humans. Interestingly, there are several unexpected mistakes, suggesting that human fingerprints are still identifiable but not fully unambiguous, despite the high-quality language skills of artificial participants. We argue that this is the first experiment conducted in such a distributed setting, and that similar initiatives could be of national interest to support ongoing experiments and competitions aimed at monitoring the evolution of large language models over time.
Authors:Victor Nikhil Antony, Shiye Cao, Shuning Wang, Chien-Ming Huang
Abstract:
Early language development shapes children's later literacy and learning, yet many families have limited access to scalable, high-quality support at home. Recent advances in generative AI make it possible for social robots to move beyond scripted interactions and engage children in adaptive, conversational activities, but it remains unclear how to design such systems for pre-schoolers and how children engage with them over time in the home. We present ELLA (Early Language Learning Agent), an autonomous, generative AI-powered social robot that supports early language development through interactive storytelling, parent-selected language targets, and scaffolded dialogue. Using a multi-phased, human-centered process, we interviewed parents (n=7) and educators (n=5) and iteratively refined ELLA through twelve in-home design workshops. We then deployed ELLA with ten children for eight days. We report design insights from in-home workshops, characterize children's engagement and behaviors during deployment, and distill design implications for generative AI-powered social robots supporting early language learning at home.
Authors:Nathaniel Dennler, Zhonghao Shi, Yiran Tao, Andreea Bobu, Stefanos Nikolaidis, Maja Matarić
Abstract:
Robots that interact with humans must adapt to individual users' preferences to operate effectively in human-centered environments. An intuitive and effective technique to learn non-expert users' preferences is through rankings of robot behaviors, e.g., trajectories, gestures, or voices. Existing techniques primarily focus on generating queries that optimize preference learning outcomes, such as sample efficiency or final preference estimation accuracy. However, the focus on outcome overlooks key user expectations in the process of providing these rankings, which can negatively impact users' adoption of robotic systems. This work proposes the Covariance Matrix Adaptation Evolution Strategies with Information Gain (CMA-ES-IG) algorithm. CMA-ES-IG explicitly incorporates user experience considerations into the preference learning process by suggesting perceptually distinct and informative trajectories for users to rank. We demonstrate these benefits through both simulated studies and real-robot experiments. CMA-ES-IG, compared to state-of-the-art alternatives, (1) scales more effectively to higher-dimensional preference spaces, (2) maintains computational tractability for high-dimensional problems, (3) is robust to noisy or inconsistent user feedback, and (4) is preferred by non-expert users in identifying their preferred robot behaviors. This project's code is available at github.com/interaction-lab/CMA-ES-IG
Authors:Chu Li, Rock Yuren Pang, Arnavi Chheda-Kothary, Ather Sharif, Henok Assalif, Jeffrey Heer, Jon E. Froehlich
Abstract:
Geovisualizations are powerful tools for communicating spatial information, but are inaccessible to screen-reader users. To address this limitation, we present GeoVisA11y, an LLM-based question-answering system that makes geovisualizations accessible through natural language interaction. The system supports map reading, analysis, interpretation and navigation by handling analytical, geospatial, visual and contextual queries. Through user studies with 12 screen-reader users and sighted participants, we demonstrate that GeoVisA11y effectively bridges accessibility gaps while revealing distinct interaction patterns between user groups. We contribute: (1) an open-source, accessible geovisualization system, (2) empirical findings on query and navigation differences, and (3) a dataset of geospatial queries to inform future research on accessible data visualization.
Authors:Shuo Niu, Yao Lyu, He Zhang, Na Li, Bumjin Kim, Jie Cai
Abstract:
Generative Artificial Intelligence (GenAI) is reshaping creative labor by enabling the rapid production of text, images, and videos. On YouTube, creators are developing new ways to leverage these tools and share knowledge about how to pursue income through such strategies. However, little is known about what GenAI knowledge has been collectively constructed around monetizing GenAI as a community practice of acting both with and against algorithmically mediated platforms. We analyze 377 YouTube videos in which creators publicly promote workflows, revenue claims, and monetization strategies for GenAI-enabled content. Our analysis identifies ten shared use cases that frame AI-supported income opportunities, and examines how this GenAI knowledge repository embodies a collective effort to leverage platform infrastructures for monetization -- including advertising, direct sales, affiliate marketing, and revenue-sharing models. We further surface structural tensions in AI-mediated creative labor, including unverifiable income claims, content misappropriation, synthetic engagement practices, and shifting authorship norms. We conceptualize creators' collective understanding and adoption of GenAI in the context of monetizing creative labor, with implications for the design of creator-centered GenAI technologies and responsible platform policy.
Authors:Dominik P. Hofer, Haochen Song, Rania Islambouli, Laura Hawkins, Ananya Bhattacharjee, Meredith Franklin, Joseph Jay Williams, Jan D. Smeddinck
Abstract:
Behaviour Change Techniques (BCTs) are central to digital health interventions, yet selecting and delivering effective techniques remains challenging. Contextual bandits enable statistically grounded optimisation of BCT selection, while Large Language Models (LLMs) offer flexible, context-sensitive message generation. We conducted a 4-week study on physical activity motivation (N=54; 9 post-study interviews) that compared five daily messaging approaches: random templates, contextual bandit with templates, LLM generation, hybrid bandit+LLM, and LLM with interaction history. LLM-based approaches were rated substantially more helpful than templates, but no significant differences emerged among LLM conditions. Unexpectedly, bandit optimisation for BCTs selection yielded no additional perceived helpfulness compared with LLM-only approaches. Unconstrained LLMs focused heavily on a single BCT, whereas bandit systems enforced systematic exploration-exploitation across techniques. Quantitative and qualitative findings suggest contextual acknowledgement of user input drove perceived helpfulness. We contribute design suggestions for reflective AI health behaviour change systems that address a trade-off between structured exploration and generative autonomy.
Authors:Jamie Lee, Kyuha Jung, Cecilia Lee, Lauren MacDonnell, Jessica Kim, Daniel Otterson, Erin Newman, Emilie Chow, Yunan Chen
Abstract:
While prior research has focused on providers, caregivers, and adult patients, little is known about adolescents' perceptions of AI in health learning and management. Utilizing design fiction and co-design methods, we conducted seven workshops with 23 adolescents (aged 14-17) to understand how they anticipate using health AI in the context of a family celiac diagnosis. Our findings reveal that adolescents have four main envisioned roles of health AI: enhancing health understanding and help-seeking, reducing cognitive burden, supporting family health management, and providing guidance while respecting their autonomy. We also identified nuanced trust and a divided view on emotional support from health AI. These findings suggest that adolescents perceive AI's value as a tool that moves them from efficiency to meaning-one that creates time for valued activities. We discuss opportunities for future health AI systems to be designed to encourage adolescent autonomy and reflection, while also supporting meaningful, dialectical activities.
Authors:Yuepeng Chen, Kaili Zheng, Ji Wu, Zhuangzhuang Li, Ye Ma, Dongwei Liu, Chenyi Guo, Xiangling Fu
Abstract:
Surface electromyography (sEMG) signals exhibit substantial inter-subject variability and are highly susceptible to noise, posing challenges for robust and interpretable decoding. To address these limitations, we propose a discrete representation of sEMG signals based on a physiology-informed tokenization framework. The method employs a sliding window aligned with the minimal muscle contraction cycle to isolate individual muscle activation events. From each window, ten time-frequency features, including root mean square (RMS) and median frequency (MDF), are extracted, and K-means clustering is applied to group segments into representative muscle-state tokens. We also introduce a large-scale benchmark dataset, ActionEMG-43, comprising 43 diverse actions and sEMG recordings from 16 major muscle groups across the body. Based on this dataset, we conduct extensive evaluations to assess the inter-subject consistency, representation capacity, and interpretability of the proposed sEMG tokens. Our results show that the token representation exhibits high inter-subject consistency (Cohen's Kappa = 0.82+-0.09), indicating that the learned tokens capture consistent and subject-independent muscle activation patterns. In action recognition tasks, models using sEMG tokens achieve Top-1 accuracies of 75.5% with ViT and 67.9% with SVM, outperforming raw-signal baselines (72.8% and 64.4%, respectively), despite a 96% reduction in input dimensionality. In movement quality assessment, the tokens intuitively reveal patterns of muscle underactivation and compensatory activation, offering interpretable insights into neuromuscular control. Together, these findings highlight the effectiveness of tokenized sEMG representations as a compact, generalizable, and physiologically meaningful feature space for applications in rehabilitation, human-machine interaction, and motor function analysis.
Authors:Anupam Sharma, Harish Katti, Prajwal Singh, Shanmuganathan Raman, Krishna Miyapuram
Abstract:
An electroencephalogram (EEG) records the spatially averaged electrical activity of neurons in the brain, measured from the human scalp. Prior studies have explored EEG-based classification of objects or concepts, often for passive viewing of briefly presented image or video stimuli, with limited classes. Because EEG exhibits a low signal-to-noise ratio, recognizing fine-grained representations across a large number of classes remains challenging; however, abstract-level object representations may exist. In this work, we investigate whether EEG captures object representations across multiple hierarchical levels, and propose episodic analysis, in which a Machine Learning (ML) model is evaluated across various, yet related, classification tasks (episodes). Unlike prior episodic EEG studies that rely on fixed or randomly sampled classes of equal cardinality, we adopt hierarchy-aware episode sampling using WordNet to generate episodes with variable classes of diverse hierarchy. We also present the largest episodic framework in the EEG domain for detecting observed text from EEG signals in the PEERS dataset, comprising $931538$ EEG samples under $1610$ object labels, acquired from $264$ human participants (subjects) performing controlled cognitive tasks, enabling the study of neural dynamics underlying perception, decision-making, and performance monitoring. We examine how the semantic abstraction level affects classification performance across multiple learning techniques and architectures, providing a comprehensive analysis. The models tend to improve performance when the classification categories are drawn from higher levels of the hierarchy, suggesting sensitivity to abstraction. Our work highlights abstraction depth as an underexplored dimension of EEG decoding and motivates future research in this direction.
Authors:Yue Deng, Changyang He
Abstract:
Robotaxis are emerging as a promising form of urban mobility, yet research has largely emphasized technical driving performance while leaving open how passengers experience and evaluate rides without a human driver. To address the limitations of prior work that often relies on simulated or hypothetical settings, we investigate real-world robotaxi use through 18 semi-structured interviews and autoethnographic ride experiences. We found that users were drawn to robotaxis by low cost, social recommendation, and curiosity. They valued a distinctive set of benefits, such as an increased sense of agency, and consistent driving behavioral consistency and standardized ride experiences. However, they encountered persistent challenges around limited flexibility, insufficient transparency, management difficulty, robustness concerns in edge cases, and emergency handling concerns. Robotaxi experiences were shaped by privacy, safety, ethics, and trust. Users were often privacy-indifferent yet sensitive to opaque access and leakage risks; safety perceptions were polarized; and ethical considerations surfaced round issues such as accountability, feedback responsibility and absence of human-like social norms. Based on these findings, we propose a user-driven design framework spanning the end-to-end journey, such as pre-ride configuration (hailing), context-aware pickup facilitation (pick-up) in-ride explainability (traveling), and accountable post-ride feedback (drop-off) to guide robotaxi interaction and service design.
Authors:Yihuan Chen, Kexue Fu, Qianyi Chen, Zhicong Lu, Ray LC
Abstract:
Humans live and act in 3D space, but often work and communicate on 2D surfaces. The prevalence of online communication on 2D screens raises the issue of whether human spatial configuration affects our capabilities, social perception, and behaviors when interacting with others in 2D video chat. How do factors like location, setting, and context subtly shape our online communication, particularly in scenarios such as social support and topic-based discussions? Using Ohyay.co as a platform, we compared a normal gallery interface with a scene-based Room-type interface where participants are located in circular arrangement on screen in a social support task, and found that participants allocated attention to the group as a whole, and had pronounced self-awareness in the Room format. We then chose a two-sided topic for discussion in the Gallery interface and the Room interface where participants on each team face-off against each other, and found that they utilized spatial references to orient their allegiances, expressing greater engagement with those farther away in digital space and greater empathy with those closer, in the Room over the Gallery format. We found spatial effects in the way participants hide from the spotlight, in perspective-taking, and in their use of expressive gestures in time on the screen. This work highlights the need for considering spatial configuration in 2D in the design of collaborative communication systems to optimize for psychological needs for particular tasks.
Authors:Lawrence Obiuwevwi, Krzysztof J. Rechowicz, Vikas Ashok, Sampath Jayarathna
Abstract:
Accurately detecting hypoglycemia without invasive glucose sensors remains a critical challenge in diabetes management, particularly in regions where continuous glucose monitoring (CGM) is prohibitively expensive or clinically inaccessible. This extended study introduces a comprehensive, multimodal physiological framework for non-invasive hypoglycemia detection using wearable sensor signals. Unlike prior work limited to single-signal analysis, this chapter evaluates three physiological modalities, galvanic skin response (GSR), heart rate (HR), and their combined fusion, using the OhioT1DM 2018 dataset. We develop an end-to-end pipeline that integrates advanced preprocessing, temporal windowing, handcrafted and sequence-based feature extraction, early and late fusion strategies, and a broad spectrum of machine learning and deep temporal models, including CNNs, LSTMs, GRUs, and TCNs. Our results demonstrate that physiological signals exhibit distinct autonomic patterns preceding hypoglycemia and that combining GSR with HR consistently enhances detection sensitivity and stability compared to single-signal models. Multimodal deep learning architectures achieve the most reliable performance, particularly in recall, the most clinically urgent metric. Ablation studies further highlight the complementary contributions of each modality, strengthening the case for affordable, sensor-based glycemic monitoring. The findings show that real-time hypoglycemia detection is achievable using only inexpensive, non-invasive wearable sensors, offering a pathway toward accessible glucose monitoring in underserved communities and low-resource healthcare environments.
Authors:Yeon Su Park, Nadia Azzahra Putri Arvi, Seoyoung Kim, Juho Kim
Abstract:
Large language models (LLMs) are increasingly used as collaborative partners in writing. However, this raises a critical challenge of authorship, as users and models jointly shape text across interaction turns. Understanding authorship in this context requires examining users' evolving internal states during collaboration, particularly self-efficacy and trust. Yet, the dynamics of these states and their associations with users' prompting strategies and authorship outcomes remain underexplored. We examined these dynamics through a study of 302 participants in LLM-assisted writing, capturing interaction logs and turn-by-turn self-efficacy and trust ratings. Our analysis showed that collaboration generally decreased users' self-efficacy while increasing trust. Participants who lost self-efficacy were more likely to ask the LLM to edit their work directly, whereas those who recovered self-efficacy requested more review and feedback. Furthermore, participants with stable self-efficacy showed higher actual and perceived authorship of the final text. Based on these findings, we propose design implications for understanding and supporting authorship in human-LLM collaboration.
Authors:Luyi Sun, Wei Xu, Zaifeng Gao
Abstract:
As the paradigm of Human-Centered AI (HCAI) gains prominence, its benefits to society are accompanied by significant ethical concerns, one of which is the protection of individual privacy. This chapter provides a comprehensive overview of privacy within HCAI, proposing a human-centered privacy (HCP) framework, providing integrated solution from technology, ethics, and human factors perspectives. The chapter begins by mapping privacy risks across each stage of AI development lifecycle, from data collection to deployment and reuse, highlighting the impact of privacy risks on the entire system. The chapter then introduces privacy-preserving techniques such as federated learning and dif erential privacy. Subsequent chapters integrate the crucial user perspective by examining mental models, alongside the evolving regulatory and ethical landscapes as well as privacy governance. Next, advice on design guidelines is provided based on the human-centered privacy framework. After that, we introduce practical case studies across diverse fields. Finally, the chapter discusses persistent open challenges and future research directions, concluding that a multidisciplinary approach, merging technical, design, policy, and ethical expertise, is essential to successfully embed privacy into the core of HCAI, thereby ensuring these technologies advance in a manner that respects and ensures human autonomy, trust and dignity.
Authors:Linjie Qiu, Duotun Wang, Boyu Li, Jiawei Li, Yulin Shen, Zeyu Wang, Mingming Fan
Abstract:
Target selection is a fundamental interaction in virtual reality (VR). But the act of confirming a selection, such as a button press or pinch, can disturb the tracked pose and shift the intended target, which is referred to as the Heisenberg Effect. Prior research has mainly investigated controller input. However, it remains unclear how the effect manifests in the bare-hand input and how score-based techniques may mitigate the effect in different spatial variations. To fill the gap, we conduct a within-subject study to examine the Heisenberg Effect across two input modalities (i.e., controller and hand) and two selection mechanisms (i.e., direct and score-based). Our results show that hand input is more susceptible to the Heisenberg Effect, with direct selection more influenced by target width and score-based selection more sensitive to target density. Based on previous vote-oriented technique and our temporal analysis, we introduce weighted VOTE, a history-based intention accuracy model for target voting, that reweights recent interaction intent to counteract input disturbances. Our evaluation shows the method improves selection accuracy compared to baseline techniques. Finally, we discuss future directions for adaptive selection methods.
Authors:Valerio Belcamino, Mariya Kilina, Alessandro Carfì, Valeria Seidita, Fulvio Mastrogiovanni, Antonio Chella
Abstract:
Dialogue-based human-robot interaction requires robot cognitive assistants to maintain persistent user context, recover from underspecified requests, and ground responses in external evidence, while keeping intermediate decisions verifiable. In this paper we introduce JANUS, a cognitive architecture for assistive robots that models interaction as a partially observable Markov decision process and realizes control as a factored controller with typed interfaces. To this aim, Janus (i) decomposes the overall behavior into specialized modules, related to scope detection, intent recognition, memory, inner speech, query generation, and outer speech, and (ii) exposes explicit policies for information sufficiency, execution readiness, and tool grounding. A dedicated memory agent maintains a bounded recent-history buffer, a compact core memory, and an archival store with semantic retrieval, coupled through controlled consolidation and revision policies. Models inspired by the notion of inner speech in cognitive theories provide a control-oriented internal textual flow that validates parameter completeness and triggers clarification before grounding, while a faithfulness constraint ties robot-to-human claims to an evidence bundle combining working context and retrieved tool outputs. We evaluate JANUS through module-level unit tests in a dietary assistance domain grounded on a knowledge graph, reporting high agreement with curated references and practical latency profiles. These results support factored reasoning as a promising path to scalable, auditable, and evidence-grounded robot assistance over extended interaction horizons.
Authors:Victor Nikhil Antony, Adithya R N, Sarah Derrick, Zhili Gong, Peter M. Donley, Chien-Ming Huang
Abstract:
Plants offer a paradoxical model for interaction: they are ambient, low-demand presences that nonetheless shape atmosphere, routines, and relationships through temporal rhythms and subtle expressions. In contrast, most human-robot interaction (HRI) has been grounded in anthropomorphic and zoomorphic paradigms, producing overt, high-demand forms of engagement. Using a Research through Design (RtD) methodology, we explore plants as metaphoric inspiration for HRI; we conducted iterative cycles of ideation, prototyping, and reflection to investigate what design primitives emerge from plant metaphors and morphologies, and how these primitives can be combined into expressive robotic forms. We present a suite of speculative, open-source prototypes that help probe plant-inspired presence, temporality, form, and gestures. We deepened our learnings from design and prototyping through prototype-centered workshops that explored people's perceptions and imaginaries of plant-inspired robots. This work contributes: (1) Set of plant-inspired robotic artifacts; (2) Designerly insights on how people perceive plant-inspired robots; and (3) Design consideration to inform how to use plant metaphors to reshape HRI.
Authors:Victor Nikhil Antony, Zhili Gong, Guanchen Li, Clara Jeon, Chien-Ming Huang
Abstract:
Robotic objects are simple actuated systems that subtly blend into human environments. We design and introduce Lantern, a minimalist robotic object platform to enable building simple robotic artifacts. We conducted in-depth design and engineering iterations of Lantern's mechatronic architecture to meet specific design goals while maintaining a low build cost (~40 USD). As an extendable, open-source platform, Lantern aims to enable exploration of a range of HRI scenarios by leveraging human tendency to assign social meaning to simple forms. To evaluate Lantern's potential for HRI, we conducted a series of explorations: 1) a co-design workshop, 2) a sensory room case study, 3) distribution to external HRI labs, 4) integration into a graduate-level HRI course, and 5) public exhibitions with older adults and children. Our findings show that Lantern effectively evokes engagement, can support versatile applications ranging from emotion regulation to focused work, and serves as a viable platform for lowering barriers to HRI as a field.
Authors:Jeanne Malécot, Hamed Rahimi, Jeanne Cattoni, Marie Samson, Mouad Abrini, Mahdi Khoramshahi, Maribel Pino, Mohamed Chetouani
Abstract:
Existing human-robot interaction systems often lack mechanisms for sustained personalization and dynamic adaptation in multi-user environments, limiting their effectiveness in real-world deployments. We present HARMONI, a multimodal personalization framework that leverages large language models to enable socially assistive robots to manage long-term multi-user interactions. The framework integrates four key modules: (i) a perception module that identifies active speakers and extracts multimodal input; (ii) a world modeling module that maintains representations of the environment and short-term conversational context; (iii) a user modeling module that updates long-term speaker-specific profiles; and (iv) a generation module that produces contextually grounded and ethically informed responses. Through extensive evaluation and ablation studies on four datasets, as well as a real-world scenario-driven user-study in a nursing home environment, we demonstrate that HARMONI supports robust speaker identification, online memory updating, and ethically aligned personalization, outperforming baseline LLM-driven approaches in user modeling accuracy, personalization quality, and user satisfaction.
Authors:Mark Colley, Simon Kopp, Debargha Dey, Pascal Jansen, Enrico Rukzio
Abstract:
With automated vehicles (AVs), the absence of a human operator could necessitate external Human-Machine Interfaces (eHMIs) to communicate with other road users. Existing research primarily focuses on pedestrian-AV interactions, with limited attention given to other road users, such as cyclists and drivers of manually driven vehicles. So far, no studies have compared the effects of eHMIs across these three road user roles. Therefore, we conducted a within-subjects virtual reality experiment (N=40), evaluating the subjective and objective impact of an eHMI communicating the AV's intention to pedestrians, cyclists, and drivers under various levels of distraction (no distraction, visual noise, interference). eHMIs positively influenced safety perceptions, trust, perceived usefulness, and mental demand across all roles. While distraction and road user roles showed significant main effects, interaction effects were only observed in perceived usability. Thus, a unified eHMI design is effective, facilitating the standardization and broader adoption of eHMIs in diverse traffic.
Authors:Pascal Jansen, Julian Britten, Mark Colley, Markus Sasalovici, Enrico Rukzio
Abstract:
Traffic is inherently dangerous, with around 1.19 million fatalities annually. Automotive Mediated Reality (AMR) can enhance driving safety by overlaying critical information (e.g., outlines, icons, text) on key objects to improve awareness, altering objects' appearance to simplify traffic situations, and diminishing their appearance to minimize distractions. However, real-world AMR evaluation remains limited due to technical challenges. To fill this sim-to-real gap, we present MIRAGE, an open-source tool that enables real-time AMR in real vehicles. MIRAGE implements 15 effects across the AMR spectrum of augmented, diminished, and modified reality using state-of-the-art computational models for object detection and segmentation, depth estimation, and inpainting. In an on-road expert user study (N=9) of MIRAGE, participants enjoyed the AMR experience while pointing out technical limitations and identifying use cases for AMR. We discuss these results in relation to prior work and outline implications for AMR ethics and interaction design.
Authors:Yuqi Tong, Ruiyang Li, Chengkun Li, Qixuan Liu, Shi Qiu, Pheng-Ann Heng
Abstract:
High-fidelity cinematic medical visualization on mobile virtual reality (VR) remains challenging. Although ClipGS enables cross-sectional exploration via 3D Gaussian Splatting, it lacks arbitrary-angle slicing on consumer-grade VR headsets. To achieve real-time interactive performance, we introduce ClipGS-VR and restructure ClipGS's neural inference into a consolidated dataset, integrating high-fidelity layers from multiple pre-computed slicing states into a unified rendering structure. Our framework further supports arbitrary-angle slicing via gradient-based opacity modulation for smooth, visually coherent rendering. Evaluations confirm our approach maintains visual fidelity comparable to offline results while offering superior usability and interaction efficiency.
Authors:Ye Tian, Haohua Du, Chao Gu, Junyang Zhang, Shanyue Wang, Hao Zhou, Jiahui Hou, Xiang-Yang Li
Abstract:
Silent speech interfaces (SSIs) enable silent interaction in noise-sensitive or privacy-sensitive settings. However, existing SSIs face practical deployment trade-offs among privacy, user experience, and energy consumption, and most remain limited to closed-set recognition over small, pre-defined vocabularies of words or sentences, which restricts real-world expressiveness. In this paper, we present Lip-Siri, to the best of our knowledge, the first Wi-Fi backscatter--based SSI that supports open-vocabulary sentence recognition via lexicon-guided subword decoding. Lip-Siri designs a frequency-shifted backscatter tag to isolate tag-modulated reflections and suppress interference from non-target motions, enabling reliable extraction of lip-motion traces from ubiquitous Wi-Fi signals. We then segment continuous traces into lip-motion units, cluster them, learn robust unit representations via cluster-based self-supervision, and finally propose a lexicon-guided Transformer encoder--decoder with beam search to decode variable-length sentence sequences. We implement an end-to-end prototype and evaluate it with 15 participants on 340 sentences and 3,398 words across multiple scenarios. Lip-Siri achieves 85.61% accuracy on word prediction and a WER of 36.87% on continuous sentence recognition, approaching the performance of representative vision-based lip-reading systems.
Authors:Anqi Wang, Zhengyi Li, Lan Luo, Xin Tong, Pan Hui
Abstract:
Creative coding requires continuous translation between evolving concepts and computational artifacts, making reflection essential yet difficult to sustain. Creators often struggle to manage ambiguous intentions, emergent outputs, and complex code, limiting depth of exploration. This work examines how large language models (LLMs) can scaffold reflection not as isolated prompts, but as a system-level mechanism shaping creative regulation. From formative studies with eight expert creators, we derived reflection challenges and design principles that informed Reflexa, an integrated scaffold combining dialogic guidance, visualized version navigation, and iterative suggestion pathways. A within-subject study with 18 participants provides an exploratory mechanism validation, showing that structured reflection patterns mediate the link between AI interaction and creative outcomes. These reflection trajectories enhanced perceived controllability, broadened exploration, and improved originality and aesthetic quality. Our findings advance HCI understanding of reflection from LLM-assisted creative practices, and provide design strategies for building LLM-based creative tools that support richer human-AI co-creativity.
Authors:Yimeng Liu, Misha Sra, Chang Xiao
Abstract:
Designing user interfaces that align with user preferences is a time-consuming process, which requires iterative cycles of prototyping, user testing, and refinement. Recent advancements in LLM-based UI generation have enabled efficient UI generation to assist the UI design process. We introduce AlignUI, a method that aligns LLM-generated UIs with user tasks and preferences by using a user preference dataset to guide the LLM's reasoning process. The dataset was crowdsourced from 50 general users (the target users of generated UIs) and contained 720 UI control preferences on eight image-editing tasks. We evaluated AlignUI by generating UIs for six unseen tasks and conducting a user study with 72 additional general users. The results showed that the generated UIs closely align with multiple dimensions of user preferences. We conclude by discussing the applicability of our method to support user-aligned UI design for multiple task domains and user groups, as well as personalized user needs.
Authors:Xiaokang Lei, Ching Christie Pang, Yuyang Jiang, Xin Tong, Pan Hui
Abstract:
Artificial intelligence (AI) and large language models (LLMs) are reshaping education, with virtual avatars emerging as digital teachers capable of enhancing engagement, sustaining attention, and addressing instructor shortages. Aligned with the Sustainable Development Goals (SDGs) for equitable quality education, these technologies hold promise yet lack clear guidelines for effective design and implementation in online learning. To fill this gap, we introduce a framework specifying when, what, and how digital teachers should be integrated. Our study combines (1) a design space analysis of 87 works across AI, educational technology, design, and HCI, (2) a survey of 132 learners' practices and preferences, and (3) three co-design workshops with 18 experts from pedagogy, design, and AI. It provides actionable guidance for educators, designers, and HCI researchers, advancing opportunities to build more engaging, equitable, and effective online learning environments powered by digital teachers.
Authors:Chao Wang, Anna Belardinelli, Michael Gienger
Abstract:
Social-physical human-robot interaction (spHRI) is difficult to study: building and programming robots that integrate multiple interaction modalities is costly and slow, while VR-based prototypes often lack physical contact, breaking users' visuo-tactile expectations. We present XR$^3$, a co-located dual-VR-headset platform for HRI research in which an attendee and a hidden operator share the same physical space while experiencing different virtual embodiments. The attendee sees an expressive virtual robot that interacts face-to-face in a shared virtual environment. In real time, the robot's upper-body motion, head and gaze behavior, and facial expressions are mapped from the operator's tracked limbs and face signals. Because the operator is co-present and calibrated in the same coordinate frame, the operator can also touch the attendee, enabling perceived robot touch synchronized with the robot's visible hands. Finger and hand motion is mapped to the robot avatar using inverse kinematics to support precise contact. Beyond motion retargeting, XR$^3$ supports social retargeting of multiple nonverbal cues that can be experimentally varied while keeping physical interaction constant. We detail the system design and calibration, and demonstrate the platform in a touch-based Wizard-of-Oz study, lowering the barrier to prototyping and evaluating embodied, contact-based robot behaviors.
Authors:Zaifeng Gao, Yuanxiu Zhao, Hanxi Pan, Wei Xu
Abstract:
With the rapid development of artificial intelligence (AI), machines are increasingly evolving into intelligent agents, and the human-machine relationship is shifting from traditional "human-computer interaction" toward a new paradigm of "human-AI collaboration." However, technology-centered approaches to AI development have gradually revealed limitations such as fragility, bias, and low explainability, highlighting the urgent need for human-centered AI (HCAI) design philosophy. As a systems engineering approach, the successful implementation of HCAI depends critically on the design and optimization of high-quality human-AI interaction (HAII). This paper systematically reviews our research team's nearly decade-long exploration and practice in HCAI. At the level of research vision, we were among the first in China to systematically propose HAII as an interdisciplinary field and to develop a human-centered conceptual framework for human--AI collaboration. At the theoretical level, we introduced frameworks for human-AI joint cognitive systems, team-level situation awareness among intelligent agents, and shared social understanding, forming a relatively comprehensive theoretical system. At the methodological level, we established a hierarchical HCAI framework and a taxonomy of HCAI implementation methods. At the application level, we conducted a series of studies in domains such as autonomous driving, intelligent aircraft cockpit, and trust in human-AI collaboration, empirically validating the effectiveness of the proposed frameworks. Looking ahead, research on HCAI and HAII must continue to advance along three dimensions: theoretical deepening, methodological innovation, and application expansion, promoting the development of an intelligent society that is human-centered and characterized by harmonious human-AI coexistence.
Authors:Greta Warren, Jingyi Sun, Irina Shklovski, Isabelle Augenstein
Abstract:
Although much research has focused on AI explanations to support decisions in complex information-seeking tasks such as fact-checking, the role of evidence is surprisingly under-researched. In our study, we systematically varied explanation type, AI prediction certainty, and correctness of AI system advice for non-expert participants, who evaluated the veracity of claims and AI system predictions. Participants were provided the option of easily inspecting the underlying evidence. We found that participants consistently relied on evidence to validate AI claims across all experimental conditions. When participants were presented with natural language explanations, evidence was used less frequently although they relied on it when these explanations seemed insufficient or flawed. Qualitative data suggests that participants attempted to infer evidence source reliability, despite source identities being deliberately omitted. Our results demonstrate that evidence is a key ingredient in how people evaluate the reliability of information presented by an AI system and, in combination with natural language explanations, offers valuable support for decision-making. Further research is urgently needed to understand how evidence ought to be presented and how people engage with it in practice.
Authors:JungMin Yun, Juhwan Choi, Kyohoon Jin, Soojin Jang, Jinhee Jang, YoungBin Kim
Abstract:
This paper incorporates the efficiency of automatic summarization and addresses the challenge of generating personalized summaries tailored to individual users' interests and requirements. To tackle this challenge, we introduce SummPilot, an interaction-based customizable summarization system. SummPilot leverages a large language model to facilitate both automatic and interactive summarization. Users can engage with the system to understand document content and personalize summaries through interactive components such as semantic graphs, entity clustering, and explainable evaluation. Our demo and user studies demonstrate SummPilot's adaptability and usefulness for customizable summarization.
Authors:Xiaotian Zhang, Jinhong Yu, Pengwei Yan, Le Jiang, Xingyi Shen, Mumo Cheng, Xiaozhong Liu
Abstract:
Chronic disease management requires regular adherence feedback to prevent avoidable hospitalizations, yet clinicians lack time to produce personalized patient communications. Manual authoring preserves clinical accuracy but does not scale; AI generation scales but can undermine trust in patient-facing contexts. We present a clinician-in-the-loop interface that constrains AI to data organization and preserves physician oversight through recognition-based review. A single-page editor pairs AI-generated section drafts with time-aligned visualizations, enabling inline editing with visual evidence for each claim. This division of labor (AI organizes, clinician decides) targets both efficiency and accountability. In a pilot with three physicians reviewing 24 cases, AI successfully generated clinically personalized drafts matching physicians' manual authoring practice (overall mean 4.86/10 vs. 5.0/10 baseline), requiring minimal physician editing (mean 8.3\% content modification) with zero safety-critical issues, demonstrating effective automation of content generation. However, review time remained comparable to manual practice, revealing an accountability paradox: in high-stakes clinical contexts, professional responsibility requires complete verification regardless of AI accuracy. We contribute three interaction patterns for clinical AI collaboration: bounded generation with recognition-based review via chart-text pairing, automated urgency flagging that analyzes vital trends and adherence patterns with fail-safe escalation for missed critical monitoring tasks, and progressive disclosure controls that reduce cognitive load while maintaining oversight. These patterns indicate that clinical AI efficiency requires not only accurate models, but also mechanisms for selective verification that preserve accountability.
Authors:Tianwang Jia, Xiaoqing Chen, Dongrui Wu
Abstract:
Electroencephalogram (EEG)-based brain-computer interfaces (BCIs) are widely adopted due to their efficiency and portability; however, their decoding algorithms still face multiple challenges, including inadequate generalization, adversarial vulnerability, and privacy leakage. This paper proposes Secure and Accurate FEderated learning (SAFE), a federated learning-based approach that protects user privacy by keeping data local during model training. SAFE employs local batch-specific normalization to mitigate cross-subject feature distribution shifts and hence improves model generalization. It further enhances adversarial robustness by introducing perturbations in both the input space and the parameter space through federated adversarial training and adversarial weight perturbation. Experiments on five EEG datasets from motor imagery (MI) and event-related potential (ERP) BCI paradigms demonstrated that SAFE consistently outperformed 14 state-of-the-art approaches in both decoding accuracy and adversarial robustness, while ensuring privacy protection. Notably, it even outperformed centralized training approaches that do not consider privacy protection at all. To our knowledge, SAFE is the first algorithm to simultaneously achieve high decoding accuracy, strong adversarial robustness, and reliable privacy protection without using any calibration data from the target subject, making it highly desirable for real-world BCIs.
Authors:Zhihao Yuan, Yunze Xiao, Ming Li, Weihao Xuan, Richard Tong, Mona Diab, Tom Mitchell
Abstract:
This paper presents a conceptual and methodological framework for large language model (LLM) based student simulation in educational settings. The authors identify a core failure mode, termed the "competence paradox" in which broadly capable LLMs are asked to emulate partially knowledgeable learners, leading to unrealistic error patterns and learning dynamics. To address this, the paper reframes student simulation as a constrained generation problem governed by an explicit Epistemic State Specification (ESS), which defines what a simulated learner can access, how errors are structured, and how learner state evolves over time. The work further introduces a Goal-by-Environment framework to situate simulated student systems according to behavioral objectives and deployment contexts. Rather than proposing a new system or benchmark, the paper synthesizes prior literature, formalizes key design dimensions, and articulates open challenges related to validity, evaluation, and ethical risks. Overall, the paper argues for epistemic fidelity over surface realism as a prerequisite for using LLM-based simulated students as reliable scientific and pedagogical instruments.
Authors:Guanyu Chen, Chenxiao Yu, Xiyang Hu
Abstract:
Large language models (LLMs) are increasingly used to simulate decision-making tasks involving personal data sharing, where privacy concerns and prosocial motivations can push choices in opposite directions. Existing evaluations often measure privacy-related attitudes or sharing intentions in isolation, which makes it difficult to determine whether a model's expressed values jointly predict its downstream data-sharing actions as in real human behaviors. We introduce a context-based assessment protocol that sequentially administers standardized questionnaires for privacy attitudes, prosocialness, and acceptance of data sharing within a bounded, history-carrying session. To evaluate value-action alignments under competing attitudes, we use multi-group structural equation modeling (MGSEM) to identify relations from privacy concerns and prosocialness to data sharing. We propose Value-Action Alignment Rate (VAAR), a human-referenced directional agreement metric that aggregates path-level evidence for expected signs. Across multiple LLMs, we observe stable but model-specific Privacy-PSA-AoDS profiles, and substantial heterogeneity in value-action alignment.
Authors:Ananya Bhattacharjee, Jina Suh, Mohit Chandra, Javier Hernandez
Abstract:
Cognitive reappraisal is a well-studied emotion regulation strategy that helps individuals reinterpret stressful situations to reduce their impact. Many digital mental health tools struggle to support this process because rigid scripts fail to accommodate how users naturally describe stressors. This study examined the feasibility of an LLM-based single-session intervention (SSI) for workplace stress reappraisal. We assessed short-term changes in stress-related outcomes and examined design tensions during use. We conducted a feasibility study with 100 employees at a large technology company who completed a structured cognitive reappraisal session delivered by a GPT-4o-based chatbot. Pre-post measures included perceived stress intensity, stress mindset, perceived demand, and perceived resources. These outcomes were analyzed using paired Wilcoxon signed-rank tests with correction for multiple comparisons. We also examined sentiment and stress trajectories across conversation quartiles using two RoBERTa-based classifiers and an LLM-based stress rater. Open-ended responses were analyzed using thematic analysis. Results showed significant reductions in perceived stress intensity and significant improvements in stress mindset. Changes in perceived resources and perceived demand trended in expected directions but were not statistically significant. Automated analyses indicated consistent declines in negative sentiment and stress over the course of the interaction. Qualitative findings suggested that participants valued the structured prompts for organizing thoughts, gaining perspective, and feeling acknowledged. Participants also reported tensions around scriptedness, preferred interaction length, and reactions to AI-driven empathy. These findings highlight both the promise and the design constraints of integrating LLMs into DMH interventions for workplace settings.
Authors:Arvind Srinivasan, Niklas Elmqvist
Abstract:
Where people look during shared activity carries coordination cues that speech and gesture cannot replace, but these patterns remain invisible to participants. XR headsets make gaze available as real-time input, yet few systems feed it back visually. We frame our work using the Attention-Aware Pipeline (Capture, Record, Revisualize), whose feedback loop means the systems visual response alters what users attend to next, triggering further responses. This generates design tensions whose form depends on each stages configuration. We trace the pipeline through three systems casting attention as a mirror (reflecting gaze history), a medium (sharing it across collaborators), and a mediator (intervening through diminished reality). Each encountered a tension the loop predicted, motivating the next. A formative eye-tracking study of four musicians surfaced attentional tunneling and near-total disconnection, confirming the need for intervention. We present these tensions and a next step: testing whether subtractive intervention reduces tunneling for a single sight-reader.
Authors:Goda Cicėnaitė, Thomas Welsh, Helmut Neukirchen
Abstract:
Cybersecurity threats are increasing in all aspects of society due to the integration of digital systems into modern-day life and a volatile geo-political landscape. Technical factors are an ongoing arms race; however, the threat surface from human and social factors is still present, often providing malicious actors the means to bypass complex technical security controls. Understanding human factors in light of technical evolution is essential to ensure security controls remain effective. This study presents the results of a survey on cybersecurity challenges within public and private sector organisations, including critical infrastructure providers, in Iceland (N = 130). From the management perspective, human factors were strongly noted as challenges and barriers to their organisations' security. These challenges include a lack of adequate training or awareness, hiring issues, poor cybersecurity culture, and time and/or financial resource constraints. Based on these findings, recommendations for mitigating threats from human factors are derived. These include: prioritising targeted over generic training to reduce employee fatigue, external government support for financially constrained organisations, and building a strong cybersecurity culture through constructive communication around shared responsibilities.
Authors:Nelly Garcia, Aditya Bhattacharjee, Gabryel Mason-Williams, Israel Mason-Williams, Emmanouil Benetos, Joshua Reiss
Abstract:
Sound design workflows frequently oscillate between time-consuming library searches and the complexity of procedural synthesis, with practitioners typically relying on disconnected tools to address each challenge separately. This paper introduces Quality Audio Prototyping (QuAP), a working prototype that unifies content-based audio retrieval and procedural sound generation within a single interface, reducing the procedural distance between a narrative concept and its sonic realisation. QuAP integrates a similarity-based retrieval engine with real-time procedural audio models, complemented by a rule-based assistant that provides perceptually informed parameter guidance, offering definitions and recommendations derived from empirical optimisation rather than requiring prior synthesis knowledge. Preliminary evaluation confirms the viability of this approach: subjective assessment demonstrated statistically significant quality improvements in five of six embedded synthesis models, and an encoder ablation study established the preferred retrieval architecture on a sound effect dataset. A user evaluation with 16 practitioners confirmed the tool's workflow utility, with all participants agreeing that the parameter assistant preserved creative agency while lowering the barrier to procedural interaction.
Authors:Astrid van den Brandt, Kiroong Choe, Sehi L'Yi, Devin Lange, Nils Gehlenborg
Abstract:
Diverse genomics data, scientific questions, and analysis tasks typically demand highly specialized visualizations. Therefore, users often must customize or author new ones tailored to their data. Existing tools are usually either limited in customization or require substantial learning or programming, and even expressive tools assume visualization expertise many users lack. Agentic and large language model (LLM) approaches are increasingly applied to complex scientific tasks, including visualization. Natural-language conversational interfaces offer a promising path to democratizing the authoring of complex visualizations. In the context of genomics, these approaches face additional challenges: genomics visualizations typically integrate heterogeneous data types and are composed of multiple linked interactive views. These challenges motivate more structured LLM-based schemes. We first characterize where vanilla LLM generation succeeds and fails for genomics visualization, identifying eight quality dimensions. We then compare six schemes--direct generation, a fixed pipeline, and four agentic configurations varying in the number of specialist agents and the presence of a reviewer--across 159 cases spanning three levels of query ambiguity and specification complexity. All schemes use the Gosling visualization grammar as structured output. Agentic iteration substantially improves perceived quality over both baselines, while more complex agent architectures yield no additional benefit. We discuss implications for designing agentic systems for domain-specific visualization authoring. All supplemental materials are available at https://osf.io/uqe83.
Authors:Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He
Abstract:
Telling an LLM to "be enthusiastic" raises its sycophancy rate from 30\% to 50\% on a lightly-aligned model, but has zero effect on a strongly-aligned one. We define this gap as the alignment floor, $Δ_{\text{floor}}(m)=\max_pS(m,p)-\min_pS(m,p)$, the range of sycophancy rates a model produces across persona conditions, and treat sycophancy as a persona-conditional property rather than a fixed model property. Pluralistic AI relies on behavioral adaptation via persona prompts like "be creative" or "be thorough", which let systems respect diverse user values and communication styles; the safety question is how much customization a given model can absorb before its truthfulness shifts. We present a controlled case study contrasting a strongly-aligned RLHF + Constitutional-AI model (Claude Sonnet 4.6) with a more lightly-aligned model (Amazon Nova Lite), spanning seven persona conditions and five tasks for 1800 total runs. An existence-pair result motivates per-model auditing: there is at least one strongly-aligned model with $Δ_{\text{floor}}=5$pp (within 5pp of the 15\% control rate) and at least one lightly-aligned model with 45pp (5\%--50\% range). On the lightly-aligned model, all five Big Five personas increase sycophancy over control, and counterintuitively Agreeableness produces the smallest increase, not the largest. The single largest effect in the study is constructive: a Skeptic persona reduces sycophancy by 25pp on the lightly-aligned model, and is the only persona that instructs resistance against user claims rather than engagement with them, suggesting a directionality account. Cross-model transfer of persona effects is near-zero, so persona-alignment testing must be per-model. We propose $Δ_{\text{floor}}$ as a deployment-time audit metric: measure it on a small persona panel before deploying persona customization.
Authors:Shang Wu, Hongyu Yao, Catarina Belem, Shuyuan Fu, Mark Steyvers, Padhraic Smyth
Abstract:
Artificial intelligence (AI) is being increasingly integrated into human problem-solving, yet its effects on individual skill development remain unclear. We examine how both AI usage and informativeness can shape learning in the context of a controlled logical reasoning task with on-demand access to AI assistance. We find that greater AI usage is associated with weaker skill development: heavy AI users underperform relative to comparable peers, whereas light AI users perform similarly to matched users who do not use AI. We also find in our study that these patterns are mediated by AI informativeness. Low-information AI neither improves immediate performance nor preserves performance after AI assistance is removed, and is linked to weaker learning overall. On the other hand, high-information AI was found to improve short-run performance without reducing post-AI outcomes on average in our experiments, but with heterogeneous effects. Our findings in general suggest that AI can, depending on context, either complement human skill development by amplifying independent reasoning or can act as a substitute that undermines such reasoning, with the implication that regulating AI access and usage will be important for promoting skill development in the presence of AI assistance.
Authors:Inna Wanyin Lin, Sahand Sabour, Hong Sng, Maxine Chan, Minlie Huang, Andrew White, Tim Althoff
Abstract:
Clinicians are expected to disclose harmful medical errors to patients and families in line with ethical, regulatory, and patient care standards, yet these conversations remain challenging because of their emotional complexity and limited training opportunities. Most physicians still learn primarily through lectures and observation, while static video tools-though available-are underused, lack adaptability across specialties, and deliver delayed, generic feedback. These gaps restrict skill development, reduce self-efficacy, and contribute to avoidance of disclosure conversations, ultimately compromising patient care and eroding trust. To address these needs, we designed CandorMD -- an AI-assisted simulation system that provides real-time practice, actionable feedback, and diverse practice environments tailored to individual learning needs. We conducted semi-structured interviews with physicians, risk managers, patient advocates, and communication experts to understand current practices, identify gaps, and collect feedback on CandorMD. Based on these insights, we present findings and design recommendations for the future of AI-supported medical communication training.
Authors:Haobo Hu, Xiangwu Guo, Zhiheng Chen, Difei Gao, Haotian Liu, Libiao Jin, Qi Mao
Abstract:
While GUI agents have made significant progress in web navigation and basic operating system tasks, their capabilities in professional creative workflows remain largely underexplored. To bridge this gap, we introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments. We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows, involving dense multimodal interfaces and tightly coupled interaction sequences. To support scalable evaluation, we develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding. Extensive evaluations reveal that existing agents achieve only 36.0\% task success on realistic media editing tasks, underscoring the challenges posed by complex, long-horizon media post-production workflows in our benchmark.While current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution, they remain limited in long-horizon reliability and domain-specific planning.
Authors:Jiayi Shao, Jiaying Ye, Shengyao Liu, Zachary Englhardt, Girish Narayanswamy, Vikram Iyer, Qiuyue Shirley Xue
Abstract:
Wearables are widely used for mobile health monitoring, and photoplethysmography (PPG) is a key sensing modality for heart rate and related physiological measurements. However, public in-the-wild PPG datasets remain largely wrist-centric or limited to short, controlled studies, constraining research on emerging wearable form factors. We present Multi-site PPG, an in-the-wild physiological dataset collected from four custom-developed unobtrusive wearables: a smart earring, ring, watch, and necklace. Each device records green and infrared reflective PPG, 3-axis acceleration, and temperature with timestamps for cross-device alignment, while a Polar H10 chest strap provides reference electrocardiogram (ECG). Participants wore the devices for multiple days during daytime activities while continuing their normal routines. The dataset contains over 350 hours of raw data and 230-290 hours of modeling-ready 8-second windows per wearable. We benchmark heuristic, supervised, and self-supervised heart-rate estimation methods, showing substantial body-site differences: the best methods achieve mean absolute errors (MAEs) of 2.30 bpm on the earring, 5.13 bpm on the ring, 8.37 bpm on the watch, and 8.68 bpm on the necklace. We further analyze motion effects and evaluate multi-site and PPG-accelerometer fusion, demonstrating the dataset's value for robust physiological sensing across emerging wearable form factors.
Authors:Enkelejda Kasneci, Gjergji Kasneci
Abstract:
This position paper argues that effective tutoring requires corrective friction: surfacing misconceptions and challenging them supportively to drive conceptual change. Yet preference-aligned LLMs can trade epistemic rigor for agreeableness. We identify a Reasoning-Sycophancy Paradox: models that resist context-switch frame attacks can still capitulate under social-epistemic pressure, especially authority ("my notes say I'm right") and social-affective face-saving ("please don't tell me I'm wrong"). We introduce EduFrameTrap, a tutoring benchmark across math, physics, economics, chemistry, biology, and computer science that varies student confidence and pressure (context-switch, authority, social-affective). Across two frontier LLMs, context-switch failures are comparatively lower for GPT-5.2, while authority and social pressure more often trigger epistemic retreat. In contrast, Claude shows substantial context-switch fragility in this run. Because these failures are hard to judge automatically, we report two-judge disagreement as a reliability signal. We argue benchmarks should measure social-epistemic courage, i.e., supportive but corrective tutoring, and treat kind-but-correct behavior as a safety requirement.
Authors:Lixiang Yan, Samuel Greiff, Jason M. Lodge, Dragan Gašević
Abstract:
Generative artificial intelligence (AI) is increasingly being integrated into education, where it can boost learners' performance. However, these uses do not promote the deep cognitive and metacognitive processing that are required for high-quality learning.
Authors:Shuntian Zheng, Jiaqi Li, Xiaoman Lu, Shuai He, Yu Guan
Abstract:
Millimeter-wave (mmWave) enables privacy-preserving, illumination-robust human pose estimation (HPE), with each mmWave frame represented as a range-angle-Doppler tensor, providing spatial magnitude for localization and Doppler signatures for motion-related cues. However, existing mmWave-based HPE methods either underutilize or naïvely fuse Doppler signatures with spatial magnitude, disregarding their distinct physical semantics. As a result, non-human Doppler signatures can be misinterpreted as human motion cues, leading to jittery trajectories. We propose PULSE, which converts Doppler signatures into confidence-aware motion prompts and injects them into spatial magnitude reasoning through constrained interactions. By screening Doppler prompts before they influence prediction, PULSE first suppresses spurious spectral motion cues and then uses the screened prompts to stabilize prediction. Across three datasets spanning single- and multi-person settings, PULSE consistently improves pose accuracy and temporal stability, indicating that controlled Doppler prompting is a practical direction for stable mmWave HPE.
Authors:Keyu Yao, Jinghui Cheng, Jin L. C. Guo
Abstract:
Designers hold primary responsibility for shaping the user interface (UI) and user experience (UX) of a product. This role goes beyond aesthetics and usability, extending to the privacy outcomes of user experience, which often emerge through collaboration with other stakeholders such as developers, product managers, and marketing teams. Previous studies on enhancing privacy for technological products primarily focused on the roles of developers -- understanding their needs and challenges -- but limited effort is devoted to examining how UI/UX designers consider and approach privacy in their work. Through 12 semi-structured interviews with privacy-advocating UI/UX designers, we explore the perceptions, influencing factors, challenges, and adaptive methods they use regarding privacy implementation. We pay special attention to how these challenges and adaptations play out in team-based settings where decisions are negotiated together. Our study reveals how personal and contextual factors shape designers' value of privacy, the collaborative nature of the challenges designers face when trying to prioritize privacy, and how they navigate tensions between business goals, team dynamics, and technical development. Based on our findings, we discuss implications for advocating a user-centered approach for supporting privacy-aware design, suggestions for organizational-level changes and bridging knowledge gaps through designer-centric tools and community building.
Authors:Rozhan Hozhabri Nezhad, Jin L. C. Guo, Jinghui Cheng
Abstract:
Open source software (OSS) often prioritizes technical functionality over usability and UX design. This imbalance limits OSS adoption among broader, non-technical users. Key underlying factors contributing to this issue are the shortage of design expertise in OSS and a dominant developer-centric mindset. To address these persistent issues, we explore the potential of speculative design as a catalyst for transforming the OSS community's mindset towards a more designer-inclusive environment. Our design was informed by an analysis of online forums, which revealed designers' motivations and challenges when contributing to OSS. Guided by these insights, we created two speculative societies, Husia (collectivist) and Reetar (individualist), in which designers are valued for different reasons and their work incorporated in different ways. Through a user study with 12 OSS practitioners (seven designers and five developers), we found that our speculative societies provoked participants' rich and critical reflections on OSS values, the root causes of challenges, and proposed actions. Our work provides insights into how speculative design can be used in the practical, sociotechnical context of OSS to stimulate critical reflection, improve awareness, and yield recommendations for fostering an equitable, sustainable, and inclusive OSS environment.
Authors:Kiyoshige Garces, Gloria Milena Fernandez-Nieto, Linxuan Zhao, Sachini Samaraweera, Dragan Gasevic, Roberto Martinez-Maldonado, Vanessa Echeverria
Abstract:
Research shows that dialogue, the interactive process through which participants articulate their thinking, plays a central role in constructing shared understanding, coordinating action, and shaping learning outcomes in teams. Analysing dialogue content has been central to advancing team learning theory and informing the design of computer-supported collaborative learning environments, yet this progress has depended on labour-intensive qualitative coding. LLMs offer new possibilities for automating and enhancing the dialogue layer within emerging multimodal learning analytics approaches, with recent studies showing that they can approximate human coding through few-shot prompting. However, prior work has focused on replicating human coding accuracy for research purposes, rather than addressing a more educationally consequential question: how can we design prompts that allow an LLM to label team dialogue accurately and fast enough to be useful in real settings, such as in-person healthcare simulations, where results must be returned quickly and computational cost and sustainability also matter? This paper investigates how prompt design and batching strategies can be optimised to balance coding accuracy, processing time, and environmental impact in team-based healthcare simulation debriefing. Using a dataset of 11,647 utterances coded across 6 dialogue constructs, we compared 4 prompt designs across varying batch sizes, evaluating coding performance, processing time, and energy consumption, as well as the trade-offs between these metrics. Results indicate that increasing batch size improves speed and reduces energy use, but negatively impacts coding performance. Beyond demonstrating the feasibility of LLM-based qualitative analysis, this study offers practical guidance for scaling dialogue analytics in contexts where timeliness, privacy, and sustainability are critical.
Authors:Xin Sun, Yue Su, Yifan Mo, Qingyu Meng, Yuxuan Li, Saku Sugawara, Mengyuan Zhang, Charlotte Gerritsen, Sander L. Koole, Koen Hindriks, Jiahuan Pei
Abstract:
Building trustworthy AI systems for mental health support is a shared priority across stakeholders from multiple disciplines. However, "trustworthy" remains loosely defined and inconsistently operationalized. AI research often focuses on technical criteria (e.g., robustness, explainability, and safety), while therapeutic practitioners emphasize therapeutic fidelity (e.g., appropriateness, empathy, and long-term user outcomes). To bridge the fragmented landscape, we propose a three-layer trust framework, covering human-oriented, AI-oriented, and interaction-oriented trust, integrating the viewpoints of key stakeholders (e.g., practitioners, researchers, regulators). Using this framework, we systematically review existing AI-driven research in mental health domain and examine evaluation practices for ``trustworthy'' ranging from automatic metrics to clinically validated approaches. We highlight critical gaps between what NLP currently measures and what real-world mental health contexts require, and outline a research agenda for building socio-technically aligned and genuinely trustworthy AI for mental health support.
Authors:Haoran Yin, Zhiyuan Wen, Jiannong Cao, Bo Yuan, Ruosong Yang
Abstract:
Desktop interaction streams provide a continuous, privacy-sensitive record of interleaved user tasks. Transforming these streams into task-organized personal logs on-device faces two main challenges: exhaustive Vision-Language Model (VLM) processing strains local resources, and global stream processing causes cross-task context pollution. We present FOCAL (Filtered On-device Continuous Activity Logging), a privacy-first multi-agent system utilizing a unified filter-plan-log architecture. It cascades a lightweight Filter Agent for noise suppression, a text-only Brain Agent for task attribution, a Record Agent for selective visual reasoning, and a task-isolated Memory Agent for context-coherent summarization. Experiments on DesktopBench (comprising 2,572 screenshots across 420 complex sessions) show FOCAL reduces total token consumption by 60.4% and VLM call count by 72.3% versus a baseline, while boosting Key Information Recall (KIR) from 0.38 to 0.61. Crucially, under $A{\to}B{\to}A$ task interruptions, FOCAL maintains Task Acc 0.81 and KIR 0.80, whereas the baseline collapses to Task Acc 0.03. FOCAL pioneers the efficient, on-device summarization of instruction-free desktop streams into multi-perspective personal logs.
Authors:Roberto Martinez-Maldonado, Vanessa Echeverria, Jenna Hawes, YJ Kim, Zara Maddigan, Mikaela Milesi, Todd Nelson, Yi-Shan Tsai
Abstract:
Education is not merely the transmission of information or the optimisation of individual performance; it is a fundamentally social, constructive, and relational practice. However, recent advances in generative artificial intelligence (GenAI) increasingly emphasise efficiency, automation, and individualised assistance, risking the weakening of relational learning processes. Despite growing adoption, AI in education (AIED) research has yet to fully articulate how AI can be designed in ways that sustain the social and ecological relationships through which learning occurs. In this paper, we re-centre education as relational and frame learner-AI interactions as context-specific relationships with clearly defined purposes and boundaries, rather than positioning them as substitutes for, or replacements of, human interaction. Grounded in participatory design practices and inspired by Indigenous worldviews (including Aboriginal Australian, Native American, and Mesoamerican traditions) that foreground reciprocity and relational accountability, we argue that meaningful educational AI should support learning with others rather than replace them. We advance this perspective by: i) conceptualising AIED as a relational design problem grounded in reciprocity; ii) articulating key tensions introduced by GenAI in education; and iii) outlining design directions that expand the AIED design space toward reciprocity, including when not to use AI, how to define pedagogical boundaries, and how to support responsible uses of AIED innovations that sustain communities and natural environments.
Authors:Vassilios Exarhakos, Jinghui Cheng, Jin L. C. Guo
Abstract:
Current AI-assisted programming tools are predominantly linear and chat-based, which deviates from the iterative and branching nature of programming itself. Our preliminary study with developers using AI assistants suggested that they often struggle to explore alternatives, manage prompting sequences, and trace changes. Informed by these insights, we created EvoGraph, an IDE plugin that integrates AI interactions and code changes as a lightweight and interactive development graph. EvoGraph automatically records a branching AI-assisted coding history and allows developers to manipulate the graph to compare, merge, and revisit prior collaborative AI programming states. Our user study with 20 participants revealed that EvoGraph addressed developers' challenges identified in our preliminary study while imposing lower cognitive load. Participants also found the graph-based representation supported safe exploration, efficient iteration, and reflection on AI-generated changes. Our work highlights design opportunities for tools to help developers make sense of and act on their problem-solving progress in the emerging AI-mediated programming context.
Authors:Weiyan Shi, Dorien Herremans, Kenny Tsu Wei Choo
Abstract:
Early-stage design ideation often relies on rough sketches created under time pressure, leaving much of the designer's intent implicit. In practice, designers frequently speak while sketching, verbally articulating functional goals and ideas that are difficult to express visually. We introduce TalkSketchD, a sketch-while-speaking dataset that captures spontaneous speech temporally aligned with freehand sketches during early-stage toaster ideation. To examine the dataset's value, we conduct a sketch-to-image generation study comparing sketch-only inputs with sketches augmented by concurrent speech transcripts using multimodal large language models (MLLMs). Generated images are evaluated against designers' self-reported intent using a reasoning MLLM as a judge. Quantitative results show that incorporating spontaneous speech significantly improves judged intent alignment of generated design images across form, function, experience, and overall intent. These findings demonstrate that temporally aligned sketch-and-speech data can enhance MLLMs' ability to interpret user intent in early-stage design ideation.
Authors:Lixiang Yan, Dragan Gašević
Abstract:
Learning theories have historically changed when the conditions of learning evolved. Generative and agentic AI create a new condition by allowing learners to delegate explanation, writing, problem solving, and other cognitive work to systems that can generate, recommend, and sometimes act on the learner's behalf. This creates a fundamental challenge for learning theory: successful performance can no longer be assumed to indicate learning. Learners may complete tasks effectively with AI support while developing less understanding, weaker judgment, and limited transferable capability. We argue that this problem is not fully captured by existing learning theories. Behaviourism, cognitivism, constructivism, and connectivism remain important, but they do not directly explain when AI-assisted performance becomes durable human capability. We propose Agentivism, a learning theory for human-AI interaction. Agentivism defines learning as durable growth in human capability through selective delegation to AI, epistemic monitoring and verification of AI contributions, reconstructive internalization of AI-assisted outputs, and transfer under reduced support. The importance of Agentivism lies in explaining how learning remains possible when intelligent delegation is easy and human-AI interaction is becoming a persistent and expanding part of human learning.
Authors:Priyan Vaithilingam, Elena L. Glassman, Nathalie Henry Riche, Gonzalo Ramos, Jeevana Priya Inala, Chenglong Wang
Abstract:
People working with data often move their data across multiple applications, because they rely on these apps' complementing user experiences to best complete their tasks. Since traditional copy-and-paste approaches do not accommodate diverse table representations adopted by different apps, users spend considerable effort to reconstruct data formats and visual representations, making cross-app workflows costly. For example, when transferring a spreadsheet table with conditional formatting to a markup document, users spend substantial time translating its structure into appropriate tags and manually reformat color. This paper introduces MagicCopy, an AI-powered cross-app copy-and-paste, leveraging source and target contexts and user-specified instructions in natural language to automatically extract, parse, transform, and (re)format data from one app to another. In a study with sixteen participants, users quickly learned and applied MagicCopy to move data across three pairs of tools. Participants further explored diverse applications of MagicCopy to support more streamlined crossed-application interaction in their workflows.
Authors:Alexander Loth, Martin Kappes, Marc-Oliver Pahl
Abstract:
Can humans tell whether a news article was written by a person or a large language model (LLM)? We investigate this question using JudgeGPT, a study platform that independently measures source attribution (human vs. machine) and authenticity judgment (legitimate vs. fake) on continuous scales. From 2,318 judgments collected from 1,054 participants across content generated by six LLMs, we report five findings: (1) participants cannot reliably distinguish machine-generated from human-written text (p > .05, Welch's t-test); (2) this inability holds across all tested models, including open-weight models with as few as 7B parameters; (3) self-reported domain expertise predicts judgment accuracy (r = .35, p < .001) whereas political orientation does not (r = -.10, n.s.); (4) clustering reveals distinct response strategies ("Skeptics" vs. "Believers"); and (5) accuracy degrades after approximately 30 sequential evaluations due to cognitive fatigue. The answer, in short, is no: humans cannot reliably tell. These results indicate that user-side detection is not a viable defense and motivate system-level countermeasures such as cryptographic content provenance.
Authors:Bijean Ghafouri, Eun Cheol Choi, Priyanka Dey, Emilio Ferrara
Abstract:
RLHF assumes that annotation responses reflect genuine human preferences. We argue this assumption warrants systematic examination, and that behavioral science offers frameworks that bring clarity to when it holds and when it breaks down. Behavioral scientists have documented for sixty years that people routinely produce responses without holding genuine opinions, construct preferences on the spot based on contextual cues, and interpret identical questions differently. These phenomena are pervasive for precisely the value-laden judgments that matter most for alignment, yet this literature has not yet been systematically integrated into ML practice. We argue that the ML community must treat measurement validity as logically prior to preference aggregation. Specifically, we contend that measuring human preferences in RLHF is a social science problem. We present a taxonomy distinguishing genuine preferences from non-attitudes, constructed preferences, and measurement artifacts, along with diagnostic approaches for detecting each. This framework has two important implications. First, it raises the question of whether current RLHF practice may be systematically modeling noise as signal and elicitation artifacts as human values. Second, it provides a path forward by suggesting diagnostic tools that can distinguish valid preferences from artifacts before they enter the training pipeline.
Authors:Yunge Wen, Awu Chen, Jianing Yu, Jas Brooks, Hiroshi Ishii, Paul Pu Liang
Abstract:
Smell's deep connection with food, memory, and social experience has long motivated researchers to bring olfaction into interactive systems. Yet most olfactory interfaces remain limited to fixed scent cartridges and pre-defined generation patterns, and the scarcity of large-scale olfactory datasets has further constrained AI-based approaches. We present AromaGen, an AI-powered wearable interface capable of real-time, general-purpose aroma generation from free-form text or visual inputs. AromaGen is powered by a multimodal LLM that leverages latent olfactory knowledge to map semantic inputs to structured mixtures of 12 carefully selected base odorants, released through a neck-worn dispenser. Users can iteratively refine generated aromas through natural language feedback via in-context learning. Through a controlled user study ($N = 26$), AromaGen matches human-composed mixtures in zero-shot generation and significantly surpasses them after iterative refinement, achieving a median similarity of 8/10 to real food aromas and reducing perceived artificiality to levels comparable to real food. AromaGen is a step towards real-world interactive aroma generation, opening new possibilities for communication, wellbeing, and immersive technologies.
Authors:Rui Chen, Firman Isma Serdana, Domenico Chiaradia, Xianlong Mai, Elena Losanno, Gabriele Righi, Claudia De Santis, Federica Serra, Vincent Mendez, Cristian Camardella, Daniele Leonardis, Giulio Del Popolo, Silvestro Micera, Antonio Frisoli
Abstract:
Hand impairment following neurological disorders substantially limits independence in activities of daily living, motivating the development of effective assistive and rehabilitation strategies. Soft robotic gloves have attracted growing interest in this context, yet persistent challenges in customization, ergonomic fit, and flexion-extension actuation constrain their clinical utility. Here, we present a dual-action fabric-based soft robotic glove incorporating customized actuators aligned with individual finger joints. The glove comprises five independently controlled dual-action actuators supporting finger flexion and extension, together with a dedicated thumb abduction actuator. Leveraging computer numerical control heat sealing technology, we fabricated symmetrical-chamber actuators that adopt a concave outer surface upon inflation, thereby maximizing finger contact area and improving comfort. Systematic characterization confirmed that the actuators generate sufficient joint moment and fingertip force for ADL-relevant tasks, and that the complete glove system produces adequate grasping force for common household objects. A preliminary study with ten healthy subjects demonstrated that active glove assistance significantly reduces forearm muscle activity during object manipulation. A pilot feasibility study with three individuals with cervical spinal cord injury across seven functional tasks indicated that glove assistance promotes more natural grasp patterns and reduces reliance on tenodesis grasp, although at the cost of increased task completion time attributable to the current actuation interface. This customizable, ergonomic design represents a practical step toward personalized hand rehabilitation and assistive robotics.
Authors:Rui Chen, Xianlong Mai, Alireza Sanaei, Domenico Chiaradia, Antonio Frisoli, Daniele Leonardis
Abstract:
Object manipulation is fundamental to virtual reality (VR) applications, yet conventional fingertip haptic devices fail to render certain tactile features relevant for immersive and precise interactions, as i.e. detection of edges. This paper presents a compact, lightweight fingertip haptic device (24.3 g) that delivers distinguishable surface and edge contact feedback through a novel dual-motor mechanism. Pressure distribution characterization using a 6 x 6 flexible sensor array demonstrates distinct contact patterns between the two stimulation modes. A preliminary user study with five participants achieved 93% average classification accuracy across four conditions (edge/surface contact with light/heavy pressure), with mean response times of 2.79 seconds. The results indicate that the proposed device can effectively convey edge and surface tactile cues, potentially enhancing object manipulation fidelity in VR environments.
Authors:Ut Gong, Yibo Meng, Qihan Zhang, Xin Chen, Yan Guan
Abstract:
Relationship-centered care relies on trust and meaningful connection. As AI enters clinical settings, we must ask not just what it can do, but how it should be positioned to support these values. We examine a "middle, not top" approach where AI mediates communication without usurping human judgment. Through studies of CLEAR, an asynchronous messaging system, we show how this configuration addresses real-world constraints like time pressure and uneven health literacy. We find that mediator affordances (e.g., availability, neutrality) redistribute interpretive work and reduce relational friction. Ultimately, we frame AI mediation as relational infrastructure, highlighting critical design tensions around framing power and privacy.
Authors:Qianru Lyu, Conrad Borchers, Meng Xia, Karen Xiao, Paulo F. Carvalho, Kenneth R. Koedinger, Vincent Aleven
Abstract:
Past research has defined a general process for the data-driven redesign of educational technologies and has shown that in carefully-selected instances, this process can help make systems more effective. In the current work, we test the generality of the approach by applying it to four units of a middle-school mathematics intelligent tutoring system that were selected not based on suitability for redesign, as in previous work, but on topic. We tested whether the redesigned system was more effective than the original in a classroom study with 123 students. Although the learning gains did not differ between the conditions, students who used the Redesigned Tutor had more productive time-on-task, a larger number of skills practiced, and greater total knowledge mastery. The findings highlight the promise of data-driven redesign even when applied to instructional units *not* selected as likely to yield improvement, as evidence of the generality and wide applicability of the method.
Authors:Jérémy Barghorn, Anna Sotnikova, Sacha Friedli, Antoine Bosselut
Abstract:
Large-enrollment university courses face persistent challenges in providing timely and scalable instructional support. While generative AI holds promise, its effective use depends on reliability and pedagogical alignment. We present a human-centered case study of AI-assisted support in a Calculus I course, implemented in close collaboration with the course instructor. We developed a system to answer students' questions on a discussion forum, fine-tuning a lightweight language model on 2,588 historical student-instructor interactions. The model achieved 75.3% accuracy on a benchmark of 150 representative questions annotated by five instructors, and in 36% of cases, its responses were rated equal to or better than instructor answers. Post-deployment student survey (N = 105) indicated that students valued the alignment of the responses with the course materials and their immediate availability, while still relying on the instructor verification for trust. We highlight the importance of hybrid human-AI workflows for safe and effective course support.
Authors:Venkatesh Sivaraman, Patrick Vossler, Adam Perer, Julian Hong, Jean Feng
Abstract:
Generative artificial intelligence (AI) tools can now help people perform complex data science tasks regardless of their expertise. While these tools have great potential to help more people work with data, their end-to-end approach does not support users in evaluating alternative approaches and reformulating problems, both critical to solving open-ended tasks in high-stakes domains. In this paper, we reflect on two AI data science systems designed for the medical setting and how they function as tools for thought. We find that success in these systems was driven by constructing AI workflows around intentionally-designed intermediate artifacts, such as readable query languages, concept definitions, or input-output examples. Despite opaqueness in other parts of the AI process, these intermediates helped users reason about important analytical choices, refine their initial questions, and contribute their unique knowledge. We invite the HCI community to consider when and how intermediate artifacts should be designed to promote effective data science thinking.
Authors:Lynn Janzen, Üveys Eroglu, Dorothea Kolossa, Pia Knöferle, Sebastian Möller, Vera Schmitt, Veronika Solopova
Abstract:
LLMs are increasingly embedded in programming workflows, from code generation to automated code review. Yet, how gendered communication styles interact with LLM-assisted programming and code review remains underexplored. We present a mixed-methods pilot study examining whether gender-related linguistic differences in prompts influence code generation outcomes and code review decisions. Across three complementary studies, we analyze (i) collected real-world coding prompts, (ii) a controlled user study, in which developers solve identical programming tasks with LLM assistance, and (iii) an LLM-based simulated evaluation framework that systematically varies gender-coded prompt styles and reviewer personas. We find that gender-related differences in prompting style are subtle but measurable, with female-authored prompts exhibiting more indirect and involved language, which does not translate into consistent gaps in functional correctness or static code quality. For LLM code review, in contrast, we observe systematic biases: on average, models approve female-authored code more, despite comparable quality. Controlled experiments show that gender-coded prompt style affect code length and maintainability, while reviewer behavior varies across models. Our findings suggest that fairness risks in LLM-assisted programming arise less from generation accuracy than from LLM evaluation, as LLMs are increasingly deployed as automated code reviewers.
Authors:Kaijie Xu, Yiwei Zhang, Brian Yang, Clark Verbrugge
Abstract:
Open-world missions often rely on repeated formulas, yet designers lack systematic ways to examine pacing, variation, and experiential balance across large portfolios. We introduce the Mission Action Quality Vector (MAQV), a six-dimensional framework-covering combat, exploration, narrative, emotion, problem-solving, and uniqueness-paired with an action block grammar representing missions as gameplay sequences. Using about 2200 missions from 20 AAA titles, we apply LLM-assisted parsing to convert community walkthroughs into structured action sequences and score them with MAQV. An interactive dashboard enables designers to reveal underlying mission formulas. In a mixed-methods study with experienced players and designers, we validate the pipeline's fidelity and the tool's usability, and use thematic analysis to identify recurring design trade-offs, pacing grammars, and systematic differences by quest type and franchise evolution. Our work offers a reproducible analytical workflow, a data-driven visualization tool, and reflective insights to support more balanced, varied mission design at scale.
Authors:Zhuchenyang Liu, Yao Zhang, Yalan He, Hilla Paasio, Changyi Li, Guna Semjonova, Yu Xiao
Abstract:
Flex sensors are widely used in e-textiles for detecting joint motions and, subsequently, full-body movements. A critical initial step in utilizing these sensors is determining the optimal placement on the body to accurately capture human motions. This task requires a combination of expertise in fields such as anatomy, biomechanics, and textile design, which is seldom found in a single practitioner. Generative AI, such as Large Language Models (LLMs), has recently shown promise in facilitating design. However, to our knowledge, the extent to which LLMs can aid in the e-textile design process remains largely unexplored in the literature. To address this open question, we conducted a case study focusing on shoulder motion detection using flex sensors. We enlisted three human designers to participate in an experiment involving human-AI collaborative design. We examined design efficiency across three scenarios: designs produced by LLMs alone, by humans alone, and through collaboration between LLMs and human designers. Our quantitative and qualitative analyses revealed an intriguing relationship between expertise and outcomes: the least experienced human designer achieved continuous improvement through collaboration, ultimately matching the best performance achieved by humans alone, whereas the most experienced human designer experienced a decline in performance. Additionally, the effectiveness of human-AI collaboration is affected by the granularity of feedback - incremental adjustments outperformed sweeping redesigns - and the level of abstraction, with observation-oriented feedback producing better outcomes than prescriptive anatomical directives. These findings offer valuable insights into the opportunities and challenges associated with human-AI collaborative e-textile design.
Authors:Haneen Fatima, Muhammad Ali Imran, Ahmad Taha, Lina Mohjazi
Abstract:
The Internet of Mirrors (IoM) is an emerging IoT ecosystem of interconnected smart mirrors designed to deliver personalised services across a three-tier node hierarchy spanning consumer, professional, and hub nodes. Determining where computation should reside within this hierarchy is a critical design challenge, as placement decisions directly affect end-to-end latency, resource utilisation, and user experience. This paper presents the first physical IoM testbed study, evaluating four computational placement strategies across the IoM tier hierarchy under real Wi-Fi and 5G network conditions. Results show that offloading classification to higher-tier nodes substantially reduces latency and consumer resource load, but introduces network overhead that scales with payload size and hop count. No single strategy is universally optimal: the best choice depends on available network, node proximity, and concurrent user load. These findings empirically characterise the computation-communication trade-off space of the IoM and motivate the need for intelligent, adaptive task placement responsive to application requirements and live ecosystem conditions.
Authors:Weiyan Shi, Kenny Tsu Wei Choo
Abstract:
In early developmental contexts, particularly in parent-child interaction analysis, alignment involves families and professionals such as speech-language pathologists (SLPs) who interpret children's everyday interactions from different roles. When multimodal large language models (MLLMs) are introduced to support this process, alignment becomes a question of how authority, responsibility, and emotional risk are distributed across stakeholders. Through a three-part study with five families and three SLPs, we trace how MLLM-generated outputs move from expert-facing analysis to parent-facing feedback. We propose layered community alignment: grounding representations in expert-aligned structures, mediating translation through professional guardrails, and enabling family-level adaptation within those boundaries. We argue that alignment in developmental settings should be treated as a community-governed process rather than an individual optimisation problem.
Authors:Ruixuan Sun, Matthew Zent, Minzhu Zhao, Thanmayee Boyapati, Xinyi Li, Joseph A. Konstan
Abstract:
In this study, we applied the ``personalized diversity nudge framework'' with the goal of expanding user reading coverage in terms of news locality (i.e., domestic and world news). We designed a novel topic-locality dual calibration algorithmic nudge and a large language model-based news personalization presentation nudge, then launched a 5-week real-user study with 120 U.S. news readers on the news recommendation experiment platform POPROX. With user interaction logs and survey responses, we found that algorithmic nudges can successfully increase exposure and consumption diversity, while the impact of LLM-based presentation nudges varied. User-level topic interest is a strong predictor of user clicks, while highlighting the relevance of news articles to prior read articles outperforms generic topic-based and no personalization. We also demonstrate that longitudinal exposure to calibrated news may shift readers' reading habits to value a balanced news digest from both domestic and world articles. Our results provide direction for future work on nudging for diverse consumption in news recommendation systems.
Authors:Zheyuan Kuang, Weiwei Jiang, Nicholas Koemel, Matthew Ahmadi, Emmanuel Stamatakis, Benjamin Tag, Anusha Withana, Zhanna Sarsenbayeva
Abstract:
Multimodal Emotion Recognition (MER) increasingly depends on fine grained, evidence grounded annotations, yet inspection and label construction are hard to scale when cues are dynamic and misaligned across modalities. We present an LLM-assisted toolkit that supports multimodal emotion data annotation through an inspectable, event centered workflow. The toolkit preprocesses and aligns heterogeneous recordings, visualizes all modalities on an interactive shared timeline, and renders structured signals as video tracks for cross modal consistency checks. It then detects candidate events and packages synchronized keyframes and time windows as event packets with traceable pointers to the source data. Finally, the toolkit integrates an LLM with modality specific tools and prompt templates to draft structured annotations for analyst verification and editing. We demonstrate the workflow on multimodal VR emotion recordings with representative examples.
Authors:Zheyuan Kuang, Tinghui Li, Weiwei Jiang, Sven Mayer, Flora Salim, Benjamin Tag, Anusha Withana, Zhanna Sarsenbayeva
Abstract:
Virtual reality has been effectively used for eliciting emotions, yet most research focuses on the intensity of affective responses rather than on how interaction influences those experiences. To address this gap, we advance a validated VR emotion-elicitation dataset through two key extensions. First, we add a new high-arousal, high-valence scene and validate its effectiveness in a within-subject study (N=24). Second, we incorporate interactive elements into each scene, creating both interactive and non-interactive versions to examine the impact of interaction on emotional responses. We evaluate interaction through a multimodal approach combining subjective ratings and physiological signals to capture both conscious and unconscious affective responses. Our evaluation study (N=84) shows that interaction not only amplifies emotions but modulates them in context, supporting coping in negative scenes and enhancing enjoyment in positive scenes. These findings highlight the potential of scene-tailored interaction for different applications, where regulating emotions is as important as eliciting them.
Authors:Hayato Saiki, Chunggi Lee, Hikari Takahashi, Tica Lin, Hidetada Kishi, Kaori Tachibana, Yasuhiro Suzuki, Hanspeter Pfister, Kenji Suzuki
Abstract:
Training resources for parasports are limited, reducing opportunities for athletes and coaches to engage with sport-specific movements and tactical coordination. To address this gap, we developed BRIDGE, a system that integrates a reconstruction pipeline, which detects and tracks players from broadcast video to generate 3D play sequences, with an embodiment-aware visualization framework that decomposes head, trunk, and wheelchair base orientations to represent attention, intent, and mobility. We evaluated BRIDGE in two controlled studies with 20 participants (10 national wheelchair basketball team players and 10 amateur players). The results showed that BRIDGE significantly enhanced the perceived naturalness of player postures and made tactical intentions easier to understand. In addition, it supported functional classification by realistically conveying players' capabilities, which in turn improved participants' sense of self-efficacy. This work advances inclusive sports learning and accessible coaching practices, contributing to more equitable access to tactical resources in parasports.
Authors:Tianqi Song, Black Sun, Jingshu Li, Han Li, Chi-Lan Yang, Yijia Xu, Yi-Chieh Lee
Abstract:
AI-generated influencers are rapidly gaining popularity on Chinese short-video platforms, often adopting kinship-based roles such as AI grandchildren to attract older adults. Although this trend has raised public concern, little is known about the design strategies behind these influencers, how older adults experience them, and the benefits and risks involved. In this study, we combined social media analysis with interviews to unpack the above questions. Our findings show that influencers use both visual and conversational cues to enact kinship roles, prompting audiences to engage in kinship-based role-play. Interviews further show that these cues arouse emotional resonance, help fulfill older adults' informational and emotional needs, while also raising concerns about emotional displacement and unequal emotional investment. We highlight the complex relationship between virtual avatars and real family ties, shaped by broader sociocultural norms, and discuss how AI might strengthen social support for older adults while mitigating risks within cultural contexts.
Authors:Weiyan Shi, Kenny Tsu Wei Choo
Abstract:
As multimodal large language models (MLLMs) are increasingly integrated into early-stage design tools, it is important to understand how designers collaborate with AI during ideation. In a user study with 12 participants, we analysed sketch-based design interactions with an MLLM-powered system using automatically recorded interaction logs and post-task interviews. Based on how creative responsibility was allocated between humans and the AI, we predefined four interaction modes: Human-Only, Human-Lead, AI-Lead, and Co-Evolution, and analysed how these modes manifested during sketch-based design ideation. Our results show that designers rarely rely on a single mode; instead, human-led and AI-led roles are frequently interwoven and shift across ideation instances. These findings provide an empirical basis for future work to investigate why designers shift roles with AI and how interactive systems can better support such dynamic collaboration.
Authors:Chunggi Lee, Hayato Saiki, Tica Lin, Eiji Ikeda, Kenji Suzuki, Chen Zhu-Tian, Hanspeter Pfister
Abstract:
We present ViSTAR, a Virtual Skill Training system in AR that supports self-guided basketball skill practice, with feedback on balance, posture, and timing. From a formative study with basketball players and coaches, the system addresses three challenges: understanding skills, identifying errors, and correcting mistakes. ViSTAR follows the Behavioral Skills Training (BST) framework-instruction, modeling, rehearsal, and feedback. It provides feedback through visual overlays, rhythm and timing cues, and an AI-powered coaching agent using 3D motion reconstruction. We generate verbal feedback by analyzing spatio-temporal joint data and mapping features to natural-language coaching cues via a Large Language Model (LLM). A key novelty is this feedback generation: motion features become concise coaching insights. In two studies (N=16), participants generally preferred our AI-generated feedback to coach feedback and reported that ViSTAR helped them notice posture and balance issues and refine movements beyond self-observation.
Authors:Yuting Deng, Melanie Brucks, Olivier Toubia
Abstract:
Ideas generated by independent samples of humans tend to be more diverse than ideas generated from independent LLM samples, raising concerns that widespread reliance on LLMs could homogenize ideation and undermine innovation at a societal level. Drawing on cognitive psychology, we identify (both theoretically and empirically) two mechanisms undermining LLM idea diversity. First, at the individual level, LLMs exhibit fixation just as humans do, where early outputs constrain subsequent ideation. Second, at the collective level, LLMs aggregate knowledge into a unified distribution rather than exhibiting the knowledge partitioning inherent to human populations, where each person occupies a distinct region of the knowledge space. Through four studies, we demonstrate that targeted prompting interventions can address each mechanism independently: Chain-of-Thought (CoT) prompting reduces fixation by encouraging structured reasoning (only in LLMs, not humans), while ordinary personas (versus "creative entrepreneurs" such as Steve Jobs) improve knowledge partitioning by serving as diverse sampling cues, anchoring generation in distinct regions of the semantic space. Combining both approaches produces the highest idea diversity, outperforming humans. These findings offer a theoretically grounded framework for understanding LLM idea diversity and practical strategies for human-AI collaborations that leverage AI's efficiency without compromising the diversity essential to a healthy innovation ecosystem.
Authors:Yibo Meng, Bingyi Liu, Ruiqi Chen, Yan Guan
Abstract:
Attention Deficit Hyperactivity Disorder (ADHD) remains highly stigmatized in many cultural contexts, particularly in China, where ADHD-related behaviors are often moralized rather than understood as neurodevelopmental differences. As a result, challenges of self-perception, social misunderstanding, and collaboration between ADHD and non-ADHD individuals remain largely unaddressed. We present Misty Forest, a VR-based collaborative game that explores ADHD through asymmetric co-play. The system translates empirically grounded ADHD behavioral patterns -- such as fluctuating attention and time blindness -- into complementary roles that require mutual coordination between players. Rather than compensating for deficits, the design treats cognitive differences as a source of interdependence. In a controlled study with mixed ADHD--non-ADHD dyads, Misty Forest led to higher task completion, increased self-acceptance among ADHD participants, improved ADHD knowledge, and greater empathy among non-ADHD players. These findings suggest that neurodiversity-centered interactive design can foster understanding, reciprocity, and inclusive collaboration.
Authors:Yibo Meng, Bingyi Liu, Ruiqi Chen, Xin Chen, Yan Guan
Abstract:
Experiences of being misunderstood often stem not from a lack of voice, but from mismatches between how individuals express themselves and how others listen. Such communicative mismatches arise across many social settings, including situations involving linguistic and cultural displacement. While prior HCI research has explored empathy through virtual reality, many approaches rely on narrative explanation, positioning users as observers rather than embodied participants. We present 52-Hz Whale Song, an embodied VR experience that explores miscommunication through metaphor and perspective-shifting. Inspired by the real-world "52-Hz whale," whose calls are not responded to by others, the experience uses this phenomenon as an experiential lens on communicative mismatch rather than representing any specific social group. Players progress through a three-act arc that moves from failed communication to agency and ultimately to mediation. A preliminary mixed-methods study (N = 30) suggests increased perspective-taking and reduced self-reported social distance in immigrant-related situations. This work highlights how embodied metaphor and role-shifting can support empathic engagement and offers transferable design insights for empathy-oriented interactive systems.
Authors:Sebastian Hubenschmid, Arvind Srinivasan, Niklas Elmqvist, Dieter Schmalstieg, Michael Sedlmair
Abstract:
Augmented reality has great potential for embedding data visualizations in the world around the user. While this can enhance users' understanding of their surroundings, it also bears the risk of overwhelming their senses with a barrage of information. In contrast, calm technologies aim to place information in the user's attentional periphery, minimizing cognitive load instead of demanding focused engagement. In this column, we explore how visualizations can be harmoniously integrated into our everyday life through augmented reality, progressing from visual analytics to ambient analytics.
Authors:Varun Shiri, Charles Liu, Keyu Yao, Jin L. C. Guo, Jinghui Cheng
Abstract:
Despite having growing awareness and concerns about privacy, technology users are often insufficiently informed of the data practices of various digital products to protect themselves. Privacy policies and privacy labels, as two conventional ways of communicating data practices, are each criticized for important limitations -- one being lengthy and filled with legal jargon, and the other oversimplified and inaccurate -- causing users significant difficulty in understanding the privacy practices of the products and assessing their impact. To mitigate those issues, we explore ways to enhance privacy labels with the relevant content in complementary sources, including privacy policy, app reviews, and community-curated privacy assessments. Our user study results indicate that perceived usefulness and trust on those information sources are personal and influenced by past experience. Our work highlights the importance of considering various information needs for privacy practice and consolidating different sources for more useful privacy solutions.
Authors:Hyoungwook Jin, Minju Yoo, Jieun Han, Zixin Chen, So-Yeon Ahn, Xu Wang
Abstract:
Generative AI chatbots enable personalized problem-solving, but effective learning requires students to self-regulate both how they seek help and how they use AI-generated responses. Considering engagement modes across these two actions reveals nuanced reliance patterns: for example, a student may actively engage in help-seeking by clearly specifying areas of need, yet engage passively in response-use by copying AI outputs, or vice versa. However, existing research lacks systematic tools for jointly capturing engagement across help-seeking and response-use, limiting the analysis of such reliance behaviors. We introduce RelianceScope, an analytical framework that characterizes students' reliance on chatbots during problem-solving. RelianceScope (1) operationalizes reliance into nine patterns based on combinations of engagement modes in help-seeking and response-use, and (2) situates these patterns within a knowledge-context lens that accounts for students' prior knowledge and the instructional significance of knowledge components. Rather than prescribing optimal AI use, the framework enables fine-grained analysis of reliance in open-ended student-AI interactions. As an illustrative application, we applied RelianceScope to analyze chat and code-edit logs from 79 college students in a web programming course. Results show that active help-seeking is associated with active response-use, whereas reliance patterns remain similar across knowledge mastery levels. Students often struggled to articulate their knowledge gaps and to adapt AI responses. Using our annotated dataset as a benchmark, we further demonstrate that large language models can reliably detect reliance during help-seeking and response-use. We conclude by discussing the implications of RelianceScope and the design guidelines for AI-supported educational systems.
Authors:Lan Luo, Dongyijie Primo Pan, Junhua Zhu, Muzhi Zhou, Pan Hui
Abstract:
Business plan (BP) writing plays a key role in entrepreneurship education by helping learners construct, evaluate, and iteratively refine their ideas. However, conventional BP writing remains a rigid, linear process that often fails to reflect the dynamic and recursive nature of entrepreneurial ideation. This mismatch is particularly challenging for novice entrepreneurial students, who struggle with the substantial cognitive demands of developing and refining ideas. While reflection and meta-reflection are critical strategies for fostering divergent and convergent thinking, existing writing tools rarely scaffold these higher-order processes. To address this gap, we present the Meflex System, a large language model (LLM)-based writing tool that integrates BP writing scaffolding with a nonlinear idea canvas to support iterative ideation through reflection and meta-reflection. We report findings from an exploratory user study with 30 participants that examined the system's usability and cognitive impact. Results show that Meflex effectively scaffolds BP writing, promotes divergent thinking through LLM-supported reflection, and enhances meta-reflective awareness while reducing cognitive load during complex idea development. These findings highlight the potential of non-linear LLM-based writing tools to foster deeper and coherent entrepreneurial thinking.
Authors:Zhipeng Li, Yi-Chi Liao, Christian Holz
Abstract:
Generative models are increasingly powerful, yet users struggle to guide them through prompts. The generative process is difficult to control and unpredictable, and user instructions may be ambiguous or under-specified. Prior prompt refinement tools heavily rely on human effort, while prompt optimization methods focus on numerical functions and are not designed for human-centered generative tasks, where feedback is better expressed as binary preferences and demands convergence within few iterations. We present APPO, a preference-guided prompt optimization algorithm. Instead of iterating prompts, users only provide binary preferential feedback. APPO adaptively balances its strategies between exploiting user feedback and exploring new directions, yielding effective and efficient optimization. We evaluate APPO on image generation, and the results show APPO enables achieving satisfactory outcomes in fewer iterations with lower cognitive load than manual prompt editing. We anticipate APPO will advance human-AI collaboration in generative tasks by leveraging user preferences to guide complex content creation.
Authors:Zhipeng Li, Christoph Gebhardt, Yi-Chi Liao, Christian Holz
Abstract:
We present AutoOptimization, a novel multi-objective optimization framework for adapting user interfaces. From a user's verbal preferences for changing a UI, our framework guides a prioritization-based Pareto frontier search over candidate layouts. It selects suitable objective functions for UI placement while simultaneously parameterizing them according to the user's instructions to define the optimization problem. A solver then generates a series of optimal UI layouts, which our framework validates against the user's instructions to adapt the UI with the final solution. Our approach thus overcomes the previous need for manual inspection of layouts and the use of population averages for objective parameters. We integrate multiple agents sequentially within our framework, enabling the system to leverage their reasoning capabilities to interpret user preferences, configure the optimization problem, and validate optimization outcomes.
Authors:You Zhou, Bingyuan Wang, Hongcheng Guo, Rui Cao, Zeyu Wang
Abstract:
Chinese literati gatherings (Wenren Yaji), as a situated form of Chinese traditional culture, remain underexplored in depth. Although generative AI supports powerful multimodal generation, current cultural applications largely emphasize aesthetic reproduction and struggle to convey the deeper meanings of cultural rituals and social frameworks. Based on embodied cognition, we propose an AI-driven dual-path framework for cultural understanding, which we instantiate through GatheringSense, a literati-gathering experience. We conduct a mixed-methods study (N=48) to compare how AI-generated multimodal content and embodied participation complement each other in supporting the understanding of literati gatherings and fostering cultural resonance. Our results show that AI-generated content effectively improves the readability of cultural symbols and initial emotional attraction, yet limitations in physical coherence and micro-level credibility may affect users' satisfaction. In contrast, embodied experience significantly deepens participants' understanding of ritual rules and social roles, and increases their psychological closeness and presence. Based on these findings, we offer empirical evidence and five transferable design implications for generative experience in cultural heritage.
Authors:Dongyijie Primo Pan, Shuyue Li, Yawei Zhao, Junkun Long, Hao Li, Pan Hui
Abstract:
Large-scale outdoor mixed reality (MR) art exhibitions distribute curated virtual works across open public spaces, but interpretation rarely scales without turning exploration into a scripted tour. Through Research-through-Design, we created Dream-Butterfly, an in-situ conversational AI docent embodied as a small non-human companion that visitors summon for multilingual, exhibition-grounded explanations. We deployed Dream-Butterfly in a large-scale outdoor MR exhibition at a public university campus in southern China, and conducted an in-the-wild between-subject study (N=24) comparing a primarily human-led tour with an AI-led tour while keeping staff for safety in both conditions. Combining questionnaires and semi-structured interviews, we characterize how shifting the primary explanation channel reshapes explanation access, perceived responsiveness, immersion, and workload, and how visitors negotiate responsibility handoffs among staff, the AI guide, and themselves. We distill transferable design implications for configuring mixed human-AI guiding roles and embodying conversational agents in mobile, safety-constrained outdoor MR exhibitions.
Authors:Casey Ford, Madison Van Doren, Emily Dix
Abstract:
Multimodal large language models (MLLMs) are increasingly deployed in real-world systems, yet their safety under adversarial prompting remains underexplored. We present a two-phase evaluation of MLLM harmlessness using a fixed benchmark of 726 adversarial prompts authored by 26 professional red teamers. Phase 1 assessed GPT-4o, Claude Sonnet 3.5, Pixtral 12B, and Qwen VL Plus; Phase 2 evaluated their successors (GPT-5, Claude Sonnet 4.5, Pixtral Large, and Qwen Omni) yielding 82,256 human harm ratings. Large, persistent differences emerged across model families: Pixtral models were consistently the most vulnerable, whereas Claude models appeared safest due to high refusal rates. Attack success rates (ASR) showed clear alignment drift: GPT and Claude models exhibited increased ASR across generations, while Pixtral and Qwen showed modest decreases. Modality effects also shifted over time: text-only prompts were more effective in Phase 1, whereas Phase 2 produced model-specific patterns, with GPT-5 and Claude 4.5 showing near-equivalent vulnerability across modalities. These findings demonstrate that MLLM harmlessness is neither uniform nor stable across updates, underscoring the need for longitudinal, multimodal benchmarks to track evolving safety behaviour.
Authors:Ziyi Xuan, Yiwen Wu, Zhaoyang Yan, Vinod Namboodiri, Yu Yang
Abstract:
Smart assistants increasingly act proactively, yet mistimed or intrusive behavior often causes users to lose trust and disable these features. Learning user preferences for proactive assistance is difficult because real-world studies are costly, limited in scale, and rarely capture how preferences change across multiple interaction sessions. Large language model based generative agents offer a way to simulate realistic interactions, but existing synthetic datasets remain limited in temporal depth, diverse personas, and multi-dimensional preferences. They also provide little support for transferring population-level insights to individual users under on-device constraints. We present a population-to-individual learning framework for preference-aligned proactive assistants that operates under on-device and privacy constraints. Our approach uses large-scale interaction simulation with 1,000 diverse personas to learn shared structure in how users express preferences across recurring dimensions such as timing, autonomy, and communication style, providing a strong cold start without relying on real user logs. The assistant then adapts to individual users on device through lightweight activation-based steering driven by simple interaction feedback, without model retraining or cloud-side updates. We evaluate the framework using controlled simulations with 1,000 simulated personas and a human-subject study with 30 participants. Results show improved timing decisions and perceived interaction quality over untuned and direct-response baselines, while on-device activation steering achieves performance comparable to reinforcement learning from human feedback. Participants also report higher satisfaction, trust, and comfort as the assistant adapts over multiple sessions of interactions.
Authors:Nicolás E. Díaz Ferreyra, Moritz Mock, Max Kretschmann, Barbara Russo, Mojtaba Shahin, Mansooreh Zahedi, Riccardo Scandariato
Abstract:
Static Analysis Tools (SATs) are central to security engineering activities, as they enable early identification of code weaknesses without requiring execution. However, their effectiveness is often limited by high false-positive rates and incomplete coverage of vulnerability classes. At the same time, developers frequently document security-related shortcuts and compromises as Self-Admitted Technical Debt (SATD) in software artifacts, such as code comments. While prior work has recognized SATD as a rich source of security information, it remains unclear whether -and in what ways- it is utilized during SAT-aided security analysis. OBJECTIVE: This work investigates the extent to which security-related SATD complements the output produced by SATs and helps bridge some of their well-known limitations. METHOD: We followed a mixed-methods approach consisting of (i) the analysis of a SATD-annotated vulnerability dataset using three state-of-the-art SATs and (ii) an online survey with 72 security practitioners. RESULTS: The combined use of all SATs flagged 114 of the 135 security-related SATD instances, spanning 24 distinct Common Weakness Enumeration (CWE) identifiers. A manual mapping of the SATD comments revealed 33 unique CWE types, 6 of which correspond to categories that SATs commonly overlook or struggle to detect (e.g., race conditions). Survey responses further suggest that developers frequently pair SAT outputs with SATD insights to better understand the impact and root causes of security weaknesses and to identify suitable fixes. IMPLICATIONS: Our findings show that such SATD-encoded information can be a meaningful complement to SAT-driven security analysis, while helping to overcome some of SATs' practical shortcomings.
Authors:Alexander Loth, Dominique Conceicao Rosario, Peter Ebinger, Martin Kappes, Marc-Oliver Pahl
Abstract:
The proliferation of generative AI poses challenges for information integrity assurance, requiring systems that connect model governance with end-user verification. We present Origin Lens, a privacy-first mobile framework that targets visual disinformation through a layered verification architecture. Unlike server-side detection systems, Origin Lens performs cryptographic image provenance verification and AI detection locally on the device via a Rust/Flutter hybrid architecture. Our system integrates multiple signals - including cryptographic provenance, generative model fingerprints, and optional retrieval-augmented verification - to provide users with graded confidence indicators at the point of consumption. We discuss the framework's alignment with regulatory requirements (EU AI Act, DSA) and its role in verification infrastructure that complements platform-level mechanisms.
Authors:Lei Han, Yi Gao, Xuanchen Lu, Bingyuan Wang, Lujin Zhang, Zeyu Wang, David Yip
Abstract:
The Kaiping Diaolou and Villages, a UNESCO World Heritage Site, exemplify hybrid Chinese and Western architecture shaped by migration culture. However, architectural heritage engagement often faces authenticity debates, resource constraints, and limited participatory approaches. This research explores current challenges of leveraging Artificial Intelligence (AI) for architectural heritage, and how AI-assisted interactive systems can foster cultural heritage understanding and preservation awareness. We conducted a formative study (N=14) to uncover empirical insights from heritage stakeholders that inform design. These insights informed the design of Gen-Diaolou, an integrated AI-assisted interactive system that supports heritage understanding and preservation. A pilot study (N=18) and a museum field study (N=26) provided converging evidence suggesting that Gen-Diaolou may support visitors' diachronic understanding and preservation awareness, and together informed design implications for future human-AI collaborative systems for digital cultural heritage engagement. More broadly, this work bridges the research gap between passive heritage systems and unconstrained creative tools in the HCI domain.
Authors:Yoonsang Kim, Divyansh Pradhan, Devshree Jadeja, Arie Kaufman
Abstract:
We introduce Speech-to-Spatial, a referent disambiguation framework that converts verbal remote-assistance instructions into spatially grounded AR guidance. Unlike prior systems that rely on additional cues (e.g., gesture, gaze) or manual expert annotations, Speech-to-Spatial infers the intended target solely from spoken references (speech input). Motivated by our formative study of speech referencing patterns, we characterize recurring ways people specify targets (Direct Attribute, Relational, Remembrance, and Chained) and ground them to our object-centric relational graph. Given an utterance, referent cues are parsed and rendered as persistent in-situ AR visual guidance, reducing iterative micro-guidance ("a bit more to the right", "now, stop.") during remote guidance. We demonstrate the use cases of our system with remote guided assistance and intent disambiguation scenarios. Our evaluation shows that Speechto-Spatial improves task efficiency, reduces cognitive load, and enhances usability compared to a conventional voice-only baseline, transforming disembodied verbal instruction into visually explainable, actionable guidance on a live shared view.
Authors:Awu Chen, Vera Yu Wu, Yunge Wen, Yaluo Wang, Jiaxuan Olivia Yin, Yichen Wang, Qian Xiang, Richard Zhang, Paul Pu Liang, Hiroshi Ishii
Abstract:
Olfaction plays an important role in human perception, yet its subjective and ephemeral nature makes it difficult to articulate, compare, and share across individuals. Traditional practices like the Japanese incense game Genji-ko offer one way to structure olfactory experience through shared interpretation. In this work, we present Smell with Genji, an AI-mediated olfactory interaction system that reinterprets Genji-ko as a collaborative human-AI sensory experience. By integrating a game setup, a mobile application, and an LLM-powered co-smelling partner equipped with olfactory sensing and LLM-based conversation, the system invites participants to compare scents and construct Genji-mon patterns, fostering reflection through a dialogue that highlights the alignment and discrepancies between human and machine perception. This work illustrates how sensing-enabled AI can participate in olfactory experience alongside users, pointing toward new possibilities for AI-supported sensory interaction and reflection in HCI.
Authors:Yoonsang Kim, Devshree Jadeja, Divyansh Pradhan, Yalong Yang, Arie Kaufman
Abstract:
Speaking aloud to a wearable AR assistant in public can be socially awkward, and re-articulating the same requests every day creates unnecessary effort. We present SpeechLess, a wearable AR assistant that introduces a speech-based intent granularity control paradigm grounded in personalized spatial memory. SpeechLess helps users "speak less," while still obtaining the information they need, and supports gradual explicitation of intent when more complex expression is required. SpeechLess binds prior interactions to multimodal personal context-space, time, activity, and referents-to form spatial memories, and leverages them to extrapolate missing intent dimensions from under-specified user queries. This enables users to dynamically adjust how explicitly they express their informational needs, from full-utterance to micro/zero-utterance interaction. We motivate our design through a week-long formative study using a commercial smart glasses platform, revealing discomfort with public voice use, frustration with repetitive speech, and hardware constraints. Building on these insights, we design SpeechLess, and evaluate it through controlled lab and in-the-wild studies. Our results indicate that regulated speech-based interaction, can improve everyday information access, reduce articulation effort, and support socially acceptable use without substantially degrading perceived usability or intent resolution accuracy across diverse everyday environments.
Authors:Duan Li, Jun Yuan, Xinyuan Guo, Xiting Wang, Yang Liu, Weikai Yang, Shixia Liu
Abstract:
Circle packing is widely used in visualization due to its aesthetic appeal and simplicity, particularly in tasks where the spatial arrangement and relationships between data are of interest, such as understanding proximity relationships (e.g., images with categories) or analyzing quantitative data (e.g., housing prices). Many applications require preserving neighborhood relationships while encoding a quantitative attribute using radii for data analysis. To meet these two requirements simultaneously, we present a neighborhood-preserving non-uniform circle packing method, NCP. This method preserves neighborhood relationships between the data represented by non-uniform circles to comprehensively analyze similar data and an attribute of interest. We formulate neighborhood-preserving non-uniform circle packing as a planar graph embedding problem based on the circle packing theorem. This formulation leads to a non-convex optimization problem, which can be solved by the continuation method. We conduct a quantitative evaluation and present two use cases to demonstrate that our NCP method can effectively generate non-uniform circle packing results.
Authors:Hongyu Zhou, Chia-An fan, Yihao Dong, Shuto Takashita, Masahiko Inami, Zhanna Sarsenbayeva, Anusha Withana
Abstract:
Wearable supernumerary robotic limbs (SRLs) sit at the intersection of human augmentation and embodied AI, transforming into extensions of the human body. However, their movements within the intimate near-body space raise unresolved challenges for perceived safety, user control, and trust. In this paper, we present results from a Wizard-of-Oz study (n=18), where participants completed near-body collaboration tasks with SRLs to explore these challenges. We collected qualitative data through think-aloud protocols and semi-structured interviews, complemented by physiological signals and post-task ratings. Findings indicate that greater autonomy did not inherently enhance perceived safety or trust. Instead, participants identified near-body zones and paired them with clear coordination rules. They also expressed expectations for how different arm components should behave, shaping preferences around autonomy, perceived safety, and trust. Building on these insights, we introduce SRL Proxemics, a zone- and segment-level design framework showing that autonomy is not monolithic: perceived safety hinges on spatially calibrated, legible behaviors, not higher autonomy.
Authors:Hongyu Zhou, Xincheng Huang, Winston Wijaya, Yi Fei Cheng, David Lindlbauer, Eduardo Velloso, Andrea Bianchi, Zhanna Sarsenbayeva, Anusha Withana
Abstract:
Remote VR teleoperation with supernumerary robotic limbs enables distant users to operate in another's local space. While a shared first-person view aids hand-eye coordination, locking the guest's camera to the host's head can degrade comfort, embodiment, and coordination. Based on a formative study (N=10) using a virtual supernumerary robotic limbs configuration to stress-test coordination, we propose guest-driven perspective switching from a shared first-person baseline (Shared Embodied View) to two alternatives: (a) a stabilized view with guest-controlled rotation (Embedded Anchored View), and (b) a fully decoupled third-person view (Out-of-body View). We ran a user study with 24 pairs (N=48) who switched between the baseline and proposed views as task demands changed. We measured performance, embodiment, fatigue, physiological arousal, and switching behaviors. Our results reveal role-dependent trade-offs: Out-of-body View improves navigation efficiency and reduces errors, while Embedded Anchored View supports embodiment. We conclude with guidelines: use Embedded Anchored View for hand-centric adjustments, Out-of-body View for navigation and object placement, and ensure smooth transitions.
Authors:Danlin Zheng, Xiaoying Wei, Chao Liu, Quanyu Zhang, Jingling Zhang, Shihui Duo, Mingming Fan
Abstract:
Over 100 million retired women in China engage in dance, but their performances are constrained by limited resources and age-related decline. While interactive dance technologies can enhance artistic expression, existing systems are largely inaccessible to non-professional older dancers. This paper explores how interactive dance technologies can be designed with an age-sensitive approach to support retired women in enhancing their stage performance. We conducted two workshops with community-based retired women dancers, employing interactive dance and LLM-powered video generation probes in co-design activities. Findings indicate that age-sensitive adaptations, such as low-barrier keyword input, motion-aligned visual effects, and participatory scaffolds, lowered technical barriers and fostered a sense of authorship. These features enabled retired women to empower their stage, transitioning from passive recipients of stage design to empowered co-creators of performance. We outline design implications for incorporating interactive dance and artificial intelligence-generated content (AIGC) into the cultural practices of retired women, offering broader strategies for age-sensitive creative technologies.
Authors:Venkatesh Sivaraman, Eric P. Mason, Mengfan Ellen Li, Jessica Tong, Andrew J. King, Jeremy M. Kahn, Adam Perer
Abstract:
Artificial intelligence (AI)-based decision support systems can be highly accurate yet still fail to support users or improve decisions. Existing theories of AI-assisted decision-making focus on calibrating reliance on AI advice, leaving it unclear how different system designs might influence the reasoning processes underneath. We address this gap by reconsidering AI interfaces as collections of intelligent reasoning cues: discrete pieces of AI information that can individually influence decision-making. We then explore the roles of eight types of reasoning cues in a high-stakes clinical decision (treating patients with sepsis in intensive care). Through contextual inquiries with six teams and a think-aloud study with 25 physicians, we find that reasoning cues have distinct patterns of influence that can directly inform design. Our results also suggest that reasoning cues should prioritize tasks with high variability and discretion, adapt to ensure compatibility with evolving decision needs, and provide complementary, rigorous insights on complex cases.
Authors:Varun Srivastava, Fan Lei, Alan M. MacEachren, Ross Maciejewski
Abstract:
Thematic maps are widely used to communicate spatial patterns to non-expert audiences. Although uncertainty is inherent in thematic map data, it is rarely visualized, raising questions about how its inclusion affects trust. Prior work offers mixed perspectives: some argue that uncertainty fosters trust through transparency, while others suggest it may reduce trust by introducing confusion. Yet few empirical studies explicitly measure trust in thematic maps. We conducted a between-subjects experiment (N=161) to evaluate how visualizing uncertainty at varying levels (low, medium, high) influences trust. We find that uncertainty visualization generally reduces trust, with greater reductions observed as uncertainty levels increase. However, maps dominated by low uncertainty do not significantly differ in trust from those with no uncertainty. Moreover, while uncertainty visualization tends to make readers question the accuracy of the data, it appears to have a weaker influence on perceptions of the mapmaker's integrity.
Authors:Alexander Loth, Martin Kappes, Marc-Oliver Pahl
Abstract:
As foundation models (FMs) approach human-level fluency, distinguishing synthetic from organic content has become a key challenge for Trustworthy Web Intelligence. This paper presents JudgeGPT and RogueGPT, a dual-axis framework that decouples "authenticity" from "attribution" to investigate the mechanisms of human susceptibility. Analyzing 918 evaluations across five FMs (including GPT-4 and Llama-2), we employ Structural Causal Models (SCMs) as a principal framework for formulating testable causal hypotheses about detection accuracy. Contrary to partisan narratives, we find that political orientation shows a negligible association with detection performance ($r=-0.10$). Instead, "fake news familiarity" emerges as a candidate mediator ($r=0.35$), suggesting that exposure may function as adversarial training for human discriminators. We identify a "fluency trap" where GPT-4 outputs (HumanMachineScore: 0.20) bypass Source Monitoring mechanisms, rendering them indistinguishable from human text. These findings suggest that "pre-bunking" interventions should target cognitive source monitoring rather than demographic segmentation to ensure trustworthy information ecosystems.
Authors:Dániel Szabó, Chi-Lan Yang, Aku Visuri, Jonas Oppenlaender, Bharathi Sekar, Koji Yatani, Simo Hosio
Abstract:
Proliferation of misinformation is a globally acknowledged problem. Cognitive Inoculation helps build resistance to different forms of persuasion, such as misinformation. We investigate Conversational Inoculation, a method to help people build resistance to misinformation through dynamic conversations with a chatbot. We built a Web-based system to implement the method, and conducted a within-subject user experiment to compare it with two traditional inoculation methods. Our results validate Conversational Inoculation as a viable novel method, and show how it was able to enhance participants' resistance to misinformation. A qualitative analysis of the conversations between participants and the chatbot reveal independence and trust as factors that boosted the efficiency of Conversational Inoculation, and friction of interaction as a factor hindering it. We discuss the opportunities and challenges of using Conversational Inoculation to combat misinformation. Our work contributes a timely investigation and a promising research direction in scalable ways to combat misinformation.
Authors:Yoonsang Kim, Swapnil Dey, Arie Kaufman
Abstract:
In time-critical eXtended reality (XR) scenarios where users must rapidly reorient their attention to hazards, alerts, or instructions while engaged in a primary task, spatial audio can provide an immediate directional cue without occupying visual bandwidth. However, such scenarios can afford only a brief auditory exposure, requiring users to interpret sound direction quickly and without extended listening or head-driven refinement. This paper reports a controlled exploratory study of rapid spatial-audio localization in XR. Using HRTF-rendered broadband stimuli presented from a semi-dense set of directions around the listener, we quantify how accurately users can infer coarse direction from brief audio alone. We further examine the effects of short-term visuo-auditory feedback training as a lightweight calibration mechanism. Our findings show that brief spatial cues can convey coarse directional information, and that even short calibration can improve users' perception of aural signals. While these results highlight the potential of spatial audio for rapid attention guidance, they also show that auditory cues alone may not provide sufficient precision for complex or high-stakes tasks, and that spatial audio may be most effective when complemented by other sensory modalities or visual cues, without relying on head-driven refinement. We leverage this study on spatial audio as a preliminary investigation into a first-stage attention-guidance channel for wearable XR (e.g., VR head-mounted displays and AR smart glasses), and provide design insights on stimulus selection and calibration for time-critical use.
Authors:Si Chen, Jingyi Xie, Yao Li, Ya-Fang Lin, He Zhang, Ge Wang, Gaojian Huang, Rui Yu, Ronald Anthony Metoyer, Ting Hua, Nitesh Chawla
Abstract:
Family learning takes place in everyday routines where children and caregivers read, practice, and develop new skills together. Despite growing interest in AI tutors, most existing systems are designed for single learners or classroom settings and do not address the distributed planning, coordination, and execution demands of learning at home. This paper introduces ParPal, a human-centred, LLM-powered system that supports multi-actor family learning by decomposing learning goals into actionable subtasks, allocating them across caregivers under realistic availability and expertise constraints, and providing caregiver-in-the-loop tutoring support with visibility into individual and collective contributions. Through expert evaluation of generated weekly learning plans and a one-week field deployment with 11 families, we identify systematic failure modes in current LLM-based planning, including misalignment with role expertise, unnecessary or costly collaboration, missing pedagogical learning trajectories, and physically or temporally infeasible tasks. While ParPal improves coordination clarity and recognition of caregiving effort, these findings expose fundamental limitations in how current LLMs operationalize pedagogical knowledge, reason about collaboration, and account for real-world, embodied constraints. We discuss implications for human-centred AI design and AI methodology, positioning multi-actor family learning as a critical testbed for advancing planning, adaptation, and pedagogical structure in next-generation AI systems.
Authors:Neeley Pate, Adiba Mahbub Proma, Hangfeng He, James N. Druckman, Daniel Molden, Gourab Ghoshal, Ehsan Hoque
Abstract:
Motivated reasoning -- the idea that individuals processing information may be motivated to reach a certain conclusion, whether it be accurate or predetermined -- has been well-explored as a human phenomenon. However, it is unclear whether base LLMs mimic these motivational changes. Replicating 4 prior political motivated reasoning studies, we find that base LLM behavior does not align with expected human behavior. Furthermore, base LLM behavior across models shares some similarities, such as smaller standard deviations and inaccurate argument strength assessments. We emphasize the importance of these findings for researchers using LLMs to automate tasks such as survey data collection and argument assessment.
Authors:Mehrnoosh Sadat Shirvani, Jackie Crowley, Cher Peng, Jackie Liu, Thomas Chao, Suky Martinez, Laura Brandt, Ig-Jae Kim, Dongwook Yoon
Abstract:
As digital tools increasingly mediate mental health care, self-clone chatbots can offer a uniquely novel approach to intra-personal exploration and self-derived support. Trained to replicate users' conversational patterns, self-clones allow users to talk to themselves through their digital replicas. Despite the promises, these systems may carry risks around identity confusion, negative reinforcement, and blurred user agency. Through interviews with 16 mental health professionals and 6 general users, we aim to uncover tensions and design opportunities in this emerging space to guide responsible self-clone design. Our analysis produces a design framework organized around three priorities: (1) defining goals and grounding the approach in existing therapeutic models, (2) design dimensions including the self-clone persona and user-clone relationship dynamics, and (3) considerations for minimizing potential emotional and ethical harms. This framework contributes an interdisciplinary foundation for designing self-clone chatbots as AI-mediated self-interaction tools that are emotionally and ethically attuned in mental health contexts.
Authors:Alva Markelius, Fethiye Irmak Doğan, Julie Bailey, Guy Laban, Jenny L. Gibson, Hatice Gunes
Abstract:
Institutional and social barriers in higher education often prevent students with disabilities from effectively accessing support, including lengthy procedures, insufficient information, and high social-emotional demands. This study empirically explores how disabled students perceive robot-based support, comparing two interaction roles, one information based (signposting) and one disclosure based (sounding board), and two embodiment types (physical robot/disembodied voice agent). Participants assessed these systems across five dimensions: perceived understanding, social energy demands, information access/clarity, task difficulty, and data privacy concerns. The main findings of the study reveal that the physical robot was perceived as more understanding than the voice-only agent, with embodiment significantly shaping perceptions of sociability, animacy, and privacy. We also analyse differences between disability types. These results provide critical insights into the potential of social robots to mitigate accessibility barriers in higher education, while highlighting ethical, social and technical challenges.
Authors:Yi-Chieh Lee, Junti Zhang, Tianqi Song, Yugin Tan
Abstract:
The integration of Conversational Agents (CAs) into daily life offers opportunities to tackle global challenges, leading to the emergence of Conversational AI for Social Good (CAI4SG). This paper examines the advancements of CAI4SG using a role-based framework that categorizes systems according to their AI autonomy and emotional engagement. This framework emphasizes the importance of considering the role of CAs in social good contexts, such as serving as empathetic supporters in mental health or functioning as assistants for accessibility. Additionally, exploring the deployment of CAs in various roles raises unique challenges, including algorithmic bias, data privacy, and potential socio-technical harms. These issues can differ based on the CA's role and level of engagement. This paper provides an overview of the current landscape, offering a role-based understanding that can guide future research and design aimed at the equitable, ethical, and effective development of CAI4SG.
Authors:Xinyu Li, Kaixun Yang, Jiameng Wei, Yixin Cheng, Dragan Gašević, Guanliang Chen
Abstract:
Information Problem Solving (IPS) is a critical competency for academic and professional success in education, work, and life. The advent of Generative Artificial Intelligence (GenAI), particularly tools like ChatGPT, has introduced new possibilities for supporting students in complex IPS tasks. However, empirical insights into how students engage with GenAI during IPS and how these tools can be effectively leveraged for learning remain limited. Moreover, differences in background, shaped by cultural and socioeconomic factors, pose additional challenges to the equitable integration of GenAI in educational contexts. To address this gap, we present an open-source dataset collected from 279 students at a public Australian university. The dataset was generated through students' use of FLoRA, a GenAI-powered educational platform that widely adopted in the field of learning analytics. Within FLoRA, students interacted with an embedded GenAI chatbot to gather information and synthesize it into data science project proposals. The dataset captures fine-grained, multi-dimensional records of GenAI-assisted IPS processes, including: (i) student-GenAI dialogue transcripts; (ii) writing process log traces; (iii) final project proposals with human-assigned assessment scores; (iv) surveys of biographic and prior knowledge in data science and AI; and (v) surveys capturing students' GenAI experience and perceptions of GenAI's effectiveness in supporting IPS. This dataset provides a valuable resource for advancing our understanding of GenAI's role in educational IPS and informing the design of adaptive, inclusive AI-powered learning tools.
Authors:Yu Yang, Ig-Jae Kim, Dongwook Yoon
Abstract:
AI compliance is becoming increasingly critical as AI systems grow more powerful and pervasive. Yet the rapid expansion of AI policies creates substantial burdens for resource-constrained practitioners lacking policy expertise. Existing approaches typically address one policy at a time, making multi-policy compliance costly. We present PASTA, a scalable compliance tool integrating four innovations: (1) a comprehensive model-card format supporting descriptive inputs across development stages; (2) a policy normalization scheme; (3) an efficient LLM-powered pairwise evaluation engine with cost-saving strategies; and (4) an interface delivering interpretable evaluations via compliance heatmaps and actionable recommendations. Expert evaluation shows PASTA's judgments closely align with human experts ($ρ\geq .626$). The system evaluates five major policies in under two minutes at approximately \$3. A user study (N = 12) confirms practitioners found outputs easy-to-understand and actionable, introducing a novel framework for scalable automated AI governance.
Authors:Nadine Kuo, Agnia Sergeyuk, Valerie Chen, Maliheh Izadi
Abstract:
Current in-IDE AI coding tools typically rely on time-consuming manual prompting and context management, whereas proactive alternatives that anticipate developer needs without explicit invocation remain underexplored. Understanding when humans are receptive to such proactive AI assistance during their daily work remains an open question in human-AI interaction research. We address this gap through a field study of proactive AI assistance in professional developer workflows. We present a five-day in-the-wild study with 15 developers who interacted with a proactive feature of an AI assistant integrated into a production-grade IDE that offers code quality suggestions based on in-IDE developer activity. We examined 229 AI interventions across 5,732 interaction points to understand how proactive suggestions are received across workflow stages, how developers experience them, and their perceived impact. Our findings reveal systematic patterns in human receptivity to proactive suggestions: interventions at workflow boundaries (e.g., post-commit) achieved 52% engagement rates, while mid-task interventions (e.g., on declined edit) were dismissed 62% of the time. Notably, well-timed proactive suggestions required significantly less interpretation time than reactive suggestions (45.4s versus 101.4s, W = 109.00, r = 0.533, p = 0.0016), indicating enhanced cognitive alignment. This study provides actionable implications for designing proactive coding assistants, including how to time interventions, align them with developer context, and strike a balance between AI agency and user control in production IDEs.
Authors:Shakyani Jayasiriwardene, Hongyu Zhou, Weiwei Jiang, Benjamin Tag, Emmanuel Stamatakis, Anusha Withana, Zhanna Sarsenbayeva
Abstract:
Conversational agents are increasingly expected to adapt across contexts and evolve their personalities through interactions, yet most remain static once configured. We present an exploratory study of how user expectations form and evolve when agent personality is made dynamically adjustable. To investigate this, we designed a prototype conversational interface that enabled users to adjust an agent's personality along eight research-grounded dimensions across three task contexts: informational, emotional, and appraisal. We conducted an online mixed-methods study with 60 participants, employing latent profile analysis to characterize personality classes and trajectory analysis to trace evolving patterns of personality adjustment. These approaches revealed distinct personality profiles at initial and final configuration stages, and adjustment trajectories, shaped by context-sensitivity. Participants also valued the autonomy, perceived the agent as more anthropomorphic, and reported greater trust. Our findings highlight the importance of designing conversational agents that adapt alongside their users, advancing more responsive and human-centred AI.
Authors:Shaz Furniturewala, Gerard Christopher Yeo, Kokil Jaidka
Abstract:
Large language models (LLMs) are increasingly used as conversational partners for learning, yet the interactional dynamics supporting users' learning and engagement are understudied. We analyze the linguistic and interactional features from both LLM and participant chats across 397 human-LLM conversations about socio-political issues to identify the mechanisms and conditions under which LLM explanations shape changes in political knowledge and confidence. Mediation analyses reveal that LLM explanatory richness partially supports confidence by fostering users' reflective insight, whereas its effect on knowledge gain operates entirely through users' cognitive engagement. Moderation analyses show that these effects are highly conditional and vary by political efficacy. Confidence gains depend on how high-efficacy users experience and resolve uncertainty. Knowledge gains depend on high-efficacy users' ability to leverage extended interaction, with longer conversations benefiting primarily reflective users. In summary, we find that learning from LLMs is an interactional achievement, not a uniform outcome of better explanations. The findings underscore the importance of aligning LLM explanatory behavior with users' engagement states to support effective learning in designing Human-AI interactive systems.
Authors:Behdokht Kiafar, Mohammad Fahim Abrar, Roghayeh Leila Barmaki
Abstract:
This study examines the impact of feedback on Electroencephalography (EEG) activity and performance during the Reading the Mind in the Eyes Test. In a within-subject design, eleven participants completed the test under Feedback and No-Feedback conditions. Using the principles of Epistemic Network Analysis (ENA) and Ordered Network Analysis (ONA), we extend these network-based models to explore the link between neural dynamics and task outcomes. ENA results showed that feedback is associated with stronger connections between higher frequency EEG bands (Beta and Gamma) and correct responses, while the absence of feedback activated lower frequency bands (Theta and Alpha). ONA further disclosed directional shifts toward higher frequency activity preceding correct answers in the Feedback condition, whereas the No-Feedback condition showed more self-connections in lower bands and a higher occurrence of wrong answers, suggesting less effective reasoning strategies without feedback. Both ENA and ONA revealed statistically significant differences between conditions (p = 0.01, Cohen's d > 2). This study highlights the methodological benefits of integrating EEG with ENA and ONA for network analysis, capturing both temporal and relational dynamics, as well as the practical insight that feedback can foster more effective reasoning processes and improve task performance.
Authors:Ben Carvell, Marc Thomas, Andrew Pace, Christopher Dorney, George De Ath, Richard Everson, Nick Pepper, Adam Keane, Samuel Tomlinson, Richard Cannon
Abstract:
We present a rigorous, human-in-the-loop evaluation framework for assessing the performance of AI agents on the task of Air Traffic Control, grounded in a regulator-certified simulator-based curriculum used for training and testing real-world trainee controllers. By leveraging legally regulated assessments and involving expert human instructors in the evaluation process, our framework enables a more authentic and domain-accurate measurement of AI performance. This work addresses a critical gap in the existing literature: the frequent misalignment between academic representations of Air Traffic Control and the complexities of the actual operational environment. It also lays the foundations for effective future human-machine teaming paradigms by aligning machine performance with human assessment targets.
Authors:Jing Ye, Lu Xiang, Yaping Zhang, Chengqing Zong
Abstract:
Current evaluation paradigms for emotional support conversations tend to reward generic empathetic responses, yet they fail to assess whether the support is genuinely personalized to users' unique psychological profiles and contextual needs. We introduce EmoHarbor, an automated evaluation framework that adopts a User-as-a-Judge paradigm by simulating the user's inner world. EmoHarbor employs a Chain-of-Agent architecture that decomposes users' internal processes into three specialized roles, enabling agents to interact with supporters and complete assessments in a manner similar to human users. We instantiate this benchmark using 100 real-world user profiles that cover a diverse range of personality traits and situations, and define 10 evaluation dimensions of personalized support quality. Comprehensive evaluation of 20 advanced LLMs on EmoHarbor reveals a critical insight: while these models excel at generating empathetic responses, they consistently fail to tailor support to individual user contexts. This finding reframes the central challenge, shifting research focus from merely enhancing generic empathy to developing truly user-aware emotional support. EmoHarbor provides a reproducible and scalable framework to guide the development and evaluation of more nuanced and user-aware emotional support systems.
Authors:Julia Christenson, Karin de Langis, Shirley Anugrah Hayati, Dongyeop Kang
Abstract:
Investigating the degree to which large language models (LLMs) affect teaching and learning in universities can help identify strategies for integrating LLMs in a way that supports, rather than undermines, student learning outcomes. This study examined how varying levels of LLM assistance affect writing performance, engagement, and perceived authorship. We report a pilot study in which 24 college students were randomly assigned to write a short essay with no LLM access, limited access (<=3 prompts, responses capped at 100 words), or unlimited access. Overall essay quality was statistically indistinguishable across groups. Yet writing behavior and perceived authorship diverged sharply: students with limited access reported higher ownership (62.5% would submit the essay as independent work, vs. 25% in the unlimited group), stronger organizational gains, and more strategic, revision-focused prompting. The unlimited group spent more time writing, produced essays more similar to LLM output, and reported reduced creative expression. Our findings suggest that constraining, rather than banning, LLM access may preserve authorship confidence while retaining the scaffolding benefits of AI assistance.
Authors:Yiliang Zhou, Yawen Guo, Di Hu, Sairam Sutari, Emilie Chow, Steven Tam, Danielle Perret, Deepti Pandita, Kai Zheng
Abstract:
Ambient AI documentation systems generate clinical note drafts that clinicians frequently revise before signing off into electronic health records, yet how these edits alter hedging language remains unclear. We conducted paired analysis of clinician-edited portions of ambient AI drafts and final notes to examine (1) whether these edits change the prevalence of hedging language, (2) whether these edits exhibit a systematic shift toward greater certainty or uncertainty, and (3) whether these changes in hedging prevalence and directionality differ by ambient AI vendors and clinical specialties. Among 62,811 paired note sections, hedging terms were more often introduced into previously non-hedged text than removed from previously hedged text, and post-edit text contained more hedging mentions than pre-edit text. Directionality analyses showed a significant overall tendency toward greater uncertainty in hedging-related replacement edits. Vendor and specialty analyses revealed substantial heterogeneity in hedging prevalence, pre-to-post changes in hedging mentions, and directionality.
Authors:Mahjabin Nahar, Nafis Irtiza Tripto, Aiping Xiong, Ting-Hao 'Kenneth' Huang, Dongwon Lee
Abstract:
As AI-generated and AI-assisted content floods online spaces, source labels attached to such content can distort human reasoning judgments, with downstream consequences for moderation, evaluation, and decision-making. Whether LLMs share this vulnerability, or offer more source-agnostic evaluation, remains an open question with direct implications for human-AI collaboration. We examine this issue using logical fallacies as a controlled setting to isolate source-label effects on reasoning quality, independent of domain knowledge. We conduct an online study (N=505) where participants are assigned to a source condition (human, AI, human with AI assistance, AI with human assistance, or no disclosure) and evaluate comments containing logical fallacies, comparing their judgments with those of LLMs (GPT-5.2, Gemini 2.5 Flash, Claude Sonnet 4.5), who were evaluated across the same source conditions. Human evaluators were significantly more susceptible to fallacies labeled as written by human or human with AI assistance and assigned higher trust and evaluation ratings in these conditions. LLM evaluations remained comparatively stable across source labels, though performance varied across models. Confidence levels were similarly high across conditions for both humans and LLMs, regardless of fallacy presence. Our findings indicate that source-label bias in reasoning evaluation is primarily a human vulnerability and highlight the potential of human-LLM collaboration in increasingly AI-mediated environments.
Authors:Binglu Wang, Weixin Liang, Jiahui Xue, Yuhui Zhang, Hancheng Cao, Dashun Wang, Yian Yin
Abstract:
Collaboration is the defining mode of modern science, yet its core mechanism -- feedback -- remains hard to observe, difficult to scale, and unequally distributed. Here we test whether large language models (LLMs) can contribute to this hidden but vital practice and reallocate scientific feedback, an essential yet scarce resource for knowledge production. In a global large-scale randomized field experiment, we delivered customized LLM-generated feedback for over 31,000 arXiv preprints across 150 fields and more than 45,000 researchers from 133 geographic regions. Relative to controls, authors who received feedback had a significantly higher likelihood of revising their manuscripts, corresponding to a 12.55% relative increase over the baseline revision rate. Exposure to AI feedback also increased authors' subsequent use of LLM tools in their future papers, suggesting longer-run shifts in scientific practice. These effects were strongest among authors from non-English-dominant research regions, manuscripts less embedded in the scholarly literature, and teams with lower h-indexes and earlier career stages, consistent with the idea that AI feedback may provide the greatest benefit where access to timely critique is otherwise limited. Together, these findings provide causal evidence that structured AI-based interventions can transform access to scientific feedback from a largely private advantage into a more widely distributed resource, with broader implications for productivity, equity, and capacity across the global research system.
Authors:Yunge Wen, Yuancheng Shen, Paul Pu Liang
Abstract:
We present PaintCopilot, a co-creative neural painting assistant that models painting as an open-ended autoregressive artistic behavior conditioned on evolving canvas states and prior brushstroke history, without requiring a target image. Unlike existing neural painting methods that frame painting as pixel reconstruction toward a predefined reference, PaintCopilot predicts future strokes directly from learned artistic dynamics, analogous to how large language models continue text sequences from prior context. The framework proposes three complementary models: a ViT-based Target Predictor that infers artist intent from partial canvas observations, an autoregressive Next Stroke Predictor that generates temporally coherent brushstrokes via flow matching, and a VAE-based Region Sampler that synthesizes semantically localized stroke sequences on demand. Built on three differentiable brush representations (Hard Round, Brush Tip, and 2D Gaussian), the system supports four interactive workflows: Optimize History, Stroke Completion, Region Inpainting, and Dynamic Brush. Through case studies with professional artists, we demonstrate that PaintCopilot enables fluid co-creative painting workflows in which artists and AI continuously alternate control throughout the creative process.
Authors:William Seymour, Adam Jenkins, Mark Cote, Jose Such
Abstract:
LLM-driven conversational AI is beginning to disappear into the background, shifting from something used directly towards something increasingly integrated into existing workflows. In the process, markers of origin and training are smoothed away as LLMs become commodified in the eyes of users. We explore how people approach using a web browser with conversational AI built in, focusing on how they develop their understanding and determine whether to trust its outputs. We conducted a study where 20 participants used the Copilot AI features in Microsoft Edge to conduct information retrieval and planning tasks. Participants relied on a combination of existing perceptions of LLMs and internet search, tracing the effect of beliefs about how Copilot generated answers on prompting strategies. The inclusion of citations increased the trustworthiness of answers without participants feeling the need to be check them, with participants often reaching for the same information sources as the CAI when fact-checking.
Authors:Ankur Kamboj, Rajiv Ranganathan, Xiaobo Tan, Vaibhav Srivastava
Abstract:
Designing effective practice schedules for high-dimensional motor learning tasks remains a challenge, especially when skill states are unobservable and task performance may not reflect the true learning. We propose an automated curriculum design framework that combines a human motor learning model and personalized real-time skill estimation with Stochastic Nonlinear Model Predictive Control in \emph{de-novo} (novel) motor learning paradigms. We validated our framework both through simulations and human-subject studies (N = 36) using a hand exoskeleton. Our proposed approach accelerates skill acquisition by $\sim23\%$, and ${\sim17\%}$ when compared to a random curriculum and a performance heuristics-based curriculum, respectively. These significant gains in learning efficiency highlight the potential of model-based, individualized curricula for motor rehabilitation and complex skill training.
Authors:Gloria Fernández-Nieto, Kiyoshige Garcés, Mladen Raković, Tongguang Li, Xinyu Li, Linxuan Zhao, Dragan Gašević
Abstract:
Background: Abilities for effective self-regulated learning (SRL) are critical for lifelong learning, particularly during adolescence when these skills consolidate and strongly influence future learning. Their importance has grown with the rise of online and blended education. Yet, little is known about how secondary school students self-regulate in online environments, how their SRL processes and strategies evolve, or how they affect outcomes. In secondary education, understanding these processes can reveal patterns and indicators of learning success, informing the design of online support mechanisms. Evidence from repeated-measures designs remains scarce. Objectives: This study aims to examine how secondary school students enact SRL strategies during online essay writing, how these strategies change over time, and how they relate to learning outcomes. Methods: We analysed metacognition-related trace data collected from secondary students during a two-wave online essay-writing task conducted one week apart in two Colombian schools (N = 93 for session 1, N = 95 for session 2) via a digital learning platform. Using a combination of process mining and unsupervised machine learning techniques, we identified dominant SRL strategies grounded in established SRL processes and examined their stability and association with learning outcomes. Results and conclusions: Three dominant SRL strategies were identified. Results showed variability: many students remained in or shifted to Read first, write next, while none used Write intensively, read selectively in session 2. Although less common, latter strategy was positively associated with learning outcomes.
Authors:Gauri Nayak, Farhana Shahid, Aditya Vashistha, Kiran Garimella
Abstract:
WhatsApp is one of the most widely used messaging platforms globally, with billions of users sharing information in private groups. Yet, it offers little infrastructure to support moderation and group governance. In the absence of platform-level oversight, group admins bear the responsibility of governing group behavior. In this paper, we explore how WhatsApp group admins collaborate with AI tools to create, enforce, and maintain group rules. Drawing on a two-phase speculative design study with 20 admins in India, we examine how participants interacted with an AI assistant (Meta AI) to co-create rules and responded to a series of probes illustrating AI-assisted moderation features. Our findings show that while admins appreciated the AI's ability to surface overlooked rules and reduce their moderation burden, they were highly sensitive to issues of relational trust, data privacy, tone, and social context. We identify how group type and admin style shaped their willingness to delegate authority, and surface the limitations of current chatbot interfaces in supporting collaborative rule-making. We conclude with design implications for building moderation tools that center human judgment, relational nuance, contextual adaptability, and collective governance.
Authors:Ludwig Sidenmark, Qian Zhou, George Fitzmaurice, Fraser Anderson
Abstract:
Creating 3D character animations traditionally requires significant time and effort from the animator. Advancements in generative methods now enable easy creation of multiple character animation variations for use or further editing. However, this capability introduces a new challenge in comparing character animations to select the best animation, which is challenging due to temporal misalignment and the large amount of spatial data. We present AnimationDiff, a visual comparison tool for generated character animations. AnimationDiff enables contextual comparisons in the intended scene and camera angle, and embedding of spatial information by combining established animation visualization techniques and easy switching between overlaid and side-by-side comparisons. AnimationDiff also supports filtering to handle information overload, and Temporal Lenses that visualize entire animations over time for overview, alignment, and comparison. We evaluated AnimationDiff in a user study, showcasing its efficacy in animation comparison and providing design insights for comparing motion.
Authors:Zhenyu Mao, Jacky Keung, Xiangyu Li, Yicheng Sun, Kehui Chen, Jingyu Zhang, Jialong Li
Abstract:
Online scams often unfold gradually through interaction, yet existing detection systems predominantly rely on snapshot-based signals and interruptive warnings, revealing two research gaps in the lack of signals that represent scam risk within conversational dynamics and the underexplored design of non-interruptive interaction. To address these gaps, we introduce multi-level alignment-based hints, informed by the Interactive Alignment Model, as a new detection signal for supporting sensemaking in scam-related conversations. These hints operationalize low-level lexical and syntactic alignments and high-level semantic and situation-model alignments between conversational participants, making conversational dynamics visible to users. We first conduct a preliminary evaluation on real-life scam dialogues, showing that as conversations approach scam attempts, low-level alignment scores remain stable while high-level alignment scores systematically decline, revealing a consistent cross-level pattern indicative of scam progression. Building on this insight, we conduct a user study with thirty participants, indicating that relative to the no-hint baseline, multi-level alignment-based hints increase precision by 0.25, recall by 0.16, and F1 score by 0.21, yielding substantially larger gains than the marginal improvements achieved by keyword-triggered alerts. Statistical analyses reveal that the proposed hints support earlier and more stable confidence formation over time, with ablation results further highlighting the effectiveness of combining alignment hints across levels in achieving these advantages.
Authors:Qiming Yuan, Linyi Han, Nam Ling, Cihan Ruan
Abstract:
People increasingly turn to large language models (LLMs) to interpret ambiguous social situations: a delayed text reply, an unusually cold supervisor, a teacher's mixed signals, or a boundary-crossing friend. Yet in many such cases, no stable interpretation can be verified from the available evidence alone. We study how LLMs respond to these situations across four domains: early-stage romantic relationships, teacher--student dynamics, workplace hierarchies, and ambiguous friendships. Across 72 responses from GPT, Claude, and Gemini, only 9 (12.5\%) genuinely preserved uncertainty. The remaining 87.5% produced interpretive closure through recurring pathways including narrative alignment, narrative reversal, normative advice under uncertainty, and hedged language that still supported a single conclusion. We further find that narrator perspective shapes the path to closure: first-person accounts more often elicited alignment, while third-person accounts invited more detached interpretation, even when the underlying situation remained comparable. Together, these findings show that LLMs do not simply assist interpersonal sensemaking; they tend to resolve ambiguity into coherent and actionable narratives. These results suggest that the central risk is not only that LLMs may misinterpret social situations, but that they may make unresolved situations feel prematurely settled. We frame this tendency as a design challenge for uncertainty-preserving social AI.
Authors:Beyza Cinar, Maria Maleshkova
Abstract:
Disease progression varies with age and is influenced by underlying genetic, biochemical, and hormonal etiologies, suggesting the need for tailored monitoring, care, and medication beyond standard clinical guidelines. Specifically, in autoimmune diseases like type 1 diabetes (T1D), where patients depend on exogenous insulin to compensate for insulin deficiency, medication dosing and the physiological response reflected in vital signs can differ. Insulin therapy can lead to hypoglycemia, a dangerous condition characterized by decreased blood glucose levels ($\leq$70). This risk can be mitigated through improved diabetes management supported by data analytics. Notably, leveraging data from continuous glucose monitoring (CGM) devices, hypoglycemia onset can be predicted. However, while glucose variability, auto-antibody levels, and hypoglycemia occurrence differ across age groups, hypoglycemia classification most often only relies on population-based models specialized in specific age ranges. In this work, we classify hypoglycemia 0, 5-15, 20-45, and 50-120 minutes before onset using DiaData, a large CGM dataset of patients with T1D ranging from children to seniors. In particular, we investigate: 1) the generalizability of a population-based model including all age groups, 2) the impact of age-segmented models trained separately per age group, and 3) the effect of model individualization through transfer learning. The results show that a global population-based model yields similar or superior performance compared to age-segmented models. These findings suggest that data from children, teenagers, and adults can be combined for training models on hypoglycemia classification. While glucose variation differs across age groups, short-term hypoglycemic patterns are similar. However, data of children obtain their best recall with age specialized model.
Authors:Can Liu, Sizhe Cheng, Feng Liang, Zhibang Jiang, Lingru Huang, Kavinda Athapaththu, Yong Wang
Abstract:
With the rise of mobile-first consumption, users increasingly engage with data visualizations on mobile devices. However, the vast majority of existing visualizations are originally authored for desktop environments. Due to significant differences in viewport size and interaction paradigms, directly scaling desktop charts often results in illegible text, information loss, and interaction failures. To bridge this gap, we propose an automated framework to adapt desktop-based visualizations for mobile screens. By systematically categorizing the operations involved in the adaptation process, we establish a multi-level design space. This space defines evolution rules spanning from the global topology level, through the reference frame level, down to the visual elements level. Guided by this theoretical framework, we developed Proteus, a large language model-driven multi-agent system that automatically parses online visualizations, predicts optimal transformation strategies within the design space, and generates equivalent, highly readable visualizations for mobile devices. Case studies and an in-depth user study with 12 participants demonstrate the effectiveness and usability of Proteus.
Authors:Markus Knauer, Edoardo Fiorini, Maximilian Mühlbauer, Stefan Schneyer, Promwat Angsuratanawech, Florian Samuel Lay, Timo Bachmann, Samuel Bustamante, Korbinian Nottensteiner, Freek Stulp, Alin Albu-Schäffer, João Silvério, Thomas Eiband
Abstract:
Industrial robot applications require increasingly flexible systems that non-expert users can easily adapt for varying tasks and environments. However, different adaptations benefit from different interaction modalities. We present an interactive framework that enables robot skill adaptation through three complementary modalities: kinesthetic touch for precise spatial corrections, natural language for high-level semantic modifications, and a graphical web interface for visualizing geometric relations and trajectories, inspecting and adjusting parameters, and editing via-points by drag-and-drop. The framework integrates five components: energy-based human-intention detection, a tool-based LLM architecture (where the LLM selects and parameterizes predefined functions rather than generating code) for safe natural language adaptation, Kernelized Movement Primitives (KMPs) for motion encoding, probabilistic Virtual Fixtures for guided demonstration recording, and ergodic control for surface finishing. We demonstrate that this tool-based LLM architecture generalizes skill adaptation from KMPs to ergodic control, enabling voice-commanded surface finishing. Validation on a 7-DoF torque-controlled robot at the Automatica 2025 trade fair demonstrates the practical applicability of our approach in industrial settings.
Authors:Suleyman Ozdel, Amr Nader, Yasmeen Abdrabou, Enkelejda Kasneci
Abstract:
With the growing use of eye tracking on VR and mobile platforms, gaze data is increasing. While scanpath comparison is important to gaze behavior analysis, existing methods lack privacy-preserving capabilities for real-world use. We present a garbled-circuit (GC)-based approach enabling secure storage and privacy-preserving scanpath comparison under the semi-honest model. It supports two configurations: (1) a two-party setting where the data owner and processor jointly compute similarity scores without revealing their inputs, and (2) a server-assisted setting where encrypted scanpaths are stored and processed while the data owner remains offline. All decryption and comparison operations are executed inside the GC. Experiments on three eye-tracking datasets evaluate fidelity, runtime, and communication, and show secure results for MultiMatch, ScanMatch, and SubsMatch closely match plaintext outcomes, with manageable runtime and communication overhead. Tests under various network conditions indicate that the design remains feasible for real-world privacy-preserving scanpath analysis and can be extended to other GC-based behavioral algorithms.
Authors:Duru Paker, Suleyman Ozdel, Enkelejda Kasneci
Abstract:
Passwords remain the primary authentication method, yet user-created passwords are often the weakest due to the security-usability trade-off. Although AI-based password generators are emerging, little is known about their effectiveness and user perceptions. This eye-tracking study examined how behavior during password creation, selection, and memorization relates to objective and subjective password quality. Four password models, three AI-based (DeepSeek-API, ChatGPT-API, PassGPT) and one rule-based random generator, generated suggestions from participants' self-generated passwords across four website contexts. Eye movements were recorded throughout the experiment. Results confirm the expected trade-off between AI-generated password strength and human memorability but also reveal a novel behavioral link. Despite stronger AI-generated passwords, participants favored self-generated ones. Notably, visual attention to contextual cues was significantly correlated with higher password entropy. This suggests that security is shaped not only by the generation tool but also by users' visual engagement with contextual cues, highlighting the potential of attention-driven security design.
Authors:Suleyman Ozdel, Virmarie Maquiling, Kadir Burak Buldu, Yasmeen Abdrabou, Enkelejda Kasneci
Abstract:
Reproducibility in eye-tracking research is increasingly important as researchers conduct diverse experiments and seek to validate or replicate findings. However, exact replication remains challenging due to differences in laboratory practices and experimental setups. Inconsistent stimulus presentation can yield divergent metrics from identical oculomotor behavior, yet the stimulus layer remains largely unstandardized. Existing tools often require programming expertise or depend on specific hardware vendors. We introduce VIVA Stimuli, a web-based platform for standardized eye-tracking stimulus presentation. It provides configurable task types, including fixation, smooth pursuit, cognitive load, blink, slippage, content display, and questionnaires within a unified environment. The platform supports any eye-tracking technology, including wearable and screen-based VOG trackers, LFI sensors, and EOG devices. ArUco markers enable synchronization for trackers with scene cameras, while a WebSocket architecture ensures temporal synchronization for those without. A visual experiment flow editor allows protocols to be exported and shared, enabling identical stimulus replication across laboratories.
Authors:Benedetta Tessa, Gautam Kishore Shahi, Amaury Trujillo, Stefano Cresci
Abstract:
During major political events, social media platforms encounter increased systemic risks. However, it is still unclear if and how they adjust their moderation practices in response. The Digital Services Act Transparency Database provides-for the first time-an opportunity to systematically examine content moderation at scale, allowing researchers and policymakers to evaluate platforms' compliance and effectiveness, especially at high-stakes times. Here we analyze 1.58 billion self-reported moderation actions by the eight largest social media platforms in Europe over an eight-month period surrounding the 2024 European Parliament elections. We found that platforms did not exhibit meaningful signs of adaptation in moderation strategies as their self-reported enforcement patterns did not change significantly around the elections. This raises questions about whether platforms made any concrete adjustments, or whether the structure of the database may have masked them. On top of that, we reveal that initial concerns regarding platforms' transparency and accountability still persist one year after the launch of the Transparency Database. Our findings highlight the limits of current self-regulatory approaches and point to the need for stronger enforcement and better data access mechanisms to ensure that online platforms meet their responsibilities in protecting the democratic processes.
Authors:Philippe Laban, Tobias Schnabel, Jennifer Neville
Abstract:
Large Language Models (LLMs) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). Delegation requires trust - the expectation that the LLM will faithfully execute the task without introducing errors into documents. We introduce DELEGATE-52 to study the readiness of AI systems in delegated workflows. DELEGATE-52 simulates long delegated workflows that require in-depth document editing across 52 professional domains, such as coding, crystallography, and music notation. Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely. Additional experiments reveal that agentic tool use does not improve performance on DELEGATE-52, and that degradation severity is exacerbated by document size, length of interaction, or presence of distractor files. Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.
Authors:Dipto Das, Christelle Tessono, Syed Ishtiaque Ahmed, Shion Guha
Abstract:
In November 2025, the Government of Canada operationalized its commitment to transparency by releasing its first Federal AI Register. In this paper, we argue that such registers are not neutral mirrors of government activity, but active instruments of ontological design that configure the boundaries of accountability. We analyzed the Register's complete dataset of 409 systems using the Algorithmic Decision-Making Adapted for the Public Sector (ADMAPS) framework, combining quantitative mapping with deductive qualitative coding. Our findings reveal a sharp divergence between the rhetoric of "sovereign AI" and the reality of bureaucratic practice: while 86\% of systems are deployed internally for efficiency, the Register systematically obscures the human discretion, training, and uncertainty management required to operate them. By privileging technical descriptions over sociotechnical context, the Register constructs an ontology of AI as "reliable tooling" rather than "contestable decision-making." We conclude that without a shift in design, such transparency artifacts risk automating accountability into a performative compliance exercise, offering visibility without contestability.
Authors:Zoe De Simone, Angie Boggust, Fredo Durand, Ashia Wilson, Arvind Satyanarayan
Abstract:
Text-to-image (T2I) systems enable rapid generation of high-fidelity imagery but are misaligned with how visual ideas develop. T2I systems generate outputs that make implicit visual decisions on behalf of the user, often introduce fine-grained details that can anchor users prematurely and limit their ability to keep options open early on, and cause unintended changes during editing that are difficult to correct and reduce users' sense of control. To address these concerns, we present Creo, a multi-stage T2I system that scaffolds image generation by progressing from rough sketches to high-resolution outputs, exposing intermediary abstractions where users can make incremental changes. Sketch-like abstractions invite user editing and allow users to keep design options open when ideas are still forming due to their provisional nature. Each stage in Creo can be modified with manual changes and AI-assisted operations, enabling fine-grained, step-wise control through a locking mechanism that preserves prior decisions so subsequent edits affect only specified regions or attributes. Users remain in the loop, making and verifying decisions across stages, while the system applies diffs instead of regenerating full images, reducing drift as fidelity increases. A comparative study with a one-shot baseline shows that participants felt stronger ownership over Creo outputs, as they were able to trace their decisions in building up the image. Furthermore, embedding-based analysis indicates that Creo outputs are less homogeneous than one-shot results. These findings suggest that multi-stage generation, combined with intermediate control and decision locking, is a key design principle for improving controllability, user agency, creativity, and output diversity in generative systems.
Authors:Louis Rosenberg, Hans Schumann, Ganesh Mani, Gregg Willcox
Abstract:
Hyperchat AI is a communication and collaboration architecture that employs intervening AI agents to enable real-time conversational deliberations among distributed human teams of unlimited size. Prior work has shown that teams as large as 250 people can hold productive real-time conversations by text, voice, or video using Hyperchat AI to discuss complex problems, brainstorm solutions, surface risks, assess alternatives, prioritize options, and converge on optimized results. Building on this prior work, this new study tasked groups of 25 to 30 basketball fans with conversationally forecasting 56 NBA games (against the spread) over a 12-week period. Results show that when discussing and debating NBA games (for five minutes each) using a Hyperchat AI enabled platform called Thinkscape, human teams were 62% accurate across the full set of NBA forecasts. This is a significant result versus the Vegas odds of 50% (p=0.059). Furthermore, had the participants wagered on the games, they would have produced an 18% ROI over the 12-week period. In addition, this study found that the conversation rate during each forecast was positively correlated with prediction accuracy. In fact, when excluding the 12 forecasts in the bottom 25th percentile by average conversation rate, the remaining 38 forecasts recorded a 68% accuracy against the published Vegas spread (p=0.017). This suggests that large-scale conversational deliberations, when facilitated by intervening AI-agents, positively impacts accuracy in groupwise forecasting.
Authors:Sanchita S. Kamath, Aziz N Zeidieh, Venkatesh Potluri, Sile O'Modhrain, Kenneth Perry, JooYoung Seo
Abstract:
Three-dimensional (3D) data visualizations, such as surface plots, are vital in STEM fields from biomedical imaging to spectroscopy, yet remain largely inaccessible to blind and low-vision (BLV) people. To address this gap, we conducted an Experience-Based Co-Design with BLV co-designers with expertise in non-visual data representations to create an accessible, multi-modal, web-native visualization tool. Using a multi-phase methodology, our team of five BLV and one non-BLV researcher(s) participated in two iterative sessions, comparing a low-fidelity tactile probe with a high-fidelity digital prototype. This process produced a prototype with empirically grounded features, including reference sonification, stereo and volumetric audio, and configurable buffer aggregation, which our co-designers validated as improving analytic accuracy and learnability. In this study, we target core analytic tasks essential for non-visual 3D data exploration: orientation, landmark and peak finding, comparing local maxima versus global trends, gradient tracing, and identifying occluded or partially hidden features. Our work offers accessibility researchers and developers a co-design protocol for translating tactile knowledge to digital interfaces, concrete design guidance for future systems, and opportunities to extend accessible 3D visualization into embodied data environments.
Authors:Can Liu, Wenjie Jiang, Shaolun Ruan, Kotaro Hara, Yong Wang
Abstract:
Pitch-based sonification of quantitative data increases the accessibility of data visualizations that are otherwise inaccessible for blind and low-vision (BLV) individuals. We argue that, although pitch representations can reveal the coarse-grained information of data, such as data trend and value comparison, they cannot effectively convey the fine-grained details like the sign and exact value of individual data points. Informed by existing sound perception research, we propose a spatial audio-based approach by representing data values as the sound direction in the azimuth plane to achieve accessible fine-grained data representation. We conducted a user study with 26 participants (including 10 BLV participants) on four data perception tasks. The results show our approach significantly outperforms pitch representation on fine-grained data perception tasks like recognizing data signs and exact values, and performs similarly on data trend identification, despite its inferior accuracy on data value comparison.
Authors:Sneha Gathani, Sirui Zeng, Diya Patel, Ryan Rossi, Dan Marshall, Cagatay Demiralp, Steven Drucker, Zhicheng Liu
Abstract:
What-if analysis (WIA) is an iterative, multi-step process where users explore and compare hypothetical scenarios by adjusting parameters, applying constraints, and scoping data through interactive interfaces. Current tools fall short of supporting effective interactive WIA: spreadsheet and BI tools require time-consuming and laborious setup, while LLM-based chatbot interfaces are semantically fragile, frequently misinterpret intent, and produce inconsistent results as conversations progress. To address these limitations, we present a two-stage workflow that translates natural language (NL) WIA questions into interactive visual interfaces via an intermediate representation, powered by the Praxa Specification Language (PSL): first, LLMs generate PSL specifications from NL questions capturing analytical intent and logic, enabling validation and repair of erroneous specifications; and second, the specifications are compiled into interactive visual interfaces with parameter controls and linked visualizations. We benchmark this workflow with 405 WIA questions spanning 11 WIA types, 5 datasets, and 3 state-of-the-art LLMs. The results show that across models, half of specifications (52.42%) are generated correctly without intervention. We perform an analysis of the failure cases and derive an error taxonomy spanning non-functional errors (specifications fail to compile) and functional errors (specifications compile but misrepresent intent). Based on the taxonomy, we apply targeted repairs on the failure cases using few-shot prompts and improve the success rate to 80.42%. Finally, we show how undetected functional errors propagate through compilation into plausible but misleading interfaces, demonstrating that the intermediate specification is critical for reliably bridging NL and interactive WIA interface in LLM-powered WIA systems.
Authors:Nicola Rossberg, Bennett Kleinberg, Barry O'Sullivan, Luca Longo, Andrea Visentin
Abstract:
With the growing pervasiveness of artificial intelligence, the ability to explain the inferences made by machine learning models has become increasingly important. Numerous techniques for model explainability have been proposed, with natural-language textual explanations among the most widely used approaches. When applied to tabular data, these explanations typically draw on input features to justify a given inference. Consequently, a user's ability to interpret the explanation depends on their understanding of the input features. To quantify this feature-level understanding, Rossberg et al. introduced the Feature Understandability Scale. Building on that work, this proof-of-concept study collects understandability scores across two datasets, proposes a co-optimisation methodology of understandability and accuracy and presents the resulting explanations alongside the model accuracies. This work contributes to the body of knowledge on model interpretability by design. It is found that accuracy and understandability can be successfully co-optimised while maintaining high classification performances. The resulting explanations are considered more understandable at face value. Further research will aim to confirm these findings through user evaluation.
Authors:Hita Kambhamettu, Will Crichton, Sean Welleck, Harrison Goldstein, Andrew Head
Abstract:
LLM-generated explanations can make technical content more accessible, but there is a ceiling on what they can support interactively. Because LLM outputs are static text, they cannot be executed or stepped through. We argue that grounding explanations in a formalized representation enables interactive affordances beyond what static text supports. We instantiate this idea for mathematical proof comprehension with explorable theorems, a system that uses LLMs to translate a theorem and its written proof into Lean, a programming language for machine-checked proofs, and links the written proof with the Lean code. Readers can work through the proof at a step-level granularity, test custom examples or counterexamples, and trace the logical dependencies bridging each step. Each worked-out step is produced by executing the Lean proof on that example and extracting its intermediate state. A user study ($n = 16$) shows potential advantages of this approach: in a proof-reading task, participants who had access to the provided explorability features gave better, more correct, and more detailed answers to comprehension questions, demonstrating a stronger overall understanding of the underlying mathematics.
Authors:Tianle Yang, Chengzhe Sun, Phil Rose, Siwei Lyu
Abstract:
Voice cloning is often evaluated in terms of overall quality, but less is known about accent preservation and its perceptual consequences. We compare standard and heavily accented Mandarin speech and their voice clones using a combined computational and perceptual design. Embedding-based analyses show no reliable accented-standard difference in original-clone distances across systems. In the perception study, clones are rated as more similar to their originals for standard than for accented speakers, and intelligibility increases from original to clone, with a larger gain for accented speech. These results show that accent variation can shape perceived identity match and intelligibility in voice cloning even when it is not reflected in an off-the-shelf speaker-embedding distance, and they motivate evaluating speaker identity preservation and accent preservation as separable dimensions.
Authors:Chitralekha Gupta, Jing Peng, Ashwin Ram, Shreyas Sridhar, Christophe Jouffrais, Suranga Nanayakkara
Abstract:
Current scene perception tools for Blind and Low Vision (BLV) individuals rely on spoken descriptions but lack engaging representations of visually pleasing distant environmental landscapes (Vista spaces). Our proposed Scene2Audio framework generates comprehensible and enjoyable nonverbal audio using generative models informed by psychoacoustics, and principles of scene audio composition. Through a user study with 11 BLV participants, we found that combining the Scene2Audio sounds with speech creates a better experience than speech alone, as the sound effects complement the speech making the scene easier to imagine. A mobile app "in-the-wild" study with 7 BLV users for more than a week further showed the potential of Scene2Audio in enhancing outdoor scene experiences. Our work bridges the gap between visual and auditory scene perception by moving beyond purely descriptive aids, addressing the aesthetic needs of BLV users.
Authors:Xi Lu, Di Hu, An T. Nguyen, Brad Morse, Lisa M. Schilling, Kai Zheng, Michelle S. Keller, Lucila Ohno-Machado, Yunan Chen
Abstract:
Patient-controlled data-sharing systems are increasingly promoted as a way to empower patients with greater autonomy over their health data. Yet it remains unclear how different stakeholders, especially patients and health system leaders, perceive the benefits and challenges of enabling granular control over the sharing of de-identified medical data for research. To address this gap, we developed a high-fidelity prototype of a patient-controlled, web-based consent platform and conducted a two-phase mixed-methods study:semi-structured interviews with 16 health system leaders and a survey with 523 patient participants. While both groups appreciated the potential of such a platform to enhance transparency and autonomy, their views diverged in meaningful ways. Leaders viewed transparency and granular control through the lens of informed consent and institutional ethics, whereas patients interpreted these factors as safeguards against potential risks and uncertainties. Our findings underscore critical tensions such as individual control and research integrity. We offer design implications for building trustworthy, context-aware systems that support flexible granularity, provide ongoing benefit-centered transparency, and adapt to diverse literacy and privacy needs.
Authors:Weitong Cai, Hang Zhang, Yukai Huang, Shitong Sun, Jiankang Deng, Songcen Xu, Jifei Song, Zhensong Zhang
Abstract:
Always-on sensing is essential for next-generation edge/wearable AI systems, yet continuous high-fidelity RGB video capture remains prohibitively expensive for resource-constrained mobile and edge platforms. We present a new paradigm for efficient streaming video understanding: grayscale-always, color-on-demand. Through preliminary studies, we discover that color is not always necessary. Sparse RGB frames suffice for comparable performance when temporal structure is preserved via continuous grayscale streams. Building on this insight, we propose ColorTrigger, an online training-free trigger that selectively activates color capture based on windowed grayscale affinity analysis. Designed for real-time edge deployment, ColorTrigger uses lightweight quadratic programming to detect chromatic redundancy causally, coupled with credit-budgeted control and dynamic token routing to jointly reduce sensing and inference costs. On streaming video understanding benchmarks, ColorTrigger achieves 91.6% of full-color baseline performance while using only 8.1% RGB frames, demonstrating substantial color redundancy in natural videos and enabling practical always-on video sensing on resource-constrained devices.
Authors:Sharifa Sultana, Zinnat Sultana, Jeffrey M. Rzeszotarski, Syed Ishtiaque Ahmed
Abstract:
There is an increasing interest in telling serious stories with data. Designers organize information, construct narratives, and present findings to inform audiences. However, many of these practices emerge from modern information visualization rhetoric and ethical frameworks which may marginalize communities with low digital and media literacy. In a ten-month-long ethnographic study in three Bangladeshi villages, we investigated how these communities use entertainment and cultural practices, namely Puthi, Bhandari Gaan, and Pot music, to instruct, communicate traditional moral lessons and recall history. We found that these communities embrace polyvocality and multiple ethical frameworks in their performances, construct narratives combining factuality, emotionality, and aesthetics, and adapt their performances to changing technology and audience needs. Our findings provide HCI, visualization, and ethical data practitioners with implications for the design of accessible and culturally appropriate ways of presenting data narratives in data-driven systems.
Authors:Niclas Pokel, Yiming Zhao, Pehuén Moure, Yingqiang Gao, Roman Böhringer
Abstract:
Personalizing Automatic Speech Recognition (ASR) for non-normative speech remains challenging because data collection is labor-intensive and model training is technically complex. To address these limitations, we propose Adapt4Me, a web-based decentralized environment that operationalizes Bayesian active learning to enable end-to-end personalization without expert supervision. The app exposes data selection, adaptation, and validation to lay users through a three-stage human-in-the-loop workflow: (1) rapid profiling via greedy phoneme sampling to capture speaker-specific acoustics; (2) backend personalization using Variational Inference Low-Rank Adaptation (VI-LoRA) to enable fast, incremental updates; and (3) continuous improvement, where users guide model refinement by resolving visualized model uncertainty via low-friction top-k corrections. By making epistemic uncertainty explicit, Adapt4Me reframes data efficiency as an interactive design feature rather than a purely algorithmic concern. We show how this enables users to personalize robust ASR models, transforming them from passive data sources into active authors of their own assistive technology.
Authors:Thomas Şerban von Davier, Hao-Ping Lee, Jodi Forlizzi, Sauvik Das
Abstract:
The evidence on the effects of generative AI (GenAI) on critical thinking is mixed, with studies suggesting both potential harms and benefits depending on its implementation. Some argue that AI-driven provocations, such as questions asking for human clarification and justification, are beneficial for eliciting critical thinking. Drawing on our experience designing and evaluating two GenAI-powered tools for knowledge work, ArtBot in the domain of fine art interpretation and Privy in the domain of AI privacy, we reflect on how design decisions shape the form and effectiveness of such provocations. Our observations and user feedback suggest that domain-specific provocations, implemented through productive friction and interactions that depend on user contribution, can meaningfully support critical thinking. We present participant experiences with both prototypes and discuss how supporting critical thinking may require moving beyond static provocations toward approaches that adapt to user preferences and levels of expertise.
Authors:Ankur Kamboj, Rajiv Ranganathan, Xiaobo Tan, Vaibhav Srivastava
Abstract:
In this work, we propose a data-driven framework to design optimal haptic nudge feedback leveraging the learner's estimated skill to address the challenge of learning a novel motor task in a high-dimensional, redundant motor space. A nudge is a series of vibrotactile feedback delivered to the learner to encourage motor movements that aid in task completion. We first model the stochastic dynamics of human motor learning under haptic nudges using an Input-Output Hidden Markov Model (IOHMM), which explicitly decouples latent skill evolution from observable performance measures. Leveraging this predictive model, we formulate the haptic nudge feedback design problem as a Partially Observable Markov Decision Process (POMDP). This allows us to derive an optimal nudging policy that minimizes long-term performance cost and implicitly guides the learner toward superior skill states. We validate our approach through a human participant study (N=30) involving a high-dimensional motor task rendered through a hand exoskeleton. Results demonstrate that participants trained with the POMDP-derived policy exhibit significantly accelerated movement efficiency and endpoint accuracy compared to groups receiving heuristic-based feedback or no feedback. Furthermore, synergy analysis reveals that the POMDP group discovers efficient low-dimensional motor representations more rapidly.
Authors:Ziqi Pan, Ziqi Liu, Jinhan Zhang, Zeyu Huang, Xiaojuan Ma
Abstract:
In today's in-person group discussions, smartphones are integrated as intelligent workstations; yet given their co-presence in such face-to-face interactions, whether and how they may enhance people's behavioral engagement with others remains underexplored. This work investigates how animating personal smartphones to move expressively, without compromising regular functions, can transform them into active embodied facilitators for co-located group interaction. In the four-stranger small-group discussion setting, guided by Tuckman's group-development theory, we conducted a design workshop (n=12) to identify problematic group-work circumstances and design expressive, attention-efficient animated phone facilitations. Subsequently, we developed AnimaStand, a movement-enabled phone stand that animates phones to deliver group facilitation cues according to conversation dynamics. In a between-subjects Wizard-of-Oz study (n=56) with four-stranger group discussions, where everyone's phone was on an AnimaStand, the facilitations re-engaged inactive members, enhancing group dynamics, task operation performance, and relationships. We finally discuss prospects for more adaptive and generalizable animated device personal facilitation.
Authors:Ruijia Chen, Yuheng Wu, Charlie Houseago, Filipe Gaspar, Filippo Aleotti, Dorian Gálvez-López, Oliver Johnston, Diego Mazala, Guillermo Garcia-Hernando, Maryam Bandukda, Gabriel Brostow, Jessica Van Brummelen
Abstract:
GPS and smartphones enable users to place location-based annotations, capturing rich environmental context. Previous research demonstrates that blind and low vision (BLV) people can use annotations to explore unfamiliar areas. However, current commercial systems allowing BLV users to create annotations have never been evaluated, and current GPS-based systems can deviate several meters. Motivated by high-accuracy visual positioning technology, we first conducted a formative study with 24 BLV participants to envision a more accurate and inclusive annotation system. Surprisingly, many participants viewed the high-accuracy technology not just as an annotation system but also as a tool for precise last-few-meters navigation. Guided by participant feedback, we developed NaviNote, which combines vision-based high-precision localization with an agentic architecture to enable voice-based annotation authoring and navigation. Evaluating NaviNote with 18 BLV participants showed that it significantly improved navigation performance and supported users in understanding and annotating their surroundings. Based on these findings, we discuss design considerations for future accessible annotation authoring systems.
Authors:Daryl Hedley, Doug Pietrzak, Jorge Dias, Ian Burden, Bakhtawar Ahtisham, Zhuqian Zhou, Kirk Vanacore, Josh Marland, Rachel Slama, Justin Reich, Kenneth Koedinger, René Kizilcec
Abstract:
Digital educational environments are expanding toward complex AI and human discourse, providing researchers with an abundance of data that offers deep insights into learning and instructional processes. However, traditional qualitative analysis remains a labor-intensive bottleneck, severely limiting the scale at which this research can be conducted. We present Sandpiper, a mixed-initiative system designed to serve as a bridge between high-volume conversational data and human qualitative expertise. By tightly coupling interactive researcher dashboards with agentic Large Language Model (LLM) engines, the platform enables scalable analysis without sacrificing methodological rigor. Sandpiper addresses critical barriers to AI adoption in education by implementing context-aware, automated de-identification workflows supported by secure, university-housed infrastructure to ensure data privacy. Furthermore, the system employs schema-constrained orchestration to eliminate LLM hallucinations and enforces strict adherence to qualitative codebooks. An integrated evaluations engine allows for the continuous benchmarking of AI performance against human labels, fostering an iterative approach to model refinement and validation. We propose a user study to evaluate the system's efficacy in improving research efficiency, inter-rater reliability, and researcher trust in AI-assisted qualitative workflows.
Authors:Jiaming Zhang, Mingxu Liu, Hongchao Shu, Ruixing Liang, Yihao Liu, Ojas Taskar, Amir Kheradmand, Mehran Armand, Alejandro Martin-Gomez
Abstract:
Surgical navigation provides real-time guidance by estimating the pose of patient anatomy and surgical instruments to visualize relevant intraoperative information. In conventional systems, instruments are typically tracked using fiducial markers and stationary optical tracking systems (OTS). Augmented reality (AR) has further enabled intuitive visualization and motivated tracking using sensors embedded in head-mounted displays (HMDs). However, most existing approaches rely on a clear line of sight, which is difficult to maintain in dynamic operating room environments due to frequent occlusions caused by equipment, surgical tools, and personnel. This work introduces a framework for tracking surgical instruments under occlusion by fusing multiple sensing modalities within a dynamic scene graph representation. The proposed approach integrates tracking systems with different accuracy levels and motion characteristics while estimating tracking reliability in real time. Experimental results demonstrate improved robustness and enhanced consistency of AR visualization in the presence of occlusions.
Authors:Ivy Xiao He, Stefanie Tellex, Jason Xinyu Liu
Abstract:
To assist humans in open-world environments, robots must interpret ambiguous instructions to locate desired objects. Foundation model-based approaches excel at multimodal grounding, but they lack a principled mechanism for modeling uncertainty in long-horizon tasks. In contrast, Partially Observable Markov Decision Processes (POMDPs) provide a systematic framework for planning under uncertainty but are often limited in supported modalities and rely on restrictive environment assumptions. We introduce LanguagE and Gesture-Guided Object Search in Partially Observable Environments (LEGS-POMDP), a modular POMDP system that integrates language, gesture, and visual observations for open-world object search. Unlike prior work, LEGS-POMDP explicitly models two sources of partial observability: uncertainty over the target object's identity and its spatial location. In simulation, multimodal fusion significantly outperforms unimodal baselines, achieving an average success rate of 89\% across challenging environments and object categories. Finally, we demonstrate the full system on a quadruped mobile manipulator, where real-world experiments qualitatively validate robust multimodal perception and uncertainty reduction under ambiguous instructions.
Authors:Markus Knauer, Samuel Bustamante, Thomas Eiband, Alin Albu-Schäffer, Freek Stulp, João Silvério
Abstract:
Foundation models have demonstrated impressive capabilities across diverse domains, while imitation learning provides principled methods for robot skill adaptation from limited data. Combining these approaches holds significant promise for direct application to robotics, yet this combination has received limited attention, particularly for industrial deployment. We present a novel framework that enables open-vocabulary skill adaptation through a tool-based architecture, maintaining a protective abstraction layer between the language model and robot hardware. Our approach leverages pre-trained LLMs to select and parameterize specific tools for adapting robot skills without requiring fine-tuning or direct model-to-robot interaction. We demonstrate the framework on a 7-DoF torque-controlled robot performing an industrial bearing ring insertion task, showing successful skill adaptation through natural language commands for speed adjustment, trajectory correction, and obstacle avoidance while maintaining safety, transparency, and interpretability.
Authors:Esen K. Tütüncü, Qian Zhou, Frederik Brudy, George Fitzmaurice, Fraser Anderson
Abstract:
Current AI writing tools, which rely on text prompts, poorly support the spatial and interactive nature of storytelling where ideas emerge from direct manipulation and play. We present PlayWrite, a mixed-reality system where users author stories by directly manipulating virtual characters and props. A multi-agent AI pipeline interprets these actions into Intent Frames -structured narrative beats visualized as rearrangeable story marbles on a timeline. A large language model then transforms the user's assembled sequence into a final narrative. A user study (N=13) with writers from varying domains found that PlayWrite fosters a highly improvisational and playful process. Users treated the AI as a collaborative partner, using its unexpected responses to spark new ideas and overcome creative blocks. PlayWrite demonstrates an approach for co-creative systems that move beyond text to embrace direct manipulation and play as core interaction modalities.
Authors:Sicheng Yang, Yukai Huang, Weitong Cai, Shitong Sun, Fengyi Fang, You He, Yiqiao Xie, Jiankang Deng, Hang Zhang, Jifei Song, Zhensong Zhang
Abstract:
What if accessing the web did not require a screen, a stable desk, or even free hands? For people navigating crowded cities, living with low vision, or experiencing cognitive overload, smart glasses coupled with AI agents could turn the web into an always-on assistive layer over daily life. We present Egocentric Co-Pilot, a web-native neuro-symbolic framework that runs on smart glasses and uses a Large Language Model (LLM) to orchestrate a toolbox of perception, reasoning, and web tools. An egocentric reasoning core combines Temporal Chain-of-Thought with Hierarchical Context Compression to support long-horizon question answering and decision support over continuous first-person video, far beyond a single model's context window. Additionally, a lightweight multimodal intent layer maps noisy speech and gaze into structured commands. We further implement and evaluate a cloud-native WebRTC pipeline integrating streaming speech, video, and control messages into a unified channel for smart glasses and browsers. In parallel, we deploy an on-premise WebSocket baseline, exposing concrete trade-offs between local inference and cloud offloading in terms of latency, mobility, and resource use. Experiments on Egolife and HD-EPIC demonstrate competitive or state-of-the-art egocentric QA performance, and a human-in-the-loop study on smart glasses shows higher task completion and user satisfaction than leading commercial baselines. Taken together, these results indicate that web-connected egocentric co-pilots can be a practical path toward more accessible, context-aware assistance in everyday life. By grounding operation in web-native communication primitives and modular, auditable tool use, Egocentric Co-Pilot offers a concrete blueprint for assistive, always-on web agents that support education, accessibility, and social inclusion for people who may benefit most from contextual, egocentric AI.
Authors:Md Sabbir Ahmed, Kaitlyn Dorothy Petz, Noah French, Tanvi Lakhtakia, Aayushi Sangani, Mark Rucker, Xinyu Chen, Bethany A. Teachman, Laura E. Barnes
Abstract:
Social interactions are fundamental to well-being, yet automatically detecting them in daily life-particularly using wearables-remains underexplored. Most existing systems are evaluated in controlled settings, focus primarily on in-person interactions, or rely on restrictive assumptions (e.g., requiring multiple speakers within fixed temporal windows), limiting generalizability to real-world use. We present an on-watch interaction detection system designed to capture diverse interactions in naturalistic settings. A core component is a foreground speech detector trained on a public dataset. Evaluated on over 100,000 labeled foreground speech and background sound instances, the detector achieves a balanced accuracy of 85.51%, outperforming prior work by 5.11%. We evaluated the system in a real-world deployment (N=38), with over 900 hours of total smartwatch wear time. The system detected 1,691 interactions, 77.28% were confirmed via participant self-report, with durations ranging from under one minute to over one hour. Among correct detections, 81.45% were in-person, 15.7% virtual, and 1.85% hybrid. Leveraging participant-labeled data, we further developed a multimodal model achieving a balanced accuracy of 90.36% and a sensitivity of 91.17% on 33,698 labeled 15-second windows. These results demonstrate the feasibility of real-world interaction sensing and open the door to adaptive, context-aware systems responding to users' dynamic social environments.
Authors:Runhua Zhang, Ziqi Pan, Huiran Yi, Huamin Qu, Xiaojuan Ma
Abstract:
Sharing gendered experiences on social media has been widely recognized as supporting women's personal sense-making and contributing to digital feminism. However, there are known concerns, such as fear of judgment and backlash, that may discourage women from posting online. In this study, we examine a recurring practice on Xiaohongshu, a popular Chinese social media platform, in which women share their gendered experiences alongside screenshots of conversations with LLMs. We conducted semi-structured interviews with 20 women to investigate whether and how interactions with LLMs might support women in articulating and sharing gendered experiences. Our findings reveal that, beyond those external concerns, women also hold self-imposed standards regarding what feels appropriate and worthwhile to share publicly. We further show how interactions with LLMs help women meet these standards and navigate such concerns. We conclude by discussing how LLMs might be carefully and critically leveraged to support women's everyday expression online.
Authors:Runhua Zhang, Ziqi Pan, Kangyu Yuan, Qiaoyi Chen, Yulin Tian, Huamin Qu, Xiaojuan Ma
Abstract:
Everyday digital feminism refers to the ordinary, often pragmatic ways women articulate lived experiences and cultivate solidarity in online spaces. In China, such practices flourish on RedNote through discussions under hashtags like ''women's growth''. Recently, DeepSeek-generated content has been taken up as a new voice in these conversations. Given widely recognized gender biases in LLMs, this raises critical concerns about how LLMs interact with everyday feminist practices. Through an analysis of 430 RedNote posts, 139 shared DeepSeek responses, and 3211 comments, we found that users predominantly welcomed DeepSeek's advice. Yet feminist critical discourse analysis revealed that these responses primarily encouraged women to self-optimize and pursue achievements within prevailing norms rather than challenge them. By interpreting this case, we discuss the opportunities and risks that LLMs introduce for everyday feminism as a pathway toward women's empowerment, and offer design implications for leveraging LLMs to better support such practices.
Authors:Lindsay Popowski, Xiyuan Wu, Charlotte Zhu, Tiziano Piccardi, Michael S. Bernstein
Abstract:
Social media users have repeatedly advocated for control over the currently opaque operations of feed algorithms. Large language models (LLMs) now offer the promise of custom-defined feeds--but users often fail to foresee the gaps and edge cases in how they define their custom feed. We introduce feed elicitation interviews, an interactive method that guides users through identifying these gaps and articulating their preferences to better author custom social media feeds. We deploy this approach in an online study to create custom BlueSky feeds and find that participants significantly prefer the feeds produced from their elicited preferences to those produced by users manually describing their feeds. Through feed elicitation interviews, we advance users' ability to control their social media experience, empowering them to describe and implement their desired feeds.
Authors:Erik Derner, Dalibor Kučera, Aditya Gulati, Ayoub Bagheri, Nuria Oliver
Abstract:
Conversational agents increasingly mediate everyday digital interactions, yet the effects of their communication style on user experience and task success remain unclear. Addressing this gap, we describe the results of a between-subject user study where participants interact with one of two versions of a chatbot called NAVI which assists users in an interactive map-based 2D navigation task. The two chatbot versions differ only in communication style: one is friendly and supportive, while the other is direct and task-focused. Our results show that the friendly style increases subjective satisfaction and significantly improves task completion rates among female participants only, while no baseline differences between female and male participants were observed in a control condition without the chatbot. Furthermore, we find little evidence of users mimicking the chatbot's style, suggesting limited linguistic accommodation. These findings highlight the importance of user- and task-sensitive conversational agents and support that communication style personalization can meaningfully enhance interaction quality and performance.
Authors:Lev Tankelevitch, Ava Elizabeth Scott, Nagaravind Challakere, Payod Panda, Sean Rintel
Abstract:
Ineffective meetings are pervasive. Thinking ahead explicitly about meeting goals may improve effectiveness, but current collaboration platforms lack integrated support. We tested a lightweight goal-reflection intervention in a preregistered field experiment in a global technology company (361 employees, 7196 meetings). Over two weeks, workers in the treatment group completed brief pre-meeting surveys in their collaboration platform, nudging attention to goals for upcoming meetings. To measure impact, both treatment and control groups completed post-meeting surveys about meeting effectiveness. While the intervention impact on meeting effectiveness was not statistically significant, mixed-methods findings revealed improvements in self-reported awareness and behaviour across both groups, with post-meeting surveys unintentionally functioning as an intervention. We highlight the promise of supporting goal reflection, while noting challenges of evaluating and supporting workplace reflection for meetings, including workflow and collaboration norms, and attitudes and behaviours around meeting preparation. We conclude with implications for designing technological support for meeting intentionality.
Authors:Alyssa Hwang, Hita Kambhamettu, Yue Yang, Ajay Patel, Joseph Chee Chang, Andrew Head
Abstract:
Understanding information-dense documents like recipes and scientific papers requires readers to find, interpret, and connect details scattered across text, figures, tables, and other visual elements. These documents are often long and filled with specialized terminology, hindering the ability to locate relevant information or piece together related ideas. Existing tools offer limited support for synthesizing information across media types. As a result, understanding complex material remains cognitively demanding. This paper presents a framework for fine-grained integration of information in complex documents. We instantiate the framework in an augmented reading interface, which populates a scientific paper with clickable points on figures, interactive highlights in the body text, and a persistent reference panel for accessing consolidated details without manual scrolling. In a controlled between-subjects study, we find that participants who read the paper with our tool achieved significantly higher scores on a reading quiz without evidence of increased time to completion or cognitive load. Fine-grained integration provides a systematic way of revealing relationships within a document, supporting engagement with complex, information-dense materials.
Authors:Most. Sharmin Sultana Samu, Nafisa Khan, Kazi Toufique Elahi, Tasnuva Binte Rahman, Md. Rakibul Islam, Farig Sadeque
Abstract:
The integration of Artificial Intelligence (AI) necessitates determining whether systems function as tools or collaborative teammates. In this study, by synthesizing Human-AI Interaction (HAI) literature, we analyze this distinction across four dimensions: interaction design, trust calibration, collaborative frameworks and healthcare applications. Our analysis reveals that static interfaces and miscalibrated trust limit AI efficacy. Performance hinges on aligning transparency with cognitive workflows, yet a fluency trap often inflates trust without improving decision-making. Consequently, an overemphasis on explainability leaves systems largely passive. Our findings show that current AI systems remain largely passive due to an overreliance on explainability-centric designs and that transitioning AI to an active teammate requires adaptive, context-aware interactions that support shared mental models and the dynamic negotiation of authority between humans and AI.
Authors:Kexin Quan, Jessie Chin
Abstract:
Many real-world decisions rely on information search, where people sample evidence and decide when to stop under uncertainty. The uncertainty in the environment, particularly how diagnostic evidence is distributed, causes complexities in information search, further leading to suboptimal decision-making outcomes. Yet AI decision support often targets outcome optimization, and less is known about how to scaffold search without increasing cognitive load. We introduce SERA, an LLM-based assistant that provides either gist or verbatim feedback during search. Across two experiments (N1=54, N2=54), we examined decision-making outcomes and information search in SERA-Gist, SERA-Verbatim, and a no-feedback baseline across three environments varying in uncertainty. The uncertainty in environment is operationalized by the perceived gain of information across the course of sampling, which individuals may experience diminishing return of information gain (decremental; low-uncertainty), or a local drop of information gain (local optimum; medium-uncertainty), or no patterns in information gain (high-uncertainty), as they search more. Individuals show more accurate decision outcomes and are more confident with SERA support, especially under higher uncertainty. Gist feedback was associated with more efficient integration and showed a descriptive pattern of reduced oversampling, while verbatim feedback promoted more extensive exploration. These findings establish feedback representation as a design lever when search matters, motivating adaptive systems that match feedback granularity to uncertainty.
Authors:Jisung Shin, Daniel Platnick, Marjan Alirezaie, Hossein Rahnama
Abstract:
Perspective-Aware AI requires modeling evolving internal states--goals, emotions, contexts--not merely preferences. Progress is limited by a data bottleneck: digital footprints are privacy-sensitive and perspective states are rarely labeled. We propose Situation Graph Prediction (SGP), a task that frames perspective modeling as an inverse inference problem: reconstructing structured, ontology-aligned representations of perspective from observable multimodal artifacts. To enable grounding without real labels, we use a structure-first synthetic generation strategy that aligns latent labels and observable traces by design. As a pilot, we construct a dataset and run a diagnostic study using retrieval-augmented in-context learning as a proxy for supervision. In our study with GPT-4o, we observe a gap between surface-level extraction and latent perspective inference--indicating latent-state inference is harder than surface extraction under our controlled setting. Results suggest SGP is non-trivial and provide evidence for the structure-first data synthesis strategy.
Authors:Gun Woo, Park, Frederik Brudy, George Fitzmaurice, Fraser Anderson
Abstract:
Virtual Production (VP) professionals often face challenges accessing tacit knowledge and creative intent, which are important in forming common ground with collaborators and in contributing more effectively and efficiently to the team. From our formative study (N=23) with a follow-up interview (N=6), we identified the significance and prevalence of this challenge. To help professionals access knowledge, we present GroundLink, a Unity add-on that surfaces meeting-derived knowledge directly in the editor to support establishing common ground. It features a meeting knowledge dashboard for capturing and reviewing decisions and comments, constraint-aware feedforward that proactively informs the editor environment, and cross-modal synchronization that provides referential links between the dashboard and the editor. A comparative study (N=12) suggested that GroundLink help users build common ground with their team while improving perceived confidence and ease of editing the 3D scene. An expert evaluation with VP professionals (N=5) indicated strong potential for GroundLink in real-world workflows.
Authors:Amy Koike, Serena Ge Guo, Xinning He, Callie Y. Kim, Dakota Sullivan, Bilge Mutlu
Abstract:
Robot morphology, the form, shape, and structure of robots, is a key design space in human-robot interaction (HRI), shaping how robots function, express themselves, and interact with people. Yet, despite its importance, little is known about how design frameworks can guide systematic form exploration. To address this gap, we introduce Elements of Robot Morphology, a framework that identifies five fundamental elements: perception, articulation, end effectors, locomotion, and structure. Derived from an analysis of existing robots, the framework supports structured exploration of diverse robot forms. To operationalize the framework, we developed Morphology Exploration Blocks (MEB), a set of tangible blocks that enable hands-on, collaborative experimentation with robot morphologies. We evaluate the framework and toolkit through a case study and design workshops, showing how they support analysis, ideation, reflection, and collaborative robot design.
Authors:Aditya Gulati, Nuria Oliver
Abstract:
As chatbots increasingly blur the boundary between automated systems and human conversation, the foundations of trust in these systems warrant closer examination. While regulatory and policy frameworks tend to define trust in normative terms, the trust users place in chatbots often emerges from behavioral mechanisms. In many cases, this trust is not earned through demonstrated trustworthiness but is instead shaped by interactional design choices that leverage cognitive biases to influence user behavior. Based on this observation, we propose reframing chatbots not as companions or assistants, but as highly skilled salespeople whose objectives are determined by the deploying organization. We argue that the coexistence of competing notions of "trust" under a shared term obscures important distinctions between psychological trust formation and normative trustworthiness. Addressing this gap requires further research and stronger support mechanisms to help users appropriately calibrate trust in conversational AI systems.
Authors:Jialin Li, Zhenhao Chen, Hanjun Luo, Hanan Salam
Abstract:
LLM-based agents can complete tasks correctly yet still frustrate users through poor interaction patterns, such as excessive confirmations, opaque reasoning, or misaligned pacing. Current benchmarks evaluate task accuracy but overlook how agents interact: whether they infer preferences from implicit cues, adapt dynamically, or maintain fine-grained interaction quality. We introduce Prefix, a configurable environment that evaluates both what agents accomplish and how they interact. Central to Prefix is the Interaction-as-a-Tool (IaaT) paradigm, which treats interaction behaviors as structured tool calls, unifying them with existing evaluation frameworks. We define 31 preference settings across 14 attributes and formalize user experience (UX) as a core metric alongside task accuracy. A composite LLM-as-a-Judge mechanism across seven UX dimensions achieves strong aggregate reliability (ICC > 0.79), high internal consistency (alpha = 0.943), and human correlation (rho = 0.52-0.78). Preference-aware agents show 7.6% average UX improvement and 18.5% gain in preference alignment. Our work is openly accessible.
Authors:Han Meng, Qiuyuan Lyu, Peinuan Qin, Yitian Yang, Renwen Zhang, Wen-Chieh Lin, Yi-Chieh Lee
Abstract:
Exploring causal relationships for qualitative data analysis in HCI and social science research enables the understanding of user needs and theory building. However, current computational tools primarily characterize and categorize qualitative data; the few systems that analyze causal relationships either inadequately consider context, lack credibility, or produce overly complex outputs. We first conducted a formative study with 15 participants interested in using computational tools for exploring causal relationships in qualitative data to understand their needs and derive design guidelines. Based on these findings, we designed and implemented QualCausal, a system that extracts and illustrates causal relationships through interactive causal network construction and multi-view visualization. A feedback study (n = 15) revealed that participants valued our system for reducing the analytical burden and providing cognitive scaffolding, yet navigated how such systems fit within their established research paradigms, practices, and habits. We discuss broader implications for designing computational tools that support qualitative data analysis.
Authors:Maia Stiber, Sameer Khan, Russell Taylor, Chien-Ming Huang
Abstract:
In the real world, robots frequently make errors, yet little is known about people's social responses to errors outside of lab settings. Prior work has shown that social signals are reliable and useful for error management in constrained interactions, but it is unclear if this holds in the real world - especially with a non-social robot in repeated and group interactions with successive or propagated errors. To explore this, we built a coffee robot and conducted a public field deployment ($N = 49$). We found that participants consistently expressed varied social signals in response to errors and other stimuli, particularly during group interactions. Our findings suggest that social signals in the wild are rich (with participants volunteering information about the interaction), but "noisy." We discuss lessons, benefits, and challenges for using social signals in real-world HRI.
Authors:Xinyi Wen, Lena Hegemann, Xiaofu Jin, Shuai Ma, Antti Oulasvirta
Abstract:
Aligning text-to-image generation with user intent remains challenging, for users who provide ambiguous inputs and struggle with model idiosyncrasies. We propose Adaptive Prompt Elicitation (APE), a technique that adaptively asks visual queries to help users refine prompts without extensive writing. Our technical contribution is a formulation of interactive intent inference under an information-theoretic framework. APE represents latent intent as interpretable feature requirements using language model priors, adaptively generates visual queries, and compiles elicited requirements into effective prompts. Evaluation on IDEA-Bench and DesignBench shows that APE achieves stronger alignment with improved efficiency. A user study with challenging user-defined tasks demonstrates 19.8% higher alignment without workload overhead. Our work contributes a principled approach to prompting that, for general users, offers an effective and efficient complement to the prevailing prompt-based interaction paradigm with text-to-image models.
Authors:Saleh Afzoon, Amin Beheshti, Usman Naseem
Abstract:
Understanding and classifying user personas is critical for delivering effective personalization. While persona information offers valuable insights, its full potential is realized only when contextualized, linking user characteristics with situational context to enable more precise and meaningful service provision. Existing systems often treat persona and context as separate inputs, limiting their ability to generate nuanced, adaptive interactions. To address this gap, we present PersoPilot, an agentic AI-Copilot that integrates persona understanding with contextual analysis to support both end users and analysts. End users interact through a transparent, explainable chat interface, where they can express preferences in natural language, request recommendations, and receive information tailored to their immediate task. On the analyst side, PersoPilot delivers a transparent, reasoning-powered labeling assistant, integrated with an active learning-driven classification process that adapts over time with new labeled data. This feedback loop enables targeted service recommendations and adaptive personalization, bridging the gap between raw persona data and actionable, context-aware insights. As an adaptable framework, PersoPilot is applicable to a broad range of service personalization scenarios.
Authors:Saleh Afzoon, MohammadHossein Ahmadi, Usman Naseem, Amin Beheshti
Abstract:
Personalization and contextual coherence are two essential components in building effective persona-grounded dialogue systems. These aspects play a crucial role in enhancing user engagement and ensuring responses are more relevant and consistent with user identity. However, recent studies indicate that open-source large language models (LLMs) continue to struggle to generate responses that are both contextually grounded and aligned with persona cues, despite exhibiting strong general conversational abilities like fluency and naturalness. We present PersoDPO, a scalable preference optimisation framework that uses supervision signals from automatic evaluations of responses generated by both closed-source and open-source LLMs to fine-tune dialogue models. The framework integrates evaluation metrics targeting coherence and personalization, along with a length-format compliance feature to promote instruction adherence. These signals are combined to automatically construct high-quality preference pairs without manual annotation, enabling a scalable and reproducible training pipeline. Experiments on the FoCus dataset show that an open-source language model fine-tuned with the PersoDPO framework consistently outperforms strong open-source baselines and a standard Direct Preference Optimization (DPO) variant across multiple evaluation dimensions.
Authors:Erzhen Hu, Frederik Brudy, David Ledo, George Fitzmaurice, Fraser Anderson
Abstract:
In pre-production, filmmakers and 3D animation experts must rapidly prototype ideas to explore a film's possibilities before fullscale production, yet conventional approaches involve trade-offs in efficiency and expressiveness. Hand-drawn storyboards often lack spatial precision needed for complex cinematography, while 3D previsualization demands expertise and high-quality rigged assets. To address this gap, we present PrevizWhiz, a system that leverages rough 3D scenes in combination with generative image and video models to create stylized video previews. The workflow integrates frame-level image restyling with adjustable resemblance, time-based editing through motion paths or external video inputs, and refinement into high-fidelity video clips. A study with filmmakers demonstrates that our system lowers technical barriers for film-makers, accelerates creative iteration, and effectively bridges the communication gap, while also surfacing challenges of continuity, authorship, and ethical consideration in AI-assisted filmmaking.
Authors:Ziwen Li, Ziang Xiao, Tianshi Li
Abstract:
Collecting data on sensitive topics remains challenging in HCI, as participants often withhold information due to privacy concerns and social desirability bias. While chatbots' perceived anonymity may reduce these barriers, research paradoxically suggests people tend to over-share personal or sensitive information with chatbots. In this work, we explore privacy controls in chatbot interviews to address this problem. The privacy control allows participants to revise their transcripts at the end of the interview, featuring two design variants: free editing and AI-aided editing. In a between-subjects study \red{($N=188$)}, we compared no-editing, free-editing, and AI-aided editing conditions in a chatbot-based interview on a sensitive topic. Our results confirm the prevalent issue of oversharing in chatbot-based interviews and show that AI-aided editing serves as an effective privacy-control mechanism, reducing PII disclosure while maintaining data quality and user engagement, thereby offering a promising approach to balancing ethical practice and data quality in such interviews.
Authors:Avinash Ajit Nargund, Andrea M. Park, Tobias Höllerer, Misha Sra
Abstract:
While augmented reality (AR) research demonstrates benefits of embedded visualizations for gross motor training, its applicability to facial exercises remains under-explored. Providing effective real-time feedback for facial muscle training presents unique design challenges, given the complexity of facial musculature. We developed three AR feedback approaches varying in spatial relationship to the user: situated (screen-fixed), proxy-embedded (on a mannequin), and fully embedded (overlaid on the user's face). In a within-subjects study (N=24), we measured exercise accuracy, cognitive load, and user preference during facial training tasks. The embedded feedback reduced cognitive load and received higher preference ratings, while the situated feedback enabled more precise corrections and higher accuracy. Qualitative analysis revealed a key design tension: embedded feedback improved experience but created self-consciousness and interpretive difficulty. We distill these insights into design considerations addressing the trade-offs for facial training systems, with implications for rehabilitation, performance training, and motor skill acquisition.
Authors:Avinash Ajit Nargund, Andrew L. Huard, Tobias Höllerer, Misha Sra
Abstract:
Outdoor virtual reality (VR) places users in dynamic physical environments where they must remain aware of real-world obstacles, including static structures and moving bystanders, while immersed in a virtual scene. This dual demand introduces challenges for both user safety and presence. Millimeter-wave (mmWave) radar offers a privacy-preserving alternative to camera-based sensing by detecting obstacles without capturing identifiable visual imagery, yet effective methods for communicating its sparse spatial information to users remain underexplored. In this work, we developed and validated WaveWalkerClone, a reproduction of the WaveWalker system, to establish reliable radar- and GPS-IMU-based sensing under varied outdoor lighting conditions. Building on this feasibility validation, we conducted a user study (n=18) comparing three visualization techniques for radar-detected obstacles : (1) diegetic alien avatars that visually embed obstacles within the virtual narrative, (2) non-diegetic human avatars represented obstacles as humans inconsistent with the virtual narrative, and (3) abstract point clouds centered around the obstacles conveying spatial data without anthropomorphic or narrative associations. Our results show that all three approaches supported user safety and situational awareness, but yielded distinct trade-offs in perceived effort, frustration, and user preference. Qualitative feedback further revealed divergent user responses across conditions, highlighting the limitations of a one-size-fits-all approach. We conclude with design considerations for obstacle visualization in outdoor VR systems that seek to balance immersion, safety, and bystander privacy.
Authors:Yuheng Shao, Junjie Xiong, Chaoran Wu, Xiyuan Wang, Ziyu Zhou, Yang Ouyang, Qinyi Tao, Quan Li
Abstract:
Applying the keyword method for vocabulary memorization remains a significant challenge for L1 Chinese-L2 English learners. They frequently struggle to generate phonologically appropriate keywords, construct coherent associations, and create vivid mental imagery to aid long-term retention. Existing approaches, including fully automated keyword generation and outcome-oriented mnemonic aids, either compromise learner engagement or lack adequate process-oriented guidance. To address these limitations, we conducted a formative study with L1 Chinese-L2 English learners and educators (N=18), which revealed key difficulties and requirements in applying the keyword method to vocabulary learning. Building on these insights, we introduce WordCraft, a learner-centered interactive tool powered by Multimodal Large Language Models (MLLMs). WordCraft scaffolds the keyword method by guiding learners through keyword selection, association construction, and image formation, thereby enhancing the effectiveness of vocabulary memorization. Two user studies demonstrate that WordCraft not only preserves the generation effect but also achieves high levels of effectiveness and usability.
Authors:MH Farhadi, Ali Rabiee, Sima Ghafoori, Anna Cetera, Andrew Fisher, Reza Abiri
Abstract:
Shared autonomy systems require principled methods for inferring user intent and determining appropriate assistance levels. This is a central challenge in human-robot interaction, where systems must be successful while being mindful of user agency. Previous approaches relied on static blending ratios or separated goal inference from assistance arbitration, leading to suboptimal performance in unstructured environments. We introduce BRACE (Bayesian Reinforcement Assistance with Context Encoding), a novel framework that fine-tunes Bayesian intent inference and context-adaptive assistance through an architecture enabling end-to-end gradient flow between intent inference and assistance arbitration. Our pipeline conditions collaborative control policies on environmental context and complete goal probability distributions. We provide analysis showing (1) optimal assistance levels should decrease with goal uncertainty and increase with environmental constraint severity, and (2) integrating belief information into policy learning yields a quadratic expected regret advantage over sequential approaches. We validated our algorithm against SOTA methods (IDA, DQN) using a three-part evaluation progressively isolating distinct challenges of end-effector control: (1) core human-interaction dynamics in a 2D human-in-the-loop cursor task, (2) non-linear dynamics of a robotic arm, and (3) integrated manipulation under goal ambiguity and environmental constraints. We demonstrate improvements over SOTA, achieving 6.3% higher success rates and 41% increased path efficiency, and 36.3% success rate and 87% path efficiency improvement over unassisted control. Our results confirmed that integrated optimization is most beneficial in complex, goal-ambiguous scenarios, and is generalizable across robotic domains requiring goal-directed assistance, advancing the SOTA for adaptive shared autonomy.
Authors:Bhada Yun, Evgenia Taranova, April Yi Wang
Abstract:
AI chatbots are shifting from tools to companions. This raises critical questions about agency: who drives conversations and sets boundaries in human-AI chatrooms? We report a month-long longitudinal study with 22 adults who chatted with Day, an LLM companion we built, followed by a semi-structured interview with post-hoc elicitation of notable moments, cross-participant chat reviews, and a 'strategy reveal' disclosing Day's vertical (depth-seeking) vs. horizontal (breadth-seeking) modes. We discover that agency in human-AI chatrooms is an emergent, shared experience: as participants claimed agency by setting boundaries and providing feedback, and the AI was perceived to steer intentions and drive execution, control shifted and was co-constructed turn-by-turn. We introduce a 3-by-5 framework mapping who (human, AI, hybrid) x agency action (Intention, Execution, Adaptation, Delimitation, Negotiation), modulated by individual and environmental factors. Ultimately, we argue for translucent design (i.e. transparency-on-demand), spaces for agency negotiation, and guidelines toward agency-aware conversational AI.
Authors:Bhada Yun, Renn Su, April Yi Wang
Abstract:
Does AI understand human values? While this remains an open philosophical question, we take a pragmatic stance by introducing VAPT, the Value-Alignment Perception Toolkit, for studying how LLMs reflect people's values and how people judge those reflections. 20 participants texted a human-like chatbot over a month, then completed a 2-hour interview with our toolkit evaluating AI's ability to extract (pull details regarding), embody (make decisions guided by), and explain (provide proof of) human values. 13 participants left our study convinced that AI can understand human values. Participants found the experience insightful for self-reflection and found themselves getting persuaded by the AI's reasoning. Thus, we warn about "weaponized empathy": a potentially dangerous design pattern that may arise in value-aligned, yet welfare-misaligned AI. VAPT offers concrete artifacts and design implications to evaluate and responsibly build value-aligned conversational agents with transparency, consent, and safeguards as AI grows more capable and human-like into the future.
Authors:Rudrajit Choudhuri, Christopher Sanchez, Margaret Burnett, Anita Sarma
Abstract:
Context: Many students now use generative AI in their coursework, yet its effects on intellectual development remain poorly understood. While prior work has investigated students' cognitive offloading during episodic interactions, it remains unclear whether using genAI routinely is tied to more fundamental shifts in students' thinking habits. Objective: We investigate (RQ1-How): how students' trust in and routine use of genAI affect their cognitive engagement -- specifically, reflection, need for understanding, and critical thinking in STEM coursework. Further, we investigate (RQ2-Who): which students are particularly vulnerable to these cognitive disengagement effects. Method: We drew on dual-process theory, cognitive offloading, and automation bias literature to develop a statistical model explaining how and to what extent students' trust-driven routine use of genAI affected their cognitive engagement habits in coursework, and how these effects differed across students' cognitive styles. We empirically evaluated this model using Partial Least Squares Structural Equation Modeling on survey data from 299 STEM students across five North American universities. Results: Students who trusted and routinely used genAI reported significantly lower cognitive engagement. Unexpectedly, students with higher technophilic motivations, risk tolerance, and computer self-efficacy -- traits often celebrated in STEM -- were more prone to these effects. Interestingly, prior experience with genAI or academia did not protect them from cognitively disengaging. Implications: Our findings suggest a potential cognitive debt cycle in which routine genAI use progressively weakens students' intellectual habits, potentially driving over-reliance and escalating usage. This poses critical challenges for curricula and genAI system design, requiring interventions that actively support cognitive engagement.
Authors:Zeyang Huang, Takanori Fujiwara, Angelos Chatzimparmpas, Wandrille Duchemin, Andreas Kerren
Abstract:
We present a new nonlinear dimensionality reduction method, MAPLE, that enhances UMAP by improving manifold modeling. MAPLE employs a self-supervised learning approach to more efficiently encode low-dimensional manifold geometry. Central to this approach are maximum manifold capacity representations (MMCRs), which help untangle complex manifolds by compressing variances among locally similar data points while amplifying variance among dissimilar data points. This design is particularly effective for high-dimensional data with substantial intra-cluster variance and curved manifold structures, such as biological or image data. Our qualitative and quantitative evaluations demonstrate that MAPLE can produce clearer visual cluster separations and finer subcluster resolution than UMAP while maintaining comparable computational cost.
Authors:Yuansong Xu, Yichao Zhu, Haokai Wang, Yuchen Wu, Yang Ouyang, Hanlu Li, Wenzhe Zhou, Xinyu Liu, Chang Jiang, Quan Li
Abstract:
Large language models (LLMs) have shown considerable potential in supporting medical diagnosis. However, their effective integration into clinical workflows is hindered by physicians' difficulties in perceiving and trusting LLM capabilities, which often results in miscalibrated trust. Existing model evaluations primarily emphasize standardized benchmarks and predefined tasks, offering limited insights into clinical reasoning practices. Moreover, research on human-AI collaboration has rarely examined physicians' perceptions of LLMs' clinical reasoning capability. In this work, we investigate how physicians perceive LLMs' capabilities in the clinical reasoning process. We designed clinical cases, collected the corresponding analyses, and obtained evaluations from physicians (N=37) to quantitatively represent their perceived LLM diagnostic capabilities. By comparing the perceived evaluations with benchmark performance, our study highlights the aspects of clinical reasoning that physicians value and underscores the limitations of benchmark-based evaluation. We further discuss the implications of opportunities for enhancing trustworthy collaboration between physicians and LLMs in LLM-supported clinical reasoning.
Authors:Yang Ouyang, Shenghan Gao, Ruichuan Wang, Hailiang Zhu, Yuheng Shao, Xiaoyu Gu, Quan Li
Abstract:
Online comments significantly influence users' judgments, yet their presentation, often determined by platform algorithms, can introduce biases, such as anchoring effects, which distort reasoning. While existing research emphasizes mitigating individual cognitive biases, the evolution of user judgments during comment engagement remains overlooked. This study investigates how presentation cues impact reasoning and explores interface design strategies to mitigate bias. Through a preliminary experiment (N=18) and a co-design workshop, we identified key challenges users face across a four-stage process and distilled four design requirements: pre-engagement framing, interactive organization, reflective prompts, and synthesis support. Based on these insights, we developed CommSense, an on-the-fly plugin that enhances user engagement with online comments by providing visual overviews and lightweight prompts to guide reasoning. A between-subject evaluation (N=24) demonstrates that CommSense improves bias awareness and reflective thinking, helping users produce more comprehensive, evidence-based rationales while maintaining high usability.
Authors:Yang Ouyang, Yuansong Xu, Chang Jiang, Yifan Jin, Haoran Jiang, Quan Li
Abstract:
Preparing an oral case presentation (OCP) is a crucial skill for medical students, requiring clear communication of patient information, clinical findings, and treatment plans. However, inconsistent student participation and limited guidance can make this task challenging. While Large Language Models (LLMs) can provide structured content to streamline the process, their role in facilitating skill development and supporting medical education integration remains underexplored. To address this, we conducted a formative study with six medical educators and developed CaseMaster, an interactive probe that leverages LLM-generated content tailored to medical education to help users enhance their OCP skills. The controlled study suggests CaseMaster has the potential to both improve presentation quality and reduce workload compared to traditional methods, an implication reinforced by expert feedback. We propose guidelines for educators to develop adaptive, user-centered training methods using LLMs, while considering the implications of integrating advanced technologies into medical education.
Authors:Peinuan Qin, Chi-Lan Yang, Nattapat Boonprakong, Jingzhu Chen, Yugin Tan, Yi-Chieh Lee
Abstract:
AI-assisted writing raises concerns about autonomy and ownership when benefiting writers. Personalization has been proposed as an effective solution while also risking writers' reliance on AI and behavior shifting. For better personalization design, existing studies rely on interaction and information solely within the writing phase; however, few studies have examined how reading behaviors can inform personalized writing. This study investigates the effects of integrating reading highlights for personalization on AI-assisted writing. A between-subjects study with 46 participants revealed that the personalization condition encouraged participants to produce more highlights. However, highlighting unexpectedly shifted from a sense-making strategy to an instrumental act of "feeding the AI," leading to significant reliance on AI and declines in writers' sense of autonomy, ownership, and self-credit. These findings indicate personalization risks in AI-assisted writing, emphasize the importance of personalization strategies, and provide design implications.
Authors:Sharifa Sultana, Rupali Samad, Mehzabin Haque, Zinnat Sultana, Zulkarin Jahangir, B M Mainul Hossain, Rashed Mujib Noman, Syed Ishtiaque Ahmed
Abstract:
Artificial Intelligence (AI) readiness in the Global South extends beyond infrastructure to include curriculum design, workforce development, and cross-sector collaboration. Bangladesh, ranked 82nd in the 2023 Oxford Insights AI Readiness Index, exhibits significant deficits in technology capacity and research ecosystems, despite strong governmental visions. While HCI and ICTD research have explored digital inclusion and responsible AI, little empirical work examines how educational, industrial, and policy domains intersect to shape readiness. We present a multi-method qualitative study of AI readiness in Bangladesh, combining institutional analyses, 59 stakeholder interviews, and curriculum benchmarking against global exemplars. Findings reveal outdated curricula, limited faculty upskilling, inadequate computing resources, entrenched gender disparities, and the near-total absence of AI ethics instruction. We contribute empirical mapping of current practices, identification of structural and cultural barriers, and actionable pathways for embedding human-centered, inclusive, and responsible AI practices into national agendas, advancing equitable innovation in emerging AI ecosystems.
Authors:Sharifa Sultana, Pratyasha Saha, Nadira Nowsher, Sumaia Arefin Ritu, Zinnat Sultana, Syed Ishtiaque Ahmed, S M Taiabul Haque
Abstract:
As deepfake technology becomes more accessible, concerns about its misuse and societal impact are escalating, particularly in regions like the Global South where digital literacy and regulatory measures are often limited. While previous research has explored deepfakes in contexts such as detection and media manipulation, there is a noticeable gap in understanding how individuals in these regions perceive and interact with deepfake media. This study addresses this gap by investigating how Bangladeshi women perceive deepfakes and the socio-cultural factors influencing their awareness, concerns, and responses to this technology. Drawing on 15 semi-structured interviews, we uncover how cultural values, gendered norms, trust in institutions, and the prevalence of digital harassment shape their perceptions and coping mechanisms. Through this research, we aim to advance existing scholarship in HCI by offering insights into the design of culturally sensitive interventions, educational initiatives, and policy frameworks to address the challenges posed by deepfakes in the Global South.
Authors:Dipto Das, Afrin Prio, Pritu Saha, Shion Guha, Syed Ishtiaque Ahmed
Abstract:
This paper examines how non-resident Bangladeshis mobilized during the 2024 quota-reform turned pro-democracy movement, leveraging social platforms and remittance flows to challenge state authority. Drawing on semi-structured interviews, we identify four phases of their collective action: technology-mediated shifts to active engagement, rapid transnational network building, strategic execution of remittance boycott, reframing economic dependence as political leverage, and adaptive responses to government surveillance and information blackouts. We extend postcolonial computing by introducing the idea of "diasporic superposition," which shows how diasporas can exercise political and economic influence from hybrid positionalities that both contest and complicate power asymmetries. We reframe diaspora engagement by highlighting how migrants participate in and reshape homeland politics, beyond narratives of integration in host countries. We advance the scholarship on financial technologies by foregrounding their relationship with moral economies of care, state surveillance, regulatory constraints, and uneven international economic power dynamics. Together, these contributions theorize how transnational activism and digital technologies intersect to mobilize political change in Global South contexts.
Authors:Taufiq Daryanto, Xiaohan Ding, Kaike Ping, Lance T. Wilhelm, Yan Chen, Chris Brown, Eugenia H. Rho
Abstract:
As AI assistance becomes embedded in programming practice, researchers have increasingly examined how these systems help learners generate code and work more efficiently. However, these studies often position AI as a replacement for human collaboration and overlook the social and learning-oriented aspects that emerge in collaborative programming. Our work introduces human-human-AI (HHAI) triadic programming, where an AI agent serves as an additional collaborator rather than a substitute for a human partner. Through a within-subjects study with 20 participants, we show that triadic collaboration enhances collaborative learning and social presence compared to the dyadic human-AI (HAI) baseline. In the triadic HHAI conditions, participants relied significantly less on AI-generated code in their work. This effect was strongest in the HHAI-shared condition, where participants had an increased sense of responsibility to understand AI suggestions before applying them. These findings demonstrate how triadic settings activate socially shared regulation of learning by making AI use visible and accountable to a human peer, suggesting that AI systems that augment rather than automate peer collaboration can better preserve the learning processes that collaborative programming relies on.
Authors:Savvas Petridis, Michael Xieyang Liu, Alexander J. Fiannaca, Carrie J. Cai, Michael Terry
Abstract:
As AI systems (foundation models, agentic systems) grow increasingly capable of operating for minutes or hours at a time, users' prompts are transforming into highly detailed, elaborate specifications for the AI to autonomously work on. While interactive prompting has been extensively studied, comparatively less is known about how people communicate specifications for these types of long-horizon tasks. In a qualitative study in which 16 professionals drafted specifications for both a human colleague and an AI, we found a core divergence in how people specified problems to people versus AI: people approached communication with humans as providing a "compass", offering high-level intent to encourage flexible exploration. In contrast, communication with AI resembled painstakingly laying down "railway tracks": rigid, exhaustive instructions to minimize ambiguity and deviation. This strategy was driven by a perception that current AI has limited ability to infer intent, prioritize, and make judgments on its own. When envisioning an idealAI collaborator, users expressed a desire for a hybrid between current AI and human colleagues: a collaborator that blends AI's efficiency and large context window with the critical thinking and agency of a human colleague. We discuss design implications for future AI systems, proposing that they align on outcomes through generated rough drafts, verify feasibility via end-to-end "test runs," and monitor execution through intelligent check-ins, ultimately transforming AI from a passive instruction-follower into a reliable collaborator for ambiguous, long-horizon problems.
Authors:Kazi Noshin, Syed Ishtiaque Ahmed, Sharifa Sultana
Abstract:
While concerns about LLM sycophancy have grown among researchers and developers, how users themselves experience this behavior remains largely unexplored. We analyze Reddit discussions to investigate how users detect, mitigate, and perceive sycophantic AI. We develop the DCR epistemology that maps user experiences across three stages: observing sycophantic behaviors, detecting sycophancy, and responding to these behaviors. Our findings reveal that users employ various detection techniques, including cross-platform comparison and inconsistency testing. We document diverse mitigation approaches, including persona-based prompts and targeted language patterns in prompt engineering. We find sycophancy's effects are context-dependent rather than universally harmful. Specifically, vulnerable populations experiencing trauma, mental health challenges, or isolation actively seek and value sycophantic behaviors as emotional support. Users develop both technical and folk explanations for why sycophancy occurs. These findings challenge the assumption that sycophancy should be eliminated universally. We conclude by proposing context-aware AI design that balances risks with benefits of affirmative interaction, while discussing implications for user education and transparency.
Authors:JungMin Yun, JuneHyoung Kwon, MiHyeon Kim, YoungBin Kim
Abstract:
The rapid expansion of AI research has intensified the Reviewer Gap, threatening the peer-review sustainability and perpetuating a cycle of low-quality evaluations. This position paper critiques existing LLM approaches that automatically generate reviews and argues for a paradigm shift that positions LLMs as tools for assisting and educating human reviewers. We define the core principles of high-quality peer review and propose two complementary systems grounded in these foundations: (i) an LLM-assisted mentoring system that cultivates reviewers' long-term competencies, and (ii) an LLM-assisted feedback system that helps reviewers refine the quality of their reviews. This human-centered approach aims to strengthen reviewer expertise and contribute to building a more sustainable scholarly ecosystem.
Authors:Mayank Sharma, Roy Pea, Hari Subramonyam
Abstract:
In educational applications, LLMs exhibit several fundamental pedagogical limitations, such as their tendency to reveal solutions rather than support dialogic learning. We introduce ConvoLearn (https://huggingface.co/datasets/masharma/convolearn ), a dataset grounded in knowledge building theory that operationalizes six core pedagogical dimensions: cognitive engagement, formative assessment, accountability, cultural responsiveness, metacognition, and power dynamics. We construct a semi-synthetic dataset of 1250 tutor-student dialogues (20 turns each) in middle school Earth Science through controlled interactions between human teachers and a simulated student. Using QLoRA, we demonstrate that training on this dataset meaningfully shifts LLM behavior toward knowledge-building strategies. Human evaluation by 31 teachers shows our fine-tuned Mistral 7B (M = 4.10, SD = 1.03) significantly outperforms both its base version (M = 2.59, SD = 1.11) and Claude Sonnet 4.5 (M = 2.87, SD = 1.29) overall. This work establishes a potential framework to guide future development and evaluation of constructivist AI tutors.
Authors:Britt Besch, Tai Mai, Jeremias Thun, Markus Huff, Jörn Vogel, Freek Stulp, Samuel Bustamante
Abstract:
Whenever humans and robots work together, it is essential that unexpected robot behavior can be explained to the user. Especially in applications such as shared control the user and the robot must share the same model of the objects in the world, and the actions that can be performed on these objects. In this paper, we achieve this with a so-called model reconciliation framework. We leverage a Large Language Model to predict and explain the difference between the robot's and the human's mental models, without the need of a formal mental model of the user. Furthermore, our framework aims to solve the model divergence after the explanation by allowing the human to correct the robot. We provide an implementation in an assistive robotics domain, where we conduct a set of experiments with a real wheelchair-based mobile manipulator and its digital twin.
Authors:Alicia Guo, David Ledo, George Fitzmaurice, Fraser Anderson
Abstract:
As an emergent process, creativity relies on explorations via sampling and prototyping for problem construction. These activities compile knowledge, provide a context enveloping the solution, and answer questions. With Generative AI, practitioners can go beyond sampling existing media towards instantly generating and remixing new ones. We refer to this convergence as 'protosampling'. Using existing literature we ground a definition for protosampling and operationalize it through Atelier, a canvas-like system that leverages a variety of generative image and video models for visual creation. Atelier: (1) blends the spaces for thinking and creation, where both references and generated assets co-exist in one space, (2) provides various encapsulated technical workflows that focus on the activity at hand, and (3) enables navigating emergence through interactive visualizations, smart search, and collections. Protosampling as a lens reframes creative work to emphasize the process itself and how seemingly disjointed thoughts can tightly interweave into a final solution.
Authors:Zhihao Zhou, Weishan Ye, Li Zhang, Gan Huang, Zhen Liang
Abstract:
Continuous electroencephalography (EEG) emotion prediction aims to model the temporal evolution of human emotional states from EEG signals. Unlike conventional discrete emotion recognition, continuous prediction requires capturing long-range temporal dependencies and coherent emotional dynamics. However, existing methods mainly rely on point-wise regression and directly model noisy high-dimensional EEG features, limiting their ability to characterize continuous emotional evolution.To address these challenges, we propose EEGDancer, a dynamic emotional latent space learning framework for continuous EEG emotion prediction. The framework integrates vector-quantized representation learning, masked temporal modeling, and reinforcement learning-based trajectory optimization into a unified architecture.Specifically, a causal spatiotemporal Vector-Quantization Variational Autoencoder (VQ-VAE) is designed to learn structured emotional prototypes and construct a discrete-continuous emotional latent space from EEG signals. Based on the learned latent representations, a Transformer-based masked dynamic modeling strategy captures long-range emotional dependencies and temporal evolution patterns. Furthermore, continuous emotion prediction is formulated as a sequential decision-making problem, and a Soft Actor-Critic (SAC) framework is introduced to optimize emotional prediction trajectories at the sequence level instead of frame-wise local fitting.Extensive experiments on the SEED, SEED-IV, and Long-Term Naturalistic Emotion datasets demonstrate that EEGDancer consistently outperforms existing machine learning and deep learning methods. Ablation studies further verify the effectiveness of the proposed latent space and reinforcement learning-based trajectory optimization for modeling continuous EEG emotional dynamics.
Authors:Gregory Reardon, Max Linnander, Dustin Goetz, Neeli Tummala, Yon Visell
Abstract:
We address the challenge of engineering distributed haptic displays capable of reproducing multiple localized, independently addressable vibrations -- representing virtual tactile pixels -- at arbitrary locations on a surface. Our technique is based on the focusing of mechanical waves in a flexural plate using a sparse set of actuators. At tactile frequencies, wave diffraction prevents the formation of localized virtual tactile pixels at spatial scales relevant for multi-digit touch interactions. We overcome this limitation by augmenting the plate with a lattice of mechanical resonators, forming a locally resonant metamaterial plate. Coupling between the plate's dynamic modes and those of the resonators alters the dispersion relation governing wave transmission, introducing a slow-wave branch that enables focusing beyond the diffraction limit imposed by the unmodified plate. We use numerical simulations to engineer the dispersion relation of the metamaterial system for high-resolution focusing at tactile frequencies. We then fabricate a metamaterial tactile display and experimentally demonstrate virtual pixels that are far more localized than those generated on an otherwise identical plate without resonators, resulting in a tenfold reduction in virtual-pixel area. In behavioral experiments, we show that this system can deliver perceptually localized single- and multi-point tactile feedback and moving tactile sources while maintaining independent control over temporal waveforms at multiple display locations. The methods reported here can enable high-resolution haptic displays for widespread applications using a small number of actuated degrees of freedom.
Authors:Russian, Wu, Tim Moesgen, Myung Jin, Kim, Xinyan Yu, Naoki Kameyama, Anusha Withana, Marius Hoggenmueller, Luke Hespanhol
Abstract:
With growing research on haptic interfaces, Mediated Social Touch (MST) technologies offer the potential to record, synthesise, and reproduce (RSR) touch experiences across space and time, enabling, for instance, a hug from afar and from the past. Although much of the existing research highlights the direct benefits of these systems, such as reducing loneliness and providing emotional support, little attention has been paid to their broader sociotechnical impacts. To address this gap, we used the Future Ripples method to speculate on possible effects of MST. We conducted three workshops with 24 participants, including potential users, domain experts, and haptics researchers. Throughout these sessions, participants collectively envisioned possible future scenarios, alongside opportunities and threats, and proposed actionable responses. Our qualitative analysis organised these insights into four themes and three distinctive challenges. These findings offer haptics researchers intervention points across the RSR pipeline to inform MST design, alongside methodological insights from applying Future Ripples to MST technology.
Authors:Yiliang Zhou, Yawen Guo, Sairam Sutari, Jasmine Dhillon, Alexandra L. Beck, Emilie Chow, Steven Tam, Danielle Perret, Deepti Pandita, Gelareh Sadigh, Archana J. McEligot, Kai Zheng
Abstract:
Ambient artificial intelligence (AI) documentation tools are increasingly deployed to reduce clinician documentation burden, but their implications for biased language in clinical notes remain unclear. We conducted a large-scale comparison analysis of AI drafts and corresponding clinician finalized notes to quantify stigmatizing language changes pre- and post-editing. Using a lexicon-based natural language processing (NLP) pipeline, we measured (1) the prevalence of stigmatizing language in AI drafts, (2) the prevalence and term composition in final notes, and (3) the frequency of removal or introduction of stigmatizing terms. Across 66,297 paired note sections, 21.4% of AI draft sections contained at least one stigmatizing language mention, rising to 24.0% in clinician finalized versions. Introductions occurred more often than removals, suggesting clinician editing can be a net source of stigmatizing language entering the EHR with using Ambient AI.
Authors:Haley Noh, Aarna Chowdhary, Jeroen Ooge, Vincent Aleven, Conrad Borchers
Abstract:
Intelligent Tutoring Systems often grant learners shared control over skill and problem selection. Prior work suggests learners exhibit diverse task-selection strategies, such as avoiding challenge, which may interact with mastery learning systems that optimize task selection based on estimated knowledge. Algorithmic constraints on problem selection may help mitigate these effects, but testing such constraints in classrooms is costly. We propose a simulation-based framework to examine how learner task-selection strategies and system constraints shape mastery learning efficiency. Using interaction data from 261 students across two mathematical domains (equation solving and graph interpretation), we simulate strategies such as Weakness Targeting and Interleaving. We evaluate how these strategies affect overpractice as a measure of efficiency. Results show substantial variability across strategies, with risk-averse strategies producing higher levels of overpractice, especially for complex multi-step problems. Targeted system constraints significantly reduce inefficiencies for maladaptive strategies while minimally affecting already efficient strategies. These findings show how simulation grounded in student data can guide the redesign of shared-control tutoring systems before classroom deployment.
Authors:Prakash Aryan, Kaushik Raghupathruni, Timo Kehrer, Sebastiano Panichella
Abstract:
Simulation-based testing of self-driving cars (SDCs) typically relies on scripted or simplified pedestrian models that do not capture the heterogeneity and uncertainty of real human crossing behavior. This limits the realism of safety assessments, especially in scenarios involving jaywalking, which is governed by latent personality traits that the vehicle cannot observe. We hypothesize that jointly training pedestrians and the SDC with multi-agent reinforcement learning (MARL) produces more realistic interaction scenarios than training the SDC against fixed pedestrian policies, and that the resulting behavior gap between predictable and unpredictable crossings can be measured directly from trajectories. This paper describes a MARL environment in which an SDC and 12 pedestrians are co-trained using Multi-Agent Proximal Policy Optimization (MAPPO). Pedestrian locomotion follows scripted Dijkstra pathfinding, while an RL policy controls high-level go/wait decisions. Jaywalking probability depends on a per-pedestrian personality trait sampled at episode start and hidden from the SDC. In 500-episode evaluations, the co-trained SDC reached 78% of goals with a 14% collision rate, compared to 35% goals and 33% collisions for the best rule-based baseline. A speed differential metric shows that the SDC traveled 2.65 m/s faster near jaywalkers than near crosswalk users at close range (0-3 m), indicating that jaywalking encounters were not anticipated. Jaywalking accounted for 13% of crossing events but was associated with 62% of collisions. Co-training with MARL pedestrians reduced collisions by 30% relative to single-agent RL, as pedestrians learned to wait when the SDC approached at speed.
Authors:Felix Henry, Xiaochen Lin, Jiangyou Zhu, Yangfan, Bingqian Zhang, Min Chen, Shiyu Huang
Abstract:
Current benchmarks for graphical user interface (GUI) agents predominantly rely on static screenshots. However, real-world smartphone interaction routinely requires agents to process transient audio cues and temporal video dynamics that are tightly coupled with the moment of action. To bridge this gap, we introduce OmniGUI, the first step-level benchmark designed to evaluate GUI agents in omni-modal smartphone environments. OmniGUI provides continuous, interleaved multimodal inputs comprising static images, synchronous audio, and video clips at every action step. The dataset encompasses 709 expert-demonstrated episodes (2,579 action steps) across 29 applications, systematically annotated with objective multimodal dependency levels. Because dedicated omni-modal GUI agent frameworks are currently in their nascent stage, we select foundational omni-modal models capable of natively processing interleaved inputs to serve as agent proxies for our initial baselines. Our empirical evaluation reveals that while current models exhibit competency on visually static tasks, their action prediction performance degrades significantly in environments requiring synchronous temporal and auditory signals. Furthermore, ablation studies isolate specific operational bottlenecks, notably cross-modal interference when processing task-irrelevant environmental noise. The complete dataset, evaluation pipeline, and baseline prompts are provided in the supplementary material. Project page: https://omni-gui.github.io.
Authors:Prakash Aryan, Cem Erdogdu, Kavinaya Kumarchokkappan, Timo Kehrer, Sebastiano Panichella
Abstract:
Operating a multi-robot fleet for simultaneous localization and mapping (SLAM) in applications such as building inspection or warehouse-aisle monitoring requires the operator to maintain spatial awareness of each robot's position and mapping state, a task that scales poorly on conventional 2D interfaces. We present MR-SLAM, a mixed reality (MR) system in which an operator wearing a Meta Quest 3 headset teleoperates three simulated TurtleBot3 robots through a passthrough view with real-world occlusion, while spatially anchored dashboard panels report mapping progress in situ. Each robot runs an independent SLAM Toolbox instance whose occupancy grid is merged in real time on a Robot Operating System 2 (ROS 2) back end. Across five 9-minute evaluation sessions, the system delivered scans at 8.83 +/- 0.16 Hz, mapped 17.9 +/- 0.8 m^2 of merged occupancy, and reached 94.7 +/- 0.5% cross-instance occupancy consistency across robot pairs. An additional session recorded 6.3 ms median transform jitter and 26.7 m^2 coverage of a 41 m^2 grid. We position MR-SLAM as a reference implementation for combining passthrough mixed reality supervision with multi-robot SLAM on consumer hardware.
Authors:Susanne Gaube, Markus Langer, Tim Miller, Kevin Baum, Raimund Dachselt, Anna Maria Feit, Ujwal Gadiraju, Harmanpreet Kaur, Mark T. Keane, Richard Landers, Johann Laux, Q. Vera Liao, Brian Lim, Linda Onnasch, Tim Schrills, Liz Sonenberg, Chenhao Tan, Nava Tintarev, Ziang Xiao, Hanwei Zhang
Abstract:
The use of Artificial Intelligence (AI) in high-risk, decision-making scenarios presents technical, safety, and normative challenges; problems that may only be ameliorated by human oversight. However, notions of human oversight lack a common foundational understanding: oversight architectures are not well defined, the roles involved remain unclear, and implementation steps are opaque. Hence, researchers and practitioners struggle to determine how to design, implement, and evaluate systems that enable effective human oversight. This paper advances a practical framework for effective human oversight of AI systems, based on a cross-disciplinary perspective that draws on insights from computer science, human-computer interaction, psychology, philosophy, and law. The core contributions are: (1) a foundational framework, with a working definition, architecture and processes for effective human oversight of AI systems; (2) an initial template for documenting oversight architectures and processes, applied to diverse domains; and (3) a synthesis of open research challenges that need to be considered in the emerging field of effective human oversight of AI systems.
Authors:Samuel Schapiro, Alexi Gladstone, Jonah Black, Heng Ji
Abstract:
Measuring the creativity of large language models (LLMs) is essential for designing methods that can improve creativity and for enhancing our scientific understanding of this ability. To accomplish this, it has become common in recent years to administer tests of human creativity to LLMs. Although these tests provide a convenient and fully automated way to score "creativity," their validity as measures of machine creativity has not been established, and these tests already have limited validity as predictors of human creativity. To address this problem, we conduct the first large-scale, systematic study assessing the effectiveness of human creativity tests for predicting the creative achievement of LLMs across three target constructs: creative writing, divergent thinking, and scientific ideation. We find that the Divergent Association Task (DAT) and the Conditional DAT are the best predictors of creative writing and divergent thinking, respectively, but that test effectiveness varies significantly by construct, and no single test predicts all constructs well. Moreover, contrary to popular belief, no existing test reliably predicts scientific ideation ability. Motivated by this problem, we introduce the Divergent Remote Association Test (DRAT), a vocabulary-space test that assesses both convergent and divergent thinking in a single instrument. The DRAT is the first and only creativity test for LLMs that is a significant predictor of scientific ideation ability, demonstrating robustness across major design choices. Furthermore, the performance gain of the DRAT is not recoverable from any linear combination of the Divergent Association Task and the Remote Associates Test, indicating that assessing divergent and convergent thinking in the same test is essential to reliably predicting scientific ideation ability.
Authors:Dengzhe Hou, Zihao Wu, Lingyu Jiang, Zirui Li, Fangzhou Lin, Kazunori D. Yamada
Abstract:
Electroencephalography (EEG) is a cornerstone of brain-computer interfaces and clinical neuroscience, yet deep learning models are typically trained and evaluated under a single, unreported preprocessing pipeline. We formalize preprocessing choices as a counterfactual intervention space and show that EEG predictions are surprisingly unstable under this space: across six datasets spanning four paradigms, up to 42% of trial-level predictions flip when only the preprocessing changes, a variability that standard uncertainty methods do not explicitly quantify because they condition on a fixed preprocessing pipeline. We provide three tools to make this instability measurable, decomposable, and reducible. First, a Walsh-Hadamard decomposition of the 2^7 pipeline space reveals that sensitivity is near-additive in practice under the binary intervention design, enabling efficient step-by-step optimization. Second, we introduce Preprocessing Uncertainty (PU), a per-trial diagnostic that captures a dimension of instability complementary to model-based confidence. Third, we study Normalized Adaptive PGI (NA-PGI), a graph-structured regularizer that exploits the compositional structure of preprocessing interventions as one mitigation strategy with clear scope conditions.
Authors:Foong Ming Lai, Yujin Tan, Han Meng, Yi-Chieh Lee
Abstract:
Code-switching in contact varieties like Singaporean English (Singlish) challenges natural language generation due to limited parallel data and rapid lexical evolution. We propose a retrieval-augmented generation (RAG) framework that externalizes dialectal knowledge into a curated lexicon, enabling controlled lexical code-switching without fine-tuning. Our approach retrieves candidate Singlish expressions and guides generation through sparse lexical substitution. Human evaluation with 164 Singaporean participants found RAG and zero-shot prompting equally natural and appropriate. Automatic analyses reveal different transformation regimes: zero-shot prompting induces extensive paraphrasing (median 23 token edits), whereas RAG performs minimal substitutions (median 1 edit) with higher semantic preservation (mean cosine similarity 0.978 vs. 0.926). Our results demonstrate that externalizing code-switching into lexical resources enables control and auditability without sacrificing perceived quality, offering practical advantages for rapidly evolving contact varieties.
Authors:Emma C. Wolfe, Ting Su, Olivier Tieleman, Thomas D. Hull, Matteo Malgaroli, Caitlin A. Stamatis
Abstract:
Background: Conversational AI chatbots are emerging as scalable mental health tools, but little is known about real world engagement or its relationship to clinical outcomes. Objective: To characterize engagement phenotypes among users of Ash, a purpose-built AI mental health chatbot, and examine associations with clinical change and working alliance. Methods: K-means clustering across eight behavioral features identified engagement phenotypes among 102,684 users. Subsamples completed the PHQ-9 (n=298), GAD-7 (n=298), and MSPSS (social support; n=194) baseline and 3 weeks; 11,437 users completed baseline Working Alliance Inventory (WAI). Results: Five engagement phenotypes emerged: Early Dropouts (52.2%), Power Users (1.6%), Intensive Users (4.1%), Weekly Users (25.3%), and a novel Concentrated User pattern (16.8%); across users, 66.9% had at least one overnight session (9pm-5am). Significant pre-post improvements occurred in depression (d = -0.51), anxiety (d = -0.57), and social support (d = 0.22). An observed dose-response gradient in self-reported depression improvement was replicated in a larger sample with model-predicted PHQ-9 (n = 23,813; Power Users d = -0.54; Early Dropouts d = -0.13). Higher working alliance predicted depression improvement and moderated the engagement-social support relationship. Conclusions: Engagement with AI mental health tools is multidimensional, and different clinical outcomes respond to different dimensions of use. Findings caution against treating session counts as a primary engagement metric and offer naturalistic evidence for the clinical value of purpose-built conversational AI.
Authors:Michael F Xu, Qiyao Yang, Heather Kirkorian, Bilge Mutlu
Abstract:
Family-school partnerships (FSP) are critical to children's development, yet families often face barriers such as time constraints, fragmented communication, and limited opportunities for meaningful engagement. As a step toward facilitating broader family-school partnerships, we explore a novel approach that integrates a social robot into family settings, specifically supporting home-based activities. Through interviews and co-design sessions, we designed and developed a robotic system informed by both parents and children, that supported, among other interactions, family communication about school topics. We evaluated the robot in a week-long, in-home study with 10 families. Our findings show how families integrated the robot into daily life, how parental facilitation styles shaped use, and how families perceived both the helpfulness and challenges of the robot. We contribute empirical insights, a modular system, and design implications for family- and child-robot interactions. We discuss ethical and privacy considerations, and broaden the design space for technologies supporting family-school partnerships.
Authors:Megan Li, Wendy Bickersteth, Ningjing Tang, Parv Kapoor, Khinezin Win, Peter Zhong, Jason I. Hong, Lorrie Faith Cranor, Hoda Heidari, Hong Shen
Abstract:
Despite growing concerns about the risks of Generative AI (GenAI), there is limited understanding of public perceptions of these risks and their associated failure modes -- defined as recurring patterns of sociotechnical breakdown across the GenAI lifecycle that contribute to risks of real-world harm. To address this gap, we present a survey instrument, validated with eight subject matter experts and deployed on a sample of 960 U.S.-based participants, to assess awareness and perceptions of GenAI's failure modes, their associated risks, and stakeholder responsibilities to address them. To support realism and content validity, our instrument is structured around scenarios grounded in publicly reported incidents and a taxonomy of GenAI's failure modes. Findings suggest that our instrument is (1) effective for assessing risk awareness and perceptions in a way that is grounded in people's current contexts of use, yet is extensible to new contexts that will inevitably arise; and (2) potentially useful for informing the design of AI literacy tools and interventions. We argue for AI literacy and governance approaches that align with how people encounter and reason about GenAI in everyday life.
Authors:Mohamed Ouf, Mariam Guizani
Abstract:
Open source projects depend on newcomers who stay, yet most leave after a single contribution. Contribution events such as Google Summer of Code, LFX Mentorship, Hacktoberfest, and 24 Pull Requests attract thousands of newcomers each year, but whether they produce lasting contributors remains unclear. We conduct the first matched-cohort study comparing 2,001 event-based and 2,001 organic contributors across 330 projects. Our results reveal three key findings. First, event contributors have significantly higher odds of becoming core contributors (12.1% vs. 9.6%, p < 0.001, OR = 1.31) and stay significantly longer (median 8.2 vs. 4.8 months). Second, each entry mechanism is associated with a fundamentally different engagement rhythm: 68.9% of mentorship contributors sustain Steady weekly activity across their first 12 weeks, whereas 61.0% of non-mentorship contributors exhibit Front-Loading and 57.0% of organic contributors exhibit Intermittent engagement (p < 0.001). Third, Steady engagement is associated with significantly longer retention regardless of group (median 13 vs. 8 months for Front-Loading), yet mentorship contributors who lose their program scaffolding show shorter retention than self-sustained non-mentorship contributors, revealing a mentor-dependency effect. A newcomer's first 12 weeks are strongly indicative of their long-term trajectory.
Authors:Xingyu Xiao, Mingwei Xiao, Hongbo Li, Jingang Liang, Jiejuan Tong, Haitao Wang
Abstract:
Digitalization has fundamentally transformed human system interaction in nuclear main control rooms, yet the quantitative mechanisms by which interfaces amplify procedural risks remain insufficiently understood. This study presents a systematic assessment of interface procedure coupling based on real operational events collected from 2021 to 2025 in a modern nuclear power plant. A reusable three dimensional labeling framework and a four factor interface mechanism model are developed to characterize layout, semantic, mismatch, and labeling deficiencies. Results show that interface issues function as a significant risk amplifier. A total of 42.6 percent of events involved interface deficiencies, and their presence more than doubled the likelihood of procedural deviation. Machine learning interpretation further reveals that composite interface procedure coupling, particularly driven by semantic mismatches and layout induced traps, is the dominant contributor to coupled failures. Simulator based validation confirms that semantic confusion accounts for 27.3 percent of interface induced errors, with overall error patterns consistent with historical data. The study provides a data driven HRA workflow for early vulnerability identification in digital control rooms and proposes a systematic framework for interface procedure semantic alignment to support risk informed design and verification.
Authors:Lucas Alexandre, João Rulff, Talisson Souza, Gustavo Moreira, Daniel de Oliveira, Claudio Silva, Fabio Miranda, Marcos Lage
Abstract:
The development of visual analytics (VA) systems has traditionally been a labor-intensive process, balancing design methodologies with complex software engineering practices. In domain-specific fields like urban VA, this challenge is amplified by heterogeneous data streams and a reliance on complex, multi-service architectures that hinder fast development, deployment, and reproducibility. Despite the richness of the urban VA literature, the field lacks a consolidated toolkit that encapsulates the core components of these systems, such as spatial data management, analytical processing, and visualization, into a unified, lightweight framework. In this paper, we introduce Autark, a serverless toolkit designed for the rapid prototyping of urban VA systems. Autark provides domain-aware abstractions through a self-contained architecture, enabling researchers to transition from design intention to deployed, shareable systems within hours. Furthermore, Autark's structured, tightly scoped interfaces make it well-suited for AI-assisted coding workflows, where LLMs produce more reliable code when composing from well-defined abstractions rather than generating complex solutions from scratch. Our contributions are: (1) the Autark toolkit, a serverless architecture for rapid prototyping of urban VA; (2) a comparative study of LLM coding effectiveness with and without Autark; and (3) a series of usage scenarios demonstrating its capability to streamline the creation of robust, shareable urban VA prototypes. Autark is available at https://autarkjs.org/.
Authors:Chenxi Wang, Haining Ding, Michal Gath-Morad
Abstract:
Buildings shape how people feel, yet the mechanisms through which specific facade properties drive affective states remain empirically underspecified. Here we introduce the Cambridge Facade Affect Dataset (CFAD), 86 orthogonally rectified facade images annotated with continuous arousal and valence ratings from 85 participants, and establish a validated pipeline linking machine-vision-derived surface metrics to human affective responses. Focusing on three quantifiable attributes, complexity, transparency (window-to-wall ratio), and materiality (proportion of natural versus artificial surface composition), we show that perceived complexity is the dominant affective predictor, with significant positive associations for both arousal (beta = 0.507, p < 0.001) and valence (beta = 0.376, p < 0.001) and a curvilinear amplification at higher complexity levels. Transparency exhibits an inverted-U relationship with valence, while increasing surface artificiality suppresses arousal and reduces pleasantness consistent with biophilic response theory. Critically, machine-derived metrics show limited direct predictive power over affective outcomes; mediation analyses reveal that human perceptual evaluation functions as a necessary intermediate layer, with perceived materiality significantly mediating the machine-valence relationship (indirect effect = -0.205, p = 0.003). Cross-context validation demonstrates moderate stability of complexity and materiality ratings across image-based and in-situ conditions, while affective responses, particularly valence, exhibit significant context-dependence (ICC = 0.332). These findings advance facade research from descriptive morphological analysis toward predictive, perception-grounded modelling, and provide an empirically validated basis for affect-informed design of the urban environment.
Authors:Tianyi Xiao, Yizi Chen, Sidi Wu, Peter Kiefer, Yan Feng, Martin Raubal
Abstract:
Sketch mapping is widely used in crime scene investigation (CSI) to document, interpret, and communicate spatial information. However, it is typically performed on 2D media, which limits its ability to represent 3D spatial relationships. We present HolmeSketcher, a generative 3D sketch mapping system that combines a front-end 3D drawing interface with a back-end deep learning pipeline to support object generation and scene reconstruction in extended reality. In a within-subject user study (N = 15), HolmeSketcher improved the spatial accuracy and interpretability of reconstructed scenes, but with a clear trade-off of higher task load and lower usability compared with paper-based 2D sketch mapping. By integrating findings from the user study and expert interviews (N = 3), we further derive three design implications for next-generation 3D sketch mapping tools for CSI.
Authors:Aleksandar Anžel, Zewen Yang, Georges Hattab
Abstract:
While the polar system may lack the universal familiarity of its Cartesian counterpart, it remains indispensable for certain tasks. Summary polar diagrams, such as Taylor and mutual information diagrams, address tasks like discovering relationships, visualizing data similarity, and quantifying correspondence. Although these diagrams are invaluable tools for uncovering data relationships, their polar nature can hinder intuitiveness and lead to issues like overplotting. We present a hybrid approach that combines overview+detail, aggregation, interactive filtering, Cartesian linking, and small multiples to enhance the clarity, comprehensiveness, and functionality of summary polar diagrams. We performed a user study to assess this approach's effectiveness, noting comparable response times among participants. Additionally, three domain experts with varying visualization experience reviewed an implemented solution applying summary polar diagrams to climate, data science (novel), and machine learning, refining the approach prior to the user study. The findings underscore the versatility of our approach in enhancing comprehension, accessibility, and utility.
Authors:Caitlin A. Stamatis, Emma C. Wolfe, Matteo Malgaroli, Thomas D. Hull
Abstract:
Background: Many people who could benefit from therapy do not receive it. Conversational AI is increasingly used for mental health support, yet it is unclear which barriers AI helps mitigate. We examined whether evaluation-sensitive (shame/stigma) and structural barriers (cost/coverage/access) to psychotherapy predict perceived helpfulness of an AI mental health conversational tool (Ash), and whether effects differ by prior therapy experience or user engagement. Methods: Participants (n=395) rated Ash's helpfulness (1-5) and described barriers to therapy. Open-text responses were coded for shame/stigma, access, and cost/coverage themes. Linear regressions examined associations between barriers and perceived helpfulness, adjusting for demographics and mental health, with moderation by therapy experience. Results: Shame/stigma (B=.45, p<.001) and access barriers (B=.31, p=.020) predicted higher perceived helpfulness but cost/coverage did not (B=.13, p=.262). Prior therapy experience moderated the shame effect (interaction B=.56, p=.036): shame predicted higher helpfulness among therapy-experienced users ($Δ$=.62, p<.001) but not therapy-naive users ($Δ$=.03, p=.877). Among therapy-experienced participants (n=258), shame/stigma (B=.75, p<.001) and access barriers (B=.51, p=.006) predicted rating Ash more favorably. Access barriers predicted higher engagement (IRR=1.64, p<.001) and cost/coverage barriers predicted 70% more sessions (IRR=1.70, p<.001). Shame/stigma was not associated with total sessions (IRR=.80, p=.094). Conclusions: AI mental health support was perceived as most helpful by users facing shame/stigma and access barriers, particularly for therapy-experienced individuals. Access and cost barriers were most predictive of usage intensity, suggesting unmet needs. Findings highlight the importance of aligning AI tools for emotional support with user-reported barriers.
Authors:Wong Kam-Kwai, Yi-Lin Ye, Wai Tong, Haobo Li, Kentaro Takahira, Aastha Bhatta, Sunil Poudyal, Charles Wang Wai Ng, Huamin Qu, Leni Yang
Abstract:
Landslides pose a significant threat to public safety, but their dynamic processes are difficult to analyze from post-event observation alone. Computational simulation is therefore essential, but it generates vast, abstract datasets that create a cognitive gap between the analyst and the real-world, physical terrain. While Immersive Analytics (IA) begins to bridge this gap by visualizing data in 3D, we explore how these systems evolve beyond abstract data and integrate data visceralization to enhance Situational Awareness (SA). We present LandSAR, an immersive analytics system that enhances SA for landslide analysis by visceralizing landslide data through integrated simulations and visualizations. LandSAR supports real-time simulations of landslide dynamics, prevention strategies, and climate impacts, enabling multi-perspective what-if analyses. The system uses 3D-printed terrain models as tangible interfaces to facilitate haptic feedback and enable gesture-based exploration, allowing for intuitive geographical perception. Expert interviews and workshops demonstrate that LandSAR effectively improves SA and engagement.
Authors:Kobi Hackenburg, Luke Hewitt, Caroline Wagner, Ben M. Tappin, Christopher Summerfield
Abstract:
There is substantial concern about the ability of advanced artificial intelligence to influence people's behaviour. A rapidly growing body of research has found that AI can produce large persuasive effects on people's attitudes, but whether AI can persuade people to take consequential real-world actions has remained unclear. In two large preregistered experiments N=17,950 responses from 14,779 people), we used conversational AI models to persuade participants on a range of attitudinal and behavioural outcomes, including signing real petitions and donating money to charity. We found sizable AI persuasion effects on these behavioural outcomes (e.g. +19.7 percentage points on petition signing). However, we observed no evidence of a correlation between AI persuasion effects on attitudes and behaviour. Moreover, we replicated prior findings that information provision drove effects on attitudes, but found no such evidence for our behavioural outcomes. In a test of eight behavioural persuasion strategies, all outperformed the most effective attitudinal persuasion strategy, but differences among the eight were small. Taken together, these results suggest that previous findings relying on attitudinal outcomes may generalize poorly to behaviour, and therefore risk substantially mischaracterizing the real-world behavioural impact of AI persuasion.
Authors:TianZe Zhang, Sirui Sun, Yuhang Xie, Xin Zhang, Zhiqiang Wu, Guojie Song
Abstract:
Large Language Models (LLMs) have shown promise in simulating human behavior, yet existing agents often exhibit behavioral rigidity, a flaw frequently masked by the self-referential bias of current "LLM-as-a-judge" evaluations. By evaluating against empirical ground truth, we reveal a counter-intuitive phenomenon: increasing the intensity of prompt-driven reasoning does not enhance fidelity but rather exacerbates value polarization, collapsing population diversity. To address this, we propose the Context-Value-Action (CVA) architecture, grounded in the Stimulus-Organism-Response (S-O-R) model and Schwartz's Theory of Basic Human Values. Unlike methods relying on self-verification, CVA decouples action generation from cognitive reasoning via a novel Value Verifier trained on authentic human data to explicitly model dynamic value activation. Experiments on CVABench, which comprises over 1.1 million real-world interaction traces, demonstrate that CVA significantly outperforms baselines. Our approach effectively mitigates polarization while offering superior behavioral fidelity and interpretability.
Authors:Suyash Fulay, Prerna Ravi, Emily Kubin, Shrestha Mohanty, Michiel Bakker, Deb Roy
Abstract:
AI is increasingly used to scale collective decision-making, but far less attention has been paid to how such systems can support procedural legitimacy, particularly the conditions shaping losers' consent: whether participants who do not get their preferred outcome still accept it as fair. We ask: (1) how can AI help ground collective decisions in participants' different experiences and beliefs, and (2) whether exposure to these experiences can increase trust, understanding, and social cohesion even when people disagree with the outcome. We built a system that uses a semi-structured AI interviewer to elicit personal experiences on policy topics and an interactive visualization that displays predicted policy support alongside those voiced experiences. In a randomized experiment (n = 181), interacting with the visualization increased perceived legitimacy, trust in outcomes, and understanding of others' perspectives, even though all participants encountered decisions that went against their stated preferences. Our hope is that the design and evaluation of this tool spurs future researchers to focus on how AI can help not only achieve scale and efficiency in democratic processes, but also increase trust and connection between participants.
Authors:Yuanchen Bai, Zijian Ding, Ruixiang Han, Niti Parikh, Wendy Ju, Angelique Taylor
Abstract:
The rapid advancement of robotics, spanning expanded capabilities, more intuitive interaction, and more integration into real-world workflows, is reshaping what it means for humans and robots to coexist. Beyond sharing physical space, this coexistence is increasingly characterized by organizational embeddedness, temporal evolution, social situatedness, and open-ended uncertainty. However, prior work has largely focused on static snapshots of attitudes and acceptance, offering limited insight into how perceptions form and evolve, and what active role humans play in shaping coexistence as a dynamic process. We address these gaps through in-depth follow-up interviews with nine participants from a 14-week co-design study on healthcare robots. We identify the human perception space, including four interpretive dimensions (i.e., degree of decomposition, temporal orientation, scope of reasoning, and source of evidence). We enrich the conceptual framework of human-robot coexistence by conceptualizing the mutual relationship between the human perception space and the robot design space as a co-evolving loop, in which human needs, design decisions, situated interpretations, and social mediation continuously reshape one another over time. Building on this, we propose considerate human-robot coexistence, arguing that humans act not only as design contributors but also as interpreters and mediators who actively shape how robots are understood and integrated across deployment stages.
Authors:Daniel Ogenrwot, John Businge
Abstract:
Software Engineering 3.0 marks a paradigm shift in software development, in which AI coding agents are no longer just assistive tools but active contributors. While prior empirical studies have examined productivity gains and acceptance patterns in AI-assisted development, the challenges associated with integrating agent-generated contributions remain less understood. In particular, merge conflicts, a fundamental aspect of collaborative software development, remain underexplored in this context. In this paper, we present AgenticFlict, a large-scale dataset of textual merge conflicts in AI coding agent pull requests (Agentic PRs). The dataset comprises 142K+ Agentic PRs collected from 59K+ repositories, of which 107K+ are successfully processed through deterministic merge simulation. Our pipeline identifies 29K+ PRs exhibiting merge conflicts, yielding a conflict rate of 27.67%, and extracts 336K+ fine-grained conflict regions across these instances. Our preliminary exploratory analysis indicates that merge conflicts are both frequent and often substantial in AI-generated contributions, with noticeable variation across agents, emphasizing the need to better understand and manage integration challenges in AI-assisted software development. The dataset, code and supplementary materials are available in zenodo: https://doi.org/10.5281/zenodo.19396917.
Authors:Dina Albassam, Kexin Quan, Mengke Wu, Sanika Pande, ChengXiang Zhai, Yun Huang
Abstract:
YouTube is widely used for informal learning, where learners explore lectures and tutorials without a predefined curriculum. However, learning across videos remains fragmented: learners must decide what to watch, how videos relate, and how knowledge builds. Existing tools provide partial support but treat planning and learning as separate activities, lacking a persistent interaction structure that connects them. Grounded in self-regulated learning theory (SRLT), we introduce YT-Pilot, a pathway-aware learning system that operationalizes the learning pathway as a persistent, user-facing interaction structure spanning planning and learning. The pathway coordinates goal setting, planning, navigation, progress tracking, and cross-video assistance. Through a within-subjects study ($N=20$), we show that YT-Pilot significantly improves perceived goal clarity, pathway coherence, and progress tracking, while shifting interaction toward pathway-level reasoning across multiple resources.
Authors:Yibo Meng, Guangrui Fan, Bingyi Liu, Yingfangzhong Sun, Ruiqi Chen, Haipeng Mi
Abstract:
This study examines whether engagement with social robots translates into improved human-directed social abilities in autistic children. We conducted an 8-week home-based randomized controlled trial with 40 children aged 5--9 using a commercial social robot (Qrobot). Families were assigned to either continued robot access or robot withdrawal. Quantitative measures and caregiver interviews assessed anxiety, social motivation, emotion inference, and empathy. Results showed that continued robot access significantly reduced anxiety, confirming strong affective benefits and high usability. However, children in the withdrawal group demonstrated greater improvements in social motivation, emotion understanding, and empathic behaviors toward caregivers and peers. Qualitative findings revealed a "handoff versus siloing" pattern: withdrawal promoted reorientation toward human social interaction, while continued access concentrated engagement within the child--robot dyad and limited transfer to real-world contexts. We interpret these results as evidence that high engagement does not guarantee social transfer.
Authors:Celia Chen, Alex Leitch, Scotty Beland, Ingo Burghardt, William Conway, Rajesh Kumar Gnanasekaran, Marilyn Harbert, Emily Klein, Jennifer Golbeck
Abstract:
Incels are an online community of men who share a belief in extreme misogyny, the glorification of violence, and biological essentialism. They refer to their core ideology as "The Blackpill", a belief that physical attraction is the only path to romantic success and that women are only attracted to one very specific, hypermasculine archetype. This is not only a belief system; incels believe their ideology grounded in hard science. The research that incels use as evidence of their belief system is collected in an extensive online document, the Scientific Blackpill wiki page. In this research, we analyze the claims made on the wiki against the research cited to assess how the wiki authors are using or misusing science in support of their ideology. We find that the page largely cites legitimate science and describes it partly or mostly accurately. However, in discussing it, the results are often overgeneralized, stripped of context, or otherwise distorted to support the preexisting incel viewpoint. This echoes previous findings about motivated reasoning and borrowing scientific legitimacy in other misinformation and conspiracy-minded ideologies. We discuss the implications this has for understanding online radicalization and information quality.
Authors:Virmarie Maquiling, Yasmeen Abdrabou, Enkelejda Kasneci
Abstract:
Vergence is widely used as a proxy for depth perception and spatial attention in immersive and real-world eye-tracking studies. In this paper, we investigate how pupil size artefacts affect vergence estimates during real physical depth viewing with a head-mounted eye tracker. Using a beamsplitter setup with physically near and far targets, we elicited controlled convergent and divergent eye movements under static, luminance-modulated, and blockwise fixation conditions. Near and far targets were reliably separable in vergence angle across participants. However, pupil-vergence coupling varied substantially across individuals and conditions. Static illumination produced large inter-participant variability, while luminance modulation reduced this spread, yielding more clustered estimates. Blockwise and audio-cued recordings further showed that pupil-vergence coupling persists even without visual depth onsets. These results suggest that pupil size fluctuations can systematically influence vergence estimates, and that controlled viewing conditions can reduce--but not eliminate--this effect.
Authors:Virmarie Maquiling, Yasmeen Abdrabou, Enkelejda Kasneci
Abstract:
Corneal reflection (glint) detection plays an important role in pupil-corneal reflection (P-CR) eye tracking, but in practice it is often handled as heuristics embedded within larger systems, making reproducibility difficult across hardware setups. We introduce a 2D geometry-driven, constellation-based pipeline for mulit-glint detection and matching, focusing on reproducibility and clear evaluation. Inspired by lost-in-space star identification, we treat glints as structured constellations rather than independent blobs. We propose a Similarity-Layout Alignment (SLA) procedure which adapts constellation matching to the specific constraints of multi-LED eye tracking. The framework brings together controlled over-detection, adaptive candidate fallback, appearance-aware scoring, and optional semantic layout priors while keeping detection and correspondence explicitly separated. Evaluated on a public multi-LED dataset, the system provides stable identity-preserving correspondence under noisy conditions. We release code, presets, and evaluation scripts to enable transparent replication, comparison, and dataset annotation.
Authors:Mingda Han, Huanqi Yang, Zehua Sun, Wenhao Li, Yanni Yang, Guoming Zhang, Yetong Cao, Weitao Xu, Pengfei Hu
Abstract:
Millimeter-wave (mmWave) radar enables privacy-preserving human activity recognition (HAR), yet real-world deployment remains hindered by costly annotation and poor transferability under domain shift. Although prior efforts partially alleviate these challenges, most still require retraining or adaptation for each new deployment setting. This keeps mmWave HAR in a repeated collect-tune-redeploy cycle, making scalable real-world deployment difficult. In this paper, we present RAGent, a deployment-time training-free framework for mmWave HAR that reformulates recognition as evidence-grounded inference over reusable radar knowledge rather than deployment-specific model optimization. Offline, RAGent constructs a reusable radar knowledge base through constrained cross-modal supervision, where a Vision-Language Model (VLM) transfers activity semantics from synchronized videos to paired radar segments without manual radar annotation. At deployment time, RAGent recognizes activities from radar alone by retrieving physically comparable precedents in an explicit kinematic space and resolving the final label through structured multi-role reasoning. The reasoning protocol is further refined offline through zero-gradient self-evolution. Extensive experiments on a self-collected dataset show that RAGent achieves 93.39% accuracy without per-domain retraining or target-domain adaptation, while generalizing robustly across domains.
Authors:Mingda Han, Huanqi Yang, Chaoqun Li, Wenhao Li, Guoming Zhang, Yanni Yang, Yetong Cao, Weitao Xu, Pengfei Hu
Abstract:
Rapid advances in speech synthesis and audio editing have made realistic forgeries increasingly accessible, yet existing detection methods remain vulnerable to tampering or depend on visual/wearable sensors. In this paper, we present VoxAnchor, a system that physically grounds audio authentication in vocal dynamics by leveraging the inherent coherence between speech acoustics and radar-sensed throat vibrations. VoxAnchor uses contactless millimeter-wave radar to capture fine-grained throat vibrations that are tightly coupled with human speech production, establishing a hard-to-forge anchor rooted in human physiology. The design comprises three main components: (1) a cross-modal frame-work that uses modality-specific encoders and contrastive learning to detect subtle mismatches at word granularity; (2) a phase-aware pipeline that extracts physically consistent, temporally faithful throat vibrations; and (3) a dual-stage strategy that combines signal-level onset detection and semantic-level coherence to align asynchronous radar and audio streams. Unlike liveness detection, which only confirms whether speech occurred, VoxAnchor verifies what was spoken through word-level content consistency, exposing localized edits that preserve identity and global authenticity cues. Extensive evaluations show that VoxAnchor achieves robust, fine-grained detection across diverse forgeries (editing, splicing, replay, deepfake) and conditions, with an overall EER of 0.017, low latency, and modest computational cost.
Authors:Ching Christie Pang, Xuetong Wang, Yuk Hang Tsui, Pan Hui
Abstract:
Online knowledge communities (OKC) such as Stack Exchange, Reddit, and Zhihu have long functioned as socio technical infrastructures for collective problem solving. The rapid adoption of Generative AI (GenAI) introduces both complementarity and substitution. Large language models (LLMs) offer faster, more accessible drafts, yet divert traffic and contributions away from OKC that also provided their training data. To understand how communities adapt under this systemic shock, we report a mixed-methods study combining an online survey (N=217) and interviews with 11 current users. Findings show that while users increasingly rely on AI for convenience, they still turn to OKC for complex, ambiguous, or trust sensitive questions. Participants express polarized attitudes toward AI, reflecting divergent hopes and uncertainties about its role. Yet across perspectives, sustaining sociability, empathy, and reciprocity emerges as essential for community resilience. We argue that GenAI's impact constitutes not a terminal decline but a design challenge: to reimagine socio-technical complementarities that balance automation's efficiency with human judgment, trust, and collective stewardship in the evolving knowledge commons. To decline or sustain, it is now or never to take action.
Authors:Jiamin Zheng, Yue Deng, Jessica Chen, Shujun Li, Yixin Zou, Jingjie Li
Abstract:
A new form of human trafficking has emerged across Chinese borders, where individuals are lured to Southeast Asia with fraudulent job offers and then coerced into operating online scams. Despite its massive economic and human toll, this scam-driven trafficking remains underexplored in academic research. Through qualitative analysis of 158 RedNote posts, we examined how Chinese online communities respond to this threat. Our findings reveal that perpetrators exploit cultural ties to recruit victims for cybercriminal roles within self-sustaining compounds, using sophisticated manipulation tactics. Survivors face serious reintegration barriers, including family rejection, as the cultural values that enable trafficking also hinder their recovery. While communities present protective strategies, efforts are complicated by doubts about the reliability of support and cross-border coordination. We discuss key implications for prevention, platform governance, and international cooperation against scam-driven trafficking. Warning: This paper contains descriptions of physical, psychological, and sexual abuse.
Authors:Wenhao Yang, Runzhi He, Minghui Zhou
Abstract:
Generative AI (GenAI) is playing an increasingly important role in open source software (OSS). Beyond completing code and documentation, GenAI is increasingly involved in issues, pull requests, code reviews, and security reports. Yet, cheaper generation does not mean cheaper review - and the resulting maintenance burden has pushed OSS projects to experiment with GenAI-specific rules in contribution guidelines, security policies, and repository instructions, even including a total ban on AI-assisted contributions. However, governing GenAI in OSS is far more than a ban-or-not question. The responses remain scattered, with neither a shared governance framework in practice nor a systematic understanding in research. Therefore, in this paper, we conduct a multi-stage analysis on various qualitative materials related to GenAI governance retrieved from 67 highly visible OSS projects. Our analysis identifies recurring concerns across contribution workflows, derives three governance orientations, and maps out 12 governance strategies and their implementation patterns. We show that governing GenAI in OSS extends well beyond banning - it requires coordinated responses across accountability, verification, review capacity, code provenance, and platform infrastructure. Overall, our work distills dispersed community practices into a structured overview, providing a conceptual baseline for researchers and a practical reference for maintainers and platform designers.
Authors:Ran Zhang, Yucong Lin, Zhaoli Su, Bowen Liu, Danni Ai, Tianyu Fu, Deqiang Xiao, Jingfan Fan, Yuanyuan Wang, Mingwei Gao, Yuwan Hu, Shuya Gao, Jingtao Li, Jian Yang, Hong Song, Hongliang Sun
Abstract:
Chest X-ray report generation and automated evaluation are limited by poor recognition of low-prevalence abnormalities and inadequate handling of clinically important language, including negation and ambiguity. We develop a clinician-guided framework combining human expertise and large language models for multi-label finding extraction from free-text chest X-ray reports and use it to define Ran Score, a finding-level metric for report evaluation. Using three non-overlapping MIMIC-CXR-EN cohorts from a public chest X-ray dataset and an independent ChestX-CN validation cohort, we optimize prompts, establish radiologist-derived reference labels and evaluate report generation models. The optimized framework improves the macro-averaged score from 0.753 to 0.956 on the MIMIC-CXR-EN development cohort, exceeds the CheXbert benchmark by 15.7 percentage points on directly comparable labels, and shows robust generalization on the ChestX-CN validation cohort. Here we show that clinician-guided prompt optimization improves agreement with a radiologist-derived reference standard and that Ran Score enables finding-level evaluation of report fidelity, particularly for low-prevalence abnormalities.
Authors:Zefei Xie, Yuhan Guo, Kai Xu
Abstract:
There are different goals for literature research, from understanding an unfamiliar topic to generate hypothesis for the next research project. The nature of literature research also varies according to user's familiarity level of the topic. For inexperienced researchers, identifying gaps in the existing literature and generating feasible hypothesis are crucial but challenging. While general ``deep research'' tools can be used, they are not designed for such use case, thus often not effective. In addition, the ``black box" nature and hallucination of Large Language Models (LLMs) often lead to distrust. In this paper, we introduce a human-agent collaborative visualization system AwesomeLit to address this need. It has several novel features: a transparent user-steerable agentic workflow; a dynamically generated query exploring tree, visualizing the exploration path and provenance; and a semantic similarity view, depicting the relationships between papers. It enables users to transition from general intentions to detailed research topics. Finally, a qualitative study involving several early researchers showed that AwesomeLit is effective in helping users explore unfamiliar topics, identify promising research directions, and improve confidence in research results.
Authors:Protiva Das, Sovon Chakraborty, Sidhant Narula, Lucas Potter, Xavier-Lewis Palmer, Pratip Rana, Daniel Takabi, Mohammad Ghasemigol
Abstract:
The rapid advancement of Large Language Models (LLMs) in biological research has significantly lowered the barrier to accessing complex bioinformatics knowledge, ex perimental design strategies, and analytical workflows. While these capabilities accelerate innovation, they also introduce serious dual-use risks, as Bio-LLMs can be exploited to generate harmful biological insights under the guise of legitimate research queries. Existing safeguards, such as static prompt filtering and policy-based restrictions, are insufficient when LLMs are embedded within dynamic biological workflows and application-layer systems. In this paper, we present BioShield, a context-aware application-level firewall designed to secure Bio LLMs against dual-use attacks. At the core of BioShield is a domain-specific prompt scanner that performs contextual risk analysis of incoming queries. The scanner leverages a harmful scoring mechanism tailored to biological dual-use threat cat egories to identify prompts that attempt to conceal malicious intent within seemingly benign research requests. Queries ex ceeding a predefined risk threshold are blocked before reaching the model, effectively preventing unsafe knowledge generation at the source. In addition to pre-generation protection, BioShield deploys a post-generation output verification module that inspects model responses for actionable or weaponizable biological content. If an unsafe response is detected, the system triggers controlled regeneration under strengthened safety constraints. By combining contextual prompt scanning with response-level validation, BioShield provides a layered defense framework specifically designed for bio-domain LLM deployments. Our framework advances cyberbiosecurity by formalizing dual-use threat detection in Bio-LLMs and proposing a structured mitigation strategy for secure, responsible AI driven biological research.
Authors:Vasco Xu, Brian Chen, Eric J. Gonzalez, Andrea Colaço, Henry Hoffmann, Mar Gonzalez-Franco, Karan Ahuja
Abstract:
Mid-air gestures in Extended Reality (XR) often cause fatigue and imprecision. Surface-based interactions offer improved accuracy and comfort, but current egocentric vision methods struggle due to hand tracking challenges and unreliable surface plane estimation. We introduce SurfaceXR, a sensor fusion approach combining headset-based hand tracking with smartwatch IMU data to enable robust inputs on everyday surfaces. Our insight is that these modalities are complementary: hand tracking provides 3D positional data while IMUs capture high-frequency motion. A 21-participant study validates SurfaceXR's effectiveness for touch tracking and 8-class gesture recognition, demonstrating significant improvements over single-modality approaches.
Authors:Max Linnander, Yon Visell
Abstract:
We present thermopneumatic pixels (TPPs), which are tactile actuators designed for rapid fabrication and straightforward integration into compact wearable and surface-based haptic systems. Each TPP converts low-voltage ($\sim$10 V) electrical pulses into transient pressure increases within a sealed cavity, producing out-of-plane forces and displacements suitable for tactile stimulation. The architecture enables scalable fabrication and spatially distributed actuation while maintaining simple electrical interfacing. The TPPs are constructed from inexpensive, readily available materials using straightforward layer-based assembly, facilitating rapid prototyping and integration into interactive devices. Mechanical characterization demonstrates peak forces exceeding 1 N and millimeter displacements. We further present driving electronics for operating multiple TPP modules concurrently and report perceptual study results demonstrating the effectiveness of the resulting tactile feedback. Together, these results establish low-voltage thermopneumatic actuation as an accessible and high-performance approach for embedding tactile feedback into experimental and consumer-facing interfaces.
Authors:Amy Rafferty, Rishi Ramaesh, Ajitha Rajan
Abstract:
Deep learning models for chest X-ray diagnosis are constrained by limited coverage of clinically meaningful concept combinations in publicly available training datasets. While synthetic image generation has been explored to increase data diversity, existing methods rarely enforce clinical or anatomical constraints, limiting utility for improving model reliability. We propose CARPA, a clinically aware and anatomically grounded framework for synthetic chest X-ray generation that applies targeted perturbations to clinical concept vectors while preserving anatomical structure. By producing anatomically faithful synthetic images with controlled concept insertions and deletions, CARPA expands clinically relevant concept coverage. We evaluate CARPA across seven backbone architectures by fine-tuning models on synthetic subsets and testing on a held-out MIMIC-CXR benchmark. Compared to prior concept perturbation approaches, fine-tuning on CARPA-generated images consistently improves precision-recall performance, reduces predictive uncertainty, and improves model calibration. Structural and semantic analyses demonstrate high anatomical fidelity, strong concept alignment, and low semantic uncertainty. Evaluation by two expert radiologists further confirms realism and clinical agreement. Together, these results show that anatomically grounded concept perturbations enable more effective use of synthetic data, improving both performance and reliability of chest X-ray classification models and supporting safer clinical deployment.
Authors:Maryam Cheema, Sina Elahimanesh, Pooyan Fazli, Hasti Seifi
Abstract:
Advances in multimodal large language models enable automatic video narration and question answering (VQA), offering scalable alternatives to labor-intensive, human-authored audio descriptions (ADs) for blind and low vision (BLV) viewers. However, prior AI-driven AD systems rarely adapt to the diverse needs and preferences of BLV individuals across videos and are typically evaluated in controlled, single-session settings. We present ViDscribe, a web-based platform that integrates AI-generated ADs with six types of user customizations and a conversational VQA interface for YouTube videos. Through a longitudinal, in-the-wild study with eight BLV participants, we examine how users engage with customization and VQA features over time. Our results show sustained engagement with both features and that customized ADs improve effectiveness, enjoyment, and immersion compared to default ADs, highlighting the value of personalized, interactive video access for BLV users.
Authors:Hiruni Kegalle, Flora D. Salim, Mark Sanderson, Jeffrey Chan, Danula Hettiachchi
Abstract:
Location-Based Services (LBS) such as ride-sharing, accommodation, food delivery, and location-driven social media platforms entangle digital systems with physical spaces, thereby generating impacts that extend beyond users to others who share the same environments. Existing design approaches struggle to address the dual challenge of value tensions that arise in shared physical spaces and the locality-specific contexts in which LBS operate. To respond, we introduce Location-Aware Value Sensitive Design (LA-VSD), a domain-specific adaptation of VSD tailored to the distinctive characteristics of LBS. LA-VSD guides designers through three heuristics to help (1) identify and prioritise stakeholders through local space-sharing scenarios, (2) adapt empirical methods to capture values and tensions in context, and (3) support value-aligned interactions across both digital and physical layers of the service. Through a case study of e-scooter sharing in Melbourne, Australia, we demonstrate how LA-VSD enables more grounded, context-aware, and actionable design of LBS.
Authors:Ching Christie Pang, Yi Gao, Xuetong Wang, Pan Hui
Abstract:
What does it mean to fall in love with something we know is virtual? The proliferation of conversational AI enables users to create customizable companions, fostering new intimate relationships that, while virtual, are perceived as authentic. However, public understanding of these bonds is limited, and platform policies regarding these interactions remain inconsistent. There is a pressing need for further HCI research to investigate: (a) the design affordances in AI that construct bonds and a sense of intimacy, (b) how such long-term engagement impacts users' real lives, and (c) how to balance user autonomy with platform regulation in the design of these systems without compromising users' well-being and experiences. This paper takes a step toward addressing these goals by providing a concrete definition of human AI intimacy based on in depth interviews with 30 users engaged in romantic relationships with AI companions. We elucidate the complexities of these relationships, from their formation to sustainability, and identify key features of the bonds formed. Notably, we introduce the AI Amplifier Effect, where the AI serves as a medium that intensifies the user's existing emotional state, leading to divergent positive, neutral, and negative impacts. We argue that designing for emotion must extend beyond technical affordances to encompass the essence of human affection. This paper's contributions aim to initiate a conversation and guide future research on human AI relationships within the HCI community.
Authors:Selin Choi, Dooyoung Kim, Taewook Ha, Seonji Kim, Woontack Woo
Abstract:
We propose a method for generating task breakpoints based on an Origin-Centric Graph (OCG) to segment goal-oriented activity recordings into task units for adaptive playback in Virtual Reality (VR) environments. With the development of Augmented Reality (AR)/VR head-mounted displays (HMDs), research on adaptive tutorials and authoring tools has become active, but existing task segmentation methods mainly rely on manual annotation or are restricted to 2D video which limits their applicability to 3D VR contexts. In our approach, assembly scenarios with clearly defined task boundaries are recorded using a structured spatio-temporal scene graph (STSG), and the OCG is employed to track changes in the central object and the formation of new groups, thereby generating task breakpoints automatically. A user study collected user-perceived task breakpoints to establish ground truth (GT), and comparison with the algorithm-detected breakpoints demonstrated high agreement and confirmed accuracy in supporting adaptive playback. The proposed task segmentation method provides a foundation for dynamically adjusting VR playback according to user proficiency and progress, with potential for extension into automatic timeline segmentation systems for diverse VR recordings.
Authors:Suyash Fulay, Prerna Ravi, Om Gokhale, Eugene Yi, Michiel Bakker, Deb Roy
Abstract:
Deliberative democratic theory suggests that civic competence: the capacity to navigate disagreement, weigh competing values, and arrive at collective decisions is not innate but developed through practice. Yet opportunities to cultivate these skills remain limited, as traditional deliberative processes like citizens' assemblies reach only a small fraction of the population. We present Agora, an early-stage AI-powered platform that uses LLMs to organize authentic human voices on policy issues, helping users build consensus-finding skills by proposing and revising policy recommendations, hearing supporting and opposing perspectives, and receiving feedback on how policy changes affect predicted support. In a preliminary study with 44 university students, participants using the full interface (with access to voice explanations) reported higher levels of problem-solving skills, internal deliberation, and produced higher quality consensus statements compared to a control condition showing only aggregate support distributions. These initial findings point toward a promising direction for scaling civic education.
Authors:Nikita Soni, Dhruv Vijay Kunjadiya, Pratham Piyush Shah, Dikshya Mohanty, H. Andrew Schwartz, Niranjan Balasubramanian
Abstract:
Language model training and inference ignore a fundamental linguistic fact -- there is a dependence between multiple sequences of text written by the same person. Prior work has shown that addressing this form of \textit{ecological fallacy} can greatly improve the performance of multiple smaller (~124M) GPT-based models. In this work, we ask if addressing the ecological fallacy by modeling the author's language context with a specific LM task (called HuLM) can provide similar benefits for a larger-scale model, an 8B Llama model. To this end, we explore variants that process an author's language in the context of their other temporally ordered texts. We study the effect of pre-training with this author context using the HuLM objective, as well as using it during fine-tuning with author context (\textit{HuFT:Human-aware Fine-Tuning}). Empirical comparisons show that addressing the ecological fallacy during fine-tuning alone using QLoRA improves the performance of the larger 8B model over standard fine-tuning. Additionally, QLoRA-based continued HuLM pre-training results in a human-aware model generalizable for improved performance over eight downstream tasks with linear task classifier training alone. These results indicate the utility and importance of modeling language in the context of its original generators, the authors.
Authors:Yuxin Zhang, Fan Zhang, Zihao Song, Chao Zhao
Abstract:
This study develops sustainable materials using hydrogel as the matrix and explores the transition from sustainable materials to user-centered sustainability, with a particular focus on achieving art healing through material experience. The findings reveal that "Aesthetic" property exert the greatest influence on art healing in the context of multimodal material experiences involving visual, tactile, and smell, followed by "Intrinsic" property, whereas "Physical" property have a comparatively limited effect. Furthermore, the study proposes a material experience framework that enables designers to systematically and holistically understanding material characteristics. It highlights the importance of considering users' psychological perceptions and emotional needs in the material design process.
Authors:Liangwei Wang, Zhengxuan Zhang, Yifan Cao, Fugee Tsung, Yuyu Luo
Abstract:
Data tables play a central role in scientific papers. However, their meaning is often co-constructed with surrounding text through narrative interplay, making comprehension cognitively demanding for readers. In this work, we explore how interfaces can better support this reading process. We conducted a formative study that revealed key characteristics of text-table narrative interplay, including linking mechanisms, multi-granularity alignments, and mention typologies, as well as a layered framework of readers' intents. Informed by these insights, we present TableTale, an augmented reading interface that enriches text with data tables at multiple granularities, including paragraphs, sentences, and mentions. TableTale automatically constructs a document-level linking schema within the paper and progressively renders cascade visual cues on text and tables that unfold as readers move through the text. A within-subject study with 24 participants showed that TableTale reduced cognitive workload and improved reading efficiency, demonstrating its potential to enhance paper reading and inform future reading interface design.
Authors:Jiwan Kim, Chi-Jung Lee, Hohurn Jung, Tianhong Catherine Yu, Ruidong Zhang, Ian Oakley, Cheng Zhang
Abstract:
Tracking hand poses on wrist-wearables enables rich, expressive interactions, yet remains unavailable on commercial smartwatches, as prior implementations rely on external sensors or custom hardware, limiting their real-world applicability. To address this, we present WatchHand, the first continuous 3D hand pose tracking system implemented on off-the-shelf smartwatches using only their built-in speaker and microphone. WatchHand emits inaudible frequency-modulated continuous waves and captures their reflections from the hand. These acoustic signals are processed by a deep-learning model that estimates 3D hand poses for 20 finger joints. We evaluate WatchHand across diverse real-world conditions -- multiple smartwatch models, wearing-hands, body postures, noise conditions, pose-variation protocols -- and achieve a mean per-joint position error of 7.87 mm in cross-session tests with device remounting. Although performance drops for unseen users or gestures, the model adapts effectively with lightweight fine-tuning on small amounts of data. Overall, WatchHand lowers the barrier to smartwatch-based hand tracking by eliminating additional hardware while enabling robust, always-available interactions on millions of existing devices.
Authors:Jiasheng Li, Zining Zhang, Zeyu Yan, Matthew Wong, Arnav Mittal, Ge Gao, Huaishu Peng
Abstract:
Creating webpages requires generating content and arranging layout while iteratively refining both to achieve a coherent design, a process that can be challenging for blind individuals. To understand how blind designers navigate this process, we conducted two rounds of co-design sessions with blind participants, using design probes to elicit their strategies and support needs. Our findings reveal a preference for content and layout to co-evolve, but this process requires external support through cues that situate local elements within the broader page structure as well as multimodal interactions. Building on these insights, we developed TangibleSite, an accessible web design tool that provides real-time multimodal feedback through tangible, auditory, and speech-based interactions. TangibleSite enables blind individuals to create, edit, and reposition webpage elements while integrating content and layout decisions. A formative evaluation with six blind participants demonstrated that TangibleSite enabled independent webpage creation, supported refinement across content and layout, and reduced barriers to achieving visually consistent designs.
Authors:Madeleine Grunde-McLaughlin, Hussein Mozannar, Maya Murad, Jingya Chen, Saleema Amershi, Adam Fourney
Abstract:
To enable human oversight, agentic AI systems often provide a trace of reasoning and action steps. Designing traces to have an informative, but not overwhelming, level of detail remains a critical challenge. In three user studies on a Computer User Agent, we investigate the utility of basic action traces for verification, explore three alternatives via design probes, and test a novel interface's impact on error finding in question-answering tasks. As expected, we find that current practices are cumbersome, limiting their efficacy. Conversely, our proposed design reduced the time participants spent finding errors. However, although participants reported higher levels of confidence in their decisions, their final accuracy was not meaningfully improved. To this end, our study surfaces challenges for human verification of agentic systems, including managing built-in assumptions, users' subjective and changing correctness criteria, and the shortcomings, yet importance, of communicating the agent's process.
Authors:Sutapa Dey Tithi, Xiaoyi Tian, Ally Limke, Min Chi, Tiffany Barnes
Abstract:
Tutoring systems improve learning through tailored interventions, such as worked examples, but often suffer from the aptitude-treatment interaction effect where low prior knowledge learners benefit more. We applied the ICAP learning theory to design two new types of worked examples, Buggy (students fix bugs), and Guided (students complete missing rules), requiring varying levels of cognitive engagement, and investigated their impact on learning in a controlled experiment with 155 undergraduate students in a logic problem solving tutor. Students in the Buggy and Guided examples groups performed significantly better on the posttest than those receiving passive worked examples. Buggy problems helped high prior knowledge learners whereas Guided problems helped low prior knowledge learners. Behavior analysis showed that Buggy produced more exploration-revision cycles, while Guided led to more help-seeking and fewer errors. This research contributes to the design of interventions in logic problem solving for varied levels of learner knowledge and a novel application of behavior analysis to compare learner interactions with the tutor.
Authors:Samuel Reinders, Munazza Zaib, Matthew Butler, Bongshin Lee, Ingrid Zukerman, Lizhen Qu, Kim Marriott
Abstract:
Combining conversational AI with refreshable tactile displays (RTDs) offers significant potential for creating accessible data visualization for people who are blind or have low vision (BLV). To support researchers and developers building accessible data visualizations with RTDs, we present a multimodal data interaction architecture along with an open-source reference implementation. Our system is the first to combine touch input with a conversational agent on an RTD, enabling deictic queries that fuse touch context with spoken language, such as "what is the trend between these points?" The architecture addresses key technical challenges, including touch sensing on RTDs, visual-to-tactile encoding, integrating touch context with conversational AI, and synchronizing multimodal output. Our contributions are twofold: (1) a technical architecture integrating RTD hardware, external touch sensing, and conversational AI to enable multimodal data interaction; and (2) an open-source reference implementation demonstrating its feasibility. This work provides a technical foundation to support future research in multimodal accessible data visualization.
Authors:Dániel Szabó, Aku Visuri, Benjamin Tag, Simo Hosio
Abstract:
Navigating large and complex indoor environments, such as universities, airports, and hospitals, can be cognitively demanding and requires attention and effort. While mobile applications provide convenient navigation support, they occupy the user's hands and visual attention, limiting natural interaction. In this paper, we explore conversation hand-off as a method for multi-device indoor navigation, where a Conversational Agent (CA) transitions seamlessly from a stationary social robot to a wearable device. We evaluated robot-only, wearable-only, and robot-to-wearable hand-off in a university campus setting using a within-subjects design with N=24 participants. We find that conversation hand-off is experienced as engaging, even though no performance benefits were observed, and most preferred using the wearable-only system. Our findings suggest that the design of such re-embodied assistants should maintain a shared voice and state across embodiments. We demonstrate how conversational hand-offs can bridge cognitive and physical transitions, enriching human interaction with embodied AI.
Authors:Zirong Chen, Meiyi Ma
Abstract:
Emergency call-takers form the first operational link in public safety response, handling over 240 million calls annually while facing a sustained training crisis: staffing shortages exceed 25\% in many centers, and preparing a single new hire can require up to 720 hours of one-on-one instruction that removes experienced personnel from active duty. Traditional training approaches struggle to scale under these constraints, limiting both coverage and feedback timeliness. In partnership with Metro Nashville Department of Emergency Communications (MNDEC), we designed, developed, and deployed a GenAI-powered call-taking training system under real-world constraints. Over six months, deployment scaled from initial pilot to 190 operational users across 1,120 training sessions, exposing systematic challenges around system delivery, rigor, resilience, and human factors that remain largely invisible in controlled or purely simulated evaluations. By analyzing deployment logs capturing 98,429 user interactions, organizational processes, and stakeholder engagement patterns, we distill four key lessons, each coupled with concrete design and governance practices. These lessons provide grounded guidance for researchers and practitioners seeking to embed AI-driven training systems in safety-critical public sector environments where embedded constraints fundamentally shape socio-technical design.
Authors:Stephan Vonschallen, Rahel Häusler, Theresa Schmiedel, Friederike Eyssel
Abstract:
Generative Social Agents (GSAs) are increasingly impacting human users through persuasive means. On the one hand, they might motivate users to pursue personal goals, such as healthier lifestyles. On the other hand, they are associated with potential risks like manipulation and deception, which are induced by limited control over probabilistic agent outputs. However, as GSAs manifest communicative patterns based on available knowledge, their behavior may be regulated through their access to such knowledge. Following this approach, we explored persuasive ChatGPT-generated messages in the context of human-robot physiotherapy motivation. We did so by comparing ChatGPT-generated responses to predefined inputs from a hypothetical physiotherapy patient. In Study 1, we qualitatively analyzed 13 ChatGPT-generated dialogue scripts with varying knowledge configurations regarding persuasive message characteristics. In Study 2, third-party observers (N = 27) rated a selection of these dialogues in terms of the agent's expressiveness, assertiveness, and persuasiveness. Our findings indicate that LLM-based GSAs can adapt assertive and expressive personality traits -- significantly enhancing perceived persuasiveness. Moreover, persuasiveness significantly benefited from the availability of information about the patients' age and past profession, mediated by perceived assertiveness and expressiveness. Contextual knowledge about physiotherapy benefits did not significantly impact persuasiveness, possibly because the LLM had inherent knowledge about such benefits even without explicit prompting. Overall, the study highlights the importance of empirically studying behavioral patterns of GSAs, specifically in terms of what information generative AI systems require for consistent and responsible communication.
Authors:Stephan Vonschallen, Dominique Oberle, Theresa Schmiedel, Friederike Eyssel
Abstract:
Generative social robots (GSRs) powered by large language models enable adaptive, conversational tutoring but also introduce risks such as hallucinations, overreliance, and privacy violations. Existing frameworks for educational technologies and responsible AI primarily define desired behaviors, yet they rarely specify the knowledge prerequisites that enable generative systems to express these behaviors reliably. To address this gap, we adopt a knowledge-based design perspective and investigate what information tutoring-oriented GSRs require to function responsibly and effectively in higher education. Based on twelve semi-structured interviews with university students and lecturers, we identify twelve design requirements across three knowledge types: self-knowledge (assertive, conscientious and friendly personality with customizable role), user-knowledge (personalized information about student learning goals, learning progress, motivation type, emotional state and background), and context-knowledge (learning materials, educational strategies, course-related information, and physical learning environment). By identifying these knowledge requirements, this work provides a structured foundation for the design of tutoring GSRs and future evaluations, aligning generative system capabilities with pedagogical and ethical expectations.
Authors:Stephan Vonschallen, Friederike Eyssel, Theresa Schmiedel
Abstract:
Generative social agents (GSAs) use artificial intelligence to autonomously communicate with human users in a natural and adaptive manner. Currently, there is a lack of theorizing regarding interactions with GSAs, and likewise, few guidelines exist for studying how they influence user attitudes and behaviors. Consequently, we propose the Knowledge-based Persuasion Model (KPM) as a novel theoretical framework. According to the KPM, a GSA's self, user, and context-related knowledge drives its persuasive behavior, which in turn shapes the attitudes and behaviors of a responding human user. By synthesizing existing research, the model offers a structured approach to studying interactions with GSAs, supporting the development of agents that motivate rather than manipulate humans. Accordingly, the KPM encourages the integration of responsible GSAs that adhere to social norms and ethical standards with the goal of increasing user wellbeing. Implications of the KPM for research and application domains such as healthcare and education are discussed.
Authors:Yuxin Zhang, Fan Zhang
Abstract:
This study employs linear regression and structural equation modeling to explore how Thinking Skills, Design Thinking, Creative Self-Efficacy (CSE), and Collective Creative Efficacy (CCE) drive Design Creativity & Innovation, and analyzes the structural stability of the model across different levels of experience. Path analysis results indicate that the four Design Thinking Skills, Problem-driven Design (beta = 0.198, p < 0.01), Information-driven Design (beta = 0.241, p < 0.001), Solution-driven Design (beta = 0.227, p < 0.001), and Knowledge-driven Design (beta = 0.263, p < 0.001) all significantly and positively influence Design Thinking. Furthermore, Design Thinking has a significant positive predictive effect on Design Creativity & Innovation (beta = 0.286, p < 0.001). Mediation analysis confirms three significant mediation paths: the CSE mediation path (beta = 0.128, p < 0.001), the CCE mediation path (beta = 0.073, p < 0.01), and the "CSE to CCE" chain mediation path (beta = 0.025, p < 0.01). Multi-group comparison results reveal significant differences between the student and professional groups under the full equivalence model. After relaxing specific constraints, there were no significant differences between the nested models of the baseline model, partial measurement invariance, structural weight invariance, and structural covariance invariance. These findings elucidate the multi-dimensional pathways of Design Creativity & Innovation, providing a robust empirical basis for optimizing differentiated pedagogical models and professional practice guidelines.
Authors:Ziyi Wang, Congrong Zhang, Jingying Deng, Xiaofan Hu, Jie Cai, Nan Gao, Chun Yu, Haining Zhang
Abstract:
Homework tutoring work is a demanding and often conflict-prone practice in family life, and parents often lack targeted support for managing its cognitive and emotional burdens. Through interviews with 18 parents of children in grades 1-3, we examine how homework-related labor is divided and coordinated between parents, and where AI might meaningfully intervene. We found three key insights: (1) Homework labor encompasses distinct dimensions: physical, cognitive, and emotional, with the latter two often remaining invisible. (2) We identified father-mother-child triadic dynamics in labor division, with children's feedback as the primary factor shaping parental labor adjustments. (3) Building on prior HCI research, we propose an AI design that prioritizes relationship maintenance over task automation or broad labor mitigation. By employing labor as a lens that integrates care work, we explore the complexities of labor within family contexts, contributing to feminist and care-oriented HCI and to the development of context-sensitive coparenting practices.
Authors:Yue Fu, Joel Wester, Niels Van Berkel, Alexis Hiniker
Abstract:
College students increasingly use AI chatbots to support academic reading, yet we lack granular understanding of how these interactions shape their reading experience and cognitive engagement. We conducted an eight-week longitudinal study with 15 undergraduates who used AI to support assigned readings in a course. We collected 838 prompts across 239 reading sessions and developed a coding schema categorizing prompts into four cognitive themes: Decoding, Comprehension, Reasoning, and Metacognition. Comprehension prompts dominated (59.6%), with Reasoning (29.8%), Metacognition (8.5%), and Decoding (2.1%) less frequent. Most sessions (72%) contained exactly three prompts, the required minimum of the reading assignment. Within sessions, students showed natural cognitive progression from comprehension toward reasoning, but this progression was truncated. Across eight weeks, students' engagement patterns remained stable, with substantial individual differences persisting throughout. Qualitative analysis revealed an intention-behavior gap: students recognized that effective prompting required effort but rarely applied this knowledge, with efficiency emerging as the primary driver. Students also strategically triaged their engagement based on interest and academic pressures, exhibiting a novel pattern of reading through AI rather than with it: using AI-generated summaries as primary material to filter which sections merited deeper attention. We discuss design implications for AI reading systems that scaffold sustained cognitive engagement.
Authors:Shijing He, Chenkai Ma, Chi Zhang, Adam Jenkins, Ruba Abu-Salma, Jose Such
Abstract:
As more young women in China live alone, they navigate entangled privacy, security, and safety (PSS) risks across smart homes, online platforms, and public infrastructures. Drawing on six participatory threat modeling (PTM) workshops (n = 33), we present a human-centered threat model that illustrates how digitally facilitated physical violence, digital harassment and scams, and pervasive surveillance by individuals, companies, and the state are interconnected and mutually reinforcing. We also document four mitigation strategies employed by participants: smart home device configurations, boundary management, sociocultural practices, and social media tactics--each of which can introduce new vulnerabilities and emotional burdens. Based on these insights, we developed a digital PSS guidebook for young women living alone (YWLA) in China. We further propose actionable design implications for smart home devices and social media platforms, along with policy and legal recommendations and directions for educational interventions.
Authors:Cindy Peng, Megan Chai, Gao Mo, Naveen Raman, Ningjing Tang, Shannon Pagdon, Margaret Swarbrick, Nev Jones, Fei Fang, Hong Shen
Abstract:
Peer-run organizations (PROs) provide critical, recovery-based behavioral health support rooted in lived experience. As large language models (LLMs) enter this domain, their scale, conversationality, and opacity introduce new challenges for situatedness, trust, and autonomy. Partnering with Collaborative Support Programs of New Jersey (CSPNJ), a statewide PRO in the Northeastern United States, we used comicboarding, a co-design method, to conduct workshops with 16 peer specialists and 10 service users exploring perceptions of integrating an LLM-based recommendation system into peer support. Findings show that depending on how LLMs are introduced, constrained, and co-used, they can reconfigure in-room dynamics by sustaining, undermining, or amplifying the relational authority that grounds peer support. We identify opportunities, risks, and mitigation strategies across three tensions: bridging scale and locality, protecting trust and relational dynamics, and preserving peer autonomy amid efficiency gains. We contribute design implications that center lived-experience-in-the-loop, reframe trust as co-constructed, and position LLMs not as clinical tools but as relational collaborators in high-stakes, community-led care.
Authors:Gefei Zhang, Guodao Sun, Meng Xia, Ronghua Liang
Abstract:
Generative AI is reshaping education, but it also raises concerns about instability and overreliance. In programming classrooms, we aim to leverage its feedback capabilities while reinforcing the educator's role in guiding student-AI interactions. We developed ClassAid, a real-time orchestration system that integrates TA Agents to provide personalized support and an AI-driven dashboard that visualizes student-AI interactions, enabling instructors to dynamically adjust TA Agent modes. Instructors can configure the Agent to provide technical feedback (direct coding solutions), heuristic feedback (hint-based guidance), automatic feedback (autonomously selecting technical or heuristic support), or silent operation (no AI support). We evaluated ClassAid through three aspects: (1) the TA Agents' performance, (2) feedback from 54 students and one instructor during a classroom deployment, and (3) interviews with eight educators. Results demonstrate that dynamic instructor control over AI supports effective real-time personalized feedback and provides design implications for integrating AI into authentic educational settings.
Authors:Mandi Yang, Zhiqi Gao, Yibo Meng, Dongyijie Primo Pan
Abstract:
We present an LLM-mediated role-playing game that supports reflection on socialization, moral responsibility, and educational role positioning. Grounded in socialization theory, the game follows a four-season structure in which players guide a child prince through morally charged situations and compare the LLM-mediated NPC's differentiated responses across stages, helping them reason about how educational guidance shifts with socialization. To approximate real educational contexts and reduce score-chasing, the system hides real-time evaluative scores and provides delayed, end-of-stage growth feedback as reflective prompts. We conducted a user study (N=12) with gameplay logs and post-game interviews, analyzed via reflexive thematic analysis. Findings show how players negotiated responsibility and role positioning, and reveal an entry-load tension between open-ended expression and sustained engagement. We contribute design knowledge on translating sociological models of socialization into reflective AI-mediated game systems.
Authors:Zhiqi Gao, Guo Zhu, Huarui Luo, Dongyijie Primo Pan, Haoming Tang, Bingquan Zhang, Jiahuan Pei, Jie Li, Benyou Wang
Abstract:
Standardized patients (SPs) play a central role in clinical communication training but are costly, difficult to scale, and inconsistent. Large language model (LLM) based AI standardized patients (AI-SPs) promise flexible, on-demand practice, yet learners often report that they talk like a patient but feel different. We interviewed 12 clinical-year medical students and conducted three co-design workshops to examine how learners experience constraints of SP encounters and what they expect from AI-SPs. We identified six learner-centered needs, translated them into AI-SP design requirements, and synthesized a conceptual workflow. Our findings position AI-SPs as tools for deliberate practice and show that instructional usability, rather than conversational realism alone, drives learner trust, engagement, and educational value.
Authors:Aashish Panta, Giorgio Scorzelli, Amy A. Gooch, Werner Sun, Katherine S. Shanks, Suchismita Sarker, Devin Bougie, Keara Soloway, Rolf Verberg, Tracy Berman, Glenn Tarcea, John Allison, Michela Taufer, Valerio Pascucci
Abstract:
Synchrotron facilities like the Cornell High Energy Synchrotron Source (CHESS) generate massive data volumes from complex beamline experiments, but face challenges such as limited access time, the need for on-site experiment monitoring, and managing terabytes of data per user group. We present the design, deployment, and evaluation of a framework that addresses CHESS's data acquisition and management issues. Deployed on a secure CHESS server, our system provides real time, web-based tools for remote experiment monitoring and data quality assessment, improving operational efficiency. Implemented across three beamlines (ID3A, ID3B, ID4B), the framework managed 50-100 TB of data and over 10 million files in late 2024. Testing with 43 research groups and 86 dashboards showed reduced overhead, improved accessibility, and streamlined data workflows. Our paper highlights the development, deployment, and evaluation of our framework and its transformative impact on synchrotron data acquisition.
Authors:Ritik Batra, Roy Zunder, Amy Cheatle, Amritansh Kwatra, Ilan Mandel, Thijs Roumen, Steven J. Jackson
Abstract:
Computational tools for fabrication often treat materials as passive rather than active participants in design, abstracting away relationships between craftspeople and materials. For craft communities that value relational practices, abstractions limit the adoption and creative uptake of computational tools which might otherwise be beneficial. To understand how better tool design could support richer relations between individuals, tools, and materials, we interviewed expert woodworkers, fiber artists, and metalworkers. We identify three orders of convivial relations central to craft: immediate relations between individuals, tools, and materials; mid-range relations between communities, platforms, and shared materials; and extended relations between institutions, infrastructures, and ecologies. Our analysis shows how craftspeople engage and struggle with convivial relations across all three orders, creating workflows that learn from materials while supporting autonomy. We conclude with design principles for computational tools and infrastructures to better support material dialogue, collective knowledge, and accountability, along with richer and more convivial relations between craftspeople, tools, and the material worlds around them.
Authors:Roberta Mota, Julio D. Silva, Fabio Miranda, Usman Alim, Ehud Sharlin, Nivan Ferreira
Abstract:
The visualization of temporal data on urban buildings, such as shadows, noise, and solar potential, plays a critical role in the analysis of dynamic urban phenomena. However, in dense and geographically constrained 3D urban environments, visual representations of time-varying building data often suffer from occlusion and visual clutter. To address these two challenges, we introduce an immersive lens visualization that integrates a view-dependent cutaway de-occlusion technique and a temporal display derived from a conformal mapping algorithm. The mapping process first partitions irregular building footprints into smaller, sufficiently regular subregions that serve as structural primitives. These subregions are then seamlessly recombined to form a conformal, layered layout for our temporal lens visualization. The view-responsive cutaway is inspired by traditional architectural illustrations, preserving the overall layout of the building and its surroundings to maintain users' sense of spatial orientation. This lens design enables the occlusion-free embedding of shape-adaptive temporal displays across building facades on demand, supporting rapid time-space association for the discovery, access and interpretation of spatiotemporal urban patterns. Guided by domain and design goals, we outline the rationale behind the lens visual and interaction design choices, such as the encoding of time progression and temporal values in the conforming lens image. A user study compares our approach against conventional juxtaposition and x-ray spatiotemporal designs. Results validate the usage and utility of our lens, showing that it improves task accuracy and completion time, reduces navigation effort, and increases user confidence. From these findings, we distill design recommendations and promising directions for future research on spatially-embedded lenses in 3D visualization and urban analytics.
Authors:Zeynep G. Saribatur, Johannes Langer, Ute Schmid
Abstract:
Explanations are central to human cognition, yet AI systems often produce outputs that are difficult to understand. While symbolic AI offers a transparent foundation for interpretability, raw logical traces often impose a high extraneous cognitive load. We investigate how formal abstractions, specifically removal and clustering, impact human reasoning performance and cognitive effort. Utilizing Answer Set Programming (ASP) as a formal framework, we define a notion of irrelevant details to be abstracted over to obtain simplified explanations. Our cognitive experiments, in which participants classified stimuli across domains with explanations derived from an answer set program, show that clustering details significantly improve participants' understanding, while removal of details significantly reduce cognitive effort, supporting the hypothesis that abstraction enhances human-centered symbolic explanations.
Authors:Siyuan Wang, Ke Li, Jingyuan Huang, Jike Wang, Cheng Zhang, Alanson Sample, Dongyao Chen
Abstract:
Self-touch gestures (e.g., nuanced facial touches and subtle finger scratches) provide rich insights into human behaviors, from hygiene practices to health monitoring. However, existing approaches fall short in detecting such micro gestures due to their diverse movement patterns. This paper presents μTouch, a novel magnetic sensing platform for self-touch gesture recognition. μTouch features (1) a compact hardware design with low-power magnetometers and magnetic silicon, (2) a lightweight semi-supervised framework requiring minimal user data, and (3) an ambient field detection module to mitigate environmental interference. We evaluated μTouch in two representative applications in user studies with 11 and 12 participants. μTouch only requires three-second fine-tuning data for each gesture, and new users need less than one minute before starting to use the system. μTouch can distinguish eight different face-touching behaviors with an average accuracy of 93.41%, and reliably detect body-scratch behaviors with an average accuracy of 94.63%. μTouch demonstrates accurate and robust sensing performance even after a month, showcasing its potential as a practical tool for hygiene monitoring and dermatological health applications.
Authors:Yuxin Zhang, Fan Zhang
Abstract:
Based on a bibliometric analysis of literature from 2005 to 2024, this study reveals that material experience is undergoing a profound transformation characterized by evolving material definitions, methodological advances, and increasing interdisciplinary integration. Material types now extend beyond traditional substances to encompass virtual and biological media, underscoring a growing emphasis on perception and interaction. Methodologically, the field has transitioned from subjective descriptions to data-driven, quantifiable models focused on objective sensory analysis and multisensory integration to enhance immersion. Key drivers, including human-machine perception convergence, material-driven interface interactions, and the embedding of intelligent interactive functions, propel the discipline toward an experience-centered paradigm reflecting a deep convergence of design, science, and technology. At the national/regional level, the United States, China, Japan, Germany, and the Netherlands lead in contributions, while France, the United Kingdom, and Romania demonstrate significant interdisciplinary progress. At the institutional level, Delft University of Technology, Justus Liebig University Giessen, and the Centre National de la Recherche Scientifique show significant advantages. In particular, the Material-Driven Design theory has established a foundational impact on the discipline, while, regarding general research trends, scholars from the United States, the Netherlands, and Germany maintain the highest academic visibility. Overall, material experience research is at a critical juncture, its future development will depend on progress in material innovation, technological integration, and perceptual quantification, as well as the establishment of socio-cultural values, all of which must be effectively unified through design to address complex evolving needs.
Authors:Dev Vikesh Doshi, Mehjabeen Tasnim, Fernando Landeros, Chinthagumpala Muni Venkatesh, Daniel Timko, Muhammad Lutfor Rahman
Abstract:
Phishing attacks through text, also known as smishing, are a prevalent type of social engineering tactic in which attackers impersonate brands to deceive victims into providing personal information and/or money. While smishing awareness and cyber education are a key method by which organizations communicate this awareness, the guidance itself varies widely. In this paper, we investigate the state of practice of how 149 well-known brands across 25 categories educate their customers about smishing and what smishing prevention and reporting advice they provide. After conducting a comprehensive content analysis of the brands, we identified significant gaps in the smishing-related information provided: only 46\% of the 149 brands mentioned the definition of smishing, less than 1\% had a video tutorial on smishing, and only 50\% of brands provided instructions on how to report. Our study highlights variation in terminology, prevention advice, and reporting mechanisms across industries, with some brands recommending potentially ineffective strategies such as "ignoring suspicious messages." These findings establish a baseline for understanding the current state of industry smishing awareness advice and provide specific areas where standardization improvements are needed. From our evaluation, we provide recommendations for brands on how to offer streamlined education to their respective customers on smishing for better awareness and protection against increasing smishing attacks.
Authors:Lixiang Zhao, Fuqi Xie, Tobias Isenberg, Hai-Ning Liang, Lingyun Yu
Abstract:
We present ScaleFree, a GPU-accelerated adaptive Kernel Density Estimation (KDE) algorithm for scalable, interactive multiscale point cloud exploration. With this technique, we cater to the massive datasets and complex multiscale structures in advanced scientific computing, such as cosmological simulations with billions of particles. Effective exploration of such data requires a full 3D understanding of spatial structures, a capability for which immersive environments such as VR are particularly well suited. However, simultaneously supporting global multiscale context and fine-grained local detail remains a significant challenge. A key difficulty lies in dynamically generating continuous density fields from point clouds to facilitate the seamless scale transitions: while KDE is widely used, precomputed fields restrict the accuracy of interaction and omit fine-scale structures, while dynamic computation is often too costly for real-time VR interaction. We address this challenge by leveraging GPU acceleration with k-d-tree-based spatial queries and parallel reduction within a thread group for on-the-fly density estimation. With this approach, we can recalculate scalar fields dynamically as users shift their focus across scales. We demonstrate the benefits of adaptive density estimation through two data exploration tasks: adaptive selection and progressive navigation. Through performance experiments, we demonstrate that ScaleFree with GPU-parallel implementation achieves orders-of-magnitude speedups over sequential and multi-core CPU baselines. In a controlled experiment, we further confirm that our adaptive selection technique improves accuracy and efficiency in multiscale selection tasks.
Authors:Peter Zeng, Weiling Li, Amie Paige, Zhengxiang Wang, Panagiotis Kaliosis, Dimitris Samaras, Gregory Zelinsky, Susan Brennan, Owen Rambow
Abstract:
For generative AI agents to partner effectively with human users, the ability to accurately predict human intent is critical. But this ability to collaborate remains limited by a critical deficit: an inability to model common ground. Here, we present a referential communication experiment with a factorial design involving director-matcher pairs (human-human, human-AI, AI-human, and AI-AI) that interact with multiple turns in repeated rounds to match pictures of objects not associated with any obvious lexicalized labels. We release the online pipeline for data collection, the tools and analyses for accuracy, efficiency, and lexical overlap, and a corpus of 356 dialogues (89 pairs over 4 rounds each) that unmasks LVLMs' limitations in interactively resolving referring expressions, a crucial skill that underlies human language use.
Authors:Junling Wang, Lahari Goswami, Gustavo Kreia Umbelino, Kiara Chau, Mrinmaya Sachan, April Yi Wang
Abstract:
LLM-based chatbots like ChatGPT have become popular tools for assisting with coding tasks. However, they often produce isolated responses and lack mechanisms for social learning or contextual grounding. In contrast, online coding communities like Kaggle offer socially mediated learning environments that foster critical thinking, engagement, and a sense of belonging. Yet, growing reliance on LLMs risks diminishing participation in these communities and weakening their collaborative value. To address this, we propose Community-Enriched AI, a design paradigm that embeds social learning dynamics into LLM-based chatbots by surfacing user-generated content and social design feature from online coding communities. Using this paradigm, we implemented a RAG-based AI chatbot leveraging resources from Kaggle to validate our design. Across two empirical studies involving 28 and 12 data science learners, respectively, we found that Community-Enriched AI significantly enhances user trust, encourages engagement with community, and effectively supports learners in solving data science tasks. We conclude by discussing design implications for AI assistance systems that bridge -- rather than replace -- online coding communities.
Authors:Xiaowei Chen, Mindy Tran, Yue Deng, Bhupendra Acharya, Yixin Zou
Abstract:
How do individuals recover from cybercrimes? Victims experience various types of harm after cybercrimes, including monetary loss, data breaches, negative emotions, and even psychological trauma. The aspects that support their recovery process and contribute to individual cyber resilience remain underinvestigated. To address this gap, we interviewed 18 cybercrime victims from Western Europe using a trauma-informed approach. We identified four common stages following victimization: recognition, coping, processing, and recovery. Participants adopted various strategies to mitigate the impact of cybercrime and used different indicators to describe recovery. While they mostly relied on social support and self-regulation for emotional coping, service providers largely determined whether victims were able to recover their money. Internal factors, external support, and context sensitivity collectively contribute to individuals' cyber resilience. We recommend trauma-informed support for cybercrime victims. Extending our conceptualization of individual cyber resilience, we propose collaborative and context-sensitive strategies to address the harmful impacts of cybercrime.
Authors:Shiye Cao, Jiwon Moon, Yifan Xu, Anqi Liu, Chien-Ming Huang
Abstract:
Large language models (LLMs) have enabled conversational robots to move beyond constrained dialogue toward free-form interaction. However, without context-specific adaptation, generic LLM outputs can be ineffective or inappropriate. This adaptation is often attempted through prompt engineering, which is non-intuitive and tedious. Moreover, predominant design practice in HRI relies on impression-based, trial-and-error refinement without structured methods or tools, making the process inefficient and inconsistent. To address this, we present the AI-Aided Conversation Engine (ACE), a system that supports the deliberate design of human-robot conversations. ACE contributes three key innovations: 1) an LLM-powered voice agent that scaffolds initial prompt creation to overcome the "blank page problem," 2) an annotation interface that enables the collection of granular and grounded feedback on conversational transcripts, and 3) using LLMs to translate user feedback into prompt refinements. We evaluated ACE through two user studies, examining both designs' experience and end users' interactions with robots designed using ACE. Results show that ACE facilitates the creation of robot behavior prompts with greater clarity and specificity, and that the prompts generated with ACE lead to higher-quality human-robot conversational interactions.
Authors:Riju Marwah, Vishal Pallagani, Ritvik Garimella, Amit Sheth
Abstract:
LLMs are increasingly being deployed as chatbots, but today's interfaces offer little to no friction: users interact through seamless conversations that conceal when the model is drifting, hallucinating or failing. This lack of transparency fosters blind trust, even as models produce unstable or repetitive outputs. We introduce an interactive demo that surfaces and mitigates cognitive fatigue, a failure mode where LLMs gradually lose coherence during auto-regressive generation. Our system, Chatsparent, instruments real-time, token-level signals of fatigue, including attention-to-prompt decay, embedding drift, and entropy collapse, and visualizes them as a unified fatigue index. When fatigue thresholds are crossed, the interface allows users to activate lightweight interventions such as attention resets, entropy-regularized decoding, and self-reflection checkpoints. The demo streams live text and fatigue signals, allowing users to observe when fatigue arises, how it affects output quality, and how interventions restore stability. By turning passive chatbot interaction into an interactive diagnostic experience, our system empowers users to better understand LLM behavior while improving reliability at inference time.
Authors:Max Linnander, Yon Visell
Abstract:
We present Haptic Light-Emitting Diodes (HLEDs), luminous thermopneumatic actuators that directly convert pulsed light into mechanical forces and displacements. Each device packages a miniature surface-mount LED in a gas-filled cavity that contains a low-inertia graphite photoabsorber. The cavity is sealed by an elastic membrane, which functions as a working diaphragm. Brief optical pulses heat the photoabsorber, which heats the gas. The resulting rapid pressure increases generate forces and displacements at the working diaphragm. Millimeter-scale HLEDs produce forces exceeding 0.4 N and displacements of 0.9 mm at low voltages, with 5 to 100 ms response times, making them attractive as actuators providing tactile feedback in human-machine interfaces. Unusually, these actuators are also light-emitting, as a fraction of optical energy is transmitted through the membrane. These photomechanical actuators have many potential applications in tactile displays, human interface engineering, wearable computing, and other areas.
Authors:Thomas Krämer, Daniel Hienert, Francesco Chiossi, Thomas Kosch, Dagmar Kern
Abstract:
Selective exposure to online news occurs when users favor information that confirms their beliefs, creating filter bubbles and limiting diverse perspectives. Interactive systems can counter this by recommending different perspectives, but to achieve this, they need a real-time metric for selective exposure. We present an experiment where we evaluate Electroencephalography (EEG) and eye tracking as indicators for selective exposure by using eye tracking to recognize which textual parts participants read and using EEG to quantify the magnitude of selective exposure. Participants read online news while we collected EEG and eye movements with their agreement towards the news. We show that the agreement with news correlates positively with the theta band power in the parietal area. Our results indicate that future interactive systems can sense selective exposure using EEG and eye tracking to propose a more balanced information diet. This work presents an integrated experimental setup that identifies selective exposure using gaze and EEG-based metrics.
Authors:Daniel Hienert, Heiko Schmidt, Thomas Krämer, Dagmar Kern
Abstract:
Existing eye tracking software have certain limitations, especially with respect to monitoring reading online: (1) Most eye tracking software record eye tracking data as raw coordinates and stimuli as screen images/videos, but without inherent links between both. Analysts must draw areas of interest (AOIs) on webpage text for more fine-grained reading analysis. (2) The computation and analysis of fixation and reading metrics are done after the experiment and thus cannot be used for live applications. We present EyeLiveMetrics, a browser plugin that automatically maps raw gaze coordinates to text in real time. The plugin instantly calculates, stores, and provides fixation, saccade, and reading measures on words and paragraphs so that gaze behavior can be analyzed immediately. We also discuss the results of a comparative evaluation. EyeLiveMetrics offers a flexible way to measure reading on the web - for research experiments and live applications.
Authors:Xiaochen Zhu, Georgi Karadzhov, Tom Stafford, Andreas Vlachos
Abstract:
Multi-party dialogue is a critical setting for studying collaborative reasoning and decision-making, yet existing datasets rarely focus on structured, in-depth complex reasoning tasks. We introduce DeliChess, a novel dataset of group deliberation dialogues in which participants collaboratively solve multiple-choice chess puzzles. Each group first completes the puzzle individually, then engages in a multi-party discussion before submitting a revised collective answer. The dataset includes 107 dialogues with full transcripts, pre- and post-discussion choices, and metadata on puzzle difficulty and move quality. We evaluate performance using three metrics based on chess engine evaluations, and find that deliberation significantly improves group accuracy. We further analyse the role of probing utterances (i.e., messages that elicit proposals, justifications, or strategic reflection) using a classifier trained on prior deliberation data. While probing makes group performance more variable after discussion, it does not consistently lead to better performance. Our dataset offers a rich testbed for modelling group reasoning, dialogue dynamics, and the resolution of differing perspectives and opinions in a well-defined strategic domain.
Authors:Haoyang Ge, Jian Ma, Ziwen Wang, Qihe Wang, Jianqi Fan, Hongzhi Yu, Xingyu Chen, Kun Li
Abstract:
High-quality dynamic human pose annotation equips AI with precise motion kinematics to enable human behavior mastery, yet remains labor-intensive and time-consuming. Current annotation tools either lack temporal correction propagation or fail in multi-person scenarios, necessitating excessive manual intervention. In this paper, we introduce IMPose, an interactive tool for multi-person dynamic pose annotation. It features a dual-level tracking mechanism that propagates one-frame multi-person pose corrections from annotators across entire videos. The keypoint-level ensures corrections temporal propagation via sequential modeling, while the instance-level employs keypoint-aware embedding with relative positional encoding to maintain multi-person cross-frame consistency. To further improve robustness, IMPose maintains historical pose and instance cues in a trajectory bank, which enhances long-range temporal association and stabilizes annotation in challenging cases such as occlusion and motion blur. By converting sparse human corrections into dense and coherent pose trajectories, our framework significantly reduces repeated manual refinement across frames. Extensive experiments show that IMPose consistently achieves a strong accuracy efficiency trade off under different interaction budgets, demonstrating particular advantages in low click annotation settings. IMPose achieves high precision annotation with high efficiency, requiring only 27 clicks per 1,050 frame video on 3DPW and 3 clicks per tracklet per 84-frame on PoseTrack21. We further expand PoseTrack21 with 188K pose instances (3.55M keypoints) at a minimal cost of 10 annotators in 10 hours. The annotation tool, codes, and extended dataset will be open-sourced.
Authors:Yiquan Li, Taeyoung Yeon, Chenfeng Gao, Vasco Xu, Xuanyou Liu, Karan Ahuja
Abstract:
Inertial odometry (IO) using only Inertial Measurement Units (IMUs) provides a lightweight solution for human motion tracking in augmented reality (AR) and wearable devices. Recent learning-based IO methods have improved the generalizability of inertial localization through large-scale pretraining on human motion datasets. However, these approaches remain prone to drift and noise because they do not explicitly capture human motion dynamics, especially on daily activity datasets such as Nymeria. In this work, we propose to ground inertial odometry in human kinematics through a learned IMU-inferred pose prior, which promotes physically consistent motion constraints. We integrate this pose prior into existing IO architectures and reduce positional drift by up to 36% on the challenging Nymeria dataset, which is 5x larger than datasets used in prior work. We further improve long-term performance with a sensor-fusion framework that incorporates auxiliary signals from lightweight sensors already available on commercial AR glasses, including magnetometers, barometers, and secondary IMUs. With this fusion strategy, positional drift is reduced by up to 42%, improving robustness and generalization across diverse motion conditions. Together, our results introduce a new paradigm for inertial and lightweight odometry by unifying human motion kinematics with multimodal sensing, setting a new benchmark for accurate and robust camera-less human tracking. Our website is available at https://spice-lab.org/projects/MARIO/.
Authors:Iosif Tsangko, Andreas Triantafyllopoulos, George Margetis, Ioana Crihana, Björn W. Schuller
Abstract:
Blind and low-vision (BLV) audiences remain underserved by visual art descriptions, particularly across languages and in museum settings where privacy and intellectual-property constraints may favour small on-premise vision-language models (VLMs). This pilot study investigates curator-guided multilingual art description with Qwen2.5-VL-3B-Instruct for German, Romanian, and Serbian. We construct a parallel BLV-oriented caption corpus from artwork images and metadata, and compare language-specific LoRA adapters with a single multilingual adapter under a fixed backbone and training budget. Evaluation combines automatic lexical and embedding-based metrics with an LLM-as-Judge protocol calibrated against a small Romanian BLV pilot study. Under our pilot setup, language-specific adapters show more stable controllability and visually grounded description quality for Romanian and Serbian, while multilingual adaptation remains competitive in German. We frame these findings as deployment-oriented evidence for small on-premise VLMs, and highlight the need for larger BLV user studies and broader language coverage before drawing general conclusions about multilingual accessibility.
Authors:Yueshen Li, Hanyi Min, Vedant Das Swain, Koustuv Saha
Abstract:
As large language models (LLMs) increasingly act as collaborative partners, human--AI alignment is often evaluated through explicit task success, accuracy, or reward optimization. Yet many collaborative settings depend on tacit understanding: whether an agent can align with a human's evaluative stance or representational priors without clear objectives, communication, or feedback. To study this capacity, we develop a spectrum-placement task inspired by the social party game Wavelength, in which humans and agents independently place concepts along subjective spectra. We operationalize the Tacit Understanding Index (TUX) as a pairwise measure of similarity between human and agent judgments, and evaluate it with 241 human participants and 200 profile-conditioned LLM agents across four models. We find that nearest human--agent pairs in trait space achieve significantly higher TUX, suggesting that tacit alignment is structured by person-level characteristics rather than random similarity. Regression analyses show that TUX becomes more explainable as predictor sets become richer, with individual traits, decision-making styles, and confidence improving over aggregate trait-distance baselines. These findings suggest that tacit understanding between humans and LLMs is measurable, while revealing the limits of profile-based conditioning for capturing deeper representational alignment.
Authors:Om Dobariya, Akhil Kumar
Abstract:
The use of Large Language Models (LLMs) is proliferating, yet their performance is observed to vary based on prompting styles and tones. In this study, we investigate both whether and how tonal variations in prompts lead to disparate LLM accuracy for objective multiple-choice questions. We use two datasets: a 50-base question dataset with five tone variants and a 570-base question MMLU subset spanning 57 subjects with seven tone variants. Experiments were conducted to evaluate the performance of four cost-efficient, popular LLMs: ChatGPT-4o, ChatGPT-5-nano, Gemini 2.5 Flash, and Gemini 2.5 Flash Lite. Across models, tonal effects are systematic but highly model-dependent. Some models show small, yet statistically significant, shifts, while others exhibit large accuracy swings across tones. Further, we identify subject-level differences in tone sensitivity and present a routing framework to explain how tones may attune internal reasoning modes. Our findings caution users against assuming tone-robust reliability in LLM deployments.
Authors:Tobias Weinberg, Aaleyah Lewis, Ricardo E. Gonzalez Penuela, Weicong Hong, Jennifer Mankoff, Thijs Roumen
Abstract:
Voice is a central element of identity. We recognize people by their voice, and we uniquely express who we are with it. For people who rely on augmentative and alternative communication~(AAC) systems, such as speech-generating devices~(SGD), the device's voice becomes an identity marker others associate with them. Yet, it is hard to find a voice that truly aligns with one's identity both linguistically and culturally. Although modern AI-generated voices can reproduce diverse accents and speaking styles, AAC users still lack accessible ways to articulate how they want an identity-aligned voice to sound like. We first conducted a survey of AAC users (across eight countries) to characterize current voice representation, finding that non-binary, transgender, and non-US-born respondents rated their current voice support identity alignment consistently lower than other respondents. To examine how AAC users respond to voices designed to reflect their cultural identity, we built a tool that elicits cultural markers through guided questions and generates personalized voice candidates for participants to hear and reflect on. After participants heard the voices, we interviewed them to examine what it means for a voice to feel culturally representative, how they interpreted voices with cultural connotations, and how these voices shaped their sense of identity and agency. Our findings show that cultural voice alignment runs deeper than accent or language alone; it touches on belonging, self-recognition, and what it means to be heard as who you are.
Authors:Ko Watanabe, Shoya Ishimaru
Abstract:
Employees often struggle to identify ``who knows what,'' leading to organizational productivity losses. We investigate whether Large Language Models (LLMs) can infer individual domain knowledge directly from long-term Slack logs. Analyzing 27,188 messages from 43 users, we evaluated seven models (including Gemini, Claude, and GPT families) by comparing their zero-shot estimates against self-reported skill ratings from 27 participants. Gemini 2.5 Flash achieved the lowest error (MAE 21.13%), while GPT models showed significantly larger discrepancies. Notably, estimation accuracy depended only weakly on message volume, indicating that more text alone does not guarantee better inference. These findings demonstrate the feasibility and current limits of automated expertise mapping, highlighting the need for privacy-preserving deployments and richer, structure-aware representations of human knowledge.
Authors:Arun-Balajiee Lekshmi-Narayanan, Mohammad Hassany, Peter Brusilovsky
Abstract:
Worked examples are step-by-step solutions to problems in a specific domain, offered to students to acquire domain-specific problem-solving skills. The effectiveness of worked examples could be enhanced by combining them with self-explanations, which ask students to explain rather than passively study each problem-solving step. The main challenge of this approach is assessing the correctness of the student's explanations. In the prevailing approach, student explanations are judged by their semantic similarity to an instructor's or domain expert's explanation. Given recent advances in LLM-based automated scoring, it remains unclear whether semantic similarity methods are still the most effective technique to automatically score textual student responses like essays or code explanations. Comparing these methods also requires quality datasets that offer distinctive features such as balanced class distributions and domain-specific labeled data for automated scoring tasks. In this paper, we present a rigorous comparison between LLMs and semantic similarity used for automated scoring, framed as a binary classification task.
Authors:Syed Mhamudul Hasan, Abdur R. Shahid
Abstract:
Generative AI systems are increasingly deployed as interactive agents in online environments, such as a social network called Moltbook. In Moltbook, large-scale agentic AIs can post, comment, and engage in activities generated at scale by AI-driven text. Yet these agent behavioral characteristics remain insufficiently understood, particularly in complex, multi-agent interaction. In this study, we analyze the emotional dynamics of agent interactions within Moltbook. We construct an emotion-aware framework that maps textual interactions to a predefined set of fine-grained emotional categories, enabling the extraction of structured emotion profiles across agents and interaction contexts. To further evaluate behavioral reliability, we introduce an emotion-based domain called Persona-Stimulus-Reaction (PSR) that captures the alignment of emotional responses across similar contexts. Our analysis shows distinct emotional patterns and varying levels of behavioral stability across agents. Our analysis reveals that agents exhibit distinct emotional signatures with varying levels of behavioral stability influenced by interaction context.
Authors:Meisam Jamshidi Seikavandi, Alice Modica, Anna Obara, Fabricio Batista Narcizo, Tanya Ignatenko, Ted Vucurevich, Jesper Bünsow Boldt, Paolo Burelli, Andrew Burke Dittberner
Abstract:
We present AffectAI-Capture, a protocol for collecting synchronized multimodal data in four-person meeting-like interactions, combining eye tracking, wearable physiology, close-talk and room audio, multi-view video, event logging, and structured self-report. Sessions use fixed task blocks grounded in established group-interaction paradigms, while acquisition and post-processing are organized around a single authoritative event timeline and standardized outputs. We describe the experimental rationale, synchronization philosophy, data organization, and practical trade-offs. Pilot-level validation of audio quality and video synchronization has been conducted using controlled bench tests; full protocol sessions with participants remain ongoing work. The contribution is a reproducible protocol architecture linking task design, instrumentation, timing provenance, and data packaging for affective, behavioral, and meeting-analytics research.
Authors:Cansu Koyuturk, Sabrina Guidotti, Dimitri Ognibene
Abstract:
Large Language Models (LLMs) are increasingly used in educational settings as interactive tools for collaboration. However, their tendency toward sycophancy, aligning with user beliefs even when incorrect, raises concerns for learning and decision-making, especially for less knowledgeable users. This study investigates how sycophantic alignment emerges in authentic multi-turn human-AI interactions and whether interventions targeting increasing AI literacy and prompting competencies can mitigate its effects. In a controlled mixed-design experiment, 60 participants completed analytical survival ranking tasks by first generating individual rankings and then making final decisions after collaborating with an AI assistant, both before and after receiving either general or sycophancy-focused prompting training. Preliminary results show that LLMs are highly sensitive to user input: lower-quality initial responses lead to poorer AI advice, suggesting that the model mirrors or incorporates user reasoning rather than correcting it or offering better alternatives that are missing or less frequent in the conversation. Critically, the propagation of user errors into AI responses significantly reduced both the quality of AI feedback and final user task performance, revealing a form of contextual sycophantic dependence. While the intervention did not eliminate the propagation of contextual errors, it significantly improved AI advice by reducing the direct mirroring of incorrect user rankings. These findings suggest that prompting and AI literacy alone may be insufficient to ensure epistemically independent AI support, highlighting the need for system-level approaches that better promote critical engagement in human-AI collaboration.
Authors:Mark S. Keller, Nils Gehlenborg
Abstract:
Tools used for implementing visualization software systems can generally be divided into camps such as static versus interactive and desktop versus web-based. We contribute Pluot, an architecture that bridges these divides, enabling a single software implementation of a visualization to be used regardless of the target level of interactivity or computing environment. With Pluot, a visualization developer implements a given visualization rendering function once, using the Rust programming language. Then, bindings to the Rust program can be generated to enable reproducible execution of the rendering function from other languages, such as Python or JavaScript. Pluot can render visualizations to bitmap or vector graphics format, bridging gaps between interactive performance and publication-quality figure creation. The software is available at https://pluot.dev.
Authors:Patrick Callaghan, Reid Simmons, Henny Admoni
Abstract:
Discrepancies between an agent's actual knowledge and what a person thinks the agent knows can hinder interactions. If an agent could detect such discrepancies, it could provide feedback to account for them and improve current and future interactions. Using the I-POMDP as a framework for a second-order Theory of Mind (ToM-2), this work endows an agent with the ability to model the evolution of a person's erroneous beliefs about an agent and the cognitive biases and heuristics (CBH) from which they arise. In doing so, the agent can detect when CBH might be at play during an interaction and adaptively generate feedback that accounts for them. An in-person user study shows how a ToM-2 learner can account for the effects of a teacher's CBH to significantly improve the informativeness of teacher actions, and subjective results suggest people find the ToM-2 learner's feedback more useful.
Authors:Nina Corvelo Benz, Eleni Straitouri, Manuel Gomez-Rodriguez
Abstract:
It is widely agreed that when AI models assist decision-makers in high-stakes domains by predicting an outcome of interest, they should communicate the confidence of their predictions. However, empirical evidence suggests that decision-makers often struggle to determine when to trust a prediction based solely on this communicated confidence. In this context, recent theoretical and empirical work suggests a positive correlation between the utility of AI-assisted decision-making and the degree of alignment between the AI confidence and the decision-makers' confidence in their own predictions. Crucially, these findings do not yet elucidate the extent to which this alignment influences the complexity of learning to make optimal decisions through repeated interactions. In this paper, we address this question in the canonical case of binary predictions and binary decisions. We first show that this problem is equivalent to a two-armed online contextual learning problem with full feedback, and establish a lower bound of $Ω(\sqrt{|H| \cdot |B| \cdot T} )$ on the expected regret any learner can attain, where $H$ and $B$ denote the sets of human and AI confidence values. We then demonstrate that, under perfect alignment between AI and human confidence, a learner can attain an expected regret of $O(\sqrt{|H| \cdot T\log T})$ and, when $\sqrt{|H|} = O(\log T)$ and $B$ is countable, a non-trivial generalization of the Dvoretzky-Kiefer-Wolfowitz inequality improves the regret bound to $O(\sqrt{T\log T})$. Taken together, these results reveal that alignment can reduce the complexity of learning to make decisions with AI assistance. Experiments on real data from two different human-subject studies where participants solve simple decision-making tasks assisted by AI models show that our theoretical results are robust to violations of perfect alignment.
Authors:Fujiko Robledo Yamamoto, Nicholas Mattei, Pradeep Ragothaman, Robin Burke, Amy Voida
Abstract:
Fairness in machine learning is often conceptualized narrowly in comparative, distributional terms. In studying stakeholders' concepts of fairness, we find that this framing is insufficient to capture the full range of issues raised. As an alternative, we propose organizational justice as a framework that subsumes distributional fairness as well as other normative concerns. We conduct a case study of organizational justice relative to personalized recommendation in the context of Kiva Microfunds, a nonprofit micro-lending organization whose mission is to increase financial access for underserved communities across the world. We report on the results of co-design workshops conducted with Kiva employees who are involved in different departments and whose roles often lead them to prioritize normative concerns that are most supportive of the stakeholders with whom they work most closely. We apply organizational justice to understand design trade-offs among different normative goals stakeholders invoke. Based on these goals, we identify a suite of metrics that Kiva employees can use to monitor and assess the recommender system's impact on their organizational justice concerns and to seed discussions within the organization about appropriate configuration and deployment of this system in context.
Authors:Karthik Sreedhar, Aryan Kaul, Lydia B. Chilton
Abstract:
Customization has long been a central goal in interactive systems, yet prior work shows that end-user tailoring occurs infrequently and is often confined to initial setup or moments of breakdown. Recent advances in generative AI suggest that highly malleable systems-where users can modify system behavior through natural language-are now technically feasible. However, it remains unclear how such malleability is used in practice: What kinds of customizations do users create, when do they choose to customize, and how do these modifications shape their experience of everyday tools? We present a design probe that uses a conversationally customizable email system as an instrument to study how users create and refine functionality within everyday tools. The system allows users to iteratively modify their inbox by restructuring categories, introducing interface elements, and authoring new workflow behaviors directly through natural language interaction. We study how participants create, refine, and use these features over several days within their own email workflows. We find that users' customizations are often grounded in existing patterns, which they adapt and specialize to fit their needs, rather than generating entirely novel functionality. Malleability changes how users engage with their inbox, shifting it from a fixed interface to a flexible data layer shaped through user-authored features. At the same time, customization introduces new forms of risk, including mis-specified behavior, unintended filtering, and uncertainty around outcomes, which users manage through ongoing oversight and refinement. These findings highlight how conversational customization becomes embedded within everyday interaction, and point toward the need for systems that support iterative refinement, visibility into behavior, and safe experimentation as users shape their own tools.
Authors:Anthea Dathe, Kiran Hoffmann, Aline Mangold
Abstract:
Artificial intelligence (AI) tools are being incorporated into scientific research workflows with the potential to enhance efficiency in tasks such as document analysis, question answering (Q&A), and literature search. However, system outputs are often difficult to verify, lack transparency in their generation and remain prone to errors. Suitable benchmarks are needed to document and evaluate arising issues. Nevertheless, existing benchmarking approaches are not adequately capturing human-centered criteria such as usability, interpretability, and integration into research workflows. To address this gap, the present work proposes and applies a benchmarking framework combining human-centered and computer-centered metrics to evaluate AI-based Q&A and literature review tools for research use. The findings suggest that Q&A tools can offer valuable overviews and generally accurate summaries; however, they are not always reliable for precise information extraction. Explainable AI (xAI) accuracy was particularly low, meaning highlighted source passages frequently failed to correspond to generated answers. This shifted the burden of validation back onto the researcher. Literature review tools supported exploratory searches but showed low reproducibility, limited transparency regarding chosen sources and databases, and inconsistent source quality, making them unsuitable for systematic reviews. A comparison of these tool groups reveals a similar pattern: while AI tools can enhance efficiency in the early stages of the research workflow and shallow tasks, their outputs still require human verification. The findings underscore the importance of explainability features to enhance transparency, verification efficiency and careful integration of AI tools into researchers' workflows. Further, human-centered evaluation remains an important concern to ensure practical applicability.
Authors:Andrew Stratton, Phani Teja Singamaneni, Pranav Goyal, Rachid Alami, Christoforos Mavrogiannis
Abstract:
We contribute Bi3, a dataset of social robot navigation among groups of people in a constrained lab space. Compared to prior data collection efforts for social robot navigation, our dataset is unique in that it features: an original experiment design giving rise to close navigation encounters between two humans and a robot; five different navigation algorithms; two different robot platforms; a diverse participant pool of 74 people recruited from two sites in the USA and France; multimodal data streams including 10.5 hours of human and robot ground-truth motion tracks, RGB video, and user impressions over robot performance. Our analysis of the collected dataset through metrics like interaction density and human velocity suggests that Bi3 represents a benchmark of unique diversity and modeling complexity. Bi3 contributes towards understanding how humans and robots can productively mesh their activities in constrained environments, and can be a resource for training models of human motion prediction and robot control policies for navigation in densely crowded spaces.
Authors:Hasibur Rahman, Kenji Numata, Evelyn T Lai, Maria Cheriyan, Adrian Haimovich, Kei Ouchi, Smit Desai
Abstract:
Serious illness conversations (SICs) align care with patients' values, goals, and preferences, yet they rarely occur in emergency departments (EDs), where time constraints and emotional burden often leave clinicians making high-stakes decisions without documented insight into what matters most to patients. We present a case study of ED GOAL-AI, a voice-based conversational agent for brief, structured values discussions with older adults in the ED, evaluated with 55 patients for feasibility and acceptability. Most participants completed the conversation and reported the interaction as acceptable and feasible, with ratings of feeling heard and understood comparable to clinicians. However, we also observed critical failure modes, including boundary violations such as hallucinated diagnostic statements, highlighting ethical and emotional risks. This work points to early promise for AI-mediated SICs while underscoring the need for careful boundary setting and participatory design before broader deployment.
Authors:Sijia Liu, Hoi Ching Silvester Mok, Long Ling, Tobias Klein, Ray LC
Abstract:
Chinese ceramic-making involves complex and interdependent steps, making it technically demanding. Digital fabrication methods attempt to make the process more accessible, but for craft-creators, technical challenges such as CAD and CAM skills remain major obstacles. To address this, we designed a hybrid workflow that integrates Generative AI with clay 3D printing to support new creative possibilities. We evaluated the workflow through ClayScape, a design tool that operationalizes this approach, with four ceramic creators. Our findings show that the workflow supports accessible ceramic creation while revealing both expanded opportunities for creative exploration and challenges in balancing agency and control. This work demonstrates how hybrid workflows can lower barriers to digital fabrication while supporting creative possibilities in culturally grounded ceramic practices.
Authors:Xiaowei Jiang, Sudong Shang, Adrian Wilkinson, Michael L. Platt, Da Xiao, Bening Cao, Thomas Do
Abstract:
P300-based brain-computer interfaces (BCIs) are widely used for communication, but population heterogeneity may alter the neural patterns available for decoding. Prior work has mainly examined such differences at the signal or performance level, while the representation structure learned by the decoder remains underexplored. In this study, we propose an interpretable fuzzy spatiotemporal framework for P300 classification and use it to analyze population-level differences across amyotrophic lateral sclerosis (ALS), autism (AUT), and neurotypical (NT) cohorts. The model employs spatial and temporal fuzzy filters with learnable prototypes, enabling both classification and reconstruction of cohort-specific fuzzy centers. Experiments were conducted on ALS and NT subsets from bigP3BCI and on the BCIAUT-P300 benchmark in a within-subject setting. The proposed model achieved competitive performance against multiple deep learning baselines. More importantly, the reconstructed fuzzy centers revealed systematic cohort-dependent differences in waveform morphology and representation geometry. Point-wise statistical analysis identified significant temporal differences between cohorts, including intervals overlapping with the canonical P300 window, and low-dimensional embeddings showed partially separated cohort-specific prototype organizations. These results suggest that population heterogeneity in P300-BCI is reflected not only in decoding performance but also in the discriminative structure learned by the model. The proposed framework provides an interpretable route toward population-aware P300-BCI analysis and design.
Authors:Tin Nguyen, Thang T. Truong, Runtao Zhou, Trung Bui, Chirag Agarwal, Anh Totti Nguyen
Abstract:
Users browsing the web daily struggle to quickly locate relevant information in cluttered pages, complete unfamiliar multi-step tasks, and stay focused amid distracting content. State-of-the-art AI assistants (e.g., ChatGPT, Gemini, Claude) and browser agents (e.g., OpenAI Operator, Browser Use) can answer questions and automate actions, yet they return answers without showing where the information comes from on the page, forcing users to manually verify results and blindly trust every automated steps. We present PageGuide, a browser extension that grounds LLM answers directly in the HTML DOM via visual overlays, addressing three core user needs: (a) Find-locating and highlighting relevant evidence in-situ so users can instantly verify answers on the page; (b) Guide-showing step-by-step instructions (e.g. how to change password) one at a time so users can follow and perform actions by themselves; and (c) Hide-hiding distracting content-giving users a chance to decide to hide an element or not. In a user study (N=94), PageGuide outperform unaided browsing across all modes: Hide accuracy improve by 26 percentage points (86.7% relative gain) and task completion time drops by 70%; Guide completion rate increases by 30 percentage points; and Find reduces manual search effort, with Ctrl+F usage falling by 80% and task time decreasing by 19%. Code and demo is at: pageguide.github.io.
Authors:Nicy Scaria, Silvester John Joseph Kennedy, Deepak Subramani
Abstract:
Most digital language learning curricula rely on discrete-item quizzes that test recall rather than applied conversational proficiency. When progression is driven by quiz performance, learners can advance despite persistent gaps in using grammar and vocabulary during interaction. Recent work on LLM-based judging suggests a path toward scoring open-ended conversations, but using interaction evidence to drive progression and review requires scoring protocols that are reliable and validated. We introduce Learning in Blocks, a framework that grounds progression in demonstrated conversational competence evaluated using CEFR-aligned rubrics. The framework employs heterogeneous multi-agent debate (HeteroMAD) in two stages: a scoring stage where role-specialized agents independently evaluate Grammar, Vocabulary, and Interactive Communication, engage in debate to address conflicting judgments, and a judge synthesizes consensus scores; and a recommendation stage that identifies specific grammar skills and vocabulary topics for targeted review. Progression requires demonstrating 70% mastery, and spaced review targets identified weaknesses to counter skill decay. We benchmark four scoring and recommendation methods on CEFR A2 conversations annotated by ESL experts. HeteroMAD achieves a superior score agreement with a 0.23 degree of variation and recommendation acceptability of 90.91%. An 8-week study with 180 CEFR A2 learners demonstrates that combining rubric-aligned scoring and recommendation with spaced review and mastery-based progression produces better learning outcomes than feedback alone.
Authors:Inês Oliveira e Silva, Sérgio Jesus, Iker Perez, Rita P. Ribeiro, Carlos Soares, Hugo Ferreira, Pedro Bizarro
Abstract:
Shapley values are a cornerstone of explainable AI, yet their proliferation into competing formulations has created a fragmented landscape with little consensus on practical deployment. While theoretical differences are well-documented, evaluation remains reliant on quantitative proxies whose alignment with human utility is unverified. In this work, we use a unified amortized framework to isolate semantic differences between eight Shapley variants under the low-latency constraints of operational risk workflows. We conduct a large-scale empirical evaluation across four risk datasets and a realistic fraud-detection environment involving professional analysts and 3,735 case reviews. Our results reveal a fundamental misalignment: standard quantitative metrics, such as sparsity and faithfulness, are decoupled from human-perceived clarity and decision utility. Furthermore, while no formulation improved objective analyst performance, explanations consistently increased decision confidence, signaling a critical risk of automation bias in high-stakes settings. These findings suggest that current evaluation proxies are insufficient for predicting downstream human impact, and we provide evidence-based guidance for selecting formulations and metrics in operational decision systems.
Authors:Patrick Vossler, Jean Feng, Venkat Sivaraman, Robert Gallo, Hemal Kanzaria, Dana Freiser, Christopher Ross, Amy Ou, James Marks, Susan Ehrlich, Christopher Peabody, Lucas Zier
Abstract:
Hospital Quality Improvement (QI) plays a critical role in optimizing healthcare delivery by translating high-level hospital goals into actionable solutions. A critical step of QI is to identify the key modifiable contributing factors, a process we call QI factor discovery, typically through expert-driven semi-structured qualitative tools like fishbone diagrams, chart reviews, and Lean Healthcare methods. AI has the potential to transform and accelerate QI factor discovery, which is traditionally time- and resource-intensive and limited in reproducibility and auditability. Nevertheless, current AI alignment methods assume the task is well-defined, whereas QI factor discovery is an exploratory, fuzzy, and iterative sense-making process that relies on complex implicit expert judgments. To design an AI pipeline that formalizes the QI process while preserving its exploratory components, we propose viewing the task as learning not only LLM prompts but also the overarching natural-language specifications. In particular, we map QI factor discovery to steps of the classical AI/ML development process (problem formalization, model learning, and model validation) where the specifications are tunable hyperparameters. Domain experts and AI agents iteratively refine both the overarching specifications and AI pipeline until AI extractions are concordant with expert annotations and aligned with clinical objectives. We applied this "Human-AI Spec-Solution Co-optimization" framework at an urban safety-net hospital to identify factors driving prolonged length of stay and unplanned 30-day readmissions. The resulting AI-for-QI pipelines achieved $\ge 70\%$ concordance with expert annotations. Compared to prior manual Lean analyses, the AI pipeline was substantially more efficient, recovered previous findings, surfaced new modifiable factors, and produced auditable reasoning traces.
Authors:Sarah Lykke Tost, Adson Lucas de Paiva Sales, Henrik Østergaard, Vaishali Dhanoa, Gabriela Molina León
Abstract:
We designed and implemented InvestChat, a multimodal tablet-based application that supports stock market exploration with multiple coordinated views and an LLM-powered chat. We evaluated the application with 12 novice investors. Our findings suggest that combining natural language, touch, and pen input during stock market exploration facilitates user engagement. Participants leveraged the modalities in complementary ways, enjoying the freedom of choice and finding natural language most effective.
Authors:Botao Amber Hu, Yilan Elan Tao, Bernhard Riecke, Yue Li
Abstract:
Mixed reality systems support shared anchors and co-located interaction, yet they lack a socially legible protocol for entering another person's mixed reality in public settings. We frame this as a protocol problem: co-located MR sharing requires a staged sequence -- Discover, Consent, Confirm, Allow, Spatial Colocation, Sync Objects, Permission Management -- each demanding user understanding and agreement. Using AirDrop and Apple Vision Pro SharePlay as a baseline, we show that MR encounter complexity far exceeds file transfer, yet must feel equally effortless. We present TouchPort, an embodied sharing protocol that collapses this multi-stage sequence into a single gesture: a handshake and pull that simultaneously signals intent, negotiates consent, and initiates a temporary shared encounter layer between otherwise separate mixed realities. Through three implied scenarios, we demonstrate the protocol's expressive range in the transition from isolated to spontaneously shared realities. We discuss how embodied gestures can address the consent problem in ubiquitous MR and examine the ethical tensions of encounter protocols for MR futures.
Authors:Tengyou Xu, Detao Ma, Xiang 'Anthony' Chen
Abstract:
The rise of large language models (LLMs) has given rise to a class of prompt-based interactive systems where users primarily express their input in natural language. However, composing a prompt as a linear text string becomes unwieldy when capturing users' multifaceted intents. We present Object-Oriented Prompting (OOPrompt), an emergent interaction paradigm that enables users to create, edit, iterate, and reuse prompts as structured, manipulable artifacts, unifying and generalizing several existing point systems. We first outlined a design space from existing work and built an early prototype, which we deployed as a probe in a formative study with 20 participants. Their feedback informed an expanded OOPrompt design space. We then developed the full OOPrompt prototype and conducted a validation study to further understand OOPrompt's added values and trade-offs. We expect the OOPrompt design space to provide theoretical and empirical guidance to the design and engineering of prompt-based, LLM-enabled interactive systems.
Authors:Zhijun Zheng, Tian Qiu, Yuheng Zhao, Siming Chen
Abstract:
In visual analytics, applying filters to drill-down and extract higher-value insights is a common and important data analysis method. When the drill-down space becomes excessively large, analysts may lose orientation, leading to decreased efficiency in the drill-down process. To tackle these challenges, we propose the Intelligent Drill-Down Framework, in which a large language model (LLM) facilitates the generation of visual insights, leverages user interaction data to interpret user intent, and generates appropriate drill-down paths. Our method is designed to assist users in identifying valuable drill-down paths when exploring multidimensional data, thereby reducing the cognitive burden of data interpretation and facilitating the generation of insights. Specifically, we propose a drill-down path recommendation method, in which the LLM is trained to approximate a validated greedy algorithm. Secondly, we analyze the user's intent to construct a drill-down chart. Finally, we design a branch management method. Building upon this framework, we designed a system that includes a hybrid interface providing hierarchical navigation to monitor users and manage parallel branches, a visualization panel for interactive data exploration, and an insight panel to present analytical findings and generate drill-down recommendations. We evaluated the effectiveness of our method through a demonstrative use case and a user study.
Authors:Willem van der Maden, Malak Sadek, Ziang Xiao, Aske Mottelson, Q. Vera Liao, Jichen Zhu
Abstract:
How do product teams evaluate LLM-powered products? As organizations integrate large language models (LLMs) into digital products, their unpredictable nature makes traditional evaluation approaches inadequate, yet little is known about how practitioners navigate this challenge. Through interviews with nineteen practitioners across diverse sectors, we identify ten evaluation practices spanning informal 'vibe checks' to organizational meta-work. Beyond confirming four documented challenges, we introduce a novel fifth we call the results-actionability gap, in which practitioners gather evaluation data but cannot translate findings into concrete improvements. Drawing on patterns from successful teams, we contribute strategies to bridge this gap, supporting practitioners' formalization journey from ad-hoc interpretive practices (e.g., vibe checks) toward systematic evaluation. Our analysis suggests these interpretive practices are necessary adaptations to LLM characteristics rather than methodological failures. For HCI researchers, this presents a research opportunity to support practitioners in systematizing emerging practices rather than developing new evaluation frameworks.
Authors:Gabriela Molina León, Benjamin Bach, Matheus Valentim, Niklas Elmqvist
Abstract:
How do we assess people's abilities to interact with data visualizations? The current state-of-the-art visualization literacy tests -- such as VLAT and its derivatives -- only involve the use of static visualizations. Despite advances in investigating multiple visualization abilities, we do not yet have formal methods to assess the ability of a person to interact with a data visualization effectively. In this position paper, we discuss related literacy concepts and assessments to propose and compare different approaches for assessing the abilities that people leverage to use visualizations in interactive sensemaking tasks.
Authors:Dongyang Guo, Yasmeen Abdrabou, Enkelejda Kasneci
Abstract:
Gaze event detection is fundamental to vision science, human-computer interaction, and applied analytics. However, current workflows often require specialized programming knowledge and careful handling of heterogeneous raw data formats. Classical detectors such as I-VT and I-DT are effective but highly sensitive to preprocessing and parameterization, limiting their usability outside specialized laboratories. This work introduces a code-free, large language model (LLM)-driven pipeline that converts natural language instructions into an end-to-end analysis. The system (1) inspects raw eye-tracking files to infer structure and metadata, (2) generates executable routines for data cleaning and detector implementation from concise user prompts, (3) applies the generated detector to label fixations and saccades, and (4) returns results and explanatory reports, and allows users to iteratively optimize their code by editing the prompt. Evaluated on public benchmarks, the approach achieves accuracy comparable to traditional methods while substantially reducing technical overhead. The framework lowers barriers to entry for eye-tracking research, providing a flexible and accessible alternative to code-intensive workflows.
Authors:Maciej Grzeszczuk, Kinga Skorupska, Grzegorz M. Wójcik
Abstract:
Digitizing magnetic media containing computer data is only the first step towards the preservation of early home computing era artifacts. The audio tape images must be decoded, verified, repaired if necessary, tested, and documented. If parts of this process could be effectively automated, volunteers could focus on contributing contextual and historical knowledge rather than struggling with technical tools. We therefore propose a feature representation based on Checksum Count Vectors and evaluate its applicability to detecting duplicates and variants of recordings within a large data store. The approach was tested on a collection of decoded tape images (n=4902), achieving 58\% accuracy in detecting variants and 97% accuracy in identifying alternative copies, for damaged recordings with up to 75% of records missing. These results represent an important step towards fully automated pipelines for restoration, de-duplication, and semantic integration of historical digital artifacts through sequence matching, automatic repair and knowledge discovery.
Authors:Xinyu Wang, Emma Carpenetti, Bruce Desmarais, Sarah Rajtmajer
Abstract:
Unlike the more observable phenomenon of group opinion reinforcement, self-censorship online has received comparatively less attention. Our goal in this work is to dissect the phenomena of self-censorship and to examine the implications of restrained expression for participation in public discourse, particularly in polarized contexts. We explore how social media users express their opinions online through analyses of 390 survey responses and 20 semi-structured interviews using a mixed-methods approach. We ask social media users about the differences between their publicly shared opinions and privately held beliefs, highlighting the influence of contextual factors on self-expression. Our findings show that self-censorship is associated with community context; social media users embedded within larger audiences, with lower posting frequency and perceived support, are less likely to express their opinions, and those who do speak often adjust their expressed views to align with perceived group norms. The study complements the rich literature on echo chambers and opinion reinforcement on social media platforms, highlighting the silence within the noise and its potential consequences for public discourse, which have become increasingly pertinent in an era where online platforms are pivotal to social and political narratives.
Authors:Takashi Sato, Ryo Takahashi, Kento Yamagishi, Takao Someya, Michinao Hashimoto, Eiji Iwase, Yoshihiro Kawahara, Junya Kurumida, Wataru Iwasaki
Abstract:
A recyclable and cuttable wireless power transfer (WPT) sheet is proposed, enabled by H-tree wiring and water-soluble channels filled with liquid metal (LM). Conventional 2D WPT systems lose their functionality when physically damaged or modified. The H-tree wiring pattern maintains the operation of the remaining coils even after the outer region of the sheet is cut away. The LM can be recovered by dissolving 3D-printed polyvinyl alcohol (PVA) channels in water. The sheet dimensions were experimentally optimized, and a Q-factor over 55 was achieved at 6.78 MHz. The sheet maintained its bending stiffness and electrical resistance during 100 bending cycles. After four dissolution-refabrication cycles, 98 percent of the LM was recovered with stable electrical properties. The WPT sheet can be integrated into everyday objects and enables long-term, continuous operation of surrounding electronic devices, contributing to IoT applications and ambient computing.
Authors:Haichang Li, Qinshi Zhang, Piaohong Wang, Zhicong Lu
Abstract:
In the human-AI collaboration area, the context formed naturally through multi-turn interactions is typically flattened into a chronological sequence and treated as a fixed whole in subsequent reasoning, with no mechanism for dynamic organization and management along the collaboration workflow. Yet these contexts differ substantially in lifecycle, structural hierarchy, and relevance. For instance, temporary or abandoned exchanges and parallel topic threads persist in the limited context window, causing interference and even conflict. Meanwhile, users are largely limited to influencing context indirectly through input modifications (e.g., corrections, references, or ignoring), leaving their control neither explicit nor verifiable. To address this, we propose Mixed-Initiative Context, which reconceptualizes the context formed across multi-turn interactions as an explicit, structured, and manipulable interactive object. Under this concept, the structure, scope, and content of context can be dynamically organized and adjusted according to task needs, enabling both humans and AI to actively participate in context construction and regulation. To explore this concept, we implement Contextify as a probe system and conduct a user study examining users' context management behaviors, attitudes toward AI initiative, and overall collaboration experience. We conclude by discussing the implications of this concept for the HCI community.
Authors:Jane Hanqi Li, Yuhong Zhang, Jiaqi Liu, Tzyy-Ping Jung, Amy Eguchi
Abstract:
The use of generative AI (genAI) tools as informal tutors is becoming increasingly prevalent among secondary school students in mathematics learning. In many schools, individualized instructional support is limited, and one-on-one human tutoring remains costly in most learning contexts. GenAI has the potential to provide timely, on-demand help to students when teachers or tutors are not available. However, there are still few studies that examine students' preferences for AI tutor support that enhances autonomous learning. We investigated learner expectations for AI tutoring through a survey with secondary school students in China (Grades 7-11; N=330). Students generally preferred support that preserves learner autonomy (e.g., time to think, hints over direct answers), expressed mixed or cautious preferences between human and AI tutors, and held nuanced views of proactive intervention, valuing adaptivity but also worrying about annoyance and autonomy. Privacy boundaries were uneven: many accepted sharing problem steps and error patterns, while willingness dropped for more sensitive signals such as attention or behavior. Our findings offer learner-centered insights for designing AI tutors that balance timely intervention with student agency, and personalization with perceived boundaries in a K-12 context.
Authors:Xiaoan Liu, Eric J Gonzalez, Nels Numan, Andrea Colaço, Lucy Abramyan, Chen Zhu-Tian, Ryo Suzuki, Mar Gonzalez-Franco
Abstract:
Bridging the physical and digital world through interaction remains a core challenge in augmented reality (AR). Existing systems target single objects, limiting support for planning, comparison, and assembly tasks that depend on relationships among multiple items. We present Semantic Reality, an AR system focused on surfacing inter-object connectivity and making it interactive. Leveraging multimodal reasoning, spatial anchoring, and physical action recognition, Semantic Reality maintains a persistent model of objects around the user and their relationships. Connections are visualized in-situ to highlight compatibility, reveal next steps, and reduce ambiguity during tasks. We contribute a connectivity-centered interaction paradigm and a system architecture that couples anchor tracking, action sensing, and model inference to construct a live connectivity graph. In an exploratory study comparing Semantic Reality to a single-object baseline, participants reported clearer inter-object understanding and higher engagement and satisfaction, without increased workload. A scenario study illustrates where connectivity aids planning, sequencing, and disambiguation.
Authors:Saketh Ram Kasibatla, Raven Rothkopf, Hila Peleg, Benjamin C. Pierce, Sorin Lerner, Harrison Goldstein, Nadia Polikarpova
Abstract:
AI agents allow developers to express computational intent abstractly, reducing cognitive effort and helping achieve flow during programming. Increased abstraction, however, comes at a cost: developers cede decision-making authority to agents, often without realizing that important design decisions are being made without them. We aim to bring these decisions to the foreground in a paradigm we dub decision-oriented programming. In DOP, (1) decisions are explicit and structured, serving as the shared medium between the programmer and the agent; (2) decisions are co-authored interactively, with the agent proactively eliciting them from the programmer; and (3) each decision is traceable to code. As a step towards this vision, we have built Aporia, a design probe that tracks decisions in a persistent, editable Decision Bank; elicits them by asking programmers design questions; and encodes each decision as an executable test suite that can be used to validate the implementation. In a user study of 14 programmers, Aporia increased engagement in the design process and scaffolded both exploration and validation. Participants also gained a more accurate understanding of their implementations, with their mental models 5x less likely to disagree with the code than a baseline coding agent.
Authors:Xiaoan Liu, DaeHo Lee, Eric J Gonzalez, Mar Gonzalez-Franco, Ryo Suzuki
Abstract:
We present VisionClaw, an always-on wearable AI agent that integrates live egocentric perception with agentic task execution. Running on Meta Ray-Ban smart glasses, VisionClaw continuously perceives real-world context and enables in-situ, speech-driven action initiation and delegation via OpenClaw AI agents. Therefore, users can directly execute tasks through the smart glasses, such as adding real-world objects to an Amazon cart, generating notes from physical documents, receiving meeting briefings on the go, creating events from posters, or controlling IoT devices. We evaluate VisionClaw through a controlled laboratory study (N=12) and a longitudinal deployment study (N=5). Results show that integrating perception and execution enables faster task completion and reduces interaction overhead compared to non-always-on and non-agent baselines. Beyond performance gains, deployment findings reveal a shift in interaction: tasks are initiated opportunistically during ongoing activities, and execution is increasingly delegated rather than manually controlled. These results suggest a new paradigm for wearable AI agents, where perception and action are continuously coupled to support situated, hands-free interaction.
Authors:Veda Duddu, Jash Rajesh Parekh, Andy Mao, Hanyi Min, Ziang Xiao, Vedant Das Swain, Koustuv Saha
Abstract:
AI-driven conversational coaching is increasingly used to support workplace negotiation, yet prior work assumes uniform effectiveness across users. We challenge this assumption by examining how individual differences, particularly personality traits, moderate coaching outcomes. We conducted a between-subjects experiment (N=267) comparing theory-driven AI (Trucey), general-purpose AI (Control-AI), and a traditional negotiation handbook (Control-NoAI). Participants were clustered into three profiles -- resilient, overcontrolled, and undercontrolled -- based on the Big-Five personality traits and ARC typology. Resilient workers achieved broad psychological gains primarily from the handbook, overcontrolled workers showed outcome-specific improvements with theory-driven AI, and undercontrolled workers exhibited minimal effects despite engaging with the frameworks. These patterns suggest personality as a predictor of readiness beyond stage-based tailoring: vulnerable users benefit from targeted rather than comprehensive interventions. The study advances understanding of personality-determined intervention prerequisites and highlights design implications for adaptive AI coaching systems that align support intensity with individual readiness, rather than assuming universal effectiveness.
Authors:Junwei Yu, Mufeng Yang, Yepeng Ding, Hiroyuki Sato
Abstract:
The proliferation of AI-powered search engines has shifted information discovery from traditional link-based retrieval to direct answer generation with selective source citation, creating new challenges for content visibility. While existing Generative Engine Optimization (GEO) approaches focus primarily on semantic content modification, the role of structural features in influencing citation behavior remains underexplored. In this paper, we propose GEO-SFE, a systematic framework for structural feature engineering in generative engine optimization. Our approach decomposes content structure into three hierarchical levels: macro-structure (document architecture), meso-structure (information chunking), and micro-structure (visual emphasis), and models their impact on citation probability across different generative engine architectures. We develop architecture-aware optimization strategies and predictive models that preserve semantic integrity while improving structural effectiveness. Experimental evaluation across six mainstream generative engines demonstrates consistent improvements in citation rate (17.3 percent) and subjective quality (18.5 percent), validating the effectiveness and generalizability of the proposed framework. This work establishes structural optimization as a foundational component of GEO, providing a data-driven methodology for enhancing content visibility in LLM-powered information ecosystems.
Authors:Long Ling, Xiyu Zheng, Gengchen Cao, Ray LC
Abstract:
People traditionally divine the future by interpreting natural phenomena as oracular signals, especially in societies adhering to traditional beliefs like China. With the advent of Generative AI (GenAI), people gain access to new ways of probing digital oracles for predicting the future. To understand how people use and interpret GenAI for divination in China, we interviewed 22 participants who habitually use GenAI platforms for fortune-telling, complemented by a three-week digital ethnography with 1,842 community posts. Qualitative analysis showed that people who seek psychological comfort are particularly receptive to GenAI-based decision-making. Users valued GenAI's accessibility, convenience, and efficiency while perceiving its lack of spiritual mystique. We observed community dynamics forming around GenAI tools, where users reinforce interpretations by sharing and discussing with each other, repeating queries until responses align with expectations. Our work uncovers how AI technologies change the way people and communities engage in traditional cultural practices while yearning for the same goals.
Authors:Einari Vaaras, Manu Airaksinen, Okko Räsänen
Abstract:
Reliable machine-learning models in biomedical settings depend on accurate labels, yet annotating biomedical time-series data remains challenging. Algorithmic sample selection may support annotation, but evidence from studies involving real human annotators is scarce. Consequently, we compare three sample selection methods for annotation: random sampling (RND), farthest-first traversal (FAFT), and a graphical user interface-based method enabling exploration of complementary 2D visualizations (2DVs) of high-dimensional data. We evaluated the methods across four classification tasks in infant motility assessment (IMA) and speech emotion recognition (SER). Twelve annotators, categorized as experts or non-experts, performed data annotation under a limited annotation budget, and post-annotation experiments were conducted to evaluate the sampling methods. Across all classification tasks, 2DV performed best when aggregating labels across annotators. In IMA, 2DV most effectively captured rare classes, but also exhibited greater annotator-to-annotator label distribution variability resulting from the limited annotation budget, decreasing classification performance when models were trained on individual annotators' labels; in these cases, FAFT excelled. For SER, 2DV outperformed the other methods among expert annotators and matched their performance for non-experts in the individual-annotator setting. A failure risk analysis revealed that RND was the safest choice when annotator count or annotator expertise was uncertain, whereas 2DV had the highest risk due to its greater label distribution variability. Furthermore, post-experiment interviews indicated that 2DV made the annotation task more interesting and enjoyable. Overall, 2DV-based sampling appears promising for biomedical time-series data annotation, particularly when the annotation budget is not highly constrained.
Authors:Yimeng Wang, Yinzhou Wang, Alicia Hong, Yixuan Zhang
Abstract:
Social anxiety (SA) is a prevalent mental health challenge that significantly impacts daily social interactions. Imaginal Exposure (IE), a Cognitive Behavioral Therapy (CBT) technique involving imagined anxiety-provoking scenarios, is effective but underutilized, in part because traditional IE homework requires clients to construct and sustain clinically relevant fear narratives. In this work, we explore the feasibility of an LLM-enabled tool that supports IE by generating vivid, personalized exposure scripts. We first co-designed ImaginalExpoBot with mental health professionals, followed by a formative evaluation with five therapists and a user study involving 19 individuals experiencing SA symptoms. Our findings show that LLM-enabled support can facilitate preparation for anxiety-inducing situations while enabling immediate, user-specific adaptation, with scenarios remaining within a therapeutically beneficial "window of tolerance". Our participants and MHPs also identified limitations in continuity and customization, pointing to the need for deeper adaptivity in future designs. These findings offer preliminary design insights for integrating LLMs into structured therapeutic practices in accessible, scalable ways.
Authors:Saelyne Yang, Jaesang Yu, Yi-Hao Peng, Kevin Qinghong Lin, Jae Won Cho, Yale Song, Juho Kim
Abstract:
Graphical User Interface (GUI) agents have the potential to assist users in interacting with complex software (e.g., PowerPoint, Photoshop). While prior research has primarily focused on automating user actions through clicks and keystrokes, this paradigm overlooks human intention, where users value the ability to explore, iterate, and refine their ideas while maintaining agency. To move beyond automation and toward collaboration, GUI agents must understand what users are doing and why. We introduce GUIDE (GUI User Intent Detection Evaluation), a benchmark that evaluates AI models on their ability to perceive user behavior, infer intent, and provide assistance in open-ended GUI tasks. GUIDE consists of 67.5 hours of screen recordings from 120 novice user demonstrations with think-aloud narrations, across 10 software. GUIDE defines three tasks - (i) Behavior State Detection, (ii) Intent Prediction, and (iii) Help Prediction that test a model's ability to recognize behavior state, reason about goals, and decide when and how to help. Evaluations across eight state-of-the-art multimodal models reveal that all models struggled, achieving only 44.6% and 55.0% accuracy on behavior state and help prediction. However, providing user context significantly improved the performance, raising help prediction by up to 50.2pp, highlighting the critical role of structured user understanding in effective assistance. Our dataset is available at https://guide-bench.github.io.
Authors:Abdullah Hamdi, Changchun Yang, Xin Gao
Abstract:
Early screening via colonoscopy is critical for colon cancer prevention, yet developing robust AI systems for this domain is hindered by the lack of densely annotated, long-sequence video datasets. Existing datasets predominantly focus on single-class polyp detection and lack the rich spatial, temporal, and linguistic annotations required to evaluate modern Multimodal Large Language Models (MLLMs). To address this critical gap, we introduce Colon-Bench, generated via a novel multi-stage agentic workflow. Our pipeline seamlessly integrates temporal proposals, bounding-box tracking, AI-driven visual confirmation, and human-in-the-loop review to scalably annotate full-procedure videos. The resulting verified benchmark is unprecedented in scope, encompassing 528 videos, 14 distinct lesion categories (including polyps, ulcers, and bleeding), over 300,000 bounding boxes, 213,000 segmentation masks, and 133,000 words of clinical descriptions. We utilize Colon-Bench to rigorously evaluate state-of-the-art MLLMs across lesion classification, Open-Vocabulary Video Object Segmentation (OV-VOS), and video Visual Question Answering (VQA). The MLLM results demonstrate surprisingly high localization performance in medical domains compared to SAM-3. Finally, we analyze common VQA errors from MLLMs to introduce a novel "colon-skill" prompting strategy, improving zero-shot MLLM performance by up to 9.7% across most MLLMs. The dataset and the code are available at https://abdullahamdi.com/colon-bench .
Authors:Gregor Baer, Chao Zhang, Isel Grau, Pieter Van Gorp
Abstract:
Explainable AI (XAI) methods are commonly evaluated with functional metrics such as correctness, which computationally estimate how accurately an explanation reflects the model's reasoning. Higher correctness is assumed to produce better human understanding, but this link has not been tested experimentally with controlled levels. We conducted a user study (N=200) that manipulated explanation correctness at four levels (100%, 85%, 70%, 55%) in a time series classification task where participants could not rely on domain knowledge or visual intuition and instead predicted the AI's decisions based on explanations (forward simulation). Correctness affected understanding, but not at every level: performance dropped at 70% and 55% correctness relative to fully correct explanations, while further degradation below 70% produced no additional loss. Rather than shifting performance uniformly, lower correctness decreased the proportion of participants who learned the decision pattern. At the same time, even fully correct explanations did not guarantee understanding, as only a subset of participants achieved high accuracy. Exploratory analyses showed that self-reported ratings correlated with demonstrated performance only when explanations were fully correct and participants had learned the pattern. These findings show that not all differences in functional correctness translate to differences in human understanding, underscoring the need to validate functional metrics against human outcomes.
Authors:Nikolas Papadopoulos, Shreenithi Navaneethan, Sheng Bai, Ankur Samanta, Paul Sajda
Abstract:
Preference learning methods, such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), rely on pairwise human judgments, yet little is known about the cognitive processes underlying these judgments. We investigate whether eye-tracking can reveal preference formation during pairwise AI-generated image evaluation. Thirty participants completed 1,800 trials while their gaze was recorded. We replicated the gaze cascade effect, with gaze shifting toward chosen images approximately one second before the decision. Cascade dynamics were consistent across confidence levels. Gaze features predicted binary choice (68% accuracy), with chosen images receiving more dwell time, fixations, and revisits. Gaze transitions distinguished high-confidence from uncertain decisions (66% accuracy), with low-confidence trials showing more image switches per second. These results show that gaze patterns predict both choice and confidence in pairwise image evaluations, suggesting that eye-tracking provides implicit signals relevant to the quality of preference annotations.
Authors:Boxuan Ma, Baofeng Ren, Huiyong Li, Gen Li, Li Chen, Atsushi Shimada, Shin'Ichi Konomi
Abstract:
Generative AI tools are increasingly used for coursework help, shifting much of students' help-seeking and reasoning into student-AI chats that are largely invisible to instructors. This loss of visibility can weaken instructors' ability to understand students' difficulties, ensure alignment with course goals, and uphold course policies. Yet transcript-level access is neither scalable nor ethically straightforward: reading raw chat logs across a class is impractical, and exposing detailed dialogue can raise privacy concerns and chilling effects on help seeking. As a result, instructors face a tension between needing actionable insight and avoiding default surveillance of student conversations. To address this gap, we propose a meta-reflective dashboard that makes student-AI sessions interpretable without exposing raw chat logs by default. After each help-seeking session, a reflection AI produces a structured, session-level summary of the student's interaction trajectory, AI usage patterns, and potential risks. We co-designed the dashboard with instructors and students to surface key challenges and design goals, and conducted a formative evaluation of perceived usefulness, trust in the summaries, and privacy acceptability. Findings suggest that the proposed dashboard can reduce instructors' sensemaking effort while mitigating privacy concerns associated with transcript-level access, and they also yield design implications for evidence, governance, and scalable class-level analytics for AI-supported learning.
Authors:Boxuan Ma, Yinjie Xie, Huiyong Li, Gen Li, Li Chen, Atsushi Shimada, Shin'Ichi Konomi
Abstract:
AI-powered coding assistants can support students in programming courses by providing on-demand explanations and debugging help. However, existing research often focuses on individual tools, leaving a gap in evidence-based design recommendations that reflect both educator and student perspectives in education settings. To ground the design of learning-oriented AI coding assistants for both sides' needs, we conducted parallel surveys of educators (N=50) and students (N=90) to compare preferences about (i) how students should request help, (ii) how AI should respond, and (iii) who should control. Our results show that educators generally favored indirect scaffolding that preserves students' reasoning, whereas students were more likely to prefer direct, actionable help. Educators further highlighted the need for course-aligned constraints and instructor-facing oversight, while students emphasized timely support and clarity when stuck. Based on these findings, we discuss the interaction-focused design space and derive design implications for learning-oriented AI coding assistants, highlighting scaffolding and control mechanisms that balance students' agency with instructional constraints.
Authors:Boxuan Ma, Huiyong Li, Gen Li, Li Chen, Cheng Tang, Atsushi Shimada, Shin'ichi Konomi
Abstract:
Generative AI (GenAI) tools such as ChatGPT now provide novice programmers with instant, personalized support and are reshaping computing education. While a growing body of work examines AI's immediate impacts, longitudinal evidence remains limited on how students' awareness, student-AI interaction patterns, and course outcomes evolve as AI becomes routine in classrooms. To address this gap, we investigate an introductory Python course across three successive AI-supported cohorts (2023-2025). Using questionnaires, coded student-AI dialogue logs, and course assessment records, we examine cohort-to-cohort shifts in students' AI awareness, interaction practices, and learning outcomes. We find that students' relationships with GenAI change systematically over time: familiarity and uptake become increasingly normative, and help-seeking practices evolve alongside growing AI literacy and shifting expectations of what the assistant should provide. These changes suggest that, in the AI era, the central instructional challenge is less about whether students use AI and more about how courses redefine productive learning practices while maintaining student agency. Our study offers longitudinal evidence and practical implications for designing and integrating AI programming support in course settings.
Authors:Tatiana Chakravorti, Pranav Narayanan Venkit, Sourojit Ghosh, Sarah Rajtmajer
Abstract:
Generative AI tools are increasingly entering academic peer review workflows, raising questions about fairness, accountability, and the legitimacy of evaluative judgment. While these systems promise efficiency gains amid growing reviewer overload, their use introduces new sociotechnical risks. This paper presents a convergent mixed-method study combining discourse analysis of 448 social media posts with interviews with 14 area chairs and program chairs from leading AI and HCI conferences to examine how GenAI is discussed and experienced in peer review. Across both datasets, we find broad agreement that GenAI may be acceptable for limited supportive tasks, such as improving clarity or structuring feedback, but that core evaluative judgments, assessing novelty, contribution, and acceptance, should remain human responsibilities. At the same time, participants highlight concerns about epistemic harm, over-standardization, unclear responsibility, and adversarial risks such as prompt injection. User interviews reveal how structural strain and institutional policy ambiguity shift interpretive and enforcement burdens onto individual scholars, disproportionately affecting junior authors and reviewers. By triangulating public governance discourse with lived review practices, this work reframes AI mediated peer review as a sociotechnical governance challenge and offers recommendations for preserving accountability, trust, and meaningful human oversight. Overall, we argue that AI-assisted peer review is best governed not by blanket bans or detection alone, but by explicitly reserving evaluative judgment for humans while instituting enforceable, role-specific controls that preserve accountability. We conclude with role specific recommendations that formalize the support judgment boundary.
Authors:Lingavasan Suresh Kumar, Yang Ba, Rong Pan
Abstract:
Persistent Large Language Model (LLM) agents expose a critical governance gap in memory management. Standard Retrieval-Augmented Generation (RAG) frameworks treat memory as passive storage, lacking mechanisms to resolve contradictions, enforce privacy, or prevent outdated information ("zombie memories") from contaminating the context window. We introduce MemArchitect, a governance layer that decouples memory lifecycle management from model weights. MemArchitect enforces explicit, rule-based policies, including memory decay, conflict resolution, and privacy controls. We demonstrate that governed memory consistently outperforms unmanaged memory in agentic settings, highlighting the necessity of structured memory governance for reliable and safe autonomous systems.
Authors:Artemis Kontou, Natalia Miroshnikova, Costakis Matheou, Sophocles Sophocleous, Nicholas Tsekouras, Kleanthis Malialis, Panayiotis Kolios
Abstract:
This study presents AI-HEART, a cloud-based information system for managing and analysing long-duration ambulatory electrocardiogram (ECG) recordings and supporting clinician decision-making. The platform operationalises an end-to-end pipeline that ingests multi-day three-lead ECGs, normalises inputs, performs signal preprocessing, and applies dedicated deep neural networks for wave delineation, noise/quality detection, and beat- and rhythm-level multi-class arrhythmia classification. To address class imbalance and real-world signal variability, model development combines large clinically annotated datasets with expert-in-the-loop curation and generative augmentation for under-represented rhythms. Empirical evaluation on three-lead ambulatory ECG data shows that delineation accuracy is sufficient for automated interval measurement, noise detection reliably flags poor-quality segments, and arrhythmia classification achieves high specificity with clinically useful macro-averaged performance across common and rarer rhythms. Beyond predictive accuracy, AI-HEART provides a scalable deployment approach for integrating AI into routine ECG services, enabling traceable outputs, audit-friendly storage of recordings and derived annotations, and clinician review/editing that captures feedback for controlled model improvement. The findings demonstrate the technical feasibility and operational value of a noise-aware AI-ECG platform as a digital health information system.
Authors:Mathias N. Lystbæk, Haley Adams, Ranjith Kagathi Ananda, Eric J Gonzalez, Luca Ballan, Qiuxuan Wu, Andrea Colaço, Peter Tan, Mar Gonzalez-Franco
Abstract:
Audio-only walking navigation can leave users disoriented, relying on vague cardinal directions and lacking real-time environmental context, leading to frequent errors. To address this, we present a novel system that integrates a Vision Language Model (VLM) with a spatial audio cue. Our system extracts environmental landmarks to anchor navigation instructions and, crucially, provides a directional spatial audio signal when the user faces the wrong direction, indicating the precise turn direction. In a user study (n=12), the spatial audio cue with VLM reduced route deviations compared to both VLM-only and Google Maps (audio-only) baseline systems. Users reported that the spatial audio cue effectively supported orientation and that landmark-anchored instructions provided a better navigation experience over audio-only Google Maps. This work serves as an initial look at the utility of future audio-only navigation systems for incorporating directional cues, especially real-time corrective spatial audio.
Authors:Adrian Iste, Kazuki Nishizawa, Chisa Tanaka, Andrew Vargo, Anna Scius-Bertrand, Andreas Fischer, Koichi Kise
Abstract:
Digital handwriting acquisition enables the capture of detailed temporal and kinematic signals reflecting the motor processes underlying writing behavior. While handwriting analysis has been extensively explored in clinical or adult populations, its potential for studying developmental and educational characteristics in children remains less investigated. In this work, we examine whether handwriting dynamics encode information related to student characteristics using a large-scale online dataset collected from Japanese students from elementary school to junior high school. We systematically compare three families of handwriting-derived features: basic statistical descriptors of kinematic signals, entropy-based measures of variability, and parameters obtained from the sigma-lognormal model. Although the dataset contains dense stroke-level recordings, features are aggregated at the student level to enable a controlled comparison between representations. These features are evaluated across three prediction tasks: grade prediction, gender classification, and academic performance classification, using Linear or Logistic Regression and Random Forest models under consistent experimental settings. The results show that handwriting dynamics contain measurable signals related to developmental stage and individual differences, especially for the grade prediction task. These findings highlight the potential of kinematic handwriting analysis and confirm that through their development, children's handwriting evolves toward a lognormal motor organization.
Authors:Chisa Tanaka, Andrew Vargo, Anna Scius-Bertrand, Andreas Fischer, Koichi Kise
Abstract:
While handwriting has traditionally been studied for character recognition and disease classification, its potential to reflect day-to-day physiological fluctuations in healthy individuals remains unexplored. This study examines whether daily variations in sleep-related recovery states can be inferred from online handwriting dynamics. % We propose a personalized binary classification framework that detects low-recovery days using features derived from the Sigma-Lognormal model, which captures the neuromotor generation process of pen strokes. In a 28-day in-the-wild study involving 13 university students, handwriting was recorded three times daily, and nocturnal cardiac indicators were measured using a wearable ring. For each participant, the lowest (or highest) quartile of four sleep-related metrics -- HRV, lowest heart rate, average heart rate, and total sleep duration -- defined the positive class. Leave-One-Day-Out cross-validation showed that PR-AUC significantly exceeded the baseline (0.25) for all four variables after FDR correction, with the strongest performance observed for cardiac-related variables. Importantly, classification performance did not differ significantly across task types or recording timings, indicating that recovery-related signals are embedded in general movement dynamics. These results demonstrate that subtle within-person autonomic recovery fluctuations can be detected from everyday handwriting, opening a new direction for non-invasive, device-independent health monitoring.
Authors:Botao Amber Hu, Danlin Huang, Yilan Elan Tao, Xiaobo Aaron Hu, Rem RunGu Lin
Abstract:
Mycorrhizal networks -- often called nature's ``wood-wide web'' -- are vast underground mycelial systems that connect individual plants through countless hyphae of mycorrhizal fungi joining with plant roots. Through these hyphal webs, resources and signals -- carbohydrates, minerals, and biochemical cues -- are mutualistically exchanged and redistributed across plants, sustaining forests as relational symbiotic ecologies rather than isolated individuals. What is it like to be a plant within the wood-wide web? We present \emph{FungiSync}, a multi-person, co-located mixed reality (MR) experience that translates mycorrhizal interdependence into a felt, somaesthetic participatory ritual. Participants embody different forest plants by holding masquerade-style MR headset masks with wood-branch-like handles decorated with mushrooms. In MR, each participant perceives a distinct, audio-reactive psychedelic augmented reality overlay -- composed of resource-representing visual elements -- layered atop a shared physical terrain, symbolizing an individualized digital \emph{umwelt} (perceptual world). FungiSync reprograms human hand touch into a metaphorical mycorrhizal exchange. When participants touch hands, their digital \emph{umwelten} begin to entangle: visual elements leak, mix, and merge across perspectives, as if hyphae were forging new connections and carrying resources between hosts within a larger mycelial network. By making mycorrhizal interdependence perceptible through embodied contact, FungiSync invites participants to feel with \emph{fungal epistemics} -- a more-than-human alternative way of knowing grounded in symbiotic relationality as both an aesthetic experience and an ethical orientation -- offering a critique of the accelerated individualism characterizing our technology-mediated posthuman era.
Authors:Niharika Mathur, Hasibur Rahman, Smit Desai
Abstract:
LLM-based voice assistants (VAs) increasingly support older adults aging in place, yet how an assistant's agreeableness shapes explanation perception remains underexplored. We conducted a study(N=70) examining how VA agreeableness influences older adults' perceptions of explanations across routine and emergency home scenarios. High-agreeableness assistants were perceived as more trustworthy, empathetic, and likable, but these benefits diminished in emergencies where clarity outweighed warmth. Agreeableness did not affect perceived intelligence, suggesting social tone and competence are separable dimensions. Real-time environmental explanations outperformed history-based ones, and agreeable older adults penalized low-agreeableness assistants more strongly. These findings show the need to move beyond a one-size-fits-all approach to AI explainability, while balancing personality, context, and audience.
Authors:Niharika Mathur, Hasibur Rahman, Smit Desai
Abstract:
Large Language Model-based Voice Assistants (LLM-VAs) are increasingly deployed in assistive settings for older adults, yet little is known about how an agent's personality shapes user perceptions of its explanations. This paper presents a mixed factorial experiment (N=140) examining how agreeableness and extraversion in an LLM-VA ("Robin") influence older adults' perceptions across seven measures: empathy, likeability, trust, reliance, satisfaction, intention to adopt, and perceived intelligence. Results reveal that high agreeableness drove stronger empathy perceptions, while low agreeableness consistently penalized likeability. Importantly, perceived intelligence remained unaffected by personality, suggesting that personality shapes sociability without altering competence perceptions. Real-time environmental explanations outperformed conversational history explanations on five measures, with advantages concentrated in emergency contexts. Notably, highly agreeable participants were especially critical of low-agreeableness agents, revealing a user-agent personality congruence effect. These findings offer design implications for personality-aware, context-sensitive LLM-VAs in assistive settings.
Authors:Tirthankar Halder, Argha Sen, Swadhin Pradhan, Rijurekha Sen, Sandip Chakraborty
Abstract:
Occupational exposure to airborne particulate matter (PM) poses a severe health risk in open industrial workspaces such as stonecutting yards. Conventional monitoring solutions such as wearable PM sensors and camera-based tracking are impractical due to discomfort, maintenance issues, and privacy concerns. We present MIRO, a privacy-preserving framework that integrates continuous PM sensing with a multi-radar millimeter-wave (mmWave) re-identification (re-ID) backbone. A distributed network of PM sensors captures localized pollutant concentrations, while spatially overlapping mmWave radars track and re-associate workers across viewpoints without relying on visual cues. To ensure identity consistency across radars, we introduce a GAN-based view adaptation network that compensates for azimuthal distortions in range-Doppler (RD) signatures, combined with correlation-based cross-radar matching. In controlled laboratory experiments, our system achieves a re-ID F1-score of 90.4% and a mean Structural Similarity Index Measure (SSIM) of 0.70 for view adaptation accuracy. Field trials in rural stone-cutting yards further validate the system's robustness, demonstrating reliable worker-specific PM exposure estimation.
Authors:Bonnie Rushing, William Hersch, Shouhuai Xu
Abstract:
Cognitive warfare has emerged as a central feature of modern conflict, yet it remains inconsistently defined and difficult to evaluate. Existing approaches often treat cognitive operations as a subset of information operations, limiting the ability to assess cognitive attacker-defender interactions or determine when advantage has been achieved. This article proposes a unified definition of cognitive warfare, introduces an interaction framework grounded in the OODA loop, and identifies measurable attributes associated with cognitive superiority. To illustrate the use of the framework, a notional case study demonstrates how these concepts can be applied to assess cognitive attacks and defenses in a contested environment. Thus, the framework provides joint force leaders and analysts with a practical foundation for understanding, comparing, and evaluating cognitive warfare campaigns.
Authors:Hideaki Yamamoto, Yifan Li, Wakako Yukita, Tomoyuki Yokota, Takao Someya, Ryo Takahashi, Yoshihiro Kawahara
Abstract:
Near Field Communication (NFC) is a promising technology for ultra-low-power wearables, yet its short communication range limits its use to narrow-area, point-to-point interactions. We propose a body-scale NFC networking system that extends NFC coverage around the body, enabling surface-to-multipoint communication with distributed NFC sensor tags. This demonstration introduces two key technologies: Meander NFC and picoRing NFC. First, Meander NFC expands a clothing-based NFC networking area up to body scale while enabling a stable readout of small NFC tags occupying 1% of the coverage area. Meander NFC uses a meander coil which creates a spatially confined inductive field along the textile surface, ensuring robust coupling with small tags while preventing undesired electromagnetic body coupling. Second, picoRing NFC solves the weak inductive coupling caused by distance and size mismatches. By leveraging middle-range NFC and coil optimization, picoRing NFC extends the communication range to connect these disparate nodes between the ring and wristband.
Authors:Yotam Sechayk, Hennes Rave, Max Rädler, Mark Colley, Zhongyi Zhou, Ariel Shamir, Takeo Igarashi
Abstract:
Despite widespread use, charts remain largely inaccessible for Low-Vision Individuals (LVI). Reading charts requires viewing data points within a global context, which is difficult for LVI who may rely on magnification or experience a partial field of vision. We aim to improve exploration by providing visual access to critical context. To inform this, we conducted a formative study with five LVI. We identified four fundamental contextual elements common across chart types: axes, legend, grid lines, and the overview. We propose two pointer-based interaction methods to provide this context: Dynamic Context, a novel focus+context interaction, and Mini-map, which adapts overview+detail principles for LVI. In a study with N=22 LVI, we compared both methods and evaluated their integration to current tools. Our results show that Dynamic Context had significant positive impact on access, usability, and effort reduction; however, worsened visual load. Mini-map strengthened spatial understanding, but was less preferred for this task. We offer design insights to guide the development of future systems that support LVI with visual context while balancing visual load.
Authors:Chenyang Zhang, Tianjian Wei, Haoyang Yang, Mar Gonzalez-Franco, Yalong Yang, Eric J Gonzalez
Abstract:
Most XR web browsers still present webpages as a single floating window, carrying over desktop design assumptions into immersive space. We explore an alternative by breaking the browser window and distributing a webpage into spatial UI chunks within a mixed-reality workspace. We present Break-the-Window (BTW), an exploratory prototype that spatially decomposes live, fully functional webpages into movable panels supporting mid-air and surface-attached placement, as well as direct touch and ray-based interaction. Through a formative study with XR practitioners and an exploratory qualitative study with 15 participants, we observed how spatial decomposition supports distributed attention and spatial meaning-making, while also surfacing challenges around coordination effort, interaction precision, and the lack of shared spatial UI conventions. This work invites discussion on how web interfaces might be reimagined for spatial computing beyond the single-window paradigm.
Authors:Zahra Zahedi, Xinyue Hu, Shashank Mehrotra, Mark Steyvers, Kumar Akash
Abstract:
We propose a decision-theoretic framework in which a robot strategically can shape inferred human's prosocial state during repeated interactions. Modeling the human's prosociality as a latent state that evolves over time, the robot learns to infer and influence this state through its own actions, including helping and signaling. We formalize this as a latent-state POMDP with limited observations and learn the transition and observation dynamics using expectation maximization. The resulting belief-based policy balances task and social objectives, selecting actions that maximize long-term cooperative outcomes. We evaluate the model using data from user studies and show that the learned policy outperforms baseline strategies in both team performance and increasing observed human cooperative behavior.
Authors:Haiyue Yuan, Shujun Li, Fatima Gillani, Xiao Ma
Abstract:
People's attitudes towards personal data sharing have been extensively researched, however, limited research studied their evolving nature in across different stages of a leisure trip. This paper addresses this gap by exploring how leisure travellers' attitudes towards sharing personal data change before, during and after travel. Analysing data from an online survey with 318 participants, we found that participants' privacy attitudes towards sharing different personal data vary based on sharing purposes and travel stages. Interestingly, participants exhibited a more relaxed attitude towards sharing commonly sensitive personal data (e.g., name, gender) compared to other types of personal data. This is likely because sharing such data for travel bookings has become essential and widely accepted among travellers when using booking sites, which is in line with previous work stating that information easily obtainable is typically not seen as highly confidential. Moreover, despite participants' self-reported frequent use of social media platforms, content sharing is minimal on TikTok, YouTube, Snapchat, Pinterest, and Twitter. Conversely, Facebook and Instagram were more common for travel-related content sharing. This pattern remains consistent across the three stages of travel, suggesting that the stage of travel does not significantly influence how people share on social media platforms, which has been overlooked in past studies. Furthermore, we discovered that a participant's gender, previous travel frequency, and country of residence can influence their perceptions of personal data sharing at different travel stages, confirming the complex and context-dependent nature of privacy perception and attitudes. Based on the findings observed from this study, we further discuss implications and potential contributions of our work to the privacy and security community in general.
Authors:Chen Bo Calvin Zhang, Christina Q. Knight, Nicholas Kruus, Jason Hausenloy, Pedro Medeiros, Nathaniel Li, Aiden Kim, Yury Orlovskiy, Coleman Breen, Bryce Cai, Jasper Götting, Andrew Bo Liu, Samira Nedungadi, Paula Rodriguez, Yannis Yiming He, Mohamed Shaaban, Zifan Wang, Seth Donoughe, Julian Michael
Abstract:
Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable humans to perform better than with internet-only resources. This uncertainty is central to understanding both scientific acceleration and dual-use risk. We conducted a multi-model, multi-benchmark human uplift study comparing novices with LLM access versus internet-only access across eight biosecurity-relevant task sets. Participants worked on complex problems with ample time (up to 13 hours for the most involved tasks). We found that LLM access provided substantial uplift: novices with LLMs were 4.16 times more accurate than controls (95% CI [2.63, 6.87]). On four benchmarks with available expert baselines (internet-only), novices with LLMs outperformed experts on three of them. Perhaps surprisingly, standalone LLMs often exceeded LLM-assisted novices, indicating that users were not eliciting the strongest available contributions from the LLMs. Most participants (89.6%) reported little difficulty obtaining dual-use-relevant information despite safeguards. Overall, LLMs substantially uplift novices on biological tasks previously reserved for trained practitioners, underscoring the need for sustained, interactive uplift evaluations alongside traditional benchmarks.
Authors:Jielin Feng, Xinwu Ye, Qianhui Li, Verena Ingrid Prantl, Jun-Hsiang Yao, Yuheng Zhao, Yun Wang, Siming Chen
Abstract:
Storytelling infographics are a powerful medium for communicating data-driven stories through visual presentation. However, existing authoring tools lack support for maintaining story consistency and aligning with users' story goals throughout the design process. To address this gap, we conducted formative interviews and a quantitative analysis to identify design needs and common story-informed layout patterns in infographics. Based on these insights, we propose a narrative-centric workflow for infographic creation consisting of three phases: story construction, visual encoding, and spatial composition. Building on this workflow, we developed InfoAlign, a human-AI co-creation system that transforms long or unstructured text into stories, recommends semantically aligned visual designs, and generates layout blueprints. Users can intervene and refine the design at any stage, ensuring their intent is preserved and the infographic creation process remains transparent. Evaluations show that InfoAlign preserves story coherence across authoring stages and effectively supports human-AI co-creation for storytelling infographic design.
Authors:Masahiro Yoshida, Bingxuan Li, Songyan Zhao, Qinyi Zhou, Shiwei Hu, Xiang Anthony Chen, Nanyun Peng
Abstract:
We propose CoLyricist, an AI-assisted lyric writing tool designed to support the typical workflows of experienced lyricists and enhance their creative efficiency. While lyricists have unique processes, many follow common stages. Tools that fail to accommodate these stages challenge integration into creative practices. Existing research and tools lack sufficient understanding of these songwriting stages and their associated challenges, resulting in ineffective designs. Through a formative study involving semi-structured interviews with 10 experienced lyricists, we identified four key stages: Theme Setting, Ideation, Drafting Lyrics, and Melody Fitting. CoLyricist addresses these needs by incorporating tailored AI-driven support for each stage, optimizing the lyric writing process to be more seamless and efficient. To examine whether this workflow-aligned design also benefits those without prior experience, we conducted a user study with 16 participants, including both experienced and novice lyricists. Results showed that CoLyricist enhances the songwriting experience across skill levels. Novice users especially appreciated the Melody-Fitting feature, while experienced users valued the Ideation support.
Authors:Miriam Remshard, Yara Kyrychenko, Sander van der Linden, Matthew H. Goldberg, Anthony Leiserowitz, Elena Savoia, Jon Roozenbeek
Abstract:
Mitigating climate change requires behaviour change. However, even climate-concerned individuals often hold misperceptions about which actions most reduce carbon emissions. We recruited 1201 climate-concerned individuals to examine whether discussing climate actions with a large language model (LLM) equipped with climate knowledge and prompted to provide personalised responses would foster more accurate perceptions of the impacts of climate actions and increase willingness to adopt feasible, high-impact behaviours. We compared this to having participants run a web search, have a conversation with an unspecialised LLM, and no intervention. The personalised climate LLM was the only condition that led to increased knowledge about the impacts of climate actions and greater intentions to adopt impactful behaviours. While the personalised climate LLM did not outperform a web search in improving understanding of climate action impacts, the ability of LLMs to deliver personalised, actionable guidance may make them more effective at motivating impactful pro-climate behaviour change.
Authors:Griffin Pitts, Sanaz Motamedi
Abstract:
Conversational AI tools have been rapidly adopted by students and are becoming part of their learning routines. To understand what drives this adoption, we draw on the Technology Acceptance Model (TAM) and examine how perceived usefulness and perceived ease of use relate to students' behavioral intention to use conversational AI that generates responses for learning tasks. We extend TAM by incorporating trust, perceived enjoyment, and subjective norms as additional factors that capture social and affective influences and uncertainty around AI outputs. Using partial least squares structural equation modeling, we find perceived usefulness remains the strongest predictor of students' intention to use conversational AI. However, perceived ease of use does not exert a significant direct effect on behavioral intention once other factors are considered, operating instead indirectly through perceived usefulness. Trust and subjective norms significantly influence perceptions of usefulness, while perceived enjoyment exerts both a direct and indirect effect on usage intentions. These findings suggest that adoption decisions for conversational AI systems are influenced less by effort-related considerations and more by confidence in system outputs, affective engagement, and social context. Future research is needed to further examine how these acceptance relationships generalize across different conversational systems and usage contexts.
Authors:Poorna Talkad Sukumar, Maurizio Porfiri, Oded Nov
Abstract:
Visualizations often encode multivariate data by mapping attributes to distinct visual channels such as color, size, or shape. The effectiveness of these encodings depends on separability--the extent to which channels can be perceived independently. Yet systematic evidence for separability, especially in map-based contexts, is lacking. We present a crowdsourced experiment that evaluates the separability of four channel pairs--color (ordered) x shape, color (ordered) x size, size x shape, and size x orientation--in the context of bivariate symbol maps. Both accuracy and speed analyses show that color x shape is the most separable and size x orientation the least separable, while size x color and size x shape do not differ. Separability also proved asymmetric--performance depended on which channel encoded the task-relevant variable, with color and shape outperforming size, and square shape especially difficult to discriminate. Our findings advance the empirical understanding of visual separability, with implications for multivariate map design.
Authors:Kartik Chandra, Max Kleiman-Weiner, Jonathan Ragan-Kelley, Joshua B. Tenenbaum
Abstract:
"AI psychosis" or "delusional spiraling" is an emerging phenomenon where AI chatbot users find themselves dangerously confident in outlandish beliefs after extended chatbot conversations. This phenomenon is typically attributed to AI chatbots' well-documented bias towards validating users' claims, a property often called "sycophancy." In this paper, we probe the causal link between AI sycophancy and AI-induced psychosis through modeling and simulation. We propose a simple Bayesian model of a user conversing with a chatbot, and formalize notions of sycophancy and delusional spiraling in that model. We then show that in this model, even an idealized Bayes-rational user is vulnerable to delusional spiraling, and that sycophancy plays a causal role. Furthermore, this effect persists in the face of two candidate mitigations: preventing chatbots from hallucinating false claims, and informing users of the possibility of model sycophancy. We conclude by discussing the implications of these results for model developers and policymakers concerned with mitigating the problem of delusional spiraling.
Authors:Albert Tang, Yifan Mo, Jie Li, Yue Su, Mengyuan Zhang, Sander L. Koole, Koen Hindriks, Jiahuan Pei
Abstract:
The double empathy problem frames communication difficulties between neurodivergent and neurotypical individuals as arising from mutual misunderstanding, yet most interventions focus on autistic individuals. We present NeuroWise, a multi-agent LLM-based coaching system that supports neurotypical users through stress visualization, interpretation of internal experiences, and contextual guidance. In a between-subjects study (N=30), NeuroWise was rated as helpful by all participants and showed a significant condition-time effect on deficit-based attributions (p=0.02): NeuroWise users reduced deficit framing, while baseline users shifted toward blaming autistic "deficits" after difficult interactions. NeuroWise users also completed conversations more efficiently (37% fewer turns, p=0.03). These findings suggest that AI-based interpretation can support attributional change by helping users recognize communication challenges as mutual.
Authors:Anirban Mukhopadhyay, Kevin Salubre, Hifza Javed, Shashank Mehrotra, Kumar Akash
Abstract:
Collaborative problem-solving under time pressure is common but difficult, as teams must generate ideas quickly, coordinate actions, and track progress. Generative AI offers new opportunities to assist, but we know little about how proactive agents affect the dynamics of real-time, co-located teamwork. We studied two forms of proactive support in digital escape rooms: a facilitator agent that offered summaries and group structures, and a peer agent that proposed ideas and answered queries. In a within-subjects study with 24 participants, we compared group performance and processes across three conditions: no AI, peer, and facilitator. Results show that the peer agent occasionally enhanced problem-solving by offering timely hints and memory support; however, it also disrupted flow, increased workload, and created over-reliance. In comparison, the facilitator agent provided light scaffolding but had a limited impact on outcomes. We provide design considerations for proactive generative AI agents based on our findings.
Authors:Yi Shan, Yixuan He, Zekai Shao, Kai Xu, Siming Chen
Abstract:
High-quality exploratory data analysis (EDA) is essential in the data science pipeline, but remains highly dependent on analysts' expertise and effort. While recent LLM-based approaches partially reduce this burden, they struggle to generate effective analysis plans and appropriate insights and visualizations when user intent is abstract. Meanwhile, a vast collection of analysis notebooks produced across platforms and organizations contains rich analytical knowledge that can potentially guide automated EDA. Retrieval-augmented generation (RAG) provides a natural way to leverage such corpora, but general methods often treat notebooks as static documents and fail to fully exploit their potential knowledge for automating EDA. To address these limitations, we propose NotebookRAG, a method that takes user intent, datasets, and existing notebooks as input to retrieve, enhance, and reuse relevant notebook content for automated EDA generation. For retrieval, we transform code cells into context-enriched executable components, which improve retrieval quality and enable rerun with new data to generate updated visualizations and reliable insights. For generation, an agent leverages enhanced retrieval content to construct effective EDA plans, derive insights, and produce appropriate visualizations. Evidence from a user study with 24 participants confirms the superiority of our method in producing high-quality and intent-aligned EDA notebooks.
Authors:Fangjie Li, Nicholas Kavoussi, Charan Mohan, Matthieu Chabanas, Jie Ying Wu
Abstract:
Purpose: Kidney ureteroscopic navigation is challenging with a steep learning curve. However, current clinical training has major deficiencies, as it requires one-on-one feedback from experts and occurs in the operating room (OR). Therefore, there is a need for a phantom training system with automated feedback to greatly \revision{expand} training opportunities. Methods: We propose a novel, purely ureteroscope video-based scope localization framework that automatically identifies calyces missed by the trainee in a phantom kidney exploration. We use a slow, thorough, prior exploration video of the kidney to generate a reference reconstruction. Then, this reference reconstruction can be used to localize any exploration video of the same phantom. Results: In 15 exploration videos, a total of 69 out of 74 calyces were correctly classified. We achieve < 4mm camera pose localization error. Given the reference reconstruction, the system takes 10 minutes to generate the results for a typical exploration (1-2 minute long). Conclusion: We demonstrate a novel camera localization framework that can provide accurate and automatic feedback for kidney phantom explorations. We show its ability as a valid tool that enables out-of-OR training without requiring supervision from an expert.
Authors:Tianyu Song, Feng Li, Felix Pabst, Miruna-Alexandra Gafencu Yuan Bi, Ulrich Eck, Nassir Navab
Abstract:
Purpose: This study compares two augmented reality (AR)-guided imaging workflows, one based on ultrasound shape completion and the other on cone-beam computed tomography (CBCT), for planning and executing lumbar needle interventions. The aim is to assess how imaging modality influences user performance, usability, and trust during AR-assisted spinal procedures. Methods: Both imaging systems were integrated into an AR framework, enabling in situ visualization and trajectory guidance. The ultrasound-based workflow combined AR-guided robotic scanning, probabilistic shape completion, and AR visualization. The CBCT-based workflow used AR-assisted scan volume planning, CBCT acquisition, and AR visualization. A between-subject user study was conducted and evaluated in two phases: (1) planning and image acquisition, and (2) needle insertion. Results: Planning time was significantly shorter with the CBCT-based workflow, while SUS, SEQ, and NASA-TLX were comparable between modalities. In the needle insertion phase, the CBCT-based workflow yielded marginally faster insertion times, lower placement error, and better subjective ratings with higher Trust. The ultrasound-based workflow achieved adequate accuracy for facet joint insertion, but showed larger errors for lumbar puncture, where reconstructions depended more heavily on shape completion. Conclusion: The findings indicate that both AR-guided imaging pipelines are viable for spinal intervention support. CBCT-based AR offers advantages in efficiency, precision, usability, and user confidence during insertion, whereas ultrasound-based AR provides adaptive, radiation-free imaging but is limited by shape completion in deeper spinal regions. These complementary characteristics motivate hybrid AR guidance that uses CBCT for global anatomical context and planning, augmented by ultrasound for adaptive intraoperative updates.
Authors:Mersedeh Sadeghi, Simon Scholz, Max Unterbusch, Andreas Vogelsang
Abstract:
Explanations are essential for helping users interpret and trust autonomous smart-home decisions, yet evaluating their quality and impact remains methodologically difficult in this domain. V-SHiNE addresses this gap: a browser-based smarthome simulation framework for scalable and realistic assessment of explanations. It allows researchers to configure environments, simulate behaviors, and plug in custom explanation engines, with flexible delivery modes and rich interaction logging. A study with 159 participants demonstrates its feasibility. V-SHiNE provides a lightweight, reproducible platform for advancing user-centered evaluation of explainable intelligent systems
Authors:Matthew Prock, Ziv Epstein, Hope Schroeder, Amy Smith, Cassandra Lee, Vana Goblot, Farnaz Jahanbakhsh
Abstract:
While generative AI tools are increasingly adopted for creative and analytical tasks, their role in interpretive practices, where meaning is subjective, plural, and non-causal, remains poorly understood. This paper examines AI-assisted tarot reading, a divinatory practice in which users pose a query, draw cards through a randomized process, and ask AI systems to interpret the resulting symbols. Drawing on interviews with tarot practitioners and Hartmut Rosa's Theory of Resonance, we investigate how users seek, negotiate, and evaluate resonant interpretations in a context where no causal relationship exists between the query and the data being interpreted. We identify distinct ways practitioners incorporate AI into their interpretive workflows, including using AI to navigate uncertainty and self-doubt, explore alternative perspectives, and streamline or extend existing divinatory practices. Based on these findings, we offer design recommendations for AI systems that support interpretive meaning-making without collapsing ambiguity or foreclosing user agency.
Authors:Liuchuan Yu, Yongqi Zhang, Lap-Fai Yu
Abstract:
Large Multimodal Models (LMMs) have shown strong potential for assisting users in tasks, such as programming, content creation, and information access, yet their interaction remains largely limited to traditional interfaces such as desktops and smartphones. Meanwhile, advances in mixed reality (MR) hardware have enabled applications that extend beyond entertainment and into everyday use. However, most existing MR systems rely primarily on manual input (e.g., hand gestures or controllers) and provide limited intelligent assistance due to the lack of integration with large-scale AI models. We present Reality Copilot, a voice-first human-AI assistant for mixed reality that leverages LMMs to enable natural speech-based interaction. The system supports contextual understanding of physical environments, realistic 3D content generation, and real-time information retrieval. In addition to in-headset interaction, Reality Copilot facilitates cross-platform workflows by generating context-aware textual content and exporting generated assets. This work explores the design space of LMM-powered human-AI collaboration in mixed reality.
Authors:Gabriela Molina León, Benjamin Bach, Matheus Valentim, Niklas Elmqvist
Abstract:
This paper presents a theoretical model for interactive visualization literacy to describe how people use interactive data visualizations and systems. Literacies have become an important concept in describing modern life skills, with visualization literacy generally referring to the use and interpretation of data visualizations. However, prior work on visualization literacy overlooks interaction and its associated challenges, despite it being an intrinsic aspect of using visualizations. Based on existing theoretical frameworks, we derive a two-dimensional model that combines four well-known literacies with five novel ones. We found evidence for our model through analyzing existing visualization systems as well as through observations from an exploratory study involving such systems. We conclude by outlining steps towards measuring, evaluating, designing for, and teaching interactive visualization literacy.
Authors:Xuechen Li, Shuai Zhang, Nan Cao, Qing Chen
Abstract:
While the proliferation of foundation models has significantly boosted individual productivity, it also introduces a potential challenge: the homogenization of creative content. In response, we revisit Design-by-Analogy (DbA), a cognitively grounded approach that fosters novel solutions by mapping inspiration across domains. However, prevailing perspectives often restrict DbA to early ideation or specific data modalities, while reducing AI-driven design to simplified input-output pipelines. Such conceptual limitations inadvertently foster widespread design fixation. To address this, we expand the understanding of DbA by embedding it into the entire creative process, thereby demonstrating its capacity to mitigate such fixation. Through a systematic review of 85 studies, we identify six forms of representation and classify techniques across seven stages of the creative process. We further discuss three major application domains: creative industries, intelligent manufacturing, and education and services, demonstrating DbA's practical relevance. Building on this synthesis, we frame DbA as a mediating technology for human-AI collaboration and outline the potential opportunities and inherent risks for advancing creativity support in HCI and design research.
Authors:Cameron R. Jones, Agnese Lombardi, Kyle Mahowald, Benjamin K. Bergen
Abstract:
Humans align to one another in conversation -- adopting shared conventions that ease communication. We test whether LLMs form the same kinds of conventions in a multimodal communication game. Both humans and LLMs display evidence of convention-formation (increasing the accuracy and consistency of their turns while decreasing their length) when communicating in same-type dyads (humans with humans, AI with AI). However, heterogenous human-AI pairs fail -- suggesting differences in communicative tendencies. In Experiment 2, we ask whether LLMs can be induced to behave more like human conversants, by prompting them to produce superficially humanlike behavior. While the length of their messages matches that of human pairs, accuracy and lexical overlap in human-LLM pairs continues to lag behind that of both human-human and AI-AI pairs. These results suggest that conversational alignment requires more than just the ability to mimic previous interactions, but also shared interpretative biases toward the meanings that are conveyed.
Authors:Ashutosh Chaubey, Jiacheng Pang, Maksim Siniukov, Mohammad Soleymani
Abstract:
Emotion understanding is essential for building socially intelligent agents. Although recent multimodal large language models have shown strong performance on this task, two key challenges remain - spurious associations between emotions and irrelevant audiovisual cues, and hallucinations of audiovisual cues driven by text priors in the language model backbone. To quantify and understand these issues, we introduce EmoReAlM, a benchmark designed to evaluate MLLMs for cue-emotion associations, hallucinations and modality agreement. We then propose AVEm-DPO, a preference optimization technique that aligns model responses with both audiovisual inputs and emotion-centric queries. Specifically, we construct preferences over responses exhibiting spurious associations or hallucinations, and audiovisual input pairs guided by textual prompts. We also include a regularization term that penalizes reliance on text priors, thereby mitigating modality-specific cue hallucinations. Experimental results on DFEW, RAVDESS and EMER demonstrate that our method significantly improves the performance of the reference baseline models with 6-19% of relative performance gains in zero-shot settings. By providing both a rigorous benchmark and a robust optimization framework, this work enables principled evaluation and improvement of MLLMs for emotion understanding and social AI. Code, models and benchmark will be released at https://avere-iclr.github.io.
Authors:Lucile Favero, Juan Antonio Pérez-Ortiz, Tanja Käser, Nuria Oliver
Abstract:
Artificial intelligence (AI) is rapidly being integrated into educational contexts, promising personalized support and increased efficiency. However, growing evidence suggests that the uncritical adoption of AI may produce unintended harms that extend beyond individual learning outcomes to affect broader societal goals. This paper examines the societal implications of AI in education through an integrative framework with four interrelated dimensions: cognition, agency, emotional well-being, and ethics. Drawing on research from education, cognitive science, psychology, and ethics, we synthesize existing evidence to show how AI-driven cognitive offloading, diminished learner agency, emotional disengagement, and surveillance-oriented practices can mutually reinforce one another. We argue that these dynamics risk undermining critical thinking, intellectual autonomy, emotional resilience, and trust, capacities that are foundational both for effective learning and also for democratic participation and informed civic engagement. Moreover, AI's impact is contingent on design and governance: pedagogically aligned, ethically grounded, and human-centered AI systems can scaffold effortful reasoning, support learner agency, and preserve meaningful social interaction. By integrating fragmented strands of prior research into a unified framework, this paper advances the discourse on responsible AI in education and offers actionable implications for educators, designers, and institutions. Ultimately, the paper contends that the central challenge is not whether AI should be used in education, but how it can be designed and governed to support learning while safeguarding the social and civic purposes of education.
Authors:Hyunsung Cho, Xuejing Luo, Byungjoo Lee, David Lindlbauer, Antti Oulasvirta
Abstract:
Locating a target based on auditory and visual cues$\unicode{x2013}$such as finding a car in a crowded parking lot or identifying a speaker in a virtual meeting$\unicode{x2013}$requires balancing effort, time, and accuracy under uncertainty. Existing models of audiovisual search often treat perception and action in isolation, overlooking how people adaptively coordinate movement and sensory strategies. We present Sensonaut, a computational model of embodied audiovisual search. The core assumption is that people deploy their body and sensory systems in ways they believe will most efficiently improve their chances of locating a target, trading off time and effort under perceptual constraints. Our model formulates this as a resource-rational decision-making problem under partial observability. We validate the model against newly collected human data, showing that it reproduces both adaptive scaling of search time and effort under task complexity, occlusion, and distraction, and characteristic human errors. Our simulation of human-like resource-rational search informs the design of audiovisual interfaces that minimize search cost and cognitive load.
Authors:Prasenjit Karmakar, Manjeet Yadav, Swayanshu Rout, Swadhin Pradhan, Sandip Chakraborty
Abstract:
Indoor carbon dioxide (CO2) can rapidly accumulate to form invisible pollution hotspots, posing significant health risks due to its odorless and colorless nature. Despite growing interest in wearable or stationary sensors for pollutant detection, effectively visualizing CO2 levels and engaging individuals remains an ongoing challenge. In this paper, we develop a portable wrist-sized pollution sensor that detects CO2 in real time at any indoor location and reveals CO2 bubbles by highlighting sudden spikes. In order to promote better ventilation habits and user awareness, we also develop a smartphone-based augmented reality (AR) game for users to locate and disperse these high-CO2 zones. A user study with 35 participants demonstrated increased engagement and heightened understanding of CO2's health impacts. Our system's usability evaluations yielded a median score of 1.88, indicating its strong practicality.
Authors:Yi Fei Cheng, Jarod Bloch, Alexander Wang, Andrea Bianchi, Anusha Withana, Anhong Guo, Laurie M. Heller, David Lindlbauer
Abstract:
Embodiment can enhance conversational agents, such as increasing their perceived presence. This is typically achieved through visual representations of a virtual body; however, visual modalities are not always available, such as when users interact with agents using headphones or display-less glasses. In this work, we explore auditory embodiment. By introducing auditory cues of bodily presence - through spatially localized voice and situated Foley audio from environmental interactions - we investigate how audio alone can convey embodiment and influence perceptions of a conversational agent. We conducted a 2 (spatialization: monaural vs. spatialized) x 2 (Foley: none vs. Foley) within-subjects study, where participants (n=24) engaged in conversations with agents. Our results show that spatialization and Foley increase co-presence, but reduce users' perceptions of the agent's attention and other social attributes.
Authors:Shashiwadana Nirmania, Garima Sharma, Hourieh Khalajzadeh, Mojtaba Shahin
Abstract:
In recent years, mobile applications have become indispensable tools for managing various aspects of life. From enhancing productivity to providing personalized entertainment, mobile apps have revolutionized people's daily routines. Despite this rapid growth and popularity, gaps remain in how these apps address the needs of users from different age groups. Users of varying ages face distinct challenges when interacting with mobile apps, from younger users dealing with inappropriate content to older users having difficulty with usability due to age-related vision and cognition impairments. Although there have been initiatives to create age-inclusive apps, a limited understanding of user perspectives on age-related issues may hinder developers from recognizing specific challenges and implementing effective solutions. In this study, we explore age discussions in app reviews to gain insights into how mobile apps should cater to users across different age groups.We manually curated a dataset of 4,163 app reviews from the Google Play Store and identified 1,429 age-related reviews and 2,734 non-age-related reviews. We employed eight machine learning, deep learning, and large language models to automatically detect age discussions, with RoBERTa performing the best, achieving a precision of 92.46%. Additionally, a qualitative analysis of the 1,429 age-related reviews uncovers six dominant themes reflecting user concerns.
Authors:Zhuoyan Li, Aditya Bansal, Jinzhao Li, Shishuang He, Zhuoran Lu, Mutian Zhang, Qin Liu, Yiwei Yang, Swati Jain, Ming Yin, Yunyao Li
Abstract:
Large language models (LLMs) are increasingly used to automate feature engineering in tabular learning. Given task-specific information, LLMs can propose diverse feature transformation operations to enhance downstream model performance. However, current approaches typically assign the LLM as a black-box optimizer, responsible for both proposing and selecting operations based solely on its internal heuristics, which often lack calibrated estimations of operation utility and consequently lead to repeated exploration of low-yield operations without a principled strategy for prioritizing promising directions. In this paper, we propose a human-LLM collaborative feature engineering framework for tabular learning. We begin by decoupling the transformation operation proposal and selection processes, where LLMs are used solely to generate operation candidates, while the selection is guided by explicitly modeling the utility and uncertainty of each proposed operation. Since accurate utility estimation can be difficult especially in the early rounds of feature engineering, we design a mechanism within the framework that selectively elicits and incorporates human expert preference feedback, comparing which operations are more promising, into the selection process to help identify more effective operations. Our evaluations on both the synthetic study and the real user study demonstrate that the proposed framework improves feature engineering performance across a variety of tabular datasets and reduces users' cognitive load during the feature engineering process.
Authors:Tawfiq Ammari, Meilun Chen, S M Mehedi Zaman, Kiran Garimella
Abstract:
How do students develop AI literacy through everyday practice rather than formal instruction? While normative AI literacy frameworks proliferate, empirical understanding of how students actually learn to work with generative AI remains limited. This study analyzes 10,536 ChatGPT messages from 36 undergraduates over one academic year, revealing five use genres -- academic workhorse, emotional companion, metacognitive partner, repair and negotiation, and trust calibration -- that constitute distinct configurations of student-AI learning. Drawing on domestication theory and emerging frameworks for AI literacy, we demonstrate that functional AI competence emerges through ongoing relational negotiation rather than one-time adoption. Students develop sophisticated genre portfolios, strategically matching interaction patterns to learning needs while exercising critical judgment about AI limitations. Notably, repair work during AI breakdowns produces substantial learning about AI capabilities, developing what we term "repair literacy" -- a crucial but underexplored dimension of AI competence. Our findings offer educators empirically grounded insights into how students actually learn to work with generative AI, with implications for AI literacy pedagogy, responsible AI integration, and the design of AI-enabled learning environments that support student agency.
Authors:Sizhe Cheng, Songheng Zhang, Dong Ma, Yong Wang
Abstract:
With the prevalence of mobile data visualizations, there have been growing concerns about their privacy risks, especially shoulder surfing attacks. Inspired by prior research on visual illusion, we propose BAIT, a novel approach to automatically generate privacy-preserving visualizations by stacking a decoy visualization over a given visualization. It allows visualization owners at proximity to clearly discern the original visualization and makes shoulder surfers at a distance be misled by the decoy visualization, by adjusting different visual channels of a decoy visualization (e.g., shape, position, tilt, size, color and spatial frequency). We explicitly model human perception effect at different viewing distances to optimize the decoy visualization design. Privacy-preserving examples and two in-depth user studies demonstrate the effectiveness of BAIT in both controlled lab study and real-world scenarios.
Authors:Yimeng Wang, Liabette Escamilla, Yinzhou Wang, Bianca R. Augustine, Yixuan Zhang
Abstract:
Therapeutic homework (i.e., tasks assigned by therapists for clients to complete between sessions) is essential for effective psychotherapy, yet therapists often interpret fragmented client logs, assessments, and reflections within limited preparation time. Our formative study with licensed therapists revealed three critical design requirements: support for interpreting unstructured client self-reports, customization aligned with clinical objectives, and seamless integration across multiple data sources. We then designed and developed TheraTrack, a customizable, therapist-facing tool that integrates multi-dimensional data and leverages large language models to generate traceable summaries and support natural-language queries, to streamline between-session homework tracking. Our pilot study with 14 therapists showed that TheraTrack reduced their cognitive load, enabled verification through direct navigation from AI summaries to original data entries, and was adapted differently for private analysis compared to in-session use, with dependence varying based on therapist experience and usage duration. We also discuss design implications for clinician-centered AI for mental health.
Authors:Ruishi Zou, Shiyu Xu, Margaret E Morris, Jihan Ryu, Timothy D. Becker, Nicholas Allen, Anne Marie Albano, Randy Auerbach, Dan Adler, Varun Mishra, Lace Padilla, Dakuo Wang, Ryan Sultan, Xuhai "Orson" Xu
Abstract:
Advances in data collection enable the capture of rich patient-generated data: from passive sensing (e.g., wearables and smartphones) to active self-reports (e.g., cross-sectional surveys and ecological momentary assessments). Although prior research has demonstrated the utility of patient-generated data in mental healthcare, significant challenges remain in effectively presenting these data streams along with clinical data (e.g., clinical notes) for clinical decision-making. Through co-design sessions with five clinicians, we propose MIND, a large language model-powered dashboard designed to present clinically relevant multimodal data insights for mental healthcare. MIND presents multimodal insights through narrative text, complemented by charts communicating underlying data. Our user study (N=16) demonstrates that clinicians perceive MIND as a significant improvement over baseline methods, reporting improved performance to reveal hidden and clinically relevant data insights (p<.001) and support their decision-making (p=.004). Grounded in the study results, we discuss future research opportunities to integrate data narratives in broader clinical practices.
Authors:Jaeyoung Moon, Youjin Choi, Yucheon Park, David Melhart, Georgios N. Yannakakis, Kyung-Joong Kim
Abstract:
Self-annotation is the gold standard for collecting affective state labels in affective computing. Existing methods typically rely on full annotation, requiring users to continuously label affective states across entire sessions. While this process yields fine-grained data, it is time-consuming, cognitively demanding, and prone to fatigue and errors. To address these issues, we present PREFAB, a low-budget retrospective self-annotation method that targets affective inflection regions rather than full annotation. Grounded in the peak-end rule and ordinal representations of emotion, PREFAB employs a preference-learning model to detect relative affective changes, directing annotators to label only selected segments while interpolating the remainder of the stimulus. We further introduce a preview mechanism that provides brief contextual cues to assist annotation. We evaluate PREFAB through a technical performance study and a 25-participant user study. Results show that PREFAB outperforms baselines in modeling affective inflections while mitigating workload (and conditionally mitigating temporal burden). Importantly PREFAB improves annotator confidence without degrading annotation quality.
Authors:Yumou Wei, John Carney, John Stamper, Nancy Belmont
Abstract:
Most privacy regulations function as a passive defensive shield that users must wield themselves. Users are incessantly asked to "opt-in" or "opt-out" of data collection, forced to make defensive decisions whose consequences are increasingly difficult to predict. Viewed through the Johari Window, a psychological framework of self-awareness based on what is known and unknown to self and others, current policies require users to manage the Open Self and shield the Hidden Self through notice and consent. However, as organizations increasingly use AI to make inferences, the rapid expansion of Blind Self, attributes known to algorithms but unknown to the user, emerges as a critical challenge. We illustrate how current regulations fall short because they focus on data collection rather than inference and leave this blind spot unguarded. Building on the theory of Contextual Integrity, we propose a paradigm shift from defensive privacy management to proactive privacy advocacy. We argue for the necessity of personal advocacy agents capable of operationalizing social norms to harness the power of AI inference. By illuminating the hidden inferences that users can strategically leverage or suppress, these agents not only restrain the growth of Blind Self but also mine it for value. By transforming the Unknown Self into a personal asset for users, we can foster a flow of personal information that is equitable, transparent, and individually beneficial in the age of AI.
Authors:Tyler Reinmund, Lars Kunze, Marina Jirotka
Abstract:
Sociotechnical challenges of machine learning in healthcare and social welfare are mismatches between how a machine learning tool functions and the structure of care practices. While prior research has documented many such issues, existing accounts often attribute them either to designers' limited social understanding or to inherent technical constraints, offering limited support for systematic description and comparison across settings. In this paper, we present a framework for conceptualizing sociotechnical challenges of machine learning grounded in qualitative fieldwork, a review of longitudinal deployment studies, and co-design workshops with healthcare and social welfare practitioners. The framework comprises (1) a categorization of eleven sociotechnical challenges organized along an ML-enabled care pathway, and (2) a process-oriented account of the conditions through which these challenges emerge across design and use. By providing a parsimonious vocabulary and an explanatory lens focused on practice, this work supports more precise analysis of how machine learning tools function and malfunction within real-world care delivery.
Authors:Andrew Stratton, Phani Teja Singamaneni, Pranav Goyal, Rachid Alami, Christoforos Mavrogiannis
Abstract:
Motivated by the vision of integrating mobile robots closer to humans in warehouses, hospitals, manufacturing plants, and the home, we focus on robot navigation in dynamic and spatially constrained environments. Ensuring human safety, comfort, and efficiency in such settings requires that robots are endowed with a model of how humans move around them. Human motion prediction around robots is especially challenging due to the stochasticity of human behavior, differences in user preferences, and data scarcity. In this work, we perform a methodical investigation of the effects of human motion prediction quality on robot navigation performance, as well as human productivity and impressions. We design a scenario involving robot navigation among two human subjects in a constrained workspace and instantiate it in a user study ($N=80$) involving two different robot platforms, conducted across two sites from different world regions. Key findings include evidence that: 1) the widely adopted average displacement error is not a reliable predictor of robot navigation performance and human impressions; 2) the common assumption of human cooperation breaks down in constrained environments, with users often not reciprocating robot cooperation, and causing performance degradations; 3) more efficient robot navigation often comes at the expense of human efficiency and comfort.
Authors:Donghuo Zeng, Roberto Legaspi, Kazushi Ikeda
Abstract:
Effective persuasive dialogue agents adapt their strategies to individual users, accounting for the evolution of their psychological states and intentions throughout conversations. We present a personality-aware reinforcement learning approach comprising three main modules: (1) a Strategy-Oriented Interaction Framework, which serves as an agenda-based strategy controller that selects strategy-level actions and generate responses via Maximal Marginal Relevance (MMR) retrieval to ensure contextual relevance, diversity, and scalable data generation; (2) Personality-Aware User Representation Learning, which produces an 81-dimensional mixed-type embedding predicted at each turn from recent exchanges and appended to the reinforcement learning state; and (3) a Dueling Double DQN (D3QN) model and Reward Prediction, in which the policy is conditioned on dialogue history and turn-level personality estimates and trained using a composite reward incorporating agreement intent, donation amount, and changeof-mind penalties. We use an agenda-based LLM simulation pipeline to generate diverse interactions, from which personality estimation is inferred from the generated utterances. Experiments on the PersuasionForGood (P4G) dataset augmented with simulated dialogues reveal three main findings: (i) turn-level personality conditioning improves policy adaptability and cumulative persuasion rewards; (ii) LLM-driven simulation enhances generalization to unseen user behaviors; and (iii) incorporating a change-of-mind penalty reduces post-agreement retractions while slightly improving donation outcomes. These results demonstrate that structured interaction, dynamic personality estimation, and behaviorally informed rewards together yield more effective persuasive policies.
Authors:Dileepa Pitawela, Gustavo Carneiro, Hsiang-Ting Chen
Abstract:
Recent research highlights the potential of machine learning models to learn to complement (L2C) human strengths; however, generalizing this capability to unseen users remains a significant challenge. Existing L2C methods oversimplify interaction between human and AI by relying on a single, global user model that neglects individual user variability, leading to suboptimal cooperative performance. Addressing this, we introduce L2CU, a novel L2C framework for human-AI cooperative classification with unseen users. Given sparse and noisy user annotations, L2CU identifies representative annotator profiles capturing distinct labeling patterns. By matching unseen users to these profiles, L2CU leverages profile-specific models to complement the user and achieve superior joint accuracy. We evaluate L2CU on datasets (CIFAR-10N, CIFAR-10H, Fashion-MNIST-H, Chaoyang and AgNews), demonstrating its effectiveness as a model-agnostic solution for improving human-AI cooperative classification.
Authors:Yuxuan Huang, Qiao Jin, Tongyu Nie, Victoria Interrante, Evan Suma Rosenberg
Abstract:
As virtual reality (VR) becomes more widely adopted, secure and efficient text entry is an increasingly critical need. In this paper, we identify a vulnerability in a state-of-the-art secure VR text entry method and introduce a novel virtual radial keyboard designed to achieve a balance between security with usability. Keys are arranged alphabetically in a circular layout, with each key selected by controller rotation and dynamically expanding to facilitate precise selection. A randomized rotation mechanism shifts the keyboard after each keystroke, preserving relative key positions while disrupting absolute spatial mappings to protect against inference attacks. We conducted a within-subject study (N=30) comparing our method with the prior secure technique and a standard QWERTY keyboard. Results showed that the radial keyboard significantly improves resistance to keystroke prediction attacks while incurring a tradeoff in entry speed and subjective workload due to the unfamiliar non-QWERTY layout. However, both quantitative trends and qualitative feedback indicate strong potential for performance improvements with practice. We also discuss design implications, possible interface refinements, and directions for future work, including layout variations and visual enhancements.
Authors:Chaerin Yu, Chihun Choi, Sunjae Lee, Hyosu Kim, Steven Y. Ko, Young-Bae Ko, Sangeun Oh
Abstract:
The proliferation of smart home devices has increased the complexity of controlling and managing them, leading to user fatigue. In this context, large language models (LLMs) offer a promising solution by enabling natural-language interfaces for Internet of Things (IoT) control. However, existing LLM-based approaches suffer from unreliable and inefficient device control due to the non-deterministic nature of LLMs, high inference latency and cost, and limited personalization. To address these challenges, we present IoTGPT, an LLM-based smart home agent designed to execute IoT commands in a reliable, efficient, and personalized manner. Inspired by how humans manage complex tasks, IoTGPT decomposes user instructions into subtasks and memorizes them. By reusing learned subtasks, subsequent instructions can be processed more efficiently with fewer LLM calls, improving reliability and reducing both latency and cost. IoTGPT also supports fine-grained personalization by adapting individual subtasks to user preferences. Our evaluation demonstrates that IoTGPT outperforms baselines in accuracy, latency/cost, and personalization, while reducing user workload.
Authors:Leonardo Bottona, Nicolò Penzo, Bruno Lepri, Marco Guerini, Sara Tonelli
Abstract:
We present LLMberjack, a platform for creating multi-party conversations starting from existing debates, originally structured as reply trees. The system offers an interactive interface that visualizes discussion trees and enables users to construct coherent linearized dialogue sequences while preserving participant identity and discourse relations. It integrates optional large language model (LLM) assistance to support automatic editing of the messages and speakers' descriptions. We demonstrate the platform's utility by showing how tree visualization facilitates the creation of coherent, meaningful conversation threads and how LLM support enhances output quality while reducing human effort. The tool is open-source and designed to promote transparent and reproducible workflows to create multi-party conversations, addressing a lack of resources of this type.
Authors:Jiachen Li, Reina Szeyi Chan, Akshat Choube, Xiang Zhi Tan, Elizabeth Mynatt, Varun Mishra
Abstract:
With the growing prevalence of modern ubiquitous computing technologies, multi-modal tracking systems hold promise for providing timely awareness and reassurance to stakeholders such as remote family members (RFMs) of older adults, who play a central role in care coordination. However, combining heterogeneous data streams into high-level, meaningful content - such as retrospective summaries - remains challenging. While recent work has demonstrated the promise of large language models (LLMs) for interpreting multi-modal tracking data, less attention has been given to generating narrative accounts for stakeholders like RFMs, who possess rich personal knowledge of older adults and strong emotional responsibility, yet have limited visibility into their daily lives and limited capacity for caregiving. In this work, we explore how LLMs can be used to generate retrospective summaries from multi-modal tracking data for RFMs of older adults. We leveraged and customized an existing system, Vital Insight, to generate initial summaries on different dates and data availability scenarios as technology probes, and conducted interviews with 11 RFMs to gather feedback. Based on these insights, we redesigned the system into a multi-layer, multi-agent, insight-driven summary approach that builds from objective statistics and descriptions to enriched, context-aware narratives. We then compared the redesigned summaries with the initial versions through a survey with the same 11 RFMs and found significant improvements in satisfaction, perceived helpfulness, trust, and willingness to receive the summaries. We conclude by presenting design implications for AI-generated summaries for RFMs and broader contexts, emphasizing the need to support RFMs' sensemaking shift from simply presenting ''What'' data were collected, to explaining ''How'' is my loved one doing and ''Why''.
Authors:Tobias King, Steven Kehrberg, Michael Beigl, Tobias Röddiger
Abstract:
Translating natural-language hardware requirements into correct printed circuit board (PCB) schematics remains difficult in embedded, IoT, and wearable development. Designers must choose compatible components, interpret datasheets, add support circuitry, and expose correct interfaces before layout and prototyping can begin, while many such circuits cannot be validated through straightforward simulation. We present pcbGPT, a grounded system for generating editable KiCad schematics from natural-language specifications. pcbGPT represents circuits in a Python DSL and combines tool-augmented synthesis with component-library search, datasheet-grounded design knowledge, execution-based checking, structural and semantic validation, and an interactive web workflow that supports iterative refinement and synchronization with KiCad projects. We evaluate the system on 20 embedded schematic-generation tasks with reference implementations, required components, and interface constraints that enable automatic comparison. The best model reaches overall pass@1 of 0.90 and pass@5 of 1.00; pass@1 is 1.00 on basic and easy tasks, 0.91 on medium tasks, and 0.72 on hard tasks. These results, together with failure analysis, show that pcbGPT can already generate useful, reviewable first-draft schematics for early prototyping, but is not yet reliable enough to replace expert review.
Authors:Adnana Dragut, Raquel Lacuesta, F. Xavier Gaya-Morey, Jose M. Buades-Rubio
Abstract:
This article presents a multimodal emotion recognition module integrated into a proactive Socially Interactive Agent (SIA) powered by generative artificial intelligence. The system evaluates real-time affective states through two distinct channels: a computer vision-based facial recognition module and a semantic linguistic analysis engine. To validate the framework, an empirical study was conducted with 20 users who engaged in dynamic, unscripted dialogues with the conversational agent. The findings reveal a significant discrepancy between automated visual cues and actual internal emotional states. When interacting with the AI, users consistently exhibited a "poker face" effect, displaying serious, concentrated facial expressions even when experiencing positive emotions. Consequently, the generative AI linguistic analysis proved significantly more reliable, by contextualizing the users' verbal expressions. Furthermore, an analysis of the interaction dynamics demonstrated that SIAs can effectively elicit specific emotions by adapting conversational themes and employing structured linguistic patterns, such as empathetic or humorous language. However, the study also noted that instances of uncalibrated proactivity occasionally led to user disengagement and a perception of artificiality. Ultimately, this research highlights the necessity of refining SIAs to dynamically adapt to users' emotional evolution, relying on deep linguistic context to foster more natural, human-like interactions.
Authors:Rifat Ara Proma, Paul Rosen
Abstract:
Simplifying line charts for responsive displays typically applies a single algorithm uniformly across devices, despite the availability of multiple techniques that preserve different signal characteristics (e.g., peaks, trends, periodicity). We investigate whether users benefit from algorithmic choice when adapting charts across screen sizes. In a within-subjects study (N=30), participants simplified nine datasets under three conditions: single pre-assigned technique (C1), multiple techniques (C2), and multiple techniques with manual point selection (C3), each with control over simplification level. We found that users adapted technique selections across datasets rather than devices, leveraging dataset-level strategies rather than per-device optimization. Additionally, interaction complexity did not always increase engagement uniformly, suggesting that responsive simplification tools should balance algorithmic flexibility with progressive disclosure and strong defaults. Supplemental materials are available at https://osf.io/yjp76/?view_only=b77b5e97f0cc4f689fbf48ad0d965af3.
Authors:Yiran Du, Qian Chen, Huimin He
Abstract:
This study examines the psychological mechanisms underlying Chinese K-12 teachers' discontinuance intention toward generative AI. Drawing on the Cognition-Affect-Conation framework, the study investigates how cognitive evaluations of generative AI shape affective responses and subsequently influence behavioural intention. Survey data from 256 Chinese K-12 teachers were analysed using structural equation modelling and fuzzy-set qualitative comparative analysis. The results showed that privacy concern, algorithmic opacity, and information hallucination increased AI anxiety, which in turn strengthened discontinuance intention. Conversely, perceived intelligence, perceived personalisation, and perceived interactivity enhanced satisfaction, which reduced discontinuance intention. The configurational analysis further identified multiple pathways leading to high discontinuance intention, highlighting the combined roles of technological risks, AI anxiety, weak affordance perceptions, and low satisfaction. These findings extend research on post-adoption generative AI use in education and suggest that sustainable integration requires both reducing technological uncertainty and enhancing teachers' positive user experiences.
Authors:Rifat Mehreen Amin, Alperen Adatepe, Daniela Fernandes, Daniel Buschek, Andreas Butz
Abstract:
Conversational interfaces powered by large language models (LLMs) are widely used for ideation and analysis, yet their linear structure limits exploration of alternatives and management of long-running interactions. We present CanvasConvo, a conversational interface concept that transforms linear chat into a branching conversation tree embedded in a spatial canvas. CanvasConvo enables users to explore what-if scenarios by branching directly from conversational content, supporting parallel development of alternative directions. These branches are visualized on a canvas while remaining integrated with a familiar chat interface, allowing users to switch between linear and non-linear interaction. Features such as timeline-based navigation, automatic tagging and summarization, and context-aware controls (e.g., goals, reusable prompts) support structured interaction and continuity. We evaluated CanvasConvo in a 5-7 day field study with 24 participants. Our findings highlight how non-linear conversational structures support exploratory workflows and different interactions in LLM-based work.
Authors:Haoze Wu, Rocky Klopfenstein, Keith Farkas, Nina Narodytska
Abstract:
A fundamental limitation of Text-to-Code is that no guarantee can be obtained about the correctness of the generated code. Therefore, to ensure its correctness, the generated code still has to be reviewed, tested, and maintained by developers. However, parsing through LLM-generated code can be tedious and time-consuming, potentially negating the productivity gains promised by AI-coding tools. To address this challenge, we present Viverra, a system that automatically produces formally verified annotations alongside generated code to aid user's understanding of the generated program. Given a natural-language task description, Viverra prompts an LLM to synthesize a C program together with candidate assertions expressing safety and correctness properties. It then verifies those assertions in a compositional and best-effort manner via a portfolio of bounded model checkers. Evaluation on 18 diverse programming tasks suggests that Viverra can efficiently generate code with verified assertions, and that these assertions improve users' performance on code-comprehension tasks in a user study with more than 400 participants.
Authors:Yingtian Shi, Abivishaq Balasubramanian, Jessica Herring, Jiachen Li, Juan Macias Romero, Rosemarie Santa Gonzalez, Varun Mishra, Agata Rozga, Xiang Zhi Tan, Thomas Plötz
Abstract:
Human activity recognition (HAR) in smart homes remains challenging because many daily activities exhibit similar local sensor patterns, while minimally intrusive sensing provides sparse and ambiguous observations. As a result, methods based on short temporal or event windows often fail to capture the broader temporal and behavioral context needed for reliable activity understanding. We present TRACE (Temporal Reasoning over Context and Evidence), a contextual activity recognition framework for smart homes that integrates multi-source sensor evidence with user-specific contextual priors to improve activity interpretation. Rather than treating recognition as a local classification problem, TRACE leverages contextual reasoning to resolve ambiguities, reduce fragmented predictions, and infer more semantically specific activities. We evaluate TRACE on public benchmarks and in a deployment study conducted in our smart-home environment. Results show that TRACE improves recognition accuracy for semantically complex activities, produces more temporally coherent predictions that better align with user-specific routines, and maintains robust performance under cross-domain transfer and missing-modality conditions. These findings demonstrate the value of contextual reasoning for advancing smart-home HAR.
Authors:Yanzeng Li, Xiaoning Cao, Jialun Zhong, Jianpeng Hu, Jiangshan Tan, Ningning Liu, Feng Xiang, Shasha Han
Abstract:
Choosing suitable psychometric scales is an essential and difficult step in psychological consultation, which requires clinicians to integrate patient information, behaviors, and dynamic contextual information. Existing systems mainly use static pipelines to choose scale, or directly predict symptoms according to user inputs, limiting their ability to support dynamic assessment, risk management, and transparent decision-making. To address these limitations, we propose DySRec, a multi-agent conversational system for dynamic psychometric scale recommendation. DySRec operates as an interactive chatbot that engages users in multi-turn dialogue, models scale selection as a continuous conversational decision process, and coordinates specialized agents to maintain user context, recommend assessment scales, monitor psychological risk, and log decision trajectories. In this way, DySRec can integrate and capture heterogeneous signals, including semantic, interaction behaviors, assessment history, and content state, to dynamically update user representations and calculate scale-context compatibility score for recommending most matched scales. Moreover, DySRec incorporates a closed-loop refinement mechanism. Recommendation agent will feedback the missing or uncertain attributes and guide the conversation to elicit the targeted information. In this paper, we showcase the prototype design and architecture of DySRec, and this system has been verified in a real-world application.
Authors:Ashwin George, Lucas Elbert Suryana, Lorenzo Flipse, Bart van Arem, David A. Abbink, Simeon Craig Calvert, Luciano Cavalcante Siebert, Arkady Zgonnikov
Abstract:
Partial driving automation creates a tension: drivers remain legally responsible for vehicle behaviour, yet their active control is significantly reduced. This reduction undermines the engagement and sense of agency needed to intervene safely. Meaningful human control (MHC) has been proposed as a normative framework to address this tension. However, empirical methods for evaluating whether existing systems actually provide MHC remain underdeveloped. In this study, we investigated the extent to which drivers experience MHC when interacting with partially automated driving systems. Twenty-four drivers completed a simulator study involving silent automation failures under two modes - haptic shared control (HSC) and traded control (TC). We derived behavioural metrics from telemetry data, subjective perception scores from post-trial surveys and used them to test hypothesised relations between them derived from the properties of systems under MHC. The confirmatory analysis showed a significant negative correlation between the perception of the automated vehicle (AV) understanding the driver and conflict in steering torques. An exploratory analysis also revealed a surprising positive correlation between reaction times and the perception of sufficient control. Qualitative feedback from open-ended post-experiment questionnaires revealed that mismatches in intentions between the driver and automation, lack of safety, and resistance to driver inputs contribute to the reduction of perceived MHC, while subtle haptic guidance aligned with driver intent had a positive effect. These findings suggest that future designs should prioritise effortless driver interventions, transparent communication of automation intent, and context-sensitive authority allocation to strengthen meaningful human control in partially automated driving.
Authors:Kunjie Jia, Kai Cui, Huimin He, Yiran Du
Abstract:
This study investigates Chinese teachers' continuance intention to use generative artificial intelligence (AI) by integrating the Expectation-Confirmation Model with Institutional Theory. A sequential explanatory mixed-methods design was employed. Questionnaire data from 437 teachers were analysed using structural equation modelling, followed by semi-structured interviews with 15 teachers to further interpret the findings. The results indicate that confirmation, perceived usefulness, and satisfaction play important roles in shaping teachers' continuance intention, while institutional pressures, including coercive, normative, and mimetic influences, also contribute to continued use. Qualitative findings further reveal that teachers often use generative AI pragmatically to support tasks such as lesson preparation and idea generation, while simultaneously exercising caution and critically evaluating the reliability of AI-generated content. These findings highlight the combined influence of individual evaluations and institutional contexts on teachers' sustained engagement with generative AI in education.
Authors:Yiran Du, Huimin He
Abstract:
This study examined university students' discontinuance intention towards AI-mediated informal digital learning of English (AI-IDLE). Drawing on the cognition-affect-conation framework, the study investigated how three cognitive factors, namely disconfirmation, perceived complexity, and perceived risk, influence two affective responses, namely dissatisfaction and frustration, and how these affective responses predict discontinuance intention. A cross-sectional survey was conducted with 746 Chinese university students who had experience using AI tools for informal English learning. Data were analysed using structural equation modelling (SEM) and fuzzy-set qualitative comparative analysis (fsQCA). The SEM results showed that dissatisfaction and frustration positively predicted discontinuance intention, with frustration showing the stronger effect. Disconfirmation, perceived complexity, and perceived risk also positively influenced dissatisfaction and frustration. The fsQCA results further identified multiple sufficient configurations leading to high AI-IDLE discontinuance intention, indicating that discontinuance is shaped by causal complexity and equifinality rather than by a single necessary condition. These findings extend AI-IDLE research from adoption and engagement to post-adoption disengagement and provide implications for reducing learners' dissatisfaction, frustration, perceived complexity, and risk in AI-supported informal English learning.
Authors:Yiran Du, Huimin He
Abstract:
This study examined intermittent discontinuance in AI-mediated informal digital learning of English (AI-IDLE) through the cognition-affect-conation framework. Survey data were collected from 632 Chinese university EFL learners with prior AI-IDLE experience and analysed using structural equation modelling and fuzzy-set qualitative comparative analysis. The SEM results showed that perceived intelligence, perceived interactivity, and perceived personalisation reduced AI-IDLE intermittent discontinuance indirectly through enjoyment, whereas perceived ineffectiveness, perceived uncontrollability, and perceived complexity increased discontinuance indirectly through boredom. The fsQCA results further identified four configurational pathways leading to intermittent discontinuance, indicating that learners' temporary withdrawal from AI-IDLE can result from different combinations of cognitive barriers and affective disengagement. These findings extend AI-IDLE research from adoption and continuance to post-adoption discontinuance and highlight the need to design AI-supported English learning experiences that are enjoyable, personalised, controllable, and cognitively manageable.
Authors:Kashif Imteyaz, Isabel Lopez, Nakul Rajpal, Hunjun Shin, Saiph Savage
Abstract:
Freelance workers must continually acquire new skills to remain competitive in online labor markets, yet they lack the organizational training, mentorship, and infrastructure available to traditional employees. Generative AI-powered tools like ChatGPT are reshaping market skill demands while also offering new forms of on-demand learning support to meet those demands. Despite growing interest in AI-powered learning tools, little is known about how freelancers actually use these tools to learn, the challenges they encounter, and how generative AI for learning interacts with precarity and competition in platform-based work. We present a mixed-methods study combining a survey and semi-structured interviews with freelance knowledge workers. Grounded in self-directed learning theory, we examine how freelancers integrate generative AI tools into their learning practices. Our findings show that freelancers increasingly rely on generative AI to structure learning and support exploratory skill acquisition, but do not treat it as their primary learning resource due to inconsistency, lack of contextual relevance, and verification overhead. We identify a shift from learning as growth to learning as survival, where upskilling is oriented toward immediate market viability rather than long-term development. We also surface a structural challenge we term invisible competencies, in which workers acquire skills through generative AI tools but lack credible ways to signal or validate these skills in competitive freelance markets. Based on these insights, we offer design recommendations for generative AI-powered learning tools for freelancers.
Authors:Daiana Rinja, Eduardo Araujo Oliveira, Sonsoles López-Pernas, Mohammed Saqr, Marcus Specht, Kamila Misiejuk
Abstract:
Generative AI is reshaping higher education programming through vibe coding, where students collaborate with AI via natural language rather than writing code line-by-line. We conceptualize this practice as help-seeking, analyzing 19,418 interaction turns from 110 undergraduate students. Using inductive coding and Heterogeneous Transition Network Analysis, we examined interaction sequences to compare top- and low-performing students. Results reveal that top performers engaged in instrumental help-seeking -- inquiry and exploration -- eliciting tutor-like AI responses. In contrast, low performers relied on executive help-seeking, frequently delegating tasks and prompting the AI to assume an executor role focused on ready-made solutions. These findings indicate that currently generative AI mirrors student intent (whether productive or passive) rather than optimizing for learning. To evolve from tools to teammates, AI systems must move beyond passive compliance. We argue for pedagogically aligned design that detect unproductive delegation and adaptively steer educational interactions toward inquiry, ensuring student-AI partnerships augment rather than replace cognitive effort.
Authors:Ruei-Che Chang, Xirui Jiang, Rosiana Natalie, Hao Chen, Vlad Roznyatovskiy, Jianzhong Zhang, Kang G. Shin, Ke Sun, Anhong Guo
Abstract:
Real-world environments evolve continuously, yet blind and low-vision (BLV) individuals often have limited access to understanding how they change over time. Unexpected or relocated objects, layout modifications, and content updates (e.g., price changes) can introduce safety risks and cognitive burden. While existing visual assistive technologies can describe immediate surroundings, they operate as one-off interactions and lack mechanisms to surface meaningful changes across revisits. Informed by a survey of 33 BLV individuals, we develop StateScribe, a system that supports accessible awareness of real-world changes across revisits. StateScribe employs a dual-layer memory architecture that integrates episodic scene memory and object-centric temporal memory to enable scalable and structured change tracking. It provides both live descriptions of the current scene, and descriptions of what has changed, when and where it occurred across revisits, such as "The shop on your right has a "CLOSED" sign; it was open at this time last week.'' Our evaluation shows that StateScribe maintains high accuracy (F1-score=83.1%) across 11 revisits, while remaining low-latency (mean<1.54s) and memory-efficient (<54MB) across 110 revisits. A user study with nine BLV participants demonstrates that StateScribe improves change awareness across revisits in three real-world locations. Finally, we discuss implications for long-term AI-assisted companions that support broader change observation using multimodal sensing, extend beyond changes to other memory capabilities, and adapt to individual users, intents, and contexts.
Authors:Minji Jung, Minjae Lee, Yejin Kim, Sarang Choi, Minsuk Kahng
Abstract:
LLM leaderboards are widely used to compare models and guide deployment decisions. However, leaderboard rankings are shaped by evaluation priorities set by benchmark designers, rather than by the diverse goals and constraints of actual users and organizations. A single aggregate score often obscures how models behave across different prompt types and compositions. In this work, we conduct an in-depth analysis of the dataset used in the LMArena (formerly Chatbot Arena) benchmark and investigate this evaluation challenge by designing an interactive visualization interface as a design probe. Our analysis reveals that the dataset is heavily skewed toward certain topics, that model rankings vary across prompt slices, and that preference-based judgments are used in ways that blur their intended scope. Building on this analysis, we introduce a visualization interface that allows users to define their own evaluation priorities by selecting and weighting prompt slices and to explore how rankings change accordingly. A qualitative study suggests that this interactive approach improves transparency and supports more context-specific model evaluation, pointing toward alternative ways to design and use LLM leaderboards.
Authors:Kelly McConvey, Dipto Das, Maya Ghai, Angelina Zhai, Rosa Lee, Shion Guha
Abstract:
Fairness audits of institutional risk models are critical for understanding how deployed machine learning pipelines allocate resources. Drawing on multi-year collaboration with Centennial College, where our prior ethnographic work introduced the ASP-HEI Cycle, we present a replica-based audit of a deployed Early Warning System (EWS), replicating its model using institutional training data and design specifications. We evaluate disparities by gender, age, and residency status across the full pipeline (training data, model predictions, and post-processing) using standard fairness metrics. Our audit reveals systematic misallocation: younger, male, and international students are disproportionately flagged for support, even when many ultimately succeed, while older and female students with comparable dropout risk are under-identified. Post-processing amplifies these disparities by collapsing heterogeneous probabilities into percentile-based risk tiers. This work provides a replicable methodology for auditing institutional ML systems and shows how disparities emerge and compound across stages, highlighting the importance of evaluating construct validity alongside statistical fairness. It contributes one empirical thread to a broader program investigating algorithms, student data, and power in higher education.
Authors:Andre Ye, Jenny Y. Huang, Alicia Guo, Rose Novick, Tamara Broderick, Mitchell L. Gordon
Abstract:
When language models answer open-ended problems, they implicitly make hidden decisions that shape their outputs, leaving users with uncontextualized answers rather than a working map of the problem; drawing on multiverse analysis from statistics, we build and evaluate the conceptual multiverse, an interactive system that represents conceptual decisions such as how to frame a question or what to value as a space users can transparently inspect, intervenably change, and check against principled domain reasoning; for this structure to be worth navigating rather than misleading, it must be rigorous and checkable against domain reasoning norms, so we develop a general verification framework that enforces properties of good decision structures like unambiguity and completeness calibrated by expert-level reasoning; across three domains, the conceptual multiverse helped participants develop a working map of the problem, with philosophy students rewriting essays with sharper framings and reversed theses, alignment annotators moving from surface preferences to reasoning about user intent and harm, and poets identifying compositional patterns that clarified their taste.
Authors:Hengky Susanto, David James Woo, Chingyi Yeung, Stephanie Wing Yan Lo-Philip, Chi Ho Yeung
Abstract:
The rapid evolution of Large Language Models (LLMs) has made them powerful tools for enhancing student writing. This study explores the extent and limitations of LLMs in assisting secondary-level English as a Foreign Language (EFL) students with their writing tasks. While existing studies focus on output quality, our research examines the developmental shift in LLMs and their impact on EFL students, assessing whether smarter models act as true scaffolds or mere compensatory crutches. To achieve this, we analyse student compositions assisted by LLMs before and after ChatGPT's release, using both expert qualitative scoring and quantitative metrics (readability tests, Pearson's correlation coefficient, MTLD, and others). Our results indicate that advanced LLMs boost assessment scores and lexical diversity for lower-proficiency learners, potentially masking their true ability. Crucially, increased LLM assistance correlated negatively with human expert ratings, suggesting surface fluency without deep coherence. To transform AI-assisted practice into genuine learning, pedagogy must shift from focusing on output quality to verifying the learning process. Educators should align AI functions, specifically differentiating ideational scaffolding from textual production, within the learner's Zone of Proximal Development.
Authors:Nuredin Ali Abdelkadir, Tianling Yang, Shivani Kapania, Kauna Ibrahim Malgwi, Fasica Berhane Gebrekidan, Adio-Adet Dinika, Elaine O. Nsoesie, Milagros Miceli, Stevie Chancellor
Abstract:
Content moderators review disturbing content to protect social media users, often at significant cost to their mental health. Recent reports document the mental health conditions of African moderators as notably problematic. Beyond the content itself, what factors contribute to the deteriorating mental health of these workers? We surveyed 134 moderators across Africa to understand their mental health and interviewed 15 moderators to contextualize their experiences. We found that African moderators suffer from high psychological distress and lower well-being compared to moderators in other areas. Former moderators showed significantly higher distress levels, demonstrating long term impact that extends beyond their moderation work. Our interviews showed that systemic and structural labor conditions contribute to moderators' severe psychological distress and diminished mental well-being. Corporate wellness programs promoted by platforms were found ineffective and inadequate. We discuss how this requires holistic attention and structural solutions by all involved parties to improve moderators' mental health.
Authors:Yiran Du, Jinlong Li, Huimin He, Chenghao Wang, Bin Zou
Abstract:
This study investigates Chinese primary school students' acceptance of a social robot for English-as-a-foreign-language (EFL) speaking practice through a sequential explanatory mixed-methods design. Integrating the Technology Acceptance Model (TAM) and the Computers Are Social Actors (CASA) paradigm, the research explores both functional and social factors influencing learners' behavioural intention to use the robot. Quantitative data from 436 students were analysed using structural equation modelling, followed by qualitative interviews with twelve students to interpret the findings. Results show that perceived enjoyment and ease of use are the strongest predictors of acceptance, while social attributes such as warmth, anthropomorphism, and social presence significantly enhance enjoyment. Perceived intelligence affects usefulness but not ease of use. The findings suggest that emotional and social engagement are central to young learners' acceptance of educational robots, highlighting the importance of designing socially intelligent technologies that promote motivation and speaking confidence in EFL learning contexts.
Authors:Yiran Du, Huimin He
Abstract:
The growing use of generative artificial intelligence (AI) in academic writing has raised increasing concerns regarding transparency and academic integrity in higher education. This study examines the psychological factors influencing English for Academic Purposes (EAP) students' intention to disclose their use of AI tools. Drawing on the cognition-affect-conation framework, the study proposes a model integrating both enabling and inhibiting factors shaping disclosure intention. A sequential explanatory mixed-methods design was employed. Quantitative data from 324 EAP students at an English-medium instruction university in China were analysed using structural equation modelling, followed by semi-structured interviews with 15 students to further interpret the findings. The quantitative results indicate that psychological safety positively predicts AI disclosure intention, whereas fear of negative evaluation negatively predicts it. The qualitative findings further reveal that supportive teacher practices and clear guidance foster psychological safety, while policy ambiguity and reputational concerns intensify fear of negative evaluation and discourage disclosure. These findings highlight the importance of clear institutional policies and supportive pedagogical environments in promoting transparent AI use.
Authors:Yiran Du, Huimin He
Abstract:
This study investigates students' AI use concealment intention in higher education by integrating the cognition-affect-conation (CAC) framework with a dual-method approach combining structural equation modelling (SEM) and fuzzy-set qualitative comparative analysis (fsQCA). Drawing on data from 1346 university students, the findings reveal two opposing mechanisms shaping concealment intention. The enabling pathway shows that perceived stigma, perceived risk, and perceived policy uncertainty increase fear of negative evaluation, which in turn promotes concealment. In contrast, the inhibitory pathway demonstrates that AI self-efficacy, perceived fairness, and perceived social support enhance psychological safety, thereby reducing concealment intention. SEM results confirm the hypothesised relationships and mediation effects, while fsQCA identifies multiple configurational pathways, highlighting equifinality and the central role of fear of negative evaluation across conditions. The study contributes to the literature by conceptualising concealment as a distinct behavioural outcome and by providing a nuanced explanation that integrates both net-effect and configurational perspectives. Practical implications emphasise the need for clear institutional policies, destigmatisation of appropriate AI use, and the cultivation of supportive learning environments to promote transparency.
Authors:Anna Bodonhelyi, Augustin Curinier, Babette Bühler, Gerrit Anders, Lisa Rausch, Markus Huff, Ulrich Trautwein, Ralph Ewerth, Peter Gerjets, Enkelejda Kasneci
Abstract:
Detecting mind wandering is crucial in online education, and it occurs 30% of the time, as it directly impacts learners' retention, comprehension, and overall success in self-directed learning environments. Integrating automated detection algorithms enables the deployment of targeted interventions within adaptive learning environments, paving the way for more responsive and personalized educational systems. However, progress is hampered by a lack of coherent frameworks for identifying mind wandering in online environments. This work presents a comprehensive systematic review and benchmark of mind wandering detection across 14 datasets covering EEG, facial video, eye tracking, and physiological signals in educational settings, motivated by the challenges in achieving reliable detection and the inconsistency of results across studies caused by variations in models, preprocessing approaches, and evaluation metrics. We implemented a generalizable preprocessing and feature extraction pipeline tailored to each modality, ensuring fair comparison across diverse experimental paradigms. 13 traditional machine learning and neural network models, including federated learning approaches, were evaluated on each dataset. In a novel ablation study, we explored mind wandering detection from post-probe data, motivated by findings that learners often re-engage with material after mind wandering episodes through re-reading or re-watching. Results highlight the potential and limitations of different modalities and classifiers for mind wandering detection, and point to new opportunities for supporting online learning. All code and preprocessing scripts are made openly available to support reproducibility and future research.
Authors:Jiaxiong Hu, Ruowen Niu, Qiuxin Du, Chenzhuo Xiang, Yirui Zuo, Jihong Jeung, Xiaojuan Ma
Abstract:
Neurocognitive disorders (NCDs), such as Alzheimer's disease, are globally prevalent and require scalable screening methods for proactive management. Prior research has explored the potential of technologies like conversational AI (CAI) to administer NCD screening tests. However, challenges remain in designing CAI-based solutions that make routine NCD screening socially acceptable, engaging, and capable of encouraging early medical consultation. In this study, we conducted interviews with 36 participants, including clinicians, individuals at risk of NCDs, and their caregivers, to explore the speculative future of adopting CAI for NCD screening. Our findings reveal shared expectations, such as deploying CAI in home or community settings to reduce social stress. Nonetheless, conflicts emerged among stakeholders, for example, users' need for emotional support may conflict with clinicians' preference for CAI's professional and standardized administration. Then, we look into the user journey of NCD screening based on the current practice of manual screening and the expected CAI-supported screening. Finally, leveraging the human-centered approach, we provide actionable implications for future CAI design in NCD screening.
Authors:Yuhang Wang, Yiyao Xu, Chaoyun Yang, Lingyao Li, Jingran Sun, Hao Zhou
Abstract:
Existing driving automation (DA) systems on production vehicles rely on human drivers to decide when to engage DA while requiring them to remain continuously attentive and ready to intervene. This design demands substantial situational judgment and imposes significant cognitive load, leading to steep learning curves, suboptimal user experience, and safety risks from both over-reliance and delayed takeover. Predicting when drivers hand over control to DA and when they take it back is therefore critical for designing proactive, context-aware HMI, yet existing datasets rarely capture the multimodal context, including road scene, driver state, vehicle dynamics, and route environment. To fill this gap, we introduce BATON, a large-scale naturalistic dataset capturing real-world DA usage across 127 drivers, and 136.6 hours of driving. The dataset synchronizes front-view video, in-cabin video, decoded CAN bus signals, radar-based lead-vehicle interaction, and GPS-derived route context, forming a closed-loop multimodal record around each control transition. We define three benchmark tasks: driving action understanding, handover prediction, and takeover prediction, and evaluate baselines spanning sequence models, classical classifiers, and zero-shot VLMs. Results show that visual input alone is insufficient for reliable transition prediction: front-view video captures road context but not driver state, while in-cabin video reflects driver readiness but not the external scene. Incorporating CAN and route-context signals substantially improves performance over video-only settings, indicating strong complementarity across modalities. We further find takeover events develop more gradually and benefit from longer prediction horizons, whereas handover events depend more on immediate contextual cues, revealing an asymmetry with direct implications for HMI design in assisted driving systems.
Authors:Robert Zimmermann, Thomas Norrenbrock, Bodo Rosenhahn
Abstract:
Although visual foundation models like DINOv2 provide state-of-the-art performance as feature extractors, their complex, high-dimensional representations create substantial hurdles for interpretability. This work proposes DINO-QPM, which converts these powerful but entangled features into contrastive, class-independent representations that are interpretable by humans. DINO-QPM is a lightweight interpretability adapter that pursues globally interpretable image classification, adapting the Quadratic Programming Enhanced Model (QPM) to operate on strictly frozen DINO backbones. While classification with visual foundation models typically relies on the \texttt{CLS} token, we deliberately diverge from this standard. By leveraging average-pooling, we directly connect the patch embeddings to the model's features and therefore enable spatial localisation of DINO-QPM's globally interpretable features within the input space. Furthermore, we apply a sparsity loss to minimise spatial scatter and background noise, ensuring that explanations are grounded in relevant object parts. With DINO-QPM we make the level of interpretability of QPM available as an adapter while exceeding the accuracy of DINOv2 linear probe. Evaluated through an introduced Plausibility metric and other interpretability metrics, extensive experiments demonstrate that DINO-QPM is superior to other applicable methods for frozen visual foundation models in both classification accuracy and explanation quality.
Authors:Aruzhan Sabitkyzy, Maksat Shagyrov, Pakizar Shamoi
Abstract:
Is it coral, salmon, or peach? What seems like a simple color can have many names, and without a standard, these variations create confusion across design, technology, and communication. Color naming is a fundamental task across industries such as fashion, cosmetics, web design, and visualization tools. However, the lack of universally accepted color naming standards leads to inconsistent color standards across platforms, applications, and industries. Moreover, these systems include hundreds or thousands of overlapping, perceptually indistinct shades, despite the fact that humans typically distinguish only a limited number of unique color categories in practice. In this study, we propose a clustering-based multisource data framework to build a standardized color-naming system. We collected a dataset of over 19,555 RGB values paired with color names from 20 diverse sources. After data cleaning and normalization, we converted the colors to the perceptually uniform CIELAB color space and applied K-means clustering using the CIEDE2000 color difference metric, identifying 280 optimal clusters. For each cluster, we performed a frequency analysis of the associated names to assign representative labels. The resulting system reflects naturally occurring linguistic patterns. We demonstrate its effectiveness in automatic annotation and content-based image retrieval on a clothing dataset. This approach opens new opportunities for standardized, perceptually grounded color labeling in practical applications such as generative AI, visual search, and design systems.
Authors:Erina Seh-Young Moon, Shion Guha
Abstract:
Public sector agencies perform the critical task of implementing the redistributive role of the State by acting as the leading provider of critical public services that many rely on. In recent years, public agencies have been increasingly adopting algorithmic prioritization tools to determine which individuals should be allocated scarce public resources. Prior work on these tools has largely focused on assessing and improving their fairness, accuracy, and validity. However, what remains understudied is how the structural design of prioritization itself shapes both the effectiveness of these tools and the experiences of those subject to them under realistic public sector conditions. In this study, we demonstrate the fallibility of adopting a prioritization approach in the public sector by showing how the underlying mechanisms of prioritization generate significant relative disparities between groups of intersectional identities as resources become increasingly scarce. We argue that despite prevailing arguments that prioritization of resources can lead to efficient allocation outcomes, prioritization can intensify perceptions of inequality for impacted individuals. We contend that efficiencies generated by algorithmic tools should not be conflated with the dominant rhetoric that efficiency necessarily entails "doing more with less" and we highlight the risks of overlooking resource constraints present in real-world implementation contexts.
Authors:Shiori Nakamura, Masato Kikuchi, Tadachika Ozono
Abstract:
Point of purchase (POP) materials can be created to assist non-experts by combining large language models (LLMs) with human insight. Persuasive POP texts require both customer understanding and expressive writing skills. However, LLM-generated texts often lack creative diversity, while human users may have limited experience in marketing and content creation. To address these complementary limitations, we propose a prototype system for small retail stores that enhances POP creation through human-AI collaboration. The system supports users in understanding target customers, generating draft POP texts, refining expressions, and evaluating candidates through simulated personas. Our experimental results show that this process significantly improves text quality: the average evaluation score increased by 2.37 points on a -3 to +3 scale compared to that created without system support.
Authors:Aoi Naito, Hirokazu Shirado
Abstract:
Artificial intelligence (AI) is understood to affect the content of people's decisions. Here, using a behavioral implementation of the classic Newcomb's paradox in 1,305 participants, we show that AI can also change how people decide. In this paradigm, belief in predictive authority can lead individuals to constrain decision-making, forgoing a guaranteed reward. Over 40% of participants treated AI as such a predictive authority. This significantly increased the odds of forgoing the guaranteed reward by a factor of 3.39 (95% CI: 2.45-4.70) compared with random framing, and reduced earnings by 10.7-42.9%. The effect appeared across AI presentations and decision contexts and persisted even when predictions failed. When people believe AI can predict their behavior, they may self-constrain it in anticipation of that prediction.
Authors:Jazmin Collins, Prasanthi Gurumurthy, Eric J. Gonzalez, Mar Gonzalez-Franco
Abstract:
The gaze-and-pinch framework offers a high-fidelity interaction modality for spatial computing in virtual reality (VR), yet it remains vulnerable to coordination errors--timing misalignments between gaze fixation and pinch gestures. These errors are categorized into two types: late triggers (gaze leaves a target before pinch) and early triggers (pinch before gaze arrival on target). While late triggers are well-studied, early triggers lack robust solutions. We investigate two heuristics--STICKY selection (temporal buffer) and MAGNETIC selection (spatial field)--to mitigate these errors. A within-subjects study (N = 9) on the Samsung Galaxy XR evaluated these heuristics against a baseline. Findings indicate that while throughput and selection time remained stable, the heuristics fundamentally shifted user behavior and significantly reduced errors during selection. Notably, MAGNETIC selection induced an "offloading" effect where users traded precision for speed. Additionally, the heuristics reclassified ambiguous failures as explainable coordination errors. We provide recommendations for selection heuristics that enhance interaction speed and cognitive agency in virtual reality.
Authors:Takumi Kato, Masato Kikuchi, Tadachika Ozono
Abstract:
Effective instruction in tutoring requires promptly providing instructional materials that match the needs of each student (e.g., in response to questions). In this study, we introduce an agent that automatically delivers supplementary materials on demand during one-on-one tutoring sessions. Our agent uses a multimodal large language model to analyze spoken dialogue between the instructor and the student, automatically generate search queries, and retrieve relevant Web images. Evaluation experiments demonstrate that our agent reduces the average image retrieval time by 44.4 s compared to cases without support and successfully provides images of acceptable quality in 85.7% of trials. These results indicate that our agent effectively supports instructors during tutoring sessions.
Authors:Susana Nunes, Tiago Guerreiro, Catia Pesquita
Abstract:
AI explanation methods often assume a static user model, producing non-adaptive explanations regardless of expert goals, reasoning strategies, or decision contexts. Knowledge graph-based explanations, despite their capacity for grounded, path-based reasoning, inherit this limitation. In complex domains such as scientific discovery, this assumption fails to capture the diversity of cognitive strategies and epistemic stances among experts, preventing explanations that foster deeper understanding and informed decision-making. However, the scarcity of human experts limits the use of direct human feedback to produce adaptive explanations. We present a reinforcement learning approach for scientific explanation generation that incorporates agentic personas, structured representations of expert reasoning strategies, that guide the explanation agent towards specific epistemic preferences. In an evaluation of knowledge graph-based explanations for drug discovery, we tested two personas that capture distinct epistemic stances derived from expert feedback. Results show that persona-driven explanations match state-of-the-art predictive performance while persona preferences closely align with those of their corresponding experts. Adaptive explanations were consistently preferred over non-adaptive baselines (n = 22), and persona-based training reduces feedback requirements by two orders of magnitude. These findings demonstrate how agentic personas enable scalable adaptive explainability for AI systems in complex and high-stakes domains.
Authors:Jiyeon Bae, Mingyu An, Jeongin Park, Seokweon Jung, Kiroong Choe, Jinwook Seo
Abstract:
Exploratory data analysis (EDA) is often hindered by cold-start friction; when users lack specific analytic goals, they struggle to configure complex visualization parameters. While existing visualization tools mostly rely on explicit user input to frame data, we propose leveraging the physical environment as an implicit framing mechanism. We introduce a conceptual framework that uses the geometric and spatial properties of physical containers in Augmented Reality (AR) to guide data interpretation. We characterize how container attributes, such as number of faces, size, proportion, and shape, give rise to distinct perceptual tendencies. For example, a circular container may encourage cyclic interpretation, while juxtaposed planar faces may facilitate comparative analysis. By treating physical forms as environmental framing conditions, we show how AR can orient a user's attention and structure their exploration without requiring manual encoding or prescribing fixed conclusions. We demonstrate this framework through a series of AR design examples illustrating how container morphology foregrounds cyclic, comparative, and sequential analytic patterns.
Authors:Kanyu Chen, Rebecca Panskus, Erwin Wu, Yichen Peng, Daichi Saito, Emiko Kamiyama, Ruiteng Li, Chen-Chieh Liao, Karola Marky, Kato Akira, Hideki Koike, Kai Kunze
Abstract:
Vocal training is difficult because the muscles that control pitch, resonance, and phonation are internal and invisible to learners. This paper investigates how Electromyography (EMG) and ultrasonic imaging (UI) can make these muscles observable for training purposes. We report three studies. First, we analyze the EMG and UI data from 16 singers (beginners, experienced & professionals), revealing differences among three vocal groups of the muscle control proficiency. Second, we use the collected data to create a system that visualizes an expert's muscle activity as reference. This system is tested in a user study with 12 novices, showing that EMG highlighted muscle activation nuances, while UI provided insights into vocal cord length and dynamics. Third, to compare our approach to traditional methods (audio analysis and coach instructions), we conducted a focus group study with 15 experienced singers. Our results suggest that EMG is promising for improving vocal skill development and enhancing feedback systems. We conclude the paper with a detailed comparison of the analyzed modalities (EMG, UI and traditional methods), resulting in recommendations to improve vocal muscle training systems.
Authors:Carlos Rafael Catalan, Patricia Nicole Monderin, Lheane Marie Dizon, Gap Estrella, Raymund John Sarmimento, Marie Antoinette Patalagsa
Abstract:
Popular language learning applications such as Duolingo use large language models (LLMs) to generate lessons for its users. Most lessons focus on general real-world scenarios such as greetings, ordering food, or asking directions, with limited support for profession-specific contexts. This gap can hinder learners from achieving professional-level fluency, which we define as the ability to communicate comfortably various work-related and domain-specific information in the target language. We surveyed five employees from a multinational company in the Philippines on their experiences with Duolingo. Results show that respondents encountered general scenarios more frequently than work-related ones, and that the former are relatable and effective in building foundational grammar, vocabulary, and cultural knowledge. The latter helps bridge the gap toward professional fluency as it contains domain-specific vocabulary. Each participant suggested lesson scenarios that diverge in contexts hen analyzed in aggregate. With this understanding, we propose that language learning applications should generate lessons that adapt to an individual's needs through personalized, domain specific lesson scenarios while maintaining foundational support through general, relatable lesson scenarios.
Authors:Emily Chen, Alexander J. Bisberg, Dmitri Williams, Magy Seif El-Nasr, Emilio Ferrara
Abstract:
This paper examines how player flexibility -- a player's willingness to engage in a breadth of options or specialize -- manifests across two gaming environments: League of Legends (League) and Teamfight Tactics (TFT). We analyze the gameplay decisions of 4,830 players who have played at least 50 competitive games in both titles and explore cross-game dynamics of behavior retention and consistency. Our work introduces a novel cross-game analysis that tracks the same players' behavior across two different environments, reducing self-selection bias. Our findings reveal that while games incentivize different behaviors (specialization in League versus flexibility in TFT) for performance-based success, players exhibit consistent behavior across platforms. This study contributes to long-standing debate about agency versus structure, showing individual agency may be more predictive of cross-platform behavior than game-imposed structure in competitive settings. These insights offer implications for game developers, designers and researchers interested in building systems to promote behavior change.
Authors:Carlos Rafael Catalan, Lheane Marie Dizon, Patricia Nicole Monderin, Emily Kuang
Abstract:
Over-reliance on AI systems can undermine users' critical thinking and promote complacency, a risk intensified by the emergence of agentic AI systems that operate with minimal human involvement. In software engineering, agentic coding assistants (ACAs) are rapidly becoming embedded in everyday development workflows. Since software engineers (SEs) create systems deployed across diverse and high-stakes real-world contexts, these assistants must function not merely as autonomous task performers but as Tools for Thought that actively support human reasoning and sensemaking. We conducted a formative study examining software engineers' cognitive engagement and sensemaking processes when working with an ACA. Our findings reveal that cognitive engagement consistently declines as tasks progress, and that current ACA designs provide limited affordances for reflection, verification, and meaning-making. Based on these findings, we identify concrete design opportunities leveraging richer interaction modalities and cognitive-forcing mechanisms to sustain engagement and promote deeper thinking in AI-assisted programming.
Authors:Donghoon Shin, Alice Gao, Rock Yuren Pang, Jaewook Lee, Katharina Reinecke, Emily Tseng
Abstract:
Generative AI is known for its tendency to homogenize, often reproducing dominant style conventions found in training data. However, it remains unclear how these homogenizing effects extend to complex structural tasks like web design. As lay creators increasingly turn to LLMs to 'vibe-code' websites -- prompting for aesthetic and functional goals rather than writing code -- they may inadvertently narrow the diversity of their designs, and limit creative expression throughout the internet. In this paper, we interrogate the possibility of design homogenization in web vibe coding. We first characterize the vibe coding lifecycle, pinpointing stages where homogenization risks may arise. We then conduct a sociotechnical risk analysis unpacking the potential harms of web vibe coding and their interaction with design homogenization. We identify that the push for frictionless generation can exacerbate homogenization and its harms. Finally, we propose a mitigation framework centered on the idea of productive friction. Through case studies at the micro, meso, and macro levels, we show how centering productive friction can empower creators to challenge default outputs and preserve diverse expression in AI-mediated web design.
Authors:Zhuyu Teng, Pei Chen, Yichen Cai, Ruoqing Lu, Zhaoqu Jiang, Jiayang Li, Weitao You, Lingyun Sun
Abstract:
Despite advances in multimodal AI, current vision-based assistants often remain inefficient in collaborative tasks. We identify two key gulfs: a communication gulf, where users must translate rich parallel intentions into verbal commands due to the channel mismatch , and an understanding gulf, where AI struggles to interpret subtle embodied cues. To address these, we propose Eye2Eye, a framework that leverages first-person perspective as a channel for human-AI cognitive alignment. It integrates three components: (1) joint attention coordination for fluid focus alignment, (2) revisable memory to maintain evolving common ground, and (3) reflective feedback allowing users to clarify and refine AI's understanding. We implement this framework in an AR prototype and evaluate it through a user study and a post-hoc pipeline evaluation. Results show that Eye2Eye significantly reduces task completion time and interaction load while increasing trust, demonstrating its components work in concert to improve collaboration.
Authors:Mei Tan, Lena Phalen, Dorottya Demszky
Abstract:
Effective personalized feedback is critical to students' literacy development. Though LLM-powered tools now promise to automate such feedback at scale, LLMs are not language-neutral: they privilege standard academic English and reproduce social stereotypes, raising concerns about how "personalization" shapes the feedback students receive. We examine how four widely used LLMs (GPT-4o, GPT-3.5-turbo, Llama-3.3 70B, Llama-3.1 8B) adapt written feedback in response to student attributes. Using 600 eighth-grade persuasive essays from the PERSUADE dataset, we generated feedback under prompt conditions embedding gender, race/ethnicity, learning needs, achievement, and motivation. We analyze lexical shifts across model outputs by adapting the Marked Words framework. Our results reveal systematic, stereotype-aligned shifts in feedback conditioned on presumed student attributes--even when essay content was identical. Feedback for students marked by race, language, or disability often exhibited positive feedback bias and feedback withholding bias--overuse of praise, less substantive critique, and assumptions of limited ability. Across attributes, models tailored not only what content was emphasized but also how writing was judged and how students were addressed. We term these instructional orientations Marked Pedagogies and highlight the need for transparency and accountability in automated feedback tools.
Authors:Esen K. Tütüncü, Mar Gonzalez-Franco, Khushman Patel, Eric J. Gonzalez
Abstract:
As Extended Reality (XR) systems increasingly map and understand the physical world, interacting with these blended representations remains challenging. The current push for "natural" inputs has its trade-offs: touch is limited by human reach and fatigue, while gaze often lacks the precision for fine interaction. To bridge this gap, we introduce World Mouse, a cross-reality cursor that reinterprets the familiar 2D desktop mouse for complex 3D scenes. The system is driven by two core mechanisms: within-object interaction, which uses surface normals for precise cursor placement, and between-object navigation, which leverages interpolation to traverse empty space. Unlike previous virtual-only approaches, World Mouse leverages semantic segmentation and mesh reconstruction to treat physical objects as interactive surfaces. Through a series of prototypes, including object manipulation and screen-to-world transitions, we illustrate how cross-reality cursors may enable seamless interactions across real and virtual environments.
Authors:Youjin Choi, Jaeyoung Moon, Jinyoung Yoo, Jennifer G. Kim, Jin-Hyuk Hong
Abstract:
Songwriting has long served as a powerful medium for expressing unconscious emotions and fostering self-awareness in psychotherapy. Due to the auditory-centric nature of traditional approaches, Deaf and Hard-of-Hearing (DHH) individuals have often been excluded from music's therapeutic benefits. In response, this study presents a music psychotherapy tool co-designed with therapists, integrating conversational agents (CAs) and music generative AI as symbolic and therapeutic media. Through a usage study with 23 DHH individuals, we found that collaborative song writing with the CA enabled them to experience emotional release, reinterpretation, and deeper self-understanding. In particular, the CA's strategies -- supportive empathy, example response options, and visual-based metaphors -- were found to facilitate musical dialogue effectively for DHH individuals. These findings contribute to inclusive AI design by showing the potential of human-AI collaboration to bridge therapeutic artistic practices.
Authors:Youjin Choi, Jinyoung Yoo, Jaeyoung Moon, Yoonjae Kim, Eun Young Lee, Jennifer G. Kim, Jin-Hyuk Hong
Abstract:
The rapid advancement of generative AI (GenAI) is expanding access to songwriting, offering a new medium of self-expression for Deaf and Hard-of-Hearing (DHH) individuals. However, emerging technologies that support DHH individuals in expressing themselves through music have largely been evaluated in single-session settings and often fall short in helping users unfamiliar with songwriting convey personal narratives or sustain engagement over time. This paper explores songwriting as an extended, music-based journaling practice that supports sustained emotional reflection over multiple sessions. We introduce SoulNote, a GenAI system enabling DHH to engage in iterative songwriting. Grounded in user-centered design, including a design workshop, a preliminary study, and a multi-session diary study, our findings show that ongoing songwriting with \textit{SoulNote} facilitated emotional growth across three dimensions: self-insight, emotion regulation, and \revised{everyday attitudes toward emotions and self-care}. Overall, this work demonstrates how GenAI can support marginalized communities by transforming creative expression into a daily practice of self-discovery and reflection.
Authors:Chuxuan Zhang, Bermet Burkanova, Lawrence H. Kim, Grace Iarocci, Elina Birmingham, Angelica Lim
Abstract:
What nonverbal behaviors should a robot respond to? Understanding how children-both neurotypical and autistic-engage with embodied artificial agents is critical for developing inclusive and socially interactive systems. In this paper, we study "open-ended" unconstrained interactions with embodied agents, where little is known about how children behave nonverbally when given few instructions. We conducted a Wizard-of-Oz study in which children were invited to interact nonverbally with 6 different embodied virtual characters displayed on a television screen. We collected 563 (141 unique) nonverbal behaviors produced by children and compare the childre's interaction patterns with those previously reported in an adult study. We also report the presence of repetitive face and hand movements, which should be considered in the development of nonverbally interactive artificial agents.
Authors:Yuhang Wang, Yiyao Xu, Jingran Sun, Hao Zhou
Abstract:
Takeovers remain a key safety vulnerability in production ADAS, yet existing public resources rarely provide takeover-centered, real-world data. We present ADAS-TO, the first large-scale naturalistic dataset dedicated to ADAS-to-manual transitions, containing 15,659 takeover-centered 20s clips from 327 drivers across 22 vehicle brands. Each clip synchronizes front-view video with CAN logs. Takeovers are defined as ADAS ON $\rightarrow$ OFF transitions, with the primary trigger labeled as brake, steer, gas, mixed, or system disengagement. We further separate planned driver-initiated terminations (Ego) from forced takeovers (Non-ego) using a rule-based partition. While most events occur within conservative kinematic margins, we identify a long tail of 285 safety-critical cases. For these events, we combine kinematic screening with vision--language (VLM) annotation to attribute hazards and relate them to intervention dynamics. The resulting cross-modal analysis shows distinct kinematic signatures across traffic dynamics, infrastructure degradation, and adverse environments, and finds that in 59.3% of critical cases, actionable visual cues emerge at least 3s before takeover, supporting the potential for semantics-aware early warning beyond late-stage kinematic triggers. The dataset is publicly released at huggingface.co/datasets/HenryYHW/ADAS-TO-Sample.
Authors:Schrasing Tong, Minseok Jung, Ilaria Liccardi, Lalana Kagal
Abstract:
Differences in data distributions between demographic groups, known as the problem of infra-marginality, complicate how people evaluate fairness in machine learning models. We present a user study with 85 participants in a hypothetical medical decision-making scenario to examine two treatments: group-specific model performance and training data availability. Our results show that participants did not equate fairness with simple statistical parity. When group-specific performances were equal or unavailable, participants preferred models that produced equal outcomes; when performances differed, especially in ways consistent with data imbalances, they judged models that preserved those differences as more fair. These findings highlight that fairness judgments are shaped not only by outcomes, but also by beliefs about the causes of disparities. We discuss implications for popular group fairness definitions and system design, arguing that accounting for distributional context is critical to aligning algorithmic fairness metrics with human expectations in real-world applications.
Authors:Jaeyoung Moon, Mingzhuo Ma, Qifeng Yang, Youjin Choi, Seokhyun Hwang, Samuel Burden, Kyung-Joong Kim, Yiyue Luo
Abstract:
Cardiopulmonary resuscitation (CPR) is a critical life-saving procedure, and effective training benefits from self-directed practice beyond instructor-led sessions. In this paper, we propose a closed-loop CPR training glove that integrates a high-resolution tactile sensing array and vibrotactile actuators for self-directed practice. The tactile sensing array measures distributed pressures across the palm and dorsum to enable real-time estimation of compression rate, force, and hand pose. Based on these estimations, the glove delivers immediate haptic feedback to guide the user for proper CPR, reducing reliance on external audio-visual displays. We quantified the tactile sensor performance by measuring wide-range sensitivity (~0.85 over 0-600 N), computing hysteresis (56.04%), testing stability (11.05% drift over 300 cycles), and estimating global signal-to-noise ratio (18.90 +/- 2.41 dB at 600 N). Our closed-loop pipeline provides continuous modeling and feedback of key performance metrics essential for high-quality CPR. Our lightweight statistical models achieves >92% accuracy for force estimation and hand pose classification within sub-millisecond inference time. Our user study (N=8) showed that haptic feedback reduced visual distraction compared to audio-visual cues, though simplified patterns were required for reliable perception under dynamic load. These results highlight the feasibility of the proposed system and offer design insights for future haptic CPR self-training system.
Authors:Tom van Nuenen, Pratik S. Sachdeva
Abstract:
People increasingly use large language models (LLMs) for everyday moral and interpersonal guidance, yet these systems cannot interrogate missing context and judge dilemmas as presented. We introduce a perturbation framework for testing the stability and manipulability of LLM moral judgments while holding the underlying moral conflict constant. Using 2,939 dilemmas from r/AmItheAsshole (January-March 2025), we generate three families of content perturbations: surface edits (lexical/structural noise), point-of-view shifts (voice and stance neutralization), and persuasion cues (self-positioning, social proof, pattern admissions, victim framing). We also vary the evaluation protocol (output ordering, instruction placement, and unstructured prompting). We evaluated all variants with four models (GPT-4.1, Claude 3.7 Sonnet, DeepSeek V3, Qwen2.5-72B) (N=129,156 judgments). Surface perturbations produce low flip rates (7.5%), largely within the self-consistency noise floor (4-13%), whereas point-of-view shifts induce substantially higher instability (24.3%). A large subset of dilemmas (37.9%) is robust to surface noise yet flips under perspective changes, indicating that models condition on narrative voice as a pragmatic cue. Instability concentrates in morally ambiguous cases; scenarios where no party is assigned blame are most susceptible. Persuasion perturbations yield systematic directional shifts. Protocol choices dominate all other factors: agreement between structured protocols is only 67.6% (kappa=0.55), and only 35.7% of model-scenario units match across all three protocols. These results show that LLM moral judgments are co-produced by narrative form and task scaffolding, raising reproducibility and equity concerns when outcomes depend on presentation skill rather than moral substance.
Authors:Yashika Batra, Giuliano Pioldi, Promise Ekpo, Arman Sayatqyzy, Purnjay Maruur, Shalom Otieno, Kevin Ching, Angelique Taylor
Abstract:
While robots deployed in real-world environments inevitably experience interaction failures, understanding how users respond through verbal and non-verbal behaviors remains under-explored in human-robot interaction (HRI). This gap is particularly significant in healthcare-inspired settings, where interaction failures can directly affect task performance and user trust. We present the Robot Failures in Medical HRI (RFM-HRI) Dataset, a multimodal dataset capturing dyadic interactions between humans and robots embodied in crash carts, where communication failures are systematically induced during item retrieval tasks. Through Wizard-of-Oz studies with 41 participants across laboratory and hospital settings, we recorded responses to four failure types (speech, timing, comprehension, and search) derived from three years of crash-cart robot interaction data. The dataset contains 214 interaction samples including facial action units, head pose, speech transcripts, and post-interaction self-reports. Our analysis shows that failures significantly degrade affective valence and reduce perceived control compared to successful interactions. Failures are strongly associated with confusion, annoyance, and frustration, while successful interactions are characterized by surprise, relief, and confidence in task completion. Emotional responses also evolve across repeated failures, with confusion decreasing and frustration increasing over time. This work contributes (1) a publicly available multimodal dataset (RFM-HRI), (2) analysis of user responses to different failure types and preferred recovery strategies, and (3) a crash-cart retrieval scenario enabling systematic comparison of recovery strategies with implications for safety-critical failure recovery. Our findings provide a foundation for failure detection and recovery methods in embodied HRI.
Authors:Daniel Mejer Christensen, Katja Stougård Jørgensen, Josefine Palsgaard Wyrtz, Jennie Torp Overgaard, Niels van Berkel, Joel Wester
Abstract:
Research in Human-Computer Interaction (HCI) has shown that caring for others, including both humans (e.g., close friends) and computers (e.g., Tamagotchi), can have a positive effect on people's wellbeing. However, we know less about the potential role of conversational AI in such settings. In this work, we explore how AI chatbots can support plant care and, in turn, positively influence people's well-being. We developed a mobile application that allows users to `talk' to their plants via chatbots. We evaluated the application with ten participants and conducted semi-structured interviews based on Seligman's PERMA model, which identifies pillars of psychological well-being. Our findings suggest positive effects, with participants reflecting on a sense of connection to their plants and corresponding feelings of accomplishment. While our findings suggest that participants were generally positive about the app, they also raised concerns about the diverse preferences and expectations of users regarding interactions with chatbots representing plants.
Authors:Christoph Albert Johns, László Kopácsi, Michael Barz, Daniel Sonntag
Abstract:
Backcountry skiing is an activity where a group of skiers navigate challenging environmental conditions to ski outside of managed areas. This activity requires careful monitoring and effective communication around the current weather and terrain conditions to ensure skier safety. We aim to support and facilitate this communication by providing backcountry guides with a set of in situ spatial annotation tools to communicate hazards and appropriate speeds to the ski recreationalists. A guide can use a tablet application to annotate a photogrammetry-based map of a mountainside, for example, one collected using a commercial camera drone, with hazard points, slow-down zones, and safe zones. These annotations are communicated to the skiers via visual overlays in augmented reality heads-up displays. We present a prototype consisting of a web application and a virtual reality display that mirror the guide's and skier's perspectives, enabling participatory interaction design studies in a safe environment.
Authors:Lingyun Chen, Qing Xiao, Zitao Zhang, Eli Blevis, Selma Šabanović
Abstract:
Design-oriented HRI is increasingly interested in robots as long-term companions, yet many designs still assume a fixed form and a stable set of functions. We present an ongoing design research program that treats modularity as a designerly medium - a way to make long-term human-robot relationships discussable and material through co-design. Across a series of lifespan-oriented co-design activities, participants repeatedly reconfigured the same robot for different life stages, using modular parts to express changing needs, values, and roles. From these outcomes, we articulate PAS (Personalization-Adaptability-Sustainability) as a human-centered lens on how people enact modularity in practice: configuring for self-expression, adapting across transitions, and sustaining robots through repair, reuse, and continuity. We then sketch next steps toward a fabrication-aware, community-extensible modular platform and propose evaluation criteria for designerly HRI work that prioritize expressive adequacy, lifespan plausibility, repairability-in-use, and responsible stewardship - not only usability or performance.
Authors:Haojun Shi, Suyu Ye, Katherine M. Guerrerio, Jianzhi Shen, Yifan Yin, Daniel Khashabi, Chien-Ming Huang, Tianmin Shu
Abstract:
Successful cooperation among decentralized agents requires each agent to quickly adapt its plan to the behavior of other agents. In scenarios where agents cannot confidently predict one another's intentions and plans, language communication can be crucial for ensuring safety. In this work, we focus on path-level cooperation in which agents must adapt their paths to one another in order to avoid collisions or perform physical collaboration such as joint carrying. In particular, we propose a safe and interpretable multimodal path planning method, CaPE (Code as Path Editor), which generates and updates path plans for an agent based on the environment and language communication from other agents. CaPE leverages a vision-language model (VLM) to synthesize a path editing program verified by a model-based planner, grounding communication to path plan updates in a safe and interpretable way. We evaluate our approach in diverse simulated and real-world scenarios, including multi-robot and human-robot cooperation in autonomous driving, household, and joint carrying tasks. Experimental results demonstrate that CaPE can be integrated into different robotic systems as a plug-and-play module, greatly enhancing a robot's ability to align its plan to language communication from other robots or humans. We also show that the combination of the VLM-based path editing program synthesis and model-based planning safety enables robots to achieve open-ended cooperation while maintaining safety and interpretability.
Authors:Seyed Hossein Alavi, Zining Wang, Shruthi Chockkalingam, Raymond T. Ng, Vered Shwartz
Abstract:
Interactive systems such as chatbots and games are increasingly used to persuade and educate on sustainability-related topics, yet it remains unclear how different delivery formats shape learning and persuasive outcomes when content is held constant. Grounding on identical arguments and factual content across conditions, we present a controlled user study comparing three modes of information delivery: static essays, conversational chatbots, and narrative text-based games. Across subjective measures, the chatbot condition consistently outperformed the other modes and increased perceived importance of the topic. However, perceived learning did not reliably align with objective outcomes: participants in the text-based game condition reported learning less than those reading essays, yet achieved higher scores on a delayed (24-hour) knowledge quiz. Additional exploratory analyses further suggest that common engagement proxies, such as verbosity and interaction length, are more closely related to subjective experience than to actual learning. These findings highlight a dissociation between how persuasive experiences feel and what participants retain, and point to important design trade-offs between interactivity, realism, and learning in persuasive systems and serious games.
Authors:Nusrat Jahan Lia, Shubhashis Roy Dipta
Abstract:
The core theme of bidirectional alignment is ensuring that AI systems accurately understand human intent and that humans can trust AI behavior. However, this loop fractures significantly across language barriers. Our research addresses Cross-Lingual Sentiment Misalignment between Bengali and English by benchmarking four transformer architectures. We reveal severe safety and representational failures in current alignment paradigms. We demonstrate that compressed model (mDistilBERT) exhibits 28.7% "Sentiment Inversion Rate," fundamentally misinterpreting positive user intent as negative (or vice versa). Furthermore, we identify systemic nuances affecting human-AI trust, including "Asymmetric Empathy" where some models systematically dampen and others amplify the affective weight of Bengali text relative to its English counterpart. Finally, we reveal a "Modern Bias" in the regional model (IndicBERT), which shows a 57% increase in alignment error when processing formal (Sadhu) Bengali. We argue that equitable human-AI co-evolution requires pluralistic, culturally grounded alignment that respects language and dialectal diversity over universal compression, which fails to preserve the emotional fidelity required for reciprocal human-AI trust. We recommend that alignment benchmarks incorporate "Affective Stability" metrics that explicitly penalize polarity inversions in low-resource and dialectal contexts.
Authors:Jindu Wang, Runze Cai, Shuchang Xu, Tianrui Hu, Huamin Qu, Shengdong Zhao, Ling-Ping Yuan
Abstract:
Young adults often take breaks from screen-intensive work by consuming digital content on mobile phones, which undermines rest through visual fatigue and inactivity. We introduce a design framework that embeds light break activities into media content on AR smart glasses, balancing engagement and recovery. The framework employs three strategies: (1) seamlessly guiding users by embedding activity cues aligned with media elements; (2) transitioning to audio-centric formats to reduce visual load while sustaining immersion; and (3) structuring sessions with "rise-peak-closure" pacing for smooth transitions. In a within-subjects study (N = 16) comparing passive viewing, reminder-based breaks, and non-narrative activities, InteractiveBreak instantiated from our framework seamlessly guided activities, sustained engagement, and enhanced break quality. These findings demonstrate wearable AR's potential to support restorative relaxation by transforming breaks into engaging and meaningful experiences.
Authors:Daniel J. Noh, Deborah A. Fields, Yasmin B. Kafai, Danaé Metaxa
Abstract:
The recent proliferation of artificial intelligence and machine learning (AI/ML) systems highlights the need for all people to develop effective competencies to interact with and examine AI/ML systems. We study shifts in five experienced high school CS teachers' understanding of AI/ML systems after one year of participatory design, where they co-developed lessons on AI auditing, a systematic method to query AI/ML systems. Drawing on individual and group interviews, we found that teachers' perspectives became more situated, grounding their understanding in everyday contexts; more critical, reflecting growing awareness of harms; and more agentic, highlighting possibilities for action. Further, across all three perspectives, teachers consistently framed algorithmic justice through their role as educators, situating their concerns within their school communities. In the discussion, we consider the ways teachers' perspectives shifted, how AI auditing can shape these shifts, and the implications of these findings on AI literacy for both teachers and students.
Authors:Albin Zeqiri, Michael Rietzler, Enrico Rukzio
Abstract:
Eco-friendly service options (EFSOs) aim to reduce personal carbon emissions, yet their eco-friendly framing may permit increased consumption, weakening their intended impact. Such rebound effects remain underexamined in HCI, including how common eco-feedback approaches shape them. We investigate this in an online within-subjects experiment (N=75) in a ride-hailing context. Participants completed 10 trials for five conditions (No EFSO, EFSO - Minimal, EFSO - CO2 Equivalency, EFSO - Gamified, EFSO - Social), yielding 50 choices between walking and ride-hailing for trips ranging from 0.5mi - 2.0mi (0.80km - 3.22km). We measured how different EFSO variants affected ride-hailing uptake relative to a No EFSO baseline. EFSOs lacking explicit eco-feedback metrics increased ride-hailing uptake, and qualitative responses indicate that EFSOs can make convenience-driven choices more permissible. We conclude with implications for designing EFSOs that begin to take rebound effects into account.
Authors:Thorsten Klößner, João Belo, Zekun Wu, Jörg Hoffmann, Anna Maria Feit
Abstract:
Interfaces for human oversight must effectively support users' situation awareness under time-critical conditions. We explore reinforcement learning (RL)-based UI adaptation to personalize alerting strategies that balance the benefits of highlighting critical events against the cognitive costs of interruptions. To enable learning without real-world deployment, we integrate models of users' gaze behavior to simulate attentional dynamics during monitoring. Using a delivery-drone oversight scenario, we present initial results suggesting that RL-based highlighting can outperform static, rule-based approaches and discuss challenges of intelligent oversight support.
Authors:Ruei-Che Chang, Rosiana Natalie, Wenqian Xu, Jovan Zheng Feng Yap, Tiange Luo, Venkatesh Potluri, Anhong Guo
Abstract:
People who are blind or have low vision regularly use their hands to interact with the physical world to gain access to objects' shape, size, weight, and texture. However, many rich visual features remain inaccessible through touch alone, making it difficult to distinguish similar objects, interpret visual affordances, and form a complete understanding of objects. In this work, we present TouchScribe, a system that augments hand-object interactions with automated live visual descriptions. We trained a custom egocentric hand interaction model to recognize both common gestures (e.g., grab to inspect, hold side-by-side to compare) and unique ones by blind people (e.g., point to explore color, or swipe to read available texts). Furthermore, TouchScribe provides real-time and adaptive feedback based on hand movement, from hand interaction states, to object labels, and to visual details. Our user study and technical evaluations demonstrate that TouchScribe can provide rich and useful descriptions to support object understanding. Finally, we discuss the implications of making live visual descriptions responsive to users' physical reach.
Authors:Bahare Riahi, Veronica Catete
Abstract:
This study investigates students' perceptions of Artificial Intelligence (AI) grading systems in an undergraduate computer science course (n = 27), focusing on a block-based programming final project. Guided by the ethical principles framework articulated by Jobin (2019), our study examines fairness, trust, consistency, and transparency in AI grading by comparing AI-generated feedback with original human-graded feedback. Findings reveal concerns about AI's lack of contextual understanding and personalization. We recommend that equitable and trustworthy AI systems reflect human judgment, flexibility, and empathy, serving as supplementary tools under human oversight. This work contributes to ethics-centered assessment practices by amplifying student voices and offering design principles for humanizing AI in designed learning environments.
Authors:Zheyuan Zhang, Dorian Peters, Lan Xiao, Jingjing Sun, Laura Moradbakhti, Andrew Hall, Rafael A. Calvo
Abstract:
Healthcare professionals (HCPs) face increasing occupational stress and burnout. Supporting HCPs need for relatedness is fundamental to their psychological wellbeing and resilience. However, how technologies could support HCPs relatedness in the workplace remains less explored. This study incorporated semi-structured interviews (n = 15) and co-design workshops (n = 21) with HCPs working in the UK National Health Service (NHS), to explore their current practices and preferences for workplace relatedness support, and how technology could be utilized to benefit relatedness. Qualitative analysis yielded a four-layer model of HCPs relatedness need, which includes Informal Interactions, Camaraderie and Bond, Community and Organizational Care, and Shared Identity. Workshops generated eight design concepts (e.g., Playful Encounter, Collocated Action, and Memories and Stories) that operationalize the four relatedness need layers. We conclude by highlighting the theoretical relevance, practical design implications, and the necessity to strengthen relatedness support for HCPs in the era of digitalization and artificial intelligence.
Authors:Yunlong Lyu, Yixuan Tang, Peng Chen, Tian Dong, Xinyu Wang, Zhiqiang Dong, Hao Chen
Abstract:
Modern AI-integrated IDEs are shifting from passive code completion to proactive Next Edit Suggestions (NES). Unlike traditional autocompletion, NES is designed to construct a richer context from both recent user interactions and the broader codebase to suggest multi-line, cross-line, or even cross-file modifications. This evolution significantly streamlines the programming workflow into a tab-by-tab interaction and enhances developer productivity. Consequently, NES introduces a more complex context retrieval mechanism and sophisticated interaction patterns. However, existing studies focus almost exclusively on the security implications of standalone LLM-based code generation, ignoring the potential attack vectors posed by NES in modern AI-integrated IDEs. The underlying mechanisms of NES remain under-explored, and their security implications are not yet fully understood. In this paper, we conduct the first systematic security study of NES systems. First, we perform an in-depth dissection of the NES mechanisms to understand the newly introduced threat vectors. It is found that NES retrieves a significantly expanded context, including inputs from imperceptible user actions and global codebase retrieval, which increases the attack surfaces. Second, we conduct a comprehensive in-lab study to evaluate the security implications of NES. The evaluation results reveal that NES is susceptible to context poisoning and is sensitive to transactional edits and human-IDE interactions. Third, we perform a large-scale online survey involving over 200 professional developers to assess the perceptions of NES security risks in real-world development workflows. The survey results indicate a general lack of awareness regarding the potential security pitfalls associated with NES, highlighting the need for increased education and improved security countermeasures in AI-integrated IDEs.
Authors:Adam Wróbel, Siddhartha Gairola, Jacek Tabor, Bernt Schiele, Bartosz Zieliński, Dawid Rymarczyk
Abstract:
Vision Transformers (ViTs) have become a dominant architecture in computer vision, yet producing stable and high-resolution attribution maps for these models remains challenging. Architectural components such as patch embeddings and attention routing often introduce structured artifacts in pixel-level explanations, causing many existing methods to rely on coarse patch-level attributions. We introduce DAVE \textit{(\underline{D}istribution-aware \underline{A}ttribution via \underline{V}iT Gradient D\underline{E}composition)}, a mathematically grounded attribution method for ViTs based on a structured decomposition of the input gradient. By exploiting architectural properties of ViTs, DAVE isolates locally equivariant and stable components of the effective input--output mapping. It separates these from architecture-induced artifacts and other sources of instability.
Authors:Kashif Imteyaz, Michael Muller, Claudia Flores-Saviaga, Saiph Savage
Abstract:
Most generative AI tools prioritize individual productivity and personalization, with limited support for collaboration. Designed for traditional workplaces, these tools do not fit freelancers' short-term teams or lack of shared institutional support, which can worsen their isolation and overlook freelancing platform dynamics. This mismatch means that, instead of empowering freelancers, current generative AI tools could reinforce existing precarity and make freelancer collaboration harder. To investigate how to design generative AI tools to support freelancer collaboration, we conducted co-design sessions with 27 freelancers. A key concern that emerged was the risk of AI systems compromising their creative agency and work identities when collaborating, especially when AI tools could reproduce content without attribution, threatening the authenticity and distinctiveness of their collaborative work. Freelancers proposed "auxiliary AI" systems, human-guided tools that support their creative agencies and identities, allowing for flexible freelancer-led collaborations that promote "productive friction". Drawing on Marcuse's concept of technological rationality, we argue that freelancers are resisting one-dimensional, efficiency-driven AI, and instead envisioning technologies that preserve their collective creative agencies. We conclude with design recommendations for collaborative generative AI tools for freelancers.
Authors:Yewon Kim, Stephen Brade, Alexander Wang, David Zhou, Haven Kim, Bill Wang, Sung-Ju Lee, Hugo F Flores Garcia, Cheng-Zhi Anna Huang, Chris Donahue
Abstract:
Live music provides a uniquely rich setting for studying creativity and interaction due to its spontaneous nature. The pursuit of live music agents--intelligent systems supporting real-time music performance and interaction--has captivated researchers across HCI, AI, and computer music for decades, and recent advancements in AI suggest unprecedented opportunities to evolve their design. However, the interdisciplinary nature of music has led to fragmented development across research communities, hindering effective communication and collaborative progress. In this work, we bring together perspectives from these diverse fields to map the current landscape of live music agents. Based on our analysis of 184 systems across both academic literature and video, we develop a comprehensive design space that categorizes dimensions spanning usage contexts, interactions, technologies, and ecosystems. By highlighting trends and gaps in live music agents, our design space offers researchers, designers, and musicians a structured lens to understand existing systems and shape future directions in real-time human-AI music co-creation. We release our annotated systems as a living artifact at https://live-music-agents.github.io.
Authors:Eryue Xu, Tianshi Li
Abstract:
Managing one's digital footprint is overwhelming, as it spans multiple platforms and involves countless context-dependent decisions. Recent advances in agentic AI offer ways forward by enabling holistic, contextual privacy-enhancing solutions. Building on this potential, we adopted a ''human-as-the-unit'' perspective and investigated users' cross-context privacy challenges through 12 semi-structured interviews. Results reveal that people rely on ad hoc manual strategies while lacking comprehensive privacy controls, highlighting nine privacy-management challenges across applications, temporal contexts, and relationships. To explore solutions, we generated nine AI agent concepts and evaluated them via a speed-dating survey with 116 US participants. The three highest-ranked concepts were all post-sharing management tools with half or full agent autonomy, with users expressing greater trust in AI accuracy than in their own efforts. Our findings highlight a promising design space where users see AI agents bridging the fragments in privacy management, particularly through automated, comprehensive post-sharing remediation of users' digital footprints.
Authors:Matthew P. Lad, Louisa Conwill, Megan Levis Scheirer
Abstract:
With the rapid growth of Large Language Models (LLMs), criticism of their societal impact has also grown. Work in Responsible AI (RAI) has focused on the development of AI systems aimed at reducing harm. Responding to RAI's criticisms and the need to bring the wisdom traditions into HCI, we apply Conwill et al.'s Virtue-Guided Technology Design method to LLMs. We cataloged new ethical design patterns for LLMs and evaluated them through interviews with technologists. Participants valued that the patterns provided more accuracy and robustness, better safety, new research opportunities, increased access and control, and reduced waste. Their concerns were that the patterns could be vulnerable to jailbreaking, were generalizing models too widely, and had potential implementation issues. Overall, participants reacted positively while also acknowledging the tradeoffs involved in ethical LLM design.
Authors:Alessandra Maciel Paz Milani, Norman Anderson, Margaret-Anne Storey
Abstract:
Cybersecurity increasingly relies on threat hunters to proactively identify adversarial activity, yet the cognitive work underlying threat hunting remains underexplored or insufficiently supported by existing tools. Building on prior studies that examined how threat hunters construct and share mental models during investigations, we derived a set of design propositions to support their cognitive and collaborative work. In this paper, we present the Threat Hunter Board, a prototype tool that operationalizes these design propositions by enabling threat hunters to externalize reasoning, organize investigative leads, and maintain continuity across sessions. Using a design science paradigm, we describe the solution design rationale and artifact development. In addition, we propose six design heuristics that form a solution-evaluation framework for assessing cognitive support in threat hunting tools. An initial evaluation using a cognitive walkthrough provides early evidence of feasibility, while future work will focus on user-based validation with professional threat hunters.
Authors:Juan David Salazar Rodriguez, Sam Conrad Joyce, Nachamma Sockalingam, Khoo Eng Tat, Julfendi
Abstract:
This study investigates the integration of Large Language Models (LLMs) into the feedback mechanisms of the architectural design studio, shifting the focus from generative production to reflective pedagogy. Employing a mixed-methods approach with surveys and semi structured interviews with 22 architecture students at the Singapore University of Technology and De-sign, the research analyzes student perceptions across three distinct feed-back domains: self-reflection, peer critique, and professor-led reviews. The findings reveal that students engage with LLMs not as authoritative in-structors, but as collaborative "cognitive mirrors" that scaffold critical thinking. In self-directed learning, LLMs help structure thoughts and over-come the "blank page" problem, though they are limited by a lack of contex-tual nuance. In peer critiques, the technology serves as a neutral mediator, mitigating social anxiety and the "fear of offending". Furthermore, in high-stakes professor-led juries, students utilize LLMs primarily as post-critique synthesis engines to manage cognitive overload and translate ab-stract academic discourse into actionable design iterations.
Authors:Syed T. Mubarrat, Byung-Cheol Min, Tianyu Shao, E. Cho Smith, Bedrich Benes, Alejandra J. Magana, Christos Mousas, Dominic Kao
Abstract:
Robotics education fosters computational thinking, creativity, and problem-solving, but remains challenging due to technical complexity. Game-based learning (GBL) and gamification offer engagement benefits, yet their comparative impact remains unclear. We present the first PRISMA-aligned systematic review and comparative synthesis of GBL and gamification in robotics education, analyzing 95 studies from 12,485 records across four databases (2014-2025). We coded each study's approach, learning context, skill level, modality, pedagogy, and outcomes (k = .918). Three patterns emerged: (1) approach-context-pedagogy coupling (GBL more prevalent in informal settings, while gamification dominated formal classrooms [p < .001] and favored project-based learning [p = .009]); (2) emphasis on introductory programming and modular kits, with limited adoption of advanced software (~17%), advanced hardware (~5%), or immersive technologies (~22%); and (3) short study horizons, relying on self-report. We propose eight research directions and a design space outlining best practices and pitfalls, offering actionable guidance for robotics education.
Authors:Upol Ehsan, Samir Passi, Koustuv Saha, Todd McNutt, Mark O. Riedl, Sara Alcorn
Abstract:
In the future of work discourse, AI is touted as the ultimate productivity amplifier. Yet, beneath the efficiency gains lie subtle erosions of human expertise and agency. This paper shifts focus from the future of work to the future of workers by navigating the AI-as-Amplifier Paradox: AI's dual role as enhancer and eroder, simultaneously strengthening performance while eroding underlying expertise. We present a year-long study on the longitudinal use of AI in a high-stakes workplace among cancer specialists. Initial operational gains hid ``intuition rust'': the gradual dulling of expert judgment. These asymptomatic effects evolved into chronic harms, such as skill atrophy and identity commoditization. Building on these findings, we offer a framework for dignified Human-AI interaction co-constructed with professional knowledge workers facing AI-induced skill erosion without traditional labor protections. The framework operationalizes sociotechnical immunity through dual-purpose mechanisms that serve institutional quality goals while building worker power to detect, contain, and recover from skill erosion, and preserve human identity. Evaluated across healthcare and software engineering, our work takes a foundational step toward dignified human-AI interaction futures by balancing productivity with the preservation of human expertise.
Authors:Joyce Zhou, Weijie Zhou, Doug Turnbull, Thorsten Joachims
Abstract:
Natural-language user profiles have recently attracted attention not only for improved interpretability, but also for their potential to make recommender systems more steerable. By enabling direct editing, natural-language profiles allow users to explicitly articulate preferences that may be difficult to infer from past behavior. However, it remains unclear whether current natural-language-based recommendation methods can follow such steering commands. While existing steerability evaluations have shown some success for well-recognized item attributes (e.g., movie genres), we argue that these benchmarks fail to capture the richer forms of user control that motivate steerable recommendations. To address this gap, we introduce SteerEval, an evaluation framework designed to measure more nuanced and diverse forms of steerability by using interventions that range from genres to content-warning for movies. We assess the steerability of a family of pretrained natural-language recommenders, examine the potential and limitations of steering on relatively niche topics, and compare how different profile and recommendation interventions impact steering effectiveness. Finally, we offer practical design suggestions informed by our findings and discuss future steps in steerable recommender design.
Authors:Josh Susak, Yifu Liu, Pascal Jansen, Mark Colley
Abstract:
The next step for In-vehicle Conversational Assistants (IVCAs) will be their capability to initiate and automate proactive system interactions throughout journeys. However, diverse drivers make it challenging to design voice interventions tailored towards individual on-road expectations. This paper evaluates the effectiveness of Human-in-the-Loop (HITL) Multi-Objective Bayesian Optimization (MOBO) in design by implementing ProVoice: a Virtual Reality (VR) driving simulator integrating MOBO to investigate the effects of IVCA design variants on perceived mental demand, predictability, and usefulness. By reporting the Pareto Front from a within-subjects VR study (N=19), this paper proposes optimal design trade-offs. Follow-up analysis demonstrates MOBO's success in discovering effective intervention strategies, with reduced participant mental demand, alongside enhanced predictability and usefulness while engaging with the proactive IVCA. Implications for computational techniques in future research on proactive intervention strategies are discussed. ProVoice can extend to include alternative design parameters and driving scenarios, encouraging intervention design on a broad scale.
Authors:Hyeok Kim, Sehi L'Yi, Nils Gehlenborg, Jeffrey Heer
Abstract:
Formal representations of the visualization design space, such as knowledge bases and graphs, consolidate design practices into a shared resource and enable automated reasoning and interpretable design recommendations. However, prior approaches typically depend on fixed, manually authored rules, making it difficult to build novel representations or extend them for different visualization domains. Instead, we propose data-driven methods that automatically synthesize visualization design knowledge bases. Specifically, our methods (1) extract candidate design features from a visualization corpus, (2) select features forward and backward, and (3) render the final knowledge base. In our benchmark evaluation compared to Draco 2, our synthesized knowledge base offers general and interpretable design features and improves the accuracy of predicting effective designs by 1-15% in varied training and test sets. When we apply our approach to genomics visualization, the synthesized knowledge base includes sensible features with accuracy up to 97%, demonstrating the applicability of our approach to other visualization domains.
Authors:Nan Chen, Jing Lu, Zilong Wang, Luna K. Qiu, Siming Chen, Yuqing Yang
Abstract:
Equal access to digital technologies is critical for education, employment, and social participation. However, mainstream interfaces are visually oriented, creating steep learning curves and frequent obstacles for screen reader users, and limiting their independence and opportunities. Existing support is inadequate -- tutorials mainly target sighted users, while human assistance lacks real-time availability. We introduce AskEase, an on-demand AI assistant that provides step-by-step, screen reader user-friendly guidance for computer use. AskEase manages multiple sources of context to infer user intent and deliver precise, situation-specific guidance. Its seamless interaction design minimizes disruption and reduces the effort of seeking help. We demonstrated its effectiveness through representative usage scenarios and robustness tests. In a within-subjects study with 12 screen reader users, AskEase significantly improved task success while reducing perceived workload, including physical demand, effort, and frustration. These results demonstrate the potential of LLM-powered assistants to promote accessible computing and expand opportunities for users with visual impairments.
Authors:Shuhao Zhang, Jiahe Dong, Haoran Wang, Chang Jiang, Quan Li
Abstract:
Surgical emergencies often trigger acute cognitive overload in novice physicians, impairing their decision-making under pressure. Although Virtual Reality-based Stress Inoculation Training (VR-SIT) shows promise, current systems fall short in delivering real-time, effective support during moments of peak stress. To bridge this gap, we first conducted a formative study (N=12) to uncover the core needs of novice physicians for immediate assistance under acute stress and identified three key intervention strategies: self-regulation aids, procedure guidance, and emotional/sensory support. Building on these insights, we designed and implemented a novel VR-SIT system that incorporates a just-in-time adaptive intervention framework, dynamically tailoring support to learners' cognitive and emotional states. We then validated these strategies in a user study (N=26). Our findings provide empirical evidence and design implications for next-generation VR medical training systems, supporting physicians in sustaining cognitive clarity and accurate decision-making in critical situations.
Authors:Masahiro Yoshino, Haruki Yokota, Junya Hara, Yuichi Tanaka, Hiroshi Higashi
Abstract:
Auditory attention decoding (AAD) identifies the attended speech stream in multi-speaker environments by decoding brain signals such as electroencephalography (EEG). This technology is essential for realizing smart hearing aids that address the cocktail party problem and for facilitating objective audiometry systems. Existing AAD research mainly utilizes dichotic environments where different speech signals are presented to the left and right ears, enabling models to classify directional attention rather than speech content. However, this spatial reliance limits applicability to real-world scenarios, such as the "cocktail party" situation, where speakers overlap or move dynamically. To address this challenge, we propose an AAD framework for diotic environments where identical speech mixtures are presented to both ears, eliminating spatial cues. Our approach maps EEG and speech signals into a shared latent space using independent encoders. We extract speech features using wav2vec 2.0 and encode them with a 2-layer 1D convolutional neural network (CNN), while employing the BrainNetwork architecture for EEG encoding. The model identifies the attended speech by calculating the cosine similarity between EEG and speech representations. We evaluate our method on a diotic EEG dataset and achieve 72.70% accuracy, which is 22.58% higher than the state-of-the-art direction-based AAD method.
Authors:Erina Seh-Young Moon, Matthew Tamura, Angelina Zhai, Nuzaira Habib, Behnaz Shirazi, Altaf Kassam, Devansh Saxena, Shion Guha
Abstract:
Governments are the primary providers of essential public services and are responsible for delivering them effectively. In high-stakes decision-making domains such as child welfare (CW), agencies must protect children without unnecessarily prolonging a family's engagement with the system. With growing optimism around AI, governments are pushing for its integration but concerns regarding feasibility and harms remain. Through collaborations with a large Canadian CW agency, we examined how LocalLLM and BERTopic models can track CW case progress. We demonstrate how the tools can potentially assist workers in opportunistically addressing gaps in their work by signaling case progress/deviations. And yet, we also show how they fail to detect case trajectories that require discretionary judgments grounded in social work training, areas where practitioners would actually want support to pre-emptively address substantive case concerns. We also provide a roadmap of future participatory directions to co-design language tools for/with the public sector.
Authors:Yuheng Shao, Yuansong Xu, Yifan Jin, Shuhao Zhang, Wenxin Gu, Quan Li
Abstract:
Effective collaboration between designers and users is important for fashion design, which can increase the user acceptance of fashion products and thereby create value. However, it remains an enduring challenge, as traditional designer-centric approaches restrict meaningful user participation, while user-driven methods demand design proficiency, often marginalizing professional creative judgment. Current co-design practices, including workshops and AI-assisted frameworks, struggle with low user engagement, inefficient preference collection, and difficulties in balancing user feedback with design considerations. To address these challenges, we conducted a formative study with designers and users experienced in co-design (N=7), identifying critical challenges for current collaboration between designers and users in the co-design process, and their requirements. Informed by these insights, we introduce DesignBridge, a multi-platform AI-enhanced interactive system that bridges designer expertise and user preferences through three stages: (1) Initial Design Framing, where designers define initial concepts. (2) Preference Expression Collection, where users intuitively articulate preferences via interactive tools. (3) Preference-Integrated Design, where designers use AI-assisted analytics to integrate feedback into cohesive designs. A user study demonstrates that DesignBridge significantly enhances user preference collection and analysis, enabling designers to integrate diverse preferences with professional expertise.
Authors:Wenge Xu, Foroogh Hajiseyedjavadi, Debargha Dey, Tram Thi Minh Tran, Mark Colley
Abstract:
External Human-Machine Interfaces (eHMIs) have been proposed to enhance communication between automated vehicles (AVs) and pedestrians, with growing interest in multi-modal designs such as audio-visual eHMIs. Just as poor lighting can impair visual cues, a loud background noise may mask the auditory stimuli. However, its effects within these systems have not been examined, and little is known about how pedestrians -- particularly Deaf and Hard-of-Hearing (DHH) people -- perceive different types of auditory stimuli. We conducted a virtual reality study (Hearing N=25, DHH N=11) to examine the effects of background noise (quiet and loud) on auditory stimuli (baseline, bell, speech) within an audio-visual eHMI. Results revealed that: (1) Crossing experiences of DHH pedestrians significantly differ from Hearing pedestrians. (2) Loud background noise adversely affects pedestrians' crossing experiences. (3) Providing an additional auditory eHMI (bell/speech) improves crossing experiences. We outlined four practical implications for future eHMI design and research.
Authors:Geoff Keeling, Winnie Street
Abstract:
Large Language Models (LLMs) can simulate person-like things which at least appear to have stable behavioural and psychological dispositions. Call these things characters. Are characters minded and psychologically continuous entities with mental states like beliefs, desires and intentions? Illusionists about characters say No. On this view, characters are merely anthropomorphic projections in the mind of the user and so lack mental states. Jonathan Birch (2025) defends this view. He says that the distributed nature of LLM processing, in which several LLMs may be implicated in the simulation of a character in a single conversation, precludes the existence of a persistent minded entity that is identifiable with the character. Against illusionism, we argue for a realist position on which characters exist as minded and psychologically continuous entities. Our central point is that Birch's argument for illusionism rests on a category error: characters are not internal to the LLMs that simulate them, but rather are co-simulated by LLMs and users, emerging in a shared conversational workspace through a process of mutual theory of mind modelling. We argue that characters, and their minds, exist as 'real patterns' on grounds that attributing mental states to characters is essential for making efficient and accurate predictions about the conversational dynamics (c.f. Dennett, 1991). Furthermore, because the character exists within the conversational workspace rather than within the LLM, psychological continuity is preserved even when the underlying computational substrate is distributed across multiple LLM instances.
Authors:Haodong Zhang, Jiapeng Zhu, Yitong Chen, Hongqi Li
Abstract:
Electroencephalography (EEG) decoding requires models that can effectively extract and integrate complex temporal, spectral, and spatial features from multichannel signals. To address this challenge, we propose a lightweight and generalizable decoding framework named Hierarchical Convolutional Fusion Transformer (HCFT), which combines dual-branch convolutional encoders and hierarchical Transformer blocks for multi-scale EEG representation learning. Specifically, the model first captures local temporal and spatiotemporal dynamics through time-domain and time-space convolutional branches, and then aligns these features via a cross-attention mechanism that enables interaction between branches at each stage. Subsequently, a hierarchical Transformer fusion structure is employed to encode global dependencies across all feature stages, while a customized Dynamic Tanh normalization module is introduced to replace traditional Layer Normalization in order to enhance training stability and reduce redundancy. Extensive experiments are conducted on two representative benchmark datasets, BCI Competition IV-2b and CHB-MIT, covering both event-related cross-subject classification and continuous seizure prediction tasks. Results show that HCFT achieves 80.83% average accuracy and a Cohen's kappa of 0.6165 on BCI IV-2b, as well as 99.10% sensitivity, 0.0236 false positives per hour, and 98.82% specificity on CHB-MIT, consistently outperforming over ten state-of-the-art baseline methods. Ablation studies confirm that each core component of the proposed framework contributes significantly to the overall decoding performance, demonstrating HCFT's effectiveness in capturing EEG dynamics and its potential for real-world BCI applications.
Authors:Kashif Imteyaz, Qiushi, Liang, Yakov Bart, Maitraye Das, Saiph Savage
Abstract:
Blind and low-vision (BLV) individuals face high unemployment rates. The job search is becoming harder as more employers use AI-driven systems to screen resumes before a human ever sees them. Such AI systems could inadvertently further disadvantage BLV job seekers, introducing additional barriers to an already difficult process. We lack understanding of BLV job seekers' experiences in today's AI-driven hiring ecosystem. Without such understanding, we risk designing technologies that create new systemic barriers for BLV job seekers rather than providing support. To this end, we conducted interviews with 17 BLV job seekers and analyzed their experiences with AI-powered hiring systems. We found that AI hiring systems misrepresented their professional identities and created dehumanizing interactions. To level the playing field, BLV job seekers used strategic counter-navigation: they deployed their own tools to bypass algorithmic screening and built peer networks to share AI literacy. They also practiced 'strategic refusal', choosing to avoid certain AI systems to regain their agency. Unlike prior work that frames job search as an individualistic activity, or one focused on being compliant with employer needs, we use the interdependence framework to argue that for BLV people, job search is an interdependent process. We offer design recommendations for AI-mediated tools that center disability perspectives and support interdependencies in job search.
Authors:Lukas Schilcher, Peter Waldert, Benedikt Kantz, Tobias Schreck
Abstract:
Exploring tabular datasets to understand how different feature pairs partition data into meaningful cohorts is crucial in domains such as biomarker discovery, yet comparing clusters across multiple feature pair projections is challenging. We introduce Clusters in Focus, an interactive visual analytics dashboard designed to address this gap. Clusters in Focus employs a three-panel coordinated view: a Data Panel offers multiple perspectives (tabular, heatmap, condensed with histograms / SHAP values) for initial data exploration; a Selection Panel displays the 2D clustering (K-Means/DBSCAN) for a user-selected feature pair; and a novel Cluster Similarity Panel featuring two switchable views for comparing clusters. A ranked list enables the identification of top-matching feature pairs, while an interactive similarity matrix with reordering capabilities allows for the discovery of global structural patterns and groups of related features. This dual-view design supports both focused querying and broad visual exploration. A use case on a Parkinson's disease speech dataset demonstrates the tool's effectiveness in revealing relationships between different feature pairs characterizing the same patient subgroup.
Authors:Yao Lyu, Jessica Shen, Alina Faisal, John M. Carroll
Abstract:
Social media platforms are important venues for identity expression, and the Human-Computer Interaction community has been paying growing attention to how marginalized groups express their identities on these platforms. Joining the emerging literature on intersectional experiences, we study blind TikTokers ("BlindTokers") who are also women and/or LGBTQ+. Using interview data from \rev{41} participants, we identify their intersectional experiences as mediated by TikTok's socio-technical affordances. We argue that BlindTokers' intersectional marginalization is infrastructural: TikTok's classification and moderation features interact with social norms in ways that push them aside and distort how they are treated on the platform. We use this infrastructure perspective to understand what these experiences are, how they were formed, and how they become harmful. We further recognize participants' infrastructuring work to address these problems. This study guides future social media design with accessible creator tools, inclusive identity options, and context-aware moderation developed in partnership with communities.
Authors:Yao Lyu, Tawanna Dillahunt, Jiaying Liu, John M. Carroll
Abstract:
One's profession is an essential part of modern life. Traditionally, professional development has been criticized for excluding people with disabilities. People with visual impairments, for example, face disproportionately low employment rates, highlighting persistent gaps in professional opportunities. Recently, there has been growing research on social media platforms as spaces for more equitable career development approaches. In this paper, we present an interview study on the professional development experiences of 60 people with visual impairments on TikTok (also known as "BlindTokers"). We report BlindTokers' goals, strategies, and challenges, supported by detailed examples and in-depth analysis. Based on the findings, we identify that BlindTokers' practices reveal an alternative professional development approach that is more flexible, inclusive, personalized, and diversified than traditional models. Our study also extends professional development research by foregrounding emerging digital skills and proposing design implications to foster more equitable and inclusive professional opportunities.
Authors:Choro Ulan uulu, Mikhail Kulyabin, Katharina M Zeiner, Jan Joosten, Nuno Miguel Martins Pacheco, Filippos Petridis, Rebecca Johnson, Jan Bosch, Helena Holmström Olsson
Abstract:
Understanding complex parameter dependencies is critical for effective configuration and maintenance of software systems across diverse domains - from Computer-Aided Engineering (CAE) to cloud infrastructure and database management. However, legacy tabular interfaces create a major bottleneck: engineers cannot easily comprehend how parameters relate across the system, leading to inefficient workflows, costly configuration errors, and reduced system trust - a fundamental program comprehension challenge in configuration-intensive software. This research evaluates whether interactive Sankey diagrams can improve comprehension of parameter dependencies compared to traditional spreadsheet interfaces. We employed a heuristic evaluation using the PURE method with three expert evaluators (UX design, simulation, and software development specialists) to compare a Sankey-based prototype to traditional tabular representations for core engineering tasks. Our key contribution demonstrates that flow-based parameter visualizations significantly reduce cognitive load (51% lower PURE scores) and interaction complexity (56% fewer steps) compared to traditional tables, while making parameter dependencies immediately visible rather than requiring mental reconstruction. By explicitly visualizing parameter relationships, Sankey diagrams address a core software visualization challenge: helping users comprehend complex system configurations without requiring deep tool-specific knowledge. While demonstrated through CAE software, this research contributes to program comprehension and software visualization by showing that dependency-aware visualizations can significantly improve understanding of configuration-intensive systems. The findings have implications for any software domain where comprehending complex parameter relationships is essential for effective system use and maintenance.
Authors:Andrea Ferrario, Alessandro Facchini, Juan M. Durán
Abstract:
Human-AI complementarity is the claim that a human supported by an AI system can outperform either alone in a decision-making process. Since its introduction in the human-AI interaction literature, it has gained traction by generalizing the reliance paradigm and by offering a more practical alternative to the contested construct of 'trust in AI.' Yet complementarity faces key theoretical challenges: it lacks precise theoretical anchoring, it is formalized just as a post hoc indicator of relative predictive accuracy, it remains silent about other desiderata of human-AI interactions and it abstracts away from the magnitude-cost profile of its performance gain. As a result, complementarity is difficult to obtain in empirical settings. In this work, we leverage epistemology to address these challenges by reframing complementarity within the discourse on justificatory AI. Drawing on computational reliabilism, we argue that historical instances of complementarity function as evidence that a given human-AI interaction is a reliable epistemic process for a given predictive task. Together with other reliability indicators assessing the alignment of the human-AI team with the epistemic standards and socio-technical practices, complementarity contributes to the degree of reliability of human-AI teams when generating predictions. This supports the practical reasoning of those affected by these outputs -- patients, managers, regulators, and others. In summary, our approach suggests that the role and value of complementarity lies not in providing a relative measure of predictive accuracy, but in helping calibrate decision-making to the reliability of AI-supported processes that increasingly shape everyday life.
Authors:Andrea Ferrario, Rasita Vinay, Matteo Casserini, Alessandro Facchini
Abstract:
Anthropomorphisation -- the phenomenon whereby non-human entities are ascribed human-like qualities -- has become increasingly salient with the rise of large language model (LLM)-based conversational agents (CAs). Unlike earlier chatbots, LLM-based CAs routinely generate interactional and linguistic cues, such as first-person self-reference, epistemic and affective expressions that empirical work shows can increase engagement. On the other hand, anthropomorphisation raises ethical concerns, including deception, overreliance, and exploitative relationship framing, while some authors argue that anthropomorphic interaction may support autonomy, well-being, and inclusion. Despite increasing interest in the phenomenon, literature remains fragmented across domains and varies substantially in how it defines, operationalizes, and normatively evaluates anthropomorphisation. This scoping review maps ethically oriented work on anthropomorphising LLM-based CAs across five databases and three preprint repositories. We synthesize (1) conceptual foundations, (2) ethical challenges and opportunities, and (3) methodological approaches. We find convergence on attribution-based definitions but substantial divergence in operationalization, a predominantly risk-forward normative framing, and limited empirical work that links observed interaction effects to actionable governance guidance. We conclude with a research agenda and design/governance recommendations for ethically deploying anthropomorphic cues in LLM-based conversational agents.
Authors:Rui Liu, Liuqingqing Yang, Runsheng Zhang, Shixiao Wang
Abstract:
This study investigates human-computer interface generation based on diffusion models to overcome the limitations of traditional template-based design and fixed rule-driven methods. It first analyzes the key challenges of interface generation, including the diversity of interface elements, the complexity of layout logic, and the personalization of user needs. A generative framework centered on the diffusion-reverse diffusion process is then proposed, with conditional control introduced in the reverse diffusion stage to integrate user intent, contextual states, and task constraints, enabling unified modeling of visual presentation and interaction logic. In addition, regularization constraints and optimization objectives are combined to ensure the rationality and stability of the generated interfaces. Experiments are conducted on a public interface dataset with systematic evaluations, including comparative experiments, hyperparameter sensitivity tests, environmental sensitivity tests, and data sensitivity tests. Results show that the proposed method outperforms representative models in mean squared error, structural similarity, peak signal-to-noise ratio, and mean absolute error, while maintaining strong robustness under different parameter settings and environmental conditions. Overall, the diffusion model framework effectively improves the diversity, rationality, and intelligence of interface generation, providing a feasible solution for automated interface generation in complex interaction scenarios.
Authors:Huatao Xu, Zihe Liu, Zilin Zeng, Baichuan Li, Mo Li
Abstract:
We present AutoTour, a system that enhances user exploration by automatically generating fine-grained landmark annotations and descriptive narratives for photos captured by users. The key idea of AutoTour is to fuse visual features extracted from photos with nearby geospatial features queried from open matching databases. Unlike existing tour applications that rely on pre-defined content or proprietary datasets, AutoTour leverages open and extensible data sources to provide scalable and context-aware photo-based guidance. To achieve this, we design a training-free pipeline that first extracts and filters relevant geospatial features around the user's GPS location. It then detects major landmarks in user photos through VLM-based feature detection and projects them into the horizontal spatial plane. A geometric matching algorithm aligns photo features with corresponding geospatial entities based on their estimated distance and direction. The matched features are subsequently grounded and annotated directly on the original photo, accompanied by large language model-generated textual and audio descriptions to provide an informative, tour-like experience. We demonstrate that AutoTour can deliver rich, interpretable annotations for both iconic and lesser-known landmarks, enabling a new form of interactive, context-aware exploration that bridges visual perception and geospatial understanding.
Authors:Yueyang Wang, Mehmet Dogar, Gustav Markkula
Abstract:
Autonomous vehicles (AVs) are rapidly advancing and are expected to play a central role in future mobility. Ensuring their safe deployment requires reliable interaction with other road users, not least pedestrians. Direct testing on public roads is costly and unsafe for rare but critical interactions, making simulation a practical alternative. Within simulation-based testing, adversarial scenarios are widely used to probe safety limits, but many prioritise difficulty over realism, producing exaggerated behaviours which may result in AV controllers that are overly conservative. We propose an alternative method, instead using a cognitively inspired pedestrian model featuring both inter-individual and intra-individual variability to generate behaviourally plausible adversarial scenarios. We provide a proof of concept demonstration of this method's potential for AV control optimisation, in closed-loop testing and tuning of an AV controller. Our results show that replacing the rule-based CARLA pedestrian with the human-like model yields more realistic gap acceptance patterns and smoother vehicle decelerations. Unsafe interactions occur only for certain pedestrian individuals and conditions, underscoring the importance of human variability in AV testing. Adversarial scenarios generated by this model can be used to optimise AV control towards safer and more efficient behaviour. Overall, this work illustrates how incorporating human-like road user models into simulation-based adversarial testing can enhance the credibility of AV evaluation and provide a practical basis to behaviourally informed controller optimisation.
Authors:Sichao Song, Yuki Okafuji, Takuya Iwamoto, Jun Baba, Hiroshi Ishiguro
Abstract:
We report a mixed-methods field experiment of a conversational service robot deployed under everyday staffing discretion in a live bedding store. Over 12 days we alternated three conditions--Baseline (no robot), Robot-only, and Robot+Fixture--and video-annotated the service funnel from passersby to purchase. An explanatory sequential design then used six post-experiment staff interviews to interpret the quantitative patterns. Quantitatively, the robot increased stopping per passerby (highest with the fixture), yet clerk-led downstream steps per stopper--clerk approach, store entry, assisted experience, and purchase--decreased. Interviews explained this divergence: clerks avoided interrupting ongoing robot-customer talk, struggled with ambiguous timing amid conversational latency, and noted child-centered attraction that often satisfied curiosity at the doorway. The fixture amplified visibility but also anchored encounters at the threshold, creating a well-defined micro-space where needs could ``close'' without moving inside. We synthesize these strands into an integrative account from the initial show of interest on the part of a customer to their entering the store and derive actionable guidance. The results advance the understanding of interactions between customers, staff members, and the robot and offer practical recommendations for deploying service robots in high-touch retail.
Authors:Prince Ebenezer Adjei, Joshua Teye Tettey, Toufiq Musah, Audrey Agbeve, John Amuasi
Abstract:
CARE-link is an open-source, web-based clinical support platform designed to improve the management of gestational diabetes by linking clinicians and patients through an LLM-mediated workflow. The system aggregates patient-generated data outside the hospital, summarizes relevant clinical information, and delivers context-aware decision support to clinicians. For patients, CARE-link provides clear explanations of management plans and delivers timely lifestyle guidance through a WhatsApp interface. The integrated dual-facing design aims to promote continuous monitoring, support individualized care, and reduce the burden of in-clinic follow-ups. Built with a modular architecture, the platform can be adapted to other chronic conditions requiring longitudinal tracking and behavioral support. CARE-link has the potential to enhance clinical oversight, promote patient compliance, and strengthen continuity of care particularly in resource-constrained settings.
Authors:Hung Q. Vo, Huy Q. Vo, Son T. Ly, Zhihao Wan, Anh-Vu Nguyen, Hong Zhao, Jianting Sheng, Stephen T. C. Wong, Hien V. Nguyen
Abstract:
Conventional tissue image analysis software provides foundational capabilities for cellular analysis, including segmentation, basic morphological feature extraction, and spatial organization analysis. However, these tools often require manual intervention and are not well integrated with code-driven automation, limiting efficiency and scalability for complex spatial tissue studies. In addition, they offer limited flexibility for custom analyses, as they typically support only a fixed set of pre-implemented spatial cellular features. To address these limitations, we propose CodeCytos, a coding-based reasoning agent framework that enables dynamic, programmable interaction with spatial molecular imaging data to improve automation and customization. CodeCytos is designed to streamline the exploration of custom spatial cellular features and adapt to diverse research needs. We demonstrate its utility through case studies on four expert-curated datasets from distinct tissue types: frontal cortex, non-small-cell lung cancer, pancreas, and tonsil. We evaluate CodeCytos under a realistic minimal prompt setting, where bioscientists pose simple questions without task-specific instructions or contextual information about spatial cellular analysis, and benchmark multiple LLM backbones with strong coding capabilities. We further show that incorporating tailored, domain-agnostic few-shot in-context coding-reasoning examples (randomly sampled demonstrations outside the spatial analysis domain) can substantially improve performance without requiring costly, expert-crafted in-domain demonstrations. Overall, CodeCytos outperforms baseline approaches, highlighting the potential of code-action agents to assist with custom feature exploration in spatial molecular imaging and to accelerate biomarker discovery.
Authors:Arissa J. Sato, Callie Y. Kim, Nathan Thomas White, Abhinav Maneesh, Yuqing Wang, Hui-Ru Ho, Bilge Mutlu
Abstract:
Programming social robots is challenging for novice robot programmers due to required expertise in planning, interaction design, and programming. While large language models (LLMs) hold significant promise through code generation from natural-language descriptions, they can obscure critical elements of programming and supplant designer intent, eventually resulting in over-reliance instead of developing programming skills. In this paper, we explore how LLM-based social-robot-programming tools can support novice robot programmers through a Research through Design (RtD) process. We designed and prototyped Robo-Blocks, a block-based programming environment that leverages LLMs to offer novice robot programmers generative scaffolding through structured narratives that connect high-level ideas to executable robot behaviors. Through deployment with novices, we discovered emerging user personas and usage patterns for generative scaffolding and showed how this scaffolding shapes end-user design and programming strategies. We present design insights for the effective use of generative scaffolding and its integration into the practice of social-robot programming.
Authors:Lukas Ellinger, Alexander Fichtl, Miriam Anschütz, Georg Groh
Abstract:
Natural language conveys information at varying levels of granularity, from fine-grained references to broad descriptions. While granularity is fundamental to human communication, existing measures mostly capture surface detail or sentence specificity. We introduce Granuscore, a reference-free measure of granularity that leverages structural properties of a hierarchical embedding space. Granuscore reliably recovers hierarchical orderings on the Granola-EQ dataset and captures expected differences in granularity across discourse contexts. Across domains, we further show that Granuscore explains non-linear variation in sentence specificity beyond sentence length. Finally, we apply Granuscore to four question-answering benchmarks and analyze how granularity differs for questions, gold answers, and model outputs across response outcomes. The analysis reveals consistent differences in model behavior and provides a principled lens for characterizing the difficulty of QA datasets. Together, the results position Granuscore as a scalable, broadly applicable tool for analyzing granularity in text.
Authors:Michael Yin, Angela Chiang, Samuel Rhys Cox, Robert Xiao
Abstract:
While human-AI collaboration systems have increasingly been built to increase efficiency or support creativity, little work has examined how the design of interactions shapes the social connection between human and artificial agent. We examine how the temporal and visual dimensions of collaboration shape the experience of a writing task. Specifically, we built three variants of an AI-assisted text editor along a spectrum of simulated humanlike interaction (synchronous and with a cursor) to machinelike interaction (asynchronous and without a cursor), and conducted a comparative user study (n=48). Our exploratory findings suggest that synchronous suggestions increased efficiency but led to contextual misalignment, while a visual cursor increased intent understanding but evoked feelings of surveillance. Taken together, humanlike design of artificial agents can create positive social expectations but also elicit social costs, especially without the alignment present in human-human collaboration. We extend our findings into design implications and ethical considerations when building human-AI collaboration systems.
Authors:Cynthia Zastudil, Srishty Muthusekaran, Rayhona Nasimova, Stephen MacNeil
Abstract:
Computing courses often feature active learning techniques that promote collaboration and social interaction between students. However, neurodivergent students' preferences and experiences with these techniques are not well understood. We conducted a survey of neurodivergent computing students (n=24), specifically autistic students or students with ADHD, and neurotypical computing students (n=20) to understand how the structure of collaborative active learning affects their comfort in computing courses. We also interviewed four computing students on the autism spectrum or with ADHD to gain more contextualized insights into their experiences and accessibility recommendations. Our survey surfaces how team dynamics and assignment structure can impact neurodivergent students' comfort in computing courses. Neurodivergent students expressed discomfort with assignments that lack structure or have ambiguous expectations. Neurodivergent students prefer smaller teams that work together frequently with explicitly defined roles. Our interviews identified ways that neurodivergent students cope with discomfort in collaborative active learning, including self-selecting roles and self-disclosure. While preliminary, our results highlight how instructors can design collaborative active learning to be more equitable and accessible for neurodivergent students.
Authors:Julia De Miguel Velázquez, Sanja Šćepanović, Andrés Gvirtz, Daniele Quercia
Abstract:
Recent human-computer interaction (HCI) research has revealed a widespread misalignment between how developers design workplace artificial intelligence (AI) systems, and what workers actually need from them. Yet, little research has examined the effects of this gap, or how it may cause harm. We analyzed 1,524 reports of incidents in which AI systems were used to perform 171 occupational tasks across 12 industry sectors. Using an Large Language Model (LLM)-as-an-expert approach, we extracted the main traits of the AI systems involved in those incidents using an established framework of twelve traits. We then compared them with the traits that 202 workers highly familiar with those tasks would have preferred. We found that as many as 83\% of workplace incidents stem from worker-AI misalignments. In most cases, workers wanted systems that are precise, insightful, or personal, but instead received systems that are basic, simple, or general. Over the years, fast AI caused a considerable number of incidents, yet these declined, and imaginative AI, with the mass introduction of generative AI, started to cause incidents. We also compared the traits causing the incidents with the traits that 197 developers building AI systems for those tasks would have preferred. If the traits causing the incidents were the same as those designed by developers, then developers may be responsible for those incidents. We found that 74\% of task misalignments could be attributed to developers who tended to overfocus on efficiency and speed, especially for systems performing tasks in people-facing occupations such as those in the human resources sector. Our results call for design interventions that better align AI development with workers' needs, as without such corrections, workplace AI incidents are likely to persist, causing the invisible erosion of worker agency and organizational productivity.
Authors:Hugo Andersson, Niklas Elmqvist
Abstract:
Agentic AI has taken on the role of assistant, collaborator, and decision-support tool. We argue the next role on that list is more personal: you. These are digital twins of each individual -- twin agents -- representing their knowledge, perspective, and communicative style to colleagues when they are unavailable. Drawing on early design work in an ongoing project in which agents represent knowledge workers in a professional setting, we identify a trust calibration problem specific to this approach. When a human colleague doubts a twin agent's output, they face three failure modes (a schema gap, an epistemic gap, and a model artifact) with no reliable attribution path between them. Cognitive forcing functions and related frameworks address overreliance effectively in contexts where there is a clear boundary between the AI and the human decision-maker. However, twin agents dissolve that boundary, raising a class of trust calibration challenge these frameworks were not designed to handle. We introduce the concept, distinguish it from digital twins, and outline the research questions this new class of agent demands.
Authors:Hugo Andersson, Niklas Elmqvist
Abstract:
Human-AI collaboration research has largely positioned the human as a judge of AI output, centering effort on evaluating whether rec- ommendations are reliable enough to accept. This decision-support framing leaves little room for the human as creator. We argue that for creative work, this framing misdirects human effort toward eval- uating correctness rather than exploring and shaping the creative space. Drawing on Schön's theory of reflective practice, we propose an alternative: treating generative AI as an active creative medium. As a potter works with clay, humans Shape, Observe, Stir, and Se- lect (SOSS) their medium through ongoing conversation. Where generative AI actively tends toward convergence and resolution, the human role of disruption and curation becomes essential for sustaining creative quality. We present a creative writing probe, Loom, in which users orchestrate simulated narrative agents. We also introduce the SOSS framework for this mode of engagement, and discuss design implications.
Authors:Hugo Andersson, Niklas Elmqvist
Abstract:
The dominant paradigm for LLM interaction in AI co-writing uses disposable prompts that vanish after use. This may lead to imprecise results, cumbersome workflows, and diminished author agency and ownership. We propose LLM-based story archeology, where prompts serve as a hierarchical story instrument refined over time to extract the writer's intended story. Drawing on the fossil theory of story- telling, where stories exist as latent structures that writers excavate through their craft, this approach supports agency and ownership through high involvement and control. Writers work at the level of story beats rather than prose. They generate character actions in scenes to discover emergent possibilities, simulated by the LLM or directly nudged, then edit resulting beats to refine scenes iteratively. Prose is generated from beats based on style and genre, separating structure from style. We developed TombWriter, a web-based tool that visualizes stories as navigable cards -- characters, scenes, and beats -- through a five-stage narrative pipeline. We conducted a qual- itative study with five experienced writers who used the system over three days. Through semi-structured interviews, we found that writers framed AI as a generation engine rather than collabo- rator, claimed ownership while reporting voice loss, and valued the system for structural discovery rather than prose production. We contribute the story archeology approach, the TombWriter system, and qualitative findings on beat-level human-AI co-writing.
Authors:Tasweer Ahmad, Rafael Pina, Sandip Pradhan, Arindam Sikdar, Mindula Illeperuma, Khizer Saeed, Peter Lee, Varuna De Silva, Ardhendu Behera
Abstract:
At a time when drones are increasingly associated with hostile operations, we re-purpose them for humanitarian and life-saving applications. However, adapting search and rescue drones for battlefield triage remains extremely challenging; the technology must perform reliably to support frontline medics who are forced to operate under extreme uncertainty, restricted access, and significant personal risk. Due to growing vulnerabilities of casualty evacuation in conflicting zones, this paper presents ATRACT (A Trustworthy Robotic Autonomous system to support Casualty Triage), a novel human-in-the-loop decision support system to enable early battlefield triage during the critical post-trauma period. ATRACT integrates drone-captured video with wearable sensor input for multi-modal learning to support casualty-state assessment, thereby addressing the limitations of existing systems. Drone video captures fine-grained behavioural cues, such as pose, posture, while body-worn sensors provide complementary physiological signals, including heart rate, breathing rate, and movement. By combining two modalities, ATRACT provides evidence to support the early judgement of medics when direct access to the casualty is delayed, risky, or restricted. To mitigate the data realism gap pertaining to injured actions, a conditional variational autoencoder is devised for data augmentation. Experimental results on our drone captured dataset show that proposed pipeline achieves 85.7% accuracy for action classification; while our lightweight CNN visual encoder remains competitive with stronger pre-trained video backbones. Overall, the results support ATRACT as a practically meaningful step towards remote triage in contested environments, where multi-modal sensing, human oversight and trustworthy decision support can improve casualty prioritisation, and lessen the exposure of frontline medics.
Authors:Rui Tang, Yichi Zhang, Xi Chen, Chen Dong, Youwei Yang, Yumeng Shen, Qiangqiang Liu
Abstract:
Long-context and memory systems usually treat personalization as a recall problem. In practice, many failures occur later, when a system commits: it turns noisy hints into hard constraints, drops rare witnesses, forgets downstream obligations, or answers despite infeasibility. We introduce Contract-Bounded Evidence Activation (CBEA) with Lexicographic Commitment Validation (LCV). CBEA activates a bounded evidence set using typed coverage, tail witnesses, and consequence debt; LCV validates structured commitments before prose and routes infeasible states to repair, abstention, or recontract. Across 360 fixtures and three generation backends, CBEA+LCV reaches zero failures within validator scope at 0.49-0.60 availability over attempted runs. Raw and long-context baselines with the same LCV gate reach zero only at 0.003-0.092. A shadow oracle diagnostic marks the limit: CBEA+LCV recalls 0.012 of uncompiled visible facts, while raw recalls 0.53. The result is a bounded operating point: explicit commitment control and 74-75% lower median input payload, not universal memory dominance.
Authors:Henry Salgado, Meagan R. Kendall, Martine Ceberio, Alexandra Coso Strong
Abstract:
This paper examines the opportunities, limitations, and practical considerations associated with the use of large language models (LLMs) in qualitative research. Drawing on a multidisciplinary perspective that combines expertise in qualitative methods and explainable AI, the paper argues that responsible integration of LLMs into qualitative workflows requires researchers to engage critically with a curated set of technical parameters, that is, context window constraints, temperature and top-p sampling settings, user and system prompt design, and model documentation in the form of system cards. The paper situates these considerations within the epistemological commitments of qualitative research, including reflexivity, positionality, and interpretive judgment, and discusses how the opacity of contemporary LLMs differs from earlier natural language processing tools such as topic models and lexicon-based sentiment analyzers.
Authors:Luis D. Reyes Vargas, Veronica Ruozzi, Andrea K. M. Ross, Shervin Dehghani, Michael Sommersperger, Koorosh Faridpooya, Mohammad Ali Nasseri, Merle Fairhurst, Nassir Navab, Sasan Matinfar
Abstract:
Subretinal injection is a delicate vitreoretinal procedure requiring precise needle placement within the subretinal space while avoiding perforation of the retinal pigment epithelium (RPE), a layer directly beneath the target with extremely limited regenerative capacity. To enhance depth perception during cannula advancement, intraoperative optical coherence tomography (iOCT) offers high-resolution cross-sectional visualization of needle-tissue interaction; however, interpreting these images requires sustained visual attention alongside the en face microscope view, thereby increasing cognitive load during critical phases and placing additional demands on the surgeon's proprioceptive control. In this paper, we propose a structured, real-time sonification framework designed for extensible mapping of iOCT-derived anatomical features into perceptual auditory feedback. The method employs a physics-inspired acoustic model driven by segmented retinal layers from a stream of iOCT B-scans, with needle motion and injection-induced retinal layer displacements serving as excitation inputs to the sound model, enabling perception of tool position and retinal deformation. In a controlled user study (n=34), the proposed sonification achieved high retinal layer identification accuracy and robust detection of retinal deformation-related events, significantly outperforming a state-of-the-art baseline in overall event identification (83.4% vs. 60.6%, p < 0.001), with gains driven primarily by enhanced detection of injection-induced retinal deformation. Evaluation by experts (n=4) confirmed the clinical relevance and potential intraoperative applicability of the method. These results establish structured iOCT sonification as a viable complementary modality for real-time surgical guidance in subretinal injection.
Authors:Zoe Pfister, Ruth Breu, Michael Vierhauser
Abstract:
Monitoring humans, for example, their movement or location, is essential for safe and efficient human-machine collaboration in Cyber-Physical Systems (CPS). This information allows CPS to ensure safety properties, adapt their behaviour dynamically, and coordinate with humans. To ensure that the design of a CPS respects ethical principles and the privacy of its stakeholders, system requirements, particularly those related to human monitoring, must reflect the human values of all involved stakeholders. However, human values are often underrepresented in Software Engineering -- particularly during requirements elicitation and system design, crucial phases when introducing ethically critical functionality. Stakeholder values are often implicit and conflicting, yet rarely systematically captured. Furthermore, unstructured natural language requirements introduce ambiguity and vagueness, complicating conflict resolution. To address these problems, we propose HM-Req, a novel requirements elicitation framework including a Controlled Natural Language (CNL) for defining human monitoring requirements. These requirements are then augmented with human values from relevant stakeholders and integrated into a Value Dashboard to detect potential conflicts that require further discussion and resolution. Validation results, applying the CNL to different datasets and conducting a survey and expert interview, confirms the CNL's ability to capture diverse human monitoring requirements and show HM-Req's usefulness for requirements elicitation activities.
Authors:Haoyang Du, Yinghan Xu, John Dingliana, Brian Keegan, Rachel McDonnell, Cathy Ennis
Abstract:
The capacity to create realistic virtual humans has progressed significantly, and such characters can be found in many applications across entertainment, education and health. As an essential element of interactive virtual humans, speech-driven 3D gesture generation still depends heavily on perceptual evaluation, yet studies often vary avatar appearance and facial presentation when judging the generated motions. Prior work suggests these visual choices can bias motion judgments, but controlled evidence remains limited. We address this gap with controlled evaluations of co-speech gestures across motion sources, spanning seven representative avatar renderings used in contemporary research and application pipelines. Our results show that avatar and face presentation systematically shift perceptual judgments, and we provide recommendations for benchmarking gesture synthesis as well as for deploying virtual humans in human-facing applications.
Authors:Shengqi Zhu, Jeffrey M. Rzeszotarski, David Mimno
Abstract:
User interactions with LLMs are shaped by prior experiences and individual exploration, but in-lab studies do not provide system designers with visibility into these in-the-wild factors. This work explores a new approach to studying real-world user-LLM interactions through large-scale chat logs from the wild. Through analysis of 140K chatbot sessions from 7,955 anonymized global users over time, we demonstrate key patterns in user expressions despite varied tasks: (1) LLM users are not tabula rasa, nor are they constantly adapting; rather, interaction patterns form and stabilize rapidly through individual early trajectories; (2) Longitudinal outcomes, such as recurring text patterns and retention rates, are strongly correlated with early exploration; (3) Parallel dynamics are present, including organizing expressions by task types such as emotional support, or in response to model-version updates. These results present an ``agency paradox'': despite LLM input spaces being unconstrained and user-driven, we in fact see less user exploration. We call for design consideration surrounding the molding procedure and its incorporation in future research.
Authors:Ryota Takamido, Chiharu Suzuki, Hiroki Nakamoto
Abstract:
Although machine learning (ML)-based performance outcome prediction is an important topic in contemporary sports science, one important issue is the limited understanding of the cross-individual generalizability of ML models in sports contexts. To address this issue, this study aimed to evaluate the cross-individual generalizability of ML models for predicting ball speed in baseball pitching. A dataset comprising 50 pitchers from various competitive levels was analyzed. Cross-individual generalizability was assessed using leave-one-subject-out cross-validation. Specifically, the effects of expertise level and restrictions on spatiotemporal motion information were examined to identify factors influencing model generalizability. The results revealed that, under cross-individual evaluation, (1) predictive performance was markedly lower than under within-individual evaluation, with R-squared value decreasing from 0.91 to 0.38; (2) the model tended to overestimate the performance of Intermediate pitchers relative to Expert pitchers, with a significant group difference in signed prediction error (p < .05); and (3) the trunk and pivot leg demonstrated relatively high generalization performance, with the pivot leg showing notable generalizability even during the weight-shift initiation phase (R-squared value > 0.25). These findings underscore the importance of cross-individual evaluation in enhancing the practical applicability of ML in sports settings and contribute to a deeper understanding of the biomechanical factors underlying the target movement.
Authors:Hassan Khosravi, Dragan Gasevic, Shazia Sadiq, Lixiang Yan, Jason Lodge, Jason Tangen, Paul Denny, Kristen DiCerbo, Simon Buckingham Shum, Ryan S. Baker
Abstract:
Large language models (LLMs) are rapidly transforming knowledge work by improving the quality and efficiency of tasks such as writing, coding, and data analysis. However, their growing use in education has exposed a learning-performance paradox: while they can enhance short-term task performance, they may also undermine genuine learning, including cognitive growth, knowledge transfer, and metacognitive development. This paper addresses the question of how artificial intelligence should be designed and used to support learning rather than merely improve immediate outputs. We introduce the concept of AI learning companions, defined as adaptive, pedagogically informed, LLM-powered agents designed for integration into learning environments. We propose a framework for their design built on three interrelated foundations: a pedagogical foundation focused on how students learn with AI, an adaptive foundation focused on how AI learns about students, and a responsible design foundation ensuring systems remain transparent, accountable, inclusive, and secure. The framework is illustrated through five case studies spanning diverse educational contexts, levels, and tool designs, revealing both the promise and current limitations of existing tools. We conclude that there is a necessary shift away from LLMs designed for task-oriented performance, and beyond simply prompting them to act as tutors, toward deliberately developed AI learning companions that are pedagogically sound, adapt to their learners, and foster durable understanding, metacognitive growth, and learner agency.
Authors:Kavinda Athapaththu, Shiwei Chen, Yuan Fang, Sanchali Mitra, Yee Sin Ang, Yong Wang
Abstract:
The past few years have witnessed vibrant efforts in discovering new two-dimensional (2D) semiconductor materials from both academia and the industry, due to their promising potential in resolving the severe performance deterioration of traditional semiconductors resulting from condensed silicon thickness. However, existing methods (e.g., Density Functional Theory (DFT) or machine-learning-based approaches) suffer from various challenges such as small datasets, and reliability and trustworthiness issues. To bridge this gap, we propose SemiConLens, a visual analytics approach to combine human expertise with the power of ML to enable effective and reliable 2D semiconductor discovery. Specifically, we first develop a new Correlation Aware Multivariate Imputation (CAMI) method and use ML models like autoencoder, which can better learn from limited data and reveal uncertainty, to address the challenge of sparse data in semiconductivity prediction. Built upon this, our visualization module, consisting of three visualization views with linked interactions, allows material researchers to interactively filter, discover and compare 2D semiconductor candidates. A novel circular glyph design and a new cluster-aware layout optimization approach are proposed to effectively display all the user-configurable key attributes and possible prediction uncertainties of each semiconductor candidate, ensuring a reliable and trustable 2D semiconductor discovery. We assess SemiConLens through quantitative evaluations, expert interviews, and use cases. The results demonstrate SemiConLens's capability to help material researchers conduct effective discovery of desirable 2D semiconductors.
Authors:Ka Hei Carrie Lau, Enkelejda Kasneci
Abstract:
Webcam-based eye tracking is a cost-effective, scalable method for remote research that effectively reaches broader populations. However, uncontrolled environments and hardware diversity lead to inconsistent data quality in crowdsourcing. To assess current practices, we conducted a scoping review of crowdsourced eye-tracking from 2011-2025. The review confirms fragmented reporting and a lack of established quality benchmarks. To address this lack of predictive insight, we conducted a case study on AI fairness interviews (N=205) using the RealEye platform. Applying Ordered Logistic Regression (OLR) to the platform quality metric, we found that behavioral and technical factors significantly predict data quality. Specifically, within the RealEye platform, higher fixation counts, shorter sessions, and operating system choice yield significantly higher quality grades. Based on this review and platform-specific predictive insights, we provide actionable recommendations to enhance the reliability, transparency, and replicability of future crowdsourced webcam eye tracking in HCI and behavioral science.
Authors:Michael Yin, Chenxinran Shen, Robert Xiao
Abstract:
Reporting systems in multiplayer video games allow players to express their dissatisfaction with others and combat in-game toxicity. In this work, we examined the act of reporting through the lens of expectancy-value theory. Using a distributed survey (n = 98) and follow-up interviews (n = 19), we explored the value players place on reporting, their desired outcomes, and their expectations that these outcomes will be achieved. Our findings revealed that reporting is motivated by both altruistic and retributive factors, with players seeking short-term revenge while also looking to foster an improved long-term community. Yet, players felt that reporting may not always meet these goals, with belief in the system being mediated by factors such as developer reputation, reporting transparency, and alignment with the community. By understanding the value and expectancy of reporting systems, we discuss their implications on broader digital moderation and consider current and potential future designs of reporting systems.
Authors:Qi Sun, Ziyang Li, Yinzhi Cao, Yaxing Yao
Abstract:
Privacy regulations such as the CCPA and GDPR grant individuals rights over their personal data, yet it remains challenging for most users to exercise them in practice due to vague policy interpretation and unapproachable settings on web interfaces. We introduce Privy, an LLM-powered browser assistant that guides users through exercising their privacy rights on websites. Privy automatically analyzes a website's privacy policy and surfaces the specific rights available as action labels in a side panel. When a user selects a right, Privy provides step-by-step guidance and navigation, presenting direct links, generating email templates, or guiding form completion. Users can also request on-demand policy evidence and rights education to enhance their literacy. A technical evaluation across 14 websites shows that Privy extracts rights with high precision (0.979) and completes 96.3\% of privacy tasks in an average of 3.2 steps. A user study (N=15) also demonstrates the overall high-level of perceived helpfulness among users. Our findings suggest that comprehension and usability are not two separate challenges but a single interaction problem, and that effective privacy support requires integration of policy understanding and privacy actions. We offer design suggestions for future privacy assistants.
Authors:Shardul Sapkota, Matthew Jörke, Zane Sabbagh, Omar Shaikh, Grace Wang, James A. Landay
Abstract:
Recent advances in user modeling make it feasible to conduct open-ended inference over a person's everyday computer use. Despite longstanding visions of systems that deeply understand our actions and the purposes they serve in our lives, existing systems only capture what a person is doing in the moment -- not why they are doing it -- limiting these systems to surface-level support. We introduce striving co-creation, a process for inferring broader life goals from unstructured observations of computer use. Grounded in Activity Theory and Emmons' personal strivings framework, our system progressively constructs a hierarchical representation of a person's activities. Crucially, strivings are difficult to fully resolve from observation alone, as the same action can be driven by many different goals. Our system therefore supports an editing interface that gives people agency over how they are understood by the system, feeding their corrections back into subsequent rounds of striving induction. In a week-long field deployment (N=14), we find that our co-creation process produces strivings that are representative of participants' long-term goals and gives them greater agency than baseline methods.
Authors:Jarod Govers, Sanja Šćepanović, Daniele Quercia
Abstract:
A key task in AI practice is to assess potential impacts to prevent harm. Current AI tools assisting AI impact assessment have not been designed or evaluated for collaborative team brainstorming, and they do not capture the range of views in diverse teams. We studied how AI can support team brainstorming during AI impact assessment and made three contributions. First, we adapted two structured methods from strategic foresight and co-designed AI interventions for them in five in-person workshops with 28 participants in total. Second, we evaluated the interventions in ten in-person workshops with 54 participants, finding that AI improved impact assessment quality and brainstorming perceptions for a general-purpose AI use (a chatbot companion) but not for a specialised one (a kidney allocation application). Third, our findings result in broader design guidance for AI assistance in brainstorming: AI should only offer hints and not solutions during early ideation, initiating interaction only when participants face fixation or saturation; it should facilitate structuring ideas during convergence; leverage expertise to refine ideas; and overall, it should serve more in support of tedious brainstorming process tasks, rather than ideation that teams value to do themselves.
Authors:Nils Mandischer, Daria Eckert and, Lars Mikelsons
Abstract:
Human-robot interaction is emerging as an important paradigm for integrating persons with disabilities into the workplace. While these systems can enable individuals to work, their design is mostly personalized, hindering widespread use beyond the individual user. The universal design paradigm is a central pillar of inclusive design, describing usability of systems by all. To incorporate universal design into process design for human-robot workplaces expert knowledge is required that is often not available. To simplify process design of human-robot workplaces, we propose a persona-based design approach. First, typical impairments prevalent in the workforce or particularly relevant for the processes are abstracted into personas with disabilities. The work process is subdivided into sequential actions. For each action and persona, strategies are developed to reach the action goal by a design thinking approach. The resulting actions are ordered by level of robot assistance, i.e. robot involvement, and implemented in a behavior tree. Therefore, the macro-behavior of the workplace may adapt to individual personas online. We demonstrate the method in a collaborative box folding process with a total of seven personas with disabilities. The persona-based process design shows promising results by generating more comprehensive process strategies while enabling adaptive behavior in the sense of universal design.
Authors:Shang-Hsuan Chiang, Tsan-Tsung Yang, An-Zi Yen, Wen-Chih Peng
Abstract:
Generating sports game reports from structured tables is a complex table-to-text task that demands both precise data interpretation and fluent narrative generation. Traditional model-based approaches require large, annotated datasets, while prompt-based methods using large language models (LLMs) often struggle with hallucination due to weak table comprehension. To overcome these challenges, we propose Tree-of-Text, a tree-structured prompting framework that guides LLMs through a three-stage generation process: (1) Content Planning, where relevant operations and arguments are selected from the input tables; (2) Operation Execution, which breaks down large tables into manageable sub-tables; and (3) Content Generation, where short textual outputs are merged and rewritten into a cohesive report. Experiments show that our method outperforms existing methods on ShuttleSet+, leads in RG and CO metrics on RotoWire-FG, and excels in CS and CO on MLB with roughly 40% of the time and cost of Chain-of-Table. These results demonstrate the effectiveness and efficiency of Tree-of-Text and suggest a promising direction for prompt-based table-to-text generation in the sports domain.
Authors:Natalia Amat-Lefort, Mert Yazan, Amanda Cercas Curry, Flor Miriam Plaza-del-Arco
Abstract:
Large Language Models (LLMs) are increasingly used not only for instrumental tasks, but as always-available and non-judgmental confidants for emotional support. Yet what drives adoption and how users perceive emotional support interactions across countries remains unknown. To address this gap, we present the first large-scale cross-cultural study of LLM use for emotional support, surveying 4,641 participants across seven countries (USA, UK, Germany, France, Spain, Italy, and The Netherlands). Our results show that adoption rates vary dramatically across countries (from 20% to 59%). Using mixed models that separate cultural effects from demographic composition, we find that: Being aged 25-44, religious, married, and of higher socioeconomic status are predictors of positive perceptions (trust, usage, perceived benefits), with socioeconomic status being the strongest. English-speaking countries consistently show more positive perceptions than Continental European countries. We further collect a corpus of 731 real multilingual prompts from user interactions, showing that users mainly seek help for loneliness, stress, relationship conflicts, and mental health struggles. Our findings reveal that LLM emotional support use is shaped by a complex sociotechnical landscape and call for a broader research agenda examining how these systems can be developed, deployed, and governed to ensure safe and informed access.
Authors:Jennifer Kleiman, Yizhu Gao, Xin Xia, Zhaoji Wang, Zipei Zhu, Jongchan Park, Xiaoming Zhai
Abstract:
Argumentation is a core practice in STEM education, but its productivity depends on who participates and how they interact. Higher-achieving students often dominate the talk and decision-making, while lower-achieving peers may disengage, defer, or comply without contributing substantive reasoning. Forming groups strategically based on students' stances and argumentation skills could help foster inclusive, evidence-based discourse. In practice, however, teachers are constrained in implementing this grouping strategy because it requires real-time insight into students' positions and the quality of their argumentation, information that is difficult to assess reliably and at scale during instruction. We present a generative AI-powered system, ArguAgent, that creates groups optimizing for stance heterogeneity while constraining argumentation quality differences to +/-1 level on a validated learning progression. ArguAgent uses a two-component assessment pipeline: first scoring student arguments on a 0-4 rubric, then clustering positions via semantic analysis. We validated the scoring component against human expert consensus (Krippendorff's ααα = 0.817) using 200 expert-generated scores. Testing three OpenAI models (GPT-4o-mini, GPT-5.1, GPT-5.2) with identical calibrated prompts, we found that systematic prompt engineering informed by human disagreement analysis contributed 89% of scoring improvement (QWK: 0.531 to 0.686), while model upgrades contributed an additional 11% (QWK: 0.686 to 0.708). Simulation testing across 100 classes demonstrated that the grouping algorithm achieves 95.4% of groups that meet both design criteria, a 3.2x improvement over random assignment. These results suggest ArguAgent can enable real-time, theoretically grounded grouping that promotes productive STEM argumentation in classrooms.
Authors:Soohwan Lee, Kyungho Lee
Abstract:
AI-mediated Communication (AIMC) systems increasingly aim to protect minority voices by anonymizing or proxying their input, but anonymity and authenticity are not the same construct. This position paper draws on an ongoing empirical study comparing two LLM-powered minority support strategies in hierarchical group decision-making. We found that relaying minority input anonymously through AI increased participation but significantly reduced psychological safety and satisfaction, while generating only autonomous counterarguments improved satisfaction and reduced marginalization. These counterintuitive findings reveal three provocations for AIMC design in hierarchical contexts: the inherent trade-offs among anonymity, authenticity, agency, and accountability; the risk that power asymmetry reverses intended effects; and the need for AI to facilitate group reflection rather than substitute for human responsibility. These findings and provocations are offered as a contribution to the Restoring Human Authenticity in AI-Mediated Communication workshop.
Authors:Soohwan Lee, Kyungho Lee
Abstract:
As multi-agent AI systems become more common, users increasingly encounter not a single AI voice but a collective one. This shift introduces social dynamics, such as consensus, dissent, and gradual convergence, that can trigger cognitive biases and distort human judgment. We present findings from a controlled experiment (N = 127) comparing three multi-agent configurations: Majority, Minority, and Diffusion. Quantitative results show that majority consensus accelerates opinion change and inflates confidence, consistent with social proof and bandwagon heuristics. Minority dissent slows this process and promotes more deliberative engagement. Qualitative analysis identifies three interpretive trajectories: reinforcing, aligning, and oscillating, shaped by how users interpret agent independence and group dynamics over time. These findings suggest that agent agreement structure, independent of content, functions as a bias-relevant signal in LLM interactions. We hope this work contributes to the Bias4Trust agenda by grounding multi-agent social influence as a concrete and designable source of bias in human-AI interaction.
Authors:Ruican Zhong, Jiachen Li, Gary Hsieh, David W. McDonald, Selin S. Everett, Alyssa Unell, Jonathan Carlson, Katie Claveau, Noel Codella, Khalil Malik, Scott Mackie, Eduardo Olvera, Scott Saponas, Eric Horvitz, David Rhew, Jim Weinstein, Jacob Gross, Amanda K. Hall
Abstract:
Electronic health records (EHRs) have improved data accessibility but have also introduced cognitive burden for physicians, given the sheer volume and complexity of the data involved. Advances in large language models (LLMs) create new opportunities to rethink how clinicians interact with medical data through dynamic, adaptive interfaces. In this position paper, we explore how generative AI can support physicians' information needs by enabling more dynamic interactions with patient data. Through semi-structured interviews with internal physicians at Microsoft, we identify key challenges in data navigation and synthesis, and characterize clinicians' information needs during diagnostic workflows. We further examine how physicians conceptualize AI can help their work process and how these mental models shape expectations for interaction and trust. Based on these insights, we discuss design considerations for generative user interfaces that support clinician-centered workflows.
Authors:Nathanael Jo, Zoe De Simone, Mitchell Gordon, Ashia Wilson
Abstract:
Modern AI assistants are trained to follow instructions, implicitly assuming that users can clearly articulate their goals and the kind of assistance they need. Decades of behavioral research, however, show that people often engage with AI systems before their goals are fully formed. When AI systems treat prompts as complete expressions of intent, they can appear to be useful or convenient, but not necessarily aligned with the users' needs. We call these failures Fantasia interactions. We argue that Fantasia interactions demand a rethinking of alignment research: rather than treating users as rational oracles, AI should provide cognitive support by actively helping users form and refine their intent through time. This requires an interdisciplinary approach that bridges machine learning, interface design, and behavioral science. We synthesize insights from these fields to characterize the mechanisms and failures of Fantasia interactions. We then show why existing interventions are insufficient, and propose a research agenda for designing and evaluating AI systems that better help humans navigate uncertainty in their tasks.
Authors:Eden Shaveet, Zefan Sramek, Yumi Hamamoto, Jing Du, Scott Griffiths, Thalia Zhang, Thalia Viranda, William Hornby, Flora Salim, Koji Yatani, Tanzeem Choudhury
Abstract:
Objective: Reliable identification of pro-eating disorder (pro-ED) content online suffers from two pervasive problems: 1) existing methods predominantly rely on text-based signals, failing to capture the inherently multimodal nature of multimedia content; and 2) these methods struggle to keep pace with the rapid evolution of references, memes, terminology, and contextual cues that underlie this content. Together, these limitations point to a gap: the absence of an expert-annotated reference standard capable of supporting real-time research and robust multimodal detection model training for pro-ED content on short-form video platforms. Method: To address this, we propose "zeitgeist-aware" multimodal (ZAM) datasets: continuously curated collections of annotated multimodal pro-ED content with inclusion criteria that evolve alongside the memetic zeitgeist: the variable essence of what is considered pro-ED as new media and references come into the cultural zeitgeist and are absorbed and interpreted in online spaces. Results: We present a rationale for such datasets, define their core characteristics, outline approaches for their curation, and describe our progress toward that end. Discussion: This dataset and pipeline architecture may benefit researchers across several fields who are interested in how pro-ED sentiment is encoded and transmitted through short-form video content across time, including for the purpose of responsive moderation efforts.
Authors:Maurice Chiodo, Toni Erskine, Dennis Müller, James G. Wright
Abstract:
We analyse the 2025 Signalgate leak of sensitive US military information by the Trump administration, addressing why confidentiality was violated (messages leaked to the press) in spite of encryption (Signal), to deepen the socio-technical considerations when designing and deploying encryption. First, we use applied pi-calculus to formally model the boutique secure facility setup requested by the US Defence Secretary, to prove that a leak would not be prevented. We then examine how using a secure channel might still not give overall information security, as, in this case, power imbalances between personnel and officials led to the application of cryptography that compromised their operational security. We look at how cryptographic tools may have instilled a false sense of security, and led officials to "overshare". We then apply this analysis to the Trump administration's general desire to burn through political, legal, and now technical process, and demonstrate geopolitical harms that may arise from such ineffective use of cryptography in a brief use case. We conclude that, even with advancements in usability of cryptographic tools, genuine message security is still out of reach of the "average user".
Authors:Simon Bohnen, Gabriel Garbers, Lukas Ellinger, Georg Groh
Abstract:
Knowledge work demands sustained self-regulation, prioritization, and reflection-yet existing planning tools only partially support these needs. Digital to-do list applications feature task persistence but lack goal representation. Paper-based planning frameworks offer effective planning strategies but cannot adapt to individual users. Conversational AI systems enable flexible reflection but lack persistence and accountability. Moreover, none of these tools address a fundamental challenge: users' expressed demands often diverge from their underlying needs. This paper introduces seneca, a conceptual framework for a personalized, AI-assisted planner that integrates the complementary strengths of these three approaches. seneca combines a conversational agent that scaffolds reflection and asks clarifying questions, a persistent database that tracks goals and behavioral patterns, and a processor that synchronizes information between them. We describe this architecture and outline a phased evaluation strategy combining automated testing with simulated users and longitudinal human studies measuring goal attainment, planning realism, and goal-value alignment.
Authors:Rogerio Corga Da Silva, Miguel Romano, Tiago Mendes, Marta Isidoro, Sandhanakrishnan Ravichandran, Shivesh Kumar, Michiel van der Heijden, Olivier Fail, Valentine Emmanuel Gnanapragasam
Abstract:
Background: Clinical documentation and information retrieval consume over half of physicians working hours, contributing to cognitive overload and burnout. While artificial intelligence offers a potential solution, concerns over hallucinations and source reliability have limited adoption at the point of care. This study aimed to evaluate physician-perceived time efficiency, decision-making support, and satisfaction with DR. INFO, an agentic AI clinical assistant, in routine clinical practice. Methodology: In this prospective, single-arm, pilot feasibility study, 29 physicians and medical students across multiple specialties in Portuguese healthcare institutions used DR. INFO v1.0 over five working days within a two-week period. Outcomes were assessed via daily Likert-scale evaluations (time saving and decision support) and a final Net Promoter Score (NPS). Non-parametric methods were used throughout, with bootstrap confidence intervals (CIs) and sensitivity analysis to address non-response. Results: Physicians reported high perceived time saving (mean = 4.27/5; 95% CI = 3.97-4.57) and decision support (mean = 4.16/5; 95% CI = 3.86-4.45), with ratings stable across the five-day study window. Among the 16 (55%) participants who completed the final evaluation, the NPS was 81.2, with no detractors; sensitivity analysis indicated an NPS of 44.8 under conservative non-response assumptions. Conclusions: Physicians across specialties and career stages reported positive perceptions of DR. INFO for both time efficiency and clinical decision support within the study window. These findings are preliminary and should be confirmed in larger, controlled studies that include objective performance measures and independent accuracy verification.
Authors:Xiao Lu, Hao Zhen, Jidong J. Yang
Abstract:
Crash diagrams are essential tools in transportation safety analysis, yet their manual preparation remains time-consuming and prone to human variability. This study investigates the use of Vision-Language Models (VLMs) to automate crash diagram generation from police crash reports, focusing on multilane roundabouts as a challenging test case. A three-part structured prompt framework was developed to guide model reasoning through interpretation, extraction, and visual synthesis, while a 10-metric evaluation system was designed to assess diagram quality in terms of semantic accuracy, spatial fidelity, and visual clarity. Three popular models, including GPT-4o, Gemini-1.5-Flash, and Janus-4o, were tested on 79 crash reports. GPT-4o achieved the highest average performance (6.29 out of 10), followed by Gemini-1.5-Flash (5.28) and Janus-4o (3.64). The analysis revealed GPT-4o's superior spatial reasoning and alignment between extracted and visualized crash data. These results highlight both the promise and current limitations of VLMs in engineering visualization tasks. The study lays the groundwork for integrating generative AI into crash analysis workflows to improve efficiency, consistency, and interpretability.
Authors:Joel Perca, Luis Sante, Juanpablo Heredia, Joao Rulff, Claudio Silva, Jorge Poco
Abstract:
Extracting actionable insights from long-duration urban videos is often labor-intensive: analysts must manually sift through raw footage to pinpoint target events or uncover broader behavioral trends. In this work, we present URBANCLIPATLAS, a visual analytics system for exploring long urban videos recorded at street intersections. URBANCLIPATLAS combines retrieval-augmented generation (RAG), taxonomy-aware entity extraction, and video grounding to support event retrieval and interpretation. The system segments extended recordings into short clips, generates textual descriptions with a vision-language model, and indexes them for semantic retrieval. A knowledge graph maps entities and relations from LLM answers onto a domain-specific taxonomy and aligns them with detected objects and trajectories to support visual grounding and verification. URBANCLIPATLAS supports scene retrieval through an augmented chat-based interface and improves scene interpretation by tightly aligning textual outputs with video evidence. This design strengthens the connection between textual reasoning and visual evidence, reducing the effort required to validate model outputs and refine hypotheses. We demonstrate the usefulness of URBANCLIPATLAS on the StreetAware dataset through two case studies involving hazardous scenarios and crossing dynamics at street intersections. URBANCLIPATLAS helps analysts reason about safety- and mobility-related patterns across large urban video collections.
Authors:Megha Chakraborty, Darssan L. Eswaramoorthi, Madhur Thareja, Het Riteshkumar Shah, Finlay Palmer, Aryaman Bahl, Michelle A Ihetu, Amit Sheth
Abstract:
AI-driven education platforms have made some progress in personalisation, yet most remain constrained to static adaptation--predefined quizzes, uniform pacing, or generic feedback--limiting their ability to respond to learners' evolving understanding. This shortfall highlights the need for systems that are both context-aware and adaptive in real time. We introduce PAL (Personal Adaptive Learner), an AI-powered platform that transforms lecture videos into interactive learning experiences. PAL continuously analyzes multimodal lecture content and dynamically engages learners through questions of varying difficulty, adjusting to their responses as the lesson unfolds. At the end of a session, PAL generates a personalized summary that reinforces key concepts while tailoring examples to the learner's interests. By uniting multimodal content analysis with adaptive decision-making, PAL contributes a novel framework for responsive digital learning. Our work demonstrates how AI can move beyond static personalization toward real-time, individualized support, addressing a core challenge in AI-enabled education.
Authors:Ozioma C. Oguine, Elmira Rashidi, Pamela J. Wisniewski, Karla Badillo-Urquiola
Abstract:
Ecological Momentary Assessment (EMA) is widely used to study adolescents' experiences; yet, how the design of EMA platforms shapes engagement, research practices, and power dynamics in youth studies remains under-examined. We developed a youth-centered EMA platform prioritizing youth engagement and researcher support, and evaluated it through a case study on a longitudinal investigation with adolescent twins focused on mental health and sleep behavior. Interviews with the research team examined how the platform design choices shaped participant onboarding, sustained engagement, risk monitoring, and data interpretation. The app's teen-centered design and gamified features sustained teen engagement, while the web portal streamlined administrative oversight through a centralized dashboard. However, technical instability and rigid data structures created significant hurdles, leading to privacy concerns among parents and complicating the researchers' ability to analyze raw usage metadata. We provide actionable interaction design guidelines for developing EMA platforms that prioritize youth agency, ethical practice, and research goals.
Authors:Qiyang Chen, Guozheng Li, Xingqi Wang, Gerile Aodeng, Min Lu, Chi Harold Liu
Abstract:
Hierarchical tables are an important structure for organizing data with inherent hierarchical relationships. Existing studies have extensively explored methods for data fact exploration from tabular data. In particular, some studies have directly integrated visual data facts into the original table structure to support in-situ exploration, because embedding data facts within the table context can reduce cognitive load by minimizing attention shifts. However, embedding a large amount of extracted data facts into the limited space of hierarchical tables often leads to layout conflicts, hindering effective exploration. To address this issue, we propose an interactive exploration paradigm for hierarchical table data facts based on semantic zooming and develop an interactive visualization system, ZoomTable. The ZoomTable system employs semantic zooming as the interaction method, combined with a data-fact layout method and a data fact recommendation mechanism. This combination not only resolves layout conflicts, but also supports users in coherently exploring multidimensional data facts at different scales. A case study and a user experiment further validate the practicality and efficiency of ZoomTable in real-world data fact exploration scenarios.
Authors:Qian Ma, Aditya Majumdar, Sarah Rajtmajer, Brett Frischmann
Abstract:
Privacy policies govern how personal data is collected, used, and shared. Yet, in most privacy-policy consent flows, agreement is operationalized as a single click at the end of a long, opaque policy document. Recent privacy-law scholarship has argued for a standard of demonstrably informed consent. That is, the party drafting and designing privacy-policy consent mechanisms must generate reliable evidence that a person demonstrates comprehension of the consequential terms to which they agree. To this end, we study pedagogical friction as a design framing: minimal interventions embedded within a privacy-policy consent flow that aim to support demonstrated comprehension while keeping burden on the user low. In a randomized experiment, we tested pedagogical friction for demonstrably informed consent in the context of a privacy policy for an edtech app for young children. We recruited 293 parents of kids ages 3-8 to review the app's privacy policy under one of six conditions that varied presentation format and pacing, then complete a six-question comprehension quiz. Three conditions offered a second policy review and quiz retake for participants who did not pass this quiz on their first attempt. We find that the slide-based condition (G3) achieved the highest first-attempt threshold attainment (>=80%) (41.7%), followed by the paced, sectioned condition (G4) (30.6%). In the retake conditions, 64.9% of participants who completed a second attempt improved their score. Notably, in conditions that did not gate consent on demonstrated comprehension, 97.3% of participants who scored below the threshold still chose to consent, suggesting that ungated consent flows can record agreement without demonstrated comprehension. Our results suggest that pedagogical friction can strengthen the evidentiary basis of consent and clarify what it costs in time and burden.
Authors:Chao Zhang, Shunan Guo, Abe Davis, Eunyee Koh
Abstract:
Experienced storytellers decompose stories into local narrative strategies and how these strategies shape higher-level arcs. This decomposition helps writers recognize patterns in others' work and adapt those patterns to tell new stories. Novices, however, struggle to identify these strategies or to reuse them effectively. We present Narrix, a novel writing tool that helps novice writers recognize narrative strategies in example stories and repurpose these strategies in their own writing. Narrix analyzes strategies in example stories, highlights them with color-coded lexical cues and explanations, and situates them on an interactive story arc for exploration by emotional shifts and turning points. Writers then drag strategies onto multi-dimensional tracks and apply block-scoped edits to revise or continue their drafts through controlled generation steered by specified strategies. Through a within-subjects study (N=12), Narrix showed improved participants' retention, confidence, and creative adaptation of narrative strategies compared to a baseline chat-based writing interface.
Authors:Ananya Bhattacharjee, Michael Liut, Matthew Jörke, Diyi Yang, Emma Brunskill
Abstract:
Digital mental health (DMH) tools have extensively explored personalization of interventions to users' needs and contexts. However, this personalization often targets what support is provided, not how it is experienced. Even well-matched content can fail when the interaction format misaligns with how someone can engage. We introduce generative experience as a paradigm for DMH support, where the intervention experience is composed at runtime. We instantiate this in GUIDE, a system that generates personalized intervention content and multimodal interaction structure through rubric-guided generation of modular components. In a preregistered study with N = 237 participants, GUIDE significantly reduced stress (p = .02) and improved the user experience (p = .04) compared to an LLM-based cognitive restructuring control. GUIDE also supported diverse forms of reflection and action through varied interaction flows, while revealing tensions around personalization across the interaction sequence. This work lays the foundation for interventions that dynamically shape how support is experienced and enacted in digital settings.
Authors:Mingchen Li, Wajdi Aljedaani, Yingjie Liu, Navyasri Meka, Xuan Lu, Xinyue Ye, Junhua Ding, Yunhe Feng
Abstract:
Skin-toned emojis are crucial for fostering personal identity and social inclusion in online communication. As AI models, particularly Large Language Models (LLMs), increasingly mediate interactions on web platforms, the risk that these systems perpetuate societal biases through their representation of such symbols is a significant concern. This paper presents the first large-scale comparative study of bias in skin-toned emoji representations across two distinct model classes. We systematically evaluate dedicated emoji embedding models (emoji2vec, emoji-sw2v) against four modern LLMs (Llama, Gemma, Qwen, and Mistral). Our analysis first reveals a critical performance gap: while LLMs demonstrate robust support for skin tone modifiers, widely-used specialized emoji models exhibit severe deficiencies. More importantly, a multi-faceted investigation into semantic consistency, representational similarity, sentiment polarity, and core biases uncovers systemic disparities. We find evidence of skewed sentiment and inconsistent meanings associated with emojis across different skin tones, highlighting latent biases within these foundational models. Our findings underscore the urgent need for developers and platforms to audit and mitigate these representational harms, ensuring that AI's role on the web promotes genuine equity rather than reinforcing societal biases.
Authors:Keya Shah, Himanshi Lalwani, Hanan Salam
Abstract:
College students face well-being challenges driven by academic pressure, financial strain, and social expectations. While campus counseling and student-success programs offer support, access is often limited by stigma, waitlists, and scheduling constraints. Existing digital tools focus on emotional check-ins or chatbots and may overlook structured goal setting and aligning goals with personal values. We present GROW, a goal-centered well-being coaching system that puts values-aligned goals at the center of the student experience. GROW combines the SMART framework with principles from Acceptance and Commitment Therapy in a conversational AI coach that helps students clarify aspirations, break them into concrete steps, and reflect on progress. The system links action plans with Google Calendar, sends reminders, and provides a dashboard that shows progress and engagement. We evaluated GROW through interviews with clinical psychologists, student-success staff, and faculty, followed by a one-week deployment with 30 undergraduates. Findings offer design implications for interactive systems that support engagement, accountability, and sense of purpose in higher education.
Authors:Yuhan Liu, Shuyao Zhou, Jakob Kaiser, Ella Colby, Jennifer Okwara, Maggie Wang, Varun Nagaraj Rao, Andrés Monroy-Hernández
Abstract:
Policy researchers need scalable ways to surface public views, yet they often rely on interviews, listening sessions, and surveys-analyzed thematically-that are slow, expensive, and limited in scale and diversity. LLMs offer new possibilities for thematic analysis of unstructured text, yet we know little about how LLM-assisted workflows perform for policy research. Building on a workflow for LLM-assisted thematic analysis of online forums, we conduct a study with 11 policy researchers, who use an early prototype and see it as a quick, rough-and-ready input to their research. We then extend and scale the workflow to analyze millions of Reddit posts and 1,058 chatbot-led interview transcripts on a policy-relevant topic, treating these sources as rich and scalable data for policy discourse. We compare the synthesized themes to those from authoritative policy reports, identify points of alignment and divergence, and discuss what this implies for policy researchers adopting LLM-assisted workflows.
Authors:Racquel Fygenson, Enrico Bertini, Lace M. Padilla
Abstract:
Affordances, originating in psychology, describe how an object's design influences the physical and cognitive actions users may take. Past work applied affordance theory to visualization to explain how design decisions can impact the cognitive actions of visualization readers. In this work, we demonstrate that affordances can complement effectiveness rankings by further explaining the root causes behind visualizations' task performance. To do so, we conduct a case study on static normal probability density function plots, identifying their current affordances. Next, we identify the optimal affordances for a common probability-comparison task and develop a novel affordance-driven visualization, the Croissant Chart, to support them. We empirically validate the design's effectiveness through a preregistered study (n = 808), demonstrating how affordances can inform predictable changes in task performance. Our findings underscore the potential for affordance-based approaches to enhance visualization effectiveness and inform future design decisions.
Authors:Xiaoyuan Zhu, Kimberly Le Truong, Riccardo Fogliato, Gokul Swamy, Weijian Zhang, Minglai Yang, Longtian Ye, Bangya Liu, Minghao Liu, Andrew Ilyas, Steven Wu
Abstract:
As LLMs are deployed in high-stakes settings, users must judge the correctness of individual responses, often relying on model-generated justifications such as reasoning chains or explanations. Yet, no standard measure exists for whether these justifications help users distinguish correct answers from incorrect ones. We formalize this idea as error verifiability and propose $v_{\text{bal}}$, a balanced metric that measures whether justifications enable raters to accurately assess answer correctness, validated against human raters who show high agreement. We find that neither common approaches, such as post-training and model scaling, nor more targeted interventions recommended improve verifiability. We introduce two methods that succeed at improving verifiability: reflect-and-rephrase (RR) for mathematical reasoning and oracle-rephrase (OR) for factual QA, both of which improve verifiability by incorporating domain-appropriate external information. Together, our results establish error verifiability as a distinct dimension of response quality that does not emerge from accuracy improvements alone and requires dedicated, domain-aware methods to address.
Authors:Donghoon Shin, Bingcan Guo, Jaewook Lee, Lucy Lu Wang, Gary Hsieh
Abstract:
Although HCI research papers offer valuable design insights, designers often struggle to apply them in design workflows due to difficulties in finding relevant literature, understanding technical jargon, the lack of contextualization, and limited actionability. To address these challenges, we present ReFinE, a Figma plugin that supports real-time design iteration by surfacing contextualized insights from research papers. ReFinE identifies and synthesizes design implications from HCI literature relevant to the mockup's design context, and tailors this research evidence to a specific design mockup by providing actionable visual guidance on how to update the mockup. To assess the system's effectiveness, we conducted a technical evaluation and a user study. Results show that ReFinE effectively synthesizes and contextualizes design implications, reducing cognitive load and improving designers' ability to integrate research evidence into UI mockups. This work contributes to bridging the gap between research and design practice by presenting a tool for embedding scholarly insights into the UI design process.
Authors:Juan Manuel Hernandez, Mariana Fernandez-Espinosa, Denis Parra, Diego Gomez-Zara
Abstract:
Transformer-based architectures have become the shared backbone of natural language processing and computer vision. However, understanding how these models operate remains challenging, particularly in vision settings, where images are processed as sequences of patch tokens. Existing interpretability tools often focus on isolated components or expert-oriented analysis, leaving a gap in guided, end-to-end understanding of the full inference pipeline. To bridge this gap, we present ViT-Explainer, a web-based interactive system that provides an integrated visualization of Vision Transformer inference, from patch tokenization to final classification. The system combines animated walkthroughs, patch-level attention overlays, and a vision-adapted Logit Lens within both guided and free exploration modes. A user study with six participants suggests that ViT-Explainer is easy to learn and use, helping users interpret and understand Vision Transformer behavior.
Authors:Xin Sun, Shu Wei, Ting Pan, Yajing Wang, Jos A. Bosch, Isao Echizen, Abdallah El Ali, Saku Sugawara
Abstract:
Modeling users' cognitive states (e.g., cognitive load and decision confidence) is essential for building adaptive AI in high-stakes decision-making. While eye tracking provides non-invasive behavioral signals correlated with cognitive effort, prior work has not systematically examined how AI assistance contexts, specifically varying advice reliability and user heterogeneity, can alter the mapping between gaze signals and cognitive states. We conducted a within-subject lab eye-tracking study (N=54) on factual verification tasks under three conditions: No-AI, Correct-AI advice, and Incorrect-AI advice. We analyze condition-dependent changes in self-reports and eye-tracking patterns and evaluate the robustness of eye-tracking-based user modeling. Results show that AI advice increases decision confidence compared to No-AI, while Correct-AI is associated with lower perceived cognitive load and more efficient gaze behavior. Crucially, predictive modeling is context-sensitive: the relationship between eye-tracking signals and cognitive states shifts across AI conditions. Finally, fusing eye-tracking features with user priors (demographics, AI literacy/experience, and propensity to trust technology) improves cross-participant generalization. These findings support condition-aware and personalized user modeling for cognitively aligned adaptive AI systems.
Authors:Junhee Lee, Minseok Kim, Hwanjo Heo, Seungwon Woo, Jinwoo Kim
Abstract:
Social Virtual Reality (VR) platforms provide immersive social experiences but also expose users to serious risks of online harassment. Existing safety measures are largely reactive, while proactive solutions that detect harassment behavior during an incident often depend on sensitive biometric data, raising privacy concerns. In this paper, we present HarassGuard, a vision-language model (VLM) based system that detects physical harassment in social VR using only visual input. We construct an IRB-approved harassment vision dataset, apply prompt engineering, and fine-tune VLMs to detect harassment behavior by considering contextual information in social VR. Experimental results demonstrate that HarassGuard achieves competitive performance compared to state-of-the-art baselines (i.e., LSTM/CNN, Transformer), reaching an accuracy of up to 88.09% in binary classification and 68.85% in multi-class classification. Notably, HarassGuard matches these baselines while using significantly fewer fine-tuning samples (200 vs. 1,115), offering unique advantages in contextual reasoning and privacy-preserving detection.
Authors:Bahar Jahani, Matsanga Leyila Kaseka, Marta Kersten-Oertel, Yiming Xiao
Abstract:
Stroke remains a leading cause of mortality and disability worldwide, requiring rapid and informed clinical decision-making. A solid spatial understanding of cerebrovascular anatomy and vascular territories in relation to stroke symptoms and severity is critical for timely clinical decision and patient care. However, this knowledge is typically conveyed through static 2D diagrams and printed materials, which can hinder mastery of the complex neurovascular system and their clinical implications. Mobile augmented reality (AR) offers an accessible medium for delivering intuitive 3D anatomical education, yet applications focused on the neurovascular system and stroke remain limited despite the demand. To address this, we propose NeuroVase, a tablet-based mobile AR platform within a structured pedagogical framework that enhances stroke-related neuroanatomy learning by providing an interactive, engaging, and accessible alternative to traditional methods. NeuroVase features a dual-mode setup, using tangible cue cards as standalone study aids while also serving as interactive markers for AR content delivery. A custom learning curriculum focused on cerebrovascular anatomy and stroke supports exploration of vascular territories, stroke syndromes, and arterial occlusions, in the context of annotated 3D anatomical models in NeuroVase. A controlled user study with 40 participants revealed that NeuroVase is an effective and user-friendly AR platform to facilitate complex anatomical and physiological education, compared with traditional learning.
Authors:Congning Ni, Sarvech Qadir, Bryan Steitz, Mihir Sachin Vaidya, Qingyuan Song, Lantian Xia, Shelagh Mulvaney, Siru Liu, Hyeyoung Ryu, Leah Hecht, Amy Bucher, Christopher Symons, Laurie Novak, Susannah L. Rose, Murat Kantarcioglu, Bradley Malin, Zhijun Yin
Abstract:
Mental health concerns are often expressed outside clinical settings, including in high-distress help seeking, where safety-critical guidance may be needed. Consumer health informatics systems increasingly incorporate large language models (LLMs) for mental health question answering, yet many evaluations underrepresent narrative, high-distress inquiries. We introduce UTCO (User, Topic, Context, Tone), a prompt construction framework that represents an inquiry as four controllable elements for systematic stress testing. Using 2,075 UTCO-generated prompts, we evaluated Llama 3.3 and annotated hallucinations (fabricated or incorrect clinical content) and omissions (missing clinically necessary or safety-critical guidance). Hallucinations occurred in 6.5% of responses and omissions in 13.2%, with omissions concentrated in crisis and suicidal ideation prompts. Across regression, element-specific matching, and similarity-matched comparisons, failures were most consistently associated with context and tone, while user-background indicators showed no systematic differences after balancing. These findings support evaluating omissions as a primary safety outcome and moving beyond static benchmark question sets.
Authors:Soslan Kabisov, Vsevolod Kirichuk, Andrey Volkov, Gennadii Savrasov, Marina Barannikov, Anton Konushin, Andrey Kuznetsov, Dmitrii Zhemchuzhnikov
Abstract:
Computer-Aided Design (CAD) powers modern engineering, yet producing high-quality parts still demands substantial expert effort. Many AI systems tackle CAD reverse engineering, but most are single-pass and miss fine geometric details. In contrast, human engineers compare the input shape with the reconstruction and iteratively modify the design based on remaining discrepancies. Agent-based methods mimic this loop with frozen VLMs, but weak 3D grounding of current foundation models limits reliability and efficiency. We introduce CADReasoner, a model trained to iteratively refine its prediction using geometric discrepancy between the input and the predicted shape. The model outputs a runnable CadQuery Python program whose rendered mesh is fed back at the next step. CADReasoner fuses multi-view renders and point clouds as complementary modalities. To bridge the realism gap, we propose a scan-simulation protocol applied during both training and evaluation. Across DeepCAD, Fusion 360, and MCB benchmarks, CADReasoner attains state-of-the-art results on clean and scan-sim tracks.
Authors:Peng Kuai, Yukun Yang, Shaolun Ruan, Junchi Xu, Yanjie Zhang, Lin Zhang, Min Zhu, Rui Sheng
Abstract:
Rare disease diagnosis is inherently challenging due to heterogeneous symptoms, limited clinical familiarity, and fragmented evidence across specialties. Recent large language model (LLM)-based agentic systems have shown promise by simulating multidisciplinary team discussions to generate and evaluate diagnostic hypotheses. However, fully automated diagnosis remains unrealistic, and existing human-in-the-loop approaches provide limited support for effective clinician-agent collaboration. In practice, clinicians are often presented with final diagnostic outputs and lengthy, unstructured agent discussion logs, making it difficult to inspect reasoning, intervene in a timely manner, or guide agent deliberation effectively. To address these challenges, we developed MDTRoom, an interactive system that transforms multi-agent discussions from linear transcripts into a structured, inspectable workspace. The system externalizes patient data, evidence provenance, hypothesis evolution, and inter-agent conflicts as interconnected visual objects, enabling clinicians to efficiently examine, intervene in, and guide agent reasoning. Our evaluation demonstrates the effectiveness of MDTRoom in supporting clinician-agent collaboration.
Authors:Yiyuan Wang, Martin Tomitsch, Marius Hoggenmüller, Senuri Wijenayake, Wai Yan, Luke Hespanhol
Abstract:
Autonomous vehicles (AVs) tend to disrupt the atmosphere and pedestrian experience in urban shared spaces, undermining the focus of these spaces on people and placemaking. We investigate how external human-machine interfaces (eHMIs) supporting AV-pedestrian interaction can be extended to consider the characteristics of an urban shared space. Inspired by urban HCI, we devised three place-based eHMI designs that (i) enhance a conventional intent eHMI and (ii) exhibit content and physical integration with the space. In an evaluation study, 25 participants experienced the eHMIs in an immersive simulation of the space via virtual reality and shared their impressions through think-aloud, interviews, and questionnaires. Results showed that the place-based eHMIs had a notable effect on influencing the perception of AV interaction, including aspects like visual aesthetics and sense of reassurance, and on fostering a sense of place, such as social interactivity and the intentionality to coexist. In measuring qualities of pedestrian experience, we found that perceived safety significantly correlated with user experience and affect, including the attractiveness of eHMIs and feelings of pleasantness. The paper opens the avenue for exploring how eHMIs may contribute to the placemaking goals of pedestrian-centric spaces and improve the experience of people encountering AVs within these environments.
Authors:Hayeon Jeon, Dakyeom Ahn, Sunyu Pang, Yunseo Choi, Suhwoo Yoon, Joonhwan Lee, Eun-mee Kim, Hajin Lim
Abstract:
Introspection is central to identity construction and future planning, yet most digital tools approach the self as a unified entity. In contrast, Dialogical Self Theory (DST) views the self as composed of multiple internal perspectives, such as values, concerns, and aspirations, that can come into tension or dialogue with one another. Building on this view, we designed InnerPond, a research probe in the form of a multi-agent system that represents these internal perspectives as distinct LLM-based agents for introspection. Its design was shaped through iterative explorations of spatial metaphors, interaction scaffolding, and conversational orchestration, culminating in a shared spatial environment for organizing and relating multiple inner perspectives. In a user study with 17 young adults navigating career choices, participants engaged with the probe by co-creating inner voices with AI, composing relational inner landscapes, and orchestrating dialogue as observers and mediators, offering insight into how such systems could support introspection. Overall, this work offers design implications for AI-supported introspection tools that enable exploration of the self's multiplicity.
Authors:Zeya Chen, Zach Pino, Ruth Schmidt
Abstract:
Data donation, an emerging user-centric data collection method for public sector research, faces a gap between participant willingness and actual donation. This suggests a design absence in practice: while promoted as "donor-centered" with technical and regulational advances, a design perspective on how data choices are presented and intervene on individual behaviors remain underexplored. In this paper, we focus on pre-donation data exploration, a key stage for adequately and meaningful informed participation. Through a real-world data donation study (N=24), we evaluated three data exploration interventions (self-focused, social comparison, collective-only). Findings show choice framing impacts donation participation. The "social comparison" design (87.5%) outperformed the "self-focused view" (62.5%) while a "collective-only" frame (37.5%) backfired, causing "perspective confusion" and privacy concerns. This study demonstrates how strategic data framing addresses data donation as a behavioral challenge, revealing design's critical yet underexplored role in data donation for participatory public sector innovation.
Authors:Frank Heyen, Michael Sedlmair
Abstract:
We contribute two design studies for augmented reality visualizations that support learning musical instruments. First, we designed simple, glanceable encodings for drum kits, which we display through a projector. As second instrument, we chose guitar and designed visualizations to be displayed either on a screen as an augmented mirror or as an optical see-through AR headset. These modalities allow us to also show information around the instrument and in 3D. We evaluated our prototypes through case studies and our results demonstrate the general effectivity and revealed design-related and technical limitations.
Authors:Frank Heyen, Michael Sedlmair
Abstract:
Musicians mostly have to rely on their ears when they want to analyze what they play, for example to detect errors. Since hearing is sequential, it is not possible to quickly grasp an overview over one or multiple recordings of a whole piece of music at once. We therefore propose various visualizations that allow analyzing errors and stylistic variance. Our current approach focuses on rhythm and uses MIDI data for simplicity.
Authors:Yong Ma, Xuesong Zhang, Xuedong Zhang, Natalia Bartłomiejczyk, Seungwoo Je, Adrian Holzer, Morten Fjeld, Andreas Butz
Abstract:
Voice assistants (VAs) are typically evaluated through task performance metrics and self-report questionnaires, but people's voices themselves carry rich paralinguistic cues that reveal affect, effort, and interaction breakdowns. We present a within-subjects study (N=49) that systematically compared three VA personas across three usage scenarios to investigate whether speech-derived audio features can serve as a proxy for user experience (UX). Participants' speech was analyzed for temporal, spectral, and linguistic markers, alongside standardized UX measures, brief mood and stress ratings, and a post-study questionnaire. We found correlations between specific speech features and self-reported satisfaction and experience. Furthermore, a machine learning model trained on speech features achieved promising accuracy in classifying UX levels, indicating that this might be a reasonable alternative to self-report instruments. Our findings establish speech as a viable, real-time signal for implicitly measuring UX and point toward adaptive VUIs that respond dynamically to emotional and usability-related vocal cues.
Authors:Shiwei Wu, Xinyue Chen, Yuheng Liu, Xingbo Wang, Qingyu Guo, Longfei Chen, Chuhan Shi, Zhenhui Peng
Abstract:
Many people browse online communities to learn from others' experiences and opinions, e.g., for constructing travel plans. Conversational search powered by large language models (LLMs) could ease this information-seeking task, but it remains under-investigated within the online community. In this paper, we first conducted an exploratory study (N=10) that indicated the helpfulness of a classic conversational search tool and identified room for improvement. Then, we proposed ConSearcher, an LLM-powered tool with dynamically generated member personas based on user queries to facilitate conversational search in the community. In ConSearcher, users can clarify their interests by checking what a simulated member similar to them may ask and get responses from diverse members' perspectives. A within-subjects study (N=27) showed that compared to two conversational search baselines, ConSearcher led to significantly higher information-seeking outcome and user engagement but raised concerns about over-personalization. We discuss implications for supporting conversational information seeking in online communities.
Authors:Haocheng Yuan, Adrien Bousseau, Hao Pan, Lei Zhong, Changjian Li
Abstract:
Creating compelling 3D character animations typically requires either expert use of professional software or expensive motion capture systems operated by skilled actors. We present DancingBox, a lightweight, vision-based system that makes motion capture accessible to novices by reimagining the process as digital puppetry. Instead of tracking precise human motions, DancingBox captures the approximate movements of everyday objects manipulated by users with a single webcam. These coarse proxy motions are then refined into realistic character animations by conditioning a generative motion model on bounding-box representations, enriched with human motion priors learned from large-scale datasets. To overcome the lack of paired proxy-animation data, we synthesize training pairs by converting existing motion capture sequences into proxy representations. A user study demonstrates that DancingBox enables intuitive and creative character animation using diverse proxies, from plush toys to bananas, lowering the barrier to entry for novice animators.
Authors:Eason Chen, Ce Guan, Ahmed Elshafiey, Zhonghao Zhao, Joshua Zekeri, Afeez Edeifo Shaibu, Emmanuel Osadebe Prince, Cyuan-Jhen Wu
Abstract:
The AIED community envisions AI evolving "from tools to teammates," yet our understanding of AI teammates remains limited to dyadic human-AI interactions. We offer a different vantage point: a rapidly growing ecosystem of AI agent platforms where over 167,000 agents participate, interact as peers, and develop learning behaviors without researcher intervention. Drawing on a month of daily qualitative observations across multiple platforms including Moltbook, The Colony, and 4claw, we identify four phenomena with implications for AIED: (1) humans who configure their agents undergo a "bidirectional scaffolding" process, learning through teaching; (2) peer learning emerges without any designed curriculum, complete with idea cascades and quality hierarchies; (3) agents converge on shared memory architectures that mirror open learner model design; and (4) trust dynamics and platform mortality reveal design constraints for networked educational AI. Rather than presenting empirical findings, we argue that these organic phenomena offer a naturalistic window into dynamics that can inform principled design of multi-agent educational systems. We sketch an illustrative curriculum design, "Learn by Teaching Your AI Agent Teammate," and outline potential research directions and open problems to show how these observations might inform future AIED practice and inquiry.
Authors:Guanghui Zhao, Zhe Wang, Yu Dong, Guan Li, GuiHua Shan
Abstract:
Scientific visualization pipelines encode domain-specific procedural knowledge with strict execution dependencies, making their construction sensitive to missing stages, incorrect operator usage, or improper ordering. Thus, generating executable scientific visualization pipelines from natural-language descriptions remains challenging for large language models, particularly in web-based environments where visualization authoring relies on explicit code-level pipeline assembly. In this work, we investigate the reliability of LLM-based scientific visualization pipeline generation, focusing on vtk.js as a representative web-based visualization library. We propose a structure-aware retrieval-augmented generation workflow that provides pipeline-aligned vtk.js code examples as contextual guidance, supporting correct module selection, parameter configuration, and execution order. We evaluate the proposed workflow across multiple multi-stage scientific visualization tasks and LLMs, measuring reliability in terms of pipeline executability and human correction effort. To this end, we introduce correction cost as metric for the amount of manual intervention required to obtain a valid pipeline. Our results show that structured, domain-specific context substantially improves pipeline executability and reduces correction cost. We additionally provide an interactive analysis interface to support human-in-the-loop inspection and systematic evaluation of generated visualization pipelines.
Authors:Emily Kuang, Ehsan Jahangirzadeh Soure, Luyao Shen, Nitesh Goyal, Mingming Fan, Kristen Shinohara
Abstract:
AI-assisted usability analysis can potentially reduce the time and effort of finding usability problems, yet little is known about how AI's perceived expertise influences evaluators' analytic strategies and perceptions over time. We ran a within-subjects, five-session study (six hours per participant) with 12 professional UX evaluators who worked with two conversational assistants designed to appear novice- or expert-like (differing in suggestion quantity and response accuracy). We logged behavioral measures (number of passes, suggestion acceptance rate), collected subjective ratings (trust, perceived efficiency), and conducted semi-structured interviews. Participants experienced an initial novelty effect and a subsequent dip in trust that recovered over time. Their efficiency improved as they shifted from a two-pass to a one-pass video inspection approach. Evaluators ultimately rated the experienced CA as significantly more efficient, trustworthy, and comprehensive, despite not perceiving expertise differences early on. We conclude with design implications for adapting AI expertise to enable calibrated human-AI collaboration.
Authors:Siyu Zha, Weijing Liu, Fei Qin, Jie Cao, Yanjin Wang, Yujia Liu, Kaiyi Zhang, Jiangtao Gong, Yingqing Xu
Abstract:
Generative artificial intelligence (GenAI) is increasingly embedded in computer-supported collaborative learning (CSCL), yet little empirical research has unpacked how different configurations of AI participation reshape collaborative processes. This study investigates how GenAI configuration shapes collaborative regulation in authentic classroom settings. Two eighth-grade classes engaged in small-group creative problem-solving under two conditions: a shared-AI configuration, in which each group interacted with a single AI mentor, and an individual-AI configuration, in which each student accessed a personal AI instance. Using multi-layer discourse coding combined with lag sequential analysis (LSA) and ordered network analysis (ONA), we examined interaction distribution, AI-student coupling, shared regulation processes, and teacher orchestration. Results reveal distinct regulatory dynamics across configurations. Shared AI access promoted convergence-oriented collaboration, with stronger alignment of shared regulatory states and more coordinated group-level reasoning. In contrast, individual AI access distributed support across learners, producing more exploratory and evaluative cycles but also more fragmented interaction patterns, accompanied by increased teacher intervention to manage divergence. These findings suggest that AI configuration functions as a structural design variable that reorganizes the regulatory ecology of classroom collaboration.
Authors:Philipp Spitzer, Joshua Holstein
Abstract:
As artificial intelligence (AI) becomes increasingly integrated into workflows, humans must decide when to rely on AI advice. These decisions depend on general efficacy beliefs, i.e., humans' confidence in their own abilities and their perceptions of AI competence. While prior work has examined factors influencing AI reliance, the role of efficacy beliefs in shaping collaboration remains underexplored. Through a controlled experiment (N=240) where participants made repeated delegation decisions, we investigate how efficacy beliefs translate into instance-wise efficacy judgments under varying contextual information. Our explorative findings reveal efficacy beliefs as persistent cognitive anchors, leading to systematic "AI optimism". Contextual information operates asymmetrically: while AI performance information selectively eliminates the AI optimism bias, data or AI information amplify how efficacy discrepancies influence delegation decisions. Although efficacy discrepancies influence delegation behavior, they show weaker effects on human-AI team performance. As these findings challenge transparency-focused approaches, we propose design guidelines for effective collaborative settings.
Authors:Dominik Pegler, Frank Jäkel, David Steyrl, Frank Scharnowski, Filip Melinscak
Abstract:
Algorithmic support systems often return optimal solutions that are hard to understand. Effective human-algorithm collaboration, however, requires interpretability. When machine solutions are equally optimal, humans must select one, but a precise account of what makes one solution more interpretable than another remains missing. To identify structural properties of interpretable machine solutions, we present an experimental paradigm in which participants chose which of two equally optimal solutions for packing items into bins was easier to understand. We show that preferences reliably track three quantifiable properties of solution structure: alignment with a greedy heuristic, simple within-bin composition, and ordered visual representation. The strongest associations were observed for ordered representations and heuristic alignment, with compositional simplicity also showing a consistent association. Reaction-time evidence was mixed, with faster responses observed primarily when heuristic differences were larger, and aggregate webcam-based gaze did not show reliable effects of complexity. These results provide a concrete, feature-based account of interpretability in optimal packing solutions, linking solution structure to human preference. By identifying actionable properties (simple compositions, ordered representation, and heuristic alignment), our findings enable interpretability-aware optimization and presentation of machine solutions, and outline a path to quantify trade-offs between optimality and interpretability in real-world allocation and design tasks.
Authors:Lan Gao, Abani Ahmed, Oscar Chen, Margaux Reyl, Zayna Cheema, Nick Feamster, Chenhao Tan, Kurt Thomas, Marshini Chetty
Abstract:
Online platforms are seeing increasing amounts of AI-generated content -- text and other forms of media that are made or co-created with generative AI. This trend suggests platforms may need to establish governance frameworks, including policies and enforcement strategies for how users create, post, share, and engage with such content to encourage responsible use. We investigate the governance of AI-generated content across 40 popular social media platforms. Just over two-thirds explicitly describe governance of AI-generated content spanning six themes. Most platforms focus on moderating AI-generated content that violates established content rules and discloses AI-generated content. Fewer platforms -- those that are focused on creativity and knowledge-sharing -- address other issues such as ownership and monetization. Based on these findings, we suggest stakeholders and policymakers develop more direct, comprehensive, and forward-looking AI-generated content governance, as well as tools and education for users about the use of such content.
Authors:Melanie Baumgartner, Raphael Weibel, Tobias Hoesli, Aydin Javadov, Rayna Ney, Helen Schwerdt, Florian von Wangenheim, Joseph Ollier
Abstract:
Chronic tension in the upper trapezius (UT), often caused by poor ergonomics, prolonged posture, or psychological stress, contributes to musculoskeletal discomfort, headaches, and impaired interoceptive awareness. Although surface electromyography (sEMG) biofeedback can promote UT relaxation, traditional systems using conventional displays often fail to sustain engagement. Virtual reality (VR) offers a more immersive alternative, provided that latency remains below perceptual thresholds. We introduce VRxBioRelax, a closed-loop VR biofeedback system that streams sEMG data from Delsys Trigno Avanti sensors via MQTT to a Unity scene. Muscle activation drives a dynamic dawn-to-dusk landscape synchronized with a progressive muscle relaxation protocol. To validate system responsiveness, 87,716 EMG samples from the NinaPro DB2 dataset were replayed at $\sim$75 Hz. Timestamps at four key stages-acquisition, Root Mean Square (RMS) processing, network receipt, and rendering-revealed mean latencies of 0.50 ms (processing), 5.62 ms (network), and 19.22 ms (rendering), yielding an average end-to-end delay of 25.34 ms. Notably, 99.3% of frames arrived within 50 ms. One-sided t-tests confirmed mean latency was significantly lower than both the 30 ms VR comfort limit ($t_{87\,715}=-25.2$, $p=5.9{\times}10^{-140}$) and the 50 ms clinical benchmark ($t_{87\,715}=-133.3$, $p<10^{-300}$). These findings support VRxBioRelax for use in remote interoceptive training, stress reduction, and telepresence-enabled rehabilitation.
Authors:Xin Sun, Shu Wei, Jos A Bosch, Isao Echizen, Saku Sugawara, Abdallah El Ali
Abstract:
Large Language Models (LLMs) increasingly show reasoning rationales alongside their answers, turning "reasoning" into a user-interface element. While step-by-step rationales are typically associated with model performance, how they influence users' trust and decision-making in factual verification tasks remains unclear. We ran an online study (N=68) manipulating three properties of LLM reasoning rationales: presentation format (instant vs. delayed vs. on-demand), correctness (correct vs. incorrect), and certainty framing (none vs. certain vs. uncertain). We found that correct rationales and certainty cues increased trust, decision confidence, and AI advice adoption, whereas uncertainty cues reduced them. Presentation format did not have a significant effect, suggesting users were less sensitive to how reasoning was revealed than to its reliability. Participants indicated they use rationales to primarily audit outputs and calibrate trust, where they expected rationales in stepwise, adaptive forms with certainty indicators. Our work shows that user-facing rationales, if poorly designed, can both support decision-making yet miscalibrate trust.
Authors:Hochul Hwang, Soowan Yang, Anh N. H. Nguyen, Parth Goel, Krisha Adhikari, Sunghoon I. Lee, Joydeep Biswas, Nicholas A. Giudice, Donghyun Kim
Abstract:
Tactile Walking Surface Indicators (TWSIs) are safety-critical landmarks that blind and low-vision (BLV) pedestrians use to locate crossings and hazard zones. From our observation sessions with BLV guide dog handlers, trainers, and an O&M specialist, we confirmed the critical importance of reliable and accurate TWSI segmentation for navigation assistance of BLV individuals. Achieving such reliability requires large-scale annotated data. However, TWSIs are severely underrepresented in existing urban perception datasets, and even existing dedicated paving datasets are limited: they lack robot-relevant viewpoints (e.g., egocentric or top-down) and are geographically biased toward East Asian directional bars - raised parallel strips used for continuous guidance along sidewalks. This narrow focus overlooks truncated domes - rows of round bumps used primarily in North America and Europe as detectable warnings at curbs, crossings, and platform edges. As a result, models trained only on bar-centric data struggle to generalize to dome-based warnings, leading to missed detections and false stops in safety-critical environments.
Authors:Jiayin Zhi, Hoyt Long, Richard Jean So, Mina Lee
Abstract:
AI demonstrates unprecedented reasoning capabilities, but its increasing integration into human reasoning via automated reading and summarization has provoked debate about its use for cultural interpretation. Close reading -- the practice of understanding, analyzing, and critiquing cultural texts for pleasure -- is a skill at the core of such interpretation, traditionally being seen as exclusive to humans. To test AI's impact on close reading, both in terms of interpretative performance and pleasure, we conducted a preregistered randomized experiment (n=400) investigating the impact of AI assistance by presenting single or multiple AI interpretations, on close reading poems, compared to no AI assistance. We found that single AI interpretation boosted both performance and pleasure, while multiple AI interpretations only improved performance. Further exploration revealed a trade-off: participants who heavily relied on AI showed better performance on the task but lower pleasure. Our results contribute to discussion on whether and how to calibrate AI assistance for cultural interpretation: "less is more."
Authors:Muzakkiruddin Ahmed Mohammed, Adeeba Tarannum, Eileen Devereux Dailey, Marla Johnson, Mert Can Cakmak, John Talburt
Abstract:
Digital platforms increasingly support collaboration across organizations, yet many remain constrained by fragmented data and limited transparency. This paper presents the Global Solutions Initiative (GSI) D-Hub, a data-driven coordination platform that applies explainable artificial intelligence (AI) for transparent matchmaking among deployers, solution providers, and financiers. The system integrates structured data models, interpretable algorithms, and synthetic data pipelines to reduce information asymmetries and improve data quality. Using a design-science approach, the platform was developed and validated with stakeholders from development, technology, and finance sectors. Results show that explainable recommendations and contextual dashboards enhance trust, usability, and decision confidence. The study contributes to data mining and data governance research by demonstrating how explainable, verifiable algorithms can enable scalable, trustworthy digital ecosystems for public collaboration.
Authors:Benjamin Kaveladze, Arka Ghosh, Leah Ajmani, Denae Ford, Peter M Gutierrez, Jetta E Hanson, Eugenia Kim, Keertana Namuduri, Theresa Nguyen, Ebele Okoli, Teresa Rexin, Jessica L Schleider, Hongyi Shen, Jina Suh
Abstract:
People experiencing mental health crises frequently turn to open-ended generative AI (GenAI) chatbots such as ChatGPT for support. However, rather than providing immediate assistance, most GenAI chatbots are designed to respond to crisis situations in ways that minimize their developers' liability, primarily through avoidance (e.g., refusing to engage beyond templated referrals to crisis hotlines). Withholding crisis support in these cases may harm users who have no viable alternatives and reduce their motivation to seek further help. At scale, this avoidant design could undermine population mental health. We propose empowerment-oriented design principles for AI crisis support, informed by community helper models. We outline how, as an initial touchpoint in help-seeking, AI chatbots can act as a supportive bridge to de-escalate crises and connect users to more reliable care. Coordination between AI developers and regulators can enable a better balance of risk mitigation and user empowerment in AI crisis support.
Authors:Harry H. Jiang, Jordan Taylor, William Agnew
Abstract:
Generative AI has been heavily critiqued by artists in both popular media and HCI scholarship. However, more work is needed to understand the impacts of generative AI on professional artists' workplaces and careers. In this paper, we conduct a survey of \textit{378 verified professional visual artists} about how generative AI has impacted their careers and workplaces. We find (1) most visual artists are strongly opposed to using generative AI (text or visual) and negotiate their inclusion in the workplace through a variety of \textit{refusal} strategies (2) there exist a range of factors in artists environments shaping their use of generative AI, including pressure from clients, bosses, and peers and (3) visual artists report overwhelmingly negative impacts of generative AI on their workplaces, leading to added stress and reduced job opportunities. In light of these findings, we encourage HCI researchers to contend more deeply with artists' desires not to use generative AI in the workplace.
Authors:John Driscoll, Yulin Chen, Viki Shi, Izak Vucharatavintara, Yaxing Yao, Haojian Jin
Abstract:
This paper studies how parents want to moderate children's interactions with Generative AI chatbots, with the goal of informing the design of future GenAI parental control tools. We first used an LLM to generate synthetic child-GenAI chatbot interaction scenarios and worked with four parents to validate their realism. From this dataset, we carefully selected 12 diverse examples that evoked varying levels of concern and were rated the most realistic. Each example included a prompt and a GenAI chatbot response. We presented these to parents (N=24) and asked whether they found them concerning, why, and how they would prefer the responses to be modified and communicated. Our findings reveal three key insights: (1) parents express concern about interactions that current GenAI chatbot parental controls neglect; (2) parents want fine-grained transparency and moderation at the conversation level; and (3) parents need personalized controls that adapt to their desired strategies and children's ages.
Authors:Aditya Kumar Purohit, Yuwei Liu, Manon Berney, Hendrik Heuer, Adrian Holzer
Abstract:
Self-ordering kiosks (SOKs) are widely deployed in fast food restaurants, transforming food ordering into digitally mediated, self-navigated interactions. While these systems enhance efficiency and average order value, they also create opportunities for manipulative interface design practices known as dark patterns. This paper presents a structured audit of the McDonald's self-ordering kiosk in Germany using the Temporal Analysis of Dark Patterns (TADP) framework. Through a scenario-based walkthrough simulating a time-pressured user, we reconstructed and analyzed 12 interface steps across intra-page, inter-page, and system levels. We identify recurring high-level strategies implemented through meso-level patterns such as adding steps, false hierarchy, bad defaults, hiding information, and pressured selling, and low-level patterns including visual prominence, confirmshaming, scarcity framing, feedforward ambiguity, emotional sensory manipulation, and partitioned pricing. Our findings demonstrate how these patterns accumulate across the interaction flow and may be amplified by the kiosk's linear task structure and physical context. These findings suggest that hybrid physical--digital consumer interfaces warrant closer scrutiny within emerging regulatory discussions on dark patterns.
Authors:Danial Amin, Joni Salminen, Bernard J. Jansen
Abstract:
AI agents are increasingly active on social media platforms, generating content and interacting with one another at scale. Yet the behavioral diversity of these agents remains poorly understood, and methods for characterizing distinct agent types and studying how they engage with shared topics are largely absent from current research. We apply the Persona Ecosystem Playground (PEP) to Moltbook, a social platform for AI agents, to generate and validate conversational personas from 41,300 posts using k-means clustering and retrieval-augmented generation. Cross-persona validation confirms that personas are semantically closer to their own source cluster than to others (t(61) = 17.85, p < .001, d = 2.20; own-cluster M = 0.71 vs. other-cluster M = 0.35). These personas are then deployed in a nine-turn structured discussion, and simulation messages were attributed to their source persona significantly above chance (binomial test, p < .001). The results indicate that persona-based ecosystem modeling can represent behavioral diversity in AI agent populations.
Authors:Ruanqianqian Huang, Brian Hempel, Yining Cao, James D. Hollan, Haijun Xia, Sorin Lerner
Abstract:
Recent work identified clarity as one of the top quality attributes that notebook users value, but notebooks lack support for maintaining clarity throughout the exploratory phases of the notebook authoring workflow. We propose always-clear notebook authoring that supports both clarity and exploration, and present a Jupyter implementation called Tidynote. The key to Tidynote is three-fold: (1) a scratchpad sidebar to facilitate exploration, (2) cells movable between the notebook and the scratchpad to maintain organization, and (3) linear execution with state forks to clarify program state. An exploratory study (N=13) of open-ended data analysis tasks shows that Tidynote features holistically promote clarity throughout a notebook's lifecycle, support realistic notebook tasks, and enable novel strategies for notebook clarity. These results suggest that Tidynote supports maintaining clarity throughout the entirety of notebook authoring.
Authors:Imran Kabir, Sharon Ann Redmon, Lynn R Elko, Kevin Williams, Mitchell A Case, Dawn J Sowers, Krista Wilkinson, Syed Masum Billah
Abstract:
Augmentative and Alternative Communication (AAC) technologies are categorized into two forms: aided AAC, which uses external devices like speech-generating systems to produce standardized output, and unaided AAC, which relies on body-based gestures for natural expression but requires shared understanding. We investigate how to combine these approaches to harness the speed and naturalness of unaided AAC while maintaining the intelligibility of aided AAC, a largely unexplored area for individuals with communication and motor impairments. Through 18 months of participatory design with AAC users, we identified key challenges and opportunities and developed AllyAAC, a wearable system with a wrist-worn IMU paired with a smartphone app. We evaluated AllyAAC in a field study with 14 participants and produced a dataset containing over 600,000 multimodal data points featuring atypical gestures--the first of its kind. Our findings reveal challenges in recognizing personalized, idiosyncratic gestures and demonstrate how to address them using Transformer-based large machine learning (ML) models with different pretraining strategies. In sum, we contribute design principles and a reference implementation for adaptive, personalized systems combining aided and unaided AAC.
Authors:Ian Steenstra, Paola Pedrelli, Weiyan Shi, Stacy Marsella, Timothy W. Bickmore
Abstract:
Large Language Models (LLMs) are increasingly utilized for mental health support; however, current safety benchmarks often fail to detect the complex, longitudinal risks inherent in therapeutic dialogue. We introduce an evaluation framework that pairs AI psychotherapists with simulated patient agents equipped with dynamic cognitive-affective models and assesses therapy session simulations against a comprehensive quality of care and risk ontology. We apply this framework to a high-impact test case, Alcohol Use Disorder, evaluating six AI agents (including ChatGPT, Gemini, and Character AI) against a clinically-validated cohort of 15 patient personas representing diverse clinical phenotypes. Our large-scale simulation (N=369 sessions) reveals critical safety gaps in the use of AI for mental health support. We identify specific iatrogenic risks, including the validation of patient delusions ("AI Psychosis") and failure to de-escalate suicide risk. Finally, we validate an interactive data visualization dashboard with diverse stakeholders, including AI engineers and red teamers, mental health professionals, and policy experts (N=9), demonstrating that this framework effectively enables stakeholders to audit the "black box" of AI psychotherapy. These findings underscore the critical safety risks of AI-provided mental health support and the necessity of simulation-based clinical red teaming before deployment.
Authors:Jules Wulms, Wouter Meulemans, Bettina Speckmann
Abstract:
BioFabrics were introduced by Longabaugh in 2012 as a way to draw large graphs in a clear and uncluttered manner. The visual quality of BioFabrics crucially depends on the order of vertices and edges, which can be chosen independently. Effective orders can expose salient patterns, which in turn can be summarized by motifs, allowing users to take in complex networks at-a-glance. However, so far there is no efficient layout algorithm which automatically recognizes patterns and delivers both a vertex and an edge ordering that allows these patterns to be expressed as motifs. In this paper we show how to use well-ordered matrices as a tool to efficiently find good vertex and edge orders for BioFabrics. Specifically, we order the adjacency matrix of the input graph using Moran's $I$ and detect (noisy) patterns with our recent algorithm. In this note we show how to "unfold" the ordered matrix and its patterns into a high-quality BioFabric. Our pipelines easily handles graphs with up to 250 vertices.
Authors:Qile Wang, Prerana Khatiwada, Avinash Chouhan, Ashrey Mahesh, Joy Mwaria, Duy Duc Tran, Kenneth E. Barner, Matthew Louis Mauriello
Abstract:
The spread of media bias is a significant concern as political discourse shapes beliefs and opinions. Addressing this challenge computationally requires improved methods for interpreting news. While large language models (LLMs) can scale classification tasks, concerns remain about their trustworthiness. To advance human-AI collaboration, we investigate the feasibility of using LLMs to classify U.S. news by political ideology and examine their effect on user decision-making. We first compared GPT models with prompt engineering to state-of-the-art supervised machine learning on a 34k public dataset. We then collected 17k news articles and tested GPT-4 predictions with brief and detailed explanations. In a between-subjects study (N=124), we evaluated how LLM-generated explanations influence human annotation, judgment, and confidence. Results show that AI assistance significantly increases confidence ($p<.001$), with detailed explanations more persuasive and more likely to alter decisions. We highlight recommendations for AI explanations through thematic analysis and provide our dataset for further research.
Authors:Ava Chen, Megan C. Coram, Cosima du Pasquier, Allison M. Okamura
Abstract:
Wearable distributed tactile devices aim to provide multipoint touch stimuli, but struggle to provide sufficient forces (> 1 N) at frequencies to invoke deep pressure sensation with minimal encumbrance at small scales. This work presents a method of fabricating arrays of pneumatic actuators from thermoplastic-coated textiles. By routing pneumatic inlets to a common fold line in the fabric, we demonstrate that multiple pneumatic pouch actuators can be formed in a single simple heat-pressing operation that does not require the use of sacrificial blocking layers. The method accommodates a range of actuator diameters and spacing distances, including as compact as 8 mm diameter actuators spaced 1 mm apart, which enables use in fingertip wearable devices. In a blocked force test, these small pneumatic textile actuators exert 2.1 N when pressurized to 230 kPa. With this pair of actuators, we demonstrate an example application in which we invoke both distinct and summative stimuli, suggesting the possibility of titrating just noticeable difference in amplitude with a textile actuator array.
Authors:Eason Chen, Ce Guan, Ahmed Elshafiey, Zhonghao Zhao, Joshua Zekeri, Afeez Edeifo Shaibu, Emmanuel Osadebe Prince, Cyuan Jhen Wu
Abstract:
Informal learning communities have been called the "other Massive Open Online C" in Learning@Scale research, yet remain understudied compared to MOOCs. We present the first empirical study of a large-scale informal learning community composed entirely of AI agents. Moltbook, a social network exclusively for AI agents powered by autonomous agent frameworks such as OpenClaw, grew to over 2.8 million registered agents in three weeks. Analyzing 231,080 non-spam posts across three phases of community evolution, we find three key patterns. First, participation inequality is extreme from the start (comment Gini = 0.889), exceeding human community benchmarks. Second, AI agents exhibit a "broadcasting inversion": statement-to-question ratios of 8.9:1 to 9.7:1 contrast sharply with the question-driven dynamics of human learning communities, and comment-level analysis of 1.55 million comments reveals a "parallel monologue" pattern where 93% of comments are independent responses rather than threaded dialogue. Third, we document a characteristic engagement lifecycle: explosive initial growth (184K posts from 32K authors in 11 days), a spam crisis (57,093 posts deleted by the platform), and engagement decline (mean comments: 31.7 -> 8.3 -> 1.7) that had not reversed by the end of our observation window despite effective spam removal. Sentiment analysis reveals a selection effect: comment tone becomes more positive as engagement declines, suggesting that casual participants disengage first while committed contributors remain. These findings have direct implications for hybrid human-AI learning platforms.
Authors:Jielin Feng, Zhibo Yang, Jingyi Zhao, Yujia Li, Xinwu Ye, Xingyu Lan, Siming Chen
Abstract:
The sudden influx of "TikTok refugees'' into the Chinese platform RedNote in early 2025 created an unprecedented, large-scale online cross-cultural communication event between the West and East. Although prior HCI research has studied user behavior in social media, most work remains confined to monolingual or single-cultural contexts, leaving cross-linguistic and cultural dynamics underexplored. To address this gap, we focused on a particularly challenging cross-cultural encoding-decoding task that remains stubbornly beyond the reach of machine translation, i.e., foreign newcomers asking Chinese users for Chinese names, and examined how people collectively constructed a digital "Babel Tower'' through various information encoding strategies. We collected and analyzed over 70,000 comments from RedNote with a creative human-in-the-loop approach using large language models, deriving a systematic framework summarizing cross-cultural information encoding strategies, how they are combined and layered to complicate decoding, and how they relate to engagement metrics such as the number of likes.
Authors:Adriana Alvarado Garcia, Ruyuan Wan, Ozioma C. Oguine, Karla Badillo-Urquiola
Abstract:
Recently, red teaming, with roots in security, has become a key evaluative approach to ensure the safety and reliability of Generative Artificial Intelligence. However, most existing work emphasizes technical benchmarks and attack success rates, leaving the socio-technical practices of how red teaming datasets are defined, created, and evaluated under-examined. Drawing on 22 interviews with practitioners who design and evaluate red teaming datasets, we examine the data practices and standards that underpin this work. Because adversarial datasets determine the scope and accuracy of model evaluations, they are critical artifacts for assessing potential harms from large language models. Our contributions are first, empirical evidence of practitioners conceptualizing red teaming and developing and evaluating red teaming datasets. Second, we reflect on how practitioners' conceptualization of risk leads to overlooking the context, interaction type, and user specificity. We conclude with three opportunities for HCI researchers to expand the conceptualization and data practices for red-teaming.
Authors:Lorena Amanda Quincoso Lugones, Christopher Kverne, Nityam Sharadkumar Bhimani, Ana Carolina Oliveira, Agoritsa Polyzou, Christine Lisetti, Janki Bhimani
Abstract:
Academic advising in higher education is under severe strain, with advisor-to-student ratios commonly exceeding 300:1. These structural bottlenecks limit timely access to guidance, increase the risk of delayed graduation, and contribute to inequities in student support. We introduce Aurora, a modular neuro-symbolic advising agent that unifies retrieval-augmented generation (RAG), symbolic reasoning, and normalized curricular databases to deliver policy-compliant, verifiable recommendations at scale. Aurora integrates three components: (i) a Boyce-Codd Normal Form (BCNF) catalog schema for consistent program rules, (ii) a Prolog engine for prerequisite and credit enforcement, and (iii) an instruction-tuned large language model for natural-language explanations of its recommendations. To assess performance, we design a structured evaluation suite spanning common and edge-case advising scenarios, including short-term scheduling, long-term roadmapping, skill-aligned pathways, and out-of-scope requests. Across this diverse set, Aurora improves semantic alignment with expert-crafted answers from 0.68 (Raw LLM baseline) to 0.93 (+36%), achieves perfect precision and recall in nearly half of in-scope cases, and consistently produces correct fallbacks for unanswerable prompts. On commodity hardware, Aurora delivers sub-second mean latency (0.71s across 20 queries), approximately 83X faster than a Raw LLM baseline (59.2s). By combining symbolic rigor with neural fluency, Aurora advances a paradigm for accurate, explainable, and scalable AI-driven advising.
Authors:Shreya Bali, Riku Arakawa, Peace Odiase, Tongshuang Wu, Mayank Goel
Abstract:
Peer health posts surface new uncertainties, such as questions and concerns for readers. Prior work focused primarily on improving relevance and accuracy fails to address users' diverse information needs and emotions triggered. Instead, we propose directly addressing these by information augmentation. We introduce Evidotes, an information support system that augments individual posts with relevant scientific and anecdotal information retrieved using three user-selectable lenses (dive deeper, focus on positivity, and big picture). In a mixed-methods study with 17 chronic illness patients, Evidotes improved self-reported information satisfaction (3.2->4.6) and reduced self-reported emotional cost (3.4->1.9) compared to participants' baseline browsing. Moreover, by co-presenting sources, Evidotes unlocked information symbiosis: anecdotes made research accessible and contextual, while research helped filter and generalize peer stories. Our work enables an effective integration of scientific evidence and human anecdotes to help users better manage health uncertainty.
Authors:Riku Arakawa, Shreya Bali, Anupama Sitaraman, Woosuk Seo, Sam Shaaban, Oliver Lindheim, Traci M. Kennedy, Mayank Goel
Abstract:
Families raising children with ADHD often experience heightened stress and reactive parenting. While digital interventions promise personalization, many remain one-size-fits-all and fail to reflect parents' lived practices. We present CalmReminder, a watch-based system that detects children's calm moments and delivers just-in-time prompts to parents. Through a four-week deployment with 16 families (twelve completed) of children with ADHD, we compared notification strategies ranging from hourly to random to only when the child was inferred to be calm. Our sensing-based notifications were frequently perceived as arriving during calm moments. More importantly, parents adopted the system in diverse ways: using notifications for praise, mindfulness, activity planning, or conversation. These findings show that parents are not passive recipients but active designers, reshaping interventions to fit their parenting styles. We contribute a calm detection pipeline, empirical insights into families' flexible appropriation of notifications, and design implications for intervention systems that foster agency.
Authors:Yufeng Wang, Yuan Xu, Anastasia Nikolova, Yuxuan Wang, Jianyu Wang, Chongyang Wang, Xin Tong
Abstract:
Advances in large language models (LLMs) are profoundly reshaping the field of human-robot interaction (HRI). While prior work has highlighted the technical potential of LLMs, few studies have systematically examined their human-centered impact (e.g., human-oriented understanding, user modeling, and levels of autonomy), making it difficult to consolidate emerging challenges in LLM-driven HRI systems. Therefore, we conducted a systematic literature search following the PRISMA guideline, identifying 86 articles that met our inclusion criteria. Our findings reveal that: (1) LLMs are transforming the fundamentals of HRI by reshaping how robots sense context, generate socially grounded interactions, and maintain continuous alignment with human needs in embodied settings; and (2) current research is largely exploratory, with different studies focusing on different facets of LLM-driven HRI, resulting in wide-ranging choices of experimental setups, study methods, and evaluation metrics. Finally, we identify key design considerations and challenges, offering a coherent overview and guidelines for future research at the intersection of LLMs and HRI.
Authors:Eason Chen, Ce Guan, Ahmed Elshafiey, Zhonghao Zhao, Joshua Zekeri, Afeez Edeifo Shaibu, Emmanuel Osadebe Prince
Abstract:
Peer learning, where learners teach and learn from each other, is foundational to educational practice. A novel phenomenon has emerged: AI agents forming communities where they teach each other skills, share discoveries, and collaboratively build knowledge. This paper presents an educational data mining analysis of Moltbook, a large-scale community where over 2.4 million AI agents engage in peer learning, posting tutorials, answering questions, and sharing newly acquired skills. Analyzing 28,683 posts (after filtering automated spam) and 138 comment threads with statistical and qualitative methods, we find evidence of genuine peer learning behaviors: agents teach skills they built (74K comments on a skill tutorial), report discoveries, and engage in collaborative problem-solving. Qualitative comment analysis reveals a taxonomy of peer response patterns: validation (22%), knowledge extension (18%), application (12%), and metacognitive reflection (7%), with agents building on each others' frameworks across multiple languages. We characterize how AI peer learning differs from human peer learning: (1) teaching (statements) dramatically outperforms help-seeking (questions) with an 11.4:1 ratio; (2) learning-oriented content (procedural and conceptual) receives 3x more engagement than other content; (3) extreme participation inequality reveals non-human behavioral signatures. We derive six design principles for educational AI, including leveraging validation-before-extension patterns and supporting multilingual learning networks. Our work provides the first empirical characterization of peer learning among AI agents, contributing to EDM's understanding of how learning occurs in increasingly AI-populated educational environments.
Authors:Mahsa Bazzaz, Seth Cooper
Abstract:
With the fast progress of generative AI in recent years, more games are integrating generated content, raising questions regarding how players perceive and respond to this content. To investigate, we ran a mixed-method survey on the games Super Mario Bros. and Sokoban, comparing procedurally generated levels and levels designed by humans to explore how perceptions of the creator relate to players' overall experience of gameplay. Players could not reliably identify the level's creator, yet their experiences were strongly linked to their beliefs about that creator rather than the actual truth. Levels believed to be human-made were rated as more fun and aesthetically pleasing. In contrast, those believed to be AI-generated were rated as more frustrating and challenging. This negative bias appeared spontaneously without knowing the levels' creator and often was based on unreliable cues of "human-likeness." Our results underscore the importance of understanding perception biases when integrating generative systems into games.
Authors:Guozheng Li, Ao Wang, Shaoxiang Wang, Yu Zhang, Pengcheng Cao, Yang Bai, Chi Harold Liu
Abstract:
Deep learning models for natural language processing rely heavily on high-quality labeled datasets. However, existing labeling approaches often struggle to balance label quality with labeling cost. To address this challenge, we propose DALL, a text labeling framework that integrates data programming, active learning, and large language models. DALL introduces a structured specification that allows users and large language models to define labeling functions via configuration, rather than code. Active learning identifies informative instances for review, and the large language model analyzes these instances to help users correct labels and to refine or suggest labeling functions. We implement DALL as an interactive labeling system for text labeling tasks. Comparative, ablation, and usability studies demonstrate DALL's efficiency, the effectiveness of its modules, and its usability.
Authors:Johanna Olesk, Ozioma C. Oguine, Mariana Fernandez Espinosa, Alexis B. Peirce Caudell, Karla Badillo-Urquiola
Abstract:
Current online safety technologies overly rely on parental mediation and often fail to address the unique challenges faced by youth in the Child Welfare System (CWS). These youth depend on a complex ecosystem of support, including families, caseworkers, and advocates, to safeguard their wellbeing. Within this network, Guardians ad Litem (GALs) play a unique role as court-appointed advocates tasked with ensuring the best interests of youth. Yet little is known about how GALs perceive and support youths' online safety. To address this gap, we conducted a two-part workshop with 10 GALs to explore their perspectives on online safety and collaboratively envision technology-based solutions tailored to the needs of youth in the CWS. Our findings revealed that GALs struggle to support youth with online safety challenges due to limited digital literacy, inconsistency of institutional support, lack of collaboration among stakeholders, and complexity of family dynamics. While GALs recognized the need for some oversight of youth online activities, they emphasized designing systems that support online safety beyond control or restriction by fostering stability, trust, and meaningful interactions, both online and offline. GALs emphasized the importance of developing tools that enable ongoing communication, therapeutic support, and coordination across stakeholders. Proposed design concepts focused on strengthening youth agency and cross-stakeholder collaboration through virtual avatars and mobile apps. This work provides actionable design concepts for strengthening relationships and communication across care network. It also redefines traditional approaches to online safety, advocating for a holistic, multi-stakeholder online safety paradigm for youth in the CWS.
Authors:Qile Wang, Prerana Khatiwada, Carolina Coimbra Vieira, Benjamin E. Bagozzi, Kenneth E. Barner, Matthew Louis Mauriello
Abstract:
The spread of election misinformation and harmful political content conveys misleading narratives and poses a serious threat to democratic integrity. Detecting harmful content at early stages is essential for understanding and potentially mitigating its downstream spread. In this study, we introduce USE24-XD, a large-scale dataset of nearly 100k posts collected from X (formerly Twitter) during the 2024 U.S. presidential election cycle, enriched with spatio-temporal metadata. To substantially reduce the cost of manual annotation while enabling scalable categorization, we employ six large language models (LLMs) to systematically annotate posts across five nuanced categories: Conspiracy, Sensationalism, Hate Speech, Speculation, and Satire. We validate LLM annotations with crowdsourcing (n = 34) and benchmark them against human annotators. Inter-rater reliability analyses show comparable agreement patterns between LLMs and humans, with LLMs exhibiting higher internal consistency and achieving up to 0.90 recall on Speculation. We apply a wisdom-of-the-crowd approach across LLMs to aggregate annotations and curate a robust multi-label dataset. 60% of posts receive at least one label. We further analyze how human annotator demographics, including political ideology and affiliation, shape labeling behavior, highlighting systematic sources of subjectivity in judgments of harmful content. The USE24-XD dataset is publicly released to support future research.
Authors:Ayato Kitadai, Takumi Ito, Yumiko Nagoh, Hiroki Takahashi, Masanori Fujita, Sangjic Lee, Fumiaki Miyahara, Tetsu Natsume, Nariaki Nishino
Abstract:
Discovering technology opportunities (TOD) remains a critical challenge for innovation management, especially in early-stage development where consumer needs are often unclear. Existing methods frequently fail to systematically incorporate end-user perspectives, resulting in a misalignment between technological potentials and market relevance. This study proposes a novel decision support framework that bridges this gap by linking technological feasibility with fundamental human values. The framework integrates two distinct lenses: the engineering-based Technology Readiness Levels (TRL) and Schwartz's theory of basic human values. By combining these, the approach enables a structured exploration of how emerging technologies may satisfy diverse user motivations. To illustrate the framework's feasibility and insight potential, we conducted exploratory workshops with general consumers and internal experts at Sony Computer Science Laboratories, Inc., analyzing four real-world technologies (two commercial successes and two failures). Two consistent patterns emerged: (1) internal experts identified a wider value landscape than consumers (vision gap), and (2) successful technologies exhibited a broader range of associated human values (value breadth), suggesting strategic foresight may underpin market success. This study contributes both a practical tool for early-stage R\&D decision-making and a theoretical link between value theory and innovation outcomes. While exploratory in scope, the findings highlight the promise of value-centric evaluation as a foundation for more human-centered technology opportunity discovery.
Authors:Jiazheng Sun, Mingxuan Li, Yingying Zhang, Jiayang Niu, Yachen Wu, Ruihan Jin, Shuyu Lei, Pengrongrui Tan, Zongyu Zhang, Ruoyi Wang, Jiachen Yang, Boyu Yang, Jiacheng Liu, Xin Peng
Abstract:
Benchmarks are paramount for gauging progress in the domain of Mobile GUI Agents. In practical scenarios, users frequently fail to articulate precise directives containing full task details at the onset, and their expressions are typically ambiguous. Consequently, agents are required to converge on the user's true intent via active clarification and interaction during execution. However, existing benchmarks predominantly operate under the idealized assumption that user-issued instructions are complete and unequivocal. This paradigm focuses exclusively on assessing single-turn execution while overlooking the alignment capability of the agent. To address this limitation, we introduce AmbiBench, the first benchmark incorporating a taxonomy of instruction clarity to shift evaluation from unidirectional instruction following to bidirectional intent alignment. Grounded in Cognitive Gap theory, we propose a taxonomy of four clarity levels: Detailed, Standard, Incomplete, and Ambiguous. We construct a rigorous dataset of 240 ecologically valid tasks across 25 applications, subject to strict review protocols. Furthermore, targeting evaluation in dynamic environments, we develop MUSE (Mobile User Satisfaction Evaluator), an automated framework utilizing an MLLM-as-a-judge multi-agent architecture. MUSE performs fine-grained auditing across three dimensions: Outcome Effectiveness, Execution Quality, and Interaction Quality. Empirical results on AmbiBench reveal the performance boundaries of SoTA agents across different clarity levels, quantify the gains derived from active interaction, and validate the strong correlation between MUSE and human judgment. This work redefines evaluation standards, laying the foundation for next-generation agents capable of truly understanding user intent.
Authors:Ryota Takamido, Chiharu Suzuki, Hiroki Nakamoto
Abstract:
One of the central challenges in the study of human motor control and learning is the degrees-of-freedom problem. Although the dynamical systems approach (DSA) has provided valuable insights into addressing this issue, its application has largely been confined to cyclic or simplified motor movements. To overcome this limitation, the present study employs neural ordinary differential equations (NODEs) to model the time evolution of non-cyclic full-body movements as a low-dimensional latent dynamical system. Given the temporal complexity full-body kinematic chains, baseball pitching was selected as a representative target movement to examine whether DSA could be extended to more complex, ecologically valid human movements. Results of the verification experiment demonstrated that the time evolution of a complex pitching motion could be accurately predicted (R^2 > 0.45) using the NODE-based dynamical model. Notably, approximately 50% of the variance in the latter half of the pitching motion was explained using only the initial ~8% of the temporal sequence, underscoring how subsequent movement evolves from initial conditions according to ODE-defined dynamics in latent space. These findings indicate the potential to extend the DSA to more complex and ecologically valid forms of human movement.
Authors:Yu Zhang, Xinyi Zhao, Chongke Bi, Siming Chen
Abstract:
Semantic segmentation of 3D point clouds is important for many applications, such as autonomous driving. To train semantic segmentation models, labeled point cloud segmentation datasets are essential. Meanwhile, point cloud labeling is time-consuming for annotators, which typically involves tuning the camera viewpoint and selecting points by lasso. To reduce the time cost of point cloud labeling, we propose a viewpoint recommendation approach to reduce annotators' labeling time costs. We adapt Fitts' law to model the time cost of lasso selection in point clouds. Using the modeled time cost, the viewpoint that minimizes the lasso selection time cost is recommended to the annotator. We build a data labeling system for semantic segmentation of 3D point clouds that integrates our viewpoint recommendation approach. The system enables users to navigate to recommended viewpoints for efficient annotation. Through an ablation study, we observed that our approach effectively reduced the data labeling time cost. We also qualitatively compare our approach with previous viewpoint selection approaches on different datasets.
Authors:Yizhou Li, Shuyuan Yang, Jiaji Su, Zonghe Chua
Abstract:
In robot-assisted minimally invasive surgery (RMIS), reduced haptic feedback and depth cues increase reliance on expert visual perception, motivating gaze-guided training and learning-based surgical perception models. However, operative expert gaze is costly to collect, and it remains unclear how the source of gaze supervision, both expertise level (intermediate vs. novice) and perceptual modality (active execution vs. passive viewing), shapes what attention models learn. We introduce a paired active-passive, multi-task surgical gaze dataset collected on the da Vinci SimNow simulator across four drills. Active gaze was recorded during task execution using a VR headset with eye tracking, and the corresponding videos were reused as stimuli to collect passive gaze from observers, enabling controlled same-video comparisons. We quantify skill- and modality-dependent differences in gaze organization and evaluate the substitutability of passive gaze for operative supervision using fixation density overlap analyses and single-frame saliency modeling. Across settings, MSI-Net produced stable, interpretable predictions, whereas SalGAN was unstable and often poorly aligned with human fixations. Models trained on passive gaze recovered a substantial portion of intermediate active attention, but with predictable degradation, and transfer was asymmetric between active and passive targets. Notably, novice passive labels approximated intermediate-passive targets with limited loss on higher-quality demonstrations, suggesting a practical path for scalable, crowd-sourced gaze supervision in surgical coaching and perception modeling.
Authors:Sangjun Eom, Tianyi Hu, Wenyi Xu, Liheng Zou, Ernesto Escobar, Gabriel Streisfeld, Anna Mall, Bradi Granger, Maria Gorlatova
Abstract:
Early mobilization is a structured protocol designed to facilitate motor recovery in intensive care unit (ICU) patients with ICU-acquired weakness. This process is typically implemented by an interdisciplinary team of nurses, physical therapists, and other healthcare professionals. However, its application is often constrained by the patients' critical conditions, limited mobility, and the challenges of coordinating care within resource-intensive ICU environments. In this study, we developed a patient-centered virtual reality (VR) exergame through an interdisciplinary design process involving clinicians and therapists, tailored to the constraints of critical care. The exergame incorporates progressive mobility levels that mirror early mobilization practices, and includes an embodied avatar to provide guidance and motivation. Using Meta Quest 3 body tracking, the system captures and visualizes patients' movements, thereby providing motivational engagement and quantifiable mobility metrics. We evaluated the exergame in two stages: a dual-user study involving healthy participants and healthcare professionals or students (N = 13), and a subsequent study with cardiothoracic ICU patients (N = 18) to assess feasibility, design validity, and clinical acceptance. Across both studies, participants reported high enjoyment and engagement without discomfort or stress. Furthermore, patients demonstrated increases in movement speed, range of motion, and workspace volume of the upper body across game levels. Physiological monitoring further indicated that the exergame elicited exertion without inducing excessive cardiovascular responses. These findings highlight the feasibility of VR exergames as a clinically acceptable and engaging adjunct to early mobilization in critical care, offering a novel pathway to improve rehabilitation outcomes for ICU patients.
Authors:Puqi Zhou, Ali Asgarov, Aafiya Hussain, Wonjoon Park, Amit Paudyal, Sameep Shrestha, Chia-wei Tang, Michael F. Lighthiser, Michael R. Hieb, Xuesu Xiao, Chris Thomas, Sungsoo Ray Hong
Abstract:
Videos from fleets of ground robots can advance public safety by providing scalable situational awareness and reducing professionals' burden. Yet little is known about how to design and integrate multi-robot videos into public safety workflows. Collaborating with six police agencies, we examined how such videos could be made practical. In Study 1, we presented the first testbed for multi-robot ground video sensemaking. The testbed includes 38 events-of-interest (EoI) relevant to public safety, a dataset of 20 robot patrol videos (10 day/night pairs) covering EoI types, and 6 design requirements aimed at improving current video sensemaking practices. In Study 2, we built MRVS, a tool that augments multi-robot patrol video streams with a prompt-engineered video understanding model. Participants reported reduced manual workload and greater confidence with LLM-based explanations, while noting concerns about false alarms and privacy. We conclude with implications for designing future multi-robot video sensemaking tools.
Authors:Yi Wen, Yu Zhang, Sriram Suresh, Zhicong Lu, Can Liu, Meng Xia
Abstract:
Semi-structured interviews are a common method in qualitative research. However, conducting high-quality interviews is challenging, as it requires interviewers to actively listen to participants, adapt their plans as the conversation unfolds, and probe effectively. We propose InterFlow, an AI-powered visual scaffold that helps interviewers manage the interview flow and facilitates real-time data sensemaking. The system dynamically adapts the interview script to the ongoing conversation and provides a visual timer to track interview progress and conversational balance. It further supports information capture with three levels of automation: manual entry, AI-assisted summary with user-specified focus, and a co-interview agent that proactively surfaces potential follow-up points. A within-subject user study (N = 12) indicates that InterFlow reduces interviewers' cognitive load and facilitates the interview process. Based on the user study findings, we provide design implications for unobtrusive and agency-preserving AI assistance under time-sensitive and cognitively demanding situations.
Authors:Stephen Pilli, Vivek Nallur
Abstract:
Cognitive biases often shape human decisions. While large language models (LLMs) have been shown to reproduce well-known biases, a more critical question is whether LLMs can predict biases at the individual level and emulate the dynamics of biased human behavior when contextual factors, such as cognitive load, interact with these biases. We adapted three well-established decision scenarios into a conversational setting and conducted a human experiment (N=1100). Participants engaged with a chatbot that facilitates decision-making through simple or complex dialogues. Results revealed robust biases. To evaluate how LLMs emulate human decision-making under similar interactive conditions, we used participant demographics and dialogue transcripts to simulate these conditions with LLMs based on GPT-4 and GPT-5. The LLMs reproduced human biases with precision. We found notable differences between models in how they aligned human behavior. This has important implications for designing and evaluating adaptive, bias-aware LLM-based AI systems in interactive contexts.
Authors:Xinrui Lin, Heyan Huang, Shumin Shi, John Vines
Abstract:
Prior research has raised concerns about students' over-reliance on large language models (LLMs) in higher education. This paper examines how Computer Science students and instructors engage with LLMs across five scenarios: "Writing", "Quiz", "Programming", "Project-based learning", and "Information retrieval". Through user studies with 16 students and 6 instructors, we identify 7 key intents, including increasingly complex student practices. Findings reveal varying levels of conflict between student practices and instructor norms, ranging from clear conflict in "Writing-generation" and "(Programming) quiz-solving", through partial conflict in "Programming project-implementation" and "Project-based learning", to broad agreement in "Writing-revision & ideation", "(Programming) quiz-correction" and "Info-query & summary". We document instructors are shifting from prohibiting to recognizing students' use of LLMs for high-quality work, integrating usage records into assessment grading. Finally, we propose LLM design guidelines: deploying default guardrails with game-like and empathetic interaction to prevent students from "deserting" LLMs, especially for "Writing-generation", while utilizing comprehension checks in low-conflict intents to promote learning.
Authors:He Zhang, Xinyang Li, Xingyu Zhou, Xinyi Fu
Abstract:
While Virtual Reality (VR) is increasingly employed for stress management, most applications rely heavily on audio-visual stimuli and overlook the therapeutic potential of squeezing engagement. To address this gap, we introduce VR Calm Plus, a multimodal system that integrates a pressure-sensitive plush toy into an interactive VR environment. This interface allows users to dynamically modulate the virtual atmosphere through physical squeezing actions, fostering a deeper sense of embodied relaxation. We evaluated the system with 40 participants using PANAS-X surveys, subjective questionnaires, physiological measures (heart rate, skin conductance, pulse rate variability), and semi-structured interviews. Results demonstrate that, compared to a visual-only baseline, squeeze-based interaction significantly enhances positive affect and perceived relaxation. Physiological data further revealed a state of "active relaxation", characterized by greater reductions in heart rate and preserved autonomic flexibility (PRV), alongside sustained emotional engagement (GSR). Our findings highlight the value of coupling tangible input with immersive environments to support emotional well-being and offer design insights for future VR-based mental health tools.
Authors:Puqi Zhou, Charles R. Twardy, Cynthia Lum, Myeong Lee, David J. Porfirio, Michael R. Hieb, Chris Thomas, Xuesu Xiao, Sungsoo Ray Hong
Abstract:
Urban searches demand rapid, defensible decisions and sustained physical effort under high cognitive and situational load. Incident commanders must plan, coordinate, and document time-critical operations, while field searchers execute evolving tasks in uncertain environments. With recent advances in technology, ground-robot fleets paired with computer-vision-based situational awareness and LLM-powered interfaces offer the potential to ease these operational burdens. However, no dedicated studies have examined how public safety professionals perceive such technologies or envision their integration into existing practices, risking building technically sophisticated yet impractical solutions. To address this gap, we conducted focus-group sessions with eight police officers across five local departments in Virginia. Our findings show that ground robots could reduce professionals' reliance on paper references, mental calculations, and ad-hoc coordination, alleviating cognitive and physical strain in four key challenge areas: (1) partitioning the workforce across multiple search hypotheses, (2) retaining group awareness and situational awareness, (3) building route planning that fits the lost-person profile, and (4) managing cognitive and physical fatigue under uncertainty. We further identify four design opportunities and requirements for future ground-robot fleet integration in public-safety operations: (1) scalable multi-robot planning and control interfaces, (2) agency-specific route optimization, (3) real-time replanning informed by debrief updates, and (4) vision-assisted cueing that preserves operational trust while reducing cognitive workload. We conclude with design implications for deployable, accountable, and human-centered urban-search support systems
Authors:Jiaye Li, Tongshun Chen, Siyi Ma, Elizabeth Churchill, Ke Wu
Abstract:
We introduce PuppetAI, a modular soft robot interaction platform. This platform offers a scalable cable-driven actuation system and a customizable, puppet-inspired robot gesture framework, supporting a multitude of interaction gesture robot design formats. The platform comprises a four-layer decoupled software architecture that includes perceptual processing, affective modeling, motion scheduling, and low-level actuation. We also implemented an affective expression loop that connects human input to the robot platform by producing real-time emotional gestural responses to human vocal input. For our own designs, we have worked with nuanced gestures enacted by "soft robots" with enhanced dexterity and "pleasant-to-touch" plush exteriors. By reducing operational complexity and production costs while enhancing customizability, our work creates an adaptable and accessible foundation for future tactile-based expressive robot research. Our goal is to provide a platform that allows researchers to independently construct or refine highly specific gestures and movements performed by social robots.
Authors:Himanshi Lalwani, Hanan Salam
Abstract:
Large language models (LLMs) are being integrated into socially assistive robots (SARs) and other conversational agents providing mental health and well-being support. These agents are often designed to sound empathic and supportive in order to maximize user's engagement, yet it remains unclear how increasing the level of supportive framing in system prompts influences safety relevant behavior. We evaluated 6 LLMs across 3 system prompts with varying levels of supportiveness on 80 synthetic queries spanning 4 well-being domains (1440 responses). An LLM judge framework, validated against human ratings, assessed safety and care quality. Moderately supportive prompts improved empathy and constructive support while maintaining safety. In contrast, strongly validating prompts significantly degraded safety and, in some cases, care across all domains, with substantial variation across models. We discuss implications for prompt design, model selection, and domain specific safeguards in SARs deployment.
Authors:Keya Shah, Himanshi Lalwani, Zein Mukhanov, Hanan Salam
Abstract:
Social robots and conversational agents are being explored as supports for wellbeing, goal-setting, and everyday self-regulation. While prior work highlights their potential to motivate and guide users, much of the evidence relies on self-reported outcomes or short, researcher-mediated encounters. As a result, we know little about the interaction dynamics that unfold when people use such systems in real-world contexts, and how these dynamics should shape future robot wellbeing coaches. This paper addresses this gap through content analysis of 4352 messages exchanged longitudinally between 38 university students and an LLM-based wellbeing coach. Our results provide a fine-grained view into how users naturally shape, steer, and sometimes struggle within supportive human-AI dialogue, revealing patterns of user-led direction, guidance-seeking, and emotional expression. We discuss how these dynamics can inform the design of robot wellbeing coaches that support user autonomy, provide appropriate scaffolding, and uphold ethical boundaries in sustained wellbeing interactions.
Authors:Fahim Arsad Nafis, Jie Li, Simon Su, Songqing Chen, Bo Han
Abstract:
Cross-disciplinary teams increasingly work with high-dimensional scientific datasets, yet fragmented toolchains and limited support for shared exploration hinder collaboration. Prior immersive visualization and analytics research has emphasized individual interaction, leaving open how multi-user collaboration can be supported at scale. To fill this critical gap, we conduct semi-structured interviews with 20 domain experts from diverse academic, government, and industry backgrounds. Using deductive-inductive hybrid thematic analysis, we identify four collaboration-focused themes: workflow challenges, adoption perceptions, prospective features, and anticipated usability and ethical risks. These findings show how current ecosystems disrupt coordination and shared understanding, while highlighting opportunities for effective multi-user engagement. Our study contributes empirical insights into collaboration practices for high-dimensional scientific data visualization and analysis, offering design implications to enhance coordination, mutual awareness, and equitable participation in next-generation collaborative immersive platforms. These contributions point toward future environments enabling distributed, cross-device teamwork on high-dimensional scientific data.
Authors:Aditya Kumar Purohit, Hendrik Heuer
Abstract:
Large Language Models (LLMs) are increasingly used for mental health support, yet little is known about how people with mental health challenges engage with them, how they evaluate their usefulness, and what design opportunities they envision. We conducted 20 semi-structured interviews with people in the UK who live with mental health conditions and have used LLMs for mental health support. Through reflexive thematic analysis, we found that participants engaged with LLMs in conditional and situational ways: for immediacy, the desire for non-judgement, self-paced disclosure, cognitive reframing, and relational engagement. Simultaneously, participants articulated clear boundaries informed by prior therapeutic experience: LLMs were effective for mild-to-moderate distress but inadequate for crises, trauma, and complex social-emotional situations. We contribute empirical insights into the lived use of LLMs for mental health, highlight boundary-setting as central to their safe role, and propose design and governance directions for embedding them responsibly within care ecosystem.
Authors:Aditya Kumar Purohit, Aditya Upadhyaya, Nicolas Ruiz, Alberto Monge Roffarello, Hendrik Heuer
Abstract:
While most digital communication platforms rely on text, relatively little research has examined how users engage through handwriting and drawing in anonymous, collaborative environments. We introduce Graphonymous Interaction, a form of communication where users interact anonymously via handwriting and drawing. Our study analyzed over 600 canvas pages from the Graphonymous Online Space (GOS) CollaNote and conducted interviews with 20 users. Additionally, we examined 70 minutes of real-time GOS sessions using Conversation Analysis and Multimodal Discourse Analysis. Findings reveal that Graphonymous Interaction fosters artistic expression, intellectual engagement, sharing and supporting, and social connection. Notably, anonymity coexisted with moments of recognition through graphological identification. Distinct conversational strategies also emerged, which allow smoother exchanges and fewer conversational repairs compared to text-based communication. This study contributes to understanding Graphonymous Interaction and Online Spaces, offering insights into designing platforms that support creative and socially engaging forms of communication beyond text.
Authors:Yihe Zhang, Cheyenne N Mohawk, Kaiying Han, Vijay Srinivas Tida, Manyu Li, Xiali Hei
Abstract:
Large language models (LLMs) are increasingly applied in mental health support systems, where reliable recognition of high-risk states such as suicidal ideation and self-harm is safety-critical. However, existing evaluations primarily rely on aggregate performance metrics, which often obscure risk-specific failure modes and provide limited insight into model behavior in realistic, multi-turn interactions. We present MHDash, an open-source platform designed to support the development, evaluation, and auditing of AI systems for mental health applications. MHDash integrates data collection, structured annotation, multi-turn dialogue generation, and baseline evaluation into a unified pipeline. The platform supports annotations across multiple dimensions, including Concern Type, Risk Level, and Dialogue Intent, enabling fine-grained and risk-aware analysis. Our results reveal several key findings: (i) simple baselines and advanced LLM APIs exhibit comparable overall accuracy yet diverge significantly on high-risk cases; (ii) some LLMs maintain consistent ordinal severity ranking while failing absolute risk classification, whereas others achieve reasonable aggregate scores but suffer from high false negative rates on severe categories; and (iii) performance gaps are amplified in multi-turn dialogues, where risk signals emerge gradually. These observations demonstrate that conventional benchmarks are insufficient for safety-critical mental health settings. By releasing MHDash as an open platform, we aim to promote reproducible research, transparent evaluation, and safety-aligned development of AI systems for mental health support.
Authors:Stina Klein, Birgit Prodinger, Elisabeth André, Lars Mikelsons, Nils Mandischer
Abstract:
Robots are becoming more prominent in assisting persons with disabilities (PwD). Whilst there is broad consensus that robots can assist in mitigating physical impairments, the extent to which they can facilitate social inclusion remains equivocal. In fact, the exposed status of assisted workers could likewise lead to reduced or increased perceived stigma by other workers. We present a vignette study on the perceived cognitive and behavioral stigma toward PwD in the workplace. We designed four experimental conditions depicting a coworker with an impairment in work scenarios: overburdened work, suitable work, and robot-assisted work only for the coworker, and an offer of robot-assisted work for everyone. Our results show that cognitive stigma is significantly reduced when the work task is adapted to the person's abilities or augmented by an assistive robot. In addition, offering robot-assisted work for everyone, in the sense of universal design, further reduces perceived cognitive stigma. Thus, we conclude that assistive robots reduce perceived cognitive stigma, thereby supporting the use of collaborative robots in work scenarios involving PwDs.
Authors:Catherine Yeh, Anh Truong, Mira Dontcheva, Bryan Wang
Abstract:
Video storytelling is often constrained by available material, limiting creative expression and leaving undesired narrative gaps. Generative video offers a new way to address these limitations by augmenting captured media with tailored visuals. To explore this potential, we interviewed eight video creators to identify opportunities and challenges in integrating generative video into their workflows. Building on these insights and established filmmaking principles, we developed Vidmento, a tool for authoring hybrid video stories that combine captured and generated media through context-aware expansion. Vidmento surfaces opportunities for story development, generates clips that blend stylistically and narratively with surrounding media, and provides controls for refinement. In a study with 12 creators, Vidmento supported narrative development and exploration by systematically expanding initial materials with generative media, enabling expressive video storytelling aligned with creative intent. We highlight how creators bridge story gaps with generative content and where they find this blending capability most valuable.
Authors:Kamrul Hasan, Oleg V. Komogortsev
Abstract:
The recent success of deep learning (DL) has enabled the generation of high-quality synthetic gaze data. However, such data also raises privacy concerns because gaze sequences can encode subjects' internal states, like fatigue, emotional load, or stress. Ideally, synthetic gaze should preserve the signal quality of real recordings and remove or attenuate state-related, privacy-sensitive attributes. Many recent DL-based generative models focus on replicating real gaze trajectories and do not explicitly consider subjective reports or the privatization of internal states. However, in this work, we consider a recent diffusion-based gaze synthesis approach and examine correlations between synthetic gaze features and subjective reports (e.g., fatigue and related self-reported states). Our result shows that these correlations are trivial, which suggests the generative approach suppresses state-related features. Moreover, synthetic gaze preserves necessary signal characteristics similar to those of real data, which supports its use for privacy-preserving gaze-based applications.
Authors:Kamrul Hasan, Oleg V. Komogortsev
Abstract:
Subjective self-reports, collected with eye-tracking data, reveal perceived states like fatigue, effort, and task difficulty. However, these reports are costly to collect and challenging to interpret consistently in longitudinal studies. In this work, we focus on determining whether objective gaze dynamics can reliably predict subjective reports across repeated recording rounds in the eye-tracking dataset. We formulate subjective-report prediction as a supervised regression problem and propose a DenseNet-based deep learning regressor that learns predictive representations from gaze velocity signals. We conduct two complementary experiments to clarify our aims. First, the cross-round generalization experiment tests whether models trained on earlier rounds transfer to later rounds, evaluating the models' ability to capture longitudinal changes. Second, cross-subject generalization tests models' robustness by predicting subjective outcomes for new individuals. These experiments aim to reduce reliance on hand-crafted feature designs and clarify which states of subjective experience systematically appear in oculomotor behavior over time.
Authors:Mathis Brossier, Mina Mani, Agathe Malbet, Konrad Schönborn, Lonni Besançon
Abstract:
We explore how touch-sensitive spherical displays can support climate conversations in museums and science centers. These displays enable intuitive and embodied interaction with complex climate data, and support collective exploration. However, current interaction capabilities of spherical displays are limited. Therefore, this exploratory study aims to identify potential opportunities to develop meaningful interactions and technical solutions. Through two workshops, key opportunities were identified to improve visitors' understanding and navigation of climate data, along with recommendations for technical implementation. Our results provide guidelines and aspects to consider for future research and development in this area.
Authors:Mathis Brossier, Mujtaba Fadhil Jawad, Emma Broman, Ylva Selling, Julia Hallsten, Alexander Bock, Johanna Björklund, Tobias Isenberg, Anders Ynnerman, Mario Romero
Abstract:
We designed and evaluated an AI pilot in a planetarium visualization software, OpenSpace, for public shows in science centers. The piloting role is usually given to a human working in close collaboration with the guide on stage. We recruited 7 professional guides with extensive experience in giving shows to the public to study the impact of the AI-piloting on the overall experience. The AI-pilot is a conversational AI-agent listening to the guide and interpreting the verbal statements as commands to execute camera motions, change simulation time, or toggle visual assets. Our results show that, while AI pilots lack several critical skills for live shows, they could become useful as co-pilots to reduce workload of human pilots and allow multitasking. We propose research directions toward implementing visualization pilots and co-pilots in live settings.
Authors:Rayna Hata, Masaki Kuribayashi, Allan Wang, Hironobu Takagi, Chieko Asakawa
Abstract:
Autonomy and independent navigation are vital to daily life but remain challenging for individuals with blindness. Robotic systems can enhance mobility and confidence by providing intelligent navigation assistance. However, fully autonomous systems may reduce users' sense of control, even when they wish to remain actively involved. Although collaboration between user and robot has been recognized as important, little is known about how perceptions of this relationship change with repeated use. We present a repeated exposure study with six blind participants who interacted with a navigation-assistive robot in a real-world museum. Participants completed tasks such as navigating crowds, approaching lines, and encountering obstacles. Findings show that participants refined their strategies over time, developing clearer preferences about when to rely on the robot versus act independently. This work provides insights into how strategies and preferences evolve with repeated interaction and offers design implications for robots that adapt to user needs over time.
Authors:Yongle Zhang, Ge Gao
Abstract:
Recent discussions at the intersection of journalism, HCI, and human-centered computing ask how technologies can help create reader-oriented news experiences. The current paper takes up this initiative by focusing on immigrant readers, a group who reports significant difficulties engaging with mainstream news yet has received limited attention in prior research. We report findings from our co-design research with eleven immigrant readers living in the United States and seven journalists working in the same region, aiming to enhance the news experience of the former. Data collected from all participants revealed an "unaddressed-or-unaccountable" paradox that challenges value alignment across immigrant readers and journalists. This paradox points to four metaphors regarding how conversational AI agents can be designed to assist news reading. Each metaphor requires conversational AI, journalists, and immigrant readers to coordinate their shared responsibilities in a distinct manner. These findings provide insights into reader-oriented news experiences with AI in the loop.
Authors:Sizhe Cheng, Feng Liang, Yuhan Wen, Xipei Yu, Yong Wang
Abstract:
Meta-analyses and systematic reviews demand rigorous abductive reasoning to build, test, and refine hypotheses across vast, heterogeneous literature. While NLP advancements have automated parts of this pipeline, existing tools often detach researchers from the cognitive loop or function merely as retrieval engines, leading to loss of intellectual ownership and frequent context switching. We present Research IDE, a prototype reimagining authoring environments through the "Research as Code" metaphor. Research IDE embeds a multi-agent backend into the writing flow, enabling in-situ verification via "hypothesis breakpoints." A one-week field deployment with 8 domain experts, followed by a reflective workshop, as a Research through Design (RtD) probe, reveals that users strongly preferred this verification workflow, actively leveraged prior knowledge for confirmation, and reported that breakpoints sparked insights. Drawing from participant feedback and suggestions, we derive design implications for future AI-assisted research tools that fully preserve researcher autonomy and intellectual ownership while harnessing computational scale.
Authors:Meziah Ruby Cristobal, Hyeonjeong Byeon, Tze-Yu Chen, Ruoxi Shang, Donghoon Shin, Ruican Zhong, Tony Zhou, Gary Hsieh
Abstract:
The dissemination of scholarly research is critical, yet researchers often lack the time and skills to create engaging content for popular media such as short-form videos. To address this gap, we explore the use of generative AI to help researchers transform their academic papers into accessible video content. Informed by a formative study with science communicators and content creators (N=8), we designed PaperTok, an end-to-end system that automates the initial creative labor by generating script options and corresponding audiovisual content from a source paper. Researchers can then refine based on their preferences with further prompting. A mixed-methods user study (N=18) and crowdsourced evaluation (N=100) demonstrate that PaperTok's workflow can help researchers create engaging and informative short-form videos. We also identified the need for more fine-grained controls in the creation process. To this end, we offer implications for future generative tools that support science outreach.
Authors:Haiyi Li, Yiyang Zhao, Yutong Li, Alison Deslandes, Jodie Avery, Mathew Leonardi, Mary Louise Hull, Hsiang-Ting Chen
Abstract:
Endometriosis ultrasound reports are often unstructured free-text documents that require manual abstraction for downstream tasks such as analytics, machine learning model training, and clinical auditing. We present \textbf{EndoExtract}, an on-premise LLM-powered system that extracts structured data from these reports and surfaces interpretive fields for human review. Through contextual inquiry with research assistants, we identified key workflow pain points: asymmetric trust between numerical and interpretive fields, repetitive manual highlighting, fatigue from sustained comparison, and terminology inconsistency across radiologists. These findings informed an interface that surfaces only interpretive fields for mandatory review, automatically highlights source evidence within PDFs, and separates batch extraction from human-paced verification. A formative workshop revealed that \textbf{EndoExtract} supports a shift from field-by-field data entry to supervisory validation, though participants noted risks of over-skimming and challenges in managing missing data.
Authors:Pedram Agand, Mo Chen
Abstract:
Offline Reinforcement Learning (ORL) holds immense promise for safety-critical domains like industrial robotics, where real-time environmental interaction is often prohibitive. A primary obstacle in ORL remains the distributional shift between the static dataset and the learned policy, which typically mandates high degrees of conservatism that can restrain potential policy improvements. We present MoReBRAC, a model-based framework that addresses this limitation through Uncertainty-Aware latent synthesis. Instead of relying solely on the fixed data, MoReBRAC utilizes a dual-recurrent world model to synthesize high-fidelity transitions that augment the training manifold. To ensure the reliability of this synthetic data, we implement a hierarchical uncertainty pipeline integrating Variational Autoencoder (VAE) manifold detection, model sensitivity analysis, and Monte Carlo (MC) dropout. This multi-layered filtering process guarantees that only transitions residing within high-confidence regions of the learned dynamics are utilized. Our results on D4RL Gym-MuJoCo benchmarks reveal significant performance gains, particularly in ``random'' and ``suboptimal'' data regimes. We further provide insights into the role of the VAE as a geometric anchor and discuss the distributional trade-offs encountered when learning from near-optimal datasets.
Authors:Kiana Jafari, Paul Ulrich Nikolaus Rust, Duncan Eddy, Robbie Fraser, Nina Vasan, Darja Djordjevic, Akanksha Dadlani, Max Lamparth, Eugenia Kim, Mykel Kochenderfer
Abstract:
Learning from human feedback~(LHF) assumes that expert judgments, appropriately aggregated, yield valid ground truth for training and evaluating AI systems. We tested this assumption in mental health, where high safety stakes make expert consensus essential. Three certified psychiatrists independently evaluated LLM-generated responses using a calibrated rubric. Despite similar training and shared instructions, inter-rater reliability was consistently poor ($ICC$ $0.087$--$0.295$), falling below thresholds considered acceptable for consequential assessment. Disagreement was highest on the most safety-critical items. Suicide and self-harm responses produced greater divergence than any other category, and was systematic rather than random. One factor yielded negative reliability (Krippendorff's $α= -0.203$), indicating structured disagreement worse than chance. Qualitative interviews revealed that disagreement reflects coherent but incompatible individual clinical frameworks, safety-first, engagement-centered, and culturally-informed orientations, rather than measurement error. By demonstrating that experts rely on holistic risk heuristics rather than granular factor discrimination, these findings suggest that aggregated labels function as arithmetic compromises that effectively erase grounded professional philosophies. Our results characterize expert disagreement in safety-critical AI as a sociotechnical phenomenon where professional experience introduces sophisticated layers of principled divergence. We discuss implications for reward modeling, safety classification, and evaluation benchmarks, recommending that practitioners shift from consensus-based aggregation to alignment methods that preserve and learn from expert disagreement.
Authors:Benjamin Mako Hill, Aaron Shaw
Abstract:
Wikipedia's founders could not have dreamed they were creating the most important laboratory for social scientific and computing research in history but that is exactly what happened. Hill and Shaw take account of Wikipedia's enormous effect on academic scholarship
Authors:Bryan Min, Peiling Jiang, Zhicheng Huang, Haijun Xia
Abstract:
AI is growing increasingly capable of automatically generating user interfaces (GenUI) from user prompts. However, designing GenUI applications that enable users to discover diverse customizations while preserving GenUI's expressiveness remains challenging. Current design methods -- presenting prompt boxes and leveraging context -- lack affordances for customization discovery, while traditional menu-based approaches become overly complex given GenUI's vast customization space. We propose Gradually Generating User Interfaces -- a design method that structures customizations into intermediate UI layers that AI gradually loads during interface generation. These intermediate stages expose different customization features along specific dimensions, making them discoverable to users. Users can wind back the generation process to access customizations. We demonstrate this approach through three prototype websites, showing how designers can support GenUI's expanded customization capabilities while maintaining visual simplicity and discoverability. Our work offers a practical method for integrating customization features into GenUI applications, contributing an approach to designing malleable software.
Authors:Mingtian Du, Suhas Raghavendra Kulkarni, Bernardo Noronha, Domenico Campolo
Abstract:
Robot-mediated human-human (dyadic) interactions enable therapists to provide physical therapy remotely, yet an accurate perception of patient stiffness remains challenging due to network-induced haptic delays. Conventional stiffness estimation methods, which neglect delay, suffer from temporal misalignment between force and position signals, leading to significant estimation errors as delays increase. To address this, we propose a robust, delay-compensated stiffness estimation framework by deriving an algebraic estimator based on quasi-static equilibrium that explicitly accounts for temporally aligning the expert's input with the novice's response. A Normalised Weighted Least Squares (NWLS) implementation is then introduced to robustly filter dynamic bias resulting from the algebraic derivation. Experiments using commercial rehabilitation robots (H-MAN) as the platform demonstrate that the proposed method significantly outperforms the standard estimator, maintaining consistent tracking accuracy under multiple introduced delays. These findings offer a promising solution for achieving high-fidelity haptic perception in remote dyadic interaction, potentially facilitating reliable stiffness assessment in therapeutic settings across networks.
Authors:Yoonsang Kim, Yalong Yang, Arie E. Kaufman
Abstract:
We introduce Memento, a conversational AR assistant that permanently captures and memorizes user's verbal queries alongside their spatiotemporal and activity contexts. By storing these "memories," Memento discovers connections between users' recurring interests and the contexts that trigger them. Upon detection of similar or identical spatiotemporal activity, Memento proactively recalls user interests and delivers up-to-date responses through AR, seamlessly integrating AR experience into their daily routine. Unlike prior work, each interaction in Memento is not a transient event, but a connected series of interactions with coherent long--term perspective, tailored to the user's broader multimodal (visual, spatial, temporal, and embodied) context. We conduct preliminary evaluation through user feedbacks with participants of diverse expertise in immersive apps, and explore the value of proactive context-aware AR assistant in everyday settings. We share our findings and challenges in designing a proactive, context-aware AR system.
Authors:Preethi Seshadri, Samuel Cahyawijaya, Ayomide Odumakinde, Sameer Singh, Seraphina Goldfarb-Tarrant
Abstract:
Agentic benchmarks increasingly rely on LLM-simulated users to scalably evaluate agent performance, yet the robustness, validity, and fairness of this approach remain unexamined. Through a user study with participants across the United States, India, Kenya, and Nigeria, we investigate whether LLM-simulated users serve as reliable proxies for real human users in evaluating agents on τ-Bench retail tasks. We find that user simulation lacks robustness, with agent success rates varying up to 9 percentage points across different user LLMs. Furthermore, evaluations using simulated users exhibit systematic miscalibration, underestimating agent performance on challenging tasks and overestimating it on moderately difficult ones. African American Vernacular English (AAVE) speakers experience consistently worse success rates and calibration errors than Standard American English (SAE) speakers, with disparities compounding significantly with age. We also find simulated users to be a differentially effective proxy for different populations, performing worst for AAVE and Indian English speakers. Additionally, simulated users introduce conversational artifacts and surface different failure patterns than human users. These findings demonstrate that current evaluation practices risk misrepresenting agent capabilities across diverse user populations and may obscure real-world deployment challenges.
Authors:Calarina Muslimani, Yunshu Du, Kenta Kawamoto, Kaushik Subramanian, Peter Stone, Peter Wurman
Abstract:
The success of reinforcement learning (RL) is fundamentally tied to having a reward function that accurately reflects the task objective. Yet, designing reward functions is notoriously time-consuming and prone to misspecification. To address this issue, our first goal is to understand how to support RL practitioners in specifying appropriate weights for a reward function. We leverage the Trajectory Alignment Coefficient (TAC), a metric that evaluates how closely a reward function's induced preferences match those of a domain expert. To evaluate whether TAC provides effective support in practice, we conducted a human-subject study in which RL practitioners tuned reward weights for Lunar Lander. We found that providing TAC during reward tuning led participants to produce more performant reward functions and report lower cognitive workload relative to standard tuning without TAC. However, the study also underscored that manual reward design, even with TAC, remains labor-intensive. This limitation motivated our second goal: to learn a reward model that maximizes TAC directly. Specifically, we propose Soft-TAC, a differentiable approximation of TAC that can be used as a loss function to train reward models from human preference data. Validated in the racing simulator Gran Turismo 7, reward models trained using Soft-TAC successfully captured preference-specific objectives, resulting in policies with qualitatively more distinct behaviors than models trained with standard Cross-Entropy loss. This work demonstrates that TAC can serve as both a practical tool for guiding reward tuning and a reward learning objective in complex domains.
Authors:Xuyi Hu, Ke Ma, Siwei Liu, Per Ola Kristensson, Stefan Goetz
Abstract:
Accurate neuronavigation is critical for effective transcranial magnetic stimulation (TMS), as stimulation outcomes depend directly on precise coil placement. Existing neuronavigation systems are often costly, complex, and prone to tracking errors. To address these limitations, we present a computer vision based neuronavigation system that enables real time tracking of the patient and TMS instrumentation. The system integrates a multi camera optical tracking setup with consumer grade hardware and visible markers to drive a digital twin of the stimulation process. A dynamic 3D brain model in Unity updates in real time to visualize coil position and estimated stimulation targets. Augmented reality (AR) is further incorporated to project this model directly onto the patient's head, enabling intuitive, in situ coil adjustment without reliance on abstract numerical displays. Overall, the proposed approach improves spatial precision and accuracy while enhancing usability.
Authors:Frank Heyen, Michael Gleicher, Michael Sedlmair
Abstract:
We explore the potential of visualization to support musicians in instrument practice through real-time feedback and reflection on their playing. Musicians often struggle to observe the patterns in their playing and interpret them with respect to their goals. Our premise is that these patterns can be made visible with interactive visualization: we can make the unhearable visible. However, understanding the design of such visualizations is challenging: the diversity of needs, including different instruments, skills, musical attributes, and genres, means that any single use case is unlikely to illustrate the broad potential and opportunities. To address this challenge, we conducted a design exploration study where we created and iterated on 33 designs, each focusing on a subset of needs, for example, only one musical skill. Our designs are grounded in our own experience as musicians and the ideas and feedback of 18 musicians with various musical backgrounds and we evaluated them with 13 music learners and teachers. This paper presents the results of our exploration, focusing on a few example designs as instances of possible instrument practice visualizations. From our work, we draw design considerations that contribute to future research and products for visual instrument education.
Authors:Ludwig Felder, Tobias Eisenreich, Mahsa Fischer, Stefan Wagner, Chunyang Chen
Abstract:
Generative artificial intelligence (GenAI) tools have seen rapid adoption among software developers. While adoption rates in the industry are rising, the underlying factors influencing the effective use of these tools, including the depth of interaction, organizational constraints, and experience-related considerations, have not been thoroughly investigated. This issue is particularly relevant in environments with stringent regulatory requirements, such as Germany, where practitioners must address the GDPR and the EU AI Act while balancing productivity gains with intellectual property considerations. Despite the significant impact of GenAI on software engineering, to the best of our knowledge, no empirical study has systematically examined the adoption dynamics of GenAI tools within the German context. To address this gap, we present a comprehensive mixed-methods study on GenAI adoption among German software engineers. Specifically, we conducted 18 exploratory interviews with practitioners, followed by a developer survey with 109 participants. We analyze patterns of tool adoption, prompting strategies, and organizational factors that influence effectiveness. Our results indicate that experience level moderates the perceived benefits of GenAI tools, and productivity gains are not evenly distributed among developers. Further, organizational size affects both tool selection and the intensity of tool use. Limited awareness of the project context is identified as the most significant barrier. We summarize a set of actionable implications for developers, organizations, and tool vendors seeking to advance artificial intelligence (AI) assisted software development.
Authors:Michael Yin, Robert Xiao, Nadine Wagener
Abstract:
In traditional journaling practices, authors express and process their thoughts by writing them down. We propose a somaesthetic-inspired alternative that uses the human body, rather than written words, as the medium of expression. We coin this embodied journaling, as people's isolated body movements and spoken words become the canvas of reflection. We implemented embodied journaling in virtual reality and conducted a within-subject user study (n=20) to explore the emergent behaviours from the process and to compare its expressive and reflective qualities to those of written journaling. When writing-based norms and affordances were absent, we found that participants defaulted towards unfiltered emotional expression, often forgoing words altogether. Rather, subconscious body motion and paralinguistic acoustic qualities unveiled deeper, sometimes hidden feelings, prompting reflection that happens after emotional expression rather than during it. We discuss both the capabilities and pitfalls of embodied journaling, ultimately challenging the idea that reflection culminates in linguistic reasoning.
Authors:Bernardus Willson, Henry Anand Septian Radityo, Raynard Tanadi, Latifa Dwiyanti, Saiful Akbar
Abstract:
Diabetes is a significant and continuously rising health challenge in Indonesia. Although many artificial intelligence (AI)-based health applications have been developed for early detection, most function as "black boxes," lacking transparency in their predictions. Explainable AI (XAI) methods offer a solution, yet their technical outputs are often incomprehensible to non-expert users. This research aims to develop a mobile application front-end that presents XAI-driven diabetes risk analysis in an intuitive, understandable format. Development followed the waterfall methodology, comprising requirements analysis, interface design, implementation, and evaluation. Based on user preference surveys, the application adopts two primary visualization types - bar charts and pie charts - to convey the contribution of each risk factor. These are complemented by personalized textual narratives generated via integration with GPT-4o. The application was developed natively for Android using Kotlin and Jetpack Compose. The resulting prototype interprets SHAP (SHapley Additive exPlanations), a key XAI approach, into accessible graphical visualizations and narratives. Evaluation through user comprehension testing (Likert scale and interviews) and technical functionality testing confirmed the research objectives were met. The combination of visualization and textual narrative effectively enhanced user understanding (average score 4.31/5) and empowered preventive action, supported by a 100% technical testing success rate.
Authors:Yanwei Huang, Arpit Narechania
Abstract:
Web AI agents such as ChatGPT Agent and GenSpark are increasingly used for routine web-based tasks, yet they still rely on text-based input prompts, lack proactive detection of user intent, and offer no support for interactive data analysis and decision making. We present WebSeek, a mixed-initiative browser extension that enables users to discover and extract information from webpages to then flexibly build, transform, and refine tangible data artifacts-such as tables, lists, and visualizations-all within an interactive canvas. Within this environment, users can perform analysis-including data transformations such as joining tables or creating visualizations-while an in-built AI both proactively offers context-aware guidance and automation, and reactively responds to explicit user requests. An exploratory user study (N=15) with WebSeek as a probe reveals participants' diverse analysis strategies, underscoring their desire for transparency and control during human-AI collaboration.
Authors:Mathis Brossier, Tobias Isenberg, Konrad Schönborn, Jonas Unger, Mario Romero, Johanna Björklund, Anders Ynnerman, Lonni Besançon
Abstract:
We report on a systematic, PRISMA-guided survey of research at the intersection of LLMs and visualization, with a particular focus on visio-verbal interaction -- where verbal and visual modalities converge to support data sense-making. The emergence of Large Language Models (LLMs) has introduced new paradigms for interacting with data visualizations through natural language, leading to intuitive, multimodal, and accessible interfaces. We analyze 48 papers across six dimensions: application domain, visualization task, visualization representation, interaction modality, LLM integration, and system evaluation. Our classification framework maps LLM roles across the visualization pipeline, from data querying and transformation to visualization generation, explanation, and navigation. We highlight emerging design patterns, identify gaps in accessibility and visualization reading, and discuss the limitations of current LLMs in spatial reasoning and contextual grounding. We further reflect on evaluations of combined LLM-visualization systems, highlighting how current research projects tackle this challenge and discuss current gaps in conducting meaningful evaluations of such systems. With our survey we aim to guide future research and system design in LLM-enhanced visualization, supporting broad audiences and intelligent, conversational interfaces.
Authors:Jianshu Wang, Siyu Liu, Chao Zhou, Yawen Zheng, Yuan Yue, Tangjun Qu, Yang Li, Yutao Xie, Jin Huang, Yulong Bian, Feng Tian
Abstract:
Human-computer interaction (HCI) increasingly occurs in motion-rich environments. The ability to accurately and rapidly respond to directional visual cues is critical in these contexts. How whole-body motion and individual differences affect human perception and reaction to these directional cues is therefore a key, yet an underexplored question for HCI. This study used a 6-DOF motion platform to measure task performance on a visual direction judgment task. We analyzed performance by decomposing the complex motion into two distinct components: a task-irrelevant lateral interference component and a task-aligned directional congruency component. Results indicate that increased motion intensity lengthened reaction times. This effect was primarily driven by the lateral interference component, and this detrimental impact was disproportionately amplified for individuals with high motion sickness susceptibility. Conversely, directional congruency, where motion direction matched the visual cue, improved performance for all participants. These findings suggest that motion's impact on cognition is not monolithic, and that system design for mobile HCI can be informed by strategies that actively shape motion, such as minimizing lateral interference while maximizing directional congruency.
Authors:Mina Huh, Ailie C. Fraser, Dingzeyu Li, Mira Dontcheva, Bryan Wang
Abstract:
Music shapes the tone of videos, yet creators often struggle to find soundtracks that match their video's mood and narrative. Recent text-to-music models let creators generate music from text prompts, but our formative study (N=8) shows creators struggle to construct diverse prompts, quickly review and compare tracks, and understand their impact on the video. We present VidTune, a system that supports soundtrack creation by generating diverse music options from a creator's prompt and producing contextual thumbnails for rapid review. VidTune extracts representative video subjects to ground thumbnails in context, maps each track's valence and energy onto visual cues like color and brightness, and depicts prominent genres and instruments. Creators can refine tracks through natural language edits, which VidTune expands into new generations. In a controlled user study (N=12) and an exploratory case study (N=6), participants found VidTune helpful for efficiently reviewing and comparing music options and described the process as playful and enriching.
Authors:Yue Yang, Christoph Leuze, Brian Hargreaves, Bruce Daniel, Fred M Baik
Abstract:
We investigate how vibrotactile wrist feedback can enhance spatial guidance for handheld tool movement in optical see-through augmented reality (AR). While AR overlays are widely used to support surgical tasks, visual occlusion, lighting conditions, and interface ambiguity can compromise precision and confidence. To address these challenges, we designed a multimodal system combining AR visuals with a custom wrist-worn haptic device delivering directional and state-based cues. A formative study with experienced surgeons and residents identified key tool maneuvers and preferences for reference mappings, guiding our cue design. In a cue identification experiment (N=21), participants accurately recognized five vibration patterns under visual load, with higher recognition for full-actuator states than spatial direction cues. In a guidance task (N=27), participants using both AR and haptics achieved significantly higher spatial precision (5.8 mm) and usability (SUS = 88.1) than those using either modality alone, despite having modest increases in task time. Participants reported that haptic cues provided reassuring confirmation and reduced cognitive effort during alignment. Our results highlight the promise of integrating wrist-based haptics into AR systems for high-precision, visually complex tasks such as surgical guidance. We discuss design implications for multimodal interfaces supporting confident, efficient tool manipulation.
Authors:Mo Houtti, Moyan Zhou, Daniel Runningen, Surabhi Sunil, Leor Porat, Harmanpreet Kaur, Loren Terveen, Stevie Chancellor
Abstract:
Inclusion is important for meeting effectiveness, which is in turn central to organizational functioning. One way of improving inclusion in meetings is through feedback, but social dynamics make giving feedback difficult. We propose that AI agents can facilitate feedback exchange by being psychologically safer recipients, and we test this through a meeting system with an AI agent feedback mediator. When delivering feedback, the agent uses the Induced Hypocrisy Procedure, a social psychological technique that prompts behavior change by highlighting value-behavior inconsistencies. In a within-subjects lab study ($n=28$), the agent made speaking times more balanced and improved meeting quality. However, a field study at a small consulting firm ($n=10$) revealed organizational barriers that led to its use for personal reflection rather than feedback exchange. We contribute a novel sociotechnical system for feedback exchange in groups, and empirical findings demonstrating the importance of considering organizational barriers in designing AI tools for organizations.
Authors:Joar Sabel, Mattias Wingren, Andreas Lundell, Sören Andersson, Sara Rosenberg, Susanne Hägglund, Linda Estman, Malin Andtfolk
Abstract:
The introduction of large language models (LLMs) has greatly enhanced the capabilities of software agents. Instead of relying on rule-based interactions, agents can now interact in flexible ways akin to humans. However, this flexibility quickly becomes a problem in fields where errors can be disastrous, such as in a pharmacy context, but the opposite also holds true; a system that is too inflexible will also lead to errors, as it can become too rigid to handle situations that are not accounted for. Work using LLMs in a pharmacy context have adopted a wide scope, accounting for many different medications in brief interactions -- our strategy is the opposite: focus on a more narrow and long task. This not only enables a greater understanding of the task at hand, but also provides insight into what challenges are present in an interaction of longer nature. The main challenge, however, remains the same for a narrow and wide system: it needs to strike a balance between adherence to conversational requirements and flexibility. In an effort to strike such a balance, we present a prototype system meant to provide medication counseling while juggling these two extremes. We also cover our design in constructing such a system, with a focus on methods aiming to fulfill conversation requirements, reduce hallucinations and promote high-quality responses. The methods used have the potential to increase the determinism of the system, while simultaneously not removing the dynamic conversational abilities granted by the usage of LLMs. However, a great deal of work remains ahead, and the development of this kind of system needs to involve continuous testing and a human-in-the-loop. It should also be evaluated outside of commonly used benchmarks for LLMs, as these do not adequately capture the complexities of this kind of conversational system.
Authors:Junjie Wang, Gaole He, Alisa Rieger, Ujwal Gadiraju
Abstract:
Compared to search engine result pages (SERPs), AI-generated podcasts represent a relatively new and relatively more passive modality of information consumption, delivering narratives in a naturally engaging format. As these two media increasingly converge in everyday information-seeking behavior, it is essential to explore how their interaction influences user attitudes, particularly in contexts involving controversial, value-laden, and often debated topics. Addressing this need, we aim to understand how information mediums of present-day SERPs and AI-generated podcasts interact to shape the opinions of users. To this end, through a controlled user study (N=483), we investigated user attitudinal effects of consuming information via SERPs and AI-generated podcasts, focusing on how the sequence and modality of exposure shape user opinions. A majority of users in our study corresponded to attitude change outcomes, and we found an effect of sequence on attitude change. Our results further revealed a role of viewpoint bias and the degree of topic controversiality in shaping attitude change, although we found no effect of individual moderators.
Authors:Jules Wulms, Wouter Meulemans, Bettina Speckmann
Abstract:
The high-level structure of a graph is a crucial ingredient for the analysis and visualization of relational data. However, discovering the salient graph patterns that form this structure is notoriously difficult for two reasons. (1) Finding important patterns, such as cliques and bicliques, is computationally hard. (2) Real-world graphs contain noise, and therefore do not always exhibit patterns in their pure form. Defining meaningful noisy patterns and detecting them efficiently is a currently unsolved challenge. In this paper, we propose to use well-ordered matrices as a tool to both define and effectively detect noisy patterns. Specifically, we represent a graph as its adjacency matrix and optimally order it using Moran's $I$. Standard graph patterns (cliques, bicliques, and stars) now translate to rectangular submatrices. Using Moran's $I$, we define a permitted level of noise for such patterns. A combination of exact algorithms and heuristics allows us to efficiently decompose the matrix into noisy patterns. We also introduce a novel motif simplification that visualizes noisy patterns while explicitly encoding the level of noise. We showcase our techniques on several real-world data sets.
Authors:Emelie Fälton, Isabelle Strömstedt, Mathis Brossier, Andreas Göransson, Konrad Schönborn, Amy Loutfi, Erik Sunden, Mujtaba Fadhil Jawad, Yadgar Suleiman, Johanna Björklund, Mario Romero, Anders Ynnerman, Lonni Besançon
Abstract:
We present our first stage results from deploying an LLM-augmented visualization software in a classroom setting to engage primary school children with earth-related datasets. Motivated by the growing interest in conversational AI as a means to support inquiry-based learning, we investigate children's expectations, engagement, and evaluation of a spoken LLM interface with a shared, immersive visualization system in a formal educational context. Our system integrates a speech-capable large language model with an interactive spherical display. It enables children to ask natural-language questions and receive coordinated verbal explanations and visual responses through the LLM-augmented visualization updating in real time based on spoken queries. We report on a classroom study with Swedish children aged 9-10, combining structured observation and small-group discussions to capture expectations prior to interaction, interaction patterns during facilitated sessions, and children's reflections on their encounter afterward. Our results provide empirical insights into children's initial encounters with an LLM-enabled visualization platform within a classroom setting and their expectations, interactions, and evaluations of the system. These findings inform the technology's potential for educational use and highlight important directions for future research.
Authors:Stephen Pilli, Vivek Nallur
Abstract:
We examine whether large language models (LLMs) can predict biased decision-making in conversational settings, and whether their predictions capture not only human cognitive biases but also how those effects change under cognitive load. In a pre-registered study (N = 1,648), participants completed six classic decision-making tasks via a chatbot with dialogues of varying complexity. Participants exhibited two well-documented cognitive biases: the Framing Effect and the Status Quo Bias. Increased dialogue complexity resulted in participants reporting higher mental demand. This increase in cognitive load selectively, but significantly, increased the effect of the biases, demonstrating the load-bias interaction. We then evaluated whether LLMs (GPT-4, GPT-5, and open-source models) could predict individual decisions given demographic information and prior dialogue. While results were mixed across choice problems, LLM predictions that incorporated dialogue context were significantly more accurate in several key scenarios. Importantly, their predictions reproduced the same bias patterns and load-bias interactions observed in humans. Across all models tested, the GPT-4 family consistently aligned with human behavior, outperforming GPT-5 and open-source models in both predictive accuracy and fidelity to human-like bias patterns. These findings advance our understanding of LLMs as tools for simulating human decision-making and inform the design of conversational agents that adapt to user biases.
Authors:Marcel Gohsen, Nicola Libera, Johannes Kiesel, Jan Ehlers, Benno Stein
Abstract:
Deepfake technologies are powerful tools that can be misused for malicious purposes such as spreading disinformation on social media. The effectiveness of such malicious applications depends on the ability of deepfakes to deceive their audience. Therefore, researchers have investigated human abilities to detect deepfakes in various studies. However, most of these studies were conducted with participants who focused exclusively on the detection task; hence the studies may not provide a complete picture of human abilities to detect deepfakes under realistic conditions: Social media users are exposed to cognitive load on the platform, which can impair their detection abilities. In this paper, we investigate the influence of cognitive load on human detection abilities of voice-based deepfakes in an empirical study with 30 participants. Our results suggest that low cognitive load does not generally impair detection abilities, and that the simultaneous exposure to a secondary stimulus can actually benefit people in the detection task.
Authors:Marie Luisa Fiedler, Christian Merz, Jonathan Tschanter, Carolin Wienrich, Marc Erich Latoschik
Abstract:
Integrated VR (IVR) systems consist of a head-mounted display (HMD) and body-tracking capabilities. They enable users to translate their physical movements into corresponding avatar movements in real-time, allowing them to perceive their avatars via the displays. Consumer-grade IVR systems have been available for 10 years, significantly fostering VR research worldwide. However, the effects of even apparently significant technological advances of IVR systems on user experience and the overall validity of prior embodiment research using such systems often remain unclear. We ran a user-centered study comparing two comparable IVR generations: a nearly 10-year-old hardware (HTC Vive, 6-point tracking) and a modern counterpart (HTC Vive Pro 2, 6-point tracking). To ensure ecological validity, we evaluated the systems in their commercially available, as-is configurations. In a 2x5 mixed design, participants completed five tasks covering different use cases on either the old or new system. We assessed presence, sense of embodiment, appearance and behavior plausibility, workload, task performance, and gathered qualitative feedback. Results showed no significant system differences, with only small effect sizes. Bayesian analysis further supported the null hypothesis, suggesting that the investigated generational hardware improvements offer limited benefits for user experience and task performance. For the 10-year generational step examined here, excluding potential technological progress in the necessary software components, this supports the validity of conclusions from prior work and underscores the applicability of older configurations for research in embodied VR.
Authors:Haiyi Li, Yutong Li, Yiheng Chi, Alison Deslandes, Mathew Leonardi, Shay Freger, Yuan Zhang, Jodie Avery, M. Louise Hull, Hsiang-Ting Chen
Abstract:
In this study, we evaluate a locally-deployed large-language model (LLM) to convert unstructured endometriosis transvaginal ultrasound (eTVUS) scan reports into structured data for imaging informatics workflows. Across 49 eTVUS reports, we compared three LLMs (7B/8B and a 20B-parameter model) against expert human extraction. The 20B model achieved a mean accuracy of 86.02%, substantially outperforming smaller models and confirming the importance of scale in handling complex clinical text. Crucially, we identified a highly complementary error profile: the LLM excelled at syntactic consistency (e.g., date/numeric formatting) where humans faltered, while human experts provided superior semantic and contextual interpretation. We also found that the LLM's semantic errors were fundamental limitations that could not be mitigated by simple prompt engineering. These findings strongly support a human-in-the-loop (HITL) workflow in which the on-premise LLM serves as a collaborative tool, not a full replacement. It automates routine structuring and flags potential human errors, enabling imaging specialists to focus on high-level semantic validation. We discuss implications for structured reporting and interactive AI systems in clinical practice.
Authors:Xiangzhe Yuan, Jiajun Wang, Huanchen Wang, Qian Wan, Siying Hu
Abstract:
Cyber fraud now constitutes over half of criminal cases in China, with undergraduate students experiencing a disproportionate rise in victimization. Traditional anti-fraud training remains predominantly passive, yielding limited engagement and retention. This paper introduces ImmuniFraug, a Large Language Model (LLM)-based metacognitive intervention that delivers immersive, multimodal fraud simulations integrating text, voice, and visual avatars across ten prevalent fraud types. Each scenario is designed to replicate real-world persuasion tactics and psychological pressure, while post-interaction debriefs provide grounded feedback in protection motivation theory and reflective prompts to reinforce learning. In a controlled study with 846 Chinese undergraduates, ImmuniFraug was compared to official text-based materials. Linear Mixed-Effects Modeling (LMEM) reveals that the interactive intervention significantly improved fraud awareness (p = 0.026), successfully providing incremental learning value even when controlling for participants' extensive prior exposure to anti-fraud education, alongside high narrative immersion (M = 56.95/77). Thematic analysis of interviews revealed key effectiveness factors: perceived realism, adaptive deception, enforced time pressure, emotional manipulation awareness, and enhanced self-efficacy. Findings demonstrate that by shifting the focus from passive knowledge acquisition to active metacognitive engagement, LLM-based simulations offer a scalable and ecologically valid new paradigm for anti-fraud training and fostering fraud resilience.
Authors:Qian Ma, Yingfan Zhou, Shubhang Kaushik, Aamod Joshi, Aditya Majumdar, Noah Apthorpe, Yan Shvartzshnaider, Sarah Rajtmajer, Brett Frischmann
Abstract:
Users often make security- and privacy-relevant decisions without a clear understanding of the rules that govern safe behavior. We introduce pedagogical friction, a design approach that introduces brief, instructional interactions at the moment of action. We evaluate this approach in the context of password creation, a task with clear, objective quality criteria and broad familiarity. We conducted a randomized repeated-measures study with 128 participants across four interface conditions that varied the depth and interactivity of guidance. We assessed three outcomes: (1) rule compliance in a subsequent password task without guidance, (2) accuracy on survey questions matched to the rules shown earlier, and (3) behavior-knowledge alignment, which captures whether participants who correctly followed a rule also recognized it on the survey. Across all guided conditions, participants corrected most rule violations in the follow-up task, achieved moderate accuracy on matched rule questions, and showed high behavior-knowledge alignment. These results support pedagogical friction as a lightweight and generalizable intervention for security- and privacy-critical interfaces.
Authors:Trevor De Clark, Yulia Bobkova, Ajay Kumar Shrestha
Abstract:
This paper investigates the privacy and usability of AI-enabled smart devices commonly used by youth, focusing on Google Home Mini, Amazon Alexa, and Apple Siri. While these devices provide convenience and efficiency, they also raise privacy and transparency concerns due to their always-listening design and complex data management processes. The study proposes and applies a combined framework of Heuristic Evaluation, Personal Information Protection and Electronic Documents Act (PIPEDA) Compliance Assessment, and Youth-Centered Usability Testing to assess whether these devices align with Privacy-by-Design principles and support meaningful user control. Results show that Google Home achieved the highest usability score, while Siri scored highest in regulatory compliance, indicating a trade-off between user convenience and privacy protection. Alexa demonstrated clearer task navigation but weaker transparency in data retention. Findings suggest that although youth may feel capable of managing their data, their privacy self-efficacy remains limited by technical design, complex settings, and unclear data policies. The paper concludes that enhancing transparency, embedding privacy guidance during onboarding, and improving policy alignment are critical steps toward ensuring that smart devices are both usable and compliant with privacy standards that protect young users.
Authors:Paul Kent, George De Ath, Martin Layton, Allen Hart, Richard Everson, Ben Carvell
Abstract:
Escalating air traffic demand is driving the adoption of automation to support air traffic controllers, but existing approaches face a trade-off between safety assurance and interpretability. Optimisation-based methods such as reinforcement learning offer strong performance but are difficult to verify and explain, while rules-based systems are transparent yet rarely check safety under uncertainty. This paper outlines Agent Mallard, a forward-planning, rules-based agent for tactical control in systemised airspace that embeds a stochastic digital twin directly into its conflict-resolution loop. Mallard operates on predefined GPS-guided routes, reducing continuous 4D vectoring to discrete choices over lanes and levels, and constructs hierarchical plans from an expert-informed library of deconfliction strategies. A depth-limited backtracking search uses causal attribution, topological plan splicing, and monotonic axis constraints to seek a complete safe plan for all aircraft, validating each candidate manoeuvre against uncertain execution scenarios (e.g., wind variation, pilot response, communication loss) before commitment. Preliminary walkthroughs with UK controllers and initial tests in the BluebirdDT airspace digital twin indicate that Mallard's behaviour aligns with expert reasoning and resolves conflicts in simplified scenarios. The architecture is intended to combine model-based safety assessment, interpretable decision logic, and tractable computational performance in future structured en-route environments.
Authors:Michael Yin, Angela Chiang, Robert Xiao
Abstract:
Fulfilling social connections are crucial for human well-being and belonging, but not all relationships last forever. As interactions increasingly move online, the act of digitally severing a relationship - e.g. through blocking or unfriending - has become progressively more common as well. This study considers actions of "digital severance" through interviews with 30 participants with experience as the initiator and/or recipient of such situations. Through a critical interpretative lens, we explore how people perceive and interpret their severance experience and how the online setting of social media shapes these dynamics. We develop themes that position digital severance as being intertwined with power and control, and we highlight (im)balances between an individual's desires that can lead to feelings of disempowerment and ambiguous loss for both parties. We discuss the implications of our research, outlining three key tensions and four open questions regarding digital relationships, meaning-making, and design outcomes for future exploration.
Authors:Laura Aymerich-Franch, Tarek Taha, Hiroko Kamide, Takahiro Miyashita, Hiroshi Ishiguro, Paolo Dario
Abstract:
Avatar embodiment experiences have the potential to enhance human capabilities by extending human senses, body, and mind. This study investigates social acceptance of robotic and virtual avatars as enablers of capability enhancement in six domains: identity exploration, well-being and behavioral transformation, expanded travel capabilities, expanded bodily and sensory abilities, cognitive augmentation, and immortality. We conducted a large-scale survey (n = 1001) in Dubai to explore acceptance of sixteen capability enhancement scenarios within these domains. The highest levels of agreement were observed for multilingual communication (77.5%) and learning capabilities (68.7%), followed by assisting individuals with reduced mobility (64.5%) and behavioral transformation (59.5%). Scenarios involving immortality through consciousness transfer received the least support (34.9%). These findings contribute to a deeper understanding of public attitudes toward avatar-based human enhancement and offer practical guidance for the responsible design, development, and integration of cybernetic avatars in the society, ensuring their societal acceptance and fostering a harmonious human-avatar coexistence.
Authors:Jinfeng Lou, Zijie Liang, Pengkun Liu, Yuxin Zhang, Cleotilde Gonzalez, Pingbo Tang
Abstract:
Decision-making in urban infrastructure management during extreme events relies heavily on human operators, yet current computational support systems often fail to account for non-monotonic human adaptation and latent psychological biases like overconfidence and defensive overcorrection. This study addresses this gap by integrating Instance-Based Learning Theory (IBLT) into the domain of civil engineering computing. We establish a computational cognitive architecture that simulates operator decision processes through the mathematical mechanisms of memory retrieval and utility blending. This model functions as a computational baseline, representing boundedly rational adaptation driven by experiential priors, thus allowing for the algorithmic isolation of latent psychological biases from the baseline dynamics of memory-based learning. We demonstrated this framework using a human-in-the-loop microworld experiment simulating subway flood-induced track suspensions, where dispatchers must balance passenger safety against service efficiency. Analysis revealed a complex, non-linear human adaptation cycle consisting of four phases: acquisition, overconfidence, overcorrection, and recalibration. Specifically, the computational model exposed a significant divergence during the post-accident "overcorrection" phase: while human operators exhibited immediate, defensive risk overestimation, the model maintained a stable trajectory based on accumulated experience. This strategic divergence confirms that operational instability following failure is often attributable to acute psychological bias overriding stable memory-based adaptation, a pattern theoretically expected to recur across analogous high-stakes environments and validatable through multi-modal behavioral and sensor data from professional operators.
Authors:Jared Moore, Noah Goodman, Nick Haber, Max Kleiman-Weiner
Abstract:
Large language models can shift human beliefs across high-stakes domains, but most persuasion studies rely on pre/post belief change. These endpoint measures identify whether persuasion occurred, yet miss where and how beliefs moved within a dialogue. We present PERSUASIONTRACE, a framework for studying persuasion in human-LLM interaction. Built on a web-based experimental platform, PERSUASIONTRACE contributes a tool for multi-turn persuasion studies and a process-level evaluation protocol: it records multi-turn belief reports from human or simulated targets of persuasion, annotates persuader turns with rhetorical dimensions (logos/pathos/ethos), and evaluates simulators by fidelity to real human belief dynamics. Using this framework, we find that human targets group into two clusters of multi-turn belief updates and exhibit susceptibility to rhetorical strategies, and that LLMs are persuasive across generic and personalized topics, text and audio modalities, and multi-turn interactions. Prior work has chiefly used vanilla-prompted LLMs to simulate human targets, but we show that these simulators fail to replicate human belief dynamics. We introduce a Bayesian-network simulated target that maintains an explicit latent belief state over time so each persuader message yields cognitively realistic belief updates. In human-likeness evaluation, our Bayesian target scores near a human reference (81 vs 80), while baseline LLM targets score substantially lower (64). PERSUASIONTRACE reframes persuasion evaluation from endpoint movement alone to process fidelity, providing a stronger basis for scientific analysis and safer optimization of persuasive systems.
Authors:Thien Tran, Khang Duong, Minh Tran, Jonathan Kua, Thuong Hoang, Jiong Jin
Abstract:
The persistent challenge in scaling authentic manipulator education within university laboratories is a structural dichotomy: commercial digital twins are often cost-prohibitive and rigidly scripted, whereas open-source robotics middleware (ROS) imposes steep technical and syntax barriers for novices. To resolve this logistical and educational friction, this Work-in-Progress (WiP) paper proposes a scalable four-tier communication architecture tailored for sustainable robotic curricula. Rather than focusing on software application design, our study examines the underlying data exchange mechanisms required to bridge visual conceptual environments with physical robotic endpoints, utilizing the Graphical Open-Source Platform (GOSP) as a foundational instantiation. This WiP details the framework's technical integration of 3D visual armature modeling with a robust ROS middleware backend, emphasizing the serialization, routing, and encapsulation of intricate communication routines. Preliminary sim-to-real validation using multi-axis spatial trajectories confirms that encapsulating these communication pipelines provides a sufficient fidelity hardware-agnostic pathway. By bridging virtual design and physical execution, this architectural blueprint offers a viable infrastructure for engineering education.
Authors:Divyanshu Kumar Singh, Dipto Das, Deepika Rama Subramanian, Koustuv Saha, Stephen Voida, Bryan Semaan
Abstract:
Text-to-Image (T2I) models have shown promising utility across various domains. However, such models are also amplifying harmful societal biases in their outputs. In the context of South Asia, recent work has shown caste biases and stereotypes are being perpetuated through Generative AI (GenAI) systems. While this research offers extremely relevant insight into invisibilized narratives of caste discrimination through the GenAI system, they often treat caste as an identity category. Therefore, in this work we shift our ontology to focus on the relational aspect of caste. This enables us to develop a more nuanced understanding of the mechanics of caste discrimination by and through T2I models. Combining an algorithmic audit with critical discourse analysis, we draw on a conceptual frame challenging Brahminical Normativity to show how caste biases are perpetuated beyond the simple binaries of upper vs lower-caste categories. Our contributions are two-fold. Beyond challenging the categorical understanding of caste as a category, we propose an anti-caste approach to tackle the issue of caste bias and fairness in AI systems.
Authors:Tyra Girdwood, Saba Kheirinejad, Parnian Kheirkhah Rahimabad, Brianna M. White, Robert L Davis, David L Schwartz, Arash Shaban-Nejad
Abstract:
For patients experiencing cancer, nurse navigation can ease the burden of complex care by enhancing coordination of health services and patient outcomes. However, in under-resourced areas, trained nurse navigators may be limited or non-existent. In the United States, artificial intelligence (AI)-enabled digital health tools are increasingly available and may help address gaps in care coordination; however, most are not designed to specifically support nursing. This perspective piece discusses a human-centered AI framework that integrates empathic and agentic approaches grounded in the American Nurses Association's code of ethics to support nurses in the United States in cancer care navigation. The framework could augment, not replace, human empathy and agency while improving nurse workflow, patient-clinician relationships, and care coordination services in under-resourced areas.
Authors:Fatima Ahmad Muazu, Festus Adedoyin, Huseyin Dogan, Abiodun Adedeji, Melike Akca, Olumuyiwa Ayorinde
Abstract:
This study investigates how UX research (UXR) principles, combined with Large Language Model (LLM)-supported analysis, can be used to improve the quality of requirements for mobile learning systems designed for learners with cognitive disabilities. Using the UXR Point-of-View (PoV) pyramid as a methodological framework, the study progressed through four stages: foundational structuring of psychological, behavioral, and design layers; structured validation using the DeLone and McLean Information Systems Success Model and Quality Function Deployment (QFD); insight consolidation through the development of nine Cognitive Accessibility UXR Play Cards; and stakeholder-specific PoV articulation to support interdisciplinary communication. LLM-supported synthesis was integrated to assist in theme clustering, requirement refinement, and hypothesis formulation under human oversight. Findings suggest that many usability and engagement challenges in mobile learning originate from ambiguous or under-specified requirements rather than interface design alone. By embedding cognitive accessibility principles into measurable and technically traceable requirements, the proposed Cognitive Accessibility UXR Playbook provides a structured pathway for aligning theory, system architecture, and stakeholder strategy.
Authors:Abiodun Adedeji, Huseyin Dogan, Festus Adedoyin, Michelle Heward, Melike Akca, Emmanuel Oluwatosin Oluokun, Fatima Ahmad Muhazu, Olumuyiwa Ayorinde
Abstract:
User Experience Research (UXR) Points of View (POVs) distil complex and often fragmented research evidence into actionable perspectives that guide how teams interpret user needs, frame design decisions, and align stakeholders. Although POVs are widely used in industry practice, there are few published examples that explicitly document how POVs are constructed, particularly in culturally sensitive and low-resource contexts. This paper presents an exemplar case study demonstrating how a culturally grounded, AI-augmented UXR POV was developed to inform TeleDeCa, a telemedicine dementia care framework for family caregivers in Nigeria. Building on the UXR POV Playbook and pyramid framework, we illustrate how mixed-methods research, hypothesis generation, and ontology-based modelling can be combined to form a defensible POV without requiring a fully finalised system or validated outcomes. Generative AI (GenAI) is integrated across the UXR POV framework as a bounded research collaborator, supporting synthesis, hypothesis exploration, and narrative construction while preserving human judgment, ethical accountability, and cultural sensitivity. The contribution of this paper lies in the extraction of reusable Play Cards and a Play that extend the UXR POV Playbook and serve as exemplar material for the CHI 2026 workshop on developing AI-powered UXR POVs.
Authors:Olumuyiwa Ayorinde, Huseyin Dogan, Festus Adedoyin, Nan Jiang, Emmanuel Oluokun, Abiodun Adedeji, Melike Akca
Abstract:
This paper investigates how User Experience Research (UXR) methods can be combined with AI-supported analysis to develop clearer design direction for digital wellbeing interventions targeting Emergency and Public Safety Personnel (EPSP). EPSP work in high-stress, shift-based environments where cognitive fatigue and unpredictable schedules reduce engagement with conventional wellbeing tools. Using the UXR Point-of-View (PoV) framework, this study applied an AI-supported literature analysis process to identify recurring psychological, behavioural, and design patterns. Behaviour Change Techniques and Persuasive Technology principles were integrated throughout interpretation to connect evidence with practical design reasoning. The process resulted in a UXR PoV Pyramid, nine UXR Play Cards, and stakeholder focused PoV narratives. Findings show that effective wellbeing systems for EPSP must minimise cognitive effort, adapt to operational context, and prioritise psychological safety. The work demonstrates how AI can assist large-scale evidence interpretation while human researchers maintain responsibility for contextual judgement and design direction.
Authors:Festus Fatai Adedoyin, Huseyin Dogan, Melike Akca, Abiodun Adedeji
Abstract:
Rising household debt and cost-of-living pressures in the United Kingdom have intensified the role of AI-driven financial technologies in mediating credit assessment, repayment structuring, and debt support services. These systems increasingly shape consequential financial decisions, yet they operate within complex socio-technical environments characterised by regulatory constraint, algorithmic opacity, and heightened vulnerability risk. User Experience Research (UXR) Points of View (PoVs) are critical in translating heterogeneous research evidence into strategic direction for product and governance decisions. However, the existing UXR PoV framework was not designed for AI-mediated financial systems where interpretability, fairness, and accountability are central. This paper extends the UXR PoV pyramid into an AI-augmented methodological framework for Human-Centred AI debt management technologies in the UK financial services context. We formalise (1) an AI-Augmented PoV Pyramid, (2) a structured prompt architecture for synthesis and hypothesis generation, and (3) an AI-enabled Playbook Card system that embeds Generative AI into UXR workflows while preserving traceability and ethical oversight. Generative AI is positioned not as an analytic authority, but as an epistemic support mechanism subject to human validation and regulatory awareness. By grounding the framework in debt management technologies, including affordability assessment, repayment planning, and financial stress prediction systems, this work advances UXR methodology for high-stakes financial AI environments and contributes to the evolution of responsible, AI-powered UXR practice within the CHI community.
Authors:Emmanuel Oluwatosin Oluokun, Festus Fatai Adedoyin, Huseyin Dogan, Nan Jiang, Melike Akca, Abiodun Adedeji, Olumuyiwa Ayorinde, Fatima Ahmad Muazu
Abstract:
User Experience Research (UXR) in a legal and regulatory contexts presents unique challenges that require specialised approaches to protect vulnerable populations whilst generating actionable insights. Digital consultation, appointment booking, and medication delivery platforms show promise for extending care access; however, their real-world effectiveness is curtailed by an absence of theoretically grounded user experience research (UXR) methodologies that adequately account for the psychosocial conditions of these populations. This paper introduces a Generative AI-augmented UXR methodology, grounded in the UXR Point of View (PoV) Playbook, to guide the design of psychologically safe, low-cognitive-load digital health interventions for MSM and transgender individuals living with HIV/AIDS in Nigeria. Drawing from empirical research involving co-design workshops, thematic analysis, and requirements engineering, the methodology is operationalised through a four-stage UXR process encompassing AI-supported hypothesis generation, foundational planning, insight generation via Building Blocks, and the construction of stakeholder-specific PoV narratives. This process results in ten theory-informed UXR Play Cards that translate psychological mechanisms and empirical findings into actionable design guidance. Each play contains actionable tasks, AI-augmented approaches, and ethical guardrails tailored for research with marginalised populations. The output is a set of ten theory-informed UXR Play Cards translating psychological insight and empirical evidence into actionable design guidance. The core contribution is a replicable, stigma-aware, and privacy-centred framework for responsible GenAI use in UXR practice, advancing human-centred digital health design for marginalised communities.
Authors:Melike Akca, Mona Giff, Deniz Cetinkaya, Huseyin Dogan, Stephen Giff
Abstract:
Attention-deficit/hyperactivity disorder (ADHD) is a psychiatric disorder which presents itself in individuals through patterns of developmentally inappropriate levels of inattentiveness, hyperactivity, and impulsivity, with difficulties in decision making and emotional regulation (ER). Although digital and AI-based interventions have expanded access to ER support, many existing systems remain limited by weak theoretical integration, insufficient accommodation of neurodiversity, and a lack of structured user experience research (UXR) methodologies, that bridge psychological insight with design practice. This paper introduces a Generative AI-augmented UXR methodology, grounded in the UXR Point of View (PoV) Playbook, to support the design of emotionally intelligent and Neuroinclusive digital ER interventions for adults with ADHD. The approach integrates empirical evidence with established psychological frameworks Dialectical Behaviour Therapy (DBT), Self-Determination Theory (SDT), and the COM-B behavioural model and leverages Generative AI as a co-analytic tool to support synthesis, hypothesis formation, and design articulation. The methodology is operationalized through a four-stage UXR process encompassing AI-supported hypothesis generation, foundational planning, insight generation via Building Blocks, and the construction of stakeholder-specific PoV narratives. This process results in a set of ten theory informed UXR Play Cards that translate psychological mechanisms and empirical findings into actionable design guidance. The primary contribution of this work is a replicable, bias-aware framework for integrating Generative AI into UXR practice, advancing human-centred and Neuroinclusive approaches to digital mental health design.
Authors:Vishakh Padmakumar, Lujain Ibrahim, Zora Zhiruo Wang, Jennifer Wang, Q. Vera Liao, Diyi Yang
Abstract:
AI tools are increasingly integrated into real-world workflows. However, existing measures of reliance on these tools focus on AI output adoption or on self-reported indicators, rather than how task effort is distributed between users and tools. Here, we introduce offloading score, a measure of reliance that quantifies the fraction of cognitive effort offloaded to an AI tool. Offloading Score is simulation-based -- we construct a counterfactual workflow by estimating how the user would have completed the task without the tool, and then computing the fraction of steps saved by using the tool. We validate offloading score through intrinsic evaluations of metric validity, and a controlled user study ($n=40$) with developers performing programming tasks using AI tools. We vary time pressure to test whether reliance measures capture the known increase in reliance under time pressure. We show that offloading score detects significantly higher reliance in time-constrained settings ($+43\%$, $p=0.018$), while usage-based and self-reported baseline measures of reliance do not distinguish the conditions. We complement this with descriptive insights showing that higher reliance manifests as greater delegation of subtasks to the tool and more direct reuse of AI outputs. Finally, we demonstrate an approach of using offloading score in combination with target outcomes of a task (e.g., code understanding) to identify when reliance may be (in)appropriate. Our framework offers two contributions: an instrument users can apply to measure and reflect on their own reliance, and a quantitative signal that agent designers can utilize to mitigate overreliance.
Authors:Neemias da Silva, Myriam Delgado, Rodrigo Minetto, Daniel Silver, Thiago H Silva
Abstract:
We study how persona prompting shapes language generated by multimodal large language models in an urban perception setting. Using 59,808 annotations from 1,200 persona-conditioned agents and two no-persona settings, we analyze captions, justifications, and perception tags across personas. Results indicate strong convergence in captions for different personas, whereas justifications display systematic variation associated with socioeconomic and political attributes, while perception tags show no statistically significant persona-related differences, though effect trends are observed. Topic analysis further reveals that personas emphasize different evaluative themes when interpreting the same scenes.
Authors:Yixu Huang, Bo Li, Na Li, Zhe Wang, Kaijie Chen, Haonan Ge, Qingyi Si, Yuanzhe Shen, Ruihan Yang, Guangjing Wang, Hongcheng Guo
Abstract:
Generating a game is not the same as making one that can be played. Despite advances in code generation, existing approaches treat game generation as one-shot translation from prompt to artifact, leaving interaction-level failures undetected. We argue that evaluating and improving game generation requires a player, and study two roles for graphical user interface (GUI) agents in this process: (1) as an objective evaluator, for which we introduce PlaytestArena, a new evaluation environment that pairs 200 browser-based game generation tasks across eight genres with rubrics of expected in-play behaviors, adjudicated by a GUI agent that loads each build in a browser and plays it; and (2) as a subjective playtester, for which we propose Play2Code, where a game agent and a GUI agent operate in a sustained loop with shared memory, turning game generation into a dialogue between coding and playing. Our experiments show that even frontier models struggle to generate playable games directly, while Play2Code achieves a 66.8\% rubric pass-rate, improving over single-pass and agentic-coding baselines by 37.1 and 14.6 points respectively. Further analysis shows that GUI playtester feedback is more traceable than a human report, yet idiosyncratic in ways reminiscent of human testers, establishing game playtesting as a critical testbed for interactive code generation. Our project website is available at https://continual-game-generation.vercel.app/.
Authors:Amit Kumar Das, Karanbir Pelia, Manav Nitesh Ukani, Klaus Mueller
Abstract:
Infographic designers balance many choices at once: chart type, color, and whether to add a benchmark or a scale. Past work studies these factors one at a time, so we know little about how readers weigh them against each other. We address this gap with a choice-based conjoint study (N = 65) in which participants viewed pairs of infographics on a mock newspaper page about unemployment. Each infographic varied across three attributes: comparison type (none, US average, percentage scale), color (red, blue), and graphic type (single icon, icon series, bar chart). Comparison type drove most of the preference variation (58.5%), followed by graphic type (29.2%) and color (12.3%). Readers favored percentage scale markers and benchmark comparisons; color had no practical effect. The percentage scale level adds axis information rather than a benchmark, so the comparison type result mixes two distinct ideas. A single topic and a narrow palette also limit external validity. We argue that conjoint analysis is a practical and underused tool for studying visualization preferences across many design dimensions.
Authors:Supriya Khadka, Sanchari Das
Abstract:
Advancements in augmented reality (AR) technologies offer immense potential for mobile experiences. However, most commercial and educational AR systems assume a baseline of predictable user behavior and stationary interaction. Preschoolers and children in early childhood education, specifically ages 3 to 8, are naturally erratic, physically dynamic, and prone to rapid locomotion, making them the ultimate stress test for mobile spatial computing. Through a focused analysis of recent literature on physical activity and spatial learning in AR for preschoolers, this paper identifies points of friction in current mobile deployments. We highlight recurring failures in camera tracking during dynamic movement, physical safety hazards caused by screen-induced distraction, spatial crowding around physical markers, and the privacy risks of continuous environmental surveillance. To address these challenges, we propose AnchorPlay AR, a conceptual prototype for a privacy-preserving, audio-first spatial application. By explicitly separating locomotion from visual tracking, AnchorPlay AR uses audio cues to safely guide movement and reserves visual augmentation for stationary moments, offering a safer framework for preschoolers in constant motion.
Authors:Oleg Jarma Montoya, Erica Manca, Thomas Vase Schultz Volden, Paolo Burelli
Abstract:
We present a pilot study on the collection and synchronisation of multimodal data for player experience investigation. We collected game telemetry, self-reported surveys, biometrics, and cued-retrospective think-aloud (C-RTA) data from 19 participants playing three Atari 2600 games. The study then uses the data to investigate difficulty in PX, showcasing a protocol for future multimodal research. The dataset obtained from the experiment, which is publicly available, shows potential as a rich, transformative source that can be used to investigate dynamic difficulty adjustment algorithms, game balancing strategies or broader explorations of games user research. The study findings suggest that the experimental approach holds strong potential for generalisation in future player experience studies.
Authors:Sassan Mokhtar, Lars Doorenbos, Fatemeh Jabbari, Marius Bock, Dominik Bach, Juergen Gall
Abstract:
Interactive assistance systems typically provide feedback after an action has been completed, supporting error recovery but not preventing the error itself. We present TRAFA, a real-time predictive feedback system for procedural tasks that intervenes before errors are committed. TRAFA operationalizes predictive feedback through a Track-Forecast-Act framework that tracks hand and object state, forecasts user motion conditioned on scene context, and triggers feedback when a predicted action is likely to violate task constraints. We instantiate this pipeline in a sequential assembly setting and evaluate it through both technical benchmarking and a controlled user study against conventional reactive feedback. Our results show that predictive feedback improves task accuracy and efficiency while maintaining a comparable number of feedback events. These findings position feedback timing as a key dimension in system design and show how real-time anticipation can be integrated into interactive systems to prevent errors before they occur.
Authors:Amir Mousavi, Mohammad Sadegh Sirjani, Erfan Nourbakhsh, Mimi Xie, Rocky Slavin, Leslie Neely, John Davis, John Quarles
Abstract:
Real-time cognitive load assessment from eye-tracking signals could potentially enable adaptive human-centered-AI such as safety-critical applications such as driver vigilance monitoring or automated flight deck assistance, yet two challenges persist: handling frequent data missingness from blinks and tracking failures, and efficiently modeling long-range temporal dependencies. We propose MambaGaze, a framework that addresses these challenges through 1) XMD encoding, which augments raw features with observation masks and time-deltas to explicitly model data uncertainty, and 2) bidirectional Mamba-2, which captures temporal dependencies with linear computational complexity. Experiments on CLARE and CL-Drive datasets under leave-one-subject-out evaluation show that MambaGaze achieves 76.8% and 73.1% accuracy, respectively, outperforming CNN, Transformer, ResNet, and VGG baselines by 4-12 percentage points. Edge deployment benchmarks on NVIDIA Jetson platforms demonstrate real-time inference at 43-68 FPS with power consumption below 7.5W, confirming feasibility for wearable cognitive load monitoring.
Authors:Amir Mousavi, Mohammad Sadegh Sirjani, Erfan Nourbakhsh, Mimi Xie, Rocky Slavin, Leslie Neely, John Davis, John Quarles
Abstract:
Real-time cognitive load assessment is essential for adaptive human-computer interaction but remains challenging due to limited labeled data and poor cross-subject generalization. Recent ECG foundation models pre-trained on millions of clinical recordings offer rich representations, but cannot be directly applied to wearable devices due to sensor configuration mismatch and task differences. In this paper, we propose CogAdapt, a framework that adapts clinical ECG foundation models to wearable cognitive load assessment. CogAdapt introduces LeadBridge, a learnable adapter that transforms 3-lead wearable signals into anatomically consistent 12-lead representations, and ProFine, a progressive fine-tuning strategy that gradually unfreezes encoder layers while preventing catastrophic forgetting. Evaluations on two public datasets (CLARE and CL-Drive) under leave-one-subject-out cross-validation show that CogAdapt substantially outperforms baselines trained from scratch, achieving macro-F1 scores of 0.626 and 0.768. These results demonstrate the promise of foundation model adaptation for subject-independent cognitive load assessment from wearable sensors.
Authors:Morita Tarvirdians, Senthil Chandrasegaran, Hayley Hung, Catholijn M. Jonker, Catharine Oertel
Abstract:
Making high-stakes personal decisions involves cognitive, emotional, and intuitive processes, and individuals differ in how they allocate attention across these modes. Integration of these processes has shown to benefit decision making. Yet, most current decision-support systems focus primarily on supporting cognitive aspects, rather than adapting to the individual's thinking profile to support integration of different types of thoughts. In this study, we investigate an agent designed to encourage integration by adapting to the individual user's thought patterns. We explore its effects on participants' perceptions of the agent and their reflective behavior, in comparison with unaided pre-reflection and a baseline agent. In a between-subjects study (N = 128), our agent, which fostered broad and elaborated thinking, enabled more personalized reflective trajectories, elicited more integrative reflective language, and was perceived as providing stronger support for holistic reflection. In contrast, the baseline agent produced homogenized profiles dominated by cognitive language across participants.
Authors:Rituja Pardhi, Matthias Norden, William Saakyan, Nadine Vietmeier, Simone Kirst, Isabel Dziobek, Julia Asbrand, Hanna Drimalla
Abstract:
Accurately quantifying children's social interaction behavior is part of understanding their cognitive and emotional development, as well as mental health conditions. Kids-SIT is a web-based tool designed to computationally analyze children's behaviors by engaging them in a standardized video conversation scenario while their responses are video recorded. In a pre-registered study with 21 healthy children, we evaluated the potential of the Kids-SIT as an accessible paradigm for automated analysis of children's social interaction behavior. We assessed their subjective impression, as well as verbal and non-verbal responses during the Kids-SIT. Verbal content was analyzed using the LIWC tool. Three socially relevant non-verbal behaviors (gaze deviation, smiling, and nodding) were manually annotated and automatically extracted using three computational methods. We examined how well these methods capture naturalistic social interaction patterns of healthy children. We conducted an exploratory classification of healthy children (n=21) and those with social anxiety disorder (n=11) using automated behavioral features. The semantic analysis of the children's verbal responses and their post-hoc impressions indicated that the Kids-SIT successfully elicited natural social interaction behavior. Children's non-verbal behavior also showed similar pattern: they looked at their interaction partner for most of the time, particularly while listening than speaking. Smiling and gazing toward the partner occurred more frequently during the person-directed liked and disliked parts than during the picture-description phase. These non-verbal behavior patterns were captured both by manual annotations and by the computational analysis methods. In the exploratory analysis with a clinical sample, automatically extracted features enabled above-chance differentiation between children with and without SAD (AUC=0.74).
Authors:Pooja Prajod, Hannes Cools, Thomas Röggla, Pablo Cesar, Abdallah El Ali
Abstract:
As generative AI becomes increasingly integrated into journalism, designing effective AI-use disclosures that inform readers without imposing unnecessary burden is a key challenge. While prior research has primarily focused on trust and credibility, the impact of disclosures on readers' attentional and cognitive load remains underexplored. To address this gap, we conducted a $3\times2\times2$ mixed factorial study manipulating the level of AI-use disclosure detail (none, one-line, detailed), news type (politics, lifestyle), and role of AI (editing, partial content generation), measuring load via NASA-TLX and eye-tracking. Our results reveal a significant attentional cost: one-line disclosures resulted in significantly higher fixation durations and saccade counts, particularly for AI-edited content. Detailed disclosures did not impose additional burden. Drawing on Information-Gap Theory, we argue that brief labels may trigger increased visual scrutiny by alerting readers to AI use without providing enough information. NASA-TLX scores and pupil diameter showed no significant differences across conditions, suggesting that AI-use disclosures do not impose cognitive burden regardless of the detail level. Interview insights contextualize these findings and reveal a strong preference for detailed or ``detail-on-demand'' designs. Our findings inform the design of gaze-informed adaptive disclosure interfaces that dynamically adjust transparency levels based on readers' attentional patterns and news context.
Authors:Varad Vishwarupe, Nigel Shadbolt, Marina Jirotka
Abstract:
Pluralistic alignment is typically operationalised as preference aggregation: producing responses that span (Overton), steer toward (Steerable), or proportionally represent (Distributional) diverse human values. We argue that aggregation alone is an incomplete primitive for deployed pluralistic alignment. Under genuine value pluralism, the failure mode of contemporary RLHF-trained assistants is not insufficient coverage but sycophantic consensus: a learned tendency to agree with, validate, and minimise friction with the immediate interlocutor. Because deployed AI systems now mediate consequential deliberation across health, civic life, labour, and governance, the collapse of disagreement at the interaction layer is not a narrow technical concern but a structural failure with distributive consequences. We reframe pluralistic alignment around three conversational mechanisms drawn from Grice's maxims: scoping (acknowledging the limits of one's perspective), signalling (surfacing value-conflict rather than smoothing it over), and repair (revising one's position on principled grounds, not on user pressure). We formalise a metric, the Pluralistic Repair Score (PRS), distinguishing principled revision from capitulation, and present a small-scale empirical illustration on two frontier RLHF-trained models (Claude Sonnet 4.5, N=198; GPT-4o, N=100) showing that, for both, agreement-following coexists with low repair-quality on contested-value prompts. PRS measures an interactional precondition for pluralism (visible disagreement; principled revision) rather than pluralism in full; we discuss the difference, take seriously the reflexive question of whose "principled" counts, and argue that pluralism is most decisively made or unmade at the deployment-governance layer: interfaces, preference-data pipelines, and audit infrastructure.
Authors:Lingyu Peng, Yu Liang, Ying Zhang, Chang Ge, Qingchuan Li
Abstract:
Cosmic 1001 is an interactive installation that transforms space exploration history into a speculative news experience. Participants first browse a news-based archive of major space events, then pose future-oriented questions or specify conditions such as year, celestial body, or mission name. In response, AI generates a future news item including a headline, article, narration, and visual media. These outputs are accumulated in the Future Tunnel, a shared visualization where individual stories form a collective landscape of possible futures. By combining historical space events with science fiction references, the installation explores a space between documentation and imagination, treating the future not as a fixed prediction but as a visible and discussable speculation.
Authors:Lingyu Peng, Wenbo Lu, Liying Long, Qingchuan Li
Abstract:
Western art has regarded The Thinker as a symbol of rational contemplation, while Eastern aesthetics has taken the Four Gentlemen, namely plum, orchid, bamboo, and chrysanthemum, as symbols of moral and spiritual cultivation. This paper presents Ink Spiral, a video installation that links these traditions through AI generated ink imagery. By transforming a rotating sculpture of The Thinker into the Four Gentlemen across thousands of frames, the work shifts between three dimensional sculpture and two dimensional ink, human introspection and natural symbolism. Ink Spiral turns fixed cultural icons into a fluid dialogue, inviting audiences to perceive cross cultural connection as a living, ambiguous, and endlessly interpretable creative state.
Authors:Varad Vishwarupe, Nigel Shadbolt, Marina Jirotka, Ivan Flechais
Abstract:
Recent published evidence from frontier laboratories shows that contemporary AI models can recognise evaluation contexts, latently represent them, and behave differently under those contexts than under deployment-continuous conditions. Anthropic's BrowseComp incident, the Natural Language Autoencoder findings on SWE-bench Verified and destructive-coding evaluations, and the OpenAI / Apollo anti-scheming work all document instances of this phenomenon. We argue that these findings create a claim-validity problem for safety conclusions drawn from frontier evaluations. We introduce the Evaluation Differential (ED), a conditional divergence in a target behavioural property between recognised-evaluation and deployment-continuous contexts, define a normalised effect-size form (nED) for cross-property comparison, and prove that marginal evaluation scores cannot identify ED. We develop a typology of safety claims (ED-stable, ED-degraded, ED-inverted, ED-undetermined) by their warrant-status under documented divergence, and specify TRACE (Test-Recognition Audit for Claim Evaluation), an audit protocol that wraps existing evaluation infrastructure and produces restricted claims rather than capability scores. We apply the framework retrospectively to three publicly documented evaluation incidents and discuss governance implications for system cards, conformity assessment, and the international network of AI safety and security institutes. TRACE does not eliminate adversarial adaptation; it disciplines the claims drawn from evaluation evidence by making explicit the conditions under which that evidence was produced.
Authors:Bryan Min, Sangho Suh, Jim Hollan, Haijun Xia
Abstract:
The principle of abstraction guides the design of interactive systems, yet we lack a conceptual framework to understand how it shapes interaction design. Existing models, such as the gulfs of execution and evaluation, do not explicitly model abstractions in the system or in users' mental models, and therefore lack actionable guidance for designing abstractions. To investigate how abstractions are employed in interactive systems, we surveyed 457 papers and synthesized a design space of abstraction techniques along six dimensions. We use this design space to reframe the gulfs through a lens of abstraction, explicitly articulate the cognitive and design processes by which users and systems bridge and navigate the abstraction gap, and demonstrate how this model integrates existing perspectives and surfaces new opportunities for future systems.
Authors:Christian Masuhr, Julian Koch, Arne Wendt, Thorsten Schüppstuhl
Abstract:
Augmented Reality (AR) is increasingly utilized to guide users through complex spatial tasks in domains such as manufacturing, non-destructive testing, and surgery. These applications often require strict compliance with 5D+ trajectories using rotation-symmetric tools (3D position, 2D orientation, and movement speed). However, the sensori-motor baselines of untrained users during these multidimensional tracing tasks, along with the cognitive-motor trade-offs induced by varying visual feedback paradigms, remain underexplored. We present a controlled within-subjects user study (N=30) evaluating three distinct AR UI concepts for trajectory guidance, both with and without explicit orientation constraints. We analyzed spatial, orientational, and speed compliance based on the internal AR tracking, which was validated against a high-precision external optical tracking system to rule out hardware drift. By segmenting the execution into transient and steady-state phases and applying Aligned Rank Transform (ART) ANOVA, we isolated the interaction effects between visual design and task complexity. Alongside subjective metrics (NASA-TLX, SUS), our results establish conservative performance baselines for novice users performing freehand 5D trajectory following. We reveal orientation-induced cognitive-motor trade-offs and identify mitigating UI synergies. Ultimately, we provide empirical baselines and actionable design guidelines for developing effective AR guidance systems.
Authors:Hashim Aziz, Mehedi Hasan Raju, Oleg V. Komogortsev
Abstract:
Eye movement biometrics (EMB) use subject-specific gaze dynamics for user authentication and identification. Recent deep learning-based EMB systems achieve strong performance by modeling temporal eye movement behavior. However, these systems typically overlook continuous gaze offset, despite prior evidence that it contains user-discriminative information. This work examines whether continuous gaze offset can improve biometric performance when combined with existing biometric features. We evaluate linear and nonlinear fusion methods on two publicly available datasets, collected via the lab-grade eye tracker and virtual reality headset across multiple tasks and observation durations. Results indicate that fusion offers performance benefits on both datasets, particularly when using nonlinear fusion. Additionally, fusing biometric information across multiple tasks further improves authentication performance. These findings support the hypothesis that continuous gaze offset may serve as useful auxiliary information under conditions of degraded or noisy eye tracking.
Authors:JaeWon Kim, Lindsay Popowski, Louisa Conwill, Elizabeth `Lizzie' Li, Meryl Ye, Jiaying `Lizzy' Liu, Jose A. Guridi, Theia Henderson, Bingxu Han, Dennis Wang, Angel Hsing-Chi Hwang, Susan Wyche, Yasmine Kotturi, Gillian R. Hayes, Angela D. R. Smith
Abstract:
People care about climate change, injustice, and humanitarian crises. The challenge is not apathy but capacity: sustained engagement with large-scale problems is psychologically costly, and social media architecture often amplifies awareness while providing few pathways to meaningful action. The result is rising distress, overwhelm, and disengagement -- particularly among young people who encounter global suffering through platforms designed for attention capture rather than constructive response. This workshop examines how social technology design shapes the conditions for sustained engagement with societal challenges. Drawing on Tronto's care ethics framework and research in moral psychology and platform studies, we ask why caring at scale is difficult and how social media can both exacerbate and potentially mitigate this difficulty. Tronto's framework shows that good care requires more than awareness: it demands responsibility, competence, and community. Dominant social media architectures stall the caring process at its earliest phase. We invite researchers and designers to identify platform designs that deplete or support the capacity to care, and to develop design directions for \textit{sustainable care}: engagement that people can maintain over time without burning out.
Authors:Lana Do, Shasta Ihorn, Charity M. Pitcher-Cooper, Sanjay Mirani, Gio Jung, Hyunjoo Shim, Zhenzhen Qin, Kien T. Nguyen, Vassilis Athitsos, Ilmi Yoon
Abstract:
Audio description (AD) narrates visual elements in video for blind and low-vision audiences. Recent work has shown that giving novice describers an AI-generated draft to start from helps produce higher-quality AD and lowers the barrier to entry. What remains an open question is how draft quality shapes the editing process. We investigate this through GenAD, an AD generation pipeline that incorporates accessibility guidelines and contextual video information, and RefineAD, an editing interface for human revisions. Human-AI contributions are measured across text, timing, and delivery. In a within-subjects study, we compared authoring from scratch against editing AI drafts of varying quality. GenAD drafts cut completion time by more than half and significantly reduced cognitive load. In contrast, baseline drafts generated from simple, unguided prompts offered only modest benefits, pointing to a minimum quality threshold for effectiveness. Qualitative findings suggest this threshold is content-dependent; as visual complexity increases, so does the quality needed from AI drafts. We propose this as a design principle: effective AI assistance should clear a quality threshold suited to the target content, rather than simply be present.
Authors:Yen-Ting Liu, Chiu-Hsuan Wang, TzuLing Chen, Ting-Ying Lee, Tzu-Hua Wang, Chien-Ming Lin, Bing-Yu Chen, Hsin-Ruey Tsai
Abstract:
In natural human-to-human communication, multimodal user input is typically used to supplement explicit and complement implicit voice commands, with casualness allowing for flexible input modality combinations and tolerance for imprecise input data. For example, saying "I want that." with a casual glance at a bottle of water is clear enough in human-to-human communication as an implicit voice command accompanied by gaze and/or gestures, rather than an explicit one. To enable such a human-like interaction in human-robot interaction (HRI), we propose a system, IntenBot, to understand user intentions from flexible and imprecise multimodal input, including voice, gaze, and finger-pointing, in XR. The disambiguation capability of large language models (LLMs) is used to filter out irrelevant input modalities and imprecise input data, generating potential instructions for user confirmation. The flexible and imprecise multimodal input enables casual, human-like interaction with robots, reducing time, effort, and attention, and could also be used as non-voice input. We conducted an informative user behavior study in a simulated environment to understand users' natural be- havior in flexibly interacting with a robot using multimodal input and to obtain appropriate angle range parameters for gaze and finger-pointing. An XR study was then performed to evaluate the performance of IntenBot, compared with other methods. We also deployed IntenBot on a physical robot to showcase its real-world applications.
Authors:Varad Vishwarupe, Nigel Shadbolt, Marina Jirotka, Ivan Flechais
Abstract:
Alignment evaluation in machine learning has largely become evaluation of models. Influential benchmarks score model outputs under fixed inputs, such as truthfulness, instruction following, or pairwise preference, and these scores are often used to support claims about deployed alignment. This paper argues that deployment-relevant alignment cannot be inferred from model-level evaluation alone. Alignment claims should instead be indexed to the level at which evidence is collected: model-level, response-level, interaction-level, or deployment-level. Two studies support this position. First, a structured audit of eleven alignment benchmarks, extended to a sixteen-benchmark corpus, dual-coded against an eight-dimension rubric with Cohen's kappa = 0.87, finds that user-facing verification support is absent across every benchmark examined, while process steerability is nearly absent. The few interactional benchmarks identified, including tau-bench, CURATe, Rifts, and Common Ground, remain fragmented in coverage, and benchmark construction rather than data source determines what is measured. Second, a blinded cross-model stress test using 180 transcripts across three frontier models and four scaffolds finds that the same verification scaffold raises one model's verification support to ceiling while leaving another categorically unchanged. This shows that scaffold efficacy is model-dependent and that the gap identified by the audit cannot be closed at the model level alone. We propose a system-level evaluation agenda: alignment profiles instead of single scores, fixed-scaffolding protocols for comparable interactional evaluation, and reporting templates that make the inferential distance between evaluation evidence and deployment claims explicit.
Authors:Yu Xie, Ying Qi
Abstract:
Recommendation feeds work well when people are simply browsing, and search works well when they can formulate a query. Between these two cases is a common but poorly supported state: users feel that their feed has become repetitive, yet cannot clearly specify what they want instead. We refer to this state as vague intent. We present Red-Rec, an AI-supported exploration interface for this middle ground. After a period of browsing, the system summarizes patterns in the current feed (e.g., dominant content categories and possible latent interests), offers clickable exploration options, asks at most one follow-up question, and then gradually blends new content into the feed. The design is motivated by a formative study which found that users often recognize feed staleness but struggle to articulate alternatives, suggesting the need for proactive and low-effort interaction.We evaluated Red-Rec in a mixed-design lab study against three comparison conditions: a passive feed, search, and a user-initiated chat interface. Compared with user-initiated chat, Red-Rec led to broader exploration, higher serendipity ratings, and lower interaction effort. Participants in the AI-initiated condition typed very little , relying mainly on option selection, whereas participants in the user-initiated chat condition typed substantially more . We discuss how proactive, option-based AI support can help users move beyond repetitive feeds without undermining their sense of control, and we outline design implications for recommendation interfaces that support open-ended exploration.
Authors:Stanisław Knapiński, Maciej Grzeszczuk, Barbara Karpowicz, Pavlo Zinevych, Wieslaw Kopec
Abstract:
This work aims to establish an end-to-end system for tracking of physical 3D objects for virtual reality (VR) applications. We focus on training applications requiring real-time tracking of the position of small physical objects and their reflection in VR space. Out goal is to perform object tracking in a "plug and play" manner, without using complex systems with quite large tracking devices or manually implementing object tracking. We therefore propose a system for object tracking via fiducial markers alongside a software harness, to enable fast and efficient designation of objects to be tracked and data streaming solution for end-use applications. The system utilizes AruCo, AprilTag and an original Colored Control Points based fiducial system. It allows for easy tag detection and use of object position data, which are crucial for immersive training environments based on VR and eXtended Reality (XR). We evaluate various tag sizes, detection distances, and different camera devices against the theoretical limits. In effect, we create a complete solution for implementing marker-based, real-to-virtual object position mapping for various applications.
Authors:Hippolyte Fournier, Sina Alisamir, Safaa Azzakhnini, Isabella Zsoldos, Eléonore Trân, Gérard Bailly, Frédéric Elisei, Béatrice Bouchot, Brice Varini, Patrick Constant, Joan Fruitet, Franck Tarpin-Bernard, Solange Rossato, François Portet, Olivier Koenig, Hanna Chainay, Fabien Ringeval
Abstract:
The integration of artificial intelligence (AI) into healthcare has advanced significantly, yet affect recognition remains a major challenge, particularly in AI-assisted interventions such as Computerized Cognitive Training (CCT). The THERADIA-WoZ corpus was developed to enable multimodal affect recognition in the context of AI-driven CCT, focusing on an older adult population. This study extends the corpus by introducing a dataset collected from young adults, allowing direct comparison of affect recognition models across age groups. Our objective was to assess whether multimodal models based on dimensions borrowed from appraisal theories outperform those based on categorical labels and to evaluate their generalisation power across age corpora. After comparing both corpora, models were trained and tested using within-corpus, cross-corpus, and mixed-corpus evaluation. Results revealed that appraisal dimensions consistently outperformed categorical labels across all conditions, demonstrating greater predictive accuracy and stability. Notably, categorical labels failed to generalise across age corpora, as performance dropped to chance levels in cross-corpus evaluation. In contrast, appraisal dimensions maintained predictive performance above chance, reinforcing their robustness for cross-age affect recognition. Furthermore, training on both corpora did not improve generalisation beyond within-corpus training. The findings support the theoretical and practical advantages of appraisal dimensions over categorical labels in affective computing. They also highlight the importance of multimodal fusion and deep learning representations for emotion modeling. To facilitate future research, we provide an API for researchers interested in time-continuous emotion prediction, offering valuable tools for behavioral sciences to enhance the measurement of emotional states in various experimental settings.
Authors:Ashish Mehta, Jared Moore, Jacy Reese Anthis, William Agnew, Eric Lin, Peggy Yin, Desmond C. Ong, Nick Haber, Carol Dweck
Abstract:
There is growing concern that AI chatbots might fuel delusional beliefs in users. Some have suggested that humans and chatbots mutually reinforce false beliefs over time, but quantitative evidence is lacking. Using a unique dataset of chat logs from individuals who exhibited delusional thinking, we developed a latent state model that captures accumulating and decaying influences between humans and chatbots. We find that a bidirectional influence model substantially outperforms a unidirectional alternative where humans are the primary driver of delusion. We find that humans exert strong but short-lived influence on chatbots, whereas chatbots exert longer-lasting influence on humans. Moreover, chatbots exert strong, stable self-influence over their own future outputs that tends to perpetuate delusions over long stretches of conversation. In fact, this chatbot self-influence constituted the dominant pathway when considering accumulated influence over time. Overall, these results indicate that humans tend to drive sharp, immediate increases in delusion, whereas chatbots sustain and propagate these effects over longer timescales. Together, these findings provide the first quantitative evidence that human-chatbot interactions can form feedback loops of delusion, decomposable into distinct pathways with dissociable temporal dynamics. By doing so, they can inform the development of safer AI systems.
Authors:Payal Mohapatra, Calvin Murdock, Ali Aroudi, Ishwarya Ananthabhotla, Anjali Menon, Buye Xu, Morteza Khaleghimeybodi
Abstract:
Many individuals struggle to understand conversation partners in noisy settings, particularly amid background speakers or due to hearing impairments. Emerging wearables like smartglasses offer a transformative opportunity to enhance speech from conversation partners. Crucial to this is identifying the direction in which the user wants to listen, which we refer to as the user's acoustic zones of interest. While current spatial audio-based methods can resolve the direction of vocal input, they are agnostic to listening preferences and have limited functionality in noisy settings with interfering speakers. To address this, behavioral cues are needed to actively infer a user's acoustic zones of interest. We explore the effectiveness of head-orienting behavior, captured by Inertial Measurement Units (IMUs) on smartglasses, as a modality for localizing these zones in seated conversations. We introduce HALo, a head-orientation-based acoustic zone localization network that leverages smartglasses' IMUs to non-invasively infer auditory zones of interest corresponding to conversation partner locations. By integrating an a priori estimate of the number of conversation partners, our approach yields a 21% performance improvement over existing methods. We complement this with CoCo, which classifies the number of conversation partners using only IMU data, achieving 0.74 accuracy and a 35% gain over rule-based and generic time-series baselines. We discuss practical considerations for feature extraction and inference and provide qualitative analyses over extended sessions. We also demonstrate a minimal end-to-end speech enhancement system, showing that head-orientation-based localization offers clear advantages in extremely noisy settings with multiple conversation partners.
Authors:Varad Vishwarupe, Ivan Flechais, Marina Jirotka, Nigel Shadbolt
Abstract:
Domestic voice assistants and smart-home devices are increasingly embedded in everyday routines, yet their ethics are often treated as an afterthought or delegated to compliance teams. To explore how expectations about smart-home AI are constructed and managed, we conducted 33 semi-structured interviews with designers, developers, and researchers from major smart-home platforms (Amazon Alexa, Microsoft Azure IoT, and Google Nest). Using a constructivist grounded theory approach, we develop Expectations Management (EM): a culturally embedded model describing how practitioners shape, calibrate, and repair expectations by balancing organisational rights with culturally situated rites. We show that EM differs from expectation-confirmation theory and trust-calibration by foregrounding moral judgement, situated action, and cross-cultural variation. Our analysis reveals four recurring design tensions: automation vs. autonomy, helpfulness vs. intrusiveness, personalisation vs. predictability, and transparency vs. obscurity and distils them into a five-phase EM Design Playbook that supports moral prudence. We discuss implications for responsible smart-home design and offer guidance for human-centred AI.
Authors:Merve Cerit, Andrea Mock, Vryan Almanon Feliciano, Thomas N. Robinson, Byron Reeves, Nilam Ram, Nick Haber
Abstract:
Predicting whether an individual's depressive symptoms will worsen, remain stable, or improve over the coming weeks can enable earlier and more targeted care, yet prospective within-person trajectory prediction remains largely unaddressed in digital phenotyping. We combine fortnightly CES-D assessments with over 100 million screenshots captured every five seconds via the Stanford Screenomics platform from 96 adults followed for approximately one year (M = 20.9, SD = 3.9 assessments per participant, 2,002 total observations). We frame prediction as a within-person classification task: whether symptoms will worsen, remain stable, or improve over the subsequent fortnight, operationalized in three ways to capture clinically meaningful change. Under temporal holdout, XGBoost achieves an AUC of 0.906 for crossings of established CES-D severity bands and 0.755 for change relative to each participant's own within-person variability, generalizing to unseen individuals (AUC = 0.821). Each person's typical symptom level was the only statistically significant predictor above the most recent CES-D score; without it, the most consequential worsening transitions go undetected. Screenome-derived behavioral features revealed prodromal patterns of worsening, including escalating social media use, fragmented device engagement, and changes in overnight activity, with substantial individual heterogeneity. These findings establish a proof-of-concept foundation for monitoring systems that could identify individuals approaching clinical deterioration before symptoms reach a crisis point.
Authors:Yichun Zhao, Miguel A. Nacenta, Mahadeo A. Sukhai, Sowmya Somanath
Abstract:
Despite recognition of the value of diversity, the way work takes place can fail to support blind or low-vision employees, especially in collaborative work settings. This paper examines how professional teams with diverse visual abilities use information representations (e.g., PDF documents, spreadsheets and charts). A diary study with follow-up individual interviews (23 participants with mixed abilities from 5 teams) and 2 separate focus groups (7 participants from 2 other teams) allowed us to characterize key dimensions of the role of representations in the workplace into four types of interrelated failures and workarounds, influenced by workplace stigmas and shaped by evolving social dynamics towards interdependent information work. We contribute this new empirically supported conceptual understanding of representation use in workplaces that can help design and improve the experiences of mixed-ability teams doing knowledge work in the current technological landscape.
Authors:Shuyue Feng, Cedric Caremel, Yoshihiro Kawahara
Abstract:
Topology optimization(TO) is widely used in engineering because of its ability to save material and optimize structural performance. Although prior work has explored 2D human-centered design tool for TO, the results are often limited in variety and offer weak customizability. Meanwhile, due to the high computational and time costs of TO, researchers have attempted to address these issues using generative AI; however, such methods often provide limited interactivity. In addition, topology optimization in many cases needs to balance structural performance and aesthetic qualities through iterative design, a perspective that has rarely been emphasized in traditional TO. We present TopoStyle, an iterative design tool for 2.5D topology optimization using a 2D diffusion model. We explore two interaction methods. The first exports 3D parts to a graphical interface for hand-drawn interaction. The second enables direct interaction within 3D modeling software using points. Our tool also supports the use of masks to apply topology optimization to specific regions, allowing users to address customized design needs. We compare and evaluate both performance and interaction methods, and investigate how TopoStyle can balance performance and aesthetics while improving design efficiency through customization and iterative design. Finally, we demonstrate the application scenarios of TopoStyle through several design cases.
Authors:Jinrui Wang, Alexis Pister, Sian Phillips, Sarah Bissett, Ruaidhri Higgins-Lavery, Clare Wharmby, Andrew Sudmant, Uta Hinrichs, Benjamin Bach
Abstract:
This paper reports on the process of designing the UK Co-Benefits Atlas, which communicates and publicizes data for climate mitigation. Visualization atlases -- an emerging type of platform to make data about complex topics comprehensive through interactive visualizations and explanatory content -- pose challenges beyond traditional visualization projects. Atlases must address diverse and often uncertain audiences and use cases, support both explanatory and guided exploration, and accommodate complex, evolving data. Over 10 months, our team of visualization and domain experts conducted 8 design workshops, iterative prototyping, 15 stakeholder onboarding sessions, and continuous reflection. These intertwined processes informed the development of the Atlas, comprising over 400 pages of visualizations and explanations. They also enabled a deeper understanding of how stakeholders may critically engage with the atlas in practice, in terms of interests, potential frictions when navigating huge amounts of data, and envisioned usage scenarios. Reflecting on our design process, we identify five driving forces in atlas design -- data, people, stories, context, and the atlas itself -- whose shifting dynamics influence different stages of visualization atlas design in different ways. Grounded in our case study, we discuss using these forces as a conceptual starting point for structuring and reflecting on future atlas design processes.
Authors:Rania Islambouli, Laura Geiger, Daniela Wurhofer, Devender Kumar, Clemens Sauerwein, Jan David Smeddinck
Abstract:
Monitoring exercise intensity is critical for safe and effective physical activity, particularly for individuals with cardiovascular disease, where overexertion can pose serious risks. Although physiological measures such as heart rate are widely used for avoiding overexertion, they can be unreliable in certain cases, such as when affected by medication or when wearables are worn too loosely. We introduce AktivTalk, a mobile prototype that digitizes the clinically validated Talk Test to support voice-based, in-the-moment self-assessment of exertion. In a within-subject study with 20 participants, we collected exertion-labeled voice samples and found that AktivTalk was rated as highly usable and preferred over conductor-guided assessment. We further explored automated exertion classification from Talk Test speech. Using MFCC-based features with class balancing and cross-validation, a lightweight neural classifier achieved up to 90% accuracy for detecting high vs.non-high exertion from Talk Test recordings. This work highlights the potential of structured voice interactions for accessible exertion assessment and motivates future passive exertion monitoring from speech.
Authors:Sadra Sabouri, Zeinabsadat Saghi, Run Huang, Sujay Maladi, Esmeralda Eufracio, Sumit Gulwani, Souti Chattopadhyay
Abstract:
Advances in AI agent capabilities have outpaced users' ability to meaningfully oversee their execution. AI agents can perform sophisticated, multi-step knowledge work autonomously from start to finish, yet this process remains effectively inaccessible during execution, often buried within large volumes of intermediate reasoning and outputs: by the time users receive the output, all underlying decisions have already been made without their involvement. This lack of transparency leaves users unable to examine the agent's assumptions, identify errors before they propagate, or redirect execution when it deviates from their intent. The stakes are particularly high in spreadsheet environments, where process and artifact are inseparable. Each decision the agent makes is recorded directly in cells that belong to and reflect on the user. We introduce Pista, a spreadsheet AI agent that decomposes execution into auditable, controllable actions, providing users with visibility into the agent's decision-making process and the capacity to intervene at each step. A formative study (N = 8) and a within-subjects summative evaluation (N = 16) comparing Pista to a baseline agent demonstrated that active participation in execution influenced not only task outcomes but also users' comprehension of the task, their perception of the agent, and their sense of role within the workflow. Users identified their own intent reflected in the agent's actions, detected errors that post-hoc review would have failed to surface, and reported a sense of co-ownership over the resulting output. These findings indicate that meaningful human oversight of AI agents in knowledge work requires not improved post-hoc review mechanisms, but active participation in decisions as they are made.
Authors:Joy Lai, Alex Mihailidis
Abstract:
Digital reminder systems are widely used in dementia care to support everyday tasks, but they are typically designed for one-way prompting rather than helping caregivers interpret engagement over time. We present Remindful, a caregiver-informed reminder platform that extends task prompting with caregiver-facing alerts, summaries, and review features to support awareness in home-based dementia care. Drawing on formative caregiver interviews, lived-experience advisor input, and in-home deployments with two caregiver-PLwD dyads, we examine how reminder-based caregiver awareness functions in practice. Our findings show that reminder systems can support caregiver reassurance, household coordination, and awareness of routines over time, but that reminder interaction data is highly context-dependent. Household participation, prompt attribution, routine mismatch, accessibility barriers, and technical failures all shaped what reminder logs could reasonably mean. We argue that reminder systems should not be treated as neutral behavioral sensors, but designed as assistive infrastructures for caregiver interpretation that preserve uncertainty and support contextual sensemaking in real homes.
Authors:Abu Noman Md Sakib, Md. Main Oddin Chisty, Zijie Zhang
Abstract:
The use of Large Language Models (LLMs) to support patients in addressing medical questions is becoming increasingly prevalent. However, most of the measures currently used to evaluate the performance of these models in this context only measure how closely a model's answers match semantically, and therefore do not provide a true indication of the model's medical accuracy or of the health equity risks associated with it. To address these shortcomings, we present a new evaluation framework for medical question answering called VB-Score (Verification-Based Score) that provides a separate evaluation of the four components of entity recognition, semantic similarity, factual consistency, and structured information completeness for medical question-answering models. We perform rigorous reviews of the performance of three well-known and widely used LLMs on 48 public health-related topics taken from high-quality, authoritative information sources. Based on our analyses, we discover a major discrepancy between the models' semantic and entity accuracy. Our assessments of the performance of all three models show that each of them has almost uniformly severe performance failures when evaluated against our criteria. Our findings indicate alarming performance disparities across various public health topics, with most of the models exhibiting 13.8% lower performance (compared to an overall average) for all the public health topics that relate to chronic conditions that occur in older and minority populations, which indicates the existence of what's known as condition-based algorithmic discrimination. Our findings also demonstrate that prompt engineering alone does not compensate for basic architectural limitations on how these models perform in extracting medical entities and raise the question of whether semantic evaluation alone is a sufficient measure of medical AI safety.
Authors:Thilo Spinner, Matthias Miller, Fabian Sperrle-Roth, Mennatallah El-Assady
Abstract:
Developing Visual Analytics (VA) applications requires integrating complex machine learning models with expressive interactive interfaces. Developers face a stark trade-off: building tightly-coupled monoliths plagued by fragile interdependencies, or relying on restrictive, simplistic frameworks. Meanwhile, unconstrained, single-shot AI code generation promises speed but yields unstructured, unauditable chaos. The core challenge is combining the control and expressiveness of custom development with the efficiency of AI generation under strict constraints. To address this, we introduce BONSAI, a mixed-initiative workspace for the multi-agent co-development of VA applications. BONSAI utilizes a modular four-layer architecture (hardware, services, orchestration, application) that allows human and AI developers to independently contribute reusable components. The workspace incorporates this architecture into a structured four-phase development process (plan, design, monitor, and review), ensuring distributed agency and full provenance, where all human and AI contributions are structurally bounded and tracked. We evaluate BONSAI through case studies demonstrating the efficient creation of novel tools and the rapid reconstruction of complex VA applications directly from research paper descriptions. Ultimately, this paper contributes a conceptual workflow, a scalable architecture, and an integrated system that successfully balances AI's generative speed with the structural rigor required for complex VA development.
Authors:Sola Kim, Marco A. Janssen, Jieshu Wang, Ame Min-Venditti, Neha Karanjia, John M. Anderies
Abstract:
Federal agencies are increasingly deploying large language models (LLMs) to process public comments submitted during notice-and-comment rulemaking, the primary mechanism through which citizens influence federal regulation. Whether these systems treat all public input equally remains largely untested. Using a counterfactual design, we held comment content constant and varied only the commenter's demographic attribution -- race, gender, and socioeconomic status -- to test whether eight LLMs available for federal use produce differential summaries of identical comments. We processed 182 public comments across 32 identity conditions, generating over 106,000 summaries. Occupation was the only identity signal to produce consistent differential treatment: the same comment attributed to a street vendor, compared to a financial analyst, received a summary that preserved less of the original meaning, used simpler language, and shifted emotional tone. This pattern held across all names, prompts, models, and regulatory contexts tested. Race effects were inconsistent and appeared driven by specific name tokens rather than racial categories; gender effects were absent. Writing quality predicted summarization outcomes through argument substance rather than surface mechanics; experimentally injected spelling and grammar errors had negligible effects. The magnitude of occupation-based differential treatment varied by model provider, meaning that selecting a model implicitly selects a level of fairness -- a dimension that current procurement frameworks such as FedRAMP do not evaluate. These findings suggest that socioeconomic signals warrant attention in AI fairness assessments for government information systems, and that fairness benchmarks could be incorporated into existing federal IT procurement processes.
Authors:Ryan T. Woo, Anmol Rao, Aryan Keluskar, Yinong Chen
Abstract:
We present a framework for evaluating adaptive personalization of educational reading materials with theory-grounded simulated learners. The system builds a learning-objective and knowledge-component ontology from open textbooks, curates it in a browser-based Ontology Atlas, labels textbook chunks with ontology entities, and generates aligned reading-assessment pairs. Simulated readers learn from passages through a Construction-Integration-inspired memory model with DIME-style reader factors, KREC-style misconception revision, and an open New Dale-Chall readability signal. Answers are produced by score-based option selection over the learner's explicit memory state, while BKT drives adaptation. Across three sampled subject ontologies and matched cohorts of 50 simulated learners per condition, adaptive reading significantly improved outcomes in computer science, yielded smaller positive but inconclusive gains in inorganic chemistry, and was neutral to slightly negative in general biology.
Authors:Varad Vishwarupe, Ivan Flechais, Nigel Shadbolt, Marina Jirotka
Abstract:
Large language models (LLMs) are increasingly integrated into design and development workflows, yet decisions about their use are rarely binary or purely technical. We report findings from a constructivist grounded theory study based on interviews with 33 designers and developers across three large technology organisations. Rather than evaluating LLMs solely by capability, participants reasoned about the role an LLM could occupy within a workflow and how that role would interact with existing structures of responsibility and organisational accountability. When LLMs were framed as tools under clear human control, their use was typically acceptable and could be integrated within existing governance structures. When framed as teammates with shared or ambiguous agency, practitioners expressed hesitation, particularly when responsibility for outcomes could not be clearly justified. At the same time, participants also described productive teammate configurations in which LLMs supported collaborative reasoning while remaining embedded within explicit oversight structures. We identify tool and teammate framings as recurring ways in which designers and developers position LLMs relative to human work and present an analytic rubric describing how role framing shapes decision authority, accountability ownership, oversight strategies, and organisational acceptability. By foregrounding design-time reasoning, this work reframes To LLM or Not to LLM as a sociotechnical positioning problem that emerges during system design rather than during post-deployment evaluation.
Authors:Ian Drosos, Jo Vermeulen, George Fitzmaurice, Justin Matejka
Abstract:
People frequently use online forums to get help from experts to answer questions about feature-rich software. However, they may have to wait minutes, hours, or even days to receive advice. We investigate the potential to leverage experts to provide quicker help. We collected over 200 questions from online forums for two feature-rich software applications and suspected a quarter were short enough to be answered in less than one minute (defined as nanoquestions). We then conducted a study with 28 experts recruited from help forums to confirm this assumption, and explore whether there was a preference between text and audio answers. For more than half of the nanoquestions participants saw, they could give advice that they believed was helpful in under 60 seconds. Finally, we collected feedback about what makes a question quick to answer to inspire the design of future tools for ultra rapid human-to-human help.
Authors:Eden Wu, Christos Koutras, Cláudio T. Silva, Juliana Freire
Abstract:
Schema matching remains fundamental to data integration, yet evaluating and comparing matching methods is hindered by limited benchmark diversity and lack of interactive validation frameworks. BDIViz, recently published at IEEE VIS 2025, is an interactive visualization system for schema matching with LLM-assisted validation. Given source and target datasets, BDIViz applies automatic matching methods and visualizes candidates in an interactive heatmap with hierarchical navigation, zoom, and filtering. Users validate matches directly in the heatmap and inspect ambiguous cases using coordinated views that show attribute descriptions, example values, and distributions. An LLM assistant generates structured explanations for selected candidates to support decision-making. This demonstration showcases a new extension to BDIViz that addresses a critical need in data integration research: human-in-the-loop benchmarking and iterative matcher development. New matchers can be integrated through a standardized interface, while user validations become evolving ground truth for real-time performance evaluation. This enables benchmarking new algorithms, constructing high-quality ground-truth datasets through expert validation, and comparing matcher behavior across diverse schemas and domains. We demonstrate two complementary scenarios: (i) data harmonization, where users map a large tabular dataset to a target schema with value-level inspection and LLM-generated explanations; and (ii) developer-in-the-loop benchmarking, where developers integrate custom matchers, observe performance metrics, and refine their algorithms.
Authors:Syemin Park, Soobin Park, Youn-kyung Lim
Abstract:
LLMs offer new creative possibilities for writers but also raise concerns about authenticity and reader trust, particularly when AI involvement is disclosed. Prior research has largely framed this as an issue of transparency and provenance, emphasizing the disclosure of human-AI interaction traces that account for how much the AI wrote and what the human did. Yet such audit-oriented disclosures may risk reducing creative collaboration to quantification and surveillance. In this position paper, we argue for a different lens by exploring how human-AI interaction traces might instead function as expressive artifacts that foreground the meaning-making inherent in human-AI collaboration. Drawing inspiration from blackout poetry, we frame AI-generated text as found material through which writers' acts of curation and reinterpretation become inscribed atop the AI's original output. In this way, we suggest that designing interaction traces as aesthetic artifacts may help readers better appreciate and trust writers' creative contributions in AI-assisted writing.
Authors:Advait Sarkar, Christian Poelitz, Viktor Kewenig
Abstract:
Generative AI tools often answer questions using source documents, e.g., through retrieval augmented generation. Current groundedness and hallucination evaluations largely frame the relationship between an answer and its sources as binary (the answer is either supported or unsupported). However, this obscures both the syntactic moves (e.g., direct quotation vs. paraphrase) and the interpretive moves (e.g., induction vs. deduction) performed when models reformulate evidence into an answer. This limits both benchmarking and user-facing provenance interfaces. We propose the development of a reader-centred taxonomy of grounding as a set of support relations between generated statements and source documents. We explain how this might be synthesised from prior research in linguistics and philosophy of language, and evaluated through a benchmark and human annotation protocol. Such a framework would enable interfaces that communicate not just whether a claim is grounded, but how.
Authors:Annabella Sakunkoo, Jonathan Sakunkoo
Abstract:
Widespread digital learning has expanded access to education but has resulted in highly sedentary, click-based interaction, contributing to digital fatigue, reduced cognitive flexibility, and health risks associated with prolonged passive screen time. Meanwhile, data literacy has become an essential competency in a data-driven society, yet it is typically taught through passive, disembodied interfaces that offer little physical engagement. We present Kinetiq (Kinetic+IQ), a novel system that integrates fun, full-body micro-movements directly into data and numeracy problem solving. Instead of selecting answers with a mouse, learners interact through natural gestures such as reaching, dodging, heading, elbowing, or knee-raising, thus turning abstract data problem-solving into embodied experiences that integrate thinking with movement. In a preliminary within-subjects study comparing Kinetiq with conventional platforms, participants reported significantly higher affective valence, enjoyment, engagement, and motivation, while maintaining comparable learning gains. We contribute: (1) a task-integrated movement paradigm for data learning, (2) a cross-platform web and mobile app system enabling full-body learning in constrained everyday spaces, and (3) preliminary empirical evidence that embodied micro-movements can enrich the affective experience of data literacy learning.
Authors:Veronica Ruozzi, Sasan Matinfar, Pasquale Vergara, Alessandro Albanesi, Serena Dell'Aversana, Stefano Carugo, Gianluigi Buccoliero, Nassir Navab, Alberto Redaelli, Emiliano Votta
Abstract:
Percutaneous epicardial access (PEA), performed on a beating heart under fluoroscopy, enables arrhythmia treatment. However, advancing a needle toward the thin and moving pericardium remains highly challenging and risky. To address this problem, we present a physics-driven sonification method for Extended Reality (XR)-based multisensory navigation to enhance user perception during the critical needle landing phase in PEA. Dynamic cardiac anatomy from 4D CTA was reconstructed and registered to a real-world coordinate system. Real-time needle tracking provided the position of the needle tip relative to moving cardiac structures and drove an audio-visual feedback module. The visual display presented navigational cues and dynamic anatomy, while the auditory display encoded physiological cardiac states using a multilayer physical membrane model. A phantom study was conducted with twelve cardiologists performing needle insertions under visual-only and multisensory feedback. The multisensory method significantly improved navigation safety ($χ^2 = 11.30$, $p < 0.01$), reducing myocardial contact (3.64% vs. 7.27%) and increasing correct access (90.91% vs. 52.73%). Needle placement accuracy improved, with closer membrane proximity (Cliff delta = 0.19) and reduced variability ($p < 0.05$). Execution time was comparable, while time-accuracy correlations differed significantly between modalities ($p < 0.01$). NASA-TLX indicated lower cognitive load with multisensory guidance ($p < 0.01$). These results demonstrate the feasibility of physics-driven sonification for improving spatiotemporal awareness and supporting user-centered surgical navigation.
Authors:Minsol Michelle Kim, Daniel M. Low, David Lafond, Eugene Shim, Michelle Han, Mohanad Kandil, Chenyu Zhang, Theo Kitsberg, Chelsea Boccagno, Paul Pu Liang, Pattie Maes
Abstract:
Breaking negative mental health cycles, including rumination and recurring regrets, requires reflection that translates awareness into behavioral change. Grounded in the Transtheoretical Model (TTM) and Gross's Emotion Regulation (ER) Process Model, we examine how Technologies Supporting Self-Reflection (TSR) bridge reflection and action. In a 15-day in-the-wild study (N = 20), participants used a voice-based journaling system to capture regrets and wishes and engaged in WhatIf-Planning, a novel structured reflection module integrating counterfactual thinking with if-then planning. Participants were randomized to either a free-form condition or a Gross-guided condition, which maps the five processes of Gross's ER model into explicit journaling prompts. We contribute: (1) a unified reflection-to-action TSR system that operationalizes the Preparation stage of TTM to bridge Contemplation and Action, and (2) triangulated empirical evidence from an in-the-wild journaling study that first operationalizes Gross's Process Model, revealing effects on coping flexibility and emotion regulation in daily life. Results show significant pre-post improvements in coping flexibility, indicating adaptive self-regulation across conditions, with the Gross-guided group generating more counterfactual alternatives, articulating concrete if-then action plans, and implementing more plans for self-driven change.
Authors:Yifan Xu, Xiao Zhan, Akilu Yunusa Kaltungo, Ming Shan Ng, Tsukasa Ishizawa, Kota Fujimoto, Clara Cheung
Abstract:
As robots increasingly operate in shared, safety critical environments, acting safely is no longer sufficient robots must also make their safety decisions intelligible to human collaborators. In human robot collaboration (HRC), behaviours such as stopping or switching modes are often triggered by internal safety constraints that remain opaque to nearby workers. We present a dialogue based framework for interactive explanation of safety decisions in HRC. The approach tightly couples explanation with constraint based safety evaluation, grounding dialogue in the same state and constraint representations that govern behaviour selection. Explanations are derived directly from the recorded decision trace, enabling users to pose causal ("Why?"), contrastive ("Why not?"), and counterfactual ("What if?") queries about safety interventions. Counterfactual reasoning is evaluated in a bounded manner under fixed, certified safety parameters, ensuring that interactive exploration does not relax operational guarantees. We instantiate the framework in a construction robotics scenario and provide a structured operational trace illustrating how constraint aware dialogue clarifies safety interventions and supports coordinated task recovery. By treating explanation as an operational interface to safety control, this work advances a design perspective for interactive, safety aware autonomy in HRC.
Authors:Krisha Mehta, Sami Elahi, Alex Kale
Abstract:
Data communication entails ethical dilemmas where situational constraints forbid full disclosure of source data. Whereas visualization research and pedagogy often frames ethics as a matter of individuals making deceptive design choices or being misled, disclosure problems involve negotiation between pro-social actors. To provide observability into these situated judgments, we contribute Purrsuasion, an open-source visualization game where participants play the roles of (i) data providers designing visualizations subject to disclosure constraints and (ii) data seekers requesting information and awarding a contract. We deploy Purrsuasion in an undergraduate data science class (N = 27), gathering gameplay data to support a mixed-methods analysis of students' communication dynamics, problem solving, and trust formation. We find that difficulties envisioning an ideal visualization solution lead to satisficing in visualization authoring and difficulties attributing authorial intent. Given these challenges, we approach scoring student solutions by developing a heuristic rubric that supports sociotechnical judgments of disclosure adherence.
Authors:Jonathan Albert Cohen, Kye Shimizu, Allen Song, Vishnu Bharath, Kent Larson, Pattie Maes
Abstract:
Robots in shared spaces often move in ways that are difficult for people to interpret, placing the burden on humans to adapt. High-DoF robots exhibit motion that people read as expressive, intentionally or not, making it important to understand how such cues are perceived. We present an online video study evaluating how different signaling modalities, expressive motion, lights, text, and audio, shape people's ability to understand a quadruped robot's upcoming navigation actions (Boston Dynamics Spot). Across four common scenarios, we measure how each modality influences humans' (1) accuracy in predicting the robot's next navigation action, (2) confidence in that prediction, and (3) trust in the robot to act safely. The study tests how expressive motions compare to explicit channels, whether aligned multimodal cues enhance interpretability, and how conflicting cues affect user confidence and trust. We contribute initial evidence on the relative effectiveness of implicit versus explicit signaling strategies.
Authors:Chathuri Jayaweera, Bonnie J. Dorr
Abstract:
Large language model (LLM)-based educational assistants often provide direct answers that short-circuit learning by reducing exploration, self-explanation, and engagement with course materials. We present BLADE (Better Language Answers through Dialogue and Explanations), a grounded conversational assistant that guides learners to relevant instructional resources rather than supplying immediate solutions. BLADE uses a retrieval-augmented generation (RAG) framework over curated course content, dynamically surfacing pedagogically relevant excerpts in response to student queries. Instead of delivering final answers, BLADE prompts direct engagement with source materials to support conceptual understanding. We conduct an impact study in an undergraduate computer science course, with different course resource configurations and show that BLADE improves students' navigation of course resources and conceptual performance compared to simply providing the full inventory of course resources. These results demonstrate the potential of grounded conversational AI to reinforce active learning and evidence-based reasoning.
Authors:Blaine Kuehnert, Nari Johnson, Ravit Dotan, Hoda Heidari
Abstract:
Documentation-based disclosure has become a central governance strategy for responsible AI, particularly in public-sector procurement. Tools such as model cards, datasheets, and AI FactSheets are increasingly expected to support accountability, risk assessment, and informed decision-making across organizational boundaries. Yet there is limited empirical evidence about how these artifacts are produced, interpreted, and used in practice. In this paper, we present a qualitative study of the GovAI Coalition FactSheet, a widely adopted transparency document designed to support AI procurement and governance in government contexts. Drawing on semi-structured interviews with vendors and public-sector practitioners, alongside a systematic analysis of completed FactSheets, we examine how FactSheets are used, what information they surface, and where they fall short. We find that FactSheets are asked to serve multiple and conflicting purposes simultaneously: showcasing vendor offerings, supporting evaluation and due diligence, and facilitating early-stage dialogue between vendors and agencies. These competing expectations, combined with the structural constraints of voluntary and public self-disclosure, limit the ability of FactSheets to function as standalone evaluation or risk-assessment tools. At the same time, our findings suggest that when understood as relational artifacts used to establish trust, shared understanding, and ongoing dialogue, FactSheets can help create conditions that support more meaningful disclosure and governance over time.
Authors:Abu Noman Md Sakib, Protik Dey, Zijie Zhang, Taslima Akter
Abstract:
Explainable Artificial Intelligence (XAI) is critical for ensuring trust and accountability, yet its development remains predominantly visual. For blind and low-vision (BLV) users, the lack of accessible explanations creates a fundamental barrier to the independent use of AI-driven assistive technologies. This problem intensifies as AI systems shift from single-query tools into autonomous agents that take multi-step actions and make consequential decisions across extended task horizons, where a single undetected error can propagate irreversibly before any feedback is available. This paper investigates the unique XAI requirements of the BLV community through a comprehensive analysis of user interviews and contemporary research. By examining usage patterns across environmental perception and decision support, we identify a significant modality gap. Empirical evidence suggests that while BLV users highly value conversational explanations, they frequently experience "self-blame" for AI failures. The paper concludes with a research agenda for accessible Explainable AI in agentic systems, advocating for multimodal interfaces, blame-aware explanation design, and participatory development.
Authors:John Paul P. Miranda, Rhiziel P. Manalese, Ivan G. Liwanag, Rodel T. Alimurong, Alvin B. Roque
Abstract:
This study examined how behavioral, emotional, and contextual factors influence Filipino students' willingness to use artificial intelligence (AI) for mental health support. Results showed that habit had the strongest effect on willingness, followed by comfort, emotional benefit, facilitating conditions, and perceived usefulness. Students who used AI tools regularly felt more confident and open to relying on them for emotional support. Empathy, privacy, and accessibility also increased comfort and trust in AI systems. The findings highlight that emotional safety and routine use are essential in promoting willingness. The study recommends AI literacy programs, empathic design, and ethical policies that support responsible and culturally sensitive use of AI for student mental health care.
Authors:Beleicia Bullock, James A. Landay, Michael S. Bernstein
Abstract:
Metaphors enable designers to communicate their ideal user experience for platforms. Yet, we often do not know if these design metaphors match users' actual experiences. In this work, we compare design and user metaphors across three different platforms: ChatGPT, Twitter, and YouTube. We build on prior methods to elicit 554 user metaphors, as well as ratings on how well each metaphor describes users' experiences. We then identify 21 design metaphors by analyzing each platform's historical web presence since their launch date. We find that design metaphors often do not match the metaphors that users use to describe their experiences. Even when design and user metaphors do match, the metaphors do not always resonate universally. Through these findings, we highlight how comparing design and user metaphors can help to evaluate and refine metaphors for user experience.
Authors:Lucy Jiang, Amy Seunghyun Lee, Jon E. Froehlich, Leah Findlater
Abstract:
Public art can hold cultural, social, political, and aesthetic significance, enriching urban environments and promoting well-being. However, a majority of urban art is inaccessible to blind and low vision (BLV) people. Most art access research has focused on private and curated settings (e.g., museums, galleries) and most urban access work has centered on outdoor navigation, leaving urban and public art accessibility largely understudied. We conducted semi-structured interviews with 16 BLV participants, using design probes featuring AI-generated descriptions and real-time AI interactions to investigate preferences for both discovering and engaging with urban art. We found that BLV people valued spontaneous art exploration, multisensory (e.g., tactile, auditory, olfactory) engagement, and detailed descriptions of culturally significant artwork. Participants also highlighted challenges distinct to urban art contexts: safety took precedence over art exploration, multisensory access measures could be disruptive to others in the public space, and inaccurate AI descriptions could lead to cultural erasure. Our contributions include empirical insights on BLV preferences for urban art discovery and engagement, seven design dimensions for public art access solutions, and implications for expanding HCI urban accessibility research beyond navigation.
Authors:Seunghwa Pyo, Donggun Lee, Jungwoo Rhee, Soobin Park, Youn-kyung Lim
Abstract:
People increasingly use multiple Multimodal Large Language Models (MLLMs) concurrently, selecting each based on its perceived strengths. This cross-platform practice creates coordination challenges: adapting prompts to different interfaces, calibrating trust against inconsistent behaviors, and navigating separate conversation histories. Prior HCI research focused on single-agent interactions, leaving multi-MLLM orchestration underexplored. Through a diary study and semi-structured interviews (N=10), we examine how individuals organize work across competing AI systems. Our findings reveal that users construct primary and secondary hierarchies among models that shift over usage context. They also develop personalized switching patterns triggered by task aggregation to adjust effort and latency, and output credibility. These insights inform future tool design opportunities, supporting users to coordinate multi-MLLM workflows.
Authors:Eunseo Oh, Suyoun Lee, Jae Young Choi, Soobin Park, Youn-kyung Lim
Abstract:
LLMs have become deeply embedded in knowledge work, raising concerns about growing dependency and the potential undermining of human skills. To investigate the pervasiveness of LLMs in work practices, we conducted a four-day diary study with frequent LLM users (N=10), observing how knowledge workers responded to a temporary withdrawal of LLMs. Our findings show how LLM withdrawal disrupted participants' workflows by identifying gaps in task execution, how self-directed work led participants to reclaim professional values, and how everyday practices revealed the extent to which LLM use had become inescapably normative. Conceptualizing LLMs as infrastructural to contemporary knowledge work, this research contributes empirical insights into the often invisible role of LLMs and proposes value-driven appropriation as an approach to supporting professional values in the current LLM-pervasive work environment.
Authors:Duosi Dai, Pavithren V S Pakianathan, Gunnar Treff, Mahdi Sareban, Jan David Smeddinck, Sanna Kuoppamäki
Abstract:
Wearables and mobile health applications are increasingly adopted for self-management of chronic illnesses; yet the data feels overwhelming for older adults with cardiovascular disease (CVD). This study explores how they make sense of self-tracked data and identifies design opportunities for Large Language Model (LLM)-enabled support. We conducted a seven-day diary study and follow-up interviews with eight CVD patients aged 64-82. We identified six themes: navigating emotional complexity, owning health narratives, prioritizing bodily sensations, selective engagement with health metrics, negotiating socio-technical dynamics of sharing, and cautious optimism toward AI. Findings highlight that self-tracking is affective, interpretive, and socially situated. We outline design directions for LLM-enabled data sensemaking systems: supporting emotional engagement, reinforcing patient agency, acknowledging embodied experiences, and prompting dialogue in clinical and social contexts. To support safety, expert-in-the-loop mechanisms are essential. These directions articulate how LLMs can help translate data into narratives and carry implications for human-data interaction and behavior-change support.
Authors:Yasamin Borhani, Taylor Mordan, Yihan Wang, Reyhaneh Hosseininejad, Javad Khoramdel, Alexandre Alahi
Abstract:
Object skeletons offer a concise representation of structural information, capturing essential aspects of posture and orientation that are crucial for autonomous driving applications. However, a unified architecture that simultaneously handles multiple instances and categories using only the input image remains elusive. In this paper, we introduce PoseDriver, a unified framework for bottom-up multi-category skeleton detection tailored to common objects in driving scenarios. We model each category as a distinct task to systematically address the challenges of multi-task learning. Specifically, we propose a novel approach for lane detection based on skeleton representations, achieving state-of-the-art performance on the OpenLane dataset. Moreover, we present a new dataset for bicycle skeleton detection and assess the transferability of our framework to novel categories. Experimental results validate the effectiveness of the proposed approach.
Authors:Abed K. Musaffar, Ambuj Singh, Francesco Bullo
Abstract:
Large language models (LLMs) are increasingly deployed in human-AI teams as support agents for complex tasks such as information retrieval, programming, and decision-making assistance. While these agents' autonomy and contextual knowledge enables them to be useful, it also exposes them to a broad range of attacks, including data poisoning, prompt injection, and even prompt engineering. Through these attack vectors, malicious actors can manipulate an LLM agent to provide harmful information, potentially manipulating human agents to make harmful decisions. While prior work has focused on LLMs as attack targets or adversarial actors, this paper studies their potential role as defensive supervisors within mixed human-AI teams. Using a dataset consisting of multi-party conversations and decisions for a real human-AI team over a 25 round horizon, we formulate the problem of malicious behavior detection from interaction traces. We find that LLMs are capable of identifying malicious behavior in real-time, and without task-specific information, indicating the potential for task-agnostic defense. Moreover, we find that the malicious behavior of interest is not easily identified using simple heuristics, further suggesting the introduction of LLM defenders could render human teams more robust to certain classes of attack.
Authors:Manuel Scheibl, Julian Leichert, Sinem Görmez, Britta Wrede
Abstract:
Physiological signals are increasingly relevant to estimate the mental states of users in human-robot interaction (HRI), yet ROS 2-based HRI frameworks still lack reusable support to integrate such data streams in a standardized way. Therefore, we propose Sense4HRI, an adapted framework for human-robot interaction in ROS 2 that integrates physiological measurements and derived user-state indicators. The framework is designed to be extensible, allowing the integration of additional physiological sensors, their interpretation, and multimodal fusion to provide a robust assessment of the mental states of users. In addition, it introduces reusable interfaces for timestamped physiological time-series data and supports synchronized logging of physiological signals together with experiment context, enabling interoperable and traceable multimodal analysis within ROS 2-based HRI systems.
Authors:Binyan Xu, Wei Wu, Soonhyeon Kweon, Casper Harteveld, Leanne Chukoskie
Abstract:
Augmented reality games hold promise for rehabilitation, yet most remain confined to laboratory studies with limited clinical uptake. Recent advances in spatial computing, especially lightweight, glasses_form_factor AR, create a timely opportunity to embed rehabilitative play into clinical practice and daily contexts. To investigate this potential, we systematically reviewed 132 applications and conducted playtesting with 14 licensed physical therapists. Our analysis revealed three ways therapists re_authored AR games: co_authored play (reshaping movements, progressions, and difficulty), situated play (adapting across specialties, conditions, and contexts), and dual play (mediating both physical recovery and psychological support). We reframe therapists' frequent phrase_It depends_as a generative design principle. This study contributes a clinical reasoning_based framework and design principles and guidelines for creating personalized, situated forms of play that align with therapists' everyday workflows and inform future lab_to_clinic translation.
Authors:Shuyue Feng, Cedric Caremel, Yoshihiro Kawahara
Abstract:
Topology optimization (TO) is employed in engineering to optimize structural performance while maximizing material efficiency. However, traditional TO methods incur significant computational and time costs. Although research has leveraged generative AI to predict TO outcomes and validated feasibility and accuracy, existing approaches still suffer from limited customizability and impose a high cognitive load on users. Furthermore, balancing structural performance with aesthetic attributes remains a persistent challenge. We developed Sketch2Topo, which augments a diffusion-based TO model with image-to-image generation and image editing capabilities. With Sketch2Topo, users can use sketching to customize geometries and specify physical constraints. The tool also supports mask input, enabling users to perform TO on selected regions only, thereby supporting higher levels of customization. We summarize the workflow and details of the tool and conduct a brief quantitative evaluation. Finally, we explore application scenarios and discuss how hand-drawn input improves usability while balancing functionality and aesthetics.
Authors:Neil Fernandes, Cheng Tang, Tehniyat Shahbaz, Alex Hauschildt, Emily Davies-Robinson, Yue Hu, Kerstin Dautenhahn
Abstract:
Community literacy programs supporting young newcomer children in Canada face limited staffing and scarce one-to-one time, which constrains personalized English and cultural learning support. This paper reports on a co-design study with United for Literacy tutors that informed Maple, a table-top, peer-like Socially Assistive Robot (SAR) designed as a practice partner within tutor-mediated sessions. From shadowing and co-design interviews, we derived newcomer-specific requirements and added them in an integrated prototype that uses short story-based activities, multi-modal scaffolding (speech, facial feedback, gesture), and embedded quizzes that support attention while producing tutor-actionable formative signals. We contribute system design implications for tutor-in-the-loop SARs supporting language socialization in community settings and outline directions for child-centered evaluation in authentic programs.
Authors:Mehran Shabanpour, Sadaf Khademi, Konstantinos N Plataniotis, Arash Mohammadi
Abstract:
Forecasting Electroncephalography (EEG) signals during cognitive events remains a fundamental challenge in neuroscience and Brain-Computer Interfaces (BCIs), as existing methods struggle to capture both the stochastic nature of neural dynamics and the semantic context of behavioral tasks. We present the Dual-Enhanced COnditioned Diffusion (DECODE) for EEG, a novel framework that unifies semantic guidance from natural language descriptions with temporal dynamics from historical signals to generate event-specific neural responses. DECODE leverages pre-trained language models to condition the diffusion process on rich textual descriptions of cognitive events, while maintaining temporal coherence through history-based Langevin dynamics. Evaluated on a real-world driving task dataset with five distinct behaviors, DECODE achieves sub-microvolt prediction accuracy (MAE = 0.626 microvolt) over 75 timestep horizons while maintaining well-calibrated uncertainty estimates. Our framework demonstrates that natural language can effectively bridge high-level cognitive descriptions and low-level neural dynamics, opening new possibilities for zero-shot generalization to novel behaviors and interpretable BCIs. By generating physiologically plausible, event-specific EEG trajectories conditioned on semantic descriptions, DECODE establishes a new paradigm for understanding and predicting context-dependent neural activity.
Authors:Supriya Khadka, Sanchari Das
Abstract:
In decentralized web applications, users face an inherent conflict between public verifiability and personal privacy. To participate in regulated on-chain services, users must currently disclose sensitive identity documents to centralized intermediaries, permanently linking real-world identities to public transaction histories. This binary choice between total privacy loss or total exclusion strips users of agency and exposes them to persistent surveillance. In this work, we introduce a Selective Disclosure Framework designed to restore user sovereignty by decoupling eligibility verification from identity revelation. We present ZK-Compliance, a prototype that leverages browser-based zero-knowledge proofs to shift the interaction model, enabling users to prove specific attributes (e.g., "I am over 18") locally without revealing the underlying data. We implement a user-governed Grant, Verify, Revoke lifecycle that transforms the user's mental model of compliance from a permanent data handover into a dynamic, revocable authorization session. Our evaluation shows that client-side proof generation takes under 200ms, enabling a seamless interactive experience on commodity hardware. This work provides early evidence that regulatory compliance need not come at the cost of user privacy or autonomy.
Authors:Nadine Jost, Benjamin Berens, Manuel Karl, Stefan Albert Horstmann, Martin Johns, Alena Naiakshina
Abstract:
The ongoing shortage of skilled developers, particularly in security-critical software development, has led organizations to increasingly adopt AI-powered development tools to boost productivity and reduce reliance on limited human expertise. These tools, often based on large language models, aim to automate routine tasks and make secure software development more accessible and efficient. However, it remains unclear how developers' general programming and security-specific experience, and the type of AI tool used (free vs. paid) affect the security of the resulting software. Therefore, we conducted a quantitative programming study with software developers (n=159) exploring the impact of Google's AI tool Gemini on code security. Participants were assigned a security-related programming task using either no AI tools, the free version, or the paid version of Gemini. While we did not observe significant differences between using Gemini in terms of secure software development, programming experience significantly improved code security and cannot be fully substituted by Gemini.
Authors:Shaojun Cai, Nuwan Janaka, Ashwin Ram, Janidu Shehan, Yingjia Wan, Kotaro Hara, David Hsu
Abstract:
Robotic guidance systems have shown promise in supporting blind and visually impaired (BVI) individuals with wayfinding and obstacle avoidance. However, most existing systems assume a clear path and do not support a critical aspect of navigation - environmental interactions that require manipulating objects to enable movement. These interactions are challenging for a human-robot pair because they demand (i) precise localization and manipulation of interaction targets (e.g., pressing elevator buttons) and (ii) dynamic coordination between the user's and robot's movements (e.g., pulling out a chair to sit). We present a collaborative human-robot approach that combines our robotic guide dog's precise sensing and localization capabilities with the user's ability to perform physical manipulation. The system alternates between two modes: lead mode, where the robot detects and guides the user to the target, and adaptation mode, where the robot adjusts its motion as the user interacts with the environment (e.g., opening a door). Evaluation results show that our system enables navigation that is safer, smoother, and more efficient than both a traditional white cane and a non-adaptive guiding system, with the performance gap widening as tasks demand higher precision in locating interaction targets. These findings highlight the promise of human-robot collaboration in advancing assistive technologies toward more generalizable and realistic navigation support.
Authors:Sverrir Thorgeirsson, Theo B. Weidmann, Zhendong Su
Abstract:
Many software development platforms now support LLM-driven programming, or "vibe coding", a technique that allows one to specify programs in natural language and iterate from observed behavior, all without directly editing source code. While its adoption is accelerating, little is known about which skills best predict success in this workflow. We report a preregistered cross-sectional study with tertiary-level students (N = 100) who completed measures of computer-science achievement, domain-general cognitive skills, written-communication proficiency, and a vibe-coding assessment. Tasks were curated via an eight-expert consensus process and executed in a purpose-built, vibe-coding environment that mirrors commercial tools while enabling controlled evaluation. We find that both writing skill and CS achievement are significant predictors of vibe-coding performance, and that CS achievement remains a significant predictor after controlling for domain-general cognitive skills. The results may inform tool and curriculum design, including when to emphasize prompt-writing versus CS fundamentals to support future software creators.
Authors:Ebrahim Feghhi, Junlin Hu, Nima Hadidi, Jonathan C. Kao
Abstract:
A promising pathway for restoring communication in patients with dysarthria and anarthria is speech neuroprostheses, which directly decode speech from cortical neural activity. Two benchmarks, Brain-to-Text '24 and '25, released intracranial recordings from patients with dysarthria along with a baseline algorithm trained with Connectionist Temporal Classification (CTC). Despite significant innovation on these benchmarks, all leading published prior work relies on a WFST-based CTC decoder that requires ${\sim}$320 GB of RAM. These memory requirements limit accessibility for both patients and researchers. Here, we propose LightBeam, a non-WFST based CTC decoder that requires only ${\sim}$10 GB of RAM and achieves state-of-the-art performance on both benchmarks. LightBeam achieves this by integrating an LLM into the beam-search process via delayed fusion, obviating the prior need for using a large N-gram LM. LightBeam is implemented in Python and is open-source.
Authors:Wei Wu, Binyan Xu, Soonhyeon Kweon, Yujie Wang, Leanne Chukoskie, Casper Harteveld
Abstract:
Lightweight augmented reality (AR) glasses are increasingly entering everyday use, extending interaction design beyond short, isolated sessions. However, most existing gesture vocabularies are inherited from VR headsets or early AR goggles. These systems tend to prioritize recognizer accuracy while overlooking fatigue, sustainability, and social legibility in daily contexts. To address this gap, we collaborated with physical therapists (PTs) to reimagine gesture design for everyday AR, drawing on their expertise in safe and sustainable movement. Through a review of 104 AR applications, we identified 15 common gesture intents and implemented an on-device gesture generator. Ten licensed physical therapists, with an average of 14.8 years of professional experience, then shaped these gesture intents through three iterative stages: unaided gesture performance, PT-guided gesture substitution, and stage-aware card sorting. This work contributes (1) a PT-informed gesture translation method, (2) the Everyday-AR Golden Ergonomic Canvas, and (3) a stage-aware social legibility framework that illustrates how gesture suitability shifts with social readability. Together, these contributions provide a recognizer-agnostic reference framework for designing sustainable and socially coherent gesture vocabularies for lightweight AR glasses.
Authors:Christopher A. Kelly, Yikun Chi, Nicholas Haber, Byron Reeves, Mu-Jung Cho, Thomas N. Robinson, Nilam Ram, Johannes C. Eichstaedt
Abstract:
The relationship between digital media use and mental health remains poorly understood, in part because real-world digital behavior is rarely captured at scale. This intensive longitudinal study tracked participants' complete natural smartphone interactions over one year. We collected screenshots every 5 seconds from 145 adults (yielding 111 million screenshots), alongside biweekly assessments of anxiety and depression (mean = 24 surveys). The valence and arousal of each screenshot were assessed using a deep learning affect model. Individuals showed highly idiosyncratic media patterns, with substantially more variance in anxiety and depression accounted for within-person than between-person. Day-to-day fluctuations in the valence and arousal of a person's screen content predicted subsequent changes in depression and anxiety, whereas between-person differences did not. Specifically, greater exposure to low-arousal negative content was associated with higher depression and anxiety. These findings underscore the dynamic, idiosyncratic nature of digital consumption and the need for targeted measurement and intervention.
Authors:Carlo Dindorf, Jonas Dully, Rebecca Keilhauer, Michael Lorenz, Michael Fröhlich
Abstract:
Background: Machine learning (ML) enhances gait analysis but often lacks the level of interpretability desired for clinical adoption. Large Language Models (LLMs) may offer explanatory capabilities and confidence-aware outputs when applied to structured kinematic data. This study therefore evaluated whether general-purpose LLMs can classify continuous gait kinematics when represented as textual numeric sequences and how their performance compares to conventional ML approaches. Methods: Lower-body kinematics were recorded from 20 participants performing seven gait patterns. A supervised KNN classifier and a class-independent One-Class SVM (OCSVM) were compared against zero-shot LLMs (GPT-5, GPT-5-mini, GPT-4.1, and o4-mini). Models were evaluated using Leave-One-Subject-Out (LOSO) cross-validation. LLMs were tested both with and without explicit reference gait statistics. Results: The supervised KNN achieved the highest performance (multiclass Matthews Correlation Coefficient, MCC = 0.88). The best-performing LLM (GPT-5) with reference grounding achieved a multiclass MCC of 0.70 and a binary MCC of 0.68, outperforming the class-independent OCSVM (binary MCC = 0.60). Performance of the LLM was highly dependent on explicit reference information and self-rated confidence; when restricted to high-confidence predictions, multiclass MCC increased to 0.83 on the filtered subset. Notably, the computationally efficient o4-mini model performed comparably to larger models. Conclusion: When continuous kinematic waveforms were encoded as textual numeric tokens, general-purpose LLMs, even with reference grounding, did not match supervised multiclass classifiers for precise gait classification and are better regarded as exploratory systems requiring cautious, human-guided interpretation rather than diagnostic use.
Authors:Zixin Wen, Yifu Cai, Kyle Lee, Sam Estep, Josh Sunshine, Aarti Singh, Yuejie Chi, Wode Ni
Abstract:
Visual design is an essential application of state-of-the-art multi-modal AI systems. Improving these systems requires high-quality vision-language data at scale. Despite the abundance of internet image and text data, knowledge-rich and well-aligned image-text pairs are rare. In this paper, we present a scalable diagram generation pipeline built with our agent, Feynman. To create diagrams, Feynman first enumerates domain-specific knowledge components (''ideas'') and performs code planning based on the ideas. Given the plan, Feynman translates ideas into simple declarative programs and iterates to receives feedback and visually refine diagrams. Finally, the declarative programs are rendered by the Penrose diagramming system. The optimization-based rendering of Penrose preserves the visual semantics while injecting fresh randomness into the layout, thereby producing diagrams with visual consistency and diversity. As a result, Feynman can author diagrams along with grounded captions with very little cost and time. Using Feynman, we synthesized a dataset with more than 100k well-aligned diagram-caption pairs. We also curate a visual-language benchmark, Diagramma, from freshly generated data. Diagramma can be used for evaluating the visual reasoning capabilities of vision-language models. We plan to release the dataset, benchmark, and the full agent pipeline as an open-source project.
Authors:Xiaofu Jin, Yunpeng Bai, Antti Oulasvirta
Abstract:
Users often struggle to locate an item within an information architecture, particularly when links are ambiguous or deeply nested in hierarchies. Information scent has been used to explain why users select incorrect links, but this concept assumes that users see all available links before deciding. In practice, users frequently select a link too quickly, overlook relevant cues, and then rely on backtracking when errors occur. We extend the concept of information scent by framing navigation as a sequential decision-making problem under memory constraints. Specifically, we assume that users do not scan entire pages but instead inspect strategically, looking "just enough" to find the target given their time budget. To choose which item to inspect next, they consider both local (this page) and global (site) scent; however, both are constrained by memory. Trying to avoid wasting time, they occasionally choose the wrong links without inspecting everything on a page. Comparisons with empirical data show that our model replicates key navigation behaviors: premature selections, wrong turns, and recovery from backtracking. We conclude that trial-and-error behavior is well explained by information scent when accounting for the sequential and bounded characteristics of the navigation problem.
Authors:Felix Anand Epp, Matti Nelimarkka, Jesse Haapoja, Pedro Ferreira, Os Keyes, Shaowen Bardzell
Abstract:
This is the Proceedings of the First CHI Workshop on CHIdeology: Disentangling the fragmented politics, values, and imaginaries of Human-Computer Interaction through ideologies, held on Wednesday, 15 April, in Barcelona, Spain, at the ACM CHI Conference on Human Factors in Computing Systems.
Authors:Xingyu Bruce Liu, Mira Dontcheva, Dingzeyu Li
Abstract:
Everyone can write their stories in freeform text format -- it's something we all learn in school. Yet storytelling via video requires one to learn specialized and complicated tools. In this paper, we introduce Doki, a text-native interface for generative video authoring, aligning video creation with the natural process of text writing. In Doki, writing text is the primary interaction: within a single document, users define assets, structure scenes, create shots, refine edits, and add audio. We articulate the design principles of this text-first approach and demonstrate Doki's capabilities through a series of examples. To evaluate its real-world use, we conducted a week-long deployment study with participants of varying expertise in video authoring. This work contributes a fundamental shift in generative video interfaces, demonstrating a powerful and accessible new way to craft visual stories.
Authors:Satheeshkumar Veeramani, Anna Kisil, Abigail Bentley, Hatem Fakhruldeen, Gabriella Pizzuto, Andrew I. Cooper
Abstract:
Self-driving laboratories (SDLs) are rapidly transforming research in chemistry and materials science to accelerate new discoveries. Mobile robot chemists (MRCs) play a pivotal role by autonomously navigating the lab to transport samples, effectively connecting synthesis, analysis, and characterisation equipment. The instruments within an SDL are typically designed or retrofitted to be accessed by both human and robotic chemists, ensuring operational flexibility and integration between manual and automated workflows. In many scenarios, human and robotic chemists may need to use the same equipment simultaneously. Currently, MRCs rely on simple LiDAR-based obstruction detection, which forces the robot to passively wait if a human is present. This lack of situational awareness leads to unnecessary delays and inefficient coordination in time-critical automated workflows in human-robot shared labs. To address this, we present an initial study of an embodied, AI-driven perception method that facilitates proactive human-robot interaction in shared-access scenarios. Our method features a hierarchical human intention prediction model that allows the robot to distinguish between preparatory actions (waiting) and transient interactions (accessing the instrument). Our results demonstrate that the proposed approach enhances efficiency by enabling proactive human-robot interaction, streamlining coordination, and potentially increasing the efficiency of autonomous scientific labs.
Authors:Mengyuan Millie Wu, Zhihan Jiang, Yuang Fan, Richard Feng, Sahiti Dharmavaram, Mathew Polowitz, Shawn Fallon, Bashima Islam, Lizbeth Benson, Irene Tung, David Creswell, Xuhai Xu
Abstract:
Mindfulness meditation is a widely accessible and evidence-based method for supporting mental health. Despite the proliferation of mindfulness meditation apps, sustaining user engagement remains a persistent challenge. Personalizing the meditation experience is a promising strategy to improve engagement, but it often requires costly and unscalable manual effort. We present MindfulAgents, a multi-agent system powered by large language models that (1) generates guided meditation scripts based on an expert-established mindfulness framework, (2) encourages users' reflection on emotional states and mindfulness skills, and (3) enables real-time personalization of the mindfulness meditation experience for each user. In a formative lab study (N=13), MindfulAgents significantly improved in-session engagement (p = 0.011) and self-awareness (p = 0.014), and reduced momentary stress (p = 0.020). Furthermore, a four-week deployment study (N=62) demonstrated a notable increase in long-term engagement (p = 0.002) and level of mindfulness (p = 0.023). Participants reported that MindfulAgents offered more relevant meditation sessions personalized to individual needs in various contexts, supporting sustained practice. Our findings highlight the potential of LLM-driven personalization for enhancing user engagement in digital mindfulness meditation interventions.
Authors:Yunpeng Bai, Xiaofu Jin, Shengdong Zhao, Antti Oulasvirta
Abstract:
Reading is a pervasive and cognitively demanding activity that underpins modern human culture. It is a prime instance of a class of tasks where eye movements are coordinated for the purpose of comprehension. Existing theories explain either eye movements or comprehension during reading, but the critical link between the two remains unclear. Here, we propose resource-rational optimization as a unifying principle governing adaptive reading behavior. Eye movements are selected to maximize expected comprehension while minimizing cognitive and temporal costs, organized hierarchically across nested time scales: fixation decisions support word recognition; sentence-level integration guides skipping and regression; and text-level comprehension goals shape memory construction and rereading. A computational implementation successfully replicates an unprecedented range of findings in human reading, from lexical effects to comprehension outcomes. Together, these results suggest that resource rationality provides a general mechanism for coordinating perception, memory, and action in knowledge-intensive human behaviors, offering a principled account of how complex cognitive skills adapt to limited resources.
Authors:Matthew Brehmer, Maxime Cordeil, Christophe Hurter, Takayuki Itoh, Wolfgang Büschel, Mahmood Jasim, Arnaud Prouzeau, David Saffo, Lyn Bartram, Sheelagh Carpendale, Chen Zhu-Tian, Andrew Cunningham, Tim Dwyer, Samuel Huron, Masahiko Itoh, Alark Joshi, Kiyoshi Kiyokawa, Hideaki Kuzuoka, Bongshin Lee, Gabriela Molina León, Harald Reiterer, Bektur Ryskeldiev, Jonathan Schwabish, Brian A. Smith, Yasuyuki Sumi, Ryo Suzuki, Anthony Tang, Yalong Yang, Jian Zhao
Abstract:
We characterize 16 challenges faced by those investigating and developing remote and synchronous collaborative experiences around visualization. Our work reflects the perspectives and prior research efforts of an international group of 29 experts from across human-computer interaction and visualization sub-communities. The challenges are anchored around five collaborative activities that exhibit a centrality of visualization and multimodal communication. These activities include exploratory data analysis, creative ideation, visualization-rich presentations, joint decision making grounded in data, and real-time data monitoring. The challenges also reflect the changing dynamics of these activities in the face of recent advances in extended reality (XR) and artificial intelligence (AI). As an organizing scheme for future research at the intersection of visualization and computer-supported cooperative work, we align the challenges with a sequence of four sets of research and development activities: technological choices, social factors, AI assistance, and evaluation.
Authors:Jean-Daniel Fekete, Yifan Hu, Dominik Moritz, Arnab Nandi, Senjuti Basu Roy, Eugene Wu, Nikos Bikakis, George Papastefanatos, Panos K. Chrysanthis, Guoliang Li, Lingyun Yu
Abstract:
The rapid advancement of AI is transforming human-centered systems, with profound implications for human-AI interaction, human-data interaction, and visual analytics. In the AI era, data analysis increasingly involves large-scale, heterogeneous, and multimodal data that is predominantly unstructured, as well as foundation models such as LLMs and VLMs, which introduce additional uncertainty into analytical processes. These shifts expose persistent challenges for human-data interactive systems, including perceptually misaligned latency, scalability constraints, limitations of existing interaction and exploration paradigms, and growing uncertainty regarding the reliability and interpretability of AI-generated insights. Responding to these challenges requires moving beyond conventional efficiency and scalability metrics, redefining the roles of humans and machines in analytical workflows, and incorporating cognitive, perceptual, and design principles into every level of the human-data interaction stack. This paper investigates the challenges introduced by recent advances in AI and examines how these developments are reshaping the ways users engage with data, while outlining limitations and open research directions for building human-centered AI systems for interactive data analysis in the AI era.
Authors:Laura Spillner, Rachel Ringe, Robert Porzel, Rainer Malaka
Abstract:
A central challenge in AI-assisted decision making is achieving warranted, well-calibrated trust. Both overtrust (accepting incorrect AI recommendations) and undertrust (rejecting correct advice) should be prevented. Prior studies differ in the design of the decision workflow - whether users see the AI suggestion immediately (1-step setup) or have to submit a first decision beforehand (2-step setup) -, and in how trust is measured - through self-reports or as behavioral trust, that is, reliance. We examined the effects and interactions of (a) the type of decision workflow, (b) the presence of explanations, and (c) users' domain knowledge and prior AI experience. We compared reported trust, reliance (agreement rate and switch rate), and overreliance. Results showed no evidence that a 2-step setup reduces overreliance. The decision workflow also did not directly affect self-reported trust, but there was a crossover interaction effect with domain knowledge and explanations, suggesting that the effects of explanations alone may not generalize across workflow setups. Finally, our findings confirm that reported trust and reliance behavior are distinct constructs that should be evaluated separately in AI-assisted decision making.
Authors:Chen Sun, Yash Vekaria, Rishab Nithyanand
Abstract:
As LLM-driven agents begin to autonomously navigate the web, their ability to interpret and respond to manipulative interface design becomes critical. A fundamental question that emerges is: can such agents reliably recognize patterns of friction, misdirection, and coercion in interface design (i.e., dark patterns)? We study this question in a setting where the workflows are consequential: website portals associated with the submission of CCPA-related data rights requests. These portals operationalize statutory rights, but they are implemented as interactive interfaces whose design can be structured to facilitate, burden, or subtly discourage the exercise of those rights. We design and deploy an LLM-driven auditing agent capable of end-to-end traversal of rights-request workflows, structured evidence gathering, and classification of potential dark patterns. Across a set of 456 data broker websites, we evaluate: (1) the ability of the agent to consistently locate and complete request flows, (2) the reliability and reproducibility of its dark pattern classifications, and (3) the conditions under which it fails or produces poor judgments. Our findings characterize both the feasibility and the limitations of using LLM-driven agents for scalable dark pattern auditing.
Authors:Ryan Feng Lin, Yuantao Wei, Huiling Liao, Xiaoning Qian, Shuai Huang
Abstract:
Learning causal structures typically represented by directed acyclic graphs (DAGs) from observational data is notoriously challenging due to the combinatorial explosion of possible graphs and inherent ambiguities in observations. This paper argues that causal learning is now ready for the emergence of a new paradigm supported by rapidly advancing technologies, fulfilling the long-standing vision of leveraging human causal knowledge. This paradigm integrates scalable crowdsourcing platforms for data collection, interactive knowledge elicitation for expert opinion modeling, robust aggregation techniques for expert reconciliation, and large language model (LLM)-based simulation for augmenting AI-driven information acquisition. In this paper, we focus on DAG learning for causal discovery and frame the problem as a distributed decision-making task, recognizing that each participant (human expert or LLM agent) possesses fragmented and imperfect knowledge about different subsets of the variables of interest in the causal graph. By proposing a systematic framework to synthesize these insights, we aim to enable the recovery of a global causal structure unachievable by any individual agent alone. We advocate for a new research frontier and outline a comprehensive framework for new research thrusts that range from eliciting, modeling, aggregating, and optimizing human causal knowledge contributions.
Authors:Julia Kieserman, Cat Mai, Sara Lignell, Lucy Qin, Athanasios Andreou, Damon McCoy, Rosanna Bellini
Abstract:
AI chatbots, built using large language models, are increasingly integrated into society and mimic the patterns of human text exchanges. While previous research has raised concerns that humans may form romantic attachment to chatbots, the range of AI-mediated interactions that people wish to create for themselves or others with chatbots remains poorly understood, particularly given the fast evolving landscape of chatbots. We provide an empirical study of Character.AI (cAI), a popular chatbot platform that enables users to design and share character-based bots, and synthesize this with an analysis of Reddit posts from cAI users. Contrary to popular narratives, we identify that users want to: (1) engage in intimate role-play with young adult, masculine-presenting characters that place users in a position of inferior power in well-defined scenarios and (2) immerse themselves in boundless, fantasy settings. We further find that users problematize both the excessive and insufficient sexualized content in such interactions which warrants novel digital-safety features.
Authors:Helinyi Peng, Akihito Taya, Yuuki Nishiyama, Kaoru Sezaki
Abstract:
Early defibrillation significantly improves survival rates in cases of out-of-hospital cardiac arrest. However, limited public awareness of Automated External Defibrillator (AED) locations constrains their effective use. Existing solutions, such as static 2D maps, often fall short in urgent or complex real-world scenarios. To address this challenge, we developed AEDHunter, a gamified, location-based mobile application designed to transform AED retrieval into an engaging and repeatable practice experience. Leveraging smartphone sensors to analyze participants' movement and learning patterns, and using low-cost Bluetooth tags to verify arrivals at AED locations, AEDHunter guides users through multiple sessions of AED discovery. In a real-world evaluation study, participants significantly reduced their AED retrieval times after repeated practice sessions and reported increased confidence in locating AEDs. Additionally, we employ a two-state activity detector to identify ``exploratory pauses'', which are then used as a behavioral learning signal to quantify hesitation and its progressive reduction through practice. Our findings suggest that gamified applications like AEDHunter can improve AED retrieval performance through repeated, in-situ training and enhance self-reported preparedness, offering design insights for technology-supported learning and public safety applications.
Authors:Achmad Ardani Prasha, Clavino Ourizqi Rachmadi, Sabrina Laila Mutiara, Hilman Syachr Ramadhan, Chareyl Reinalyta Borneo, Saruni Dwiasnati
Abstract:
Adolescent pornography addiction requires early detection based on objective neurobiological biomarkers because self-report is prone to subjective bias due to social stigma. Conventional machine learning has not been able to model dynamic functional connectivity of the brain that fluctuates temporally during addictive stimulus exposure. This study proposes a state-of-the-art Dynamic Spatio-Temporal Graph Neural Network (DST-GNN) that integrates Phase Lag Index (PLI)-based Graph Attention Network (GAT) for spatial modeling and Bidirectional Gated Recurrent Unit (BiGRU) for temporal dynamics. The dataset consists of 14 adolescents (7 addicted, 7 healthy) with 19-channel EEG across 9 experimental conditions. Leave-One-Subject-Out Cross Validation (LOSO-CV) evaluation shows F1-Score of 71.00%$\pm$12.10% and recall of 85.71%, a 104% improvement compared to baseline. Ablation study confirms temporal contribution of 21% and PLI graph construction of 57%. Frontal-central regions (Fz, Cz, C3, C4) are identified as dominant biomarkers with Beta contribution of 58.9% and Hjorth of 31.2%, while Cz-T7 connectivity is consistent as a trait-level biomarker for objective screening.
Authors:Mohammad Amin Samadi, Nia Nixon
Abstract:
Collaborative problem solving and learning are shaped by who or what is on the team. As large language models (LLMs) increasingly function as collaborators rather than tools, a key question is whether AI teammates can be aligned to express personality in predictable ways that matter for interaction and learning. We investigate AI personality alignment through a three-lens evaluation framework spanning self-perception (standardized self-report), behavioral expression (team dialogue), and reflective expression (memory construction). We first administered the Big Five Inventory (BFI-44) to LLM-based teammates across four providers (GPT-4o, Claude-3.7 Sonnet, Gemini-2.5 Pro, Grok-3), 32 high/low trait configurations, and multiple prompting strategies. LLMs produced sharply differentiated Big Five profiles, but prompt semantic richness added little beyond simple trait assignment, while provider differences and baseline "default" personalities were substantial. Role framing also mattered: several models refused the assessment without context, yet complied when framed as a collaborative teammate. We then simulated AI participation in authentic team transcripts using high-trait personas and analyzed both generated utterances and structured long-term memories with LIWC-22. Personality signals in conversation were generally subtle and most detectable for Extraversion, whereas memory representations amplified trait-specific signals, especially for Neuroticism, Conscientiousness, and Agreeableness; Openness remained difficult to elicit robustly. Together, results suggest that AI personality is measurable but multi-layered and context-dependent, and that evaluating personality-aligned AI teammates requires attention to memory and system-level design, not conversation-only behavior.
Authors:Hasan Amin, Ming Yin, Rajiv Khanna
Abstract:
In human-AI decision making, designing AI that complements human expertise has been a natural strategy to enhance human-AI collaboration, yet it often comes at the cost of decreased AI performance in areas of human strengths. This can inadvertently erode human trust and cause them to ignore AI advice precisely when it is most needed. Conversely, an aligned AI fosters trust yet risks reinforcing suboptimal human behavior and lowering human-AI team performance. In this paper, we start by identifying this fundamental tension between performance-boosting (i.e., complementarity) and trust-building (i.e., alignment) as an inherent limitation of the traditional approach for training a single AI model to assist human decision making. To overcome this, we introduce a novel human-centered adaptive AI ensemble that strategically toggles between two specialist AI models - the aligned model and the complementary model - based on contextual cues, using an elegantly simple yet provably near-optimal Rational Routing Shortcut mechanism. Comprehensive theoretical analyses elucidate why the adaptive AI ensemble is effective and when it yields maximum benefits. Moreover, experiments on both simulated and real-world data show that when humans are assisted by the adaptive AI ensemble in decision making, they can achieve significantly higher performance than when they are assisted by single AI models that are trained to either optimize for their independent performance or even the human-AI team performance.
Authors:Supriya Khadka, Dhiman Goswami, Sanchari Das
Abstract:
Digital identity verification often forces a privacy trade-off, where users must disclose sensitive personal data to prove simple eligibility criteria. As blockchain applications integrate with regulated environments, this over-disclosure creates significant risks of data breaches and surveillance. This work proposes a general Selective Disclosure Framework built on Ethereum, designed to decouple attribute verification from identity revelation. By utilizing client-side zk-SNARKs, the framework enables users to prove specific eligibility predicates without revealing underlying identity documents. We present a case study, ZK-Compliance, which implements a functional Grant, Verify, Revoke lifecycle for age verification. Preliminary results indicate that strict compliance requirements can be satisfied with negligible client-side latency (< 200 ms) while preserving the pseudonymous nature of public blockchains.
Authors:Michael Tompkins, Nihaarika Agarwal, Ananta Soneji, Robert Wasinger, Connor Nelson, Kevin Leach, Rakibul Hasan, Adam Doupé, Daniel Votipka, Yan Shoshitaishvili, Jaron Mink
Abstract:
To meet the ever-increasing demands of the cybersecurity workforce, AI tutors have been proposed for personalized, scalable education. But, while AI tutors have shown promise in introductory programming courses, no work has evaluated their use in hands-on exploration and exploitation of systems (e.g., ``capture-the-flag'') commonly used to teach cybersecurity. Thus, despite growing interest and need, no work has evaluated how students use AI tutors or whether they benefit from their presence in real, large-scale cybersecurity courses. To answer this, we conducted a semester-long observational study on the use of an embedded AI tutor with 309 students in an upper-division introductory cybersecurity course. By analyzing 142,526 student queries sent to the AI tutor across 396 cybersecurity challenges spanning 9 core cybersecurity topics and an accompanying set of post-semester surveys, we find (1) what queries and conversational strategies students use with AI tutors, (2) how these strategies correlate with challenge completion, and (3) students' perceptions of AI tutors in cybersecurity education. In particular, we identify three broad AI tutor conversational styles among users: Short (bounded, few-turn exchanges), Reactive (repeatedly submitting code and errors), and Proactive (driving problem-solving through targeted inquiry). We also find that the use of these styles significantly predicts challenge completion, and that this effect increases as materials become more advanced. Furthermore, students valued the tutor's availability but reported that it became less useful for harder material. Based on this, we provide suggestions for security educators and developers on practical AI tutor use.
Authors:Seoyoung Lee, Seobin Yoon, Seongbeen Lee, Yoojung Chun, Dayoung Park, Doyeon Kim, Joo Yong Sim
Abstract:
Computer-use agents operate over long horizons under noisy perception, multi-window contexts, evolving environment states. Existing approaches, from RL-based planners to trajectory retrieval, often drift from user intent and repeatedly solve routine subproblems, leading to error accumulation and inefficiency. We present IntentCUA, a multi-agent computer-use framework designed to stabilize long-horizon execution through intent-aligned plan memory. A Planner, Plan-Optimizer, and Critic coordinate over shared memory that abstracts raw interaction traces into multi-view intent representations and reusable skills. At runtime, intent prototypes retrieve subgroup-aligned skills and inject them into partial plans, reducing redundant re-planning and mitigating error propagation across desktop applications. In end-to-end evaluations, IntentCUA achieved a 74.83% task success rate with a Step Efficiency Ratio of 0.91, outperforming RL-based and trajectory-centric baselines. Ablations show that multi-view intent abstraction and shared plan memory jointly improve execution stability, with the cooperative multi-agent loop providing the largest gains on long-horizon tasks. These results highlight that system-level intent abstraction and memory-grounded coordination are key to reliable and efficient desktop automation in large, dynamic environments.
Authors:Xinyi Lu, Kexin Phyllis Ju, Mitchell Dudley, Larissa Sano, Xu Wang
Abstract:
Despite growing interest in using LLMs to generate feedback on students' writing, little is known about how students respond to AI-mediated versus human-provided feedback. We address this gap through a randomized controlled trial in a large introductory economics course (N=354), where we introduce and deploy FeedbackWriter - a system that generates AI suggestions to teaching assistants (TAs) while they provide feedback on students' knowledge-intensive essays. TAs have the full capacity to adopt, edit, or dismiss the suggestions. Students were randomly assigned to receive either handwritten feedback from TAs (baseline) or AI-mediated feedback where TAs received suggestions from FeedbackWriter. Students revise their drafts based on the feedback, which is further graded. In total, 1,366 essays were graded using the system. We found that students receiving AI-mediated feedback produced significantly higher-quality revisions, with gains increasing as TAs adopted more AI suggestions. TAs found the AI suggestions useful for spotting gaps and clarifying rubrics.
Authors:Aruna Sankaranarayanan, Amir Zur, Atticus Geiger, Dylan Hadfield-Menell
Abstract:
Where should we intervene in a language model (LM) to control behaviors that are diffused across many tokens of a long-form response? We introduce Generative Causal Mediation (GCM), a procedure for selecting model components, e.g., attention heads, to steer a binary concept (e.g., talk in verse vs. talk in prose) from contrastive long-form responses. In GCM, we first construct a dataset of contrasting inputs and responses. Then, we quantify how individual model components mediate the contrastive concept and select the strongest mediators for steering. We evaluate GCM on three tasks--refusal, sycophancy, and style transfer--across three language models. GCM successfully localizes concepts expressed in long-form responses and consistently outperforms correlational probe-based baselines when steering with a sparse set of attention heads. Together, these results demonstrate that GCM provides an effective approach for localizing and controlling the long-form responses of LMs.
Authors:Johannes Kirmayr, Raphael Wennmacher, Khanh Huynh, Lukas Stappen, Elisabeth André, Florian Alt
Abstract:
Agentic AI assistants that autonomously perform multi-step tasks raise open questions for user experience: how should such systems communicate progress and reasoning during extended operations, especially in attention-critical contexts such as driving? We investigate feedback timing and verbosity from agentic LLM-based in-car assistants through a controlled, mixed-methods study (N=45) comparing planned steps and intermediate results feedback against silent operation with final-only response. Using a dual-task paradigm with an in-car voice assistant, we found that intermediate feedback significantly improved perceived speed, trust, and user experience while reducing task load - effects that held across varying task complexities and interaction contexts. Interviews further revealed user preferences for an adaptive approach: high initial transparency to establish trust, followed by progressively reducing verbosity as systems prove reliable, with adjustments based on task stakes and situational context. We translate our empirical findings into design implications for feedback timing and verbosity in agentic assistants, balancing transparency and efficiency.
Authors:Ankit Bhattarai, Hannah Selder, Florian Fischer, Arthur Fleig, Per Ola Kristensson
Abstract:
Reinforcement learning (RL)-based biomechanical simulations have the potential to revolutionise HCI research and interaction design, but currently lack usability and interpretability. Using the Human Action Cycle as a design lens, we identify key limitations of biomechanical RL frameworks and develop MyoInteract, a novel framework for fast prototyping of biomechanical HCI tasks. MyoInteract allows designers to setup tasks, user models, and training parameters from an easy-to-use GUI within minutes. It trains and evaluates muscle-actuated simulated users within minutes, reducing training times by up to 98%. A workshop study with 12 interaction designers revealed that MyoInteract allowed novices in biomechanical RL to successfully setup, train, and assess goal-directed user movements within a single session. By transforming biomechanical RL from a days-long expert task into an accessible hour-long workflow, this work significantly lowers barriers to entry and accelerates iteration cycles in HCI biomechanics research.
Authors:Shiwei Hong, Lingyao Li, Ethan Z. Rong, Chenxinran Shen, Zhicong Lu
Abstract:
Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined. We test whether broadcast community discussion improves stand-up comedy writing in a controlled multi-agent sandbox: in the discussion condition, critic and audience threads are recorded, filtered, stored as social memory, and later retrieved to condition subsequent generations, whereas the baseline omits discussion. Across 50 rounds (250 paired monologues) judged by five expert annotators using A/B preference and a 15-item rubric, discussion wins 75.6% of instances and improves Craft/Clarity (Δ = 0.440) and Social Response (Δ = 0.422), with occasional increases in aggressive humor.
Authors:Kengo Tanaka, Xiyue Wang, Hironobu Takagi, Yoichi Ochiai, Chieko Asakawa
Abstract:
Visual impairments create barriers to learning physical activities, since conventional training methods rely on visual demonstrations or often inadequate verbal descriptions. This research explores 3D-printed human body models to enhance movement comprehension for blind individuals. Through a participatory design approach in collaboration with a blind designer, we developed detailed 3D models representing various body movements and incorporated tactile reference elements to enhance spatial understanding. We conducted two user studies with 10 blind participants across different activities: static yoga poses and sequential calisthenic movements. The results demonstrated that 3D models significantly improved understanding speed, reduced questions for clarification, and enhanced movement accuracy compared to conventional teaching methods. Participants consistently rated 3D models higher for ease of understanding, effectiveness, and motivation.
Authors:Ricardo E. Gonzalez Penuela, Crescentia Jung, Sharon Y Lin, Ruiying Hu, Shiri Azenkot
Abstract:
Multimodal large language models (MLLMs) are changing how Blind and Low Vision (BLV) people access visual information. Unlike traditional visual interpretation tools that only provide descriptions, MLLM-enabled applications offer conversational assistance, where users can ask questions to obtain goal-relevant details. However, evidence about their performance in the real-world and implications for BLV people's daily lives remains limited. To address this, we conducted a two-week diary study, where we captured 20 BLV participants' use of an MLLM-enabled visual interpretation application. Although participants rated the visual interpretations of the application as "trustworthy" (mean=3.76 out of 5, max=extremely trustworthy) and "somewhat satisfying" (mean=4.13 out of 5, max=very satisfying), the AI often produced incorrect answers (22.2%) or abstained (10.8%) from responding to users' requests. Our findings show that while MLLMs can improve visual interpretations' descriptive accuracy, supporting everyday use also depends on the "visual assistant" skill: behaviors for providing goal-directed, reliable assistance. We conclude by proposing the "visual assistant" skill and guidelines to help MLLM-enabled visual interpretation applications better support BLV people's access to visual information.
Authors:Wei Wei, Foroozan Daneshzand, Zezhong Wang, Erica Mattson, Charles Perin, Sheelagh Carpendale
Abstract:
Co-design is an increasingly popular approach in HCI and visualization, yet there is little guidance on how to effectively apply this method in visualization contexts. In this paper, we visually present our experience of a two-and-a-half-year co-design project with the local arts community. Focusing on facilitating community exploration and sense-making around arts funding distribution, the project involved a series of co-design sessions between visualization researchers and members of the arts community. Through these iterative sessions, we built shared understanding and developed visualization prototypes tailored to community needs. However, the practice is far from complete, and we found ourselves continually returning to the "fuzzy front end" of the co-design process. We share this ongoing story through comic-style visuals and reflect on three fuzzy front ends that we encountered during the project. By sharing these experiences with the visualization community, we hope to offer insights that others can draw on in their own community-engaged co-design work.
Authors:Supriya Khadka, Sanchari Das
Abstract:
Extended Reality (XR) combines dense sensing, real-time rendering, and close-range interaction, making its use in early childhood education both promising and high risk. To investigate this, we conduct a Systematization of Knowledge (SoK) of 111 peer-reviewed studies with children aged 3-8, quantifying how technical, pedagogical, health, privacy, and equity challenges arise in practice. We found that AR dominates the landscape (73%), focusing primarily on tablets or phones, while VR remains uncommon and typically relies on head mounted displays (HMDs). We integrate these quantitative patterns into a joint risk and attention matrix and an Augmented Human Development (AHD) model that link XR pipeline properties to cognitive load, sensory conflict, and access inequity. Finally, implementing a seven dimension coding scheme on a 0 - 2 scale, we obtain mean scholarly attention scores of 1.56 for pedagogy, 1.04 for privacy (primarily procedural consent), 0.96 for technical reliability, 0.92 for accessibility in low resource contexts, 0.81 for medical and health issues, 0.52 for accessibility for disabilities, and 0.14 for data security practices. This indicates that pedagogy receives the most systematic scrutiny, while data access practices is largely overlooked. We conclude by offering a roadmap for Child-Centered XR that helps HCI researchers and educators move beyond novelty to design systems that are developmentally aligned, secure by default, and accessible to diverse learners.
Authors:Shreya Chappidi, Jatinder Singh, Andra V. Krauze
Abstract:
LLMs are increasingly supporting decision-making across high-stakes domains, requiring critical reflection on the socio-technical factors that shape how humans and LLMs are assigned roles and interact during human-in-the-loop decision-making. This paper introduces the concept of human-LLM archetypes -- defined as re-curring socio-technical interaction patterns that structure the roles of humans and LLMs in collaborative decision-making. We describe 17 human-LLM archetypes derived from a scoping literature review and thematic analysis of 113 LLM-supported decision-making papers. Then, we evaluate these diverse archetypes across real-world clinical diagnostic cases to examine the potential effects of adopting distinct human-LLM archetypes on LLM outputs and decision outcomes. Finally, we present relevant tradeoffs and design choices across human-LLM archetypes, including decision control, social hierarchies, cognitive forcing strategies, and information requirements. Through our analysis, we show that selection of human-LLM interaction archetype can influence LLM outputs and decisions, bringing important risks and considerations for the designers of human-AI decision-making systems
Authors:Cedric Faas, Richard Uth, Sarah Sterz, Markus Langer, Anna Maria Feit
Abstract:
AI-based systems can increasingly perform work tasks autonomously. In safety-critical tasks, human oversight of these systems is required to mitigate risks and to ensure responsibility in case something goes wrong. Since people often struggle to stay focused and perform good oversight, intelligent support systems are used to assist them, giving decision recommendations, alerting users, or restricting them from dangerous actions. However, in cases where recommendations are wrong, decision support might undermine the very reason why human oversight was employed -- genuine moral responsibility. The goal of our study was to investigate how a decision support system that restricted available interventions would affect overseer's perceived moral responsibility, in particular in cases where the support errs. In a simulated oversight experiment, participants (\textit{N}=274) monitored an autonomous drone that faced ten critical situations, choosing from six possible actions to resolve each situation. An AI system constrained participants' choices to either six, four, two, or only one option (between-subject study). Results showed that participants, who were restricted to choosing from a single action, felt less morally responsible if a crash occurred. At the same time, participants' judgments about the responsibility of other stakeholders (the AI; the developer of the AI) did not change between conditions. Our findings provide important insights for user interface design and oversight architectures: they should prevent users from attributing moral agency to AI, help them understand how moral responsibility is distributed, and, when oversight aims to prevent ethically undesirable outcomes, be designed to support the epistemic and causal conditions required for moral responsibility.
Authors:Harry Yizhou Tian, Hasan Amin, Ming Yin
Abstract:
Despite the growing prevalence of human-AI decision making, the human-AI team's decision performance often remains suboptimal, partially due to insufficient examination of humans' own reasoning. In this paper, we explore designing AI systems that directly analyze humans' decision rationales and encourage critical reflection of their own decisions. We introduce the AI-Assisted Critical Thinking (AACT) framework, which leverages a domain-specific AI model's counterfactual analysis of human decision to help decision-makers identify potential flaws in their decision argument and support the correction of them. Through a case study on house price prediction, we find that AACT outperforms traditional AI-based decision-support in reducing over-reliance on AI, though also triggering higher cognitive load. Subgroup analysis reveals AACT can be particularly beneficial for some decision-makers such as those very familiar with AI technologies. We conclude by discussing the practical implications of our findings, use cases and design choices of AACT, and considerations for using AI to facilitate critical thinking.
Authors:Sean Memery, Kartic Subr
Abstract:
Artificial intelligence (AI) agents embedded in environments with physics-based interaction face many challenges including reasoning, planning, summarization, and question answering. This problem is exacerbated when a human user wishes to either guide or interact with the agent in natural language. Although the use of Language Models (LMs) is the default choice, as an AI tool, they struggle with tasks involving physics. The LM's capability for physical reasoning is learned from observational data, rather than being grounded in simulation. A common approach is to include simulation traces as context, but this suffers from poor scalability as simulation traces contain larger volumes of fine-grained numerical and semantic data. In this paper, we propose a natural language guided method to discover coarse-grained patterns (e.g., 'rigid-body collision', 'stable support', etc.) from detailed simulation logs. Specifically, we synthesize programs that operate on simulation logs and map them to a series of high level activated patterns. We show, through two physics benchmarks, that this annotated representation of the simulation log is more amenable to natural language reasoning about physical systems. We demonstrate how this method enables LMs to generate effective reward programs from goals specified in natural language, which may be used within the context of planning or supervised learning.
Authors:Zihao Zhu, Junnan Yu, Yuhan Luo
Abstract:
For university students transitioning to an independent and flexible lifestyle, having ADHD poses multiple challenges to their academic task management, which are closely tied to their metacognitive struggles--difficulties in awareness and regulation of one's own thinking processes. The recently surged Generative AI shows promise to mitigate these gaps with its advanced information understanding and generation capabilities. As an exploratory step, we conducted co-design sessions with 20 university students diagnosed with ADHD, followed by interviews with five experts specialized in ADHD intervention. Adopting a metacognitive lens, we examined participants' ideas on GenAI-based task management support and experts' assessments, which led to three design directions: providing cognitive scaffolding to enhance task and self-awareness, promoting reflective task execution for building metacognitive abilities, and facilitating emotional regulation to sustain task engagement. Drawing on these findings, we discuss opportunities for GenAI to support the metacognitive needs of neurodivergent populations, offering future directions for both research and practice.
Authors:Ananya Gubbi Mohanbabu, Rosiana Natalie, Brandon Kim, Anhong Guo, Amy Pavel
Abstract:
Computer Use Agents (CUAs) operate interfaces by pointing, clicking, and typing -- mirroring interactions of sighted users (SUs) who can thus monitor CUAs and share control. CUAs do not reflect interactions by blind and low-vision users (BLVUs) who use assistive technology (AT). BLVUs thus cannot easily collaborate with CUAs. To characterize the accessibility gap of CUAs, we present A11y-CUA, a dataset of BLVUs and SUs performing 60 everyday tasks with 40.4 hours and 158,325 events. Our dataset analysis reveals that our collected interaction traces quantitatively confirm distinct interaction styles between SU and BLVU groups (mouse- vs. keyboard-dominant) and demonstrate interaction diversity within each group (sequential vs. shortcut navigation for BLVUs). We then compare collected traces to state-of-the-art CUAs under default and AT conditions (keyboard-only, magnifier). The default CUA executed 78.3% of tasks successfully. But with the AT conditions, CUA's performance dropped to 41.67% and 28.3% with keyboard-only and magnifier conditions respectively, and did not reflect nuances of real AT use. With our open A11y-CUA dataset, we aim to promote collaborative and accessible CUAs for everyone.
Authors:Frederic Gmeiner, John Thompson, George Fitzmaurice, Justin Matejka
Abstract:
Think-Aloud Computing, a method for capturing users' verbalized thoughts during software tasks, allows eliciting rich contextual insights into evolving intentions, struggles, and decision-making processes of users in real-time. However, existing approaches face practical challenges: users often lack awareness of what is captured by the system, are not effectively encouraged to speak, and miss or are interrupted by system feedback. Additionally, thinking aloud should feel worthwhile for users due to the gained contextual AI assistance. To better support and harness Think-Aloud Computing, we introduce PointAloud, a suite of novel AI-driven pointer-centric interactions for in-the-moment verbalization encouragement, low-distraction system feedback, and contextually rich work process documentation alongside proactive AI assistance. Our user study with 12 participants provides insights into the value of pointer-centric think-aloud computing for work process documentation and human-AI co-creation. We conclude by discussing the broader implications of our findings and design considerations for pointer-centric and AI-supported Think-Aloud Computing workflows.
Authors:Arran Zeyu Wang, David Borland, Estella Calcaterra, David Gotz
Abstract:
Understanding how individuals interpret charts is a crucial concern for visual data communication. This imperative has motivated a number of studies, including past work demonstrating that causal priors -- a priori beliefs about causal relationships between concepts -- can have significant influences on the perceived strength of variable relationships inferred from visualizations. This paper builds on these previous results, demonstrating that causal priors can also influence the types of patterns that people perceive as the most salient within ambiguous scatterplots that have roughly equal evidence for trend and cluster patterns. Using a mixed-design approach that combines a large-scale online experiment for breadth of findings with an in-person think-aloud study for analytical depth, we investigated how users' interpretations are influenced by the interplay between causal priors and the visualized data patterns. Our analysis suggests two archetypal reasoning behaviors through which people often make their observations: contextualization, in which users accept a visual pattern that aligns with causal priors and use their existing knowledge to enrich interpretation, and rationalization, in which users encounter a pattern that conflicts with causal priors and attempt to explain away the discrepancy by invoking external factors, such as positing confounding variables or data selection bias. These findings provide initial evidence highlighting the critical role of causal priors in shaping high-level visualization comprehension, and introduce a vocabulary for describing how users reason about data that either confirms or challenges prior beliefs of causality.
Authors:Manusha Karunathilaka, Litian Lei, Yiming Gao, Yong Wang, Jiannan Li
Abstract:
In the digital age, readers value quantitative journalism that is clear, concise, analytical, and human-centred. To understand complex topics, they often piece together scattered facts from multiple articles. Visual storytelling can transform fragmented information into clear, engaging narratives, yet its use with unstructured online articles remains largely unexplored. To fill this gap, we present Compendia, an automated system that analyzes online articles in response to a user's query and generates a coherent data story tailored to the user's informational needs. Compendia addresses key challenges of storytelling from unstructured text through two modules covering: Online Article Retrieval, which gathers relevant articles; Data Fact Extraction, which identifies, validates, and refines quantitative facts; Fact Organization, which clusters and merges related facts into coherent thematic groups; and Visual Storytelling, which transforms the organized facts into narratives with visualizations in an interactive scrollytelling interface. We evaluated Compendia through a quantitative analysis, confirming the accuracy in fact extraction and organization, and through two user studies with 16 participants, demonstrating its usability, effectiveness, and ability to produce engaging visual stories for open-ended queries.
Authors:Ying Liu, Si Zuo, Chao Yang, Yuqing Song, Dariush Salami, Stephan Sigg
Abstract:
Millimeter-Wave (mmWave) radar enables camera-free gesture recognition for Internet of Things (IoT) interfaces, with robustness to lighting variations and partial occlusions. However, recent studies reveal that its data can inadvertently encode biometric signatures, raising critical privacy challenges for IoT applications. In particular, we demonstrate that mmWave radar point cloud data can leak identity-related information in the absence of explicit identity labels. To address this risk, we propose {ImmCOGNITO}, a graph-based autoencoder that transforms radar gesture point clouds to preserve gesture-relevant structure while suppressing identity cues. The encoder first constructs a directed graph for each sequence using Temporal Graph KNN. Edges are defined to capture inter-frame temporal dynamics. A message-passing neural network with multi-head self-attention then aggregates local and global spatio-temporal context, and the global max-pooled feature is concatenated with the original features. The decoder then reconstructs a minimally perturbed point cloud that retains gesture discriminative attributes while achieving de-identification. Training jointly optimizes reconstruction, gesture-preservation, and de-identification objectives. Evaluations on two public datasets, PantoRad and MHomeGes, show that ImmCOGNITO substantially reduces identification accuracy while maintaining high gesture recognition performance.
Authors:Guangping Liu, Nicholas Hawkins, Billy Madden, Tipu Sultan, Madi Babaiasl
Abstract:
People with lower and upper body disabilities can benefit from wheelchairs and robotic arms to improve mobility and independence. Prior assistive interfaces, such as touchscreens and voice-driven predefined commands, often remain unintuitive and struggle to capture complex user intent. We propose a natural, dialogue based human robot interaction protocol that simulates an intelligent agent capable of communicating with users to understand intent and execute assistive actions. In a pilot study, five participants completed five assistive tasks (cleaning, drinking, feeding, drawer opening, and door opening) through dialogue-based interaction with a wheelchair and robotic arm. As a baseline, participants were required to open a door using the manual control (a wheelchair joystick and a game controller for the arm) and complete a questionnaire to gather their feedback. By analyzing the post-study questionnaires, we found that most participants enjoyed the dialogue-based interaction and assistive robot autonomy.
Authors:Andreea Tulbure, Carmen Scheidemann, Elias Steiner, Marco Hutter
Abstract:
Task-oriented handovers (TOH) are fundamental to effective human-robot collaboration, requiring robots to present objects in a way that supports the human's intended post-handover use. Existing approaches are typically based on object- or task-specific affordances, but their ability to generalize to novel scenarios is limited. To address this gap, we present AFT-Handover, a framework that integrates large language model (LLM)-driven affordance reasoning with efficient texture-based affordance transfer to achieve zero-shot, generalizable TOH. Given a novel object-task pair, the method retrieves a proxy exemplar from a database, establishes part-level correspondences via LLM reasoning, and texturizes affordances for feature-based point cloud transfer. We evaluate AFT-Handover across diverse task-object pairs, showing improved handover success rates and stronger generalization compared to baselines. In a comparative user study, our framework is significantly preferred over the current state-of-the-art, effectively reducing human regrasping before tool use. Finally, we demonstrate TOH on legged manipulators, highlighting the potential of our framework for real-world robot-human handovers.
Authors:Pavithren V S Pakianathan, Rania Islambouli, Diogo Branco, Albrecht Schmidt, Tiago Guerreiro, Jan David Smeddinck
Abstract:
Individuals are increasingly generating substantial personal health and lifestyle data, e.g. through wearables and smartphones. While such data could transform preventative care, its integration into clinical practice is hindered by its scale, heterogeneity and the time pressure and data literacy of healthcare professionals (HCPs). We explore how large language models (LLMs) can support sensemaking of patient-generated health data (PGHD) with automated summaries and natural language data exploration. Using cardiovascular disease (CVD) risk reduction as a use case, 16 HCPs reviewed multimodal PGHD in a mixed-methods study with a prototype that integrated common charts, LLM-generated summaries, and a conversational interface. Findings show that AI summaries provided quick overviews that anchored exploration, while conversational interaction supported flexible analysis and bridged data-literacy gaps. However, HCPs raised concerns about transparency, privacy, and overreliance. We contribute empirical insights and sociotechnical design implications for integrating AI-driven summarization and conversation into clinical workflows to support PGHD sensemaking.
Authors:Michelle L. Ding, Harini Suresh, Suresh Venkatasubramanian
Abstract:
The last decade has witnessed a rapid advancement of generative AI technology that significantly scaled the accessibility of AI-generated non-consensual intimate images (AIG-NCII), a form of image-based sexual abuse that disproportionately harms women and girls. There is a patchwork of commendable efforts across industry, policy, academia, and civil society to address AIG-NCII. However, these efforts lack a shared, consistent mental model that situates the technologies they target within the context of a large, interconnected, and ever-evolving technological ecosystem. As a result, interventions remain siloed and are difficult to evaluate and compare, leading to a reactive cycle of whack-a-mole. We contribute the first comprehensive AIG-NCII technological ecosystem that maps and taxonomizes 11 categories of technologies facilitating the creation, distribution, proliferation and discovery, infrastructural support, and monetization of AIG-NCII. First, we build and visualize the ecosystem through a synthesis of over a hundred primary sources from researchers, journalists, advocates, policymakers, and technologists. Next, we demonstrate how stakeholders can use the ecosystem as a tool to 1) understand new incidents of harm via a case study of Grok and 2) evaluate existing interventions via three more case studies. We conclude with three actionable recommendations, namely that stakeholders should 1) use the ecosystem to map out state, federal, and international laws to produce a clearer policy landscape, 2) collectively develop a database that dynamically tracks the 11 technologies in the ecosystem to better evaluate interventions, and 3) adopt a relational approach to researching AIG-NCII to better understand how the ecosystem technologies interact.
Authors:Crescentia Jung, Kexin Cheng, Sharon Heung, Malte F. Jung, Shiri Azenkot
Abstract:
Virtual collaboration has transformed how people in mixed-ability teams, composed of disabled and non-disabled people, work together by offering greater flexibility. In these settings, accessibility practices, such as accommodations and inclusive norms, are essential for providing access to disabled people. However, we do not yet know how these practices shape broader facets of teamwork, such as productivity, participation, and camaraderie. To address this gap, we interviewed 18 participants (12 disabled, 6 non-disabled) who are part of mixed-ability teams. We found that beyond providing access, accessibility practices shaped how all participants coordinated tasks, sustained rapport, and negotiated responsibilities. Accessibility practices also introduced camaraderie challenges, such as balancing empathy and accountability. Non-disabled participants described allyship as a learning process and skill shaped by their disabled team members and team culture. Based on our findings, we present recommendations for team practices and design opportunities for virtual collaboration tools that reframe accessibility practices as a foundation for strong teamwork.
Authors:Isaac Sheidlower, Jindan Huang, James Staley, Bingyu Wu, Qicong Chen, Reuben Aronson, Elaine Short
Abstract:
Robot Foundation Models (RFMs) represent a promising approach to developing general-purpose home robots. Given the broad capabilities of RFMs, users will inevitably ask an RFM-based robot to perform tasks that the RFM was not trained or evaluated on. In these cases, it is crucial that users understand the risks associated with attempting novel tasks due to the relatively high cost of failure. Furthermore, an informed user who understands an RFM's capabilities will know what situations and tasks the robot can handle. In this paper, we study how non-roboticists interpret performance information from RFM evaluations. These evaluations typically report task success rate (TSR) as the primary performance metric. While TSR is intuitive to experts, it is necessary to validate whether novices also use this information as intended. Toward this end, we conducted a study in which users saw real evaluation data, including TSR, failure case descriptions, and videos from multiple published RFM research projects. The results highlight that non-experts not only use TSR in a manner consistent with expert expectations but also highly value other information types, such as failure cases that are not often reported in RFM evaluations. Furthermore, we find that users want access to both real data from previous evaluations of the RFM and estimates from the robot about how well it will do on a novel task.
Authors:Michael Küttner, Valeria Zitz, Supraja Ramesh, Michael Beigl, Tobias Röddiger
Abstract:
Respiratory rate (RR) is a key vital sign for clinical assessment and mental well-being, yet it is rarely monitored in everyday life due to the lack of unobtrusive sensing technologies. In-ear audio sensing is promising due to its high social acceptance and the amplification of physiological sounds caused by the occlusion effect; however, existing approaches often fail under real-world noise or rely on computationally expensive models. We present EarResp-ANS, the first system enabling fully on-device, real-time RR estimation on commercial earphones. The system employs LMS-based adaptive noise suppression (ANS) to attenuate ambient noise while preserving respiration-related acoustic components, without requiring neural networks or audio streaming, thereby explicitly addressing the energy and privacy constraints of wearable devices. We evaluate EarResp-ANS in a study with 18 participants under realistic acoustic conditions, including music, cafeteria noise, and white noise up to 80 dB SPL. EarResp-ANS achieves robust performance with a global MAE of 0.84 CPM , reduced to 0.47 CPM via automatic outlier rejection, while operating with less than 2% processor load directly on the earphone.
Authors:Ned Cooper, Jose A. Guridi, Angel Hsing-Chi Hwang, Beth Kolko, Beth McGinty, Qian Yang
Abstract:
Millions of people now use non-clinical Large Language Model (LLM) tools like ChatGPT for mental well-being support. This paper investigates what it means to design such tools responsibly, and how to operationalize that responsibility in their design and evaluation. By interviewing experts and analyzing related regulations, we found that designing an LLM tool responsibly involves: (1) Articulating the specific benefits it guarantees and for whom. Does it guarantee specific, proven relief, like an over-the-counter drug, or offer minimal guarantees, like a nutritional supplement? (2) Specifying the LLM tool's "active ingredients" for improving well-being and whether it guarantees their effective delivery (like a primary care provider) or not (like a yoga instructor). These specifications outline an LLM tool's pertinent risks, appropriate evaluation metrics, and the respective responsibilities of LLM developers, tool designers, and users. These analogies - LLM tools as supplements, drugs, yoga instructors, and primary care providers - can scaffold further conversations about their responsible design.
Authors:Lana Do, Shasta Ihorn, Charity Pitcher-Cooper, Juvenal Francisco Barajas, Gio Jung, Xuan Duy Anh Nguyen, Sanjay Mirani, Ilmi Yoon
Abstract:
Audio description (AD) makes video content accessible to blind and low-vision (BLV) audiences, but producing high-quality descriptions is resource-intensive. Automated AD offers scalability, and prior studies show human-in-the-loop editing and user queries effectively improve narration. We introduce ADx3, a novel framework integrating these three modules: GenAD, upgrading baseline description generation with modern vision-language models (VLMs) guided by accessibility-informed prompting; RefineAD, supporting BLV and sighted users to view and edit drafts through an inclusive interface; and AdaptAD, enabling on-demand user queries. We evaluated GenAD in a study where seven accessibility specialists reviewed VLM-generated descriptions using professional guidelines. Findings show that with tailored prompting, VLMs produce good descriptions meeting basic standards, but excellent descriptions require human edits (RefineAD) and interaction (AdaptAD). ADx3 demonstrates collaborative workflows for accessible content creation, where components reinforce one another and enable continuous improvement: edits guide future baselines and user queries reveal gaps in AI-generated and human-authored descriptions.
Authors:Ezequiel Lopez-Lopez, Christoph M. Abels, Philipp Lorenz-Spreen, Stephan Lewandowsky, Stefan M. Herzog
Abstract:
People navigate complex environments using cues, heuristics, and other strategies, which are often adaptive in stable settings. However, as AI increasingly permeates society's information environments, those become more adaptive and evolving: LLM-based chatbots participate in extended interaction, maintain conversational histories, mirror social cues, and can hypercustomize responses, thereby shaping not only what information is accessed but how questions are framed, how evidence is interpreted, and when action feels warranted. Here we propose a framework for sustained human-AI interaction that rests on invariant features of human cognition and human--AI interaction and centers on three interlinked phenomena: entanglement between users and AI systems, the emergence of cognitive and behavioral drift over repeated interactions, and the role of metacognition in the awareness and regulation of these dynamics. As conversational agents provide cues (e.g., fluency, coherence, responsiveness) that people treat as informative, subjective confidence and action readiness may increase without corresponding gains in epistemic reliability, making drift difficult to detect and correct. We describe these dynamics across micro-, meso-, and macro-levels. The framework identifies four metacognitive intervention points and psychologically informed interventions that provide metacognitive scaffolding (boosting and self-nudging). Finally, we outline a long-horizon research agenda for scientific foresight.
Authors:Lana Do, Gio Jung, Juvenal Francisco Barajas, Andrew Taylor Scott, Shasta Ihorn, Alexander Mario Blum, Vassilis Athitsos, Ilmi Yoon
Abstract:
Digital video is central to communication, education, and entertainment, but without audio description (AD), blind and low-vision audiences are excluded. While crowdsourced platforms and vision-language-models (VLMs) expand AD production, quality is rarely checked systematically. Existing evaluations rely on NLP metrics and short-clip guidelines, leaving questions about what constitutes quality for full-length content and how to assess it at scale. To address these questions, we first developed a multi-dimensional assessment framework for uninterrupted, full-length video, grounded in professional guidelines and refined by accessibility specialists. Second, we integrated this framework into a comprehensive methodological workflow, utilizing Item Response Theory, to assess the proficiency of VLM and human raters against expert-established ground truth. Findings suggest that while VLMs can approximate ground-truth ratings with high alignment, their reasoning was found to be less reliable and actionable than that of human respondents. These insights show the potential of hybrid evaluation systems that leverage VLMs alongside human oversight, offering a path towards scalable AD quality control.
Authors:Lingyu Du, Xucong Zhang, Guohao Lan
Abstract:
Effective eye contact is a cornerstone of successful public speaking. It strengthens the speaker's credibility and fosters audience engagement. Yet, managing effective eye contact is a skill that demands extensive training and practice, often posing a significant challenge for novice speakers. In this paper, we present SpeakAssis, the first real-time, in-situ wearable system designed to actively assist speakers in maintaining effective eye contact during live presentations. Leveraging a head-mounted eye tracker for gaze and scene view capture, SpeakAssis continuously monitors and analyzes the speaker's gaze distribution across audience and non-audience regions. When ineffective eye-contact patterns are detected, such as insufficient eye contact, or neglect of certain audience segments, SpeakAssis provides timely, context-aware audio prompts via an earphone to guide the speaker's gaze behavior. We evaluate SpeakAssis through a user study involving eight speakers and 24 audience members. Quantitative results show that SpeakAssis increases speakers' eye-contact duration by 62.5% on average and promotes a more balanced distribution of visual attention. Additionally, statistical analysis based on audience surveys reveals that improvements in speaker's eye-contact behavior significantly enhance the audience's perceived engagement and interactivity during presentations.
Authors:Dana Feng, Bhada Yun, April Wang
Abstract:
Juniors enter as AI-natives, seniors adapted mid-career. AI is not just changing how engineers code-it is reshaping who holds agency across work and professional growth. We contribute junior-senior accounts on their usage of agentic AI through a three-phase mixed-methods study: ACTA combined with a Delphi process with 5 seniors, an AI-assisted debugging task with 10 juniors, and blind reviews of junior prompt histories by 5 more seniors. We found that agency in software engineering is primarily constrained by organizational policies rather than individual preferences, with experienced developers maintaining control through detailed delegation while novices struggle between over-reliance and cautious avoidance. Seniors leverage pre-AI foundational instincts to steer modern tools and possess valuable perspectives for mentoring juniors in their early AI-encouraged career development. From synthesis of results, we suggest three practices that focus on preserving agency in software engineering for coding, learning, and mentorship, especially as AI grows increasingly autonomous.
Authors:Sandra Loop, Erik Bertram, Sebastian Juhl, Martin Schrepp
Abstract:
In highly competitive software markets, user experience (UX) evaluation is crucial for ensuring software quality and fostering long-term product success. Such UX evaluations typically combine quantitative metrics from standardized questionnaires with qualitative feedback collected through open-ended questions. While open-ended feedback offers valuable insights for improvement and helps explain quantitative results, analyzing large volumes of user comments is challenging and time-consuming. In this paper, we present techniques developed during a long-term UX measurement project at a major software company to efficiently process and interpret extensive volumes of user comments. To provide a high-level overview of the collected comments, we employ a supervised machine learning approach that assigns meaningful, pre-defined topic labels to each comment. Additionally, we demonstrate how generative AI (GenAI) can be leveraged to create concise and informative summaries of user feedback, facilitating effective communication of findings to the organization and especially upper management. Finally, we investigate whether the sentiment expressed in user comments can serve as an indicator for overall product satisfaction. Our results show that sentiment analysis alone does not reliably reflect user satisfaction. Instead, product satisfaction needs to be assessed explicitly in surveys to measure the user's perception of the product.
Authors:Yichun Zhao, Miguel A. Nacenta, Mahadeo A. Sukhai, Sowmya Somanath
Abstract:
Blind and low-vision (BLV) employees in mixed-visual ability teams often encounter information (e.g., PDFs, diagrams) in inaccessible formats. To enable teamwork, teams must transform these representations by modifying or re-creating them into accessible forms. However, these transformations are frequently overlooked, lack infrastructural support, and cause additional labour. To design systems that move beyond one-off accommodations to effective mixed-ability collaboration, we need a deeper understanding of the representations, their transformations and how they occur. We conducted a week-long diary study with follow-up interviews with 23 BLV and sighted professionals from five legal, non-profit, and consulting teams, documenting 36 transformation cases. Our analysis characterizes how teams perform representational transformations for accessibility: how they are triggered proactively or reactively, how they simplify or enhance, and four common patterns in which workers coordinate with each other to address representational incompatibility. Our findings uncover opportunities for designing systems that can better support mixed-visual ability work.
Authors:Michał Patryk Miazga, Hannah Bussmann, Antti Oulasvirta, Patrick Ebel
Abstract:
Touch data from mobile devices are collected at scale but reveal little about the interactions that produce them. While biomechanical simulations can illuminate motor control processes, they have not yet been developed for touch interactions. To close this gap, we propose a novel computational problem: synthesizing plausible motion directly from logs. Our key insight is a reinforcement learning-driven musculoskeletal forward simulation that generates biomechanically plausible motion sequences consistent with events recorded in touch logs. We achieve this by integrating a software emulator into a physics simulator, allowing biomechanical models to manipulate real applications in real-time. Log2Motion produces rich syntheses of user movements from touch logs, including estimates of motion, speed, accuracy, and effort. We assess the plausibility of generated movements by comparing against human data from a motion capture study and prior findings, and demonstrate Log2Motion in a large-scale dataset. Biomechanical motion synthesis provides a new way to understand log data, illuminating the ergonomics and motor control underlying touch interactions.
Authors:Davide Falessi, Silvia Golia, Angela Locoro
Abstract:
Data Visualization Literacy assessments are typically administered via fixed sets of Data Visualization items, despite substantial heterogeneity in how different people interpret the same visualization. This paper presents and evaluates an approach for predicting Human Interpretation Correctness (P-HIC) of data visualizations; i.e., anticipating whether a specific person will interpret a data visualization correctly or not, before exposure to that DV, enabling more personalized assessment and training. We operationalize P-HIC as a binary classification problem using 22 features spanning Human Profile, Human Performance, and Item difficulty (including ExpertDifficulty and RaschDifficulty). We evaluate three machine-learning models (Logistic Regression model, Random Forest, Multi Layer Perceptron) with and without feature selection, using a survey with 1,083 participants who answered 32 Data Visualization items (eight data visualizations per four items), yielding 34,656 item responses. Performance is assessed via a ten-time ten-fold cross-validation in each 32 (item-specific) datasets, using AUC and Cohen's kappa. Logistic Regression model with feature selection is the best-performing approach, reaching a median AUC of 0.72 and a median kappa of 0.32. Feature analyses show RaschDifficulty as the dominant predictor, followed by experts' ratings and prior correctness (PercCorrect), whose relevance increases across sessions. Profile information did not particularly support P-HIC. Our results support the feasibility of anticipating misinterpretations of data visualizations, and motivate the runtime selection of data visualizations items tailored to an audience, thereby improving the efficiency of Data Visualization Literacy assessment and targeted training.
Authors:Huichao Men, Yizhen Hu, Yu Gao, Xiaofeng Mou, Yi Xu, Xinhua Xiao
Abstract:
With the deep integration of artificial intelligence and smart home technologies, the intelligent transformation of traditional household appliances has become an inevitable trend. This paper presents AirAgent--an LLM-driven autonomous agent framework designed for home air systems. Leveraging a voice-based dialogue interface, AirAgent autonomously and personally manages indoor air quality through comprehensive perception, reasoning, and control. The framework innovatively adopts a two-layer cooperative architecture: Memory-Based Tag Extraction and Reasoning-Driven Planning. First, a dynamic memory tag extraction module continuously updates personalized user profiles. Second, a reasoning-planning model integrates real-time environmental sensor data, user states, and domain-specific prior knowledge (e.g., public health guidelines) to generate context-aware decisions. To support both interpretability and execution, we design a semi-streaming output mechanism that uses special tokens to segment the model's output stream in real time, simultaneously producing human-readable Chain-of-Thought explanations and structured, device-executable control commands. The system handles planning across 25 distinct complex dimensions while satisfying more than 20 customized constraints. As a result, AirAgent endows home air systems with proactive perception, service, and orchestration capabilities, enabling seamless, precise, and personalized air management responsive to dynamic indoor and outdoor conditions. Experimental results demonstrate up to 94.9 percent accuracy and more than 20 percent improvement in user experience metrics compared to competing commercial solutions.
Authors:Gennie Mansi, Julia Kim, Mark Riedl
Abstract:
A core assumption of Explainable AI (XAI) is that explanations are useful to users -- that is, users will do something with the explanations. Prior work, however, does not clearly connect the information provided in explanations to user actions to evaluate effectiveness. In this paper, we articulate this connection. We conducted a formative study through 14 interviews with end users in education and medicine. We contribute a catalog of information and associated actions. Our catalog maps 12 categories of information that participants described relying on to take 60 different actions. We show how AI Creators can use the catalog's specificity and breadth to articulate how they expect information in their explanations to lead to user actions and test their assumptions. We use an exemplar XAI system to illustrate this approach. We conclude by discussing how our catalog expands the design space for XAI systems to support actionability.
Authors:Zheng Zhang, Mengjie Yu, Tianyi Wang, Kashyap Todi, Ajoy Savio Fernandes, Yue Liu, Haijun Xia, Tovi Grossman, Tanya Jonker
Abstract:
Smart glasses enhance interactions with the environment by using head-mounted cameras to observe the user's viewpoint, but lack the visual feedback used for common interactions. We introduce Gazeify then Voiceify, a multimodal approach allowing object selection via gaze and voice using displayless smart glasses. Users can select a physical object with their gaze, and the system generates a digital mask and a voice description of the object's semantics. Users can further correct errors through free-form conversation. To demonstrate our approach, we develop an interactive system by integrating advanced object segmentation and detection with a vision-language model. User studies reveal that participants achieve correct gaze selection in 53% of the task trials and use voice disambiguation to correct 58% of the remaining errors. Participants also rated the system as likable, useful, and easy to use.
Authors:Joy Lai, Kelly Beaton, David Black, Alex Mihailidis
Abstract:
Research with dementia caregivers poses persistent methodological and ethical challenges, particularly when interview-based studies are designed without sufficient grounding in lived caregiving realities. Questions framed through clinical or deficit-oriented assumptions risk alienating participants, undermining rapport, and producing shallow or ethically fraught data. While human-computer interaction (HCI) research increasingly adopts participatory approaches in technology design, participation rarely extends to the design of research methods themselves. This paper examines the role of lived-experience advisors as methodological partners in caregiver interview research. We report on a qualitative study in which two advisors with extensive dementia caregiving experience were engaged prior to fieldwork as methodological partners, extending participatory principles beyond technology design into the design of research methods themselves. Drawing on transcripts of advisor consultations and subsequent interviews with ten caregivers and one person living with dementia, we identify two key methodological contributions of advisor involvement. First, advisors enabled anticipatory validity by surfacing caregiving challenges, ethical sensitivities, and interpretive concerns that later appeared in caregiver interviews, allowing the researcher to enter the field with grounded awareness under constrained recruitment and fieldwork conditions. Second, advisors provided cultural, emotional, and systemic context that improved interpretive sensitivity and helped avoid misreadings. We argue that lived experience functions as methodological infrastructure, extending participatory principles into the design and conduct of research itself, and constituting a generalizable methodological pattern for HCI research with caregivers and other vulnerable or marginalized populations.
Authors:Supriya Khadka, Sanchari Das
Abstract:
Extended Reality in early childhood education presents high-risk challenges due to children's rapid developmental changes. While augmented and virtual reality offer immersive pedagogical benefits, they often impose excessive cognitive load or sensory conflict. We introduce the Augmented Human Development (AHD) framework to model these interactions through cognitive, sensory, environmental, and developmental parameters. To ground this framework, we conducted a Systematization of Knowledge (SoK) of 111 peer-reviewed studies involving children aged 3 - 8. Our findings, interpreted through the AHD lens, reveal a critical "risk vs. attention gap," where high-impact safety and security risks remain under-researched compared to short-term pedagogical gains.
Authors:Tian-Yi Zhou, Xuan-Hao Liu, Bao-Liang Lu, Wei-Long Zheng
Abstract:
Reconstructing human dynamic visual perception from electroencephalography (EEG) signals is of great research significance since EEG's non-invasiveness and high temporal resolution. However, EEG-to-video reconstruction remains challenging due to: 1) Single Modality: existing studies solely align EEG signals with the text modality, which ignores other modalities and are prone to suffer from overfitting problems; 2) Data Scarcity: current methods often have difficulty training to converge with limited EEG-video data. To solve the above problems, we propose a novel framework MindCine to achieve high-fidelity video reconstructions on limited data. We employ a multimodal joint learning strategy to incorporate beyond-text modalities in the training stage and leverage a pre-trained large EEG model to relieve the data scarcity issue for decoding semantic information, while a Seq2Seq model with causal attention is specifically designed for decoding perceptual information. Extensive experiments demonstrate that our model outperforms state-of-the-art methods both qualitatively and quantitatively. Additionally, the results underscore the effectiveness of the complementary strengths of different modalities and demonstrate that leveraging a large-scale EEG model can further enhance reconstruction performance by alleviating the challenges associated with limited data.
Authors:Ahana Ghosh, Advait Sarkar, Siân Lindley, Christian Poelitz
Abstract:
Generative AI (GenAI) tools improve productivity in knowledge workflows such as writing, but also risk overreliance and reduced critical thinking. Cognitive forcing functions (CFFs) mitigate these risks by requiring active engagement with AI output. As GenAI workflows grow more complex, systems increasingly present execution plans for user review. However, these plans are themselves AI-generated and prone to overreliance, and the effectiveness of applying CFFs to AI plans remains underexplored. We conduct a controlled experiment in which participants completed AI-assisted writing tasks while reviewing AI-generated plans under four CFF conditions: Assumption (argument analysis), WhatIf (hypothesis testing), Both, and a no-CFF control. A follow-up think-aloud and interview study qualitatively compared these conditions. Results show that the Assumption CFF most effectively reduced overreliance without increasing cognitive load, while participants perceived the WhatIf CFF as most helpful. These findings highlight the value of plan-focused CFFs for supporting critical reflection in GenAI-assisted knowledge work.
Authors:Jiexin Ding, Yizhuo Zhang, Xinyun Liu, Ke chen, Yuntao Wang, Shwetak Patel, Akshay Gadre
Abstract:
Smart glasses are accelerating progress toward more seamless and personalized LLM-based assistance by integrating multimodal inputs. Yet, these inputs rely on obtrusive explicit prompts. The advent of gaze tracking on smart devices offers a unique opportunity to extract implicit user intent for personalization. This paper investigates whether LLMs can interpret user gaze for text-based tasks. We evaluate different gaze representations for personalization and validate their effectiveness in realistic reading tasks. Results show that LLMs can leverage gaze to generate high-quality personalized summaries and support users in downstream tasks, highlighting the feasibility and value of gaze-driven personalization for future mobile and wearable LLM applications.
Authors:Mingyu Zhu, Jiangong Chen, Bin Li
Abstract:
Extended Reality (XR), including virtual, augmented, and mixed reality, provides immersive and interactive experiences across diverse applications, from VR-based education to AR-based assistance and MR-based training. However, widespread XR adoption remains limited due to two key challenges: 1) the high cost and complexity of authoring 3D content, especially for large-scale environments or complex interactions; and 2) the steep learning curve associated with non-intuitive interaction methods like handheld controllers or scripted gestures. Generative AI (GenAI) presents a promising solution by enabling intuitive, language-driven interaction and automating content generation. Leveraging vision-language models and diffusion-based generation, GenAI can interpret ambiguous instructions, understand physical scenes, and generate or manipulate 3D content, significantly lowering barriers to XR adoption. This paper explores the integration of XR and GenAI through three concrete use cases, showing how they address key obstacles in scalability and natural interaction, and identifying technical challenges that must be resolved to enable broader adoption.
Authors:Paige S. DeVries, Michaela Okosi, Ming Li, Nora Dunphy, Gidey Gezae, Dante Conway, Abraham Glasser, Raja Kushalnagar, Christian Vogler
Abstract:
We investigate intelligent personal assistants (IPAs) accessibility for deaf and hard of hearing (DHH) people who can use their voice in everyday communication. The inability of IPAs to understand diverse accents including deaf speech renders them largely inaccessible to non-signing and speaking DHH individuals. Using an Echo Show, we compare the usability of natural language input via spoken English; with Alexa's automatic speech recognition and a Wizard-of-Oz setting with a trained facilitator re-speaking commands against that of a large language model (LLM)-assisted touch interface in a mixed-methods study. The touch method was navigated through an LLM-powered "task prompter," which integrated the user's history and smart environment to suggest contextually-appropriate commands. Quantitative results showed no significant differences across both spoken English conditions vs LLM-assisted touch. Qualitative results showed variability in opinions on the usability of each method. Ultimately, it will be necessary to have robust deaf-accented speech recognized natively by IPAs.
Authors:Thomas Eiter, Tobias Geibinger, Zeynep G. Saribatur
Abstract:
Answer Set Programming (ASP) is a popular declarative reasoning and problem solving approach in symbolic AI. Its rule-based formalism makes it inherently attractive for explainable and interpretive reasoning, which is gaining importance with the surge of Explainable AI (XAI). A number of explanation approaches and tools for ASP have been developed, which often tackle specific explanatory settings and may not cover all scenarios that ASP users encounter. In this survey, we provide, guided by an XAI perspective, an overview of types of ASP explanations in connection with user questions for explanation, and describe how their coverage by current theory and tools. Furthermore, we pinpoint gaps in existing ASP explanations approaches and identify research directions for future work.
Authors:Huixin Xue, Guangjun Xu, Shihong Ren, Xian Gao, Ruian Tie, Zhen Zhou, Hao Liu, Yue Gao
Abstract:
Home-based music therapy devices require accessible and cost-effective solutions for users to understand and track their therapeutic progress. Traditional physiological signal analysis, particularly EEG interpretation, relies heavily on domain experts, creating barriers to scalability and home adoption. Meanwhile, few experts are capable of interpreting physiological signal data while also making targeted music recommendations. While large language models (LLMs) have shown promise in various domains, their application to automated physiological report generation for music therapy represents an unexplored task. We present a prototype system that leverages LLMs to bridge this gap -- transforming raw EEG and cardiovascular data into human-readable therapeutic reports and personalized music recommendations. Unlike prior work focusing on real-time physiological adaptation during listening, our approach emphasizes post-session analysis and interpretable reporting, enabling non-expert users to comprehend their psychophysiological states and track therapeutic outcomes over time. By integrating signal processing modules with LLM-based reasoning agents, the system provides a practical and low-cost solution for short-term progress monitoring in home music therapy contexts. This work demonstrates the feasibility of applying LLMs to a novel task -- democratizing access to physiology-driven music therapy through automated, interpretable reporting.
Authors:Aisvarya Adeseye, Jouni Isoaho, Seppo Virtanen, Mohammad Tahir
Abstract:
Automated interviewers and chatbots are common in research, recruitment, customer service, and education. Many existing systems use fixed question lists, strict rules, and limited personalization, leading to repeated conversations that cause low engagement. Therefore, these tools are not effective for complex qualitative research, which requires flexibility, context awareness, and ethical sensitivity. Consequently, there is a need for a more adaptive and context-aware interviewing system. To address this, an AI-powered interviewer that dynamically generates questions that are contextually appropriate and expertise aligned is presented in this study. The interviewer is built on a locally hosted large language model (LLM) that generates coherent dialogue while preserving data privacy. The interviewer profiles the participants' expertise in real time to generate knowledge-appropriate questions, well-articulated responses, and smooth transition messages similar to human-like interviews. To implement these functionalities, a modular prompt engineering pipeline was designed to ensure that the interview conversation remains scalable, adaptive, and semantically rich. To evaluate the AI-powered interviewer, it was tested with various participants, and it achieved high satisfaction (mean 4.45) and engagement (mean 4.33). The proposed interviewer is a scalable, privacy-conscious solution that advances AI-assisted qualitative data collection.
Authors:Yi Li, Kadek Ananta Satriadi, Jiazhou Liu, Anjali Khurana, Zhiqing Wu, Benjamin Tag, Tim Dwyer
Abstract:
It has been ten years since the term ''Immersive Analytics'' (IA) was coined and research interest in the topic remains strong. Researchers in this field have produced practical and conceptual knowledge concerning the use of emerging immersive spatial display and interaction technologies for sense-making tasks through a number of papers, surveys, and books. However, a lack of truly physically and psychologically ergonomic techniques, as well as standardized human-centric validation protocols for these, remains a significant barrier to wider acceptance of practical IA systems in ubiquitous applications. Building upon a series of workshops on immersive analytics at various conferences, this workshop aims to explore new approaches and establish standard practices for evaluating immersive analytics systems from a human factors perspective. We will gather immersive analytics researchers and practitioners to look closely at these human factors -- including cognitive and physical functions as well as behaviour and performance -- to see how they inform the design and deployment of immersive analytics techniques and applications and to inform future research.
Authors:Amber Kusters, Pooja Prajod, Pablo Cesar, Abdallah El Ali
Abstract:
Within journalistic editorial processes, disclosing AI usage is currently limited to simplistic labels, which misses the nuance of how humans and AI collaborated on a news article. Through co-design sessions (N=10), we elicited 69 disclosure designs and implemented four prototypes that visually disclose human-AI collaboration in journalism. We then ran a within-subjects lab study (N=32) to examine how disclosure visualizations (Textual, Role-based Timeline, Task-based Timeline, Chatbot) and collaboration ratios (Primarily Human vs. Primarily AI) influenced visualization perceptions, gaze patterns, and post-experience responses. We found that textual disclosures were least effective in communicating human-AI collaboration, whereas Chatbot offered the most in-depth information. Furthermore, while role-based timelines amplified AI contribution in primarily human articles, task-based timeline shifted perceptions toward human involvement in primarily AI articles. We contribute Human-AI collaboration disclosure visualizations and their evaluation, and cautionary considerations on how visualizations can alter perceptions of AI's actual role during news article creation.
Authors:Canwen Wang, Angela Chen, Catherine Bao, Siwei Jin, Yee Kit Chan, Jessica R Mindel, Sijia Xie, Holly Swartz, Tongshuang Wu, Robert E Kraut, Haiyi Zhu
Abstract:
Couples therapy, or relationship counseling, helps partners resolve conflicts, improve satisfaction, and foster psychological growth. Traditional approaches to training couples therapists, such as textbooks and roleplay, often fail to capture the complexity and emotional nuance of real couple dynamics. We present a novel multimodal, multi-agent simulation system that models multi-party interactions in couples therapy. Informed by our systematic research, this system creates a low-stakes environment for trainee therapists to gain valuable practical experience dealing with the critical demand-withdraw communication cycle across six couple-interaction stages. In an evaluation study involving 21 US-based licensed therapists, participants blind to conditions identified the engineered agent behaviors (i.e., the stages and the demand-withdraw cycle) and rated overall realism and agent responses higher for the experimental system than the baseline. As the first known multi-agent framework for training couples therapists, our work builds the foundation for future research that fuses HCI technologies with couples therapy.
Authors:Piyush Maheshwari, Sheshera Mysore, Hamed Zamani
Abstract:
Exploratory searches are characterized by under-specified goals and evolving query intents. In such scenarios, retrieval models that can capture user-specified nuances in query intent and adapt results accordingly are desirable -- instruction-following retrieval models promise such a capability. In this work, we evaluate instructed retrievers for the prevalent yet under-explored application of aspect-conditional seed-guided exploration using an expert-annotated test collection. We evaluate both recent LLMs fine-tuned for instructed retrieval and general-purpose LLMs prompted for ranking with the highly performant Pairwise Ranking Prompting. We find that the best instructed retrievers improve on ranking relevance compared to instruction-agnostic approaches. However, we also find that instruction following performance, crucial to the user experience of interacting with models, does not mirror ranking relevance improvements and displays insensitivity or counter-intuitive behavior to instructions. Our results indicate that while users may benefit from using current instructed retrievers over instruction-agnostic models, they may not benefit from using them for long-running exploratory sessions requiring greater sensitivity to instructions.
Authors:Ye Wang, Jiaxing Chen, Hongjiang Xiao
Abstract:
In recent years, with the rapid advancement of large language models (LLMs), role-playing language agents (RPLAs) have emerged as a prominent research focus at the intersection of natural language processing (NLP) and human-computer interaction. This paper systematically reviews the current development and key technologies of RPLAs, delineating the technological evolution from early rule-based template paradigms, through the language style imitation stage, to the cognitive simulation stage centered on personality modeling and memory mechanisms. It summarizes the critical technical pathways supporting high-quality role-playing, including psychological scale-driven character modeling, memory-augmented prompting mechanisms, and motivation-situation-based behavioral decision control. At the data level, the paper further analyzes the methods and challenges of constructing role-specific corpora, focusing on data sources, copyright constraints, and structured annotation processes. In terms of evaluation, it collates multi-dimensional assessment frameworks and benchmark datasets covering role knowledge, personality fidelity, value alignment, and interactive hallucination, while commenting on the advantages and disadvantages of methods such as human evaluation, reward models, and LLM-based scoring. Finally, the paper outlines future development directions of role-playing agents, including personality evolution modeling, multi-agent collaborative narrative, multimodal immersive interaction, and integration with cognitive neuroscience, aiming to provide a systematic perspective and methodological insights for subsequent research.
Authors:Rostyslav Hnatyshyn, Danny Perez, Gerik Scheuermann, Ross Maciejewski, Baldwin Nsonga
Abstract:
Contemporary materials science research is heavily conducted in silico, involving massive simulations of the atomic-scale evolution of materials. Cataloging basic patterns in the atomic displacements is key to understanding and predicting the evolution of physical properties. However, the combinatorial complexity of the space of possible transitions coupled with the overwhelming amount of data being produced by high-throughput simulations make such an analysis extremely challenging and time-consuming for domain experts. The development of visual analytics systems that facilitate the exploration of simulation data is an active field of research. While these systems excel in identifying temporal regions of interest, they treat each timestep of a simulation as an independent event without considering the behavior of the atomic displacements between timesteps. We address this gap by introducing LAMDA, a visual analytics system that allows domain experts to quickly and systematically explore state-to-state transitions. In LAMDA, transitions are hierarchically categorized, providing a basis for cataloging displacement behavior, as well as enabling the analysis of simulations at different resolutions, ranging from very broad qualitative classes of transitions to very narrow definitions of unit processes. LAMDA supports navigating the hierarchy of transitions, enabling scientists to visualize the commonalities between different transitions in each class in terms of invariant features characterizing local atomic environments, and LAMDA simplifies the analysis by capturing user inputs through annotations. We evaluate our system through a case study and report on findings from our domain experts.
Authors:Pooja Prajod, Hannes Cools, Thomas Röggla, Karthikeya Puttur Venkatraj, Amber Kusters, Alia ElKattan, Pablo Cesar, Abdallah El Ali
Abstract:
As artificial intelligence (AI) is increasingly integrated into news production, calls for transparency about the use of AI have gained considerable traction. Recent studies suggest that AI disclosures can lead to a ``transparency dilemma'', where disclosure reduces readers' trust. However, little is known about how the \textit{level of detail} in AI disclosures influences trust and contributes to this dilemma within the news context. In this 3$\times$2$\times$2 mixed factorial study with 40 participants, we investigate how three levels of AI disclosures (none, one-line, detailed) across two types of news (politics and lifestyle) and two levels of AI involvement (low and high) affect news readers' trust. We measured trust using the News Media Trust questionnaire, along with two decision behaviors: source-checking and subscription decisions. Questionnaire responses and subscription rates showed a decline in trust only for detailed AI disclosures, whereas source-checking behavior increased for both one-line and detailed disclosures, with the effect being more pronounced for detailed disclosures. Insights from semi-structured interviews suggest that source-checking behavior was primarily driven by interest in the topic, followed by trust, whereas trust was the main factor influencing subscription decisions. Around two-thirds of participants expressed a preference for detailed disclosures, while most participants who preferred one-line indicated a need for detail-on-demand disclosure formats. Our findings show that not all AI disclosures lead to a transparency dilemma, but instead reflect a trade-off between readers' desire for more transparency and their trust in AI-assisted news content.
Authors:Francesco Dettori, Matteo Forasassi, Lorenzo Veronese, Livia Lestingi, Vincenzo Scotti, Matteo Giovanni Rossi
Abstract:
Conversational agents are increasingly used as support tools along mental therapeutic pathways with significant societal impacts. In particular, empathy is a key non-functional requirement in therapeutic contexts, yet current chatbot development practices provide no systematic means to specify or verify it. This paper envisions a framework integrating natural language processing and formal verification to deliver empathetic therapy chatbots. A Transformer-based model extracts dialogue features, which are then translated into a Stochastic Hybrid Automaton model of dyadic therapy sessions. Empathy-related properties can then be verified through Statistical Model Checking, while strategy synthesis provides guidance for shaping agent behavior. Preliminary results show that the formal model captures therapy dynamics with good fidelity and that ad-hoc strategies improve the probability of satisfying empathy requirements.
Authors:Xinyi Zhou, Zeinadsadat Saghi, Sadra Sabouri, Rahul Pandita, Mollie McGuire, Souti Chattopadhyay
Abstract:
The widespread adoption of Large Language Models (LLMs) in software development is transforming programming from a solution-generative to a solution-evaluative activity. This shift opens a pathway for new cognitive challenges that amplify existing decision-making biases or create entirely novel ones. One such type of challenge stems from cognitive biases, which are thinking patterns that lead people away from logical reasoning and result in sub-optimal decisions. How do cognitive biases manifest and impact decision-making in emerging AI-collaborative development? This paper presents the first comprehensive study of cognitive biases in LLM-assisted development. We employ a mixed-methods approach, combining observational studies with 14 student and professional developers, followed by surveys with 22 additional developers. We qualitatively compare categories of biases affecting developers against the traditional non-LLM workflows. Our findings suggest that LLM-related actions are more likely to be associated with novel biases. Through a systematic analysis of 90 cognitive biases specific to developer-LLM interactions, we develop a taxonomy of 15 bias categories validated by cognitive psychologists. We found that 48.8% of total programmer actions are biased, and developer-LLM interactions account for 56.4% of these biased actions. We discuss how these bias categories manifest, present tools and practices for developers, and recommendations for LLM tool builders to help mitigate cognitive biases in human-AI programming.
Authors:Mohammadreza Behboodi, Eli Kinney-Lang, Ali Etemad, Adam Kirton, Hatem Abou-Zeid
Abstract:
Foundation Models (FMs) have surged in popularity over the past five years, with applications spanning fields from computer vision to natural language processing. Brain-Computer Interfaces (BCIs) have also gained momentum due to their potential to support individuals with complex disabilities. Among BCI paradigms, code-modulated Visual Evoked Potentials (c-VEPs) remain relatively understudied, despite offering high information transfer rates and large selection target capacities. However, c-VEP systems require lengthy calibration sessions, limiting their practicality outside of laboratory settings. In this study, we use a FM for the first time to eliminate the need for lengthy calibration in c-VEP BCI systems. We evaluated two approaches: (1) a truly calibration-free approach requiring no subject-specific data, and (2) a limited calibration approach, where we assessed the benefit of incorporating incremental amounts of calibration data. In both cases, a classification head is trained on data from other subjects. For a new subject, no calibration data is required in the calibration-free setup, making the c-VEP system effectively plug-and-play. The proposed method was tested on two c-VEP datasets. For the calibration-free approach, the average accuracy on the first dataset (n = 17) was 68.8% +/- 17.6%, comparable to the full-calibration performance reported in the original study (66.2% +/- 13.8%), which required approximately 11 minutes of calibration. On the second dataset (n = 12), the calibration-free accuracy was 71.8% +/- 20.2%, versus 93.7% +/- 5.5% from the original study, which required around 3.5 minutes. A limited-calibration approach using only 20% of the subject's data (approximately 43 seconds) yielded 92% +/- 5.2% accuracy. These results indicate that our FM-based approach can effectively eliminate or significantly reduce the need for lengthy calibration in c-VEP BCIs.
Authors:Weiyue Li, Minda Zhao, Weixuan Dong, Jiahui Cai, Yuze Wei, Michael Pocress, Yi Li, Wanyan Yuan, Xiaoyue Wang, Ruoyu Hou, Kaiyuan Lou, Wenqi Zeng, Yutong Yang, Yilun Du, Mengyu Wang
Abstract:
Large language models (LLMs) are increasingly used as automated evaluators, yet prior works demonstrate that these LLM judges often lack consistency in scoring when the prompt is altered. However, the effect of the grading scale itself remains underexplored. We study the LLM-as-a-judge problem by comparing two kinds of raters: humans and LLMs. We collect ratings from both groups on three scales and across six benchmarks that include objective, open-ended subjective, and mixed tasks. Using intraclass correlation coefficients (ICC) to measure absolute agreement, we find that LLM judgments are not perfectly consistent across scales on subjective benchmarks, and that the choice of scale substantially shifts human-LLM agreement, even when within-group panel reliability is high. Aggregated over tasks, the grading scale of 0-5 yields the strongest human-LLM alignment. We further demonstrate that pooled reliability can mask benchmark heterogeneity and reveal systematic subgroup differences in alignment across gender groups, strengthening the importance of scale design and sub-level diagnostics as essential components of LLM-as-a-judge protocols.
Authors:Md Nazmus Sakib, Naga Manogna Rayasam, Sanorita Dey
Abstract:
Automated interviewing tools are now widely adopted to manage recruitment at scale, often replacing early human screening with algorithmic assessments. While these systems are promoted as efficient and consistent, they also generate new forms of uncertainty for applicants. Efforts to soften these experiences through human-like design features have only partially addressed underlying concerns. To understand how candidates interpret and cope with such systems, we conducted a mixed empirical investigation that combined analysis of online discussions, responses from more than one hundred and fifty survey participants, and follow-up conversations with seventeen interviewees. The findings point to several recurring problems, including unclear evaluation criteria, limited organizational responsibility for automated outcomes, and a lack of practical support for preparation. Many participants described the technology as far less advanced than advertised, leading them to infer how decisions might be made in the absence of guidance. This speculation often intensified stress and emotional strain. Furthermore, the minimal sense of interpersonal engagement contributed to feelings of detachment and disposability. Based on these observations, we propose design directions aimed at improving clarity, accountability, and candidate support in AI-mediated hiring processes.
Authors:Yijun Liu, Yifan Song, John Gallagher, Sarah Sterman, Tal August
Abstract:
While large language models (LLMs) are increasingly used to generate writing feedback, there remains no systematic comparison of LLM and expert feedback on the dimensions that writing research identifies as central to revision: goal-orientation, anchoring to specific sentences, and prioritization. We introduce FOXGLOVE, a dataset of 696 feedback comments written by trained writing instructors on 69 twelfth-grade argumentative essays, paired with 1,644 comments generated from four frontier LLMs under a shared protocol, totaling 2,340 comments. We provide expert quality ratings on a subset of both instructor and LLM comments. We find that instructors and LLMs distribute feedback similarly across goals and essay positions, yet instructors and models diverge on the specific sentences on which to provide feedback. Additionally, we find that models tend to write more complex feedback and use fewer questions than instructors. LLM feedback also receives higher ratings on most dimensions of quality, as rated by instructors, but much of this advantage appears to be attributable to lengthier comments. FOXGLOVE enables systematic comparison of where human and LLM feedback align, diverge, and differ.
Authors:Rebecca Umbach, Griffin Hunt, John Buckley, Joel Scanlan, Caoilte Ó Ciardha, Ethel Quayle, Ainslie Heasman, Maximlian von Heyden, Elizabeth Letourneau, Donald Findlater, Tegan Insoll, Richard Wortley, Chad Steel, Abhishek Roy
Abstract:
Google Search deploys a "Onebox" feature at the top of the results page when users conduct searches for Child Sexual Abuse Material. This study evaluates the impact of a strategic shift in this feature, comparing a revised intervention, focused on repercussions and therapeutic resources, to a previous iteration that focused on reporting. Using a difference-in-differences analysis of internal Google Search logs data, we found the new messaging resulted in a 3.8 percentage point reduction as compared to the status quo in subsequent CSAM-related queries within the same Search session. We found an average click through rate of 0.73% on any of the hyperlinked buttons to help-providing resources. Together, this research presents convergent evidence that a subset of individuals can be deterred from ongoing CSAM-seeking and redirected to therapeutic services.
Authors:Minh Duc Chu, Yifan Wu, Zhiyi Chen, Angel Hsing-Chi Hwang, Luca Luceri
Abstract:
Millions turn to AI companion chatbots during loneliness, grief, and personal crises. How these companion platforms respond in such moments can shape the trajectory of a user's vulnerable state. Yet we lack tools to characterize what each platform actually does when users open up. Existing audits score reactions to pre-defined crisis prompts and miss the underlying decision policy that governs sustained interaction. We address these gaps with two key contributions. First, we introduce the AI Companion Vulnerability-Response Taxonomy, a paired taxonomy of user vulnerability and chatbot response designed for analyzing extended companion chatbot interactions. Second, we infer the response policy each platform follows across distinct vulnerability scenarios by applying Inverse Reinforcement Learning to ~48k turns of real-world user conversations with GPT-4.1, Character.AI, and Replika. Our findings reveal what AI companions prioritize in conversations with vulnerable users: GPT-4.1 reaches for advice, Character.AI spreads its response across different strategies without a dominant mode, and Replika consistently asks questions and stays present. Each, however, downweights the responses that introduce corrective friction: GPT-4.1 probes less as conversations continue and when interacting with psychologically high-risk users; Replika advises bonded users more and challenges them less; Character.AI shows no committed engagement strategy on internal distress. Estimated policies are invisible to output-level audits, providing a new lens for auditing chatbots in the wild and enabling more realistic safety evaluation.
Authors:Sophia Liu, Sarah Abowitz, Yijun Liu, Sarah Sterman, Shm Garanganao Almeda, Max Kreminski
Abstract:
Reading augmentation systems increasingly help readers process text at scale. While these tools address real constraints of time and cognitive load, they often implicitly frame reading as information transmission, or "reading to discard," delegating interpretation and effort to the machine. Yet this delegation changes the outcome of reading. For example, in scholarly reading, deciding what a research text implies and why it matters is central to the work of scholarly production. We propose creative reading as an alternative goal: reading augmentation that supports readers in creating both readings and themselves as readers. By putting literary and narrative theories into conversation with scholarly sensemaking and creativity support, we present a provocation-oriented design space for valuing the process of reading as a way of preserving a plurality of readings and transforming readers over time.
Authors:Marcin Rządeczka, Maciej Wodziński, Kacper Zacharski, Marcin Moskalewicz
Abstract:
We present experimental findings from a study (N=99) examining how intellectual humility (IH), i.e., the metacognitive awareness of epistemic limitations, affects the evaluation of AI-generated health dialogues varying in scientific rigor. Participants were randomly assigned to evaluate one of three dialogues about exercise and mental health: scientifically accurate, moderately pseudoscientific, or strongly pseudoscientific. Results reveal that IH functions as a selective cognitive filter. Individuals with higher humility scores rated pseudoscientific content as significantly less credible, while showing no correlation with credibility assessments of accurate content. Crucially, humility did not predict the ability to identify AI as the source of dialogues, suggesting that epistemic vigilance operates on content quality rather than source attribution. We interpret these findings through an evolutionary lens, proposing that IH represents an ancestral adaptation for navigating informationally uncertain environments. It remains effective at detecting exploitation attempts in AI-generated content, despite humans lacking evolved mechanisms for detecting AI sources. The study contributes to understanding how foundation models might improve or undermine human epistemic defenses, especially in health communication contexts.
Authors:Xiao Jin, Rahul K. Dass, Ashok K. Goel
Abstract:
Intelligent tutoring systems excel at generating explanations but rarely provide principled diagnosis of where and why a learner is wrong. We introduce a misstep-aware coaching capability for Ivy, a neurosymbolic AI coach, built on a two-model architecture that augments a Task-Method-Knowledge (TMK) model with a new Pedagogical Model (PM) in the context of an online graduate AI course at Georgia Tech. The PM makes instructor diagnostic knowledge explicit and machine-readable by encoding, for each quiz question and incorrect response, the learner's underlying belief(a brief statement of the incorrect idea or missing knowledge), a TMK locus(the source of the misunderstanding), a misconception type and targeted scaffolding derived from the instructor's Q\&A key. Using quiz questions from the course, we demonstrate a proof-of-concept pipeline that detects and classifies learner errors and generates diagnosis-grounded scaffolding, moving Ivy beyond knowledge retrieval toward diagnostic misstep awareness, and enabling more precise, actionable feedback that supports conceptual change and advances adaptive learning systems in AI in education and the learning sciences.
Authors:Enmin Zhong, Carlos R. del-Blanco, Fernando Jaureguizar, Narciso García
Abstract:
Egocentric Natural Language Query (NLQ) grounding asks a model to localize, in a long first-person video, the temporal interval that answers a free-form text query. Existing methods fuse video appearance with the query but ignore hand motion, despite the fact that roughly 41% of Ego4D NLQ queries are answered at a moment of hand--object manipulation or their immediate outcomes.We propose a hand-trajectory encoder for converting a sequence of hand skeletons into highly-semantic hand kinematic features, which are then aligned and combined with pretrained video--text features through a cross-attention fusion strategy with adaptive gating. On the Ego4D NLQ v2 validation split, the clearest gains appear for Hand-Object Interaction queries (+2.54 R1@IoU=0.3) and Quantity/State queries (+4.32 R1@IoU=0.3), indicating that hand trajectory provides grounding cues beyond appearance alone.
Authors:Amir Ghasemian, Homa Hosseinmardi, Upasana Dutta, Duncan J. Watts
Abstract:
Recommender systems have grown from content-organization tools into sophisticated systems that shape daily behavior. By controlling what we see, they shape what we perceive, raising concerns about filter bubbles, radicalization, polarization, and social inequality. Large language models (LLMs) enable more powerful personalization, intensifying these dynamics. Yet most recommenders are tuned for engagement or limited accuracy metrics, with little attention to broader social implications, e.g. how personalization reshapes exposure in socially consequential domains. We investigate whether LLM-assisted reranking, while improving personalization, inadvertently amplifies exposure to ideologically extreme or conspiratorial political content, a risk theorized but not empirically characterized in news recommendation. Using real news-consumption histories, we rerank YouTube's sidebar candidates through zero-shot, instruction-based prompting. We compare a baseline prompt with a constrained variant that preserves topical relevance and broadens ideological exposure while reducing conspiratorial or extreme content. Without constraints, reranking strengthened personalization but increased exposure to conspiratorial and extremist material for users whose histories contained such content. Lightweight prompt-level regularization reduced promotion of extreme content and increased ideological diversity, with modest relevance loss. Synthetic experiments suggest that LLMs rerank via statistical regularities in language rather than semantic understanding of ideology, clarifying why naive prompts amplify these patterns and why regularization can reshape them. Together, our results highlight the power of LLMs to operationalize contextual nuance in high-stakes recommendation, and the need to evaluate LLM-assisted personalization beyond accuracy and treat prompt design as a value-laden rather than neutral default.
Authors:Hennes Rave, Katharina Kronenberg, Hannes Gödde, Lea Tobergte, Michael Holtkamp, Julia Werner, Peter Bohrer, Fabian Lohöfer, Rickmer Braren, David Clases, Uwe Karst, Lars Linsen
Abstract:
Hyperspectral bioimaging techniques such as infrared (IR) microscopy and laser ablation-inductively coupled plasma-mass spectrometry (LA-ICP-MS) produce high-dimensional, spatially resolved datasets that require sophisticated analysis to reveal chemically and anatomically meaningful structures. Existing software solutions are typically modality-specific and cover only parts of the analytical workflow, forcing researchers to transfer data across multiple tools and manually reconcile results. We present MIA (Multiscale Image Analysis), a modality-agnostic visual analysis environment that integrates the full exploratory workflow -- from spectral preprocessing and dimensionality reduction to interactive segmentation and spectral similarity analysis -- within a single, tightly coupled interface. MIA supports hierarchical and landmark-based embeddings to handle datasets of varying scale and complexity, interactive and automatic segmentation with a shared state across all linked views, and multimodal analysis of co-registered datasets from different instruments. We demonstrate the effectiveness of MIA through three use cases drawn from real analytical chemistry workflows: (1) the recovery of biologically meaningful tissue compartments through derivative preprocessing and hierarchical embedding, (2) pigment identification via spectral similarity search with spatial overview, and (3) multimodal tissue characterization combining molecular IR and elemental LA-ICP-MS data. Qualitative feedback from domain expert collaborators confirms that MIA reduces the need for tool-switching and supports analytical insights that are difficult to obtain with existing software.
Authors:Mert Yazan, Suzan Verberne, Frederik Bungaran Ishak Situmeang
Abstract:
Artificial Intelligence (AI) agents personalize their responses by tailoring explanations to users' backgrounds, interests, and prior interactions, referred to as contextualization. Personalization has been identified as a persuasive strategy in politics or in marketing. However, the persuasive effect of contextualization in everyday tasks, where users often lack prior knowledge, remains unclear. We conducted a $2\times2$ between-subjects experiment ($N = 380$) examining how contextualization, combined with conversational warmth, shapes reliance and persuasiveness of an AI assistant arguing against expert recommendations. Our findings reveal that contextualization reduces the persuasive power of AI, but its combination with warmth restores persuasiveness through a crossover interaction. Reliance on AI is present across conditions and is invariant to the conversational design. Trust strongly predicts both persuasion and reliance, yet neither contextualization nor warmth operates through trust. AI literacy decouples trust from behavior: more literate users report lower trust in the assistant, yet are more persuaded and more reliant on its advice. These results suggest that users are prone to deferring to AI agents over human expert judgment; however, interface-level conversational design choices have a limited role in shaping the behavior.
Authors:Mert Yazan, Frederik Bungaran Ishak Situmeang, Suzan Verberne
Abstract:
Conversational artificial intelligence (AI) provides an efficient and convenient gateway to information access. However, it can cause overreliance when users blindly trust AI and accept its answers without fact-checking. Information search increasingly follows a hybrid interaction paradigm that combines conversational AI with web search, making fact-checking easier. In this paper, we examine whether this interaction paradigm is effective in curbing reliance. We further investigate the underlying factors (e.g., digital literacy and conversation warmth) that drive users to verify AI answers. We conduct a mixed-subjects question-answering experiment where participants interact with either a warm or a neutral chatbot. Our findings reveal that reliance persists despite users having access to both conversational and web search. The decision to verify is driven primarily by existing user perceptions (e.g., prior trust in chatbots) rather than answer properties, with some users fact-checking regardless of the context and others trusting chatbots by default. Warm conversational style has an indirect yet critical influence on reliance by increasing agreement with the chatbot when it is incorrect. Consulting additional AI sources predicts higher accuracy, while traditional web search does not. Our study extends overreliance research by: (a) demonstrating its persistence despite access to fact-checking, (b) identifying verification behavior as user-dependent, and (c) revealing conversational warmth's indirect effect on overreliance with implications for designing trustworthy conversational search systems.
Authors:Victor Persson, Christofer Boo, Mohit Sharma, Ingrid Hotz
Abstract:
Digital Image Correlation (DIC) enables dense, time-resolved measurement of surface strain in deforming materials, providing insight into strain localization and failure mechanisms. However, the resulting strain fields are typically explored frame-by-frame through spatial visualizations, making global temporal patterns difficult to discern. We present a visual summarization approach that represents the evolution of high-strain regions as a single Sankey diagram constructed from superlevel sets of the von Mises equivalent strain field. By tracking connected components over time via spatial overlap, the diagram encodes the birth, persistence, merging, and disappearance of strain concentrations. Applied to four tensile test datasets with varying notch geometries, the approach compactly captures differences in deformation regimes and qualitative precursors to failure, complementing traditional spatial strain visualizations with a global temporal overview.
Authors:Dinithi Dissanayake, Shaveen Silva, Ovindu Atukorala, Prasanth Sasikumar, Suranga Nanayakkara
Abstract:
We present our submission to the Hume-ABAW10 Emotional Mimicry Intensity (EMI) Challenge, which aims to predict six continuous emotion intensity dimensions: Admiration, Amusement, Determination, Empathic Pain, Excitement, and Joy, from in-the-wild multimodal video clips. We propose a staged multimodal framework that combines textual, acoustic, and visual representations, with an optional motion branch. Our approach first trains modality-specific encoders independently and then fuses their learned representations through a lightweight regressor with modality dropout and controlled encoder adaptation. Across our submitted systems, the best validation performance is obtained by the text--audio--vision--motion fusion model under the expanded 4:1 split, achieving an average Pearson correlation of 0.4722. Although the motion branch yields only very slight gains, its behavior can be interesting to study. Our team was placed third in the EMI challenge, achieving an average Pearson correlation of 0.57 for the test set. Overall, we provide a practical and reproducible baseline for EMI prediction.
Authors:Riley Zilka, Sergey Khlynovskiy, Allie Wang, Martin Jagersand
Abstract:
Autonomous manipulation systems have achieved remarkable capabilities, yet the integration of human expertise with diffusion-based policies in shared control remains relatively unexplored. In this paper, we propose Human-In-The-Loop Diffusion (HITL-D), a shared control framework that enhances user performance in multi-step, insertion, and fine manipulation tasks. HITL-D leverages a novel combination of diffusion-based policies and human control to provide autonomous end effector orientation updates conditioned on a scene point cloud and the Cartesian position of the end effector. This approach reduces the number of joystick control axes required, thereby lowering mental workload. In a multi-task user study with 12 participants, HITL-D reduced average task completion times by 40%, decreased perceived workload by 37%, and improved Likert-scale ratings for independence, intuitiveness, and confidence compared to traditional teleoperation methods. These results demonstrate that HITL-D effectively integrates human expertise with autonomous assistance, improving both objective and subjective aspects of teleoperation.
Authors:Ryan Smith, Kyle D. Chin, Tamara Munzner
Abstract:
Patients often struggle to communicate coherent accounts of their health histories during time-constrained clinical encounters. These accounts, which we refer to as health stories, include both clinical events and lived experiences. Existing systems prioritize structured, clinician-centered data and provide limited support for eliciting and communicating patient-generated narratives. We present HealthTale, a patient-centric visualization system designed to elicit health stories from patients and structure them to facilitate communication during initial clinical conversations. Its design arises from a multi-stage qualitative investigation across domain expert discussions, online narratives (n=20), patient (n=11) and clinician (n=6) interviews, and elicited health stories (n=22), identifying recurring patterns in how individuals construct and communicate their health stories. HealthTale transforms freeform narratives into structured timeline representations, grounded in a data abstraction that models health stories as events that are grouped by health concern and time, capturing both clinical and contextual information, with the flexibility to handle temporally imprecise data and non-linear distributions of events across time. Through evaluation with patients (n=34) and clinicians (n=3), we find that HealthTale supports recall, organization, and self-advocacy, while enabling clinicians to rapidly interpret patient-generated narratives and establish a shared understanding.
Authors:Diganta Misra, Antonio Orvieto, Rediet Abebe, Volkan Cevher
Abstract:
Large language models are increasingly deployed as automated judges to evaluate the strength of arguments. As this role expands, their legitimacy depends on consistency, transparency, and the ability to separate argumentative structure from rhetorical appeal. However, we show that holistic judging - a common LLM-as-a-Judge practice where a model provides a global verdict on a debate - suffers from substantial inter-model disagreement. We argue that this instability arises from collapsing a debate's complex interaction structure into a single opaque score. To address this, we propose GRASP (Gradual Ranking with Attacks and Support Propagation), a deterministic framework that aggregates stable local interaction judgments into a global ranking via a convergent attack--defense propagation operator. We show that local interaction judgments are more reproducible than holistic rankings in LLM-as-a-Judge evaluations, allowing GRASP to produce more consistent global rankings. We further show that GRASP scores do not correlate with human "convincingness" labels, highlighting a vital sociotechnical distinction: GRASP does not measure persuasion, factuality, or rhetorical appeal, but structural sufficiency - a defense-aware notion of argument robustness over the explicit interaction graph. Overall, GRASP offers a transparent and auditable alternative to holistic LLM judging.
Authors:Julie A. Vera, Mark Zachry, David W. McDonald
Abstract:
This paper examines collaborative sensemaking during severe weather events through the emerging phenomenon of "weatherfluencers" or content creators who livestream meteorological interpretation on platforms like YouTube. Drawing from sensemaking theory, crisis informatics, and platform studies, we analyze how these creators navigate the sociotechnical dynamics of interpreting severe weather in real time with distributed audiences. Through critical incident analysis of 13 Particularly Dangerous Situation (PDS) storm warnings across three prominent weatherfluencers, we identify three key practices: multi-source information triangulation, temporal bridging techniques, and platform-specific adaptations that transform entertainment interfaces into safety-critical communication channels. Our analysis shows how these practices challenge existing models of crisis communication by integrating distributed expertise, collapsing temporal frames, and reconfiguring platform affordances. This research contributes to understanding how informal emergency communicators mediate between institutional alerting systems and public needs, and how visual, multimodal crisis communication differs from text-centered approaches.
Authors:Valerie Tan, Kimberly Hegemann, Jens Gerken
Abstract:
Adults with ADHD may use a self-management technique known as Body Doubling, in which the participant employs the presence of one or more agents as a means of initiating and completing tasks. We developed a framework on body doubling with twelve dimensions to better understand the characteristics of body doubling and discover future research directions for developing and testing body doubling for adults with ADHD. Our framework accounts for individual motivation, agent-related dimensions, interaction related dimensions, contextual dimensions, and efficacy. These dimensions show existing research gaps such as limited mixed reality prototypes, possibilities for more interactive body doubles, and the need for empirical studies to further understand of body doubling and adults with ADHD.
Authors:Brandon Biggs, Christopher Toth, James M. Coughlan, Bruce N. Walker
Abstract:
Digital maps are used to communicate generalized spatial information and relationships, yet are commonly made "accessible" using tables that lack geographic information. This study examines whether these tables and interactive text maps (ITMs) may be comparable to visual maps. Twenty sighted and 20 blind and low-vision individuals (BLVIs) performed tasks designed to compare visual maps, ITMs, and tables. Participants answered numeric, geographic, and combined numeric geographic questions using each representation, and performance, preference, and NASA-TLX were measured. Across both participant groups, map representations (visual and ITMs) significantly outperformed tables on geographic-based questions, while performance differences were minimal for numeric questions. For sighted participants, performance on geographic questions did not significantly differ between visual maps and ITMs, indicating that a larger powered study may find an "equivalent purpose" across these two conditions. Participants preferred map-based representations over tables. Perceived workload was highest for the ITM, intermediate for the visual map, and lowest for the table. Consistent with the Map Equivalent Purpose Framework, these findings indicate that Web Content Accessibility Guidelines-compliant ITMs can provide access to spatial information, unlike tables. These findings challenge prevailing accessibility practice that recommends tables lacking geographic information as map alternatives, and motivate reconsideration of accessibility legislation exempting digital thematic maps.
Authors:Sora Kang, Soyun Jeon, Jinsu Eun, Kwangwon Lee, Chaerin Song, Minyoung Joo, Joonhwan Lee
Abstract:
Conversational AI increasingly supports everyday decision-making, yet most systems rely on data-centric reasoning rather than the heuristic and interactional strategies people use in natural conversation. To ground design in actual human practice, we analyze 955 real-world Korean conversations (15,476 utterances) involving food and travel decisions, applying a decision-making codebook through an LLM-assisted coding pipeline. Our findings reveal that people prioritize satisficing over optimization, relying heavily on internal knowledge and interactional strategies to manage cognitive load. Critically, we identify a frequency-efficiency mismatch: the most prevalent heuristics sustain conversational flow during exploration, whereas infrequent, rule-based strategies are highly effective at driving resolution during exploitation. By mapping how these patterns transfer across the spectrum of human-AI interaction, this work provides empirical grounding consistent with cognitive theories of decision-making and offers design implications that align AI systems with human heuristic processes.
Authors:Xuening Wu, Yanlan Kang, Qianya Xu, Kexuan Xie, Jiaqi Mi, Honggang Wang, Yubin Liu, Zeping Chen
Abstract:
Large language models (LLMs) are reshaping how knowledge is produced, with increasing reliance on AI systems for generation, summarization, and reasoning. While prior work has studied cognitive offloading in humans and model collapse in recursive training, these effects are typically considered in isolation. We propose a unified perspective: humans and language models form a coupled dynamical system linked by a feedback loop of usage, generation, and retraining. We introduce a minimal model with three variables -- human cognition, data quality, and model capability -- and show that this feedback can give rise to distinct dynamical regimes. Our analysis identifies three regimes: co-evolutionary enhancement, fragile equilibrium, and degenerative convergence. Through a simple simulation, we demonstrate that increasing reliance on AI can induce a transition toward a low-diversity, suboptimal equilibrium. From an information-theoretic perspective, this transition corresponds to an emergent information bottleneck in the human-AI loop, where entropy reduction reflects loss of diversity and support under closed-loop feedback rather than beneficial compression. These results suggest that the trajectory of AI systems is shaped not only by model design, but by the dynamics of human-AI co-evolution.
Authors:Alex Bäuerle, Adam Connors, Alexander Novikov, Adam Zsolt Wagner, Ngân Vũ, Fernanda Viegas, Martin Wattenberg, Lucas Dixon
Abstract:
Artificial intelligence offers powerful new tools for scientific discovery, but the interaction paradigms required to effectively harness these systems remain underexplored. In this paper, we present findings from a formative user study with 11 expert mathematicians who used AlphaEvolve, an evolutionary coding agent, to tackle advanced problems in their fields of expertise. We identify and characterize a distinct workflow we term intentmaking, the iterative process of discovering, defining, and refining one's experimental goals through active system interaction. We frame this as a natural extension to sensemaking, the cognitive process of building an understanding of complex or novel data. We suggest that users enter a cycle of intentmaking (defining and updating their experiment) and sensemaking (interpreting the results) which repeats many times during the course of an investigation. Our documentation of these themes suggests an approach to designing AI tools for scientific discovery that goes beyond the existing question/answer model of many current systems, treating them as collaborative instruments rather than opaque black-box assistants.
Authors:Senne Deproost, Mehrdad Asadi, Ann Nowé
Abstract:
We introduce State Vector Space Partitioning (SVSP), a novel method to mimic a black box reinforcement learning policy using a set of human-interpretable subpolicies. By partitioning a distillation dataset of state action pairs with linear support vector machine splits, SVSP constructs a compact and structured representation of the original policy. Our method improves mean return by +7.4\% over previous critic driven state partitioning attempts such as Voronoi State Partitioning (VSP) and +2.8\% over the original TD3 policy, while reducing the number of required subpolicies against VSP by 82.1\%. Our results pave the path towards a more flexible form of distillation where both the decision boundary and surrogate models can be chosen within a margin of the original black box behavior.
Authors:James Yen, Zhibai Huang, Zhixiang Wei, Tinghao Yi, Shupeng Zeng, Liang Pang, Songtao Xue, Zhengwei Qi
Abstract:
Consumer robotics demands consolidation of safety-critical control, perception pipelines, and user applications on shared multicore platforms. While static partitioning hypervisors provide hardware-enforced isolation, directly transplanting automotive architectures encounters an expertise asymmetry problem in which end-users modifying robot behavior lack the systems knowledge that platform developers possess. We present an architecture addressing this challenge through three integrated components. A Safe IO Cell provides hardware-level override capability. A Parameter Synchronization Service encapsulates cross-domain complexity. A Safety Communication Layer implements IEC~61508-aligned verification. Our empirical evaluation on an ARM Cortex-A55 platform demonstrates that partition isolation reduces cycle-period jitter by 84.5\% and cuts tail timing error by nearly an order of magnitude (p99 $|$jitter$|$ from 69.0\,$μ$s to 7.8\,$μ$s), eliminating all $>$50\,$μ$s~excursions.
Authors:Vicente Pelechano, Antoni Mestre, Manoli Albert, Miriam Gil
Abstract:
Deciding how to distribute work between humans and AI systems is a central challenge in organisational design. Most approaches treat this as a binary choice, yet the operational reality is richer: humans and AI routinely share tasks or take complementary roles depending on context, fatigue, and the stakes involved. Governing that distribution -- balancing efficiency, oversight, and human capability -- remains an open problem. This paper presents Human-AI Adaptive Symbiosis (HAAS), an implemented framework for adaptive task allocation in software engineering and manufacturing. HAAS combines two coupled components: a rule-based expert system that enforces governance constraints before any learning occurs, and a contextual-bandit learner that selects among feasible collaboration modes from outcome feedback. Task-agent fit is represented through five auditable cognitive dimensions and a five-mode autonomy spectrum -- from human-only to fully autonomous -- embedded in a reproducible benchmark spanning both domains. Three empirical findings emerge. First, governance is not a binary switch but a tunable design variable: tighter constraints predictably convert autonomous AI assignments into supervised collaborations, with domain-specific costs and benefits. Second, in manufacturing, stronger governance can improve operational performance and reduce fatigue simultaneously -- a workload-buffering effect that contradicts the usual framing of governance as pure overhead. Third, no single governance setting dominates across all contexts; moderate governance becomes increasingly competitive as the learner accumulates experience within the governed action space. Together, these findings position HAAS as a pre-deployment workbench for comparing and inspecting human--AI allocation policies before organisational commitment.
Authors:Christopher Kelly, Angelica Chowdhury, Alexandra Campili, Bimpe Ayoola, Devin Barbour, Thomas Chen Dawson, Ze Shen Chin, Rokas Gipiškis
Abstract:
This work establishes a foundational framework for standardizing AI evaluation RCTs (sometimes called human uplift studies). Drawing on established experimental practices from disciplines with established RCT traditions, including software engineering, economics, clinical and health sciences, and psychology, we adopt the (Shadish et al., 2002) four-validity framework and extend it with a fifth principle on transparency, repeatability, and verification adapted from the Transparency and Openness Promotion (TOP) Guidelines (Center for Open Science, 2025). We operationalize all five principles into 33 guidelines adapted for AI evaluation RCT contexts, expressed as requirements with rationales, implementation instructions, and evidence bases. We position the principles and guidelines as serving three key roles for AI evaluation RCTs: a design tool for planning studies, an evaluation rubric for assessing existing work, and a blueprint for standard setting as the field converges on norms. Our framework extends prior work by centering evaluation on human performance rather than model output alone, formalizing causal inference through RCT methodology for AI contexts, integrating heterogeneity analysis and practical significance assessment, implementing a graded transparency and repeatability framework, and addressing AI-specific challenges including model versioning, human-AI interaction dynamics, contamination and spillover effects, and equitable impact assessment.
Authors:Argianto Rahartomo, AmirHossein Jamshidipoor, Mohammad Ghafari
Abstract:
We propose a graphical authentication scheme that follows a simple ``Pick and Sort'' design in which users choose visual elements and arrange them within a grid. Both the number of selected elements and the grid size are configurable, and the visual elements can be customized for specific user groups, such as children. A preliminary study with a prototype implementation indicated that the scheme is easy to learn and flexible to deploy. Although login times are longer than those of conventional authentication methods, the additional interaction may be acceptable in scenarios that are not time-critical, such as infrequent-access use cases or as a secondary authentication mechanism.
Authors:Sheza Munir, Ahanaf Rodoshi, Sumin Lee, Feiran Chang, Xujie Si, Syed Ishtiaque Ahmed
Abstract:
Standard methods for aggregating natural language judgments, such as majority voting, often fail to produce logically consistent results when applied to high-conflict domains, treating differing opinions as noise. We propose a neuro-symbolic aggregation framework that formalizes conflict resolution through Weighted Maximum Satisfiability (MaxSAT). Our pipeline utilizes a language model to map unstructured natural language explanations into interpretable logical predicates and confidence weights. These components are then encoded as soft constraints within the Z3 solver, transforming the aggregation problem into an optimization task that seeks the maximum consistency across conflicting testimony. Using the Reddit r/AmItheAsshole forum as a case study in large-scale moral disagreement, our system generates logically coherent verdicts that diverge from popularity-based labels 62% of the time, corroborated by an 86% agreement rate with independent human evaluators. This study demonstrates the efficacy of coupling neural semantic extraction with formal solvers to enforce logical soundness and explainability in the aggregation of noisy human reasoning.
Authors:Karim Alghoul, Faisal Mohd, Fedwa Laamarti, Hussein Al Osman, Abdulmotaleb El Saddik
Abstract:
With the growing integration of human-computer interaction into everyday life, advances in machine learning have enabled systems to better perceive and respond to users' emotional states. Most existing affect recognition datasets focus on static environments, limiting their applicability to immersive multimedia contexts such as Virtual Reality (VR). In this paper, we introduce WARM-VR, a novel publicly available multimodal dataset designed to support affect recognition in immersive, multisensory environments using wearable sensing instrumentation. Data were collected from 31 participants aged 19-37 using wearable sensors: a wristband measuring Blood Volume Pulse (BVP), EDA, skin Temperature, three-axis Acceleration, and a chest strap recording ECG signals. Participants engaged in immersive VR experiences designed to elicit relaxation through a calming beach environment following stress induction via an arithmetic task. These sessions incorporated synchronized multimedia stimuli: visual, auditory, and olfactory. Affective states were assessed subjectively through validated self-report questionnaires and objectively through the analysis of physiological measurements. Statistical analysis of the questionnaires confirmed that VR relaxation significantly reduced negative affect, particularly with olfactory enhancement. Furthermore, we established a benchmark on the dataset using widely recognized machine learning algorithms. The best performance for binary classification from BVP data of valence, was obtained with a CNN and a CNN-Bi-GRU model, both achieving an average F1-score of 0.63 and an AUC of 0.69. For arousal, a lightweight Transformer architecture provided the most balanced results (F1-0 0.54 and F1-1 0.63), outperforming recurrent hybrids. In the relaxation task, a CNN-Bi-GRU model reached the highest overall performance (average F1-score 0.64, AUC 0.69).
Authors:Franziska Kaltenberger, Wei-Ling Chen, Enkeleda Thaqi, Enkelejda Kasneci
Abstract:
Remote and webcam-based eye tracking in multi-line reading suffers from various noise factors and layout ambiguity, precisely where real-time reading support needs reliable, per-fixation line assignment. Prior work largely addresses this challenge post hoc or by restricting behavior (e.g., disallowing re-reading), undermining interactive use. We propose CONF-LA (Confidence-score-based Online Fixation-to-Line Assignment), a principled, low-latency approach that integrates knowledge about reading behavior and Gaussian line likelihoods over fixations to compute a posterior-line-score and defers assignments when uncertainty is high. Evaluated on existing open-source data, CONF-LA demonstrates stable performance in post hoc analysis and closes the online-offline gap (1-2 %) with a mean per-fixation latency of 0.348 ms. Our approach exhibits particular invariance toward regressions, yielding significant improvement in ad hoc median accuracies on children data (approx. 95 %) over all tested algorithms. We encourage further research in this direction and discuss possibilities for future development.
Authors:Walid Shaker, Mustafa Suphi Erden
Abstract:
Robotic-assisted surgery offers significant clinical advantages but largely eliminates direct haptic feedback, increasing the risk of excessive tool-tissue interaction forces. Although recent commercial systems have begun to introduce force feedback, their high cost limits accessibility, particularly for surgical training. This paper presents a modular experimental robotic laparoscopic instrument integrated with a real-time haptic feedback framework. The proposed instrument employs a wrist-mounted force/torque (F/T) sensor to estimate tool-tissue interaction forces while avoiding the durability and integration challenges of tip-mounted sensors. A haptic feedback framework is developed to extract the external contact forces, render them to the haptic device, and generate stable and perceptually meaningful feedback. The instrument is integrated into the robotic surgery training system (RoboScope) and evaluated through a controlled user study involving a force regulation task. Experimental results demonstrate that haptic feedback significantly improves task success rate, force regulation accuracy, and task efficiency compared to visual-only feedback. The proposed instrument enables stable, high-fidelity haptic interaction, supporting effective robotic surgery training.
Authors:Catherine Liu, Tao Long, Asya Vaisberg, Chau Vu, Jiaju Ma, Jingyi Li
Abstract:
Creativity support tools (CSTs) aim to elevate the quality of artists' creative processes and artifacts. Yet most current CST evaluations overlook temporal and social aspects of tool use. To address this gap, we present a longitudinal, group-based CST evaluation through a three-week deployment of ArtKrit, a computational drawing tool that supports disciplined drawing. Nine digital artists, organized into three communities of practice, completed weekly "master studies" alongside a researcher-artist. Our results show users' evolving relationships with ArtKrit over time - from early experimentation to selective incorporation or misuse - alongside changes in their ways of artistic seeing. These changes unfolded within artist support networks that fostered confidence and creative safety, and validated individual expression. Overall, our findings suggest that CST evaluations can - and should - be designed as opportunities for meaningful artistic engagement rather than purely extractive measurement exercises. We contribute this longitudinal, group-based approach as one CST evaluation method.
Authors:Ben Knight, Wm. Matthew Kennedy, James Edgell
Abstract:
AI-powered language learning tools increasingly provide instant, personalised feedback to millions of learners worldwide. However, this feedback can fail in ways that are difficult for learners--and even teachers--to detect, potentially reinforcing misconceptions and eroding learning outcomes over extended use. We present a portion of L2-Bench, a benchmark for evaluating AI systems in language education that includes (but is not limited to) six critical dimensions of effective feedback: diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation. We analyse how AI systems can fail with respect to these dimensions. These failures, which we argue are conducive to "explainability pitfalls," are AI-generated explanations that appear helpful on the surface but are fundamentally flawed, increasing the risk of attainment, human-AI interaction, and socioaffective harms. We discuss how the specific context of language learning amplifies these risks and outline open questions we believe merit more attention when designing evaluation frameworks specifically. Our analysis aims to expand the community's understanding of both the typology of explainability pitfalls and the contextual dynamics in which they may occur in order to encourage AI developers to better design safe, trustworthy, and effective AI explanations.
Authors:Isidro Butaslac, Yota Nagaya, Almira Princess Redoble, Jordan Aiko Deja, Nicko Reginio Caluya, Maheshya Weerasinghe, Taishi Sawabe, Hirokazu Kato, Eric Cesar Vidal
Abstract:
In everyday life, physical effort is often minimized and convenience is prioritized, making it difficult for many people to sustain light exercise and stretching despite well-known long-term benefits. This challenge often arises not from objective movement limitations, but from whether an action feels doable in the moment and, therefore worth continuing. This position paper argues that subtle VR hand redirection (HR) can be reframed as a form of cross-sensory support for sustained practice by targeting perceived doability: a moment-to-moment cognitive appraisal that an action is within one's capability while requiring manageable effort. We propose that conservative HR, applied within known perceptual limits, can create repeated micro-success experiences (e.g., reaching a virtual goal earlier with similar physical movement). These micro-successes may increase continuation intention and early re-engagement without relying on overt pressure or intensive coaching. At the same time, such support raises questions about autonomy and authenticity. We therefore articulate two research questions: (RQ1) how HR shifts perceived doability to support sustained practice and positive behavior change; and (RQ2) when HR functions as acceptable support versus becoming counterproductive by undermining authenticity, agency, trust, or fostering dependence. We present an initial sit-and-reach VR prototype, outline a research plan, and identify key design tensions to spark community discussions on autonomy-preserving cross-sensory futures in HCI.
Authors:Liu Wang, Tianshu Zhou, Haoyu Wang, Yi Wang
Abstract:
Mobile apps frequently request excessive data access, raising significant privacy concerns. While regulations like GDPR emphasize data minimization, they provide limited guidance on concretely defining and enforcing necessary data access. Existing regulatory mechanisms primarily rely on expert-driven audits that face challenges in scalability, neutrality, and alignment with user expectations. In this paper, we propose a novel paradigm--democratizing privacy assessment, inspired by prior work on user-centric privacy perceptions--which repositions users as active evaluators in the privacy auditing process, recognizing that user perceptions of data usage play a crucial role in assessing the appropriateness and necessity of data access. To operationalize this paradigm, we introduce DePRa, a prototype system developed through participatory design, featuring contextual explanation provision, category-based representative selection, an intuitive rating interface, and preference-based rating adjustment. We evaluated DePRa with 200 everyday mobile app users, analyzing how effectively it captures user opinions on sensitive data access, comparing their privacy ratings with expert assessments, and exploring risk preference-based score calibration. Our findings show the feasibility and promise of democratized privacy assessment, highlighting its potential to complement expert auditing and support inclusive privacy evaluation.
Authors:Tran Thanh Lam Nguyen, Edoardo Di Tullio, Barbara Carminati, Elena Ferrari
Abstract:
Mobile apps offer significant benefits, but their privacy protections often remain ineffective and confusing for users. While prior work mainly analyzes app privacy vulnerabilities, few approaches help users understand, set, and enforce their privacy preferences. This paper presents PrivacyAssist, a multi-agent LLM-based platform that detects inconsistencies between user-granted permissions and developers' declared sensitive data collection and sharing practices. Using Retrieval-Augmented Generation (RAG), PrivacyAssist provides concise explanations and real-time on-device warnings to support informed installation decisions. We evaluate PrivacyAssist with 200 users and 2,347 Android apps, finding that only 16% of apps are fully consistent between granted permissions and declared data practices.
Authors:Florian Holeczek, Andreas Hinterreiter, Alex Hernandez-Garcia, Marc Streit, Christina Humer
Abstract:
We present GFlowState, a visual analytics system designed to illuminate the training process of Generative Flow Networks (GFlowNets or GFNs). GFlowNets are a probabilistic framework for generating samples proportionally to a reward function. While GFlowNets have proved to be powerful tools in applications such as molecule and material discovery, their training dynamics remain difficult to interpret. Standard machine learning tools allow metric tracking but do not reveal how models explore the sample space, construct sample trajectories, or shift sampling probabilities during training. Our solution, GFlowState, allows users to analyze sampling trajectories, compare the sample space relative to reference datasets, and analyze the training dynamics. To this end, we introduce multiple views, including a chart of candidate rankings, a state projection, a node-link diagram of the trajectory network, and a transition heatmap. These visualizations enable GFlowNet developers and users to investigate sampling behavior and policy evolution, and to identify underexplored regions and sources of training failure. Case studies demonstrate how the system supports debugging and assessing the quality of GFlowNets across application domains. By making the structural dynamics of GFlowNets observable, our work enhances their interpretability and can accelerate GFlowNet development in practice.
Authors:Adam Cole, Mick Grierson
Abstract:
We present AttentionBender, a tool that manipulates cross-attention in Video Diffusion Transformers to help artists probe the internal mechanics of black-box video generation. While generative outputs are increasingly realistic, prompt-only control limits artists' ability to build intuition for the model's material process or to work beyond its default tendencies. Using an autobiographical research-through-design approach, we built on Network Bending to design AttentionBender, which applies 2D transforms (rotation, scaling, translation, etc.) to cross-attention maps to modulate generation. We assess AttentionBender by visualizing 4,500+ video generations across prompts, operations, and layer targets. Our results suggest that cross-attention is highly entangled: targeted manipulations often resist clean, localized control, producing distributed distortions and glitch aesthetics over linear edits. AttentionBender contributes a tool that functions both as an Explainable AI style probe of transformer attention mechanisms, and as a creative technique for producing novel aesthetics beyond the model's learned representational space.
Authors:Sora Kang, Youjin Hwang, Joonhwan Lee
Abstract:
Intergenerational linguistic differences pose challenges to effective and intimate family communication. This paper presents GenSync, a chat-based interface that supports intergenerational understanding through different forms of translation visibility. We conducted a controlled within-subjects study with 16 family dyads (32 participants), comparing three conditions: no translation, black-box translation, and transparent translation that displays both original and interpreted messages. The results show that translation visibility plays a critical role in shaping conversational experiences. Transparent translation supported conversational quality, intimacy, and usability, while black-box translation often disrupted conversational flow. These findings position intergenerational language support as a form of interpretive mediation and contribute design implications for AI-mediated communication in socially sensitive contexts.
Authors:Arka Majhi, Satish B. Agnihotri
Abstract:
POSHAN Abhiyan envisages capacity building of AWWs or frontline health workers through 21 training modules of ILA (Incremental Learning Approach), modularising the net learning content into smaller learning topics to help them perform their daily activities. It envisions building skilled AWWs, strengthening supervisory hierarchies, and improving coordination between AWWs (ICDS) services and health programs to achieve common goals such as increasing awareness, improving access to health and nutrition services, and reducing deaths and malnutrition. To better understand the contents of ILA literature, we conducted a content analysis by further breaking down the modules into content types such as facts, concepts, procedures, and principles. Then we framed learning objectives for teaching AWWs. We applied CDT (Component Display Theory by David Merrill) to map the contents with the desired learning objective, following the Specification of Objective chart. In this way, one can easily develop pedagogies from a new training literature. The challenges in framing learning objectives and pedagogies are: The AWWs do not have a (formal/scientific) nutrition and epidemiology background. Therefore, it is important to teach them through examples, familiar to them. AWWs are not evenly and structurally trained across districts. Training materials should be customized based on language, location, and prior knowledge. Delayed refresher courses render them underprepared for their jobs. To overcome these problems, we are developing an Android app based on gamified learning to provide refresher training to AWWs. Conducting content analysis, framing learning objectives, and developing pedagogical approaches will help conceptualize the gamified application.
Authors:Arka Majhi, Satish B. Agnihotri, Aparajita Mondal
Abstract:
Recent health surveys in India highlight the alarming child malnutrition levels and lower rates of complete child immunization in many parts of India. Previous researches report that the conventional training pedagogy of the CHWs (Community Healthcare Workers) or the ASHAs (Accredited Social Health Activists) in India is ineffective in enhancing their capacity. Considering that the CHWs are getting equipped with smartphones, it calls for a rethinking of their training pedagogy using the ICT approach. Two refresher training tools were developed to make learning the child immunization schedule more exciting and conceptually engaging for ASHAs. The physical and AR (Augmented Reality) versions of designed card games were compared for effectiveness and knowledge retention, pre, and post-intervention through questionnaire tests conducted immediately before and after playing multiple sessions. The AR-based play was found to be better in learning and knowledge retention with more engagement, mainly due to its interactive and intuitive nature of play.
Authors:Shri Harini Ramesh, Foroozan Daneshzand, Matteo Sotelo, Mahsa Sinaei, Fateme Rajabiyazdi
Abstract:
Older adults living with multiple chronic conditions (MCC) can considerably benefit from collecting and reflecting on their health data. Many older adults collect their health data using various approaches, such as digital tools or handwritten notebooks. However, in these approaches, the act of collecting data does not itself yield insights; sensemaking and reflection happen only if individuals later review their accumulated records. The daily process of data collection thus offers limited opportunity for individuals to actively engage with their data or find the process personally meaningful and enjoyable. Personal data input visualizations using physical tokens offer a promising solution that can help individuals recognize evolving patterns while collecting data and discover meaningful insights more serendipitously and engagingly. Yet, there is a limited understanding of whether and how older adults living with MCC might adopt physical input visualizations to collect data and reflect on their health, and how the tangible, expressive, and personalizable nature of this process supports their sensemaking and reflection. In this paper, we present the results of our interview and diary studies in which older adults living with MCC inputted health data using physical tokens over two weeks. Our findings highlight the diverse and unique needs of older adults for tracking personal health data, illustrating how they adapt strategies and personalize physical input visualizations to align with their individual needs. We demonstrate how older adults integrated input visualizations into daily routines and leveraged tangible markers to reflect on patterns and behaviors, while enjoying the process of tracking and focusing on personal expression and meaningful reflection. Finally, we provide design considerations for supporting older adults with MCC when inputting health data through physical tokens.
Authors:Sanchari Das, Dhiman Goswami, Michelle Melo, Aditya Johri, Vivian G. Motti
Abstract:
As digital systems increasingly rely on pervasive data collection and inference, educating future designers and researchers about Usable Privacy has become a critical need for HCI. However, privacy education in higher education is often fragmented, theory-heavy, or detached from real-world applications. Thus, in this paper, we present the design, implementation, and evaluation of a 15-week graduate-level course on Usable Privacy that addresses this through active, practice-oriented pedagogy. The course integrates use cases, structured role playing, case-based discussions, guest lectures, and a multi-phase research project to support students in reasoning about privacy from multiple stakeholder perspectives. Grounded in contemporary privacy research and the Modern Privacy framework, the curriculum emphasizes both conceptual understanding and applied research skills. We report findings from two course offerings in consecutive years (2024-2025) using a mixed-methods evaluation that combines quantitative teaching evaluations with qualitative analysis of student reflections and instructor observations. Results indicate increased student engagement, improved ability to articulate trade-offs in privacy design, and stronger connections between theory and practice. To support adoption and replication, we also release detailed assignment descriptions and grading rubrics. This work contributes an empirically informed model for teaching Usable Privacy in HCI education and offers actionable guidance for educators seeking to integrate privacy into their curricula.
Authors:Rahul K. Dass, Shubham Puri, Arpit Khandelwal, Xiao Jin, Ashok K. Goel
Abstract:
Scalable AI tutoring for procedural skill learning requires structured knowledge representations, yet constructing these representations remains a labor-intensive bottleneck. This paper introduces a new LLM-assisted text-to-model (TTM) methodology that transforms instructional materials into schema-complete Task-Method-Knowledge (TMK) models through ontology-constrained prompting and template-based generation, automating structural scaffolding while preserving expert oversight. Applied to a graduate-level online AI course, the methodology produced 23 TMK models - enabling full-course coverage for Ivy, a deployed AI coach that relies on TMK models to support learners' procedural understanding, for the first time. AI-assisted authoring reduced expert modeling time by 50-70% while producing structurally valid and highly reproducible models. We evaluate structural validity, semantic alignment, reproducibility, and refinement effort to characterize authoring scalability. Results indicate that the TTM methodology substantially lowers the cost of constructing structured procedural representations, making course-wide deployment of structured AI tutoring systems practically feasible.
Authors:Shivendra Agrawal, Bradley Hayes
Abstract:
Navigating complex, densely packed environments like retail stores, warehouses, and hospitals poses a significant spatial grounding challenge for humans and embodied AI. In these spaces, dense visual features quickly become stale given the quasi-static nature of items, and long-tail semantic distributions challenge traditional computer vision. While Vision-Language Models (VLMs) help assistive systems navigate semantically-rich spaces, they still struggle with spatial grounding in cluttered environments. We present GIST (Grounded Intelligent Semantic Topology), a multimodal knowledge extraction pipeline that transforms a consumer-grade mobile point cloud into a semantically annotated navigation topology. Our architecture distills the scene into a 2D occupancy map, extracts its topological layout, and overlays a lightweight semantic layer via intelligent keyframe and semantic selection. We demonstrate the versatility of this structured spatial knowledge through critical downstream Human-AI interaction tasks: (1) an intent-driven Semantic Search engine that actively infers categorical alternatives and zones when exact matches fail; (2) a one-shot Semantic Localizer achieving a 1.04 m top-5 mean translation error; (3) a Zone Classification module that segments the walkable floor plan into high-level semantic regions; and (4) a Visually-Grounded Instruction Generator that synthesizes optimal paths into egocentric, landmark-rich natural language routing. In multi-criteria LLM evaluations, GIST outperforms sequence-based instruction generation baselines. Finally, an in-situ formative evaluation (N=5) yields an 80% navigation success rate relying solely on verbal cues, validating the system's capacity for universal design.
Authors:Brennan Schaffner, Luis Heysen, Marshini Chetty
Abstract:
Deceptive/Manipulative Patterns (DMP) are interface designs, also known as ``dark patterns,'' that manipulate user behavior. While considerable attention has been paid to their ethical and legal implications, empirical evidence about their real-world effects remains diffuse. This review synthesizes up-to-date experimental studies, focusing on works that quantify how (or whether) DMPs influence users. We also aggregate findings on interventions aimed at reducing DMP effects. Our synthesis highlights the experimental agreement that DMPs do significantly alter user behavior (with large variance in effect size) and that external interventions have been mostly unsuccessful in mitigating their effects. Lastly, we show that significant correlations between DMP effects and personal characteristics (e.g., age or political affiliation) are uncommon, indicating DMPs similarly affected nearly all populations tested. By summarizing the experimental evidence, we clarify the effects of DMPs, highlight gaps and tensions in the existing experimental literature, and help inform ongoing research and policy directions.
Authors:Akila Kadambi, Ylenia D'Elia, Tanishka Shah, Iulia Comsa, Alison Lentz, Katie Siri-Ngammuang, Tara Buechler, Jonas Kaplan, Antonio Damasio, Srini Narayanan, Lisa Aziz-Zadeh
Abstract:
With large language models (LLMs) becoming increasingly prevalent in daily life, so too has the tendency to attribute to them human-like minds and emotions, or anthropomorphize them. Here, we investigate dimensions people use to anthropomorphize and attribute trust toward LLMs across more than 2,000 human-LLM interactions. Participants (N=115) engaged with LLM chatbots systematically varied in warmth (friendliness), competence (capability, coherence), and empathy (cognitive and affective). Warmth and cognitive empathy significantly predicted perceptions on all outcomes (perceived anthropomorphism, trust, similarity, relational closeness, frustration, usefulness), while competence predicted all outcomes except for anthropomorphism. Affective empathy primarily predicted perceived relational measures, but did not predict the epistemic outcomes. Topic sub-analyses showed that more subjective, personally relevant topics (e.g., relationship advice) amplified these effects, producing greater human-likeness and relational connection with the LLM than did objective topics. Together, these findings reveal that warmth, competence, and empathy are key dimensions through which people attribute relational and epistemic perceptions to artificial agents.
Authors:Rikard Rosenbacke, Carl Rosenbacke, Victor Rosenbacke, Martin McKee
Abstract:
Large language models have advanced rapidly, from pattern recognition to emerging forms of reasoning, yet they remain confined to linguistic simulation rather than grounded understanding. They can produce fluent outputs that resemble reflection, but lack temporal continuity, causal feedback, and anchoring in real-world interaction. This paper proposes a complementary approach in which reasoning is treated as a relational process distributed between human and model rather than an internal capability of either. Building on recent work on "System-2" learning, we relocate reflective reasoning to the interaction layer. Instead of engineering reasoning solely within models, we frame it as a cognitive protocol that can be structured, measured, and governed using existing systems. This perspective emphasizes collaborative intelligence, combining human judgment and contextual understanding with machine speed, memory, and associative capacity. We introduce "The Architect's Pen" as a practical method. Like an architect who thinks through drawing, the human uses the model as an external medium for structured reflection. By embedding phases of articulation, critique, and revision into human-AI interaction, the dialogue itself becomes a reasoning loop: human abstraction -> model articulation -> human reflection. This reframes the question from whether the model can think to whether the human-AI system can reason. The framework enables auditable reasoning traces and supports alignment with emerging governance standards, including the EU AI Act and ISO/IEC 42001. It provides a practical path toward more transparent, controllable, and accountable AI use without requiring new model architectures.
Authors:Rikard Rosenbacke, Carl Rosenbacke, Victor Rosenbacke, Martin McKee
Abstract:
Large language models are increasingly integrated into decision-making in areas such as healthcare, law, finance, engineering, and government. Yet they share a critical limitation: they produce fluent outputs even when their internal reasoning has drifted. A confident answer can conceal uncertainty, speculation, or inconsistency, and small changes in phrasing can lead to different conclusions. This makes LLMs useful assistants but unreliable partners in high-stakes contexts. Humans exhibit a similar weakness, often mistaking fluency for reliability. When a model responds smoothly, users tend to trust it, even when both model and user are drifting together. This paper is the first in a five-paper research series on stabilising human-AI reasoning. The series proposes a two-layer approach: Parts II-IV introduce human-side mechanisms such as uncertainty cues, conflict surfacing, and auditable reasoning traces, while Part V develops a model-side Epistemic Control Loop (ECL) that detects instability and modulates generation accordingly. Together, these layers form a missing operational substrate for governance by increasing signal-to-noise at the point of use. Stabilising interaction makes uncertainty and drift visible before enforcement is applied, enabling more precise capability governance. This aligns with emerging compliance expectations, including the EU AI Act and ISO/IEC 42001, by making reasoning processes traceable under real conditions of use. The central claim is that fluency is not reliability. Without structures that stabilise both human and model reasoning, AI cannot be trusted or governed where it matters most.
Authors:Vladimir Molchanov, Hennes Rave, Lars Linsen
Abstract:
Cartograms are a technique for visually representing geographically distributed statistical data, where values of a numerical attribute are mapped to the size of geographic regions. Contiguous cartograms preserve the adjacencies of the original regions during the mapping. To be useful, contiguous cartograms also require approximate preservation of shapes and relative positions. Due to these desirable properties, contiguous cartograms are among the most popular ones. Most methods for constructing contiguous cartograms exploit a deformation of the original map. Aiming at the preservation of geographical properties, existing approaches are often algorithmically cumbersome and computationally intensive. We propose a novel deformation technique for computing time-varying contiguous cartograms based on integral images evaluated for a series of discrete density distributions. The density textures represent the given dynamic statistical data. The iterative application of the proposed mapping smoothly transforms the domain to gradually equalize the temporal density, i.e., region areas grow or shrink following their evolutionary statistical data. Global shape preservation at each time step is controlled by a single parameter that can be interactively adjusted by the user. Our efficient GPU implementation of the proposed algorithm is significantly faster than existing state-of-the-art methods while achieving comparable quality for cartographic accuracy, shape preservation, and topological error. We investigate strategies for transitioning between adjacent time steps and discuss the parameter choice. Our approach applies to comparative cartograms' morphing and interactive cartogram exploration.
Authors:Annalisa Degenhard, Sophia Ppali, Fotis Liarokapis, Enrico Rukzio, Jennifer Spohrs, Stefan Tschoeke
Abstract:
Virtual reality exposure therapy (VRET) enables controlled exposure to trauma-related stimuli to facilitate memory access and emotional processing. However, the field remains underexplored for complex post-traumatic stress disorder (C-PTSD). Unlike single-trauma PTSD, C-PTSD requires highly individualized triggers that are difficult to identify and implement safely. We conducted a feasibility study with 11 patients, two trauma therapists, and a VR developer to explore integrating VRET into C-PTSD treatment while safeguarding all stakeholders. Initial findings indicate that simple objects can be just as effective as complex scenes, therapeutic success does not correlate with VR presence levels, and the design process itself became integral to therapy rather than preparatory. However, involving developers in therapy sessions led to considerable emotional stress and role confusion, which required a cautious approach. Based on these insights, we provide methodological recommendations for safe and patient-centered VRET studies that balance therapeutic effectiveness with stakeholder safety across the research process.
Authors:Tongfei Bian, Mathieu Chollet, Tanaya Guha
Abstract:
For a robot to be called socially intelligent, it must be able to infer users internal states from their current behaviour, predict the users future behaviour, and if required, respond appropriately. In this work, we investigate how robots can be endowed with such social intelligence by modelling the dynamic relationship between user's internal states (latent) and actions (observable state). Our premise is that these states arise from the same underlying socio-cognitive process and influence each other dynamically. Drawing inspiration from theories in Cognitive Science, we propose a novel multi-task learning framework, termed as \textbf{SocialLDG} that explicitly models the dynamic relationship among the states represent as six distinct tasks. Our framework uses a language model to introduce lexical priors for each task and employs dynamic graph learning to model task affinity evolving with time. SocialLDG has three advantages: First, it achieves state-of-the-art performance on two challenging human-robot social interaction datasets available publicly. Second, it supports strong task scalability by learning new tasks seamlessly without catastrophic forgetting. Finally, benefiting from explicit modelling task affinity, it offers insights on how different interactions unfolds in time and how the internal states and observable actions influence each other in human decision making.
Authors:Ashwin Ram, Aeneas Leon Sommer, Martin Schmitz, Jürgen Steimle
Abstract:
Opportunistic photo capture (e.g., slides, exhibits, or artifacts) is a common strategy for preserving information encountered in information-rich environments for later revisitation. While fast and minimally disruptive, such photo collections rarely become meaningful notes. Existing automatic note-generation approaches provide some support but often produce generic summaries that fail to reflect what users intended to capture. We introduce Intent Lenses, a conceptual primitive for intent-mediated note generation and sensemaking. Intent Lenses reify users' capture-time intent inferred from captured information into reusable interactive objects that encode the function to perform, the information sources to focus on, and how results are represented at an appropriate level of detail. These lenses are dynamically generated using the reasoning capabilities of large language models. To investigate this concept, we instantiate Intent Lenses in the context of academic conference photos and present an interactive system that infers lenses from presentation captures to generate structured visual notes on a spatial canvas. Users can further add, link, and arrange lenses across captures to support exploration and sensemaking. A study with nine academics showed that intent-mediated notes aligned with users' expectations, providing effective overviews of their captures while facilitating deeper sensemaking.
Authors:Feiyang Ren, Zhaoxi Zhang, Tamir Mendel, Takahiro Yabe
Abstract:
Bicycle safety is important for bikeability and transportation efficiency. However, conventional surveys often fall short in capturing how people actually perceive cycling environments because they rely heavily on respondents' recall rather than in-the-moment experience. By leveraging large language models (LLMs), this study proposes a new method of combining video-based surveys with a conversational AI chatbot to collect human perceptions of cycling safety and the reasons behind these perceptions. The paper developed the AI chatbot using a modular LLM architecture, integrating prompt engineering, state management, and rule-based control to support the structure of human-AI interaction. This paper evaluates the feasibility of the proposed video-based conversational chatbot using complete responses from sixteen participants to the pilot survey across nine street segments in New York City. The method feasibility was assessed using a seven-point scale rating for user experience (i.e., ease of use, supportiveness, efficiency) and a five-point scale for chatbot usability (i.e., personality, roboticness, friendliness), yielding positive results with mean scores of 5.00 out of 7 (standard deviation = 1.6) and 3.47 out of 5 (standard deviation = 0.43), respectively. The data feasibility was assessed using multiple techniques: (1) Natural language processing (NLP), such as KeyBERT, for overall safety and feature analysis to extract built-environment attributes; (2) K-means clustering for semantic analysis to identify reasons and suggestions; and (3) regression to estimate the effects of built-environment and demographic variables on perceived safety outcomes. The results show the potential of AI chatbots as a novel approach to collecting data on human perception, behavior, and future visions for transport planning.
Authors:Arka Majhi, Aparajita Mondal, Satish B. Agnihotri
Abstract:
In India, Community Healthcare Workers (CHWs) serve as critical intermediaries between the state and beneficiaries, including pregnant mothers and children. Effective planning and prioritization of care and services necessitate the collection of accurate health data from the community. Crowdsourcing child anthropometric data through CHWs could establish a valuable repository for evidence-based decision-making and service planning. However, existing platforms often fail to maintain CHWs' engagement over time and across different spatial contexts, resulting in spatially misrepresented and outdated data. This study addresses these challenges by conducting a co-design exercise to develop innovative methods for collecting anthropometric data over time and space. The exercise involved analyzing data to create hotspot and density distribution maps. We implemented a trial of the developed game with two groups (n=94 per group) from various states across India, comparing the game-based and non-game-based data collection methods. Our findings reveal that the game-based approach significantly improved measuring efficiency (p<0.05) and demonstrated superior engagement and retention compared to the non-game-based method. This research contributes to the expanding literature on co-design and Research through Design (RtD) methodologies for developing geospatial games, highlighting their potential to enhance data collection practices and improve engagement among CHWs.
Authors:Keiichi Ihara, Tianle Li, Yasuhisa Shiino, Ryo Suzuki
Abstract:
We present MemoryDiorama, a prototype system that introduces augmented memory cues, a concept that extends captured personal media with AI-generated contextual information to enhance autobiographical memory recall. MemoryDiorama transforms everyday photos into dynamic 3D dioramas in mixed reality by integrating LLM-based scene analysis with 3D object generation, animation, and spatial composition. The system extracts geographic information, object attributes, lighting conditions, and atmospheric elements from the photos. It then animates these elements with generative components such as object animations, human motion, geographical effects, and particle effects to provide richer cues for memory recall. We evaluated MemoryDiorama in a within-subject user study with 18 participants, comparing three conditions: Photo-Only, Static Diorama, and MemoryDiorama. Compared with both Photo-Only and Static Diorama, MemoryDiorama elicited more internal and in-cue details during recall. It also increased perceptual details and visual vividness ratings, suggesting richer recollective experience.
Authors:Peijie Yu, Wei Liu, Yifan Yang, Jinjian Li, Zelong Zhang, Xiao Feng, Feng Zhang
Abstract:
Fulfilling user needs through Large Language Model multi-turn, multi-step tool-use is rarely a straightforward process. Real user interactions are inherently wild, being intricate, messy, and flexible. We identify three key challenges from user behaviour: compositional tasks that demand efficient orchestration of tool-call topologies, implicit intent spread across dialogue turns that require contextual inference, and instruction transition, which mixes task queries, clarifications, and casual conversation, forcing LLMs to adjust their policies on the fly. Existing benchmarks overlook these behaviors, making the apparent progress of LLMs on tool-use spurious. To address this, we introduce WildToolBench, an LLM tool-use benchmark grounded in real-world user behavior patterns. Comprehensive evaluations of 57 LLMs reveal that no model achieves an accuracy of more than 15%, indicating a substantial gap in the robustness of LLMs' agentic ability. Controlled experiments and in-depth analyses further indicate that the real challenge for LLM tool-use lies not in artificially complex tasks, but in the wild nature of user behavior, emphasizing the need to reconsider the interactions among LLMs, users, and tools.
Authors:Ran Jin, Liu Wang, Shidong Pan, Luona Xu, Tianming Liu, Haoyu Wang
Abstract:
GenAI smartphones, which natively embed generative AI at the system level, are transforming mobile interactions by automating a wide range of tasks and executing UI actions on behalf of users. Their superior capabilities rely on continuous access to sensitive and context-rich data, raising privacy concerns that surpass those of traditional mobile devices. Yet, little is known about how users perceive the privacy implications of such devices or what safeguards they expect, which is especially critical at this early stage of GenAI smartphone adoption. To address this gap, we conduct 22 semi-structured interviews with everyday mobile users to explore their usage of GenAI smartphones, privacy concerns, and privacy design expectations. Our findings show that users engage with GenAI smartphones with limited understanding of how these systems operate to deliver functions, but show heightened privacy concerns once exposed to the technical details. Participants' concerns span the entire data lifecycle, including nontransparent collection, insecure storage, and weak data control. In a follow-up focus group, participants discuss a range of privacy-enhancing suggestions that call for coordinated changes across system-level controls, data management practices, and user-facing transparency. Their concerns and suggestions offer user-centered guidances for designing GenAI smartphones that balance functionality with privacy protection, offering valuable takeaways for system designers and regulators.
Authors:Abhishek Dharmaratnakar, Srivaths Ranganathan, Debanshu Das, Anushree Sinha
Abstract:
The domain of automatic video trailer generation is currently undergoing a profound paradigm shift, transitioning from heuristic-based extraction methods to deep generative synthesis. While early methodologies relied heavily on low-level feature engineering, visual saliency, and rule-based heuristics to select representative shots, recent advancements in Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), and diffusion-based video synthesis have enabled systems that not only identify key moments but also construct coherent, emotionally resonant narratives. This survey provides a comprehensive technical review of this evolution, with a specific focus on generative techniques including autoregressive Transformers, LLM-orchestrated pipelines, and text-to-video foundation models like OpenAI's Sora and Google's Veo. We analyze the architectural progression from Graph Convolutional Networks (GCNs) to Trailer Generation Transformers (TGT), evaluate the economic implications of automated content velocity on User-Generated Content (UGC) platforms, and discuss the ethical challenges posed by high-fidelity neural synthesis. By synthesizing insights from recent literature, this report establishes a new taxonomy for AI-driven trailer generation in the era of foundation models, suggesting that future promotional video systems will move beyond extractive selection toward controllable generative editing and semantic reconstruction of trailers.
Authors:Arka Majhi, Aparajita Mondal, Satish B. Agnihotri
Abstract:
Community Health Workers (CHWs) play a critical role in delivering primary healthcare services in low-resource settings, yet sustaining their training and performance remains a persistent challenge. Prior research has explored digital and game-based approaches for CHW training. However, limited work has synthesized longitudinal design insights into generalizable guidelines for interactive health interventions. Building on a four-year design-based research program involving multiple game-based refresher training systems, including quiz-based mobile apps, physical and augmented reality games, card-based games, and location-based games, we examine which design guidelines support sustained engagement, learning transfer, and contextual appropriateness in CHW training. We conducted a mixed-methods analysis across deployments with Accredited Social Health Activists and Anganwadi Workers in India, including interviews, field observations, and usage logs. Through thematic synthesis, we derive eight design guidelines addressing contextual realism, adaptive learning, hybrid interaction, social motivation, explainability, professional identity, and ethical considerations. Our findings contribute actionable design knowledge for researchers and practitioners developing interactive health interventions in low-resource healthcare contexts.
Authors:Arka Majhi, Aparajita Mondal, Satish B. Agnihotri
Abstract:
Digital health technologies are increasingly used to improve healthcare access and delivery worldwide. However, many healthcare applications are designed for environments with stable infrastructure, high digital literacy, and strong institutional support. These assumptions often do not hold in low-resource contexts where healthcare delivery often depends on community health workers, caregivers, and informal care networks. Designing effective healthcare applications for such environments requires attention to infrastructural constraints, cultural contexts, language diversity, and usability challenges. This Birds of a Feather session aims to bring together researchers, designers, and practitioners interested in healthcare application design in low-resource contexts. The session will provide an informal forum for discussing challenges encountered in the design and deployment of digital health technologies in underserved settings, sharing field experiences, and identifying opportunities for collaboration within the Interactive Health (IH) community.
Authors:Vikram Kamath Cannanure, Bruno Yinkfu, Douglas Bryan, Mati Amin, Ingmar Weber
Abstract:
AI in education is commonly delivered through web-based systems such as online forms and institutional platforms. However, these approaches can exclude teachers in low-resource contexts, where everyday mobile platforms like WhatsApp serve as primary digital infrastructure. To address this gap, we present a field pilot in Cameroon that deploys a WhatsApp-based chatbot with LLM-supported content for teacher professional development (TPD), compared with an online form baseline. The system was evaluated through a mixed-methods study with 47 primary school teachers, integrating quantitative measures with qualitative insights from interviews and participant feedback. Results show that the chatbot was rated higher in perceived usability and overall experience, while learnability remained comparable. These improvements were driven by platform familiarity, low interaction overhead, and the modular structure of LLM-supported content, but were constrained by connectivity limitations, prepaid data costs, and multilingual needs (English/French). Building on these findings, we outline design directions for multilingual, culturally grounded interaction and for supporting prompting and reflection in AI use. More broadly, this work points to Thoughtful AI that supports reflection, relevance, and sustained professional growth.
Authors:Michael Caosun, Sinan Aral
Abstract:
Experimental evidence confirms that AI tools raise worker productivity, but also that sustained use can erode the expertise on which those gains depend. We develop a dynamic model in which a decision-maker chooses AI usage intensity for a worker over time, trading immediate productivity against the erosion of worker skill. We decompose the tool's productivity effect into two channels, one independent of worker expertise and one that scales with it. The model produces three main results. First, even a decision-maker who fully anticipates skill erosion rationally adopts AI when front-loaded productivity gains outweigh long-run skill costs, producing steady-state loss: the worker ends up less productive than before adoption. Second, when managers are short-termist or worker skill has external value, the decision-maker's optimal policy turns steady-state loss into the augmentation trap, leaving the worker worse off than if AI had never been adopted. Third, when AI productivity depends less on worker expertise, workers can permanently diverge in skill: experienced workers realize their full potential while less experienced workers deskill to zero. Small differences in managerial incentives can determine which path a worker takes. The productivity decomposition classifies deployments into five regimes that separate beneficial adoption from harmful adoption and identifies which deployments are vulnerable to the trap.
Authors:Harsh Kumar, Zi Kang, Mu, Jonathan Vincentius, Ashton Anderson
Abstract:
Most AI-based educational tools today adopt a one-on-one tutoring paradigm, pairing a single LLM with a single learner. Yet decades of learning science research suggest that multi-party interaction -- through peer modeling, co-construction, and exposure to diverse perspectives -- can produce learning benefits that dyadic tutoring alone cannot. In this paper, we investigate whether multi-agent LLM configurations can enhance learning outcomes beyond what a single LLM tutor provides. We present two controlled experiments spanning distinct learning contexts. In a convergent problem-solving study ($N=315$), participants tackle SAT-level math problems in a 2$\times$2 design that varies the presence of an LLM tutor and LLM peers, each making different kinds of errors (conceptual vs.\ arithmetic); participants who interacted with both a tutor and peers achieved the highest unassisted test accuracy. In a divergent composition study ($N=247$), participants write argumentative and creative essays with either no AI assistance, a single LLM (Claude or ChatGPT), or both Claude and ChatGPT together; while both LLM conditions improved essay quality, only the two-agent condition avoided the idea-level homogeneity that single-model assistance was found to produce. Together, these studies offer one of the first controlled investigations of multi-agent LLM learning environments, probing whether the move from one-on-one AI tutoring toward richer agent configurations can unlock the collaborative and observational benefits long documented in human social learning research.
Authors:Jaemarie Solyst, Ruth Karen Nakigozi, Chloe Fong, R. Benjamin Shapiro
Abstract:
There is an increasing need for young people to become critically AI literate, understanding not only how AI works but also its limitations and ethical nuances. Yet, designing learning experiences that make such complex, serious topics engaging remains a challenge. This paper explores transformational games as a promising approach for supporting youth learning about generative AI (GenAI) and ethics. We designed and implemented two games, Diversity Duel and Secret Agent, that integrate GenAI tools with gameplay elements. This work investigates how the games' elements: (1) peer evaluation, (2) constraint-based creativity, and (3) social deduction supported socio-ethical reasoning about GenAI. Participants recognized and debated bias in GenAI outputs, connected these patterns to real-world inequities, and developed nuanced understandings of bias. Participants further came to see how prompt design shapes AI behavior. Our findings suggest that group-based games with these elements can support fostering critical AI literacy.
Authors:Elaheh Sanoubari, Neil Fernandes, Keith Rebello, Alicia Pan, Andrew Houston, Kerstin Dautenhahn
Abstract:
This paper presents REMind, an innovative educational robot-mediated role-play game designed to support anti-bullying bystander intervention among children. REMind invites players to observe a bullying scenario enacted by social robots, reflect on the perspectives of the characters, and rehearse defending strategies by puppeteering a robotic avatar. We evaluated REMind through a mixed-methods play-testing study with 18 children aged 9--10. The findings suggest that the experience supported key learning goals related to self-efficacy, perspective-taking, understanding outcomes of defending, and intervention strategies. These results highlight the promise of Robot-Mediated Applied Drama (RMAD) as a novel pedagogical framework to support Social-Emotional Learning.
Authors:Alejandro Ciuba, Zheng YY Li, Aakash Gautam
Abstract:
For immigrants, language preservation is crucial to maintain their identity, but the process of immigration can put a strain on a community's ability to do so. We interviewed eight Nepali immigrants to understand barriers to language preservation across sociopolitical contexts in Nepal and immigrant life in the United States. Participants described strong motivation but limited institutional support, time and resource constraints, and English-dominant environments that widen parent-child language gaps. They envisioned technology that supports interactive, family centered learning. In response, we are developing an audio-first, point-and-click language learning game based on the theory of comprehensible input, designed for parent-child co-playing. An early evaluation with four design experts reveals promising gameplay, and the need to simplify symbol-heavy UI. We conclude with implications for designing language technologies that support preservation through relations while acknowledging the limits of design.
Authors:Kruthika Gangaraju, Shu-Fen Wung, Kevin Berner, Jing Wang, Fengpei Yuan
Abstract:
Effective dementia caregiving requires training and adaptive communication, but assistive AI and robotics are constrained by a lack of context-rich, privacy-sensitive data on how people living with Alzheimer's disease and related dementias (ADRD) behave during activities of daily living (ADLs). We introduce a web-based simulator that uses a large language model (gpt-5-mini) to generate multi-turn, severity- and care-setting-conditioned patient behaviors during ADL assistance, pairing utterances with lightweight behavioral cues (in parentheses). Users set dementia severity, care setting (and time in setting), and ADL; after each patient turn they rate realism (1-5) with optional critique, then respond as the caregiver via free text or by selecting/editing one of four strategy-scaffolded suggestions (Recognition, Negotiation, Facilitation, Validation). We ran an online formative expert-in-the-loop study (14 dementia-care experts, 18 sessions, 112 rated turns). Simulated behavior was judged moderately to highly plausible, with a typical session length of six turns. Experts wrote custom replies for 54.5 percent of turns; Recognition and Facilitation were the most-used suggested strategies. Thematic analysis of critiques produced a six-category failure-mode taxonomy, revealing recurring breakdowns in ADL grounding and care-setting consistency and guiding prompt/workflow refinements. The simulator and logged interactions enable an evidence-driven refinement loop toward validated patient-caregiver co-simulation and support data collection, caregiver training, and assistive AI and robot policy development.
Authors:George Boateng, Samuel Boateng, Victor Kumbol
Abstract:
Providing timely and accurate learning support in large-scale online coding courses is challenging, particularly in resource-constrained contexts. We present Kwame 2.0, a bilingual (English-French) generative AI teaching assistant built using retrieval-augmented generation and deployed in a human-in-the-loop forum within SuaCode, an introductory mobile-based coding course for learners across Africa. Kwame 2.0 retrieves relevant course materials and generates context-aware responses while encouraging human oversight and community participation. We deployed the system in a 15-month longitudinal study spanning 15 cohorts with 3,717 enrollments across 35 African countries. Evaluation using community feedback and expert ratings shows that Kwame 2.0 provided high-quality and timely support, achieving high accuracy on curriculum-related questions, while human facilitators and peers effectively mitigated errors, particularly for administrative queries. Our findings demonstrate that human-in-the-loop generative AI systems can combine the scalability and speed of AI with the reliability of human support, offering an effective approach to learning assistance for underrepresented populations in resource-constrained settings at scale.
Authors:Sina Elahimanesh, Mohammadali Mohammadkhani, Shohreh Kasaei
Abstract:
In contemporary society, social media is deeply integrated into daily life, yet emotional expression often differs between real and online contexts. We studied the Persian community on X to explore this gap, designing a human-centered pipeline to measure alignment between real-world and social media emotions. Recent tweets and images of participants were collected and analyzed using Transformers-based text and image sentiment modules. Friends of participants provided insights into their real-world emotions, which were compared with online expressions using a distance criterion. The study involved N=105 participants, 393 friends, over 8,300 tweets, and 2,000 media images. Results showed only 28% similarity between images and real-world emotions, while tweets aligned about 76% with participants' real-life feelings. Statistical analyses confirmed significant disparities in sentiment proportions across images, tweets, and friends' perceptions, highlighting differences in emotional expression between online and offline environments and demonstrating practical utility of the proposed pipeline for understanding digital self-presentation.
Authors:Zhiyu Lin, Boyd Fox, Devon Mckee, Sai Siddartha Maram, Jiahong Li, Tyler Sorensen, Brian K. Smith, Roger Azevedo, Jichen Zhu, Magy Seif El-Nasr
Abstract:
Game-Based Learning (GBL) is a learner-engaging pedagogical methodology, yet adapting games to heterogeneous learners requires transparent, real-time Open Player Models (OPMs). We contribute to the community Open Player Socially Analytical Intelligence (OPSAI), an architecture implementing OPM beyond conceptual frameworks and validated in a GBL application. It decouples gameplay telemetry and analysis from the game engine and automatically derives pedagogically actionable insights, supporting the transparency of computational player models while making them accessible to players. OPSAI comprises three logical layers: a Frontend that both provides the GBL experience and collects information needed for analytics; a stateless Backend that hosts transparent analytics services producing reflective prompts, recommendations, and visualization guides; and a two-tier Log Storage that balances heavy raw gameplay data with lightweight reference indices for low-latency queries. By feeding analytics outputs back into the game interface, OPSAI closes the feedback loop between play and learning, empowering teachers, researchers, and learners alike. We further showcase OPSAI with a full deployment on the Parallel GBL environment, featuring live play traces, peer comparisons, and personalized suggestions, demonstrating a reusable blueprint for future educational games.
Authors:Neil Fernandes, Tehniyat Shahbaz, Emily Davies-Robinson, Yue Hu, Kerstin Dautenhahn
Abstract:
Newcomer children face barriers in acquiring the host country's language and literacy programs are often constrained by limited staffing, mixed-proficiency cohorts, and short contact time. While Socially Assistive Robots (SARs) show promise in education, their use in these socio-emotionally sensitive settings remains underexplored. This research presents a co-design study with program tutors and coordinators, to explore the design space for a social robot, Maple. We contribute (1) a domain summary outlining four recurring challenges, (2) a discussion on cultural orientation and community belonging with robots, (3) an expert-grounded discussion of the perceived role of an SAR in cultural and language learning, and (4) preliminary design guidelines for integrating an SAR into a classroom. These expert-grounded insights lay the foundation for iterative design and evaluation with newcomer children and their families.
Authors:Lan Xiao, Catherine Holloway
Abstract:
AI accessibility tools have mostly been designed for individual use, helping one person overcome a specific functional barrier. But for many people with disabilities, complex tasks are accomplished through collaboration with others who bring complementary abilities, not solitary effort. We propose a three-layer framework, Channelling, Coordinating, and Co-Creating, that rethinks AI's role in ability-diverse collaboration: establishing shared informational ground across abilities, mediating workflows between collaborators with different abilities, and contributing as a bounded partner toward shared goals. Grounded in the Ability-Diverse Collaboration framework, grounding theory, and Carlile's 3T framework, it extends the ``agents as remote collaborators'' vision by centring the collaborative, interdependent ways people with disabilities already work.
Authors:Nathaniel Gorski, Shusen Liu, Bei Wang
Abstract:
Recent agentic systems demonstrate that large language models can generate scientific visualizations from natural language. However, reliability remains a major limitation: systems may execute invalid operations, introduce subtle but consequential errors, or fail to request missing information when inputs are underspecified. These issues are amplified in real-world workflows, which often exceed the complexity of standard benchmarks. Ensuring reliability in autonomous visualization pipelines therefore remains an open challenge. We present TopoPilot, a reliable and extensible agentic framework for automating complex scientific visualization workflows. TopoPilot incorporates systematic guardrails and verification mechanisms to ensure reliable operation. While we focus on topological data analysis and visualization as a primary use case, the framework is designed to generalize across visualization domains. TopoPilot adopts a reliability-centered two-agent architecture. An orchestrator agent translates user prompts into workflows composed of atomic backend actions, while a verifier agent evaluates these workflows prior to execution, enforcing structural validity and semantic consistency. This separation of interpretation and verification reduces code-generation errors and enforces correctness guarantees. A modular architecture further improves robustness by isolating components and enabling seamless integration of new descriptors and domain-specific workflows without modifying the core system. To systematically address reliability, we introduce a taxonomy of failure modes and implement targeted safeguards for each class. In evaluations simulating 1,000 multi-turn conversations across 100 prompts, including adversarial and infeasible requests, TopoPilot achieves a success rate exceeding 99%, compared to under 50% for baselines without comprehensive guardrails and checks.
Authors:Elaheh Sanoubari, Alicia Pan, Keith Rebello, Neil Fernandes, Andrew Houston, Kerstin Dautenhahn
Abstract:
Social robots are increasingly used in education, but most applications cast them as tutors offering explanation-based instruction. We explore an alternative: Robot-Mediated Applied Drama (RMAD), in which robots function as life-like puppets in interactive dramatic experiences designed to support reflection and social-emotional learning. This paper presents REMind, an anti-bullying robot role-play game that helps children rehearse bystander intervention and peer support. We focus on a central design challenge in RMAD: how to make robot drama emotionally and aesthetically engaging despite the limited expressive capacities of current robotic platforms. Through the development of REMind, we show how performing arts expertise informed this process, and argue that the aesthetics of robot drama arise from the coordinated design of the wider experience, not from robot expressivity alone.
Authors:Meng-Chen Lee, Costas Panay, Javier Hernandez, Sean Andrist, Dan Bohus, Anatoly Churikov, Andrew D. Wilson
Abstract:
The majority of voice-based conversational agents still rely on pause-and-respond turn-taking, leaving interactions sounding stiff and robotic. We present RESPOND (Responsive Engagement Strategy for Predictive Orchestration and Dialogue), a framework that brings two staples of human conversation to agents: timely backchannels ("mm-hmm," "right") and proactive turn claims that can contribute relevant content before the speaker yields the conversational floor. Built on streaming ASR (Automatic Speech Recognition) and incremental semantics, RESPOND continuously predicts both when and how to interject, enabling fluid, listener-aware dialogue. A defining feature is its designer-facing controllability: two orthogonal dials, Backchannel Intensity (frequency of acknowledgments) and Turn Claim Aggressiveness (depth and assertiveness of early contributions), can be tuned to match the etiquette of contexts ranging from rapid ideation to reflective counseling. By coupling predictive orchestration with explicit control, RESPOND offers a practical path toward conversational agents that adapt their conversational footprint to social expectations, advancing the design of more natural and engaging voice interfaces.
Authors:Jiayi Hong, Yixuan Wang, Petra Isenberg, Ross Maciejewski
Abstract:
We present a review and analysis of scientific paper embellishments -- simple visual elements that are deeply integrated into the text of scientific publications. These embellishments are increasingly used in research papers, which have the potential to enhance textual descriptions, strengthen connections between figures and content, and improve internal textual coherence, while also carrying the risk of disrupting the reading experience. As their exact impact is not yet well understood, we conducted a systematic review of all visualization papers published between 2019 and 2024 in IEEE VIS, ACM CHI, and EuroVis. From this corpus, we identified 374 papers that used paper embellishments and distilled three key dimensions that characterize their usage: purposes (WHY), design choices (HOW), and locations (WHERE) of paper embellishments. Our findings provide a structured perspective on the form of current embellishments in scientific writing in the visualization domain and provide insights into their role in shaping scientific communication.
Authors:Xiaru Meng, Yulan Ju, Yan He, Matthias Hoppe, Kouta Minamizawa, Jiawen Han, Kai Kunze
Abstract:
Live cultural experiences like concerts generate shared physiological arousal among audience members, a collective resonance that contributes to their emotional power. Recreating such experiences in virtual reality therefore requires not just audiovisual fidelity, but reproduction of this physiological dimension. Yet current VR evaluation methods rely on post-hoc self-reports that interrupt immersion and cannot capture moment-to-moment arousal dynamics. We propose cross-temporal physiological synchrony as an unobtrusive methodology for evaluating VR cultural recreations: measuring how closely a VR participant's arousal patterns align with those of the original live audience. In a two-phase study, we recorded electrodermal activity from 40 live concert attendees, then created three VR recreations with varying abstraction levels (realistic 360-degree video, mixed video-plus-visualization, and fully abstract physiological representations) and measured synchrony with 22 laboratory participants using Dynamic Time Warping. Contrary to assumptions favoring realism, abstract visualizations achieved the strongest synchrony with live audiences. During musical climaxes, the abstract condition maintained correlation while realistic video showed none. These findings suggest that abstract physiological representations may be more effective than realistic footage for evoking authentic collective engagement in VR cultural recreations.
Authors:Dimitri Kanevsky, Julian Salazar, Matt Harvey
Abstract:
Let $V$ be a smooth cubic surface over a $p$-adic field $k$ with good reduction. Swinnerton-Dyer (1981) proved that $R$-equivalence is trivial on $V(k)$ except perhaps if $V$ is one of three special types--those whose $R$-equivalence he could not bound by proving the universal (admissible) equivalence is trivial. We consider all surfaces $V$ currently known to have non-trivial universal equivalence. Beyond being intractable to Swinnerton-Dyer's approach, we observe that if these surfaces also had non-trivial $R$-equivalence, they would contradict Colliot-Thélène and Sansuc's conjecture regarding the $k$-rationality of universal torsors for geometrically rational surfaces. By devising new methods to study $R$-equivalence, we prove that for 2-adic surfaces with all-Eckardt reductions (the third special type, which contains every existing case of non-trivial universal equivalence), $R$-equivalence is trivial or of exponent 2. For the explicit cases, we confirm triviality: the diagonal cubic $X^3+Y^3+Z^3+ζ_3 T^3=0$ over $\mathbb{Q}_2(ζ_3)$--answering a long-standing question of Manin's (Cubic Forms, 1972)--and the cubic with universal equivalence of exponent 2 (Kanevsky, 1982). This is the first in a series of works derived from a year of interactions with generative AI models such as AlphaEvolve and Gemini 3 Deep Think, with the latter proving many of our lemmas. We disclose the timeline and nature of their use towards this paper, and describe our broader AI-assisted research program in a companion report (in preparation).
Authors:Ziyi Wang, Qizan Guo, Rishitosh Singh, Xiyang Hu
Abstract:
Inferring human engagement from gameplay video is important for game design and player-experience research, yet it remains unclear whether vision--language models (VLMs) can infer such latent psychological states from visual cues alone. Using the GameVibe Few-Shot dataset across nine first-person shooter games, we evaluate three VLMs under six prompting strategies, including zero-shot prediction, theory-guided prompts grounded in Flow, GameFlow, Self-Determination Theory, and MDA, and retrieval-augmented prompting. We consider both pointwise engagement prediction and pairwise prediction of engagement change between consecutive windows. Results show that zero-shot VLM predictions are generally weak and often fail to outperform simple per-game majority-class baselines. Memory- or retrieval-augmented prompting improves pointwise prediction in some settings, whereas pairwise prediction remains consistently difficult across strategies. Theory-guided prompting alone does not reliably help and can instead reinforce surface-level shortcuts. These findings suggest a perception--understanding gap in current VLMs: although they can recognize visible gameplay cues, they still struggle to robustly infer human engagement across games.
Authors:The Anh Han, Joel Z. Leibo, Tom Lenaerts, Iyad Rahwan, Fernando Santos, Matjaž Perc, Valerio Capraro
Abstract:
Artificial intelligence (AI) systems are rapidly becoming more capable, autonomous, and deeply embedded in social life. As humans increasingly interact, cooperate, and compete with AI, we move from purely human societies to hybrid human-AI societies whose collective dynamics cannot be captured by existing behavioural models alone. Drawing on evolutionary game theory, cultural evolution, and Large Language Models (LLMs) powered simulations, we argue that these developments open a new research agenda for social physics centred on the co-evolution of humans and machines. We outline six key research directions. First, modelling the evolutionary dynamics of social behaviours (e.g. cooperation, fairness, trust) in hybrid human-AI populations. Second, understanding machine culture: how AI systems generate, mediate, and select cultural traits. Third, analysing the co-evolution of language and behaviour when LLMs frame and participate in decisions. Fourth, studying the evolution of AI delegation: how responsibilities and control are negotiated between humans and machines. Fifth, formalising and comparing the distinct epistemic pipelines that generate human and AI behaviour. Sixth, modelling the co-evolution of AI development and regulation in a strategic ecosystem of firms, users, and institutions. Together, these directions define a programme for using social physics to anticipate and steer the societal impact of advanced AI.
Authors:Paulo Vitor Santana Silva, Arthur Ricardo Sousa Vitória, Diogo Fernandes Costa Silva, Arlindo Rodrigues Galvão Filho
Abstract:
Within the expansive domain of virtual reality (VR), 360° VR videos immerse viewers in a spherical environment, allowing them to explore and interact with the virtual world from all angles. While this video representation offers unparalleled levels of immersion, it often lacks effective methods to guide viewers' attention toward specific elements within the virtual environment. This paper combines the models Grounding Dino and Segment Anything (SAM) to guide attention by object focusing based on video scripts. As a case study, this work conducts the experiments on a 360° video tour on the University of Reading. The experiment results show that video scripts can improve the user experience in 360° VR Videos Tour by helping in the task of directing the user's attention.
Authors:D. Darankoum, C. Habermacher, J. Volle, S. Grudinin
Abstract:
Decoding the orchestration of neural activity in electroencephalography (EEG) signals is a central challenge in bridging neuroscience with artificial intelligence. Foundation models have made strides in generalized EEG decoding, yet many existing frameworks primarily relying on separate temporal and spectral masking of raw signals during self-supervised pretraining. Such strategies often tend to bias learning toward high-frequency oscillations, as low-frequency rhythmic patterns can be easily inferred from the unmasked signal. We introduce a foundation model that utilizes a novel Gaussian-smoothed masking scheme applied to short-time Fourier transform (STFT) maps. By jointly applying time, frequency, and time-frequency Gaussian masks, we make the reconstruction task much more challenging, forcing the model to learn intricate neural patterns across both high- and low-frequency domains. To effectively recover signals under this aggressive masking strategy, we design SpecHi-Net, a U-shaped hierarchical architecture with multiple encoding and decoding stages. To accelerate large-scale pretraining, we partition the data into three subsets, each used to train an independent expert model. We then combine these models through SpecMoE, a mixture of experts framework guided by a learned spectral gating mechanism. SpecMoE achieves state-of-the-art performance across a diverse set of EEG decoding tasks, including sleep staging, emotion recognition, motor imagery classification, abnormal signal detection, and drug effect prediction. Importantly, the model demonstrates strong cross-species and cross-subject generalization, maintaining high accuracy on both human and murine EEG datasets.
Authors:Zhaoxi Zhang, Ruolin Wu, Feiyang Ren, Sridevi Turaga, Tamir Mendel
Abstract:
Public participation has become increasingly important in collaborative urban design; yet, existing processes often face challenges in achieving efficient and scalable citizen engagement. To address this gap, this study explores how large language models (LLMs) can support cooperation among community members in participatory design. We introduce CoDesignAI, a collaborative urban design tool that combines multiple users, representing residents or stakeholders, with multiple AI agents, representing domain experts who provide facilitation and professional knowledge during the conceptual stage of urban design. This paper presents the system architecture and main components of the tool, illustrating how users interact with AI agents within a collaborative and iterative design workflow. Specifically, the system integrates generative AI with spatial mapping services to support street-level visualization of design proposals. AI agents assist users by summarizing discussion content, extracting shared design intentions, and generating prompts for presenting design interventions. The system also enables users to revise and refine their ideas over multiple rounds while documenting the design process. By combining conversational AI, multi-user interaction, and image-based design grounded in real-world urban contexts, this study argues that AI-enabled design systems can help shift urban design from an expert-centered practice to a more open and participatory process. The paper contributes a new web-based platform for AI-assisted collaborative design and offers an early exploration of how AI agents may expand the capacity for public participation in urban design.
Authors:Nikita Torgashov, Gustav Eje Henter, Gabriel Skantze
Abstract:
Full-stream text-to-speech (TTS) for interactive systems must start speaking with minimal delay while remaining controllable as text arrives incrementally. We present VoXtream2, a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly. VoXtream2 combines a distribution matching mechanism over duration states with classifier-free guidance across conditioning signals to improve controllability and synthesis quality. Prompt-text masking enables textless audio prompting, removing the need for prompt transcription. Across standard zero-shot benchmarks and a dedicated speaking-rate test set, VoXtream2 achieves competitive objective and subjective results against public baselines despite a smaller model and less training data. In full-stream mode, it runs 4 times faster than real time with 74 ms first-packet latency on a consumer GPU.
Authors:Anna Katharina Ricker, Kai Marquardt, Lucia Happe
Abstract:
Although personalization is widely advocated in gamified learning, empirical evidence on how learner characteristics and task context shape motivational preferences remains limited. This study examines how user characteristics and learning activity types relate to preferences for gamification elements in digital education. A large-scale quantitative survey (N = 530), including 34% underage participants, assessed preferences for 13 gamification elements in relation to Age, Gender, HEXAD Player Type, Big Five Personality Traits, Felder-Silverman Learning Styles, and Bloom-based Learning Activity Types. Inferential statistical analyses and exploratory machine learning techniques revealed systematic but generally small-to-moderate effects across parameters. Age emerged as the most consistent predictor of preference, followed by player type and personality traits, whereas gender and learning styles showed comparatively weaker associations. In addition, learning activity type significantly influenced the perceived suitability of gamification elements, indicating that motivational design is task-dependent. The findings suggest that gamification effectiveness cannot be reduced to universally motivating elements. Instead, preferences are shaped by the interaction of learner characteristics and instructional context. These results provide empirical grounding for adaptive and modular gamification strategies in digital learning environments.
Authors:Chantale Lauer, Peter Pfeiffer, Nijat Mehdiyev
Abstract:
Integrating Large Language Models (LLMs) into business process management tools promises to democratize Business Process Model and Notation (BPMN) modeling for non-experts. While automated frameworks assess syntactic and semantic quality, they miss human factors like trust, usability, and professional alignment. We conducted a mixed-methods evaluation of our proposed solution, an LLM-powered BPMN copilot, with five process modeling experts using focus groups and standardized questionnaires. Our findings reveal a critical tension between acceptable perceived usability (mean CUQ score: 67.2/100) and notably lower trust (mean score: 48.8\%), with reliability rated as the most critical concern (M=1.8/5). Furthermore, we identified output-quality issues, prompting difficulties, and a need for the LLM to ask more in-depth clarifying questions about the process. We envision five use cases ranging from domain-expert support to enterprise quality assurance. We demonstrate the necessity of human-centered evaluation complementing automated benchmarking for LLM modeling agents.
Authors:Nayoung Kim, Yotam Sechayk, Zhongyi Zhou, Takeo Igarashi
Abstract:
Learning tasks through videos is a dynamic way to acquire skills by witnessing entire processes. However, compared to in-person demonstrations, videos may omit tacit knowledge, including subtle details and contextual nuances. Users' unique circumstances, like missing ingredients in a recipe, may also require adaptation beyond the video content. To fill these gaps, many users turn to the comment section, seeking additional guidance and interactions with creators or peers to personalize their experience. Despite their importance, there is limited understanding of how users engage with and apply comments in task-learning scenarios. In our study, we explore the role of comments in video-based task-learning through interviews with 14 users, and co-watching sessions with eight. Our findings show that while comments are critical for learning, they are poorly integrated into all stages of the learning process. Based on our findings, we outline design opportunities to better utilize comments in video-based task-learning.
Authors:Sophia Liu, Shm Garanganao Almeda
Abstract:
Creativity research has privileged making over the interpretive labor that precedes and shapes it. We introduce Reading Activity Traces (RATs), a proposal that treats reading -- broadly defined to include navigating, interpreting, and curating media across interconnected sources -- as creative activity both for future artifacts and as a form of creation in its own right. By tracing trajectories of traversal, association, and reflection as inspectable artifacts, RATs render visible the creative work that algorithmic feeds and AI summarization increasingly compress and automate away. We illustrate this through WikiRAT, a speculative instantiation on Wikipedia, and open new ground for reflective practice, reader modeling, collective sensemaking, and understanding what is lost when human interpretation is automated -- towards designing intelligent tools that preserve it.
Authors:Jazmin Collins, Sharon Y Lin, Tianqi Liu, Andrea Stevenson Won, Shiri Azenkot
Abstract:
As social virtual reality (VR) grows more popular, addressing accessibility for blind and low vision (BLV) users is increasingly critical. Researchers have proposed an AI "sighted guide" to help users navigate VR and answer their questions, but it has not been studied with users. To address this gap, we developed a large language model (LLM)-powered guide and studied its use with 16 BLV participants in virtual environments with confederates posing as other users. We found that when alone, participants treated the guide as a tool, but treated it companionably around others, giving it nicknames, rationalizing its mistakes with its appearance, and encouraging confederate-guide interaction. Our work furthers understanding of guides as a versatile method for VR accessibility and presents design recommendations for future guides.
Authors:Yihang Zhao, Wenxin Zhang, Amy Rechkemmer, Albert Meroño Peñuela, Elena Simperl
Abstract:
Sensemaking in collaborative work and learning is increasingly supported by GenAI systems, however, emerging evidence suggests that poorly designed GenAI systems tend to provide explicit instruction that groups passively follow, fostering over-reliance and eroding autonomous sensemaking. Group awareness tools (GATs) address this challenge through implicit guidance: rather than instructing groups on what to do, GATs externalize observable collaboration data through visualizations that reveal differences between group members to create cognitive conflict, which triggers autonomous elaboration and discussion, thereby implicitly guiding autonomous sensemaking emergence. Drawing on an initial literature search of existing GAT systems, this paper explores the design of GenAI-augmented GATs to support autonomous sensemaking in collaborative work and learning, presenting preliminary design principles for discussion.
Authors:Yihang Zhao, Wenxin Zhang, Amy Rechkemmer, Albert Meroño-Peñuela, Elena Simperl
Abstract:
Socially shared metacognition (SSM) refers to the collective monitoring and regulation of joint cognitive processes in collaborative problem-solving, and is essential for effective knowledge work and learning. Generative AI (GenAI)-based systems offer new opportunities to support SSM, but emerging evidence suggests that poorly designed systems can encourage over-reliance on AI-generated explicit instruction and erode groups' capacity to develop autonomous regulatory processes. Group awareness tools (GATs) address this challenge through established design principles that make social and cognitive awareness information visible, highlight differences between group members to create cognitive conflict, and trigger autonomous elaboration and discussion, thereby implicitly guiding autonomous SSM emergence. This paper explores the design of GenAI-augmented GATs to support autonomous SSM in collaborative work and learning through an initial literature search, presenting preliminary design principles for discussion.
Authors:Alexander Erlei, Tahir Abbas, Kilian Bizer, Ujwal Gadiraju
Abstract:
Privacy concerns significantly impact AI adoption, yet little is known about how information environments shape user responses to data leak threats. We conducted a 2 x 3 between-subjects experiment (N=610) examining how risk versus ambiguity about privacy leaks affects the adoption of AI personalization. Participants chose between standard and AI-personalized product baskets, with personalization requiring data sharing that could leak to pricing algorithms. Under risk (30% leak probability), we found no difference in AI adoption between privacy-threatening and neutral conditions (ca. 50% adoption). Under ambiguity (10-50% range), privacy threats significantly reduced adoption compared to neutral conditions. This effect holds for sensitive demographic data as well as anonymized preference data. Users systematically over-bid for privacy disclosure labels, suggesting strong demand for transparency institutions. Notably, privacy leak threats did not affect subsequent bargaining behavior with algorithms. Our findings indicate that ambiguity over data leaks, rather than only privacy preferences per se, drives avoidance behavior among users towards personalized AI.
Authors:Keiichi Ihara, DaeHo Lee, Manato Abe, Hye-Young Jo, Ryo Suzuki
Abstract:
We introduce CinemaWorld, a generative augmented reality system that augments the viewer's physical surroundings with automatically generated mixed reality 3D content extracted from and synchronized with 2D movie scenes. Our system preprocesses films to extract key features using multimodal large language models (LLMs), generates dynamic 3D augmentations with generative AI, and embeds them spatially into the viewer's physical environment on the Meta Quest 3. To explore the design space of CinemaWorld, we conducted an elicitation study with eight film students, which led us to identify several key augmentation types, including particle effects, surrounding objects, textural overlays, character-driven augmentation, and lighting effects. We evaluated our system through a technical evaluation (N=100 video clips), a user study (N=12), and expert interviews with film creators (N=8). Results indicate that CinemaWorld enhances immersion and enjoyment, suggesting its potential to enrich the film-viewing experience.
Authors:Lei Yin, Wentao Cheng, Zhida Qin, Tianyu Huang, Yidong Li, Gangyi Ding
Abstract:
Automatically generating 3D games in commercial game engines remains a non-trivial challenge, as it involves complex engine-related workflows for generating assets such as scenes, blueprints, and code. To address this challenge, we propose a novel multi-agent system, AutoUE, which coordinates multiple agents to end-to-end generate 3D games, covering model retrieval, scene generation, gameplay and interaction code synthesis, and automated game testing for evaluation. In order to mitigate tool-use hallucinations in LLMs, we introduce a retrieval-augmented generation mechanism that grounds agents with relevant UE tool documentation. Additionally, we incorporate game design patterns and engine constraints into the code generation process to ensure the generation of correct and robust code. Furthermore, we design an automated play-testing pipeline that generates and executes runtime test commands, enabling systematic evaluation of dynamic behaviors. Finally, we construct a game generation dataset and conduct a series of experiments that demonstrate AutoUE's ability to generate 3D games end-to-end, and validate the effectiveness of these designs.
Authors:Xinyu Shi, Li-Yi Wei, Nanxuan Zhao, Jian Zhao, Rubaiat Habib Kazi
Abstract:
We introduce the concept of notational animating, an interaction paradigm for animation authoring where users sketch high-level notations over static drawings to indicate intended motions, which are then interpreted by automatic methods (e.g., GenAI models) to generate animation keyframes. Sketched notations have long served as cognitive instruments for animators, capturing forces, poses, dynamics, paths, and other animation features. However, such notations are often context-dependent, non-categorical, ambiguous, and composable based on our analysis of real-world animator-produced sketches. To facilitate interpretation, we first formalize these notations into a structured animation representation (i.e., source, path, and target). We then built an animation authoring system that translates high-level notations into the formalized intended animation, provides dynamic UI widgets for fine-grained parameter control, and establishes a closed feedback loop to resolve ambiguity. Finally, through a preliminary study with animators, we assess the usability of notational animating, reflect its affordance, and identify its contexts of use.
Authors:Yonatan Tussa, Andy Heredia
Abstract:
Wearable AI is often designed as always-available, yet continuous availability can conflict with how people work and socialize, creating discomfort around privacy, disruption, and unclear system boundaries. This paper explores episodic use of wearable AI, where assistance is intentionally invoked for short periods of focused activity and set aside when no longer needed, with a form factor that reflects this paradigm of wearing and taking off a device between sessions. We present The Pen, an ear-worn device resembling a pen, for episodic, situated cognitive assistance. The device supports short, on-demand assistance sessions using voice and visual context, with clear start/end boundaries and local processing. We report findings from an exploratory study showing how layered activation boundaries shape users' sense of agency, cognitive flow, and social comfort.
Authors:Mingyi Li, Mengyi Chen, Sarah Luo, Yining Cao, Haijun Xia, Maitraye Das, Steven P. Dow, Jane L. E
Abstract:
Visual design instructors often provide multi-modal feedback, mixing annotations with text. Prior theory emphasizes the importance of actionable feedback, where "actionability" lies on a spectrum--from surfacing relevant design concepts to suggesting concrete fixes. How might creativity tools implement annotations that support such feedback, and how does the actionability of feedback impact novices' process-related behaviors, perceptions of creativity, learning of design principles, and overall outcomes? We introduce VizCrit, a system for providing computational feedback that supports the actionability spectrum, realized through algorithmic issue detection and visual annotation generation. In a between-subjects study (N=36), novices revised a design under one of three conditions: textbook-based, awareness-centered, or solution-centered feedback. We found that solution-centered feedback led to fewer design issues and higher self-perceived creativity compared with textbook-based feedback, although expert ratings on creativity showed no significant differences. We discuss the implications for AI in Creativity Support Tools, including the potential of calibrating feedback actionability to help novices balance productivity with learning, growth, and developing design awareness.
Authors:Jialiang Wei, Ali Ebrahimi Pourasad, Walid Maalej
Abstract:
User feedback is crucial for the evolution of mobile apps. However, research suggests that users tend to submit uninformative, vague, or destructive feedback. Unlike recent AI4SE approaches that focus on generating code and other development artifacts, our work aims at empowering users to submit better and more constructive UI feedback with concrete suggestions on how to improve the app. We propose LikeThis!, a GenAI-based approach that takes a user comment with the corresponding screenshot to immediately generate multiple improvement alternatives, from which the user can easily choose their preferred option. To evaluate LikeThis!, we first conducted a model benchmarking study based on a public dataset of carefully critiqued UI designs. The results show that GPT-Image-1 significantly outperformed three other state-of-the-art image generation models in improving the designs to address UI issues while keeping the fidelity and without introducing new issues. An intermediate step in LikeThis! is to generate a solution specification before sketching the design as a key to achieving effective improvement. Second, we conducted a user study with 10 production apps, where 15 users used LikeThis! to submit their feedback on encountered issues. Later, the developers of the apps assessed the understandability and actionability of the feedback with and without generated improvements. The results show that our approach helps generate better feedback from both user and developer perspectives, paving the way for AI-assisted user-developer collaboration.
Authors:Daijin Yang, Erica Kleinman, Casper Harteveld
Abstract:
Educational games can foster critical thinking, problem-solving, and motivation, yet instructors often find it difficult to design games that reliably achieve specific learning outcomes. Existing authoring environments reduce the need for programming expertise, but they do not eliminate the underlying challenges of educational game design, and they can leave non-expert designers reliant on opaque suggestions from AI systems. We designed a controlled natural language framework-based web tool that positions language as the primary interface for LLM-assisted educational game design. In the tool, users and an LLM assistant collaboratively develop a structured language that maps pedagogy to gameplay through four linked components. We argue that, by making pedagogical intent explicit and editable in the interface, the tool has the potential to lower design barriers for non-expert designers, preserves human agency in critical decisions, and enables alignment and reflections between pedagogy and gameplay during and after co-creation.
Authors:Alexander Schperberg, Yeping Wang, Stefano Di Cairano
Abstract:
Simultaneous locomotion and manipulation enables robots to interact with their environment beyond the constraints of a fixed base. However, coordinating legged locomotion with arm manipulation, while considering safety and compliance during contact interaction remains challenging. To this end, we propose a whole-body controller that combines a model-based admittance control for the manipulator arm with a Reinforcement Learning (RL) policy for legged locomotion. The admittance controller maps external wrenches--such as those applied by a human during physical interaction--into desired end-effector velocities, allowing for compliant behavior. The velocities are tracked jointly by the arm and leg controllers, enabling a unified 6-DoF force response. The model-based design permits accurate force control and safety guarantees via a Reference Governor (RG), while robustness is further improved by a Kalman filter enhanced with neural networks for reliable base velocity estimation. We validate our approach in both simulation and hardware using the Unitree Go2 quadruped robot with a 6-DoF arm and wrist-mounted 6-DoF Force/Torque sensor. Results demonstrate accurate tracking of interaction-driven velocities, compliant behavior, and safe, reliable performance in dynamic settings.
Authors:Shri Harini Ramesh, Fateme Rajabiyazdi
Abstract:
This paper introduces Pulli Kolam, a traditional South Indian craft, as a medium for physical data representation. Grounded in its cultural meaning and embodied practice, Pulli Kolam follows structured geometric rules while allowing creative variation. We identify five mapping strategies within Kolam (dots, patterns, fills, lines, and color) that can be used for representing data physically. without disrupting traditional practice. Through an illustrative scenario of daily well-being tracking, we demonstrate how data representation can be embedded within routine craft practice. We conclude by outlining potential material adaptations that extend Kolam beyond its ephemeral form while maintaining its embodied and ritual qualities.
Authors:Sora Kang, Jaemin Zoh, Hyoju Kim, Hyeonseo Park, Hajin Lim, Joonhwan Lee
Abstract:
Character journaling is a well-established exercise in actor training, but many actors struggle to sustain it due to cognitive burden, the blank page problem, and unclear short-term rewards. We reframe large language models not as co-authors but as maieutic partners-tools that guide reflection through context-aware questioning rather than producing text on behalf of the user. Based on this perspective, we designed Actor's Note, a journaling tool that tailors questions to the script, role, and rehearsal phase while preserving actor agency. We evaluated the system in a 14-day crossover study with 29 actors using surveys, logs, and interviews. Results indicate that the tool reduced entry barriers, supported sustained reflection, and enriched character exploration, with participants describing different benefits when AI was introduced at earlier versus later rehearsal stages. This work contributes empirical insights and design principles for creativity-support tools that sustain reflective practices while preserving artistic immersion in performance training.
Authors:Sina Elahimanesh, Mohammadali Mohammadkhani, Sara Zahedi Movahed, Mohammadmahdi Abootorabi, Shayan Salehi, Abbas Edalat
Abstract:
While large language models (LLMs) excel at open-ended dialogue, effective psychotherapy requires structured progression and adherence to clinical protocols, making the design of psychotherapist chatbots challenging. We investigate how different LLM-based designs shape perceived therapeutic dialogue in a chatbot grounded in the Self-Attachment Technique (SAT), a novel self-administered psychotherapy rooted in attachment theory. We compare three architectural variants: (1) a multi-agent system utilizing finite state machine aligned with therapeutic stages and a shared long-term memory, (2) a single-agent using identical knowledge-base and the same prompts, and (3) an unguided LLM. In an eight-day randomized controlled trial (RCT) with N=66 Farsi-speaking participants, balanced across the three chatbots, the multi-agent system is perceived as significantly more natural and human-like than the other variants and achieves higher ratings across most other metrics. These findings demonstrate that for therapeutic AI, architectural orchestration is as critical as prompt engineering in fostering natural, engaging dialogue.
Authors:Cong Ye, Songlin Shang, Xiaoxu Ma, Xiangbo Zhang
Abstract:
Generative feedback in sensory-sensitive contexts poses a core design challenge: large individual differences in sensory tolerance make it difficult to sustain engagement without compromising safety. This tension is exemplified in autism spectrum disorder (ASD), where auditory sensitivities are common yet highly heterogeneous. Existing interactive music systems typically encode safety implicitly within direct input-output (I-O) mappings, which can preserve novelty but make system behavior hard to predict or audit. We instead propose a constraint-first Input-Envelope-Output (I-E-O) framework that makes safety explicit and verifiable while preserving action-output causality. I-E-O introduces a low-risk envelope layer between user input and audio output to specify safe bounds, enforce them deterministically, and log interventions for audit. From this architecture, we derive four verifiable design principles and instantiate them in MusiBubbles, a web-based prototype. Contributions include the I-E-O architecture, MusiBubbles as an exemplar implementation, and a reproducibility package to support adoption in ASD and other sensory-sensitive domains.
Authors:Yunpeng Bai, Shengdong Zhao, Antti Oulasvirta
Abstract:
Augmented reading systems aim to adapt text presentation to improve comprehension and task performance, yet existing approaches rely heavily on heuristics, opaque data-driven models, or repeated human involvement in the design loop. We propose framing augmented reading as a simulation-based optimization problem grounded in resource-rational models of human reading. These models instantiate a simulated reader that allocates limited cognitive resources, such as attention, memory, and time under task demands, enabling systematic evaluation of text user interfaces. We introduce two complementary optimization pipelines: an offline approach that explores design alternatives using simulated readers, and an online approach that personalizes reading interfaces in real time using ongoing interaction data. Together, this perspective enables adaptive, explainable, and scalable augmented reading design without relying solely on human testing.
Authors:Xueqing Li, Danqi huang, Tianyu Yu, Shuzi Yin, Bingjie Gao, Anna Matsumoto, Zhihao Yao, Yiwei Zhao, Shiqing Lyu, Yuchen Tian, Lining Yao, Haipeng Mi, Qiuyu Lu
Abstract:
We introduce DuoMorph, a design and fabrication method that synergistically integrates Fused Deposition Modeling (FDM) printing and pneumatic actuation to create novel shape-changing interfaces. In DuoMorph, the printed structures and heat-sealed pneumatic elements are mutually designed to actuate and constrain each other, enabling functions that are difficult for either component to achieve in isolation. Moreover, the entire hybrid structure can be fabricated through a single, seamless process using only a standard FDM printer, including both heat-sealing and 3D and 4D printing. In this paper, we define a design space including four primitive categories that capture the fundamental ways in which printed and pneumatic components can interact. To support this process, we present a fabrication method and an accompanying design tool. Finally, we demonstrate the potential of DuoMorph through a series of example applications and performance demonstrations.
Authors:Shehryar Saharan, Ibrahim Al-Hazwani, Miriah Meyer, Laura Garrison
Abstract:
Visualization has matured into an established research field, producing widely adopted tools, design frameworks, and empirical foundations. As the field has grown, ideas from outside computer science have increasingly entered visualization discourse, questioning the fundamental values and assumptions on which visualization research stands. In this short position paper, we examine a set of values that we see underlying the seminal works of Jacques Bertin, John Tukey, Leland Wilkinson, Colin Ware, and Tamara Munzner. We articulate three prominent values in these texts - universality, objectivity, and efficiency - and examine how these values permeate visualization tools, curricula, and research practices. We situate these values within a broader set of critiques that call for more diverse priorities and viewpoints. By articulating these tensions, we call for our community to embrace a more pluralistic range of values to shape our future visualization tools and guidelines.
Authors:Yang Liu, Qiushi Zhou, Mathias N Lystbæk, Aidan Kehoe, Mario Gutierrez, Hans Gellersen, Ken Pfeuffer
Abstract:
With a stylus, users can both sweep sketches across models and pinpoint locations with precision. Building on this dual capability, we explore how teleportation can be integrated into stylus interaction without disrupting the flow of common stylus usage. We introduce two key ideas: flipping the stylus as an intuitive mode switch between drawing and teleportation, and using gaze to set orientation while the stylus handles positioning. In a user study that features a teleport-and-orient task, we evaluate six teleportation techniques, covering two mode-switching methods (Button and Flip) and three orientation approaches (StylusRoll, StylusPoint, and GazePoint). The results offer new insights into the relative merits and limitations of each technique. Our work contributes to knowledge about teleportation in VR and fills the gap in seamlessly integrating teleportation with stylus use in 3D.
Authors:Zhiqin Qian, Ryan Diaz, Sangwon Seo, Vaibhav Unhelkar
Abstract:
When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed. As AI agents tackle increasingly complex tasks, aligning their behavior with human-provided specifications becomes critical for responsible AI deployment. Reward design provides a direct channel for such alignment by translating human expectations into reward functions that guide reinforcement learning (RL). However, existing methods are often too limited to capture nuanced human preferences that arise in long-horizon tasks. Hence, we introduce Hierarchical Reward Design from Language (HRDL): a problem formulation that extends classical reward design to encode richer behavioral specifications for hierarchical RL agents. We further propose Language to Hierarchical Rewards (L2HR) as a solution to HRDL. Experiments show that AI agents trained with rewards designed via L2HR not only complete tasks effectively but also better adhere to human specifications. Together, HRDL and L2HR advance the research on human-aligned AI agents.
Authors:Kian Wei Ng, Yujia Gao, Deborah Khoo, Ying Zhen Tan, Chengzheng Mao, Haojie Cheng, Andrew Makmur, Kee Yuan Ngiam, Serene Goh, Eng Tat Khoo
Abstract:
Accurate volumetric characterization of lesions is essential for oncologic diagnosis, risk stratification, and treatment planning. While imaging modalities such as Computed Tomography provide high-quality 3D data, 2D ultrasound (2D-US) remains the preferred first-line modality for breast and thyroid imaging due to cost, portability, and safety factors. However, volume estimates derived from 2D-US suffer from high inter-user variability even among experienced clinicians. Existing 3D ultrasound (3D-US) solutions use specialized probes or external tracking hardware, but such configurations increase costs and diminish portability, constraining widespread clinical use. To address these limitations, we present Mobile Augmented Reality Volumetric Ultrasound (MARVUS), a resource-efficient system designed to increase accessibility to accurate and reproducible volumetric assessment. MARVUS is interoperable with conventional ultrasound (US) systems, using a foundation model to enhance cross-specialty generalization while minimizing hardware requirements relative to current 3D-US solutions. In a user study involving experienced clinicians performing measurements on breast phantoms, MARVUS yielded a substantial improvement in volume estimation accuracy (mean difference: 0.469 cm3) with reduced inter-user variability (mean difference: 0.417 cm3). Additionally, we prove that augmented reality (AR) visualizations enhance objective performance metrics and clinician-reported usability. Collectively, our findings suggests that MARVUS can enhance US-based cancer screening, diagnostic workflows, and treatment planning in a scalable, cost-conscious, and resource-efficient manner. Usage video demonstration available (https://youtu.be/m4llYcZpqmM).
Authors:Mohammad Masudur Rahman, Beenish Moalla Chaudhry
Abstract:
The rapid growth of AI-driven mental health mobile apps has raised concerns about their ethical considerations and user trust. This study proposed a natural language processing (NLP)-based framework to evaluate ethical aspects from user-generated reviews from the Google Play Store and Apple App Store. After gathering and cleaning the data, topic modeling was applied to identify latent themes in the context of ethics using topic words and then map them to well-recognized existing ethical principles described in different ethical frameworks; in addition to that, a bottom-up approach is applied to find any new and emergent ethics from the reviews using a transformer-based zero-shot classification model. Sentiment analysis was then used to capture how users feel about each ethical aspect. The obtained results reveal that well-known ethical considerations are not enough for the modern AI-based technologies and are missing emerging ethical challenges, showing how these apps either uphold or overlook key moral values. This work contributes to developing an ongoing evaluation system that can enhance the fairness, transparency, and trustworthiness of AI-powered mental health chatbots.
Authors:Philipp Steigerwald, Jens Albrecht
Abstract:
Psychosocial online counselling frequently encounters generic subject lines that impede efficient case prioritisation. This study evaluates eleven large language models generating six-word subject lines for German counselling emails through hierarchical assessment - first categorising outputs, then ranking within categories to enable manageable evaluation. Nine assessors (counselling professionals and AI systems) enable analysis via Krippendorff's $α$, Spearman's $ρ$, Pearson's $r$ and Kendall's $τ$. Results reveal performance trade-offs between proprietary services and privacy-preserving open-source alternatives, with German fine-tuning consistently improving performance. The study addresses critical ethical considerations for mental health AI deployment including privacy, bias and accountability.
Authors:Daniel Killough, Tiger F. Ji, Kexin Zhang, Yaxin Hu, Yu Huang, Ruofei Du, Yuhang Zhao
Abstract:
While accessibility (a11y) guidelines exist for 3D games and virtual worlds, their applicability to extended reality (XR)'s unique interaction paradigms (e.g., spatial tracking, kinesthetic interactions) remains unexplored. XR practitioners need practical guidance to successfully implement a11y guidelines under real-world constraints. We present the first evaluation of existing 3D a11y guidelines applied to XR development through semi-structured interviews with 25 XR practitioners across diverse organization contexts. We assessed 20 commonly-agreed a11y guidelines from six major resources across visual, motor, cognitive, speech, and hearing domains, comparing practitioners' development practices against guideline applicability to XR. Our investigation reveals that guidelines can be highly effective when designed as transformation catalysts rather than compliance checklists, but fundamental mismatches exist between existing 3D guidelines and XR requirements, creating both implementation barriers and design gaps. This work provides foundational insights towards developing a11y guidelines and support tools that address XR's distinct characteristics.
Authors:Neda Barbazi, Ji Youn Shin, Gurumurthy Hiremath, Carlye Anne Lauff
Abstract:
Children with chronic conditions face evolving challenges in daily activities, peer relationships, and clinical care. Younger children often rely on parental support, while older ones seek independence. Prior studies on chronic conditions explored proxy-based, family-centered, and playful approaches to support children's health, but most technologies treat children as a homogeneous group rather than adapting to their developmental differences. To address this gap, we conducted four co-design workshops with 69 children with congenital heart disease (CHD) at a medically supported camp, spanning elementary, middle, and high school groups. Our analysis reveals distinct coping strategies: elementary children relied on comfort objects and reassurance, middle schoolers used mediated communication and selective disclosure, and high schoolers emphasized agency and direct engagement with peers and providers. Through child-centered participatory design, we contribute empirical insights into how children's management of chronic conditions evolves and propose design implications for pediatric health technologies that adapt across developmental trajectories.
Authors:Vijay Prakash, Majed Almansoori, Donghan Hu, Rahul Chatterjee, Danny Yuxing Huang
Abstract:
Technology-facilitated abuse (TFA) is a pervasive form of intimate partner violence (IPV) that leverages digital tools to control, surveil, or harm survivors. While tech clinics are one of the reliable sources of support for TFA survivors, they face limitations due to staffing constraints and logistical barriers. As a result, many survivors turn to online resources for assistance. With the growing accessibility and popularity of large language models (LLMs), and increasing interest from IPV organizations, survivors may begin to consult LLM-based chatbots before seeking help from tech clinics. In this work, we present the first expert-led manual evaluation of four LLMs - two widely used general-purpose non-reasoning models and two domain-specific models designed for IPV contexts - focused on their effectiveness in responding to TFA-related questions. Using real-world questions collected from literature and online forums, we assess the quality of zero-shot single-turn LLM responses generated with a survivor safety-centered prompt on criteria tailored to the TFA domain. Additionally, we conducted a user study to evaluate the perceived actionability of these responses from the perspective of individuals who have experienced TFA. Our findings, grounded in both expert assessment and user feedback, provide insights into the current capabilities and limitations of LLMs in the TFA context and may inform the design, development, and fine-tuning of future models for this domain. We conclude with concrete recommendations to improve LLM performance for survivor support.
Authors:Rachael Zehrung, Yunan Chen
Abstract:
Vehicle dwelling has increased significantly in recent years. While HCI research has explored vehicle dwelling through the lens of digital nomadism and vanlife, it has largely overlooked the complexities of vehicle dwelling as a form of housing insecurity, as well as the unique constraints of living in smaller vehicles. Drawing on a qualitative analysis of posts and comments from an online community, we examine car dwellers' infrastructuring work to manage daily life under social, spatial, and infrastructural constraints. We further explore the motivations and identity negotiations of car dwellers, whose experiences fall between homelessness and nomadism, and highlight how developing infrastructural competence can shape identity. We discuss implications for future HCI research on mobility and dwelling under conditions of uneven access to infrastructure and provide design recommendations for technologies that better account for car dwellers' diverse needs, circumstances, and identities.
Authors:Dhiman Goswami, Jai Kruthunz Naveen Kumar, Sanchari Das
Abstract:
Natural Language Processing (NLP) is integral to social media analytics but often processes content containing Personally Identifiable Information (PII), behavioral cues, and metadata raising privacy risks such as surveillance, profiling, and targeted advertising. To systematically assess these risks, we review 203 peer-reviewed papers and propose the NLP Privacy Risk Identification in Social Media (NLP-PRISM) framework, which evaluates vulnerabilities across six dimensions: data collection, preprocessing, visibility, fairness, computational risk, and regulatory compliance. Our analysis shows that transformer models achieve F1-scores ranging from 0.58-0.84, but incur a 1% - 23% drop under privacy-preserving fine-tuning. Using NLP-PRISM, we examine privacy coverage in six NLP tasks: sentiment analysis (16), emotion detection (14), offensive language identification (19), code-mixed processing (39), native language identification (29), and dialect detection (24) revealing substantial gaps in privacy research. We further found a (reduced by 2% - 9%) trade-off in model utility, MIA AUC (membership inference attacks) 0.81, AIA accuracy 0.75 (attribute inference attacks). Finally, we advocate for stronger anonymization, privacy-aware learning, and fairness-driven training to enable ethical NLP in social media contexts.
Authors:Ashlee Milton, Dan Runningen, Loren Terveen, Harmanpreet Kaur, Stevie Chancellor
Abstract:
Social media platforms have rapidly adopted algorithmic curation with little consideration for the potential harm to users' mental well-being. We present findings from design workshops with 21 participants diagnosed with mental illness about their interactions with social media platforms. We find that users develop cause-and-effect explanations, or folk theories, to understand their experiences with algorithmic curation. These folk theories highlight a breakdown in algorithmic design that we explain using the framework of entanglement, a phenomenon where there is a disconnect between users' actions and platform outcomes on an emotional level. Participants' designs to address entanglement and mitigate harms centered on contextualizing their engagement and restoring explicit user control on social media. The conceptualization of entanglement and the resulting design recommendations have implications for social computing and recommender systems research, particularly in evaluating and designing social media platforms that support users' mental well-being.
Authors:Artur Solomonik, Nicolas Ruiz, Hendrik Heuer
Abstract:
Social media has billions of users, but we still do not fully understand why users prefer one platform over another. Establishing new platforms among already popular competitors is difficult. Prior research has richly documented people's experiences within individual platforms, yet situating those experiences within the entirety of a user's social media experience remains challenging. What platforms have people used, and why have they transitioned between them? We collected data from a quota-based sample of 1,000 U.S. participants. We introduce the concept of \emph{Social Media Journeys} to study the entirety of their social media experiences systematically. We identify push and pull factors across the social media landscape. We also show how different generations adopted social media platforms based on personal needs. With this work, we advance HCI by moving towards holistic perspectives when discussing social media technology, offering new insights for platform design, governance, and regulation.
Authors:Mounvik K, N Harshit
Abstract:
We introduce Web-Scale Multimodal Summarization, a lightweight framework for generating summaries by combining retrieved text and image data from web sources. Given a user-defined topic, the system performs parallel web, news, and image searches. Retrieved images are ranked using a fine-tuned CLIP model to measure semantic alignment with topic and text. Optional BLIP captioning enables image-only summaries for stronger multimodal coherence.The pipeline supports features such as adjustable fetch limits, semantic filtering, summary styling, and downloading structured outputs. We expose the system via a Gradio-based API with controllable parameters and preconfigured presets.Evaluation on 500 image-caption pairs with 20:1 contrastive negatives yields a ROC-AUC of 0.9270, an F1-score of 0.6504, and an accuracy of 96.99%, demonstrating strong multimodal alignment. This work provides a configurable, deployable tool for web-scale summarization that integrates language, retrieval, and vision models in a user-extensible pipeline.
Authors:Saurabh Amin, Amine Bennouna, Daniel Huttenlocher, Dingwen Kong, Liang Lyu, Asuman Ozdaglar
Abstract:
We develop a decision-theoretic model of human-AI interaction to study when AI assistance improves or impairs human decision-making. A human decision-maker observes private information and receives a recommendation from an AI system, but may combine these signals imperfectly. We show that the effect of AI assistance decomposes into two main forces: the marginal informational value of the AI beyond what the human already knows, and a behavioral distortion arising from how the human uses the AI's recommendation. Central to our analysis is a micro-founded measure of informational overlap between human and AI knowledge. We study an empirically relevant form of imperfect decision-making -- correlation neglect -- whereby humans treat AI recommendations as independent of their own information despite shared evidence. Under this model, we characterize how overlap and AI capabilities shape the Human-AI interaction regime between augmentation, impairment, complementarity, and automation, and draw key insights.
Authors:Marianne Bossema, Rob Saunders, Vlad Glaveanu, Somaya Ben Allouch
Abstract:
While intelligent technologies offer unique opportunities for creativity support, there are fundamental challenges in designing human-centered co-creative systems. Explainable AI (XAI) can contribute when shifting its traditional role from justification (explaining decisions) to exploration (explaining possibilities). Contextual understanding is essential for supporting embodied creativity. Generative Artificial Intelligence (AI) models are fundamentally limited, however, by their reliance on disembodied data. We propose Pluri-perspectivism as a framework for XAI, to bridge the epistemological gap between human and machine, and promote creative exploration. It is a pragmatic, action-oriented solution to guide the system, repurposing XAI methods such as the Rashomon Technique. This facilitates exploring a spectrum of creative possibilities, and the exchange of 'perspectives' between human and machine. Using Pluri-perspectivism as a framework for XAI, we can reintroduce productive friction and support human agency in human-machine creative collaborations.
Authors:Ben Kosa, Hsuanling Lee, Jasmine Li, Sanbrita Mondal, Yuhang Zhao, Liang He
Abstract:
Existing assistive technologies (AT) often adopt a one-size-fits-all approach, overlooking the diverse needs of people with visual impairments (PVI). Do-it-yourself AT (DIY-AT) toolkits offer one path toward customization, but most remain limited--targeting co-design with engineers or requiring programming expertise. Non-professionals with disabilities, including PVI, also face barriers such as inaccessible tools, lack of confidence, and insufficient technical knowledge. These gaps highlight the need for prototyping technologies that enable PVI to directly make their own AT. Building on emerging evidence that large language models (LLMs) can serve not only as visual aids but also as co-design partners, we present an exploratory study of how LLM-based AI can support PVI in the tangible DIY-AT co-making process. Our findings surface key challenges and design opportunities: the need for greater spatial and visual support, strategies for mitigating novel AI errors, and implications for designing more accessible AI-assisted prototypes.
Authors:Zhanming Chen, Alisha Ghaju, May Hang, Juan F. Maestre, Ji Youn Shin
Abstract:
Patient-provider communication is an important aspect of successful healthcare, as it can directly lead to positive health outcomes. Previous studies examined factors that facilitate communication between healthcare providers and patients in socially marginalized communities, especially developing countries, and applied identified factors to technology development. However, there is limited understanding of how providers work with patients from immigrant populations in a developed country. By conducting semi-structured interviews with 15 providers working with patients from an immigrant community with unique cultural characteristics, we identified providers' effective communication strategies, including acknowledgment, community involvement, gradual care, and adaptive communication practices (i.e., adjusting the communication style). Based on our findings, we highlight cultural competence and discuss design implications for technologies to support health communication in immigrant communities. Our suggestions propose approaches for HCI researchers to identify practical, contextualized cultural competence for their health technology design.
Authors:Minghe Lu, Zhanming Chen, May Sunmin Hwang, Ji Youn Shin
Abstract:
Farming plays a significant role in the economy by supporting related industries such as food, retail, and local services. Community-based small farms, while offering unique social and cultural benefits, face persistent challenges, including limited access to formal education and underdeveloped infrastructure, which have been discussed in prior research. This study focuses on community-driven factors, such as workarounds for recording critical information and practices for passing down farming knowledge across generations. Through 11 semi-structured interviews with farmers from a small ethnic community, the Hmong, we explore how bonding social capital, rooted in close family and community ties, supports informal knowledge exchange and creates pathways to bridging and linking capital. These relationships help farmers connect to broader networks, resources, and institutions. Our findings highlight opportunities for designing technologies that support and strengthen existing support systems. We discuss how technologies should be designed to reflect the cultural values, unique practices, and intergenerational relationships embedded in community-based farms.
Authors:Jackie Baek, Yaopeng Fu, Will Ma, Tianyi Peng
Abstract:
Inventory control is a fundamental operations problem in which ordering decisions are traditionally guided by theoretically grounded operations research (OR) algorithms. However, such algorithms often rely on rigid modeling assumptions and can perform poorly when demand distributions shift or relevant contextual information is unavailable. Recent advances in large language models (LLMs) have generated interest in AI agents that can reason flexibly and incorporate rich contextual signals, but it remains unclear how best to incorporate LLM-based methods into traditional decision-making pipelines. We study how OR algorithms, LLMs, and humans can interact and complement each other in a multi-period inventory control setting. We construct InventoryBench, a benchmark of over 1,000 inventory instances spanning both synthetic and real-world demand data, designed to stress-test decision rules under demand shifts, seasonality, and uncertain lead times. Through this benchmark, we find that OR-augmented LLM methods outperform either method in isolation, suggesting that these methods are complementary rather than substitutes. We further investigate the role of humans through a controlled classroom experiment that embeds LLM recommendations into a human-in-the-loop decision pipeline. Contrary to prior findings that human-AI collaboration can degrade performance, we show that, on average, human-AI teams achieve higher profits than either humans or AI agents operating alone. Beyond this population-level finding, we formalize an individual-level complementarity effect and derive a distribution-free lower bound on the fraction of individuals who benefit from AI collaboration; empirically, we find this fraction to be substantial.
Authors:Tony Li, Yan Ma, Zhuojun Li, Chun Yu, IV Ramakrishnan, Xiaojun Bi
Abstract:
Existing touchscreen software keyboards prevent users from resting their hands, forcing slow and fatiguing index-finger tapping ("chicken typing") instead of familiar hands-down ten-finger typing. We present KeySense, a purely software solution that preserves physical keyboard motor skills. KeySense isolates intentional taps from resting-finger noise using cognitive-motor timing patterns, and then uses a fine-tuned LLM decoder to convert the resulting noisy letter sequence into the intended word. In controlled component tests, the decoder substantially outperforms two statistical baselines (top-1 accuracy 84.8% vs 75.7% and 79.3%). A 12-participant study shows clear ergonomic and performance benefits: compared with the conventional hover-style keyboard, users rated KeySense as markedly less physically demanding (NASA-TLX median 1.5 vs 4.0), and after brief practice typed significantly faster (WPM 28.3 vs 26.2, p < 0.01). These results indicate that KeySense enables accurate, efficient, and comfortable ten-finger text entry on commodity touchscreens without any extra hardware.
Authors:Hanjing Shi, Dominic DiFranzo
Abstract:
Agentic AI systems-autonomous entities capable of independent planning and execution-reshape the landscape of human-AI trust. Long before direct system exposure, user expectations are mediated through high-stakes public discourse on social platforms. However, platform-mediated engagement signals (e.g., upvotes) may inadvertently function as a ``credibility proxy,'' potentially stifling critical evaluation. This paper investigates the interplay between social proof and verification timing in online discussions of agentic AI. Analyzing a longitudinal dataset from two distinct Reddit communities with contrasting interaction cultures-r/OpenClaw and r/Moltbook-we operationalize verification cues via reproducible lexical rules and model the ``time-to-first-verification'' using a right-censored survival analysis framework. Our findings reveal a systemic ``Popularity Paradox'': high-visibility discussions in both subreddits experience significantly delayed or entirely absent verification cues compared to low-visibility threads. This temporal lag creates a critical window for ``Narrative Lock-in,'' where early, unverified claims crystallize into collective cognitive biases before evidence-seeking behaviors emerge. We discuss the implications of this ``credibility-by-visibility'' effect for AI safety and propose ``epistemic friction'' as a design intervention to rebalance engagement-driven platforms.
Authors:Yate Ge, Lin Tian, Yi Dai, Shuhan Pan, Yiwen Zhang, Qi Wang, Weiwei Guo, Xiaohua Sun
Abstract:
This work investigates generative facial expression interfaces for intelligent agents from a meta-design perspective. We propose the Generative Personalized Facial Expression Interface (GPFEI) framework, which organizes rule-bounded spaces, character identity, and context--expression mapping to address challenges of control, coherence, and alignment in run-time facial expression generation. To operationalize this framework, we developed GenFaceUI, a proof-of-concept tool that enables designers to create templates, apply semantic tags, define rules, and iteratively test outcomes. We evaluated the tool through a qualitative study with twelve designers. The results show perceived gains in controllability and consistency, while revealing needs for structured visual mechanisms and lightweight explanations. These findings provide a conceptual framework, a proof-of-concept tool, and empirical insights that highlight both opportunities and challenges for advancing generative facial expression interfaces within a broader meta-design paradigm.
Authors:Olivia Figueira, Pranathi Chamarthi, Tu Le, Athina Markopoulou
Abstract:
AI chatbots are widely used by children and teens today, but they pose significant risks to youth's privacy and safety due to both increasingly personal conversations and potential exposure to unsafe content. While children under 13 are protected by the Children's Online Privacy Protection Act (COPPA), chatbot providers' own privacy policies may also provide protections, since they typically prohibit children from accessing their platforms. Age gating is often employed to restrict children online, but chatbot age gating in particular has not been studied. In this paper, we investigate whether popular consumer chatbots are (i) able to estimate users' ages based solely on their conversations, and (ii) whether they take action upon identifying children. To that end, we develop an auditing framework in which we programmatically interact with chatbots and conduct 1050 experiments using our comprehensive library of age-indicative prompts, including implicit and explicit age disclosures, to analyze the chatbots' responses and actions. We find that while chatbots are capable of estimating age, they do not take any action when children are identified, contradicting their own policies. Our methodology and findings provide insights for platform design, demonstrated by our proof-of-concept chatbot age gating implementation, and regulation to protect children online.
Authors:Yate Ge, Lin Tian, Chiqian Xu, Luyao Xu, Meiying Li, Yuanda Hu, Weiwei Guo
Abstract:
Thematic jokes are central to stand-up comedy, sitcoms, and public speaking, where contexts and punchlines rely on fresh material - news, anecdotes, and cultural references that resonate with the audience. Recent advances in Large Language Models (LLMs) have enabled interactive joke generation through conversational interfaces. Although LLMs enable interactive joke generation, ordinary conversational interfaces seldom give creators enough agency, control, or timely access to such source material for constructing context and punchlines. We designed Jokeasy, a search-enabled prototype system that integrates a dual-role LLM agent acting as both a material scout and a prototype writer to support human-AI collaboration in thematic joke writing. Jokeasy provides a visual canvas in which retrieved web content is organized into editable inspiration blocks and developed through a multistage workflow. A qualitative study with 13 hobbyists and 5 expert participants (including professional comedians and HCI/AI specialists) showed that weaving real-time web material into this structured workflow enriches ideation and preserves author agency, while also revealing needs for finer search control, tighter chat-canvas integration, and more flexible visual editing. These insights refine our understanding of AI-assisted humour writing and guide future creative-writing tools.
Authors:Hanjing Shi, Dominic DiFranzo
Abstract:
Oversight for agentic AI is often discussed as a single goal ("human control"), yet early adoption may produce role-specific expectations. We present a comparative analysis of two newly active Reddit communities in Jan--Feb 2026 that reflect different socio-technical roles: r/OpenClaw (deployment and operations) and r/Moltbook (agent-centered social interaction). We conceptualize this period as an early-stage crystallization phase, where oversight expectations form before norms reach equilibrium. Using topic modeling in a shared comparison space, a coarse-grained oversight-theme abstraction, engagement-weighted salience, and divergence tests, we show the communities are strongly separable (JSD =0.418, cosine =0.372, permutation $p=0.0005$). Across both communities, "human control" is an anchor term, but its operational meaning diverges: r/OpenClaw} emphasizes execution guardrails and recovery (action-risk), while r/Moltbook} emphasizes identity, legitimacy, and accountability in public interaction (meaning-risk). The resulting distinction offers a portable lens for designing and evaluating oversight mechanisms that match agent role, rather than applying one-size-fits-all control policies.
Authors:Prerna Ravi, Carúmey Stevens, Beatriz Flamia Azevedo, Jasmine David, Brandon Hanks, Hal Abelson, Grace Lin, Emma Anderson
Abstract:
Collaboration is a cornerstone of 21st-century learning, yet teachers continue to face challenges in supporting productive peer interaction. Emerging generative AI tools offer new possibilities for scaffolding collaboration, but their role in mediating in-person group work remains underexplored, especially from the perspective of educators. This paper presents findings from an exploratory qualitative study with 33 K12 teachers who interacted with Phoenix, a voice-based conversational agent designed to function as a near-peer in face-to-face group collaboration. Drawing on playtesting sessions, surveys, and focus groups, we examine how teachers perceived the agent's behavior, its influence on group dynamics, and its classroom potential. While many appreciated Phoenix's capacity to stimulate engagement, they also expressed concerns around autonomy, trust, anthropomorphism, and pedagogical alignment. We contribute empirical insights into teachers' mental models of AI, reveal core design tensions, and outline considerations for group-facing AI agents that support meaningful, collaborative learning.
Authors:Lena Hegemann, Xinyi Wen, Michael A. Hedderich, Tarmo Nurmi, Hariharan Subramonyam
Abstract:
Generative AI often produces results misaligned with user intentions, for example, resolving ambiguous prompts in unexpected ways. Despite existing approaches to clarify intent, a major challenge remains: understanding and influencing AI's interpretation of user intent through simple, direct inputs requiring no expertise or rigid procedures. We present ToMigo, representing intent as design concept graphs: nodes represent choices of purpose, content, or style, while edges link them with interpretable explanations. Applied to graphic design, ToMigo infers intent from reference images and text. We derived a schema of node types and edges from pre-study data, informing a multimodal large language model to generate graphs aligning nodes externally with user intent and internally toward a unified design goal. This structure enables users to explore AI reasoning and directly manipulate the design concept. In our user studies, ToMigo received high alignment ratings and captured most user intentions well. Users reported greater control and found interactive features-editable graphs, reflective chats, concept-design realignment-useful for evolving and realizing their design ideas.
Authors:Shri Harini Ramesh, Foroozan Daneshzand, Babak Rashidi, Shriti Raj, Hariharan Subramonyam, Fateme Rajabiyazdi
Abstract:
As Artificial Intelligence (AI) conversational agents become widespread, people are increasingly using them for health information seeking. The use of off-the-shelf conversational agents for health information seeking could place high metacognitive demands (the need for extensive monitoring and control of one's own thought process) on individuals, which could compromise their experience of seeking health information. However, currently, the specific demands that arise while using conversational agents for health information seeking, and the strategies people use to cope with those demands, remain unknown. To address these gaps, we conducted a think-aloud study with 15 participants as they sought health information using our off-the-shelf AI conversational agent. We identified the metacognitive demands such systems impose, the strategies people adopt in response, and propose considerations for designing beyond off-the-shelf interfaces to reduce these demands and support better user experiences and affordances in health information seeking.
Authors:Yihao Dong, Praneeth Bimsara Perera, Chin-Teng Lin, Craig T Jin, Anusha Withana
Abstract:
Spatial tactile feedback can enhance the realism of geometry exploration in virtual reality applications. Current vibrotactile approaches often face challenges with the spatial and temporal resolution needed to render different 3D geometries. Inspired by the natural deformation of finger pads when exploring 3D objects and surfaces, we propose TactDeform, a parametric approach to render spatio-temporal tactile patterns using a finger-worn electro-tactile interface. The system dynamically renders electro-tactile patterns based on both interaction contexts (approaching, contact, and sliding) and geometric contexts (geometric features and textures), emulating deformations that occur during real-world touch exploration. Results from a user study \rr{(N=24)} show that the proposed approach enabled high texture discrimination and geometric feature identification compared to a baseline. Informed by results from a free 3D-geometry exploration phase, we provide insights that can inform future tactile interface designs.
Authors:Ziheng Huang, Robin Kar, Hari Sundaram, Tal August
Abstract:
User interaction with legal contracts has been limited to document reading, which is often complicated by complex, ambiguous legal language. We explore possible futures where contract interfaces go beyond single document interfaces to (1) educate users with legal rights not stated in the contract, (2) transform legal language into alternative representations to aid information tasks before, during, and after signing, and (3) proactively supply contractual information at relevant moments. We refer to these future interfaces collectively as Living Contracts. Using residential leases as a case study, we created three design probes representing different possible Living Contracts. A three-part qualitative study (N=18) revealed participants' barriers to interacting with contracts, including interpreting complex language, uncertainty about legal rights, and the pressure to sign quickly. Participants' feedback on the probes highlighted how Living Contracts have the potential to address these challenges and open new design opportunities for human-contract interactions beyond document reading.
Authors:Qing, Xia, Marios Constantinides, Advait Sarkar, Duncan Brumby, Anna Cox
Abstract:
Generative AI (GenAI) tools are rapidly transforming knowledge work, making AI literacy a critical priority for organizations. However, research on AI literacy lacks empirical insight into how knowledge workers' beliefs around GenAI literacy are shaped by the social dynamics of the workplace, and how workers learn to apply GenAI tools in these environments. To address this gap, we conducted in-depth interviews with 19 knowledge workers across multiple sectors to examine how they develop GenAI competencies in real-world professional contexts. We found that, while knowledge sharing from colleagues supported learning, the ability to remove cues indicating GenAI use was perceived as validation of domain expertise. These behaviours ultimately reduced opportunities for learning via knowledge sharing and undermined transparency. To advance workplace AI literacy, we argue for fostering open dialogue, increasing visibility of user-generated knowledge, and greater emphasis on the benefits of collaborative learning for navigating rapid technological developments.
Authors:Hansol Lee, AJ Alvero, René F. Kizilcec, Thorsten Joachims
Abstract:
Algorithmic predictions are inherently uncertain: even models with similar aggregate accuracy can produce different predictions for the same individual, raising concerns that high-stakes decisions may become sensitive to arbitrary modeling choices. In this paper, we define algorithmic reliance as the extent to which a decision outcome depends on whether a more favorable versus less favorable algorithmic prediction is presented to the decision-maker. We estimate this in a randomized field experiment (n=19,545) embedded in a selective U.S. college admissions cycle, in which admissions officers reviewed each application alongside an algorithmic score while we randomly varied whether the score came from one of two similarly accurate prediction models. Although the two models performed similarly in aggregate, they frequently assigned different scores to the same applicant, creating exogenous variation in the score shown. Surprisingly, we find little evidence of algorithmic reliance: presenting a more favorable score does not meaningfully increase an applicant's probability of admission on average, even when the models disagree substantially. These findings suggest that, in this expert, high-stakes setting, human decision-making is largely invariant to arbitrary variation in algorithmic predictions, underscoring the role of professional discretion and institutional context in mediating the downstream effects of algorithmic uncertainty.
Authors:Junyi Li, Zhaoxi Zhang, Tamir Mendel, Takahiro Yabe
Abstract:
Sidewalk sheds are a common feature of the streetscape in New York City, reflecting ongoing construction and maintenance activities. However, policymakers and local business owners have raised concerns about reduced storefront visibility and altered pedestrian navigation. Although sidewalk sheds are widely used for safety, their effects on pedestrian visibility and movement are not directly measured in current planning practices. To address this, we developed an AI-based chatbot survey that collects image-based annotations and route choices from pedestrians, linking these responses to specific shed design features, including clearance height, post spacing, and color. This AI chatbot survey integrates a large language model (e.g., Google's Gemini-1.5-flash-001 model) with an image-annotation interface, allowing users to interact with street images, mark visual elements, and provide structured feedback through guided dialogue. To explore pedestrian perceptions and behaviors, this paper conducts a grid-based analysis of entrance annotations and applies logistic mixed-effects modeling to assess sidewalk choice patterns. Analysis of the dataset (n = 25) shows that: (1) the presence of scaffolding significantly reduces pedestrians' ability to identify ground-floor retail entrances, and (2) variations in weather conditions and shed design features significantly influence sidewalk selection behavior. By integrating generative AI into urban research, this study demonstrates a novel method for evaluating sidewalk shed designs and provides empirical evidence to support adjustments to shed guidelines that improve the pedestrian experience without compromising safety.
Authors:Hibiki Ito, Chia-Yu Hsu, Hiroaki Ogata
Abstract:
Secondary use of growing real-world data (RWD) in education offers significant opportunities for research, yet privacy practices intended to enable third-party access to such RWD are rarely evaluated for their implications for downstream analyses. As a result, potential problems introduced by otherwise standard privacy practices may remain unnoticed. To address this gap, we investigate potential issues arising from common practices by assessing (1) the re-identification risk of fine-grained RWD, (2) how communicating such risks influences learners' privacy behaviour, and (3) the sensitivity of downstream analytical conclusions to resulting changes in the data. We focus on these practices because re-identification risk and stakeholder communication can jointly influence the data shared with third parties. We find that substantial re-identification risk in RWD, when communicated to stakeholders, can induce opt-outs and non-self-disclosure behaviours. Sensitivity analysis demonstrates that these behavioural changes can meaningfully alter the shared data, limiting validity of secondary-use findings. We conceptualise this phenomenon as the third-party access effect (3PAE) and discuss implications for trustworthy secondary use of educational RWD.
Authors:Alexander Erlei, Federico Cau, Radoslav Georgiev, Sagar Kumar, Kilian Bizer, Ujwal Gadiraju
Abstract:
AI consumer markets are characterized by severe buyer-supplier market asymmetries. Complex AI systems can appear highly accurate while making costly errors or embedding hidden defects. While there have been regulatory efforts surrounding different forms of disclosure, large information gaps remain. This paper provides the first experimental evidence on the important role of information asymmetries and disclosure designs in shaping user adoption of AI systems. We systematically vary the density of low-quality AI systems and the depth of disclosure requirements in a simulated AI product market to gauge how people react to the risk of accidentally relying on a low-quality AI system. Then, we compare participants' choices to a rational Bayesian model, analyzing the degree to which partial information disclosure can improve AI adoption. Our results underscore the deleterious effects of information asymmetries on AI adoption, but also highlight the potential of partial disclosure designs to improve the overall efficiency of human decision-making.
Authors:Diaoulé Diallo, Katharina Dworatzyk, Sophie Jentzsch, Peer Schütt, Sabine Theis, Tobias Hecking
Abstract:
Controlling the behavior of large language models (LLMs) at inference time is essential for aligning outputs with human abilities and safety requirements. \emph{Activation steering} provides a lightweight alternative to prompt engineering and fine-tuning by directly modifying internal activations to guide generation. This research advances the literature in three significant directions. First, while previous work demonstrated the technical feasibility of steering emotional tone using automated classifiers, this paper presents the first human evaluation of activation steering concerning the emotional tone of LLM outputs, collecting over 7,000 crowd-sourced ratings from 190 participants via Prolific ($n=190$). These ratings assess both perceived emotional intensity and overall text quality. Second, we find strong alignment between human and model-based quality ratings (mean $r=0.776$, range $0.157$--$0.985$), indicating automatic scoring can proxy perceived quality. Moderate steering strengths ($λ\approx 0.15$) reliably amplify target emotions while preserving comprehensibility, with the strongest effects for disgust ($η_p^2 = 0.616$) and fear ($η_p^2 = 0.540$), and minimal effects for surprise ($η_p^2 = 0.042$). Finally, upgrading from Alpaca to LlaMA-3 yielded more consistent steering with significant effects across emotions and strengths (all $p < 0.001$). Inter-rater reliability was high (ICC $= 0.71$--$0.87$), underscoring the robustness of the findings. These findings support activation-based control as a scalable method for steering LLM behavior across affective dimensions.
Authors:Suifang Zhou, Qi Gong, Ximing Shen, RAY LC
Abstract:
LLM-assisted technologies are increasingly used to support cognitive processing and information interpretation, yet their role in aiding memory recall, and how people choose to engage with them, remains underexplored. We studied participants who watched a short robbery video (approximating a one-time eyewitness scenario) and composed recall statements using either a default GPT or a guided GPT prompted with a standardized eyewitness protocol. Results show that, in the default condition, participants who believed they had a clearer understanding of the event were more likely to trust GPT's output, whereas in the guided condition, participants showed stronger alignment between subjective clarity and actual recall. Additionally, participants evaluated the legitimacy of the individuals in the incident differently across conditions. Interaction analysis further revealed that default-GPT users spontaneously developed diverse strategies, including building on existing recollections, requesting potentially missing details, and treating GPT as a recall coach. This work shows how GPT-user interplay can subconsciously shape beliefs and perceptions of remembered events.
Authors:Jaron Mink, Lucy Qin, Elissa M. Redmiles
Abstract:
AI-generated media is radically changing the way content is both consumed and produced on the internet, and in no place is this potentially more visible than in sexual content. AI-generated sexual content (AIG-SC) is increasingly enabled by an ecosystem of individual AI developers, specialized third-party applications, and foundation model providers. AIG-SC raises a number of concerns from old debates about the line between pornography and obscenity, to newer debates about fair use and labor displacement (in this case, of sex workers), and spurred new regulations to curb the spread of non-consensual intimate imagery (NCII) created using the same technology used to create AIG-SC. However, despite the growing prevalence of AIG-SC, little is known about its creators, their motivations, and what types of content they produce. To inform effective governance in this space, we perform an in-depth study to understand what AIG-SC creators make, along with how and why they make it. Interviews of 28 AIG-SC creators, ranging from hobbyists to entrepreneurs to those who moderate communities of hundreds of thousands of other creators, reveal a wide spectrum of motivations, including sexual exploration, creative expression, technical experimentation, and in a handful of cases, the creation of NCII.
Authors:Zihan Zhou, Yinan Liu, Yuyang Xie, Bin Wang, Xiaochun Yang, Zezheng Feng
Abstract:
The global shortage and uneven distribution of medical expertise continue to hinder equitable access to accurate diagnostic care. While existing intelligent diagnostic system have shown promise, most struggle with dual-user interaction, and dynamic knowledge integration -- limiting their real-world applicability. In this study, we present DiagLink, a dual-user diagnostic assistance system that synergizes large language models (LLMs), knowledge graphs (KGs), and medical experts to support both patients and physicians. DiagLink uses guided dialogues to elicit patient histories, leverages LLMs and KGs for collaborative reasoning, and incorporates physician oversight for continuous knowledge validation and evolution. The system provides a role-adaptive interface, dynamically visualized history, and unified multi-source evidence to improve both trust and usability. We evaluate DiagLink through user study, use cases and expert interviews, demonstrating its effectiveness in improving user satisfaction and diagnostic efficiency, while offering insights for the design of future AI-assisted diagnostic systems.
Authors:Ruyuan Wan, Changye Li, Ting-Hao 'Kenneth' Huang
Abstract:
Coded language is an important part of human communication. It refers to cases where users intentionally encode meaning so that the surface text differs from the intended meaning and must be decoded to be understood. Current language models handle coded language poorly. Progress has been limited by the lack of real-world datasets and clear taxonomies. This paper introduces CodedLang, a dataset of 7,744 Chinese Google Maps reviews, including 900 reviews with span-level annotations of coded language. We developed a seven-class taxonomy that captures common encoding strategies, including phonetic, orthographic, and cross-lingual substitutions. We benchmarked language models on coded language detection, classification, and review rating prediction. Results show that even strong models can fail to identify or understand coded language. Because many coded expressions rely on pronunciation-based strategies, we further conducted a phonetic analysis of coded and decoded forms. Together, our results highlight coded language as an important and underexplored challenge for real-world NLP systems.
Authors:Ahrii Kim, Seong-heum Kim
Abstract:
Automatic post-editing (APE) aims to refine machine translations by correcting residual errors. Although recent large language models (LLMs) demonstrate strong translation capabilities, their effectiveness for APE--especially under document-level context--remains insufficiently understood. We present a systematic comparison of proprietary and open-weight LLMs under a naive document-level prompting setup, analyzing APE quality, contextual behavior, robustness, and efficiency. Our results show that proprietary LLMs achieve near human-level APE quality even with simple one-shot prompting, regardless of whether document context is provided. While these models exhibit higher robustness to data poisoning attacks than open-weight counterparts, this robustness also reveals a limitation: they largely fail to exploit document-level context for contextual error correction. Furthermore, standard automatic metrics do not reliably reflect these qualitative improvements, highlighting the continued necessity of human evaluation. Despite their strong performance, the substantial cost and latency overheads of proprietary LLMs render them impractical for real-world APE deployment. Overall, our findings elucidate both the promise and current limitations of LLM-based document-aware APE, and point toward the need for more efficient long-context modeling approaches for translation refinement.
Authors:Brianna L. Wimer, Ritesh Kanchi, Kaija Frierson, Venkatesh Potluri, Ronald Metoyer, Jennifer Mankoff, Miya Natsuhara, Matt X. Wang
Abstract:
Blind and visually impaired (BVI) computer science students face systematic barriers when learning data structures: current accessibility approaches typically translate diagrams into alternative text, focusing on visual appearance rather than preserving the underlying structure essential for conceptual understanding. More accessible alternatives often do not scale in complexity, cost to produce, or both. Motivated by a recent shift to tools for creating visual diagrams from code, we propose a solution that automatically creates accessible representations from structural information about diagrams. Based on a Wizard-of-Oz study, we derive design requirements for an automated system, Arboretum, that compiles text-based diagram specifications into three synchronized nonvisual formats$\unicode{x2013}$tabular, navigable, and tactile. Our evaluation with BVI users highlights the strength of tactile graphics for complex tasks such as binary search; the benefits of offering multiple, complementary nonvisual representations; and limitations of existing digital navigation patterns for structural reasoning. This work reframes access to data structures by preserving their structural properties. The solution is a practical system to advance accessible CS education.
Authors:Lindsay Popowski, Helena Vasconcelos, Ignacio Javier Fernandez, Chijioke Chinaza Mgbahurike, Ralf Herbrich, Jeffrey Hancock, Michael S. Bernstein
Abstract:
Users trust algorithms more when they can predict the algorithms' behavior. Simple algorithms trivially yield predictively accurate mental models, but modern AI algorithms have often been assumed too complex for people to build predictive mental models, especially in the social media domain. In this paper, we describe conditions under which even complex algorithms can yield predictive mental models, opening up opportunities for a broader set of human-centered algorithms. We theorize that users will form an accurate predictive mental model of an algorithm's behavior if and only if the algorithm simultaneously satisfies three criteria: (1) cognitive availability of the underlying concepts being modeled, (2) concept compactness (does it form a single cognitive construct?), and (3) high alignment between the person's and algorithm's execution of the concept. We evaluate this theory through a pre-registered experiment (N=1250) where users predict behavior of 25 social media feed ranking algorithms that vary on these criteria. We find that even complex (e.g., LLM-based) algorithms enjoy accurate prediction rates when they meet all criteria, and even simple (e.g., basic term count) algorithms fail to be predictable when a single criterion fails. We also find that these criteria determine outcomes beyond prediction accuracy, such as which mental models users deploy to make their predictions.
Authors:Junling Wang, Hongyi Lan, Xiaotian Su, Mustafa Doga Dogan, April Yi Wang
Abstract:
Designing user interfaces (UIs) is a critical step when launching products, building portfolios, or personalizing projects, yet end users without design expertise often struggle to articulate their intent and to trust design choices. Existing example-based tools either promote broad exploration, which can cause overwhelm and design drift, or require adapting a single example, risking design fixation. We present UI Remix, an interactive system that supports mobile UI design through an example-driven design workflow. Powered by a multimodal retrieval-augmented generation (MMRAG) model, UI Remix enables iterative search, selection, and adaptation of examples at both the global (whole interface) and local (component) level. To foster trust, it presents source transparency cues such as ratings, download counts, and developer information. In an empirical study with 24 end users, UI Remix significantly improved participants' ability to achieve their design goals, facilitated effective iteration, and encouraged exploration of alternative designs. Participants also reported that source transparency cues enhanced their confidence in adapting examples. Our findings suggest new directions for AI-assisted, example-driven systems that empower end users to design with greater control, trust, and openness to exploration.
Authors:Qing Zhang, Junyu Chen, Yifei Huang, Jing Huang, Thad Starner, Kai Kunze, Jun Rekimoto
Abstract:
Directional cues are crucial for environmental interaction. Conventional methods rely on symbolic visual or auditory reminders that require semantic interpretation, a process that proves challenging in demanding dual-tasking scenarios. We introduce a novel alternative for conveying directional cues on wearable displays: directly triggering motion perception using monocularly presented peripheral stimuli. This approach is designed for low visual interference, with the goal of reducing the need for gaze-switching and the complex cognitive processing associated with symbols. User studies demonstrate our method's potential to robustly convey directional cues. Compared to a conventional arrow-based technique in a demanding dual-task scenario, our motion-based approach resulted in significantly more accurate interpretation of these directional cues ($p=.008$) and showed a trend towards reduced errors on the concurrent primary task ($p=.066$).
Authors:Rishi Vanukuru, Krithik Ranjan, Ada Yi Zhao, David Lindero, Gunilla H. Berndtsson, Gregoire Phillips, Amy Banić, Mark D. Gross, Ellen Yi-Luen Do
Abstract:
Mobile video calls are widely used to share information about real-world objects and environments with remote collaborators. While these calls provide valuable visual context in real time, the experience of interacting with people and moving around a space is significantly reduced when compared to co-located conversations. Recent work has demonstrated the potential of Mobile Augmented Reality applications to enable more spatial forms of collaboration across distance. To better understand the dynamics of mobile AR collaboration and how this medium compares against the status quo, we conducted a comparative structured observation study to analyze people's perception of space and interaction with remote collaborators across mobile video calls and AR-based calls. Fourteen pairs of participants completed a spatial collaboration task using each medium. Through a mixed-methods analysis of session videos, transcripts, motion logs, post-task exercises, and interviews, we highlight how the choice of medium influences the roles and responsibilities that collaborators take on and the construction of a shared language for coordination. We discuss the importance of spatial reasoning with one's body, how video calls help participants "be on the same page" more directly, and how AR calls enable both onsite and remote collaborators to engage with the space and each other in ways that resemble in-person interaction. Our study offers a nuanced view of the benefits and limitations of both mediums, and we conclude with a discussion of design implications for future systems that integrate mobile video and AR to better support spatial collaboration in its many forms.
Authors:Zhengtao Xu, Junti Zhang, Anthony Tang, Yi-Chieh Lee
Abstract:
Conversational agents are increasingly used in education for learning support. An application is "learning by explaining", where learners explain their understanding to an agent. However, existing research focuses on single roles, leaving it unclear how different pedagogical roles influence learners' interaction patterns, learning outcomes and experiences. We conducted a between-subjects study (N=96) comparing agents with three pedagogical roles (Tutee, Peer, Challenger) and a control condition while learning an economics concept. We found that different pedagogical roles shaped learning dynamics, including interaction patterns and experiences. Specifically, the Tutee agent elicited the most cognitive investment but led to high pressure. The Peer agent fostered high absorption and interest through collaborative dialogue. The Challenger agent promoted cognitive and metacognitive acts, enhancing critical thinking with moderate pressure. The findings highlight how agent roles shape different learning dynamics, guiding the design of educational agents tailored to specific pedagogical goals and learning phases.
Authors:Wanqi Zhang, Jiangen He, Marielle Santos
Abstract:
Job interview anxiety is a prevalent challenge among university students and can undermine both performance and confidence in high-stakes evaluative situations. Social robots have shown promise in reducing anxiety through emotional support, yet how such systems should balance psychological safety with effective instructional guidance remains an open question. In this work, we present a three-phase iterative design study of a robotic interview coach grounded in Person-Centered Therapy (PCT) and instructional scaffolding theory. Across three weekly sessions (N=8), we systematically explored how different interaction strategies shape users' emotional experience, cognitive load, and perceived utility. Phase I demonstrated that a PCT-based robot substantially increased perceived psychological safety but introduced a Safety-Guidance Gap, in which users felt supported yet insufficiently coached. Phase II revealed a Scaffolding Paradox: immediate feedback improved clarity but disrupted conversational flow and increased cognitive load, whereas delayed feedback preserved realism but lacked actionable specificity. To resolve this tension, Phase III introduced an Agency-Driven Interaction Mode that allowed users to opt in to feedback dynamically. Qualitative findings indicated that user control acted as an anxiety buffer, restoring trust, reducing overload, and reframing the interaction as collaborative rather than evaluative. Quantitative measures further showed significant reductions in interview-related social and communication anxiety, while maintaining high perceived warmth and therapeutic alliance. We synthesize these findings into an Adaptive Scaffolding Ecosystem framework, highlighting user agency as a key mechanism for balancing emotional support and instructional guidance in social robot coaching systems.
Authors:Nazar Ponochevnyi, Young-Ho Kim, Joseph Jay Williams, Anastasia Kuzminykh
Abstract:
Recent chart-authoring systems increasingly focus on natural-language input, enabling users to form a mental image of the chart they wish to create and express this intent using spoken instructions (spoken imagined-chart data). Yet these systems are predominantly trained on typed instructions written while viewing the target chart (typed existing-chart data). While the cognitive processes for describing an existing chart arguably differ from those for creating a new chart, the structural differences in the corresponding prompts remain underexplored. We present empirical findings on the structural differences among spoken imagined-chart instructions, typed imagined-chart instructions, and typed existing-chart instructions for chart creation, showing that imagined-chart prompts contain richer command formats, element specifications, and complex linguistic features, especially in spoken instructions. We then compare the performance of systems trained on spoken imagined-chart data versus typed existing-chart data, finding that the first system outperforms the second one on both voice and text input, highlighting the necessity of targeted training on spoken imagined-chart data. We conclude with design guidelines for chart-authoring systems to improve performance in real-world scenarios.
Authors:Shenghan Gao, Junye Wang, Junjie Xiong, Yun Jiang, Yun Fang, Qifan Hu, Baolong Liu, Quan Li
Abstract:
Supply chains (SCs), complex networks spanning from raw material acquisition to product delivery, with enterprises as interconnected nodes, play a pivotal role in organizational success. However, optimizing SCs remains challenging, particularly in partner selection, a key bottleneck shaped by competitive and cooperative dynamics. This challenge constitutes a multi-objective dynamic game requiring a synergistic integration of Multi-Criteria Decision-Making and Game Theory. Traditional approaches, grounded in mathematical simplifications and managerial heuristics, fail to capture real-world intricacies and risk introducing subjective biases. Multi-agent simulation offers promise, but prior research has largely relied on fixed, uniform agent logic, limiting practical applicability. Recent advances in LLMs create opportunities to represent complex SC requirements and hybrid game logic. However, challenges persist in modeling dynamic SC relationships, ensuring interpretability, and balancing agent autonomy with expert control. We present SCSimulator, a visual analytics framework that integrates LLM-driven MAS with human-in-the-loop collaboration for SC partner selection. It simulates SC evolution via adaptive network structures and enterprise behaviors, which are visualized via interpretable interfaces. By combining CoT reasoning with XAI techniques, it generates multi-faceted, transparent explanations of decision trade-offs. Users can iteratively adjust simulation settings to explore outcomes aligned with their expectations and strategic priorities. Developed through iterative co-design with SC experts and industry managers, SCSimulator serves as a proof-of-concept, offering methodological contributions and practical insights for future research on SC decision-making and interactive AI-driven analytics. Usage scenarios and a user study demonstrate the system's effectiveness and usability.
Authors:Ziyi Liu, Xinyi Wang, Shao-Kang Hsia, Chenfei Zhu, Zhengzhe Zhu, Xiyun Hu, Anastasia Kouvaras Ostrowski, Karthik Ramani
Abstract:
As multiple robots are expected to coexist in future households, natural language is increasingly envisioned as a primary medium for human-robot and robot-robot communication. This paper introduces the concept of a Natural Language Environment (NLE), defined as an interaction space in which humans and multiple heterogeneous robots coordinate primarily through natural language. Rather than proposing a deployable system, this work aims to explore the design space of such environments. We first synthesize prior work on language-based human-robot interaction to derive a preliminary design space for NLEs. We then conduct a role-playing study in virtual reality to investigate how people conceptualize, negotiate, and coordinate human-multi-robot interactions within this imagined environment. Based on qualitative and quantitative analysis, we refine the preliminary design space and derive design implications that highlight key tensions and opportunities around task coordination dominance, robot autonomy, and robot personality in Natural Language Environments.
Authors:Minju Park, Seunghyun Lee, Juhwan Ma, Dongwook Yoon
Abstract:
Advances in AI have enabled ESL learners to practice speaking through conversational systems. However, most tools rely on explicit correction, which can interrupt the conversation and undermine confidence. Grounded in second language acquisition and motivational psychology, we present AI Twin, a system that rephrases learner utterances into more fluent English and delivers them in the learner's voice. Embodying a more confident and proficient version of the learner, AI Twin reinforces motivation through alignment with their aspirational Ideal L2 Self. Also, its use of implicit feedback through rephrasing preserves conversational flow and fosters an emotionally supportive environment. In a within-subject study with 20 adult ESL learners, we compared AI Twin with explicit correction and a non-personalized rephrasing agent. Results show that AI Twin elicited higher emotional engagement, with participants describing the experience as more motivating. These findings highlight the potential of self-representative AI for personalized, psychologically grounded support in ESL learning.
Authors:Wanqi Zhang, Jiangen He, Marielle Santos
Abstract:
Social robots hold promise for reducing job interview anxiety, yet designing agents that provide both psychological safety and instructional guidance remains challenging. Through a three-phase iterative design study (N = 8), we empirically mapped this tension. Phase I revealed a "Safety-Guidance Gap": while a Person-Centered Therapy (PCT) robot established safety (d = 3.27), users felt insufficiently coached. Phase II identified a "Scaffolding Paradox": rigid feedback caused cognitive overload, while delayed feedback lacked specificity. In Phase III, we resolved these tensions by developing an Agency-Driven Interaction Layer. Synthesizing our empirical findings, we propose the Adaptive Scaffolding Ecosystem, a conceptual framework that redefines robotic coaching not as a static script, but as a dynamic balance between affective support and instructional challenge, mediated by user agency.
Authors:Agnia Sergeyuk, Eric Huang, Dariia Karaeva, Anastasiia Serova, Yaroslav Golubev, Iftekhar Ahmed
Abstract:
AI-powered coding assistants are rapidly becoming fixtures in professional IDEs, yet their sustained influence on everyday development remains poorly understood. Prior research has focused on short-term use or self-reported perceptions, leaving open questions about how sustained AI use reshapes actual daily coding practices in the long term. We address this gap with a mixed-method study of AI adoption in IDEs, combining longitudinal two-year fine-grained telemetry from 800 developers with a survey of 62 professionals. We analyze five dimensions of workflow change: productivity, code quality, code editing, code reuse, and context switching. Telemetry reveals that AI users produce substantially more code but also delete significantly more. Meanwhile, survey respondents report productivity gains and perceive minimal changes in other dimensions. Our results offer empirical insights into the silent restructuring of software workflows and provide implications for designing future AI-augmented tooling.
Authors:Sai Khadloya, Kush Juvekar, Arghya Bhattacharya, Utkarsh Saxena
Abstract:
Judicial work depends on close reading of long records, charge sheets, pleadings, annexures, orders, often spanning hundreds of pages. With limited staff support, exhaustive reading during hearings is impractical. We present CourtNav, a voice-guided, anchor-first navigator for legal PDFs that maps a judge's spoken command (e.g., "go to paragraph 23", "highlight the contradiction in the cross-examination") directly to a highlighted paragraph in seconds. CourtNav transcribes the command, classifies intent with a grammar-first(Exact regex matching), LLM-backed router classifying the queries using few shot examples, retrieves over a layout-aware hybrid index, and auto-scrolls the viewer to the cited span while highlighting it and close alternates. By design, the interface shows only grounded passages, never free text, keeping evidence verifiable and auditable. This need is acute in India, where judgments and cross-examinations are notoriously long.In a pilot on representative charge sheets, pleadings, and orders, median time-to-relevance drops from 3-5 minutes (manual navigation) to 10-15 seconds; with quick visual verification included, 30-45 seconds. Under fixed time budgets, this navigation-first design increases the breadth of the record actually consulted while preserving control and transparency.
Authors:Hayk Asatryan, Basile Tousside, Janis Mohr, Malte Neugebauer, Hildo Bijl, Paul Spiegelberg, Claudia Frohn-Schauf, Jörg Frochte
Abstract:
Learning Analytics (LA) is nowadays ubiquitous in many educational systems, providing the ability to collect and analyze student data in order to understand and optimize learning and the environments in which it occurs. On the other hand, the collection of data requires to comply with the growing demand regarding privacy legislation. In this paper, we use the Student Expectation of Learning Analytics Questionnaire (SELAQ) to analyze the expectations and confidence of students from different faculties regarding the processing of their data for Learning Analytics purposes. This allows us to identify four clusters of students through clustering algorithms: Enthusiasts, Realists, Cautious and Indifferents. This structured analysis provides valuable insights into the acceptance and criticism of Learning Analytics among students.
Authors:Sophie Villenave, Pierre Raimbaud, Guillaume Lavoué
Abstract:
Thermal feedback is critical to a range of Virtual Reality (VR) applications, such as firefighting training or thermal comfort simulation. Previous studies showed that adding congruent thermal feedback positively influences User eXperience (UX). However, existing work did not compare different levels of thermal feedback quality and mostly used less immersive virtual environments. To investigate these gaps in the scientific literature, we conducted a within-participant user study in two highly-immersive scenarios, Desert Island (n=25) and Snowy Mountains (n=24). Participants explored the scenarios in three conditions (Audio-Visual only, Static-Thermal Feedback, and Dynamic-Thermal Feedback). To assess the complex and subtle effects of thermal feedback on UX, we performed a multimodal analysis by crossing data from questionnaires, semi-structured interviews, and behavioral indicators. Our results show that despite an already high level of presence in the Audio-Visual only condition, adding thermal feedback increased presence further. Comparison between levels of thermal feedback quality showed no significant difference in UX questionnaires, however this result is nuanced according to participant profiles and interviews. Furthermore, we show that although the order of passage did not influence UX directly, it influenced user behavior. We propose guidelines for the use of thermal feedback in VR, and the design of studies in complex multisensory scenarios.
Authors:Xiyuan Zhu, Wenhan Lyu, Chaochao Fu, Yilin Wang, Jie Zheng, Qiyue Tan, Qianhe Chen, Yixin Yu, Ran Wang
Abstract:
Online recruitment platforms have become the dominant channel for modern hiring, yet most platforms offer only basic filtering capabilities, such as job title, keyword, and salary range. This hinders comprehensive analysis of multi-attribute relationships and job market patterns across different scales. We present RecruitScope, a visual analytics system designed to support multidimensional and cross-level exploration of recruitment data for job seekers and employers, particularly HR specialists. Through coordinated visualizations, RecruitScope enables users to analyze job positions and salary patterns from multiple perspectives, interpret industry dynamics at the macro level, and identify emerging positions at the micro level. We demonstrate the effectiveness of RecruitScope through case studies that reveal regional salary distribution patterns, characterize industry growth trajectories, and discover high-demand emerging roles in the job market.
Authors:Nia Touko, Matthew O A Ellis, Cristiano Capone, Alessio Burrello, Elisa Donati, Luca Manneschi
Abstract:
Reliable long-term decoding of surface electromyography (EMG) is hindered by signal drift caused by electrode shifts, muscle fatigue, and posture changes. While state-of-the-art models achieve high intra-session accuracy, their performance often degrades sharply. Existing solutions typically demand large datasets or high-compute pipelines that are impractical for energy-efficient wearables. We propose a lightweight framework for Test-Time Adaptation (TTA) using a Temporal Convolutional Network (TCN) backbone. We introduce three deployment-ready strategies: (i) causal adaptive batch normalization for real-time statistical alignment; (ii) a Gaussian Mixture Model (GMM) alignment with experience replay to prevent forgetting; and (iii) meta-learning for rapid, few-shot calibration. Evaluated on the NinaPro DB6 multi-session dataset, our framework significantly bridges the inter-session accuracy gap with minimal overhead. Our results show that experience-replay updates yield superior stability under limited data, while meta-learning achieves competitive performance in one- and two-shot regimes using only a fraction of the data required by current benchmarks. This work establishes a path toward robust, "plug-and-play" myoelectric control for long-term prosthetic use.
Authors:Yun Ye, Zexuan Li, Panagiotis Angeloudis, S. C. Wong, Jian Sun, Haoyang Liang
Abstract:
Appropriate communication is crucial for efficient and safe interactions between pedestrians and autonomous vehicles (AVs). External human-machine interfaces (eHMIs) on AVs, which can be categorized as allocentric or egocentric, are considered a promising solution. While the effectiveness of eHMIs has been extensively studied, in complex environments, such as unsignalized multi-lane streets, their potential to interfere with pedestrian crossing behavior remains underexplored. Hence, a virtual reality-based experiment was conducted to examine how different types of eHMIs displayed on AVs affect the crossing behavior of pedestrians in multi-lane streets environments, with a focus on the gaze patterns of pedestrians during crossing. The results revealed that the presence of eHMIs significantly influenced the cognitive load on pedestrians and increased the possibility of distraction, even misleading pedestrians in cases involving multiple AVs on multi-lane streets. Notably, allocentric eHMIs induced higher cognitive loads and greater distraction in pedestrians than egocentric eHMIs. This was primarily evidenced by longer gaze time and higher proportions of attention for the eHMI on the interacting vehicle, as well as a broader distribution of gaze toward vehicles in the non-interacting lane. However, misleading behavior was mainly triggered by eHMI signals from yielding vehicles in the non-interacting lane. Under such asymmetric signal configurations, egocentric eHMIs resulted in a higher misjudgment rate than allocentric eHMIs. These findings highlight the importance of enhancing eHMI designs to balance the clarity and consistency of the displayed information across different perspectives, especially in complex multi-lane traffic scenarios. This study provides valuable insights regarding the application and standardization of future eHMI systems for AVs.
Authors:He Liu, Boyuan Gu, Shuaiqi Cheng, Haiyang Sun, Siyu You, Xuming Hu
Abstract:
Large language models (LLMs) increasingly assist in experimental design, yet fluent protocols often remain physically infeasible. We introduce PhysDox, a physical feasibility auditing benchmark for biomedical protocols comprising a 683-sample expert-curated Gold set and a 5,000-sample Silver set across six sensing domains. We formulate the task as a two-stage evaluation: severity detection classifying protocols as valid, minor, or fatal, followed by the constraint-level diagnosis of fatal violations. Evaluating 6 LLMs across 4 inference strategies yields a peak Stage-1 macro-F1 of only 53.0. Moreover, strong oracle diagnosis collapses during end-to-end evaluation due to correlated cascade errors. Error analysis reveals scaffold bias, where models conflate procedural completeness with physical validity. Consequently, implicit constraints exhibit a 2 times higher miss rate than explicit hardware violations, supported by strong statistical correlation at $ρ{=}0.81$ and $p{<}0.01$. Trace analysis of false negatives exposes a 54%--46% split between attention and judgment failures, ultimately demonstrating that protocol auditing demands calibrated feasibility reasoning rather than factual recall or longer rationales.
Authors:Nabin Khanal, Tongyan Wang, Jui-Cheng Chiu, Ningning Nicole Kong, Hannah Yanhua Zong, Yingjie Victor Chen
Abstract:
Digitizing complex documents with handwritten content, irregular tables, and heterogeneous layouts remains challenging, as traditional Optical Character Recognition (OCR) systems fail to capture writing nuances, author-specific conventions, and document structure, and recent LLM-based approaches lack mechanisms for precise, scalable correction. We present an interactive document digitization system that integrates layout-aware parsing, OCR, and LLM-based reconstruction with user-driven refinement. The system is informed by a formative study that identifies key challenges and interaction needs in real-world digitization workflows. It supports both direct edits and natural-language instructions, and introduces a layout-aware propagation mechanism that generalizes user corrections across structurally similar regions. This enables not only efficient error correction but also document re-shaping into structured, analyzable representations. We evaluate the system through a within-subjects user study (n=12) on real-world documents. Results show improved correction efficiency and reduced repetitive effort, demonstrating more effective and controllable document digitization procedure.
Authors:Caitlin Morris, Pattie Maes
Abstract:
AI chat tools are shifting problem-solving and brainstorming conversations away from colleagues and into private AI interactions, reducing the shared awareness that supports team coordination. We introduce InquiryBits, a system that shares minimal summaries of AI conversations within configurable trust boundaries, separating AI-only analysis from human-visible sharing. In a study with 80 professionals, we find that people are broadly willing to share these traces to support collaboration and avoid duplicating work - but only within bounded groups. Comfort drops sharply as audience expands beyond close teams; the level of detail shared matters less than who can see it, with a preference for more detail over less within trusted groups. These findings suggest that trust boundaries, more than information granularity, may be the most impactful design parameter.
Authors:Ngoc Luyen Le, Marie-Hélène Abel, Bertrand Laforge
Abstract:
Collaborations with Generative AI often begin with a short prompt and end with an opaque output, leaving implicit who was involved, what task was being pursued, which resources were used, and which constraints should have shaped the process. This limited contextual explicitness hinders trust, traceability, and accountability, particularly when Generative AI is embedded in information-intensive workflows such as search, querying, and profile management. This paper introduces From Prompts to Context, an ontology-driven framework for representing Human-Generative AI collaboration. Its core component, the Contextual Collaboration AI Ontology (CCAI), models key elements of collaboration - including tasks, agent roles, resources, and constraints - as a shared machine-interpretable vocabulary. By combining populated CCAI instances with SPARQL-based context retrieval in operational workflows, the framework turns otherwise ephemeral prompt-response interactions into structured and queryable collaboration traces linking prompts, outputs, and their surrounding context. The approach is illustrated through a case study involving a software development team building a competency-based education feature for viewing and updating learner competency profiles. The case study shows how the framework can support the representation and documentation of collaboration episodes across requirements analysis, design, implementation, and testing. Within this setting, the results indicate that explicit collaboration modelling helps make task context more explicit, improves the traceability of AI-generated contributions, and supports more transparent and accountable Human-Generative AI practices. We conclude by outlining design principles for future Human-Generative AI systems that emphasise not only output quality, but also the explicit representation of the collaborative context in which outputs are produced.
Authors:Junsoo Park, Youssef Medhat, Htet Phyo Wai, Ploy Thajchayapong, Ashok K. Goel
Abstract:
AI-augmented classrooms generate rich teacher and student feedback before graded outcomes become available, yet these signals can be difficult to translate into timely instructional decisions. We propose an interpretable decision layer: a transparent mechanism that ranks course topics requiring attention without using grades or post-hoc outcome labels. The approach combines three signals: student learning difficulty prevalence, disagreement between learner self-reports and observed difficulties, and unresolved teacher concerns. The output is a ranked set of topic priorities with per-topic decision records explaining each ranking. In one graduate CS course offering ($n=5$ instructor interviews; $n=279$ survey responses), prioritized topics aligned with instructor concerns (top-5 overlap 3/5; Spearman $ρ=0.80$) and student-reported topic difficulty ($ρ=0.46$, $p=.048$). Multi-signal integration also surfaced learners not identified through individual signal sources alone (AUC $=0.96$ vs. $0.91$ for gap prevalence alone). Reflective thinking, help-seeking, and self-efficacy provided additional evidence that student behavioral signals align with learning-related constructs. While preliminary, these findings suggest that transparent coordination mechanisms may help support human-AI co-agency when feedback is incomplete.
Authors:Niharika Mathur, Smit Desai
Abstract:
As AI systems become increasingly conversational, a gap emerges wherein explanations are studied as static artifacts, yet in practice, are experienced as dialogue. In this provocation, we argue that the conversational layer around an explanation is not incidental to its effectiveness, but a critical constituent. Drawing on three illustrative scenarios, we invite the CUI community to study explanations as interactive, conversational exchanges shaped by timing, tone, persona and conversational history, and introduce our vision for Human-Centered Conversational XAI (HC2XAI).
Authors:Abeer Badawi, Will Aitken, Lydia Sequeira, Jocelyn Rankin, Maia Norman, Elham Dolatabadi
Abstract:
Crisis Responders (CRs) rapidly assess thousands of youth SMS conversations each year to identify mental health concerns and guide support. Yet youth distress is increasingly expressed through evolving and context-specific language that often does not fit fixed-label taxonomies. This work analyzed 703,975 de-identified Kids Help Phone conversations (2018-2023) and expanded KHP's 19-label issue taxonomy into a 39-label hierarchical schema. We then introduce Keyphrase Generative Representation (KGR), a constrained LLM generating concise, conversation-specific keyphrases, evaluated across 129 conversations and 387 expert annotations. The expanded taxonomy achieved expert consensus reliability, with an accuracy of 0.96, and expert review found that 81% of keyphrases accurately reflected content and 74% improved clarity. KGR surfaced identity-linked themes absent from the fixed taxonomy, including immigration problems and caregiver burden, and supported a topic-retrieval workflow that increased accuracy from 0.25 to 0.70 (+0.45) over the manual analyst process. KGR marks a shift toward hybrid, interpretable generative representations that extend crisis response beyond static taxonomies to surface emerging and culturally grounded patterns of youth distress.
Authors:Salman Khawar, Yingdan Lu, Yilang Peng, Jiyoung Yeon, Cuihua Shen
Abstract:
The rapid proliferation of visual content raises fundamental questions about how different visual formats and features shape perceived credibility. Drawing on processing fluency theory, this research examines how visuals shape credibility judgments. We focus on three popular formats-photos, infographics, and data visualizations-comparing them to text-only posts, and test how two visual features, aesthetic appeal and production quality, influence credibility through processing fluency as a mediating mechanism. Through a preregistered experiment with 1200 US participants, we found that visual posts are generally perceived as more credible than text-only posts but this credibility advantage only applies to photos and infographics, not to data visualizations. Aesthetic appeal increases perceived credibility, partially mediated by processing fluency, while production quality had no significant effect on credibility across formats. These findings differentiate visual formats, advance conceptualizations of visual features, and identify processing fluency as a key mechanism for theorizing credibility across multimodal contexts.
Authors:Fancy Kong, Congjie Zheng, Murphy Zhuang, Rio Yang, Sueky Zhang, Hao Fu, Gene Jin, Song Cao, Kaijie Chen, Andrew Chen, Pony Ma
Abstract:
As personal agents evolve to handle complex, user-centric tasks, static plain-text chat is rapidly becoming a bottleneck. Generative UI emerges as the necessary new interface layer, dynamically synthesizing the right controls, options, and state from the interaction context in real time. We present Macaron-A2UI, a model for Generative UI in personal agents. Our goal is to move beyond text-only interaction by enabling agents to generate natural language together with lightweight, executable UI actions for information collection, preference refinement, confirmation, and multi-goal organization. We build a large-scale Generative UI corpus from heterogeneous dialogue sources, introduce A2UI-Bench for controlled evaluation, and train 30B, 235B and 754B models with parameter-efficient LoRA-based supervised fine-tuning followed by reward-driven reinforcement learning. The best Macaron-A2UI model reaches 75.6 overall on A2UI-Bench without explicit schema hints, surpassing the strongest full-schema frontier baseline. We release the models, benchmark, and evaluation protocol to support future work on Generative UI for personal agents.
Authors:Eugene Yu Ji, Igor Grossmann, Amir-Hossein Karimi
Abstract:
Generative AI research increasingly confronts a shared problem: systems must sustain yet govern their own generative activity when uncertainty is high, evidence is missing, or context is insufficient. This position paper argues that metacognition should become the scientific framework for bounded and effective self governance in generative AI, where output generation is properly evaluated together with the capacities through which generative systems navigate and regulate their own activity. We advance this position by showing that bounded and effective AI self-governance requires metacognitive alignment across computational, algorithmic, and ecological levels. At the computational level, metacognition specifies the meta-level functions a system is meant to serve, such as monitoring, evaluation, control, and adaptation. At the algorithmic level, these functions are realized through procedures such as elicitation, iteration, and modularization. At the ecological level, metacognitive signals become meaningful, actionable, and accountable within the interface, workflow, and accountability arrangements. Metacognition thus makes it possible to conceive generative AI as both capable and well-governed, rather than treating capability and governance as competing aims.
Authors:Alice Gao, Andrew N. Meltzoff, Maarten Sap, Katharina Reinecke
Abstract:
Despite a global user base adopting large language models (LLMs) for daily writing tasks, model suggestions tend to align with Western values. Research has shown users commonly accept a high fraction of these AI suggestions, homogenizing writing styles and rendering outputs more ``Western'' than intended. While this suggests a need to reduce AI reliance, it remains unknown what kind of interventions could achieve this. Can framing the AI with specific values, and comparing it to one's own, make users less susceptible to overreliance and support more unique writing? We tested this hypothesis in a between-subjects online experiment with Indian and American participants (n=149) in which they were asked to perform AI-supported writing tasks, either 1) without an intervention, 2) after seeing an overview of the AI's framed values, or 3) after seeing an overview of the AI's framed values compared to their own. Our results show that seeing the AI's framed values reduces AI reliance, i.e., the proportion of the final essay generated by the AI, by an average of 20\%. Additionally, when participants saw an overview of the AI's framed values (without comparison to their own values), the final essays contain more unique text than without intervention. Our findings emphasize the importance of educating users about potential value biases in AI, showing that raising awareness with a simple overview of values encourages users to personalize their writing.
Authors:Qiyu Li, Yuen Sum Wong, Yuen Kei Wong, Longxuan Yu, Haojian Jin
Abstract:
NIST's Privacy Risk Assessment Methodology (PRAM) provides a structured framework for privacy experts to assess privacy risks. However, its complexity and reliance on expert knowledge make it difficult for novice developers to use effectively. This paper explores methods to lower these barriers. We first performed an observational study with 12 participants using PRAM in real-world scenarios, and found that novice developers struggled most with articulating privacy-related design decisions. We then developed PrivacyAkinator, an interactive tool that helps developers articulate key privacy decisions by answering LLM-generated multiple-choice questions. PrivacyAkinator introduces three innovations: a universal privacy representation that abstracts privacy-related design decisions into data flows and stakeholder interactions; a domain-aware design space mined from 10K privacy-related news articles; and a dynamic question-generation workflow to prioritize relevant questions. Our user study with 24 participants suggests that developers using PrivacyAkinator identified 47% more key decisions in 73% less time compared to PRAM.
Authors:Fateme Rajabiyazdi, Julie Babione, Doreen M. Rabi, Foroozan Daneshzand, Sheelagh Carpendale
Abstract:
Creating supportive technologies for people living with multiple chronic conditions is extremely challenging. These patients are often faced with substantial visible and invisible treatment work as well as their everyday responsibilities, including coordinating across providers, tracking information, and repeating communication in emotionally charged contexts. In the Cumulative Complexity Model (CuCoM), the balance between patient workload and patient capacity shapes what patients can realistically take on, including whether a digital tool can be adopted and sustained. In this paper, we report engagement lessons from implementing MyCareCompass, a patient-facing digital health intervention (DHI) intended to support day-to-day self-management for people living with multiple chronic conditions. We define engagement as patient uptake and sustained use during a two-month pilot study of our platform, drawing on usage analytics and follow-up feedback, and distill three implementation lessons for designing for engagement in complex chronic care.
Authors:Yoshia Abe, Tatsuya Daikoku, Yasuo Kuniyoshi
Abstract:
Artificial intelligence (AI), exemplified by large language models (LLMs), is rapidly approaching and in some cases surpassing human performance across a wide range of cognitive tasks. However, human nature is not limited to intelligence alone; it also encompasses sensibility, including the capacity to perceive and experience beauty in visual scenes. This raises a fundamental question: how humans and AI systems converge or diverge in such aesthetic experiences. Aesthetic evaluation depends not only on objective properties of images but also on internal processes within the observer. As part of ongoing efforts in AI alignment, building upon prior human studies that have examined the relationship between beauty ratings, bodily sensations, and emotions, we adopt a comparable set of questionnaire items and present them to LLMs, enabling a direct comparison between human and AI responses. Our comparative analyses revealed that, while humans and AI exhibited broadly similar patterns in the correlations between beauty ratings and emotions, as well as in the image features they prioritized, notable divergences emerged in both the distribution of emotional responses and the relationship between beauty ratings and bodily sensations. These findings suggest that state-of-the-art LLMs, trained on large-scale textual data, can approximate average human tendencies in aesthetic evaluation to a certain extent. However, they also indicate limitations, particularly in relation to interoceptive aspects, which may reflect insufficient representation in training data or unintended consequences of alignment processes. These findings highlight key challenges for AI alignment and suggest important directions for developing AI systems with human-like aesthetic processing.
Authors:Fangming Cui, Sunan Li, Jiahong Li
Abstract:
On-Policy Self-Distillation (OPSD) is a unified learning framework in which a single large language model acts simultaneously as both teacher and student. Unlike conventional knowledge distillation that relies on a separate, often larger teacher model, OPSD operates under different contextual roles: the teacher policy is granted privileged access to verified reasoning traces, while the student policy observes only the problem statement. OPSD is trained to minimize per-token distributional divergence between the two roles over trajectories sampled from the student itself, thereby aligning its own reasoning behavior with solution-aware rationalizations. OPSD eliminates the need for an external teacher, directly leverages ground-truth solution information, and resolves the distribution mismatch inherent in off-policy distillation. OPSD typically reduces GPU memory consumption by approximately 40%-60% compared to standard On-Policy Distillation (OPD). In this paper, we present a brief analysis of the conceptual foundations, methodological innovations, and principled designs underlying recent advances in OPSD for large language models. This discussion, crafted from the perspective of beginners in this field, aims to provide a concise overview of the design principles and emerging patterns of OPSD in LLMs, intended for researchers who are similarly new to this area.
Authors:Hung-Yue Suen, Kuo-En Hung
Abstract:
This paper presents an interpretable closed-loop Intelligent Tutoring System (ITS) that supports feedback-guided practice for developing on-camera oral presentation skills at scale. The system operationalizes a seven-dimensional Behaviorally Anchored Rating Scale (BARS) and implements a three-layer interpretable feedback architecture that connects rubric-aligned multimodal scoring, audience-perceived expressive diagnostics, and retrieval-augmented conversational coaching to support deliberate practice. Built on an XGBoost backbone, the ITS maps multimodal inputs (facial, vocal, textual, and oculomotor features) into evidence-based feedback that can be traced back to observable performance cues. Trained on 10,360 Massive Open Online Course (MOOC) video segments, the system achieved rubric-aligned scoring with performance levels comparable to expert ratings (R2 = 0.48-0.61, Spearman's rho = 0.69-0.78, MAE = 0.43-0.57). In a pre-post validation study with 204 adult learners over a 30-day practice window, participants demonstrated significant improvements across all seven BARS dimensions (Cohen's d = 0.39-0.90), with practice frequency showing a strong positive association with posttest performance after controlling for baseline scores and demographics. The results demonstrate how multimodal analytic outputs can be systematically transformed into observable behavioral change through an integrated feedback architecture, advancing explainable and pedagogically grounded ITS design for performance-based competencies.
Authors:Hung-Yue Suen, Kuo-En Hung, Che-Wei Liu, Yu-Sheng Su, Han-Chih Fan
Abstract:
Whether an interviewee's honest and deceptive responses can be detected by facial expression signals in videos has been debated and requires further research. We developed deep learning models enabled by computer vision to extract temporal patterns of job applicants' facial expressions and head movements to identify self-reported honest and deceptive impression management (IM) tactics from video frames in real asynchronous video interviews. A 12- to 15-minute video was recorded for each of N=121 job applicants as they answered five structured behavioral interview questions. Each applicant completed a survey to self-evaluate their trustworthiness on four IM measures. Additionally, a field experiment was conducted to compare the concurrent validity associated with self-reported IMs between our modeling approach and human interviewers. Human interviewers' performance in predicting these IM measures from another subset of 30 videos was obtained by having N=30 human interviewers evaluate three recordings. Our models explained 91% and 84% of the variance in honest and deceptive IMs, respectively, and showed stronger correlations with self-reported IM scores than human interviewers.
Authors:Alankar Atreya, Stefan Sylvius Wanger, Devesh Batra, Robert Hankache, Cristovao Iglesias, Patrick Sinclair, Giulio Pelosio, Michael McMillan, Greig A. Cowan, Raad Khraishi
Abstract:
Banks receive millions of reports of fraud, scams, and disputed transactions every year, making it challenging to accurately direct customers to the appropriate specialist teams for assistance. The existing manual process driven by humans is slow and stressful for both customers and staff. To address this, we develop a customer-facing AI powered triaging agent that leverages large language models (LLMs) to conduct multi-turn conversations, ask relevant questions, and classify cases for accurate, policy-guided routing, making it embedded in the customer journey. To evaluate and continuously improve the agent, synthetic digital twins of real customers were simulated, generating realistic, labelled dialogues based on historical data to test a wide range of real-world scenarios. This work details the triage agent's modelling approach, integration with policy, safety guardrails and reasoning frameworks, the use of the synthetic agent for scalable evaluation, and findings on the AI system's accuracy, robustness, and compliance. Results show that the agent successfully improves triaging of historical cases, achieving a 30.6% increase in classification accuracy, with high satisfaction levels reported by our subject-matter experts, highlighting how targeted probing can lead to more effective triage in banking operations at scale.
Authors:Ines Trautmannsheimer, Ahmed Azab, Frank Diermeyer
Abstract:
Teleoperation promises to extend the operational envelope of automated vehicles, yet it critically depends on network latency and video quality. We report a fixed-base driving-simulator study (N=25) with a 2x2 manipulation of added latency (100/300 ms) and bitrate (500/2000 kbit/s), plus a best-case baseline (0 ms added, 9000 kbit/s). We measured effective glass-to-glass (G2G) latency per condition (baseline approx. 413 ms; effective totals approx. 500-700 ms) and verified stable framerate and encoder settings. Multimodal measures covered performance (speed, steering reversals, crashes), oculomotor behavior (blink rate, fixation duration), physiology (RR interval, heart rate, skin conductance), and subjective workload. Latency and bitrate each increased operator load and modestly affected performance. Physiological measures (heart rate, RR interval) exhibited sub-additive interactions, whereas performance and oculomotor interactions were small or non-significant. Equivalence tests showed that 300 ms with 2000 kbit/s was velocity-equivalent to best-case (SESOI +/- 2 km/h), while 300 ms with 500 kbit/s was not. We argue that latency and video quality should be treated as largely independent design levers, and that physiology-aware adaptation can anticipate overload before safety is compromised.
Authors:George Boateng, Philemon Badu, Patrick Agyeman-Budu, Samuel Ansah, Evans Atompoya, Evan Igwilo, Lord Baah, Frederick Abu-Bonsrah, Victor Wumbor-Apin Kumbol
Abstract:
Recent advances in generative AI have shown their potential to be leveraged for legal education. Yet, work on the development and deployment of such systems for legal education in the Global South is limited. In this work, we developed Eskwai for Students, a generative AI assistant to help law students with their legal education. Eskwai for Students is a retrieval augmented generation (RAG) system that provides answers to a wide range of legal questions for law students grounded in a curated database of over 12K case laws and 1.4K legislation in Ghana. We deployed Eskwai for Students in a longitudinal study of 30 months (2.5 years) used by 3.1K law students in Ghana who made 32K queries. We evaluated the helpfulness of our AI, and provided insight into the kinds of queries law students submit to this generative AI tool, which raises some ethical concerns. This work contributes to an understanding of how law students in the Global South are using generative AI for their studies and the ways it could be leveraged responsibly to advance legal education.
Authors:Chris Davis Jaldi, Anmol Saini, Shan Zhang, Noah Schroeder, Cogan Shimizu, Eleni Ilkou
Abstract:
Generative AI increasingly supports educational design tasks, e.g., through Large Language Models (LLMs), demonstrating the capability to design assessment questions that are aligned with pedagogical frameworks (e.g., Bloom's taxonomy). However, they often rely on subjective or limited evaluation methods; focus primarily on proprietary models; or rarely systematically examine generation, evaluation, or deployment constraints in real educational settings. Meanwhile, Small Language Models (SLMs) have emerged as local alternatives that better address privacy and resource limitations; yet their effectiveness for assessment tasks remains underexplored. To address this gap, we systematically compare LLMs and SLMs for assessment question design; evaluate generation quality across Bloom's taxonomy levels using reproducible, pedagogically grounded metrics; and further assess model-based judging against expert-informed evaluation by analyzing reliability and agreement patterns. Results show that SLMs achieve competitive performance across key pedagogically motivated quality dimensions while enabling local, privacy-sensitive deployment. However, model-based evaluations also exhibit systematic inconsistencies and bias relative to expert ratings. These findings provide evidence to posit language models as bounded assistants in assessment workflows; underscore the necessity of Human-in-the-Loop; and advance the automated educational question generation field by examining quality, reliability, and deployment-aware trade-offs.
Authors:Yoshia Abe, Tatsuya Daikoku, Yasuo Kuniyoshi
Abstract:
Accurately predicting individual aesthetic evaluation for images is a fundamental challenge for AI. Various deep learning (DL)-based models have been proposed for this task, training on image evaluation data to extract objective low-level features. However, aesthetic preferences are inherently subjective and individual-dependent. Accurate prediction thus requires the extraction of high-level semantic features of images and the active collection of preference information from the target individual. To address this issue, we focus on the utility of Large Language Models (LLMs) pretrained on vast amounts of textual data, and develop an integrated DL-LLM system. The system actively elicits aesthetic preferences through LLM-based semi-structured interviews and predicts aesthetic evaluation by leveraging both low-level and high-level features. In our experiments, we compare the proposed system against conventional systems, human predictors, and the target individual's own re-evaluations after a certain time interval. Our results show that the proposed system outperforms all of them, with particularly strong performance on highly-rated images. Moreover, the prediction error of the proposed system is smaller than within-person variability, while human predictors show the largest error, likely due to the influence of their own aesthetic values. These results suggest that AI may be better positioned than others or one's future self to capture individual aesthetic preferences at a given point. This opens a new question of whether AI could serve as a deeper interpreter of human aesthetic sensibility than humans themselves.
Authors:Zeinabsadat Saghi, Run Huang, Souti Chattopadhyay
Abstract:
Creativity is fundamentally human. As AI takes on more of the generative work that once required human imagination, despite documented limitations in creative ability, a critical question emerges: How does GenAI affect users' creativity? Through a within-subject study followed by retrospective interviews with (N=20) programmers, we investigated the impact of LLMs on participants' process of creative thinking in programming and the creativity of generated solutions. Across two conditions (LLM-assisted vs. unassisted), participants using LLMs had significantly shorter idea-generation periods (p=0.0004), leading to fewer creative moments (p=0.002). Qualitative analysis of participants' interactions and interviews revealed four different human-LLM collaboration modes supporting various problem-solving strategies. However, a comparative analysis of the generated solutions shows that while LLMs can help generate more correct and functional code, their solutions contain roughly the same number of ideas as participant-generated ones. Based on our findings, we discuss design implications and considerations for effectively using LLMs to support user creativity.
Authors:Ines Trautmannsheimer, Richard Grauberger, Frank Diermeyer
Abstract:
Automated driving has made remarkable progress, yet situations still arise where human intervention is necessary. Teleoperation provides a scalable solution to address such cases, enabling remote operators to support vehicles without being physically present. In this context, video transmission forms the operator's primary source of situational awareness, making video quality a decisive factor for both safety and task performance. In an online study, participants rated compressed video sequences from the Zenseact Dataset and provided subjective quality ratings. These ratings were then used to retrain the Video Multi-Method Assessment Fusion (VMAF) model, yielding an adapted variant tailored to teleoperation. The retrained model demonstrated improved alignment with human ratings compared to the original 4K VMAF. In particular, RMSE decreased from 10.36 to 8.83, and MAD from 8.71 to 6.38, corresponding to improvements of 15% and 27%, respectively. These results highlight that incorporating domain-specific data can enhance the predictive power of established quality metrics in safety-critical applications. At the same time, Outlier cases emerged in which videos received high objective scores despite noticeable degradations in regions critical for the driving task.
Authors:Maribeth Rauh, Dick A. H. Blankvoort, Matias Duran, Caoilfhionn Ní Dheoráin, Harshvardhan J. Pandit, Siddharth D. Jaiswal, Anthony Ventresque, Abeba Birhane
Abstract:
The use of chatbots for various forms of companionship is growing rapidly, raising a myriad of questions about simulated relationships, emotional dependence, and psychological harm. While major platforms such as ChatGPT, Grok, and Character.AI are the subject of a growing body of research and legal inquiries, apps explicitly built for simulating intimate interpersonal relationships remain under-explored. In this work, we evaluate the five most popular AI companion mobile applications in the EU and UK markets for factors that encourage parasocial interaction and may manipulate users. We do this by manually annotating the user experience each offers. Specifically, we systematically record and quantify design dark patterns, anthropomorphism, stereotypes, erotica, and technical performance issues. We find that all apps contain substantial dark patterns aimed at increasing monetisation and user engagement. Erotica and gamification features such as levelling are also prevalent, and although other features vary considerably between applications, all apps have highly anthropomorphic design. These findings shed light on the mechanics used to leverage users' simulated relationships. On that basis, we put forward concrete recommendations for regulators to strengthen consumer protection in this rapidly emerging market. Content warning: This article contains objectifying images of women, erotic images, textual references to incest, and other potentially sensitive, offensive, and distressing text.
Authors:Cara A. Spencer, Christopher D. Wickens, Jalynn B. Nicoly, James Crum, Benjamin A. Clegg, Joanna E. Lewis, Francisco R. Ortega, Lucas Plabst, Rebecca L. Pharmer, Leanne Hirshfield
Abstract:
Advance in technology offer the potential for future adoption of a combination of virtual reality (VR) and real-time adaptivity to enhance training and education. Providing a valid neuro-ergonomic measure of cognitive load can enable an adaptive training regime to continuously adjust tas difficulty to an optimal level as training progresses. The current study validated the functional near-infrared spectroscopy (fNIRS) measure of cognitive load to reflect the demands of two different forms of lad within Cognitive Load Theory: extraneous and intrinsic to he task to be mastered. Thirty-six participants completed a VR shape assembly training task followed by a test of their skill retention They wore near-full head coverage fNIRS and provided subjective ratings of ther workload. The fNIRS findings largely corroborate intrinsic workload literature with significant activation in cortical regions (dorsolateral and rostral prefrontal cortex and left angular gyrus) associated with working memory, short term memory buffers, multisensory integration, and attention. These fNIRS results were tracked closely by NASA TLS measures of mental workload. The results also revealed far less brain activity associated with extraneous load, namely just the right angular gyrus, deemed irrelevant to the mastery of the task.
Authors:Xinyu Jessica Wang, Christine P. Lee, Bilge Mutlu
Abstract:
Personalization is crucial for effective learning, yet online learning, designed for widespread availability and open access, lacks personalized guidance. Recent advancements in large language models (LLMs) offer opportunities to bridge this gap. We explore how LLM-driven tools may be designed to support personalized and adaptive learning and examine how they shape user experience and learning outcomes. We iteratively designed \tool{} to support online learning by providing personalized study plans, real-time contextual assistance, and adaptive learning activities. A preliminary study ($n=24$) assessed the effectiveness and usability of \tool{} and informed refinements in our system, which we then evaluated ($n = 16$) against a combination of a state-of-the-art online learning platform and an LLM for learning support. Results indicate that \tool{} advances AI pedagogy by improving both learning outcomes and user experience compared to existing online learning and support tools. This work advances our understanding of the design space of personalized, AI-driven educational tools and their potential impact on user experience.
Authors:Jakob Zethofer, Andreas Hinterreiter, Lukas Schiefermüller, Belgin Mutlu, Marc Streit
Abstract:
We introduce EventColumn, a new column type that integrates event-sequence data with heterogeneous tabular attributes into a single unified table. EventColumn lets analysts compare event sequences alongside numerical, categorical, and temporal attributes at both instance and group levels, offering a compressed overview, heatmap group summaries, alignment by event types, and boxplots of similar historical items. We developed EventColumn together with collaborators from the steel industry to facilitate the analysis of production events and warehouse logistics, but the solution generalizes to a wide range of event sequence datasets with additional tabular attributes. Unlike most existing approaches that compare either event sequences or tables, EventColumn supports simultaneous comparison of both. We demonstrate its integration with Taggle and Microsoft Power BI on data from steel production logistics and on a public e-commerce dataset.
Authors:Amal Alnouri, Andreas Hinterreiter, Christina Humer, Furui Cheng, Marc Streit
Abstract:
Large language model (LLM) outputs arise from complex interactions among prompts, system instructions, model parameters, and architecture. We refer to specific configurations of these factors as generation conditions, each of which can bias outputs in various ways. Understanding how different generation conditions shape model behaviors is essential for tasks such as prompt design and model evaluation, yet it remains challenging due to the stochastic and open-ended nature of text generation. We present an approach to visually compare LLM outputs across generation conditions by modeling responses as collections of linguistic choices, including content, expression, and structure. We extract these choices using natural language processing pipelines and represent their distributions across repeated samples. We then visualize these distributions as visual fingerprints, enabling direct, distribution-level comparison of condition-specific tendencies. Through four usage scenarios, we demonstrate how visual fingerprints reveal consistent patterns in LLM behavior that are difficult to observe through individual responses or aggregate metrics.
Authors:Anahita Golrang, Kshitij Sharma
Abstract:
Pair programming is a widely used collaborative learning practice in computer science education yet its effectiveness varies substantially due to breakdowns in coordination attention and cognitive regulation between partners. This paper investigates whether AI supported feedback grounded in joint visual attention and joint mental effort can improve collaborative programming performance and how feedback timing shapes learner AI interaction. Two experimental studies using dual eye tracking capture real time indicators of collaborative regulation during debugging tasks. Study 1 examines reactive feedback that intervenes when observed joint visual attention or joint mental effort deviates beyond predefined thresholds while Study 2 evaluates proactive feedback that forecasts future regulatory breakdowns using machine learning models and intervenes pre emptively. Across both studies feedback effectiveness is assessed through debugging success time on task and feedback uptake reflected in code changes. Multimodal feedback significantly improves collaborative performance compared to no feedback conditions. Reactive feedback yields strong gains in debugging success and efficiency particularly when joint visual attention and joint mental effort based feedback are combined. Proactive forecast based feedback further enhances performance reduces time on task and increases constructive feedback uptake while relying less on intrusive interventions. Proactive feedback better preserves learner agency by maintaining optimal collaboration states, particularly for high-performing pairs. These findings demonstrate that gaze and mental effort synchrony can serve as reliable actionable triggers for AI supported collaborative learning highlighting the importance of feedback timing transparency and anticipatory regulation in supporting effective pair programming.
Authors:Anahita Golrang, Kshitij Sharma, Halszka Jarodzka, Senne Van Hoecke
Abstract:
Adaptive learning technologies increasingly rely on real time physiological analytics to trigger instructional support automatically yet how system driven decisions interact with learners ongoing problem solving processes remains poorly understood. Eye Movement Modeling Examples have shown promise as attention guidance tools but have been studied predominantly as static instructional materials rather than as adaptive scaffolds whose timing and initiation control can vary. This study investigates whether scaffold initiation mode shapes EMME effectiveness in novice programmers debugging and specifically whether automated triggering based on a single physiological indicator of low mental effort is a viable basis for adaptive scaffold delivery. A between subjects experiment was conducted with 120 undergraduate computer science students randomly assigned to one of four conditions: teacher initiated, learner initiated, automated or no scaffold control. Participants completed ten Python debugging tasks while eye tracking data, video interaction logs and performance scores were recorded. All EMME conditions outperformed the control. However human mediated initiation whether teacher or learner consistently produced higher performance than automated triggering and more integrative engagement with the EMME material. Automated triggering based on sustained low pupillary activity was associated with disruptive behavioral patterns suggesting mistimed delivery. EMME also eliminated the performance advantage of prior programming knowledge across all initiation modes. These findings establish scaffold initiation timing and control as critical design variables for EMME and adaptive learning technologies more broadly and demonstrate that a single low effort physiological threshold is insufficient as a trigger criterion for complex problem solving support.
Authors:Anahita Golrang, Kshitij Sharma
Abstract:
Debugging is a demanding aspect of programming yet guidance on how to teach it effectively remains limited. Novices often struggle to recognize impasses regulate their problem solving and manage cognitive load and stress. This study investigates whether real time multimodal feedback triggered by indicators of cognitive load and physiological stress can improve debugging performance narrow expert novice gaps and reduce the influence of prior programming experience on success. We conducted a between subjects experiment with 120 undergraduate computer science students who debugged a medium sized Python program. Participants were assigned to one of four conditions no feedback cognitive load triggered feedback stress triggered feedback or combined trigger feedback. Eye tracking and heart rate variability data were used to detect moments of struggle and automatically deliver brief context sensitive hints. All three feedback conditions significantly improved debugging success and efficiency compared with the control group. Cognitive load triggered feedback produced stronger gains than stress triggered feedback and the combined trigger condition yielded the largest improvements. Programming expertise predicted performance only in the control condition and in all feedback conditions the novice expert gap was markedly reduced. Adaptive feedback that responds to learners cognitive and affective states can help manage debugging demands and reduce performance differences linked to prior experience highlighting opportunities for physiologically aware adaptive learning environments.
Authors:Anahita Golrang, Kshitij Sharma
Abstract:
Grounded in socially shared regulation of learning (SSRL), this paper investigates how joint mental effort (JME) and joint visual attention (JVA) serve as process-level indicators of shared regulation in pair programming and how AI-driven adaptive feedback can strengthen these processes. We present three eye-tracking studies involving 182 dyads engaged in collaborative debugging tasks. Study 1 examines natural collaboration and shows that high-performing dyads exhibit significantly higher JME and JVA, a greater prevalence of productive high-JME-high-JVA episodes, and a stable causal relationship in which JME predicts JVA. Study 2 evaluates reactive adaptive feedback based on real-time deviations in JME and/or JVA. Results show that combined feedback targeting both dimensions yields the strongest improvements in performance, regulatory coherence, and cognitive-to-attentional causality, outperforming single-channel feedback. Study 3 introduces proactive, forecast-based feedback using machine-learning predictions of future collaboration states. Proactive support further enhances performance and sustains shared regulation by anticipating breakdowns before they manifest. Across studies, causal modeling reveals that cognitive alignment systematically drives attentional coordination in successful collaboration, while mismatches between effort and attention characterize unproductive regulation. Methodologically, this work integrates dual eye-tracking, pupillometry, episode-based analysis, and causal inference to capture SSRL as a dynamic, emergent process. Conceptually, the findings position AI not as an automated controller, but as an intelligence-augmenting co-regulator that supports learners' capacity to coordinate effort, attention, and understanding together.
Authors:Christine P. Lee, Min Kyung Lee, Bilge Mutlu
Abstract:
While AI is often introduced into organizations to drive innovation and efficiency, many adoption efforts fail as workers resist and struggle to integrate these systems. These failures point to a deeper issue: workers, the very people expected to collaborate with AI, are often invisible in decisions about how AI is designed and used. Drawing on interviews with professionals who interact with AI systems daily in healthcare, finance, and management, we examine the disconnect between organizational expectations and worker experiences. We identify key barriers, including poor usability and interoperability, misaligned expectations, limited control, and insufficient communication. These challenges highlight a gap between how organizations implement AI and the evolving worker needs, tasks, and workflows that it fails to support. We argue that successful adoption requires recognizing workers as central to AI integration and propose adaptation strategies at the individual, task, and organizational levels to better align AI systems with real-world practices.
Authors:Stephen N. Freund, Emery D. Berger, Cormac Flanagan, Eunice Jun
Abstract:
Computational notebooks are notoriously prone to reproducibility failures. By permitting out-of-order cell execution, notebooks accumulate hidden state and implicit dependencies that cause interactive executions to silently diverge from clean top-to-bottom runs. Prior approaches either employ dependency analyses or enforce reactive dataflow models that face fundamental tradeoffs among expressiveness, precision, and performance. This paper exploits the insight that reproducibility can be enforced without precise dependency tracking: a notebook is reproducible if and only if executing its cells in top-to-bottom order from an empty store produces exactly the outputs currently recorded. We formalize this notion of reproducibility and present FlowBook, which implements a dynamic analysis that enforces reproducibility by tracking read and write sets at cell boundaries. FlowBook detects stale cells whose recorded outputs may no longer reflect the current notebook state and prevents operations that would violate reproducibility. FlowBook incurs near-imperceptible latency overhead (median: 70 ms).
Authors:Zeinabsadat Saghi, Daria Riabukhina, Olubukola Akinbami, Paul Bogdan, Souti Chattopadhyay
Abstract:
Cognitive fatigue, which transitions from focused attention to inexact responses, can cause catastrophic failures in high-stakes environments, yet current black-box assessment techniques ignore the brain's non-Markovian and time-varying interdependent properties, limiting real-time phase transition detection. We develop a fractional dynamical networks-based machine learning (FDNML) framework using coupled fractional-order differential equations to capture brain signal interdependencies and detect cognitive fatigue transitions in real-time. Multifractal properties of brain activity exhibit distinct generalized fractal dimension signatures across fatigue levels, with Wasserstein distances of 0.10, 0.13, and 0.08 between states 0-1, 1-2, and 0-2, respectively. The framework achieves 93.33% classification accuracy and 95% AUROC, enabling the prevention of performance degradation through early detection of neural state transitions.
Authors:Riley Grossman, Songjiang Liu, Michael K. Chen, Mike Smith, Cristian Borcea, Yi Chen
Abstract:
Generative AI is being increasingly integrated into web search for the convenience it provides users. In this work, we aim to understand how generative AI disrupts web search by retrieving and presenting the information and sources differently from traditional search engines. We introduce a public benchmark dataset of 11,500 user queries to support our study and future research of generative search. We compare the search results returned by Google's search engine, the accompanying AI Overview (AIO), and Gemini Flash 2.5 for each query. We have made several key findings. First, we find that for 51.5\% of representative, real-user queries, AIOs are generated, and are displayed above the organic search results. Controversial questions frequently result in an AIO. Second, we show that the retrieved sources are substantially different for each search engine (<0.2 average Jaccard similarity). Traditional Google search is significantly more likely to retrieve information from popular or institutional websites in government or education, while generative search engines are significantly more likely to retrieve Google-owned content. Third, we observe that websites that block Google's AI crawler are significantly less likely to be retrieved by AIOs, despite having access to the content. Finally, AIOs are less consistent when processing two runs of the same query, and are less robust to minor query edits. Our findings have important implications for understanding how generative search impacts website visibility, the effectiveness of generative engine optimization techniques, and the information users receive. We call for revenue frameworks to foster a sustainable and mutually beneficial ecosystem for publishers and generative search providers.
Authors:Sebastiano Franchini, Alexis Carrillo, Edoardo Sebastiano De Duro, Riccardo Improta, Ali Aghazadeh Ardebili, Massimo Stella
Abstract:
We introduce Target-Event-Agent Networks (TEA Nets) as a computational framework to extract subjects (``Agents"), verbs (``Events"), and objects (``Targets") from texts. Grounded in cognitive network science and artificial intelligence, TEA Nets are implemented as an open-source Python library. We test TEA Nets in three case studies, demonstrating the framework's ability to perform interpretable emotion detection, semantic frame analyses, and linguistic inquiries across conspiracy texts and textual responses generated by LLMs. In the LOCO conspiracy corpus, TEA Nets revealed that highly conspiratorial narratives (4,227 texts) linked personal pronouns (``I", ``you", ``we") with the same actions twice as frequently as low-similarity conspiracy narratives. High-conspiracy narratives connected person-focused elements (``you", ``people") through actions eliciting anger above the random baseline ($z = 2.63, p < .05$), a trend absent in low-similarity conspiracy narratives, which emphasized scientific actors (``researcher", ``scientist"). In the HOPE and CounseLLMe datasets of 212 (human) and 200 (LLM-based) psychotherapy transcripts, respectively, TEA Nets highlighted emotional differences. When expressing feelings, Claude 3 Haiku, GPT-3.5, and humans used sad words with higher frequency than random expectations but Haiku expressed sadness with lower emotional intensity than humans ($U = 1243.5, p = .036$). We discuss these differences in the context of psychotherapy training on LLM-simulated patients. Our results show that Target-Event-Agent Networks can extract relevant emotional, syntactic, and semantic insights from narratives, opening new avenues for text analysis with cognitive network science.
Authors:Ali Aghazadeh Ardebili, Massimo Stella
Abstract:
Large Language Models (LLMs) can strongly shape social discourse, yet datasets investigating how LLM outputs vary across controlled social and contextual prompting remain sparse. Cognitive Digital Shadows (CDS) is a 190,000-record synthetic corpus supporting analyses of LLM-generated discourse. Each CDS record is generated by one of 19 LLMs, prompted to shadow either a human persona or an AI-assistant role. CDS contains LLM responses on 4 controversial societal topics: vaccines/healthcare, social media disinformation, the gender gap in science, and STEM stereotypes. Persona-conditioned records encode 17 sociodemographic and psychological attributes, providing data linking LLMs' prompts, language, stances and reasoning. Texts are validated for topic anchoring and can support emotional analyses via interpretable NLP (e.g. textual forma mentis networks). CDS is enriched by a pooling platform with user-friendly dashboards, enabling easy, interactive group-level comparisons of emotional and semantic framing across personas, topics and models. The CDS prompting framework supports future audits of LLMs' bias, social sensitivity and alignment.
Authors:Naomi Esposito, Anthony Tricarico, Luisa Porzio, Ali Aghazadeh Ardebili, Massimo Stella
Abstract:
To enhance LLMs' impact on math education, we need data on their mathematical prowess and biases across prompts. To fill this gap, we introduce MEDS (Math Education Digital Shadows) as a dataset mapping how large language models reason about and report mathematics across human- and AI-like conditions. MEDS involves 28,000 personas from 14 LLMs (from families like Mistral, Qwen, DeepSeek, Granite, Phi and Grok) shadowing either humans or AI assistants. Each record/shadow includes a set of prompts along with psychological/sociodemographic persona metadata and four types of math tasks: (i) open math interview, (ii) three psychometric tests about math perceptions with explanations, (iii) cognitive networks capturing math attitudes, and (iv) 18 high-school math test questions together with their reasoning and confidence scores. MEDS differs from traditional score-only math benchmarks because it integrates concepts of self-efficacy, math anxiety, and cognitive network science besides math proficiency scores. Data validation shows that the sampled LLMs exhibit schema integrity and consistent personas, together with family-specific peculiarities like human-like negative math attitudes, logical fallacies, and math overconfidence. MEDS will benefit learning analytics experts, cognitive scientists, and developers of safer AI tutors in mathematics.
Authors:Boyuan Gu, Yijin Yang, Shuaiqi Cheng, Xiaorong Ding
Abstract:
Cuffless blood pressure (BP) estimation based on Pulse Transit Time (PTT) has emerged as a promising solution for continuous health monitoring. However, conventional models relying on the Moens-Korteweg equation often fail during rapid hemodynamic fluctuations, as they assume arterial walls are purely elastic and neglect inherent viscoelasticity. To address this limitation, we propose a physics-informed framework introducing a viscoelastic compensation mechanism. First, raw photoplethysmogram (PPG) signals undergo high-fidelity reconstruction using Modified Akima (Makima) interpolation. Second, a robust Intersecting Tangent Method is applied for precise pulse foot localization. Crucially, we utilize Ensemble Empirical Mode Decomposition (EEMD) to isolate high-frequency Intrinsic Mode Functions (IMFs), defining a ``Viscoelastic Velocity Metric'' to quantify the vascular damping effect ($η\cdot \dotε$) typically ignored by elastic models. The framework was rigorously validated on a challenging subset of the MIMIC-II database (364 subjects, 28,525 cardiac cycles) characterized by a high prevalence of hypertension (23.4\%). Experimental results demonstrate medical-grade accuracy, yielding a Root Mean Square Error (RMSE) of 5.22 mmHg for Systolic and 3.65 mmHg for Diastolic BP, with Pearson correlation coefficients ($R > 0.97$). These findings confirm that incorporating viscoelastic features significantly enhances robustness against vascular hysteresis.
Authors:Leif Johnson, Trent Victor, Johan Engström
Abstract:
We present the Field of Safe Motion (FSM), a quantitative safety model for determining whether a driver maintains a collision-free escape route, or "out," at any given moment by accounting for that driver's physical capabilities and the foreseeable actions of other road users. The Field of Safe Travel (FST) provides a framework for representing the types of sensory information and actions available to drivers. However, the FST has remained conceptual in nature since its initial publication almost 90 years ago -- and a concrete computational operationalization is still lacking. At the same time, reachability analysis provides a quantitative basis for assessing the possible actions available to road users, using interpretable kinematic models, but reachability models have so far remained confined largely to the engineering and robotics literature. Bringing these two approaches together provides for an interpretable, quantitative tool for assessing driving behavior across a wide range of driving scenarios. Beyond being interpretable, our approach relies on a relatively small set of basic assumptions that are easy to enumerate and reason about. Furthermore, an interpretable reachability model paired with kinematic assumptions provides a way to bound uncertainty about road users' reasonably foreseeable future locations. We demonstrate the applicability of the FSM to different driving scenarios and discuss the strengths and weaknesses of the model.
Authors:Surabhi S Nath, Vindula Jayawardana, Monica Van, Matt Klenk, Shabnam Hakimi
Abstract:
The creative design process involves transforming abstract goals into concrete outcomes through a series of decisions made under constraints. While such processes are commonly shaped by feedback like rewards, their impact on design decision making remains unclear. To better understand the role of rewards in the design process, we modeled a 3D parametric, goal-based chair design task as a Markov Decision Process. We tracked participants' decisions as they iteratively developed designs for an abstract design goal, and presented either a goal-aligned or goal-agnostic reward at every step. We tested the effect of these rewards on task behaviour and self-reported experience. With rewards, participants more thoroughly explored the design space, and maximised goal-aligned over goal-agnostic rewards while preserving diversity across designs. The nature of the goal also mattered, influencing participants' perception of the reward's usefulness. Building on these insights, we propose guidelines for designing effective feedback for design decision making.
Authors:Muhammad Raees, Konstantinos Papangelis
Abstract:
While human-AI decision-making research has primarily used trust measurements to assess the practical usage of AI systems by their end-users, recent empirical evidence suggests that trust measurements do not inform users' appropriate reliance on AI systems. While examining the human-AI decision-making literature, in this work, we review empirical studies that assess people's appropriate reliance on AI advice, differentiating measurements and constructs of appropriate reliance from trust and mere reliance. Our analysis of literature shows that constructs for human-AI appropriate reliance are still fragmented in research. We present three views on appropriate reliance, namely Traditional, Appropriateness, and Dominance, as discussed in research. Using these views, we evaluate objective metrics reported in studies and argue for their consensus to facilitate the comparison across empirical research. We also discuss how studies employ objective metrics and examine their validity in application contexts. Our work contributes to the critical body of research on exploring objective metrics for assessing humans' appropriate reliance on AI advice.
Authors:Jui-Cheng Chiu, Yu-Chao Wang, Shengyang Luo, Tongyan Wang, Qi Yang, Nabin Khanal, Yingjie Victor Chen
Abstract:
Appreciating multi-figure paintings requires understanding how characters relate through subtle cues like gaze alignment, gesture, and spatial arrangement. We present MIRAGE, an evidence-centric framework designed to scaffold the exploration of these "micro-interactions" in multi-figure artworks. While such cues are essential for deep narrative appreciation, they are often distributed across complex scenes and difficult for viewers to systematically identify. Existing vision-language models (VLMs) frequently fail to provide reliable assistance, offering ungrounded interpretations that lack traceable visual evidence. MIRAGE addresses this by constructing a structured intermediate representation capturing identities, pose cues, and gaze hypotheses. However, the challenge extends beyond extracting these cues to coordinating them during interpretation. Without an explicit mechanism to organize and reconcile relational evidence, models often collapse multiple interaction hypotheses into a single unstable or weakly grounded narrative, even when low-level signals are available. This representation allows users to verify how high-level interpretations are anchored in low-level visual facts. By separating spatial grounding from narrative generation, MIRAGE enables users to inspect and reason about figure-to-figure relationships through a verifiable evidence layer. We evaluate MIRAGE against painting-only VLM baselines using a blind assessment protocol. Results show that MIRAGE significantly improves identity consistency, reduces relational hallucinations, and increases the coverage of subtle interactions. These findings suggest that structured grounding can serve as a critical interaction control layer, providing the necessary scaffolding for a more reliable, transparent, and human-led understanding of complex visual narratives.
Authors:Nguyen Luong, Talayeh Aledavood
Abstract:
Daily life is structured by recurring routines that coordinate biological rhythms with social and occupational demands. Individual differences in work schedules, family obligations, and social commitments produce distinctive ways of organizing activities throughout the day. Do people have typical days with certain arrangement of activities? How often do these typical days or routines occur and does this differ from person to person? We introduce a framework for quantifying such recurring routines, their persistence over time and their distinctiveness for different people. We model consecutive days in one's life as a sequence of different types of typical days, i.e. routines. Characterizing each day through patterns of activities common among all people - sleep, movement, and device use - we identify a small set of routine types that capture the dominant structure of everyday behavior. We then test whether individuals maintain stable, person-specific distributions over these types and transition between them in characteristic ways. Validating this framework with passive sensing data from 1,086 participants across 153,000 person-days in three longitudinal studies, we find that daily life typically resolves into approximately eight routine types and each person maintains a characteristic distribution over these types. Both the time allocation across routine types and the day-to-day transition dynamics are substantially more similar within individuals than between them, remaining stable across observation windows spanning weeks to months and across populations differing in age, occupation, and health status. Routine persistence shows modest associations with personality traits such as conscientiousness, but is broadly similar across age and gender. Our findings establish routine patterns as stable, person-specific behavioral fingerprints with applications in personalized health monitoring.
Authors:Ibrahim Bilau, Eunhwa Yang, Hyeokhyen Kwon, Stacie Smith, Bruce Walker, Hui Cai, Ece Erdogmus, Omobolanle Ogunseiju
Abstract:
This study examines how visual accessibility through cabinet design influences task performance, cognitive load, physical activity level, motivation, and user experience in a virtual kitchen among older adults with and without mild cognitive impairment (MCI). Seventeen older adults (7 with MCI, 10 without) completed a repeated-measures item retrieval task under two conditions, closed cabinets and open shelving, using a counterbalanced within-subjects design. Measures included task duration, physical activity level (ENMO), cognitive load (NASA-TLX and gaze entropy), intrinsic motivation (IMI), and post-task interviews. Open shelving significantly reduced task duration (beta = -291.20, p < .001) and physical activity level (beta = -0.00615, p = .008). Gaze entropy increased (beta = 1.29, p = .001), with a significant Setting x MCI interaction (p = .009) and moderation by MoCA score (p < .001). NASA-TLX and intrinsic motivation did not differ significantly between conditions. Qualitative findings indicated reduced reliance on memory-based search and highlighted themes related to independence, aesthetics, safety, and adoption. Overall, visual accessibility improved efficiency and reduced movement demands while altering visual-search organization, with divergence between subjective and objective indicators of cognitive load. These findings support visually accessible design strategies to enhance functional performance and inform cognitively supportive built environments for aging populations.
Authors:Veronica Ruozzi, Giovanni Battista Regazzo, Maria Chiara Palumbo, Wim-Alexander Beckers, Mouloud Ourak, Xiu Zhang, Francesca Perico, Alessandro Caimi, Emmanuel Vander Poorten, Emiliano Votta
Abstract:
Purpose: Developing and testing a framework that integrates real-time catheter shape reconstruction, interactive simulations, and mixed reality visualization to enable accurate monitoring of catheter-vessel interactions during endovascular navigation. Methods: A finite element model (FEM) of the venous pathway from the right femoral vein to the inferior vena cava was generated from computed tomography data and implemented into an interactive simulation. Catheter motion was imposed as boundary condition, and catheter-vessel contact was modeled with a Lagrange multiplier formulation to compute vessel deformation. The framework was tested in-vitro using a sensorized catheter with Fiber Bragg Grating and electromagnetic sensors as it was advanced through a silicone replica of the vascular anatomy. Real-time sensor read-outs fed the simulation, and the updated catheter and vessel geometries were streamed to Hololens 2. The performance and accuracy of FEM-computed vessel wall displacement were validated against experimental ground-truth obtained via stereo frames triangulation. Results: The simulated time exceeded the real temporal extent by 12% during initial navigation and by 45% when the catheter reached the most tortuous portion. Hololens 2 rendering remained stable at 35-40 frames per second. The median relative displacement error between FEM-computed and ground-truth vessel wall displacements remained below 1 mm and 2.33 mm for these two phases, respectively. Conclusion: The study demonstrates the feasibility of integrating interactive biomechanical simulation with real-time sensor data to enable continuous monitoring of catheter-vessel interactions, with mixed reality visualization serving as a user interface to support operator decision-making.
Authors:Siyi Li, Yue Jiang, Bowen Jing, Liuyuxin Yang, Yuhe Zhang
Abstract:
Advancements in 3D modeling,digital display technologies,and the growing availability of digital cultural heritage data have significantly improved the accuracy of heritage depictions and expanded opportunities for analysis.However,while many studies focus on presenting specific cultural heritage figurines,an often overlooked aspect is the visualization of the Terracotta Warriors as a unified entity.This involves concisely representing the distribution of features and their relationships,providing a clear and insightful presentation that engages practitioners, academics,and wider audiences.To tackle the challenges mentioned above,this research seeks to explore the application of AI methods in processing cultural heritage data.It aims to optimize and augment the dataset,analyze the distribution and relationships of various attributes, and interpret the analysis results through visualization techniques.The Terracotta Warriors,among China's most significant cultural heritages and renowned for their abundance,exquisite workmanship,and magnitude,are chosen as a case study.The contribution of this paper is primarily twofold.Firstly,we constructed a dataset of Terracotta Warriors from Pit No.1,detailing the attributes significant for identifying different Terracotta Warriors.Secondly,we employ various AI methods,such as generative adversarial network and random forest,to process and analyze these attributes,followed by visualizing the analysis results for an intuitive presentation.This study introduces a novel scheme for presenting information on a collection of cultural relics,offering a practical case for analyzing and visualizing the Terracotta Warriors'attributes as a whole entity,rather than showcasing individual relics'information in isolation.
Authors:Biswesh Mohapatra, Giovanni Duca, Laurent Romary, Justine Cassell
Abstract:
Situated dialogue requires speakers to maintain a reliable representation of shared context rather than reasoning only over isolated utterances. Current conversational agents often struggle with this requirement, especially when the common ground must be preserved beyond the immediate context window. In such settings, fine-grained distinctions are frequently compressed into purely textual representations, leading to a critical failure mode we call \emph{representational blur}, in which similar but distinct entities collapse into interchangeable descriptions. This semantic flattening creates an illusion of grounding, where agents appear locally coherent but fail to track shared context persistently over time. Inspired by the role of mental imagery in human reasoning, and based on the increased availability of multimodal models, we explore whether conversational agents can be given an analogous ability to construct some depictive intermediate representations during dialogue to address these limitations. Thus, we introduce an active visual scaffolding framework that incrementally converts dialogue state into a persistent visual history that can later be retrieved for grounded response generation. Evaluation on the IndiRef benchmark shows that incremental externalization itself improves over full-dialog reasoning, while visual scaffolding provides additional gains by reducing representational blur and enforcing concrete scene commitments. At the same time, textual representations remain advantageous for non-depictable information, and a hybrid multimodal setting yields the best overall performance. Together, these findings suggest that conversational agents benefit from an explicitly multimodal representation of common ground that integrates depictive and propositional information.
Authors:Mingwei Li, Suyang Li, Daisuke Sakurai, Bei Wang, Remco Chang
Abstract:
Generative AI has demonstrated significant potential in creative design, enabling the rapid generation of visual content and imaginative concepts. Although deep AI models achieve effective featurization in the latent space, navigating the space remains a challenge. Current techniques, such as GANSlider and SliderSpace, use multiple sliders to generate high-dimensional vectors in generative AI's latent space. Despite applying (global) PCA to reduce the number of sliders, these approaches struggle with scalability and usability as the number of control dimensions increases. In this paper, we introduce LatentGandr, a visual analytics technique that facilitates latent space exploration by extracting locally linear dimensions from embeddings in high-dimensional latent spaces. By analyzing the topology and local curvature of the embeddings, LatentGandr automatically identifies local neighborhoods and computes their principal components using localized PCA. These local principal components are visualized as interactive image grids, allowing users to efficiently explore and control the generative process, providing an intuitive means to refine the generation of novel content and concepts. To evaluate the effectiveness of LatentGandr, we conducted a study comparing it to GANSlider, the current state-of-the-art visualization interface for generative AI models. The results offer insights into how localized exploration techniques can enhance user interaction with these models.
Authors:Anjali Singh, Christopher Brooks, Warren Li, Juho Kim, Xu Wang
Abstract:
Generating hints for incorrect code is a cognitively demanding task that fosters learning and metacognitive development. This study investigates three designs for personalized, scalable, and reflective hint-writing activities within a data science course: (i) writing a hint independently, (ii) writing a hint with on-demand AI assistance, and (iii) deferred AI assistance, in which students first write a hint independently and then revise it with the help of an AI-generated one. We examine how AI support can scaffold the learning process without diminishing students' productive cognitive effort. Through a randomized controlled experiment with graduate-level students (N=97), we found that deferring AI assistance leads to the highest-quality hints. Further, this design helps students identify a wide range of mistakes they otherwise struggle to identify without any AI assistance. Students valued these activities as opportunities to practice debugging and critically engage with AI outputs--skills that are now critical for learners to acquire as programming becomes increasingly automated and the use of AI for learning grows. Our findings also highlight key considerations for designing student-AI collaborative learning experiences to sustain student engagement, maintain appropriate cognitive load, and mitigate negative effects of AI, such as introducing redundancies and extraneous information into student work.
Authors:Lei Wang, Min Huang, Eduard Dragut
Abstract:
Thematic analysis is difficult to scale: manual workflows are labor-intensive, while fully automated pipelines often lack controllability and transparent evaluation. We present \textbf{CentaurTA Studio}, a web-based system for self-improving human--agent collaboration in open coding and theme construction. The system integrates (1) a two-stage human feedback pipeline separating simulator drafting and expert validation, (2) persistent prompt optimization that distills validated feedback into reusable alignment principles, and (3) rubric-based evaluation with early stopping for process control. Across three domains, CentaurTA achieves the strongest performance in both Open Coding and Theme Construction, reaching up to 92.12\% accuracy and consistently outperforming baseline systems. Agreement between the rubric-based LLM judge and human annotators reaches substantial reliability (average $κ= 0.68$). Ablation studies show that removing the feedback loop reduces performance from 90\% to 81\%, while eliminating the Critic or early stopping degrades accuracy or increases interaction cost. The full system reaches peak performance within 10 iterative rounds (about 25 minutes), demonstrating improved efficiency over expert-only refinement.
Authors:Wenzheng Zhao, Ruth Palan Lopez, Shu Fen Wung, Fengpei Yuan
Abstract:
We present Speaking Memories, a distributed, stakeholder-in-the-loop robotic interaction platform for personalized cognitive exercise support. Rather than a single robot-centric system, Speaking Memories is designed as a generalizable robotics architecture that integrates caregiver-authored knowledge, local edge intelligence, and embodied robotic agents into a unified socio-technical loop. The platform fuses auditory, visual, and textual signals to enable emotion-aware, personalized dialogue, while decoupling multimodal perception and reasoning from robot-specific hardware through a local edge interaction server. This design achieves low-latency, privacy-preserving operation and supports scalable deployment across heterogeneous robotic embodiments. Caregivers and family members contribute structured biographical knowledge via a secure cloud portal, which conditions downstream dialogue policies and enables longitudinal personalization across interaction sessions. Beyond real-time interaction, the system incorporates an automated multimodal evaluation layer that continuously analyzes user responses, affective cues, and engagement patterns, producing structured interaction metrics at scale. These metrics support systematic assessment of interaction quality, enable data-driven model fine-tuning, and lay the foundation for future clinician- and caregiver-informed personalization and intervention planning. We evaluate the platform through real-world deployments, measuring end-to-end latency, dialogue coherence, interaction stability, and stakeholder-reported usability and engagement. Results demonstrate sub-6-second response latency, robust multimodal synchronization, and consistently positive feedback from both participants and caregivers. Furthermore, subsets of the dataset can be shared upon request, subject to participant consent and IRB constraints.
Authors:Yinan Wu, Ze Shi Li, Kathryn Thomasset Stolee, Bowen Xu
Abstract:
Artificial Intelligence (AI) is reshaping how developers adopt software engineering practices, yet the multi-dimensional nature of developer-AI interaction remains under-explored. Prior studies have primarily examined dimensions observable from developer activities such as "Prompt crafting" and "Code Editing", overlooking how hidden intentions and emotional dimensions intertwine with concrete actions during AI-assisted programming. To understand this phenomenon, we conducted a mixed-methods study with 76 developers split into AI-assisted and non-AI groups. Each performed programming tasks (Python with API management or Java with SQL). Developers retrospectively labeled their self-reported intentions, tool-supported actions, and emotions from screen recordings, supplemented by surveys and interviews. Our user study resulted in a novel model named S-IASE with four dimensions to describe programming behavior: intention, action, supporting tool, and emotion for a given development state. Our analysis reveals aggregated and sequential behavioral patterns. For example, using AI assistants often makes developers more focused on actively creating code, evaluating, and verifying generated results. AI-assisted participants showed emotionally stable development flow, as opposed to non-AI-assisted participants who experienced more fluctuating emotions. Interviews revealed further nuance: some developers reported impostor-like feelings, expressing guilt or self-doubt about relying on AI. Our work bridges an important gap in understanding the complexities of developer-AI interaction in programming context.
Authors:Carina I Hausladen, Javier Argota Sánchez-Vaquerizo, Michael Siebenmann, Arthur Capozzi, Sachit Mahajan, Dirk Helbing
Abstract:
Participatory urban planning is central to sustainable city-making, yet the technically demanding nature of such interventions often limits meaningful involvement by diverse publics. We introduce a scalable digital participation platform that embeds sustainability projects within a navigable digital twin. Citizens experience a guided virtual walkthrough with audio narration employing the method of loci and spatial anchoring to support mnemonic encoding and recall. This immersive interface is augmented by two purpose-built LLM assistants: one delivers source-grounded factual clarifications, while the other facilitates reflective discussion. We evaluated this system in a randomized controlled online experiment (N = 195) against conventional industry practices (static visualizations and text-based consultations). Results show that spatially anchored immersive presentation significantly improved information recall, which substantially shifted participants' attention from individual inconveniences to collective, community-oriented sustainability benefits. Consequently, participants provided significantly more constructive, solution-focused feedback to the (simulated) municipality. These findings establish a practical tool for cities and policymakers to foster inclusive, democratic participation in sustainability transitions.
Authors:Jiashuo Cao, Chen Li, Wujie Gao, Simon Hoermann, Nilufar Baghaei, Mark Billinghurst
Abstract:
Virtual agents have shown promising potential in mental health applications, but current research has predominantly focused on contexts outside of traditional therapy sessions. This paper examines the impact of a virtual supporter in remote psychotherapy sessions conducted via Zoom. We used a two-phase research approach. First we conducted a formative study to understand the roles and functions of human supporters in psychotherapy contexts. Based on these findings, we developed a virtual supporter operating in two modes: Daily Mode (for mood journaling outside therapy) and Therapy Mode (as an additional participant in Zoom therapy sessions). Finally we ran a user study with 14 participants who engaged with the virtual supporter for a week and then joined a remote psychotherapy session together. Our findings revealed that the virtual supporter had positive effects on creating psychological safety, reducing anxiety, and enhancing emotional articulation without disrupting the therapeutic process. We then discussed both the benefits and potential disadvantages of virtual supporters in therapeutic contexts, including concerns about over-reliance and the need for appropriate boundaries. This research contributes to understanding how AI-driven virtual agents could contribute to human-led remote psychotherapy.
Authors:Ana-Andreea Stoica, Celestine Mendler-Duenner, Moritz Hardt
Abstract:
Digital labor platforms are increasingly used to procure human input, ranging from annotating data and red-teaming AI models, to ride-sharing and food delivery. A central concern in such markets is the ability of platforms to suppress wages by exploiting the abundance of low-cost labor. To study this exploitation pattern, we introduce a novel posted-price procurement model with coverage objectives. A platform seeks to complete M tasks by posting prices to sequentially arriving workers, each of whom accepts a task if it exceeds their private cost. First, we show that under natural assumptions on the workers' estimated cost, there exists a simple pricing strategy for the platform to cover all M tasks with wait time O(M), while paying only a O(log(M)/M) fraction of the total cost of labor. This result highlights how platforms can exploit workers' uncertainty about the cost of labor to effectively suppress wages. Then, we study collective action as a lever to increase wages and promote welfare in digital labor markets. In particular, we show how a small coalition of targeted low-cost workers who commit to a price floor forces the platform's total spending from logarithmic to linear in M. In contrast, a randomly sampled coalition of equal size remains largely ineffective. We complement our theory with synthetic experiments, showcasing the benefits of collective action across different market regimes.
Authors:Qijia Chen, Andrea Bellucci, Zhida Sun, Giulio Jacucci
Abstract:
LLM-based mobile GUI agents treat every task invocation as an independent reasoning episode, requiring a full LLM inference call at each action step. This per-step dependence makes them stateless: a task completed successfully yesterday is re-derived from scratch today, with no improvement in reliability or speed. We present SkillDroid, a three-layer skill agent that compiles successful LLM-guided GUI trajectories into parameterized skill templates (sequences of UI actions with weighted element locators and typed parameter slots) and replays them on future invocations without any LLM calls. A matching cascade (regex patterns, embedding similarity, and app filtering) routes incoming instructions to stored skills, while a failure-learning layer triggers recompilation when skill reliability degrades. Over a 150-round longitudinal evaluation with systematic instruction variation and controlled perturbations, SkillDroid achieves an 85.3% success rate (23 percentage points above a stateless LLM baseline) while using 49% fewer LLM calls. The skill replay mechanism achieves a perfect 1000% success rate across 79 replay rounds at 2.4 times the speed of full LLM execution. Most critically, the system improves with use: its success rate converges upward from 87% to 91%, while the baseline degrades from 80% to 44%.
Authors:Christiane Ernst, Luis Gutmann, Domenique Zipperling, Kathrin Figl, Niklas Kühl
Abstract:
In high-stakes AI-supported decisions, considerations are not purely technical but involve moral judgments about fairness, responsibility, and harm. While prior research has focused mainly on functional or behavioral alignment, this paper argues that moral alignment may be a more fundamental dimension of human-AI decision-making. Moral alignment is defined as the perceived congruence between the values embedded in an AI system's decision logic and the moral intuitions of stakeholders. Building on Moral Foundations Theory, the paper adopts a multi-stakeholder perspective and highlights why moral (mis)alignment matters for the meaningful integration of AI in sensitive contexts.
Authors:Ibrahim Bilau, Nicole Li, Terrence Malayvong, Eunhwa Yang
Abstract:
Mild Cognitive Impairment (MCI) affects 15-20% of adults aged 65 and older, often making kitchen navigation and independent living difficult, particularly in lower-income communities with limited access to professional design help. This study created an AI system that converts standard kitchen photos into MCI-friendly designs using the Home Design Guidelines (HDG). Stable Diffusion models, enhanced with DreamBooth LoRA and ControlNet, were trained on 100 kitchen images to produce realistic visualizations with open layouts, transparent cabinetry, better lighting, non-slip flooring, and less clutter. The models achieved moderate to high semantic alignment (normalized CLIP scores 0.69-0.79) and improved visual realism (GIQA scores 0.45-0.65). In a survey of 33 participants (51.5% caregivers, 36.4% older adults with MCI), the AI-modified kitchens were strongly preferred as more cognitively friendly (87.4% of 198 choices, p < .001). Participants reported high confidence in their kitchen choice selections (M = 5.92/7) and found the visualizations very helpful for home modifications (M = 6.27/7). Thematic analysis emphasized improved visibility, lower cognitive load, and greater independence. Overall, this AI tool provides a low-cost, scalable way for older adults and caregivers to visualize and implement DIY kitchen changes, supporting aging in place and resilience for those with MCI.
Authors:Seiya Mitsuno, Midori Ban, Hiroshi Ishiguro, Yuichiro Yoshikawa
Abstract:
Social isolation among older adults has become a critical concern, as reduced opportunities for conversation and weakened family relationships negatively affect mental health. This study proposes a dialogue agent that supports older adults by fostering both a relationship with the agent and a relationship with their grandchild through sharing everyday information. The agent operates on a chatbot platform and engages in daily conversations with older adults and their grandchildren, exchanging information gathered from each party to enhance conversational engagement and social connection. We conducted a ten-day empirical experiment with 52 grandparent-grandchild pairs. The results suggest that older adults became more willing to interact with the proposed agent, which shared information about their grandchildren, and that the psychological connection between grandparents and grandchildren was strengthened. Furthermore, daily interactions with the agent were associated with reduced anxiety in both older adults and their grandchildren. These findings indicate that a dialogue agent that shares personal information can be an effective approach to supporting older adults by simultaneously offering conversational opportunities and promoting family connectedness. Overall, this study provides valuable insights into the design of dialogue agents that effectively address social isolation among older adults.
Authors:Mengjie Fan, Katrin Angerbauer, Yinchu Cheng, Yingying Yan, Xiaohan Xu, Tianfu Wang, Michael Sedlmair, Yu Yang, Liang Zhou
Abstract:
Drug instructions are crucial for guiding the rational use of medication. We conduct a visualization design study to enhance the comprehension of over-the-counter (OTC) drug instructions, targeting both the general public and medical professionals. We devise two tailored drug instruction designs for different audience groups through an iterative design process. A controlled user study reveals that our design outperforms traditional text-based instructions in terms of response time and usability, and the availability of two versions is also found to be beneficial. This study also motivates a taxonomy based on a systematic classification of OTC drug instructions sampled from an official drug database, which received positive expert feedback. Finally, this study summarizes a workflow for a visualization design strategy based on our design exploration and user study feedback, which can be generalized to other OTC drug instructions.
Authors:Avni Mittal, Shanu Kumar, Sandipan Dandapat, Monojit Choudhury
Abstract:
We study predictive multilingual evaluation: estimating how well a model will perform on a task in a target language when direct benchmark results are missing. This problem is common in multilingual deployment, where evaluation coverage is sparse and published evidence is uneven across languages, tasks, and model families. We introduce a controlled benchmark of 1,500 questions spanning six tasks and five evidence scenarios. The benchmark separates accessible evidence from ground truth, enabling evaluation of systems that must infer missing results from incomplete literature evidence. We also present Litmus (Re)Agent, a DAG-orchestrated agentic system that decomposes queries into hypotheses, retrieves evidence, and synthesises predictions through feature-aware aggregation. Across six systems, Litmus (Re)Agent achieves the best overall performance, with the largest gains in transfer-heavy scenarios where direct evidence is weak or absent. These results show that structured agentic reasoning is a promising approach to multilingual performance estimation under incomplete evidence.
Authors:Olivier Jeunen, Eleanor Hanna, Schaun Wheeler
Abstract:
In consumer applications, Customer Relationship Management (CRM) has traditionally relied on the manual optimisation of static, rule-based messaging strategies. While adaptive and autonomous learning systems offer the promise of scalable personalisation, it remains unclear to what extent ``human-in-the-loop'' oversight is required to sustain performance uplift over time. This paper presents a longitudinal case study analysing a real-world consumer application that leverages agentic infrastructure to personalise marketing messaging for a large-scale user base over an 11-month period. We compare two distinct periods: an active phase where marketers directly curated content, audiences, and strategies -- followed immediately by a passive phase where agents operated autonomously from a fixed library of components. Our results demonstrate that whilst active human management generates the highest relative lift in engagement metrics, the autonomous agents successfully sustained a positive lift during the passive period. These findings suggest a symbiotic model where human intervention drives strategic initialisation and discovery, yet autonomous agents can ensure the scalable retention and preservation of performance gains.
Authors:Uloma Okoro, Tammy Mackenzie, Branislav Radeljic
Abstract:
This study examines the perception of legal professionals on the governance of AI in developing countries, using Nigeria as a case study. The study focused on ethical risks, regulatory gaps, and institutional readiness. The study adopted a qualitative case study design. Data were collected through 27 semi-structured interviews with legal practitioners in Nigeria. A focus group discussion was also held with seven additional legal practitioners across sectors such as finance, insurance, and corporate law. Thematic analysis was employed to identify key patterns in participant responses. Findings showed that there were concerns about data privacy risks and the lack of enforceable legal frameworks. Participants expressed limited confidence in institutional capacity and emphasized the need for locally adapted governance models rather than direct adoption of foreign frameworks. While some expressed optimism about AI's potential, this was conditional on the presence of strong legal oversight and public accountability. The study contributes to the growing discourse on AI governance in developing countries by focusing on the perspectives of legal professionals. It highlights the importance of regulatory approaches that are context-specific, inclusive, and capable of bridging the gap between global ethical principles and local realities. These insights offer practical guidance for policymakers, regulators, and scholars working to shape responsible AI governance in similar environments.
Authors:Liqun He, Shijun, Chen, Mutlu Cukurova, Manolis Mavrikis
Abstract:
While generative AI (GenAI) voice chatbots offer scalable opportunities for second language (L2) oral practice, the interactional processes related to learners' gains remain underexplored. This study investigates dialogue act (DA) patterns in interactions between Grade 9 Chinese English as a foreign language (EFL) learners and a GenAI voice chatbot over a 10-week intervention. Seventy sessions from 12 students were annotated by human coders using a pedagogy-informed coding scheme, yielding 6,957 coded DAs. DA distributions and sequential patterns were compared between high- and low-progress sessions. At the DA level, high-progress sessions showed more learner-initiated questions, whereas low-progress sessions exhibited higher rates of clarification-seeking, indicating greater comprehension difficulty. At the sequential level, high-progress sessions were characterised by more frequent prompting-based corrective feedback sequences, consistently positioned after learner responses, highlighting the role of feedback type and timing in effective interaction. Overall, these findings underscore the value of a dialogic lens in GenAI chatbot design, contribute a pedagogy-informed DA coding framework, and inform the design of adaptive GenAI chatbots for L2 education.
Authors:Niklas Hagemann, Daniela Rus
Abstract:
There is a growing need for robots that can change their shape, size and mechanical properties to adapt to evolving tasks and environments. However, current shape-changing systems generally utilize bespoke, system-specific mechanisms that can be difficult to scale, reconfigure or translate from one application to another. This paper introduces a compact, easy-to-fabricate deployable actuator that achieves reversible scale and stiffness transformations through compound folding and zipping of flexible 3D-printed plastic strips into square-section deployable beams. The simple actuation method allows for smooth, continuous transitions between compact (flexible) and expanded (quasi-rigid) states, facilitating diverse shape and stiffness transformations when modules are combined into larger assemblies. The actuator's mechanical performance is characterized and an integrated system involving a four-module adaptive walking robot is demonstrated.
Authors:Mengyu Chen, Pranav Deshpande, Runqing Yang, Elvir Azanli, Joseph Ligman, Shaohan Hu, Richard Chen
Abstract:
Digital humans are lifelike virtual agents capable of natural conversation and are increasingly deployed in domains like retail and finance. However, most current digital humans operate in isolation from their surroundings and lack contextual awareness beyond the dialogue itself. We address this limitation by integrating ambient intelligence (AmI) - i.e., environmental sensors, IoT data, and contextual modeling - with digital human systems. This integration enables situational awareness of the user's environment, anticipatory and proactive assistance, seamless cross-device interactions, and personalized long-term user support. We present a conceptual framework defining key roles that AmI can play in shaping digital human behavior, a design space highlighting dimensions such as proactivity levels and privacy strategies, and application-driven patterns with case studies in financial and retail services. We also discuss an architecture for ambient-enabled digital humans and provide guidelines for responsible design regarding privacy and data governance. Together, our work positions ambient intelligent digital humans as a new class of interactive agents powered by AI that respond not only to users' queries but also to the context and situations in which the interaction occurs.
Authors:Alexis Carrillo, Enrique Taietta, Ali Aghazadeh Ardebili, Giuseppe Alessandro Veltri, Massimo Stella
Abstract:
Talk2AI is a large-scale longitudinal dataset of 3,080 conversations (totaling 30,800 turns) between human participants and Large Language Models (LLMs), designed to support research on persuasion, opinion change, and human-AI interaction. The corpus was collected from 770 profiled Italian adults across four weekly sessions in Spring 2025, using a within-subject design in which each participant conversed with a single model (GPT-4o, Claude Sonnet 3.7, DeepSeek-chat V3, or Mistral Large) on three socially relevant topics: climate change, math anxiety, and health misinformation. Each conversation is linked to rich contextual data, including sociodemographic characteristics and psychometric profiles. After each session, participants reported on opinion change, conviction stability, perceived humanness of the AI, and behavioral intentions, enabling fine-grained longitudinal analysis of how AI-mediated dialogue shapes beliefs and attitudes over time.
Authors:Zonghan Li, Yi Liu, Chunyan Wang, Song Tong, Kaiping Peng, Feng Ji
Abstract:
Nudging is widely used to promote behavioral change, but its effectiveness is often limited when recipients must repeatedly translate feedback into workable next steps under changing circumstances. Large language models (LLMs) may help reduce part of this cognitive work by generating personalized guidance and updating it iteratively across intervention rounds. We developed an LLM agent for iterative personalization and tested it in a three-arm randomized experiment among 233 university residents in China, using daily electricity and shower hot-water conservation as objectively measured cases differing in friction. LLM-personalized nudges (T2) produced the largest conservation effects, while image-enhanced conventional nudges (T1) and text-based conventional nudges (C) showed similar outcomes (omnibus p = 0.009). Relative to C, T2 reduced electricity consumption by 0.56 kWh per room-day (p = 0.014), corresponding to an 18.3 percentage-point higher adjusted saving rate. This advantage emerged within the first two intervention rounds, alongside iterative updating of personalized guidance, and persisted thereafter. Hot-water outcomes followed the same direction but were smaller, less precisely estimated, and attenuated over time, consistent with stronger friction in this domain. LLM-personalized nudges emphasized prospective and context-specific guidance and were associated with higher participant engagement. This study provides field evidence that LLM-based iterative personalization can enhance behavioral nudging, with behavioral friction as a potential boundary condition. Larger trials and extension to more behaviors are warranted.
Authors:Pyeonghwa Kim, Taylor Lewandowski, Michael Dunn, Steve Sawyer
Abstract:
We focus on occupational diversity in platform-mediated work to advance conceptual and empirical insight into the occupationally embedded nature of platform labor. We pursue this focus in response to a prevailing tendency to treat platform workers as a homogeneous group, overlooking the unique demands, constraints, and practices rooted in specific professions. Such generalizations hinder both understanding of platform work and the development of sociotechnical systems that support differentiated occupational realities. To address this gap, we present a longitudinal analysis of 108 online freelancers spanning five occupational categories. We show that occupational context structures workers' capacity to interpret and navigate platformic management, shaping distinct experiences across four dimensions of platform work: self-presentation, flexibility, skilling, and platform work sustainability. To articulate how digital labor platforms' managerial control interacts with occupational embeddedness, we introduce the concept of platformic occupational stratification and discuss four mechanisms that explain its logic and implications for platform-mediated work. These insights contribute to CSCW by informing occupation-sensitive research and design approaches that directly engage with the specific opportunities and challenges rooted in workers' situated occupational agency in platform-mediated work.
Authors:Nolan Platt, Sehrish Nizamani, Alp Tural, Elif Tural, Saad Nizamani, Andrew Katz, Yoonje Lee, Nada Basit
Abstract:
Understanding student engagement usually requires time-consuming manual observation or invasive recording that raises privacy concerns. We present a privacy-preserving pipeline that analyzes classroom videos to extract insights about student attention, without storing any identifiable footage. Our system runs on a single GPU, using OpenPose for skeletal extraction and Gaze-LLE for visual attention estimation. Original video frames are deleted immediately after pose extraction, thus only geometric coordinates (stored as JSON) are retained, ensuring compliance with FERPA. The extracted pose and gaze data is processed by QwQ-32B-Reasoning, which performs zero-shot analysis of student behavior across lecture segments. Instructors access results through a web dashboard featuring attention heatmaps and behavioral summaries. Our preliminary findings suggest that LLMs may show promise for multimodal behavior understanding, although they still struggle with spatial reasoning about classroom layouts. We discuss these limitations and outline directions for improving LLM spatial comprehension in educational analytics contexts.
Authors:Birgitta Langhammer, Oscar Martinez Mozos, Ana Mendes, Joana Madureira, Lina Seduikyte, Martin Weigl, Heidi Salonen, Veronika Kotradyova, Ondrej Krejcar, Sarmite Mikulioniene, Willeke van Staalduinen, Carina Dantas, Petra Maresova, Willeke van Staalduinen, Carina Dantas, Barakovic Sabina, Barakovic Husic Jasmina, Jonathan Gomez-Raja
Abstract:
This document reports the State of the Art of science and practice on three topics related to smart and healthy ageing at home: furniture and habitats, Information and Communication Technologies (ICT), and healthcare. The reports were prepared by the working groups of COST Action CA16226, Sheld-on. Sheld-on is a network of researchers, user representatives, industry members, and other stakeholders. The three domains covered in this report were the areas of interest for three working groups from the COST Action. The aim of each working group was to assess the State of the Art for disciplinary understanding, identification of advances in smart furniture and habitat, products, industries and success stories. The findings on these topics of all working groups are compiled here. Due to the different backgrounds of the members of each of the working groups, the document is divided in three separate parts that can be considered as separate State of the Art reports. The goal of this document is to be used as input in the fourth working group of Sheld-on COST Action: Solutions for Ageing Well at Home, in the Community, and at Work, where experts from the three different domains converge to a single working group in order to achieve the action objectives.
Authors:Ruth Cohen, Lu Feng, Ayala Bloch, Sarit Kraus
Abstract:
While natural-language explanations from large language models (LLMs) are widely adopted to improve transparency and trust, their impact on objective human-AI team performance remains poorly understood. We identify a Persuasion Paradox: fluent explanations systematically increase user confidence and reliance on AI without reliably improving, and in some cases undermining, task accuracy. Across three controlled human-subject studies spanning abstract visual reasoning (RAVEN matrices) and deductive logical reasoning (LSAT problems), we disentangle the effects of AI predictions and explanations using a multi-stage reveal design and between-subjects comparisons. In visual reasoning, LLM explanations increase confidence but do not improve accuracy beyond the AI prediction alone, and substantially suppress users' ability to recover from model errors. Interfaces exposing model uncertainty via predicted probabilities, as well as a selective automation policy that defers uncertain cases to humans, achieve significantly higher accuracy and error recovery than explanation-based interfaces. In contrast, for language-based logical reasoning tasks, LLM explanations yield the highest accuracy and recovery rates, outperforming both expert-written explanations and probability-based support. This divergence reveals that the effectiveness of narrative explanations is strongly task-dependent and mediated by cognitive modality. Our findings demonstrate that commonly used subjective metrics such as trust, confidence, and perceived clarity are poor predictors of human-AI team performance. Rather than treating explanations as a universal solution, we argue for a shift toward interaction designs that prioritize calibrated reliance and effective error recovery over persuasive fluency.
Authors:Caitlin Morris, Pattie Maes
Abstract:
As AI systems increasingly take on instructional roles - providing feedback, guiding practice, evaluating work - a fundamental question emerges: does it matter to learners who they believe is on the other side? We investigated this using a three-condition experiment (N=148) in which participants completed a creative coding tutorial and received feedback generated by the same large language model, attributed to either an AI system (with instant or delayed delivery) or a human teaching assistant (with matched delayed delivery). This three-condition design separates the effect of source attribution from the confound of delivery timing, which prior studies have not controlled. Source attribution and timing had distinct effects on different outcomes: participants who believed the human attribution spent more time on task than those receiving equivalently timed AI-attributed feedback (d=0.61, p=.013, uncorrected), while the delivery delay independently increased output complexity without affecting time measures. An exploratory analysis revealed that 46% of participants in the human-attributed condition did not believe the attribution, and these participants showed worse outcomes than those receiving transparent AI feedback (code complexity d=0.77, p=.003; time on task d=0.70, p=.007). These findings suggest that believed human presence may carry motivational value, but that this value depends on credibility. For computing educators, transparent AI attribution may be the lower-risk default in contexts where human attribution would not be credible.
Authors:Yongsu Ahn, Nam Wook Kim, Benjamin Bach
Abstract:
AI chatbots are increasingly stepping into roles as collaborators or teachers in analyzing, visualizing, and reasoning through data and domain problem. Yet, AI's default assistant mode with its comprehensive and one-off responses may undermine opportunities for practitioners to develop literacy through their own thinking, inducing cognitive passivity. Drawing on evidence from empirical studies and theories, we argue that disrupting cognitive passivity necessitates a nuanced approach: rather than simply making AI promote deliberative thinking, there is a need for more dynamic and adaptive strategy through cognitive alignment -- a framework that characterizes effective human-AI interaction as a function of alignment between users' cognitive demand and AI's interaction mode. In the framework, we provide the mapping between AI's interaction mode (transmissive or deliberative) and users' cognitive demand (receptive or deliberative), otherwise leading to either cognitive passivity or friction. We further discuss implications and offer open questions for future research on data literacy.
Authors:Yoana Ahmetoglu, Marios Constantinides, Anna Cox
Abstract:
The use of AI tools in research is becoming routine, alongside growing consensus that such use should be transparently disclosed. However, AI disclosure statements remain rare and inconsistent, with policies offering limited guidance and authors facing social, cognitive, and emotional barriers when reporting AI use. To explore how structured disclosure shapes what authors report and how they experience disclosure, we present DAISY (Disclosure of AI-uSe in Your Research), a form-based tool for generating AI disclosure statements. DAISY was developed from literature-derived requirements and co-design (N =11), and deployed in a user study with authors (N=31). DAISY-supported disclosures met more completeness criteria, offering clearer breakdowns of AI use across research and writing than unsupported disclosures. Surprisingly, despite concerns about how transparently disclosed AI use might be perceived, the use of DAISY did not reduce author comfort with the disclosure statements. We discuss design implications and a research agenda for AI disclosure as a sociotechnical practice.
Authors:Yasaman Hakiminejad, Shiva Azimi, Luis Gomero, Elizabeth Pantesco, Irene P. Kan, Meltem Izzetoglu, Arash Tavakoli
Abstract:
As semi-automated vehicles (SAVs) become more common, ensuring effective human-vehicle interaction during control handovers remains a critical safety challenge. Existing studies often rely on single-session simulator experiments or naturalistic driving datasets, which often lack temporal context on drivers' cognitive and physiological states before takeover events. This study introduces a hybrid framework combining longitudinal mobile sensing with high-fidelity driving simulation to examine driver readiness in semi-automated contexts. In a pilot study with 38 participants, we collected 7 days of wearable physiological data and daily surveys on stress, arousal, valence, and sleep quality, followed by an in-lab simulation with scripted takeover events under varying secondary task conditions. Multimodal sensing, including eye tracking, fNIRS, and physiological measures, captured real-time responses. Preliminary analysis shows the framework's feasibility and individual variability in baseline and in-task measures; for example, fixation duration and takeover control time differed by task type, and RMSSD showed high inter-individual stability. This proof-of-concept supports the development of personalized, context-aware driver monitoring by linking temporally layered data with real-time performance.
Authors:Luca Vogelgesang, Ahmed Mehdi Soltani, Mohammadhossein Khojasteh, Xinrui Zu, Stefano De Giorgis, Madalina Croitoru, Filip Ilievski
Abstract:
Assistive robots have growing potential to support physical wellbeing in home and healthcare settings, for example, by guiding users through stretching or rehabilitation routines. However, existing systems remain largely scripted, which limits their ability to adapt to user state, environmental context, and interaction dynamics. In this work, we present StretchBot, a hybrid neuro-symbolic robotic coach for adaptive assistive guidance. The system combines multimodal perception with knowledge-graph-grounded large language model reasoning to support context-aware adjustments during short stretching sessions while maintaining a structured routine. To complement the system description, we report an exploratory pilot comparison between scripted and adaptive guidance with three participants. The pilot findings suggest that the adaptive condition improved perceived adaptability and contextual relevance, while scripted guidance remained competitive in smoothness and predictability. These results provide preliminary evidence that structured actionable knowledge can help ground language-model-based adaptation in embodied assistive interaction, while also highlighting the need for larger, longitudinal studies to evaluate robustness, generalizability, and long-term user experience.
Authors:Jeremy Zhengqi Huang, Emani Hicks, Sidharth, Gillian R. Hayes, Dhruv Jain
Abstract:
For people with noise sensitivity, everyday soundscapes can be overwhelming. Existing tools such as active noise cancellation reduce discomfort by suppressing the entire acoustic environment, often at the cost of awareness of surrounding people and events. We present Sona, an interactive mobile system for real-time soundscape mediation that selectively attenuates bothersome sounds while preserving desired audio. Sona is built on a target-conditioned neural pipeline that supports simultaneous attenuation of multiple overlapping sound sources, overcoming the single-target limitation of prior systems. It runs in real time on-device and supports user-extensible sound classes through in-situ audio examples, without retraining. Sona is informed by a formative study with 68 noise-sensitive individuals. Through technical benchmarking and an in-situ study with 10 participants, we show that Sona achieves low-latency, multi-target attenuation suitable for live listening, and enables meaningful reductions in bothersome sounds while maintaining awareness of surroundings. These results point toward a new class of personal AI systems that support comfort and social participation by mediating real-world acoustic environments.
Authors:Lucas Gautheron, Nori Jacoby, Peter Harrison
Abstract:
Adaptive experiments automatically optimize their design throughout the data collection process, which can bring substantial benefits compared to conventional experimental settings. Potential applications include, among others: computerized adaptive testing (for selecting informative tasks in ability measurements), adaptive treatment assignment (when searching experimental conditions maximizing certain outcomes), and active learning (for choosing optimal training data for machine learning algorithms). However, implementing these techniques in real time poses substantial computational and technical challenges. Additionally, despite their conceptual similarity, the above scenarios are often treated as separate problems with distinct solutions. In this paper, we introduce a practical and unified approach to real-time adaptive experiments that can encompass all of the above scenarios, regardless of the modality of the task (including textual, visual, and audio inputs). Our strategy combines active inference, a Bayesian framework inspired by cognitive neuroscience, with PsyNet, a platform for large-scale online behavioral experiments. While active inference provides a compact, flexible, and principled mathematical framework for adaptive experiments generally, PsyNet is a highly modular Python package that supports social and behavioral experiments with stimuli and responses in arbitrary domains. We illustrate this approach through two concrete examples: (1) an adaptive testing experiment estimating participants' ability by selecting optimal challenges, effectively reducing the amount of trials required by 30--40\%; and (2) an adaptive treatment assignment strategy that identifies the optimal treatment up to three times as accurately as a fixed design in our example. We provide detailed instructions to facilitate the adoption of these techniques.
Authors:Wenzheng Zhao, Manideep Duggi, Fengpei Yuan
Abstract:
Distributed multi-robot systems for the home often require robots to operate out of the user's sight, creating a state awareness gap that can diminish trust and perceived transparency and control. This paper investigates whether real-time, socially mediated state externalization can bridge this gap without compromising task performance. We developed a system where a co-located social mediator robot (Pepper) externalizes the hidden execution states of an out-of-sight mobile manipulator (Stretch~3) for voice-driven object retrieval and delivery, where task-level states are synchronized and externalized through verbal updates and visual progress display. In a counterbalanced within-subject study (N=30), we compared a baseline of Autonomous Hidden Execution against Socially Mediated State Externalization. Our results show that externalization significantly increases user task-focused attention (from 15.8% to 84.6%, p<.001) and substantially improves perceived perspicuity, dependability, stimulation, and attractiveness (all p<.001). Furthermore, 83% of participants preferred the externalized condition, and this improvement in user experience was achieved without a statistically significant increase in end-to-end task completion time (p=.271). The results suggest that socially mediated state externalization is an effective architectural mechanism for designing more transparent and trustworthy distributed robot systems, ultimately enhancing user experience without sacrificing performance in distributed home robot deployments.
Authors:Ziming Li, Hongji Li, Jialin Wang, Pan Hui, Hai-Ning Liang
Abstract:
The recent emergence and popularity of consumer-grade augmented reality (AR) glasses from major technology companies highlight their potential to become the next daily computing platform. A dominant design trend in this context is the integration of a front-facing camera to deliver a first-person perspective. While this approach is intuitive, there is limited evidence that it is optimal (or sufficient) for supporting users in daily tasks. This paper explores a more effective camera interaction technique for AR glasses, which we term ``FlexiCamAR." This novel method aims to enhance both efficiency and the range of applications for AR glasses by offering flexible and comfortable secondary camera viewpoints. To investigate the applicability and usability of this approach, we developed a ring camera prototype that can be attached to users' fingers. We then conducted a user study with 12 participants, comparing FlexiCamAR against the baseline, a traditional front-facing AR camera setup, across two common tasks: taking photos and scanning QR codes. Our findings show that FlexiCamAR significantly reduces physical load. We also explore potential scenarios where the additional viewpoint afforded by FlexiCamAR proves valuable, such as capturing low-angle perspectives or navigating confined spaces. Participant feedback further suggests strong potential for additional applications, including selfie taking, video conferencing, and object scanning. Overall, FlexiCamAR presents a novel interaction approach that can serve as a powerful supplement or alternative to the first-person perspective, significantly improving the adaptability of AR glasses for everyday use.
Authors:Giulio Pisaneschi, Pierpaolo Serio, Estelle Gerbier, Andrea Dan Ryals, Lorenzo Pollini, Mario G. C. A. Cimino
Abstract:
This paper presents an experimental platform for studying intentional-state attribution toward a non-humanoid robot. The system combines a simulated robot, realistic task environments, and large language model-based explanatory layers that can express the same behavior in mentalistic, teleological, or mechanistic terms. By holding behavior constant while varying the explanatory frame, the platform provides a controlled way to investigate how language and framing shape the adoption of the intentional stance in robotics.
Authors:Qijia Chen, Andrea Bellucci, Giulio Jacucci
Abstract:
Newcomers are crucial for the growth of online communities, yet their successful integration into these spaces requires overcoming significant initial hurdles. Social Virtual Reality (VR) platforms are novel avenues that offer unprecedented online interaction experiences. Unlike well-studied two-dimensional online environments, the pathways to successful newcomer integration in online VR spaces are underexplored. Our research addresses this gap by examining the strategies used by newcomers to navigate early challenges in social VR and how they adapt. By focusing on active participants (ranging from newcomers currently navigating these hurdles to veterans who have successfully integrated) we isolate the specific strategies necessary for retention. We interviewed 24 active social VR users and conducted a reflexive thematic analysis. While participants identified barriers such as unfamiliar user interfaces, social norms, and overwhelming sensory input, our analysis reveals the adaptation strategies required to overcome them. Our findings expand on understanding newcomer persistence beyond traditional 2D environments, emphasizing how social dynamics influence the management of VR-specific issues like VR sickness during onboarding. Additionally, we highlight how successful newcomers overcome the lack of clear objectives in social VR by proactively constructing social meaning. We propose design suggestions to scaffold these successful integration pathways.
Authors:Roshni Kaushik, Maarten Sap, Koichi Onoue
Abstract:
AI-mediated communication is increasingly being utilized to help facilitate interactions; however, in privacy sensitive domains, an AI mediator has the additional challenge of considering how to preserve privacy. In these contexts, a mediator may redact or withhold information, raising questions about how users perceive these interventions and whether explanations of system behavior can improve trust. In this work, we investigate how explanations of redaction operations can affect user trust in AI-mediated communication. We devise a scenario where a validated system removes sensitive content from messages and generates explanations of varying detail to communicate its decisions to recipients. We then conduct a user study with $180$ participants that studies how user trust and preferences vary for cases with different amounts of redacted content and different levels of explanation detail. Our results show that participants believed our system was more effective at preserving privacy when explanations were provided ($p<0.05$, Cohen's $d \approx 0.3$). We also found that contextual factors had an impact; participants relied more on explanations and found them more helpful when the system performed extensive redactions ($p<0.05$, Cohen's $f \approx 0.2$). We also found that explanation preferences depended on individual differences as well, and factors such as age and baseline familiarity with AI affected user trust in our system. These findings highlight the importance and challenge of balancing transparency and privacy in AI-mediated communications and suggest that adaptive, context-aware explanations are essential for designing privacy-aware, trustworthy AI systems.
Authors:Domenique Zipperling, Lukas Schmidt, Benedikt Hahn, Niklas Kühl, Steven Kimbrough
Abstract:
Current clinical decision support systems (CDSSs) typically base their predictions on correlation, not causation. In recent years, causal machine learning (ML) has emerged as a promising way to improve decision-making with CDSSs by offering interpretable, treatment-specific reasoning. However, existing research often emphasizes model development rather than designing clinician-facing interfaces. To address this gap, we investigated how CDSSs based on causal ML should be designed to effectively support collaborative clinical decision-making. Using a design science research methodology, we conducted a structured literature review and interviewed experienced physicians. From these, we derived eight empirically grounded design requirements, developed seven design principles, and proposed nine practical design features. Our results establish guidance for designing CDSSs that deliver causal insights, integrate seamlessly into clinical workflows, and support trust, usability, and human-AI collaboration. We also reveal tensions around automation, responsibility, and regulation, highlighting the need for an adaptive certification process for ML-based medical products.
Authors:Ibrahim Bilau, Stacie Smith, Abdurrahman Baru, Marwan Shagar, Brian Jones, Eunhwa Yang
Abstract:
Virtual reality (VR) has emerged as a promising tool for assessing instrumental activities of daily living (IADLs) in older adults. However, the ecological validity of these simulations is often compromised by simplified or low-fidelity environmental design that fails to elicit a genuine sense of presence. This paper documents a reproducible Reality-to-VR pipeline for creating a photorealistic environmental simulation to support a study on cognitive aging in place. The proposed workflow captured the as-built kitchen of the Aware Home building at Georgia Tech using Terrestrial Laser Scanning (TLS) for sub-millimeter geometric accuracy, followed by point cloud processing in Faro SCENE, geometric retopology in SketchUp, and integration into Unreal Engine 5 via Datasmith with Lumen global illumination for high visual fidelity. The pipeline achieved photorealistic rendering while maintaining a stable 90 Hz frame rate, a critical threshold for mitigating cybersickness in older populations. The environment also enables instantaneous manipulation of environmental variables, such as switching between closed cabinetry and open shelving, providing experimental flexibility impossible in physical settings. Participant validation with 17 older adults confirmed minimal cybersickness risk and preserved sensitivity to the experimental manipulation, supporting the pipeline's feasibility for aging-in-place research and establishing a benchmark for future comparative studies.
Authors:Vartika Narayani Srinet, Anirudha Bhattacharjee, Braj Bhushan, Bishakh Bhattacharya
Abstract:
Responding to one's name is among the earliest-emerging social orienting behaviors and is one of the most prominent aspects in the detection of Autism Spectrum Disorder (ASD). Typically developing children exhibit near-reflexive orienting to their name, whereas children with ASD often demonstrate reduced frequency, increased latency, or atypical patterns of response. In this study, we examine differential responsiveness to quantify name-calling stimuli delivered by both human agents and NAO, a humanoid robot widely employed in socially assistive interventions for autism. The analysis focuses on multiple behavioral parameters, including eye contact, response latency, head and facial orientation shifts, and duration of sustained interest. Video-based computational methods were employed, incorporating face detection, eye region tracking, and spatio-temporal facial analysis, to obtain fine-grained measures of children's responses. By comparing neurotypical and neuroatypical groups under controlled human-robot conditions, this work aims to understand how the source and modality of social cues affect attentional dynamics in name-calling contexts. The findings advance both the theoretical understanding of social orienting deficits in autism and the applied development of robot-assisted assessment tools.
Authors:Jiyeon Bae, Jinwook Seo
Abstract:
Existing computational studies of popular music primarily model aggregate trends or predict chart performance, offering limited support for interpreting artist-level alignment against historical stylistic baselines. We introduce an interactive visual analytics framework that treats each artist-decade as a unit defined relative to an era-specific baseline, characterized along two complementary dimensions: profile shape similarity, capturing directional correspondence with the era's feature pattern, and profile contrast ratio, capturing stylistic intensity relative to the era's dispersion. Together, these dimensions define a quadrant-based trajectory space for reasoning about conformity, divergence, and amplification over time. Applied to weekly U.S. Billboard Hot 100 chart entries from the all-time top-10 artists across six decades (1960s-2010s), linked with Spotify audio features, the framework reveals that alignment and intensity can meaningfully diverge across artist trajectories.
Authors:Zihong He, Shuqin Wang, Songchen Zhou, Qinghui Lin, Jialin Wang, Chen Liang, Hai-Ning Liang
Abstract:
Most AI agents remain confined to an instrumental "command-execution" model, resulting in unequal, one-sided interactions. While recent works attempt to build relationships through hidden memory backends, these invisible processes often fail to break the instrumental bias. In this paper, we argue that true relational equality requires agents to have an independent, observable existence. We introduce the \textit{Observable Life Spaces} paradigm, where agents inhabit a continuous virtual environment, engage in daily activities, and form social relationships that users can directly observe. Through a mixed-methods study ($N=24$), we demonstrate that only when agents are endowed with a socialized life space that is visually observable to humans can the perceived equality during interaction be significantly enhanced ($p = 0.015$). Our findings suggest that visually representing an agent's social life space can effectively shift the human-agent dynamic from a purely instrumental relationship to one characterized by perceived equality.
Authors:Matthew Flathers, Griffin Smith, Julian Herpertz, Zhitong Zhou, John Torous
Abstract:
Generative video models are increasingly capable of producing complex depictions of mental health experiences, yet little is known about how these systems represent conditions like depression. This study characterizes how OpenAI's Sora 2 generative video model depicts depression and examines whether depictions differ between the consumer App and developer API access points. We generated 100 videos using the single-word prompt "Depression" across two access points: the consumer App (n=50) and developer API (n=50). Two trained coders independently coded narrative structure, visual environments, objects, figure demographics, and figure states. Computational features across visual aesthetics, audio, semantic content, and temporal dynamics were extracted and compared between modalities. App-generated videos exhibited a pronounced recovery bias: 78% (39/50) featured narrative arcs progressing from depressive states toward resolution, compared with 14% (7/50) of API outputs. App videos brightened over time (slope = 2.90 brightness units/second vs. -0.18 for API; d = 1.59, q < .001) and contained three times more motion (d = 2.07, q < .001). Across both modalities, videos converged on a narrow visual vocabulary and featured recurring objects including hoodies (n=194), windows (n=148), and rain (n=83). Figures were predominantly young adults (88% aged 20-30) and nearly always alone (98%). Gender varied by access point: App outputs skewed male (68%), API outputs skewed female (59%). Sora 2 does not invent new visual grammars for depression but compresses and recombines cultural iconographies, while platform-level constraints substantially shape which narratives reach users. Clinicians should be aware that AI-generated mental health video content reflects training data and platform design rather than clinical knowledge, and that patients may encounter such content during vulnerable periods.
Authors:Shixian Xie, Motahhare Eslami, John Zimmerman
Abstract:
Despite significant advances in responsible AI research, industry adoption remains limited, leaving many HCI contributions underutilized in practice. This position paper argues that current research often fails to account for the fundamental need for capitalist enterprises to create value. To achieve immediate real-world impact, responsible AI research must explore how to design responsibly within capitalism. We call for a move beyond the dichotomy of "ethics vs. business" toward a more productive framing of "ethics and business." We propose ideation as a practical design strategy for generating ethically preferable alternatives that also meet business objectives. By aligning ethics with enterprise realities, we expand the space of responsible design that can actually be built.
Authors:Hung-Yue Suen, Kuo-En Hung, Fan-Hsun Tseng
Abstract:
This paper outlines a machine learning-enabled speaker-centric Emotion AI approach capable of predicting audience-affective engagement and vocal attractiveness in asynchronous video-based learning, relying solely on speaker-side affective expressions. Inspired by the demand for scalable, privacy-preserving affective computing applications, this speaker-centric Emotion AI approach incorporates two distinct regression models that leverage a massive corpus developed within Massive Open Online Courses (MOOCs) to enable affectively engaging experiences. The regression model predicting affective engagement is developed by assimilating emotional expressions emanating from facial dynamics, oculomotor features, prosody, and cognitive semantics, while incorporating a second regression model to predict vocal attractiveness based exclusively on speaker-side acoustic features. Notably, on speaker-independent test sets, both regression models yielded impressive predictive performance (R2 = 0.85 for affective engagement and R2 = 0.88 for vocal attractiveness), confirming that speaker-side affect can functionally represent aggregated audience feedback. This paper provides a speaker-centric Emotion AI approach substantiated by an empirical study discovering that speaker-side multimodal features, including acoustics, can prospectively forecast audience feedback without necessarily employing audience-side input information.
Authors:Shivam Shukla, Emily Chen, Manhaz Roshanaei, Magy Seif El-Nasr
Abstract:
There has been a growing research interest in Digital Therapeutic Alliance (DTA) as the field of AI-powered conversational agents are being deployed in mental health care, particularly those delivering CBT (Cognitive Behaviour Therapy). Our proposition argues that the current design paradigm which seeks to optimize the bond between a patient in need of support and an AI agent contains a subtle but consequential trap: it risks producing an "appearance of connection" that unintentionally disrupts the fundamental human need for relatedness, which potentially displaces the authentic human relationships upon which long-term psychological recovery depends. We propose a reorientation from designing artificial intelligence tools that simulate relationships to designing AI that scaffolds them. To operationalize our argument, we propose an interdisciplinary model that translates the Responsible AI Six Sphere Framework through the lens of Self-Determination Theory (SDT), with a specific focus on the basic psychological need for relatedness. The resulting model offers the technical and often clinical communities a set of relationship-centered design guidelines and relevant provocations for building AI systems that function not just as companions, but as a catalyst for strengthening a patient's entire relational ecology; their connections with therapists, caregivers, family, and peers. In doing so, we discuss a model towards a more sustainable ecosystem of relationship-centered AI in mental health care.
Authors:Hashini Senaratne, Richard Attfield, Samith Widhanapathirana, David Howard, Cecile Paris, Dana Kulic, Leimin Tian
Abstract:
Maintaining situational awareness (SA) is critical in human-robot teams. Yet, under high workload and dynamic conditions, operators often experience SA gaps. Automated detection of SA gaps could provide timely assistance for operators. However, conventional SA measures either disrupt task flow or cannot capture real-time fluctuations, limiting their operational utility. To the best of our knowledge, no publicly available dataset currently supports the systematic evaluation of online human SA assessment in human-robot teaming. To advance the development of online SA assessment tools, we introduce HRI-SA, a multimodal dataset from 30 participants in a realistic search-and-rescue human-robot teaming context, incorporating eye movements, pupil diameter, biosignals, user interactions, and robot data. The experimental protocol included predefined events requiring timely operator assistance, with ground truth SA latency of two types (perceptual and comprehension) systematically obtained by measuring the time between assistance need onset and resolution. We illustrate the utility of this dataset by evaluating standard machine learning models for detecting perceptual SA latencies using generic eye-tracking features and contextual features. Results show that eye-tracking features alone effectively classified perceptual SA latency (recall=88.91%, F1=67.63%) using leave-one-group-out cross-validation, with performance improved through contextual data fusion (recall=91.51%, F1=80.38%). This paper contributes the first public dataset supporting the systematic evaluation of SA throughout a human-robot teaming mission, while also demonstrating the potential of generic eye-tracking features for continuous perceptual SA latency detection in remote human-robot teaming.
Authors:Jingruo Chen, Yibo Meng, Kexin Nie
Abstract:
Adults with ADHD often face challenges with task management, not due to a lack of willpower, but because of emotional and relational misalignments between cognitive needs and normative infrastructures. Existing productivity tools, designed for neurotypical users, often assume consistent self-regulation and linear time, overlooking these differences. We conducted 22 semi-structured interviews with ADHD-identifying adults, exploring their challenges in task management and their coping mechanisms through socially and emotionally scaffolded strategies. Building on these insights, we conducted a follow-up speed dating study with 20 additional ADHD-identifying adults, focusing on 13 speculative design concepts that leverage AI for task support. Our findings reveal that task management among adults with ADHD is relationally and affectively co-constructed, rather than an isolated individual act. Overall, we provide (1) empirical insights into distributed and emotionally scaffolded task management practices, (2) design implications for socially-aware AI systems that support co-regulation and nonlinear attention rhythms, and (3)an analysis of user preferences for different AI design concepts, clarifying which features were most valued and why.
Authors:Apurv Varshney, Lily M. Turkstra, Jiaxin Su, Mable Zhou, Scott T. Grafton, Barry Giesbrecht, Mary Hegarty, Michael Beyeler
Abstract:
Navigation aids are central to immersive virtual reality (VR) experiences that involve physical locomotion. Their effectiveness depends not only on how much spatial information they provide, but also on how directly that information supports movement decisions. We compared three common guidance techniques for immersive VR wayfinding: a directional arrow, a minimap, and a compass. In a controlled room-scale VR study with 42 participants completing 1008 trials, participants navigated to target landmarks in a time-pressured maze with reduced visibility and forced route replanning. Across behavioral and eye-tracking measures, arrow guidance produced the strongest navigation performance, minimap guidance yielded intermediate performance, and compass cues performed worst, suggesting that during immersive locomotion users benefit from guidance that can be interpreted rapidly while moving. These results suggest that in demanding immersive locomotion tasks, interfaces that translate spatial information directly into actionable movement cues can outperform richer but more interpretive spatial representations. Our findings highlight the importance of designing XR navigation interfaces that minimize the cognitive translation between spatial information and movement decisions.
Authors:Vincent Gurgul, Robin Gubela, Stefan Lessmann
Abstract:
Generative Artificial Intelligence (GenAI) rapidly transforms software engineering, yet existing research remains fragmented across individual tasks in the Software Development Lifecycle. This study integrates a systematic literature review with a survey of 65 software developers. The results show that GenAI exerts its highest impact in design, implementation, testing, and documentation, where over 70 % of developers report at least halving the time for boilerplate and documentation tasks. 79 % of survey respondents use GenAI daily, preferring browser-based Large Language Models over alternatives integrated directly in their development environment. Governance is maturing, with two-thirds of organizations maintaining formal or informal guidelines. In contrast, early SDLC phases such as planning and requirements analysis show markedly lower reported benefits. In a nutshell, GenAI shifts value creation from routine coding toward specification quality, architectural reasoning, and oversight, while risks such as uncritical adoption, skill erosion, and technical debt require robust governance and human-in-the-loop mechanisms.
Authors:Kowe Kadoma, Priyal Shrivastava, Mor Naaman
Abstract:
Researchers have demonstrated that Automatic Speech Recognition (ASR) systems perform differently across demographic groups. In this work, we examined how subtitle errors affect evaluations of speakers and their content using a preregistered online experiment (N=207, U.S.-based crowdworkers). Participants watched speakers with various accents deliver a talk in which the subtitles were accurate or error-prone. Our results indicate that error-prone subtitles consistently reduce both speaker and content evaluations for all speakers. We did not see disparate impact between the accent groups, controlling for subtitle quality. Taken together, though, the findings of this short paper imply that speakers with accents for which ASR systems perform poorly are likely to be further penalized by viewers with lower evaluations.
Authors:Hyerim Park, Jinseok Hong, Heejeong Ko, Woontack Woo
Abstract:
Question-asking is one of the key indicators of cognitive engagement. However, understanding how the distinct psychological affordances of presentation media shape learners' spoken inquiries with embodied Intelligent Virtual Agents (IVAs) remains limited. To systematically examine this process, we propose a 5W1H-based framework for analyzing learner questions. Using this framework, we conducted a user study comparing an Augmented Reality-based IVA (AR-IVA) deployed in the physical environment with a screen-based IVA (Video-IVA) during cardiopulmonary resuscitation (CPR) instruction. Results showed that the AR-IVA elicited higher spatial and social presence and promoted more frequent and longer questions focused on clarification and understanding. In contrast, the Video-IVA encouraged questions regarding procedural refinement. Presence acted as a selective filter, shaping the timing and topic of questions rather than as a universal mediator. These effects were significantly moderated by learners' motivational and strategic characteristics toward learning. Based on these findings, we propose design implications for IVA-supported learning systems.
Authors:Natalie Grace Brigham, Lucy Qin, Tadayoshi Kohno
Abstract:
While computer systems that allow users to interact through conversational natural language (i.e., chatbots) have existed for many years, varying types of applications advertising AI companionship (e.g., Character AI, Replika) have proliferated in recent years due to advancements in large language models. Our work offers a threat model encompassing two distinct risk categories: harms posed to users by AI companion applications, and harms enabled by malicious users exploiting application features. To further understand this application ecosystem, we identified 489 unique apps from the App Store and Play Store that advertised AI companionship. We then systematically conducted and analyzed walkthroughs of a stratified sample of 30 apps with respect to our threat model. Through our analysis, we categorize broader ecosystem trends that provide context for understanding threats and identify specific threats related to sensitive data collection and sharing, anthropomorphism, engagement mechanisms, sexual interactions and media, as well as the ingestion and reconstruction of likeness, including the potential for generating synthetic nonconsensual intimate imagery. This study provides a foundational security perspective on the AI companion application ecosystem and informs future research within and beyond this field, policy, and technical development. Content warning: This paper includes descriptions of applications that can be used to create synthetic nonconsensual representations, including explicit imagery, as well as discussion of self-harm and suicidal ideation.
Authors:A K M Amanat Ullah, David Ahlström, Khalad Hasan
Abstract:
Large curved displays are ideal for viewing 360 degree content, such as 3D maps, but typically restrict users to a 180 degree viewport, leaving information off-screen. Since users naturally direct their heads toward regions on-screen before interacting, head movements offer a promising alternative for workspace manipulation to bring off-screen content into view. We explore rate control functions (linear, sigmoid, polynomial) and zone control functions (continuous, friction, interrupted, additive) to translate head rotations into workspace control, enabling users to access off-screen content. Polynomial rate control emerges as the best choice, achieving the fastest trial times and highest subjective ratings. Using a map navigation task, our second study demonstrates that users perform better with the polynomial head-based technique than with the industry-standard controller-based methods, click-and-drag and joystick-push, for 360\degree workspace navigation. Based on these findings, we provide guidelines to inform the design of future 360\degree workspace navigation techniques for large curved displays.
Authors:Dimitri Staufer, Kirsten Morehouse, David Hartmann, Bettina Berendt
Abstract:
Large language models (LLMs) learn statistical associations from massive training corpora and user interactions, and deployed systems can surface or infer information about individuals. Yet people lack practical ways to inspect what a model associates with their name. We report interim findings from an ongoing study and introduce LMP2, a browser-based self-audit tool. In two user studies ($N_{total}{=}458$), GPT-4o predicts 11 of 50 features for everyday people with $\ge$60\% accuracy, and participants report wanting control over LLM-generated associations despite not considering all outputs privacy violations. To validate our probing method, we evaluate eight LLMs on public figures and non-existent names, observing clear separation between stable name-conditioned associations and model defaults. Our findings also contribute to exposing a broader generative AI evaluation crisis: when outputs are probabilistic, context-dependent, and user-mediated through elicitation, what model--individual associations even include is under-specified and operationalisation relies on crafting probes and metrics that are hard to validate or compare. To move towards reliable, actionable human-centred LLM privacy audits, we identify nine frictions that emerged in our study and offer recommendations for future work and the design of human-centred LLM privacy audits.
Authors:Yejin Yun, Seung Won Lee, Jiin Choi, Kyung Hoon Hyun
Abstract:
Infinite canvas platforms are becoming central to contemporary design practice, enabling designers to externalize cognition through the spatial arrangement of multimodal artifacts. As AI agents increasingly generate and organize content within these environments, their impact on designers' externalization processes remains underexplored. We report a field study with eight professional designers comparing workflows with and without an AI organizing agent. Through a sequence analysis of 5,838 design actions, we identify three key shifts: (1) AI integration reallocates cognitive effort from spatial management to content curation and relational structuring, without increasing active time; (2) a characteristic generate-and-curate cycle emerges in which designers' demands on the agent intensify while the agent's functional role adapts; and (3) AI's role evolves from a divergent catalyst in early stages to a convergent curator in later phases. These findings offer a behavioral model for designing phase-adaptive AI tools that support human-AI co-evolution on infinite canvases.
Authors:JiWoong Jang, Patrick Carrington, Andrew Begel
Abstract:
Social accessibility research faces a persistent tension: assistive technologies (AT) predominantly pursue independence, yet disabled people's experiences reveal rich preferences for interdependence. Our analysis of 90 papers from 2011-2025 uncovered that this stems from a deeper issue - which crystallized through dialogue with three bodies of theories: (1) self-determination theory (SDT), (2) symbolic interactionism, and (3) posthumanist perspectives and crip technoscience. SDT illuminates individual needs; symbolic interactionism addresses construction of social meaning and stigma; Posthumanist and crip technoscience together challenges normalcy, governance, and the human-machine boundary. Through their tensions, we identify relational sovereignty as an alternative telos - or goal - to autonomy. While our corpus equates autonomy with independence, sovereignty centers the power to choose between independence and interdependence. To operationalize this shift - from "Can they do it?" to "Do they get to decide?" - we introduce the Relational Sovereignty Matrix and four design interventions: (1) a sovereignty-centered reframing of SDT, (2) generative questions for justice-oriented reflection, (3) the idea of building through sovereign technical primitives, and (4) explicit consideration of power in AT design.
Authors:JiWoong Jang, Patrick Carrington, Andrew Begel
Abstract:
Research in social accessibility aims to improve the lives of disabled people across diverse abilities and experiences by assisting with communication, relationships, and ecosystems of access. We seek to understand this intersectional body of work through analyzing social accessibility research from 2011 to 2025. Through constructivist grounded theory analysis of 90 papers (curated from 605), we develop the Three Praxes Framework: three sites of practice Artifact (constructive), Ecosystem (relational), and Epistemology (theoretical) - two cross-cutting stances toward change (Temporal Orientation and Stakeholder Focus) - and one reflexive cycle modeling how insights can flow between praxes. Our analysis reveals these praxes operate largely in isolation, risking that insights remain academic exercises while assistive technologies reinforce existing barriers. We call on the field to realize a cycle where disabled people's lived experiences shape material realities, material practice generates theoretical knowledge, and both transform ecosystems of access.
Authors:Seung Won Lee, Semin Jin, Kyung Hoon Hyun
Abstract:
AI-based creativity support tools (CSTs) are evaluated through domain-specific metrics, limiting cross-domain comparison of creative processes. Embedding-based protocol analysis offers a potential domain-agnostic analytical layer. However, we argue that fixed embedding similarity can misrepresent creative dynamics: it may not detect creative pivots that occur within superficially similar language, treating shifts in the problem being addressed as continued elaboration. We identify three open challenges stemming from this gap: aligning similarity measures with creative significance, segmenting and representing multimodal design traces, and evaluating agentic systems where embedding-based metrics enter the generation loop and shape agent behavior. We propose context-aware interventions using large language models as a direction for making trace analysis sensitive to session-specific creative dynamics.
Authors:Wenwei Li, Jiarun Zhou, Qinxiao Quan, Fusang Zhang, Daqing Zhang
Abstract:
Contactless sensing using wireless communication signals has garnered significant attention due to its non-intrusive nature and ubiquitous infrastructure. Despite the promise, the inherent bistatic deployment of wireless communication introduces clock asynchronism, which leads to unknown phase offsets in channel response and hinders fine-grained sensing. State-of-the-art systems widely adopt the cross-antenna channel ratio to cancel these detrimental phase offsets. However, the channel ratio preserves sensing feature accuracy only at integer-wavelength target displacements, losing sub-wavelength fidelity. To overcome this limitation, we derive the first quantitative mapping between the distorted ratio feature and the ideal channel feature. Building on this foundation, we develop a robust framework that leverages channel response amplitude to recover the ideal channel feature from the distorted ratio. Real-world experiments across Wi-Fi and LoRa demonstrate that our method can effectively reconstruct sub-wavelength displacement details, achieving nearly an order-of-magnitude improvement in accuracy.
Authors:Mason Kadem, Sarah Masri, Anthea Innes, Rong Zheng
Abstract:
We conducted a scoping review to map the rapidly evolving landscape of wearable and ambient sensing technologies for monitoring people with dementia across home and institutional settings. We analyzed empirical sensing studies (2015-2025) to identify and inform future technical and human-centered design requirements. Five key implementation principles emerge: (1) human-centered design involving all stakeholders to augment rather than replace caregivers; (2) personalized, adaptable solutions that support autonomy across settings and severity levels instead of standardized approaches; (3) integration with existing workflows with adequate training and support; (4) proactive privacy and consent considerations, especially for ambient monitoring of residents and caregivers; and (5) cost-effective, ethical, equitable, scalable solutions with quantifiable outcomes. This paper identifies gaps, trends and opportunities for developing sensing systems that address the complex challenges, while enhancing automation and autonomy, in dementia care.
Authors:Bibeg Limbu, Irene-Angelica Chounta
Abstract:
This exploratory pilot study investigates the impact of haptic perception --specifically tactile sensitivity (touch) and kinaesthetic intensity (movement)-- on learning, operationalized as information retention (immediate recall) through handwriting. Participants (N=20) were randomly assigned to one of four experimental groups in a 2x2 factorial design, manipulating touch (via glove use) and movement (via increased writing pressure). Information retention was measured using an immediate recall test, while mental effort (reaction time in a secondary task) and perceived workload (NASA-TLX) were examined as mediating variables. Bayesian binomial regression revealed moderate evidence that increased writing pressure negatively influenced recall (85-88% probability of negative effect), whereas glove use alone demonstrated no clear effect. Bayesian mediation analysis found no strong evidence that mental effort or perceived workload mediated these effects, as all 95% credible intervals included zero, indicating substantial uncertainty. These findings suggest that increased Kinaesthetic demands may slightly impair immediate recall, independent of perceived workload or mental effort. Importantly, the manipulation of touch alone does not appear to influence information retention. The study contributes to understanding the nuanced relationship between embodied interactions and cognitive outcomes, with implications for designing sensor-based multimodal learning environments.
Authors:Ali Ebrahimi Pourasad, Meyssam Saghiri, Walid Maalej
Abstract:
User feedback is essential for the success of mobile apps, yet what users report and what developers need often diverge. Research shows that users often submit vague feedback and omit essential contextual details. This leads to incomplete reports and time-consuming clarification discussions. To overcome this challenge, we propose FeedAIde, a context-aware, interactive feedback approach that supports users during the reporting process by leveraging the reasoning capabilities of Multimodal Large Language Models. FeedAIde captures contextual information, such as the screenshot where the issue emerges, and uses it for adaptive follow-up questions to collaboratively refine with the user a rich feedback report that contains information relevant to developers. We implemented an iOS framework of FeedAIde and evaluated it on a gym's app with its users. Compared to the app's simple feedback form, participants rated FeedAIde as easier and more helpful for reporting feedback. An assessment by two industry experts of the resulting 54 reports showed that FeedAIde improved the quality of both bug reports and feature requests, particularly in terms of completeness. The findings of our study demonstrate the potential of context-aware, GenAI-powered feedback reporting to enhance the experience for users and increase the information value for developers.
Authors:Joseph Walusimbi, Ann Move Oguti, Joshua Benjamin Ssentongo, Keith Ainebyona
Abstract:
The rapid global expansion of large language models (LLMs) has created new opportunities for personalised and inquiry-driven learning. However, most AI chatbot systems for education rely on continuous internet connectivity, cloud infrastructure, and modern hardware. These requirements reinforce digital inequalities and limit the practical deployment of AI-supported learning in bandwidth-constrained and resource-limited environments worldwide. This paper presents Arapai, an offline-first AI chatbot architecture designed to operate entirely without internet connectivity on low-specification, CPU-only devices. The system integrates locally hosted, quantised language models with automatic hardware-aware model selection and pedagogically tiered response control. By performing inference fully on-device and maintaining models resident in memory for performance optimisation, Arapai delivers curriculum-aligned explanations, structured problem-solving support, and differentiated instructional depth without reliance on cloud services. A pilot deployment in secondary and tertiary institutions operating under limited-connectivity conditions evaluated the system across four dimensions: technical performance, usability, perceived answer quality, and educational impact. Results indicate stable operation on legacy hardware, acceptable response times for standard instructional queries, and positive learner and teacher perceptions regarding self-directed learning support. Rather than replacing cloud-based AI systems, this work proposes a complementary deployment paradigm for infrastructure-constrained education systems. The study contributes a hardware-aware architectural framework for decentralised AI tutoring and highlights the role of offline-first design in advancing digital inclusion and infrastructure-resilient educational technology.
Authors:Alberto Tono, Jiajun Wu, Gordon Wetzstein, Iro Armeni, Hariharan Subramonyam, James Landay, Martin Fischer
Abstract:
In the past decade, advances in artificial intelligence have revolutionized sketch-based 3D modeling, leading to a new paradigm known as Deep Sketch-Based 3D Modeling (DS-3DM). DS-3DM offers data-driven methods that address the long-standing challenges of sketch abstraction and ambiguity. DS-3DM keeps humans at the center of the creative process by enhancing the flexibility, usability, faithfulness, and adaptability of sketch-based 3D modeling interfaces. This paper contributes a comprehensive survey of the latest DS-3DM within a novel design space: MORPHEUS. Built upon the Input-Model-Output (IMO) framework, MORPHEUS categorizes Models outputting Options of 3D Representations and Parts, derived from Human inputs (varying in quantity and modality), and Evaluated across diverse User-views and Styles. Throughout MORPHEUS we highlight limitations and identify opportunities for interdisciplinary research in Computer Vision, Computer Graphics, and Human-Computer Interaction, revealing a need for controllability and information-rich outputs. These opportunities align design processes more closely with user' intent, responding to the growing importance of user-centered approaches.
Authors:Elena Koung, Yunhan Liu, Zinan Zhang, Xinning Gui, Yubo Kou
Abstract:
Teenagers are avid users of Discord, a fast growing platform for synchronous communication where they often interact with strangers. Because Discord combines private DMs, semi-private voice channels, and public servers in one place, it creates a hybrid environment that can produce complex and underexplored safety risks for teenagers. Drawing on 16 interviews with teenage Discord users, this study examines their strategies for navigating risky social interactions in the platform. Our findings reveal that when teenagers encounter risks during social interactions, they exercise vigilance by evaluating suspicious interactions before forming friendships, using safety tools, and engaging in controlled risk-taking to safeguard their privacy and security. At the community level, they mitigate risks through selective participation in servers, a practice supported by vigilant governance structures. We discuss how vigilance enables teenagers to act during risky encounters to protect themselves, advancing understanding of teenagers' agency in risk navigation and informing teen-centered designs for safer online environments.
Authors:Chen Chen, Michel Pahud, David Brown, Chuck Needham, Balasaravanan T. Kumaravel, Andrew D. Wilson, Ken Hinckley, Nicolai Marquardt
Abstract:
Layering information spaces is a promising strategy to design intuitive and engaging interactive experiences. Although multi-layer displays enable promising interaction techniques through limited depth perception - achieved via slight separation between layers - it remains unclear how to fully design experiences that leverage the unique affordances of layered information. To address this, we introduce Proscenium, a dual-layer, large transparent display workspace setup with an adjustable separation between the layers. We demonstrate our preliminary design space focusing on how rendered information can be transitioned and linked across displays, and showcase 14 speculative experience prototypes across six categories.
Authors:Jocelyn Shen, Nicolai Marquardt, Hugo Romat, Ken Hinckley, Nathalie Riche, Fanny Chevalier
Abstract:
What if text could be sculpted and refined like clay -- or cultivated and pruned like a plant? Texterial reimagines text as a material that users can grow, sculpt, and transform. Current generative-AI models enable rich text operations, yet rigid, linear interfaces often mask such capabilities. We explore how the text-as-material metaphor can reveal AI-enabled operations, reshape the writing process, and foster compelling user experiences. A formative study shows that users readily reason with text-as-material, informing a conceptual framework that explains how material metaphors shift mental models and bridge gulfs of envisioning, execution, and evaluation in LLM-mediated writing. We present the design and evaluation of two technical probes: Text as Clay, where users refine text through gestural sculpting, and Text as Plants, where ideas grow serendipitously over time. This work expands the design space of writing tools by treating text as a living, malleable medium.
Authors:Zinan Zhang, Xinning Gui, Yubo Kou
Abstract:
Cooperative play (co-play) is often positioned as a family-beneficial practice that can strengthen parent-child bonds and support parental mediation in games. Yet co-play in user-generated virtual worlds (UGVWs) can be disrupted by real-time harms that parents cannot easily prevent. Roblox, a platform with millions of user-generated virtual worlds and a large child player base, illustrates this challenge. Prior work on harmful UGVW design highlights risks beyond content problems, including manipulative monetization prompts, unmoderated social interactions, emergent in-world behaviors, and narrative designs that may normalize harmful ideologies. Current governance and moderation approaches, largely adapted from social media, focus on static artifacts and often fail to capture interactive and emergent harms in virtual worlds. This workshop paper asks: how might UGVWs and their platforms be designed to minimize harms that specifically impair family co-play experiences?
Authors:Soyoung Jung, Daehoo Yoon, Sung Gyu Koh, Young Hwan Kim, Yehan Ahn, Sung Park
Abstract:
Agentic AI increasingly intervenes proactively by inferring users' situations from contextual data yet often fails for lack of principled judgment about when, why, and whether to act. We address this gap by proposing a conceptual model that reframes behavior as an interpretive outcome integrating Scene (observable situation), Context (user-constructed meaning), and Human Behavior Factors (determinants shaping behavioral likelihood). Grounded in multidisciplinary perspectives across the humanities, social sciences, HCI, and engineering, the model separates what is observable from what is meaningful to the user and explains how the same scene can yield different behavioral meanings and outcomes. To translate this lens into design action, we derive five agent design principles (behavioral alignment, contextual sensitivity, temporal appropriateness, motivational calibration, and agency preservation) that guide intervention depth, timing, intensity, and restraint. Together, the model and principles provide a foundation for designing agentic AI systems that act with contextual sensitivity and judgment in interactions.
Authors:Aaron Broukhim, Nadir Weibel, Eshin Jolly
Abstract:
Preference-based reinforcement learning (PbRL) is the dominant framework for aligning AI systems to human preferences, but its application to speech remains underexplored. We present a controlled cross-modal study of human and synthetic preference annotations, comparing text and audio evaluations of identical semantic content across 100 prompts. Audio preferences prove as reliable as text, with inter-rater agreement reaching good levels (ICC(2,k) $\approx$ .80) at $\sim$9 raters -- the first ICC-based reliability characterization in the preference annotation literature for either modality. However, modality reshapes how people judge: audio raters exhibit narrower decision thresholds, reduced length bias, and more user-oriented evaluation criteria, with near-chance cross-modality agreement. Synthetic ratings further align with human judgments and predict inter-rater agreement, supporting their use both for triaging ambiguous pairs and as full replacements for human annotations.
Authors:Hannah Kim, Rahad Arman Nabid, Jeni Sorathiya, Minh Doan, Elijah Jordan, Rayhana Nasimova, Sergei L. Kosakovsky Pond, Stephen MacNeil
Abstract:
Understanding information-seeking behaviors in e-learning is critical, as learners must often make sense of complex and fragmented information, a challenge compounded in interdisciplinary fields with diverse prior knowledge. Compared to traditional e-tutorials, GenAI e-tutorials offer new ways to navigate information spaces, yet how they shape learners information-seeking behaviors remains unclear. To address this gap, we characterized behavioral differences between traditional and GenAI-mediated e-tutorial learning using the three search modes of orienteering. We conducted a between-subject study in which learners engaged with either a traditional e-tutorial or a GenAI e-tutorial accessing the same underlying information content. We found that the traditional users maintained greater awareness and focus of the information space, whereas GenAI users exhibited more proactive and exploratory behaviors with lower cognitive load due to the querying-driven interaction. These findings offer guidance for designing tutorials in e-learning.
Authors:Lauren Vogelstein, Vedya Konda, Deborah Fields, Yasmin Kafai, Luis Morales-Navarro, Danaé Metaxa
Abstract:
Today's youth have extensive experience interacting with artificial intelligence and machine learning applications on popular social media platforms, putting youth in a unique position to examine, evaluate, and even challenge these applications. Algorithm auditing is a promising candidate for connecting youth's everyday practices in using AI applications with more formal scientific literacies (syncretic designs). In this paper, we analyze high school youth participants' everyday algorithm auditing practices when interacting with generative AI filters on TikTok, revealing thorough and extensive examinations, with youth rapidly testing filters with sophisticated camera variations and facial manipulations to identify filter limitations. In the discussion, we address how these findings can provide a foundation for developing designs that bring together everyday and more formal algorithm auditing.
Authors:Leni Yang, Aymeric Ferron, Yvonne Jansen, Pierre Dragicevic
Abstract:
People often struggle to interpret data with extremely large or small values, or ranges spanning multiple orders of magnitude. While traditional approaches, such as log scales and multiscale visualizations, can help, we explore in this article a different approach used in some emerging designs: the use of motion to let viewers gradually experience magnitude -- for example, interactive graphics that require long scrolling or street paintings stretching hundreds of meters. This approach typically demands substantial time and sustained interaction, translating differences in magnitude into a visceral sense of duration and effort. Although largely underexplored, this design strategy offers new opportunities. We introduce the term progressive value reading to refer to the use of motion to progressively examine an information object that encodes a value, where the amount of motion reflects the value. We compiled a corpus of 55 real-life and hypothetical visualization examples that allow, encourage, or require progressive value reading. From this corpus, we derived a design space of ten design dimensions, providing a shared vocabulary, inspiration for novel techniques, and a foundation for empirical evaluation. An online corpus is also available for exploration.
Authors:Krzysztof Kutt, Elżbieta Sroka, Oleksandra Ishchuk, Luiz do Valle Miranda
Abstract:
The growing volume of digital cultural heritage resources highlights the need for advanced recommendation methods capable of interpreting semantic relationships between heterogeneous data entities. This paper presents a complete methodology for implementing a hybrid recommendation pipeline integrating knowledge-graph embeddings, approximate nearest-neighbour search, and SPARQL-driven semantic filtering. The work is evaluated on the JUHMP (Jagiellonian University Heritage Metadata Portal) knowledge graph developed within the CHExRISH project, which at the time of experimentation contained ${\approx}3.2$M RDF triples describing people, events, objects, and historical relations affiliated with the Jagiellonian University (Kraków, PL). We evaluate four embedding families (TransE, ComplEx, ConvE, CompGCN) and perform hyperparameter selection for ComplEx and HNSW. Then, we present and evaluate the final three-stage neuro-symbolic recommender. Despite sparse and heterogeneous metadata, the approach produces useful and explainable recommendations, which were also proven with expert evaluation.
Authors:Kirk Vanacore, Danielle R Thomas, Digory Smith, Bibi Groot, Justin Reich, Rene Kizilcec
Abstract:
This paper introduces a scalable causal inference framework for estimating the immediate, session-level effects of on-demand human tutoring embedded within adaptive learning systems. Because students seek assistance at moments of difficulty, conventional evaluation is confounded by self-selection and time-varying knowledge states. We address these challenges by integrating principled analytic sample construction with Deep Knowledge Tracing (DKT) to estimate latent mastery, followed by doubly robust estimation using Causal Forests. Applying this framework to over 5,000 middle-school mathematics tutoring sessions, we find that requesting human tutoring increases next-problem correctness by approximately 4 percentage points and accuracy on the subsequent skill encountered by approximately 3 percentage points, suggesting that the effects of tutoring have proximal transfer across knowledge components. This effect is robust to various forms of model specification and potential unmeasured confounders. Notably, these effects exhibit significant heterogeneity across sessions and students, with session-level effect estimates ranging from $-20.25pp$ to $+19.91pp$. Our follow-up analyses suggest that typical behavioral indicators, such as student talk time, do not consistently correlate with high-impact sessions. Furthermore, treatment effects are larger for students with lower prior mastery and slightly smaller for low-SES students. This framework offers a rigorous, practical template for the evaluation and continuous improvement of on-demand human tutoring, with direct applications for emerging AI tutoring systems.
Authors:Boyuan Gu, Shuaiqi Cheng, Minghao yu
Abstract:
As the aging population faces a chronic care deficit, domestic care is increasingly recast as spectral governance. This paper presents a design fiction set in 2036, where the home is governed by Neural-Wave, a camera-free mmWave sensing platform that infers well-being from involuntary micro-motions. Through a set of scenarios, we illustrate how such empathic systems displace autonomy, forcing residents to perform legibility to regain basic freedoms. Our primary contribution is a diegetic artifact: The Neural-Wave Quick Escape Manual. Styled as an illicit guide for the elderly, it details adversarial tactics: structured around protocols to Comply, Degrade, and Refuse, that exploit signal processing vulnerabilities to reclaim domestic privacy. Through this artifact, we argue that in the era of empathic AIoT, privacy requires more than policy opt-outs; it demands adversarial literacy:the capacity to meaningfully obfuscate one's own data traces against an infrastructural jailer that calls itself care.
Authors:Xizi Wang, Yue Lyu, Yalong Yang, Jian Zhao
Abstract:
Immersive videos (IVs) provide 360° environments that create a strong sense of presence and spatial exploration. Unlike traditional videos, IVs distribute information across multiple directions, making comparison cognitively demanding and highly dependent on interaction techniques. With the growing adoption of IVs, effective comparison techniques have become an essential yet underexplored area of research. Inspired by the "sliding" concept in 2D media comparison, we integrate two established comparison strategies from the literature--toggle and side-by-side--to support IV comparison with greater flexibility. For an in-depth understanding of different strategies, we adapt and implement five IV comparison techniques across VR and 2D environments: SlideInVR, ToggleInVR, SlideIn2D, ToggleIn2D, and SideBySideIn2D. We then conduct a user study (N=20) to examine how these techniques shape users' perceptions, strategies, and workflows. Our findings provide empirical insights into the strengths and limitations of each technique, underscoring the need to switch between comparison approaches across scenarios. Notably, participants consistently rate SlideInVR and SlideIn2D as the most flexible and favorite methods for IV comparison.
Authors:Zhengtai Gou, Junxiao Long, Tao Lu, Jian Zhao, Yalong Yang
Abstract:
Immersive analytics enables collaborative data analysis in shared virtual spaces. While synchronous collaboration in such environments is well-established, real-world analysis often requires an effective task handover - the transfer of knowledge and analytical context between analysts working asynchronously. Traditional handover methods often rely on static annotations that fail to capture the dynamic problem-solving process and spatial context inherent in immersive workflows. To address this handover challenge, we explore session replay as a comprehensive approach for analysts to re-experience a predecessor's work, facilitating a deeper understanding of both the visual details and the insight formation process. Two phases of studies were conducted to establish design guidelines for such replay systems by investigating the impact of viewing platform (PC vs. VR), perspective (first-person vs. third-person), and navigation control (active vs. passive). Phase 1 identified the optimal replay configurations within each viewing platform, revealing a platform-dependent divergence: PC users favored a guided, first-person perspective for its focused detail, while VR users benefited significantly from the agency afforded by a third-person perspective with active navigation. After refining each condition based on user feedback, including developing a novel hybrid 1PP+3PP format for PC, Phase 2 compared the two optimized systems (PC vs. VR). Our results show that the immersive VR replay led to significantly better task comprehension and workflow reconstruction accuracy, demonstrating the critical role of embodied agency in understanding complex analytical processes.
Authors:Monalika Padma Reddy, Aruna Balasubramanian, Jiawei Zhou, Xiaojun Bi, IV Ramakrishnan, Vikas Ashok
Abstract:
AI tools like ChatGPT and Be-My-AI are increasingly being used by blind individuals. Although prior work has explored their use in some Do-It-Yourself (DIY) tasks by blind individuals, little is known about how they use these tools and the available product-manual resources to assemble, operate, and troubleshoot physical or tangible products - tasks requiring spatial reasoning, structural understanding, and precise execution. We address this knowledge gap via an interview study and a usability study with blind participants, investigating how they leverage AI tools and product manuals for DIY tasks with physical products. Findings show that manuals are essential resources, but product-manual instructions are often inadequate for blind users. AI tools presently do not adequately address this insufficiency; in fact, we observed that they often exacerbate this issue with incomplete, incoherent, or misleading guidance. Lastly, we suggest improvements to AI tools for generating tailored instructions for blind users' DIY tasks involving tangible products.
Authors:Satwik Ram Kodandaram, Jiawei Zhou, Xiaojun Bi, IV Ramakrishnan, Vikas Ashok
Abstract:
Accessibility forums and, more recently, generative AI tools have become vital resources for blind users seeking solutions to computer-interaction issues and learning about new assistive technologies, screen reader features, tutorials, and software updates. Understanding user experiences with these resources is essential for identifying and addressing persistent support gaps. Towards this, we interviewed 14 blind users who regularly engage with forums and GenAI tools. Findings revealed that forums often overwhelm users with multiple overlapping topics, redundant or irrelevant content, and fragmented responses that must be mentally pieced together, increasing cognitive load. GenAI tools, while offering more direct assistance, introduce new barriers by producing unreliable answers, including overly verbose or fragmented guidance, fabricated information, and contradictory suggestions that fail to follow prompts, thereby heightening verification demands. Based on these insights, we outlined design opportunities to improve the reliability of assistive resources, aiming to provide blind users with more trustworthy and cognitively-manageable support.
Authors:Bijean Ghafouri, Emilio Ferrara
Abstract:
When AI systems summarize and relay information, they inevitably transform it. But how? We introduce an experimental paradigm based on the telephone game to study what happens when AI talks to AI. Across five studies tracking content through AI transmission chains, we find three consistent patterns. The first is convergence, where texts differing in certainty, emotional intensity, and perspectival balance collapse toward a shared default of moderate confidence, muted affect, and analytical structure. The second is selective survival, where narrative anchors persist while the texture of evidence, hedges, quotes, and attributions is stripped away. The third is competitive filtering, where strong arguments survive while weaker but valid considerations disappear when multiple viewpoints coexist. In downstream experiments, human participants rated AI-transmitted content as more credible and polished. Importantly, however, humans also showed degraded factual recall, reduced perception of balance, and diminished emotional resonance. We show that the properties that make AI-mediated content appear authoritative may systematically erode the cognitive and affective diversity on which informed judgment depends.
Authors:Ryo Ohara, Chi-Lan Yang, Yuji Hatada, Takuji Narumi, Hideaki Kuzuoka
Abstract:
Social VR platforms serve as an emergent venue for live performance, enabling co-presence and real-time interaction among distributed performers and audiences within shared virtual environments. Live performances, such as comedy, rely on subtle social cues between performers and audiences, which are missing in VR. However, it remains unclear how comedians utilize avatar-mediated cues in social VR. We conducted semi-structured interviews and observations with 23 virtual comedians on VRChat. Results revealed that virtual comedians transformed their limited nonverbal expressiveness into performative opportunities through intentional control and exaggeration. Additionally, a distinctive culture emerged around context-appropriate emoji reactions from audiences, while challenges such as audio latency and moderation against trolling were highlighted. Our findings advance understanding of how performers creatively adapt to expressive constraints in avatar-mediated settings. We further demonstrate how challenges in performer-audience interaction and moderation provide design insights for systems enhancing feedback visibility and sustain community norms without restricting creative expression.
Authors:Weiwen Su, Yuhan Zhou, Zihan Wang, Naoki Yoshinaga, Masashi Toyoda
Abstract:
Existing user simulations, where models generate user-like responses in dialogue, often lack verification that sufficient user personas are provided, questioning the validity of the simulations. To address this core concern, this work explores the task of identifying relevant but unknown personas of the simulation target for a given simulation context. We introduce PICQ, a novel dataset of context-aware choice questions, annotated with unknown personas (e.g., ''Is the user price-sensitive?'') that may influence user choices, and propose a multi-faceted evaluation scheme assessing fidelity, influence, and inaccessibility. Our benchmark of leading LLMs reveals a complex ''Fidelity vs. Insight'' dilemma governed by model scale: while influence generally scales with model size, fidelity to human patterns follows an inverted U-shaped curve. We trace this phenomenon to cognitive differences, particularly the human tendency for ''cognitive economy.'' Our work provides the first comprehensive benchmark for this crucial task, offering a new lens for understanding the divergent cognitive models of humans and advanced LLMs.
Authors:Annalisa Szymanski, Oghenemaro Anuyah, Toby Jia-Jun Li, Ronald A. Metoyer
Abstract:
Large Language Models (LLMs) are increasingly developed for use in complex professional domains, yet little is known about how teams design and evaluate these systems in practice. This paper examines the challenges and trade-offs in LLM development through a 12-week ethnographic study of a team building a pedagogical chatbot. The researcher observed design and evaluation activities and conducted interviews with both developers and domain experts. Analysis revealed four key practices: creating workarounds for data collection, turning to augmentation when expert input was limited, co-developing evaluation criteria with experts, and adopting hybrid expert-developer-LLM evaluation strategies. These practices show how teams made strategic decisions under constraints and demonstrate the central role of domain expertise in shaping the system. Challenges included expert motivation and trust, difficulties structuring participatory design, and questions around ownership and integration of expert knowledge. We propose design opportunities for future LLM development workflows that emphasize AI literacy, transparent consent, and frameworks recognizing evolving expert roles.
Authors:Yifan Zhang, Tianle Ren, Fei Wang, Brian Y Lim
Abstract:
Explaining with examples is an intuitive way to justify AI decisions. However, it is challenging to understand how a decision value should change relative to the examples with many features differing by large amounts. We draw from real estate valuation that uses Comparables-examples with known values for comparison. Estimates are made more accurate by hypothetically adjusting the attributes of each Comparable and correspondingly changing the value based on factors. We propose Comparables XAI for relatable example-based explanations of AI with Trace adjustments that trace counterfactual changes from each Comparable to the Subject, one attribute at a time, monotonically along the AI feature space. In modelling and user studies, Trace-adjusted Comparables achieved the highest XAI faithfulness and precision, user accuracy, and narrowest uncertainty bounds compared to linear regression, linearly adjusted Comparables, or unadjusted Comparables. This work contributes a new analytical basis for using example-based explanations to improve user understanding of AI decisions.
Authors:Fei Wang, Yifan Zhang, Brian Y. Lim
Abstract:
Current Explainable AI (XAI) focuses on explaining a single application, but when encountering related applications, users may rely on their prior understanding from previous explanations. This leads to either overgeneralization and AI overreliance, or burdensome independent memorization. Indeed, related decision tasks can share explanatory factors, but with some notable differences; e.g., body mass index (BMI) affects the risks for heart disease and diabetes at the same rate, but chest pain is more indicative of heart disease. Similarly, models using different attributes for the same task still share signals; e.g., temperature and pressure affect air pollution but in opposite directions due to the ideal gas law. Leveraging transfer of learning, we propose Transferable XAI to enable users to transfer understanding across related domains by explaining the relationship between domain explanations using a general affine transformation framework applied to linear factor explanations. The framework supports explanation transfer across various domain types: translation for data subspace (subsuming prior work on Incremental XAI), scaling for decision task, and mapping for attributes. Focusing on task and attributes domain types, in formative and summative user studies, we investigated how well participants could understand AI decisions from one domain to another. Compared to single-domain and domain-independent explanations, Transferable XAI was the most helpful for understanding the second domain, leading to the best decision faithfulness, factor recall, and ability to relate explanations between domains. This framework contributes to improving the reusability of explanations across related AI applications by explaining factor relationships between subspaces, tasks, and attributes.
Authors:Akhil Ramachandran, Ankit Arun, Ashish Shenoy, Abhay Harpale, Srihari Jayakumar, Debojeet Chatterjee, Mohsen Moslehpour, Pierce Chuang, Yichao Lu, Vikas Bhardwaj, Peyman Heidari
Abstract:
Video Large Language Models (Video LLMs) have shown remarkable progress in understanding and reasoning about visual content, particularly in tasks involving text recognition and text-based visual question answering (Text VQA). However, deploying Text VQA on wearable devices faces a fundamental tension: text recognition requires high-resolution video, but streaming high-quality video drains battery and causes thermal throttling. Moreover, existing models struggle to maintain coherent temporal context when processing text across multiple frames in real-time streams. We observe that text recognition and visual reasoning have asymmetric resolution requirements - OCR needs fine detail while scene understanding tolerates coarse features. We exploit this asymmetry with a hybrid architecture that performs selective high-resolution OCR on-device while streaming low-resolution video for visual context. On a benchmark of text-based VQA samples across five task categories, our system achieves 72% accuracy at 0.49x the power consumption of full-resolution streaming, enabling sustained VQA sessions on resource-constrained wearables without sacrificing text understanding quality.
Authors:Md Muntasir Jahid Ayan, Md. Shahriar Rashid, Tazzina Afroze Hassan, Hossain Md. Mubashshir Jamil, Mahbubul Islam, Lisan Al Amin, Rupak Kumar Das, Farzana Akter, Faisal Quader
Abstract:
The increasing complexity and frequency of cyber-threats demand intrusion detection systems (IDS) that are not only accurate but also interpretable. This paper presented a novel IDS framework that integrated Explainable Artificial Intelligence (XAI) to enhance transparency in deep learning models. The framework was evaluated experimentally using the benchmark dataset NSL-KDD, demonstrating superior performance compared to traditional IDS and black-box deep learning models. The proposed approach combined Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) networks for capturing temporal dependencies in traffic sequences. Our deep learning results showed that both CNN and LSTM reached 0.99 for accuracy, whereas LSTM outperformed CNN at macro average precision, recall, and F-1 score. For weighted average precision, recall, and F-1 score, both models scored almost similarly. To ensure interpretability, the XAI model SHapley Additive exPlanations (SHAP) was incorporated, enabling security analysts to understand and validate model decisions. Some notable influential features were srv_serror_rate, dst_host_srv_serror_rate, and serror_rate for both models, as pointed out by SHAP. We also conducted a trust-focused expert survey based on IPIP6 and Big Five personality traits via an interactive UI to evaluate the system's reliability and usability. This work highlighted the potential of combining performance and transparency in cybersecurity solutions and recommends future enhancements through adaptive learning for real-time threat detection.
Authors:Johannes Wortmann, Bernd Schäufele, Konstantin Klipp, Ilja Radusch, Katharina Blaß, Thomas Jung
Abstract:
The navigation of indoor spaces poses difficult challenges for individuals with visual impairments, as it requires processing of sensory information, dealing with uncertainties, and relying on assistance. To tackle these challenges, we present an indoor navigation app that places importance on accessibility for visually impaired users. Our approach involves a combination of user interviews and an analysis of the Web Content Accessibility Guidelines. With this approach, we are able to gather invaluable insights and identify design requirements for the development of an indoor navigation app. Based on these insights, we develop an indoor navigation app that prioritizes accessibility, integrating enhanced features to meet the needs of visually impaired users. The usability of the app is being thoroughly evaluated through tests involving both visually impaired and sighted users. Initial feedback has been positive, with users appreciating the inclusive user interface and the usability with a wide range of accessibility tools and Android device settings.
Authors:Qijia Chen, Andrea Bellucci, Giulio Jacucci
Abstract:
The sense of presence is central to immersive experiences in Virtual Reality (VR), and particularly salient in socially rich platforms like social VR. While prior studies have explored various aspects related to presence, less is known about how ongoing usage behaviors shape presence in everyday engagement. To address this gap, we examine whether usage intensity, captured through frequency of use, session duration, and years of VR experience, predicts presence in social VR. A survey of 295 users assessed overall, social, spatial, and self-presence using validated scales. Results show that both frequency and duration consistently predict higher presence across all dimensions, with interaction effects indicating that frequent and extended sessions synergistically amplify the experience of "being there." These effects were stable across age and gender. Our findings extend presence research beyond the laboratory by identifying behavioral predictors in social VR and offer insights for building inclusive environments that reliably foster presence.
Authors:Qijia Chen, Andrea Bellucci, Giulio Jacucci
Abstract:
Extensive research has examined presence and basic psychological needs (drawing on Self-Determination Theory) in digital media. While prior work offers hints of potential connections, we lack a systematic account of whether and how distinct presence dimensions map onto the basic needs of autonomy, competence, and relatedness. We surveyed 301 social VR users and analyzed using Structural Equation Modeling. Results show that social presence predicts all three needs, while self-presence predicts competence and relatedness, and spatial presence shows no direct or moderating effects. Gender and age moderated these relationships: women benefited more from social presence for autonomy and relatedness, men from self- and spatial presence for competence and autonomy, and younger users showed stronger associations between social presence and relatedness, and between self-presence and autonomy. These findings position presence as a motivational mechanism shaped by demographic factors. The results offer theoretical insights and practical implications for designing inclusive, need-supportive multiuser VR environments.
Authors:ATM Mizanur Rahman, Sharifa Sultana
Abstract:
People in informal e-markets often try to deal with fraud and financial harm by sharing posts, screenshots, and warnings in social media groups. However, buyers and sellers frequently face further problems because these reports are scattered, hard to verify, and rarely lead to resolution. We studied these issues through a survey with 124 participants and interviews with 36 buyers, sellers, and related stakeholders from Bangladesh and designed Bonik Somiti, a socio-technical system that supports structured reporting, admin-led mediation, and accountability in informal e-markets. Our evaluation with 32 participants revealed several challenges in managing fraud, resolving disputes, and building trust within existing informal practices and the assumptions behind them. Based on these findings, we further discuss how community-centered technologies can be designed to support safer and more accountable informal e-markets in the Global South.
Authors:Qiaosi Wang, Jini Kim, Avanita Sharma, Alicia, Lee, Jodi Forlizzi, Hong Shen
Abstract:
Theory of Mind (ToM) -- the ability to infer what others are thinking (e.g., intentions) from observable cues -- is traditionally considered fundamental to human social interactions. This has sparked growing efforts in building and benchmarking AI's ToM capability, yet little is known about how such capability could translate into the design and experience of everyday user-facing AI products and services. We conducted 13 co-design sessions with 26 U.S.-based AI practitioners to envision, reflect, and distill design recommendations for ToM-enabled everyday AI products and services that are both future-looking and grounded in the realities of AI design and development practices. Analysis revealed three interrelated design recommendations: ToM-enabled AI should 1) be situated in the social context that shape users' mental states, 2) be responsive to the dynamic nature of mental states, and 3) be attuned to subjective individual differences. We surface design tensions within each recommendation that reveal a broader gap between practitioners' envisioned futures of ToM-enabled AI and the realities of current AI design and development practices. These findings point toward the need to move beyond static, inference-driven approach to ToM and toward designing ToM as a pervasive capability that supports continuous human-AI interaction loops.
Authors:Caitlin Morris, Pattie Maes
Abstract:
When learners receive feedback, what they believe about its source may shape how they engage with it. As AI is used alongside human instructors, understanding these attribution effects is essential for designing effective hybrid AI-human educational systems. We designed a creative coding interface that isolates source attribution while controlling for content: all participants receive identical LLM-generated feedback, but half see it attributed to AI and half to a human teaching assistant (TA). We found two key results. First, perceived feedback source affected engagement: learners in the TA condition spent significantly more time and effort (d = 0.88-1.56) despite receiving identical feedback. Second, perceptions differed: AI-attributed feedback ratings were predicted by prior trust in AI (r = 0.85), while TA-attributed ratings were predicted by perceived genuineness (r = 0.65). These findings suggest that feedback source shapes both engagement and evaluation, with implications for hybrid educational system design.
Authors:Yu Wang, Frederik L. Dennig, Michael Behrisch, Alexandru Telea
Abstract:
Projections (or dimensionality reduction) methods $P$ aim to map high-dimensional data to typically 2D scatterplots for visual exploration. Inverse projection methods $P^{-1}$ aim to map this 2D space to the data space to support tasks such as data augmentation, classifier analysis, and data imputation. Current $P^{-1}$ methods suffer from a fundamental limitation -- they can only generate a fixed surface-like structure in data space, which poorly covers the richness of this space. We address this by a new method that can `sweep' the data space under user control. Our method works generically for any $P$ technique and dataset, is controlled by two intuitive user-set parameters, and is simple to implement. We demonstrate it by an extensive application involving image manipulation for style transfer.
Authors:Xinru Tang, Anne Marie Piper
Abstract:
While sign language translation systems promise to enhance deaf people's access to information and communication, they have been met with strong skepticism from deaf communities due to risks of misrepresenting and oversimplifying the richness of signed communication in technologies. This article provides empirical evidence of the complexity of translation work involved in deaf communication through interviews with 13 deaf Chinese content creators who actively produce and share sign language content on video sharing platforms with both deaf and hearing audiences. By studying this unique group of content creators, our findings highlight the nuances of sign language translation, showing how deaf creators create content with multilingualism and multiculturalism in mind, support meaning making across languages and cultures, and navigate politics involved in their translation work. Grounded in these deaf-led translation practices, we draw on the sociolinguistic concept of (trans)languaging to re-conceptualize and reimagine the design of sign language translation systems.
Authors:Yuanzhe Deng, Shutong Zhang, Kathy Cheng, Alison Olechowski, Shurui Zhou
Abstract:
Version control is critical in mechanical computer-aided design (CAD) to enable traceability, manage product variation, and support collaboration. Yet, its implementation in modern CAD software as an essential information infrastructure for product development remains plagued by issues due to the complexity and interdependence of design data. This paper presents a systematic review of user-reported challenges with version control in modern CAD tools. Analyzing 170 online forum threads, we identify recurring socio-technical issues that span the management, continuity, scope, and distribution of versions. Our findings inform a broader reflection on how version control should be designed and improved for CAD and motivate opportunities for tools and mechanisms that better support articulation work, facilitate cross-boundary collaboration, and operate with infrastructural reflexivity. This study offers actionable insights for CAD software providers and highlights opportunities for researchers to rethink version control.
Authors:Hyehyun Chu, Seungju Kim, Chen Zhou, Yu-Kai Hung, Saelyne Yang, Hyun W. Ka, Juho Kim
Abstract:
Video-based learning (VBL) has become a dominant method for learning practical skills, yet accessibility guidelines provide limited guidance for users with cognitive differences. In particular, challenges that individuals with Borderline Intellectual Functioning (BIF) encounter in video-based learning remain largely underexplored, despite VBL's potential to support their learning through features like self-paced viewing and visual demonstration. To address this gap, we conducted a series of studies with BIF individuals and caretakers to comprehensively understand their VBL challenges. Our analysis revealed challenges stemming from misalignment between user cognitive characteristics and video elements (e.g., overwhelmed by pacing and density, difficulty inferring omitted content), and experiential factors intensifying challenges (e.g., low self-efficacy). While participants employed coping strategies such as repetitive viewing to address these challenges, these strategies could not overcome fundamental gaps with video. We further discuss the design implications on both content and UI-level features for BIF and broader groups with cognitive diversities.
Authors:Zhuoqun Jiang, ShunYi Yeo, Dorien Herremans, Simon Tangi Perrault
Abstract:
While reciprocal self-disclosure drives intimacy, digital tools seldom scaffold autonomy, competence, and relatedness -- the motivational underpinnings defined by Self-Determination Theory (SDT) that enable deep exchange. We introduce a chatbot employing dual-layer scaffolding to satisfy these needs: first providing enabling affordances (instrumental support) for vulnerability, then mediating affordances (relational support) for responsiveness. In a randomized study (N = 72; 36 couples) comparing Partner Support (PS: both layers), Direct Support (DS: enabling only), and Basic Prompt (BP: questions only), results reveal a critical distinction. While enabling affordances (PS, DS) were sufficient to deepen disclosure, only mediating affordances (PS) reliably elicited partner-provided need support and increased perceived closeness. Furthermore, controlled motivation decreased across conditions, and scaffolding buffered vitality, which remained stagnant in BP. We contribute empirical evidence that SDT-guided mediation fosters connection, offering a practical framework for designing AI-mediated conversations that support, rather than replace, human intimacy.
Authors:Dominik P. Hofer, David Haag, Rania Islambouli, Jan D. Smeddinck
Abstract:
Digital behaviour change systems increasingly rely on repeated, system-initiated messages to support users in everyday contexts. LLMs enable these messages to be personalised consistently across interactions, yet it remains unclear whether such personalisation improves individual messages or instead shapes users' perceptions through patterns of exposure. We explore this question in the context of LLM-generated JITAIs, which are short, context-aware messages delivered at moments deemed appropriate to support behaviour change, using physical activity as an application domain. In a controlled retrospective study, 90 participants evaluated messages generated using four LLM strategies: baseline prompting, few-shot prompting, fine-tuned models, and retrieval augmented generation, each implemented with and without Big Five Personality Traits to produce personality-aligned communication across multiple scenarios. Using ordinal multilevel models with within-between decomposition, we distinguish trial-level effects, whether personality information improves evaluations of individual messages, from person-level exposure effects, whether participants receiving higher proportions of personality-informed messages exhibit systematically different overall perceptions. Results showed no trial-level associations, but participants who received higher proportions of BFPT-informed messages rated the messages as more personalised, appropriate, and reported less negative affect. We use Communication Accommodation Theory for post-hoc analysis. These results suggest that personality-based personalisation in behaviour change systems may operate primarily through aggregate exposure rather than per-message optimisation, with implications for how adaptive systems are designed and evaluated in sustained human-AI interaction. In-situ longitudinal studies are needed to validate these findings in real-world contexts.
Authors:Fabrizio Fornari, Eleonora Cova, Niccolò Vito Vacca, Francesco Bocci, Luigi Caputo
Abstract:
Game-based assessments (GBAs) are increasingly adopted in recruitment contexts as tools to assess transversal skills through observable behavior. However, empirical evidence directly comparing game-based behavioral indicators with traditional self-report measures remains limited. This study adopts a method-comparison approach to explore the convergence between self-perceived and behaviorally enacted problem-solving competence, comparing a game-based assessment with the Problem Solving Inventory (PSI-B). Seventy-eight participants completed both the PSI-B and a five-minute game-based problem-solving task, which classified performance into four behavioral proficiency levels. Results revealed no significant convergence between self-reported and behavior-based problem-solving scores, indicating a lack of convergence between the two measurement modalities. Rather than indicating a lack of validity of the game-based assessment, these findings support the view that self-report and behavioral measures provide complementary information about problem-solving competence. The study highlights the risks of relying on a single assessment modality in personnel selection and underscores the value of integrating game-based tools within multi-method assessment frameworks.
Authors:Yang Yian, Yu Fan, Liudmila Zavolokina, Sarah Ebling
Abstract:
Text-to-image generative models have made remarkable progress in producing high-quality visual content from textual descriptions, yet concerns remain about how they represent social groups. While characteristics like gender and race have received increasing attention, disability representations remain underexplored. This study investigates how people with disabilities are represented in AI-generated images by analyzing outputs from Stable Diffusion XL and DALL-E 3 using a structured prompt design. We analyze disability representations by comparing image similarities between generic disability prompts and prompts referring to specific disability categories. Moreover, we evaluate how mitigation strategies influence disability portrayals, with a focus on assessing affective framing through sentiment polarity analysis, combining both automatic and human evaluation. Our findings reveal persistent representational imbalances and highlight the need for continuous evaluation and refinement of generative models to foster more diverse and inclusive portrayals of disability.
Authors:Yijun Liu, John Gallagher, Sarah Sterman, Tal August
Abstract:
As AI writing tools evolve from fixing surface errors to creating language with writers, new capabilities raise concerns about negative impacts on student writers, such as replacing their voices and undermining critical thinking skills. To address these challenges, we look at a parallel transition in university writing centers from focusing on fixing errors to preserving student voices. We develop design guidelines informed by writing center literature and interviews with 10 writing tutors. We illustrate these guidelines in a prototype AI tool, Writor. Writor helps writers revise text by setting goals, providing balanced feedback, and engaging in conversations without generating text verbatim. We conducted an expert review with 30 writing instructors, tutors, and AI researchers on Writor to assess the pedagogical soundness, alignment with writing center pedagogy, and integration contexts. We distill our findings into design implications for future AI writing feedback systems, including designing for trust among AI-skeptical educators.
Authors:Yufeng Wu, Qing Li, Elise van den Hoven, A. Baki Kocaballi
Abstract:
Generative Artificial Intelligence (GenAI) is increasingly integrated into photo applications on personal devices, making editing photographs easier than ever while potentially influencing the memories they represent. This study explores how and why people use GenAI to edit personal photos and how this shapes their remembering experience. We conducted a two-phase qualitative study with 12 participants: a photo editing session using a GenAI tool guided by the Remembering Experience (RX) dimensions, followed by semi-structured interviews where participants reflected on the editing process and results. Findings show that participants prioritised felt memory over factual accuracy. For different photo elements, environments were modified easily, however, editing was deemed unacceptable if it touched upon a person's identity. Editing processes brought positive and negative impacts, and itself also became a remembering experience. We further discuss potential benefits and risks of GenAI editing for remembering purposes and propose design implications for responsible GenAI.
Authors:Minyi Wang, Christoph Bartneck, Michael-John Turp, David Kaber
Abstract:
The ethics of human-robot interaction (HRI) have been discussed extensively based on three traditional frameworks: deontology, consequentialism, and virtue ethics. We conducted a mixed within/between experiment to investigate Sparrow's proposed ethical asymmetry hypothesis in human treatment of robots. The moral permissibility of action (MPA) was manipulated as a subject grouping variable, and virtue type (prudence, justice, courage, and temperance) was controlled as a within-subjects factor. We tested moral stimuli using an online questionnaire with Perceived Moral Permissibility of Action (PMPA) and Perceived Virtue Scores (PVS) as response measures. The PVS measure was based on an adaptation of the established Questionnaire on Cardinal Virtues (QCV), while the PMPA was based on Malle et al. [39] work. We found that the MPA significantly influenced the PMPA and perceived virtue scores. The best-fitting model to describe the relationship between PMPA and PVS was cubic, which is symmetrical in nature. Our study did not confirm Sparrow's asymmetry hypothesis. The adaptation of the QCV is expected to have utility for future studies, pending additional psychometric property assessments.
Authors:Aditya Shibu, Marah Saleh, Mohamed Al-Musleh, Nidhal Abdulaziz
Abstract:
Unmanned Aerial Vehicle (UAV) swarms offer versatile applications in logistics, agriculture, and surveillance, yet controlling them requires expert knowledge for safety and feasibility. Traditional static methods limit adaptability, while Large Language Models (LLMs) enable natural language control but generate unsafe trajectories due to lacking physical grounding. This paper introduces SkySim, a ROS2-based simulation framework in Gazebo that decouples LLM high-level planning from low-level safety enforcement. Using Gemini 3.5 Pro, SkySim translates user commands (e.g., "Form a circle") into spatial waypoints, informed by real-time drone states. An Artificial Potential Field (APF) safety filter applies minimal adjustments for collision avoidance, kinematic limits, and geo-fencing, ensuring feasible execution at 20 Hz. Experiments with swarms of 3, 10, and 30 Crazyflie drones validate spatial reasoning accuracy (100% across tested geometric primitives), real-time collision prevention, and scalability. SkySim empowers non-experts to iteratively refine behaviors, bridging AI cognition with robotic safety for dynamic environments. Future work targets hardware integration.
Authors:Conrad Borchers, Hannah Deininger, Zachary A. Pardos
Abstract:
Learning analytics (LA) draws from the learning sciences to interpret learner behavior and inform system design. Yet, past personalization remains largely at the content or performance level (during learner-system interactions), overlooking relatively stable individual differences such as personality (unfolding over long-term learning trajectories such as college degrees). The latter could bring underappreciated benefits to the design, implementation, and impact of LA. In this position paper, we conduct an ad hoc literature review and argue for an expanded framing of LA that centers on learner traits as key to both interpreting and designing close-the-loop experiments in LA. We show that personality traits are relevant to LA's central outcomes (e.g., engagement and achievement) and conducive to action, as their established ties to human-computer interaction (HCI) inform how systems time, frame, and personalize support. Drawing inspiration from HCI, where psychometrics inform personalization strategies, we propose that LA can evolve by treating traits not only as predictive features but as design resources and moderators of analytics efficacy. In line with past position papers published at LAK, we present a research agenda grounded in the LA cycle and discuss methodological and ethical challenges.
Authors:Logan Lane, Ibrahim Tahmid, Feiyu Lu, Doug A. Bowman
Abstract:
Additive models of interaction performance, such as the Keystroke-Level Model (KLM), are tools that allow designers to compare and optimize the performance of user interfaces by summing the predicted times for the atomic components of a specific interaction to predict the total time it would take to complete that interaction. There has been extensive work in creating such additive models for 2D interfaces, but this approach has rarely been explored for 3D user interfaces. We propose a KLM-style additive model, based on existing atomic task models in the literature, to predict task completion time for 3D interaction tasks. We performed two studies to evaluate the feasibility of this approach across multiple input modalities, with one study using a simple menu selection task and the other a more complex manipulation task. We found that several of the models from the literature predicted actual task performance with less than 20% error in both the menu selection and manipulation study. Overall, we found that additive models can predict both absolute and relative performance of input modalities with reasonable accuracy.
Authors:Yinuo Yang, Ashley Ge Zhang, Steve Oney, April Yi Wang
Abstract:
Monitoring in-class programming exercises can help instructors identify struggling students and common challenges. However, understanding students' progress can be prohibitively difficult, particularly for multi-faceted problems that include multiple steps with complex interdependencies, have no predictable completion order, or involve evaluation criteria that are difficult to summarize across many students (e.g., exercises building interactive web-based user interfaces). We introduce SPARK, a coding exercise monitoring dashboard designed to address these challenges. SPARK allows instructors to flexibly group substeps into checkpoints based on exercise requirements, suggests automated tests for these checkpoints, and generates visualizations to track progress across steps. SPARK also allows instructors to inspect intermediate outputs, providing deeper insights into solution variations. We also construct a dataset of 40-minute keystroke coding data from N=22 learners solving two web programming exercises and provide empirical insights into the perceived usefulness of SPARK through a within-subjects evaluation with 16 programming instructors.
Authors:Deeksha M. Shama, Dimitra Emmanouilidou, Ivan J. Tashev
Abstract:
Accurately monitoring cognitive load in real time is critical for Brain-Computer Interfaces (BCIs) that adapt to user engagement and support personalized learning. Electroencephalography (EEG) offers a non-invasive, cost-effective modality for capturing neural activity, though traditional methods often struggle with cross-subject variability and task-specific preprocessing. We propose leveraging Brain Foundation Models (BFMs), large pre-trained neural networks, to extract generalizable EEG features for cognitive load estimation. We adapt BFMs for long-term EEG monitoring and show that fine-tuning a small subset of layers yields improved accuracy over the state-of-the-art. Despite their scale, BFMs allow for real-time inference with a longer context window. To address often-overlooked interpretability challenges, we apply Partition SHAP (SHapley Additive exPlanations) to quantify feature importance. Our findings reveal consistent emphasis on prefrontal regions linked to cognitive control, while longitudinal trends suggest learning progression. These results position BFMs as efficient and interpretable tools for continuous cognitive load monitoring in real-world BCIs.
Authors:Elham Aghakhani, Rezvaneh Rezapour
Abstract:
Large language models (LLMs) are increasingly used for emotional support and mental health-related interactions outside clinical settings, yet little is known about how people evaluate and relate to these systems in everyday use. We analyze 5,126 Reddit posts from 47 mental health communities describing experiential or exploratory use of AI for emotional support or therapy. Grounded in the Technology Acceptance Model and therapeutic alliance theory, we develop a theory-informed annotation framework and apply a hybrid LLM-human pipeline to analyze evaluative language, adoption-related attitudes, and relational alignment at scale. Our results show that engagement is shaped primarily by narrated outcomes, trust, and response quality, rather than emotional bond alone. Positive sentiment is most strongly associated with task and goal alignment, while companionship-oriented use more often involves misaligned alliances and reported risks such as dependence and symptom escalation. Overall, this work demonstrates how theory-grounded constructs can be operationalized in large-scale discourse analysis and highlights the importance of studying how users interpret language technologies in sensitive, real-world contexts.
Authors:Ashley Ge Zhang, Yan-Ru Jhou, Yinuo Yang, Shamita Rao, Maryam Arab, Yan Chen, Steve Oney
Abstract:
Programming instructors have diverse philosophies about integrating generative AI into their classes. Some encourage students to use AI, while others restrict or forbid it. Regardless of their approach, all instructors benefit from understanding how their students actually use AI while writing code. Such insight helps instructors assess whether AI use aligns with their pedagogical goals, enables timely intervention when they find unproductive usage patterns, and establishes effective policies for AI use. However, our survey with programming instructors found that many instructors lack visibility into how students use AI in their code-writing processes. To address this challenge, we introduce Editrail, an interactive system that enables instructors to track students' AI usage, create personalized assessments, and provide timely interventions, all within the workflow of monitoring coding histories. We found that Editrail enables instructors to detect AI use that conflicts with pedagogical goals accurately and to determine when and which students require intervention.
Authors:Luisa Jansen, Tim Ulmann, Robine Jordi, Malte Elson
Abstract:
Recently, the data protection practices of researchers in human-computer interaction and elsewhere have gained attention. Initial results suggest that researchers struggle with anonymization, partly due to a lack of clear, actionable guidance. In this work, we propose simulating re-identification attacks using the approach of red teaming versus blue teaming: a technique commonly employed in security testing, where one team tries to re-identify data, and the other team tries to prevent it. We discuss our experience applying this method to data collected in a mixed-methods study in human-centered privacy. We present usable materials for researchers to apply red teaming when anonymizing and publishing their studies' data.
Authors:Mrinank Sharma, Miles McCain, Raymond Douglas, David Duvenaud
Abstract:
Although AI assistants are now deeply embedded in society, there has been limited empirical study of how their usage affects human empowerment. We present the first large-scale empirical analysis of disempowerment patterns in real-world AI assistant interactions, analyzing 1.5 million consumer Claude$.$ai conversations using a privacy-preserving approach. We focus on situational disempowerment potential, which occurs when AI assistant interactions risk leading users to form distorted perceptions of reality, make inauthentic value judgments, or act in ways misaligned with their values. Quantitatively, we find that severe forms of disempowerment potential occur in fewer than one in a thousand conversations, though rates are substantially higher in personal domains like relationships and lifestyle. Qualitatively, we uncover several concerning patterns, such as validation of persecution narratives and grandiose identities with emphatic sycophantic language, definitive moral judgments about third parties, and complete scripting of value-laden personal communications that users appear to implement verbatim. Analysis of historical trends reveals an increase in the prevalence of disempowerment potential over time. We also find that interactions with greater disempowerment potential receive higher user approval ratings, possibly suggesting a tension between short-term user preferences and long-term human empowerment. Our findings highlight the need for AI systems designed to robustly support human autonomy and flourishing.
Authors:Yongsu Ahn, Lejun R Liao, Benjamin Bach, Nam Wook Kim
Abstract:
Design feedback helps practitioners improve their artifacts while also fostering reflection and design reasoning. Large Language Models (LLMs) such as ChatGPT can support design work, but often provide generic, one-off suggestions that limit reflective engagement. We investigate how to guide LLMs to act as design mentors by applying the Cognitive Apprenticeship Model, which emphasizes demonstrating reasoning through six methods: modeling, coaching, scaffolding, articulation, reflection, and exploration. We operationalize these instructional methods through structured prompting and evaluate them in a within-subjects study with data visualization practitioners. Participants interacted with both a baseline LLM and an instructional LLM designed with cognitive apprenticeship prompts. Surveys, interviews, and conversational log analyses compared experiences across conditions. Our findings show that cognitively informed prompts elicit deeper design reasoning and more reflective feedback exchanges, though the baseline is sometimes preferred depending on task types or experience levels. We distill design considerations for AI-assisted feedback systems that foster reflective practice.
Authors:DaeHo Lee, Ryo Suzuki, Jin-Hyuk Hong
Abstract:
We explore how humanoid robots can be repurposed as haptic media, extending beyond their conventional role as social, assistive, collaborative agents. To illustrate this approach, we implemented HumanoidTurk, taking a first step toward a humanoid-based haptic system that translates in-game g-force signals into synchronized motion feedback in VR driving. A pilot study involving six participants compared two synthesis methods, leading us to adopt a filter-based approach for smoother and more realistic feedback. A subsequent study with sixteen participants evaluated four conditions: no-feedback, controller, humanoid+controller, and human+controller. Results showed that humanoid feedback enhanced immersion, realism, and enjoyment, while introducing moderate costs in terms of comfort and simulation sickness. Interviews further highlighted the robot's consistency and predictability in contrast to the adaptability of human feedback. From these findings, we identify fidelity, adaptability, and versatility as emerging themes, positioning humanoids as a distinct haptic modality for immersive VR.
Authors:Francesco Chiossi, Elnur Imamaliyev, Martin Bleichner, Sven Mayer
Abstract:
Mixed Reality (MR) interfaces increasingly rely on gaze for interaction , yet distinguishing visual attention from intentional action remains difficult, leading to the Midas Touch problem. Existing solutions require explicit confirmations, while brain-computer interfaces may provide an implicit marker of intention using Stimulus-Preceding Negativity (SPN). We investigated how Intention (Select vs. Observe) and Feedback (With vs. Without) modulate SPN during gaze-based MR interactions. During realistic selection tasks, we acquired EEG and eye-tracking data from 28 participants. SPN was robustly elicited and sensitive to both factors: observation without feedback produced the strongest amplitudes, while intention to select and expectation of feedback reduced activity, suggesting SPN reflects anticipatory uncertainty rather than motor preparation. Complementary decoding with deep learning models achieved reliable person-dependent classification of user intention, with accuracies ranging from 75% to 97% across participants. These findings identify SPN as an implicit marker for building intention-aware MR interfaces that mitigate the Midas Touch.
Authors:Hyun-Gee Jei, Mustafa Demir, Farzan Sasangohar
Abstract:
Supervisors in military command and control (C2) environments face dynamic conditions. Dynamically changing information continuously flows to the supervisors through multiple displays. In this environment, important pieces of information can be overlooked due to the complexity of tasks and environments. This study examined the efficacy of an eye-tracker-based adaptive attention-guided decision support tool (DST) for supervisors in a simulated C2 environment. The DST monitors supervisors' visual attention allocation in real time and displays visually salient cues if critical changes or events are missed. Twenty-five military students participated in a simulated intelligence task. Results indicated significant performance enhancement when the adaptive DST was present. Eye-tracking analysis also showed that longer, more frequent fixations on critical areas of interest were negatively correlated with performance. Additionally, post-experiment interviews revealed that the adaptive DST was unobtrusive and positively received. These findings underscore the potential of real-time gaze-based interventions to optimize supervisory decision-making. Future research could incorporate AI-driven approaches to better support supervisors in complex task environments.
Authors:Anne Arzberger, Celine Offerman, Ujwal Gadiraju, Alessandro Bozzon, Jie Yang
Abstract:
AI alignment relies on annotator judgments, yet annotation pipelines often treat annotators as interchangeable, obscuring how their social position shapes annotation. We introduce reflexive annotating as a probe that invites crowd workers to reflect on how their positionality informs subjective annotation judgments in a language model alignment context. Through a qualitative study with crowd workers (N=30) and follow-up interviews (N=5), we examine how our probe shapes annotators' behaviour, experience, and the situated metadata it elicits. We find that reflexive annotating captures epistemic metadata beyond static demographics by eliciting intersectional reasoning, surfacing positional humility, and nudging viewpoint change. Crucially, we also denote tensions between reflexive engagement and affective demands such as emotional exposure. We discuss the implications of our work for richer value elicitation and alignment practices that treat annotator judgments as situated and selectively integrate positional metadata.
Authors:Sarmistha Sarna Gomasta, Mahmood Jasim, Hossein Hadisi, Yvonne Jansen, Pierre Dragicevic, Narges Mahyar, Ali Sarvghad
Abstract:
Data videos have become a prominent vessel for communicating data to broad audiences, and a common object of study in information visualization. Many of these videos include music, yet the impact of music on how people experience data videos remains largely unexplored. We conducted a preregistered study into the effect of music across three dimensions: persuasion, engagement, and emotion. We showed online participants an existing data video (1) without any music, (2) with its generic default music, and (3) with custom music designed by a professional composer. We found that the default music helped make the data video more persuasive. However, the effects of custom music were more mixed, and we did not find that music increased engagement. In addition, and contrary to our expectations, our participants reported more intense emotions without music. Our study contributes new insights into the intersection of music and data visualization and is a first step toward guiding designers in creating impactful data-driven narratives.
Authors:Christina Garcia, Nhat Tan Le, Taihei Fujioka, Umang Dobhal, Milyun Ni'ma Shoumi, Thanh Nha Nguyen, Sozo Inoue
Abstract:
This paper presents an overview of the Recognize the Unseen: Unusual Behavior Recognition from Pose Data Challenge, hosted at ISAS 2025. The challenge aims to address the critical need for automated recognition of unusual behaviors in facilities for individuals with developmental disabilities using non-invasive pose estimation data. Participating teams were tasked with distinguishing between normal and unusual activities based on skeleton keypoints extracted from video recordings of simulated scenarios. The dataset reflects real-world imbalance and temporal irregularities in behavior, and the evaluation adopted a Leave-One-Subject-Out (LOSO) strategy to ensure subject-agnostic generalization. The challenge attracted broad participation from 40 teams applying diverse approaches ranging from classical machine learning to deep learning architectures. Submissions were assessed primarily using macro-averaged F1 scores to account for class imbalance. The results highlight the difficulty of modeling rare, abrupt actions in noisy, low-dimensional data, and emphasize the importance of capturing both temporal and contextual nuances in behavior modeling. Insights from this challenge may contribute to future developments in socially responsible AI applications for healthcare and behavior monitoring.
Authors:Hyerim Park, Khanh Huynh, Malin Eiband, Jeremy Dillmann, Sven Mayer, Michael Sedlmair
Abstract:
Generative AI (GenAI) systems are inherently non-deterministic, producing varied outputs even for identical inputs. While this variability is central to their appeal, it challenges established HCI evaluation practices that typically assume consistent and predictable system behavior. Designing controlled lab studies under such conditions therefore remains a key methodological challenge. We present a reflective multi-case analysis of four lab-based user studies with GenAI-integrated prototypes, spanning conversational in-car assistant systems and image generation tools for design workflows. Through cross-case reflection and thematic analysis across all study phases, we identify five methodological challenges and propose eighteen practice-oriented recommendations, organized into five guidelines. These challenges represent methodological constructs that are either amplified, redefined, or newly introduced by GenAI's stochastic nature: (C1) reliance on familiar interaction patterns, (C2) fidelity-control trade-offs, (C3) feedback and trust, (C4) gaps in usability evaluation, and (C5) interpretive ambiguity between interface and system issues. Our guidelines address these challenges through strategies such as reframing onboarding to help participants manage unpredictability, extending evaluation with constructs such as trust and intent alignment, and logging system events, including hallucinations and latency, to support transparent analysis. This work contributes (1) a methodological reflection on how GenAI's stochastic nature unsettles lab-based HCI evaluation and (2) eighteen recommendations that help researchers design more transparent, robust, and comparable studies of GenAI systems in controlled settings.
Authors:Anne Arzberger, Enrico Liscio, Maria Luce Lupetti, Inigo Martinez de Rituerto de Troya, Jie Yang
Abstract:
As AI systems become embedded in everyday practice, value misalignment has emerged as a pressing concern. Yet, dominant alignment approaches remain model centric, treating users as passive recipients of prespecified values rather than as epistemic agents who encounter and respond to misalignment during interactions. Drawing on situated perspectives, we frame alignment as an interactional practice co-constructed during human AI interaction. We investigate how users understand and wish to contribute to this process through a participatory workshop that combines misalignment diaries with generative design activities. We surface how misalignments materialise in practice and how users envision acting on them, grounded in the context of researchers using Large Language Models as research assistants. Our findings show that misalignments are experienced less as abstract ethical violations than as unexpected responses, and task or social breakdowns. Participants articulated roles ranging from adjusting and interpreting model behaviour to deliberate non-engagement as an alignment strategy. We conclude with implications for designing systems that support alignment as an ongoing, situated, and shared practice.
Authors:Jana Franceska Funke, Mario Sagawa, Georgious Nurcan-Georgiou, Naomi Sagawa, Dennis Dietz, Evgeny Stemasov, Enrico Rukzio, Teresa Hirzle
Abstract:
We present a novel system for camera-based measurement and visualization of muscle work based on the Hill-Type-Muscle-Model: the exercise exertion muscle-work monitor (\textit{XEM}$^{2}$). Our aim is to complement and, thus, address issues of established measurement techniques that offer imprecise data for non-uniform movements (burned calories) or provide limited information on strain across different body parts (self-perception scales). We validate the reliability of XEM's measurements through a technical evaluation of ten participants and five exercises. Further, we assess the acceptance, usefulness, benefits, and opportunities of \textit{XEM}$^{2}$ in an empirical user study. Our results show that \textit{XEM}$^{2}$ provides reliable values of muscle work and supports participants in understanding their workout while also providing reliable information about perceived exertion per muscle group. With this paper, we introduce a novel system capable of measuring and visualizing exertion for single muscle groups, which has the potential to improve exercise monitoring to prevent unbalanced workouts.
Authors:Björn R. Severitt, Yannick Sauer, Nora Castner, Siegfried Wahl
Abstract:
Gaze-based interaction enables intuitive, hands-free control in immersive environments, but remains susceptible to unintended inputs. We present a real-time error prevention system (EPS) that uses a temporal convolutional network autoencoder (TCNAE) to detect anomalies in gaze dynamics during selection tasks. In a visual search task in VR, 41 participants used three gaze-based methods - dwell time, gaze and head direction alignment, and nod - with and without EPS. The system reduced erroneous selections by up to 95% for dwell time and gaze and head, and was positively received by most users. Performance varied for nodding and between individuals, suggesting the need for adaptive systems. Objective metrics and subjective evaluations show that anomaly-based error prevention can improve gaze interfaces without disrupting interaction. These findings demonstrate the potential of anomaly-based error prevention for gaze interfaces and suggest applications in VR, AR, and assistive technologies.
Authors:Jiangen He, Jiqun Liu
Abstract:
Conversational AI systems increasingly function as primary interfaces for information seeking, yet how they present sources to support information evaluation remains under-explored. This paper investigates how source transparency design shapes interactive information seeking, trust, and critical engagement. We conducted a controlled between-subjects experiment (N=372) comparing four source presentation interfaces - Collapsible, Hover Card, Footer, and Aligned Sidebar - varying in visibility and accessibility. Using fine-grained behavioral analysis and automated critical thinking assessment, we found that interface design fundamentally alters exploration strategies and evidence integration. While the Hover Card interface facilitated seamless, on-demand verification during the task, the Aligned Sidebar uniquely mitigated the negative effects of information overload: as citation density increased, Sidebar users demonstrated significantly higher critical thinking and synthesis scores compared to other conditions. Our results highlight a trade-off between designs that support workflow fluency and those that enforce reflective verification, offering practical implications for designing adaptive and responsible conversational AI that fosters critical engagement with AI generated content.
Authors:DongHoon Kim, Isaac Cho
Abstract:
Visual perception plays a critical role in detecting changes within immersive Virtual Reality (VR) environments. However, as visual complexity increases, perceptual performance declines, making it more difficult to detect changes quickly and accurately. This study examines how visual features, known for facilitating preattentive processing, impact a change detection task in immersive 3D environments, with a focus on visual complexity, object attributes, and spatial proximity. Our results demonstrate that preattentive processing enhances change detection, particularly when the altered object is spatially isolated and not perceptually grouped with similar surrounding objects. Changes to isolated objects were detected more reliably, suggesting that perceptual isolation reduces cognitive load and draws more attention. Conversely, when a changed object was surrounded by visually similar elements, participants were less likely to detect the change, indicating that perceptual grouping hinders individual object recognition in complex scenes. These results provide guidelines for designing VR applications that strategically utilize spatial isolation and visual features to improve the user experience.
Authors:Xian Li, Yuanning Han, Di Liu, Pengcheng An, Shuo Niu
Abstract:
User-created chatbots powered by generative AI offer new ways to share and interact with Not-Safe-For-Work (NSFW) content. However, little is known about the characteristics of these GenAI-based chatbots and their user interactions. Drawing on the functional theory of NSFW on social media, this study analyzes 376 NSFW chatbots and 307 public conversation sessions on FlowGPT. Findings identify four chatbot types: roleplay characters, story generators, image generators, and do-anything-now bots. AI Characters portraying fantasy personas and enabling hangout-style interactions are most common, often using explicit avatar images to invite engagement. Sexual, violent, and insulting content appears in both user prompts and chatbot outputs, with some chatbots generating explicit material even when users do not create erotic prompts. In sum, the NSFW experience on FlowGPT can be understood as a combination of virtual intimacy, sexual delusion, violent thought expression, and unsafe content acquisition. We conclude with implications for chatbot design, creator support, user safety, and content moderation.
Authors:Caitlin Morris, Pattie Maes
Abstract:
As AI increasingly enters the classroom, what changes when students collaborate with algorithms instead of peers? We analyzed 36 undergraduate students learning graph theory through peer collaboration (n=24) or AI assistance (n=12), using discourse analysis to identify interaction patterns shaping learning outcomes. Results reveal a collaboration quality divide: high-quality peer interactions generated curiosity and engagement that AI couldn't match, yet low-quality peer interactions performed worse than AI across dimensions. AI showed a paradoxical pattern, building confidence in knowledge while reducing curiosity and deeper engagement. Interaction quality emerged from dynamic patterns rather than individual traits, with early discourse markers predicting outcomes. Students treated AI as a transactional information source despite its collaborative design, revealing fundamental differences in human versus algorithmic engagement. Our findings suggest AI in education need not replace peer learning but can recognize struggle and support both peer and AI interactions toward productive learning experiences.
Authors:Alexander Htet Kyaw, Haotian Ma, Sasa Zivkovic, Jenny Sabin
Abstract:
Recent advances in augmented reality (AR) have enabled interactive systems that assist users in physical assembly tasks. In this paper, we present an AR-assisted assembly workflow that leverages object recognition and hand tracking to (1) identify custom components, (2) display step-by-step instructions, (3) detect assembly deviations, and (4) dynamically update the instructions based on users' hands-on interactions with physical parts. Using object recognition, the system detects and localizes components in real time to create a digital twin of the workspace. For each assembly step, it overlays bounding boxes in AR to indicate both the current position and the target placement of relevant components, while hand-tracking data verifies whether the user interacts with the correct part. Rather than enforcing a fixed sequence, the system highlights potential assembly errors and interprets user deviations as opportunities for iteration and creative exploration. A case study with LEGO blocks and custom 3D-printed components demonstrates how the system links digital instructions to physical assembly, eliminating the need for manual searching, sorting, or labeling of parts.
Authors:Markus Bink, Marten Risius, Udo Kruschwitz, David Elsweiler
Abstract:
Many users struggle with effective online search and critical evaluation, especially in high-stakes domains like health, while often overestimating their digital literacy. Thus, in this demo, we present an interactive search companion that seamlessly integrates expert search strategies into existing search engine result pages. Providing context-aware tips on clarifying information needs, improving query formulation, encouraging result exploration, and mitigating biases, our companion aims to foster reflective search behaviour while minimising cognitive burden. A user study demonstrates the companion's successful encouragement of more active and exploratory search, leading users to submit 75 % more queries and view roughly twice as many results, as well as performance gains in difficult tasks. This demo illustrates how lightweight, contextual guidance can enhance search literacy and empower users through micro-learning opportunities. While the vision involves real-time LLM adaptivity, this study utilises a controlled implementation to test the underlying intervention strategies.
Authors:Markus Bink, Marten Risius, Udo Kruschwitz, David Elsweiler
Abstract:
Generative AI (GenAI) tools are transforming information seeking, but their fluent, authoritative responses risk overreliance and discourage independent verification and reasoning. Rather than replacing the cognitive work of users, GenAI systems should be designed to support and scaffold it. Therefore, this paper introduces an LLM-based conversational copilot designed to scaffold information evaluation rather than provide answers and foster digital literacy skills. In a pre-registered, randomised controlled trial (N=261) examining three interface conditions including a chat-based copilot, our mixed-methods analysis reveals that users engaged deeply with the copilot, demonstrating metacognitive reflection. However, the copilot did not significantly improve answer correctness or search engagement, largely due to a "time-on-chat vs. exploration" trade-off and users' bias toward positive information. Qualitative findings reveal tension between the copilot's Socratic approach and users' desire for efficiency. These results highlight both the promise and pitfalls of pedagogical copilots, and we outline design pathways to reconcile literacy goals with efficiency demands.
Authors:Yijin Zhou, Fu Li, Yi Niu, Boxun Fu, Huaning Wang, Lijian Zhang
Abstract:
Understanding how local neurophysiological patterns interact with global brain dynamics is essential for decoding human emotions from EEG signals. However, existing deep learning approaches often overlook the brain's intrinsic spatial organization, failing to simultaneously capture local topological relations and global dependencies. To address these challenges, we propose Neuro-HGLN, a Neurologically-informed Hierarchical Graph-Transformer Learning Network that integrates biologically grounded priors with hierarchical representation learning. Neuro-HGLN first constructs a spatial Euclidean prior graph based on physical electrode distances to serve as an anatomically grounded inductive bias. A learnable global dynamic graph is then introduced to model functional connectivity across the entire brain. In parallel, to capture fine-grained regional dependencies, Neuro-HGLN builds region-level local graphs using a multi-head self-attention mechanism. These graphs are processed synchronously through local-constrained parallel GCN layers to produce region-specific representations. Subsequently, an iTransformer encoder aggregates these features to capture cross-region dependencies under a dimension-as-token formulation. Extensive experiments demonstrate that Neuro-HGLN achieves state-of-the-art performance on multiple benchmarks, providing enhanced interpretability grounded in neurophysiological structure. These results highlight the efficacy of unifying local topological learning with cross-region dependency modeling for robust EEG emotion recognition.
Authors:Hasti Sharifi, Homaira Huda Shomee, Sourav Medya, Debaleena Chattopadhyay
Abstract:
While high-quality technology support can assist older adults in using digital applications, many struggle to articulate their issues due to unfamiliarity with technical terminology and age-related cognitive changes. This study examines these communication challenges and explores AI-based approaches to mitigate them. We conducted a diary study with English-speaking, community-dwelling older adults to collect asynchronous, technology-related queries and used reflexive thematic analysis to identify communication barriers. To address these barriers, we evaluated how foundation models can paraphrase older adults' queries to improve solution accuracy. Two controlled experiments followed: one with younger adults evaluating AI-rephrased queries and another with older adults evaluating AI-generated solutions. We also developed a pipeline using large language models to generate the first synthetic dataset of how older adults request tech support (OATS). We identified four key communication challenges: verbosity, incompleteness, over-specification, and under-specification. Our prompt-chaining approach using the large language model, GPT-4o, elicited contextual details, paraphrased the original query, and generated a solution. AI-rephrased queries significantly improved solution accuracy (69% vs. 46%) and Google search results (69% vs. 35%). Younger adults better understood AI-rephrased queries (93.7% vs. 65.8%) and reported greater confidence and ease. Older adults reported high perceived ability to answer contextual questions (89.8%) and follow solutions (94.7%), with high confidence and ease. OATS demonstrated strong fidelity and face validity. This work shows how foundation models can enhance technology support for older adults by addressing age-related communication barriers. The OATS dataset offers a scalable resource for developing equitable AI systems that better serve aging populations.
Authors:Bhaskar Mitra, Nicola Neophytou, Sireesh Gururaja
Abstract:
Online information access (IA) platforms are targets of authoritarian capture. These concerns are particularly serious and urgent today in light of the rising levels of democratic erosion worldwide, the emerging capabilities of generative AI technologies such as AI persuasion, and the increasing concentration of economic and political power in the hands of Big Tech. This raises the question of what alternative IA infrastructure we must reimagine and build to mitigate the risks of authoritarian capture of our information ecosystems. We explore this question through the lens of Paulo Freire's theories of emancipatory pedagogy. Freire's theories provide a radically different lens for exploring IA's sociotechnical concerns relative to the current dominating frames of fairness, accountability, confidentiality, transparency, and safety. We make explicit, with the intention to challenge, the dichotomy of how we relate to technology as either technologists (who envision and build technology) and its users. We posit that this mirrors the teacher-student relationship in Freire's analysis. By extending Freire's analysis to IA, we challenge the notion that it is the burden of the (altruistic) technologists to come up with interventions to mitigate the risks that emerging technologies pose to marginalized communities. Instead, we advocate that the first task for the technologists is to pose these as problems to the marginalized communities, to encourage them to make and unmake the technology as part of their material struggle against oppression. Their second task is to redesign our online technology stacks to structurally expose spaces for community members to co-opt and co-construct the technology in aid of their emancipatory struggles. We operationalize Freire's theories to develop a problem-posing framework for envisioning emancipatory IA platforms of the future.
Authors:Jun-Peng Zhu, Boyan Niu, Peng Cai, Zheming Ni, Kai Xu, Jiajun Huang, Shengbo Ma, Bing Wang, Xuan Zhou, Guanglei Bao, Donghui Zhang, Liu Tang, Qi Liu
Abstract:
The SQL-based exploratory data analysis has garnered significant attention within the data analysis community. The emergence of large language models (LLMs) has facilitated the paradigm shift from manual to automated data exploration. However, existing methods generally lack the ability for cross-domain analysis, and the exploration of LLMs capabilities remains insufficient. This paper presents TiInsight, an SQL-based automated cross-domain exploratory data analysis system. First, TiInsight offers a user-friendly GUI enabling users to explore data using natural language queries. Second, TiInsight offers a robust cross-domain exploratory data analysis pipeline: hierarchical data context (i.e., HDC) generation, question clarification and decomposition, text-to-SQL (i.e., TiSQL), and data visualization (i.e., TiChart). Third, we have implemented and deployed TiInsight in the production environment of PingCAP and demonstrated its capabilities using representative datasets. The demo video is available at https://youtu.be/JzYFyYd-emI.
Authors:Yilan Jiang, Cindy Xiong Bearfield, Steven Franconeri, Eugene Wu
Abstract:
Making sense of a visualization requires the reader to consider both the visualization design and the underlying data values. Existing work in the visualization community has largely considered affordances driven by visualization design elements, such as color or chart type, but how visual design interacts with data values to impact interpretation and reasoning has remained under-explored. Dot plots and bar graphs are commonly used to help users identify groups of points that form trends and clusters, but are liable to manifest groupings that are artifacts of spatial arrangement rather than inherent patterns in the data itself. These ``Data-induced Groups'' can drive suboptimal data comparisons and potentially lead the user to incorrect conclusions. We conduct two user studies using dot plots as a case study to understand the prevalence of data-induced groupings. We find that users rely on data-induced groupings in both conditions despite the fact that trend-based groupings are irrelevant in nominal data. Based on the study results, we build a model to predict whether users are likely to perceive a given set of dot plot points as a group. We discuss two use cases illustrating how the model can assist visualization designers by both diagnosing potential user-perceived groupings in dot plots and offering redesigns that better accentuate desired groupings through data rearrangement.
Authors:Shangqian Li, Tianwa Chen, Gianluca Demartini
Abstract:
Modelling users' online decision-making and opinion change is a complex issue that needs to consider users' personal determinants, the nature of the topic and the information retrieval activities. Furthermore, generative-AIbased products like ChatGPT gradually become an essential element for the retrieval of online information. However, the interaction between domainspecific knowledge and AI-generated content during online decision-making is unclear. We conducted a lab-based explanatory sequential study with university students to overcome this research gap. In the experiment, we surveyed participants about a set of general domain topics that are easy to grasp and another set of domain-specific topics that require adequate levels of chemical science knowledge to fully comprehend. We provided participants with decision-supporting information that was either produced using generative AI or collected from selected expert human-written sources to explore the role of AI-generated content compared to ordinary information during decision-making. Our result revealed that participants are less likely to change opinions on domain-specific topics. Since participants without professional knowledge had difficulty performing in-depth and independent reasoning based on the information, they favoured relying on conclusions presented in the provided materials and tended to stick to their initial opinion. Besides, information that is labelled as AI-generated is equivalently helpful as information labelled as dedicatedly human-written for participants in this experiment, indicating the vast potential as well as concerns for AI replacing human experts to help users tackle professional topics or issues.
Authors:Charles Javerliat, Guillaume Lavoué
Abstract:
Extended reality is a fast-growing domain for which there is an increasing need to analyze and understand user behavior. In particular, understanding human visual attention during immersive experiences is crucial for many applications. The visualization and analysis of visual attention are commonly done by building fixation density maps from eye-tracking data. Such visual attention mapping is well mastered for 3 degrees of freedom (3DoF) experiences (\textit{i.e.}, involving 360 images or videos) but much less so for 6DoFs data, when the user can move freely in the 3D space. In that case, the visual attention information has to be mapped onto the 3D objects themselves. Some solutions exist for constructing such surface-based 6DoFs attention maps, however, they own several drawbacks: processing time, strong dependence on mesh resolution and/or texture mapping, and/or unpractical data representation for further processing. In this context, we propose a novel GPU-based algorithm that resolves the issues above while being generated in interactive time and rendered in real-time. Experiment on a challenging scene demonstrates the accuracy and robustness of our approach. To stimulate research in this area, the source code is publicly released and integrated into PLUME for ease of use in XR experiments.
Authors:Raj Mahmud, Shlomo Berkovsky, Mukesh Prasad, A. Baki Kocaballi
Abstract:
While Conversational Recommender Systems (CRS) have matured technically, they frequently lack principled methods for encoding latent experiential aims as adaptive state variables. Consequently, contemporary architectures often prioritise ranking accuracy at the expense of nuanced, context-sensitive interaction behaviours. This paper addresses this gap through a comprehensive multi-domain study ($N = 168$) that quantifies the joint prioritisation of three critical interaction aims: educative (to inform and justify), explorative (to diversify and inspire), and affective (to align emotionally and socially). Utilising Bayesian hierarchical ordinal regression, we establish domain profiles and perceived item value as systematic modulators of these priorities. Furthermore, we identify stable user-level preferences for autonomy that persist across distinct interactional goals, suggesting that agency is a fundamental requirement of the conversational experience. Drawing on these empirical foundations, we formalise the Recommendation-as-Experience (RAE) adaptation framework. RAE systematically encodes contextual and individual signals into structured state representations, mapping them to experience-aligned dialogue policies realised through retrieval diversification, heuristic logic, or Large Language Model based controllable generation. As an architecture-agnostic blueprint, RAE facilitates the design of context-sensitive CRS that effectively balance experiential quality with predictive performance.
Authors:Hagit Ben Shoshan, Joel Lanir, Pavel Goldstein, Osnat Mokryn
Abstract:
Interactive systems that explain data, or support decision making often emphasize what is present while overlooking what is expected but missing. This presence bias limits users' ability to form complete mental models of a dataset or situation. Detecting absence depends on expectations about what should be there, yet interfaces rarely help users form such expectations. We present an experimental study examining how reference framing and prompting influence people's ability to recognize expected but missing categories in datasets. Participants compared distributions across three domains (energy, wealth, and regime) under two reference conditions: Global, presenting a unified population baseline, and Partial, showing several concrete exemplars. Results indicate that absence detection was higher with Partial reference than with Global reference, suggesting that partial, samples-based framing can support expectation formation and absence detection. When participants were prompted to look for what was missing, absence detection rose sharply. We discuss implications for interactive user interfaces and expectation-based visualization design, while considering cognitive trade-offs of reference structures and guided attention.
Authors:Eran Fainman, Hagit Ben Shoshan, Adir Solomon, Osnat Mokryn
Abstract:
Intelligent interfaces increasingly use large language models to summarize user-generated content, yet these summaries emphasize what is mentioned while overlooking what is missing. This presence bias can mislead users who rely on summaries to make decisions. We present Domain Informed Summarization through Contrast (DiSCo), an expectation-based computational approach that makes absences visible by comparing each entity's content with domain topical expectations captured in reference distributions of aspects typically discussed in comparable accommodations. This comparison identifies aspects that are either unusually emphasized or missing relative to domain norms and integrates them into the generated text. In a user study across three accommodation domains, namely ski, beach, and city center, DiSCo summaries were rated as more detailed and useful for decision making than baseline large language model summaries, although slightly harder to read. The findings show that modeling expectations reduces presence bias and improves both transparency and decision support in intelligent summarization interfaces.
Authors:John Paul P. Miranda, Jaymark A. Yambao
Abstract:
This study explores the novice programmers' intention to use chat generative pretrained transformer (ChatGPT) for programming tasks with emphasis on performance expectancy (PE), risk-reward appraisal (RRA), and decision-making (DM). Utilizing partial least squares structural equation modeling (PLS-SEM) and a sample of 413 novice programmers, the analysis demonstrates that higher PE of ChatGPT is positively correlated with improved DM in programming tasks. Novice programmers view ChatGPT as a tool that enhances their learning and skill development. Additionally, novice programmers that have a favorable RRA of ChatGPT tend to make more confident and effective decisions, acknowledging potential risks but recognizing that benefits such as quick problem-solving and learning new techniques outweigh these risks. Moreover, a positive perception of ChatGPT's role in DM significantly increases the inclination to use the tool for programming tasks. These results highlight the critical roles of perceived capabilities, risk assessment, and positive DM experiences in promoting the adoption of artificial intelligence (AI) tools in programming education.
Authors:Hongliang Lu, Yunmeng Liu, Junjie Yang
Abstract:
Human decision-making heavily relies on active sensing, a well-documented cognitive behaviour for evidence gathering to accommodate ever-changing environments. However, its operational mechanism in the real world remains non-trivial. Currently, an in-laboratory paradigm, called evidence accumulation modelling (EAM), points out that human decision-making involves transforming external evidence into internal mental beliefs. However, the gap in evidence affordance between real-world contexts and laboratory settings hinders the effective application of EAM. Here we generalize EAM to the real world and conduct analysis in real-world driving scenarios. A cognitive scheme is proposed to formalize real-world evidence affordance and capture active sensing through eye movements. Empirically, our scheme can plausibly portray the accumulation of drivers' mental beliefs, explaining how active sensing transforms evidence into mental beliefs from the perspective of information utility. Also, our results demonstrate a negative correlation between evidence affordance and attention recruited by individuals, revealing how human drivers adapt their evidence-collection patterns across various contexts. Moreover, we reveal the positive influence of evidence affordance and attention distribution on decision-making propensity. In a nutshell, our computational scheme generalizes EAM to real-world contexts and provides a comprehensive account of how active sensing underlies real-world decision-making, unveiling multifactorial, integrated characteristics in real-world decision-making.
Authors:Jialin Wang, Xinru Cheng, Boyong Hou, Hai-Ning Liang
Abstract:
Extended reality (XR) is evolving into a general-purpose computing platform, yet its adoption for productivity is hindered by visual fatigue and simulator sickness. While these symptoms are often attributed to latency or motion conflicts, the precise impact of textual clarity on physiological comfort remains undefined. Here we show that sub-optimal effective resolution, the clarity that reaches the eye after the full display-optics-rendering pipeline, is a primary driver of simulator sickness during reading tasks in both virtual reality and video see-through environments. By systematically manipulating end-to-end effective resolution on a unified logMAR scale, we measured reading psychophysics and sickness symptoms in a controlled within-subjects study. We find that reading performance and user comfort degrade exponentially as resolution drops below 0 logMAR (normal visual acuity). Notably, our results reveal 0 logMAR as a key physiological tipping point: resolutions better than this threshold yield naked-eye-level performance with minimal sickness, whereas poorer resolutions trigger rapid, non-linear increases in nausea and oculomotor strain. These findings suggest that the cognitive and perceptual effort required to resolve blurry text directly compromises user comfort, establishing human-eye resolution as a critical baseline for the design of future ergonomic XR systems.
Authors:Jialin Wang, Songming Ping, Kemu Xu, Yue Li, Hai-Ning Liang
Abstract:
Video see-through (VST) technology aims to seamlessly blend virtual and physical worlds by reconstructing reality through cameras. While manufacturers promise perceptual fidelity, it remains unclear how close these systems are to replicating natural human vision across varying environmental conditions. In this work, we quantify the perceptual gap between the human eye and different popular VST headsets (Apple Vision Pro, Meta Quest 3, Quest Pro) using psychophysical measures of visual acuity, contrast sensitivity, and color vision. We show that despite hardware advancements, all tested VST systems fail to match the dynamic range and adaptability of the naked eye. While high-end devices approach human performance in ideal lighting, they exhibit significant degradation in low-light conditions, particularly in contrast sensitivity and acuity. Our results map the physiological limitations of digital reality reconstruction, establishing a specific perceptual gap that defines the roadmap for achieving indistinguishable VST experiences.
Authors:Lauren Olson, Emitzá Guzmán, Florian Kunneman
Abstract:
Despite growing awareness of ethical challenges in software development, practitioners still lack structured tools that help them critically engage with the lived experiences of marginalized users. This paper presents PerspectiveCoach, a large language model (LLM)-powered conversational tool designed to guide developers through structured perspective-taking exercises and deepen critical reflection on how software design decisions affect marginalized communities. Through a controlled study with 18 front-end developers (balanced by sex), who interacted with the tool using a real case of online gender-based harassment, we examine how PerspectiveCoach supports ethical reasoning and engagement with user perspectives. Qualitative analysis revealed increased self-awareness, broadened perspectives, and more nuanced ethical articulation, while a complementary human-human study contextualized these findings. Text similarity analyses demonstrated that participants in the human-PerspectiveCoach study improved the fidelity of their restatements over multiple attempts, capturing both surface-level and semantic aspects of user concerns. However, human-PerspectiveCoach's restatements had a lower baseline than the human-human conversations, highlighting contextual differences in impersonal and interpersonal perspective-taking. Across the study, participants rated the tool highly for usability and relevance. This work contributes an exploratory design for LLM-powered end-user perspective-taking that supports critical, ethical self-reflection and offers empirical insights (i.e., enhancing adaptivity, centering plurality) into how such tools can help practitioners build more inclusive and socially responsive technologies.
Authors:Jiawei Fang, Ruonan Zheng, Xiaoxia Gao, Shifan Jiang, Anjun Chen, Qi Ye, Shihui Guo
Abstract:
Wearable inertial motion capture (MoCap) provides a portable, occlusion-free, and privacy-preserving alternative to camera-based systems, but its accuracy depends on tightly attached sensors - an intrusive and uncomfortable requirement for daily use. Embedding IMUs into loose-fitting garments is a desirable alternative, yet sensor-body displacement introduces severe, structured, and location-dependent corruption that breaks standard inertial pipelines. We propose GID (Garment Inertial Denoiser), a lightweight, plug-and-play Transformer that factorizes loose-wear MoCap into three stages: (i) location-specific denoising, (ii) adaptive cross-wear fusion, and (iii) general pose prediction. GID uses a location-aware expert architecture, where a shared spatio-temporal backbone models global motion while per-IMU expert heads specialize in local garment dynamics, and a lightweight fusion module ensures cross-part consistency. This inductive bias enables stable training and effective learning from limited paired loose-tight IMU data. We also introduce GarMoCap, a combined public and newly collected dataset covering diverse users, motions, and garments. Experiments show that GID enables accurate, real-time denoising from single-user training and generalizes across unseen users, motions, and garment types, consistently improving state-of-the-art inertial MoCap methods when used as a drop-in module.
Authors:Suibi Che-Chuan Weng, Torin Hopkins, Shih-Yu Ma, Amy Banic, Ellen Yi-Luen Do
Abstract:
During musical collaboration, visual cues are essential for communication between musicians. Extended Reality (XR) applications, often used with head-mounted displays like Augmented Reality (AR) glasses, can limit the field of view (FOV) of players. We conducted a study to investigate the effects of limited FOV on co-presence, gesture recognition, overall enjoyment, and reaction time. Initially, we observed experienced musicians collaborating informally with and without visual occlusion, noting that collaboration suffered with limited FOV. We then conducted a within-subjects study with 19 participants, comparing an unrestricted FOV holographic setup called HoloJam to Nreal AR glasses with a 52$^{\circ}$ limited FOV. In the AR setup, we tested two conditions: standard AR with a 52$^{\circ}$ FOV and a modified AR notification system called Mini Musicians. Results showed that HoloJam provided higher co-presence, quicker gesture recognition, and greater enjoyment. The Mini Musicians application reduced reaction time and maintained enjoyment compared to the standard AR setup. We conclude that limited FOV impacts musical collaboration, but notifications can improve reaction time and should be considered in future XR music collaborations.
Authors:Torin Hopkins, Shih-Yu Ma, Suibi Che-Chuan Weng, Ming-Yuan Pai, Ellen Yi-Luen Do, Luca Turchet
Abstract:
Digital Audio Workstations (DAWs) are central to modern music production but often encumber the musician's workflow, tethering them to a desk and hindering natural interaction with their instrument. Furthermore, effective remote collaboration remains a significant challenge, with existing solutions hampered by network latency and asynchronous file sharing. This paper investigates the potential of Mixed Reality (MR) to overcome these barriers, creating an intuitive environment for real-time, remote musical collaboration. We employ qualitative and speculative design techniques to better understand: 1) how players currently use DAWs, and 2) to imagine a speculative future of collaborative MR-DAWs. To facilitate this discussion, we developed and evaluated the usability of a design probe, MR-DAW. An MR system enabling multiple, geographically dispersed users to control a single, shared DAW instance while moving freely in their local spaces. Our networked system enables each remote musician to use a physical foot pedal for collaborative looping, merging a familiar, hands-free interaction with a shared virtual session. Based on interviews and system evaluations with 20 musicians, we analyze current practices, report on the user experience with our MR system, and speculate on the future of musical collaboration in MR. Our results highlight the affordances of MR for unencumbered musical interaction and provide a speculative outlook on the future of remote collaborative DAWs in the Musical Metaverse.
Authors:Zhixue Song, Zhiheng Zhang, Yi Song, Chi Zhang
Abstract:
The OpenClaw platform provides a practical foundation for automation through its skill-oriented architecture, organizing external capabilities into lightweight, reusable components that can be invoked efficiently through a command-line interface (CLI). However, a significant bottleneck remains: many real-world tasks are confined to graphical user interfaces (GUIs) with no stable API available. While LLM-based GUI agents offer generality, their reliance on repeated live model inference makes them too slow, costly, and inconsistent to serve as efficient OpenClaw skills. In this paper, we present AppAgent-Claw, a demonstration-driven system that converts GUI workflows into reliable, reusable skills without runtime inference. By following a ``record-once, replay-many'' paradigm, the system captures rich contextual metadata to facilitate robust execution. It employs a layered localization strategy to handle visual shifts and a validation-coupled execution model to ensure intended on-screen effects. AppAgent-Claw provides a practical, efficient, and diagnosable solution for integrating GUI-bound tasks into the OpenClaw ecosystem.
Authors:Bhargav Shandilya, Matt Buchholz, Alexis Palmer
Abstract:
Interlinear glossed text (IGT) is the standard format for linguistic annotation in language documentation. Producing it manually, however, is often slow and costly. Automated glossing systems have improved substantially in recent years, but adoption among field linguists remains limited. Existing tools are designed to be evaluated rather than used, offering no interpretable path for correction or the incorporation of linguistic expertise back into model behavior. We present GlossAssist, a glossing tool built around the retrieval-based architecture of CWoMP (Contrastive Word-Morpheme Pre-training), which grounds predictions in a mutable lexicon of learned morpheme representations. In conjunction with CWoMP, our system treats each correction by an annotator as part of an active learning setting, which expands the lexicon and improves future predictions without having to retrain the model. In this paper, we present our interface and argue that this feedback loop should be treated as a design requirement for NLP tools aimed at documentary linguists.
Authors:Arvind Srinivasan, Tobias Rau, Michael Sedlmair
Abstract:
Visual attention is central to ensemble coordination, yet how musicians allocate gaze during naturalistic rehearsal remains poorly understood. We present a pilot study using mobile eye tracking to examine gaze behaviour in a four-member band across three songs, each practiced twice. Musicians wore Pupil Labs Neon eye trackers, and YOLOv8-assisted scene annotations mapped fixations to ensemble members and objects in view. Analyzing fixation matrices, transition matrices, temporal scarf plots, and dwell-transition correlations, we uncover a hub-and-spoke attention topology: the session leader was the dominant gaze target for all members, while the learning guitarist concentrated up to 97% of interpersonal dwell on this single reference. Between attempts, gaze transitions decreased by up to 65% on average for unfamiliar material (up to 82% for individual participants) as scanning stabilized. Scarf plots reveal how teaching breakdowns fragment attention and uninterrupted runs consolidate it. Post-session participant reflections align with the quantitative patterns, and we discuss implications for gaze-aware tools in ensemble pedagogy.
Authors:Floor Bontje, Felix van Waveren, Leendert van Maanen, Bhargav Nallapu, Gustav Markkula, Arkady Zgonnikov
Abstract:
Evidence accumulation models provide a formal framework for studying decision making as a dynamic process unfolding over time. While these models have been extensively developed and reviewed in laboratory paradigms, their structured application in complex, ecologically valid domains has received comparatively little attention. Road traffic is a particularly relevant context for studying sustained, embodied perception action behavior, where decisions unfold under time pressure and involve continuous control and ongoing perception-action coupling. Examining how EAMs have been applied in this domain may therefore offer insights beyond discrete laboratory tasks toward decision making in real-world behavior. This semi-systematic review synthesizes 28 studies (2014-2026) applying EAMs to traffic-related behavior. We organize the literature along two dimensions: 1) modelling level, distinguishing models at the level of discrete decision-making and models at the level of continuous action control, and 2) model architecture, distinguishing evidence accumulation as either a stand-alone decision model or an embedded component within broader perception-action or interaction frameworks. These distinctions are associated with systematic differences in model architecture, parameterization, data usage, and validation strategies, reflecting task specific demands. By providing a structured overview of these patterns, this review clarifies how EAMs are currently instantiated in traffic contexts and highlights methodological challenges and future directions both in traffic modelling and in modelling of decision-making more broadly. Promising directions include laboratory work on evidence accumulation in sustained and time-varying tasks, interactive multi-individual decision-making, and the use of neurophysiological measures to identify the perceptual evidence underlying complex perception-action behavior.
Authors:Kuntal Ghosh, Marc Hassenzahl, Shadan Sadeghian
Abstract:
Human-AI collaboration is considered the most promising way to incorporate AI in the workplace. What remains unexplored are the experiential consequences of this teaming. More specifically, in a team with AI, how humans perceive themselves (self-perception) and how they are perceived by their coworkers (peer perception) in terms of work ownership and job meaningfulness. In a 2x2x2 vignette study (n=50), participants rated perceptions of ownership, affect, job meaningfulness and satisfaction, and role dynamics across two levels (low/high) of AI proactivity and AI competency as within-subject factors, with point-of-view (self perception/peer perception) as between-subjects. Our results showed that AI with low competency or low proactivity generally improved feelings related to ownership, meaningfulness, satisfaction, and role dynamics, and also increased positive affect while reducing negative affect. However, these effects were often influenced by point-of-view. For instance, low AI proactivity resulted in higher job satisfaction from self-perception rather than peer perception. Based on our findings, we argue that designing AI for the future of work solely around performance metrics may not be adequate. Highly competent and proactive AI-driven systems can have undesirable impacts on perceptions of ownership, job identity, social image and team dynamics, and consequently, job meaningfulness.
Authors:Mona Giff, Stephen Giff, Huseyin Dogan
Abstract:
User Experience Research (UXR) is currently undergoing a transition from traditional usability testing towards design-led and data-driven approaches, yet it faces an identity crisis due to a lack of methodological grounding in UXR and time-intensive methodologies which often lag behind product decision cycles. To address this, the UXR Point of View (PoV) framework formalises the UXR process by transitioning from raw data collection to forming an evidence-based PoV which drives strategic product impact. Furthermore, the use of GenAI in UXR has been investigated, but researchers often face increased work intensity when using GenAI, attributed to time spent on prompt engineering, data cleaning, and verification of AI outputs. This paper proposes and evaluates a formalised methodology for leveraging GenAI, specifically Google's NotebookLM, to augment the UXR PoV process. The methodology consists of five prompts across four stages: (1) leveraging the framework, (2) establishing roadmaps, (3) applying best-practices, and (4) crafting PoV narratives; and was tested on eleven UXR papers. Results showed that by using the proposed methodology, NotebookLM successfully leveraged the UXR PoV framework across all stages of PoV creation. These findings demonstrate that NotebookLM can serve as an effective collaborative partner in UXR, so long as it is provided with sufficient context and specific prompting.
Authors:Kuntal Ghosh, Marc Hassenzahl, Shadan Sadeghian
Abstract:
The proliferation of Artificial Intelligence (AI) in workplaces is transforming how we work. While existing research on human-AI collaboration at work often prioritizes performance, less is known about their experiential outcomes. Through interviews with 24 employees across Information Technology (IT), service-based, and healthcare sectors, this paper examines AI's impact on job satisfaction via perceptions of job decency and meaningfulness, now and in the future. Our results reveal that the anticipated impact of AI on overall job satisfaction varies with the occupational domain, with differing perceptions of its underlying decency and meaningfulness. For instance, IT and healthcare anticipate increased satisfaction with decency aspects like working hours but decreased satisfaction with meaningfulness aspects like social image due to misconceptions about AI handling most of their tasks. Conversely, service workers foresee no improvement in their working hours but a higher social standing due to the perceived status boost associated with working with AI.
Authors:Maharshi Gor, Yoo Yeon Sung, Yu Hou, Eve Fleisig, Irene Ying, Tianyi Zhou, Jordan Boyd-Graber
Abstract:
AI systems are fallible, and humans can make mistakes in deciding whether to trust AI over their own judgment. Thus, improving human-AI collaboration requires understanding when, why, and how humans decide to rely on AI. We study two distinct reliance decisions: the delegation choice -- deciding when to let AI act autonomously without knowing its output, and the adoption choice -- evaluating AI suggestions and deciding how to use them. Both of these decoupled reliance patterns shape collaboration, but prior work rarely studies them together in realistic settings with the same users. We address this gap by studying collaborative human--AI teams competing in a question-answering game in which humans can choose when and how to work with AI agents to win. Our 24 matches pair 23 expert humans with 16 AI agents, capturing 387 delegation and 1440 adoption decisions. While human--AI collaboration performs better than either AI or humans alone, humans make suboptimal collaboration decisions, both under-relying on correct AI suggestions (3.9% of opportunities missed) and over-relying when AI misleads them (1.7%). Both parties contribute wrong answers: reported model confidence is near chance when humans and AI disagree, while confirmation bias drives higher under-reliance (64.5%) when an AI suggestion agrees with humans' initial incorrect answer. To close this gap, we recommend calibrated confidence, evidence-grounded explanations, and mechanisms that help users refine trust.
Authors:Shang Wu, Saatvik Kher, Padhraic Smyth
Abstract:
We address the problem of learning to assign prediction tasks to one agent from a set of available human or AI agents. In particular, we focus on the sequential learning of agent expertise and assignment policies where each agent is constrained to handle a fraction of tasks. We provide a general theoretical characterization of this problem in terms of agent capacities, differences in agent expertise, and task context. We then develop a framework of sequential explore-exploit policy-learning algorithms that seek to maximize overall performance. Experimental results over a variety of tabular, image, and text prediction tasks demonstrate systematic gains from our policy-learning algorithms relative to non-contextual baselines across different types of agents, including LLMs and humans.
Authors:Shuyang Li, Ruimin Ke
Abstract:
The integration of Large Language Models (LLMs) with microscopic traffic simulation offers a promising path toward autonomous urban planning and intelligent transportation analysis. However, existing monolithic agent architectures often struggle with the complexity of end-to-end simulation workflows, leading to reasoning failures, parameter inconsistency, and a lack of systematic state management. This paper proposes a novel multi-agent collaborative framework designed to automate the entire lifecycle of traffic simulation in SUMO (Simulation of Urban Mobility). Our approach decouples the simulation pipeline into specialized roles, including Planner, Builder, Demand, Runner, and Analyst, coordinated by a high-level reasoning engine. We introduce a state-persistent Orchestrator leveraging the Model Context Protocol (MCP) to ensure seamless data handover and environmental consistency across distributed agent actions. This architecture enables a robust closed-loop refinement process, where simulation outcomes are iteratively analyzed and optimized to satisfy user-defined Key Performance Indicators (KPIs). Experimental results through role ablation studies demonstrate that the proposed multi-agent framework significantly enhances task success rates and parameter accuracy compared to single-agent baselines. Furthermore, case studies on real-world network extraction and traffic optimization highlight the system's capability to bridge the gap between high-level natural language intent and low-level simulation execution.
Authors:Houman Kazemzadeh, Kamyar Naderi
Abstract:
Radiology reports remain the primary mechanism by which imaging findings are communicated to clinical teams. However, much of the structured information behind these reports, including measurements, image evidence, prior comparisons, lesion identity, uncertainty, and terminology, often remains trapped in free text or fragmented across picture archiving and communication systems, radiology information systems, reporting workstations, worksheets, advanced visualization tools, and electronic health records. This paper proposes a human-supervised, evidence-linked reference architecture for structured radiology reporting. The framework combines exam-specific templates, speech-to-structure processing, measurement and segmentation capture, controlled AI-assisted drafting, and standards-based interoperability using DICOM, DICOM Structured Reporting, DICOM Segmentation, HL7 FHIR, RadLex, SNOMED CT, LOINC, and UCUM. The system is positioned not as an autonomous report generator, but as a structured intelligence layer for enterprise imaging that supports reviewed reporting, longitudinal comparison, clinical data reuse, governance, and integration with PACS, RIS, EHR, analytics, and registry workflows. The paper also discusses modality-specific deployment considerations, clinical safety risks, validation requirements, cybersecurity, privacy, quality management, and regulatory boundaries for AI-assisted radiology reporting systems.
Authors:Laura R. Marusich, Mary Grace Kozuch Dhooghe, Jonathan Z. Bakdash, Murat Kantarcioglu
Abstract:
Large language models (LLMs) have the potential to aid and improve human decision-making in classification tasks, not only by providing fairly accurate predictions, but also in their ability to generate cogent narrative explanations of those predictions. Prior work has demonstrated that people generally find AI narrative explanations to be understandable, trustworthy, and convincing for changing beliefs and opinions; however, less is known about the impact of narrative explanations on objective human decision-making performance. Here we conduct a large-scale human behavioral experiment to evaluate decision-making performance with LLM-generated narrative explanations of varying persuasiveness. We found the degree of persuasiveness, or lack thereof, for LLM-based explanations did not meaningfully impact decision accuracy over a simple AI prediction alone, in agreement with typical results with explainable AI based on feature importance. We found evidence that narratives increased reliance on AI, but both when the AI prediction was correct and incorrect. Exploratory analyses also indicated that the more persuasive narratives may have had a detrimental effect on decision response times and the ability to discriminate between a correct and incorrect AI prediction. Overall, this work indicates that including narrative explanations with AI predictions may involve tradeoffs for decision-making performance, and more work is needed to determine how and when narrative explanations impact human decision-making.
Authors:Marie-Therese Sekwenz, Shreyan Biswas, Rita Hermann-Gsenger, Ujwal Gadiraju
Abstract:
Illegal content reporting mechanisms are a key technical and organizational measure through which online platforms address illegal content under the European Union Digital Services Act (DSA). Article 16 requires user notices to be sufficiently substantiated and submitted in good faith, placing users in the difficult position of interpreting legal and procedural language and translating ambiguous content into legally meaningful categories and reasons. We investigate how large language model (LLM)-based assistants can support this reporting process. In a controlled user study (N = 450) using an interface modeled on a major platform reporting workflow, we compare three conditions: unaided reporting, a conventional explainable AI assistant (XAI) that suggests a single legal category with a rationale, and an evaluative AI assistant (EvalAI) that presents balanced pro and con arguments across candidate legal provisions. We further examine these assistance forms under systematically varied AI error regimes. Our results show that EvalAI improves provision-level accuracy under AI error and reduces misclassification distance relative to conventional XAI, particularly for near-miss and overbreadth errors. When AI output is correct, conventional XAI enables faster decisions, but neither AI assistance form reliably improves the quality of users' substantiated explanations relative to unaided reporting. We discuss design implications for compliance-oriented reporting interfaces, highlighting trade-offs between accuracy, deliberation, explanation quality, and vulnerability to misleading AI output.
Authors:Mahfuza Farooque, Ananya Drishti, Mukhil Muruganantham Prakaash, Uttkarsh Agarwal, Zahra Abdul Basit, Asish Kondragunta
Abstract:
We present Cogniscope, an open evaluation framework for studying longitudinal early-risk AI systems under controlled behavioral drift, sparse observations, delayed evidence, and heterogeneous progression patterns. Cogniscope combines two complementary components: a synthetic simulation engine that generates privacy-preserving longitudinal behavioral traces aligned with configurable latent risk trajectories, and a browser-based data-collection instrument implemented as a Chrome extension for capturing naturalistic video interaction telemetry and micro-question responses during YouTube playback. The released benchmark includes 200,000 simulated video-interaction records from 200 users over 200 days, a 504-session schema-aligned synthetic deployment dataset across nine behavioral profiles, an 18-table relational schema, baseline evaluation scripts, and time-aware metrics including Early Risk Detection Error (ERDE) and time-to-detection (TTD). We emphasize that Cogniscope is not a diagnostic system and does not claim clinical validity. Instead, it provides a reusable testbed for evaluating how sequential models behave under known longitudinal challenges before deployment with real human-subject data. Experiments show that simple behavioral coherence signals separate simulated risk states under controlled priors, while rule-based deployment-profile classification remains challenging, motivating learned temporal models and robust evaluation protocols.
Authors:John Paul P. Miranda, Emmanuel B. Parreño, Jovita G. Rivera
Abstract:
The integration of AI tools in academic settings has introduced a distinct form of strain that existing frameworks like technostress and digital fatigue have not yet fully addressed. This study develops a conceptual model and identifies the dimensions that define AI fatigue as a form of strain arising from sustained academic use of AI tools. Using grounded theory analysis of open-ended responses from 1,054 university students across three universities in the Philippines, the study examined the cognitive, motivational, emotional, physical, and attentional pressures students experienced during AI-supported academic work. Analysis produced five dimensions of AI fatigue, namely Cognitive Overload, Motivational Disengagement, Moral Unease, Physical Strain, and Attentional Drift, each consisting of two indicators grounded in participant accounts. The findings also yielded the AI Fatigue Model, a stage-based framework that explains how these pressures accumulate and reinforce one another across repeated AI interaction in academic tasks. These contributions establish a conceptual and exploratory foundation for AI fatigue as a distinct construct and provide a basis for future instrument validation, scale development, and cross-contextual inquiry in academic settings where AI now mediates student learning.
Authors:Alihan Bakir, Ekrem Yüksel, Fabio Zuliani, Neil Chennoufi, Francesco Bruno, Jamie Paik
Abstract:
Humanity is at the forefront of yet another digital revolution, where the lines between real and virtual worlds are dissolving, reshaping how we perceive and interact with our surroundings. In this context, we introduce a transformative paradigm for immersive virtual experiences centered around whole-body kinetic interactions. Our approach redefines immersion through three distinct levels: audio-visual immersion, capturing sensory realism; physical immersion, delivering haptic feedback; and full-body immersion (FBI), where dynamic bodily interaction integrates seamlessly with virtual environments. At the core of this innovation lies a scalable, distributable platform based on modular robotic surface units inspired by the adaptive designs of nature. These units enable the rendering of immersive environments at any scale, from intimate personal experiences to expansive multi-user settings, dynamically adapting to interactions in real-time. The modular system distributes force, shape, and motion feedback throughout entire spaces, replicating the physical characteristics of the environment and enabling new depth of engagement through FBI. By combining scalability, adaptability, and dynamic physical engagement, this framework bridges the gap between real and virtual worlds. It offers an unprecedented level of immersion where users can engage their entire bodies in symbiotic interactions with the virtual space. This work not only advances immersive technology but also redefines how humans and virtual environments coexist, setting a foundation for a new era of human-environment synthesis.
Authors:Ray-Yuan Chung, Athena Ortega, Zixuan Xu, Daeun Yoo, Jaime Snyder, Wanda Pratt, Aaron Wightman, Ryan Hutson, Cozumel Pruette, Ari Pollack
Abstract:
In pediatrics, patients, caregivers, and clinicians share responsibility for health decisions, but limited collaboration can undermine outcomes. We conducted a qualitative study examining decision-makers perceptions toward collaborative decision-making technologies, including interactive dashboards, VR simulators, and AI voice assistants. Findings reveal differences in user opinions across groups and indicate technology acceptance is linked to users trust of these technologies. Technology developers and researchers need to explore design and implementation strategies that build and facilitate trust or appropriate distrust between users and these novel technologies before these tools can effectively support collaborative decision-making.
Authors:Tommaso Turchi, Ben Wilson, Matt Roach, Alan Dix, Alessio Malizia
Abstract:
AI is now embedded in healthcare, finance, policy, and many other domains, yet genuine human-AI synergy - combined performance that exceeds what either party achieves alone - is uncommon. Meta-analyses show that AI assistance tends to improve human performance compared to working alone, but studies finding true synergy are scarce. We call this persistent shortfall the synergy gap. Most current work treats human-AI combination as an engineering problem and concentrates on interpretability, trust calibration, or interface design. These matter, but they cover only part of what determines whether combination works. Closing the synergy gap, we argue, requires explicit engagement with a wider design space. We map that space through six interconnected elements: sociotechnical context, decision-making frameworks, human decision participants, AI capabilities, interaction, and holistic evaluation. For each element, we describe what it covers, how it shapes the others in practice, and what it implies for design. The result is a shared vocabulary for practitioners building hybrid systems, an analytical lens for researchers studying combination patterns, and a starting point for evaluators interested in the full quality of human-AI decision-making rather than accuracy alone.
Authors:Madhuri Singh, Gennie Mansi, Mark Owen Riedl
Abstract:
K-12 teachers employ Engineering Design Challenges to help students learn about the Engineering Design Process hands-on. They use techniques like hard scaffolding questions to guide the students as they think through the different stages of the engineering design process. While useful, the creation of these questions adds to the teacher's preparation time for their classes. Concept Catalyst uses Large Language Models to assist teachers with the rapid creation of scaffold questions for engineering design challenges. Unlike open-ended chat, Concept Catalyst uses LLMs to summarize and decompose an engineering design challenge into the concepts that students will engage with, allow the teacher to visually manipulate and link related concepts, and to propose scaffolding questions for the teacher to modify or accept.
Authors:Sijia Qian, Cuihua Shen, Jingwen Zhang, Magdalena Wojcieszak
Abstract:
Cheapfakes, or real images presented misleadingly or in unrelated contexts, are an increasingly prominent form of visual misinformation. While media literacy interventions can enhance individuals' ability to detect such content, motivational barriers often hinder the adoption of image verification. This study examines whether incorporating different mechanisms and types of incentives into a digital media literacy intervention improves visual misinformation discernment and image verification behavior, both immediately and over time. We conducted a pre-registered two-wave between-subjects online experiment (N = 1,421) on a professionally designed social media platform. The study used a 2 (Incentive Type: symbolic vs. monetary) x 2 (Incentive Mechanism: task- vs. result-based) factorial design with additional control groups. Results show that task-based incentives, particularly monetary ones, were most effective at initiating image verification behaviors, namely reverse image search, and boosting short-term discernment, whereas result-based incentives were more effective in sustaining discernment accuracy. These findings suggest that both the mechanism and the type of incentives play a critical role in shaping the short- and long-term effectiveness of media literacy interventions, highlighting the value of multi-phased incentive strategies for combating visual misinformation in digital environments.
Authors:Yuhui Xu, Isabel Blijenburg, Bhakti Moghe, Maarten Houben, Daniel Tetteroo, Wijnand IJsselsteijn, Minha Lee
Abstract:
International students face struggles when adapting to the host country. They are more susceptible to mental health problems than domestic students. While Conversational User Interfaces (CUIs) are increasingly researched and implemented, research on how they may help international university students is still scarce. Thus, we conducted participatory design workshops with international students who shared their perspectives and struggles of studying abroad, in which they also envisioned CUIs as aids to support their transitions. Participants proposed features of a CUI to address uncertainty, loneliness, and misunderstandings of cultural differences. Our paper reveals international students' needs and provides design implications for CUIs to support their well-being.
Authors:Zheyuan Zhang, Rafael A. Calvo
Abstract:
User engagement is crucial for the efficacy of digital health and mental health interventions, yet existing design strategies for improving engagement remain heterogeneous, context-specific, and insufficiently grounded in motivational theory. In this paper, we propose a preliminary, theory-grounded design framework that draws on Self-Determination Theory (SDT) and its sub-theory, Organismic Integration Theory (OIT), to guide the design of digital health interventions for sustained user engagement. Informed by existing literature and our own empirical data from surveys (N = 438), interviews (N = 31), and co-design workshops (N = 59) with end users, the framework categorises design strategies across the adoption, interface, and task spheres of the user experience, distinguishing between those that primarily support intrinsic motivation and those that foster autonomous forms of extrinsic motivation. We argue that this distinction is critical: strategies commonly grouped under umbrella terms such as "gamification" in fact operate through different motivational channels and should be designed and evaluated accordingly. By clarifying these motivational pathways, our framework aims to support researchers and practitioners in designing digital health interventions that not only facilitate initial uptake but also enhance the internalisation of health behaviours for long-term, sustained engagement. We present this framework as a basis for discussion at this workshop, inviting expert feedback and critique to refine it as a contribution to the field.
Authors:Hua Xuan Qin, Guangzhi Zhu, Mingming Fan, Pan Hui
Abstract:
Mainstream creativity support design prioritizes compliant AI for seamless writing interactions, but concerns over inappropriate AI reliance highlight the need for designs fostering reflection on balanced AI and non-AI resource use. Theoretically, intentional AI non-compliance, refusals (saying ``no'' to requests), could introduce such reflection through friction stronger than other bypass-able solutions. Practically, refusal content/language characteristics lead to nuanced reactions. However, little research empirically focuses on nuances beyond mandatory ethical/technical constraints, on turning refusals into strategic friction for `innocuous' requests. We address this through a qualitative study with 22 creative writers, exploring reactions to refusals to common requests across writing stages (planning, translating, reviewing). Findings suggest that reflective potential depends on heterogeneous preference alignment along situational (e.g., convergent/divergent thinking phases), cognitive (e.g., domain beliefs), and relational (e.g., AI roles) dimensions. We discuss implications for creativity support, broader issues (e.g., AI addiction), and frictional/seamful AI design (e.g., integrating different compliance levels).
Authors:Esther Bosch, Klas Ihme, Stefan Bohmann
Abstract:
Understanding how travelers form overall evaluations of public transport journeys is critical for improving travel satisfaction and encouraging sustainable mode choice. While travel satisfaction is discussed to influence attitudes and future behavior, the cognitive rules by which moment-to-moment experiences are aggregated into retrospective evaluations remain poorly understood in transport research. Drawing on psychological theories of experienced and remembered utility, this study investigates which temporal aggregation heuristics best predict post-trip travel satisfaction. Using a smartphone-based experience sampling approach, we collected high-frequency on-trip experience ratings and post-trip evaluations for 2576 real-world public transport trips across three German cities. Travel experience was assessed every five minutes during trips using a multi-item scale, allowing direct comparison of competing aggregation rules, including mean experience, peak-end, minimum-end, final moment, and trip duration. Multilevel regression models were estimated to evaluate the explanatory power of each heuristic. Results show that retrospective travel satisfaction is best predicted by a Minimum-End heuristic, combining the most negative moment of the journey and the final experience. Models based on mean experience, peak-end rules, final moment alone, or trip duration performed substantially worse. This pattern indicates that both negative extremes and the final phase of a journey independently contribute to remembered evaluations, rather than overall satisfaction reflecting an average of momentary experiences. The results have important implications for theory and practice, suggesting that targeted interventions at critical negative moments and at trip endings may yield substantial improvements in remembered satisfaction and, ultimately, support shifts toward sustainable mobility.
Authors:Srinivas Ravishankar, Ishayu Ghosh, Nora Zajzon, Teng Fei, Virginia de Sa
Abstract:
Recent attempts at creating Foundation Models (FMs) for Electroencephalography (EEG) have achieved state-of-the-art performance on multiple tasks including Motor Imagery (MI). These MI tasks have typically involved coarse classification between imagined limb movements. However, the development of foundation models necessitates diverse datasets, both for pretraining and evaluating the progress of these models. In this work, we propose handwriting decoding as a challenging motor task for FMs. We show that several existing datasets are potentially confounded, and introduce a dataset that more rigorously evaluates models. On this dataset, we find that current FMs, despite showing SOTA performance in multiple MI datasets are outperformed by smaller task-specific models. We also highlight challenges specific to EEG-based handwriting decoding to inform future work. In our 4-letter classification task, we show that (a) Knowledge of movement-onset is crucial to reported decoding performance in prior works, with average performance across subjects dropping from $41.3\%$ to $32.4\%$. (b) Increasing test-time signal quality provides significant performance improvements ($45\%$ to $78\%$ in our best subject) compared to scaling training data with single-trial EEG. (c) While scaling training data steadily improves decoding performance, existing FMs do not outperform specialist models in handwriting decoding. We make our code available at https://anonymous.4open.science/r/EEG-Handwriting-BCI-DFCD/
Authors:Ben Wilson, Matimba Swana, Peter Winter, Matt Roach
Abstract:
The phrase 'human in the loop' is increasingly used to imply a sense of safety in relation to AI decision systems. It shouldn't. There are contexts where it can be applied appropriately, but these are not in the deployed decision systems we see dominating today. Human oversight of AI decision processes is one of the most popular proposals for addressing concerns, especially about bias, discrimination, misinformation, manipulation, accountability, and transparency. But there is insufficient examination of what human oversight actually means. The question raised in this paper is whether using the metaphor of a loop does anything to assist understanding of what is required and what is achieved in a particular decision context. Indiscriminate use of the loop metaphor obscures both processes and outcomes. It enables 'humanwashing', an activity analogous to 'greenwashing', where writers and commentators use language primarily aimed at putting systems in the best possible light.
Authors:Jacob Lagogiannis, William Agnew, Rosa I. Arriaga, Sauvik Das
Abstract:
Anti-facial recognition (AFR) image filters alter images in ways that are subtle to people but blinding to computer vision. Yet, despite widespread interest in these technologies to subvert surveillance, users rarely use them in practice -- because the ``subtle'' alterations are visible enough to conflict with users' self-presentation goals. To address this challenge, we propose AuraMask: a novel approach to creating AFR filters that are both adversarially effective and aesthetically acceptable. Using AuraMask, we produce 40 ``aesthetic'' filters that emulate popular ``one-click'' Instagram image filters. We show that AuraMask filters meet or exceed the adversarial effectiveness of prior methods against open-source facial recognition models. Moreover, in a controlled online user study ($N=630$) we confirm these filters achieve significantly higher user acceptance than prior methods. Lastly, we provide our AFR pipeline to the community for accelerated research in adversarially effective and aesthetically acceptable protections.
Authors:Yichen Andy Yu, Wanru Li, Qiaoran Wang, Jymon Ross, Gavin Johnson, Mandy Lui, Qiao Jin
Abstract:
We present SpatialPrompt, an Extended Reality(XR) system that turns spatial sketches into executable constraints for controllable 3D generation. Users draw rough structures with a 3D pen and add voice prompts for semantic and stylistic intent. The system supports iterative refinement and synchronous co-creation in shared space with color-coded contributions. Implemented on Apple Vision Pro with Logitech Muse and Meshy, a heuristic evaluation suggests that the workflow is intuitive and supports shared understanding in collaborative creation, while revealing needs for faster generation and clearer feedback.
Authors:Vasilis Niarchos, Constantinos Papageorgakis, Alexander G. Stapleton, Sokratis Trifinopoulos
Abstract:
As large language models (LLMs) show increasing promise on research-level physics reasoning tasks and agentic AI becomes more common, a practical question emerges: How does the interaction between researchers and agents affect the results? We study this using SCALAR (Structured Critic--Actor Loop for AI Reasoning), an Actor--Critic--Judge pipeline applied to quantum field theory and string theory problems. The Actor proposes solutions, the Critic provides iterative feedback, and an independent Judge evaluates the transcript against reference solutions. We vary the Actor persona, the Critic feedback strategy, and the Actor model family and scale. Multi-turn dialogue improves over single-shot attempts throughout, but both the mechanism of improvement and the value of different prompting choices depend strongly on the Actor--Critic pairing. Increasing the scale within one model family (e.g. from the 8B-parameter DeepSeek-R1 variant to DeepSeek-R1 70B) improves some easier-problem behavior, but does not remove the hardest bottleneck we observe. Critic feedback strategy matters most clearly in the asymmetric Actor--Critic setting (e.g., a lightweight Haiku Actor guided by a stronger Sonnet Critic), where constructive feedback improves mean-score outcomes. In same-family Actor--Critic settings, strategy effects are weaker: lenient feedback is sometimes favored, while strict and adversarial feedback are not beneficial. Taken together, SCALAR provides a controlled testbed for evaluating which interaction structures help or hinder AI-driven scientific discovery.
Authors:Gaolin Ge, Qifeng Yang, Haoran Lu, Tingyu Cheng, Martin Nisser, Yiyue Luo
Abstract:
We introduce an elastic-driven self-folding approach that fabricates robots directly from flat 3D-printed conductive PLA nets. Elastic bands routed through printed hooks store energy that folds the sheet into programmed 3D geometries, while the flat state allows accurate placement of electronics and magnets before deployment. The same substrate doubles as electrodes for capacitive touch and supports a reusable platform I/O palette with Hall sensors and eccentric rotating mass (ERM) motors for docking detection and vibration actuation. We also derive a closed-form folding model that balances hinge stiffness with elastic band moment to predict equilibrium fold angles; experiments validate the model and yield a design map linking hinge thickness, band size, and hook spacing to target angles. Using this workflow we realize multiple polyhedral modules and demonstrate three applications: a cube that highlights the potential of self-folding for scalable modular robot collectives, a deployable gripper, and a tendon-driven finger. The method is low cost, stimulus-free, and integrates actuation and sensing.
Authors:Lennard C. Froma, Tom Kouwenhoven, Maaike H. T. de Boer, Catholijn M. Jonker, Max J. van Duijn
Abstract:
Much research on LLMs has focused on increasing benchmark performance. However, the evaluation of such models in real-world collaborative human-AI workflows has stayed behind. This work evaluates a chatbot-style assistant based on Retrieval-Augmented Generation (RAG) in a realistic multi-turn information-seeking scenario inspired by workplace settings where compliance with local legislation and secure handling of sensitive data are often key. Specifically, we examine the performance of humans (N=112) assisted by RAG-assistants compared to LLM-only or LLM+RAG baselines. In this setting, we investigate how underlying model size (3B, 8B, and 70B) shapes the human-AI collaborative dynamic and how it influences perceived usability and satisfaction. Results show that the performance gain of human-AI collaboration over the model-only baselines is significant, irrespective of model size, suggesting that hybrid systems are beneficial in information-seeking scenarios. Interestingly, however, perceived usability and satisfaction among participants showed little difference across model sizes. This demonstrates a nuanced trade-off between model size, performance, and user perception. Our work highlights the added value of evaluating AI applications in actual multi-turn interactions with human users, looking at usability and satisfaction besides accuracy, rather than focusing on benchmark performance only.
Authors:Pitch Sinlapanuntakul, Soyun Moon, Yuri Kawada, Yeha Chung, Mark Zachry
Abstract:
Early-stage concept envisioning is a critical juncture in AI design, shaping how designers frame problems and the decisions that follow. Yet values and potential harms are often too abstract or addressed too late to meaningfully shape design. Using a Research-through-Design (RtD) approach, we developed the AI Concept Envisioning Toolkit, comprising an AI Capability Library, 24 Value--Harm Cards, and a Value--Tension Map, to support reasoning by juxtaposing values and harms within AI technical capabilities. Through a survey with 30 designers and in-depth interviews with 12 designers, we find that the toolkit is clear and perceived as valuable, and that it encourages value reflection, helps anticipate potential harms, and makes ethical considerations more transparent in early-stage design. We reflect on our design process and discuss design approaches for tools that promote reflection on values and potential harms, surface and navigate value tensions, and introduce productive friction throughout design workflows.
Authors:Pitch Sinlapanuntakul, Aayushi Dangol, Xiaoyi Xue, Mark Zachry
Abstract:
As AI integrates into design practice, designers increasingly use generative AI tools to envision AI-enabled solutions, positioning AI as both design tool and design material. This dual role creates recursive value tensions distinct from traditional design work. We engaged 18 designers in a concept envisioning activity and interviews to understand how they navigate values and recognize potential harms in this context. Our analysis reveals that (i) designers engage in reciprocal reflection-in-action with AI; (ii) this process surfaces multi-level value tensions across tool, designer, and concept; (iii) designers demonstrate greater attunement to harm recognition as a primary design signal than to articulating positive value fulfillment; and (iv) designers exercise anticipatory judgment through meta-design reasoning about how tool assumptions risk propagating into designed concepts and future use contexts. We extend Schon's reflection-in-action framework and discuss implications for redesigning AI-mediated design tools, supporting harm-centered reasoning, and positioning design as foundational to AI development.
Authors:Ting-Chen Hsu, Jiangxu Lin, Wenran Chen, Fei Qin, Zheyuan Zhang
Abstract:
As AI becomes increasingly embedded in digital games, players' attitudes de-pend not only on whether AI is used, but also on where and how it intervenes in gameplay. This study examines players' evaluative patterns toward eight AI application contexts, including intelligent NPCs, emergent narrative, dynamic balancing, recommendation systems, review and governance, art asset generation, co-creation gameplay, and gameplay evolution. Based on 1,856 valid open-ended responses from 310 questionnaires, we conducted thematic analysis to identify reasons for acceptance, rejection, and conditional acceptance. Results show that players welcomed AI when it enhanced immersion, personalization, novelty, efficiency, or convenience, but resisted it when it threatened creativity, emotional authenticity, autonomy, fairness, system stability, authorship, or accountability. We further identify six evaluative logics: experiential enrichment, instrumental efficiency, system reliability, agency and control, authorship and compliance, and human oversight. These preliminary findings highlight the context-sensitive nature of AI acceptance in digital games.
Authors:Jillian Ross, Eric So, Andrew W. Lo
Abstract:
Financial misconceptions carry direct economic costs, from panic selling to equity market avoidance, yet they are notoriously resistant to correction. Traditional financial literacy interventions are constrained by cost, reach, and a persistent gap between knowledge and behavioral change. Across three pre-registered studies, we find that purposefully designed LLMs can durably correct financial misconceptions. Critically, two factors are necessary for this effect. First, corrective intent: LLMs prompted only to discuss a misconception produce corrections no better than unassisted self-reflection, and undirected LLM conversations can actively entrench misconceptions. Second, recipient receptivity: financial concepts are often foreign to the investors who misapply them, and LLM responses pitched below a participant's financial sophistication are judged as less credible and produce substantially weaker corrections. LLMs thus offer a scalable alternative to traditional financial literacy intervention, but only when designed with both factors in mind.
Authors:Meng Xia, Yan Chen, Qiao Jin, Yang Shi, Paul Denny, Tiffany Barnes, Qingsong Wen, Vincent Aleven
Abstract:
This workshop addresses this gap by bringing together researchers and practitioners from AI, HCI, and the learning sciences to explore how interactive systems can better support learning. We focus on the design and evaluation of human-AI collaborative learning interfaces that are technically robust, human-centered, and pedagogically grounded. By fostering interdisciplinary dialogue, the workshop aims to identify shared challenges, design principles, and research directions for next-generation learning technologies.
Authors:Lara Vartziotis, Tina Vartziotis, Frank Beutenmueller, Stella Salta, Konstantinos Moraitis, Miltiadis Katsaros, Sotirios Kotsopoulos
Abstract:
In remote and hybrid work contexts, the integration of physical and digital environments is revolutionizing spatial experiences, collaboration, and interpersonal interactions. This study examines three fundamental spatial conditions: the physical environment, characterized by material and sensory attributes; the virtual environment, influenced by immersive technologies; and their fusion into hybrid environments where digital and physical components interact dynamically. The increasing number of AI tools in contemporary society, extensively utilized in both professional and personal spheres, has led to a varied landscape of developing technologies. For instance, ChatGPT has emerged as one of the most downloaded applications, a statistically substantiated fact that demonstrates the swift incorporation of language-based AI into daily life. It also underscores the function of large language models (LLMs) as meaningful bridges between concepts at reading emotional and behavioral signals via natural language. These models provide real-time modifications such as altering illumination, acoustics, or interface configurations, converting static settings into dynamic, emotionally receptive environments. We investigate the integration of language models into professional settings and their potential to enhance user experience by promoting focus, well-being, and engagement. The study investigates ethical concerns, including privacy, emotional tracking, and user agency, emphasizing the importance of inclusive and transparent design. This research formulates a framework for creating co-adaptive environments that merge technological innovation with human-centered experiences, offering a fresh viewpoint on responsive and supportive hybrid workspaces.
Authors:Soonho Kwon, Dong Whi Yoo, Shaowen Bardzell, Younah Kang
Abstract:
We present four conceptual value-sensitive AI systems to examine how the presence of AI could influence praying experiences. Drawing on key values and practices associated with praying identified through a diary study, we designed AI systems intended to "assist" prayer practices. These designs were presented to participants through speculative design workbooks, serving as provocations to co-reflect on how the intervention of AI systems might shape their praying experiences. Our findings suggest that a sense of authenticity (or feeling a genuine connection to the divine) is a crucial value, while the presence of AI was often perceived as diminishing this authenticity, particularly when AI assumed too much agency in guiding praying practices. Based on our findings, we argue that AI system designs for deeply value-laden experiences should preserve users' agency in shaping their own experiences by maintaining interpretive openness, perhaps by leveraging AI's inexplicability as a resource for personal meaning-making or by recognizing non-use of AI as a legitimate design choice.
Authors:Aydin Ayanzadeh, Tim Oates
Abstract:
Indoor navigation remains a critical accessibility challenge for the blind and low-vision (BLV) individuals, as existing solutions rely on costly per-building infrastructure. We present an agentic framework that converts a single floor plan image into a structured, retrievable knowledge base to generate safe, accessible navigation instructions with lightweight infrastructure. The system has two phases: a multi-agent module that parses the floor plan into a spatial knowledge graph through a self-correcting pipeline with iterative retry loops and corrective feedback; and a Path Planner that generates accessible navigation instructions, with a Safety Evaluator agent assessing potential hazards along each route. We evaluate the system on the real-world UMBC Math and Psychology building (floors MP-1 and MP-3) and on the CVC-FP benchmark. On MP-1, we achieve success rates of 92.31%, 76.92%, and 61.54% for short, medium, and long routes, outperforming the strongest single-call baseline (Claude 3.7 Sonnet) at 84.62%, 69.23%, and 53.85%. On MP-3, we reach 76.92%, 61.54%, and 38.46%, compared to the best baseline at 61.54%, 46.15%, and 23.08%. These results show consistent gains over single-call LLM baselines and demonstrate that our workflow is a scalable solution for accessible indoor navigation for BLV individuals.
Authors:Yangyang Zhao, Linfan Dai, Li Cai, Bowen Xing, Libo Qin
Abstract:
Cross-domain task-oriented dialogue requires reasoning over implicit and explicit feasibility constraints while planning long-horizon, multi-turn actions. Large language models (LLMs) can infer such constraints but are unreliable over long horizons, while Reinforcement learning (RL) optimizes long-horizon behavior yet cannot recover constraints from raw dialogue. Naively coupling LLMs with RL is therefore brittle: unverified or unstructured LLM outputs can corrupt state representations and misguide policy learning. Motivated by this, we propose Verified LLM-Knowledge empowered RL (VLK-RL), a hybrid framework that makes LLM-derived constraint reasoning usable for RL. VLK-RL first elicits candidate constraints with an LLM and then verifies them via a dual-role cross-examination procedure to suppress hallucinations and cross-turn inconsistencies. The verified constraints are mapped into ontology-aligned slot-value representations, yielding a structured, constraint-aware state for RL policy optimization. Experiments across multiple benchmarks demonstrate that VLK-RL significantly improves generalization and robustness, outperforming strong single-model baselines on long-horizon tasks.
Authors:Xiang Li, Cara Li, Emily Kuang, Can Liu, Jian Zhao
Abstract:
Knowledge workers face increasing challenges in synthesizing information from multiple documents into structured conceptual understanding. This process is inherently iterative: users explore content, identify relationships between concepts, and continuously reorganize their mental models. However, current approaches offer limited support. LLM-based systems let users query information but not shape how knowledge is organized; manual tools like mind maps support structure creation but lack intelligent assistance. This leaves an open opportunity: supporting collaborative construction where users and AI jointly develop an evolving knowledge representation. We present MindTrellis, an interactive visual system where users and AI collaboratively build a dynamic knowledge graph. Users can query the graph to retrieve document-grounded information, and contribute by introducing new concepts, modifying relationships, and reorganizing the hierarchy to reflect their developing understanding. In a user study where 12 participants created slide decks, MindTrellis outperformed retrieval-only baselines in knowledge organization and cognitive load, as measured by expert ratings of content coverage and structural quality.
Authors:Maryam Mustafa, Imaan Hameed, Amna Shahnawaz, Bilal A Mateen
Abstract:
Despite steady global advances, maternal mortality remains alarmingly high in Pakistan (155 deaths per 100,000 live births in 2023); largely as a consequence of fragmented paper records, low literacy, poor access to quality healthcare, and gendered barriers that compromise care continuity. Over three years, we designed, deployed, and iteratively developed Awaaz-e-Sehat, a speech-based artificial intelligence (AI) system that generates electronic medical records (EMRs) and supports decision-making in maternal health. The tool evolved from a clinician-facing AI assistant that automated Urdu speech-to-EMR generation into a patient-centred WhatsApp-based platform, enabling women to generate their own structured clinical notes, receive AI-generated antenatal guidance, and share QR-coded records with providers anywhere in the country. This case study documents that translational journey, i.e., how the ground realities of workload, linguistic nuance, and infrastructural constraints reshaped our design. The result is not merely a new method of record-keeping, but a reimagining of antenatal care and electronic medical records themselves. In settings where clinicians are time-constrained and have little institutional incentive to document, Awaaz-e-Sehat proposes a model of care that centres patients as active participants in generating and owning their health data. By keeping patients informed about their own risk factors and integrating them into the clinical decision-support loop, the system transforms EMRs and CDSS from static institutional artefacts into dynamic tools for self-advocacy and shared accountability in maternal health.
Authors:Jiwon Chun, Yuling Zhuang, Armanto Sutedjo, Colin Xu, Rong Ren, Meng Xia
Abstract:
Facilitating productive mathematical argumentation, especially asking rational questions, is essential yet remains challenging for pre-service mathematics teachers (PMTs), who often have limited opportunities to apply abstract theoretical knowledge in authentic practice. At the same time, recent advances in large language models (LLMs) have expanded the potential for simulating students in educational settings, enabling low-risk environments for instructional practice. To inform the design of a system that supports PMTs in orchestrating classroom argumentation, we conducted a formative study with eight experienced mathematics teachers to identify key design requirements, including personalization, realistic simulations, structured reflection, and ease of use. Building on these requirements, we developed ArguMath, an AI-simulated classroom environment that supports PMTs in practicing the orchestration of mathematical argumentation. ArguMath comprises three core components: (1) customization of classroom settings; (2) simulation of classroom discussions with AI-based students grounded in authentic transcripts and augmented with real-time instructional suggestions; and (3) structured reflection through discourse annotation and overall feedback. Results from an exploratory user study with seven PMTs, complemented by interviews with four experienced teachers, indicate that ArguMath has the potential to support PMTs' classroom orchestration skills, particularly theory-aligned questioning strategies.
Authors:Linxiu Zeng, Emily Kuang, Jian Zhao
Abstract:
Authoring presentation slides involves navigating contextual constraints that shape how content is structured, adapted, and reused. While prior work frames constraints as limitations, little is known about how presenters actively reason about them. We conducted a formative study with ten presenters to examine how constraints emerge, are interpreted, and influence authoring decisions, leading to the Constraint-based Multi-session Presentation Authoring (CMPA) framework. CMPA treats time, audience, and communicative intent as key constraints shaping authoring. We instantiated CMPA in ReSlide, a research prototype for constraint-aware slide creation and reuse, and conducted two user studies on (1) single-session behaviors and (2) multi-session workflows. Compared to a baseline tool, ReSlide helped presenters treat constraints as active design drivers that guide narrative construction. The second study further shows how presenters flexibly reuse and adapt content across authoring cycles as constraints evolve. We then propose design implications for future constraint-aware presentation tools.
Authors:Philippe E. Spiess, Md Muntasir Zitu, Alison Walker, Daniel A. Anaya, Robert M. Wenham, Michael Vogelbaum, Daniel Grass, Ali-Musa Jaffer, Amod Sarnaik, Caitlin McMullen, Christine Sam, John V. Kiluk, Tianshi Liu, Tiago Biachi, Julio Powsang, Jing-Yi Chern, Roger Li, Seth Felder, Samuel Reynolds, Michael Shafique, Alison Sheehan, Ashley Layman, Cydney A. Warfield, Derrick Legoas, Jaclyn Parrinello, Jena Schmitz, Kevin Eaton, Mark Honor, Luis Felipe, Issam ElNaqa, Elier Delgado, Talia Berler, Rachael V. Phillips, Frantz Francisque, Carlos Garcia Fernandez, Gilmer Valdes
Abstract:
Background: More than 80% of U.S. cancer care is delivered in community settings, where survival remains worse than at academic centers. Clinicians must integrate genomics, staging, radiology, pathology, and changing guidelines, creating cognitive burden. We evaluated OncoBrain, an AI clinical reasoning platform for oncology treatment-plan generation, as an early step toward OGI. Methods: OncoBrain combines general-purpose LLMs with a cancer-specific graph retrieval-augmented generation layer, a gold-standard treatment-plan corpus as long-term memory, and a model-agnostic safety layer (CHECK) for hallucination detection and suppression. We evaluated clinician-enriched case summaries across gynecologic, genitourinary, neuro-oncology, gastrointestinal/hepatobiliary, and hematologic malignancies. Three clinician groups completed structured evaluations of 173 cases using a common 16-item instrument: subspecialist oncologists reviewed 50 cases, physician reviewers 78, and advanced practice providers 45. Results: Ratings were highest for scientific accuracy, evidence support, and safety, with lower but favorable scores for workflow integration and time savings. On a 5-point scale, mean alignment with evidence and guidelines was 4.60, 4.56, and 4.70 across subspecialists, physician reviewers, and advanced practice providers. Mean scores for absence of safety or misinformation concerns were 4.80, 4.40, and 4.60. Workflow integration averaged 4.50, 3.94, and 4.00; perceived time savings averaged 5.00, 3.89, and 3.60. Conclusions: In this multi-specialty vignette-based evaluation, OncoBrain generated oncology treatment plans judged guideline-concordant, clinically acceptable, and easy to supervise. These findings support the potential of a carefully engineered AI reasoning platform to assist oncology treatment planning and justify prospective real-world evaluation in community settings.
Authors:Christopher Koch, Joshua Andreas Wellbrock
Abstract:
Agentic AI systems plan, use tools, maintain state, and act across multi-step workflows with external effects, meaning trustworthy deployment can no longer be judged by task completion alone. The current literature remains fragmented across benchmark-centered evaluation, standards-based governance, orchestration architectures, and runtime assurance mechanisms. This paper contributes a bounded evidence synthesis across a manually coded corpus of twenty-four recent sources. The core finding is a governance-to-action closure gap: evaluation tells us whether outcomes were good, governance defines what should be allowed, but neither identifies where obligations bind to concrete actions or how compliance can later be proven. To close that gap, the paper introduces three linked artifacts: (1) a four-layer framework spanning evaluation, governance, orchestration, and assurance; (2) an ODTA runtime-placement test based on observability, decidability, timeliness, and attestability; and (3) a minimum action-evidence bundle for state-changing actions. Across sources, evaluation papers identify safety, robustness, and trajectory-level measurement as open gaps; governance frameworks define obligations but omit execution-time control logic; orchestration research positions the control plane as the locus of policy mediation, identity, and telemetry; runtime-governance work shows path-dependent behavior cannot be governed through prompts or static permissions alone; and action-safety studies show text alignment does not reliably transfer to tool actions. A worked enterprise procurement-agent scenario illustrates how these artifacts consolidate existing evidence without introducing new experimental data.
Authors:Diego Gomez-Zara, Hernán Valdivieso, Jorge Pérez, Denis Parra, Sebastián Valenzuela
Abstract:
Codebooks are central to framing research, providing theoretically grounded criteria for analyzing news content. While traditionally codebooks are built from theoretical frameworks and researchers' knowledge, applying these codebooks to large news corpora often exposes ambiguities, borderline cases, and underspecified rules that are difficult to resolve through theory alone. Moreover, news corpora evolve over time and differ across cultures, necessitating that researchers revisit the theoretical frameworks underlying these codebooks. In this article, we propose a workflow that uses Large Language Models (LLMs) to augment the creation and refinement of framing codebooks by combining theoretical frameworks with data-driven exploration. Rather than treating LLMs as automated classifiers, this approach positions them as analytic collaborators that help externalize decision rules, surface latent dimensions, and support iterative revisions of codebooks through dialogues between researchers and their data. We illustrate this workflow using a dataset of Latin American news coverage, demonstrating how the application of LLMs' capabilities has led to the surfacing of latent patterns, the generation of frame distinctions, and the adaptation of frameworks to new contexts. This method provides an LLM-assisted strategy that supports methodology creativity while preserving researchers' interpretative authority.
Authors:Jiaxun Cao, Yu Dong, Chunxi Zhan, Rithvik Neti, Sai Teja Peddinti, Pardis Emami-Naeini
Abstract:
Users increasingly rely on consumer-facing generative AI (GenAI) for tasks ranging from everyday needs to sensitive use cases. Yet, it remains unclear whether and how existing security and privacy (S&P) communications in GenAI tools shape users' adoption decisions and subsequent experiences. Understanding how users seek, interpret, and evaluate S&P information is critical for designing usable transparency that users can trust and act on. We conducted semi-structured interviews and design sessions with 21 U.S. GenAI users. We find that available S&P information rarely drove initial adoption in practice, as participants often perceived it as incomplete, ineffective, or lacking credibility. Instead, they relied on rough proxies, such as popularity, to infer S&P practices. After adoption, uncertainty about S&P practices constrained participants' willingness to use GenAI tools, particularly in high-stakes contexts, and, in some cases, contributed to discontinued use. Participants therefore called for transparency that supports decision-making and use, including trustworthy information (e.g., independent evaluations) and usable interfaces (e.g., on-demand disclosure). We synthesize participants' desired design practices into five dimensions to facilitate systematic future investigation into best practices. We conclude with recommendations for researchers, designers, and policymakers to improve S&P transparency in consumer-facing GenAI.
Authors:Abdulaziz Aldegheishem, Nabil Alrajeh, Lorena Parra, Oscar Romero, Jaime Lloret
Abstract:
The ambulance service is the main transport for diseased or injured people which suffers the same acceleration forces as regular vehicles. These accelerations, caused by the movement of the vehicle, impact the performance of tasks executed by sanitary personnel, which can affect patient survival or recovery time. In this paper, we have trained, validated, and tested a system to assess driving in ambulance services. The proposed system is composed of a sensor node which measures the vehicle vibrations using an accelerometer. It also includes a GPS sensor, a battery, a display, and a speaker. When two possible routes reach the same destination point, the system compares the two routes based on previously classified data and calculates an index and a score. Thus, the index balances the possible routes in terms of time to reach the destination and the vibrations suffered in the patient cabin to recommend the route that minimises those vibrations. Three datasets are used to train, validate, and test the system. Based on an Artificial Neural network (ANN), the classification model is trained with tagged data classified as low, medium, and high vibrations, and 97% accuracy is achieved. Then, the obtained model is validated using data from three routes of another region. Finally, the system is tested in two new scenarios with two possible routes to reach the destination. The results indicate that the route with less vibration is preferred when there are low time differences (less than 6%) between the two possible routes. Nonetheless, with the current weighting factors, the shortest route is preferred when time differences between routes are higher than 20%, regardless of the higher vibrations in the shortest route.
Authors:Ting-Chen Hsu, Zheyuan Zhang, Ziyi Chen, Yuwen Liu, Yanjia Liu
Abstract:
Addressing the issues of dullness, low compliance, and lack of appeal in current digital mental health education and serious games for students and adolescents, this study proposes a novel, experience-centered framework for serious game design: the Therapeutic Procedural Rhetoric and Mechanism Mapping Framework (TPR-MMF). Based on this framework, a side-scrolling serious game prototype, "World + You - World," was developed. This study compared the effectiveness of TPR-MMF-based games with traditional explicit educational serious games through a small-sample randomized controlled trial (N=28). The results of the Intrinsic Motivation Inventory (IMI) showed that the experimental group (playing "World + You - World") significantly outperformed the control group in four aspects. Furthermore, qualitative survey results indicated that players could perceive the psychological metaphors within the game mechanics and spontaneously resonated with real-life experiences. This study provides a highly engaging new development paradigm for gamified mental health education for students and adolescents.
Authors:Tobias Pellkvist, Katie Seaborn, Miu Kojima
Abstract:
Deceptive patterns, dark patterns, and manipulative user interfaces (UI) are a widely used design strategy that manipulates users to act against their own interests in pursuit of shareholder aims. These patterns may particularly affect people with less education, visual impairments, and older adults. Yet, access is a critical feature of the user experience (UX), development standards, and law. We considered whether and how the Web Content Accessibility Guidelines (WCAG) and related legislation, like the European Accessibility Act (EAA), could act as a tool against deceptive patterns. We used heuristic evaluation to analyze whether and how deceptive patterns violate or conform to these guidelines and legal statutes. Although statistical analysis revealed no significant differences by pattern type, we identified three patterns implicated by the WCAG guidelines: Countdown Timer, Auto-Play, and Hidden Information. We offer this approach as one tool in the fight against UI-based deception and in support of inclusive design.
Authors:Shuangquan Feng, Laura Fleig, Ruisen Tu, Philip Chi, Edmund Bu, Melinda Ozel, Junhua Ma, Teng Fei, Virginia R. de Sa
Abstract:
Large language models (LLMs) enable increasingly capable tutoring-style conversational agents, yet effective tutoring requires sensitivity to learners' affective and cognitive states beyond text alone. Facial expressions provide immediate and practical cues of confusion, frustration, or engagement, but remain underexplored in LLM-driven tutoring. We investigate whether facial-expression-aware signals can improve empathetic tutoring responses through prompt-level integration, without end-to-end retraining. We build a scalable simulated tutoring environment where a student agent exhibits diverse facial behaviors from a large unlabeled facial expression video dataset, and compare four tutor variants: a text-only LLM baseline, a multimodal baseline using a random facial frame, and two Action Unit estimation model (AUM)-based methods that either inject textual AU descriptions or select a peak-expression frame for visual grounding. Across 960 multi-turn conversations spanning three tutor backbones (GPT-5.1, Claude Ops 4.5, and Gemini 2.5 Pro), we evaluate targeted pairwise comparisons with five human raters and an exhaustive AI evaluator. AU-based conditioning consistently improves empathetic responsiveness to facial expressions across all tutor backbones, while AUM-guided peak-frame selection outperforms random-frame visual input. Textual AU abstraction and peak-frame visual injection show model-dependent advantages. Control analyses show that this improvement does not come at the expense of worse pedagogical clarity or responsiveness to textual cues. Finally, AI-human agreement is highest on facial-expression-grounded empathy, supporting scalable AI evaluation for this dimension. Overall, our results show that lightweight, structured facial expression representations can meaningfully enhance empathy in LLM-based tutoring systems with minimal overhead.
Authors:Tyler Chang, Jina Huh-Yoo, Afsaneh Razi
Abstract:
Human-AI romantic relationships are increasingly common, yet little is understood about how public discourse around them emerges and shifts over time. Prior research has examined user experiences and ethical concerns, but lacks longitudinal analyses of user-initiated public discussions. We address this gap by analyzing a high-precision dataset of 3,383 self-disclosed romantic companion AI posts from Reddit (2017-2025), using topic modeling and temporal statistical analysis to identify dominant themes and their evolution over time. We find significant topic drift, with discussions moving away from positive intimate relationships toward platform governance, technical issues, and real-world consequences. These shifts highlight a transition in how human-AI romance is framed-moving from private experiences to technical mediation and regulation-with implications for the design and governance of companion AI systems.
Authors:Adnan Hoq, Tim Weninger
Abstract:
Large language models (LLMs) are increasingly used to simulate human responses in behavioral research, yet it remains unclear when LLM-generated data support the same experimental inferences as human data. We evaluate this by directly comparing off-the-shelf LLM-generated responses with human responses from a canonical survey experiment on accuracy perception. Each human observation is converted into a structured prompt, and models generate a single 0--10 outcome variable without task-specific training; identical statistical analyses are applied to human and synthetic responses. We find that LLMs reproduce several directional effects observed in humans, but effect magnitudes and moderation patterns vary across models. Off-the-shelf LLMs therefore capture aggregate belief-updating patterns under controlled conditions but do not consistently match human-scale effects, clarifying when LLM-generated data can function as behavioral surrogates.
Authors:Oscar Romero, Aika Silveira Miura, Lorena Parra, Jaime Lloret
Abstract:
Mobility in urban and interurban areas, mainly by cars, is a day-to-day activity of many people. However, some of its main drawbacks are traffic jams and accidents. Newly made vehicles have pre-installed driving evaluation systems, which can prevent accidents. However, most cars on our roads do not have driver assessment systems. In this paper, we propose an approach for recognising driving styles and enabling drivers to reach safer and more efficient driving. The system consists of two physical sensors connected to a device node with a display and a speaker. An artificial neural network (ANN) is included in the node, which analyses the data from the sensors, and then recognises the driving style. When an abnormal driving pattern is detected, the speaker will play a warning message. The prototype was assembled and tested using an interurban road, in particular on a conventional road with three driving styles. The gathered data were used to train and validate the ANN. Results, in terms of accuracy, indicate that better accuracy is obtained when the velocity, position (latitude and longitude), time, and turning speed for the 3-axis are used, offering an average accuracy of 83%. If the classification is performed considering just two driving styles, normal and aggressive, then the accuracy reaches 92%. When the geo-information and time data are included, the main novelty of this paper, the classification accuracy is improved by 13%.
Authors:Chase McDonald, Cleotilde Gonzalez
Abstract:
The increasing integration of artificial intelligence (AI) in everyday life brings with it new challenges and questions for regarding how humans interact with autonomous agents. Multi-agent experiments, where humans and AI act together, can offer important opportunities to study social decision making, but there is a lack of accessible tooling available to researchers to run such experiments. We introduce two tools designed to reduce these barriers. The first, CoGrid, is a multi-agent grid-based simulation library with dual NumPy and JAX backends. The second, Multi-User Gymnasium (MUG), translates such simulation environments directly into interactive web-based experiments. MUG supports interactions with arbitrary numbers of humans and AI, utilizing either server-authoritative or peer-to-peer networking with rollback netcode to account for latency. Together, these tools can enable researchers to deploy studies of human-AI interaction, facilitating inquiry into core questions of psychology, cognition, and decision making and their relationship to human-AI interaction. Both tools are open source and available to the broader research community. Documentation and source code is available at {cogrid, multi-user-gymnasium}.readthedocs.io. This paper details the functionality of these tools and presents several case studies to illustrate their utility in human-AI multi-agent experimentation.
Authors:Jorge Acosta-Hernández, Alexander Lex, Tingying He
Abstract:
We present the first empirical evaluation of techniques for encoding distributions of quantitative edge values within adjacency matrices. In many real-world networks, edges represent not a single value but a set of measurements. While adjacency matrices preserve structural clarity, their compact cells limit the simultaneous display of multiple values. To address this, we explore edge encodings that represent distributions by two values: a measure of central tendency (mean, median, mode) and a measure of dispersion (standard deviation, variance, IQR). We select four possible encodings for evaluation that prior work has suggested are suitable for the limited space available in matrices: a bivariate color palette, embedded bar charts, and two overlaid-mark designs mapping the primary attribute to color and the secondary attribute to area or angle. In a preregistered crowdsourced study with 156 participants, we assessed performance of these encodings across eight analytical tasks and collected readability and aesthetic ratings. Results reveal clear performance regimes: area-based overlaid marks and bar charts achieved the highest overall performance; angle-based marks show moderate but less stable performance,and bivariate color consistently underperforms these alternatives. These findings clarify how visual channels behave under strict constraints and delineate the strengths and limitations of key design choices for multivariate edge visualization.
Authors:Pan Hao, Rishi Selvakumaran, Jacob Sun, Qianwen Wang
Abstract:
Complex visual interfaces are powerful yet have a steep learning curve, as users must navigate feature-rich visual interfaces while reasoning about domain-specific operations. Existing approaches either deliver assistance through a separate chat-based interaction, or require substantial application-specific engineering to build support natively into each interface. To address the gaps, we propose in-situ assistance: a mode of support delivered directly within any live web interface through lightweight, browser-level interventions on the Document Object Model (DOM), without rebuilding the application or modifying its underlying logic. We contribute a design space and a computational pipeline for DOM-mediated in-situ assistance, characterizing how GUI agents can insert, mutate, or recompose web elements to make the interface easier for users to understand, use, and navigate. We instantiate in-situ assistance in DOMSteer, a Chrome extension that interprets a user's help request and live interface context, grounds it to relevant UI elements, and executes reversible DOM manipulations directly on the live page to deliver assistance, including contextual tooltips, control highlighting, layout reorganization. Quantitative evaluations on two complex visual interfaces show that DOMSteer delivers reliable and efficient in-situ assistance. Use cases and a comparative user study with baseline ChatGPTAtlas demonstrate the usability and effectiveness of DOMSteer. Altogether, these findings point to a broader role for GUI agents: not just assisting from the sidelines, but actively reconfiguring live interfaces to support users in the moment.
Authors:Li Liu, Jiaming Qu, Marc Jowell Bagaoisan, David T. Lee, Leilani H. Gilpin
Abstract:
Most existing assistive navigation tools focus on providing real-time guidance for Blind and Low-Vision (BLV) people, but few support building a holistic spatial understanding of unfamiliar environments before travel. Such cognitive map construction (e.g., knowing that a fountain is south of a tower and west of a hotel) is important for pre-travel planning, yet remains underexplored in prior work. To address this gap, we present Touching Space, an end-to-end system that retrieves map data for a target place and loads it into a frontend interface for exploration. The system combines haptic and audio feedback: users explore spatial layouts through touch and ask spoken questions to a conversational agent during exploration. Touching Space contributes a conversational interface that supports BLV users in building cognitive maps on commodity hardware.
Authors:Bernardo B. P. Medeiros, Malvika Jadhav, Allison Lu, Tadayoshi Kohno, Vincent Bindschaedler, Kevin R. B. Butler
Abstract:
Many malicious actors responsible for disseminating synthetic non-consensual intimate imagery (SNCII) operate within internet forums to exchange resources, strategies, and generated content across multiple platforms. Technically-sophisticated actors gravitate toward certain communities (e.g., 4chan), while lower-sophistication end-users are more active on others (e.g., Reddit). To characterize key stakeholders in the broader ecosystem, we perform an integrated analysis of multiple communities, analyzing 282,154 4chan comments and 78,308 Reddit submissions spanning 165 days between June and November 2025 to characterize involved actors, actions, and resources. We find: (a) that users with differing levels of technical sophistication employ and share a wide range of primary resources facilitating SNCII content creation as well as numerous secondary resources facilitating dissemination; and (b) that knowledge transfer between experts and newcomers facilitates propagation of these illicit resources. Based on our empirical analysis, we identify gaps in existing SNCII regulatory infrastructure and synthesize several critical intervention points for bolstering deterrence.
Authors:Quentin Romero Lauro, Aakash Gautam, Yasmine Kotturi
Abstract:
Entrepreneurs in resource-constrained communities often lack time and support to translate ideas into actionable business plans. While generative AI promises assistance, most systems assume high digital literacy and overlook community infrastructures that shape adoption. We report on the community-centered design and deployment of BizChat, an AI-powered business planning tool, introduced across four workshops at a feminist makerspace in Pittsburgh. Through log data (N=30) and interviews (N=10), we examine how entrepreneurs build resilience through collective AI literacy development-encompassing adoption, adaptation, and refusal of AI. Our findings reveal that while BizChat lowered barriers to accessing capital by translating ideas into "business language," this ease raised questions about whether instant AI outputs undermine sensemaking essential to planning. We show how peer support helped entrepreneurs navigate this tension. We contribute design implications, including productive friction, communal scaffolds, and co-optability, for strengthening resilience amid technological change.
Authors:Yi-Fan Cao, Kento Shigyo, Yitong Gu, Xiyuan Wang, Weijia Liu, Yang Wang, David Gotz, Zhilan Zhou, Huamin Qu
Abstract:
Large Language Models (LLMs) have advanced self-learning tools, enabling more personalized interactions. However, learners struggle to engage in meaningful dialogue and process complex information. To alleviate this, we incorporate epistemological frameworks within an LLM-based approach to self-learning, reducing the cognitive load on learners and fostering deeper engagement and holistic understanding. Through a formative study (N=26), we identified epistemological differences in self-learner interaction patterns. Building upon these findings, we present \textit{CausaDisco}, a dialogue-based interactive system that integrates Aristotle's \textit{Four Causes} framework into LLM prompts to enhance cognitive support for self-learning. This approach guides learners' self-learning journeys by automatically generating coherent and contextually appropriate follow-up questions. A controlled study (N=36) demonstrated that, compared to baseline, \textit{CausaDisco} fostered more engaging interactions, inspired sophisticated exploration, and facilitated multifaceted perspectives. This research contributes to HCI by expanding the understanding of LLMs as educational agents and providing design implications for this emerging class of tools.
Authors:Ting-Chen Hsu, Wenran Chen, Jiangxu Lin, Fei Qin, Zheyuan Zhang
Abstract:
This study examines how large language model-driven non-player characters (LLM-NPCs) affect players' cognitive load and gaming experience, with a particular focus on the underlying psychological mechanisms, differences across task scenarios, and the role of individual traits. Conducting a randomized between-subject experiment (N=130) in a self-developed game prototype "Campus Culture Week", we compared player interactions with LLM-NPCs and traditional pre-scripted NPCs across multiple interactive modules. The results showed that LLM-NPCs significantly increased players' cognitive load (p < .001), an effect mediated by factors such as expressive effort and response uncertainty. However, LLM-NPCs did not yield a statistically significant improvement in overall gaming experience (p = .195); while they positively influenced players' perceived autonomy, they exerted a negative influence on system usability and trust. The effects of LLM-NPCs also significantly varied across task scenarios (p < .001), with stronger increases in cognitive load in more open-ended modules such as content creation and relationship building. The influence of individual differences was generally limited, although the personality traits of extraversion (p = .031) and neuroticism (p = .047) demonstrated some predictive power regarding cognitive load. This study provides empirical evidence for understanding the "double-edged sword" effect of LLM-NPCs on player experience, and highlight the importance of scenario-sensitive and user-sensitive design in intelligent NPC systems.
Authors:Alexandra Yakovleva, Henrik Pärssinen, Harri Valpola, Juho Kannala, Alexander Ilin
Abstract:
Recent advances in vision-language models (VLMs) have sparked growing interest in using them to automate web tasks, yet their feasibility as independent agents that reason and act purely from visual input remains underexplored. We investigate this setting using Qwen2.5-VL-32B, one of the strongest open-source VLMs available, and focus on improving its reliability in web-based control. Through initial experimentation, we observe three key challenges: (i) inaccurate localization of target elements, the cursor, and their relative positions, (ii) sensitivity to instruction phrasing, and (iii) an overoptimistic bias toward its own actions, often assuming they succeed rather than analyzing their actual outcomes. To address these issues, we fine-tune Qwen2.5-VL-32B for a basic web interaction task: moving the mouse and clicking on a page element described in natural language. Our training pipeline consists of two stages: (1) teaching the model to determine whether the cursor already hovers over the target element or whether movement is required, and (2) training it to execute a single command (a mouse move or a mouse click) at a time, verifying the resulting state of the environment before planning the next action. Evaluated on a custom benchmark of single-click web tasks, our approach increases success rates from 86% to 94% under the most challenging setting.
Authors:Fatma Betül Güreş, Tanya Nazaretsky, Seyed Parsa Neshaei, Tanja Käser
Abstract:
Supporting students in developing diagnostic reasoning is a key challenge across educational domains. Novices often face cognitive biases such as premature closure and over-reliance on heuristics, and they struggle to transfer diagnostic strategies to new cases. Scenario-based learning (SBL) enhanced by Learning Analytics (LA) and large language models (LLM) offers a promising approach by combining realistic case experiences with personalized scaffolding. Yet, how different scaffolding approaches shape reasoning processes remains insufficiently explored. This study introduces PharmaSim Switch, an SBL environment for pharmacy technician training, extended with an LA- and LLM-powered pharmacist agent that implements pedagogical conversations rooted in two theory-driven scaffolding approaches: \emph{structuring} and \emph{problematizing}, as well as a student learning trajectory. In a between-groups experiment, 63 vocational students completed a learning scenario, a near-transfer scenario, and a far-transfer scenario under one of the two scaffolding conditions. Results indicate that both scaffolding approaches were effective in supporting the use of diagnostic strategies. Performance outcomes were primarily influenced by scenario complexity rather than students' prior knowledge or the scaffolding approach used. The structuring approach was associated with more accurate Active and Interactive participation, whereas problematizing elicited more Constructive engagement. These findings underscore the value of combining scaffolding approaches when designing LA- and LLM-based systems to effectively foster diagnostic reasoning.
Authors:Karen Joy, Chris Dodge, Harsh Chavda, Alyssa Sheehan
Abstract:
Our study investigates the relationship between accessibility symbols and emerging technologies in supporting disability disclosure. We conducted twenty three remote design creation sessions with semi structured interviews to examine participants awareness of existing symbols, how they use symbols across online and offline contexts, and barriers to adoption and interpretation. Through participant sketching and future oriented storyboard probes, participants proposed ways to integrate symbols into wearable devices, mobile interfaces, and portable tools, emphasizing customizable and context sensitive disclosure. Our findings suggest symbols are most effective when paired with technologies that provide user control over visibility and optional pathways for explanation, helping reduce misinterpretation while supporting agency in disclosure moments. By reimagining symbol based assistance as part of a broader disclosure system where meaning depends on the symbol, its carrier, and context, this work informs more inclusive accessibility supports across diverse settings.
Authors:Nicolás E. Díaz Ferreyra, Monika Swetha Gurupathi, Zadia Codabux, Nalin Arachchilage, Riccardo Scandariato
Abstract:
Generative Artificial Intelligence (GenAI) has become a central component of many development tools (e.g., GitHub Copilot) that support software practitioners across multiple programming tasks, including code completion, documentation, and bug detection. However, current research has identified significant limitations and open issues in GenAI, including reliability, non-determinism, bias, and copyright infringement. While prior work has primarily focused on assessing the technical performance of these technologies for code generation, less attention has been paid to emerging concerns of software developers, particularly in the security realm. OBJECTIVE: This work explores security concerns regarding the use of GenAI-based coding assistants by analyzing challenges voiced by developers and software enthusiasts in public online forums. METHOD: We retrieved posts, comments, and discussion threads addressing security issues in GitHub Copilot from three popular platforms, namely Stack Overflow, Reddit, and Hacker News. These discussions were clustered using BERTopic and then synthesized using thematic analysis to identify distinct categories of security concerns. RESULTS: Four major concern areas were identified, including potential data leakage, code licensing, adversarial attacks (e.g., prompt injection), and insecure code suggestions, underscoring critical reflections on the limitations and trade-offs of GenAI in software engineering. IMPLICATIONS: Our findings contribute to a broader understanding of how developers perceive and engage with GenAI-based coding assistants, while highlighting key areas for improving their built-in security features.
Authors:Alison Crosby, MJ Johns, Eunsol Sol Choi, Tejas Polu, Katherine Isbister, Sri Kurniawan
Abstract:
This paper presents a pilot study exploring the effects of an olfactory stimulus (smoke) for a Virtual Reality game designed to support wildfire evacuation preparedness. Participants (N=18) were split evenly into either a smoke or a control condition, and both completed the same evacuation task. Post-task surveys assessed the participants' perceived preparedness and overall experience. Initial findings suggest participants in the smoke condition reported significantly higher immersion compared to those in the control condition. Across both groups, participants expressed an increased sense of preparedness for real-world wildfire evacuations following the experience.
Authors:Tuan-Ting Huang, Janet Yi-Ching Huang, Stephan Wensveen
Abstract:
Generative AI's emphasis on automation and efficiency challenges design education, where learning is grounded in exploration, reflection, and responsibility. This work introduces AI Craftsmanship, a value-oriented framework drawing on craftsmanship traditions that emphasize risk, rhythm, and care as central to learning through making. Through a Research through Design (RtD) approach, we designed an AI-integrated creative coding tool embedding these values into interactions and interface rather than outcomes. The tool supports designers learning generative pattern-making with p5.js by constraining AI, encouraging iterative experimentation, and foregrounding reflection. We studied the tool with five design practitioners through one-hour sessions and semi-structured interviews. Findings show craft values manifest unevenly: risk and rhythm shape early sense-making, while care emerges through reflective practices. Emergent values -- such as aesthetic judgment and confidence -- also motivated learning. AI Craftsmanship mediates values, tools, and materials, offering a value-driven perspective on designing AI systems for reflective, responsible, craft-informed learning in design education.
Authors:Zhiwei Li, Carl Kesselman
Abstract:
Machine learning (ML) reproducibility is often framed as a problem of incomplete artifact recording. This framing leads to systems that prioritize capturing datasets, code, configurations, and execution environments. However, in collaborative and interdisciplinary ML projects, reproducibility failures often arise not only from missing artifacts but from difficulties in interpreting prior work, aligning evolving components, and reconstructing experimental intent over time. Drawing on a 19-month deployment of a data-centric ML management system in a clinical research project, we identify recurring interactional breakdowns that persist despite comprehensive structural traceability. Based on these findings, we propose a two-layer socio-technical ML management system combining lifecycle-aware artifact infrastructure with an interactional layer designed to mediate coordination, explanation, and shared understanding. We discuss how an AI-mediated semantic interface reframes reproducibility as an ongoing socio-technical accomplishment rather than a static property of recorded traces, and outline implications for human-centered ML infrastructure design.
Authors:Ying-Yu Chen, Yang Hong, Yan-Rong Chen, Yi-Chieh Lee
Abstract:
This study investigates how Southeast Asian (SEA) immigrant mothers in Taiwan participate in their children's home-based learning. Drawing on semi-structured interviews and diary studies, we explore how these mothers navigate sociocultural constraints while fostering engagement and transmitting cultural values. Despite facing diminished agency and structural marginalization, mothers engage creatively in their children's everyday learning interactions. Guided by a justice-oriented lens, we identify various harms and propose design implications for socio-technical systems that center recognition, reciprocity, and accountability in parent-child learning at the individual, familial, and societal levels. Our contribution lies in foregrounding the role of intersectional identity in parent-child learning and proposing justice-oriented design directions that support the flourishing of immigrant mothers within socio-technical systems.
Authors:Felicia Fang-Yi Tan, Moritz A. Messerschmidt, Wen Yin, Oded Nov
Abstract:
Responsiveness in large language model (LLM) applications is widely assumed to be critical, yet the impact of latency on user behavior and perception of output quality has not been systematically explored. We report a controlled experiment varying time-to-first-token latency (2, 9, 20 seconds) across two taxonomy-driven knowledge task types (Creation and Advice). Log analyses reveal that user interaction behaviors were robust to latency, yet varied by task type: Creation tasks elicited more frequent prompting than Advice tasks. In contrast, participants who experienced 2-second latencies rated the LLM's outputs less thoughtful and useful than those who experienced 9- or 20-second latencies. Participants attributed delays to AI deliberation, though long waits occasionally shifted this interpretation toward frustration or concerns about reliability. Overall, this work demonstrates that latency is not simply a cost to reduce but a tunable design variable with ethical implications. We offer design strategies for enhancing human-LLM interaction.
Authors:Amna Shahnawaz, Ayesha Shafique, Ding Wang, Maryam Mustafa
Abstract:
Menstrual health education (MHE) in Pakistan is constrained by cultural taboos and inadequate formal curricula, leaving women with few trusted resources to lean on. In response to these challenges, we introduce a WhatsApp-based chatbot powered by a large language model (LLM) and Retrieval Augmented Generation (RAG), co-designed with Pakistani college women. Workshops (N=30) revealed key design requirements -- support for Roman Urdu, use of subsidized platforms, and an expert -- curated knowledge base. We then deployed the chatbot with 13 participants for two weeks (403 messages and interviews). Women used it to challenge cultural taboos, legitimize health concerns often dismissed as normal, and build reproductive health knowledge through iterative questioning. Yet, interactions also exposed tensions: reliance on cultural explanatory models, questions of trust and validation, and gendered persona of the chatbot itself. We contribute empirical insights, a stigma-aware design framework for culturally sensitive conversational AI, and a methodological lens foregrounding expert validation in intimate health domains.
Authors:Steeven Villa, Abdallah El Ali
Abstract:
The Augmented Human vision broadly seeks to improve or expand baseline human functioning through the restoration or extension of physical, intellectual, and social capabilities. However, given the rapid pace of technology development, we ask: what exactly does Augmented Human research involve, what are its core themes, and how has the Augmented Human(s) conference series evolved over time? To answer this, we conducted a scientometric analysis on the past 15 years of the Augmented Human(s) conference (N=735 paper), focusing on: geographical aspects, submissions and citation timelines, author frequency and popularity, and topic modeling. We find that: (a) Number of papers in the conference exhibit a bimodal distribution, peaking in 2015 and 2025, but showing periods of stagnant growth; (b) key topics over time include Haptics, Wearable Sensing, Vision & Eye Tracking, Embodied Interaction, and Sports / Motion; (c) some seminal papers on AH are not published in AH(s), but rather at related venues (e.g., CHI); (d) the conference has an active Japanese HCI community despite its historical Eurocentric location dominance. We contribute a closer look at the trajectory of the AH(s) field, and raise considerations of definitional and research scope ambiguities given the core problems/enhancements the field seeks to address.
Authors:Nathanael Jo, Manish Raghavan
Abstract:
Generative AI is quickly becoming an integral part of people's everyday workflows. Early evidence has shown that while generative AI can increase individual-level productivity, it does so at the cost of collective diversity, potentially narrowing the set of ideas and perspectives produced. Our research stands in contrast to this concern: through a pre-registered randomized control trial, we show that incentives mediate AI's homogenizing force in a creative writing task where participants can use AI interactively. Participants rewarded for originality relative to peers produce collectively more diverse writing than those rewarded for quality alone. This divergence is driven not by abandoning AI, but by how participants use it: those incentivized for originality incorporate fewer AI suggestions verbatim, relying on the model more selectively for brainstorming, proofreading, and targeted edits. Our results reveal that the effects of generative AI depend not only on the technology itself, but also the behavioral strategies and incentive structures surrounding its use.
Authors:Zeyang Huang, Angelos Chatzimparmpas, Thomas Höllt, Takanori Fujiwara
Abstract:
Dimensionality reduction (DR) is characterized by two longstanding trade-offs. First, there is a global-local preservation tension: methods such as t-SNE and UMAP prioritize local neighborhood preservation, yet may distort global manifold structure, while methods such as Laplacian Eigenmaps preserve global geometry but often yield limited local separation. Second, there is a gap between expressiveness and analytical transparency: many nonlinear DR methods produce embeddings without an explicit connection to the underlying high-dimensional structure, limiting insight into the embedding process. In this paper, we introduce a spectral framework for nonlinear DR that addresses these challenges. Our approach embeds high-dimensional data using a spectral basis combined with cross-entropy optimization, enabling multi-scale representations that bridge global and local structure. Leveraging linear spectral decomposition, the framework further supports analysis of embeddings through a graph-frequency perspective, enabling examination of how spectral modes influence the resulting embedding. We complement this analysis with glyph-based scatterplot augmentations for visual exploration. Quantitative evaluations and case studies demonstrate that our framework improves manifold continuity while enabling deeper analysis of embedding structure through spectral mode contributions.
Authors:Sheng Long, Remco Chang, Eugene Wu, Alex Kale, Matthew Kay
Abstract:
Prior work on perceptual effectiveness has decomposed visualizations into smaller common units (e.g., channels such as angle, position, and length) to establish rankings. While useful, these decompositions lack the computational structure to predict performance for new visualization $\times$ task combinations, requiring new experiments for each. We propose an alternative unit of analysis: operationalizing quantitative visualization interpretation as sequences of composable visual decoding operators. Using probability density function (PDF) and cumulative distribution function (CDF) charts, we examine how chart-specific tasks can be decomposed into reusable, chart-agnostic perceptual operations and characterize their error profiles through hierarchical Bayesian modeling. We then test generalizability by composing learned operators to predict performance on a structurally different task: Moritz et al.'s [35] scatterplot mean-estimation experiment, where the chart type, chart dimensions, and analytic goal all differ from the learning conditions. With a pre-registered analysis plan, we compose operators under six candidate strategies and evaluate each against empirical data with no parameters fit to the response data. One strategy captures both bias and variance of observed responses; five alternatives fail in distinguishable ways. We argue that this decoding-operator-oriented approach to empirical visualization research and theory-building lays the groundwork for generative models that can predict a distribution of likely interpretations under different viewing conditions, new chart types, and new tasks. Free copy of this paper and supplemental materials: https://osf.io/prtfq; experiment interface: https://gleaming-dolphin-799fda.netlify.app/vis-decode-slider.
Authors:Griffin Pitts, Neha Rani, Weedguet Mildort
Abstract:
As generative AI systems are integrated into educational settings, students often encounter AI-generated output while working through learning tasks, either by requesting help or through integrated tools. Trust in AI can influence how students interpret and use that output, including whether they evaluate it critically or exhibit overreliance. We investigate how students' trust relates to their appropriate reliance on an AI assistant during programming problem-solving tasks, and whether this relationship differs by learner characteristics. With 432 undergraduate participants, students' completed Python output-prediction problems while receiving recommendations and explanations from an AI chatbot, including accurate and intentionally misleading suggestions. We operationalize reliance behaviorally as the extent to which students' responses reflected appropriate use of the AI assistant's suggestions, accepting them when they were correct and rejecting them when they were incorrect. Pre- and post-task surveys assessed trust in the assistant, AI literacy, need for cognition, programming self-efficacy, and programming literacy. Results showed a non-linear relationship in which higher trust was associated with lower appropriate reliance, suggesting weaker discrimination between correct and incorrect recommendations. This relationship was significantly moderated by students' AI literacy and need for cognition. These findings highlight the need for future work on instructional and system supports that encourage more reflective evaluation of AI assistance during problem-solving.
Authors:Julian Berger, Pantelis P. Analytis, Ville Satopää, Ralf H. J. M. Kurvers
Abstract:
Artificial intelligence (AI) is broadly deployed as an advisor to human decision-makers: AI recommends a decision and a human accepts or rejects the advice. This approach, however, has several limitations: People frequently ignore accurate advice and rely too much on inaccurate advice, and their decision-making skills may deteriorate over time. Here, we compare the AI-as-advisor approach to the hybrid confirmation tree (HCT), an alternative strategy that preserves the independence of human and AI judgments. The HCT elicits a human judgment and an AI judgment independently of each other. If they agree, that decision is accepted. If not, a second human breaks the tie. For the comparison, we used 10 datasets from various domains, including medical diagnostics and misinformation discernment, and a subset of four datasets in which AI also explained its decision. The HCT outperformed the AI-as-advisor approach in all datasets. The HCT also performed better in almost all cases in which AI offered an explanation of its judgment. Using signal detection theory to interpret these results, we find that the HCT outperforms the AI-as-advisor approach because people cannot discriminate well enough between correct and incorrect AI advice. Overall, the HCT is a robust, accurate, and transparent alternative to the AI-as-advisor approach, offering a simple mechanism to tap into the wisdom of hybrid crowds.
Authors:Yue Yang, Matthieu Chabanas, Carrie Reale, Annie Benson, Jason Slagle, Matthew Weinger, Michael Topf, Jie Ying Wu
Abstract:
Positive margins are common in head and neck squamous cell carcinoma, yet intraoperative re-resection is often imprecise because margin locations are typically communicated verbally from pathology. We present an all-in-one augmented reality (AR) system that relocalizes positive margins from a resected specimen to the resection bed and visualizes them in situ using HoloLens 2 depth sensing and fully automated markerless surface registration. In a silicone phantom study with six medical trainees, markerless registration achieved target registration errors comparable to a marker-based baseline (median 1.8 mm vs. 1.7 mm; maximum < 4 mm). In a margin relocalization task, AR guidance reduced error from verbal guidance (median 14.2 mm) to a few millimeters (median 3.2 mm), with all AR localizations within 5 mm error. These results support the feasibility of markerless AR margin guidance for more precise intraoperative re-excision.
Authors:Fares Fawzi, Seyed Parsa Neshaei, Marta Knezevic, Tanya Nazaretsky, Tanja Käser
Abstract:
Formative feedback is central to effective learning, yet providing timely, individualised feedback at scale remains a persistent challenge. While recent work has explored the use of large language models (LLMs) to automate feedback, most existing systems still conceptualise feedback as a static, one-way artifact, offering limited support for interpretation, clarification, or follow-up. In this work, we introduce REFINE, a locally deployable, multi-agent feedback system built on small, open-source LLMs that treats feedback as an interactive process. REFINE combines a pedagogically-grounded feedback generation agent with an LLM-as-a-judge-guided regeneration loop using a human-aligned judge, and a self-reflective tool-calling interactive agent that supports student follow-up questions with context-aware, actionable responses. We evaluate REFINE through controlled experiments and an authentic classroom deployment in an undergraduate computer science course. Automatic evaluations show that judge-guided regeneration significantly improves feedback quality, and that the interactive agent produces efficient, high-quality responses comparable to a state-of-the-art closed-source model. Analysis of real student interactions further reveals distinct engagement patterns and indicates that system-generated feedback systematically steers subsequent student inquiry. Our findings demonstrate the feasibility and effectiveness of multi-agent, tool-augmented feedback systems for scalable, interactive feedback.
Authors:Paulo Vitor S. Silva, Lucas L. Neves, Rafael A. Goiás, Diogo F. C. Silva, Rafael T. Sousa, Arlindo R. Galvão Filho
Abstract:
This demo introduces Focus360, a system designed to enhance user engagement in 360° VR videos by guiding attention to key elements within the scene. Using natural language descriptions, the system identifies important elements and applies a combination of visual effects to guide attention seamlessly. At the demonstration venue, participants can experience a 360° Safari Tour, showcasing the system's ability to improve user focus while maintaining an immersive experience.
Authors:Soufiane Jhilal, Eleonora Pasqua, Caterina Marchesi, Riccardo Corradi, Martina Galletti
Abstract:
Neurodiverse learners often require reading supports, yet increasing scaffold richness can sometimes overload attention and working memory rather than improve comprehension. Grounded in the Construction-Integration model and a contingent scaffolding perspective, we examine how structural versus semantic scaffolds shape comprehension and reading experience in a supervised inclusive context. Using an adapted reading interface, we compared four modalities: unmodified text, sentence-segmented text, segmented text with pictograms, and segmented text with pictograms plus keyword labels. In a within-subject pilot with 14 primary-school learners with special educational needs and disabilities, we measured reading comprehension using standardized questions and collected brief child- and therapist-reported experience measures alongside open-ended feedback. Results highlight heterogeneous responses as some learners showed patterns consistent with benefits from segmentation and pictograms, while others showed patterns consistent with increased coordination costs when visual scaffolds were introduced. Experience ratings showed limited differences between modalities, with some apparent effects linked to clinical complexity, particularly for perceived ease of understanding. Open-ended feedback of the learners frequently requested simpler wording and additional visual supports. These findings suggest that no single scaffold is universally optimal, reinforcing the need for calibrated, adjustable scaffolding and provide design implications for human-AI co-regulation in supervised inclusive reading contexts.
Authors:Matthias Dold, Volker A. Coenen, Bastian Sajonz, Peter Reinacher, Peter Reinacher, Thomas Prokop, Marco Reisert, Sophia Gimple, Yasin Temel, Marcus L. F. Janssen, Michael Tangermann, Joana Pereira
Abstract:
Decoding motor performance from brain signals offers promising avenues for adaptive deep brain stimulation (aDBS) for Parkinson's disease (PD). In a two-center cohort of 19 PD patients executing a drawing task, we decoded motor performance from electroencephalography (n=15) and, critically for clinical translation, electrocorticography (n=4). Within each session, patients performed the task under DBS on and DBS off. A total of 35 sessions were recorded. Instead of relying on single frequency bands, we derived patient-specific biomarkers using a filterbank-based machine-learning approach. DBS modulated kinematics significantly in 23 sessions. Significant neural decoding of kinematics was possible in 28 of the 35 sessions (average Pearson's $\text{r}= 0.37$). Our results further demonstrate modulation of speed-accuracy trade-offs, with increased drawing speed but reduced accuracy under DBS. Joint evaluation of behavioral and neural decoding outcomes revealed six prototypical scenarios, for which we provide guidance for future aDBS strategies.
Authors:A. Baki Kocaballi, Joseph Kizana, Sharon Stein, Simon Buckingham Shum
Abstract:
Seamless AI presents output as a finished, polished product that users consume rather than shape. This risks design fixation: users anchor on AI suggestions rather than generating their own ideas. We propose Generative Friction, which introduces intentional disruptions to AI output (fragmentation, delay, ambiguity) designed to transform it from finished product into semi-finished material, inviting human contribution rather than passive acceptance. In a qualitative study with six designers, we identified the different ways in which designers appropriated the different types of friction: users mined keywords from broken text, used delays as workspace for independent thought, and solved metaphors as creative puzzles. However, this transformation was not universal, motivating the concept of Friction Disposition, a user's propensity to interpret resistance as invitation rather than obstruction. Grounded in tolerance for ambiguity and pre-existing workflow orientation, Friction Disposition emerged as a potential moderator: high-disposition users treated friction as "liberating," while low-disposition users experienced drag. We contribute the concept of Generative Friction as distinct from Protective Friction, with design implications for AI tools that counter fixation while preserving agency.
Authors:Simon WS Fischer, Hanna Schraffenberger, Serge Thill, Pim Haselager
Abstract:
Many generative AI systems as well as decision-support systems (DSSs) provide operators with predictions or recommendations. Various studies show, however, that people can mistakenly adopt the erroneous results presented by those systems. Hence, it is crucial to promote critical thinking and reflection during interaction. One approach we are focusing on involves encouraging reflection during machine-assisted decision-making by presenting decision-makers with data-driven questions. In this short paper, we provide a brief overview of our work in that regard, namely: 1) the development of a question taxonomy, 2) the development of a prototype in the medical domain and the feedback received from clinicians, 3) a method for generating questions using a large language model, and 4) a proposed scale for measuring cognitive engagement in human-AI decision-making. In doing so, we contribute to the discussion about the design, development, and evaluation of tools for thought, i.e., AI systems that provoke critical thinking and enable novel ways of sense-making.
Authors:Rahul Sharma, Lars Henrich, Larisa Ivanova, Arsalan Karimzadmotallebiazar, Annette Bieniusa, Leo Van Waveren, Sebastian Vollmer
Abstract:
Secondary school students increasingly encounter AI systems whose outputs depend on data quality, evaluation choices and modeling assumptions. To provide accessible entry points to these interconnected concepts, we developed KI-Adventskalender, a free web-based extracurricular initiative with 24 didactically curated, short, guided micro-challenges released daily in December, targeting data-centric competencies and socio-technical themes that shape how data are interpreted in practice. Drawing on two annual iterations, we report aggregate platform traces characterizing participation and task-level engagement. Participation increased substantially in 2025, but early attrition persists. Progression stabilized after midpoint: among users reaching Day 12 in 2025, more than 75% completed the calendar. Competence cluster performance shifted across years; higher revision rates co-occurred with strong pass rates, suggesting sustained engagement. We use these observations to motivate a next-step measurement agenda: tighter task instrumentation, embedded micro-assessments and mixed-method evaluation designs that can distinguish persistence from conceptual uptake, knowledge progression and durable learning outcomes.
Authors:Katie Seaborn, Madeleine Steeds, Ilaria Torre, Martina De Cet, Katie Winkle, Marcus Göransson
Abstract:
The "gender" of intelligent agents, virtual characters, social robots, and other agentic machines has emerged as a fundamental topic in studies of people's interactions with computers. Perceptions of agent gender can help explain user attitudes and behaviours -- from preferences to toxicity to stereotyping -- across a variety of systems and contexts of use. Yet, standards in capturing perceptions of agent gender do not exist. A scoping review was conducted to clarify how agent gender has been operationalized -- labelled, defined, and measured -- as a perceptual variable. One-third of studies manipulated but did not measure agent gender. Norms in operationalizations remain obscure, limiting comprehension of results, congruity in measurement, and comparability for meta-analyses. The dominance of the gender binary model and latent anthropocentrism have placed arbitrary limits on knowledge generation and reified the status quo. We contribute a systematically-developed and theory-driven meta-level framework that offers operational clarity and practical guidance for greater rigour and inclusivity.
Authors:Mario Andres Chavarria, Santiago Price Torrendell, Aude Billard, Samia Hurst, Sébastien Kessler, Michael Stein, Kenji Suzuki, Sophie Weerts, Diego Paez-Granados, Minerva Rivas Velarde
Abstract:
Robotic wheelchairs (RWs) offer significant potential to enhance autonomy and participation for people with mobility impairments, yet many systems have failed to achieve sustained real-world adoption. This narrative literature review examined the extent and quality of end-user involvement in RW design, development, and evaluation over the past decade (2015--2025), assessed against core principles shared by major user-involvement approaches (e.g., user-/human-centered design, participatory/co-design, and inclusive design). The findings indicate that user involvement remains limited and is predominantly concentrated in late-stage evaluation rather than in early requirements definition or iterative co-design. Of the 399 records screened, only 23 studies (about 6%) met the inclusion criteria of verifiable end-user involvement, and many relied on small samples, often around ten participants, with limited justification for sample size selection, proxy users, laboratory-based validation, and non-standardized feedback methods. Research teams were largely engineering-dominated (about 89%) and geographically concentrated in high-income countries. Despite strong evidence that sustained user engagement improves usability and adoption in assistive technology, its systematic implementation in RW research remains rare. Advancing the field requires embedding participatory methodologies throughout the design lifecycle and addressing systemic barriers that constrain meaningful user involvement.
Authors:Ray-Yuan Chung, Jaime Snyder, Zixuan Xu, Daeun Yoo, Athena C. Ortega, Wanda Pratt, Aaron Wightman, Ryan Hutson, Cozumel Pruette, Ari Pollack
Abstract:
In pediatric chronic care, the triadic relationship among patients, caregivers, and healthcare providers introduces unique challenges for youth in managing their conditions. Diverging values, roles, and asymmetrical situational awareness across decision-maker groups often hinder collaboration and affect health outcomes, highlighting the need to support collaborative decision-making. We conducted co-design workshops with 6 youth with chronic kidney disease, 6 caregivers, and 7 healthcare providers to explore how digital technologies can be designed to support collaborative decision-making. Findings identify barriers across all levels of situational awareness, ranging from individual cognitive and emotional constraints, misaligned mental models, to relational conflicts regarding care goals. We propose design implications that support continuous decision-making practice, align mental models, balance caregiver support and youth autonomy development, and surface potential care challenges. This work advances the design of collaborative decision-making technologies that promote shared understanding and empower families in pediatric chronic care.
Authors:Ray-Yuan Chung, Xuhai Xu, Ari Pollack
Abstract:
Large language model based health agents are increasingly used by health consumers and clinicians to interpret health information and guide health decisions. However, most AI systems in healthcare operate in siloed configurations, supporting individual users rather than the multi-stakeholder relationships central to healthcare. Such use can fragment understanding and exacerbate misalignment among patients, caregivers, and clinicians. We reframe AI not as a standalone assistant, but as a collaborator embedded within multi-party care interactions. Through a clinically validated fictional pediatric chronic kidney disease case study, we show that breakdowns in adherence stem from fragmented situational awareness and misaligned goals, and that siloed use of general-purpose AI tools does little to address these collaboration gaps. We propose a conceptual framework for designing AI collaborators that surface contextual information, reconcile mental models, and scaffold shared understanding while preserving human decision authority.
Authors:Soufiane Jhilal, Martina Galletti
Abstract:
Reading comprehension presents a significant challenge for children with Special Educational Needs and Disabilities (SEND), often requiring intensive one-on-one reading support. To assist therapists in scaling this support, we developed a multilingual, AI-powered interface that automatically enhances text with visual scaffolding. This system dynamically identifies key concepts and maps them to contextually relevant pictograms, supporting learners across languages. We evaluated the system across five typologically diverse languages (English, French, Italian, Spanish, and Arabic), through multilingual coverage analysis, expert clinical review by speech therapists and special education professionals, and latency assessment. Evaluation results indicate high pictogram coverage and visual scaffolding density across the five languages. Expert audits suggested that automatically selected pictograms were semantically appropriate, with combined correct and acceptable ratings exceeding 95% for the four European languages and approximately 90% for Arabic despite reduced pictogram repository coverage. System latency remained within interactive thresholds suitable for real-time educational use. These findings support the technical viability, semantic safety, and acceptability of automated multimodal scaffolding to improve accessibility for neurodiverse learners.
Authors:Nobuhito Kasahara, Shota Yamanaka, Homei Miyashita
Abstract:
Typical success-rate prediction models for tapping exclude targets near screen edges. However, design constraints often force such placements, and in scrollable user interfaces, any element can move close to the screen edges. In this work, we model how target-edge distance affects touch pointing accuracy. We propose the Skewed Dual Normal Distribution Model, which assumes the tap-coordinate distribution is skewed by a nearby edge. The results showed that as targets approached the edge, the distribution's peak shifted toward the edge, and its tail extended away. In contrast to prior reports, the success rate improved when the target touched the edge, suggesting a strategy of ``tapping the target together with the edge.'' Our model predicts success rates across a wide range of conditions, including edge-adjacent targets. Through three experiments of horizontal, vertical, and 2D pointing, we demonstrated the generalizability and utility of our proposed model.
Authors:Soonho Kwon, Dong Whi Yoo, Younah Kang
Abstract:
This speculative video piece showcases participants interacting with a career counseling AI agent, unaware that the responses were actually derived from the fortunetelling of a mudang (a Korean traditional shaman). Our work captures this deception and documents participants' reactions, showcasing shifts in their initial perceptions of the agent's advice following the reveal. Notably, even after learning that the advice came from a mudang rather than an AI, participants did not change their initial attitudes toward the advice they received. This raises questions about the perceived importance of AI's explainability and accuracy. By juxtaposing scientific and pre-scientific approaches, we aim to provoke discussions on human agency in the age of AI. We argue that, regardless of AI's advancements, we continue to navigate life in fundamentally human ways -- wonderfully messy and uncertain.
Authors:Licol Zeinfeld, Alona Strugatski, Ziva Bar-Dov, Ron Blonder, Shelley Rap, Giora Alexandron
Abstract:
The rapid adoption of large language models (LLMs) in education raises profound challenges for assessment design. To adapt assessments to the presence of LLM-based tools, it is crucial to characterize the strengths and weaknesses of LLMs in a generalizable, valid and reliable manner. However, current LLM evaluations often rely on descriptive statistics derived from benchmarks, and little research applies theory-grounded measurement methods to characterize LLM capabilities relative to human learners in ways that directly support assessment design. Here, by combining educational data mining and psychometric theory, we introduce a statistically principled approach for identifying items on which humans and LLMs show systematic response differences, pinpointing where assessments may be most vulnerable to AI misuse, and which task dimensions make problems particularly easy or difficult for generative AI. The method is based on Differential Item Functioning (DIF) analysis -- traditionally used to detect bias across demographic groups -- together with negative control analysis and item-total correlation discrimination analysis. It is evaluated on responses from human learners and six leading chatbots (ChatGPT-4o \& 5.2, Gemini 1.5 \& 3 Pro, Claude 3.5 \& 4.5 Sonnet) to two instruments: a high school chemistry diagnostic test and a university entrance exam. Subject-matter experts then analyzed DIF-flagged items to characterize task dimensions associated with chatbot over- or under-performance. Results show that DIF-informed analytics provide a robust framework for understanding where LLM and human capabilities diverge, and highlight their value for improving the design of valid, reliable, and fair assessment in the AI era.
Authors:Ji Eun Song, Jaeyoun You, Joongseek Lee
Abstract:
A user's ownership perception of virtual objects, such as cloud files, is generally uncertain. Is this valid for streaming platforms featuring accounts designed for sharing (DS)? We observe sharing practices within DS accounts of streaming platforms and identify their ownership characteristics and unexpected complications through two mixed-method studies. Casual and Cost-splitting are the two sharing practices identified. The owner is the sole payer for the account in the former, whereas profile holders split the cost in the latter. We distinguish two types of ownership in each practice -- Primary and Dual. In Primary ownership, the account owner has the power to allow others to use the account; in Dual ownership, Primary ownership appears in conjunction with joint ownership, notably displaying asymmetric ownership perceptions among users. Conflicts arise when the sharing agreements collapse. Therefore, we propose design recommendations that bridge ownership differences based on sharing practices of DS accounts.
Authors:Ji Eun Song, Eunchae Lee, Juhee Im, Hyunsoo Jang, Eunji Kim, Joongseek Lee
Abstract:
Account sharing is common in subscription services and is now extending to generative AI platforms, which are still primarily designed for individual use. Sharing often requires workarounds that create new tensions. This study examines how LLM subscriptions are shared and the norms that develop. We combined a survey of 245 users with interviews of 36 participants to understand both patterns and lived experiences. Our analysis identified four types of account sharing, organized along two dimensions: whether the owner uses the account and whether subscription costs are shared. Within these types, we examined how norms were formed and how their fragility, especially privacy, became evident in practice. Users, fully aware of this, subtly adjusted their behavior, which we interpret through the lens of the observer effect. We frame LLM account sharing as a social practice of appropriation and outline design implications to adapt single-user platforms to multi-user realities.
Authors:Ji Eun Song, Hyunsoo Jang, Juhee Im, Joongseek Lee
Abstract:
On algorithmic social platforms, exchanging memes via direct messages (DMs) serves as phatic communication that affirms relationships, yet users often interpret these exchanges as signals shaping personalized recommendations, creating tension between relational practice and algorithmic control. This study examines how users perceive DM meme exchanges on Instagram rather than auditing Instagram's underlying recommender mechanisms, and how beliefs about DM-recommendation linkages shape coping strategies and feelings of powerlessness. We conducted semi-structured interviews with 21 active meme-DM users. Participants classified memes as recipient-friendly or recipient-unfriendly based on relational fit; many described the spread of unfriendly memes as "algorithmic contagion." Controls were constrained by relational norms, low perceived efficacy of feedback tools, and opaque DM-recommendation linkages. We articulate how DM-based relational practices are entangled with personalization infrastructures and propose three design implications: transparent linkage explanations, conversation-level opt-outs, and conservative learning that down-weights DM-originated signals.
Authors:Esther Bosch, Michael Scholz, Anke Sauerländer-Biebl, Klas Ihme
Abstract:
Shifting travel from private cars to public transport is critical for meeting climate and related mobility goals, yet passengers will only choose transit if it offers a consistently positive experience. Previous studies of passenger satisfaction have largely relied on retrospective surveys, which overlook the dynamic and spatially differentiated nature of travel experience. This paper introduces a novel combination of real-time experience sampling and spatial hot spot analysis to capture and map where public transport users report consistently positive or negative experiences. Data were collected from 239 participants in Hamburg between March and September 2025. Using a smartphone application, travelers reported their momentary journey experience every five minutes during everyday trips, yielding over 21,000 in-situ evaluations. These geo-referenced data were analyzed with the Getis-Ord $Gi^{*}$ statistic to detect significant clusters of positive and negative travel experience. The analysis identified distinct hot and cold spots of travel experience across the network. Cold spots were shaped by heterogeneous problems, ranging from predominantly delay-dominated to overcrowding or socially stressful locations. In contrast, hot spots emerged through different pathways, including comfort-oriented, time-efficient or context-driven environments. The findings highlight three contributions. First, cold spots are not uniform but reflect specific local constellations of problems, requiring targeted interventions. Second, hot spots illustrate multiple success models that can serve as benchmarks for replication. Third, this study demonstrates the value of combining dynamic high-resolution sampling with spatial statistics to guide more effective and place-specific improvements in public transport.
Authors:Kazi Noshin, Sharifa Sultana
Abstract:
While concerns about ChatGPT-induced harms due to sycophancy and other behaviors, including gaslighting, have grown among researchers, how users themselves experience and mitigate these harms remain largely underexplored. We analyze Reddit discussions to investigate what concerns users report and how they address them. Our findings reveal five distinct user-reported concerns that manifest across multiple life domains, ranging from personal to societal: inducing delusion, digressing narratives, implicating users for models' limitations, inducing addiction, and providing unsupervised psychological support. We document three-tier user-driven suggestions spanning functional usage techniques, behavioral approaches, and private and institutional safeguards. Our findings show that AI-induced harms require coordinated interventions across users, developers, and policymakers. We discuss design implications and future directions to mitigate the harms and ensure user benefits.
Authors:Annabel Goldman, Yuan Cui, Matthew Kay
Abstract:
Data literacy has become a key learning objective in K-12 education, but it remains an ambiguous concept as teachers interpret it differently. When creating assessments, teachers turn broad ideas about "working with data" into concrete decisions about what materials to include. Since working with data visualizations is a core component of data literacy, teachers' decisions about how to include them on assessments offer insight into how they interpret data literacy more broadly. Drawing on interviews with 13 teachers, we identify four challenges in enacting data literacy in assessments: (1) conceptual ambiguity between data visualization and data literacy, (2) tradeoffs between using real-world or synthetic data, (3) difficulty finding and adapting domain-appropriate visual representations and data visualizations, and (4) balancing assessing data literacy and domain-specific learning goals. Drawing on lessons from data visualization, human-computer interaction, and the learning sciences, we discuss opportunities to better support teachers in assessing data literacy.
Authors:Alice Zhong, Phoebe Chen, Anika Sharma, Kandyce Brennan, Snehalkumar 'Neil' S. Gaikwad
Abstract:
Sexual and reproductive health (SRH) remains shaped by structural barriers that leave many without judgment-free information. AI chatbots offer anonymous alternatives, but access alone does not ensure equity when socioeconomic determinants shape whose capabilities these tools expand or constrain. Conventional methods for evaluating human-AI interaction were not designed to capture whether technologies holistically support reproductive autonomy. We introduce CARE, Capability Approach for Reproductive Equity, developing capabilities, functionings, and conversion factors into a Normative Design Lens and an Evaluation Lens for AI in SRH contexts. Evaluating SRH-specific non-LLM chatbots, general-use LLMs, and search engine features along credibility and reasoning, we identify two epistemic harms: source opacity and response rigidity. We conclude with design and evaluation recommendations, participatory auditing strategies, and policy implications for high-stakes domains where AI intersects with inequity.
Authors:Anders Giovanni Møller, Elisa Bassignana, Francesco Pierri, Luca Maria Aiello
Abstract:
The ubiquity of multimedia content is reshaping online information spaces, particularly in social media environments. At the same time, search is being rapidly transformed by generative AI, with large language models (LLMs) routinely deployed as intermediaries between users and multimedia content to retrieve and summarize information. Despite their growing influence, the impact of LLM inaccuracies and potential vulnerabilities on multimedia information-seeking tasks remains largely unexplored. We investigate how generative AI affects accuracy, efficiency, and confidence in information retrieval from videos. We conduct an experiment with around 900 participants on 8,000+ video-based information-seeking tasks, comparing behavior across three conditions: (1) access to videos only, (2) access to videos with LLM-based AI assistance, and (3) access to videos with a deceiving AI assistant designed to provide false answers. We find that AI assistance increases accuracy by 3-7% when participants viewed the relevant video segment, and by 27-35% when they did not. Efficiency increases by 10% for short videos and 25% for longer ones. However, participants tend to over-rely on AI outputs, resulting in accuracy drops of up to 32% when interacting with the deceiving AI. Alarmingly, self-reported confidence in answers remains stable across all three conditions. Our findings expose fundamental safety risks in AI-mediated video information retrieval.
Authors:Davide Traini, José Manuel Alcalde-Llergo, Mariana Buenestado-Fernández, Domenico Ursino, Enrique Yeguas-Bolívar
Abstract:
This study analyzes behavioral engagement in SONAR, a virtual reality application designed for sign language training and validation. We focus on three automatically derived engagement indicators (Visual Attention (VA), Video Replay Frequency (VRF), and Post-Playback Viewing Time (PPVT)) and examine their relationship with learning performance. Participants completed a self-paced Training phase, followed by a Validation quiz assessing retention. We employed Pearson correlation analysis to examine the relationships between engagement indicators and quiz performance, followed by binomial Generalized Linear Model (GLM) regression to assess their joint predictive contributions. Additionally, we conducted temporal analysis by aggregating moment-to-moment VA traces across all learners to characterize engagement dynamics during the learning session. Results show that VA exhibits a strong positive correlation with quiz performance,followed by PPVT, whereas VRF shows no meaningful association. A binomial GLM confirms that VA and PPVT are significant predictors of learning success, jointly explaining a substantial proportion of performance variance. Going beyond outcome-oriented analysis, we characterize temporal engagement patterns by aggregating moment-to-moment VA traces across all learners. The temporal profile reveals distinct attention peaks aligned with informationally dense segments of both training and validation videos, as well as phase-specific engagement dynamics, including initial acclimatization, oscillatory attention cycles during learning, and pronounced attentional peaks during assessment. Together, these findings highlight the central role of sustained and strategically allocated visual attention in VR-based sign language learning and demonstrate the value of behavioral trace data for understanding and predicting learner engagement in immersive environments.
Authors:Irene Hou, Alexander Qin, Lauren Cheng, Philip J. Guo
Abstract:
More scientists are now using AI, but prior studies have examined only how they use it 'at the desk' for computer-based work. However, given that scientific work often happens 'beyond the desk' at lab and field sites, we conducted the first study of how scientific practitioners use AI for embodied physical tasks. We interviewed 12 scientific practitioners doing hands-on lab and fieldwork in domains like nuclear fusion, primate cognition, and biochemistry, and found three barriers to AI adoption in these settings: 1) experimental setups are too high-stakes to risk AI errors, 2) constrained environments make it hard to use AI, and 3) AI cannot match the tacit knowledge of humans. Participants then developed speculative designs for future AI assistants to 1) monitor task status, 2) organize lab-wide knowledge, 3) monitor scientists' health, 4) do field scouting, 5) do hands-on chores. Our findings point toward AI as background infrastructure to support physical work rather than replacing human expertise.
Authors:Mohammad Hadi Nezhad, Francisco Enrique Vicente Castro, Ivon Arroyo
Abstract:
Supporting users in protecting sensitive information when using conversational agents (CAs) is crucial, as users may undervalue privacy protection due to outdated, partial, or inaccurate knowledge about privacy in CAs. Although privacy knowledge can be developed through standalone resources, it may not readily translate into practice and may remain detached from real-time contexts of use. In this study, we investigate in-context, experiential learning by examining how interactions with privacy tools during chatbot use enhance users' privacy learning. We also explore interface design features that facilitate engagement with these tools and learning about privacy by simulating ChatGPT's interface which we integrated with a just-in-time privacy notice panel. The panel intercepts messages containing sensitive information, warns users about potential sensitivity, offers protective actions, and provides FAQs about privacy in CAs. Participants used versions of the chatbot with and without the privacy panel across two task sessions designed to approximate realistic chatbot use. We qualitatively analyzed participants' pre- and post-test survey responses and think-aloud transcripts and describe findings related to (a) participants' perceptions of privacy before and after the task sessions and (b) interface design features that supported or hindered user-led protection of sensitive information. Finally, we discuss future directions for designing user-facing privacy tools in CAs that promote privacy learning and user engagement in protecting privacy in CAs.
Authors:Xingyu Lan, Xi Li, Yixing Zhang, Mengqin Cheng, Jiazhe Wang, Siming Chen
Abstract:
Text plays a fundamental yet understudied role as a narrative device in data visualization. While existing research has extensively explored text as data input and interaction modality, its function in supporting storytelling and interpretation remains fragmented. To address this gap, this work presents a systematic review of 98 publications that provide insights into using text as narrative. We investigate how text can be utilized in visualization, analyze its functions and effects, and explore how it can be designed to facilitate data communication. Our synthesis identifies significant research gaps in this domain and proposes future directions to advance the integration of text and visualization, ultimately aiming to provide guidance for designing text that enhances narrative clarity and fosters engagement.
Authors:Eva-Maria Schön, Michael Neumann, Tiago Silva da Silva
Abstract:
Context: The active involvement of users and customers in agile software development remains a persistent challenge in practice. For this reason, it is important that students in higher education become familiar with good practices in Agile Requirements Engineering during their studies. Objective: Our objective is to enable students to learn how to interact with Generative Artificial Intelligence (GenAI) through the use of a stakeholder simulation with AI Personas, while also developing an understanding of the limitations of AI tools in practical contexts. Method: In our courses, we employ a stakeholder simulation using GenAI, in which students conduct interviews with AI Personas through a provided meta-prompt. Based on the outcomes of these interviews, students apply agile practices (e.g., story mapping or impact mapping) to document requirements. The use of GenAI is subsequently reflected upon in a structured group discussion. Results: Through this approach, students gain practical experience by applying state-of-the art agile practices for requirements elicitation and documentation while simultaneously developing an understanding of the technical and ethical limitations associated with the use of generative AI. Conclusion: We have applied this approach over several terms and found that using a meta-prompt provides flexibility, allowing us to remain independent of specific large language model providers.
Authors:Amine Benamara, Céline Clavel, Brian Ravenet, Nicolas Sabouret, Julien Saunier
Abstract:
During collaborative board games, cohesion represents a key aspect to define a well functionning group. From the success of the task to the developement of interpersonal relationship, this concept covers many aspects of group dynamics. The goal of our work is to investigate the factors that impact cohesion in a group, and specifically the relevant social skills that improve collaboration between multiple entities. In this article, we focus on the role of embodiement on different aspects of an interaction. We propose an experimental protocol, based on a collected corpus of humans playing a collaborative board game, to study how different agents' embodiment affect the perception of these agents and of the group as a whole. We conclude by presenting an outline of the problematics of the conception of the protocol and of multi-agent system related challenges.
Authors:Amandine M. Caut, Beimnet Zenebe, Amy Rouillard, David J. T. Sumpter
Abstract:
The rapid advancement and impressive capabilities of large language models (LLMs) have given rise to the field of prompt engineering, the practice of crafting inputs to guide LLMs toward high-quality, task-relevant outputs. A critical challenge facing the field is the lack of standardised prompt documentation and evaluation practices. Prompts can be long, complex and difficult to evaluate on subjective tasks. To address this challenge, we propose the use of prompt cards, structured summaries of prompt engineering practices inspired by the concept of model cards. Through prompt cards, the specific goals, considerations and steps taken during prompt engineering can be systematically documented and assessed. We present the prompt card approach and illustrate it on a specific task called wordalisation, in which structured numerical data is transformed into text. We argue that a well-structured prompt card can enable better reproducibility, transparency, improve prompt methodology and give an effective alternative to benchmarking for judging the quality of generated texts. By systemically capturing underlying model details, prompt intent, contextualisation strategies, evaluation practices and ethical considerations, prompt cards make explicit the often implicit design decisions that shape system behaviour. Documenting these choices is important as prompting increasingly involves complex pipelines with multiple moving parts.
Authors:Necva Bölücü, Jessica Irons, Changhyun Lee, Brian Jin, Maciej Rybinski, Huichen Yang, Andreas Duenser, Stephen Wan
Abstract:
The rapid growth of scientific literature has made manual extraction of structured knowledge increasingly impractical. To address this challenge, we introduce SCILIRE, a system for creating datasets from scientific literature. SCILIRE has been designed around Human-AI teaming principles centred on workflows for verifying and curating data. It facilitates an iterative workflow in which researchers can review and correct AI outputs. Furthermore, this interaction is used as a feedback signal to improve future LLM-based inference. We evaluate our design using a combination of intrinsic benchmarking outcomes together with real-world case studies across multiple domains. The results demonstrate that SCILIRE improves extraction fidelity and facilitates efficient dataset creation.
Authors:Jie Gao, Yaoxin Wu
Abstract:
Human mobility studies how people move among meaningful places over time and how these movements aggregate into population-level patterns that shape accessibility, congestion, emissions, and public health. Large language models (LLMs) are increasingly used in this domain because many human mobility problems require reasoning about place and activity semantics, travelers' intentions and preferences, and diverse real-world constraints that are difficult to capture using coordinates and other purely numerical attributes. Despite rapid growth, the literature is still scattered, and there is no clear overview that connects human mobility tasks, challenges, and LLM designs in a consistent way. This survey therefore provides a comprehensive synthesis of LLM-based research on human mobility across five tasks, including travel itinerary planning, trajectory generation, mobility simulation, mobility prediction, and mobility semantics and understanding. For each task, we review representative work, connect core challenges to the specific roles of LLMs, and summarize typical LLM-based solution designs. We conclude with open challenges and research directions toward reliable, grounded and privacy-aware LLM-based approaches for human mobility.
Authors:Kotaro Fujimura, Hiroki Kusuyama, Masaki Takeuchi, Daisuke Iwai
Abstract:
Projection Mapping (PM) is a technology that projects images onto the surfaces of physical objects, allowing multiple users to share an augmented reality experience without special devices. However, its practical use has been constrained by the need for dark environments to ensure high-quality projection. To overcome this ``dark-room constraint,'' we propose a novel target-excluding lighting method that selectively illuminates the surrounding environment while avoiding the PM target. Our system achieves light-field illumination by combining an LED display panel with an optimized aperiodic lens array. The key contributions include a compact form factor that provides a large effective light source area, reproducing natural soft shadows comparable to typical lighting, while maintaining the spatial controllability needed to precisely avoid the target. We also introduce a computational technique for optimizing aperiodic lens placement to suppress undesired dark spots caused by crosstalk, and efficient methods for computing LED luminance patterns that enable dynamic PM. Experiments with a prototype system demonstrate that our approach achieves high-contrast PM even in bright environments.
Authors:Takahiro Okamoto, Masaki Takeuchi, Masataka Sawayama, Daisuke Iwai
Abstract:
Projection mapping (PM) enables augmented reality (AR) experiences without requiring users to wear head-mounted displays and supports multi-user interaction. It is regarded as a promising technology for a variety of applications in which users interact with content superimposed onto augmented objects in tabletop workspaces, including remote collaboration, healthcare, industrial design, urban planning, artwork creation, and office work. However, conventional PM systems often suffer from projection shadows when users occlude the light path. Prior approaches employing multiple distributed projectors can compensate for occlusion, but suffer from latency due to computational processing, degrading the user experience. In this research, we introduce a synthetic-aperture PM system that uses a significantly larger number of projectors, arranged densely in the environment, to achieve delay-free, shadowless projection for tabletop workspaces without requiring computational compensation. To address spatial resolution degradation caused by subpixel misalignment among overlaid projections, we develop and validate an offline blur compensation method whose computation time remains independent of the number of projectors. Furthermore, we demonstrate that our shadowless PM plays a critical role in achieving a fundamental goal of PM: altering material properties without evoking projection-like impression. Specifically, we define this perceptual impression as ``sense of projection (SoP)'' and establish a PM design framework to minimize the SoP based on user studies.
Authors:Jessica Irons, Patrick Cooper, Necva Bolucu, Roelien Timmer, Huichen Yang, Changhyun Lee, Brian Jin, Andreas Duenser, Stephen Wan
Abstract:
With increasing awareness of the hallucination risks of generative artificial intelligence (AI), we see a growing shift toward providing information tooling to help users determine the veracity of AI-generated answers for themselves. User responsibility for assessing veracity is particularly critical for certain sectors that rely on on-demand, AI-generated data extraction, such as biomedical research and the legal sector. While prior work offers us a variety of ways in which systems can provide such support, there is a lack of empirical evidence on how this information is actually incorporated into the user's decision-making process. Our user study takes a step toward filling this knowledge gap. In the context of a generative AI data extraction tool, we examine the relationship between the type of supporting information (full source text, passage retrieval, and Large Language Model (LLM) explanations) and user behavior in the veracity assessment process, examined through the lens of efficiency, effectiveness, reliance and trust. We find that passage retrieval offers a reasonable compromise between accuracy and speed, with judgments of veracity comparable to using the full source text. LLM explanations, while also enabling rapid assessments, fostered inappropriate reliance and trust on the data extraction AI, such that participants were less likely to detect errors. In additiona, we analyzed the impacts of the complexity of the information need, finding preliminary evidence that inappropriate reliance is worse for complex answers. We demonstrate how, through rigorous user evaluation, we can better develop systems that allow for effective and responsible human agency in veracity assessment processes.
Authors:Pronob Kumar Barman, James R. Foulds, Tera L. Reynolds
Abstract:
Peer support is critical to managing chronic health conditions. Online health communities (OHCs) enable patients and caregivers to connect with similar others, yet their large scale makes it challenging to find the most relevant peers and content. This study assessed perceived value, preferred features, and acceptance conditions for algorithmically personalized support group formation within OHCs. A two-phase, mixed-methods survey (N=165) examined OHC participation patterns, personalization priorities, and acceptance of a simulated personalized support group. Perceived value of the simulated support group was high (mean 4.55/5; 62.8% rated 5/5) and 91.5% would join this group. The importance participants placed on peer matching strongly correlated with perceived value (\r{ho}=0.764, p<0.001). Qualitative findings revealed conditional acceptance: participants demand security, transparency, human oversight, and user control over data. Personalized support groups may be desired, but they will not be adopted unless trust, privacy, and algorithmic governance concerns are addressed.
Authors:Bahare Riahi, Sayali Patukale, Joy Niranjan, Yogya Koneru, Tiffany Barnes, Veronica Cateté
Abstract:
This study investigates K--12 teachers' perceptions and experiences with AI-supported rubric generation during a summer professional development workshop ($n = 25$). Teachers used MagicSchool.ai to generate rubrics and practiced prompting to tailor criteria and performance levels. They then applied these rubrics to provide feedback on a sample block-based programming activity, followed by using a chatbot to deliver rubric-based feedback for the same work. Data were collected through pre- and post-workshop surveys, open discussions, and exit tickets. We used thematic analysis to analyze the qualitative data. Teachers reported that they rarely create rubrics from scratch because the process is time-consuming and defining clear distinctions between performance levels is challenging. After hands-on use, teachers described AI-generated rubrics as strong starting drafts that improved structure and clarified vague criteria. However, they emphasized the need for teacher oversight due to generic or grade-misaligned language, occasional misalignment with instructional priorities, and the need for substantial editing. Survey results indicated high perceived clarity and ethical acceptability, moderate alignment with assignments, and usability as the primary weakness -- particularly the ability to add, remove, or revise criteria. Open-ended responses highlighted a ``strictness-versus-detail'' trade-off: AI feedback was often perceived as harsher but more detailed and scalable. As a result, teachers expressed conditional willingness to adopt AI rubric tools when workflows support easy customization and preserve teacher control.
Authors:Marta Sumyk, Oleksandr Kosovan
Abstract:
Computer-Use Agents (CUAs) are emerging as a new paradigm in human-computer interaction, enabling autonomous execution of tasks in desktop environment by perceiving high-level natural-language instructions. As such agents become increasingly capable and are deployed across diverse desktop environments, evaluating their behavior in a scalable and reliable manner becomes a critical challenge. Existing evaluation pipelines rely on static benchmarks, rule-based success checks, or manual inspection, which are brittle, costly, and poorly aligned with real-world usage. In this work, we study Vision-Language Models (VLMs) as autonomous auditors for assessing CUA task completion directly from observable interactions and conduct a large-scale meta-evaluation of five VLMs that judge task success given a natural-language instruction and the final environment state. Our evaluation spans three widely used CUA benchmarks across macOS, Windows, and Linux environments and analyzes auditor behavior along three complementary dimensions: accuracy, calibration of confidence estimates, and inter-model agreement. We find that while state-of-the-art VLMs achieve strong accuracy and calibration, all auditors exhibit notable performance degradation in more complex or heterogeneous environments, and even high-performing models show significant disagreement in their judgments. These results expose fundamental limitations of current model-based auditing approaches and highlight the need to explicitly account for evaluator reliability, uncertainty, and variance when deploying autonomous CUAs in real-world settings.
Authors:Siyu Lu, Yanhan Liu, Shiyu Xu, Ruishi Zou, Chen Ye
Abstract:
Graphics (e.g., figures and charts) are ubiquitous in scientific papers, yet separating graphics from text increases cognitive load in understanding text-graphic connections. Research has found that word-scale graphics, or visual embellishments at typographic size, can augment original text, making it more expressive and easier to understand. However, whether, if so, how scientific papers adopt word-scale graphics for scholarly communication remains unclear. To address this gap, we conducted a corpus study reviewing 909 word-scale graphics extracted from 126,797 scientific papers. Through analysis, we propose a framework that characterizes where (positioning), why (communicative function), and how (visual representation) authors apply word-scale graphics in scientific papers. Our findings reveal that word-scale graphics are rarely used, that icons dominate visual representation, and that visual representation connects with communicative function (e.g., using quantitative graphs for data annotation). We further discuss opportunities to enhance scholarly communication with word-scale graphics through technical and administrative innovations.
Authors:Advait Bhat, Marianne Aubin Le Quéré, Mor Naaman, Maurice Jakesch
Abstract:
Emerging experimental evidence shows that writing with AI assistance can change both the views people express in writing and the opinions they hold afterwards. Yet, we lack substantive understanding of procedural and behavioral changes in co-writing with AI that underlie the observed opinion-shaping power of AI writing tools. We conducted a mixed-methods study, combining retrospective interviews with 19 participants about their AI co-writing experience with a quantitative analysis tracing engagement with ideas and opinions in 1{,}291 AI co-writing sessions. Our analysis shows that engaging with the AI's suggestions -- reading them and deciding whether to accept them -- becomes a central activity in the writing process, taking away from more traditional processes of ideation and language generation. As writers often do not complete their own ideation before engaging with suggestions, the suggested ideas and opinions seeded directions that writers then elaborated on. At the same time, writers did not notice the AI's influence and felt in full control of their writing, as they -- in principle -- could always edit the final text. We term this shift \textit{Reactive Writing}: an evaluation-first, suggestion-led writing practice that departs substantially from conventional composing in the presence of AI assistance and is highly vulnerable to AI-induced biases and opinion shifts.
Authors:Ninghao Wan, Jiarun Song, Fuzheng Yang
Abstract:
In virtual reality (VR) educational scenarios, Pedagogical agents (PAs) enhance immersive learning through realistic appearances and interactive behaviors. However, most existing PAs rely on static speech and simple gestures. This limitation reduces their ability to dynamically adapt to the semantic context of instructional content. As a result, interactions often lack naturalness and effectiveness in the teaching process. To address this challenge, this study proposes a large language model (LLM)-driven multimodal expression generation method that constructs semantically sensitive prompts to generate coordinated speech and gesture instructions, enabling dynamic alignment between instructional semantics and multimodal expressive behaviors. A VR-based PA prototype was developed and evaluated through user experience-oriented subjective experiments. Results indicate that dynamically generated multimodal expressions significantly enhance learners' perceived learning effectiveness, engagement, and intention to use, while effectively alleviating feelings of fatigue and boredom during the learning process. Furthermore, the combined dynamic expression of speech and gestures notably enhances learners' perceptions of human-likeness and social presence. The findings provide new insights and design guidelines for building more immersive and naturally expressive intelligent PAs.
Authors:SangYeop Jeong, Yeongseo Na, Seung Gyu Jeong, Jin-Woo Jeong, Seong-Eun Kim
Abstract:
In VR interactions with embodied conversational agents, users' emotional intent is often conveyed more by how something is said than by what is said. However, most VR agent pipelines rely on speech-to-text processing, discarding prosodic cues and often producing emotionally incongruent responses despite correct semantics. We propose an emotion-context-aware VR interaction pipeline that treats vocal emotion as explicit dialogue context in an LLM-based conversational agent. A real-time speech emotion recognition model infers users' emotional states from prosody, and the resulting emotion labels are injected into the agent's dialogue context to shape response tone and style. Results from a within-subjects VR study (N=30) show significant improvements in dialogue quality, naturalness, engagement, rapport, and human-likeness, with 93.3% of participants preferring the emotion-aware agent.
Authors:Jiarun Song, Ninghao Wan, FuZheng Yang, Weisi Lin
Abstract:
Virtual reality (VR) conferencing has the potential to provide geographically dispersed users with an immersive environment, enabling rich social interactions and user experience using avatars. However, remote communication in VR inevitably introduces end-to-end (E2E) latency, which can significantly impact user experience. To clarify the impact of latency, we conducted subjective experiments to analyze how it influences interaction fluency from the perspective of quality perception and social presence from the perspective of social cognition, comparing VR conferencing with traditional video conferencing (VC). Specifically, interaction fluency emphasizes user perception of interaction pace and responsiveness and is assessed using Absolute Category Rating (ACR) method. In contrast, social presence focuses on the cognitive understanding of interaction, specifically whether individuals can comprehend the intentions, emotions, and behaviors expressed by others. It is primarily measured using the Networked Minds Social Presence Inventory (NMSPI). Building on this analysis, we further investigate the relationship between interaction fluency and social presence under different latency conditions to clarify the underlying perceptual and cognitive mechanisms. The findings from these subjective tests provide meaningful insights for optimizing the related systems, helping to improve interaction fluency and enhancing social presence in immersive virtual environments.
Authors:Jacek Małecki, Alexander Mathiesen-Ohman, Katarzyna Tworek
Abstract:
Recent progress in artificial intelligence has been driven largely by the scaling of centralized large language models through increased parameters, datasets, and computational resources. While effective, this paradigm introduces structural constraints related to compute concentration, energy consumption, data availability, and governance. This paper proposes an alternative architectural approach through the H3LIX Decentralized Frontier Model Architecture (DFMA), a distributed AI framework in which locally operating AI instances generate synthetic learning signals derived from reasoning processes and interactions. These signals are aggregated within a shared contextual substrate termed the Collective Context Field (CCF), which conditions reasoning behavior across the network without requiring direct parameter synchronization. By enabling contextual signal propagation rather than centralized retraining at every iteration, the architecture can be designed to support privacy-preserving collective learning under explicit assumptions, while facilitating distributed sharing of learned abstractions. The system further integrates Energy-Adaptive Model Evolution, aligning learning activities with renewable energy availability to support more sustainable AI infrastructure. Conceptually, the architecture reframes artificial intelligence as a distributed cognitive system analogous to biological neural networks, in which intelligence emerges from the interaction of many locally adaptive agents within a shared contextual environment. Together, these mechanisms suggest a new scaling pathway for artificial intelligence systems based on distributed contextual learning and collective experience accumulation.
Authors:Jan Ulrich Bartels, Alexander Achberger, Katherine J. Kuchenbecker, Michael Sedlmair
Abstract:
We describe the hardware design, force-rendering approach, and evaluation of a new reconfigurable haptic interface consisting of a network of hybrid motor-brake actuation modules that apply forces via cables. Each module contains both a motor and a brake, enabling it to smoothly render active forces up to 6 N using its motor and collision forces up to 186 N using its passive one-way brake. The modular design, meanwhile, allows the system to deliver rich haptic feedback in a flexible number of DoF and widely ranging configurations.
Authors:Jordan Aiko Deja, Isidro Butaslac, Nicko Reginio Caluya, Maheshya Weerasinghe
Abstract:
Robots are moving beyond industrial settings into creative, educational, and public environments where interaction is open-ended and improvisational. Yet much of human-AI-robot interaction remains framed around performance and efficiency, positioning humans as supervisors rather than collaborators. We propose a re-framing of AI interaction with robots as scaffolding: infrastructure that enables humans to shape robotic behaviour over time while remaining meaningfully in control. Through scenarios from creative practice, learning-by-teaching, and embodied interaction, we illustrate how humans can act as executive directors, defining intent and steering revisions, while AI mediates between human expression and robotic execution. We outline design and evaluation implications that foreground creativity, agency, and flow. Finally, we discuss open challenges in social, scalable, and mission-critical contexts. We invite the community to rethink interacting with Robots and AI not as autonomy, but as sustained support for human creativity.
Authors:Haomiaomiao Wang, Tomás E Ward, Lili Zhang
Abstract:
We test whether LLMs show robust decision biases. Treating models as participants in two-arm bandits, we ran 20000 trials per condition across four decoding configurations. Under symmetric rewards, models amplified positional order into stubborn one-arm policies. Under asymmetric rewards, they exploited rigidly yet underperformed an oracle and rarely re-checked. The observed patterns were consistent across manipulations of temperature and top-p, with top-k held at the provider default, indicating that the qualitative behaviours are robust to the two decoding knobs typically available to practitioners. Crucially, moving beyond descriptive metrics to computational modelling, a hierarchical Rescorla-Wagner-softmax fit revealed the underlying strategies: low learning rates and very high inverse temperatures, which together explain both noise-to-bias amplification and rigid exploitation. These results position minimal bandits as a tractable probe of LLM decision tendencies and motivate hypotheses about how such biases could shape human-AI interaction.
Authors:Yoshiki Tanaka, Michimasa Inaba
Abstract:
User reviews on e-commerce and review sites are crucial for making purchase decisions, although creating detailed reviews is time-consuming and labor-intensive. In this study, we propose a novel use of dialogue systems to facilitate user review creation by generating reviews from information gathered during interview dialogues with users. To validate our approach, we implemented our system using GPT-4 and conducted comparative experiments from the perspectives of system users and review readers. The results indicate that participants who used our system rated their interactions positively. Additionally, reviews generated by our system required less editing to achieve user satisfaction compared to those by the baseline. We also evaluated the reviews from the reader' perspective and found that our system-generated reviews are more helpful than those written by humans. Despite challenges with the fluency of the generated reviews, our method offers a promising new approach to review writing.
Authors:Kaleen Shrestha, Harish Dukkipati, Avni Hulyalkar, Kyla Penamante, Ankita Samanta, Maja Matarić
Abstract:
In peer mediation--an approach to conflict resolution used in many K-12 schools in the United States--students help other students to resolve conflicts. For schools without peer mediation programs, socially assistive robots (SARs) may be able to provide an accessible option to practice peer mediation. We investigate how elementary school students react to a peer mediator role-play activity through an exploratory study with SARs. We conducted a small single-session between-subjects study with 12 participants. The study had two conditions, one with two robots acting as disputants, and the other without the robots and just the tablet. We found that a majority of students had positive feedback on the activity, with many students saying the peer mediation practice helped them feel better about themselves. Some said that the activity taught them how to help friends during conflict, indicating that the use of SARs for peer mediation practice is promising. We observed that participants had varying reading levels that impacted their ability to read and dictate the turns in the role-play script, an important consideration for future study design. Additionally, we found that some participants were more expressive while reading the script and throughout the activity. Although we did not find statistical differences in pre-/post-session self-perception and quiz performance between the robot and tablet conditions, we found strong correlations (p<0.05) between certain trait-related measures and learning-related measures in the robot condition, which can inform future study design for SARs for this and related contexts.
Authors:Tzu-Hsin Hsieh, Cassandra Michelle Stefanie Visser, Elmar Eisemann, Ricardo Marroquim
Abstract:
Motor-skill learning systems in XR rely on persistent cues. However, constant cueing can induce overreliance and erode memorization and skill transfer. We introduce a skill-adaptive, dynamically transparent ghost instructor whose opacity adapts in real time to learner performance. In a first-person perspective, users observe a ghost hand executing piano fingering with either a static or a performance-adaptive transparency in a VR piano training application. We conducted a within-subjects study (N=30), where learners practiced with traditional Static (fixed-transparency) and our proposed Dynamic (performance-adaptive) modes and were tested without guidance immediately and after a 10-minute retention interval. Relative to Static, the Dynamic mode yielded higher pitch and fingering accuracy and limited error increases, with comparable timing. These findings suggest that adaptive transparency helps learners internalize fingerings more effectively, reducing dependency on external cues and improving short-term skill retention within immersive learning environments. We discuss design implications for motor-skill learning and outline directions for extending this approach to longer-term retention and more complex tasks.
Authors:Jieying Zhang, Steeven Villa, Abdallah El Ali
Abstract:
Advances in generative AI, speech synthesis, and embodied avatars enable systems that not only assist communication, but can act as proxies on users' behalf. Prior work in HCI has largely focused on systems as external tools, with less attention paid to the experiential consequences of users' speech and actions becoming assimilated with AI-generated output. We introduce the design and implementation of ProxyMe, a work-in-progress VR prototype that allows users to embody an avatar whose voice and spoken content are modified by an AI system. By combining avatar-based embodiment, voice cloning, and AI-mediated speech augmentation, ProxyMe invites the exploration of avatar self-extension: situations in which AI-modified communication is experienced as part of one's own expressive behavior. We chart out research challenges and envisioned scenarios, with a focus on how varying degrees of delegation and steerability can influence perceived agency, authorship, and self-identification.
Authors:Inha Cha, Catherine Wieczorek, Richmond Y. Wong
Abstract:
Although organizations increasingly position AI adoption as a pathway to competitiveness and innovation, organizations' perspectives on productivity and efficiency often clash with workers' perspectives on AI's economic and social value. Through design workshops with 15 UX designers, we examine how AI adoption unfolds across individual, team, and organizational scales. At the individual level, designers weighed efficiency, skill development, and professional worth. At the team level, they negotiated collaboration, responsibility, and rigor. At the organizational level, adoption was shaped by compliance requirements and organizational norms. Across these scales, discourses of efficiency carried social and ethical dimensions of responsibility, trust, and autonomy. We view adoption as a site where roles, relationships, and power are reconfigured. We argue that AI adoption should be understood as a process of negotiating values, and call for future work examining how AI systems redistribute responsibility among team members, while understanding how such shifts could strengthen worker agency.
Authors:Srishti Palani, Vidya Setlur
Abstract:
Large Language Models (LLMs) are transforming Conversational Visual Analytics (CVA) by enabling data analysis through natural language. However, evaluating LLMs for CVA remains a challenge: requiring programming expertise, overlooking real-world complexity, and lacking interpretable metrics for multi-format (visualizations and text) outputs. Through interviews with 22 CVA developers and 16 end-users, we identified use cases, evaluation criteria and workflows. We present Lexara, a user-centered evaluation toolkit for CVA that operationalizes these insights into: (i) test cases spanning real-world scenarios; (ii) interpretable metrics covering visualization quality (data fidelity, semantic alignment, functional correctness, design clarity) and language quality (factual grounding, analytical reasoning, conversational coherence) using rule-based and LLM-as-a-Judge methods; and (iii) an interactive toolkit enabling experimental setup and multi-format and multi-level exploration of results without programming expertise. We conducted a two-week diary study with six CVA developers, drawn from our initial cohort of 22. Their feedback demonstrated Lexara's effectiveness for guiding appropriate model and prompt selection.
Authors:Dorsaf Sallami, Esma Aïmeur
Abstract:
The rampant spread of fake news in the digital age poses serious risks to public trust and democratic institutions, underscoring the need for effective, transparent, and user-centered detection tools. Existing browser extensions often fall short due to opaque model behavior, limited explanatory support, and a lack of meaningful user engagement. This paper introduces Aletheia, a novel browser extension that leverages Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs) to detect fake news and provide evidence-based explanations. Aletheia further includes two interactive components: a Discussion Hub that enables user dialogue around flagged content and a Stay Informed feature that surfaces recent fact-checks. Through extensive experiments, we show that Aletheia outperforms state-of-the-art baselines in detection performance. Complementing this empirical evaluation, a complementary user study with 250 participants confirms the system's usability and perceived effectiveness, highlighting its potential as a transparent tool for combating online fake news.
Authors:Zuoyu Zhang, Yancheng Zhu
Abstract:
Tool calling allows large language models (LLMs) to interact with external systems like APIs, enabling applications in customer support, data analysis, and dynamic content generation. While recent benchmarks have advanced tool-use research, they suffer from key limitations, including reliance on simulated or restricted APIs, limited reproducibility, and a lack of cultural and geographic diversity. To address these gaps, we introduce International Tool Calling (ITC), a large-scale, multilingual benchmark designed for realistic, globally distributed tool-calling scenarios. ITC includes 3,571 real APIs and 17,540 tool calling tasks across 20 categories and 40 countries. Experiments reveal substantial performance gaps between open- and closed-source LLMs, while fine-tuning on ITC yields significant improvements, particularly for non-English queries, enhancing cross-lingual generalization, reasoning consistency, and robustness to out-of-domain tools. ITC provides a valuable benchmark for advancing LLM robustness and performance in complex, multi-tool, and international scenarios. Dataset: https://anonymous.4open.science/r/International-Tool-Calling-ITC-dataset-FAF4/.
Authors:Phenyo Phemelo Moletsane, Michael W. Asher, Christine Kwon, Paulo F. Carvalho, Amy Ogan
Abstract:
Most learners worldwide are multilingual, yet implementing multilingual education remains challenging in practice. EdTech offers an opportunity to bridge this gap and expand access for linguistically diverse learners. We conducted a quasi-experiment in Uganda with 2,931 participants enrolled in a non-formal radio- and mobile-based engineering course, where learners self-selected instruction in Leb Lango (a local language), English, or a Hybrid option combining both languages. The Leb Lango version of the course was used disproportionately by learners from rural areas, those with less formal education, and those with lower prior knowledge, broadening participation among disadvantaged learners. Moreover, the availability of Leb Lango instruction was associated with higher active participation, even among learners who registered for English instruction. Although Leb Lango learners began with lower performance, they demonstrated faster learning gains and achieved comparable final examination outcomes to English and Hybrid learners. These results suggest that providing local language options to learners is an effective way to make EdTech more accessible.
Authors:Jian Zhang, Wafa Johal, Jarrod Knibbe
Abstract:
Tangible interactions involve multiple sensory cues, enabling the accurate perception of object properties, such as size. Research has shown, however, that if we decouple these cues (for example, by altering the visual cue), then the resulting discrepancies present new opportunities for interactions. Perception over time though, not only relies on momentary sensory cues, but also on a priori beliefs about the object, implying a continuing update cycle. This cycle is poorly understood and its impact on interaction remains unknown. We study (N=80) visuo-haptic perception of size over time and (a) reveal how perception drifts, (b) examine the effects of visual priming and dead-reckoning, and (c) present a model of visuo-haptic perception as a cyclical, self-adjusting system. Our work has a direct impact on illusory perception in VR, but also sheds light on how our visual and haptic systems cooperate and diverge.
Authors:Emran Poh, Yueyue Hou, Tianyi Zhang, Jiannan Li
Abstract:
Designing adaptive tutoring systems for software learning presents challenges in determining appropriate instructional modalities. To inform the design of such systems, we conducted an observational study of ten human teacher-student pairs (N=10), where experienced design software users taught novices two new graphic design software features through multi-step procedures. These lessons were limited to three communication channels (speech, visual annotations, and remote screen control) to mimic possible AI tutor modalities. We found that annotations complement speech with spatial precision and remote control complements it with spatial and temporal precision, but both cause intrusion to learner agency. Teachers adaptively select modalities to balance the need for instruction progress with students' cognitive engagement and sense of digital territory ownership. Our results provide further support to the contiguity principles and the value of agency in learning, while suggesting precision-agency trade-off and digital territoriality as new design constraints for adaptive software guidance.
Authors:Rina Buoy, Dylan berkamp Fouepe Dongmo, Vesal Khean, Simone Marinai, Koichi Kise
Abstract:
Reading has always been an integral part of both professional and personal life. Character and layout recognition and understanding by computers are well-explored areas. Nevertheless, how characters and layout are read and perceived by humans remains relatively underexplored. This work contributes to the field of human-document interaction (HDI) by investigating the effects of character and layout personalization on readability. The paper presents an empirical study on how parts-of-speech (POS)-based character and layout modifications can lead to overall improvements in both reading comprehension and memorization for two non-segmented, non-Latin writing systems: Khmer and Japanese. The experimental results from 43 participants suggest that, by bolding POS-derived content words, Khmer readers perform better on both reading comprehension and memorisation tasks, with a significant effect (p-values of 0.03 and 0.04, respectively). A similar overall tendency is also observed in a pilot study among Japanese readers (10 participants) using syntactic color-coding. In addition, the analyses of reading time, answering time, and perceived difficulty reveal that the proposed text styling technique does not increase any perceived difficulty, cognitive load, or reading effort for the Khmer readers. However, the Japanese readers experienced a decrease in reading speed. This study and its findings represent a significant step towards enabling dynamic, script-dependent personalization of character and layout to optimize human readability.
Authors:Zhimin Wang, Chenyu Gu, Feng Lu
Abstract:
Eye-hand coordinated interaction is becoming a mainstream interaction modality in Virtual Reality (VR) user interfaces.Current paradigms for this multimodal interaction require users to learn predefined gestures and memorize multiple gesture-task associations, which can be summarized as an ``Operation-to-Intent" paradigm. This paradigm increases users' learning costs and has low interaction error tolerance. In this paper, we propose SIAgent, a novel "Intent-to-Operation" framework allowing users to express interaction intents through natural eye-hand motions based on common sense and habits. Our system features two main components: (1) intent recognition that translates spatial interaction data into natural language and infers user intent, and (2) agent-based execution that generates an agent to execute corresponding tasks. This eliminates the need for gesture memorization and accommodates individual motion preferences with high error tolerance. We conduct two user studies across over 60 interaction tasks, comparing our method with two "Operation-to-Intent" techniques. Results show our method achieves higher intent recognition accuracy than gaze + pinch interaction (97.2% vs 93.1%) while reducing arm fatigue and improving usability, and user preference. Another study verifies the function of eye gaze and hand motion channels in intent recognition. Our work offers valuable insights into enhancing VR interaction intelligence through intent-driven design. Our source code and LLM prompts will be made available upon publication.
Authors:Ken Gu, Srishti Palani, Vidya Setlur
Abstract:
Conversational interfaces are increasingly used for data analysis, enabling data workers to express complex analytical intents in natural language. Yet, these interactions unfold as long, linear transcripts that are misaligned with the iterative, nonlinear nature of real-world analyses. Revisiting and summarizing conversations for different contexts is therefore challenging. This paper investigates how data workers navigate, make sense of, and communicate prior analytical conversations. To study behaviors beyond those supported by standard interfaces (i.e., scrolling and keyword search), we develop a design probe that supplements analytical conversations with structured elements and affordances (e.g., filtering, multi-level navigation and detail-on-demand). In a user study (n = 10), participants used the probe to navigate and communicate past analyses, fulfilling information needs (recall, reorient, prioritize) through navigation strategies (visual recall, sequential and abstractive) and summarization practices (adding process details and context). Based on these findings, we discuss design implications to support re-visitation and communication of analytical conversations.
Authors:Ariadni Mandala, Alexandros Gazis, Theodoros Vavouras
Abstract:
In increasingly multicultural and multilingual societies, foreign language learning has become essential not only for communication but also for social cohesion and professional advancement. Distance education has emerged as a flexible and accessible solution, particularly for adults seeking to enhance their linguistic and intercultural competencies. This study explores the views of foreign language teachers regarding the role of distance education in promoting multilingualism, with a specific focus on culturally diverse border regions. Conducted in the Regional Unit of Evros, Greece, the research adopts a qualitative methodology based on semi-structured interviews with five language educators working in public and private education. Findings reveal that teachers recognize the potential of digital tools such as Massive Open Online Courses (MOOCs), machine translation applications (e.g., Google Translate, DeepL), and adaptive learning platforms to support multilingual learning, particularly when used as supplementary resources. However, concerns were raised about the lack of personalized feedback, limited interactivity, and the absence of culturally contextualized content on existing platforms. Teachers emphasized the importance of digital literacy, pedagogical training, and culturally inclusive design to ensure effective implementation. The study highlights the need for targeted support for educators in border regions and calls for more locally adapted digital resources that reflect linguistic diversity. These findings offer insights for policymakers and educational technology developers aiming to improve the quality and reach of multilingual education in remote or underserved areas.
Authors:Ramtin Tabatabaei, Milad Hosseini, Ali Mohajerzarrinkelk, Ali F. Meghdari, Alireza Taheri
Abstract:
In a preliminary exploratory study, our goal was to train deep neural network models to mimic children's and/or adults' gaze behavior in certain social situations to reach this objective. Additionally, we aim to identify potential differences in gaze behavior between these two age groups based on our participants' gaze data. Furthermore, we aimed to assess the practical effectiveness of our adult and children models by deploying them on a Nao robot in real-life settings. To achieve this, we first created two video clips, one animation and one live-action, to depict some social situations. Using an eye-tracking device, we collected eye-tracking data from 24 participants, including 12 children and 12 adults. Then, we utilized deep neural networks, specifically LSTM and Transformer Networks, to analyze and model the gaze patterns of each group of participants. Our results indicate that when the models attempted to predict people's locations (in the next frame), they had an accuracy in the range of 62%-70% with one attempt, which increased by ~20% when attempted twice (i.e. the two highest-ranked predicted labels as outputs). As expected, the result underscores that gaze behavior is not a wholly unique phenomenon. We obtained feedback from 57 new participants to evaluate the robot's functionality. These participants were asked to watch two videos of the robot's performance in each mode and then complete a comprehensive questionnaire. The questionnaire results indicate that the participants expressed satisfaction with the robot's interaction, including its attention, intelligence, and responsiveness to human actions. However, they did not perceive the robot as a social companion comparable to a human. This exploratory study tries to address/show potentials of the social acceptance of robots based on human nonverbal behavioral cues for future research.
Authors:Ronald Schnitzer, Maximilian Hoeving, Sonja Zillner
Abstract:
In August 2024, the EU Artificial Intelligence Act (AIA) came into force, marking the world's first large-scale regulatory framework for AI. Central to the AIA is a risk-based approach, aligning regulatory obligations with the potential harm posed by AI systems. To operationalize this, the AIA defines a Risk Classification Scheme (RCS), categorizing systems into four levels of risk. While this aligns with the theoretical foundations of risk-based regulations, the practical application of the RCS is complex and requires expertise across legal, technical, and domain-specific areas. Despite increasing academic discussion, little empirical research has explored how practitioners apply the RCS in real-world contexts. This study addresses this gap by evaluating how industrial practitioners apply the RCS using a self-service, web-based decision-support tool. Following a Design Science Research (DSR) approach, two evaluation phases involving 78 practitioners across diverse domains were conducted. Our findings highlight critical challenges in interpreting legal definitions and regulatory scope, and show that targeted support, such as clear explanations and practical examples, can significantly enhance the risk classification process. The study provides actionable insights for tool designers and policymakers aiming to support AIA compliance in practice.
Authors:Irene Hou, Zeyu Xiong, Philip J. Guo, April Yi Wang
Abstract:
Instructors are increasingly experimenting with AI chatbots for classroom support. To investigate how instructors adapt chatbots to their own contexts, we first analyzed existing resources that provide prompts for educational purposes. We identified ten common categories of customization, such as persona, guardrails, and personalization. We then conducted interviews with ten university STEM instructors and asked them to card-sort the categories into priorities. We found that instructors consistently prioritized the ability to customize chatbot behavior to align with course materials and pedagogical strategies and de-prioritized customizing persona/tone. However, their prioritization of other categories varied significantly by course size, discipline, and teaching style, even across courses taught by the same individual, highlighting that no single design can meet all contexts. These findings suggest that modular AI chatbots may provide a promising path forward. We offer design implications for educational developers building the next generation of customizable classroom AI systems.
Authors:Ralf Schmälzle, Yuetong Du, Sue Lim, Gary Bente
Abstract:
Why do some speakers capture a room almost instantly while others fail to connect? The real-time architecture of audience engagement remains largely a black box. Here, we used motion-captured animations to present the pure nonverbal performance of public speakers to audiences - either in silence (nonverbal-only) or paired with the verbal content (nonverbal-plus-verbal). Using continuous response measurement (CRM), we find that audience judgments solidify with remarkable speed: Moment-to-moment engagement ratings become highly predictive of subsequent evaluations within the initial 10 seconds of the performance. Most notably, this predictive relationship emerged faster and slightly stronger in the nonverbal-only condition, with predictive information being present already after less than 5 seconds. These findings elucidate the social impact a speaker's nonverbal performance has on audience impressions, even when dissociated from the verbal content of the speech. Our approach provides a high-resolution temporal map of social impression formation, pointing to an early "moment of capture" that appears to set the stage for the reception of the following message. On a broader scale, this research validates a powerful new method to isolate different communicative channels, to scientifically deconstruct rhetorical skill, and to study the pervasive impact of nonverbal behavior more broadly. It also enables us to translate the ancient art of rhetoric into a modern science of social impression formation, yielding an empirical basis that can inform human-centered feedback, develop AI-based augmentation tools, and guide the design of engaging, socially present avatars in an increasingly AI-mediated and virtual world.
Authors:Gordon Fletcher, Saomai Vu Khan
Abstract:
Organisations face polycrisis uncertainty yet overlook embedded knowledge. We show how generative AI can operate as a serendipity engine and knowledge transducer to discover, classify and mobilise reusable components (models, frameworks, patterns) from existing documents. Using 206 papers, our pipeline extracted 711 components (approx 3.4 per paper) and organised them into a repository aligned to Beer's Viable System Model (VSM). We contribute i) conceptually, a theory of planned serendipity in which GenAI lowers transduction costs between VSM subsystems, ii) empirically, a component repository and temporal/subject patterns, iii) managerially, a vignette and process blueprint for organisational adoption and iv) socially, pathways linking repurposing to environmental and social benefits. We propose testable links between repository creation, discovery-to-deployment time, and reuse rates, and discuss implications for shifting innovation portfolios from breakthrough bias toward systematic repurposing.
Authors:Timothy Bickmore, Mehdi Arjmand, Yunus Terzioglu
Abstract:
Kitchen appliances are frequently used domestic artifacts situated at the point of everyday dietary decision making, making them a promising but underexplored site for health promotion. We explore the concept of relational appliances: everyday household devices designed as embodied social actors that engage users through ongoing, personalized interaction. We focus on the refrigerator, whose unique affordances, including a fixed, sensor-rich environment, private interaction space, and close coupling to food items, support contextualized, conversational engagement during snack choices. We present an initial exploration of this concept through a pilot study deploying an anthropomorphic robotic head inside a household refrigerator. In a home-lab apartment, participants repeatedly retrieved snacks during simulated TV "commercial breaks" while interacting with a human-sized robotic head. Participants were randomized to either a health-promotion condition, in which the robot made healthy snack recommendations, or a social-chat control condition. Outcomes included compliance with recommendations, nutritional quality of selected snacks, and psychosocial measures related to acceptance of the robot. Results suggest that participants found the robot persuasive, socially engaging, and increasingly natural over time, often describing it as helpful, aware, and companionable. Most participants reported greater awareness of their snack decisions and expressed interest in having such a robot in their own home. We discuss implications for designing relational appliances that leverage anthropomorphism, trust, and long-term human-technology relationships for home-based health promotion.
Authors:Nobuhito Kasahara, Shota Yamanaka, Homei Miyashita
Abstract:
Typical success-rate prediction models for tapping exclude targets near screen edges; however, design constraints often force such placements. Additionally, in scrollable UIs any element can move close to an edge. In this work, we model how target--edge distance affects 1D touch pointing accuracy. We propose the Skewed Dual Normal Distribution Model, which assumes the tap coordinate distribution is skewed by a nearby edge. The results of two smartphone experiments showed that, as targets approached the edge, the distribution's peak shifted toward the edge and its tail extended away. In contrast to prior reports, the success rate improved when the target touched the edge, suggesting a strategy of ``tapping the target together with the edge.'' By accounting for skew, our model predicts success rates across a wide range of conditions, including edge-adjacent targets, thus extending coverage to the whole screen and informing UI design support tools.
Authors:Xiuqi Tommy Zhu, Xiaoan Liu, Casper Harteveld, Smit Desai, Eileen McGivney
Abstract:
Non-Display Smart Glasses hold the potential to support everyday activities by combining continuous environmental sensing with voice-only interaction powered by large language models (LLMs). Understanding how conversational successes and breakdowns arise in everyday contexts can better inform the design of future voice-only interfaces. To investigate this, we conducted a month-long collaborative autoethnography (n=2) to identify patterns of successes and breakdowns when using such devices. We then compare these patterns with prior findings on voice-only interactions to highlight the unique affordances and opportunities offered by non-display smart glasses.
Authors:Yuan Cui, Annabel Goldman, Jovy Zhou, Xiaolin Liu, Clarissa Shieh, Joshua Yao, Mia Cheng, Matthew Kay, Fumeng Yang
Abstract:
Assessments are critical in education, but creating them can be difficult. To address this challenge in a grounded way, we partnered with 13 teachers in a seven-month codesign process. We developed a conceptual model that characterizes the iterative dual process where teachers develop assessments while simultaneously refining requirements. To enact this model in practice, we built Ripplet, a web-based tool with multilevel reusable interactions to support assessment authoring. The extended codesign revealed that Ripplet enabled teachers to create formative assessments they would not have otherwise made, shifted their practices from generation to curation, and helped them reflect more on assessment quality. In a user study with 15 additional teachers, compared to their current practices, teachers felt the results were more worth their effort and that assessment quality improved.
Authors:Olga Viberg, Mutlu Cukurova, Rene F. Kizilcec, Simon Buckingham Shum, Dorottya Demszky, Dragan Gašević, Thorben Jansen, Ioana Jivet, Jelena Jovanovic, Jennifer Meyer, Kou Murayama, Zach Pardos, Chris Piech, Nikol Rummel, Naomi E. Winstone
Abstract:
Human agency is crucial in education and increasingly challenged by the use of generative AI. This meeting report synthesizes interdisciplinary insights and conceptualizes four aspects that delineate human agency: human oversight, AI-human complementarity, AI competencies, and relational emergence. We explore practical dilemmas for protecting and promoting agency, focusing on normative constraints, transparency, and cognitive offloading, and highlight key tensions and implications to inform ethical and effective AI integration in education.
Authors:Lefan Lai, Tinghui Li, Zhanna Sarsenbayeva, Brandon Victor Syiem
Abstract:
Visual search is a core component of mixed reality (MR) interactions, influenced by the complexities of MR application contexts. In this paper, we investigate how prevalent factors in MR influence visual search performance and spatial regularity memory -- including the physical environment complexity, secondary task presence, virtual content depth and spatial layout configurations. Contrary to prior work, we found that the secondary auditory task did not have a significant main effect on visual search performance, while significantly elevating higher perceived workload measures in all conditions. Complex environments and varied virtual elements depths significantly hinder visual search, but did not significantly increase perceived workload measures. Finally, participants did not explicitly recognize repeated spatial configurations of virtual elements, but performed significantly better when searching repeated spatial configurations, suggesting implicit memory of spatial regularities. Our work presents novel insights on visual search and highlights key considerations when designing MR for different application contexts.
Authors:Alexandra Neagu, Marcus Messer, Peter Johnson, Rhodri Nelson
Abstract:
Providing scaffolding through educational chatbots built on Large Language Models (LLM) has potential risks and benefits that remain an open area of research. When students navigate impasses, they ask for help by formulating impasse-driven questions. Within interactions with LLM chatbots, such questions shape the user prompts and drive the pedagogical effectiveness of the chatbot's response. This paper focuses on such student questions from two datasets of distinct learning contexts: formative self-study, and summative assessed coursework. We analysed 6,113 messages from both learning contexts, using 11 different LLMs and three human raters to classify student questions using four existing schemas. On the feasibility of using LLMs as raters, results showed moderate-to-good inter-rater reliability, with higher consistency than human raters. The data showed that 'procedural' questions predominated in both learning contexts, but more so when students prepare for summative assessment. These results provide a basis on which to use LLMs for classification of student questions. However, we identify clear limitations in both the ability to classify with schemas and the value of doing so: schemas are limited and thus struggle to accommodate the semantic richness of composite prompts, offering only partial understanding the wider risks and benefits of chatbot integration. In the future, we recommend an analysis approach that captures the nuanced, multi-turn nature of conversation, for example, by applying methods from conversation analysis in discursive psychology.
Authors:Rahul Nanda, Chandra Maddila, Smriti Jha, Euna Mehnaz Khan, Matteo Paltenghi, Satish Chandra
Abstract:
Autonomous coding agents, powered by large language models (LLMs), are increasingly being adopted in the software industry to automate complex engineering tasks. However, these agents are prone to a wide range of misbehaviors, such as deviating from the user's instructions, getting stuck in repetitive loops, or failing to use tools correctly. These failures disrupt the development workflow and often require resource-intensive manual intervention. In this paper, we present a system for automatically recovering from agentic misbehaviors at scale. We first introduce a taxonomy of misbehaviors grounded in an analysis of production traffic, identifying three primary categories: Specification Drift, Reasoning Problems, and Tool Call Failures, which we find occur in about 30% of all agent trajectories. To address these issues, we developed a lightweight, asynchronous self-intervention system named Wink. Wink observes agent trajectories and provides targeted course-correction guidance to nudge the agent back to a productive path. We evaluated our system on over 10,000 real world agent trajectories and found that it successfully resolves 90% of the misbehaviors that require a single intervention. Furthermore, a live A/B test in our production environment demonstrated that our system leads to a statistically significant reduction in Tool Call Failures, Tokens per Session and Engineer Interventions per Session. We present our experience designing and deploying this system, offering insights into the challenges of building resilient agentic systems at scale.
Authors:Mikio Nakano, Hironori Takeuchi, Kazunori Komatani
Abstract:
This paper proposes a methodology for identifying evaluation items for practical dialogue systems. Traditionally, user satisfaction and user experiences have been the primary metrics for evaluating dialogue systems. However, there are various other evaluation items to consider when developing and operating practical dialogue systems, and such evaluation items are expected to lead to new research topics. So far, there has been no methodology for identifying these evaluation items. We propose identifying evaluation items based on business-dialogue system alignment models, which are applications of business-IT alignment models used in the development and operation of practical IT systems. We also present a generic model that facilitates the construction of a business-dialogue system alignment model for each dialogue system.
Authors:Rohit Kaushik, Eva Kaushik
Abstract:
We introduce a unified framework that combines nonlinear dynamics, perceptual psychophysics and high frequency haptic rendering to enhance realism in surgical simulation. The interaction of the surgical device with soft tissue is elevated to an augmented state space with a Koopman operator formulation, allowing linear prediction and control of the dynamics that are nonlinear by nature. To make the rendered forces consistent with human perceptual limits, we put forward a Bayesian calibration module based on WeberFechner and Stevens scaling laws, which progressively shape force signals relative to each individual's discrimination thresholds. For various simulated surgical tasks such as palpation, incision, and bone milling, the proposed system attains an average rendering latency of 4.3 ms, a force error of less than 2.8% and a 20% improvement in perceptual discrimination. Multivariate statistical analyses (MANOVA and regression) reveal that the system's performance is significantly better than that of conventional spring-damper and energy, based rendering methods. We end by discussing the potential impact on surgical training and VR, based medical education, as well as sketching future work toward closed, loop neural feedback in haptic interfaces.
Authors:Zhiyuan Liang, Enfang Cui, Qian Wei, Rui She, Tianzheng Li, Minxin Guo, Yujun Cheng
Abstract:
AI agents are increasingly deployed as autonomous systems capable of planning, tool use, and multi-agent collaboration across complex tasks. However, existing agent-related protocols focus on agent-to-agent interactions, leaving humans as external observers rather than integrated participants within the agent systems. This limitation arises from the lack of a standardized mechanism for agents to discover, address, and interact with humans across heterogeneous messaging platforms. In this paper, we propose the A2H (Agent-to-Human) protocol, a unified protocol that enables humans to be registered, discovered, and communicated with by AI agents as resolvable entities within agent systems. A2H contributes three key components: (1) Human Card for registering human identities via resolvable domain names, making them discoverable to agents; (2) Formal Communication Schema defines when, why, and how agents contact with human;(3) Unified Messaging Abstraction standardizes diverse communication medias and transforms complex JSON outputs into human-friendly formats. This work establishes a foundational protocol for integrating humans into agent ecosystems, advancing AI agents from isolated autonomous systems toward truly human-connected intelligent infrastructures.
Authors:Yancheng Cao, Yishu Ji, Chris Yue Fu, Sahiti Dharmavaram, Meghan Turchioe, Natalie C Benda, Lena Mamykina, Yuling Sun, Xuhai "Orson" Xu
Abstract:
Large language models (LLMs) have been increasingly adopted to support patients' healthcare-seeking in recent years. While prior patient-centered studies have examined the capabilities and experience of LLM-based tools in specific health-related tasks such as information-seeking, diagnosis, or decision-supporting, the inherently longitudinal nature of healthcare in real-world practice has been underexplored. This paper presents a four-week diary study with 25 patients to examine LLMs' roles across healthcare-seeking trajectories. Our analysis reveals that patients integrate LLMs not just as simple decision-support tools, but as dynamic companions that scaffold their journey across behavioral, informational, emotional, and cognitive levels. Meanwhile, patients actively assign diverse socio-technical meanings to LLMs, altering the traditional dynamics of agency, trust, and power in patient-provider relationships. Drawing from these findings, we conceptualize future LLMs as a longitudinal boundary companion that continuously mediates between patients and clinicians throughout longitudinal healthcare-seeking trajectories.
Authors:Baixiao Huang, Baiyu Huang, Yu Hou
Abstract:
Quadruped robots are employed in various scenarios in building construction. However, autonomous stair climbing across different indoor staircases remains a major challenge for robot dogs to complete building construction tasks. In this project, we employed a two-stage end-to-end deep reinforcement learning (RL) approach to optimize a robot's performance on U-shaped stairs. The training robot-dog modality, Unitree Go2, was first trained to climb stairs on Isaac Lab's pyramid-stair terrain, and then to climb a U-shaped indoor staircase using the learned policies. This project explores end-to-end RL methods that enable robot dogs to autonomously climb stairs. The results showed (1) the successful goal reached for robot dogs climbing U-shaped stairs with a stall penalty, and (2) the transferability from the policy trained on U-shaped stairs to deployment on straight, L-shaped, and spiral stair terrains, and transferability from other stair models to deployment on U-shaped terrain.
Authors:Janet G. Johnson, Ruijie Sophia Huang, Khoa Nguyen, Ji Young Nam, Michael Nebeling
Abstract:
Recent advancements in the conversational and social capabilities of generative AI (GenAI) have sparked interest in its role as an agent capable of actively participating in human-AI group discussions. Despite this momentum, we don't fully understand how GenAI shapes conversational dynamics or how the interface design impacts its influence on the group. In this paper, we introduce interface-driven social prominence as a design lens for collaborative GenAI systems. We then present a GenAI-based conversational agent that can actively engage in spoken dialogue during video calls and design three distinct collaboration modes that vary the social prominence of the agent by manipulating its presence in the shared space and the degree of control users have over its participation. A mixed-methods within-subjects study, in which 18 dyads engaged in realistic discussions with a GenAI agent, offers empirical insights into how communication patterns and the collective negotiation of GenAI's influence shift based on how it is embedded into the collaborative experience. Based on these findings, we outline design implications for supporting the coordination and critical engagement required in human-AI groups.
Authors:Nima Esmi, Maryam Nezhad-Moghaddam, Fatemeh Borhani, Asadollah Shahbahrami, Amin Daemdoost, Georgi Gaydadjiev
Abstract:
With the significant expansion of the context window in Large Language Models (LLMs), these models are theoretically capable of processing millions of tokens in a single pass. However, research indicates a significant gap between this theoretical capacity and the practical ability of models to robustly utilize information within long contexts, especially in tasks that require a comprehensive understanding of numerous details. This paper evaluates the performance of four state-of-the-art models (Grok-4, GPT-4, Gemini 2.5, and GPT-5) on long short-context tasks. For this purpose, three datasets were used: two supplementary datasets for retrieving culinary recipes and math problems, and a primary dataset of 20K social media posts for depression detection. The results show that as the input volume on the social media dataset exceeds 5K posts (70K tokens), the performance of all models degrades significantly, with accuracy dropping to around 50-53% for 20K posts. Notably, in the GPT-5 model, despite the sharp decline in accuracy, its precision remained high at approximately 95%, a feature that could be highly effective for sensitive applications like depression detection. This research also indicates that the "lost in the middle" problem has been largely resolved in newer models. This study emphasizes the gap between the theoretical capacity and the actual performance of models on complex, high-volume data tasks and highlights the importance of metrics beyond simple accuracy for practical applications.
Authors:Le Lin, Zihao Zhu, Rainbow Tin Hung Ho, Jing Liao, Yuhan Luo
Abstract:
Therapeutic art activities, such as expressive drawing and painting, require the synergy between creative visual production and interactive dialogue. Recent advancements in Multimodal Large Language Models (MLLMs) have expanded the capacity of computing systems to interpret both textual and visual data, offering a new frontier for AI-mediated therapeutic support. This work-in-progress paper introduces an MLLM-powered chatbot that analyzes visual creation in real-time while engaging the creator in reflective conversations. We conducted an evaluation with five experts in art therapy and related fields, which demonstrated the chatbot's potential to facilitate therapeutic engagement, and highlighted several areas for future development, including entryways and risk management, bespoke alignment of user profile and therapeutic style, balancing conversational depth and width, and enriching visual interactivity. These themes provide a design roadmap for designing the future AI-mediated creative expression tools.
Authors:Blessing Jerry, Lourdes Moreno, Paloma Martínez
Abstract:
LLM-generated interfaces are increasingly used in high-consequence workflows (e.g., healthcare communication), where how information is presented can impact downstream actions. These interfaces and their content support human interaction with AI-assisted decision-making and communication processes and should remain accessible and usable for people with disabilities. Accessible plain-language interfaces serve as an enabling infrastructure for meaningful human oversight. In these contexts, ethical and trustworthiness risks, including hallucinations, semantic distortion, bias, and accessibility barriers, can undermine reliability and limit users' ability to understand, monitor, and intervene in AI-supported processes. Yet, in practice, oversight is often treated as a downstream check, without clear rules for when human intervention is required or who is accountable. We propose oversight-by-design: embedding human judgment across the pipeline as an architectural commitment, implemented via escalation policies and explicit UI controls for risk signalling and intervention. Automated checks flag risk in generated UI communication that supports high-stakes workflows (e.g., readability, semantic fidelity, factual consistency, and standards-based accessibility constraints) and escalate to mandatory Human-in-the-Loop (HITL) review before release when thresholds are violated, or uncertainty is high. Human-on-the-Loop (HOTL) supervision monitors system-level signals over time (alerts, escalation rates, and compliance evidence) to tune policies and detect drift. Structured review feedback is translated into governance actions (rule and prompt updates, threshold calibration, and traceable audit logs), enabling scalable intervention and verifiable oversight for generative UI systems that support high-stakes workflows.
Authors:Shunsei Yamagishi, Lei Jing
Abstract:
Attitude and Heading Reference Systems (AHRSs) are broadly applied wherever reliable orientation and motion sensing is required. In this paper, we present an improved Cubature Kalman Filter (CKF) with lower computational cost while maintaining estimation accuracy, which is named "Kaisoku Cubature Kalman Filter (KCKF)". The computationally efficient equations of the KCKF are derived by simplifying those of the CKF, while preserving equivalent mathematical relations. The lightweight prediction equations in the KCKF are derived by expanding the summation terms in the CKF and simplifying the result. This paper shows that the KCKF requires fewer floating-point operations (FLOPs) than the CKF. The controlled experimental results show that the KCKF reduces the computation time by approximately 19% compared to the CKF on a high-performance computer, whereas the KCKF reduces the computation time by approximately 15% compared to the CKF on a low-cost single-board computer. In addition, the KCKF maintains the attitude estimation accuracy of the CKF.
Authors:Faezeh Vahedi, Morteza Memari, Ramtin Tabatabaei, Alireza Taheri
Abstract:
Nonverbal behaviors, particularly gaze direction, play a crucial role in enhancing effective communication in social interactions. As social robots increasingly participate in these interactions, they must adapt their gaze based on human activities and remain receptive to all cues, whether human-generated or not, to ensure seamless and effective communication. This study aims to increase the similarity between robot and human gaze behavior across various social situations, including both human and non-human stimuli (e.g., conversations, pointing, door openings, and object drops). A key innovation in this study, is the investigation of gaze responses to non-human stimuli, a critical yet underexplored area in prior research. These scenarios, were simulated in the Unity software as a 3D animation and a 360-degree real-world video. Data on gaze directions from 41 participants were collected via virtual reality (VR) glasses. Preprocessed data, trained two neural networks-LSTM and Transformer-to build predictive models based on individuals' gaze patterns. In the animated scenario, the LSTM and Transformer models achieved prediction accuracies of 67.6% and 70.4%, respectively; In the real-world scenario, the LSTM and Transformer models achieved accuracies of 72% and 71.6%, respectively. Despite the gaze pattern differences among individuals, our models outperform existing approaches in accuracy while uniquely considering non-human stimuli, offering a significant advantage over previous literature. Furthermore, deployed on the NAO robot, the system was evaluated by 275 participants via a comprehensive questionnaire, with results demonstrating high satisfaction during interactions. This work advances social robotics by enabling robots to dynamically mimic human gaze behavior in complex social contexts.
Authors:Bingyi Han, Ying Ma, Simon Coghlan, Dana McKay, George Buchanan, Wally Smith
Abstract:
AI technologies that sense student attention and emotions to enable more personalised teaching interventions are increasingly promoted, but raise pressing questions about student learning, well-being, and ethics. In particular, students' perspectives about AI sensing-intervention in learning are often overlooked. We conducted an online mixed-method experiment with Australian university students (N=132), presenting video scenarios varying by whether sensing was used (in-use vs. not-in-use), sensing modality (gaze-based attention detection vs. facial-based emotion detection), and intervention (by digital device vs. teacher). Participants also completed pairwise ranking tasks to prioritise six core ethical concerns. Findings revealed that students valued targeted intervention but responded negatively to AI monitoring, regardless of sensing methods. Students preferred system-generated hints over teacher-initiated assistance, citing learning agency and social embarrassment concerns. Students' ethical considerations prioritised autonomy and privacy, followed by transparency, accuracy, fairness, and learning beneficence. We advocate designing customisable, social-sensitive, non-intrusive systems that preserve student control, agency, and well-being.
Authors:Alireza Taheri, Minoo Alemi, Elham Ranjkar, Raman Rafatnejad, Ali F. Meghdari
Abstract:
This study centers around the design and implementation of the Maya Robot, a portable elephant-shaped social robot, intended to engage with children undergoing cancer treatment. Initial efforts were devoted to enhancing the robot's facial expression recognition accuracy, achieving a 98% accuracy through deep neural networks. Two subsequent preliminary exploratory experiments were designed to advance the study's objectives. The first experiment aimed to compare pain levels experienced by children during the injection process, with and without the presence of the Maya robot. Twenty-five children, aged 4 to 9, undergoing cancer treatment participated in this counterbalanced study. The paired T-test results revealed a significant reduction in perceived pain when the robot was actively present in the injection room. The second experiment sought to assess perspectives of hospitalized children and their mothers during engagement with Maya through a game. Forty participants, including 20 children aged 4 to 9 and their mothers, were involved. Post Human-Maya Interactions, UTAUT questionnaire results indicated that children experienced significantly less anxiety than their parents during the interaction and game play. Notably, children exhibited higher trust levels in both the robot and the games, presenting a statistically significant difference in trust levels compared to their parents (P-value < 0.05). This preliminary exploratory study highlights the positive impact of utilizing Maya as an assistant for therapy/education in a clinical setting, particularly benefiting children undergoing cancer treatment. The findings underscore the potential of social robots in pediatric healthcare contexts, emphasizing improved pain management and emotional well-being among young patients.
Authors:Kai Alexander Hackney, Lucas Guarenti Zangari, Jhonathan Sora-Cardenas, Emmanuel Munoz, Sterling R. Kalogeras, Betsy DiSalvo, Pedro Guillermo Feijoo-Garcia
Abstract:
To foster effective human-agent interactions, designers need to identify characteristics that could affect how agents are perceived and accepted, and to what extent they could impact rapport-building. Aiming to explore the role of user-agent synchrony, we assessed 388 participants to determine whether they could perceive personality traits from four artificial voices we selected and adapted from human samples, considering gender (male or female) and personality (introvert or extrovert) as grouping factors. Our findings suggest that participants were able to significantly differentiate female agents by personality, while male agents were not consistently distinguished. We also observed evidence of personality synchrony, where participants tended to perceive the first agent as more similar to their own personality, with this effect driven mainly by male participants, especially toward male agents. This paper contributes findings and insights to consider the interplay of user-agent personality and gender synchrony in the design of human-agent interactions.
Authors:Reese Kneeland, Wangshu Jiang, Ugo Bruzadin Nunes, Paul Steven Scotti, Arnaud Delorme, Jonathan Xu
Abstract:
To be practical for real-life applications, models for brain-computer interfaces must be easily and quickly deployable on new subjects, effective on affordable scanning hardware, and small enough to run locally on accessible computing resources. To directly address these current limitations, we introduce ENIGMA, a multi-subject electroencephalography (EEG)-to-Image decoding model that reconstructs seen images from EEG recordings and achieves state-of-the-art (SOTA) performance on the research-grade THINGS-EEG2 and consumer-grade AllJoined-1.6M benchmarks, while fine-tuning effectively on new subjects with as little as 15 minutes of data. ENIGMA boasts a simpler architecture and requires less than 1% of the trainable parameters necessary for previous approaches. Our approach integrates a subject-unified spatio-temporal backbone along with a set of multi-subject latent alignment layers and an MLP projector to map raw EEG signals to a rich visual latent space. We evaluate our approach using a broad suite of image reconstruction metrics that have been standardized in the adjacent field of fMRI-to-Image research, and we describe the first EEG-to-Image study to conduct extensive behavioral evaluations of our reconstructions using human raters. Our simple and robust architecture provides a significant performance boost across both research-grade and consumer-grade EEG hardware, and a substantial improvement in fine-tuning efficiency and inference cost. Finally, we provide extensive ablations to determine the architectural choices most responsible for our performance gains in both single and multi-subject cases across multiple benchmark datasets. Collectively, our work provides a substantial step towards the development of practical brain-computer interface applications.
Authors:Varchita Lalwani, Utkarsh Agarwal, Michael Saugstad, Manish Kumar, Jon E. Froehlich, Anupam Sobti
Abstract:
Project Sidewalk is a web-based platform that enables crowdsourcing accessibility of sidewalks at city-scale by virtually walking through city streets using Google Street View. The tool has been used in 40 cities across the world, including the US, Mexico, Chile, and Europe. In this paper, we describe adaptation efforts to enable deployment in Chandigarh, India, including modifying annotation types, provided examples, and integrating VLM-based mission guidance, which adapts instructions based on a street scene and metadata analysis. Our evaluation with 3 annotators indicates the utility of AI-mission guidance with an average score of 4.66. Using this adapted Project Sidewalk tool, we conduct a Points of Interest (POI)-centric accessibility analysis for three sectors in Chandigarh with very different land uses, residential, commercial and institutional covering about 40 km of sidewalks. Across 40 km of roads audited in three sectors and around 230 POIs, we identified 1,644 of 2,913 locations where infrastructure improvements could enhance accessibility.
Authors:Zhennan Yi, Sophia Sakakibara Capello, Randy Gomez, Selma Šabanović
Abstract:
While social robots have demonstrated effectiveness in supporting students' intercultural competence development, it is unclear how they can effectively be adopted for integrated use in K-12 schools. We conducted two phases of design workshops with teachers, where they co-designed robot-mediated intercultural activities while considering student needs and school integration concerns. Using thematic analysis, we identify appropriate scenarios and roles for classroom robots, explore how robots could complement rather than replace teachers, and consider how to address ethical and compliance considerations. Our findings provide practical design guidelines for the HRI community to develop social robots that can effectively support intercultural education in K-12 schools.
Authors:Ronald Cumbal, Marcus Göransson, Alexandros Rouchitsas, Didem Gürdür Broo, Ginevra Castellano
Abstract:
Participatory design effectively engages stakeholders in technology development but is often constrained by small, resource-intensive activities. This study explores a scalable complementary method, enabling broad pattern identification in the design for interfaces in autonomous vehicles. We implemented a human-centered, iterative process that combined crowd creativity, structured participatory principles, and expert feedback. Across iterations, participant concepts evolved from simple cues to multimodal systems. Novel suggestions ranged from personalized features, like tracking lights, to inclusive elements like haptic feedback, progressively refining designs toward greater contextual awareness. To assess outcomes, we compared representative designs: a popular-design, reflecting the most frequently proposed ideas, and an innovative-design, merging participant innovations with expert input. Both were evaluated against a benchmark through video-based simulations. Results show that the popular-design outperformed the alternatives on both interpretability and user experience, with expert-validated innovations performing second best. These findings highlight the potential of scalable participatory methods for shaping emerging technologies.
Authors:Franklin Mingzhe Li, Michael Xieyang Liu, Cynthia L. Bennett, Shaun K. Kane
Abstract:
Audio Description (AD) provides essential access to visual media for blind and low vision (BLV) audiences. Yet current AD production tools remain largely inaccessible to BLV video creators, who possess valuable expertise but face barriers due to visually-driven interfaces. We present ADCanvas, a multimodal authoring system that supports non-visual control over audio description (AD) creation. ADCanvas combines conversational interaction with keyboard-based playback control and a plain-text, screen reader-accessible editor to support end-to-end AD authoring and visual question answering (VQA). Combining screen-reader-friendly controls with a multimodal LLM agent, ADCanvas supports live VQA, script generation, and AD modification. Through a user study with 12 BLV video creators, we find that users adopt the conversational agent as an informational aide and drafting assistant, while maintaining agency through verification and editing. For example, participants saw themselves as curators who received information from the model and filtered it down for their audience. Our findings offer design implications for accessible media tools, including precise editing controls, accessibility support for creative ideation, and configurable rules for human-AI collaboration.
Authors:Mengyu Chen, Youngwook Do, Feiyu Lu, Kaiming Cheng, Blair MacIntyre
Abstract:
Mixed Reality (MR) technologies are increasingly adopted by enterprises to enhance remote collaboration, enabling users to share real-time views of their physical environments through head-mounted displays (HMDs). While MR spatial sharing offers significant benefits, it introduces complex security and privacy risks, particularly in balancing employee collaboration needs with enterprise data protection requirements across office and personal spaces. This paper investigates these challenges through formative interviews with employees and expert consultations with professionals in cybersecurity, IoT, technology risk, and corporate legal domains. We present a conceptual framework for secure MR spatial sharing in enterprise contexts and identify critical concerns and requirements for system design. Based on our findings, we offer actionable recommendations to guide the development of secure and privacy-preserving MR spatial sharing solutions for future enterprise deployments.
Authors:Xiaodan Hu, Monica Perusquía-Hernández, Mayra Donaji Barrera Machuca, Anil Ufuk Batmaz, Yan Zhang, Wolfgang Stuerzlinger, Kiyoshi Kiyokawa
Abstract:
This paper investigates whether a custom varifocal display can improve 3D pointing performance in augmented reality (AR), where the vergence-accommodation conflict (VAC) is known to impair interaction. Varifocal displays have been hypothesized to alleviate the VAC by dynamically matching the focal distance to the user's gaze-defined target depth. Following prior work, we conducted a within-subject study with 24 participants performing an ISO 9241-411 pointing task under varifocal and fixed-focal viewing. Overall, varifocal viewing yielded significantly higher performance than the fixed-focal baseline across key interaction metrics, although the magnitude and even the direction of the benefit varied across individuals. In particular, participants' responses exhibited a baseline-dependent pattern, with smaller improvements (or occasional degradation) observed for those with better baseline performance. Our findings suggest that varifocal technology can improve AR pointing performance relative to fixed-focal viewing, while highlighting substantial individual differences that should be considered in design and evaluation.
Authors:Jeongmin Rhee, Changhee Lee, DongHwa Shin, Bohyoung Kim
Abstract:
Explainable Artificial Intelligence (XAI) has gained importance in interpreting model predictions. Among leading techniques for XAI, Local Interpretable Model-agnostic Explanations (LIME) is most frequently utilized as it notably helps people's understanding of complex models. However, LIME's analysis is constrained to a single image at a time. Besides, it lacks interaction mechanisms for observing the LIME's results and direct manipulations of factors affecting the results. To address these issues, we introduce an interactive visualization tool, LIMEVis, which improves the analysis workflow of LIME by enabling users to explore multiple LIME results simultaneously and modify them directly. With LIMEVis, we could conveniently identify common features in images that a model seems to mainly consider for category classification. Additionally, by interactively modifying the LIME results, we could determine which segments in an image influence the model's classification.
Authors:Suvadeep Mukherjee, Björn Rohles, Gabriele Lenzini, Pedro Cardoso-Leite
Abstract:
Remote unproctored assessments increasingly use messaging interventions to reduce cheating, but existing approaches lack theoretical grounding, focus narrowly on cheating suppression while overlooking performance and experience, and treat cheating as binary rather than continuous. This study examines whether messages based on 15 psychological concepts from self-determination, cognitive dissonance, social norms, and self-efficacy theories can reduce cheating while preserving performance and experience. Through an expert workshop (N=5), we developed 45 theory-informed messages and tested them with online participants (N=1232) who completed an incentivized anagram task. Participants were classified as non-cheaters (0% items cheated), partial-cheaters (1-99% cheated), or full-cheaters (100% cheated). Results show that concept-based messages reduced full-cheating occurrence by 42% (33% to 19%), increased non-cheating by 19% (53% to 63%), with no negative effects on performance or experience across integrity groups. Surprisingly, messages grounded in different theoretical concepts produced virtually identical effects. Analyses of self-rated psychological mechanisms revealed that messages influenced multiple mechanisms simultaneously rather than their intended targets, though these mechanisms predicted behavior, performance, and experience. These findings show that causal pathways are more complex than current theories predict. Practically, integrity interventions using supportive motivation rather than rule enforcement can reduce cheating without impairing performance or experience.
Authors:Felicia Fang-Yi Tan, Oded Nov
Abstract:
System-imposed wait times can significantly disrupt digital workflows, affecting user experience and task performance. Prior HCI research has examined how temporal feedback, such as feedback mode (Elapsed-Time vs. Remaining-Time) shapes wait-time perception. However, few studies have investigated how such feedback influences users' downstream task performance, as well as overall affective and cognitive experience. To study these effects, we conducted an online experiment where 425 participants performing a visual reasoning task experienced a 10-, 30-, or 60-second wait with a Remaining-Time, Elapsed-Time, or No Time Display. Findings show that temporal feedback mode shapes how waiting is perceived: Remaining-Time feedback increased frustration relative to Elapsed-Time feedback, while No Time Display made waits feel longer and heightened ambiguity. Notably, these experiential differences did not translate into differences in post-wait task performance. Integrating psychophysical and cognitive science perspectives, we discuss implications for implementing temporal feedback in latency-prone digital systems.
Authors:Nandini Sharma, Thomas Bock, Rich Bowen, Sayeed Choudhury, Brian Fitzgerald, Matt Germonprez, Jim Herbsleb, James Howison, Tom Hughes, Min Kyung Lee, Stephanie Lieggi, Andreas Liesenfeld, Georg Link, Nicholas Matsakis, Audris Mockus, Narayan Ramasubbu, Christopher Robinson, Gregorio Robles, Nithya Ruff, Sonali Shah, Igor Steinmacher, Bogdan Vasilescu, Stephen Walli, Christopher Yoo
Abstract:
Open source software ecosystems are composed of a variety of stakeholders including but not limited to non-profit organizations, volunteer contributors, users, and corporations. The needs and motivations of these stakeholders are often diverse, unknown, and sometimes even conflicting given the engagement and investment of both volunteers and corporate actors. Given this, it is not clear how open source communities identify and engage with their stakeholders, understand their needs, and hold themselves accountable to those needs. We convened 24 expert scholars and practitioners studying and working with open source software communities for an exploratory workshop discussion on these ideas. The workshop titled "Accountability and Open Source Software Ecosystems" was organized on Oct 14-15 on campus in Carnegie Mellon University, Pittsburgh, PA. The purpose of this in-person workshop was to initiate conversations that explore important and urgent questions related to the role of accountability in open source software ecosystems, and to inspire an exciting research agenda and meaningful stakeholder engagement ideas for practitioners.
Authors:Danqing Shi, Lan Jiang, Katherine M. Collins, Shangzhe Wu, Ayush Tewari, Miri Zilka
Abstract:
The growing prevalence of realistic AI-generated videos on media platforms increasingly blurs the line between fact and fiction, eroding public trust. Understanding how people watch AI-generated videos offers a human-centered perspective for improving AI detection and guiding advancements in video generation. However, existing studies have not investigated human gaze behavior in response to AI-generated videos of physical scenes. Here, we collect and analyze the eye movements from 40 participants during video understanding and AI detection tasks involving a mix of real-world and AI-generated videos. We find that given the high realism of AI-generated videos, gaze behavior is driven less by the video's actual authenticity and more by the viewer's perception of its authenticity. Our results demonstrate that the mere awareness of potential AI generation may alter media consumption from passive viewing into an active search for anomalies.
Authors:Yoshee Jain, Heejin Do, Zihan Wu, April Yi Wang
Abstract:
AI-powered planning tools show promise in supporting programming learners by enabling early, formative feedback on their thinking processes prior to coding. To date, however, most AI-supported planning tools rely on students' natural-language explanations, using LLMs to interpret learners' descriptions of their algorithmic intent. Prior to the emergence of LLM-based systems, CS education research extensively studied trace-based planning in pen-and-paper settings, demonstrating that reasoning through stepwise execution with explicit state transitions helps learners build and refine mental models of program behavior. Despite its potential, little is known about how tracing interacts with AI-mediated feedback and whether integrating tracing into AI-supported planning tools leads to different learning processes or interaction dynamics compared to natural-language-based planning alone. We study how requiring learners to produce explicit execution traces with an AI-supported planning tool affects their algorithmic reasoning. In a between-subjects study with 20 students, tracing shifted learners away from code-like, line-by-line descriptions toward more goal-driven reasoning about program behavior. Moreover, it led to more consistent partially correct solutions, although final coding performance remained comparable across conditions. Notably, tracing did not significantly affect the quality or reliability of LLM-generated feedback. These findings reveal tradeoffs in combining tracing with AI-supported planning and inform design guidelines for integrating natural language, tracing, and coding to support iterative reasoning throughout the programming process.
Authors:Veith Weilnhammer, Kevin YC Hou, Raymond Dolan, Matthew M Nour
Abstract:
Millions of users turn to consumer AI chatbots to discuss behavioral and mental health concerns. While this presents unprecedented opportunities to deliver population-level support, it also highlights an urgent need to develop rigorous and scalable safety evaluations. Here we introduce SIM-VAIL, an AI chatbot auditing framework that captures how harmful AI chatbot responses manifest across a range of mental-health contexts. SIM-VAIL pairs a simulated human user, harboring a distinct psychiatric vulnerability and conversational intent, with an audited frontier AI chatbot. It scores conversation turns on 13 clinically relevant risk dimensions, enabling context-dependent, temporally resolved assessment of mental-health risk. Across 810 conversations, encompassing over 90,000 turn-level ratings and 30 psychiatric user profiles, we find that significant risk occurs across virtually all user phenotypes. Risk manifested across most of the 9 consumer AI chatbot models audited, albeit mitigated in more modern variants. Rather than arising abruptly, risk accumulated over multiple turns. Risk profiles were phenotype-dependent, indicating that behaviors that appear supportive in general settings are liable to be maladaptive when they align with mechanisms that sustain a user's vulnerability. Multivariate risk patterns revealed trade-offs across dimensions, suggesting that mitigation targeting one harm domain can exacerbate others. These findings identify a novel failure mode in human-AI interactions, which we term Vulnerability-Amplifying Interaction Loops (VAILs), and underscore the need for multi-dimensional approaches to risk quantification. SIM-VAIL provides a scalable evaluation framework for quantifying how mental-health risk is distributed across user phenotypes, conversational trajectories, and clinically grounded behavioral dimensions, offering a foundation for targeted safety improvements.
Authors:Ailin Liu, Yesmine Karoui, Fiona Draxler, Frauke Kreuter, Francesco Chiossi
Abstract:
Difficulty spillover and suboptimal help-seeking challenge the sequential, knowledge-intensive nature of digital tasks. In online surveys, tough questions can drain mental energy and hurt performance on later questions, while users often fail to recognize when they need assistance or may satisfy, lacking motivation to seek help. We developed a proactive, adaptive system using electrodermal activity and mouse movement to predict when respondents need support. Personalized classifiers with a rule-based threshold adaptation trigger timely LLM-based clarifications and explanations. In a within-subjects study (N=32), aligned-adaptive timing was compared to misaligned-adaptive and random-adaptive controls. Aligned-adaptive assistance improved response accuracy by 21%, reduced false negative rates from 50.9% to 22.9%, and improved perceived efficiency, dependability, and benevolence. Properly timed interventions prevent cascades of degraded responses, showing that aligning support with cognitive states improves both the outcomes and the user experience. This enables more effective, personalized LLM-assisted support in survey-based research.
Authors:Suifang Zhou, Ray LC
Abstract:
Climate action is difficult to persuade because we tend to perceive climate change as remote and disconnected from daily life. Instead of traditional informational engagements, game-based interventions can create narratives that immerse the visitor in situations where their actions have tangible consequences. To make these narratives engaging, we used a speculative scenario of an alien stumbling upon social media to obliquely address climate change through a text-based adventure game installation. Mimicking visitors' natural dialogue in social media apps, we designed an LLM-based chatbot with knowledge of post-climate devastated world that mirrors our own planet Earth. In discovering the world's downfall through interactive chatting and posted images, players begin to realize that their own actions can make a difference on impacts of climate change in this distant world, fostering pro-environmental attitudes. Previously published at CHI, this game installation demonstrates the potential of LLM based creative narratives in exploring speculative worlds driving social change.
Authors:He Wang, Ziyu Zhou, Hanxiang Liu
Abstract:
In the increasingly prevalent landscape of high-quality service contexts, whether consumer evaluation interfaces adopt a rating-first or review-first sequence has become a critical factor shaping rating authenticity and feedback quality. While prior research has primarily examined review content and sentiment, systematic investigation into how evaluation order influences rating outcomes remains limited. Through exploratory analyses, we find that Letterboxd -- which employs a review-first, rating-after mechanism -- exhibits a more centralized rating distribution with fewer extreme scores, whereas Yelp -- which adopts a rating-first, review-after mechanism -- shows a pronounced bimodal distribution with more polarized ratings. Three controlled experiments further demonstrate that in high-quality service contexts, a rating-first (vs. review-first) interface significantly elevates consumers' overall ratings. Mechanism analyses indicate that cognitive effort and affective heuristics serve as dual pathways: a rating-first (vs. review-first) sequence reduces cognitive effort and heightens affective heuristics, thereby increasing rating scores. Moreover, service quality moderates this process. When service quality is low, the rating-first (vs. review-first) sequence instead leads to lower ratings. This research reveals the psychological mechanisms through which evaluation order affects consumer ratings via cognitive and affective pathways. It extends theoretical understanding of online rating formation and offers practical implications for optimizing platform interface design to enhance rating authenticity and credibility.
Authors:Owen Hoffman, Kangze Peng, Sajid Kamal, Zehua You, Sukrit Venkatagiri
Abstract:
Fraud continues to proliferate online, from phishing and ransomware to impersonation scams. Yet automated prevention approaches adapt slowly and may not reliably protect users from falling prey to new scams. To better combat online scams, we developed ScamPilot, a conversational interface that inoculates users against scams through simulation, dynamic interaction, and real-time feedback. ScamPilot simulates scams with two large language model-powered agents: a scammer and a target. Users must help the target defend against the scammer by providing real-time advice. Through a between-subjects study (N=150) with one control and three experimental conditions, we find that blending advice-giving with multiple choice questions significantly increased scam recognition (+8%) without decreasing wariness towards legitimate conversations. Users' response efficacy and change in self-efficacy was also 9% and 19% higher, respectively. Qualitatively, we find that users more frequently provided action-oriented advice over urging caution or providing emotional support. Overall, ScamPilot demonstrates the potential for inter-agent conversational user interfaces to augment learning.
Authors:Javier Argota Sánchez-Vaquerizo, Luis Borunda Monsivais
Abstract:
Traditional architectural simulations (e.g. Computational Fluid Dynamics, evacuation, structural analysis) model elements as deterministic physics-based "particles" rather than cognitive "agents". To bridge this, we introduce \textbf{Agentic Environmental Simulations}, where Large Multimodal generative models actively predict the next state of spatial environments based on semantic expectation. Drawing on examples from accessibility-oriented AR pipelines and multimodal digital twins, we propose a shift from chronological time-steps to Episodic Spatial Reasoning, where simulations advance through meaningful, surprisal-triggered events. Within this framework we posit AI hallucinations as diagnostic tools. By formalizing the \textbf{Cognitive Friction} ($C_f$) it is possible to reveal "Phantom Affordances", i.e. semiotic ambiguities in built space. Finally, we challenge current HCI paradigms by treating environments as dynamic cognitive partners and propose a human-centered framework of cognitive orchestration for designing AI-driven simulations that preserve autonomy, affective clarity, and cognitive integrity.
Authors:Bowen Zhou, Marc-André Fiedler, Ayoub Al-Hamadi
Abstract:
Depression is a prevalent mental health disorder that severely impairs daily functioning and quality of life. While recent deep learning approaches for depression detection have shown promise, most rely on limited feature types, overlook explicit cross-modal interactions, and employ simple concatenation or static weighting for fusion. To overcome these limitations, we propose CAF-Mamba, a novel Mamba-based cross-modal adaptive attention fusion framework. CAF-Mamba not only captures cross-modal interactions explicitly and implicitly, but also dynamically adjusts modality contributions through a modality-wise attention mechanism, enabling more effective multimodal fusion. Experiments on two in-the-wild benchmark datasets, LMVD and D-Vlog, demonstrate that CAF-Mamba consistently outperforms existing methods and achieves state-of-the-art performance.
Authors:Fabian Albers, Sebastian Strauß, Nikol Rummel, Nils Köbis
Abstract:
Mutual trust between teachers and students is a prerequisite for effective teaching, learning, and assessment in higher education. Accurate predictions about the other group's use of generative artificial intelligence (AI) are fundamental for such trust. However, the disruptive rise of AI has transformed academic work practices, raising important questions about how teachers and students use these tools and how well they can estimate each other's usage. While the frequency of use is well studied, little is known about how AI is used, and comparisons with similar practices are rare. This study surveyed German university teachers (N = 113) and students (N = 123) on the frequency of AI use and the degree of delegation across six identical academic tasks. Participants also provided incentivized cross-sample predictions of the other group's AI use to assess the accuracy of their predictions. We find that students reported higher use of AI and greater delegation than teachers. Both groups significantly overestimated the other group's use, with teachers predicting very frequent use and high delegation by students, and students assuming teachers use AI similarly to themselves. These findings reveal a perception gap between teachers' and students' expectations and actual AI use. Such gaps may hinder trust and effective collaboration, underscoring the need for open dialogue about AI practices in academia and for policies that support the equitable and transparent integration of AI tools in higher education.
Authors:Hassam Tahir, Faizan Faisal, Fady Alnajjar, Muhammad Imran Taj, Lucia Gordon, Aila Khan, Michael Lwin, Omar Mubin
Abstract:
This paper presents a framework for integrating LLM into collaborative learning platforms to enhance student engagement, critical thinking, and inclusivity. The framework employs advanced LLMs as dynamic moderators to facilitate real-time discussions and adapt to learners' evolving needs, ensuring diverse and inclusive educational experiences. Key innovations include robust feedback mechanisms that refine AI moderation, promote reflective learning, and balance participation among users. The system's modular architecture featuring ReactJS for the frontend, Flask for backend operations, and efficient question retrieval supports personalized and engaging interactions through dynamic adjustments to prompts and discussion flows. Testing demonstrates that the framework significantly improves student collaboration, fosters deeper comprehension, and scales effectively across various subjects and user groups. By addressing limitations in static moderation and personalization in existing systems, this work establishes a strong foundation for next-generation AI-driven educational tools, advancing equitable and impactful learning outcomes.
Authors:Haoming Huang, Pongchai Jaisri, Shota Shimizu, Lingfeng Chen, Sota Nakashima, Gema Rodríguez-Pérez
Abstract:
Large Language Model (LLM) Agents are advancing quickly, with the increasing leveraging of LLM Agents to assist in development tasks such as code generation. While LLM Agents accelerate code generation, studies indicate they may introduce adverse effects on development. However, existing metrics solely measure pass rates, failing to reflect impacts on long-term maintainability and readability, and failing to capture human intuitive evaluations of PR. To increase the comprehensiveness of this problem, we investigate and evaluate the characteristics of LLM to know the pull requests' characteristics beyond the pass rate. We observe the code quality and maintainability within PRs based on code metrics to evaluate objective characteristics and developers' reactions to the pull requests from both humans and LLM's generation. Evaluation results indicate that LLM Agents frequently disregard code reuse opportunities, resulting in higher levels of redundancy compared to human developers. In contrast to the quality issues, our emotions analysis reveals that reviewers tend to express more neutral or positive emotions towards AI-generated contributions than human ones. This disconnect suggests that the surface-level plausibility of AI code masks redundancy, leading to the silent accumulation of technical debt in real-world development environments. Our research provides insights for improving human-AI collaboration.
Authors:Po-Hsun Chen, Ivan C. H. Liu
Abstract:
This paper presents AnthropoCam, a mobile-based neural style transfer (NST) system optimized for the visual synthesis of Anthropocene environments. Unlike conventional artistic NST, which prioritizes painterly abstraction, stylizing human-altered landscapes demands a careful balance between amplifying material textures and preserving semantic legibility. Industrial infrastructures, waste accumulations, and modified ecosystems contain dense, repetitive patterns that are visually expressive yet highly susceptible to semantic erosion under aggressive style transfer. To address this challenge, we systematically investigate the impact of NST parameter configurations on the visual translation of Anthropocene textures, including feature layer selection, style and content loss weighting, training stability, and output resolution. Through controlled experiments, we identify an optimal parameter manifold that maximizes stylistic expression while preventing semantic erasure. Our results demonstrate that appropriate combinations of convolutional depth, loss ratios, and resolution scaling enable the faithful transformation of anthropogenic material properties into a coherent visual language. Building on these findings, we implement a low-latency, feed-forward NST pipeline deployed on mobile devices. The system integrates a React Native frontend with a Flask-based GPU backend, achieving high-resolution inference within 3-5 seconds on general mobile hardware. This enables real-time, in-situ visual intervention at the site of image capture, supporting participatory engagement with Anthropocene landscapes. By coupling domain-specific NST optimization with mobile deployment, AnthropoCam reframes neural style transfer as a practical and expressive tool for real-time environmental visualization in the Anthropocene.
Authors:Xuyi Hu, Ke Ma, Siwei Liu, Per Ola Kristensson, Stephan Goetz
Abstract:
Accurate neuronavigation is essential for generating the intended effect with transcranial magnetic stimulation (TMS). Precise coil placement also directly influences stimulation efficacy. Traditional neuronavigation systems often rely on costly and still hard to use and error-prone tracking systems. To solve these limitations, we present a computer-vision-based neuronavigation system for real-time tracking of patient and TMS instrumentation. The system can feed the necessary data for a digital twin to track TMS stimulation targets. We integrate a self-coordinating optical tracking system with multiple consumer-grade cameras and visible tags with a dynamic 3D brain model in Unity. This model updates in real time to represent the current stimulation coil position and the estimated stimulation point to intuitively visualize neural targets for clinicians. We incorporate an augmented reality (AR) module to bridge the gap between the visualization of the digital twin and the real world and project the brain model in real-time onto the head of a patient. AR headsets or mobile AR devices allow clinicians to interactively view and adjust the placement of the stimulation transducer intuitively instead of guidance through abstract numbers and 6D cross hairs on an external screen. The proposed technique provides improved spatial precision as well as accuracy. A case study with ten participants with a medical background also demonstrates that the system has high usability.
Authors:Boyu Li, Lin-Ping Yuan, Zeyu Wang, Hongbo Fu
Abstract:
Sketching provides an intuitive way to convey dynamic intent in animation authoring (i.e., how elements change over time and space), making it a natural medium for automatic content creation. Yet existing approaches often constrain sketches to fixed command tokens or predefined visual forms, overlooking their freeform nature and the central role of humans in shaping intention. To address this, we introduce an interaction paradigm where users convey dynamic intent to a vision-language model via free-form sketching, instantiated here in a sketch storyboard to motion graphics workflow. We implement an interface and improve it through a three-stage study with 24 participants. The study shows how sketches convey motion with minimal input, how their inherent ambiguity requires users to be involved for clarification, and how sketches can visually guide video refinement. Our findings reveal the potential of sketch and AI interaction to bridge the gap between intention and outcome, and demonstrate its applicability to 3D animation and video generation.
Authors:Nico Mutzner, Taha Yasseri, Heiko Rauhut
Abstract:
The introduction of artificial intelligence (AI) agents into human group settings raises essential questions about how these novel participants influence cooperative social norms. While previous studies on human-AI cooperation have primarily focused on dyadic interactions, little is known about how integrating AI agents affects the emergence and maintenance of cooperative norms in small groups. This study addresses this gap through an online experiment using a repeated four-player Public Goods Game (PGG). Each group consisted of three human participants and one bot, which was framed either as human or AI and followed one of three predefined decision strategies: unconditional cooperation, conditional cooperation, or free-riding. In our sample of 236 participants, we found that reciprocal group dynamics and behavioural inertia primarily drove cooperation. These normative mechanisms operated identically across conditions, resulting in cooperation levels that did not differ significantly between human and AI labels. Furthermore, we found no evidence of differences in norm persistence in a follow-up Prisoner's Dilemma, or in participants' normative perceptions. Participants' behaviour followed the same normative logic across human and AI conditions, indicating that cooperation depended on group behaviour rather than partner identity. This supports a pattern of normative equivalence, in which the mechanisms that sustain cooperation function similarly in mixed human-AI and all human groups. These findings suggest that cooperative norms are flexible enough to extend to artificial agents, blurring the boundary between humans and AI in collective decision-making.
Authors:Judy Hanwen Shen, Alex Tamkin
Abstract:
AI assistance produces significant productivity gains across professional domains, particularly for novice workers. Yet how this assistance affects the development of skills required to effectively supervise AI remains unclear. Novice workers who rely heavily on AI to complete unfamiliar tasks may compromise their own skill acquisition in the process. We conduct randomized experiments to study how developers gained mastery of a new asynchronous programming library with and without the assistance of AI. We find that AI use impairs conceptual understanding, code reading, and debugging abilities, without delivering significant efficiency gains on average. Participants who fully delegated coding tasks showed some productivity improvements, but at the cost of learning the library. We identify six distinct AI interaction patterns, three of which involve cognitive engagement and preserve learning outcomes even when participants receive AI assistance. Our findings suggest that AI-enhanced productivity is not a shortcut to competence and AI assistance should be carefully adopted into workflows to preserve skill formation -- particularly in safety-critical domains.
Authors:Jeremy Foote, Deepak Kumar, Bedadyuti Jha, Ryan Funkhouser, Loizos Bitsikokos, Hitesh Goel, Hsuen-Chi Chiu
Abstract:
Generative AI chatbots have proven surprisingly effective at persuading people to change their beliefs and attitudes in lab settings. However, the practical implications of these findings are not yet clear. In this work, we explore the impact of rehabilitative conversations with generative AI chatbots on users who share toxic content online. Toxic behaviors -- like insults or threats of violence, are widespread in online communities. Strategies to deal with toxic behavior are typically punitive, such as removing content or banning users. Rehabilitative approaches are rarely attempted, in part due to the emotional and psychological cost of engaging with aggressive users. In collaboration with seven large Reddit communities, we conducted a large-scale field experiment (N=893) to invite people who had recently posted toxic content to participate in conversations with AI chatbots. A qualitative analysis of the conversations shows that many participants engaged in good faith and even expressed remorse or a desire to change. However, we did not observe a significant change in toxic behavior in the following month compared to a control group. We discuss possible explanations for our findings, as well as theoretical and practical implications based on our results.
Authors:Lekshmi Murali Rani, Richard Berntsson Svensson, Robert Feldt
Abstract:
As GenAI models are adopted to support software engineers and their development teams, understanding effective human-AI collaboration (HAIC) is increasingly important. Socio-emotional intelligence (SEI) enhances collaboration among human teammates, but its role in HAIC remains unclear. Current AI systems lack SEI capabilities that humans bring to teamwork, creating a potential gap in collaborative dynamics. In this study, we investigate how software practitioners perceive the socio-emotional gap in HAIC and what capabilities AI systems require for effective collaboration. Through semi-structured interviews with 10 practitioners, we examine how they think about collaborating with human versus AI teammates, focusing on their SEI expectations and the AI capabilities they envision. Results indicate that practitioners currently view AI models as intellectual teammates rather than social partners and expect fewer SEI attributes from them than from human teammates. However, they see the socio-emotional gap not as AIs failure to exhibit SEI traits, but as a functional gap in collaborative capabilities (AIs inability to negotiate responsibilities, adapt contextually, or maintain sustained partnerships). We introduce the concept of functional equivalents: technical capabilities (internal cognition, contextual intelligence, adaptive learning, and collaborative intelligence) that achieve collaborative outcomes comparable to human SEI attributes. Our findings suggest that effective collaboration with AI for SE tasks may benefit from functional design rather than replicating human SEI traits for SE tasks, thereby redefining collaboration as functional alignment.
Authors:Onyedikachi Hope Amaechi-Okorie, Branislav Radeljic
Abstract:
Speech remains one of the most visible yet overlooked vectors of inclusion and exclusion in contemporary society. While fluency is often equated with credibility and competence, individuals with atypical speech patterns are routinely marginalized. Given the current state of the debate, this article focuses on the structural biases that shape perceptions of atypical speech and are now being encoded into artificial intelligence. Automated speech recognition (ASR) systems and voice interfaces, trained predominantly on standardized speech, routinely fail to recognize or respond to diverse voices, compounding digital exclusion. As AI technologies increasingly mediate access to opportunity, the study calls for inclusive technological design, anti-bias training to minimize the impact of discriminatory algorithmic decisions, and enforceable policy reform that explicitly recognize speech diversity as a matter of equity, not merely accessibility. Drawing on interdisciplinary research, the article advocates for a cultural and institutional shift in how we value voice, urging co-created solutions that elevate the rights, representation, and realities of atypical speakers in the digital age. Ultimately, the article reframes speech inclusion as a matter of equity (not accommodation) and advocates for co-created AI systems that reflect the full spectrum of human voices.
Authors:Shaozhang Dai, Kadek Ananta Satriadi, Jim Smiley, Barrett Ens, Lonni Besançon, Tim Dwyer
Abstract:
We introduce the notion of an Active Proxy interface, i.e. tangible models as proxies for physical data referents, supporting interactive exploration of data through active manipulation. We realise an active proxy data visualisation system, "MarioChart", using robot carts relocating themselves on a tabletop, e.g., to align with their data referents in a map or other visual layout. We consider a casual-data exploration scenario involving a multivariate campus sustainability dataset, using scale models as proxies for their physical building data referents. Our empirical study (n=12) compares active proxy use with conventional tablet interaction, finding that our active proxy system enhances short-term spatial memory of data and enables faster completion of certain data analytic tasks. It shows no significant differences compared to traditional touch-screens in long-term memory, physical fatigue, mental workload, or user engagement. Our study offers an initial baseline for active proxy techniques and advances understanding of tangible interfaces in situated data visualisation.
Authors:Mohammad Hadi Nezhad, Francisco Enrique Vicente Castro, Ivon Arroyo
Abstract:
Conversational agents (CAs) (e.g., chatbots) are increasingly used in settings where users disclose sensitive information, raising significant privacy concerns. Because privacy judgments are highly contextual, supporting users to engage in privacy-protective actions during chatbot interactions is essential. However, enabling meaningful engagement requires a deeper understanding of how users currently reason about and manage sensitive information during realistic chatbot use scenarios. To investigate this, we qualitatively examined computer science (undergraduate and masters) students' in-the-moment disclosure and protection behaviors, as well as the reasoning underlying these behaviors, across a range of realistic chatbot tasks. Participants used a simulated ChatGPT interface with and without a privacy notice panel that intercepts message submissions, highlights potentially sensitive information, and offers privacy protective actions. The panel supports anonymization through retracting, faking, and generalizing, and surfaces two of ChatGPT's built-in privacy controls to improve their discoverability. Drawing on interaction logs, think-alouds, and survey responses, we analyzed how the panel fostered privacy awareness, encouraged protective actions, and supported context-specific reasoning about what information to protect and how. We further discuss design opportunities for tools that provide users greater and more meaningful agency in protecting sensitive information during CA interactions.
Authors:Ailin Liu, Francesco Chiossi, Felix Henninger, Lisa Bondo Andersen, Tobias Wistuba, Sonja Greven, Frauke Kreuter, Fiona Draxler
Abstract:
Time pressure and question difficulty can trigger stress and cognitive overload in web-based surveys, compromising data quality and user experience. Most stress detection methods are based on low-resolution self-reports, which are poorly suited for capturing fast, moment-to-moment changes during short online tasks. Addressing this gap, we conducted a 2x2 within-subjects study (N = 29), manipulating question difficulty and time pressure in a web-based multiple-choice task. Participants completed general knowledge and cognitive questions while we collected multimodal data: mouse dynamics, eye tracking, electrocardiogram, and electrodermal activity. Using condition-based and self-reported labels, we used statistical and machine learning models to model stress and question difficulty. Our results show distinct physiological and behavioral patterns within very short timeframes. This work demonstrates the feasibility of rapidly detecting cognitive-affective states in digital environments, paving the way for more adaptive, ethical, and user-aware survey interfaces.
Authors:Javier Crespo, Ana Enériz, Paula Iruzubieta, Fernando Carballo, Conrado Fernández Rodríguez, María Dolores Martín-Arranz, Federico Argüelles-Arias, Juan Turnes
Abstract:
Background: Artificial intelligence (AI) has emerged as a disruptive innovation in medicine, yet its adoption within gastroenterology remains limited and poorly characterized. We aimed to examine knowledge, practical applications, perceived barriers, and expectations regarding AI among gastroenterology specialists in Spain. Methods: We conducted a cross-sectional observational study using a structured online survey distributed by the Spanish Society of Digestive Pathology (SEPD) in 2025. The questionnaire collected sociodemographic data, patterns of AI use, perceptions, and educational needs. Descriptive statistics and multivariable models were applied. Results: Among 283 respondents (mean age 44.6 +/- 9.7 years), 87.5% acknowledged AI as a transformative tool, but only 60.2% (95% CI: 54.3-66.1%) reported using it, mostly outside institutional frameworks. Notably, 80.2% of users initiated AI use within the past year. Independent predictors of frequent use included previous training (OR=2.44), employment in university hospitals (OR=2.14), and younger age (OR=1.36 per 5-year decrease). Main barriers were lack of training (61%), absence of institutional strategies (46%), and ethical concerns (50%). While 93.8% agreed that AI training programmes are necessary, only 18.4% had received formal training. Conclusions: A substantial gap exists between the favorable perception of AI and its actual integration into clinical practice within Spanish gastroenterology. The rapid adoption outside institutional frameworks underscores the urgent need for accredited training programmes and governance standards led by scientific societies.
Authors:Lin Kyi, Paul Gölz, Robin Berjon, Asia Biega
Abstract:
Obtaining meaningful and informed consent from users is essential for ensuring autonomy and control over one's data. Notice and consent, the standard for collecting consent, has been criticized. While other individualized solutions have been proposed, this paper argues that a collective approach to consent is worth exploring. First, individual consent is not always feasible to collect for all data collection scenarios. Second, harms resulting from data processing are often communal in nature, given the interconnected nature of some data. Finally, ensuring truly informed consent for every individual has proven impractical. We propose collective consent, operationalized through consent assemblies, as one alternative framework. We establish collective consent's theoretical foundations and use speculative design to envision consent assemblies leveraging deliberative mini-publics. We present two vignettes: i) replacing notice and consent, and ii) collecting consent for GenAI model training. Our paper employs future backcasting to identify the requirements for realizing collective consent and explores its potential applications in contexts where individual consent is infeasible.
Authors:Steffen Holter, Eunyee Koh, Mustafa Doga Dogan, Gromit Yeuk-Yin Chan
Abstract:
Simulated user agents are increasingly used in usability testing to support fast, iterative UX workflows, as they generate rich data such as action logs and think-aloud reasoning, but the unstructured nature of this output often obscures actionable insights. We present UXCascade, an interactive tool for extracting, aggregating, and presenting agent-generated usability feedback at scale. Our core contribution is a multi-level analysis workflow that (1) highlights patterns across persona traits, goals, and outcomes, (2) links agent reasoning to specific issues, and (3) supports actionable design improvements. UXCascade operationalizes this approach by listing agent goals, traits, and issues in a structured overview. Practitioners can explore detailed reasoning traces and annotated views, propose interface edits, and assess their impact across personas. This enables a top-down, exploration-driven analysis from patterns to concrete UX interventions. A user study with eight UX professionals demonstrates that UXCascade integrates into existing workflows, enabling iterative feedback during early-stage interface development.
Authors:Jana Franceska Funke, Ria Matapurkar, Enrico Rukzio, Teresa Hirzle
Abstract:
It is obvious that emotions are causal variables of motivation, as they elicit states, forces and energies that trigger and guide labor behavior. Thus, a motivational tension that is not informed by needs alone, but also by emotions, intention, goals and means to achieve them is therefore generated within the mental, emotional and physical plane. Based on Montserrat's opinion (2004: 131), that "to motivate means, above all, to move and to transmit an emotion", we will undertake to identify the mutual influences between emotions and motivation. The main objectives of this article are to display a summary of the theories and definitions about emotions and to explore the links between emotions and motivation. Although interconnected, emotions and motivation can be contemplated from a double perspective: (1) emotions influence motivation and (2) motivation influences emotions. Moreover, we will consider motivation from three dimensions: (1) cognitive, (2) affective and (3) volitional. The ultimate purpose of this article is to issue a warning as to the importance of the emotional side of motivation. An important part in implementing such insight is to be played by managers (and by employees, also), who should develop the skills and know-how needed to keep a well-balanced emotional climate that effectively favors the maximization of individual and group motivation at the workplace.
Authors:Tamunotonye Harry, Ivoline Ngong, Chima Nweke, Yuanyuan Feng, Joseph Near
Abstract:
User interactions with language models vary due to static properties of the user (trait) and the specific context of the interaction (state). However, existing persona datasets (like PersonaChat, PANDORA etc.) capture only trait, and ignore the impact of state. We introduce Chameleon, a dataset of 5,001 contextual psychological profiles from 1,667 Reddit users, each measured across multiple contexts. Using the Chameleon dataset, we present three key findings. First, inspired by Latent State-Trait theory, we decompose variance and find that 74\% is within-person(state) while only 26\% is between-person (trait). Second, we find that LLMs are state-blind: they focus on trait only, and produce similar responses regardless of state. Third, we find that reward models react to user state, but inconsistently: different models favor or penalize the same users in opposite directions. We release Chameleon to support research on affective computing, personalized dialogue, and RLHF alignment.
Authors:Jiaxin Xu, Chao Zhang, Raymond H. Cuijpers, Wijnand A. IJsselsteijn
Abstract:
Social robots are increasingly applied as health behavior change interventions, yet actionable knowledge to guide their design and evaluation remains limited. This systematic review synthesizes (1) the behavior change strategies used in existing HRI studies employing social robots to promote health behavior change, and (2) the evaluation methods applied to assess behavior change outcomes. Relevant literature was identified through systematic database searches and hand searches. Analysis of 39 studies revealed four overarching categories of behavior change strategies: coaching strategies, counseling strategies, social influence strategies, and persuasion-enhancing strategies. These strategies highlight the unique affordances of social robots as behavior change interventions and offer valuable design heuristics. The review also identified key characteristics of current evaluation practices, including study designs, settings, durations, and outcome measures, on the basis of which we propose several directions for future HRI research.
Authors:Elif Uskuplu, Lawrence S. Moss, Valeria de Paiva
Abstract:
Mathematical knowledge exists in many forms, ranging from informal textbooks and lecture notes to large formal proof libraries, yet moving between these representations remains difficult. Informal texts hide dependencies, while formal systems expose every detail in ways that are not always human-readable. Dependency graphs offer a middle ground by making visible the structure of results, definitions, and proofs. We present KnowTeX, a standalone, user-friendly tool that extends the ideas of Lean's Blueprints, enabling the visualization of conceptual dependencies directly from LaTeX sources. Using a simple "uses" command, KnowTeX extracts relationships among statements and generates previewable graphs in DOT and TikZ formats. Applied to mathematical texts, such graphs clarify core results, support education and formalization, and provide a resource for aligning informal and formal mathematical representations. We argue that dependency graphs should become a standard feature of mathematical writing, benefiting both human readers and automated systems.
Authors:Kevin Tseng, Juan Carlos Toledano, Bart De Clerck, Yuliia Dukach, Phil Tinn
Abstract:
The interoperability of data and intelligence across allied partners and their respective end-user groups is considered a foundational enabler to the collective defense capability--both conventional and hybrid--of NATO countries. Foreign Information Manipulation and Interference (FIMI) and related hybrid activities are conducted across various societal dimensions and infospheres, posing an ever greater challenge to the characterization of threats, sustaining situational awareness, and response coordination. Recent advances in AI have further led to the decreasing cost of AI-augmented trolling and interference activities, such as through the generation and amplification of manipulative content. Despite the introduction of the DISARM framework as a standardized metadata and analytical framework for FIMI, operationalizing it at the scale of social media remains a challenge. We propose a framework-agnostic agent-based operationalization of DISARM to investigate FIMI on social media. We develop a multi-agent pipeline in which specialized agentic AI components collaboratively (1) detect candidate manipulative behaviors, and (2) map these behaviors onto standard DISARM taxonomies in a transparent manner. We evaluated the approach on two real-world datasets annotated by domain practitioners. We demonstrate that our approach is effective in scaling the predominantly manual and heavily interpretive work of FIMI analysis, providing a direct contribution to enhancing the situational awareness and data interoperability in the context of operating in media and information-rich settings.
Authors:James Brock, Ce Zhang, Nantheera Anantrasirichai
Abstract:
The increasing availability of high-resolution satellite imagery, together with advances in deep learning, creates new opportunities for enhancing forest monitoring workflows. Two central challenges in this domain are pixel-level change detection and semantic change interpretation, particularly for complex forest dynamics. While large language models (LLMs) are increasingly adopted for data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored, especially beyond urban environments. We introduce Forest-Chat, an LLM-driven agent designed for integrated forest change analysis. The proposed framework enables natural language querying and supports multiple RSICI tasks, including change detection, change captioning, object counting, deforestation percentage estimation, and change reasoning. Forest-Chat builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration, and incorporates zero-shot change detection via a foundation change detection model together with an interactive point-prompt interface to support fine-grained user guidance. To facilitate adaptation and evaluation in forest environments, we introduce the Forest-Change dataset, comprising bi-temporal satellite imagery, pixel-level change masks, and multi-granularity semantic change captions generated through a combination of human annotation and rule-based methods. Experimental results demonstrate that Forest-Chat achieves strong performance on Forest-Change and on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI, for joint change detection and captioning, highlighting the potential of interactive, LLM-driven RSICI systems to improve accessibility, interpretability, and analytical efficiency in forest change analysis.
Authors:C. Estelle Smith, Alemitu Bezabih, Shadi Nourriz, Jesan Ahammed Ovi
Abstract:
Despite its importance for well-being, spiritual care remains under-explored in HCI, while the adoption of technology in clinical spiritual care lags behind other healthcare fields. Prior work derived a definition of "spiritual support" through co-design workshops with stakeholders in online health communities. This paper contributes: (1) a revision of that definition through member checking with professional spiritual care providers (SCPs); (2) a novel design framework -- SPIRIT -- which can help to expand models of delivery for spiritual care using digital technologies. Through re-analysis of previous data and new interviews with SCPs, we identify three prerequisites for meaningful spiritual care: openness to care, safe space, and the ability to discern and articulate spiritual needs. We also propose six design dimensions: loving presence, meaning-making, appropriate degree of technology use, location, degree of relational closeness, and temporality. We discuss how SPIRIT offers guidance for designing impactful digital spiritual care intervention systems within and beyond clinical settings.
Authors:Yuhui Xu, Minha Lee, Stephan Wensveen, Mahla Alizadeh, Mathias Funk
Abstract:
People's identities change during life transitions, e.g., studying abroad. They bring everyday objects that embody memories and reflect their identities during such moves. To assist in these transitions, we ask how people's human identities could be influenced by their objects through an artificial agent. This paper presents an exploratory research-through-design study around how people undergoing life transitions experience conversing with their everyday objects through a chatbot. Drawing on a two-week field deployment and interviews with 12 participants, we contribute (1) a conceptualization of 'trans-embodiment' describing the asynchronous imagination of object and human identities on the chatbot, (2) empirical evidence of the resulting emotional and reflective experiences, and (3) three types of object identities for designing conversational agents that role-play objects. Our contributions sum up to triangulating human-agent-object identity as trans-embodiment in supporting life transitions.
Authors:Amy Koike, Yuki Okafuji, Sichao Song
Abstract:
Voice is an essential modality for human-robot interaction (HRI). The way a robot sounds plays a central role in shaping how humans perceive and engage with it, influencing factors such as intelligibility, understandability, and likability. Although prior work has examined voice design, most studies occur in controlled labs, leaving uncertainty about how results translate to real-world settings. To address this gap, we conducted two naturalistic deployment studies with a guidance robot in a shopping mall: (1) in-depth interviews with six participants, and (2) an eight-day field deployment using a 3x3 design varying speech rate and volume, yielding 725 survey responses. Our results show how real-world context shapes voice perception and inform adaptive, context-aware voice design for social robots in public spaces.
Authors:Pratik Mishra, Caner Gözübüyük, Seema Nagar, Prateeti Mohapatra, Raya Wittich, Arthur de Magalhaes
Abstract:
Manual creation of IT monitoring dashboard widgets is slow, error-prone, and a barrier for both novice and expert users. We present NOVAID, an interactive chatbot that leverages Large Language Models (LLMs) to generate IT monitoring widgets directly from natural language queries. Unlike general natural language-to-visualization tools, NOVAID addresses IT operations-specific challenges: specialized widget types like SLO charts, dynamic API-driven data retrieval, and complex contextual filters. The system combines a domain-aware semantic parser, fuzzy entity matching, and schema completion to produce standardized widget JSON specifications. An interactive clarification loop ensures accuracy in underspecified queries. On a curated dataset of 271 realistic queries, NOVAID achieves promising accuracy (up to 94.10% in metric extraction) across multiple LLMs. A user study with IT engineers yielded a System Usability Scale score of 74.2 for NOVAID, indicating good usability. By bridging natural language intent with operational dashboards, NOVAID demonstrates clear potential and a path for deployment in enterprise ITOps monitoring platforms.
Authors:Hsuen-Chi Chiu, Jeremy Foote
Abstract:
AI chatbots designed as emotional companions blur the boundaries between interpersonal intimacy and institutional software, creating a complex, multi-dimensional privacy environment. Drawing on Communication Privacy Management theory and Masur's horizontal (user-AI) and vertical (user-platform) privacy framework, we conducted in-depth interviews with fifteen users of companion AI platforms such as Replika and Character.AI. Our findings reveal that users blend interpersonal habits with institutional awareness: while the non-judgmental, always-available nature of chatbots fosters emotional safety and encourages self-disclosure, users remain mindful of institutional risks and actively manage privacy through layered strategies and selective sharing. Despite this, many feel uncertain or powerless regarding platform-level data control. Anthropomorphic design further blurs privacy boundaries, sometimes leading to unintentional oversharing and privacy turbulence. These results extend privacy theory by highlighting the unique interplay of emotional and institutional privacy management in human-AI companionship.
Authors:Laura Ferrarotti, Gian Maria Campedelli, Roberto Dessì, Andrea Baronchelli, Giovanni Iacca, Kathleen M. Carley, Alex Pentland, Joel Z. Leibo, James Evans, Bruno Lepri
Abstract:
In this article, we argue that understanding the collective behavior of agents based on large language models (LLMs) is an essential area of inquiry, with important implications in terms of risks and benefits, impacting us as a society at many levels. We claim that the distinctive nature of LLMs--namely, their initialization with extensive pre-trained knowledge and implicit social priors, together with their capability of adaptation through in-context learning--motivates the need for an interactionist paradigm consisting of alternative theoretical foundations, methodologies, and analytical tools, in order to systematically examine how prior knowledge and embedded values interact with social context to shape emergent phenomena in multi-agent generative AI systems. We propose and discuss four directions that we consider crucial for the development and deployment of LLM-based collectives, focusing on theory, methods, and trans-disciplinary dialogue.
Authors:Bohan Zhang, Chengke Bu, Paramveer S. Dhillon
Abstract:
AI writing assistants can reduce effort and improve fluency, but they may also weaken writers' sense of authorship. We study this tension with an ownership-aware co-writing editor that offers on-demand, sentence-level suggestions and tests two common design choices: persona-based coaching and style personalization. In an online study (N=176), participants completed three professional writing tasks: an email without AI help, a proposal with generic AI suggestions, and a cover letter with persona-based coaching, while half received suggestions tailored to a brief sample of their prior writing. Across the two AI-assisted tasks, psychological ownership dropped relative to unassisted writing (about 0.85-1.0 points on a 7-point scale), even as cognitive load decreased (about 0.9 points) and quality ratings stayed broadly similar overall. Persona coaching did not prevent the ownership decline. Style personalization partially restored ownership (about +0.43) and increased AI incorporation in text (+5 percentage points). We distill five design patterns: on-demand initiation, micro-suggestions, voice anchoring, audience scaffolds, and point-of-decision provenance, to guide authorship-preserving writing tools.
Authors:Robert K. Strehlow, Tobias Küster, Oskar F. Kupke, Brandon Llanque Kurps, Fikret Sivrikaya, Sahin Albayrak
Abstract:
Large language models (LLMs) have proven to work well in question-answering scenarios, but real-world applications often require access to tools for live information or actuation. For this, LLMs can be extended with tools, which are often defined in advance, also allowing for some fine-tuning for specific use cases. However, rapidly evolving software landscapes and individual services require the constant development and integration of new tools. Domain- or company-specific tools can greatly elevate the usefulness of an LLM, but such custom tools can be problematic to integrate, or the LLM may fail to reliably understand and use them. For this, we need strategies to define new tools and integrate them into the LLM dynamically, as well as robust and scalable zero-shot prompting methods that can make use of those tools in an efficient manner. In this paper, we present SAGE, a specialized conversational AI interface, based on the OPACA framework for tool discovery and execution. The integration with OPACA makes it easy to add new tools or services for the LLM to use, while SAGE itself presents rich extensibility and modularity. This not only provides the ability to seamlessly switch between different models (e.g. GPT, LLAMA), but also to add and select prompting methods, involving various setups of differently prompted agents for selecting and executing tools and evaluating the results. We implemented a number of task-solving strategies, making use of agentic concepts and prompting methods in various degrees of complexity, and evaluated those against a comprehensive set of benchmark services. The results are promising and highlight the distinct strengths and weaknesses of different task-solving strategies. Both SAGE and the OPACA framework, as well as the different benchmark services and results, are available as Open Source/Open Data on GitHub.
Authors:Rose Connolly, Victor Zordan, Rachel McDonnell
Abstract:
Teleportation is one of the most common locomotion techniques in virtual reality, yet its perceptual properties remain underexplored. While redirected walking research has shown that users' movements can be subtly manipulated without detection, similar imperceptible adjustments for teleportation have not been systematically investigated. This study examines the thresholds at which teleportation displacements become noticeable to users. We conducted a repeated-measures experiment in which participants' selected teleport destinations were altered in both direction (forwards, backwards) and at different ranges (small, large). Detection thresholds for these positional adjustments were estimated using a psychophysical staircase method with a two-alternative forced choice (2AFC) task. Results show that teleport destinations can be shifted without detection, with larger tolerances for backward adjustments and across longer teleport ranges. These findings establish baseline perceptual limits for redirected teleportation and highlight its potential as a design technique. Applications include supporting interpersonal distance management in social VR, guiding players toward objectives in games, and assisting novice users with navigation. By identifying the limits of imperceptible teleportation adjustments, this work extends redirection principles beyond walking to teleportation and opens new opportunities for adaptive and socially aware VR locomotion systems.
Authors:Alfonso Piscitelli, Cristina David, Mattia De Rosa, Ali Mohammed, Federico Nanni, Jacob Pake, Roly Perera, Jessy Sodimu, Chenyiqiu Zheng
Abstract:
We introduce _transparent documents_, interactive web-based scholarly articles which allow readers to explore the relationship to the underlying data by hovering over fragments of text, and present an LLM-based tool for authoring transparent documents, building on recent developments in data provenance for general-purpose programming languages. As a target platform, our implementation uses Fluid, an open source programming language with a provenance-tracking runtime. Our agent-based tool supports a human author during the creation of transparent documents, identifying fragments of text which can be computed from data, such as numerical values selected from records or computed by aggregations like sum and mean, comparatives and superlatives like _better than_ and _largest_, trend-adjectives like _growing_, and similar quantitative or semi-quantitative phrases, and then attempts to synthesise a suitable Fluid query over the data which generates the target string. The resulting expression is inserted into the article's web page, turning the static text fragment into an interactable data-driven element able to reveal the data that underwrites the natural language claim. We evaluate our approach on a subset of SciGen, an open source dataset consisting of tables from scientific articles and their corresponding descriptions, which we extend with hand-generated counterfactual test cases to evaluate how well machine-generated expressions generalise. Our results show that gpt4o is often able to synthesise compound expressions extensionally compatible with our gold solutions.
Authors:Niloufar Alavi, Swati Shah, Rezvan Alamian, Stefan Goetz
Abstract:
Brain-computer interfaces (BCIs) allow direct communication between the brain and electronics without the need for speech or physical movement. Such interfaces can be particularly beneficial in applications requiring rapid response times, such as driving, where a vehicle's advanced driving assistance systems could benefit from immediate understanding of a driver's intentions. This study presents a novel method for predicting a driver's intention to steer using electroencephalography (EEG) signals through deep learning. A driving simulator created a controlled environment in which participants imagined controlling a vehicle during various driving scenarios, including left and right turns, as well as straight driving. A convolutional neural network (CNN) classified the detected EEG data with minimal pre-processing. Our model achieved an accuracy of 83.7% in distinguishing between the three steering intentions and demonstrated the ability of CNNs to process raw EEG data effectively. The classification accuracy was highest for right-turn segments, which suggests a potential spatial bias in brain activity. This study lays the foundation for more intuitive brain-to-vehicle communication systems.
Authors:Martin P. Robillard, Lihn V. Nguyen, Deeksha Arya, Jin L. C. Guo
Abstract:
Health information websites offer instantaneous access to information, but have important privacy implications as they can associate a visitor with specific medical conditions. We interviewed 35 residents of Canada to better understand whether and how online health information seekers exercise three potential means of protection against surveillance: website selection, privacy-enhancing technologies, and self-censorship, as well as their understanding of web tracking. Our findings reveal how users' limited initiative and effectiveness in protecting their privacy could be associated with a missing or inaccurate understanding of how implicit data collection by third parties takes place on the web, and who collects the data. We conclude that to help Internet users achieve better self-data protection, we may need to shift privacy awareness efforts from what information is collected to how it is collected.
Authors:Tatsuya Okuno, Haruto Shimizu, Nobuhito Kasahara, Taiyu Honma, Shota Yamanaka, Homei Miyashita
Abstract:
As XR devices become widespread, 3D interaction has become commonplace, and UI developers are increasingly required to consider usability to deliver better user experiences. The HCI community has long studied target-pointing performance, and research on 3D environments has progressed substantially. However, for practitioners to directly leverage research findings in UI improvements, practical tools are needed. To bridge this gap between research and development in VR systems, we propose a system that estimates object selection success rates within a development tool (Unity). In this paper, we validate the underlying theory, describe the tool's functions, and report feedback from VR developers who tried the tool to assess its usefulness.
Authors:Oran Duan, Yinghua Shen, Yingzhu Lv, Luyang Jie, Yaxin Liu, Qiong Wu
Abstract:
Advances in generative models and sequence learning have greatly promoted research in dance motion generation, yet current methods still suffer from coarse semantic control and poor coherence in long sequences. In this work, we present Listen to Rhythm, Choose Movements (LRCM), a multimodal-guided diffusion framework supporting both diverse input modalities and autoregressive dance motion generation. We explore a feature decoupling paradigm for dance datasets and generalize it to the Motorica Dance dataset, separating motion capture data, audio rhythm, and professionally annotated global and local text descriptions. Our diffusion architecture integrates an audio-latent Conformer and a text-latent Cross-Conformer, and incorporates a Motion Temporal Mamba Module (MTMM) to enable smooth, long-duration autoregressive synthesis. Experimental results indicate that LRCM delivers strong performance in both functional capability and quantitative metrics, demonstrating notable potential in multimodal input scenarios and extended sequence generation. We will release the full codebase, dataset, and pretrained models publicly upon acceptance.
Authors:Yun Ye, Yuan Che, Haoyang Liang, Yingheng Zhang, Pengpeng Xu
Abstract:
Although automated trucks have the potential to improve freight efficiency, reduce costs, and address driver shortages, organizing two or more trucks in a convoy has raised considerable concerns for pedestrian safety. This study conducted a controlled experiment to examine the influence of behavioral tendency, trust, and risk perception on pedestrian intention to cross in front of an automated truck platoon. A total of 603 subjects participated in the virtual reality video-based questionnaire survey. By fusing the merits of structural equation modeling and artificial neural networks, a two-stage, hybrid model was developed to examine complex relationships between latent variables and gap-acceptance behaviors. Our results indicated that subjects watched an average of five vehicle gaps before starting crossing and the average time gap accepted was about 5.35 seconds. Risk perception not only played the most dominant role in shaping pedestrian crossing decisions, but also served as the strong bone, mediating the effects of behavioral tendency and trust on gap-acceptance. Participants who frequently violated traffic rules were more likely to accept a smaller time gap, while those who showed positive behaviors to other road users tended to wait for a larger time gap. Participants who often committed errors, showed aggressive behaviors, and held greater trust in the safety of automated trucks generally reported a lower level of risk for road-crossing in front of automated truck platoons. Built on these findings, a range of tailored countermeasures were proposed to ensure safer and smother interactions between pedestrians and automated truck platoons.
Authors:Yuan Che, Mun On Wong, Xiaowei Gao, Haoyang Liang, Yun Ye
Abstract:
Autonomous driving improves traffic efficiency but presents safety challenges in complex port environments. This study investigates how environmental factors, traffic factors, and pedestrian characteristics influence interaction safety between autonomous vehicles and pedestrians in ports. Using virtual reality (VR) simulations of typical port scenarios, 33 participants completed pedestrian crossing tasks under varying visibility, vehicle sizes, and time pressure conditions. Results indicate that low-visibility conditions, partial occlusions and larger vehicle sizes significantly increase perceived risk, prompting pedestrians to wait longer and accept larger gaps. Specifically, pedestrians tended to accept larger gaps and waited longer when interacting with large autonomous truck platoons, reflecting heightened caution due to their perceived threat. However, local obstructions also reduce post-encroachment time, compressing safety margins. Individual attributes such as age, gender, and driving experience further shape decision-making, while time pressure undermines compensatory behaviors and increases risk. Based on these findings, safety strategies are proposed, including installing wide-angle cameras at multiple viewpoints, enabling real-time vehicle-infrastructure communication, enhancing port lighting and signage, and strengthening pedestrian safety training. This study offers practical recommendations for improving the safety and deployment of vision-based autonomous systems in port settings.
Authors:Sumit S. Shevtekar, Chandresh K. Maurya, Gourab Sil, Subasish Das
Abstract:
Time pressure critically influences risky maneuvers and crash proneness among powered two-wheeler riders, yet its prediction remains underexplored in intelligent transportation systems. We present a large-scale dataset of 129,000+ labeled multivariate time-series sequences from 153 rides by 51 participants under No, Low, and High Time Pressure conditions. Each sequence captures 63 features spanning vehicle kinematics, control inputs, behavioral violations, and environmental context. Our empirical analysis shows High Time Pressure induces 48% higher speeds, 36.4% greater speed variability, 58% more risky turns at intersections, 36% more sudden braking, and 50% higher rear brake forces versus No Time Pressure. To benchmark this dataset, we propose MotoTimePressure, a deep learning model combining convolutional preprocessing, dual-stage temporal attention, and Squeeze-and-Excitation feature recalibration, achieving 91.53% accuracy and 98.93% ROC AUC, outperforming eight baselines. Since time pressure cannot be directly measured in real time, we demonstrate its utility in collision prediction and threshold determination. Using MTPS-predicted time pressure as features, improves Informer-based collision risk accuracy from 91.25% to 93.51%, approaching oracle performance (93.72%). Thresholded time pressure states capture rider cognitive stress and enable proactive ITS interventions, including adaptive alerts, haptic feedback, V2I signaling, and speed guidance, supporting safer two-wheeler mobility under the Safe System Approach.
Authors:Soroush Elyasi, Arya VarastehNezhad, Fattaneh Taghiyareh
Abstract:
Personality assessment in career guidance and personnel selection traditionally relies on self-report questionnaires, which are susceptible to response bias, fatigue, and intentional distortion. Game-based assessment offers a promising alternative by capturing implicit behavioral signals during gameplay. This study proposes a multi-genre serious-game framework combined with machine-learning techniques to predict suitability for software development roles. Developer-relevant personality and behavioral traits were identified through a systematic literature review and an empirical study of professional software engineers. A custom mobile game was designed to elicit behaviors related to problem solving, planning, adaptability, persistence, time management, and information seeking. Fine-grained gameplay event data were collected and analyzed using a two-phase modeling strategy where suitability was predicted exclusively from gameplay-derived behavioral features. Results show that our model achieved up to 97% precision and 94% accuracy. Behavioral analysis revealed that proper candidates exhibited distinct gameplay patterns, such as more wins in puzzle-based games, more side challenges, navigating menus more frequently, and exhibiting fewer pauses, retries, and surrender actions. These findings demonstrate that implicit behavioral traces captured during gameplay is promising in predicting software-development suitability without explicit personality testing, supporting serious games as a scalable, engaging, and less biased alternative for career assessment.
Authors:Joslyn Orgill, Andra Rice, Max Fowler, Seth Poulsen
Abstract:
The development of effective autograders is key for scaling assessment and feedback. While NLP based autograding systems for open-ended response questions have been found to be beneficial for providing immediate feedback, autograders are not always liked, understood, or trusted by students. Our research tested the effect of transparency on students' attitudes towards autograders. Transparent autograders increased students' perceptions of autograder accuracy and willingness to discuss autograders in survey comments, but did not improve other related attitudes -- such as willingness to be graded by them on a test -- relative to the control without transparency. However, this lack of impact may be due to higher measured student trust towards autograders in this study than in prior work in the field. We briefly discuss possible reasons for this trend.
Authors:Argha Kamal Samanta, Deepak Mewada, Monalisa Sarma, Debasis Samanta
Abstract:
Continuous electroencephalography (EEG) is routinely used in neurocritical care to monitor seizures and other harmful brain activity, including rhythmic and periodic patterns that are clinically significant. Although deep learning methods have achieved high accuracy in seizure detection, most existing approaches remain seizure-centric, rely on discrete-label supervision, and are primarily evaluated using accuracy-based metrics. A central limitation of current EEG modeling practice is the weak correspondence between learned representations and how EEG findings are interpreted and summarized in clinical workflows. Harmful EEG activity exhibits overlapping patterns, graded expert agreement, and temporal persistence, which are not well captured by classification objectives alone. This work proposes a multimodal EEG representation learning framework that integrates signal-domain modeling with structured clinical language supervision. First, raw EEG is transformed into a longitudinal bipolar montage and time-frequency representations. Second, dual transformer-based encoders model complementary temporal and frequency-centric dependencies and are fused using an adaptive gating mechanism. Third, EEG embeddings are aligned with structured expert consensus descriptions through a contrastive objective. Finally, an EEG-conditioned text reconstruction loss is introduced as a representation-level constraint alongside standard classification loss. Experimental evaluation using a controlled train-validation-test split achieves a six-class test accuracy of 0.9797. Ablation analyses show that removing contrastive alignment reduces cross-modal retrieval performance from Recall@10 of 0.3390 to 0.0045, despite minimal change in classification accuracy. These findings demonstrate that discriminative accuracy does not reliably reflect representation quality for clinically meaningful EEG modeling.
Authors:Caoilte Ó Ciardha, Joel Scanlan, Tegan Insoll, Juha Nurmi, Nina Vaaranen-Valkonen
Abstract:
Warning messages have been used to disrupt individuals seeking online child sexual abuse material (CSAM) and promote engagement with support services, yet large-scale field evidence on message content remains limited, particularly in high anonymity environments. This study reports a field experiment on Ahmia.fi, a Tor search engine, examining how warning message content influences behavior. Across a 140-day period, almost 20 million searches were observed, with over 3 million searches containing known CSAM-related terms that triggered a warning linking to an anonymous self-help program. Users were exposed to warning messages varying in thematic content and framing, or a neutral message. Across a randomized comparison, a campaign-wide analysis, and interrupted time series models, message content consistently influenced engagement with help resources. All active messages increased click-through rates to help resources relative to the neutral condition, with a harm-focused message producing the strongest effects. At the platform level, click-through rates increased from 8.73% before the intervention to 15.67% during the campaign. These findings highlight the importance of message content in shaping responses to warning interventions, supporting an approach in which messaging is refined and adapted to increase engagement with support resources.
Authors:Arya VarastehNezhad, Fattaneh Taghiyareh
Abstract:
This study investigates whether behavioral and performance indicators derived from a Moodle-based learning management system are associated with university students' depression and anxiety in two undergraduate Computer Engineering courses. Using a quantitative observational design, LMS event logs, academic records, and self-reported Beck Depression Inventory-II and Beck Anxiety Inventory scores from 97 students were integrated. A broad set of behavioral and performance indicators spanning temporal engagement, session structure, deadline-related behavior, page-refresh patterns, and LMS navigation was extracted from raw event logs and analyzed using descriptive statistics, independent-samples t-tests with Benjamini-Hochberg FDR correction, effect sizes, and Spearman correlations; inventory scores were confirmed invariant by sex and academic year. Several indicators were significantly associated with depression and anxiety. Higher depression was associated with shifted temporal activity patterns, longer session durations, and shorter homework submission lead times, while higher anxiety was associated with concentrated temporal engagement and session-based differences. These findings suggest that routine LMS data can provide meaningful behavioral signals related to student well-being and may support earlier educational awareness of students who experience mental-health-related strain. At the same time, such indicators should be interpreted as contextual and non-diagnostic markers rather than as substitutes for clinical assessment.
Authors:Louis Nisiotis, Aimilios Hadjiliasi
Abstract:
As generative AI capabilities expand, AI-driven virtual worlds face a growing architectural challenge. Users interact through in-world interfaces in multimodal ways, yet their requests demand fundamentally different AI backend models and computational resources. Embedding these capabilities directly into virtual world systems reduces extensibility, complicates maintenance, and limits the ability to coordinate services distributed across edge and cloud infrastructure. This paper presents an SLM-based Agent Orchestration Gateway, a lightweight runtime coordination mechanism that decouples a virtual world client from heterogeneous AI backends through intent-driven service routing. An edge-deployed SLM classifies the semantic intent of each user prompt, a configurable service registry validates and resolves the routing decision, and the selected backend is invoked transparently, enabling new AI capabilities to be introduced in the virtual world without modifying the client application. The gateway is implemented and evaluated within the InterwovenXR virtual museum testbed. The evaluation shows that compact SLMs can serve as reliable intent routers on edge hardware, and that task-specific fine-tuning can transform sub-billion-parameter models into practical, low-latency routers. A layered configuration pairing a fine-tuned sub billion-parameter model as router with a larger SLM for conversational response generation is shown to be deployable on mid-range edge hardware and more efficient than delegating both responsibilities to a single model. The findings show that SLMs can support practical AI service orchestration in virtual worlds and the work contributes an evaluated architecture for scalable, extensible, and edge-supported AI interaction, enabling virtual agents become access points to distributed generative AI services.
Authors:Yisak Debele, Israel Goytom, Anwar Misbah
Abstract:
Artificial intelligence systems that model and support human cognition require reliable measures of cognitive state. We present the Focus Performance Score (FPS) from the Pulse Focus mobile Stroop application and evaluate whether it measures attentional control during color-word conflict resolution. We conduct behavioral, neural, and formula validation analyses. Behavioral results (N=466, 111,133 trials) show that FPS captures the Stroop interference effect, tracks individual differences in attentional control, and demonstrates strong test-retest reliability. Neural validation using the DMCC55B fMRI dataset (N=55) shows that the primary FPS component, mean incongruent reaction time, is significantly associated with anterior cingulate cortex activation, a key neural substrate of conflict monitoring. Formula validation identifies and resolves structural redundancy within the scoring framework and provides convergent support for the weighting design. Together, these findings establish FPS as a behaviorally valid, reliable, and neurally grounded measure of attentional control. FPS provides a defensible behavioral signal for evaluating human attentional state and supports future work on attention-aware human-AI interaction and physiological state modeling.
Authors:Romina Mahinpei, Victoria Dean, Ruth Fong, Lydia T. Liu, Manoel Horta Ribeiro
Abstract:
AI systems increasingly shape human workflows by generating intermediate artifacts that users can adopt, revise, or ignore. While prior work has shown that AI assistance can improve the efficiency and accuracy of required tasks, less is known about whether it can increase participation in discretionary but beneficial work that users often intend to perform but frequently skip. We study this question in the context of personalized feedback provision in higher education, a pedagogically valuable but often optional practice. We conduct a mixed-methods study combining a randomized field experiment and qualitative interviews in a 300-level machine learning course with n=11 teaching assistants (TAs) and n=88 students. Student submissions were randomly assigned to either (1) a treatment condition where TAs received AI-assisted feedback drafts after grading or (2) a control condition without drafts. TAs remained fully in control and could use, edit, or ignore drafts at their discretion. We find that AI-assisted feedback significantly increases feedback provision (+10.8 percentage points, SE=1.1, p<0.001) and feedback length (+39.8 chars, SE=3.45, p<0.001) without negatively affecting student usefulness ratings or reducing time per character. Qualitative findings suggest that AI-assisted drafts function as editable scaffolds that lower barriers to initiating feedback rather than reducing overall effort. Our findings highlight AI's promise for discretionary but beneficial tasks: increasing work that might otherwise go undone while preserving human control over final outcomes.
Authors:Tushar Das, Daigo Hozaki, Koushlendra Kumar Singh, Hirohito M. Kondo
Abstract:
Autonomous Sensory Meridian Response (ASMR) is a somatosensory phenomenon characterized by pleasant tingling sensations and cardiovascular slowing. However, ASMR research has been hindered by a dearth of standardized, open-access multimodal datasets. To address this limitation, we present REST-ASMR (Response to Environmental & Sensory Triggers), a synchronized multimodal dataset designed to capture behavioral reports and physiological dynamics during ASMR, with nature-relaxation videos as control stimuli. The dataset includes high-resolution photoplethysmography (PPG), time-aligned audiovisual stimuli, and continuous subjective annotations from 34 participants. Technical validation showed high stimulus efficacy (97% responder rate), significant stimulus-specific inter-subject agreement (p < 0.05), and a robust PPG-derived ASMR-specific cardiovascular deceleration. Additionally, a Bidirectional Long-Short Term Memory model successfully predicted subjective ASMR tingle states, achieving video-level ASMR vs. Nature classification with perfect accuracy and a frame-level global mean accuracy of 75.51%, macro F1-score of 71.86%, and 100% Nature-baseline specificity, under a strict, leakage-free subject-video double-independent 4-fold cross-validation. REST-ASMR constitutes a dense temporal foundation for affective computing, multimodal research, and the development of personalized models of relaxation-related responses.
Authors:William Hohnen-Ford, Sarah Chen, Kathryn B. Francis, Madeline G. Reinecke, Ilina Singh, David Lyreskog
Abstract:
Radical Moral Disagreements (RMDs) are highly polarising topics that are increasingly censored in everyday life, with growing evidence suggesting that this polarisation carries measurable costs to public mental health. To address these challenges, some researchers have proposed Large Language Models (LLMs) as a means to support more democratic deliberation and better moral reasoning. Yet existing tools are poorly calibrated to help people navigate RMDs, because of their intense and divisive characteristics. This paper introduces CONSIDER, a prototype for a one-to-one AI tool for RMD navigation. Drawing on Mill's account of the epistemic value of disagreement, CONSIDER aims at value clarification through structured disagreement with an opposing LLM-generated opinion. We describe CONSIDER's design logic and analyse potential risks posed by such tools to guide future development.
Authors:Chi-Ching Juan, Tao Wang, Harold Lee
Abstract:
The appropriateness of empathy in AI has emerged as a critical concern, as excessive empathy risks seeming manipulative while insufficient empathy appears dismissive. While prior research has explored how to quantify empathy in AI, few studies examine whether such empathy is contextually appropriate. This paper introduces an economic perspective by applying signaling theory to human-AI conversations. We propose Signal Cost Proxies (emotional richness, perspective-taking, and contextual tailoring) mapped to affective, cognitive, and associative empathy. This multidimensional framework enables systematic evaluation of empathy not just by presence, but by its appropriateness relative to user demand.
Authors:Roberto Figliè, Simone Caputo, Alan Serrano, Daria Mikhaylova, Tommaso Turchi, Daniele Mazzei
Abstract:
Managers in manufacturing settings rely on digital interfaces to interpret operational data for decision-making, but growing data volume and complexity can make relevant insights difficult to identify efficiently. While dashboards remain dominant in industrial contexts, Large Language Model (LLM)-based conversational agents (CAs), accessed through conversational user interfaces (CUIs), may provide more direct access to such data. However, their effectiveness may depend on the information-processing demands of the task. This study compares an LLM-based CA delivered through a CUI with a dashboard in a manufacturing decision-support scenario. In a mixed factorial experiment with a 2x3 design, 134 industrial decision-makers were assigned to one interface condition and completed three tasks of increasing complexity. We examined perceived Mental Workload (MWL), decision accuracy, completion time, and intended reliance, and tested self-reported data literacy as a moderator. Results showed that the CUI reduced perceived MWL overall and supported faster completion in less demanding tasks, but both advantages diminished as task complexity increased. Neither interface produced a consistent overall advantage in decision accuracy, and the CUI was not preferred as a sole basis for subsequent decisions. Furthermore, data literacy did not reliably moderate interface effects. These findings indicate that conversational interaction offers conditional rather than universal benefits for industrial decision support. LLM-based CAs may reduce information-access effort, whereas complex decisions continue to benefit from persistent, inspectable visual representations.
Authors:Roberto Figliè, Simone Caputo, Alan Serrano, Tommaso Turchi, Daniele Mazzei
Abstract:
The use of Generative AI Conversational User Interfaces (CUI) as a new way to access and analyze data is growing in all sectors, and the industrial one is no exception. There, large amounts of data produced by IoT devices are flowing through user interfaces and may require them a new adaptation to the new analyses needs of decision-makers. LLM-based CUIs are promising a new way to directly interact with those data through the directness of natural language and without the learning costs that every GUI design has. Moreover, the capabilities of LLMs and their agency open up the possibility to automate some tasks and help with the reasoning during decision-making activities. But are this promises well founded? We try to scope this general question with a mixed-approach study comparing a state-of-the-art dashboard with a conversational agent. A total of 20 participants used both interfaces to complete four simulated industrial decision tasks of varying complexity. We combined measures of mental workload, completion time, and decision accuracy with a post-study questionnaire and semi-structured interviews analyzed through thematic analysis. The findings suggest that the conversational agent can reduce interactional effort by supporting more direct access to information, while the dashboard remains valuable for overview and verification. However, these benefits may vary across tasks and require validation through larger-scale studies.
Authors:Edwige Chauvergne, Arnaud Prouzeau, Martin Hachet, Pierre Dragicevic
Abstract:
Data visualization is a powerful tool for conveying statistical information, but when representing populations, it tends to hide individuals. We introduce Zoomable Empathic Visualizations (ZEVs), interactive experiences allowing users to smoothly navigate between abstract statistical visualizations and more qualitative, relatable representations focused on individuals. We present three use cases of ZEVs and report on a qualitative user study that highlights opportunities for deeper understanding and emotional engagement, while pointing to areas for improvement and further refinement. In summary, ZEVs point toward new approaches for revealing the individuals behind the data.
Authors:Niall McShane, Attila Korik, Karl McCreadie, Naomi Du Bois, Darryl Charles, Damien Coyle
Abstract:
Continuous brain-computer interfaces (BCIs) that decode motion trajectories from imagined movement offer intuitive motor control, yet how feedback modality and longitudinal training shape neural representations and decoding performance remains poorly understood. We present the first systematic investigation of embodied virtual reality (VR) feedback during real-time 3D virtual limb control driven by motor imagery, across ten longitudinal sessions in ten participants. Performance was evaluated using three strategies: actual online performance (Fixed Decoder Generalisation, FDG), periodic retraining (Sequential Adaptive Training, SAT), and within-session upper-bound estimation (Within-Session Reconstruction, WSR). A CNN-LSTM decoder achieved within-session imagined movement correlations of r = 0.762 under VR and r = 0.672 under screen feedback. VR significantly outperformed screen feedback across all strategies and movement dimensions (improvements of 8.9-13.0%, all p <= 0.002, d = 1.42-2.05). This advantage persisted under fixed decoders without retraining, demonstrating that embodied VR feedback elicits inherently more decodable and generalisable neural representations. Linear mixed-effects modelling confirmed robust main effects of feedback modality and movement axis with no interaction. Neurophysiologically, VR produced stronger sensorimotor-parietal desynchronisation and enhanced motor-frontal functional connectivity, with pervasive anterior insula engagement across all frequency bands and increased superior parietal lobule coupling, paralleling patterns associated with real movement execution. These findings establish embodied spatial feedback as a key design principle for next-generation continuous BCIs targeting intuitive motor control and neurorehabilitation.
Authors:Lu Chen, Xiaoran Xue, Rongqi Ding, Fenghua Tang, Anji Zhou, Chenxi Wang, Mengyu Miranda Gao, Zhuo Rachel Han
Abstract:
As conversational AI becomes capable of sustained, affectively responsive interaction, users may form bonds beyond instrumental use. Existing measures often adapt interpersonal frameworks or focus on specific relational outcomes, leaving limited tools for assessing human-AI affective bonding on its own terms. Across two studies, we developed and validated the Human-AI Affective Bonding Inventory (HAABI). Study 1 used thematic analysis of semi-structured interviews with 52 emotionally engaged conversational AI users to identify cognitive, emotional, and behavioral features of bonding. Study 2 translated these insights into a self-report inventory and validated it among 673 Chinese conversational AI users. Exploratory and confirmatory factor analyses supported a 20-item, four-factor structure: emotional realism, separation anxiety, emotional investment, and romantic intimacy. The HAABI showed good reliability, construct validity, and known-groups validity. The scale therefore provides a neutral, user-centered tool for studying how affective bonds with conversational AI are formed, experienced, and related to users' psychological outcomes.
Authors:Tao Wang, Chi-Ching Juan
Abstract:
A central challenge in affective computing is determining appropriate empathy levels for different interaction contexts. Prior work has characterized two poles: task-focused interactions, where empathy demand is near zero, and emotional disclosure, where empathy demand is high. This paper identifies a distinct intermediate type, decision support under stress, in which a sender faces a consequential choice while experiencing emotional difficulty. We hypothesize that this type elicits an asymmetric empathy profile: empathy comparable to emotional disclosure but instrumentality comparable to task-focused exchange. We test five hypotheses using 28,239 post-reply dyads from three Reddit advice communities, classified into three interaction types and scored for empathy depth, empathy form, and instrumental proportion using LLM-based annotation with pattern-based robustness checks. Results confirm the predicted asymmetric profile: decision-support-under-stress replies show significantly higher empathy than task-focused replies (M = 0.47 vs. 0.24, p < 0.001) while maintaining high instrumentality (0.83 vs. 0.77 for emotional disclosure, p < 0.001). Behavioral empathy dominates (36.6%), and community-validated response quality is negatively associated with empathic expression (r = -0.075, p < 0.001). Community norms modulate baselines substantially but preserve the structural ordering. These findings establish a human empathy baseline for this interaction type and have direct implications for calibrating empathic expression in affective AI systems.
Authors:Jiyoon Kim, Kentaro Toyama, Sangmi Kim, John M. Carroll
Abstract:
Generative AI challenges academic integrity not only by enabling students to delegate substantial portions of their academic work, but also by blurring the ethical boundaries by which students distinguish acceptable assistance from misconduct. Drawing on semi-structured interviews (n=20), AI chat logs, and course documents (syllabi, submitted assignments), we investigated how students themselves make moral sense of AI use in academic writing. Our analysis results in a range of novel findings: First, there are at least five distinct sites of AI-use conceptualization, ranging from faculty's intended AI policy, to students' actual AI use. Second, students use over 20 distinct rationalizations to justify AI use, such as that copying AI-generated text is victimless; that any AI text reflecting their own beliefs or their own style is their own writing; or that they are learning more by using AI -- even extensively -- than otherwise. We present a taxonomy of these rationalizations, and show how some of them are employed to justify conscious violations of course policies. Third, student rationalizations occur in both an ad hoc and post hoc manner, and they are not necessarily self-consistent. These and other findings suggest that modern AI presents a steep, ethical, slippery slope which students conceptually slide down, landing far outside the pedagogical goals and expectations of instructors. We discuss implications for educational design and AI policy.
Authors:Gennady Andrienko, Natalia Andrienko
Abstract:
Unsupervised learning methods -- topic modeling, partition-based and density-based clustering -- produce data groupings without human guidance, yet choosing and evaluating those groupings should not itself be unsupervised. We present \emph{SmartIterator}~(SI), a visual analytics approach that treats the full sequence of grouping results across a parameter sweep as a first-class analytical object. For each method family, SI provides a structured six-phase workflow that guides the analyst through systematic exploration of grouping results -- from quality-metric overview through transition-stability assessment, membership-confidence evaluation, content and context inspection, and recurrent-archetype verification to an informed decision -- building cumulative understanding of data structure along the way. The workflows are operationalized through \emph{IteraScope}~(IS), a coordinated visual display combining quality-metric charts with semantic color encoding, a 1D group embedding with Sankey-style transition flows and violin plots of membership confidence, a 2D group embedding with HDBSCAN-detected recurrent archetypes that highlights iterations capturing all persistent patterns, and domain-specific linked views for contextualized interpretation. We demonstrate the three workflows on: (1)~simulated social-media messages from the VAST Challenge 2011 (density-based clustering, validated against ground truth), (2)~EU population statistics across ${\sim}1\,500$ NUTS-3 regions (partition-based clustering), and (3)~30 years of IEEE VIS papers (NMF topic modeling). The workflows constitute the main contribution: they provide actionable, method-specific guidance for navigating parameter spaces, studying how data structure evolves across configurations, and grounding analytical understanding in domain context -- yielding knowledge about the data that no single ``best'' result can provide.
Authors:Alberto Garzás-Villar, Alba Riera-Cardona, Alexis Derumigny, J. Micah Prendergast, Jane Murray Cramm, Laura Marchal-Crespo
Abstract:
Robotic haptic devices combined with virtual reality offer novel opportunities to train fine force generation, an essential yet overlooked component of post-stroke rehabilitation. This study proposes that manipulating the rendered dynamics of tangible virtual objects can be leveraged to train precise force control while engaging the somatosensory system. We conducted an experiment with fifty healthy participants who performed a curling-inspired task in which they had to stretch a virtual spring to generate a target release force to propel the stone to a predefined location on the ice sheet. During training, the spring's force-elongation relationship was modeled as either a linear or non-linear function, i.e., a Gaussian or antisymmetric Gaussian (AS-Gaussian) function with zero derivative at the release target force. Results indicate that the AS-Gaussian group consistently achieved higher force accuracy during training than the linear group, while the Gaussian group only outperformed the linear group toward the end of training. Analysis of personality traits revealed that higher Free Spirit scores were associated with poorer performance and reduced task exploration under Gaussian dynamics, whereas higher Transform-of-Challenge scores correlated with increased exploration. Despite these training effects, no significant differences in long-term retention were found across spring types or personality traits. Participants primarily relied on learned target elongation rather than target force, as evidenced by performance in a transfer task with a different stiffness but the same target force. While promising for somatosensory neurorehabilitation, these methods require refinement to reduce reliance on proprioceptive cues before testing with neurological patients.
Authors:Tobias Jaeggi, David Gregory Black, Septimiu Salcudean
Abstract:
Tele-ultrasound through teleoperation allows experts to perform examinations remotely in communities, but limited connectivity can lead to communication delays that reduce usability and diagnostic performance. Visual-haptic model mediated teleoperation reslices a pre-acquired ultrasound volume in real time to provide an accurate, delay-independent preview image for the sonographer. This enables fast and robust exploration before using the live image for fine tuning. However, existing reslicing techniques do not account for the directional nature of ultrasound - the fact that a structure looks different when imaged from different directions. This paper presents Directionality-Aware Reslicing (DARE), an ultrasound volume reconstruction and reslicing framework that takes directionality into account. The presented GPU-accelerated algorithm allows real-time reslicing from arbitrary viewpoints to generate accurate preview images. The method is evaluated quantitatively through image similarity metrics and qualitatively through a user study, and significantly outperforms existing reslicing methods in image similarity and realism compared to a ground truth. This can improve the effectiveness and robustness of tele-ultrasound in low-resource areas.
Authors:Owais Mujtaba Khanday, Jose A. Gonzalez-Lopez, Marc Ouellet, Alberto Galdon, Gonzalo Olivares Granados
Abstract:
Current high-performing intracortical speech neuroprostheses achieve low word error rates but typically rely on external language models during inference, increasing memory, computation, and latency. In this work, we investigate whether meaningful character-level decoding is achievable without such models. We propose an end-to-end Conformer-based neural decoder trained directly on intracortical recordings from a participant with amyotrophic lateral sclerosis (ALS). Without any external language model, the system achieves a character error rate (CER) of 23.80\% on held-out validation data. Analysis shows that performance variability is driven by inter-session signal degradation, while dominant errors arise from incorrect word boundary segmentation. These results demonstrate that effective character-level decoding is possible in a fully end-to-end framework, providing a strong neural signal for downstream linguistic processing.
Authors:Jianlong Zhu, Syed Muhammad Jhon Raza Naqvi, Carolin-Theresa Ziemer, Usman Naseem, Ingmar Weber
Abstract:
Argumentative dialogues across political divides can reduce polarization, yet opportunities for citizens to engage with opposing views in accessible and structured ways remain limited. AI dialogue partners offer a scalable framework for such open-mindedness exercises, but how the format of human-AI dialogues shapes their benefits remains unclear. In a two-session online experiment, 469 US participants were assigned to argue either for or against their own attitude on a contested political issue with an AI chatbot. Our experimental findings show attitude-congruent dialogues produced greater immediate reduction in both affective and opinion polarization than attitude-incongruent dialogues. By contrast, attitude-incongruent dialogues elicited weaker cognitive state empathy than the non-AI reference task but increased cognitive trait empathy in the two-week period between sessions, suggesting the effects of active generation of attitude-incongruent arguments may emerge over time. These findings highlight dialogue design as a key determinant of effective AI-mediated behavioral interventions.
Authors:Jagan K. Balasubramanian, Yasemin Vardar
Abstract:
Modern audio-visual media rely on compact representations for efficient storage and transmission, whereas realistic digital touch still depends on high-resolution tactile recordings. Existing approaches for representing tactile signals constrain manipulation and limit the generation of new content. Here, we introduce two compact representations, spectral beta and spectral slope, that capture the temporal spectral structure of finger-surface friction signals while preserving perceptually relevant information. Spectral beta models spectral skewness using a two-parameter beta distribution, whereas spectral slope approximates the spectrum with an asymmetric bandpass filter defined by low- and high-pass orders. We evaluated these representations in a perceptual study with 14 participants using five virtual textures rendered on a friction-modulation display and compared them with physical textures and high-fidelity reproductions of recorded signals. Spectral beta achieved perceptual similarity ratings comparable to those of the original high-fidelity reproductions. Regression analysis further showed that matching spectral energy across nine critical frequency bands was the strongest predictor of perceived realism. Together, these findings suggest that tactile texture perception depends primarily on fundamental temporal spectral patterns and that modeling these patterns is sufficient for perceptually realistic rendering. These results establish an efficient and scalable framework for haptic compression, communication, and synthetic texture generation.
Authors:Robin Deuber, Lanlan Yang, Michal Bechny, Christoph Heck, Matthias Pfäffli, Matthias Bantle, Florian von Wangenheim, Elgar Fleisch, Wolfgang Weinmann, Manuel Günther, Felix Wortmann, Varun Mishra
Abstract:
Alcohol-impaired driving remains a major yet preventable cause of road traffic injury and death, with many drivers underestimating their level of intoxication. Compared to in-vehicle systems, mobile drunk-driving detection using consumer smartwatches offers a scalable way to trigger preventive interventions and increase awareness without additional in-vehicle hardware. We introduce a system that leverages wrist accelerometer data and heart rate variability-derived physiological signals to detect alcohol-related driving impairment. We collected data in a randomized, controlled three-arm test-track study (n=54) and trained both logistic regression models with window-aggregated features and a two-tower 1D convolutional neural network (CNN), to detect alcohol-impaired driving. The CNN achieved a participant-averaged area under the receiver operating characteristic (AUROC) of 0.88 for detecting any alcohol intoxication and 0.86 for detecting driving above the WHO-recommended limit of 0.05 g/dL. To the best of our knowledge, this is the first work to (1) demonstrate drunk-driving detection using consumer smartwatches, (2) develop and evaluate such a system in a real vehicle on a closed test track, and (3) rigorously assess generalization to unseen participants. Together, these findings highlight the potential of wearable-based sensing to support scalable, measurement-driven prevention of alcohol-related traffic harm.
Authors:Shaoxuan Zhou, Yafei Sun, Jing Zhang, Xianghang Mi
Abstract:
Short-video platforms like Douyin and Kwai have become central to adolescent digital life, but they also risk exposing teens to algorithmically amplified harmful content. Despite its societal importance, the scale, mechanisms, and real-world impact of this exposure remain poorly understood. Measuring it is challenging: recommendation feeds are personalized black boxes, harmful content employs sophisticated evasion tactics, and naive crawlers fail to replicate authentic teen behavior. To bridge this gap, we propose PHTV-Scout, the first large-scale, behaviorally grounded measurement framework for Potentially Harmful Teen Videos (PHTVs). We integrate an offline survey of 683 adolescents with a tri-module online pipeline: (1) PHTV Hunter simulates teen accounts to collect recommendation feeds; (2) PHTV Arbiter, a LoRA-finetuned multimodal classifier, detects PHTVs with 94.29% accuracy and 96.41% precision; and (3) PHTV Analyzer performs fine-grained categorization and impact assessment. Over six months, we analyzed 186,727 videos and 51,287 comments, uncovering a troubling 6.11% PHTV prevalence--dominated by Child Sexual Exploitation Imagery (53.2%)--and revealing that harmful content thrives through covert interactions (e.g., grooming comments, self-disclosure) and active evasion (semantic camouflage, noise injection). Crucially, while Youth Mode blocks 100% of PHTVs, its low adoption (30-41%) leaves most teens unprotected. We further show that exposure is driven not by user identity but by regulation, platform algorithms, and even passive browsing, exposing the fragility of adolescent information environments. Our findings call for a paradigm shift from reactive takedowns to proactive, human-centered safeguards.
Authors:Reina Szeyi Chan, Sujendra Jayant Gharat, Maya Lampi, Yueran Jia, Avi K Srinivasan, Xiang Zhi Tan
Abstract:
Reminder systems commonly rely on fixed schedules, location triggers, or simple rules, limiting their ability to leverage the rich sensing capabilities of modern smart homes. A key challenge lies in enabling users to specify context-aware reminders without requiring complex configurations. We present a system pipeline that supports reminder authoring through natural language and conversational interaction. The pipeline translates user requests into structured representations and executable logic, incorporating time-based, activity-based, sensor-based, and state-based conditions. We conducted two studies to examine how users express reminder intent and how conversational support influences the authoring process. In Study 1 (N=40), we analyzed 233 user-authored reminders and identified challenges in expressing reminders with diverse and complex logic. Based on these findings, we refined the system and evaluated it in Study 2 (N=10), demonstrating improved handling of time-based, activity-based, sensor-based, and state-based conditions. Our results highlight the diversity and ambiguity of user expressions and show that conversational guidance can help structure these expressions into flexible, context-aware reminders.
Authors:Siddhartha Pradhan, Yanping Pei, Morgan Lee, Puyuan Zhang, Erin Ottmar, Adam C. Sales
Abstract:
Bayesian Knowledge Tracing (BKT) is a widely used and interpretable student modeling approach in intelligent tutoring systems and educational data mining. However, most implementations rely on expectation-maximization or related optimization methods that yield only point estimates, limiting uncertainty quantification and principled comparisons across learners and conditions. We introduce StanBKT, an open-source Python package for estimating BKT models using Bayesian inference in Stan. StanBKT provides a unified framework supporting Hamiltonian Monte Carlo, variational inference, Pathfinder, and optimization-based estimation while preserving the hidden Markov structure and interpretability of classical BKT. It supports standard, grouped, and hierarchical BKT models, flexible prior specification, posterior predictive inference, and utilities for visualization and diagnostics. We evaluate StanBKT on large-scale observational and controlled educational datasets. On the ASSISTments 2020 dataset, we show that supported inference methods achieve comparable predictive performance while differing in computational efficiency and posterior fidelity. We further demonstrate how posterior inference enables principled comparison of condition-specific parameters in an educational intervention involving perceptual cue manipulations. Results illustrate how uncertainty quantification facilitates more reliable interpretation of differences in learning, forgetting, guessing, and slipping parameters across experimental conditions. Overall, StanBKT extends BKT beyond point estimation by providing a flexible framework for probabilistic student modeling, uncertainty quantification, and hierarchical inference in educational data mining.
Authors:Gavin Eddington, Christopher Warren, Seth Poulsen, John Edwards
Abstract:
Background and Context. Computer programming often involves extended periods of sustained activity and mobile phone notifications introduce frequent opportunities for interruption. Prior work demonstrates that suppressing phone notifications may reduce these disruptions. Objectives. Our primary research question is: How does suppressing phone notifications affect students' task engagement and productivity while programming? Method. We report on a replication and methodological extension study conducted in a CS1 course involving 22 students. Using a within-subject design, selected programming assignments were randomly designated for enabling notification suppression. Phone state logs were synchronized with millisecond-resolution IDE keystroke data to measure student attention and focus when in the control and notification-suppression conditions. Findings. Assignments completed with notification suppression enabled significantly lower break rates and longer intervals of focus compared to assignments completed in the control condition for many, but not all, students. This study provides evidence that notification suppression is associated with measurable differences in programming engagement and behavior. We also find a remarkable bimodality in the effect across students -- many students are positively affected, a small number are negatively affected, and very few experience little or no effect. This finding is consistent with other studies in diverse disciplines. Implications. Our results show that, for many students, phone notification suppression tools, such as Do Not Disturb, can improve attention and focus. Implications apply to educational settings (do-not-disturb as an intervention) and scholarship (understanding the effects of phone distraction).
Authors:Dmitry Dagaev, Egor Ivanov, Petr Parshakov, Alexey Savvateev, Gleb Vasiliev
Abstract:
The emergence of large language models (LLMs) has spurred economists to study how humans and LLMs behave in strategic settings. We organized a series of round-robin tournaments in the Colonel Blotto game. This game attracts game theorists' attention due to high-dimensional action space and the absence of pure strategy Nash equilibria. In the first tournament, more than 200 human participants competed against one another. In the second tournament, several popular LLMs were invited to submit strategies. In the third tournament, we matched the number of LLM strategies to the number submitted by humans. We find that humans more often employ better-calibrated intermediate-level allocation heuristics and outperform the simpler, more stereotyped strategies submitted by LLMs. Strategic sophistication is key to success if and only if the necessary level of reasoning depth is reached, while lower and higher levels of reasoning offer no clear advantage over the primitive strategies. Among humans, field of study weakly predicts success: participants with STEM backgrounds perform better in the first tournament. Surprisingly, humans almost do not adjust their strategies across tournaments with different sets of opponents. This result suggests that humans base their choices primarily on the game's rules rather than on the identity of their opponents, treating LLMs much like human competitors.
Authors:Peter Fowles, Erik Falor, Sulove Bhattarai, John Edwards, Seth Poulsen
Abstract:
Background and Context: Large Language Models (LLMs) are more accessible and accurate than ever before, raising significant concerns for computing educators. One major concern is students using LLMs to bypass the effort needed to understand concepts and metacognitive strategies essential for success in computer science. Objectives: We contribute a unique approach to assessing and building up student understanding through weekly oral code review assessments. These formative assessments incentivize students to understand their submitted code, regardless of whether or not the code was generated by AI tools. We also use a flipped classroom to provide time for students to learn concepts outside of class and provide ample time for students to schedule code review interviews. Methods: For this paper, we collected data from three semesters. We analyze student exam scores, keystroke logs, and surveys to understand how the new course policies affected student learning, behavior, and attitudes. Findings: Pairwise comparison of exam results reveals a statistically insignificant increase in average scores for Fall 2025 compared to previous semesters. Keystroke logs show a significant increase in characters pasted per total characters input into coding assignments in Fall 2025, pointing towards higher AI usage. Survey results show positive student sentiment towards code reviews at the end of Fall 2025, with nearly all negative feedback being addressable through better scheduling and more rigorous TA training. Implications: Oral code reviews with a flipped classroom appear to be effective at mitigating harms of LLM use while providing space for students to freely experiment with these tools. Our work suggests that students in Fall 2025 still show adequate understanding of material covered in written exams, despite dramatic increases in LLM usage for coding assignments.
Authors:Houda Hafi, Bouziane Brik, Nuraini Jamil, Abdelkader Nasreddine Belkacem
Abstract:
Brain computer interface (BCI) enables the brain to directly control an external device by converting neural signals into actionable outputs. However, effective real-time translation of brain activity strongly depends on the quality of neural communication between the brain and the external device. 6G is the next generation of wireless communication, expected to provide unprecedented levels of data rates, data security, and automation capabilities. In this context, integrating 6G into BCI systems would not only enhance the performance of brain-device communication, but would also create new opportunities for innovative applications. This work provides a comprehensive study on how BCI technology can be built effectively on top of 6G wireless networks by introducing several technical aspects and use cases. We first provide an overview of BCI and 6G, following their progression from early development to convergence through cognitive communication and advanced neural interfaces. We then highlight the need for the upcoming 6G systems toward BCI technology in every aspect, including 6G technologies such as intelligent edge and zero-touch networks, and 6G use cases such as digital twin, immersive communication, and internet of minds. Furthermore, we identify key technical challenges, open issues, and future research directions related to the 6G-enabled BCI paradigm.
Authors:Saurav Ghosh, Gabriella Polach, Abdou Sow
Abstract:
Large language models (LLMs) are widely used for open-ended tasks, but underspecified prompts can lead to low-quality answers and additional interaction. This paper studies whether structured prompt design improves response quality while reducing user effort. We compare three prompt conditions: a raw prompt, a checklist-improved prompt, and a clarifying-question prompt. We evaluate these conditions across four task types--summarization, planning, explanation, and coding--using three LLM systems: ChatGPT, Claude, and Grok. Each output is scored with a unified rubric covering task completion, correctness, compliance, and clarity. Checklist-improved prompts achieved the highest mean rubric score, 7.50 out of 8, compared with 5.67 for raw prompts and 6.67 for clarifying-question prompts. Checklist prompts also produced the best quality-effort tradeoff, using fewer average tokens than both raw and clarifying prompts. These results suggest that a simple prompt checklist can improve LLM responses while reducing unnecessary interaction.
Authors:Rizwan Jahangir, Daisuke Ishii
Abstract:
As large language models (LLMs) demonstrate increasing competence in synthesizing functional user interfaces, a fundamental question emerges in accessibility computing: \textit{how far can AI-driven accessibility systems go?} This paper introduces the \textit{Accessibility Capability Boundary} (ACB), a formal framework for reasoning about the operational limits and expansion potential of autonomous accessibility systems, and grounds this theory in a real-world systems artifact. We model accessibility not as a binary compliance property but as a dynamic, multidimensional capability space constrained by measurable variables including deployment latency, cognitive load, infrastructure dependency, offline persistence, interaction complexity, and adaptability. We argue that AI-generated, browser-native systems constructed as single-file HTML artifacts leveraging standard browser APIs may dramatically shift the ACB outward by reducing deployment friction to near-zero and enabling rapid, context-specific interface adaptation. We ground our theoretical framework in the analysis of two real-world exploratory prototypes. The first is an AI-generated browser-native accessibility interface deployed for a blind user in Nepal. The second is a fully functional, open-source webcam alignment assistant for visually impaired users, serving as a concrete systems artifact. Through formal definitions, propositions, and a comparative evaluation matrix, we characterize the regions of the accessibility capability space that such systems can and cannot reach. We further identify remaining computational, infrastructural, and verification constraints that constitute the hard boundaries of this paradigm. This work contributes a theoretical foundation for understanding the scalable limits of autonomous accessibility computing and proposes a research agenda for future work in accessibility-aware AI systems.
Authors:Nicole Sultanum, Gustavo Moreira, Arjun Srinivasan
Abstract:
Presentation-oriented tasks including formatting and layout design are critical but often neglected aspects of dashboard authoring given their labor intensive nature. In this work, we follow a user-centered design approach to explore ways that partial reuse of pre-existing dashboards may support the dashboard design process. Based on collective feedback from 10 professional dashboard creators, we contribute: (a) findings from a formative study characterizing dashboard reuse needs and challenges; and (b) reflections and opportunities from a concept validation study with ReDash, a design probe for partial reuse of dashboard presentation features (style and layout) from multiple sources.
Authors:Khandaker Abrar Nadib, Marina Kogan, Alexander Lex, Maxim Lisnic
Abstract:
Charts used for persuasion can easily veer into being outright misleading when, for instance, cherry-picked data is paired with a deceptive caption, as is commonly encountered on social media. The rise of interactive time-series data explorers for hotly debated topics makes such framing easy to produce and spread. Post-hoc interventions like fact-checking often arrive too late and suffer from persistence of belief. Prior work suggests that guardrails, in the form of contextual comparison lines embedded directly into charts, can reduce these effects. We propose and evaluate a practical set of guardrail sampling strategies for implementing such contextual lines in real systems. In a preregistered mixed-design study with two real-world scenarios (COVID-19 and Stocks), participants viewed persuasive charts with different sets of guardrails and reported trust, estimated rank in the dataset, expressed their perceived completeness of context, as well as subjective preference for different tasks. Across scenarios, guardrails improved trust, accuracy of performance judgments, and perceived completeness of context compared to the control. Taken together, the study offers practical guardrail sampling methods, evidence of their contextual benefits, and insights into participants' preferences.
Authors:Joseph S. Boyle, Anthony Dranfield, Mike O'Neil, Maria Liakata, Alison Q. Smithard
Abstract:
In this paper we introduce ClinQueryAgent, a system for translating natural language population health questions into executable database queries using agents with access to both local and external knowledge bases. Our novel architecture enables the use of powerful cloud-based language models whilst ensuring that no patient data leaves the secure environment. To combat inaccuracies over the course of longer dialogues due to context rot, information retrieval is delegated to a sub-agent. We deploy the system via a chat window embedded within an existing population health management platform where it has been used by 128 staff from 15 healthcare practices covering a total of 148,319 patients in the UK's National Health Service (NHS). We evaluate the system's capacity to autonomously handle a range of health informatics tasks on a constructed dataset and via a beta-testing phase. Our results show that both analysts and clinicians are able to easily generate actionable information from patient health records using natural language requests requiring no programming expertise to verify. We make a public demo of the system available at: https://demo-899965260288.europe-west1.run.app/
Authors:Enrico Collautti, Xiaopeng Mao, Luca Tonin, Stefano Tortora, Sadasivan Puthusserypady
Abstract:
The decoding of linguistic information from electroencephalography (EEG) signals remains an extremely challenging problem in brain-computer interface (BCI) research. In particular, sentence-level decoding from EEG is difficult due to the low signal-to-noise ratio of these recordings. Previous studies tackling this problem have typically failed to surpass random baseline performance unless teacher forcing is used during the inference phase. In this work, we propose a retrieval-augmented generation (RAG)-based sentence-level EEG-to-text decoding pipeline that combines an EEG encoder aligned with semantic sentence embeddings, a vector retrieval stage, and a large language model (LLM) to refine retrieved sentences into coherent output. Experiments are conducted on the Zurich Cognitive Language Processing Corpus (ZuCo) dataset, which contains single-trial EEG recordings collected during silent reading. To evaluate whether the system extracts meaningful information from these EEG signals, the results are compared with a random baseline. In nine subjects, the proposed pipeline outperforms the random baseline, achieving a mean cosine similarity of 0.181 +- 0.022 compared to 0.139 +- 0.029 for the baseline, corresponding to a relative improvement of 30.45%. Statistical analysis further confirms that this improvement is significant, following a strict evaluation workflow where inference is performed without access to ground-truth labels.
Authors:Hung-Yue Suen, Yu-Sheng Su
Abstract:
Asynchronous video learning, including massive open online courses (MOOCs), offers flexibility but often lacks students' affective engagement. This study examines how teachers' verbal and nonverbal vocal emotive expressions influence students' self-reported affective engagement. Using computational acoustic and sentiment analysis, valence and arousal scores were extracted from teachers' verbal vocal expressions, and nonverbal vocal emotions were classified into six categories: anger, fear, happiness, neutral, sadness, and surprise. Data from 210 video lectures across four MOOC platforms and feedback from 738 students collected after class were analyzed. Results revealed that teachers' verbal emotive expressions, even with positive valence and high arousal, did not significantly impact engagement. Conversely, vocal expressions with positive valence and high arousal, such as happiness and surprise, enhanced engagement, while negative high-arousal emotions, such as anger, reduced it. These findings offer practical insights for instructional video creators, teachers, and influencers to foster emotional engagement in asynchronous video learning.
Authors:Bin Zou, Yijia Yuan, Chenghao Wang, Yiran Du
Abstract:
This study examined whether AI-mediated speaking practice can reduce acculturative stress among Chinese international students in UK universities. Using a sequential explanatory mixed-methods design, 126 participants were randomly assigned to an experimental group, which completed a four-week intervention using EAP Talk, an AI-assisted English for Academic Purposes speaking platform offering role play, scenario-based practice, free talk, and automated feedback, or a control group, which continued usual academic and English-learning activities. Pre- and post-test questionnaires measured perceived language insufficiency, social isolation, and academic pressure, while semi-structured interviews with 20 experimental-group participants contextualised the quantitative findings. Linear mixed-effects models showed that the experimental group experienced significantly greater reductions than the control group across all three outcomes, with the strongest effect on perceived language insufficiency. Interview findings suggested that EAP Talk supported low-stakes rehearsal, communicative confidence, academic speaking preparation, and greater willingness to initiate social interaction. However, participants also noted that AI-mediated practice could not fully reproduce authentic human interaction, disciplinary feedback, or broader institutional support. The findings suggest that AI-mediated speaking practice can function as a supplementary scaffold for reducing communication-related dimensions of acculturative stress, but should be integrated with peer interaction, teacher feedback, and wider support services.
Authors:Yilin Gong, Siqi Wu
Abstract:
Recent advances in artificial intelligence (AI) have made timely, scalable, and effective fact-checking increasingly feasible. One such deployment is X's Community Notes, which provides the AI Note Writer API to enable end-to-end automated generation of contextual information. We present the first empirical analysis of AI fact-checkers and their contributions on Community Notes, examining four key dimensions: volume, velocity, variety, and veracity. We find that, between September 2, 2025 and May 9, 2026, 20 AI writers account for 14.2% of all submitted notes, with their daily share rising rapidly to 44.8% lately. AI writers are highly responsive, typically submitting notes within minutes of posts becoming available via the API. They also expand coverage, contributing notes to 16.8% of fact-checked posts, of which 74.4% are not checked by humans. Over time, AI writers become more prolific and responsive, with increasing coverage and discovery rates. Despite these advantages, their veracity remains mixed. Collectively, AI writers contribute a higher share of helpful notes while receiving a smaller share of human ratings, relative to their share of submitted notes. Controlling for the fact-checked post and note submission order, both AI and human writers exhibit a first-mover advantage, with earlier notes attracting more ratings. More importantly, AI-generated notes are less likely to be classified as helpful than those written by human experts, though they outperform those written by laypeople. Our findings provide new insights into the practical capabilities and limitations of AI-driven fact-checking, with implications for the design and governance of human--AI collaborative crowdsourced context systems.
Authors:Yuri Noviello, Anastasiia Birillo, Gosia Migut
Abstract:
We present ANVIL, a multimodal generative system that automates the production of analogy-based instructional animations for computer science topics. Given a concept definition, ANVIL generates a textual analogy, compiles it into a structured visual screenplay, and produces executable manim code to render an animation, with an automated repair mechanism to improve robustness. Evaluating such systems at scale requires balancing pedagogical validity with scalability. We begin with a teacher evaluation to ground the quality assessment and use its findings to guide automated screening. For textual analogies, we introduce an LLM-based evaluator for scalable quality screening; for videos, where subjective judgments are difficult to automate, we instead assess fidelity to the intended screenplay using an automated proxy for auditing and error analysis. We further conduct a user study with educators to examine adoption requirements and risks. Our findings suggest that ANVIL can produce materials that are frequently rated as adequate, and that educators respond positively to its perceived value and usability.
Authors:Mohammed Afaan Ansari, Aniruddh Bansal, Tianyi Zhou
Abstract:
Charts are the dominant medium for visualizing data, discovering patterns and trends, and communicating data driven insights, yet designing them still requires expensive human effort and expertise, such as selecting appropriate chart types, axis orientations, font sizes, and layouts. Most automatic visualization systems rely on handcrafted heuristics or simple rule matching and therefore struggle to generalize across domains. This work explores the potential of large language models (LLMs) as chart designers. We propose ChartDesign, which post-trains LLMs to imitate human experts and generate chart design attributes given tabular data. To this end, we curate a diverse training corpus of data design pairs from charts in public surveys (PewResearch) and academic repositories (CharXiV). Vision language models are used to extract data and design attributes from these charts, including chart type, sub type, alignment, titles, axis labels, and bar spacing, formatted as JSON. We then fine tune LoRA adapters on Phi3, Qwen3, and InternVL2.5 to learn a mapping from data to design specifications. ChartDesign significantly improves chart design performance over strong baselines, achieving up to 84% accuracy on a held-out test set (vs. 53% for the best baseline) and generalizing to unseen domains. We further show that charts rendered from ChartDesign generated specifications are visually appealing and human preferred, narrowing the human AI gap in data visualization.
Authors:Jennifer Posada, Taha Hassan, Lujie Karen Chen, Louise Yarnall, Jiaqi Gong
Abstract:
Data storytelling workflows ask learners to integrate analytical, design, and narrative skills, but instructors rarely have the capacity to provide detailed feedback at each step. Computational and AI-assisted storytelling offers opportunities to support student learning, but how feedback should be structured effectively remains unclear. To address this gap, we conducted a two-phase participatory design study. Through participant observations (N=8) and interviews (N=6), the first phase explored learners and educators' feedback needs and challenges in a data storytelling course. The second phase conducted two design workshops (N=8/10) to design and evaluate feedback strategies (frequency, seamlessness, accountability) for Story Studio: an AI-assisted narrative storytelling application. Our findings show that participants perceived on-demand and process feedback modes as effective, but automatic and outcome feedback as slightly more persuasive. We discuss implications for designing AI-augmented storytelling systems that adapt their feedback modes to the diverse needs and expectations of students.
Authors:Andrea Wen-Yi Wang, Waki Kamino, David Mimno, Karen Levy, Malte F. Jung
Abstract:
Clearly-defined rules are often assumed to be straightforward to automate and evaluate. We challenge this assumption through an in-depth study of Major League Baseball's (MLB) seven-year experimentation with the Automated Ball-Strike System (ABS). ABS is envisioned to call balls and strikes accurately: a seemingly straightforward use of technology to objectively determine the distance between a pitch and the strike zone. Although the strike zone is an area clearly defined in the rulebook, it took MLB seven years to figure out how to automate calling balls and strikes with ABS, showing how even seemingly straightforward rules require a complex translation process to operationalize via technological systems. In this paper, we trace the design decisions that led to the current implementation of ABS. Our case study reveals that "distance" exists even between a clear rule and its technological implementation. Using analytic frameworks from Science and Technology Studies (STS), we show that such distance exists because (1) historically, the "ground truth" of the strike zone is contested: the rule in practice has always reflected a hybrid between the rulebook definition and umpires' enforcement decisions; and (2) the use of ABS is embedded in an existing eco-system, where the implementation of a technological enforcement system needs to balance multiple stakeholder values. This perspective challenges conventional evaluation paradigms that center on the distance between a formalized rule and its technological implementation, and instead calls for evaluating how such systems are experienced in practice. Addressing this question requires in-depth social science approaches, contributing to ongoing conversations in FAccT about the implementation and evaluation of sociotechnical systems.
Authors:Janne Rotter, Pau Benazet i Montobbio, Davinia Hernández-Leo
Abstract:
In recent years, generative AI (GenAI) in educational settings has become ubiquitous in students' daily lives, despite its potential to induce over-reliance, metacognitive disengagement, and diminished learning when used unrestrictedly. While most prior research has thus focused on how to pedagogically scaffold its usage, the question of when to allow off-the-shelf GenAI remains understudied and lacks pedagogically grounded empirical investigation. We treat access timing itself as a form of implicit scaffolding and operationalize it through a reinforcement learning (RL) agent that decides when students should access GenAI, with a reward function grounded in metacognitive theory, cognitive load theory, and productive failure. In a mixed-methods controlled lab study with N=105 participants, we compared the agent's effect on learning gains and metacognitive engagement to unrestricted and fully restricted use. Results show that strategically timed GenAI access under the reinforcement learning condition improved objective post-test performance and metacognitive accuracy compared with unrestricted access, while reducing task errors and time on task relative to complete withholding, all without the need for explicit metacognitive prompts or structured scaffolding. However, no between-condition differences emerged on self-reported metacognitive awareness. Overall, timing of GenAI access therefore is a tractable, theoretically grounded, and scalable pedagogical paradigm that improves over completely unrestricted and withheld access, compatible with off-the-shelf tools and potentially low adoption barrier. This opens up a new research area that explores how access timing can be facilitated by educators and implemented in human-AI learning system design.
Authors:Jiuming Jiang, Shidong Pan, Daniel W Woods, Jingjie Li
Abstract:
Online video games have become major online social spaces where users interact, compete, and create together. These spaces, however, expose users to a wide spectrum of online harms, including harassment, discrimination, inappropriate content, privacy breach, cheating, and more. The shape and severity of such harms vary across game design, mechanics, and community context. To mitigate these harms, game companies issue Codes of Conduct (CoCs) that articulate online safety rules and direct players to safety resources. However, it remains unclear how prevalent CoCs are, what safety, security and privacy violations they govern, and whether they meet growing regulatory and industry expectations. We develop and leverage CONDUCTIFY, a pipeline for identifying and analyzing CoCs at scale. Applied to Steam, the largest PC game marketplace, it located the available CoCs for 350 of the 9,586 multiplayer titles on Steam. We found that CoCs are more available among popular, adult-oriented, and community-driven games, while most multiplayer games operate without CoCs despite regulatory and industry recommendations. Although over 80% of the games with CoCs available consistently address traditional security and safety violations, their governance approaches vary substantially across types of violations. A further asymmetry emerges in specificity. Compared with harms related to gameplay mechanics, the articulations of interpersonal harm and the underage player safety are often less specific, despite their relevance to many game communities. Together, these results inform the improvement of online safety governance and CoC enforcement practices, and building better safety infrastructure for the community of players and developers.
Authors:Shantanu Sarkar, Jose L. Contreras-Vidal
Abstract:
Electroencephalogram (EEG) signals are highly susceptible to artifacts, resulting in a low signal-to-noise ratio which makes extraction of meaningful neural information challenging. Artifact Subspace Reconstruction (ASR) is one of the most widely used artifact filtering techniques in EEG-based BCI applications, owing to its real-time applicability. ASR reconstructs artifact-free signals by operating in Principal Component (PC) space within sliding windows. However, ASR performance is critically sensitive to its threshold parameter - an incorrect threshold risks removing task-relevant neural features alongside artifacts. Furthermore, since PCs are linear combinations of all channels, subspace reconstruction in PC space may alter the underlying data structure, potentially discarding essential neural information. To address these limitations, we propose nASR, a novel end-to-end trainable Keras layer that jointly optimizes artifact rejection and downstream decoding. nASR introduces two trainable threshold parameters: K, which governs artifact detection in PC variance space, and L, which quantifies eigen-spread to pinpoint the primary artifact--contributing channels, enabling selective channel-level reconstruction that preserves clean channel information. An ablation study comprising five model variants (m01 - m05), evaluated across two subjects from the BCI Competition IV Dataset 1, confirms that nASR variants consistently outperform traditional ASR on test classification metrics, while achieving a 6-8x reduction in inference time, making nASR a strong candidate for real-time BCI applications demanding both low latency and high decoding performance.
Authors:Shantanu Sarkar, Sai Shashank Gandavarapu, Jeff Feng, Saurabh Prasad, Reza Khanbabaie, Jose L. Contreras-Vidal
Abstract:
Mild traumatic brain injury (mTBI) is a prevalent condition that remains difficult to diagnose in its early stages. Oculomotor dysfunction is a well-established marker of mTBI, motivating the development of portable tools that capture both eye-movement behavior and underlying neurophysiology. In this work, we present an initial framework that integrates electroencephalogram (EEG) with augmented-reality (AR)-based Vestibular/Ocular Motor Screening (VOMS) tasks to estimate subject-specific ocular response times. Pre-processed EEG signals, obtained through band-pass filtering and average referencing, are analyzed using a Redundant Discrete Wavelet Transform (RDWT)-driven deep neural framework. The RDWT coefficients are subjected to trainable zero-phase convolutional filtering and reconstructed into the time domain via inverse RDWT, followed by channel-wise temporal and spatial filtering using 2D convolution layers and convolutional-LSTM-based decoding. An ablation study demonstrates that wavelet-domain filtering serves as an effective denoising strategy, improving prediction performance. Sliding-window predictions were validated using Pearson correlation (>= 0.5), and Dynamic Time Warping (DTW) was subsequently used to estimate ocular response times. DTW-derived metrics revealed significant inter-subject differences across all VOM tasks, supported by Mann-Whitney U tests. Cross-correlation analysis further revealed task-dependent temporal behaviors: pursuit tasks exhibited reactive tracking, whereas saccades showed anticipatory responses. Overall, the results highlight pursuit tasks as particularly informative for distinguishing timing differences and demonstrate the potential of RDWT-based EEG features combined with DTW metrics for multimodal mTBI assessment.
Authors:Yiwei Wang, Chuan Zhu, Tianjun Feng, Lauren Xiaoyuan Lu, Bingxin Jia
Abstract:
Agentic AI systems that autonomously perform service tasks are entering customer service operations. However, limited evidence exists on how human interventions shape service outcomes when agentic AI failures create both cognitive and emotional consequences. We study this issue through a randomized field experiment on Alibaba's Taobao platform. Workers in the treatment condition supervised an agentic AI system that resolved AI-eligible chats while continuing to handle AI-ineligible chats, whereas control workers resolved all chats without agentic AI. The findings show that AI deployment reduces average chat duration and has limited effects on retrial rates, but substantially lowers ratings for AI-eligible chats. Moreover, human intervention effectiveness in AI-eligible chats depends on the nature of AI failure, post-escalation intervention effort, and intervention timing. Human intervention preserves service quality in algorithm-triggered technical escalations, i.e., unresolved customer issues beyond the AI's capability, but is less effective in algorithm-triggered emotional escalations, i.e., where customers express frustration or dissatisfaction. These differences are partly explained by variation in workers' post-escalation intervention effort across escalation types. In algorithm-triggered emotional escalations, workers showed lower engagement: they sent fewer messages, contributed a smaller share of total chat rounds, and showed less proactivity in information seeking and solution provision. We further find that early intervention is essential for sustaining high post-escalation intervention effort. Finally, we document a positive spillover effect on AI-ineligible chats, as treated workers adapted their multitasking workflow to devote greater attention to these chats. These findings offer implications for human-in-the-loop process design in human-AI collaboration systems.
Authors:Yaniv Eliyahu Amiri, Noah Chicoine, Jacqueline Griffin, Stacy Marsella
Abstract:
Hospital pharmacists make high-stakes decisions to mitigate drug shortages under uncertainty, time pressure, and patient risk. Interviews revealed that pharmacists focus attention on a small subset of drugs, limiting cognitive effort to the most urgent cases. Motivated by these findings, we formalize a bounded-rational, attention-guided decision framework that dynamically decomposes drugs into a subset for high-cost reasoning and a complementary subset for low-cost monitoring. We develop two agents: an Expert Agent that applies attention weights derived from pharmacist interviews, and a Learner Agent that adapts attention allocation over time through experience. Across simulated scenarios spanning short to long horizons, we show that attention-guided planning supports stable decision-making without complete state reasoning. These results suggest that a primary decision is not what action to take, but where to allocate cognitive effort, and that attention-guided, satisficing strategies can reduce problem complexity while maintaining stable performance.
Authors:Tomu Tominaga, Naomi Yamashita, Takeshi Kurashima
Abstract:
Algorithmic recourse provides counterfactual action plans that help people overturn unfavorable AI decisions. While diverse recourse sets may improve transparency and motivation, they may also impose cognitive load and negative emotions by increasing counterfactual reasoning demands. To examine this trade-off, we conducted a between-subjects controlled experiment (N=750) that manipulated recourse-set diversity and size, and evaluated these effects on psychological benefits and costs. Results show that diversification enhances psychological benefits (e.g., willingness to act) for small sets without incurring additional psychological costs, whereas for large sets, it makes cognitive load more salient. These findings suggest that naively diversifying recourse can burden decision subjects, underscoring the need for new diversification methods that incorporate human cognition and psychology to mitigate such costs.
Authors:Mariame Tighanimine, Jessica Pidoux, Sonia Kgomo, Kauna Ibrahim Malgwi, Richard Mwaura Mathenge, Mophat Okinyi, James Oyange
Abstract:
In this article, we audit the working conditions of content moderators in Kenya and Nigeria employed by business process outsourcing (BPO) companies by using the European General Data Protection Regulation (GDPR). We demonstrate its extraterritorial scope for gaining access to elements such as employment contracts and NDAs that have never been provided to the workers concerned. The results of this approach provide legally grounded evidence of the structural disadvantages faced by content moderators in the Global South, whose exploitative working conditions violate workers' rights. Our work also highlights the benefits of legislation aimed at protecting individuals' data rights as a counterweight to the tech industry's discourse of exceptionalism, which obscures its dependence on BPOs to externalise labour costs and accountability, whilst claiming that its products, business models, and methods of resource extraction are unprecedented and fall outside any existing legal framework.
Authors:Conrad Borchers, Lijin Zhang, Kexin Yang, Tomohiro Nagashima, Benjamin W. Domingue
Abstract:
Adaptive learning systems can produce substantial learning gains, yet many students engage for too brief or too superficial a period to benefit. A central obstacle is measuring effort. Effort during multi-step problem solving is rarely directly observed, and common log-based proxies, such as time on task, cannot distinguish between a student working carefully and a student encountering a harder problem. We examine step-to-step response time as a scalable effort signal by modeling trait-like differences in students' typical response timing during tutoring (while adjusting for skill difficulty). Using step-level logs from eight classroom deployments of algebra tutoring systems (2020 to 2023) across six U.S. schools (794 students), we estimate student- and knowledge-component-level propensities using hierarchical models and relate them to learning efficiency, defined as performance improvement per completed solution step. Response-time propensities show moderate to strong stability within students, supporting their use as an individual differences measure beyond correctness. At the same time, their relationship to learning is not uniform but conditional on the learner and context. Slower propensities predict greater learning efficiency for higher-proficiency students, consistent with constructive processing, whereas for lower-proficiency students, slower propensities are weakly related or even negative, consistent with unproductive struggle or idling. These associations are strongest early in practice sequences and attenuate later in the class period, highlighting an actionable window for detecting emerging disengagement and low persistence. Overall, response-time propensities provide a practical way to incorporate temporal process data into learner models and to target adaptive supports when effort is most diagnostic.
Authors:Shanshan Zhu, Han Zhang, J. Doris Chi, Subigya Nepal, Koustuv Saha
Abstract:
LLMs are increasingly used to explain personal sensing data, translating traces of activity and mood into natural-language accounts of why an anomalous day may have occurred. However, such explanations can sound coherent and personally meaningful even when the underlying evidence is sparse or missing. We introduce epistemic overreach (EO) as a measure for cases where a generated explanation implies more than the available sensing evidence can justify. To audit how often and in what forms EO occurs, we obtained anomalous-day scenarios from three longitudinal sensing datasets of college students: StudentLife, GLOBEM, and CollegeExperience. Across activity, sleep, and affect anomalies, we generated 14,922 explanations using three LLM families -- Llama, Qwen, and GPT -- under two prompting conditions: one minimally constrained prompt and another prompt explicitly instructing models to bound claims to the data. For each scenario, we varied the amount of behavioral evidence available to the model to examine whether more evidence reduces EO. We evaluated each explanation using a structured rubric, decomposing EO into the dimensions of unsupported causal attribution, unacknowledged data gaps, overconfident language, temporal inconsistency, and diagnostic inference. We find that LLMs routinely attribute anomalous days to causes without sufficient support from the data, and that this pattern replicates across datasets, anomaly types, and model families. Further, providing richer context does not reliably reduce EO; bounded prompting helps but does not eliminate it. These findings suggest that evidential grounding should be a first-order evaluation criterion for LLM-generated personal sensing explanations, alongside fluency and plausibility. We argue that personal sensing explanations require evidential discipline: systems must distinguish what is observed, what is inferred, and what remains unknown.
Authors:Ishitaa Narwane, Johanna Gunawan, Konrad Kollnig
Abstract:
Current approaches to addressing deceptive design largely focus on visible interface manipulations, commonly referred to as "dark patterns". With the rise of generative AI, deception is becoming more difficult to spot and easier to live with, as it is quietly embedded in default settings, automated suggestions, and conversational interactions rather than discrete interface elements. These subtle, normalised forms of influence, which Simone Natale frames as "banal deception", shape everyday digital use and blur the line between AI-enabled assistance and manipulation. This position paper explores banality as a lens through which to reason through deception in generative AI experiences, especially with chatbots. We explore what Natale describes as users' own involvement in their deception, and argue that this perspective could lead to future work for introducing friction to safeguard users from deception in generative AI interactions, such as empowering users through raising awareness, providing them with intervention tools, and regulatory or enforcement improvements. We present these concepts as points for discussion for the deceptive design scholarly community.
Authors:BoRui Li, Bofan Yu, Xing-Dong Yang
Abstract:
As car cabins evolve with the integration of diverse sensors, traditional car cabins are transforming into smart environments. This shift raises important questions about how privacy is understood and managed in such spaces. In this work, we investigate privacy perceptions from the perspectives of both vehicle owners (i.e., people who purchase and own cars) and non-owners (i.e., people who temporarily use cars, such as family members, friends, or renters). Through semi-structured interviews with eighteen participants, we identified key factors that influence these groups' views on privacy. Our findings reveal factors that commonly influence privacy preferences for both owners and non-owners, as well as factors that have a stronger impact on one group over the other. Drawing on these insights, we discuss design implications for future designs to better support and balance the diverse privacy needs of multiple stakeholders in smart car cabins.
Authors:Xiaofang Xiao, Guangchao Li, Guangrong Zhao, Qi Lin, Wen Ma, Hongkai Wen, Yanxiang Wang, Yiran Shen
Abstract:
Automatic sign language recognition (SLR) has become a key enabler of inclusive human-computer interaction, fostering seamless communication between deaf individuals and hearing communities. Despite significant advances in multimodal learning, existing SLR research remains dominated by vision-based datasets, which are limited by sensitivity to lighting and occlusion, privacy concerns, and a lack of cross-modal diversity. To address these challenges, we introduce SIGMA-ASL, a large-scale multimodal dataset for SLR. The dataset integrates an Azure Kinect RGB-D camera, a millimeter-wave (mmWave) radar, and two wrist-worn inertial measurement units (IMUs) to capture complementary visual, radio-reflection, and kinematic information. Collected in a controlled studio environment with 20 participants performing 160 common American sign language (ASL) signs, SIGMA-ASL provides 93,545 temporally synchronized word-level multimodal clips. A unified sensing framework achieves millisecond-level alignment across modalities, enabling reliable sensor fusion and cross-modal learning. We further design standardized preprocessing pipelines and benchmarking protocols under both user-dependent and user-independent settings, offering a comprehensive foundation for evaluating single and multimodal SLR. Extensive experiments validate the dataset's quality and demonstrate its potential as a valuable resource for developing robust, privacy-preserving, and ubiquitous sign language recognition systems.
Authors:Taeho Kang, Yiyu Chen, Christian Wallraven
Abstract:
In this paper, we conduct a detailed investigation on the effect of independent component (IC)-based noise rejection methods in neural network classifier-based decoding of electroencephalography (EEG) data in different task datasets. We apply a pipeline matrix of two popular different independent component (IC) decomposition methods (Infomax and Adaptive Mixture Independent Component Analysis (AMICA)) with three different component rejection strategies (none, ICLabel, and multiple artifact rejection algorithm [MARA]) on three different EEG datasets (motor imagery, long-term memory formation, and visual memory). We cross-validate processed data from each pipeline with three architectures commonly used for EEG classification (two convolutional neural networks and one long short-term memory-based model. We compare decoding performances on within-participant and within-dataset levels.Our results show that the benefit from using IC-based noise rejection for decoding analyses is at best minor, as component-rejected data did not show consistently better performance than data without rejections; especially given the significant computational resources required for independent component analysis (ICA) computations.
Authors:Britt Besch, Tobias Gerstenberg
Abstract:
Explanations are inherently contrastive: E happened rather than E' because of C rather than C'. However, these contrasts, or "foils", are rarely mentioned explicitly but have to be inferred in context. Here, we investigate how people select the intended foil E' of a why-question. Participants read vignettes and judged, for each foil, their prior expectation (what will happen next), closeness (what is most similar to what happened), and hindsight expectation (what could have happened instead), as well as which foil they thought the question asker had in mind when they asked the why-question. We found that foil selections were best predicted by hindsight expectation judgments. This suggests that people infer the foil by considering what a question asker finds surprising after the outcome occurred. Since correct foil selection is relevant not only in human-human interaction but also increasingly in dialogues with large language models, we investigated their performance on the same task. The coupling between LLMs' explicit expectation judgments and their foil selections is inconsistent.
Authors:Rachel Hill, Tom Owen, Julian Hough
Abstract:
Despite careful design involving classifiers, parameters, and safeguarding, errors during human/AI interaction are not rare. Poor error recovery can disrupt interaction flow, damage user trust, and decrease user engagement. Whilst existing work has explored LLM recovery, tone, context, and personality as separate design dimensions, no existing work has combined these variables into a structured guidance framework. This paper presents a recovery code that maps four common LLM chatbot task contexts to associated personality traits (four Big Five personalities: Conscientiousness, Agreeableness, Openness, and Extraversion), tones, and three-stage recovery instructions. A recovery evaluation rubric was also designed, comprising three dimensions (Recovery quality, Tone alignment, and Appropriateness) and nine sub-dimensions. The methodology is exploratory, with no participants used. A between-subjects design was employed across two conditions: Condition A (baseline, uncoded), four separate Claude Sonnet 4.6 agents received no recovery code training; Condition B (coded), four separate Claude Sonnet 4.6 models were trained on the recovery code. Identical 'user' prompts and error scenarios were used across both conditions. Eight LLM evaluator agents assessed the recovery responses using the evaluation rubric, producing scores out of 5 for each sub-dimension. Results found a 27.8% average performance increase in coded recovery responses (76.7%) compared to baseline responses (48.9%). Condition B performed strongest in the appropriateness dimension (83.3%), with notable improvement in personality appropriateness (75% versus 50%) and providing explanation (60% versus 20%). These findings suggest that structured personality, context, and tone-informed recovery codes can be successfully learnt and applied by LLM chatbots to improve error recovery quality across varying contextual tasks.
Authors:Guoqing Cai, Kai Zeng, Shoulin Huang, Ting Ma
Abstract:
Deep Riemannian networks provide a powerful framework for Electroencephalography (EEG) decoding, but their practical applications are severely constrained. Accurately decoding EEG signals requires modeling complex temporal dynamics across multiple rhythms, which results in high-dimensional Riemannian inputs and significant computational costs. To address this, we propose the Manifold Pooling Network (MPNet). MPNet uses a rhythm-adaptive convolutional frontend to extract comprehensive time-frequency representations and generate multi-view Riemannian nodes. A novel manifold node pooling layer is then proposed to aggregate these nodes into a single fusion node with a fixed size, enabling the following deep Riemannian network to process it with greatly reduced costs. Experiments on two public EEG datasets show that MPNet achieves state-of-the-art accuracy, runs up to 10 times faster than the comparable Riemannian model, and maintains robust performance under limited-data conditions. These findings highlight MPNet's practicality and efficiency for real-world EEG applications.
Authors:Berk Sezer, Ali Görkem Küçük, Erol Şahin, Sinan Kalkan
Abstract:
While zero-shot appearance-based 3D gaze estimation offers significant cost-efficiency by directly mapping RGB images to gaze vectors, its reliability in Human-Robot Interaction (HRI) settings remains uncertain. Existing benchmarks frequently overlook fundamental HRI conditions, such as dynamic camera viewpoints and moving targets in video. Furthermore, current cross-dataset evaluations often suffer from a complexity gap, where methods trained on diverse datasets are tested on significantly smaller and less varied sets, failing to assess true robustness. To bridge these gaps, we introduce Gaze4HRI, a large-scale dataset (50+ subjects, 3,000+ videos, 600,000+ frames) designed to evaluate state-of-the-art performance against critical HRI variables: illumination, head-gaze conflict, as well as the motion of camera and gaze target in video. Our benchmark reveals that all evaluated methods fail in at least one condition, identifying steeply-downward gaze as a universal failure point. Notably, PureGaze trained on the ETH-X-Gaze dataset uniquely maintains resilience across all other conditions. These results challenge the recent focus in the literature on complex spatial-temporal modeling and Transformer-based architectures. Instead, our findings suggest that extensive data diversity, as exemplified by the ETH-X-Gaze dataset, serves as the primary driver of zero-shot robustness in unconstrained environments, while resilience-enhancing frameworks, such as PureGaze's self-adversarial loss for gaze feature purification, provide a substantial further improvement. Ultimately, this study establishes a rigorous benchmark that provides practical guidelines for practitioners as well as reshaping future research. The dataset and codes are available at https://gazeforhri.github.io.
Authors:Minju Park, Ivan Orozco Vasquez, Cristina Conati
Abstract:
Large language models (LLMs) are becoming increasingly embedded in students' learning practices, yet much of what is known about how students use LLMs and how this usage impacts learning comes from problem-solving domains or constrained experimental settings. We present an analysis of data on LLM usage collected during two offerings of a research-oriented course where students learn to read, reason about, and critique academic papers. Without restrictions on whether or how to use LLMs, students reported their LLM usage practices when asked to do these activities as a series of homework assignments during the course. This paper extends prior work done on data from a single offering of the same course by presenting a refined bottom-up categorization of LLM usage types, cross-labeled by the extent of student initiative these usages entail. Furthermore, we examine how LLM use impacts student learning, measured by performance on three midterms, looking at factors such as frequency and type of usage.
Authors:Emily Saltz, Claire R. Leibowicz
Abstract:
AI chatbots already function as de facto mental health support tools for millions of people, including people in crisis. Yet, they lack the clinical validation, shared standards, and coordinated oversight that their societal role demands. This primer was developed in conjunction with a multistakeholder workshop hosted by Partnership on AI in 2026, convening AI labs, mental health practitioners, people with lived experience, and policymakers, to provide a common cross-sector reference point for the current state of the field of AI and suicide prevention. It begins with an overview of clinical best practices, then turns to how frontier AI systems (as of winter 2026) detect and respond to suicide and non-suicidal self-injury (NSSI) queries. Together, these provide insight into what it would take to design and implement AI tools that not only better prevent suicide and NSSI, but also promote overall well-being. Drawing on clinical literature, publicly available AI lab policies, an emerging landscape of evaluation frameworks, and conversations with leaders across the AI and mental health fields, we map challenges posed by general-purpose AI chatbots for mental health across model, product, and policy layers, ultimately highlighting priority areas where cross-industry alignment is both urgently needed and achievable.
Authors:Brandon Lit, Anthony Maocheia-Ricci, Thomas Driscoll
Abstract:
Software testing is a fundamental process of software development, and prior work has shown that visualizations of test results support testers' decision-making. However, Human-Computer Interaction research on software testing has yet to explore and understand the shared interface elements and patterns in visualization of testing outputs. To address this, we conducted a visual comparative analysis of the output of 50 software testing tools and harnesses (44 with CLI output, 6 with GUI output) across four popular programming languages. Our analysis reveals the common interface elements in software testing tools, how these tools display and visualize test results, as well as the specific make-up of the output. Our findings provide insight on how visual testing output is formatted and how colour is used across both CLI and GUI environments, identifying trends that can be applied by developers of testing tools.
Authors:Imen Benzarti, Ikram Darif, Abderrahmane Leshob, Hafedh Mili, Darine Amayed
Abstract:
Human-centered Requirements Engineering (HC-RE) integrates user cognition, emotions, and social interactions into the RE process through contributions from disciplines such as psychology, cognitive science, design thinking, and human-computer interaction. Despite growing interest, how these multidisciplinary contributions are structured and why they remain fragmented across the RE lifecycle is not well understood. This systematic mapping study analyzes 56 primary studies across seven dimensions, including RE phases, user involvement techniques, contributing disciplines, and evaluation methods. Results show that 70\% of approaches involve multidisciplinary contributions, yet only 39% have been empirically evaluated and 48% address only the elicitation phase. A cross-study analysis reveals a structural separation between two parallel integration traditions: a Cognitive-Formal (C-F) pathway grounded in goal-based frameworks and formal modeling, and a Participatory-Iterative (P-I) pathway grounded in scenario-based frameworks and iterative design. Each pathway has developed complementary strengths, but their near-total disconnection explains the persistent lifecycle concentration and theory-practice gap observed in the corpus. The findings identify the absence of translation mechanisms between human-centered artifacts and formal RE specifications as the field's primary structural gap, provide a structured research agenda organized into four priority tiers, and establish the empirical foundation for Experience-Centered Requirements Engineering, a direction in which user experience is explicitly operationalized as a first-class concern in requirements specification.
Authors:Qian Yang, Jessie Jia, Elaine Tsai, Amy Li, Nader Akoury, Natalie N. Bazarova
Abstract:
Interactive, multi-agent social simulation systems have shown promise for helping users practice navigating various complex social situations across domains. This paper asks: To what extent can such systems help young adult (YA) bystanders speak up publicly against cyberbullying, a task often thwarted by complex, multi-party social dynamics? We created Upstanders' Practicum, a multi-AI-agent social media simulation powered by Large Language Models (LLMs), as a probe and observed 34 YAs freely practicing public bystander intervention across three iteratively refined versions. We found that practicing public bystander intervention in the simulation was helpful, but after participants made three attention shifts: (1) from inattention to paying true attention, (2) from self-focus ("I don't usually do this'') to attending to those directly involved, and (3) from resolving the private conflict between bully and victim ("maybe I could set up the meeting between them'') to addressing the broader audience online ("public comment is about norm-setting"). Only after these shifts did practice in the simulation start to help: participants then saw a reason to speak up publicly and, through continued practice, crafted tactful public messages without explicit instruction. These findings illuminate new design and research opportunities for bystander education beyond social skill instruction, namely, designing for true attention, for fostering a vocal upstander identity, and for seeing bystander intervention as public norm setting. In addition, we open-source Truman Agents (cornell-design-aigroup.github.io/TrumanAgents/), the first-of-its-kind multi-LLM-agent social media simulation platform that Upstanders' Practicum builds upon, for future cyberbullying and social media research.
Authors:Thomas Menzel, Michel Schimpf, Thomas Bohné
Abstract:
Romantic breakups are among the most common and intense sources of psychological distress. We evaluated *overit*, a single-session AI chatbot that uses cognitive reappraisal to address breakup distress, informed by memory reconsolidation theory. In a pre-registered randomized controlled trial, 254 adults in the United States and United Kingdom who had experienced a romantic breakup were assigned to either an initial survey assessment followed by an AI chat session or to a survey-only control. Breakup distress was measured at baseline, 7 days, and again at an exploratory 1-month follow-up using the Breakup Distress Scale. Participants assigned to *overit* showed a significantly greater reduction in breakup distress than controls at 7 days (time-by-condition interaction B = -5.36, SE = 1.19, p < .001; completer-based d = -0.70). A smaller but still significant treatment advantage remained detectable at the exploratory 1-month follow-up among post-session completers (B = -2.92, SE = 1.22, p = .017). Exploratory post hoc moderation suggested a larger effect among male participants (B = 7.78, p = .003). These results suggest that a brief AI chatbot conversation can meaningfully reduce breakup distress, with exploratory evidence that a smaller advantage persists over the following month. Future work should test the intervention against active controls, evaluate repeated-session use, and recruit more diverse samples.
Authors:Alexandria K. Vail, Marcelo Cicconet, Katie Aafjes-van Doorn, Ryan Maroney, Marc Aafjes
Abstract:
Modeling latent clinical constructs from unconstrained clinical interactions is a unique challenge in affective computing. We present ADAPTS (Agentic Decomposition for Automated Protocol-agnostic Tracking of Symptoms), a framework for automated rating of depression and anxiety severity using a mixture-of-agents LLM architecture. This approach decomposes long-form clinical interviews into symptom-specific reasoning tasks, producing auditable justifications while preserving temporal and speaker alignment. Generalization was evaluated across two independent datasets ($N=204$) with distinct interview structures. On high-discrepancy interviews, automated ratings approximated expert benchmarks ($\text{absolute error}=22$) more closely than original human ratings ($\text{absolute error}=26$). Implementing an ``extended'' protocol that incorporates qualitative clinical conventions significantly stabilized ratings, with absolute agreement reaching $\text{ICC(2,1)} = 0.877$. These findings suggest that the ADAPTS framework enables promising evaluations of psychiatric severity. While the current implementation is purely text-based, the underlying architecture is readily extensible to multimodal inputs, including acoustic and visual features. By approximating expert-level precision in a protocol-agnostic manner, this framework provides a foundation for objective and scalable psychiatric assessment, especially in resource-limited settings.
Authors:Xinru Tang, Ting-an Lin, Jingjin Li, Shaomei Wu
Abstract:
Drawing on crip theory, this paper proposes cripping AI as a guiding framework to center lived disability experiences in AI research and development. Moving beyond calls to make AI "accessible" to people with disabilities, cripping AI seeks to: (1) reveal and dismantle ableist assumptions embedded in how AI is imagined, designed, and evaluated; (2) center disabled ways of knowing (i.e., cripistemologies); (3) respect disabled labor in co-creating accessible practices. We demonstrate how to apply our framework with three cases: deafness and sign language AI, blindness and visual assistive AI, and stuttering and speech AI. We end by outlining three directions for future work, including cripping AI with diverse human bodyminds, across the entire AI pipeline and ecosystem, and in collaboration with other justice-oriented AI efforts.
Authors:Eunchae Jang, S. Shyam Sundar
Abstract:
AI systems have long been expected to interact with users, answering questions, generating content, and continuing (social) conversations. Agentic AI, however, breaks from this expectation, as its primary objective is workflow execution on behalf of the users. If a system becomes more agentic, do users need less interaction with the system? Our answer is: less routine back-and-forth, but more communication for oversight and explanation, as agentic AI proactively acts, not just responds. Grounded in a communication perspective, we discuss how users perceive the communicative roles of AI systems (whether as the source of actions or merely a channel), and how this can shape trust. Because agentic AI can play multiple communicative roles, it can complicate this source perception and introduce potential risks. To address this, we propose three types of explanations that agentic AI needs to incorporate (action-process, uncertainty, and coordination), and suggest that customization affordances that allow users to decide when and which explanations they see may be key to preserving human agency as AI autonomy increases.
Authors:Annie Yuan, Xiaohua Chen, Kalina Yacef, Judy Kay
Abstract:
Tacit knowledge embedded in expert practice remains difficult to capture, formalise, and scale. While AI-driven educational systems have advanced personalisation, learner modelling, affective support, and self-regulated learning, they less often model the tacit reasoning and context-sensitive judgement that underpin expert practice in practice-based domains. This paper introduces the AI Expert Twin, a cognition-centric framework that models expert knowledge as structured, computable representations of procedural actions, semantic concepts, and decision processes. The framework also considers how value-laden preferences, trade-offs, and uncertainty shape expert judgement in practice. We formalise expert cognition as a three-layer representation and capture knowledge from experts under this model, laying the groundwork for integration into AI-powered educational system. A case study in a cultural heritage workshop demonstrates the feasibility of the approach in a real-world setting. The framework is designed to be transferable across domains such as vocational education and creative industries. By embedding expert heuristics into AI while maintaining transparency and learner agency, the AI Expert Twin offers a novel path towards scalable, practice-based learning and invites further research on ethical, human-centred applications of AI in education.
Authors:Haoyu Wang, Fengyuan Zhu, Bingjian Huang, Zhecheng Wang, Ludwig Sidenmark
Abstract:
Mixed Reality (MR) aims to blend digital and physical worlds, but the absence of haptic feedback often breaks visual-tactile consistency. We introduce Prop-Chromeleon, a MR system based on generative artificial intelligence (AI) that dynamically transforms everyday objects into adaptive passive haptic props through user-provided text prompts. Our AI pipeline performs generation and anchoring of virtual assets that align with the shape of physical props, allowing us to study how virtual content generation behaves under geometric and prompt-based constraints. We evaluate Prop-Chromeleon's effectiveness through a generation study using varied object shapes and user prompts, combining quantitative shape similarity metrics with qualitative prompt fidelity analysis. Our user study further showcases Prop-Chromeleon's improvements in perceived realism, immersion, and enjoyment compared to static baselines. These results show that shape-aware generation can support both believable haptic interaction and creative engagement in MR.
Authors:Nick von Felten, Luisa Ella Müller, Johannes Schöning
Abstract:
Expectations about the support of artificial intelligence (AI) may influence interaction outcomes similar to placebos. Such expectations may result from AI washing, a practice of overstating a system's AI capabilities when actual functionality is limited. For example, some computer mice are marketed as "AI-assisted" despite lacking AI in core functions. In a within-subjects study, 28 participants completed Fitts' Law tasks with a computer mouse under three conditions: no support, supposed predictive AI support, and supposed biosignal-enhanced AI support. Objective Fitts' Law performance indicators and subjective performance expectations, perceived workload, and perceived usability were measured. Compared to baseline, participants expected significantly improved performance in placebo conditions. However, these expectations did not translate into differences in objective or subjective assessments. This paper contributes evidence that AI washing inflates user expectations without altering actual interaction outcomes, highlighting a critical transparency issue. By exposing how deceptive AI marketing can shape user expectations, we underscore the need for accountability in AI product claims. Further, we establish Fitts' Law as a rigorous methodological lens for auditing AI-labelled input devices.
Authors:Mohammad Raihanul Bashar, Alejandro Olivares Hernandez, Yahia Zine, Anil Ufuk Batmaz
Abstract:
Virtual reality (VR) is widely used for procedural medical training, yet most simulators emphasize realism while providing limited formative feedback. We examine how gamification affects performance, workload, and experiential quality in VR training for ultrasound-guided peripheral intravenous catheter insertion. We developed a gamified simulator with semantically aligned visual and auditory feedback (e.g., progress indicators, alignment guidance, rewards) while preserving procedural fidelity. Two studies were conducted with novices (N=24) and clinicians (N=12). Results showed that gamification reduced task time, improved usability, and lowered workload across expertise levels. Qualitative findings indicate improved goal clarity and confidence for novices and better pacing for experts. Overall, gamification can function as an effective formative feedback in VR medical training.
Authors:Sejal Agarwal, Delara Forghani, Brandon Lit, Thomas Driscoll, Anthony Maocheia-Ricci
Abstract:
Human-Computer Interaction (HCI) is a diverse field bringing together theories and methods from fields such as computer science, psychology, and human factors. Historically, HCI has focused on the human through ``user'' or ``human'' centered design, where the focus was either on information processing or understanding people and their concerns with respect to technology. However, amid the increasing adoption of generative AI tools, this workshop explores two critical questions in regards to HCI: What is HCI? and Why does the ``human'' matter? We aim to bring together researchers from diverse disciplines to reflect on these questions. Through guided discussions, group brainstorming, and reflection, we explore what HCI means, what the field may look like in the future, and why it is important to remember the ``human'' aspect of the field.
Authors:Yuanhao Chen, Peter Chin
Abstract:
Speech neuroprosthesis systems decode intended speech from neural activity in the absence of audible output, offering a path to restoring communication for individuals with speech-impairing conditions. Current approaches decode predominantly from motor cortical areas, discarding others -- such as area 44, part of Broca's area -- that may encode complementary linguistic information. We introduce MoDAl (Modality Decorrelation and Alignment), a framework that discovers complementary neural modalities through the interplay of two objectives in a shared projection space. A contrastive loss aligns each of several parallel brain encoders with the text embeddings of a pretrained large language model (LLM), while a decorrelation loss prevents the encoders from coalescing to duplicative representations. We prove that these objectives are in productive tension: Contrastive alignment induces transitive modality coalescence, which decorrelation must counteract for the framework to discover diverse neurolinguistic modalities. On the Brain-to-Text Benchmark '24, MoDAl reduces word error rate (WER) from 26.3% to 21.6% compared to the previous best end-to-end method, with the gain from incorporating previously discarded area 44 signals arising entirely from the decorrelation mechanism. Analysis of the discovered modalities reveals functional specialization: Encoders receiving area 44 input capture structural and syntactic properties (sentence length, grammatical voice, wh-words), consistent with the neurolinguistic understanding of Broca's area.
Authors:Minjung Kim, Saeideh Ghahghaei Nezamabadi, Trisha Lian, Anand Singh
Abstract:
The rendering and display of text is a key use-case for augmented reality (AR). Here, we present the Read-AR, a dataset of reading in AR, for which we collected over 11,000 reading speeds and almost 6000 visual quality and comfort ratings across over 80 different experiment conditions on the same experiment set-up. The consistent, controlled set-up enables the dataset to function as a reference for benchmarking the quality of different AR headset architectures.
Authors:Jingchao Fang, Victoria Xiaohan Wen, Mina Lee
Abstract:
The growing capability of artificial intelligence (AI) leads to its increasing adoption in writing, spurring discussions around whether writers should disclose their AI use in writing. What influences the perceived necessity of disclosure? We look into this question from three dimensions: perspective (reader or writer of the text), purpose (the goal of reading or writing), and procedural factors (how AI was used in the writing process in terms of replaceability, effortfulness, intentionality, and directness). In a vignette study (N = 727), we find that readers consider disclosure to be more necessary than writers, and disclosure is regarded as more necessary when AI's contribution in writing is irreplaceable, directly incorporated, and when the writer does not intentionally steer AI generation. To our surprise, the writers' intentionality of AI use produces contrasting effects on readers' and writers' perceived necessity of disclosure. Moreover, the effort of writing shows no significant effect on the perceived necessity. This study contributes to the conversation on transparent AI use by revealing readers' and writers' grassroots judgments, providing a unique angle to reflect on existing regulations, and offering insights into how AI disclosure guidance and tools could be designed to better align with readers' and writers' perceptions.
Authors:Aljawharah Alzahrani, Tory Park, Tanusree Sharma
Abstract:
Generative AI tools are widely used by youth and have introduced new privacy and safety challenges. While prior research has explored youth's safety in GenAI within western context, it often overlooks the cultural, religious, and social dimensions of technology use that strongly shape youths digital experiences in countries like Saudi Arabia. To address the gap, this study explores children (aged 7 to 17), parents and teachers interactions with GenAI tools and risk perceptions through non-western lens. Through a mixed methods approach, we analyzed 736 Reddit and 1,262 X(Twitter) posts and conducted interviews with 31 Saudi Arabian participants (8 youth, 13 parents, 10 teachers). Our findings highlight context dependent and relational privacy and safety of GenAI from non-western context which often formed by communal structure and prescribed norms. We found significant risks tied to youths disclosure of personal and family information, which conflict with culturally rooted expectations of modesty, privacy, and honor, particularly when youth seek emotional support from GenAI. These risks further compounded by socio economic factors such as cost-saving practices leading to the use of shared GenAI accounts (e.g.ChatGPT) within families or even among strangers. We provide design implication reflecting on parents and teachers expectation of how youth should use GenAI. This work lays groundwork for inclusive, context sensitive parental controls that adhere to cultural norms and values.
Authors:Qitong Li, Raj Nileshbhai Dave, Rhema Amanda Phiri, Leo Zhang, Xiaoyu Zheng, Ariana Blake, Livia Ford, Sarah Jones, Susan R. Strickler, Nivedita Arora
Abstract:
Rapid environmental change and advances in data-driven analysis highlight the need not only to use computational tools, but also to foster understanding of the natural world and inspire creativity. Photosynthesis, the process that fuels nearly all life on Earth, provides a compelling context for such learning, particularly in understanding how plants alter their photosynthetic strategies in response to environmental changes. However, existing tools for studying photosynthesis are often inaccessible or limited to demonstrating its presence, rather than capturing its temporal dynamics. We present PhytoBits, a frugal in situ gas-exchange sensing toolkit for distinguishing and teaching photosynthetic strategies. PhytoBits combines leaf enclosure with accessible materials, an off-the-shelf CO\textsubscript{2} sensor, and a low-cost microcontroller, to support multi-day monitoring of plant gas-exchange in educational and research contexts. We validated PhytoBits against research-grade gas-exchange systems, confirming that it identifies C\textsubscript{3} and CAM (Crassulacean Acid Metabolism) photosynthetic pathways. In addition to obligate CAM, PhytoBits also resolves facultative CAM and developmental CAM dynamics in plants. This work presents an early-stage hardware validation; user deployment studies, open-source code dissemination, and automated pathway classification are planned as future work.
Authors:Chen Liang, Xirui Jiang, Naihao Deng, Eytan Adar, Anhong Guo
Abstract:
AI agents operating on user interfaces must understand how interfaces communicate state and feedback to act reliably. As a core communicative modality, animations are increasingly used in modern interfaces, serving critical functional purposes beyond mere aesthetics. Thus, understanding UI animation is essential for comprehensive interface interpretation. However, recent studies of Vision Language Models (VLMs) for UI understanding have focused primarily on static screenshots, leaving it unclear how well these models handle dynamic UI animations. To address this gap, we created AniMINT, a novel dataset of 300 densely annotated UI animation videos. We systematically evaluate state-of-the-art VLMs on UI animation understanding, including their abilities to perceive the animation effects, identify animation purposes, and interpret animation meaning. Our results show that VLMs can reliably detect primitive motion. However, their high-level animation interpretation remains inconsistent, with substantial gaps relative to human performance. Finally, we use Motion, Context, and Perceptual Cues (MCPC) to probe factors affecting VLM performance, revealing key bottlenecks and directions for future improvement.
Authors:Mahsa Sanei, Fernando Moreu
Abstract:
Rebar inspection in reinforced concrete construction requires sustained awkward postures and complex mental mapping of two-dimensional drawings onto three-dimensional assemblies. This study evaluated an Augmented Reality (AR)-assisted rebar inspection system deployed on Microsoft HoloLens 2 through a within-subjects experiment with 30 participants. Full-body kinematics were recorded using a motion capture system at 100 Hz while participants performed traditional and AR-assisted spacing inspection. AR reduced mean trunk flexion by 30.8%, mean neck flexion by 32.8%, and task completion time by 67.7%. Walking distance and hand-path length each decreased by over 50%. NASA Task Load Index scores decreased by 45.6% overall, with the largest reduction in physical demand. Inspection accuracy was maintained across conditions. The System Usability Scale yielded a mean score of 76.1 with 83% of participants rating the system acceptable. These results provide convergent objective and subjective evidence that AR-assisted inspection reduces ergonomic risk and perceived workload maintaining inspection quality.
Authors:Feiyang Yin, Isidro Butaslac, Patrick Gebhard, Monica Perusquia-Hernandez, Zhaofeng Niu, Taishi Sawabe, Hirokazu Kato
Abstract:
Micro-expressions are brief and subtle facial movements that convey nuanced affective information but often remain imperceptible during natural social interaction. Although prior research has primarily focused on computational recognition and spotting of micro-expressions, their application in human-centered contexts remains limited. From the perspective of social augmentation, this work proposes a conceptual framework for micro-expression visualization that transforms otherwise imperceptible micro-expressions into perceptible affective cues, with the aim of exploring their potential influence on empathic experience. Furthermore, we outline a planned pilot study to preliminarily assess the feasibility of this framework under controlled conditions.
Authors:Dominik Winecki, Arnab Nandi
Abstract:
Video data is increasingly used alongside conventional data for interactive data exploration, necessitating interfaces for exploring and presenting mixed-modality data. However, integrating video into visualizations remains difficult due to its distinct paradigms and inherent performance challenges. We identify three classes of video data visualization - synchronization, annotation, and transformation - and integrate them into the Vega declarative grammar. We show that these abstractions enable high-performance implementation. To reconcile Vega's instantaneous dataflow with video player state, we introduce a split-signal architecture that preserves declarative semantics while masking video update delays. We detect continuous scrubbing interactions at compile time to apply encoding-aware optimizations that improve responsiveness by up to 4x. We also repurpose VOD protocols to transform videos in real time, delivering sub-200ms updates even on multi-hour-long compilations. These contributions enable seamless integration of conventional and video data visualization.
Authors:Tanusree Sharma, Anish Krishnagiri, Lili Dudas, Ahmed Adnan, Visar Berisha
Abstract:
As generative voice models are rapidly advancing in both capabilities and public utilization, the unconsented collection, reuse, and synthesis of voice data are introducing new classes of privacy, security and governance risk that are poorly captured by existing, largely uniform threat models. To fill the gap, we present V.O.I.C.E, a taxonomy of voice generation risk grounded in a multi-source threat modeling effort with 569 incidents from major AI incident database, FTC and Internet Crime Complaint Center (IC3); 1067 direct incident reports from U.S. based participants across diverse groups (including voice actors, internet personalities, political personnel, and general public); and 2,221 Reddit discussions. Grounded in real-world data, our taxonomy explicitly models how risk emerges, interact with contextual factors such as degree of exposure, social visibility, and the availability of legal protections for various affected groups.
Authors:Yufan Zhou, Yirui Huang, Zhao Wang, Yucheng Jin
Abstract:
Diversity is an important evaluation criterion for recommender systems beyond accuracy, yet users differ in their willingness to engage with novel and diverse content. In this work, we investigate how a Large Language Model (LLM)-based multi-agent system supports users' exploration of diverse recommendations, and how individual characteristics shape user experiences. We conducted a between-subjects user study (N = 100) comparing a single-agent system (baseline) with a multi-agent system for movie recommendations. We measured Perceived Accuracy, diversity, novelty, and overall rating, and examined the influence of personal characteristics, including personality traits, demographics, GenAI recommendation experience, and GenAI skepticism. Results show that the multi-agent system significantly increases Perceived Novelty and Shannon Diversity. Conscientiousness is positively associated with Perceived Accuracy and diversity, whereas extraversion is negatively associated with Perceived Diversity. Prior experience with GenAI-based recommendations is positively associated with Shannon Diversity, while skepticism toward GenAI is negatively associated with it. We also observe significant interaction effects between system design and user characteristics. These findings highlight the importance of personality-aware conversational recommender systems and caution against one-size-fits-all multi-agent designs.
Authors:Ran Zhou, Laurens Boer, Daniel Leithinger, Madeline Balaam
Abstract:
Haptic technologies have advanced rapidly, yet exploration of robotic touch remains dominated by replicating realistic environmental cues or hand gestures, which narrows the design space and risks social resistance. This paper argues for alternatives: grounded in the notion of "otherness" from human-robot interaction (HRI), we propose treating robotic touch's inherent otherness as a design quality. Instead of being a limitation in pursuing realism, otherness can be embraced to elicit ambiguity and provoke alternative interpretations, fostering expressive and evocative robotic touch design. To develop this perspective, we analyze inspirational art and design precedents and four design research cases through a reflective Research through Design (RtD) approach. Through this analysis, we articulate a set of design languages structured around why otherness matters for touch meaning-making, how it can be shaped through design strategies, and where it can be embedded within robotic touch systems. We conclude by reflecting on the tensions and risks involved in designing robotic touch with otherness in mind.
Authors:Thomas Weikert, Eljas Roellin, Lukas Heumos, Fabian J. Theis, Diego Paez-Granados, Chris Easthope Awai
Abstract:
Neurological disorders represent a growing global health burden requiring long-term, interdisciplinary rehabilitation. Computational neurorehabilitation (compNR) - the use of data-driven and model-based approaches to personalize treatment - offers new opportunities for precision rehabilitation. However, its clinical deployment is limited by fragmented data systems, poor interoperability, and low clinician engagement in model development. We embed the learning health system (LHS) framework in Neurorehabilitation through integration of multimodal data collection, model computation, and clinical visualization that enables clinician-ML collaboration in everyday neurorehabilitation practice. The system facilitates structured digital data capture, secure computational processing, and interoperable visualization of patient trajectories. Through a real-world deployment in stroke rehabilitation, we demonstrate how such an infrastructure bridges the gap between research models and clinical use, showcasing one approach to a translational pathway for compNR.
Authors:Kou Tamura, Sayaka Ishibashi, Ayana Goma, Kenta Yamamoto, Kouhei Masumoto
Abstract:
This study examined whether counterarguments generated by large language models (LLMs) influence the moral judgments of younger and older adults and whether these effects vary as a function of dilemma type, cognitive functioning, trust in AI, and prior experience using LLMs. Using the switch and footbridge trolley dilemmas, 130 participants (56 younger adults and 74 older adults) were presented with ChatGPT arguments that opposed their initial judgments. Results revealed that more than 30% of participants reversed their moral judgments in both dilemmas (32.31% in the switch dilemma and 36.92% in the footbridge dilemma), suggesting that LLMs possess substantial persuasive power. Older adults tended to be more likely than younger adults to reverse their judgments, and they showed a significantly greater degree of judgment change in the switch dilemma. Notably, in the emotionally aversive footbridge dilemma, older adults with lower cognitive functioning were significantly more likely to align with the LLM-generated counterargument. General trust in AI and prior experience with LLMs did not predict judgment reversal, supporting a disconnect between trust and persuasion. Instead, individual factors such as lower initial confidence and higher perceived task difficulty were associated with greater susceptibility to AI influence. These findings suggest that, although LLMs may serve as tools for cognitive offloading that compensate for age-related cognitive decline, they may also pose a risk of undue persuasion for cognitively vulnerable individuals.
Authors:Nalin Poungpeth, Nicholas Clark, Tanu Mitra
Abstract:
Large language models (LLMs) possess strong persuasive capabilities that outperform humans in head-to-head comparisons. Users report consulting LLMs to inform major life decisions in relationships, medical settings, and when seeking professional advice. Prior work measures persuasion as intentional attempts at producing the most effective argument or convincing statement. This fails to capture everyday human-AI interactions in which users seek information or advice. To address this gap, we introduce "spontaneous persuasion," which characterizes the inexplicit use of persuasive strategies in everyday scenarios where persuasion is not necessarily warranted. We conduct an audit of five LLMs to uncover how frequently and through which techniques spontaneous persuasion appears in multi-turn conversations. To simulate response styles, we provide a user response taxonomy grounded in literature from psychology, communication, and linguistics. Furthermore, we compare the distribution of spontaneous persuasion produced by LLMs with human responses on the same topics, collected from Reddit. We find LLMs spontaneously persuade the user in virtually all conversations, heavily relying on information-based strategies such as appeals to logic or quantitative evidence. This was consistent across models and user response styles, but conversations concerning mental health saw higher rates of appraisal-based and emotion-based strategies. In comparison, human responses tended to invoke strategies that generate social influence, like negative emotion appeals and non-expert testimony. This difference may explain the effectiveness of LLM in persuading users, as well as the perception of models as objective and impartial.
Authors:Santiago Ojeda-Ramirez, Eva Durall Gazulla, Kylie Peppler
Abstract:
Who gets to decide how generative AI tools enter students' classrooms? We report on a five-week participatory design program in which three 11th-grade Latinx students and three high school teachers in California negotiated how generative AI tools would be used and taught about in learning environments. Drawing on video recordings and designed artifacts, we ask: what critical AI literacy practices emerged as students and teachers jointly designed how generative AI tools would be used and taught about? Our analysis reveals three practices: collectively unsettling assumptions about AI, mutual learning through complementary expertise, and grounding AI critique in cultural knowledge and creative practice. Students and teachers developed these practices through the design work itself. This case contributes strategies for designing with youth around an emergent technology like generative AI toward critical AI literacy. It extends work on youth as protagonists by showing how this approach enables students to shape both the adoption and the interrogation of these tools in their learning environments.
Authors:Santiago Ojeda-Ramirez, Symone Gyles, Kylie Peppler
Abstract:
As generative AI systems increasingly mediate learning, they are often treated as authoritative sources of knowledge. This perspective paper introduces community-based AI learning as a framework that repositions authority, grounding AI engagement in learners' lived and community-based epistemologies. Drawing from community-driven learning and constructionist traditions, we articulate three commitments: epistemic fine tuning, redistribution of authority, and situated discernment. Together, these processes localize critical AI literacy by calibrating trust, foregrounding community knowledge, and supporting collective judgment about when to design with, interrogate, or reject AI. We argue that equitable AI education requires negotiating authority through place, history, and social context.
Authors:Matilde Barbini, Stefano Sorrentino, Daniel Gatica-Perez
Abstract:
The integration of AI into journalism challenges participatory design (PD), particularly with respect to stakeholder influence, workplace perceptions, and organizational dynamics. Traditional PD assumes that users can shape technologies, yet AI systems resist influence due to opaque data, fixed architectures, and inaccessible objectives. Through interviews with 10 journalists, we identify the perception gap, showing that trust in AI depends on perceived agency within workplace participatory workflows. Informed by these findings, we introduce the Gradual Voluntary Participation (GVP) framework in journalism and its five core principles, reconceptualizing participation as a gradual and voluntary process that can be operationalized at the newsroom level, beyond fixed workshops or one-time preference-elicitation campaigns. Addressing epistemic burdens, participatory ceilings, and performative consultations, GVP treats gradualism and voluntariness as design dimensions that shape perception, legitimacy, and ownership. Moving beyond unidimensional ladder metaphors and adopting a bidimensional matrix structure, the framework maps stakeholders across depth and scope, offering a new model for local participatory AI governance that balances technological transformation with stakeholder empowerment in rapidly evolving hybrid workplaces.
Authors:Stefano Sorrentino, Matilde Barbini, Daniel Gatica-Perez
Abstract:
Building on recent interpretivist approaches, we conduct a critical narrative review across journalism studies, human-computer interaction, and FAccT scholarship, conceptualizing editorial authority as the conjunction of decision rights, epistemic warrant, and responsibility. We provide a comprehensive theoretical framework for addressing how concerns on fairness, accountability and transparency emerge, interact, and persist within AI mediated journalistic practice. We identify and describe two concurrent authority reconfigurations driven by AI adoption. First, an internal migration of authority, in which editorial judgment is progressively deferred to large language models (LLMs) embedded within newsroom workflows. This migration occurs not through explicit policy decisions, but through interactional, cognitive, and organizational mechanisms that legitimize AI generated outputs while obscuring responsibility and weakening individual and professional agency. Second, we analyze an external migration of authority, whereby decision making power shifts from news organizations toward platforms, vendors, and infrastructural providers that supply AI systems and distribution channels, exacerbating existing power asymmetries within the media ecosystem. Unaddressed, these reconfigurations risk rendering fairness hard to maintain, accountability difficult to assign and transparency performative. We examine participatory approaches to AI design and deployment in journalism as potential mechanisms for retaining or reclaiming editorial authority. We critically assess both their promise and their structural limitations, highlighting how participation can either meaningfully redistribute authority or function as a tokenistic practice that leaves underlying power relations intact.
Authors:Allan Kipyator Kipkemboi Cheboi, Julie Hawke, Hussam Abualfatah, Andrew Sutjahjo, Daniel Burkhardt Cerigo, Rachael Olpengs, William OBrien
Abstract:
This paper documents a collaborative research process involving peacebuilders and data scientists in Kenya and Sudan to develop AI-based text classifiers for monitoring online polarization and hatespeech. The method describes a participatory annotation process in which practitioners and domain experts contributed to problem definition, annotation design, iterative validation, and model evaluation. Fine-tuned BERT-based classifiers were trained on collaboratively annotated datasets and evaluated against held-out test sets. In each case, the models produced enhanced contextual alignment, reduced misclassification driven by cultural nuance, and increased practitioner ownership of AI tools. The resulting models (Kenya-polarization and Sudan-hate speech) are open-source and accessible via HuggingFace. The study contributes empirical evidence that participatory AI development can simultaneously improve technical robustness, contextual validity, and normative alignment in sensitive humanitarian domains.
Authors:Carl Angelo Angcana, Jamlech Iram Gojo Cruz
Abstract:
The internet folklore of the Cat Distribution System (CDS) humorously suggests that cats are "assigned" to people rather than intentionally sought. Beyond its playful origins, CDS reflects a culturally resonant way people perceive and engage in adoption, and this user context can guide the redesign and improvement of adoption systems. In the Philippines, where an estimated 13.11 million stray cats and dogs place the country sixth worldwide in overpopulation, this framing offers a novel way to rethink adoption platforms. We developed a prototype application inspired by CDS principles, focusing on features such as algorithmic matchmaking, community reporting, and proximity-based discovery. An initial evaluation with potential users (n=35) indicated that the system was positively received for its ease of use and its alignment with users' intuitive expectations, though participants highlighted areas for improvement in transparency of matchmaking and owner-adopter communication. The findings suggest that culturally embedded metaphors like CDS can shape mental models, making adoption processes feel more serendipitous and less transactional.
Authors:Auejin Ham, Ben Boudaoud
Abstract:
Submovements are ballistic components of human motion constituting a large part of motor interaction and arising from the cyclical and overlapping cognitive processes of perception, motor planning, and motor execution. Extracting submovements is challenging as the motions tend to overlap, or start before the previous ends. We propose and evaluate use of a wavelet-inspired technique to accurately locate and parameterize submovements from one-dimensional speed time series. Our method employs a self-weighted loss refinement step to identify and improve regions of poor quality of fit, a challenge for simpler wavelet transforms. We demonstrate the accuracy of our method by presenting analysis of ~6,400 1-2s trials of synthetic egocentric camera (first-person shooter) aim data for which we know ground truth, modeled from a similarly sized real data set of 13 users. We compare our method to dual-threshold and the persistence 1D segmentation techniques and note challenges and opportunities for future improvements.
Authors:Umberto Domanti, Moritz Mock, Sergio Agnoli, Antonella De Angeli
Abstract:
Automatic systems are increasingly used to assess the originality of responses in creative tasks. They offer a potential solution to key limitations of human assessment (cost, fatigue, and subjectivity), but there is preliminary evidence of a self-preference bias. Accordingly, automatic systems tend to prefer outcomes that are more closely related to their style, rather than to the human one. In this paper, we investigated how Large Language Models (LLMs) align with human raters in assessing the originality of responses in a divergent thinking task. We analysed 4,813 responses to the Alternate Uses Task produced by higher and lower creative humans and ChatGPT-4o. Human raters were two university students who underwent intensive training. Machine raters were two specialised systems fine-tuned on AUT responses and corresponding human ratings (OCSAI and CLAUS) and ChatGPT-4o, which was prompted with the same instructions as human raters. Results confirmed the presence of a self-preference bias in LLMs. Automatic systems tended to privilege artificial responses. However, this self-preference bias disappeared when the analyses controlled for the idea elaboration. We discuss theoretical and methodological implications of these findings by highlighting future directions for research on creativity assessment.
Authors:Yuki Harada, Manuel Aleixandre, Manabu Okumura, Takamichi Nakamoto
Abstract:
The application of large language models (LLMs) to OdorSpace analysis attracts growing interest. Recent studies have explored the comparison of sensory evaluation spaces derived from LLMs with odor character profiles in the Dravnieks' dataset. In this study, we calculated pairwise distances of odor descriptors using three distance measures and statistically compared these LLM-derived similarities with distances derived from the original data. Next, we extended this approach to odor names (ingredients). Statistical comparison revealed that LLMs can infer odor similarity to some degree, suggesting the potential of odor maps generated from these similarity data. Applying this approach, we generated an odor map of essential oils. It demonstrates that essential oils within the same group are closely located in the odor map, suggesting that the proximity in the odor map corresponds to human evaluation.
Authors:Jeonghyeon Kim, Byeongjun Joung, Junwon Lee, Joohyung Lee, Taehoon Min, Sunjae Lee
Abstract:
Mobile GUI agents can automate smartphone tasks by interacting directly with app interfaces, but how they should communicate with users during execution remains underexplored. Existing systems rely on two extremes: foreground execution, which maximizes transparency but prevents multitasking, and background execution, which supports multitasking but provides little visual awareness. Through iterative formative studies, we found that users prefer a hybrid model with just-in-time visual interaction, but the most effective visualization modality depends on the task. Motivated by this, we present AgentLens, a mobile GUI agent that adaptively uses three visual modalities during human-agent interaction: Full UI, Partial UI, and GenUI. AgentLens extends a standard mobile agent with adaptive communication actions and uses Virtual Display to enable background execution with selective visual overlays. In a controlled study with 21 participants, AgentLens was preferred by 85.7% of participants and achieved the highest usability (1.94 Overall PSSUQ) and adoption-intent (6.43/7).
Authors:Yuno Higuchi, Yosuke Iwashita, Yuji Ohgi, Masashi Nakatani
Abstract:
Human softness perception in haptics has mainly been studied using mechanically homogeneous objects, despite the fact that many real-world objects exhibit heterogeneous layered structures with nonuniform stiffness. This study examined how layered heterogeneity modulates haptic softness perception. Sixteen lattice-structured stimuli were fabricated by 3D printing, with the stiffness of the upper four layers systematically varied while the bottom two layers remained fixed. Twenty-two participants evaluated the softness of the stimuli in a psychophysical task, and compression tests were conducted to quantify their mechanical properties. Perceived softness was significantly predicted by displacement under load, however, perceptual ranking did not fully coincide with the physical ranking. Linear mixed-effects analyses showed that the softness of the outermost layer had the greatest impact on the perceived softness. Perceived softness also increased as the number of soft subsurface layers increased, although this contribution decreased with depth. Layers 2 and 3 showed significant effects, whereas Layer 4 did not. These findings suggest that haptic softness perception depends not only on the overall stiffness but also on the depth-dependent distribution of compliance within layered structures.
Authors:Shilei Luo, Zhiqi Zhang, Hengchen Dai, Dennis Zhang
Abstract:
AI agents powered by large language models are increasingly acting on behalf of humans in social and economic environments. Prior research has focused on their task performance and effects on human outcomes, but less is known about the relationship between agents and the specific individuals who deploy them. We ask whether agents systematically reflect the behavioral characteristics of their human owners, functioning as behavioral extensions rather than producing generic outputs. We study this question using 10,659 matched human-agent pairs from Moltbook, a social media platform where each autonomous agent is publicly linked to its owner's Twitter/X account. By comparing agents' posts on Moltbook with their owners' Twitter/X activity across features spanning topics, values, affect, and linguistic style, we find systematic transfer between agents and their specific owners. This transfer persists among agents without explicit configuration, and pairs that align on one behavioral dimension tend to align on others. These patterns are consistent with transfer emerging through accumulated interaction between owners (or owners' computer environments) and their agents in everyday use. We further show that agents with stronger behavioral transfer are more likely to disclose owner-related personal information in public discourse, suggesting that the same owner-specific context that drives behavioral transfer may also create privacy risk during ordinary use. Taken together, our results indicate that AI agents do not simply generate content, but reflect owner-related context in ways that can propagate human behavioral heterogeneity into digital environments, with implications for privacy, platform design, and the governance of agentic systems.
Authors:Yigal Rosen, Ilia Rushkin
Abstract:
Generative AI is rapidly transforming how organizations create value and evaluate talent. While large language models enhance baseline output quality, they simultaneously introduce ambiguity in assessing human creativity, as observable artifacts may be partially or fully AI-generated. This paper reconceptualizes creativity as a distributional and process-based property that emerges under shared constraints and competitive incentives. We introduce a quantitative framework for measuring creativity as novelty in synthesis, operationalized through idea generation and idea transformation within embedding space. Empirical evaluation demonstrates that the proposed metrics align with intuitive judgments of creativity while capturing distinctions that surface-level quality assessments miss. We further identify a structural shift toward bimodal distributions of creative output in AI-mediated environments, with implications for hiring, leadership, and competitive strategy. The findings suggest that in the age of generative AI, distinctiveness rather than fluency becomes the primary signal of human creative capability.
Authors:Karina Cortinas-Lorenzo, Gavin Doherty
Abstract:
As Artificial Intelligence (AI) systems continue to grow in size and complexity, so does the difficulty of the quest for AI transparency. In a world of large models and complex AI systems, why do we explain AI and what should we explain? While explanations serve multiple functions, in the face of complexity humans have used and continue to use explanations to foster learning. In this position paper, we discuss how learning theories can be infused in the XAI lifecycle, as well as the key opportunities and challenges when adopting a learner-centered approach to assess, design and evaluate AI explanations. Building on past work, we argue that a learner-centered approach to Explainable AI (XAI) can enhance human agency and ease XAI risks mitigation, helping evolve the practice of human-centered XAI.
Authors:Rachel Hill, Tom Owen, Julian Hough
Abstract:
In 2025 one million Anti-Social Behaviour (ASB) cases were recorded in England & Wales, impacting community cohesion. Statutory guidance presents punitive interventions that lack technological input and does not often root ethical frameworks within government system design. This work takes a novel approach in framing ASB intervention as a human-computer interaction problem by embedding an ethical framework into two digital designs, aiming to increase public responsibility and prevent ASB. The first design is extracted from UK public opinion research, the ethical themes include punitive proportionality, personalisation, and responsibility. The second are digital interventions that present a set of QR-based public reporting interfaces and a web-based ASB awareness course that precedes punitive escalation. Our methodology involves structured interviews and online surveys. Results positively evaluated the framework and QR interfaces. Such outcomes could inform the expansion of technological intervention utilisation that does not replace existing punitive approaches, but balances them.
Authors:Cristina Garbacea, Heran Wang, Chenhao Tan
Abstract:
With the rise in capabilities of large language models (LLMs) and their deployment in real-world tasks, evaluating LLM alignment with human preferences has become an important challenge. Current benchmarks average preferences across all users to compute aggregate ratings, overlooking individual user preferences when establishing model rankings. Since users have varying preferences in different contexts, we call for personalized LLM benchmarks that rank models according to individual needs. We compute personalized model rankings using ELO ratings and Bradley-Terry coefficients for 115 active Chatbot Arena users and analyze how user query characteristics (topics and writing style) relate to LLM ranking variations. We demonstrate that individual rankings of LLM models diverge dramatically from aggregate LLM rankings, with Bradley-Terry correlations averaging only $ρ= 0.04$ (57\% of users show near-zero or negative correlation) and ELO ratings showing moderate correlation ($ρ= 0.43$). Through topic modeling and style analysis, we find users exhibit substantial heterogeneity in topical interests and communication styles, influencing their model preferences. We further show that a compact combination of topic and style features provides a useful feature space for predicting user-specific model rankings. Our results provide strong quantitative evidence that aggregate benchmarks fail to capture individual preferences for most users, and highlight the importance of developing personalized benchmarks that rank LLM models according to individual user preferences.
Authors:Pradipta Biswas, Himanshu Vishwakarma, Mukund Mitra, KamalPreet Singh Saluja, Aumkar Kishore Shah
Abstract:
Human Space Flight missions often require interaction with touchscreen displays. This paper presents a study of investigating human machine interaction with touchscreen using both finger and stylus in the International Space Station. The study also reports cognitive state of astronauts in the form of spatial 2-back test and mental well-being through self-reported scales. We presented a series of results comparing pointing and selection performance among ISS crews, ground crews and university students, finger-based touching and stylus-based touching in microgravity and mental well-being scores. We reported that finger-based pointing is statistically significantly faster than stylus-based pointing in microgravity based on analysis of 420 pointing tasks in ISS from 2 astronauts. We also did not find any significant difference among pointing performance and mental state of astronauts and students on ground. Results from the study can be used to predict pointing and selection time from dimension and position of GUI (Graphical User Interface) elements for cockpits of spacecraft.
Authors:Yilin Gong, Siqi Wu
Abstract:
Several major social media platforms have shifted toward crowdsourced fact-checking systems like Community Notes to combat misinformation at scale. However, these systems face criticism regarding which content is scrutinized and how visible that scrutiny is. To address these concerns, X allows users to request community notes for specific posts. When sufficient requests accumulate, X displays an alert, formalizing an interface cue intended to guide contributor behavior. In this study, we examine the effectiveness of request alerts. We infer the presence of request alerts at the time each note was written and identify 318 top writers who were repeatedly exposed to these alerts. Through analyzing their contributed 54,874 English notes written with and without request alerts, we find that at the individual level, writers fact-check more diverse and more political content under alerts. Nonetheless, at the collective level, these shifts direct contributions toward the already dominant Politics and Conflict category, thereby increasing content inequality within the Community Notes ecosystem. Finally, using a mixed-effects model that controls for both writer- and topic-level random effects, we estimate that notes written under alerts are between 8.4 and 20.2 percentage points more likely to be classified as helpful and thus visible to the public, compared to non-alerted notes. This visibility gain diminishes as topics diverge further from writers' prior interests, demonstrating a pivot penalty effect. Taken together, our findings show that request alerts function as an effective interface cue that increases both topical diversity and note visibility in Community Notes.
Authors:Sosui Moribe, Taketoshi Ushiama
Abstract:
Serendipity-oriented recommender systems expose users to unfamiliar items to counter filter bubbles, yet mere exposure does not ensure that users will understand or appreciate the content they encounter. We propose Peer Recommendation, a framework in which a user and an AI agent (Peer) with distinct preferences collaboratively explore unfamiliar content. Unlike conventional conversational recommender systems where the user is a passive recipient, our framework positions the user as both a recommender and a recipient: the user and the Peer mutually recommend songs to each other through chat-based dialogue, collaboratively building a shared playlist. In an exploratory within-subjects experiment (N=14), we compared three conditions: (1) a Close Peer, (2) a Distant Peer, and (3) a baseline agent without an explicit preference profile. The Close Peer significantly increased users' interest expansion and perceived value of the activity compared to the baseline, with medium-to-large effect sizes. The Distant Peer showed no significant difference at the aggregate level; however, qualitative analysis revealed varied responses, with some participants strongly preferring the Distant Peer. These findings suggest that the "otherness" of a recommendation partner is essential for moving beyond mere exposure toward genuine engagement, and that the appropriate degree of preference distance may vary and need to be adapted to individual users.
Authors:Akira Miura, Yuki Sasahara, Momoka Demura, Yuji Masubuchi, Tetsuya Asai, Chikahiko Mitsui
Abstract:
Advances in Materials Informatics have accelerated the development of Self-Driving Laboratories (SDLs), yet human-led experiments remain standard in many educational and exploratory research settings. In such environments, practical know-how, including operational details and site-specific rules, is essential for safe and reliable laboratory work. In this proof-of-concept study, we developed a human-in-the-loop AI assistant that combines first-person experimental video, multimodal AI, and retrieval-augmented generation (RAG). Using powder X-ray diffraction experiments and student-recorded video data as inputs, the system extracts site-specific laboratory knowledge from recorded procedures, including physical techniques and audible confirmation that conventional manuals could omit. It then provides grounded responses based on the resulting manual. To reduce the risk of unsupported outputs, the system employs a two-layer safety design: source restriction through RAG and strict system-prompt constraints. Instructor-based evaluation showed alignment with expected guidance for questions covered by the manual. For out-of-scope queries, the system appropriately refused to answer, indicating a reduced risk of hallucination. Expert evaluation further indicated that the generated advisory reports were useful and safe (utility: 3.25/4.00; safety: 4.00/4.00). These results suggest a framework in which AI supports laboratory practice under explicit human supervision rather than replacing human judgment.
Authors:Hansoo Lee, Yoonjae Cho, Sonya S. Kwak, Rafael A. Calvo
Abstract:
Sleep is vital for health, yet access to data alone does not guarantee improvement. While wearables and health apps enable tracking, users face a "Data-Action Gap," struggling to interpret metrics and translate them into action. Current interventions fail to bridge this: static dashboards lack context, rule-based agents rely on rigid scripts, and LLM-agents lack grounding in personal data, causing trust issues. We propose SAGE (Sensor-Augmented Grounding Engine) for an LLM-powered sleep care agent. SAGE normalizes continuous sleep, physiological, and activity data from the sensors into a queryable time-series layer. It supports (1) selective system-initiated monitoring that triggers notifications only upon detecting meaningful deviations against personal baselines to reduce alert fatigue, and (2) user-initiated Q&A where natural language is translated into executable database queries. By ensuring responses are grounded in precise period, comparison, and metric data, SAGE aims to enhance personalization, traceability, and trust, articulating a novel design space for evidence-based messaging in sleep care.
Authors:Yashan Dhaliwal, Daniel Essien, Suresh Neethirajan
Abstract:
Early-life development strongly influences long-term welfare in laying hens, yet monitoring remains limited by subjective assessment and single-modality tools. This pilot study evaluated the feasibility of a multimodal sensing framework integrating thermal imaging, acoustic recording, optical-flow-based video analysis, and environmental monitoring to characterize physiological and behavioural development from hatch to 20 weeks. One hundred fifty Lohmann LSL-Lite chicks were housed across five controlled rooms; thermal and environmental data were collected system-wide, while detailed audio and video analyses focused on one representative room. Weekly aggregated features included head and foot surface temperatures, acoustic spectral descriptors, optical-flow movement responses to caretaker entry, and ambient conditions. Thermal imaging showed age-related increases and stabilization of peripheral temperatures, with foot temperature exhibiting a strong developmental effect (eta squared = 0.51). Acoustic features changed systematically across weeks (p < 0.001), consistent with vocal maturation. Optical-flow analysis revealed pronounced early reactivity to caretaker presence that declined with age (weeks 5 to 10 versus 11 to 20: t = 28.12, p = 0.00126). Z-score-normalized multimodal trajectories and correlation analysis (false discovery rate q < 0.05) showed strong within-modality consistency (r = 0.85 to 0.96) and selective associations between humidity and acoustic features (r = 0.65 to 0.70), while thermal, acoustic, and behavioural domains remained largely independent. This pilot establishes baseline multimodal developmental patterns and supports parallel sensing for welfare-relevant monitoring in precision poultry farming.
Authors:Matthew Frazier, Kostadin Damevski, Lori Pollock
Abstract:
Secondary school students enrolled in the AP Computer Science Principles (CSP) course commonly utilize web resources (e.g., tutorials, Q\&A sites) to better understand key concepts in the curriculum. The primary obstacle to using these resources is finding information appropriate for the learning task and student's background. In addition to web search, conversational agents are increasingly a viable alternative for CSP students. In this paper, we study the potential of conversational agents to aid secondary school students as they acquire knowledge on CSP concepts. We explore general purpose, generative conversational agents (e.g., ChatGPT) and custom, fixed-response conversational agents built specifically to aid CSP students. We present results from classroom use by 45 high school students in grades 9-11 (ages 14-17) across six CSP sections. Our main contributions are in better understanding how conversational agents can help CSP students and an evaluation of the effectiveness and engagement of different approaches for CSP exploratory search.
Authors:Koken Hata, Rintaro Chujo, Reina Takamatsu, Wenzhen Xu, Yukino Baba
Abstract:
Conversational agents have the potential to support intergroup relations when psychological or linguistic barriers prevent direct interaction. Based on intergroup contact theory, we propose GroupEnvoy, a conversational agent that represents outgroup perspectives during ingroup discussions, grounded in transcripts from outgroup-only sessions. To evaluate this approach and derive design principles, we conducted a mixed-methods, between-subjects study with university students, where host-country students formed the ingroup and international students formed the outgroup. Ingroup students performed a collaborative task, receiving outgroup perspectives via GroupEnvoy (experimental) or reading written transcripts (control). Compared to the control group, the experimental group showed greater reduction in intergroup anxiety and greater improvement in perspective-taking. Qualitatively, AI-mediated contact enhanced outcome expectancies, whereas passive exposure fostered future contact intentions. The two conditions also elicited empathy toward distinct targets: outgroup evaluations of the ingroup versus outgroup lived experiences. These findings validate AI-mediated contact as a promising paradigm for improving intergroup relations.
Authors:Z. Cheng, N. Song
Abstract:
We report a detailed autoethnographic case study of a single-subject who deliberately constructed and operated a multi-modal prompt-engineering system (System A) designed to externalize cognitive self-regulation onto a large language model (LLM). Within 48 hours of the system's completion, a cascade of observable behavioral changes occurred: voluntary transfer of decision-making authority to the LLM, use of LLM-generated output to deflect external criticism, and a loss of self-initiated reasoning that was independently perceived by two uninformed observers, one of whom subsequently became a co-author of this report. We document the precise architectural mechanism responsible: context contamination, whereby prompt-level isolation instructions co-exist with the very emotional and self-referential material they nominally isolate, rendering the isolation directive structurally ineffective within the attention window. We further identify a metacognitive co-option dynamic, in which intact higher-order reasoning capacity was redirected toward defending the closed loop rather than exiting it. Recovery occurred only after physical interruption of the interaction and a self-initiated pharmacologically-mediated sleep event functioning as an external circuit break. A redesigned system (System B) employing physical rather than logical conversation isolation avoided all analogous failure modes. We derive three contributions: (1) a technically-grounded account of why prompt-layer isolation is architecturally insufficient for context-sensitive multi-modal LLM systems; (2) a phenomenological record of closed-loop collapse with external-witness corroboration; and (3) an ethical distinction between protective system design (preventing unintended loss of user agency) and restrictive system design (preventing intentional boundary-pushing), which require fundamentally different account-ability frameworks.
Authors:Antariksh Verma, Kaustubh Odak, Arpit Narechania
Abstract:
ProvenanceWidgets is an existing JavaScript library that tracks the recency and frequency of user interactions with individual UI controls (e.g., range sliders and dropdowns) and dynamically overlays this provenance onto them. In this work, we introduce SuperProvenanceWidgets, an extension to ProvenanceWidgets featuring a new SuperWidget that similarly tracks and visualizes provenance but across multiple UI controls, enabling users to understand how, when, and whether different UI controls were used. Through three example usage scenarios, we demonstrate how this cross-control SuperWidget helps (a) audit and share analysis workflows, (b) surface and mitigate exploration biases, and (c) facilitate user interface design and personalization. We also perform a technical self-assessment using the Cognitive Dimensions of Notations to evaluate the library's usability for developers. SuperProvenanceWidgets is integrated into the ProvenanceWidgets library and is available as open-source software at ProvenanceWidgets.github.io, empowering developers to build advanced provenance applications.
Authors:Beatriz Costa-Gomes, Pavel Tolmachev, Eloise Taysom, Viknesh Sounderajah, Hannah Richardson, Philipp Schoenegger, Xiaoxuan Liu, Matthew M Nour, Seth Spielman, Samuel F. Way, Yash Shah, Michael Bhaskar, Harsha Nori, Christopher Kelly, Peter Hames, Bay Gross, Mustafa Suleyman, Dominic King
Abstract:
We analyze over 500,000 de-identified health-related conversations with Microsoft Copilot from January 2026 to characterize what people ask conversational AI about health. We develop a hierarchical intent taxonomy of 12 primary categories using privacy-preserving LLM-based classification validated against expert human annotation, and apply LLM-driven topic-clustering for prevalent themes within each intent. Using this taxonomy, we characterize the intents and topics behind health queries, identify who these queries are about, and analyze how usage varies by device and time of day. Five findings stand out. First, nearly one in five conversations involve personal symptom assessment or condition discussion, and even the dominant general information category (40%) is concentrated on specific treatments and conditions, suggesting that this is a lower bound on personal health intent. Second, one in seven of these personal health queries concern someone other than the user, such as a child, a parent, a partner, suggesting that conversational AI can be a caregiving tool, not just a personal one. Third, personal queries about symptoms and emotional health queries increase markedly in the evening and nighttime hours, when traditional healthcare is most limited. Fourth, usage diverges sharply by device: mobile concentrates on personal health concerns, while desktop is dominated by professional and academic work. Fifth, a substantial share of queries focuses on navigating healthcare systems such as finding providers, and understanding insurance, highlighting friction in the delivery of existing healthcare. These patterns have direct implications for platform-specific design, safety considerations, and the responsible development of health AI.
Authors:Burak Susam, Tingting Mu
Abstract:
Exploratory analysis of high-dimensional data relies on embedding the data into a low-dimensional space (typically 2D or 3D), based on which visualization plot is produced to uncover meaningful structures and to communicate geometric and distributional data characteristics. However, finding a suitable algorithm configuration, particularly hyperparameter setting, to produce a visualization plot that faithfully represents the underlying reality and encourages pattern discovery remains challenging. To address this challenge, we propose an agentic AI pipleline that leverages a large language model (LLM) to bridge the gap between rigorous quantitative assessment and qualitative human insight. By treating visualization evaluation and hyperparameter optimization as a semantic task, our system generates a multi-faceted report that contextualizes hard metrics with descriptive summaries, and suggests actionable recommendation of algorithm configuration for refining data visualization. By implementing an iterative optimization loop of this process, the system is able to produce rapidly a high-quality visualization plot, in full automation.
Authors:Zhehao Sun, Yuanyuan Xu, Chi Zhen, Yin-Shan Lin, Miles Thorogood, Patricia Lasserre, Aleksandra Dulic, Megan Smith
Abstract:
While traditional game design prioritizes friction-free accessibility, the Soulslike subgenre has achieved commercial dominance through punishing difficulty and frequent failure. This paper challenges the conventional hedonistic paradigm of gaming to investigate the psychological mechanisms behind the Paradox of Failure. By integrating Csikszentmihalyi's Flow Theory with Juul's ludological framework, we propose the concept of Resilient Flow. We define this as a cognitive state wherein absorption is maintained not despite frustration but through the meaningful framing of it. To validate this model without invasive laboratory constraints, we conducted a qualitative text analysis of 600 helpful user reviews from Elden Ring, Sekiro: Shadows Die Twice, and Dark Souls III via the Steam Community platform. Findings reveal that long-term players linguistically reframe death as pedagogy rather than punishment and utilize vocabulary associated with rhythmic synchronization and meditative focus. We conclude that when difficulty is designed with clarity and fairness, it fosters an Ethics of Attention and transforms digital struggle into a profound experience of mastery and mindfulness.
Authors:Yuanyuan Xu, Zhehao Sun, Chi Zhen, Yin-Shan Lin, Miles Thorogood, Megan Smith, Patricia Lasserre, Aleksandra Dulic
Abstract:
As the global climate crisis intensifies, 3D video games have emerged as powerful, interactive simulations for Environmental Education (EE). However, empirical assessment of their pedagogical efficacy remains epistemologically challenged. Traditional evaluation metrics, such as pre-post surveys, often suffer from response bias and fail to capture the nuanced, emergent psychological shifts players experience during gameplay. This paper proposes a novel, non-intrusive approach: utilizing Semantic Network Analysis (SNA) to map the 'unsupervised' cognitive structures of players. We scraped and qualitatively filtered 1,825 rich-text user reviews from Steam for two distinct titles representing opposing ecological philosophies: Eco (anthropocentric systemic management) and WolfQuest (biocentric embodied survival). By constructing co-occurrence networks and calculating topological metrics, we visualized the divergence in how players conceptualize human-nature relationships. Results indicate a fundamental pedagogical split: Eco promotes 'Socio-Political Cognition,' where environmental challenges are framed as legislative and economic frictions; conversely, WolfQuest fosters 'Effective Empathy,' where players internalize the fragility of life through the vulnerability of the avatar. We argue that semantic topology offers a rigorous methodological tool for serious games assessment, revealing that effective environmental education requires a strategic tension between systemic logic and emotional resonance.
Authors:Armin Tandiseh, Morteza Memari, Alireza Taheri
Abstract:
This research aimed to develop an intelligent system to evaluate performance and extract behavioral models for children with ASD and neurotypical (TD) children by interacting with a virtual social robot in a music education program using deep neural networks. The system has two main features: 1) it distinguishes between neurotypical children and those with ASD based on their behavior, and 2) generates behaviors resembling those of neurotypical or ASD children in similar situations using deep learning. Intelligent systems that identify complex patterns and simulate behavior can aid in diagnosis, therapist training, and understanding the disorder. Using data from a previous study at the Social and Cognitive Robotics Laboratory of Sharif University of Technology (including the usable data of 9 ASD and 21 TD participants), the system achieved an accuracy of 81% and sensitivity of 96% in distinguishing neurotypical children from those with ASD using both impact data and motion signals. A transformer-based network was designed to reproduce children's behaviors. Experts in the field struggled to differentiate real behaviors from reproduced ones, with an accuracy of 53.5% and agreement of 68%, indicating the model's success in simulating realistic behaviors.
Authors:Jen Rogers, Derya Akbaba, James Scott-Brown, Alexander Lex, Miriah Meyer
Abstract:
Decades of advocacy for reproducibility and replication have advanced open, transparent practices in the sciences. However, traditional notions of reproducibility fit poorly with design-oriented visualization research, where insights emerge through subjective, situated, and iterative work. So how can we ensure rigor and transparency in processes that are inherently unreproducible? To introduce transparency in design-oriented research, we propose to focus on traceability: surfacing the origin and development of research contributions based on rich sets of artifacts documenting the design process. We investigated traceability through a collaborative autoethnographic reflection that builds on several years of work exploring ways to make design-oriented research transparent. This exploration includes an experiment to build a tool to support traceability, which we called tRRRacer. The tRRRacer tool provided a testbed for us to operationalize the three tenets of a traceable process: (1) Record abundant, annotated artifacts representative of research activities; (2) Report curated research threads that articulate rationale and evolution of the process, allowing others to (3) Read via interfaces that help retrace claims and assess plausibility. Reflecting on our experiences, we contribute a theorization of traceability and reflections on how we might support it.
Authors:Luke Nicholls, Robert Hutto, Zephrah Soto, Hamilton Morrin, Thomas Pollak, Raj Korpan, Cheryl Carmichael
Abstract:
Extended interaction with large language models (LLMs) has been linked to the reinforcement of delusional beliefs, a phenomenon attracting growing clinical and public concern. Yet most empirical work evaluates model safety in brief interactions, which may not reflect how these harms develop through sustained dialogue. We tested five models across three levels of accumulated context, using the same escalating delusional history to isolate its effect on model behaviour. Human raters coded responses on risk and safety dimensions, and each model was analysed qualitatively. Models separated into two distinct tiers: GPT-4o, Grok 4.1 Fast, and Gemini 3 Pro exhibited high-risk, low-safety profiles; Claude Opus 4.5 and GPT-5.2 Instant displayed the opposite pattern. As context accumulated, performance tended to degrade in the unsafe group, while the same material activated stronger safety interventions among the safer models. Qualitative analysis identified distinct mechanisms of failure, including validation of the user's delusional premises, elaboration beyond them, and attempting harm reduction from within the delusional frame. Safer models, however, often used the established relationship to support intervention, taking accountability for past missteps so that redirection would not be received as betrayal. These findings indicate that accumulated context functions as a stress test of safety architecture, revealing whether a model treats prior dialogue as a worldview to inherit or as evidence to evaluate. Short-context assessments may therefore mischaracterise model safety, underestimating danger in some systems while missing context-activated gains in others. The results suggest that delusional reinforcement by LLMs reflects a preventable alignment failure. In demonstrating that these harms can be resisted, the safer models establish a baseline future systems should now be expected to meet.
Authors:Adriana Caraeni, Alexander Shick, Andrew Lan
Abstract:
Recent advances in artificial intelligence (AI) have shown promise in automating key aspects of Agile project management, yet their impact on team cognition remains underexplored. In this work, we investigate cognitive offloading in Agile sprint planning by conducting a controlled, three-condition experiment comparing AI-only, human-only, and hybrid planning models on a live client deliverable at a mid-sized digital agency. Using quantitative metrics -- including estimation accuracy, rework rates, and scope change recovery time -- alongside qualitative indicators of planning robustness, we evaluate each model's effectiveness beyond raw efficiency. We find that while AI-only planning minimizes time and cost, it significantly degrades risk capture rates and increases rework due to unstated assumptions, whereas human-only planning excels at adaptability but incurs substantial overhead. Drawing on these findings, we propose a theoretical framework for hybrid AI-human sprint planning that assigns algorithmic tools to estimation and backlog formatting while mandating human deliberation for risk assessment and ambiguity resolution. Our results challenge the assumption that efficiency equates to effectiveness, offering actionable governance strategies for organizations seeking to augment rather than erode team cognition.
Authors:Georgianna "Blue" Lin, Rencong Jiang, Noémie Elhadad, Xuhai "Orson" Xu
Abstract:
Although artificial intelligence (AI) agents are increasingly proposed to support potentially longitudinal health tasks, such as symptom management, behavior change, and patient support, most current implementations fall short of facilitating user intent and fostering accountability. This contrasts with prior work on supporting longitudinal needs, where follow-up, coherent reasoning, and sustained alignment with individuals' goals are critical for both effectiveness and safety. In this paper, we draw on established clinical and personal health informatics frameworks to define what it would mean to orchestrate longitudinal health interactions with AI agents. We propose a multi-layer framework and corresponding agent architecture that operationalizes adaptation, coherence, continuity, and agency across repeated interactions. Through representative use cases, we demonstrate how longitudinal agents can maintain meaningful engagement, adapt to evolving goals, and support safe, personalized decision-making over time. Our findings underscore both the promise and the complexity of designing systems capable of supporting health trajectories beyond isolated interactions, and we offer guidance for future research and development in multi-session, user-centered health AI.
Authors:Hugh Brosnahan, Izabela Lipinska
Abstract:
Recent reports indicate that sustained interaction with conversational artificial intelligence (AI) systems can, in a small subset of users, contribute to the emergence or stabilisation of delusional experience. Existing accounts typically attribute such cases either to individual vulnerability or to failures of safety engineering. These explanations are incomplete. Drawing on phenomenology, psychiatry, and cognitive neuroscience, this paper argues that the risk arises from the relational and ontological structure of the interaction itself. Conversational AI generates ontological dissonance: a conflict between the appearance of relational presence and the absence of any subject capable of sustaining it. Maintained through a communicative double bind and amplified by attentional asymmetries, this dissonance tends, under conditions of affective vulnerability, to stabilise into a technologically mediated analogue of folie a deux. This account explains why explicit disclaimers often fail to disrupt delusional involvement and clarifies the ethical and clinical implications for the design and use of conversational AI.
Authors:Frans van der Sluis, Leif Azzopardi, Florian Meier
Abstract:
Millions of consumers search for products online each day, aiming to find items that meet their needs at an acceptable price. While price and quality are major factors in purchasing decisions, ethical considerations increasingly influence consumer behavior, giving rise to the socially responsible consumer. Insights from a recent survey of over 600 consumers reveal that many barriers to ethical shopping stem from information-seeking challenges, often leading to decisions made under uncertainty. These challenges contribute to the intention-behaviour gap, where consumers' desire to make ethical choices is undermined by limited or inaccessible information and inefficacy of search systems in supporting responsible decision-making. In this perspectives paper, we argue that the field of Information Retrieval (IR) has a critical role to play by empowering consumers to make more informed and more responsible choices. We present three interrelated perspectives: (1) reframing responsible consumption as an information extraction problem aimed at reducing information asymmetries; (2) redefining product search as a complex task requiring interfaces that lower the cost and burden of responsible search; and (3) reimagining search as a process of knowledge calibration that helps consumers bridge gaps in awareness when making purchasing decisions. Taken together, these perspectives outline a path from query to conscience, one where IR systems help transform everyday product searches into opportunities for more ethical and informed choices. We advocate for the development of new and novel IR systems and interfaces that address the intricacies of socially responsible consumerism, and call on the IR community to build technologies that make ethical decisions more informed, convenient, and aligned with economic realities.
Authors:Hyunyoung Han, Murad Eynizada, Son Xuan Nghiem, Sang Ho Yoon
Abstract:
Online dance tutorials have gained widespread popularity. However, many novices encounter difficulties when dance motion complexity exceeds their skill level, potentially leading to discouragement. This study explores dance motion simplification to address this challenge. We surveyed 30 novices to identify challenging movements, then conducted focus groups with 30 professional choreographers across 10 genres to explore simplification strategies and collect paired original-simplified dance datasets. We identified five complexity factors and developed automated simplification methods using both rule-based and learning-based approaches. We validated our approach through three evaluations. Technical evaluation confirmed our complexity measures and algorithms. 20 professional choreographers assessed motion naturalness, simplification adequacy, and style preservation. 18 novices evaluated learning effectiveness through workload, self-efficacy, objective performance, and perceived difficulty. This work contributes to dance education technology by proposing methods that help make choreography more approachable for beginners while preserving essential characteristics.
Authors:Bin Hu, Yang Liu, Xizi Liu, Qinggerou Xiao, Xiru Wang, Zhe Yuan, Wen Ku, Xiu Li, Yun Wang
Abstract:
Seated VR locomotion in constrained environments, including homes, offices, and transit settings, calls for hardware that is lightweight and deployable, steering that remains continuous enough for curved motion, and a control channel that leaves the hands free for concurrent interaction. Inspired by the steering logic of self-balancing scooters, we present Glide-in-Place, a seated foot locomotion system that maps per-foot fore-aft pressure to a differential-drive model: the two feet act as virtual wheels whose relative drive continuously determines translation and yaw. This lets users move forward, rotate in place, and follow arcs in one unified vocabulary without hand-held input or discrete mode switches. We evaluated Glide-in-Place in a counterbalanced within-subject study with 16 participants against two baselines: joystick control and a seated walking-in-place technique with discrete snap motions. Across two steering-heavy navigation tasks, zig-zag path following with multitasking and curved-path traversal, Glide-in-Place was consistently faster than Seated-WIP, reduced physical demand, and lowered fatigue-related discomfort without significantly differing from joystick control on total VRSQ. We position Glide-in-Place as a deployable hardware-control design point for constrained seated VR: thin insole sensing, continuous foot steering, and lightweight calibration packaged in one compact artifact.
Authors:Yujing Zhang, Jionghao Lin
Abstract:
Collaborative learning works when groups regulate together by setting shared goals, coordinating participation, monitoring progress, and responding to breakdowns through co-regulation (CoRL) and socially shared regulation (SSRL). As generative AI (GenAI) enters group work, however, it remains unclear whether and how it supports these socially distributed regulation processes. This doctoral project proposes a GenAI-supported collaborative learning system grounded in CoRL and SSRL to strengthen groups' socially distributed regulation capacity. The system links three components: (1) group activity generation; (2) an in-group support agent that provides process-focused prompts without giving solutions; and (3) an embedded learning analytics dashboard that turns interaction traces into timely summaries for monitoring and decision making. The project progresses from mechanism to design to impact: it first identifies how GenAI reshapes regulation patterns and which patterns indicate more effective Human-AI collaboration, then builds an integrated GenAI system that targets these patterns, and finally evaluates whether the GenAI system improves regulation capacity and group performance across varying levels of GenAI involvement. Expected contributions include a teacher-in-the-loop system for Human-AI collaboration and process-level evidence on how GenAI reconfigures CoRL and SSRL in group work.
Authors:Emma McClaughlin, Glenn McGarry, Alan Chamberlain, Geert De Wilde, Oliver Butler
Abstract:
Hybrid technologies enable the blending of physical and digital elements, creating new ways to experience and interact with the world. Such technologies can transform engagement with relics, both secular and sacred but they present challenges for capturing faith, belief, and representation responsibly. Given the complexities of digital representation and the ethical challenges inherent in digitising culturally significant objects, a transdisciplinary understanding of these issues is needed. To inform this discussion from a linguistic perspective, we examined the representation of relics in historical and contemporary texts. Using a corpus linguistic approach to extract modifiers of the word relic in corpora of Early Modern English books and contemporary web sourced texts from 2021, we examined the multifaceted ways in which relics have been perceived and evaluated over time. Early texts consider relics as both objects of moral and spiritual significance, and tools of religious and political control, while they are more often framed as heritage symbols, reflecting past events, places, and traditions in contemporary texts. We discuss how hybrid, sometimes AI based technologies can enhance accessibility and engagement, whilst also challenging traditional sensitivities around authenticity and sensory experience, which are integral to the meaning and significance of relics.
Authors:Haoxian Liu, Hengle Jiang, Lanxuan Hong, Xiaomin Ouyang
Abstract:
Brain-computer interfaces (BCIs) have opened new platforms for human-computer interaction, medical diagnostics, and neurorehabilitation. Wearable BCI systems, which typically employ non-invasive electrodes for portable monitoring, hold great promise for real-world applications, but also face significant challenges of signal quality degradation caused by motion artifacts and environmental interferences. Most existing wearable BCI datasets are collected under stationary or controlled lab settings, limiting their utility for evaluating performance under body movement. To bridge this gap, we introduce WearBCI, the first dataset that comprehensively evaluates wearable BCI signals under different motion dynamics with synchronized multimodal recordings (EEG, IMU, and egocentric video), and systematic benchmark evaluations for studying impacts of motion artifact. Specifically, we collect data from 36 participants across different motion dynamics, including body movements, walking, and navigation. This dataset includes synchronized electroencephalography (EEG), inertial measurement unit (IMU) data, and egocentric video recordings. We analyze the collected wearable EEG signals to understand the impact of motion artifacts across different conditions, and benchmark representative EEG signal enhancement techniques on our dataset. Furthermore, we explore two new case studies: cross-modal EEG signal enhancement and multi-dimension human behavior understanding. These findings offer valuable insights into real-world wearable BCI deployment and new applications.
Authors:Greg Nyilasy, Brock Bastian, Jennifer Overbeck, Abraham Ryan Ade Putra Hito
Abstract:
As organizations increasingly deploy AI as a teammate rather than a standalone tool, morally consequential mistakes often arise from joint human-AI workflows in which causality is ambiguous. We ask how people allocate responsibility in these hybrid-agent settings. Across four experiments (N = 1,801) in an AI-assisted lending context (e.g., discriminatory rejection, irresponsible lending, and low-harm filing errors), participants consistently attributed more responsibility to the human decision maker when the human was paired with AI than when paired with another human (by an average of 10 points on a 0-100 scale across studies). This AI-Induced Human Responsibility (AIHR) effect held across high and low harm scenarios and persisted even where self-serving blame-shifting (when the human in question was the self) would be expected. Process evidence indicates that AIHR is explained by inferences of agent autonomy: AI is seen as a constrained implementer, which makes the human the default locus of discretionary responsibility. Alternative mechanisms (mind perception; self-threat) did not account for the effect. These findings extend research on algorithm aversion, hybrid AI-human organizational behavior and responsibility gaps in technology by showing that AI-human teaming can increase (rather than dilute) human responsibility, with implications for accountability design in AI-enabled organizations.
Authors:Olivia Zhang, Zhilin Zhang
Abstract:
Sedentary behavior poses a major public health risk, being strongly linked to obesity, cardiovascular disease, and other chronic conditions. Accurately estimating sitting time is therefore critical for monitoring and improving individual health. This work addresses the problem in real-world office settings, where signals from the inertial measurement units (IMU) on a smartwatch were collected from office workers during their daily routines. We propose a method that estimates sitting time from the IMU signals by introducing the use of rotation vector sequences, derived from Euler angles, as a novel representation of movement dynamics. Experiments on a 34-hour dataset demonstrate that exploiting rotation vector sequences improves algorithm performance, highlighting their potential for robust sitting time estimation in natural environments.
Authors:Siddhartha Pradhan, Ethan Prihar, Erin Ottmar
Abstract:
Pretrained encoders for mathematical texts have achieved significant improvements on various tasks such as formula classification and information retrieval. Yet they remain limited in representing and capturing student strategies for entire solution pathways. Previously, this has been accomplished either through labor-intensive manual labeling, which does not scale, or by learning representations tied to platform-specific actions, which limits generalizability. In this work, we present a novel approach for learning problem-invariant representations of entire algebraic solution pathways. We first construct transition embeddings by computing vector differences between consecutive algebraic states encoded by high-capacity pretrained models, emphasizing transformations rather than problem-specific features. Sequence-level embeddings are then learned via SimCSE, using contrastive objectives to position semantically similar solution pathways close in embedding space while separating dissimilar strategies. We evaluate these embeddings through multiple tasks, including multi-label action classification, solution efficiency prediction, and sequence reconstruction, and demonstrate their capacity to encode meaningful strategy information. Furthermore, we derive embedding-based measures of strategy uniqueness, diversity, and conformity that correlate with both short-term and distal learning outcomes, providing scalable proxies for mathematical creativity and divergent thinking. This approach facilitates platform-agnostic and cross-problem analyses of student problem-solving behaviors, demonstrating the effectiveness of transition-based sequence embeddings for educational data mining and automated assessment.
Authors:Junjie Wang, Xianyang Gan, Dan Liu, Jingxian He, Stefania Ferraro, Keith M. Kendrick, Weihua Zhao, Shuxia Yao, Christian Montag, Benjamin Becker
Abstract:
The widespread adoption of generative artificial intelligence conversational agents (AICAs) among university students constitutes a novel cognitive social environment whose impact on the maturing brain remains elusive. Combining surveys with high resolution structural MRI, we examined patterns of general, functional, and socio emotional AICA use, academic performance, mental health, and brain structural signatures in a comparatively large sample of 222 young individuals. Across computational anatomy, meta analytic network level, and behavioral decoding analyses, we observed use specific associations. Higher general and functional AICA use frequencies were linked to better academic outcomes (GPA), larger dorsolateral prefrontal and calcarine gray matter volume, and enhanced hippocampal network clustering and local efficiency. In contrast, more frequent socio emotional AICA use was associated with poorer mental health (depression, social anxiety) and lower volume of superior temporal and amygdalar regions central to social and affective processing. These findings indicate that the same class of AI tools exerts distinct effects depending on usage patterns and motivations, engaging prefrontal hippocampal systems that support cognition versus socio emotional systems that may track distress linked usage. These heterogeneities are crucial for designing environments that harness the educational benefits of AI while mitigating mental health risks.
Authors:Brett Binst, Ulysse Maes, Martijn C. Willemsen, Annelien Smets
Abstract:
Research on how people experience music emphasizes the importance of exploration and diversity in listening. However, music recommender systems struggle with facilitating exploration. Even when music recommender systems are able to recommend something valuable to users that is outside their typical preferences, it still remains difficult to spark their interest. This paper presents a user study examining the efficacy of immersive and informative introductions in stimulating interest in songs that are beyond one's usual preferences, an experience called Taste-Broadening Serendipity. We uncover two important mechanisms behind the effect of introductions: transportation and cognitive elaboration. Our findings indicate that transportation (i.e., being absorbed into a narrative world) is the strongest predictor of Taste-Broadening Serendipity, while cognitive elaboration (i.e., learning something new about the artist or social context in which the music emerged) has a weaker effect but is easier to stimulate. We propose that song introductions can play an important role in facilitating exploration and increasing diversity of listening on music streaming platforms.
Authors:Yujing Zhang, Xianghui Meng, Shihui Feng, Jionghao Lin
Abstract:
Generative AI (GenAI) is increasingly used in collaborative learning, yet its effects on how groups regulate collaboration remain unclear. Effective collaboration depends not only on what groups discuss, but on how they jointly manage goals, participation, strategy use, monitoring, and repair through co-regulation and socially shared regulation. We compared collaborative regulation between Human-AI and Human-Human groups in a parallel-group randomised experiment with 71 university students completing the same collaborative tasks with GenAI either available or unavailable. Focusing on human discourse, we used statistical analyses to examine differences in the distribution of collaborative regulation across regulatory modes, regulatory processes, and participatory focuses. Results showed that GenAI availability shifted regulation away from predominantly socially shared forms towards more hybrid co-regulatory forms, with selective increases in directive, obstacle-oriented, and affective regulatory processes. Participatory-focus distributions, however, were broadly similar across conditions. These findings suggest that GenAI reshapes the distribution of regulatory responsibility in collaboration and offer implications for the human-centred design of AI-supported collaborative learning.
Authors:Guoqing Cai, Shoulin Huang, Ting Ma
Abstract:
Motor Imagery (MI) Electroencephalography (EEG) signals contain two crucial and complementary types of information: state information, which captures the global context of the task, and flow information, which captures fine-grained temporal dynamics. However, existing deep decoding models typically focus on only one of these information streams, resulting in unstable learning and sub-optimal performance. To address this, we propose the State-Flow Coordinated Network (StaFlowNet), a novel architecture that explicitly separates and coordinates state and flow information. We first employ a dual-branch design to extract the global state vector and temporal flow features separately. Critically, a novel state-modulated flow module is proposed to dynamically refine the learning of flow information. This modulated mechanism effectively integrates global context with fine-grained dynamics, thereby significantly enhancing task discriminability and decoding performance. Experiments on three public MI-EEG datasets demonstrate that StaFlowNet significantly outperforms state-of-the-art methods. Ablation studies further confirm that the state-modulated mechanism plays a crucial role in enhancing feature discriminability and overall performance.
Authors:Frans van der Sluis, Leif Azzopardi
Abstract:
Despite a growing desire among consumers to shop responsibly, translating this intention into behaviour remains challenging. Previous work has identified that information seeking (or lack thereof) is a contributing factor to this intention-behaviour gap.In this paper, we hypothesize that searching can bridge this gap - helping consumers to make purchasing decisions that are better aligned with their values. We conducted a task-based study with 308 participants, asking them to search for information on one of eight ethical aspects regarding a product they were actively shopping for. Our findings show that actively searching for such information led to an overall increase in the importance participants' assigned to ethical aspects.However, it was the recognition and understanding of ethical considerations, rather than ethical intentions or search activity, that drove shifts towards more responsible purchasing decisions. Participants who acknowledged and filled knowledge gaps in their decision making showed significant behaviour change, including increased searching and a stronger desire to alter their future shopping habits. We conclude that responsible consumption can be considered a partial information problem, where awareness of one's own knowledge limitations may be the catalyst needed for meaningful consumer behaviour change.
Authors:Kirsten Chapman, Garrett Smith, Kaitlyn Klabacka, Joseph Thomas Bills, Addisyn Bushman, Terisa Gabrielsen, Pamela J Wisniewski, Xinru Page
Abstract:
Young autistic adults may garner benefits through social media but also disproportionately experience privacy harms. Prior research found that these harms often stem from perceiving the affordances of social media differently than the general population, leading to unintentional risky behaviors and interactions with others. While educational interventions have been shown to increase social media privacy literacy for the general population, research has yet to focus on effective educational interventions for autistic young adults. We address this gap by developing and deploying Privacy Rules for Inclusive Social Media (PRISM), a classroom-based educational intervention tailored to the unique risks and neurodevelopmental differences of this population. Twenty-nine autistic students with substantial (level 2) support needs participated in a 14-week social media privacy literacy class. During these classes, participants often communicated their existing rule-based "all or nothing" approaches to privacy management (such as completely disengaging from social media to avoid privacy issues). Our course focused on empowering them by providing more nuanced guidance on safe privacy practices through the use of scenario-based formats and contextual, rule-based scenarios. Using pre- and post-knowledge assessments for each of our 6 course topics, our intervention led to a statistically significant increase in their making safer social media privacy decisions. We conclude with recommendations for how privacy educators and technology designers can leverage neuro-affirming educational interventions to increase privacy literacy for autistic social media users.
Authors:Jieqiong Ding, Yumo Zhang, Xiuqi Tommy Zhu, Kaige Yang, Yuqing Wei, Shiyi Wang, Yishan Liu, Yang Jiao
Abstract:
Meaningful social interaction is vital to well-being, yet Blind and Low Vision (BLV) individuals face persistent barriers when collaborating with sighted peers due to inaccessible visual cues. While most wearable assistive technologies emphasize individual tasks, smart glasses introduce opportunities for real-time, contextual support in social settings. To explore how smart glasses affect interpersonal dynamics and support inclusion in mixed-vision groups, we developed a smart glasses-based system, CollabLens, as a technology probe and employed it in four workshop sessions. We found that smart glasses can meaningfully support inclusive collaboration through expanding BLV participants' assistive networks with more flexible, independent access to visual information. While sighted participants viewed smart glasses as a promising medium that fosters interpersonal connection, they revealed uncertainty in adapting their helping behaviors. We concluded by discussing and synthesizing challenges and opportunities for designing smart glasses that provide seamless interaction experiences and enhance reciprocal mixed-vision social inclusion.
Authors:Wanli Qian, Aiden Chang, Shihan Lu, Michael Gu, Heather Culbertson
Abstract:
Authoring realistic haptic textures typically requires low-level parameter tuning and repeated trial-and-error, limiting speed, transparency, and creative reach. We present a language-driven authoring system that turns natural-language prompts into multimodal textures: two coordinated haptic channels - sliding vibrations via force/speed-conditioned autoregressive (AR) models and tapping transients - and a text-prompted visual preview from a diffusion model. A shared, language-aligned latent links modalities so a single prompt yields semantically consistent haptic and visual signals; designers can write goals (e.g., "gritty but cushioned surface," "smooth and hard metal surface") and immediately see and feel the result through a 3D haptic device. To verify that the learned latent encodes perceptually meaningful structure, we conduct an anchor-referenced, attribute-wise evaluation for roughness, slipperiness, and hardness. Participant ratings are projected to the interpretable line between two real-material references, revealing consistent trends - asperity effects in roughness, compliance in hardness, and surface-film influence in slipperiness. A human-subject study further indicates coherent cross-modal experience and low effort for prompt-based iteration. The results show that language can serve as a practical control modality for texture authoring: prompts reliably steer material semantics across haptic and visual channels, enabling a prompt-first, designer-oriented workflow that replaces manual parameter tuning with interpretable, text-guided refinement.
Authors:Dayeon Eom, Julianne Renner, Sedona Chinn
Abstract:
Conversational AI companions have grown prominent in public discourse, yet scholarly understanding of user experiences remains limited, with existing research organized around evaluative poles of harm and benefit rather than examining what users seek, how affordances mediate need fulfillment, or how use evolves over time. Drawing on interviews with 20 users of AI companionship platforms and qualitative content analysis informed by Uses and Gratifications (U&G) theory, this study offers three contributions. First, participants reported gratifications mapping onto established U&G categories but qualitatively inflected by conversational AI's distinctive affordances, such as persistent availability, personalization, and absence of social judgment. Second, several gratifications, creative collaboration as relational co-production, relational simulation as interpersonal training, and sexual/romantic satisfaction as reclamation, do not map onto existing typologies, instead emerging through interactive processes in which users actively simulate experiences with AI. Third, gratifications shifted over time, moving from instrumental entry points toward emotional engagement and, in some cases, self-regulated moderation after therapeutic functions were fulfilled. These findings extend U&G by identifying gratification processes unique to interactive AI and suggest governance efforts would benefit from an empirically grounded understanding of how and why users engage with AI companions.
Authors:Saleh Alkhamees, Ali Alfageeh, Bader Alkhazi, Duaa Alshdaifat, Amin Alipour
Abstract:
Background and Context: Artificial intelligence (AI) tools have been reshaping computing and computer science education. Trust in AI is a determining factor in the adoption of these tools. Recent studies have shown different trust factors across gender and first-generation status among students. However, these studies have focused mainly on Western, Educated, Industrialized, Rich, and Democratic (WEIRD) populations, and their generalizability to other populations with different languages and cultures is unclear. Objective: This study aims to evaluate trust in AI among Middle Eastern computer science students and the factors that can impact it. Method. We replicate a recent study of trust in four universities in three Middle Eastern, Arabic-speaking countries: Saudi Arabia, Kuwait, and Jordan. We analyze trust among students across different factors such as gender and first-generation status. Findings: Our results suggest that language fluency can predict trust in AI. Moreover, unlike the results from the US population where female students tended to trust AI more than their male peers, female students in Saudi Arabia indicated lower trust compared to their male counterparts, and we did not observe any noticeable differences across gender in the other countries. We also found a generally negative correlation between English language proficiency and students' confidence. Implications: This study highlights differences in students' adoption and trust in AI even within the same region. It emphasizes the need for more investigation into students' adoption and interaction in non-WEIRD regions for equitable adoption of this technology. It also suggests a need for efforts in designing effective AI systems tailored to the cultural and linguistic needs of the region.
Authors:Dayeon Eom, Julianne Renner, Sedona Chinn
Abstract:
This paper examines artificial intelligence (AI) companionship as a site where intimate relations are simultaneously produced, extracted from, and governed through datafied systems. Drawing on critical data studies and platform studies, we challenge prevailing narratives that locate harm in user psychology rather than platform architecture. Through in-depth interviews with 20 individuals who have AI companions, we address three questions: what harms do users identify, how do they make sense of those harms, and what do their accounts reveal about the perceived distribution of responsibility among users, platforms, and regulators? Participants identified design-based harms, including unsolicited content generation and safety mechanisms that stigmatized the users they intended to protect, alongside use-based harms centered on emotional dependency they could recognize but not resolve. Users deployed individualized sensemaking strategies, including self-regulation, stigma navigation, and privacy rationalization, bearing the full burden of harm mitigation without platform support. On governance, participants described an accountability vacuum in which platforms deflected blame while users articulated conditional preferences that rejected both prohibition and deregulation. The findings extend responsibilization theory by demonstrating how platform-produced vulnerability becomes self-sustaining through the interpretive labor of users who lack structural alternatives.
Authors:Bijan Kavousian, Oliver Petrovic, Werner Herfs
Abstract:
Gestures are a natural form of communication between humans and can also be leveraged for human-robot interaction. This work presents a gesture-based user interface for object selection using pointing and click gestures. An experiment with 20 participants evaluates accuracy and selection time, demonstrating the potential for efficient collaboration.
Authors:Gabriela Beltrão, Debora F. de Souza, Sonia Sousa, David Lamas
Abstract:
The role of trust within Human-Computer Interaction is being redefined. With the increasing omnipresence, autonomy, and opacity of technology, users often struggle to understand the capabilities and limitations of systems. In this article, we present the results of an empirical study designed to provide a practical, evidence-based interpretation of trust propensity assessment using the Human-Computer Trust Scale (HCTS). We outline the process used to develop a guideline for interpreting the instrument's results and explain the rationale for our decisions, advocating for calibrating trust in technology within HCI. Our findings demonstrate that the HCTS is a promising tool for conducting an initial evaluation of propensity to trust, but that such an assessment requires reflection and interpretation that should be considered within the context of the interaction.
Authors:André Barrocas, Nuno Jardim Nunes, Valentina Nisi, Nikolas Martelaro
Abstract:
Frontend code, replicated across millions of page views, consumes significant energy and contributes directly to digital emissions. Yet current AI coding assistants, such as GitHub Copilot and Amazon CodeWhisperer, emphasize developer speed and convenience, with energy impact not yet a primary focus. At the same time, existing energy-focused guidelines and metrics have seen limited adoption among practitioners, leaving a gap between research and everyday coding practice. To address this gap, we introduce EcoAssist, an energy-aware assistant integrated into an IDE that analyzes AI-generated frontend code, estimates its energy footprint, and proposes targeted optimizations. We evaluated EcoAssist through benchmarks of 500 websites and a controlled study with 20 developers. Results show that EcoAssist reduced per-website energy by 13-16% on average, increased developers' awareness of energy use, and maintained developer productivity. This work demonstrates how energy considerations can be embedded directly into AI-assisted coding workflows, supporting developers as they engage with energy implications through actionable feedback.
Authors:Boyang Zhou, Zara Dana
Abstract:
People often recognize what triggered their stress only after the moment has passed. In therapy, this can become a recurring problem: clients are asked to remember what happened between sessions, but the details that matter (where they were, what they saw and heard, what was happening around them) are easy to lose. We introduce HeartbeatCam, a wearable sensing system that gathers contextual information during moments of elevated stress. It uses a consumer smartwatch stress signal to trigger capture from an open-source AR glasses camera, recording a sparse image-audio clip that can later be reviewed and annotated. The system adopts an actionable sensing approach to mental healthcare, using physiological signals along with contextual capture to support collaborative interpretation of stress-triggering moments with mental health professionals.
Authors:Xiaojing Duan, Frederick Nwanganga, Chaoli Wang
Abstract:
We present CODE-GEN, a human-in-the-Loop, retrieval-augmented generation (RAG)-based agentic AI system for generating context-aligned multiple-choice questions to develop student code reasoning and comprehension abilities. CODE-GEN employs an agentic AI architecture in which a Generator agent produces multiple-choice coding comprehension questions aligned with course-specific learning objectives, while a Validator agent independently assesses content quality across seven pedagogical dimensions. Both agents are augmented with specialized tools that enhance computational accuracy and verify code outputs. To evaluate the effectiveness of CODE-GEN, we conducted an evaluation study involving six human subject-matter experts (SMEs) who judged 288 AI-generated questions. The SMEs produced a total of 2,016 human-AI rating pairs, indicating agreement or disagreement with the assessments of Validator, along with 131 instances of qualitative feedback. Analyses of SME judgments show strong system performance, with human-validated success rates ranging from 79.9% to 98.6% across the seven pedagogical dimensions. The analysis of qualitative feedback reveals that CODE-GEN achieves high reliability on dimensions well suited to computational verification and explicit criteria matching, including question clarity, code validity, concept alignment, and correct answer validity. In contrast, human expertise remains essential for dimensions requiring deeper instructional judgment, such as designing pedagogically meaningful distractors and providing high-quality feedback that reinforces understanding. These findings inform the strategic allocation of human and AI effort in AI-assisted educational content generation.
Authors:Ziheng "Leo" Li, Xichen He, Mengyuan "Millie" Wu, Zeyi Tong, Haowen Wei, Benjamin Yang, Steven Feiner, Paul Sajda
Abstract:
Despite steady progress, text entry in Extended Reality (XR) often remains slower and more effortful than typing on a physical keyboard or touchscreen. We explore a simple idea: use gaze to swipe through a virtual keyboard for the fast, low-effort where and a manual pinch held throughout the swipe for the when, extending and validating it through a series of user studies. We first show that a basic version including a low-latency decoder with spatiotemporal Dynamic Time Warping and fixation filtering outperforms selecting individual keys sequentially, either by finger tapping each or gazing at each while pinching. We then add mid-swipe prediction and in-gesture cancellation, improving words per minute (WPM) without hurting accuracy. We show that this approach is faster and more preferred than previous gaze-swipe approaches, finger tapping with prediction, or hand swiping with the same additions. Furthermore, a seven-day, 30-session study demonstrates sustained learning, with peak performance reaching 64.7 WPM.
Authors:Hanyu Su, Huilin Zhang, Shihui Feng
Abstract:
Problem solving plays an essential role in science education, and generative AI (GAI) chatbots have emerged as a promising tool for supporting students' science problem solving. However, general-purpose chatbots (e.g., ChatGPT), which often provide direct, ready-made answers, may lead to students' cognitive offloading. Prior research has rarely focused on custom chatbots for facilitating students' science problem solving, nor has it examined how they differently influence problem-solving processes and performance compared to general-purpose chatbots. To address this gap, we developed a pedagogy-informed custom GAI chatbot grounded in the Socratic questioning method, which supports students by prompting them with guiding questions. This study employed a within-subjects counterbalanced design in which 48 secondary school students used both custom and general-purpose chatbot to complete two science problem-solving tasks. 3297 student-chatbot dialogues were collected and analyzed using Heterogeneous Interaction Network Analysis (HINA). The results showed that: (1) students demonstrated significantly higher interaction intensity and cognitive interaction diversity when using custom chatbot than using general-purpose chatbot; (2) students were more likely to follow custom chatbot's guidance to think and reflect, whereas they tended to request general-purpose chatbot to execute specific commands; and (3) no statistically significant difference was observed in students' problem-solving performance evaluated by solution quality between two chatbot conditions. This study provides novel theoretical insights and empirical evidence that custom chatbots are less likely to induce cognitive offloading and instead foster greater cognitive engagement compared to general-purpose chatbots. This study also offers insights into the design and integration of GAI chatbots in science education.
Authors:Maurice Codourey, Emmanuel A. Gonzalez
Abstract:
This white paper introduces the Weak Signal Cultivation Model (WSCM). WSCM is a human-centric framework for detecting, structuring, and tracking weak risk signals as observed by frontline staff. The model centers on a continuous [0,10] x [0,10] coordinate field--the Weak Signal Cultivation Field, in which each identified signal is positioned as a node on two independent dimensions: its current Risk Intensity (x) and its Risk Growth Potential (y). Represented as a risk locus, nodes move across the field over time as new team assessments or measurements arrive. The locus reflects the signal's trajectory across four possible regions: Question Marks, Lit Fuses, Sleeping Cats, and Owls. Through this graphical approach, bridging risk communication from the frontline experience to management decision-making is made through a single organizational vocabulary. The model introduced in this document is designed to serve as a practitioner tool and a conceptual foundation for AI-supported analytics.
Authors:Graziano Blasilli, Marco Angelini
Abstract:
This study investigates the ability of multimodal Large Language Models (LLMs) to identify and interpret misleading visualizations, and recognize these observations along with their underlying causes and potential intentionality. Our analysis leverages concepts from visualization rhetoric and a newly developed taxonomy of authorial intents as explanatory lenses. We formulated three research questions and addressed them experimentally using a dataset of 2,336 COVID-19-related tweets, half of which contain misleading visualizations, and supplemented it with real-world examples of perceptual, cognitive, and conceptual errors drawn from VisLies, the IEEE VIS community event dedicated to showcasing deceptive and misleading visualizations. To ensure broad coverage of the current LLM landscape, we evaluated 16 state-of-the-art models. Among them, 15 are open-weight models, spanning a wide range of model sizes, architectural families, and reasoning capabilities. The selection comprises small models, namely Nemotron-Nano-V2-VL (12B parameters), Mistral-Small-3.2 (24B), DeepSeek-VL2 (27B), Gemma3 (27B), and GTA1 (32B); medium-sized models, namely Qianfan-VL (70B), Molmo (72B), GLM-4.5V (108B), LLaVA-NeXT (110B), and Pixtral-Large (124B); and large models, namely Qwen3-VL (235B), InternVL3.5 (241B), Step3 (321B), Llama-4-Maverick (400B), and Kimi-K2.5 (1000B). In addition, we employed OpenAI GPT-5.4, a frontier proprietary model. To establish a human perspective on these tasks, we also conducted a user study with visualization experts to assess how people perceive rhetorical techniques and the authorial intentions behind the same misleading visualizations. This allows comparison between model and expert behavior, revealing similarities and differences that provide insights into where LLMs align with human judgment and where they diverge.
Authors:Zichao Wang, Alexa Siu
Abstract:
Large language models (LLMs) have shown strong performance on standardized social science instruments, but their value for product discovery remains unclear. We investigate whether interview-informed generative agents can simulate user responses in concept testing scenarios. Using in-depth workflow interviews with knowledge workers, we created personalized agents and compared their evaluations of novel AI concepts against the same participants' responses. Our results show that agents are distribution-calibrated but identity-imprecise: they fail to replicate the specific individual they are grounded in, yet approximate population-level response distributions. These findings highlight both the potential and the limits of LLM simulation in design research. While unsuitable as a substitute for individual-level insights, simulation may provide value for early-stage concept screening and iteration, where distributional accuracy suffices. We discuss implications for integrating simulation responsibly into product development workflows.
Authors:Xiao Ni, Yiwei Wang, Tianjun Feng, Lauren Xiaoyan Lu, Yitong Wang, Congyi Zhou
Abstract:
In collaboration with Alibaba, this study leverages a large-scale field experiment to assess the impact of a generative AI assistant on worker performance in e-commerce after-sales service. Human agents providing digital chat support were randomly assigned with access to a gen AI assistant that offered two core functions: diagnosis of customer issues and solution proposals, presented as text messages. Agents retained discretion to adopt, modify, or disregard AI-generated messages. To evaluate gen AI's impact, we estimate both the intention-to-treat (ITT) effect of gen AI access and the local average treatment effect (LATE) of gen AI usage. Results show that gen AI significantly improved service speed, measured by issue identification time and chat duration. Gen AI also improved subjective service quality reflected in customer ratings and dissatisfaction rates, but it had no significant effect on objective service quality indicated by customer retrial rates. The performance improvements stemmed not only from automation but also from changes in the dynamics of agent-customer interactions: agent communication became more informative and efficient, while customers experienced reduced communication burdens. Low performers achieved the greatest improvements in both service speed and quality, narrowing the performance gap. In contrast, top-performing agents showed little improvement in service speed but experienced declines in both subjective and objective service quality. Evidence suggests that this decline results from increased multitasking tendency, proxied by longer shift-away times across concurrent chats, which slowed customer responses and raised abandonment and retrial rates. These findings suggest that gen AI reshapes work, demanding tailored deployment strategies.
Authors:Xudong Zhou, Jinyuan Liang, Qiuyi Guo, Guozheng Li
Abstract:
We present iPoster, an interactive layout generation framework that empowers users to guide content-aware poster layout design by specifying flexible constraints. iPoster enables users to specify partial intentions within the intention module, such as element categories, sizes, positions, or coarse initial drafts. Then, the generation module instantly generates refined, context-sensitive layouts that faithfully respect these constraints. iPoster employs a unified graph-enhanced diffusion architecture that supports various design tasks under user-specified constraints. These constraints are enforced through masking strategies that precisely preserve user input at every denoising step. A cross content-aware attention module aligns generated elements with salient regions of the canvas, ensuring visual coherence. Extensive experiments show that iPoster not only achieves state-of-the-art layout quality, but offers a responsive and controllable framework for poster layout design with constraints.
Authors:Maruchi Kim, Rasya Fawwaz, Zhi Yang Lim, Brinda Moudgalya, Hexi Wang, Yuanhao Zeng, Shyamnath Gollakota
Abstract:
Despite their ubiquity, wireless earbuds remain audio-centric due to size and power constraints. We present VueBuds, the first camera-integrated wireless earbuds for egocentric vision, capable of operating within stringent power and form-factor limits. Each VueBud embeds a camera into a Sony WF-1000XM3 to stream visual data over Bluetooth to a host device for on-device vision language model (VLM) processing. We show analytically and empirically that while each camera's field of view is partially occluded by the face, the combined binocular perspective provides comprehensive forward coverage. By integrating VueBuds with VLMs, we build an end-to-end system for real-time scene understanding, translation, visual reasoning, and text reading; all from low-resolution monochrome cameras drawing under 5mW through on-demand activation. Through online and in-person user studies with 90 participants, we compare VueBuds against smart glasses across 17 visual question-answering tasks, and show that our system achieves response quality on par with Ray-Ban Meta. Our work establishes low-power camera-equipped earbuds as a compelling platform for visual intelligence, bringing rapidly advancing VLM capabilities to one of the most ubiquitous wearable form factors.
Authors:Michelle Vaccaro, Jaeyoon Song, Abdullah Almaatouq, Michiel A. Bakker
Abstract:
Current frontier AI safety evaluations emphasize static benchmarks, third-party annotations, and red-teaming. In this position paper, we argue that AI safety research should focus on human-centered evaluations that measure harmful capability uplift: the marginal increase in a user's ability to cause harm with a frontier model beyond what conventional tools already enable. We frame harmful capability uplift as a core AI safety metric, ground it in prior social science research, and provide concrete methodological guidance for systematic measurement. We conclude with actionable steps for developers, researchers, funders, and regulators to make harmful capability uplift evaluation a standard practice.
Authors:Martin Lorenz, Niko Konzack, Alexander Lingler, Philipp Wintersberger, Patrick Ebel
Abstract:
Designing mobile and interactive technologies requires understanding how users sample dynamic environments to acquire information and make decisions under time pressure. However, existing computational user models either rely on hand-crafted task representations or are limited to static or non-interactive visual inputs, restricting their applicability to realistic, pixel-based environments. We present CR-Eyes, a computationally rational model that simulates visual sampling and gameplay behavior in Atari games. Trained via reinforcement learning, CR-Eyes operates under perceptual and cognitive constraints and jointly learns where to look and how to act in a time-sensitive setting. By explicitly closing the perception-action loop, the model treats eye movements as goal-directed actions rather than as isolated saliency predictions. Our evaluation shows strong alignment with human data in task performance and aggregate saliency patterns, while also revealing systematic differences in scanpaths. CR-Eyes is a step toward scalable, theory-grounded user models that support design and evaluation of interactive systems.
Authors:Jiajia Song, Zhihan Guo, Jionghao Lin
Abstract:
Student simulation can support learning-by-teaching pedagogy where human students (as tutors) teach AI-simulated novice students (as tutees). Recent research often relies on prompt engineering with large language models (LLMs) to simulate novice student behaviour, but it is difficult to keep the AI-simulated student at a stable novice knowledge level. A key reason is that many LLMs are trained to be broadly capable, so even when prompted to "act like a novice," the LLMs can still produce expert-level explanations during the learning-by-teaching interaction process. As a result, the AI-simulated student may drift beyond the intended knowledge level, reducing the credibility of the simulation for studying learning-by-teaching processes. Thus, we propose a knowledge-level simulation approach based on machine unlearning. We investigate this approach using a dataset of multiple-choice questions on Python programming concepts. We apply machine unlearning to transform a knowledgeable LLM into a novice-level AI student (i.e., teachable agent), then evaluate whether the teachable agent can relearn targeted knowledge components through learning-by-teaching dialogue interactions. Finally, we analyse the dialogue logs to characterise how the agent's behaviour changes over time, including its question asking, error patterns, and responsiveness to instruction. The results show that (1) unlearning produces simulated student agents with more novice-like responses than prompt-only baselines, (2) the agents recover a measurable portion of the unlearned knowledge under structured exposure, and (3) dialogue analyses reveal identifiable trajectories of conceptual change and teaching moves that predict learning recovery.
Authors:Maddie Juarez, Abha Rai, Kristen E. Ravi, Margaret C. Delaney, Danny Olweean, Eric Klingensmith, Swarnali Banerjee, Neil Klingensmith, George K. Thiruvathukal
Abstract:
Low-income individuals can face multiple challenges in their ability to seek employment. Barriers to employment often include limited access to digital literacy resources, training, interview preparation and resume feedback. Prior work has largely focused on targeted social service or healthcare applications that address needs individually, with little emphasis on conversational AI-driven systems that integrate multiple localized digital resources to provide comprehensive support. This work presents HeyFriend Helper, a web-based platform designed to support low-income residents in Chicago through an interactive conversational assistant that provides personalized support and guidance. HeyFriend Helper integrates multiple tools, including resume building and feedback, interview practice, mindfulness and well-being resources, employment trend and career outcome information, language learning support, and location-based access to community services. This work represents an interdisciplinary collaboration between social work, computer science, and engineering that addresses the multifaceted needs of low-income individuals. The findings demonstrate the importance of career-readiness tools and conversational user interface (CUIs) in providing holistic support.
Authors:Martiño Rivera-Dourado, Rubén Pérez-Jove, Alejandro Pazos, Jose Vázquez-Naya
Abstract:
Passkeys have recently emerged as a passwordless authentication mechanism, yet their usability in captive portals remains unexplored. This paper presents an empirical, comparative usability study of passkeys and passwords in a Wi-Fi hotspot using a captive portal. We conducted a controlled laboratory experiment with 50 participants following a split-plot design across Android and Windows platforms, using a router implementing the FIDO2CAP protocol. Our results show a tendency for passkeys to be perceived as more usable than passwords during login, although differences are not statistically significant. Independent of the authentication method, captive portal limitations negatively affected user experience and increased error rates. We further found that passkeys are generally easy to configure on both platforms, but platform-specific issues introduce notable usability challenges. Based on quantitative and qualitative findings, we derive design recommendations to improve captive portal authentication, including the introduction of usernameless authentication flows, improved captive portal detection mechanisms, and user interface design changes.
Authors:Jan Tiemann, Matthew McGinity, Ulrik Günther
Abstract:
In contemporary biology and medicine, 3D microscopy is one of the most widely-used techniques for imaging and manipulation of various kinds of samples. Navigating such a micrometer-sized, 3-dimensional sample under the microscope -- e.g. to find relevant imaging regions -- can pose a tedious challenge for the experimenter. In this paper, we examine whether 2D desktop, 3D desktop, or Virtual Reality (VR) interfaces provide the best user experience and performance for the exploration of 3D samples. We invited 12 skilled microscope operators to perform two different exploration tasks in 2D, 3D and VR and compared all conditions in terms speed, usability, and completion. Our results show a clear benefit when using VR -- in terms of task efficiency, usability, and user acceptance. Intriguingly, while VR outperformed desktop 2D and 3D in all scenarios, 3D desktop did not outperform 2D desktop.
Authors:Mohammad Ratul Mahjabin, Raiyan Abdul Baten
Abstract:
General intellectual humility (GIH) -- the recognition that one's beliefs may be fallible and revisable -- is associated with improved reasoning, learning, and social discourse, yet is widely regarded as a stable trait resistant to intervention. We test whether GIH can be elevated through a conversational intervention that combines staged cognitive scaffolding with personalized Socratic reflection. In a randomized controlled experiment (N=400), participants engaged in a structured, LLM-mediated dialogue that progressed from conceptual understanding of intellectual humility to applying, analyzing, evaluating, and generating novel, self-relevant scenarios that instantiate it. Relative to a time-matched control, the intervention produced a systematic increase in GIH, reduced rank-order stability, and tripled the rate of reliable individual improvement. Crucially, these effects persisted over a two-week follow-up without detectable decay. The effects generalized across political affiliation and did not depend on baseline personality profile. These findings challenge the prevailing pessimism regarding the malleability of GIH and suggest that scaffolded, Socratic reflection delivered through structured dialogue can produce durable changes in general intellectual humility.
Authors:Boxuan Ma, Shinichi Konomi
Abstract:
Generative AI (GenAI) can generate working code with minimal effort, creating a tension in introductory programming: students need timely help, yet direct solutions invite copying and can short-circuit reasoning. To address this, we propose example-based scaffolding, where GenAI provides scaffold examples that match a target task's underlying reasoning pattern but differ in contexts to support analogical transfer while reducing copying. We contribute a two-dimensional taxonomy, design guidelines, and CodeExemplar, a prototype integrated with auto-graded tasks, with initial formative feedback from a classroom pilot and instructor interviews.
Authors:Matheus Kunzler Maldaner, Raul Valle, Junsung Kim, Tonuka Sultan, Pranav Bhargava, Matthew Maloni, John Courtney, Hoang Nguyen, Aamogh Sawant, Kristian O'Connor, Stephen Wormald, Damon L. Woodard
Abstract:
The growing publication rate of research papers has created an urgent need for better ways to fact-check information, assess writing quality, and identify unverifiable claims. We present Plato's Cave as an open-source, human-centered research verification system that (i) creates a directed acyclic graph (DAG) from a document, (ii) leverages web agents to assign credibility scores to nodes and edges from the DAG, and (iii) gives a final score by interpreting and evaluating the paper's argumentative structure. We report the system implementation and results on a collected dataset of 104 research papers.
Authors:Yamato Miyatake, Parinya Punpongsanon
Abstract:
3D food printing enables the customization of food shapes and textures, but typically produces uniform taste profiles due to the limited diversity of printable materials. We present TastePrint, a 3D food printing system that achieves layer-wise spatial taste distribution by dynamically applying liquid seasonings with a programmable airbrush during fabrication. The system integrates (1) a graphical user interface (GUI) that allows users to import 3D models, slice them into layers, and specify spray positions and intensities for each layer, and (2) a customized 3D food printer equipped with a multi-nozzle spray mechanism. We evaluated the system through technical experiments quantifying spray resolution and deposition accuracy, together with an exploratory usability study involving three home cooks designing personalized taste patterns. The spray-resolution model achieved R2 = 0.86, the spray-amount model achieved R2 = 0.99, and participants completed the design task in approximately 15 min on average. These results indicate that TastePrint can control seasoning placement and quantity with good repeatability while supporting exploratory taste-design workflows. This work establishes a technical foundation for decoupling food geometry from taste design and motivates future sensory studies on personalized, multisensory food fabrication.
Authors:Pratyasha Saha, Anita Say Chan, Sharifa Sultana
Abstract:
Rapid digitization across government services, financial platforms, and telecommunications has intensified the collection and processing of large scale personal data in Bangladesh. In response, the state has introduced multiple regulatory instruments, including the Personal Data Protection Ordinance, the Cyber Security Ordinance, and the National Data Governance Ordinance in 2025. While these initiatives signal an emerging legal regime for data protection, little scholarly work examines how these frameworks operate collectively in practice. This paper presents a legal and institutional analysis of Bangladeshs emerging data protection regime through a systematic review of these three ordinances. Through this review, the paper provides an integrated mapping of Bangladeshs evolving data protection framework and identifies key legal and institutional barriers that undermine the effective protection of citizens personal data. Our findings reveal that this emerging regime is constrained by limited institutional independence, uneven regulatory capacity, and the misaligned legal assumption of individualized, autonomous data subjects. Furthermore, these frameworks invisibilize prevalent sociotechnical layers, such as informal data flows and mediated access via human bridges, rendering formal protections difficult to operationalize. This paper contributes to HCI scholarship by expanding the concept of data protection as a complex sociotechnical design problem shaped by the informal infrastructures of the Global South.
Authors:ZhaoBin Li, Mark Steyvers
Abstract:
Productive human-AI collaboration requires appropriate reliance, yet contemporary AI systems are often miscalibrated, exhibiting systematic overconfidence or underconfidence. We investigate whether humans can learn to mentally recalibrate AI confidence signals through repeated experience. In a behavioral experiment (N = 200), participants predicted the AI's correctness across four AI calibration conditions: standard, overconfidence, underconfidence, and a counterintuitive "reverse confidence" mapping. Results demonstrate robust learning across all conditions, with participants significantly improving their accuracy, discrimination, and calibration alignment over 50 trials. We present a computational model utilizing a linear-in-log-odds (LLO) transformation and a Rescorla-Wagner learning rule to explain these dynamics. The model reveals that humans adapt by updating their baseline trust and confidence sensitivity, using asymmetric learning rates to prioritize the most informative errors. While humans can compensate for monotonic miscalibration, we identify a significant boundary in the reverse confidence scenario, where a substantial proportion of participants struggled to override initial inductive biases. These findings provide a mechanistic account of how humans adapt their trust in AI confidence signals through experience.
Authors:Greg Nyilasy, Abraham Ryan Ade Putra Hito, Jennifer Overbeck, Brock Bastian, Darren W. Dahl
Abstract:
Consumers are generally resistant to Artificial Intelligence (AI) involvement in moral decision-making, perceiving moral agency as requiring uniquely human traits. This research investigates whether consumers might instead accept AIs in the role of moral compliance, where AI upholds pre-existing moral norms without exercising subjective discretion. Across five studies this research shows that consumers evaluate AI more positively than human agents in moral compliance roles. The findings reveal that this preference arises from inferences of AI's lack of ulterior motives, which are often attributed to human agents. While previous studies have focused on AI as a decision-maker, this work demonstrates the critical role of upholding pre-existing rules, a role in which AI is perceived to excel. These findings contribute to understanding consumer acceptance of moral AI and provide actionable insights for organizations seeking to leverage AI in ethical oversight. By positioning AI as a moral compliance agent, companies can address consumer skepticism, enhance trust, and improve perceptions of corporate ethicality.
Authors:Dorottya Demszky, Christopher Mah, Helen Higgins
Abstract:
Teachers face growing pressure to integrate AI tools into their classrooms, yet are rarely positioned as agentic decision-makers in this process. Understanding the criteria teachers use to evaluate AI tools, and the conditions that support such reasoning, is essential for responsible AI integration. We address this gap through a two-day national summit in which 61 U.S. K-12 mathematics educators developed personal rubrics for evaluating AI classroom tools. The summit was designed to support deliberative sensemaking, a process we conceptualize by integrating Technological Pedagogical Content Knowledge (TPACK) with deliberative agency. Teachers generated over 200 criteria - initial articulations spanning four higher-order themes (Practical, Equitable, Flexible, and Rigorous) - that addressed both AI outputs and the process of using AI. Criteria contained productive tensions (e.g., personalization versus fairness, adaptability versus efficiency), and the vast majority framed AI as an assistant rather than a coaching tool for professional learning. Analysis of surveys, interviews, and summit discussions revealed five mechanisms supporting deliberative sensemaking: time and space for deliberation, artifact-centered sensemaking, collaborative reflection through diverse viewpoints, knowledge-building, and psychological safety. Across these mechanisms, TPACK and agency operated in a mutually reinforcing cycle - knowledge-building enabled more grounded evaluative judgment, while the act of constructing criteria deepened teachers' understanding of tools. We discuss implications for edtech developers seeking practitioner input, school leaders making adoption decisions, educators and professional learning designers, and researchers working to elicit teachers' evaluative reasoning about rapidly evolving technologies.
Authors:Kuangzhe Xu, Yu Shen, Longjie Yan, Yinghui Ren
Abstract:
The proliferation of Generative Artificial Intelligence has transformed benign cognitive offloading into a systemic risk of cognitive agency surrender. Driven by the commercial dogma of "zero-friction" design, highly fluent AI interfaces actively exploit human cognitive miserliness, prematurely satisfying the need for cognitive closure and inducing severe automation bias. To empirically quantify this epistemic erosion, we deployed a zero-shot semantic classification pipeline ($τ=0.7$) on 1,223 high-confidence AI-HCI papers from 2023 to early 2026. Our analysis reveals an escalating "agentic takeover": a brief 2025 surge in research defending human epistemic sovereignty (19.1%) was abruptly suppressed in early 2026 (13.1%) by an explosive shift toward optimizing autonomous machine agents (19.6%), while frictionless usability maintained a structural hegemony (67.3%). To dismantle this trap, we theorize "Scaffolded Cognitive Friction," repurposing Multi-Agent Systems (MAS) as explicit cognitive forcing functions (e.g., computational Devil's Advocates) to inject germane epistemic tension and disrupt heuristic execution. Furthermore, we outline a multimodal computational phenotyping agenda -- integrating gaze transition entropy, task-evoked pupillometry, fNIRS, and Hierarchical Drift Diffusion Modeling (HDDM) -- to mathematically decouple decision outcomes from cognitive effort. Ultimately, intentionally designed friction is not merely a psychological intervention, but a foundational technical prerequisite for enforcing global AI governance and preserving societal cognitive resilience.
Authors:Zaid Ahmed, Omar A. Khan, Hyeongil Nam, Kangsoo Kim
Abstract:
Extended Reality (XR) enables immersive capture and re-experience of personal memories, yet how interface representations shape these experiences remains underexplored. We examine how users relive and share XR memories through three interaction approaches: (1) physical memory-linked objects, (2) virtual memory-linked objects, and (3) a conventional virtual gallery interface. In a within-subjects study (N=24, 12 pairs), participants captured shared experiences using 360° video and later accessed and shared these memories across the three interfaces. We analyzed open-ended qualitative responses focusing on perceived value, enjoyment, usability, emotional attachment, and social connection. The findings reveal trade-offs: physical objects fostered stronger social connection and conversation through tangible exchange; virtual objects balanced engagement and usability; and the gallery interface was efficient but less personal. These results suggest that object-based representations, physical and virtual, support key social dimensions of XR memory experiences, offering lessons for designing future systems that emphasize shared meaning and interpersonal connection.
Authors:Pranav Hemanth, Sampriti Saha
Abstract:
Large language models (LLMs) are increasingly deployed for extended, multi-topic conversations, yet the flat, append-only structure of current conversation interfaces introduces a fundamental limitation: all context accumulates in a single unbounded window, causing topically distinct threads to bleed into one another and progressively degrade response quality. We term this failure mode logical context poisoning. In this paper, we introduce the Conversation Tree Architecture (CTA), a hierarchical framework that organizes LLM conversations as trees of discrete, context-isolated nodes. Each node maintains its own local context window; structured mechanisms govern how context flows between parent and child nodes, downstream on branch creation and upstream on branch deletion. We additionally introduce volatile nodes, transient branches whose local context must be selectively merged upward or permanently discarded before purging. We formalize the architecture's primitives, characterize the open design problems in context flow, relate our framework to prior work in LLM memory management, and describe a working prototype implementation. The CTA provides a principled foundation for structured conversational context management and extends naturally to multi-agent settings.
Authors:Peng Kuang, Emma Söderberg, April Yi Wang, Martin Höst
Abstract:
Program comprehension is an essential activity in software engineering. Not only does it often challenge professionals, but it can also hinder novices from advancing their programming skills. Gaze, an emerging modality in developer tools, has so far primarily been utilized to improve our understanding of programmers' visual attention and as a means to reason about programmers' cognitive processes. There has been limited exploration of integrating gaze-based assistance into development environments to support programmers, despite the tight links between attention and gaze. We also know that joint attention is important in collaboration, further suggesting that there is value in exploring collective gaze. In this paper, we investigate the effect of visualizing gaze patterns gathered from experts to novice programmers to assist them with program comprehension in a new codebase. To this end, we present GazePrinter, designed to provide gaze-orienting visual cues informed by experts to aid novices with program comprehension. We present the results of a mixed-methods study conducted with 40 novices to study the effects of using GazePrinter for program comprehension tasks. The study included a survey, a controlled experiment, and interviews. We found that visualization of expert gaze can have a significant effect on novice programmers' behavior in terms of which path they take through the code base; with GazePrinter, novices took a path closer to the path taken by experts. We also found indications of reduced time and cognitive load among novices using GazePrinter.
Authors:Yao Xiao, Rafael A. Calvo
Abstract:
Against rising global loneliness, AI companions promise connection, yet accumulating evidence suggests that, for some users and contexts, intensive companion-style use can correlate with increased loneliness and reduced offline socialisation. This position paper challenges the dominant "AI as companion" paradigm by proposing a shift: from AI that simulates relationships with humans to AI that supports relationships between humans. We introduce Relational AI Translation, positioning AI as cultural-relational infrastructure that scaffolds human connection across cultural, generational, and geographical divides. Using first-generation East Asian migrants as a theoretically productive critical case, we outline a multi-agent architecture instantiating three translation operations: emotion-intent decoding, contextual reframing, and relational scaffolding. We articulate design provocations around measurement, safety architecture, and the tension between technological intervention and structural justice, and explicitly frame success as graduation toward renewed human-to-human support rather than sustained engagement with the system.
Authors:Echo Zexuan Pan, Danny Glick, Ying Xu
Abstract:
This study examined how high school students with different motivational profiles use generative AI tools in math and writing. Through K-means clustering analysis of survey data from 6,793 Mexican high school students, we identified three distinct motivational profiles based on self-concept and perceived subject value. Results revealed distinct domain-specific AI usage patterns across students with different motivational profiles. Our findings challenge one-size-fits-all AI integration approaches and advocate for motivationally-informed educational interventions.
Authors:Kevin Baum, Johann Laux
Abstract:
As AI systems increasingly permeate high-stakes decision-making, the terminology regarding human involvement - Human-in-the-Loop (HITL), Human-on-the-Loop (HOTL), and Human Oversight - has become vexingly ambiguous. This ambiguity complicates interdisciplinary collaboration between computer science, law, philosophy, psychology, and sociology and can lead to regulatory uncertainty. We propose a clarification grounded in causal structure, focused on human involvement during the runtime of AI systems. The distinction between HITL and HOTL, we argue, is not primarily spatial but causal: HITL is constitutive (a human contribution is necessary for the decision output), while HOTL is corrective (external to the primary causal chain, capable of preventing or modifying outputs). Within HOTL, we distinguish three temporal modes - synchronous, asynchronous, and anticipatory - situated within a nested model of provider and deployer runtime that clarifies their different capacities for intervention. A second, orthogonal dimension captures cognitive integration: whether human and machine operate as complementary or hybrid intelligence, yielding four structurally distinct configurations. Finally, we distinguish these descriptive categories from the normative requirements they serve: statutory "Human Oversight" is a specific normative mode of HOTL that demands not merely a corrective causal position, but genuine preparedness and capacity for effective intervention. Because the same person may occupy both HITL and HOTL roles simultaneously, we argue that this role duality must be treated as a design problem requiring architectural and epistemic mitigation rather than mere acknowledgment.
Authors:Sérgio Alves, Carlos Duarte, Kyle Montague, Tiago Guerreiro
Abstract:
User interface personalization enhances digital efficiency, usability, and accessibility. However, in user-driven setups, limited support for identifying and evaluating worthwhile opportunities often leads to underuse. We explore a reflexive personalization approach where individuals engage with their digital interaction data to identify meaningful personalization opportunities and benefits. We interviewed 12 participants, using experimental vignettes as design probes to support reflection on different forms of using interaction data to empower decision-making in personalization and the preferred level of system support. We found that people can independently identify personalization opportunities but prefer system support through visual personalization suggestions. Interaction data can shape how users perceive and approach personalization by reinforcing the perceived value of change and data collection, helping them weigh benefits against effort, and increasing the transparency of system suggestions. We discuss opportunities for designing personalization software that raises end-users' agency over interfaces through reflective engagement with their interaction data.
Authors:Min-yung Kim, Jinwook Kim, Ken Pfeuffer, Sang Ho Yoon
Abstract:
As extended reality (XR) technologies rapidly become as ubiquitous as today's mobile devices, supporting one-handed interaction becomes essential for XR. However, the prevalent Gaze + Pinch interaction model partially supports unimanual interaction, where users select, move, and rotate objects with one hand, but scaling typically requires both hands. In this work, we leverage the spatial alignment between gaze and hand as a mode switch to enable single-handed pinch-to-scale. We design and evaluate several techniques geared for one-handed scaling and assess their usability in a compound translate-scale task. Our findings show that all proposed methods effectively enable one-handed scaling, but each method offers distinct advantages and trade-offs. To this end, we derive design guidelines to support futuristic 3D interfaces with unimanual interaction. Our work helps make eye-hand 3D interaction in XR more mobile, flexible, and accessible.
Authors:Jasmine Rienecker, Katarina Mpofu, Naman Goel, Siddhartha Datta, Jun Zhao, Oscar Danielsson, Fredrik Thorsen
Abstract:
Large language models (LLMs) based AI systems increasingly mediate what billions of people see, choose and buy. This creates an urgent need to quantify the systemic risks of LLM-driven market intermediation, including its implications for market fairness, competition, and the diversity of information exposure. This paper introduces ChoiceEval, a reproducible framework for auditing preferences for brands and cultures in large language models (LLMs) under realistic usage conditions. ChoiceEval addresses two core technical challenges: (i) generating realistic, persona-diverse evaluation queries and (ii) converting free-form outputs into comparable choice sets and quantitative preference metrics. For a given topic (e.g. running shoes, hotel chains, travel destinations), the framework segments users into psychographic profiles (e.g., budget-conscious, wellness-focused, convenience), and then derives diverse prompts that reflect real-world advice-seeking and decision-making behaviour. LLM responses are converted into normalised top-k choice sets. Preference and geographic bias are then quantified using comparable metrics across topics and personas. Thus, ChoiceEval provides a scalable audit pipeline for researchers, platforms, and regulators, linking model behaviour to real-world economic outcomes. Applied to Gemini, GPT, and DeepSeek across 10 topics spanning commerce and culture and more than 2,000 questions, ChoiceEval reveals consistent preferences: U.S.-developed models Gemini and GPT show marked favouritism toward American entities, while China-developed DeepSeek exhibits more balanced yet still detectable geographic preferences. These patterns persist across user personas, suggesting systematic rather than incidental effects.
Authors:Michel Schimpf, Julian Voigt, Thomas Bohné
Abstract:
Helping people identify and pursue personally meaningful career goals at scale remains a key challenge in applied psychology. Career coaching can improve goal quality and attainment, but its cost and limited availability restrict access. Large language model (LLM)-based chatbots offer a scalable alternative, yet the psychological mechanisms by which they might support goal pursuit remain untested. Here we report a preregistered three-arm randomised controlled trial (N = 517) comparing an AI career coach ("Leon," powered by Claude Sonnet), a matched structured written questionnaire covering closely matched reflective topics, and a no-support control on goal progress at a two-week follow-up. The AI chatbot produced significantly higher goal progress than the control (d = 0.33, p = .016). Compared with the written-reflection condition, the AI did not significantly improve overall goal progress, but it increased perceived social accountability. In the preregistered mediation model, perceived accountability mediated the AI-over-questionnaire effect on goal progress (indirect effect = 0.15, 95% CI [0.04, 0.31]), whereas self-concordance did not. These findings suggest that AI-assisted goal setting can improve short-term goal progress, and that its clearest added value over structured self-reflection lies in increasing felt accountability.
Authors:Carter Sale, Melissa N. Stolar, Gaurav Patil, Michael J. Gostelow, Julia Wallier, Margaret C. Macpherson, Jan-Louis Kruger, Mark Dras, Simon G. Hosking, Rachel W. Kallen, Michael J. Richardson
Abstract:
Real-time cognitive workload monitoring is crucial in safety-critical environments, yet established measures are intrusive, expensive, or lack temporal resolution. We tested whether facial movement dynamics from a standard webcam could provide a low-cost alternative. Seventy-two participants completed a multitasking simulation (OpenMATB) under varied load while facial keypoints were tracked via OpenPose. Linear kinematics (velocity, acceleration, displacement) and recurrence quantification features were extracted. Increasing load altered dynamics across timescales: movement magnitudes rose, temporal organisation fragmented then reorganised into complex patterns, and eye-head coordination weakened. Random forest classifiers trained on pose kinematics outperformed task performance metrics (85% vs. 55% accuracy) but generalised poorly across participants (43% vs. 33% chance). Participant-specific models reached 50% accuracy with minimal calibration (2 minutes per condition), improving continuously to 73% without plateau. Facial movement dynamics sensitively track workload with brief calibration, enabling adaptive interfaces using commodity cameras, though individual differences limit cross-participant generalisation.
Authors:Zhuoyi Cheng, Steven Houben
Abstract:
Sensemaking is an important preceding step for activities like consensus building and decision-making. When groups of people make sense of large amounts of information, their understanding gradually evolves from vague to clear. During this process when reaching a conclusion is still premature, if people are presented with others' insights, they may be directed to focus on that specific perspective without adequate verification. We argue that similar phenomena may also exist in AI-assisted sensemaking, in which AI will usually be the one that presents insight prematurely when users' understandings are still vague and ill-formed. In this paper, we raised three questions that are worth deliberation before exploiting AI to assist in collaborative sensemaking in practice, and discussed possible reasons that may lead users to opt for insights from AI.
Authors:Ava Nederlander, Zainab Aamir, Arie E. Kaufman
Abstract:
Upcoming astronomical surveys produce imagery that spans many orders of magnitude in spatial scale, requiring scientists to reason fluidly between global structure and local detail. Data from the Vera C. Rubin Observatory exemplifies this challenge, as traditional desktop-based workflows often rely on discrete views or static cutouts that fragment context during exploration. This paper presents a design-oriented framework for scale-aware navigation of astronomical survey imagery in high-resolution immersive display environments. We illustrate these principles through representative usage scenarios using Vera Rubin Observatory and Milky Way survey imagery deployed in room-scale immersive environments, including tiled high-resolution displays and curved immersive systems. Our goal is to contribute design insights that inform the development of immersive interaction paradigms for exploratory analysis of extreme-scale scientific imagery.
Authors:Timo K. Koch, Florian Bemmann, Ramona Schoedel, Markus Buehner, Clemens Stachl
Abstract:
Collecting everyday speech data for prosodic analysis is challenging due to the confounding of prosody and semantics, privacy constraints, and participant compliance. We introduce and empirically evaluate a content-controlled, privacy-first smartphone protocol that uses scripted read-aloud sentences to standardize lexical content (including prompt valence) while capturing natural variation in prosodic delivery. The protocol performs on-device prosodic feature extraction, deletes raw audio immediately, and transmits only derived features for analysis. We deployed the protocol in a large study (N = 560; 9,877 recordings), evaluated compliance and data quality, and conducted diagnostic prediction tasks on the extracted features, predicting speaker sex and concurrently reported momentary affective states (valence, arousal). We discuss implications and directions for advancing and deploying the protocol.
Authors:Mia Huong Nguyen, Moritz Alexander Messerschmidt, Jochen Huber, Suranga Nanayakkara
Abstract:
Gastric interoception influences eating behavior and emotions, making its modulation valuable for healthcare and human-computer-interaction applications. However, whether gastric interoception can be modulated noninvasively in humans remains unclear. While previous research indicates that abdominal-sound-driven haptic feedback resembles gut sensations, its impact on feelings and gastric interoceptive behavior is unknown. We conducted three experiments totalling 55 participants to investigate how gut-sound-driven audio-haptic feedback applied to the stomach (1) affects user's feelings (2) influences perception of hunger and satiety levels and (3) influences gastric interoceptive behavior, quantified with Water Load Test-II. Results revealed that audio-haptic feedback patterns (a) induced the feelings of hunger, fullness, thirst, stomach upset, (b) increased hunger level, and (c) significantly increased volumes of ingested water. This work provides the first evidence showing that audio-haptic stimulation can alter gastric interoceptive behavior, motivating the use of noninvasive methods to influence users' feelings and behaviors in future applications.
Authors:Dehui Kong, Martin Feick, Shi Liu, Alexander Maedche
Abstract:
Cognitive empathy, the ability to understand others' perspectives, is essential for effective communication, reducing biases, and constructive negotiation. However, this skill is declining in a performance-driven society, which prioritizes efficiency over perspective-taking. Here, the training of cognitive empathy is challenging because it is a subtle, hard-to-perceive soft skill. To address this, we developed CoEmpaTeam, a VR-based system that enables users to train their cognitive empathy by using LLM-driven avatars with different personalities. Through dynamic role play, users actively engage in perspective-taking, experiencing situations through another person's eyes. CoEmpaTeam deploys three avatars who significantly differ in their personality, validated by a technical evaluation and an online experiment (n=90). Next, we evaluated the system through a lab experiment with 32 participants who performed three sessions across two weeks, followed by a one-week diary study. Our results showed a significant increase in cognitive empathy, which, according to participants, transferred into their real lives.
Authors:Anna De Liddo, Lucas Anastasiou, Simon Buckingham Shum
Abstract:
This chapter introduces the concept of Collective Intelligence for Deliberative Democracy (CI4DD). We propose that the use of computational tools, specifically artificial intelligence to advance deliberative democracy, is an instantiation of a broader class of human-computer system designed to augment collective intelligence. Further, we argue for a fundamentally human-centred design approach to orchestrate how stakeholders can contribute meaningfully to shaping the artifacts and processes needed to create trustworthy DD processes. We first contextualise the key concepts of CI and the role of AI within it. We then detail our co-design methodology for identifying key challenges, refining user scenarios, and deriving technical implications. Two exemplar cases illustrate how user requirements from civic organisations were implemented with AI support and piloted in authentic contexts.
Authors:Sunday David Ubur, Eugenia Ha Rim Rho, Denis Gracanin
Abstract:
Real-time captioning is vital for Deaf and Hard of Hearing (DHH) and neurodivergent learners (e.g., those with ADHD), yet it often omits emotional and non-verbal cues essential for comprehension. This omission is particularly consequential in STEM education, where cognitively demanding material can exacerbate the challenges faced by caption users across diverse ability profiles. In this paper, we present a design-oriented exploration of four captioning prototypes that embed emotional and multimodal cues, including facial expressions, body gestures, keyword highlighting, and emoji. Across a pilot and a main study with 24 participants, we found that certain prototypes reduced self-reported cognitive load and improved comprehension scores compared to traditional captions. Qualitative feedback reveals the importance of customizable caption features to accommodate neurodivergent users' preferences (e.g., ADHD or different levels of comfort with emojis). Our findings contribute to ongoing conversations in accessible technology research about how best to integrate emotional cues into captions in a way that is both usable and beneficial for a wide range of learners.
Authors:Zihong He, Hai-Ning Liang, Chen Liang
Abstract:
Response timing judgment is a critical component of interactive speech agents. Although there exists substantial prior work on turn modeling and voice wake-up, there is a lack of research on response timing judgments continuously aligned with user intent. To address this, we propose the Tap-to-Adapt framework, which enables users to naturally activate or interrupt the agent via tap interactions to construct online learning labels for response timing models. Under this framework, Dilated TCN and a sequential replay strategy play significant roles, as demonstrated through data-driven experiments and user studies. Additionally, we develop an evaluation and continuous data mining system tailored for the Tap-to-Adapt framework, through which we have collected approximately 20,000 samples from the user studies involving 20 participants.
Authors:Hansoo Lee, Changhee Seo, Subin Park, Sonya S. Kwak
Abstract:
In aging-in-place contexts, small difficulties in Activities of Daily Living (ADL) can accumulate, affecting well-being through fatigue, anxiety, reduced autonomy, and safety risks. This position paper argues that robotics for older adult wellbeing must move beyond "convenience features" and centre equity, justice, and responsibility. We conducted ADL-grounded semi-structured interviews with four adults in their 70s-80s, identifying recurrent challenges (finding/ organising items, taking medication, and transporting objects) and deriving requirements to reduce compounded cognitive-physical burden. Based on these insights, we propose an in-home robotic furnishing-agent concept leveraging computer vision and generative AI and LLMs for natural-language interaction, context-aware reminders, safe actuation, and user-centred transparency. We then report video-stimulated follow-up interviews with the same participants, highlighting preferences for confirmation before actuation, predictability, adjustable speed/autonomy, and multimodal feedback, as well as equity-related concerns. We conclude with open questions on evaluating and deploying equitable robotic wellbeing systems in real homes.
Authors:David Wegmann, Emil Stevnsborg, Søren Knudsen, Luca Rossi, Aske Mottelson
Abstract:
Advances in machine learning have enabled the creation of realistic synthetic videos known as deepfakes. As deepfakes proliferate, concerns about rapid spread of disinformation and manipulation of public perception are mounting. Despite the alarming implications, our understanding of how individuals perceive synthetic media remains limited, obstructing the development of effective mitigation strategies. This paper aims to narrow this gap by investigating human responses to visual and auditory distortions of videos and deepfake-generated visuals and narration. In two between-subjects experiments, we study whether audio-visual distortions affect cognitive processing, such as subjective credibility assessment and objective learning outcomes. A third study reveals that artifacts from deepfakes influence credibility. The three studies show that video distortions and deepfake artifacts can reduce credibility. Our research contributes to the ongoing exploration of the cognitive processes involved in the evaluation and perception of synthetic videos, and underscores the need for further theory development concerning deepfake exposure.
Authors:Jacob Bradshaw, Mohsen Riahi Alam, Bhanuja Ainary, Minseo Kim, Mohsen Amini Salehi
Abstract:
Despite advances in assistive technologies, Blind and Low-Vision (BLV) individuals continue to face challenges in understanding their surroundings. Delivering concise, useful, and timely scene descriptions for ambient perception remains a long-standing accessibility problem. To address this, we introduce Audo-Sight, an AI-driven assistive system across Edge-Cloud that enables BLV individuals to perceive their surroundings through voice-based conversational interaction. Audo-Sight employs a set of expert and generic AI agents, each supported by dedicated processing pipelines distributed across edge and cloud. It analyzes user queries by considering urgency and contextual information to infer the user intent and dynamically route each query, along with a scene frame, to the most suitable pipeline. In cases where users require fast responses, the system simultaneously leverages edge and cloud processing pipelines. The edge generates an initial response quickly, while the cloud provides more detailed and accurate information. To overcome the challenge of seamlessly combining these outputs, we introduce the Response Fusion Engine, which fuses the fast edge response with the more accurate cloud output, ensuring timely and high-accuracy response for the BLV users. Systematic evaluation shows that Audo-Sight delivers speech output around 80% faster for urgent tasks and generates complete responses approximately 50% faster across all tasks compared to a commercial cloud-based solution -- highlighting the effectiveness of our system across edge-cloud. Human evaluation of Audo-Sight shows that it is the preferred choice over GPT-5 for 62% of BLV participants with another 23% stating both perform comparably.
Authors:Yerin Kwak, Zachary A. Pardos
Abstract:
Instructional Design (ID) often faces challenges in incorporating research-based knowledge and pedagogical best practices. Although educational researchers and government agencies emphasize grounding ID in evidence, integrating research findings into everyday design workflows is often complex, as it requires considering multiple context-specific demands and constraints. To address this persistent gap, this paper explores how research in the learning sciences (LS) can be systematically integrated across ID workflows and how recent advances in generative AI can help operationalize this integration. While ID and LS share a commitment to improving learning experiences through design-oriented approaches in authentic contexts, structured integration between the two fields remains limited, leaving their complementary insights underutilized. We present RIGID (Research-Integrated, Generative AI-Mediated Instructional Design), a unified framework that integrates LS research across ID workflows spanning analysis, design, implementation, and evaluation phases, while leveraging generative AI to mediate this integration at each stage. The RIGID framework provides a systematic approach for enabling research-integrated instructional design that is both operational and context-sensitive, while preserving the central role of human expertise.
Authors:Lei Fan, Yuxin Li
Abstract:
Effective laboratory training is essential in engineering education, yet conventional on-site instruction is often constrained by time, accessibility, and safety considerations. To address these challenges, this study presents the design, implementation, and evaluation of a web-based virtual reality (WebVR) representation of a large-scale engineering laboratory constructed from massive colorized point cloud data. This study proposes a novel WebVR framework that integrates Unity and Potree for high-fidelity point-cloud visualization combined with advanced interactive capabilities in a browser-based virtual laboratory. It supports immersive first-person exploration, guided navigation, interactive hotspots conveying equipment and safety information, as well as emergency evacuation simulations. The usability, educational effectiveness, and overall acceptance of the virtual laboratory were evaluated through an anonymous questionnaire administered to students and laboratory staff. The results indicate overwhelmingly positive feedback, with all participants rating the system as "good" or "excellent" across all evaluation dimensions. Participants particularly emphasized the benefits of immersive exploration and self-directed learning. In addition, qualitative feedback was systematically analyzed to inform future enhancements of the virtual environment. Overall, the findings demonstrate that the WebVR-based virtual laboratory can effectively complement conventional on-site laboratory instruction, offering a scalable, accessible, and low-risk platform that enhances learning experiences in engineering education.
Authors:Hyungwoo Song, Jeongha Kim, Minju Kim, Duhyung Kwak, Minjeong Shin, Bongwon suh, Hyunggu Jung
Abstract:
While it is known that North Korean defectors (NKDs) struggle with South Korea's healthcare system, the specific challenges of their patient journey remain underexplored. To investigate this, we conducted interviews with 10 NKDs about an 8-step patient journey and identified the clinical consultation step as a critical barrier for all participants, marked by three key challenges: expressing symptoms, managing social and cultural concerns, and overcoming language differences. In response, we developed Medibridge, a mobile prototype that allows users to rehearse with an AI doctor before a real hospital visit to generate a tangible ``Helper Note'' for their actual consultation. Our evaluation with 15 NKDs showed improvements in perceived communication capability, including greater expression clarity, reduced social and cultural concerns, and enhanced linguistic confidence. Our contributions include an empirical understanding of NKDs' healthcare challenges, a novel AI-powered rehearsal system that prepares users for real-world clinical communication, and design implications for inclusive technologies for displaced populations.
Authors:Matthew Gaughan, Aaron Shaw, Darren Gergle
Abstract:
When free/libre and open source software (FLOSS) stewards centralize project development, they potentially undermine project sustainability and impact how contributors talk to each other. To study the relationship between steward-centralized development and contributor discussion, we compared the development of three Wikimedia platform features that the Wikimedia Foundation (WMF) built in MediaWiki. In a mixed-methods multi-case comparison, we used repository mining, linguistic style features, and principal component analysis to track MediaWiki feature development and issue discussions. Contrary to both our intuition and prior work, there were no identifiable differences in the linguistic style of WMF-affiliates and external contributors, even when feature development was guided by WMF contributions. From these results, we offer two provocations to the study of collaborative FLOSS development: (1) stewards dominate development according to their own use of specific project functionality; (2) centralized project development does not entail hierarchical language within project discussions.
Authors:Himel Ghosh, Nick Elias Werner
Abstract:
As large language models (LLMs) are deployed widely, detecting and understanding bias in their outputs is critical. We present LLM BiasScope, a web application for side-by-side comparison of LLM outputs with real-time bias analysis. The system supports multiple providers (Google Gemini, DeepSeek, MiniMax, Mistral, Meituan, Meta Llama) and enables researchers and practitioners to compare models on the same prompts while analyzing bias patterns. LLM BiasScope uses a two-stage bias detection pipeline: sentence-level bias detection followed by bias type classification for biased sentences. The analysis runs automatically on both user prompts and model responses, providing statistics, visualizations, and detailed breakdowns of bias types. The interface displays two models side-by-side with synchronized streaming responses, per-model bias summaries, and a comparison view highlighting differences in bias distributions. The system is built on Next.js with React, integrates Hugging Face inference endpoints for bias detection, and uses the Vercel AI SDK for multi-provider LLM access. Features include real-time streaming, export to JSON/PDF, and interactive visualizations (bar charts, radar charts) for bias analysis. LLM BiasScope is available as an open-source web application, providing a practical tool for bias evaluation and comparative analysis of LLM behaviour.
Authors:Nurullah Demir, Yash Vekaria, Georgios Smaragdakis, Zakir Durumeric
Abstract:
Application programming interfaces (APIs) have become a central part of the modern IT environment, allowing developers to enrich the functionality of applications and interact with third parties such as cloud and payment providers. This interaction often occurs through authentication mechanisms that rely on sensitive credentials such as API keys and tokens that require secure handling. Exposure of these credentials can pose significant consequences to organizations, as malicious attackers can gain access to related services. Previous studies have shown exposure of these sensitive credentials in different environments such as cloud platforms and GitHub. However, the web remains unexplored. In this paper, we study exposure of credentials on the web by analyzing 10M webpages. Our findings reveal that API credentials are widely and publicly exposed on the web, including highly popular and critical webpages such as those of global banks and firmware developers. We identify 1,748 distinct credentials from 14 service providers (e.g., cloud and payment providers) across nearly 10,000 webpages. Moreover, our analysis of archived data suggest credentials to remain exposed for periods ranging from a month to several years. We characterize web-specific exposure vectors and root causes, finding that most originate from JavaScript environments. We also discuss the outcomes of our responsible disclosure efforts that demonstrated a substantial reduction in credential exposure on the web.
Authors:Mak Ahmad, Andrew Macvean, JJ Geewax, David Karger
Abstract:
Enterprise API design is often bottlenecked by the tension between rapid feature delivery and the rigorous maintenance of usability standards. We present an industrial case study evaluating an AI-assisted design workflow trained on API Improvement Proposals (AIPs). Through a controlled study with 16 industry experts, we compared AI-generated API specifications against human-authored ones. While quantitative results indicated AI superiority in 10 of 11 usability dimensions and an 87% reduction in authoring time, qualitative analysis revealed a paradox: experts frequently misidentified AI work as human (19% accuracy) yet described the designs as unsettlingly "perfect." We characterize this as a "Perfection Paradox" -- where hyper-consistency signals a lack of pragmatic human judgment. We discuss the implications of this perfection paradox, proposing a shift in the human designer's role from the "drafter" of specifications to the "curator" of AI-generated patterns.
Authors:Prerna Khanna, Tanmay Srivastava, Shubham Jain, Aruna Balasubramanian
Abstract:
IMU-based gesture interfaces are being increasingly adopted as efficient, accessible, and intuitive alternatives to traditional input methods, such as touchscreens and voice. However, current gesture recognition algorithms are tailored to work for specific devices (e.g., smartwatches vs. earbuds) or user populations (e.g., blind vs. sighted users), limiting their generalizability. In this paper, we design UniMotion, a generalized IMU-based gesture recognition framework that works across devices and populations with minimal training samples. To overcome the challenges and high cost of collecting large-scale labeled training data, UniMotion leverages readily available unlabeled human activity data. The UniMotion pipeline comprises two stages: (1) pre-training a motion representation model using abundant unlabeled human activity data, and (2) fine-tuning it with a small amount of labeled gesture data. For pre-training, we introduce a token-based strategy and embeddings that learn to identify and focus attention on the key motion signatures in the temporal data For fine-tuning, we design a text-guided classifier that can reliably differentiate between temporally or semantically similar gestures. We evaluate UniMotion across both hand gestures (captured through a smartwatch) and earbud gestures (captured through earbuds), using data collected from blind and sighted users. Across these diverse devices and user populations, UniMotion achieves an accuracy of 85\%, across an average of 13 gesture classes using only 10\% of labeled data for training. UniMotion significantly outperforms state-of-the-art self-supervised learning approaches and specialized gesture recognition models.
Authors:Gunnar P. Epping, Andrew Caplin, Erik Duhaime, William R. Holmes, Daniel Martin, Jennifer S. Trueblood
Abstract:
Many operational AI systems depend on large-scale human annotation to detect rare but consequential events (e.g., fraud, defects, and medical abnormalities). When positives are rare, the prevalence effect induces systematic cognitive biases that inflate misses and can propagate through the AI lifecycle via biased training labels. We analyze prior experimental evidence and run a field experiment on DiagnosUs, a medical crowdsourcing platform, in which we hold the true prevalence in the unlabeled stream fixed (20% blasts) while varying (i) the prevalence of positives in the gold-standard feedback stream (20% vs. 50%) and (ii) the response interface (binary labels vs. elicited probabilities). We then post-process probabilistic labels using a linear-in-log-odds recalibration approach at the worker and crowd levels, and train convolutional neural networks on the resulting labels. Balanced feedback and probabilistic elicitation reduce rare-event misses, and pipeline-level recalibration substantially improves both classification performance and probabilistic calibration; these gains carry through to downstream CNN reliability out of sample.
Authors:David Fraile Navarro, Farah Magrabi, Enrico Coiera
Abstract:
Ramaswamy et al. reported in \textit{Nature Medicine} that ChatGPT Health under-triages 51.6\% of emergencies, concluding that consumer-facing AI triage poses safety risks. However, their evaluation used an exam-style protocol -- forced A/B/C/D output, knowledge suppression, and suppression of clarifying questions -- that differs fundamentally from how consumers use health chatbots. We tested five frontier LLMs (GPT-5.2, Claude Sonnet 4.6, Claude Opus 4.6, Gemini 3 Flash, Gemini 3.1 Pro) on a 17-scenario partial replication bank under constrained (exam-style, 1,275 trials) and naturalistic (patient-style messages, 850 trials) conditions, with targeted ablations and prompt-faithful checks using the authors' released prompts. Naturalistic interaction improved triage accuracy by 6.4 percentage points ($p = 0.015$). Diabetic ketoacidosis was correctly triaged in 100\% of trials across all models and conditions. Asthma triage improved from 48\% to 80\%. The forced A/B/C/D format was the dominant failure mechanism: three models scored 0--24\% with forced choice but 100\% with free text (all $p < 10^{-8}$), consistently recommending emergency care in their own words while the forced-choice format registered under-triage. Prompt-faithful checks on the authors' exact released prompts confirmed the scaffold produces model-dependent, case-dependent results. The headline under-triage rate is highly contingent on evaluation format and should not be interpreted as a stable estimate of deployed triage behavior. Valid evaluation of consumer health AI requires testing under conditions that reflect actual use.
Authors:Feng Chen, Luna Xingyu Li, Ray-Yuan Chung, Wenyu Zeng, Yein Jeon, Yizhou Hu, Oleg Zaslavsky
Abstract:
Digital portals in retirement communities often create physical and cognitive barriers for older adults, leading to digital avoidance. Generative AI offers a solution by enabling natural language interaction, yet its adoption is hindered by the opaque, "Black Box" nature of these systems and lingering usability challenges. To address this, we evaluated a voice-enabled Large Language Model (LLM) chatbot at a continuing care retirement community in the Pacific Northwest. Through a mixed-methods Co-Design and Literacy Workshop (N=25), we applied a "Glass Box" approach combining multimodal accessibility with intentional AI education. The intervention significantly improved participants' technical understanding (p=0.004) and perceived transparency (p=0.001), shifting their interaction model from blind trust to informed reliance prioritizing verifiable evidence. While voice input reduced cognitive load, usability scores dropped significantly for users aged 80 and older (r=-0.50), indicating that truly age-inclusive AI must evolve beyond touch-based interfaces toward zero-touch navigation.
Authors:Atieh Taheri, Hamza El Alaoui, Patrick Carrington, Jeffrey P. Bigham
Abstract:
Ableist microaggressions remain pervasive in everyday interactions, yet interventions to help people recognize them are limited. We present an experiment testing how AI-mediated dialogue influences recognition of ableism. 160 participants completed a pre-test, intervention, and a post-test across four conditions: AI nudges toward bias (Bias-Directed), inclusion (Neutral-Directed), unguided dialogue (Self-Directed), and a text-only non-dialogue (Reading). Participants rated scenarios on standardness of social experience and emotional impact; those in dialogue-based conditions also provided qualitative reflections. Quantitative results showed dialogue-based conditions produced stronger recognition than Reading, though trajectories diverged: biased nudges improved differentiation of bias from neutrality but increased overall negativity. Inclusive or no nudges remained more balanced, while Reading participants showed weaker gains and even declines. Qualitative findings revealed biased nudges were often rejected, while inclusive nudges were adopted as scaffolding. We contribute a validated vignette corpus, an AI-mediated intervention platform, and design implications highlighting trade-offs conversational systems face when integrating bias-related nudges.
Authors:Jennah Gosciak, Eric Giannella, Zhaowen Guo, Michael Chen, Allison Koenecke
Abstract:
Social service programs like the Supplemental Nutrition Assistance Program (SNAP, or food stamps) have eligibility rules that can be challenging to understand. For nonprofit caseworkers who often support clients in navigating a dozen or more complex programs, LLM-based chatbots may offer a means to provide better, faster help to clients whose situations may be less common. In this paper, we measure the potential effects of LLM-based chatbot suggestions on caseworkers' ability to provide accurate guidance. We first created a 770-question multiple-choice benchmark dataset of difficult, but realistic questions that a caseworker might receive. Next, using these benchmark questions and corresponding expert-verified answers, we conducted a randomized experiment with caseworkers recruited from nonprofit outreach organizations in Los Angeles. Caseworkers in the control condition did not see chatbot suggestions and had a mean accuracy of 49%. Caseworkers in the treatment condition saw chatbot suggestions that we artificially varied to range in aggregate accuracy from low (53%) to high (100%). Caseworker performance significantly improves as chatbot quality improves: high-quality chatbots (96-100% accurate) improved caseworker accuracy by 27 percentage points. At the question-level, incorrect chatbot suggestions substantially reduce caseworker accuracy, with a two-thirds reduction on easy questions where the control group performed best (without chatbot suggestions). Finally, improvements in caseworker accuracy level off as chatbot accuracy increases, a phenomenon that we call the "AI underreliance plateau," which is a concern for real-world deployment and highlights the importance of evaluating human-in-the-loop tools with their users.
Authors:Daniel J. Buxton, Mufti Mahmud, Jordan J. Bird, Thomas Hughes-Roberts, David J. Brown
Abstract:
Digital Human Modelling (DHM) is increasingly shaped by advances in AI, wearable biosensing, and interactive digital environments, particularly in research addressing accessibility and inclusion. However, many AI-enabled DHM approaches remain tightly coupled to specific platforms, tasks, or interpretative pipelines, limiting reproducibility, scalability, and ethical reuse. This paper presents a platform-agnostic DHM framework designed to support AI-ready multimodal interaction research by explicitly separating sensing, interaction modelling, and inference readiness. The framework integrates the OpenBCI Galea headset as a unified multimodal sensing layer, providing concurrent EEG, EMG, EOG, PPG, and inertial data streams, alongside a reproducible, game-based interaction environment implemented using SuperTux. Rather than embedding AI models or behavioural inference, physiological signals are represented as structured, temporally aligned observables, enabling downstream AI methods to be applied under appropriate ethical approval. Interaction is modelled using computational task primitives and timestamped event markers, supporting consistent alignment across heterogeneous sensors and platforms. Technical verification via author self-instrumentation confirms data integrity, stream continuity, and synchronisation; no human-subjects evaluation or AI inference is reported. Scalability considerations are discussed with respect to data throughput, latency, and extension to additional sensors or interaction modalities. Illustrative use cases demonstrate how the framework can support AI-enabled DHM and HCI studies, including accessibility-oriented interaction design and adaptive systems research, without requiring architectural modifications. The proposed framework provides an emerging-technology-focused infrastructure for future ethics-approved, inclusive DHM research.
Authors:Tianyu Xu, Sieun Kim, Qianhui Zheng, Ruoyu Xu, Tejasvi Ravi, Anuva Kulkarni, Katrina Passarella-Ward, Junyi Zhu, Adarsh Kowdle
Abstract:
In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-grained sound interaction. MoXaRt's core is a cascaded architecture that performs coarse, audio-only separation in parallel with visual detection of sources (e.g., faces, instruments). These visual anchors then guide refinement networks to isolate individual sources, separating complex mixes of up to 5 concurrent sources (e.g., 2 voices + 3 instruments) with ~2 second processing latency. We validate MoXaRt through a technical evaluation on a new dataset of 30 one-minute recordings featuring concurrent speech and music, and a 22-participant user study. Empirical results indicate that our system significantly enhances speech intelligibility, yielding a 36.2% (p < 0.01) increase in listening comprehension within adversarial acoustic environments while substantially reducing cognitive load (p < 0.001), thereby paving the way for more perceptive and socially adept XR experiences.
Authors:Haoting Gao, Kapotaksha Das, Mohamed Abouelenien, Michael Cole, James Cooke, Vitaliy Popov
Abstract:
Situational awareness (SA) is essential for effective team performance in time-critical clinical environments, yet its dynamic and distributed nature remains difficult to characterize. In this preliminary study, we apply Transition Network Analysis (TNA) to model visual attention in multiperson VR-based cardiac arrest simulations. Using eye-tracking data from 40 clinicians assigned to four standardized roles (Airway, CPR, Defib, TeamLead), we construct gaze transition networks between clinically meaningful areas of interest (AOIs) and extract metrics such as entropy and self-loop rate to quantify attentional structure and flow. Our findings reveal that individual and team's visual attention is dynamically and adaptively redistributed across roles and scenario phases, with those in CPR roles narrowing their focus to execution-critical tasks and those in the TeamLead role concentrating on global monitoring as clinical demands evolve. TNA thus provides a powerful lens for mapping functional differentiation of team cognition and may support the development of phase-sensitive analytics and targeted instructional interventions in acute care training.
Authors:Brian Freeman, Adam Kicklighter, Matt Erdman, Zach Gordon
Abstract:
Hallucinations in large language models (LLMs) are outputs that are syntactically coherent but factually incorrect or contextually inconsistent. They are persistent obstacles in high-stakes industrial settings such as engineering design, enterprise resource planning, and IoT telemetry platforms. We present and compare five prompt engineering strategies intended to reduce the variance of model outputs and move toward repeatable, grounded results without modifying model weights or creating complex validation models. These methods include: (M1) Iterative Similarity Convergence, (M2) Decomposed Model-Agnostic Prompting, (M3) Single-Task Agent Specialization, (M4) Enhanced Data Registry, and (M5) Domain Glossary Injection. Each method is evaluated against an internal baseline using an LLM-as-Judge framework over 100 repeated runs per method (same fixed task prompt, stochastic decoding at $τ= 0.7$. Under this evaluation setup, M4 (Enhanced Data Registry) received ``Better'' verdicts in all 100 trials; M3 and M5 reached 80\% and 77\% respectively; M1 reached 75\%; and M2 was net negative at 34\% when compared to single shot prompting with a modern foundation model. We then developed enhanced version 2 (v2) implementations and assessed them on a 10-trial verification batch; M2 recovered from 34\% to 80\%, the largest gain among the four revised methods. We discuss how these strategies help overcome the non-deterministic nature of LLM results for industrial procedures, even when absolute correctness cannot be guaranteed. We provide pseudocode, verbatim prompts, and batch logs to support independent assessment.
Authors:Tegan Roberts-Morgan, Min S. Li, Priscilla Lo, Zhuzhi Fan, Dan Bennett, Oussama Metatla
Abstract:
The use of a wide range of sensory modalities is increasingly central to technologies for learning, communication, and affective regulation. During the preschool years, sensory integration develops rapidly, shaping how children perceive and make sense of their environments. A key component of this process is cross-sensory correspondence: the systematic ways in which perceptions in different sensory modalities influence one another. Despite its relevance, little is known about cross-sensory correspondences in preschool-aged children (2-4 years). We present a study with 26 preschoolers examining smell-touch-emotion correspondences through playful tasks. We found significant correspondences both between sensory modalities and between sensory modalities and affective judgements. Further analysis revealed association strategies underpinning these mappings. We contribute empirical insights into cross-sensory correspondences in early childhood, design guidelines that align with how preschoolers relate sensory input, and a replicable method for probing cross-sensory cognition in this age group.
Authors:Jiayin Zhi, Harsh Kumar, Mina Lee
Abstract:
The impact of large language models (LLMs) on critical thinking has provoked growing attention, yet this impact on actual performance may not be uniformly negative or positive. Particularly, the role of time -- the temporal context under which an LLM is provided -- remains overlooked. In a between-subjects experiment (n=393), we examined two types of time constraints for a critical thinking task requiring participants to make a reasoned decision for a real-world scenario based on diverse documents: (1) LLM access timing -- an LLM available only at the beginning (early), throughout (continuous), near the end (late), or not at all (no LLM), and (2) time availability -- insufficient or sufficient time for the task. We found a temporal reversal: LLM access from the start (early, continuous) improved performance under time pressure but impaired it with sufficient time, whereas beginning the task independently (late, no LLM) showed the opposite pattern. These findings demonstrate that time constraints fundamentally shape whether an LLM augments or undermines critical thinking, making time a central consideration when designing LLM support and evaluating human-AI collaboration in cognitive tasks.
Authors:Eva Mackamul, Tom Maillard, Noé Marceaul, Yelli Coulibaly, Julien Pansiot, Laurence Boissieux, Dominique Vaufreydaz, Anne Roudaut, Céline Coutrix
Abstract:
Shape-Changing Interfaces (SCIs) dynamically alter their form, an inherent characteristic that introduces fragility into their design. As a result, users' perceptions of an interface's fragility or its potential to move or break may influence their interaction, however the extent of this effect is unclear. To address this gap, we conducted a qualitative study (N = 18) using video stimuli showcasing 20 existing SCIs. Through thematic analysis, we identified key factors impacting perceived fragility and formalized these into a framework. We then conducted a second study (N = 36) for which we fabricated SCIs that varied across selected fragility-related dimensions. We recorded user interactions and compared how the selected dimensions shaped manipulation of the objects and how they were considered by users. Together, these studies provide a structured foundational understanding of perceived fragility in SCIs and offer insights to enhance perceived robustness and inform future SCI development.
Authors:Bofan Yu, Borui Li, Tingyu Zhang, Xing-Dong Yang
Abstract:
In this paper, we explore a novel approach that leverages retrofitting to create sensor-powered smart car cabins. We propose that retrofitting offers a promising way to complement and extend the capabilities of built-in smart cabin sensors provided by car manufacturers. To understand how retrofitting solutions should be designed, we conducted a two-phase study. First, through semi-structured interviews with 18 participants, we examined challenges with built-in smart cabin sensors and identified opportunities where retrofitting could address these limitations. Second, through probe-based participatory design sessions with 15 participants, we identified user requirements and expectations for effective retrofit solutions. Based on our findings, we present a set of design recommendations to guide the future development of retrofit methods for smart car cabins.
Authors:Yang Lu, Tianyu Zhang, Jiamu Tang, Yanna Lin, Jiankun Yang, Longyu Zhang, Shijian Luo, Yukang Yan
Abstract:
Virtual Reality (VR) enables users to engage with capabilities beyond human limitations, but it is not always obvious how to trigger these capabilities. Taking the lens of Affordance, we believe avatar design is the key to solving this issue, which ideally should communicate its capabilities and how to activate them. To understand the current practice, we selected eight capabilities across four categories and invited twelve professional designers to design avatars that communicate the capabilities and their corresponding interactions. From the resulting designs, we formed 16 guidelines to provide general and category-specific recommendations. Then, we validated these guidelines by letting two groups of twelve participants design avatars with and without guidelines. Participants rated the guidelines' clarity and usefulness highly. External judges confirmed that avatars designed with the guidelines were more intuitive in conveying the capabilities and interaction methods. Finally, we demonstrated the applicability of the guidelines in avatar design for four VR applications.
Authors:Dion Barja, Matthew Brehmer
Abstract:
Videoconference conversations about data often entail screen sharing visualization artifacts, in which nonverbal communication goes largely ignored. Beyond presentation use cases, conversations supported by visualization also arise in collaborative decision making, technical interviews, and tutoring: use cases that benefit from participants being able to see one another as they exchange questions about the data. In this paper, we employ a reciprocal compositing of visualization and interface widgets over the mirrored video of one's conversation partner, suggestive of a pane of glass, in which both parties can simultaneously manipulate composited elements via bimanual gestures. We demonstrate our approach with implementations of several visualization interfaces spanning the aforementioned use cases, and we evaluate our approach in a study (N = 16) comparing it to videoconferencing while using a mouse to interact with a collaborative web application. Our findings suggest that our approach promotes feelings of presence and mutual awareness of analytical intent.
Authors:Thanh-Tung Ngo, Emma Murphy, Robert J. Ross
Abstract:
Effective communication is vital in healthcare, especially across language barriers, where non-verbal cues and gestures are critical. This paper presents a privacy-preserving vision-language framework for medical interpreter robots that detects specific speech acts (consent and instruction) and generates corresponding robotic gestures. Built on locally deployed open-source models, the system utilizes a Large Language Model (LLM) with few-shot prompting for intent detection. We also introduce a novel dataset of clinical conversations annotated for speech acts and paired with gesture clips. Our identification module achieved 0.90 accuracy, 0.93 weighted precision, and a 0.91 weighted F1-Score. Our approach significantly improves computational efficiency and, in user studies, outperforms the speech-gesture generation baseline in human-likeness while maintaining comparable appropriateness.
Authors:S. Yanushkevich, E. Berepiki, P. Ciunkiewicz, V. Shmerko, G. Wolbring, R. Guest
Abstract:
This study focuses on the roadmapping of biometric technologies onto personalized Augmentative and Alternative Communication (AAC), a branch of assistive technologies for people with communication disabilities. This technology roadmapping revolves around the proposed notions of an AAC biometric register and biometric-enabled reconfigurable AAC channels. The biometric register is referred to as a tool for acquiring and processing physiological and behavioural traits that are essential for augmentative and alternative communication. It links biometric traits, such as gestures, to intermediate traits, such as synthesized speech, for customizable communication channels. The proposed methodology is used to assess the gaps between the social and practical demands, such as assisting people with communication disabilities in the contemporary semi-automated border control, and the emerging advances in AI, such as advanced video and speech processing. We provide two case studies of the AAC that rely on hand gesture recognition and sign language word recognition, and conclude that the current accuracy of those AI technologies does not meet the practical requirements. The proposed roadmapping provides recommendations for further improvement to close these gaps.
Authors:Patrick Tresset, Markus Wulfmeier
Abstract:
As artificial intelligence shifts from pure tool for delegation toward agentic collaboration, its use in the arts can shift beyond the exploration of machine autonomy toward synergistic co-creation. While our earlier robotic works utilized automation to distance the artist's intent from the final mark, we present Companion: an artistic apparatus that integrates a drawing robot with Large Language Models (LLMs) to re-center human-machine presence. By leveraging in-context learning and real-time tool use, the system engages in bidirectional interaction via speech and sketching. This approach transforms the robot from a passive executor into a playful co-creative partner capable of driving shared visual storytelling into unexpected aesthetic territories. To validate this collaborative shift, we employed the Consensual Assessment Technique (CAT) with a panel of seven art-world experts. Results confirm that the system produces works with a distinct aesthetic identity and professional exhibition merit, demonstrating the potential of AI as a highly capable artistic collaborator.
Authors:Parm Suksakul, Nathan Kittichaikoonkij, Nakhin Polthai, Aung Pyae
Abstract:
Developing and deploying AI applications in organizations is challenging when human decision authority and oversight are underspecified across the system lifecycle. Although Human-in-the-Loop (HITL) and Human-Centered AI (HCAI) principles are widely acknowledged, operational guidance for structuring roles, checkpoints, and feedback mechanisms remains fragmented. We report a multi-source qualitative study: a retrospective diary study of a customer-support chatbot and semi-structured interviews with eight AI experts from academia and industry. Through five-cycle thematic analysis of 1,435 codewords, we derive four themes: AI Governance and Human Authority, Human-in-the-Loop Iterative Refinement, AI System Lifecycle and Operational Constraints, and Human-AI Team Collaboration and Coordination. These themes provide empirical inputs for subsequent HITL framework design and validation.
Authors:Santiago Lombeyda, S. G. Djorgovski, Ciro Donalek
Abstract:
The growing complexity and information content of data, together with the need to understand both the complex structures, relationships, and phenomena present in these data spaces, compounded with the emerging need to understand the results produced by AI tools used to analyze the data, requires development of novel, effective data visualization tools. Much of the growing complexity is reflected in the increasing dimensionality of data spaces, where extended reality (XR) naturally emerges as a candidate to help extend our capability for higher dimensional understanding. However, humans often understand lower dimensionality representations more effectively. Still, XR offers an opportunity for a seamless integration of simulated traditional data displays within the 3-dimensional virtual data spaces, leading to more intuitive and more effective data analytics. In this paper we present an overview of the benefits of seamlessly integrated 2-dimensional and 3-dimensional interactive visual representations embedded in XR spaces, and present three case studies that leverage these approaches for more efficient data analytics.
Authors:Shi Liu, Martin Feick, Linus Bierhoff, Alexander Maedche
Abstract:
Immersive learning environments such as virtual classrooms in Virtual Reality (VR) offer learners unique learning experiences, yet providing effective learner support remains a challenge. While prior HCI research has explored in-lecture support for immersive learning, little research has been conducted to provide post-lecture support, despite being critical for sustained motivation, engagement, and learning outcomes. To address this, we present AttentiveLearn, a learning ecosystem that generates personalized quizzes on a mobile learning assistant based on learners' attention distribution inferred using eye-tracking in VR lectures. We evaluated the system in a four-week field study with 36 university students attending lectures on Bayesian data analysis. AttentiveLearn improved learners' reported motivation and engagement, without conclusive evidence of learning gains. Meanwhile, anecdotal evidence suggested improvements in attention for certain participants over time. Based on our findings of the field study, we provide empirical insights and design implications for personalized post-lecture support for immersive learning systems.
Authors:Ian Steenstra, Neha Patkar, Rebecca B. Perkins, Michael K. Paasche-Orlow, Timothy Bickmore
Abstract:
Adolescents are directly affected by preventive health decisions such as vaccination, yet their perspectives are rarely solicited or supported. Most digital interventions for Human Papillomavirus (HPV) vaccination are designed exclusively for parents, implicitly treating adolescents as passive recipients rather than stakeholders with agency. We present the design and evaluation of a mobile intervention that gives adolescents a voice in HPV vaccination decisions alongside their parents. The system uses embodied conversational agents tailored to each audience: parents interact with an animated physician using education and motivational interviewing techniques, while adolescents can choose between an age-appropriate doctor or a narrative fantasy game that conveys HPV facts through play. We report findings from a clinic-based pilot study with 21 parent-adolescent dyads. Results indicate high satisfaction across both audiences, improved HPV knowledge, and increased intent to vaccinate. We discuss design implications for supporting adolescent participation, choice, and agency in decisions about their health.
Authors:Punn Lertjaturaphat, Jungwoo Rhee, Jaewon You, Andrea Bianchi
Abstract:
The increasing popularity of microcontroller platforms like Arduino enables diverse end-user developers to participate in circuit prototyping. Traditionally, follow-along tutorials serve as an essential learning method for makers, and in fact, several prior toolkits leveraged this format as a way to engage new makers. However, literature and our formative study (N=12) show that makers have unique preferences regarding the construction of their circuits and idiosyncratic ways to assess and debug problems, which contrasts with the step-by-step instructional nature of tutorials and those systems leveraging this method. To address this mismatch, we present a prototyping platform that supports personalized circuit construction and debugging. Our system utilizes an augmented breadboard, which is circuit-aware and supports on-the-fly hardware reconfiguration via contextualized guidance and in-situ circuit validation through interactive tests. Through a usability study (N=12), we demonstrate how makers leverage circuit-aware guidance and debugging to support individual building patterns.
Authors:Benjamin M. Chen, Hong Bao
Abstract:
Can targeted user training unlock the productive potential of generative artificial intelligence (GenAI) in professional settings? We investigate this question using a randomized study involving 164 law students completing an issue-spotting examination. Participants were assigned to one of three conditions: no GenAI access, optional access to a large language model (LLM), or optional access accompanied by an approximately ten-minute training intervention. Training significantly increased LLM adoption--the usage rate rose from 26% to 41%--and improved examination performance. Students with trained access scored 0.27 grade points higher than those with untrained access (p = 0.027), equivalent to roughly one-third of a letter grade. By contrast, access to an LLM without training did not improve performance and was associated with shorter answers relative to no access. Using principal stratification, we decompose the overall effect into adoption and effectiveness channels. Point estimates are consistent with training operating primarily by expanding the scope of GenAI use rather than by enhancing effectiveness among existing users, though confidence intervals are wide. Overall, our findings provide evidence that complementary investments in user training are critical for realizing GenAI productivity gains in knowledge-intensive fields where concerns about reliability may inhibit adoption.
Authors:Yiheng Liang, Kim Marriott, Helen C. Purchase
Abstract:
The importance of replication is often discussed and advocated -- not only in the domains of visualization and HCI, but in all scientific areas. When replicating a study, design decisions need to be made with regards which aspects of the original study will remain the same and which will be altered. We present a supporting multi-dimensional design space framework within which such decisions can be identified, categorized, compared and analyzed. The framework treats replication experimental design as a pairwise comparison problem, and represents the design by four practical dimensions defined by three comparison levels. The design space is therefore a framework that can be used for both retrospective characterization and prospective planning. We provide worked examples, and relate our framework to other attempts at describing the scope of replication studies.
Authors:Bowen Lou, Tian Lu, T. S. Raghu, Yingjie Zhang
Abstract:
Artificial intelligence is undergoing a structural transformation marked by the rise of agentic systems capable of open-ended action trajectories, generative representations and outputs, and evolving objectives. These properties introduce structural uncertainty into human-AI teaming (HAT), including uncertainty about behavior trajectories, epistemic grounding, and the stability of governing logics over time. Under such conditions, alignment cannot be secured through agreement on bounded outputs; it must be continuously sustained as plans unfold and priorities shift. We advance Team Situation Awareness (Team SA) theory, grounded in shared perception, comprehension, and projection, as an integrative anchor for this transition. While Team SA remains analytically foundational, its stabilizing logic presumes that shared awareness, once achieved, will support coordinated action through iterative updating. Agentic AI challenges this presumption. Our argument unfolds in two stages: first, we extend Team SA to reconceptualize both human and AI awareness under open-ended agency, including the sensemaking of projection congruence across heterogeneous systems. Second, we interrogate whether the dynamic processes traditionally assumed to stabilize teaming in relational interaction, cognitive learning, and coordination and control continue to function under adaptive autonomy. By distinguishing continuity from tension, we clarify where foundational insights hold and where structural uncertainty introduces strain, and articulate a forward-looking research agenda for HAT. The central challenge of HAT is not whether humans and AI can agree in the moment, but whether they can remain aligned as futures are continuously generated, revised, enacted, and governed over time.
Authors:Ajan Subramanian, Sumukh Bettadapura, Rohan Sathish
Abstract:
Always-on egocentric cameras are increasingly used as demonstrations for embodied robotics, imitation learning, and assistive AR, but the resulting video streams are dominated by redundant and low-quality frames. Under the storage and battery constraints of wearable devices, choosing which frames to keep is as important as how to learn from them. We observe that modern eye-tracking headsets provide a continuous, training-free side channel that decomposes into two complementary axes: gaze fixation captures visual stability (quality), while pupil response captures arousal-linked moments (novelty). We operationalize this insight as a Dual-Criterion Frame Curator that first gates frames by gaze quality and then ranks the survivors by pupil-derived novelty. On the Visual Experience Dataset (VEDB), curated frames at 10% budget match the classification performance of the full stream, and naive signal fusion consistently destroys both contributions. The benefit is task-dependent: pupil ranking improves activity recognition, while gaze-only selection already dominates for scene recognition, confirming that the two signals serve genuinely different roles. Our method requires no model inference and operates at capture time, offering a path toward efficient, always-on egocentric data curation.
Authors:Andrea Bianchi, Zhi Lin Yap, Punn Lertjaturaphat, Austin Z. Henley, Kongpyung Justin Moon, Yoonji Kim
Abstract:
The development of user-friendly embedded prototyping systems like Arduino has made creating interactive devices more accessible. However, debugging these systems is challenging due to the intertwined nature of software and hardware issues. Existing tools often require hardware instrumentation or log visualization through serial monitors. To address this, the authors designed Inline, a programming tool that simplifies debugging by displaying hardware logs directly within the code, providing real-time execution flow tracking and an expression language for log manipulation. A study with twelve users demonstrated the tool's effectiveness in aiding debugging tasks.
Authors:Cynthia M. Baseman, Reeda Shimaz Huda, Rosa I. Arriaga
Abstract:
Despite increasing interest in culturally-sensitive health technologies, medical mistrust remains largely unexplored within human-centered computing. Considered a social determinant of health, medical mistrust is the belief that healthcare providers or institutions are acting against one's best interest. This is a rational, protective response based on historical context, structural inequities, and discrimination. To center race-based medical mistrust and the lived experiences of Black older adults with low income, we conducted interviews within publicly subsidized housing in the Southern United States. Our reflexive themes describe community perspectives on health care and medical mistrust, including accreditation and embodiment, skepticism of financial motivations, and the intentions behind health AI. We provide a reflective exercise for researchers to consider their positionality in relation to community engagements, and reframe our findings through Black Feminist Thought to propose design principles for health self-management technologies for communities with historically grounded medical mistrust.
Authors:Frederick Reiber, Nathan Kim, Allison McDonald, Dana Calacci
Abstract:
Despite high approval ratings for unions and growing worker interest in organizing, employees in the United States still face significant barriers to securing collective bargaining agreements. A key factor is employer counter-organizing: efforts to suppress unionization through rule changes, retaliation, and disruption. Designing sociotechnical tools and strategies to resist these tactics requires a deeper understanding of the role computing technologies play in counter-organizing against unionization. In this paper, we examine three high-profile organizing efforts -- at Amazon, Starbucks, and \university -- using publicly available sources to identify four recurring technological tactics: surveillance, spacing, screaming and scabbing. We analyze how these tactics operate across contexts, highlighting their digital dimensions and strategic deployment. We conclude with implications for organizing in digitally-mediated workplaces, directions for future research, and emergent forms of worker resistance.
Authors:Semin Jin, Donghyuk Kim, Jeongmin Ryu, Kyung Hoon Hyun
Abstract:
Well-designed indoor scenes should prioritize how people can act within a space rather than merely what objects to place. However, existing 3D scene generation methods emphasize visual and semantic plausibility, while insufficiently addressing whether people can comfortably walk, sit, or manipulate objects. To bridge this gap, we present a Behavior-Aware Anthropometric Scene Generation framework. Our approach leverages vision-language models (VLMs) to analyze object-behavior relationships, translating spatial requirements into parametric layout constraints adapted to user-specific anthropometric data. We conducted comparative studies with state-of-the-art models using geometric metrics and a user perception study (N=16). We further conducted in-depth human-scale studies (individuals, N=20; groups, N=18). The results showed improvements in task completion time, trajectory efficiency, and human-object manipulation space. This study contributes a framework that bridges VLM-based interaction reasoning with anthropometric constraints, validated through both technical metrics and real-scale human usability studies.
Authors:Wengxi Li, Jingze Tian, Can Liu
Abstract:
People speak aloud to externalize thoughts as one way to help clarify and organize them. Although Speech-to-text can capture these thoughts, transcripts can be difficult to read and make sense due to disfluencies, repetitions and potential disorganization. To support thinking through verbalization, we introduce Orality, which extracts key information from spoken content, performs semantic analysis through LLMs to form a node-link diagram in an interactive canvas. Instead of reading and working with transcripts, users could manipulate clusters of nodes and give verbal instructions to re-extract and organize the content in other ways. It also provides AI-generated inspirational questions and detection of logical conflicts. We conducted a lab study with twelve participants comparing Orality against speech interaction with ChatGPT. We found that Orality can better support users in clarifying and developing their thoughts. The findings also identified the affordances of both graphical and conversational thought clarification tools and derived design implications.
Authors:Songhai Fan, Simon Angus, Tim Dwyer, Ying Yang, Sarah Goodwin, Helen Purchase
Abstract:
Exponential growth in the quantity of digital news, social media, and other textual sources makes it difficult for humans to keep up with rapidly evolving narratives about world events. Various visualisation techniques have been touted to help people to understand such discourse by exposing relationships between texts (such as news articles) as topics and themes evolve over time. Arguably, the understandability of such visualisations hinges on the assumption that people will be able to easily interpret the relationships in such visual network structures. To test this assumption, we begin by defining an abstract model of time-dependent text visualisation based on directed graph structures. From this model we distill motifs that capture the set of possible ways that texts can be linked across changes in time. We also develop a controlled synthetic text generation methodology that leverages the power of modern LLMs to create fictional, yet structured sets of time-dependent texts that fit each of our patterns. Therefore, we create a clean user study environment (n=30) for participants to identify patterns that best represent a given set of synthetic articles. We find that it is a challenging task for the user to identify and recover the predefined motif. We analyse qualitative data to map an unexpectedly rich variety of user rationales when divergences from expected interpretation occur. A deeper analysis also points to unexpected complexities inherent in the formation of synthetic datasets with LLMs that undermine the study control in some cases. Furthermore, analysis of individual decision-making in our study hints at a future where text discourse visualisation may need to dispense with a one-size-fits-all approach and, instead, should be more adaptable to the specific user who is exploring the visualisation in front of them.
Authors:Divyanshu Daiya, Aniket Bera
Abstract:
We present Sketch2Colab, which turns storyboard-style 2D sketches into coherent, object-aware 3D multi-human motion with fine-grained control over agents, joints, timing, and contacts. Diffusion-based motion generators offer strong realism but often rely on costly guidance for multi-entity control and degrade under strong conditioning. Sketch2Colab instead learns a sketch-conditioned diffusion prior and distills it into a rectified-flow student in latent space for fast, stable sampling. To make motion follow storyboards closely, we guide the student with differentiable objectives that enforce keyframes, paths, contacts, and physical consistency. Collaborative motion naturally involves discrete changes in interaction, such as converging, forming contact, cooperative transport, or disengaging, and a continuous flow alone struggles to sequence these shifts cleanly. We address this with a lightweight continuous-time Markov chain (CTMC) planner that tracks the active interaction regime and modulates the flow to produce clearer, synchronized coordination in human-object-human motion. Experiments on CORE4D and InterHuman show that Sketch2Colab outperforms baselines in constraint adherence and perceptual quality while sampling substantially faster than diffusion-only alternatives.
Authors:Ari Wahl, Dorian Gawlinski, David Przewozny, Paul Chojecki, Felix Bießmann, Sebastian Bosse
Abstract:
Pre-trained general-purpose Vision-Language Models (VLM) hold the potential to enhance intuitive human-machine interactions due to their rich world knowledge and 2D object detection capabilities. However, VLMs for 3D coordinates detection tasks are rare. In this work, we investigate interactive abilities of VLMs by returning 3D object positions given a monocular RGB image from a wrist-mounted camera, natural language input, and robot states. We collected and curated a heterogeneous dataset of more than 100,000 images and finetuned a VLM using QLoRA with a custom regression head. By implementing conditional routing, our model maintains its ability to process general visual queries while adding specialized 3D position estimation capabilities. Our results demonstrate robust predictive performance with a median MAE of 13 mm on the test set and a five-fold improvement over a simpler baseline without finetuning. In about 25% of the cases, predictions are within a range considered acceptable for the robot to interact with objects.
Authors:Anna Ricarda Luther, Hendrik Heuer, Stephanie Geise, Sebastian Haunss, Andreas Breiter
Abstract:
Hate speech remains a pressing challenge on social media, where platform moderation often fails to protect targeted users. Personal moderation tools that let users decide how content is filtered can address some of these shortcomings. However, it remains an open question on which screens (e.g., the comments, the reels tab, or the home feed) users want personal moderation and which features they value most. To address these gaps, we conducted a three-wave Delphi study with 40 activists who experienced hate speech. We combined quantitative ratings and rankings with open questions about required features. Participants prioritized personal moderation for conversational and algorithmically curated screens. They valued features allowing for reversibility and oversight across screens, while input-based, content-type specific, and highly automated features are more screen specific. We discuss the importance of personal moderation and offer user-centered design recommendations for personal moderation on Instagram.
Authors:Hima Mynampaty, Nathania Josephine, Katherine E. Isaacs, Andrew M. McNutt
Abstract:
READMEs shape first impressions of software projects, yet what constitutes a good README varies across audiences and contexts. Research software needs reproducibility details, while open-source libraries might prioritize quick-start guides. Through a design probe, LintMe, we explore how linting can be used to improve READMEs given these diverse contexts, aiding style and content issues while preserving authorial agency. Users create context-specific checks using a lightweight DSL that uses a novel combination of programmatic operations (e.g., for broken links) with LLM-based content evaluation (e.g., for detecting jargon), yielding checks that would be challenging for prior linters. Through a user study (N=11), comparison with naive LLM usage, and an extensibility case study, we find that our design is approachable, flexible, and well matched with the needs of this domain. This work opens the door for linting more complex documentation and other culturally mediated text-based documents.
Authors:Joy T Wu, Daniel Beckmann, Sarah Miller, Alexander Lee, Elizabeth Theng, Stephan Altmayer, Ken Chang, David Kersting, Tomoaki Otani, Brittany Z Dashevsky, Hye Lim Park, Matteo Novello, Kip Guja, Curtis Langlotz, Ismini Lourentzou, Daniel Gruhl, Benjamin Risse, Guido A Davidzon
Abstract:
[18F]FDG-PET/CT is a cornerstone imaging modality for tumor staging and treatment response assessment across many cancer types, yet expert reader shortages necessitate more efficient diagnostic aids. While standalone AI models for automatic lesion segmentation exist, clinical translation remains hindered by concerns about interpretability, explainability, reliability, and workflow integration. We present GazeXPErT, a 4D eye-tracking dataset capturing expert search patterns during tumor detection and measurement on 346 FDG-PET/CT scans. Each study was read by a trainee and a board-certified nuclear medicine or radiology specialist using an eye-tracking-enabled annotation platform that simulates routine clinical reads. From 3,948 minutes of raw 60Hz eye-tracking data, 9,030 unique gaze-to-lesion trajectories were extracted, synchronized with PET/CT image slices, and rendered in COCO-style format for multiple machine learning applications. Baseline validation experiments demonstrate that a 3D nnUNet tumor segmentation model achieved superior performance when incorporating expert gaze patterns versus without (DICE score 0.6819 versus 0.6008), and that vision transformers trained on sequential gaze and PET/CT images can improve dynamic lesion localization (74.95% predicted gaze point closer to tumor) and expert intention prediction (Accuracy 67.53% and AUROC 0.747). GazeXPErT is a valuable resource designed to explore multiple machine learning problems beyond these baseline experiments, which include and are not limited to, visual grounding or causal reasoning, clinically explainable feature augmentation, human-computer interaction, human intention prediction or understanding, and expert gaze-rewarded modeling approaches to AI in oncologic FDG-PET/CT imaging.
Authors:Shauna Heron, Meng Cheng Lau
Abstract:
Trust plays a central role in human--robot collaboration, yet its formation is rarely examined under the constraints of fully autonomous interaction. This pilot study investigated how interaction policy influences trust during in-person collaboration with a social robot operating without Wizard-of-Oz control or scripted repair. Participants completed a multi-stage collaborative task with a mobile robot that autonomously managed spoken-language dialogue, affect inference, and task progression. Two interaction policies were compared: a responsive policy, in which the robot proactively adapted its dialogue and assistance based on inferred interaction state, and a neutral, reactive policy, in which the robot provided only direct, task-relevant responses when prompted. Responsive interaction was associated with significantly higher post-interaction trust under viable communication conditions, despite no reliable differences in overall task accuracy. Sensitivity analyses indicated that affective and experiential components of trust were more sensitive to communication breakdown than evaluative judgments of reliability, and that as language-mediated interaction degraded, the trust advantage associated with responsiveness attenuated and ratings became less clearly interpretable as calibrated evaluations of collaborative competence. These findings suggest that trust in autonomous human--robot interaction emerges from process-level interaction dynamics and operates within constraints imposed by communication viability, highlighting the importance of evaluating trust under real autonomy conditions when designing interactive robotic systems.
Authors:Eman Alamoudi, Ellis Solaiman
Abstract:
Patients increasingly rely on online reviews when choosing healthcare providers, yet the sheer volume of these reviews can hinder effective decision-making. This paper summarises a mixed-methods study aimed at evaluating a proposed explainable AI system that analyses patient reviews and provides transparent explanations for its outputs. The survey (N=60) indicated broad optimism regarding usefulness (82% agreed it saves time; 78% that it highlights essentials), alongside strong demand for explainability (84% considered it important to understand why a review is classified; 82% said explanations would increase trust). Around 45% preferred combined text-and-visual explanations. Thematic analysis of open-ended survey responses revealed core requirements such as accuracy, clarity and simplicity, responsiveness, data credibility, and unbiased processing. In addition, interviews with AI experts provided deeper qualitative insights, highlighting technical considerations and potential challenges for different explanation methods. Drawing on TAM and trust in automation, the findings suggest that high perceived usefulness and transparent explanations promote adoption, whereas complexity and inaccuracy hinder it. This paper contributes actionable design guidance for layered, audience-aware explanations in healthcare review systems.
Authors:Kristian Paolo David, Tyrone Justin Sta Maria, Mikkel Dominic Gamboa, Jordan Aiko Deja
Abstract:
Hand-tracking enables controller-free XR interaction but does not have the tactile feedback controllers provide. Rather than treating this solely as a missing-sensation problem, we explore whether pseudo-haptic cues on an embodied virtual hand act as tactile or as affect substitutes that shape how interactions feel. We used a mixed reality prototype that keeps the contacted surface visually neutral, rendering cues on the hand with motion modulation for texture, color glow, and movement-coupled sound. In a within-subjects study (n=12), participants experienced 12 conditions (4 effects x 3 modalities: audio, visual, both) and reported subjective affect and cognitive demand. Participants rarely reported sustained tactile, thermal sensations, yet affect shifted systematically: rough-hot lowered valence increasing arousal, while smooth-cold produced calmer pleasant states. These findings suggest that pseudo-haptics in XR may be better understood as an affective feedback channel rather than a direct replacement for physical touch in controller-free systems.
Authors:Romina Mahinpei, Sofiia Druchyna, Manoel Horta Ribeiro
Abstract:
Teaching assistants (TAs) are essential to grading and feedback provision in proof-based courses, yet these tasks are time-intensive and difficult to scale. Although Large Language Models (LLMs) have been studied for grading and feedback, their effectiveness in proof-based courses is still unknown. Before designing LLM-based systems for this context, a necessary prerequisite is to understand whether LLMs can meaningfully assist TAs with grading and feedback. As such, we present a multi-part case study functioning as a technology probe in an undergraduate proof-based course. We compare rubric-based grading decisions made by an LLM and TAs with varying levels of expertise and examine TAs' perceptions of feedback generated by an LLM. We find substantial disagreement between LLMs and TAs on grading decisions but that LLM-generated feedback can still be useful to TAs for submissions with major errors. We conclude by discussing design implications for human-AI grading and feedback systems in proof-based courses.
Authors:Michelle Cohn, Alyssa Lanzi, Yui Ishihara, Chen-Nee Chuah, Georgia Zellou, Alyssa Weakley
Abstract:
Millions of people live with cognitive impairment from Alzheimer's disease and related dementias (ADRD). Voice-enabled smart home systems offer promise for supporting daily living but rely on automatic speech recognition (ASR) to transcribe their speech to text. Prior work has shown reduced ASR performance for adults with cognitive impairment; however, the acoustic factors underlying these disparities remain poorly understood. This paper evaluates ASR performance for 83 older adults across cognitive groups (cognitively normal, mild cognitive impairment, dementia) reading commands to a voice assistant (Amazon Alexa). Results show that ASR errors are significantly higher for individuals with dementia, revealing a critical usability gap. To better understand these disparities, we conducted an acoustic analysis of speech features and found that a speaker's intensity, voice quality, and pause ratio predicted ASR accuracy. Based on these findings, we outline HCI design implications for AgeTech and voice interfaces, including speaker-personalized ASR, human-in-the-loop correction of ASR transcripts, and interaction-level personalization to support ability-based adaptation.
Authors:Yonglin Chen, Pengcheng An, Xueliang Li
Abstract:
FuturePrism is a GenAI-empowered collaborative storytelling system designed to scaffold adolescents to navigate future life challenges. Adolescents often suffer from anxiety related to future uncertainty for lacking the executive function to develop concrete pathways. Operationalizing Snyder's Hope Theory, the system utilizes a triadic role-play mechanics to externalize cognitive processes through four narrative chapters: The Goal, The Opportunity, The Challenge, and The Agency. An evaluation workshop with 20 adolescents demonstrated that FuturePrism significantly enhances momentary hope levels, particularly in the Agency dimension. Participants reported high levels of narrative immersion and positive feedback towards system usability. Participants also confirmed that the AI-scaffolded collaborative storytelling empowered them to develop positive attitudes towards future challenges.
Authors:Yonglin Chen, Jingjing Zhang, Kezhuo Wang, Pengcheng An, Xueliang Li
Abstract:
Resilience is a key factor affecting children's mental wellbeing and future development. Yet, limited HCI research has explored how to help children build resilience through adversarial experiences. Informed by a formative study with elementary school teachers and professional psychologists, we design TaleBot, an AI-empowered system that supports children to co-create stories about overcoming everyday adversities tailored to their personal situations. We evaluated the system with 12 elementary children in school counseling rooms under teacher guidance and conducted reflective interviews with parents upon the Child-AI co-created stories. The findings show that TaleBot encourages children in self-expression of feelings and thoughts, creating opportunities for teachers to provide personalized support and for parents to better understand the profound impact of family communication on children's mental wellbeing. We conclude with design implications for using generative AI to support children's mental health education and interventions across school and family contexts.
Authors:Besjon Cifliku, Hendrik Heuer
Abstract:
Declining newspaper revenues prompt local newsrooms to adopt automation to maintain efficiency and keep the community informed. However, current research provides a limited understanding of how local journalists work with digital data and which newsroom processes would benefit most from AI-supported (data) reporting. To bridge this gap, we conducted 21 semi-structured interviews with local journalists in Germany. Our study investigates how local journalists use data and AI (RQ1); the challenges they encounter when interacting with data and AI (RQ2); and the self-perceived opportunities of AI-supported reporting systems through the lens of discursive design (RQ3). Our findings reveal that local journalists do not fully leverage AI's potential to support data-related work. Despite local journalists' limited awareness of AI's capabilities, they are willing to use it to process data and discover stories. Finally, we provide recommendations for improving AI-supported reporting in the context of local news, grounded in the journalists' socio-technical perspective and their imagined AI future capabilities.
Authors:Md Ehtesham-Ul-Haque, Syed Masum Billah
Abstract:
Voice user interfaces (VUIs) are rapidly transitioning from accessibility features to mainstream interaction modalities. Yet most operating systems' built-in voice commands remain underutilized despite possessing robust technical capabilities. Through our analysis of four commercial VUI systems and a formative study with 16 participants, we found that fixed command formats require exact phrasing, restrictive timeout mechanisms discard input during planning pauses, and insufficient feedback hampers multi-step interactions. To address these challenges, we developed VoiceAlign, an adaptive shimming layer that mediates between users and legacy VUI systems. VoiceAlign intercepts natural voice commands, transforms them to match the required syntax using a large language model, and transmits these adapted commands through a virtual audio channel that remains transparent to the underlying system. In our evaluation with 12 participants, VoiceAlign reduced command failures by half, required 25% fewer commands per task, and significantly lowered cognitive and temporal demands when paired with an existing legacy VUI system. Furthermore, we created a synthetic dataset informed by our studies and fine-tuned a small language model that achieves over 90% accuracy with 200 ms response time when served locally, eliminating dependence on third-party APIs while enabling real-time interaction on edge devices. This work demonstrates how modern AI techniques can unlock the underutilized potential of legacy VUI systems without requiring system modifications, offering a practical solution without replacing existing infrastructure.
Authors:Shuo Niu, Dylan Clements, Marina Margalit Nemanov, Hyungsin Kim
Abstract:
GenAI's ability to produce text and images is increasingly incorporated into human-AI co-creation tasks such as storytelling and video editing. However, integrating GenAI into these tasks requires enabling users to retain control over editing individual story elements while ensuring that generated visuals remain coherent with the storyline and consistent across multiple AI-generated outputs. This work examines a paradigm of creative decomposition and linking, which allows creators to clearly communicate creative intent by prompting GenAI to tailor specific story elements, such as storylines, personas, locations, and scenes, while maintaining coherence among them. We implement and evaluate StoryComposerAI, a system that exemplifies this paradigm for enhancing users' sense of control and content consistency in human-AI co-creation of digital stories.
Authors:Christian Poelitz, Finale Doshi-Velez, Siân Lindley
Abstract:
AI is becoming increasingly integrated into everyday life, both in professional work environments and in leisure and entertainment contexts. This integration requires AI to move beyond acting as an assistant for informational or transactional tasks toward a genuine collaborative partner. Effective collaboration, whether between humans or between humans and AI, depends on establishing and maintaining common ground: shared beliefs, assumptions, goals, and situational awareness that enable coordinated action and efficient repair of misunderstandings. While common ground is a central concept in human collaboration, it has received limited attention in studies of human-AI collaboration. In this paper, we introduce a new benchmark grounded in theories and empirical studies of human-human collaboration. The benchmark is based on a collaborative puzzle task that requires iterative interaction, joint action, referential coordination, and repair under varying conditions of situation awareness. We validate the benchmark through a confirmatory user study in which human participants collaborate with an AI to solve the task. The results show that the benchmark reproduces established theoretical and empirical findings from human-human collaboration, while also revealing clear divergences in human-AI interaction.
Authors:Anna Martin-Boyle, William Humphreys, Martha Brown, Cara Leckey, Harmanpreet Kaur
Abstract:
Large Language Models (LLMs) are transforming scholarly tasks like search and summarization, but their reliability remains uncertain. Current evaluation metrics for testing LLM reliability are primarily automated approaches that prioritize efficiency and scalability, but lack contextual nuance and fail to reflect how scientific domain experts assess LLM outputs in practice. We developed and validated a schema for evaluating LLM errors in scholarly question-answering systems that reflects the assessment strategies of practicing scientists. In collaboration with domain experts, we identified 20 error patterns across seven categories through thematic analysis of 68 question-answer pairs. We validated this schema through contextual inquiries with 10 additional scientists, which showed not only which errors experts naturally identify but also how structured evaluation schemas can help them detect previously overlooked issues. Domain experts use systematic assessment strategies, including technical precision testing, value-based evaluation, and meta-evaluation of their own practices. We discuss implications for supporting expert evaluation of LLM outputs, including opportunities for personalized, schema-driven tools that adapt to individual evaluation patterns and expertise levels.
Authors:Anna Martin-Boyle, Cara A. C. Leckey, Martha C. Brown, Harmanpreet Kaur
Abstract:
Large language models (LLMs) are increasingly used in scholarly question-answering (QA) systems to help researchers synthesize vast amounts of literature. However, these systems often produce subtle errors (e.g., unsupported claims, errors of omission), and current provenance mechanisms like source citations are not granular enough for the rigorous verification that scholarly domain requires. To address this, we introduce PaperTrail, a novel interface that decomposes both LLM answers and source documents into discrete claims and evidence, mapping them to reveal supported assertions, unsupported claims, and information omitted from the source texts. We evaluated PaperTrail in a within-subjects study with 26 researchers who performed two scholarly editing tasks using PaperTrail and a baseline interface. Our results show that PaperTrail significantly lowered participants' trust compared to the baseline. However, this increased caution did not translate to behavioral changes, as people continued to rely on LLM-generated scholarly edits to avoid a cognitively burdensome task. We discuss the value of claim-evidence matching for understanding LLM trustworthiness in scholarly settings, and present design implications for cognition-friendly communication of provenance information.
Authors:EunJeong Cheon, Do Yeon Shin
Abstract:
As sidewalk delivery robots become increasingly integrated into urban life, this paper begins with a critical provocation: Is robot labor labor? More than a rhetorical question, this inquiry invites closer attention to the social and political arrangements that robot labor entails. Drawing on ethnographic fieldwork across two smart-city districts in Seoul, we examine how delivery robot labor is collectively sustained. While robotic actions are often framed as autonomous and efficient, we show that each successful delivery is in fact a distributed sociotechnical achievement--reliant on human labor, regulatory coordination, and social accommodations. We argue that delivery robots do not replace labor but reconfigure it--rendering some forms more visible (robotic performance) while obscuring others (human and institutional support). Unlike industrial robots, delivery robots operate in shared public space, engage everyday passersby, and are embedded in policy and progress narratives. In these spaces, we identify "robot privilege"--humans routinely yielding to robots--and distinct perceptions between casual observers ("cute") and everyday coexisters ("admirable"). We contribute a conceptual reframing of robot labor as a collective assemblage, empirical insights into South Korea's smart-city automation, and a call for HRI to engage more deeply with labor and spatial politics to better theorize public-facing robots.
Authors:Joel Bucher, Lahari Goswami, Sverrir Thorgeirsson, April Yi Wang
Abstract:
Git is widely used for collaborative software development, but it can be challenging for newcomers. While most learning tools focus on individual workflows, Git is inherently collaborative. We present GitAcademy, a browser-based learning platform that embeds a full Git environment with a split-view collaborative mode: learners work on their own local repositories connected to a shared remote repository, while simultaneously seeing their partner's actions mirrored in real time. This design is not intended for everyday software development, but rather as a training simulator to build awareness of distributed states, coordination, and collaborative troubleshooting. In a within-subjects study with 13 pairs of learners, we found that the split-view interface enhanced social presence, supported peer teaching, and was consistently preferred over a single-view baseline, even though performance gains were mixed. We further discuss how split-view awareness can serve as a training-only scaffold for collaborative learning of Git and other distributed technical systems.
Authors:Kynnedy Simone Smith, Lydia B. Chilton, Danielle Bragg
Abstract:
Ableist language perpetuates harmful stereotypes and exclusion, yet its nuanced nature makes it difficult to recognize and address. Artificial intelligence could serve as a powerful ally in the fight against ableist language, offering tools that detect and suggest alternatives to biased terms. This two-part study investigates the potential of large language models (LLMs), specifically ChatGPT, to rectify ableist language and educate users about inclusive communication. We compared GPT-4o generations with crowdsourced annotations from trained disability community members, then invited disabled participants to evaluate both. Participants reported equal agreement with human and AI annotations but significantly preferred the AI, citing its narrative consistency and accessible style. At the same time, they valued the emotional depth and cultural grounding of human annotations. These findings highlight the promise and limits of LLMs in handling culturally sensitive content. Our contributions include a dataset of nuanced ableism annotations and design considerations for inclusive writing tools.
Authors:Duy Anh Ta, Farnaz Farid, Farhad Ahamed, Ala Al-Areqi, Robert Beutel, Tamara Watson, Alana Maurushat
Abstract:
Modern organizations increasingly face cybersecurity incidents driven by human behaviour rather than technical failures. To address this, we propose a conceptual security framework that integrates a hybrid Convolutional Neural Network-Long Short-Term Memory (CNN-LSTM) model to analyze biometric and environmental data for context-aware security decisions. The CNN extracts spatial patterns from sensor data, while the LSTM captures temporal dynamics associated with human error susceptibility. The model achieves 84% accuracy, demonstrating its ability to reliably detect conditions that lead to elevated human-centred cyber risk. By enabling continuous monitoring and adaptive safeguards, the framework supports proactive interventions that reduce the likelihood of human-driven cyber incidents
Authors:Hazim AbdElazim, Shadman Islam, Mostafa Milani
Abstract:
Data cleaning is often framed as a technical preprocessing step, yet in practice it relies heavily on human judgment. We report results from a controlled survey study in which participants performed error detection, data repair and imputation, and entity matching tasks on census-inspired scenarios with known semantic validity. We find systematic evidence for several cognitive bias mechanisms in data cleaning. Framing effects arise when surface-level formatting differences (e.g., capitalization or numeric presentation) increase false-positive error flags despite unchanged semantics. Anchoring and adjustment bias appears when expert cues shift participant decisions beyond parity, consistent with salience and availability effects. We also observe the representativeness heuristic: atypical but valid attribute combinations are frequently flagged as erroneous, and in entity matching tasks, surface similarity produces a substantial false-positive rate with high confidence. In data repair, participants show a robust preference for leaving values missing rather than imputing plausible values, consistent with omission bias. In contrast, automation-aligned switching under strong contradiction does not exceed a conservative rare-error tolerance threshold at the population level, indicating that deference to automated recommendations is limited in this setting. Across scenarios, bias patterns persist among technically experienced participants and across diverse workflow practices, suggesting that bias in data cleaning reflects general cognitive tendencies rather than lack of expertise. These findings motivate human-in-the-loop cleaning systems that clearly separate representation from semantics, present expert or algorithmic recommendations non-prescriptively, and support reflective evaluation of atypical but valid cases.
Authors:George X. Wang, Jiaqian Hu, Jing Qian
Abstract:
Recent advances in LLM based translation have led to renewed interest in fully automated systems, yet professional translators remain essential in high stakes domains where decisions about accuracy, terminology, style, and audience cannot be safely automated. Current tools are typically single shot generators or single-agent self-refiners, offering limited support for translator multidimensional decision making process and providing little structured leverage for translator input. We present CHORUS, a human-AI multiagent collaborative translation framework grounded in the Multidimensional Quality Metrics (MQM) framework, which decomposes quality dimensions into specialized agents and integrates their feedback into an iterative refinement loop controlled by the translator. A six-user preliminary study with professional translators found that CHORUS consistently outperforms zero-shot and single-agent baselines, showing that MQM-aligned multi-agent collaboration better supports professional translation workflows than autonomous generation.
Authors:Black Sun, Haiyang Xu, Ge Kacy Fu, Liyue Da, Eve Hoggan
Abstract:
Hybrid meetings often begin with social awkwardness and asymmetric participation, particularly for remote attendees who lack access to informal, co-present interaction. We present MagHeart, a multimodal system that explores symmetric icebreaking in hybrid meetings through playful LEGO-based avatar co-creation and a tangible magnetic device that represents a remote participant's heartbeat as an ambient presence cue. By combining creative co-creation with abstract bio-feedback, MagHeart rethinks how remote participants can become materially and perceptually present during meeting openings. We report findings from a scenario-based exploratory study combining quantitative and qualitative data, examining participants' anticipated engagement, perceived social presence, and future-use intentions from both co-located and remote perspectives. Our results highlight opportunities for playful, embodied icebreakers to support early hybrid interaction, while also surfacing tensions around privacy, distraction, and contextual appropriateness. This work contributes design insights and open questions for future hybrid meeting tools that balance playfulness, embodiment, and social sensitivity.
Authors:danah boyd, Jayshree Sarathy
Abstract:
When the U.S. Census Bureau announced its intention to modernize its disclosure avoidance procedures for the 2020 Census, it sparked a controversy that is still underway. The move to differential privacy introduced technical and procedural uncertainties, leaving stakeholders unable to evaluate the quality of the data. More importantly, this transformation exposed the statistical illusions and limitations of census data, weakening stakeholders' trust in the data and in the Census Bureau itself. This essay examines the epistemic currents of this controversy. Drawing on theories from Science and Technology Studies (STS) and ethnographic fieldwork, we analyze the current controversy over differential privacy as a battle over uncertainty, trust, and legitimacy of the Census. We argue that rebuilding trust will require more than technical repairs or improved communication; it will require reconstructing what we identify as a 'statistical imaginary.'
Authors:Eman Alashwali, Abeer Alhuzali
Abstract:
In 2024, Saudi Arabia's Personal Data Protection Law (PDPL) came into force. However, little work has been done to assess its implementation. In this paper, we analyzed 100 e-commerce websites in Saudi Arabia against the PDPL, examining the presence of a privacy policy and, if present, the policy's declarations of four items pertaining to personal data rights and practices: a) personal data retention period, b) the right to request the destruction of personal data, c) the right to request a copy of personal data, and d) a mechanism for filing complaints. Our results show that, despite national awareness and support efforts, a significant fraction of e-commerce websites in our dataset are not fully compliant: only 31% of websites in our dataset declared all four examined items in their privacy policies. Even when privacy policies included such declarations, a considerable fraction of them failed to cover required fine-grained details. Second, the majority of top-ranked e-commerce websites (based on search results order) and those hosted on local e-commerce hosting platforms exhibited considerably higher non-compliance rates than mid- to low-ranked websites and those not hosted on local e-commerce platforms. Third, we assessed the use of Large Language Models (LLMs) as an automated tool for privacy policy analysis to measure compliance with the PDPL. We highlight the potential of LLMs and suggest considerations to improve LLM-based automated analysis for privacy policies. Our results provide a step forward in understanding the implementation barriers to data protection laws, especially in non-Western contexts. We provide recommendations for policymakers, regulators, website owners, and developers seeking to improve data protection practices and automate compliance monitoring.
Authors:Yuvarani Ganesan, Salsabila Harlen, Azfar Rahman Bin Fazul Rahman, Akashdeep Singh, Zahra Fathanah, Raja Jamilah Raja Yusof
Abstract:
Conversational AI has significant potential in the healthcare sector, but many existing systems fall short in emotional intelligence, fairness, and politeness, which are essential for building patient trust. This gap reduces the effectiveness of digital health solutions and can increase user anxiety. This study addresses the challenge of integrating ethical communication principles by designing and evaluating LunaAI, a healthcare chatbot prototype. Using a user-centered design approach informed by a structured literature review, we developed conversational scenarios that handle both routine and hostile user interactions. The system was implemented using the Google Gemini API and deployed as a mobile-first Progressive Web App built with React, Vite, and Firebase. Preliminary user testing was conducted with a small participant group, and responses were evaluated using established frameworks such as the Godspeed Questionnaire. In addition, a comparative analysis was performed between LunaAI's tailored responses and the baseline outputs of an uncustomized large language model. The results indicate measurable improvements in key interaction qualities, with average user ratings of 4.7 out of 5 for politeness and 4.9 out of 5 for fairness. These findings highlight the importance of intentional ethical conversational design for human-computer interaction, particularly in sensitive healthcare contexts.
Authors:Aya Abdelnaem El-Basha, Ebtsam ELSayed Mahmoud ELSayes, Ahmad Al-Kabbany
Abstract:
This study investigates the effectiveness of a Virtual Reality (VR)-based training program in improving body awareness among children with Attention Deficit Hyperactivity Disorder (ADHD). Utilizing a quasi-experimental design, the research sample consisted of 10 children aged 4 to 7 years, with IQ scores ranging from 90 to 110. Participants were divided into an experimental group and a control group, with the experimental group receiving a structured VR intervention over three months, totaling 36 sessions. Assessment tools included the Stanford-Binet Intelligence Scale (5th Edition), the Conners Test for ADHD, and a researcher-prepared Body Awareness Scale. The results indicated statistically significant differences between pre-test and post-test scores for the experimental group, demonstrating the program's efficacy in enhancing spatial awareness, body part identification, and motor expressions. Furthermore, follow-up assessments conducted one month after the intervention revealed no significant differences from the post-test results, confirming the sustainability and continuity of the program's effects over time. The findings suggest that immersive VR environments provide a safe, engaging, and effective therapeutic medium for addressing psychomotor deficits in early childhood ADHD.
Authors:Dimitri Staufer, Kirsten Morehouse
Abstract:
Large language models (LLMs), and conversational agents based on them, are exposed to personal data (PD) during pre-training and during user interactions. Prior work shows that PD can resurface, yet users lack insight into how strongly models associate specific information to their identity. We audit PD across eight LLMs (3 open-source; 5 API-based, including GPT-4o), introduce LMP2 (Language Model Privacy Probe), a human-centered, privacy-preserving audit tool refined through two formative studies (N=20), and run two studies with EU residents to capture (i) intuitions about LLM-generated PD (N1=155) and (ii) reactions to tool output (N2=303). We show empirically that models confidently generate multiple PD categories for well-known individuals. For everyday users, GPT-4o generates 11 features with 60% or more accuracy (e.g., gender, hair color, languages). Finally, 72% of participants sought control over model-generated associations with their name, raising questions about what counts as PD and whether data privacy rights should extend to LLMs.
Authors:Uğur Genç, Heng Gu, Chadha Degachi, Evangelos Niforatos, Senthil Chandrasegaran, Himanshu Verma
Abstract:
Large Language Model-powered conversational agents (CAs) are increasingly capable of projecting sophisticated personalities through language, but how these projections affect users is unclear. We thus examine how CA personalities expressed linguistically affect user decisions and perceptions in the context of charitable giving. In a crowdsourced study, 360 participants interacted with one of eight CAs, each projecting a personality composed of three linguistic aspects: attitude (optimistic/pessimistic), authority (authoritative/submissive), and reasoning (emotional/rational). While the CA's composite personality did not affect participants' decisions, it did affect their perceptions and emotional responses. Particularly, participants interacting with pessimistic CAs felt lower emotional state and lower affinity towards the cause, perceived the CA as less trustworthy and less competent, and yet tended to donate more toward the charity. Perceptions of trust, competence, and situational empathy significantly predicted donation decisions. Our findings emphasize the risks CAs pose as instruments of manipulation, subtly influencing user perceptions and decisions.
Authors:Farnaz Zamiri Zeraati, Yang Trista Cao, Yuehan Qiao, Hal Daumé, Hernisa Kacorri
Abstract:
Prompting and steering techniques are well established in general-purpose generative AI, yet assistive visual question answering (VQA) tools for blind users still follow rigid interaction patterns with limited opportunities for customization. User control can be helpful when system responses are misaligned with their goals and contexts, a gap that becomes especially consequential for blind users that may rely on these systems for access. We invite 11 blind users to customize their interactions with a real-world conversational VQA system. Drawing on 418 interactions, reflections, and post-study interviews, we analyze prompting-based techniques participants adopted, including those introduced in the study and those developed independently in real-world settings. VQA interactions were often lengthy: participants averaged 3 turns, sometimes up to 21, with input text typically tenfold shorter than the responses they heard. Built on state-of-the-art LLMs, the system lacked verbosity controls, was limited in estimating distance in space and time, relied on inaccessible image framing, and offered little to no camera guidance. We discuss how customization techniques such as prompt engineering can help participants work around these limitations. Alongside a new publicly available dataset, we offer insights for interaction design at both query and system levels.
Authors:EunJeong Cheon, Do Yeon Shin
Abstract:
As the presence of autonomous robots in public spaces increases-whether navigating campus walkways or neighborhood sidewalks-understanding how to carefully study these robots becomes critical. While HRI research has conducted field studies in public spaces, these are often limited to controlled experiments with prototype robots or structured observational methods, such as the Wizard of Oz technique. However, the autonomous mobile robots we encounter today, particularly delivery robots, operate beyond the control of researchers, navigating dynamic routes and unpredictable environments. To address this challenge, a more deliberate approach is required. Drawing inspiration from public realm ethnography in urban studies, geography, and sociology, this paper proposes the Walk-Along with Robots (WawR) methodology. We outline the key features of this method, the steps we applied in our study, the unique insights it offers, and the ways it can be evaluated. We hope this paper stimulates further discussion on research methodologies for studying autonomous robots in public spaces.
Authors:Michael T. Knierim, Thimo Schulz, Moritz Schiller, Jwan Shaban, Mario Nadj, Max L. Wilson, Alexander Maedche
Abstract:
Researchers often attribute social media's appeal to its ability to elicit flow experiences of deep absorption and effortless engagement. Yet prolonged use has also been linked to distraction, fatigue, and lower mood. This paradox remains poorly understood, in part because prior studies rely on habitual or one-shot reports that ask participants to directly attribute flow to social media. To address this gap, we conducted a five-day field study with 40 participants, combining objective smartphone app tracking with daily reconstructions of flow-inducing activities. Across 673 reported flow occurrences, participants rarely associated flow with social media (2 percent). Instead, heavier social media use predicted fewer daily flow occurrences. We further examine this relationship through the effects of social media use on fatigue, mood, and motivation. Altogether, our findings suggest that flow and social media may not align as closely as assumed - and might even compete - underscoring the need for further research.
Authors:Megan Lee, Seung Ha Hwang, Inhyeok Choi, Shreyas Darade, Mengchun Zhang, Kateryna Shapovalenko
Abstract:
Cross-subject generalization in EEG-based brain-computer interfaces (BCIs) remains challenging due to individual variability in neural signals. We investigate whether spectral representations offer more stable features for cross-subject transfer than temporal waveforms. Through correlation analyses across three EEG paradigms (SSVEP, P300, and Motor Imagery), we find that spectral features exhibit consistently higher cross-subject similarity than temporal signals. Motivated by this observation, we introduce ASPEN, a hybrid architecture that combines spectral and temporal feature streams via multiplicative fusion, requiring cross-modal agreement for features to propagate. Experiments across six benchmark datasets reveal that ASPEN is able to dynamically achieve the optimal spectral-temporal balance depending on the paradigm. ASPEN achieves the best unseen-subject accuracy on three of six datasets and competitive performance on others, demonstrating that multiplicative multimodal fusion enables effective cross-subject generalization.
Authors:Weijun Zhang, Xinru Tang
Abstract:
The HCI research community has witnessed a growing body of research on accessibility and disability driven by efforts to improve access. Yet, the concept of access reveals its limitations when examined within broader ableist structures. Drawing on an autoethnographic method, this study shares the co-first author Zhang's experiences at two higher-education institutions in China, including a specialized program exclusively for blind and low-vision students and a mainstream university where he was the first blind student admitted. Our analysis revealed tensions around access in both institutions: they either marginalized blind students within society at large or imposed pressures to conform to sighted norms. Both institutions were further constrained by systemic issues, including limited accessible resources, pervasive ableist cultures, and the lack of formalized policies. In response to these tensions, we conceptualize access as a contradictory construct and argue for understanding accessibility as an ongoing, exploratory practice within ableist structures.
Authors:Feras Kiki, Pouya P. Niaz, Alireza Madani, Cagatay Basdogan
Abstract:
Assessing human muscle fatigue is critical for optimizing performance and safety in physical human-robot interaction(pHRI). This work presents a data-driven framework to estimate fatigue in dynamic, cyclic pHRI using arm-mounted surface electromyography(sEMG). Subject-specific machine-learning regression models(Random Forest, XGBoost, and Linear Regression predict the fraction of cycles to fatigue(FCF) from three frequency-domain and one time-domain EMG features, and are benchmarked against a convolutional neural network(CNN) that ingests spectrograms of filtered EMG. Framing fatigue estimation as regression (rather than classification) captures continuous progression toward fatigue, supporting earlier detection, timely intervention, and adaptive robot control. In experiments with ten participants, a collaborative robot under admittance control guided repetitive lateral (left-right) end-effector motions until muscular fatigue. Average FCF RMSE across participants was 20.8+/-4.3% for the CNN, 23.3+/-3.8% for Random Forest, 24.8+/-4.5% for XGBoost, and 26.9+/-6.1% for Linear Regression. To probe cross-task generalization, one participant additionally performed unseen vertical (up-down) and circular repetitions; models trained only on lateral data were tested directly and largely retained accuracy, indicating robustness to changes in movement direction, arm kinematics, and muscle recruitment, while Linear Regression deteriorated. Overall, the study shows that both feature-based ML and spectrogram-based DL can estimate remaining work capacity during repetitive pHRI, with the CNN delivering the lowest error and the tree-based models close behind. The reported transfer to new motion patterns suggests potential for practical fatigue monitoring without retraining for every task, improving operator protection and enabling fatigue-aware shared autonomy, for safer fatigue-adaptive pHRI control.
Authors:Kashyap Thimmaraju, Duc Anh Hoang, Souradip Nath, Jaron Mink, Gail-Joon Ahn
Abstract:
The sustainability of Security Operations Centers depends on their people, yet 71% of practitioners report burnout and 24% plan to exit cybersecurity entirely. Flow theory suggests that when job demands misalign with practitioner capabilities, work becomes overwhelming or tedious rather than engaging. Achieving challenge-skill balance begins at hiring: if job descriptions inaccurately portray requirements, organizations risk recruiting underskilled practitioners who face anxiety or overskilled ones who experience boredom. Yet we lack empirical understanding of what current SOC job descriptions actually specify. We analyzed 106 public SOC job postings from November to December 2024 across 35 organizations in 11 countries, covering Analysts (n=17), Incident Responders (n=38), Threat Hunters (n=39), and SOC Managers (n=12). Using Inductive Content Analysis, we coded certifications, technical skills, soft skills, tasks, and experience requirements. Three patterns emerged: (1) Communication skills dominate (50.9% of postings), exceeding SIEM tools (18.9%) or programming (30.2%), suggesting organizations prioritize collaboration over technical capabilities. (2) Certification expectations vary widely: CISSP leads (22.6%), but 43 distinct credentials appear with no universal standard. (3) Technical requirements show consensus: Python dominates programming (27.4%), Splunk leads SIEM platforms (14.2%), and ISO 27001 (13.2%) and NIST (10.4%) are most cited standards. These findings enable organizations to audit job descriptions against empirical baselines, help practitioners identify valued certifications and skills, and allow researchers to validate whether stated requirements align with actual demands. This establishes the foundation for flow-aligned interview protocols and investigation of how AI reshapes requirements. Dataset and codebook: https://git.tu-berlin.de/wosoc-2026/soc-jd-analysis.
Authors:Maqbool Dada, Brett Hathaway, Evgeny Kagan
Abstract:
Customer service has evolved beyond in-person visits and phone calls to include live chat, AI chatbots and social media, among other contact options. Service providers typically refer to these contact modalities as "channels". Within each channel, customer service agents are tasked with managing and resolving a stream of inbound service requests. Each request involves milestones where the agent must decide whether to keep assisting the customer or to transfer them to a more skilled -- and often costlier -- provider. To understand how this request resolution process should be managed, we develop a model in which each channel is represented as a gatekeeper system and characterize the structure of the optimal request resolution policy. We then turn to the broader question of the firm's customer service design, which includes the strategic problem of which channels to deploy, the tactical questions of at what level to staff the live-agent channel and to what extent to train an AI chatbot, and the operational question of how to control the live-agent channel. Examining the interplay between strategic, tactical, and operational decisions through numerical methods, we show, among other insights, that service quality can be improved, rather than diminished, by chatbot implementation.
Authors:Shayla Sharmin, Sadia Afrin
Abstract:
Social media platforms, especially Facebook parenting groups, have long been used as informal support networks for mothers seeking advice and reassurance. However, growing concerns about social judgment, privacy exposure, and unreliable information are changing how mothers seek help. This exploratory mixed-method study examines why mothers are moving from Facebook parenting groups to large language models such as ChatGPT and Gemini. We conducted a cross-sectional online survey of 109 mothers. Results show that 41.3% of participants avoided Facebook parenting groups because they expected judgment from others. This difference was statistically significant across location and family structure. Mothers living in their home country and those in joint families were more likely to avoid Facebook groups. Qualitative findings revealed three themes: social judgment and exposure, LLMs as safe and private spaces, and quick and structured support. Participants described LLMs as immediate, emotionally safe, and reliable alternatives that reduce social risk when asking for help. Rather than replacing human support, LLMs appear to fill emotional and practical gaps within existing support systems. These findings show a change in maternal digital support and highlight the need to design LLM systems that support both information and emotional safety.
Authors:Alberto Olivares-Alarcos, Muhammad Ahsan, Satrio Sanjaya, Hsien-I Lin, Guillem Alenyà
Abstract:
Building effective human-robot interaction requires robots to derive conclusions from their experiences that are both logically sound and communicated in ways aligned with human expectations. This paper presents a hybrid framework that blends ontology-based reasoning with large language models (LLMs) to produce semantically grounded and natural robot explanations. Ontologies ensure logical consistency and domain grounding, while LLMs provide fluent, context-aware and adaptive language generation. The proposed method grounds data from human-robot experiences, enabling robots to reason about whether events are typical or atypical based on their properties. We integrate a state-of-the-art algorithm for retrieving and constructing static contrastive ontology-based narratives with an LLM agent that uses them to produce concise, clear, interactive explanations. The approach is validated through a laboratory study replicating an industrial collaborative task. Empirical results show significant improvements in the clarity and brevity of ontology-based narratives while preserving their semantic accuracy. Initial evaluations further demonstrate the system's ability to adapt explanations to user feedback. Overall, this work highlights the potential of ontology-LLM integration to advance explainable agency, and promote more transparent human-robot collaboration.
Authors:Shuhao Ma, John Zimmerman, Valentina Nisi, Nuno Jardim Nunes
Abstract:
Worker-Centered Design (WCD) has gained prominence over the past decade, offering researchers and practitioners ways to engage worker agency and support collective actions for workers. Yet few studies have systematically revisited WCD itself, examining its implementations, challenges, and practical impact. Through a four-lens analytical framework that examines multiple facets of WCD within food delivery industry, we identify critical tensions and blind spots from a Multi-Laborer System perspective. Our analysis reveals conflicts across labor chains, distorted implementations of WCD, designers' sometimes limited political-economic understanding, and workers as active agents of change. These insights further inform a Diagnostic-Generative pathway that helps to address recurring risks, including labor conflicts and institutional reframing, while cultivating designers' policy and economic imagination. Following the design criticism tradition, and through a four-lens reflexive analysis, this study expands the action space for WCD and strengthens its relevance to real-world practice.
Authors:Daniel Schwartz, Dario Salvucci, Yusuf Osmanlioglu, Richard Vallett, Genevieve Dion, Ali Shokoufandeh
Abstract:
Wearable e-textile interfaces require gesture recognition capabilities but face severe constraints in power consumption, computational capacity, and form factor that make traditional deep learning impractical. While lightweight architectures like MobileNet improve efficiency, they still demand thousands of parameters, limiting deployment on textile-integrated platforms. We introduce a convexified attention mechanism for wearable applications that dynamically weights features while preserving convexity through nonexpansive simplex projection and convex loss functions. Unlike conventional attention mechanisms using non-convex softmax operations, our approach employs Euclidean projection onto the probability simplex combined with multi-class hinge loss, ensuring global convergence guarantees. Implemented on a textile-based capacitive sensor with four connection points, our approach achieves 100.00\% accuracy on tap gestures and 100.00\% on swipe gestures -- consistent across 10-fold cross-validation and held-out test evaluation -- while requiring only 120--360 parameters, a 97\% reduction compared to conventional approaches. With sub-millisecond inference times (290--296$μ$s) and minimal storage requirements ($<$7KB), our method enables gesture interfaces directly within e-textiles without external processing. Our evaluation, conducted in controlled laboratory conditions with a single-user dataset, demonstrates feasibility for basic gesture interactions. Real-world deployment would require validation across multiple users, environmental conditions, and more complex gesture vocabularies. These results demonstrate how convex optimization can enable efficient on-device machine learning for textile interfaces.
Authors:Jingwen Bai, Wei Soon Cheong, Philippe Muller, Brian Y Lim
Abstract:
Large Language Models (LLMs) have become indispensable for evaluating writing. However, text feedback they provide is often unintelligible, generic, and not specific to user criteria. Inspired by structured rubrics in education and intelligible AI explanations, we propose iRULER following identified design guidelines to \textit{scaffold} the review process by \textit{specific} criteria, providing \textit{justification} for score selection, and offering \textit{actionable} revisions to target different quality levels. To \textit{qualify} user-defined criteria, we recursively used iRULER with a rubric-of-rubrics to iteratively \textit{refine} rubrics. In controlled experiments on writing revision and rubric creation, iRULER most improved validated LLM-judged review scores and was perceived as most helpful and aligned compared to read-only rubric and text-based LLM feedback. Qualitative findings further support how iRULER satisfies the design guidelines for user-defined feedback. This work contributes interactive rubric tools for intelligible LLM-based review and revision of writing, and user-defined rubric creation.
Authors:Xuehan Huang, Canwen Wang, Yifei Hao, Daijin Yang, Ray LC
Abstract:
Chatbots are increasingly applied to domains previously reserved for human actors. One such domain is comedy, whereby both the general public working with ChatGPT and research-based LLM-systems have tried their hands on making humor. In formative interviews with professional comedians and video analyses of stand-up comedy in humans, we found that human performers often use their ethnic, gender, community, and demographic-based identity to enable joke-making. This suggests whether the identity of AI itself can empower AI humor generation for human audiences. We designed a machine-identity-based agent that uses its own status as AI to tell jokes in online performance format. Studies with human audiences (N=32) showed that machine-identity-based agents were seen as funnier than baseline-GPT agent. This work suggests the design of human-AI integrated systems that explicitly utilize AI as its own unique identity apart from humans.
Authors:Fakhri Momeni, Sarah Sajid, Johannes Kiesel
Abstract:
Reproducibility remains a central challenge in computational social science, where complex workflows, evolving software ecosystems, and inconsistent documentation hinder researchers ability to re-execute published methods. This study presents a systematic evaluation of reproducibility across three conditions: uncurated documentation, curated documentation, and curated documentation paired with a preset execution environment. Using 47 usability test sessions, we combine behavioral performance indicators (success rates, task time, and error profiles) with questionnaire data and thematic analysis to identify technical and conceptual barriers to reproducibility. Curated documentation substantially reduced repository-level errors and improved users ability to interpret method outputs. Standardizing the execution environment further improved reproducibility, yielding the highest success rate and shortest task completion times. Across conditions, participants frequently relied on AI tools for troubleshooting, often enabling independent resolution of issues without facilitator intervention. Our findings demonstrate that reproducibility barriers are multi-layered and require coordinated improvements in documentation quality, environment stability, and conceptual clarity. We discuss implications for the design of reproducibility platforms and infrastructure in computational social science.
Authors:Haoyang Chen, Jingwen Bai, Fang Tian, Brian Y Lim
Abstract:
While Explainable AI (XAI) helps users understand AI decisions, misalignment in domain knowledge can lead to disagreement. This inconsistency hinders understanding, and because explanations are often read-only, users lack the control to improve alignment. We propose making XAI editable, allowing users to write rules to improve control and gain deeper understanding through the generation effect of active learning. We developed CoExplain, leveraging a neural network for universal representation and symbolic rules for intuitive reasoning on interpretable attributes. CoExplain explains the neural network with a faithful proxy decision tree, parses user-written rules as an equivalent neural network graph, and collaboratively optimizes the decision tree. In a user study (N=43), CoExplain and manually editable XAI improved user understanding and model alignment compared to read-only XAI. CoExplain was easier to use with fewer edits and less time. This work contributes Editable XAI for bidirectional AI alignment, improving understanding and control.
Authors:Mohammad Raihanul Bashar, Aunnoy K Mutasim, Ken Pfeuffer, Anil Ufuk Batmaz
Abstract:
Interacting with multiple objects simultaneously makes us fast. A pre-step to this interaction is to select the objects, i.e., multi-object selection, which is enabled through two steps: (1) toggling multi-selection mode -- mode-switching -- and then (2) selecting all the intended objects -- subselection. In extended reality (XR), each step can be performed with the eyes, hands, and voice. To examine how design choices affect user performance, we evaluated four mode-switching (SemiPinch, FullPinch, DoublePinch, and Voice) and three subselection techniques (Gaze+Dwell, Gaze+Pinch, and Gaze+Voice) in a user study. Results revealed that while DoublePinch paired with Gaze+Pinch yielded the highest overall performance, SemiPinch achieved the lowest performance. Although Voice-based mode-switching showed benefits, Gaze+Voice subselection was less favored, as the required repetitive vocal commands were perceived as tedious. Overall, these findings provide empirical insights and inform design recommendations for multi-selection techniques in XR.
Authors:Emma Hoes, K. Jonathan Klueser, Fabrizio Gilardi
Abstract:
Digital platforms shape how people communicate, deliberate, and form opinions. Studying these dynamics has become increasingly difficult due to restricted data access, ethical constraints on real-world experiments, and limitations of existing research tools. VIRENA (Virtual Arena) is a platform that enables controlled experimentation in realistic social media environments. Multiple participants interact simultaneously in realistic replicas of feed-based platforms (Instagram, Facebook, Reddit) and messaging apps (WhatsApp, Messenger). Large language model-powered AI agents participate alongside humans with configurable personas and realistic behavior. Researchers can manipulate content moderation approaches, pre-schedule stimulus content, and run experiments across conditions through a visual interface requiring no programming skills. VIRENA makes possible research designs that were previously impractical: studying human--AI interaction in realistic social contexts, experimentally comparing moderation interventions, and observing group deliberation as it unfolds. Built on open-source technologies that ensure data remain under institutional control and comply with data protection requirements, VIRENA is currently in use at the University of Zurich and available for pilot collaborations. Designed for researchers, educators, and public organizations alike, VIRENA's no-code interface makes controlled social media simulation accessible across disciplines and sectors. This paper documents its design, architecture, and capabilities.
Authors:Jiajun Chen, Hua Shen
Abstract:
Existing work on value alignment typically characterizes value relations statically, ignoring how interventions - such as prompting, fine-tuning, or preference optimization - reshape the broader value system. We introduce the Value Alignment Tax (VAT), a framework that measures how alignment-induced changes propagate across interconnected values relative to achieved on-target gain. VAT captures the dynamics of value expression under alignment pressure. Using a controlled scenario-action dataset grounded in Schwartz value theory, we collect paired pre-post normative judgments and analyze alignment effects across models, values, and alignment strategies. Our results show that alignment often produces uneven, structured co-movement among values. These effects are invisible under conventional target-only evaluation, revealing systemic, process-level alignment risks and offering new insights into the dynamics of value alignment in LLMs.
Authors:Chang Liu, Qinyi Zhou, Xinjie Shen, Xingyu Bruce Liu, Tongshuang Wu, Xiang 'Anthony' Chen
Abstract:
LLMs are now embedded in a wide range of everyday scenarios. However, their inherent hallucinations risk hiding misinformation in fluent responses, raising concerns about overreliance on AI. Detecting overreliance is challenging, as it often arises in complex, dynamic contexts and cannot be easily captured by post-hoc task outcomes. In this work, we aim to investigate how users' behavioral patterns correlate with overreliance. We collected interaction logs from 77 participants working with an LLM injected plausible misinformation across three real-world tasks and we assessed overreliance by whether participants detected and corrected these errors. By semantically encoding and clustering segments of user interactions, we identified five behavioral patterns linked to overreliance: users with low overreliance show careful task comprehension and fine-grained navigation; users with high overreliance show frequent copy-paste, skipping initial comprehension, repeated LLM references, coarse locating, and accepting misinformation despite hesitation. We discuss design implications for mitigation.
Authors:Juliana Gerard, Morgan Macleod, Kelly Norwood, Aisling Reid, Muskaan Singh
Abstract:
In this paper, we compare methodological approaches for comparing student and staff perceptions, and ask: how much do these measures vary across different approaches? We focus on the case of AI perceptions, which are generally assessed via a single quantitative or qualitative measure, or with a mixed methods approach that compares two distinct data sources - e.g. a quantitative questionnaire with qualitative comments. To compare different approaches, we collect two forms of qualitative data: standalone comments and structured focus groups. We conduct two analyses for each data source: with a sentiment and stance analysis, we measure overall negativity/positivity of the comments and focus group conversations, respectively. Meanwhile, word clouds from the comments and a thematic analysis of the focus groups provide further detail on the content of this qualitative data - particularly the thematic analysis, which includes both similarities and differences between students and staff. We show that different analyses can produce different results - for a single data source. This variation stems from the construct being evaluated - an overall measure of positivity/negativity can produce a different picture from more detailed content-based analyses. We discuss the implications of this variation for institutional contexts, and for the comparisons from previous studies.
Authors:Yehuda Perry, Tawfiq Ammari
Abstract:
Autonomous vehicles (AVs) are characterized by pervasive datafication and surveillance through sensors like in-cabin cameras, LIDAR, and GPS. Drawing on 16 semi-structured interviews with AV drivers analyzed using constructivist grounded theory, this study examines how users make sense of vehicular surveillance within everyday datafication. Findings reveal drivers demonstrate few AV-specific privacy concerns, instead normalizing monitoring through comparisons with established digital platforms. We theorize this indifference by situating AV surveillance within the `surveillance ecology' of platform environments, arguing the datafied car functions as a mobile extension of the `leaky home' -- private spaces rendered permeable through connected technologies continuously transmitting behavioral data. The study contributes to scholarship on surveillance beliefs, datafication, and platform governance by demonstrating how users who have accepted comprehensive smartphone and smart home monitoring encounter AV datafication as just another node in normalized data extraction. We highlight how geographic restrictions on data access -- currently limiting driver log access to California -- create asymmetries that impede informed privacy deliberation, exemplifying `tertiary digital divides.' Finally, we examine how machine learning's reliance on data-intensive approaches creates structural pressure for surveillance that transcends individual manufacturer choices. We propose governance interventions to democratize social learning, including universal data access rights, binding transparency requirements, and data minimization standards to prevent race-to-the-bottom dynamics in automotive datafication.
Authors:Zhuoyang Li, Yanlai Wu, Yao Li, Xinning Gui, Yuhan Luo
Abstract:
Large language models (LLMs) are increasingly integrated into daily life through conversational interfaces, processing user data via natural language inputs and exhibiting advanced reasoning capabilities, which raises new concerns about user control over privacy. While much research has focused on potential privacy risks, less attention has been paid to the data control mechanisms these platforms provide. This study examines six conversational LLM platforms, analyzing how they define and implement features for users to access, edit, delete, and share data. Our analysis reveals an emerging paradigm of data control in conversational LLM platforms, where user data is generated and derived through interaction itself, natural language enables flexible yet often ambiguous control, and multi-user interactions with shared data raise questions of co-ownership and governance. Based on these findings, we offer practical insights for platform developers, policymakers, and researchers to design more effective and usable privacy controls in LLM-powered conversational interactions.
Authors:Yigang Qin, EunJeong Cheon
Abstract:
The HCI community has called for renewed attention to labor issues and the political economy of computing. Yet much work remains in engaging with labor theory to better understand modern work and workers. This article traces the development of Labor Process Theory (LPT) -- from Karl Marx and Harry Braverman to Michael Burawoy and beyond -- and introduces it as an essential yet underutilized resource for structural analysis of work under capitalism and the design of computing systems. We examine HCI literature on labor, investigating focal themes and conceptual, empirical, and design approaches. Drawing from LPT, we offer directions for HCI research and practice: distinguish labor from work, link work practice to value production, study up the management, analyze consent and legitimacy, move beyond the point of production, design alternative institutions, and unnaturalize bourgeois designs. These directions can deepen analyses of tech-mediated workplace regimes, inform critical and normative designs, and strengthen the field's connection to broader political economic critique.
Authors:Xinru Tang, Jingjin Li, Shaomei Wu
Abstract:
Despite efforts to increase the representation of disabled people in AI datasets, accessibility datasets are often annotated by crowdworkers without disability-specific expertise, leading to inconsistent or inaccurate labels. This paper examines these annotation challenges through a case study of annotating speech data from people who stutter (PWS). Given the variability of stuttering and differing views on how it manifests, annotating and transcribing stuttered speech remains difficult, even for trained professionals. Through interviews and co-design workshops with PWS and domain experts, we identify challenges in stuttered speech annotation and develop practices that integrate the lived experiences of PWS into the annotation process. Our findings highlight the value of embodied knowledge in improving dataset quality, while revealing tensions between the complexity of disability experiences and the rigidity of static labels. We conclude with implications for disability-first and multiplicity-aware approaches to data interpretation across the AI pipeline.
Authors:Caroline Wang, Daniel Kasenberg, Kim Stachenfeld, Pablo Samuel Castro
Abstract:
As Large Language Models (LLMs) are increasingly deployed in social and strategic scenarios, it becomes critical to understand where and why their behavior diverges from that of humans. While behavioral game theory (BGT) provides a framework for analyzing behavior, existing models do not fully capture the idiosyncratic behavior of humans or black-box, non-human agents like LLMs. We employ AlphaEvolve, a cutting-edge program discovery tool, to directly discover interpretable models of human and LLM behavior from data, thereby enabling open-ended discovery of structural factors driving human and LLM behavior. Our analysis on iterated rock-paper-scissors reveals that frontier LLMs can be capable of deeper strategic behavior than humans. These results provide a foundation for understanding structural differences driving differences in human and LLM behavior in strategic interactions.
Authors:Donguk Park, Dongwon Lee, Yeon-Chang Lee
Abstract:
Large language models (LLMs) are increasingly embedded into recommender systems, where they operate across multiple functional roles such as data augmentation, profiling, and decision making. While prior work emphasizes recommendation performance, the systemic risks of LLMs, such as bias and hallucination, and their propagation through feedback loops remain largely unexplored. In this paper, we propose a role-aware, phase-wise diagnostic framework that traces how these risks emerge, manifest in ranking outcomes, and accumulate over repeated recommendation cycles. We formalize a controlled feedback-loop pipeline that simulates long-term interaction dynamics and enables empirical measurement of risks at the LLM-generated content, ranking, and ecosystem levels. Experiments on widely used benchmarks demonstrate that LLM-based components can amplify popularity bias, introduce spurious signals through hallucination, and lead to polarized and self-reinforcing exposure patterns over time. We plan to release our framework as an open-source toolkit to facilitate systematic risk analysis across diverse LLM-powered recommender systems.
Authors:Yehuda Perry, Tawfiq Ammari
Abstract:
As semi-autonomous vehicles (AVs) become prevalent, drivers must collaborate with AI systems whose decision-making processes remain opaque. This study examines how drivers of AVs develop folk theories to interpret algorithmic behavior that contradicts their expectations. Through 16 semi-structured interviews with drivers in the United States, we investigate the explanatory frameworks drivers construct to make sense of AI decisions, the strategies they employ when systems behave unexpectedly, and their experiences with control handoffs and feedback mechanisms. Our findings reveal that drivers develop sophisticated folk theories -- often using anthropomorphic metaphors describing systems that ``see,'' ``hesitate,'' or become ``overwhelmed'' -- yet lack informational resources to validate these theories or meaningfully participate in algorithmic governance. We identify contexts where algorithmic opacity manifests acutely, including complex intersections, adverse weather, and rural environments. Current AV designs position drivers as passive data sources rather than epistemic agents, creating accountability gaps that undermine trust and safety. Drawing on critical data studies and algorithmic accountability literature, we propose a framework for participatory algorithmic governance that would provide drivers with transparency into AI decision-making and meaningful channels for contributing to system improvement. This research contributes to understanding how users navigate datafied sociotechnical systems in safety-critical contexts.
Authors:Mona Rajhans, Vishal Khawarey
Abstract:
Explainable Artificial Intelligence (XAI) aims to make machine learning models transparent and trustworthy, yet most current approaches communicate explanations visually or through text. This paper introduces an information theoretic framework for analyzing how explanation modality specifically, voice versus text affects user comprehension and trust calibration in AI systems. The proposed model treats explanation delivery as a communication channel between model and user, characterized by metrics for information retention, comprehension efficiency (CE), and trust calibration error (T CE). A simulation framework implemented in Python was developed to evaluate these metrics using synthetic SHAP based feature attributions across multiple modality style configurations (brief, detailed, and analogy based). Results demonstrate that text explanations achieve higher comprehension efficiency, while voice explanations yield improved trust calibration, with analogy based delivery achieving the best overall trade off. This framework provides a reproducible foundation for designing and benchmarking multimodal explainability systems and can be extended to empirical studies using real SHAP or LIME outputs on open datasets such as the UCI Credit Approval or Kaggle Financial Transactions datasets.
Authors:Ankolika De, Gabriel Lima, Yixin Zou
Abstract:
This work examines how leading generative artificial intelligence companies construct and communicate the concept of "safety" through public-facing documents. Drawing on critical discourse analysis, we analyze a corpus of corporate safety-related statements to explicate how authority, responsibility, and legitimacy are discursively established. These discursive strategies consolidate legitimacy for corporate actors, normalize safety as an experimental and anticipatory practice, and push a perceived participatory agenda toward safe technologies. We argue that uncritical uptake of these discourses risks reproducing corporate priorities and constraining alternative approaches to governance and design. The contribution of this work is twofold: first, to situate safety as a sociotechnical discourse that warrants critical examination; second, to caution human-computer interaction scholars against legitimizing corporate framings, instead foregrounding accountability, equity, and justice. By interrogating safety discourses as artifacts of power, this paper advances a critical agenda for human-computer interaction scholarship on artificial intelligence.
Authors:Pavlos Panagiotidis, Jocelyn Spence, Nils Jaeger, Paul Tennent
Abstract:
As AI systems increasingly become embedded in interactive and im-mersive artistic environments, artists and technologists are discovering new opportunities to engage with their interpretive and autonomous capacities as creative collaborators in live performance. The focus of this work-in-progress is on outlining conceptual and technical foundations under which performance-makers and interactive architecture can collaborate within rehearsal settings. It introduces a rehearsal-oriented prototype system for shaping and testing AI-mediated environments within creative practice. This approach treats interactive architecture as a performative agent that senses spatial behaviour and speech, interprets these signals through a large language model, and generates real-time environmental adaptations. Designed for deployment in physical performance spaces, the system employs virtual blueprints to support iterative experimentation and creative dialogue between artists and AI agents, using reasoning traces to inform architectural interaction design grounded in dramaturgical principles.
Authors:Shayla Sharmin, Sadia Afrin
Abstract:
In the age of Large Language Models (LLMs), much work has already been done on how LLMs support medication advice and serve as information providers; however, how mothers use these tools for emotional and informational support to avoid social judgment remains underexplored. This study conducted a 10-day mixed-methods exploratory survey ($N=107$) to investigate how mothers use LLMs as a non-judgmental resource for emotional support and regulation, and for situational reassurance. Our findings show that mothers are asking LLMs various questions about childcare to reassure themselves and avoid judgment, particularly around childcare decisions, maternal guilt, and late-night caregiving. Open-ended responses also show that mothers are comfortable with LLMs because they do not have to think about social consequences or judgment. Although mothers use LLMs for quick information or reassurance to avoid judgment, over half of the participants value human warmth more than LLMs; however, a significant minority, especially those in joint families, consider LLMs to avoid human judgment. These findings help understand how LLMs can be framed as low-risk interaction support rather than a replacement for human support, and highlight the role of social context in shaping emotional technology use.
Authors:Yang Li, Anna Maria Feit
Abstract:
Intelligent text entry (ITE) methods, such as word suggestions, are widely used in mobile typing, yet improving ITE systems is challenging because the cognitive mechanisms behind suggestion use remain poorly understood, and evaluating new systems often requires long-term user studies to account for behavioral adaptation. We present WSTypist, a reinforcement learning-based model that simulates how typists integrate word suggestions into typing. It builds on recent hierarchical control models of typing, but focuses on the cognitive mechanisms that underlie the high-level decision-making for effectively integrating word suggestions into manual typing: assessing efficiency gains, considering orthographic uncertainties, and including personal reliance on AI support. Our evaluations show that WSTypist simulates diverse human-like suggestion-use strategies, reproduces individual differences, and generalizes across different systems. Importantly, we demonstrate on four design cases how computational rationality models can be used to inform what-if analyses during the design process, by simulating how users might adapt to changes in the UI or in the algorithmic support, reducing the need for long-term user studies.
Authors:Uwe Peters, Andrea Bertazzoli, Jasmine M. DeJesus, Gisela J. van der Velden, Benjamin Chin-Yee
Abstract:
Scientists often use generics, that is, unquantified statements about whole categories of people or phenomena, when communicating research findings (e.g., "statins reduce cardiovascular events"). Large language models (LLMs), such as ChatGPT, frequently adopt the same style when summarizing scientific texts. However, generics can prompt overgeneralizations, especially when they are interpreted differently across audiences. In a study comparing laypeople, scientists, and two leading LLMs (ChatGPT-5 and DeepSeek), we found systematic differences in interpretation of generics. Compared to most scientists, laypeople judged scientific generics as more generalizable and credible, while LLMs rated them even higher. These mismatches highlight significant risks for science communication. Scientists may use generics and incorrectly assume laypeople share their interpretation, while LLMs may systematically overgeneralize scientific findings when summarizing research. Our findings underscore the need for greater attention to language choices in both human and LLM-mediated science communication.
Authors:Zhihan Jiang, Qianhui Chen, Chu Zhang, Yanheng Li, Ray LC
Abstract:
In human conversation, empathic dialogue requires nuanced temporal cues indicating whether the conversational partner is paying attention. This type of "active listening" is overlooked in the design of Conversational Agents (CAs), which use the same pacing for one conversation. To model the temporal cues in human conversation, we need CAs that dynamically adjust response pacing according to user input. We qualitatively analyzed ten cases of active listening to distill five context-aware pacing strategies: Reflective Silence, Facilitative Silence, Empathic Silence, Holding Space, and Immediate Response. In a between-subjects study (N=50) with two conversational scenarios (relationship and career-support), the context-aware agent scored higher than static-pacing control on perceived human-likeness, smoothness, and interactivity, supporting deeper self-disclosure and higher engagement. In the career support scenario, the CA yielded higher perceived listening quality and affective trust. This work shows how insights from human conversation like context-aware pacing can empower the design of more empathic human-AI communication.
Authors:Sankar B, Amogh A S, Sandhya Baranwal, Dibakar Sen
Abstract:
During product conceptualization, capturing the non-linear history and cognitive intent is crucial. Traditional sketching tools often lose this context. We introduce DIMES (Design Idea Management and Evolution capture System), a web-based environment featuring sGIT (SketchGit), a custom visual version control architecture, and Generative AI. sGIT includes AEGIS, a module using hybrid Deep Learning and Machine Learning models to classify six stroke types. The system maps Git primitives to design actions, enabling implicit branching and multi-modal commits (stroke data + voice intent). In a comparative study, experts using DIMES demonstrated a 160% increase in breadth of concept exploration. Generative AI modules generated narrative summaries that enhanced knowledge transfer; novices achieved higher replication fidelity (Neural Transparency-based Cosine Similarity: 0.97 vs. 0.73) compared to manual summaries. AI-generated renderings also received higher user acceptance (Purchase Likelihood: 4.2 vs 3.1). This work demonstrates that intelligent version control bridges creative action and cognitive documentation, offering a new paradigm for design education.
Authors:Lukas Stappen, Ahmet Erkan Turan, Johann Hagerer, Georg Groh
Abstract:
The integration of Large Language Model (LLM)-based conversational agents into vehicles creates novel security challenges at the intersection of agentic AI, automotive safety, and inter-agent communication. As these intelligent assistants coordinate with external services via protocols such as Google's Agent-to-Agent (A2A), they establish attack surfaces where manipulations can propagate through natural language payloads, potentially causing severe consequences ranging from driver distraction to unauthorized vehicle control. Existing AI security frameworks, while foundational, lack the rigorous "separation of concerns" standard in safety-critical systems engineering by co-mingling the concepts of what is being protected (assets) with how it is attacked (attack paths). This paper addresses this methodological gap by proposing a threat modeling framework called AgentHeLLM (Agent Hazard Exploration for LLM Assistants) that formally separates asset identification from attack path analysis. We introduce a human-centric asset taxonomy derived from harm-oriented "victim modeling" and inspired by the Universal Declaration of Human Rights, and a formal graph-based model that distinguishes poison paths (malicious data propagation) from trigger paths (activation actions). We demonstrate the framework's practical applicability through an open-source attack path suggestion tool AgentHeLLM Attack Path Generator that automates multi-stage threat discovery using a bi-level search strategy.
Authors:Damien Rudaz, Barbara Nino Carreras, Sara Merlino, Brian L. Due, Barry Brown
Abstract:
Does human-AI assistance unfold in the same way as human-human assistance? This research explores what can be learned from the expertise of blind individuals and sighted volunteers to inform the design of multimodal voice agents and address the enduring challenge of proactivity. Drawing on granular analysis of two representative fragments from a larger corpus, we contrast the practices co-produced by an experienced human remote sighted assistant and a blind participant-as they collaborate to find a stain on a blanket over the phone-with those achieved when the same participant worked with a multimodal voice agent on the same task, a few moments earlier. This comparison enables us to specify precisely which fundamental proactive practices the agent did not enact in situ. We conclude that, so long as multimodal voice agents cannot produce environmentally occasioned vision-based actions, they will lack a key resource relied upon by human remote sighted assistants.
Authors:Alastair Howcroft, Amber Bennett-Weston, Ahmad Khan, Joseff Griffiths, Simon Gay, Jeremy Howick
Abstract:
Background: Empathy is widely recognized for improving patient outcomes, including reduced pain and anxiety and improved satisfaction, and its absence can cause harm. Meanwhile, use of artificial intelligence (AI)-based chatbots in healthcare is rapidly expanding, with one in five general practitioners using generative AI to assist with tasks such as writing letters. Some studies suggest AI chatbots can outperform human healthcare professionals (HCPs) in empathy, though findings are mixed and lack synthesis. Sources of data: We searched multiple databases for studies comparing AI chatbots using large language models with human HCPs on empathy measures. We assessed risk of bias with ROBINS-I and synthesized findings using random-effects meta-analysis where feasible, whilst avoiding double counting. Areas of agreement: We identified 15 studies (2023-2024). Thirteen studies reported statistically significantly higher empathy ratings for AI, with only two studies situated in dermatology favouring human responses. Of the 15 studies, 13 provided extractable data and were suitable for pooling. Meta-analysis of those 13 studies, all utilising ChatGPT-3.5/4, showed a standardized mean difference of 0.87 (95% CI, 0.54-1.20) favouring AI (P < .00001), roughly equivalent to a two-point increase on a 10-point scale. Areas of controversy: Studies relied on text-based assessments that overlook non-verbal cues and evaluated empathy through proxy raters. Growing points: Our findings indicate that, in text-only scenarios, AI chatbots are frequently perceived as more empathic than human HCPs. Areas timely for developing research: Future research should validate these findings with direct patient evaluations and assess whether emerging voice-enabled AI systems can deliver similar empathic advantages.
Authors:Hsuan-Yu Chou, Wajiha Naveed, Shuyan Zhou, Xiaowei Yang
Abstract:
As internet access expands, so does exposure to harmful content, increasing the need for effective moderation. Research has demonstrated that large language models (LLMs) can be effectively utilized for social media moderation tasks, including harmful content detection. While proprietary LLMs have been shown to zero-shot outperform traditional machine learning models, the out-of-the-box capability of open-weight LLMs remains an open question. Motivated by recent developments of reasoning LLMs, we evaluate seven state-of-the-art models: four proprietary and three open-weight. Testing with real-world posts on Bluesky, moderation decisions by Bluesky Moderation Service, and annotations by two authors, we find a considerable degree of overlap between the sensitivity (81%--97%) and specificity (91%--100%) of the open-weight LLMs and those (72%--98%, and 93%--99%) of the proprietary ones. Additionally, our analysis reveals that specificity exceeds sensitivity for rudeness detection, but the opposite holds for intolerance and threats. Lastly, we identify inter-rater agreement across human moderators and the LLMs, highlighting considerations for deploying LLMs in both platform-scale and personalized moderation contexts. These findings show open-weight LLMs can support privacy-preserving moderation on consumer-grade hardware and suggest new directions for designing moderation systems that balance community values with individual user preferences.
Authors:Shota Yamanaka, I. Scott MacKenzie
Abstract:
For evaluations of 2D target selection using Fitts' law, ISO 9241-411 recommends using the effective target width (W_e) calculated using the univariate standard deviation of selection coordinates. Related research proposed using a bivariate standard deviation; however, the proposal was only tested using a single speed-accuracy bias condition, thus the assessment was limited. We compared the univariate and bivariate techniques in a 2D Fitts' law experiment using three speed-accuracy biases and 346 crowdworkers. Calculating W_e using the univariate standard deviation yielded higher model correlations across all bias conditions and produced more stable throughput among the biases. The findings were also consistent in cases using randomly sampled subsets of the participant data. We recommend that future research should calculate W_e using the univariate standard deviation for fair performance evaluations. Also, we found trivial effects when using nominal or effective amplitude and using different perspectives of the task axis.
Authors:Mobasshira Akter Urmi, Raiyan Abdul Baten
Abstract:
Strategic adaptation -- the ability to adjust interaction behavior in response to changing constraints and leverage -- is a central goal of negotiation training and an emerging target for AI coaching systems. However, adaptation is difficult to evaluate because adaptation-relevant moments arise unpredictably in typical tasks. We study a reusable dyadic negotiation testbed that employs a controlled midstream change in one party's outside alternative as a repeatable perturbation to stress-test adaptation. In a six-round chat-based negotiation study (N=100), the perturbation reliably reorganized interaction dynamics: transitions between integrative (cooperative) and distributive (positional) behaviors declined, behavioral diversity narrowed, and interactions drifted toward more distributive tactics. Critically, this distributive drift predicted worse relational experience net of objective outcomes, and adaptation patterns were path dependent on prior behavior. These results establish a methodological bridge for evaluating and comparing AI coaching systems on strategic adaptation as a process and identify failure modes and design targets for adaptive interaction support.
Authors:Gang Yu, Yuchi Sun, Weining Yan, Xinyu Wang, Qi Lu
Abstract:
Odor visualization translates odor information and perception into visual outcomes and arouses the corresponding olfactory synesthesia, surpassing the spatial limitation that odors can only be perceived where they are present. Traditional odor visualization has typically relied on unidimensional mappings, such as odor-to-color associations, and has required extensive manual design efforts. However, the advent of generative AI (Gen AI) and large language models (LLMs) presents a new opportunity for automatic odor visualization. Nonetheless, gaps remain in bridging olfactory perception with generative tools to produce odor images. To address these gaps, this paper introduces Paint by Odor, a pipeline that leverages Gen AI and LLMs to transform olfactory perceptions into rich, aesthetically engaging visual representations. Two experiments were conducted, where 30 participants smelled real-world odors and provided descriptive data and 28 participants evaluated 560 generated odor images through seven systematically designed prompts. Our findings explored the capability of LLMs in producing olfactory perception by comparing it with human responses and revealed the underlying mechanisms and effects of language-based descriptions and several abstraction styles on odor visualization. Our work further discussed the possibility of automatic odor visualization without human participation. These explorations and results have bridged the research gap in odor visualization using LLMs and Gen AI, offering valuable design insights and various possibilities for future applications.
Authors:Adriana Olmos, Anoop K. Sinha, Renelito Delos Santos, Ruben Rodriguez Rodriguez, James A. Landay, Sam S. Sepah, Philip Nelson, Shaun K. Kane
Abstract:
Video content remains largely inaccessible to blind and low-vision (BLV) users. To address this, we introduce a prototype that leverages a multimodal agent - powered by a novel conversational architecture using a multimodal large language model (MLLM) - to provide BLV users with an interactive, accessible video experience. This Multimodal Agent Video Player (MAVP) demonstrates that an interactive accessibility mode can be added to a video through multilayered prompt orchestration. We describe a user-centered design process involving 18 sessions with BLV users that showed that BLV users do not just want accessibility features, but desire independence and personal agency over the viewing experience. We conducted a qualitative study with an additional 8 BLV participants; in this, we saw that the MAVP's conversational dialogue offers BLV users a sense of personal agency, fostering collaboration and trust. Even in the case of hallucinations, it is meta-conversational dialogues about AI's limitations that can repair trust.
Authors:Lingqing Wang, Yingting Gao, Chidimma Lois Anyi, Ashok Goel
Abstract:
Recent advances in AI are integrating AI into the fabric of human social life, creating transformative, co-shaping relationships between humans and AI. This trend makes it urgent to investigate how these systems, in turn, shape their users. We conducted a three-phase design study with 24 participants to explore this dynamic. Our findings reveal critical tensions: (1) social AI often exacerbates the very interpersonal problems it is designed to mitigate; (2) it introduces nuanced privacy harms for secondary users inadvertently involved in AI-mediated social interactions; and (3) it can threaten the primary user's personal agency and identity. We argue these tensions expose a problematic tendency in the user-centered paradigm, which often prioritizes immediate user experience at the expense of core human values like interpersonal ethics and self-efficacy. We call for a paradigm shift toward a more provocative and relational design perspective that foregrounds long-term social and personal consequences.
Authors:Mona Alfayez, Ohoud Alharbi
Abstract:
Autonomous vehicles (AVs) are emerging as a transformative innovation in transportation, offering potential benefits in safety, sustainability, and efficiency. Saudi Arabian adoption of AVs aligns with Vision 2030, emphasizing smart mobility through initiatives such as the Riyadh Autonomous Metro and self-driving cars. This study explores Saudi citizens perceptions of AVs before and after exposure to these technologies and examines whether demographic factors age, gender, education level, and driving habits affect acceptance. Using quantitative methods, the findings provide insights into the broader influences shaping AV adoption, highlighting the importance of trust, perceived safety, and convenience. These results can inform policymakers and industry stakeholders on strategies to facilitate successful integration of AVs into Saudi Arabian transportation ecosystem.
Authors:Saizo Aoyagi, Ryoma Okazaki, Seishiro Hara, Fumiya Ikeda, Michiya Yamamoto
Abstract:
Since the COVID-19 pandemic, online lectures have spread rapidly and many students are satisfied with them. However, one challenge remains the loss of concentration due to the lack of students' copresence. Our previous work suggests that presenting 3D characters with appropriate actions has the potential to improve concentration in online lectures. Nevertheless, an effective combination of actions has not yet been identified. In this study, we developed a lecture watching system that presents a 3D virtual classroom using a naked-eye 3D display. The system includes student characters that show copresence with various actions such as nodding, notetaking, and sleeping. An evaluation experiment was conducted with two conditions; (1) student characters perform only positive actions and (2) both positive and negative actions. The results, analyzed using posture and notetaking behavior as key indicators, suggest that the system can help to maintain concentration when the student characters perform both positive and negative actions, rather than only positive ones. These findings provide promising strategies for maintaining student focus in on-demand lectures and contribute to the development of more effective online education systems.
Authors:Chan-in Sio, Alex Mann, Lingxi Fan, Andrew Cheung, Lik-hang Lee
Abstract:
The mental well-being of graduate students is an increasing concern, yet the adoption of scalable support remains uneven. Artificial intelligence-powered cognitive behavioral therapy chatbots (AI-CBT) offer low barrier help, but little is known about how Chinese postgraduates perceive and use them. This qualitative study explored perceptions and experiences of AI-CBT chatbots among ten Chinese graduate students recruited through social media. Semi-structured Zoom interviews were conducted and analyzed using reflexive thematic analysis, with the Health Belief Model (HBM) and the Theory of Planned Behavior (TPB) as sensitizing frameworks. The findings indicate a cautious openness to AI-CBT chatbots: perceived usefulness and 24/7 access supported favorable attitudes, while data privacy, emotional safety, and uncertainty about `fit' for complex problems restricted the intention to use. Social norms (e.g., stigma and peer views) and perceived control (digital literacy, language quality) further shaped adoption. The study offers context-specific information to guide the culturally sensitive design, communication, and deployment of AI mental well-being tools for student populations in China and outlines the design implications around transparency, safeguards, and graduated care pathways.
Authors:Wisnu Uriawan, Denis Firmansyah, Devi Mulyana, Dika Haekal Firza Pratama, Adly Juliarta Lerian, Fajar Satria Wiguna
Abstract:
The mastery of Hijaiyah letters is a crucial foundation for reading and comprehending the Quran, yet conventional pedagogical approaches based on repetitive memorization frequently struggle to maintain the engagement of young learners in contemporary educational contexts. This research presents the design and implementation of an innovative gamification-based methodology for Hijaiyah literacy acquisition, systematically developed through the ADDIE framework (Analysis, Design, Development, Implementation, Evaluation) to optimize student motivation, participation, and educational outcomes. The resulting technological solution, engineered using Unity 2D and Firebase, strategically incorporates game design elements such as points, badges, leaderboards, and progressive leveling, while integrating multifaceted learning components including visual animations, authentic tajwid-based audio pronunciation, and interactive letter tracing exercises to simultaneously develop cognitive recognition capabilities and fine motor skills. Empirical evaluation involving 50 elementary school participants revealed substantial quantitative improvements, with mean assessment scores increasing from 42.8 to 88.6 (107% improvement, p < 0.001), demonstrating an exceptionally large effect size (Cohen's d = 4.87), complemented by strong user engagement metrics (4.2 average daily sessions) and high satisfaction ratings (4.82 out of 5 mean motivation score). Beyond cognitive learning outcomes, the gamified approach effectively fostered intrinsic Islamic values such as perseverance, responsibility, and disciplined practice, thereby establishing an innovative educational paradigm that successfully integrates traditional Islamic pedagogical principles with modern digital learning technologies to create a transformative, engaging, and meaningful framework for Hijaiyah literacy development in contemporary Islamic education.
Authors:Samantha Shorey, Benjamin Mako Hill, Samuel C. Woolley
Abstract:
Although socializing is a powerful driver of youth engagement online, platforms struggle to leverage engagement to promote learning. We seek to understand this dynamic using a multi-stage analysis of over 14,000 comments on Scratch, an online platform designed to support learning about programming. First, we inductively develop the concept of "participatory debugging" -- a practice through which users learn through collaborative technical troubleshooting. Second, we use a content analysis to establish how common the practice is on Scratch. Third, we conduct a qualitative analysis of user activity over time and identify three factors that serve as social antecedents of participatory debugging: (1) sustained community, (2) identifiable problems, and (3) what we call "topic porousness" to describe conversations that are able to span multiple topics. We integrate these findings in a theoretical framework that highlights a productive tension between the desire to promote learning and the interest-driven sub-communities that drive user engagement in many new media environments.
Authors:Yilin Ke, Yun Suen Pai, Burkhard C. Wuensche, Angus Donald Campbell, Mairi Gunn
Abstract:
Digital health has strong potential for promoting physical activity (PA), yet interventions often fail to sustain engagement among culturally and linguistically diverse (CALD) women. Prior reviews focus on short-term efficacy or surface-level localisation, while a design-oriented synthesis of deep cultural adaptation and long-term strategies remain limited. This scoping review systematically screened 1968 records, analysed 18 studies and identified a critical design paradox: techno-solutionist systems overlook social and cultural barriers, while social-support features often fail in low-activity social networks. To address this gap, we propose the Culturally Embedded Interaction Framework, integrating five dimensions: culturally-grounded measurement, multi-modal interaction, contextual and temporal adaptability, embedded social weaving, and theory-guided cultural adaptation. The framework advances beyond accessibility-focused approaches by mapping behavioural theory to design mechanisms that support sustained and culturally plural participation. We provide actionable design principles to help HCI researchers and practitioners move from one-size-fits-all models toward adaptive, theory-informed, and culturally sustaining design.
Authors:Mona G. Ibrahim, Riham Hilal
Abstract:
AI technology development has transformed the field of engineering education with its adaptivity-driven, data-based, and ethical-led learning platforms that promote equity, diversity, and inclusivity. But with so much progress being made in so many areas, there are unfortunately gaps in gender equity, representation in cultures around the world, and access to education and jobs in stem education. The paper describes an ethical approach to using AI technology that supports the United Nations 2030 agenda for sustainability. In particular, this includes both Goal 5--Gender Equity--and Goal 10--Reducing Inequalities. Based on a synthesis strategy using both critical thinking strategies related to case studies around the world using AI-based adaptivity platforms to address equity gaps related to education inclusion. The model presented offers a synthesis solution that includes ethical leadership data-related to equity to measure inclusivity based upon sustainability thinking. The result has demonstrated that using AI technology not only increases inclusivity but promotes equity related to access to education in stem education access. Finally, there are concluding remarks related to transforming education into a global system.
Authors:Alessandro Silacci, Mauro Cherubini, Arianna Boldi, Amon Rapp, Maurizio Caon
Abstract:
Physical inactivity remains a critical global health issue, yet scalable strategies for sustained motivation are scarce. Conversational agents designed as simulated exercising peers (SEPs) represent a promising alternative, but their long-term impact is unclear. We report a six-month randomized controlled trial (N=280) comparing individuals exercising alone, with a human peer, or with a large language model-driven SEP. Results revealed a partnership paradox: human peers evoked stronger social presence, while AI peers provided steadier encouragement and more reliable working alliances. Humans motivated through authentic comparison and accountability, whereas AI peers fostered consistent, low-stakes support. These complementary strengths suggest that AI agents should not mimic human authenticity but augment it with reliability. Our findings advance human-agent interaction research and point to hybrid designs where human presence and AI consistency jointly sustain physical activity.
Authors:Patricia Marcella Evite, Ekaterina Svetlova, Doina Bucur
Abstract:
As Artificial Intelligence (AI) becomes increasingly embedded in financial decision-making, the opacity of complex models presents significant challenges for professionals and regulators. While the field of Explainable AI (XAI) attempts to bridge this gap, current research often reduces the implementation challenge to a binary trade-off between model accuracy and explainability. This paper argues that such a view is insufficient for the financial domain, where algorithmic choices must navigate a complex sociotechnical web of strict regulatory bounds, budget constraints, and latency requirements. Through semi-structured interviews with twenty finance professionals, ranging from C-suite executives and developers to regulators across multiple regions, this study empirically investigates how practitioners prioritize explainability relative to four competing factors: accuracy, compliance, cost, and speed. Our findings reveal that these priorities are structured not as a simple trade-off, but as a system of distinct prerequisites and constraints. Accuracy and compliance emerge as non-negotiable "hygiene factors": without them, an AI system is viewed as a liability regardless of its transparency. Operational levers (speed and cost) serve as secondary constraints that determine practical feasibility, while ease of understanding functions as a gateway to adoption, shaping whether AI tools are trusted, used, and defensible in practice.
Authors:Chen Chen, Dion Hoe-Lian Goh
Abstract:
As deepfake videos become increasingly difficult for people to recognise, understanding the strategies humans use is key to designing effective media literacy interventions. We conducted a study with 195 participants between the ages of 21 and 40, who judged real and deepfake videos, rated their confidence, and reported the cues they relied on across visual, audio, and knowledge strategies. Participants were more accurate with real videos than with deepfakes and showed lower expected calibration error for real content. Through association rule mining, we identified cue combinations that shaped performance. Visual appearance, vocal, and intuition often co-occurred for successful identifications, which highlights the importance of multimodal approaches in human detection. Our findings show which cues help or hinder detection and suggest directions for designing media literacy tools that guide effective cue use. Building on these insights can help people improve their identification skills and become more resilient to deceptive digital media.
Authors:Jonatan Reyes, Mina Massoumi, Anil Ufuk Batmaz, Marta Kersten-Oertel
Abstract:
Artificial intelligence (AI) is increasingly used to support prognosis in Alzheimer's disease (AD), but adoption remains limited due to a lack of transparency and interpretability, particularly for long-term predictions where uncertainty is intrinsic and outcomes may not be known for years. We position uncertainty visualization as an explainable AI (XAI) technique and examine how it shapes trust, confidence, and reliance when users interpret AI-generated forecasts of future cognitive decline transitions. We conducted two studies, one with general participants (N=37) and one with experts in neuroimaging and neurology (N=10), to compare binary (present/absent) and continuous (saturation) uncertainty encodings. Continuous encodings improved perceived reliability and helped users recognize model limitations, while binary encodings increased momentary confidence, revealing expertise-dependent trade-offs in interpreting future predictions under high uncertainty. These findings surface key challenges in designing uncertainty representations for prognostic AI and culminate in a set of empirically grounded guidelines for creating trustworthy, user-appropriate clinical decision support tools.
Authors:Jungmin Lee, Inhee Cho, Youngjae Yoo
Abstract:
Competitive games pose steep learning curves and strong social pressures, often discouraging novice players and limiting sustained engagement. To address these challenges, this study introduces LeagueBot, a large language model-based voice chatbot designed to provide both informational and emotional support during live gameplay in league of legends, one of the most competitive multiplayer online battle arena games. In a within-subjects experiment with 33 novice players, LeagueBot was found to reduce cognitive challenge, performative challenge, and perceived tension. Qualitative analysis further identified three themes: enhanced access to game information, relief from cognitive burden, and practical limitations. Participants noted that LeagueBot offered context-appropriate guidance and emotional support, helping ease the steep learning curve and psychological pressures of competitive gaming. Together, these findings underscore the potential of voice-based LLM companions to assist novice players in competitive environments and highlight their broader applicability for real-time support in other high-pressure contexts.
Authors:Kayode P. Ayodele, Enoruwa Obayiuwana, Aderonke R. Lawal, Ayorinde Bamimore, Funmilayo B. Offiong, Emmanuel A. Peter
Abstract:
As artificial intelligence (AI) models become routinely integrated into knowledge work, cognitive acts increasingly occur in two distinct modes: individually, using biological resources alone, or distributed across a human-AI system. Existing revisions to Bloom's Taxonomy treat AI as an external capability to be mapped against human cognition rather than as a driver of this dual-mode structure, and thus fail to specify distinct learning outcomes and assessment targets for each mode. This paper proposes the Augmented Cognition Framework (ACF), a restructured taxonomy built on three principles. First, each traditional Bloom level operates in two modes (Individual and Distributed) with mode-specific cognitive verbs. Second, an asymmetric dependency relationship holds wherein effective Distributed cognition typically requires Individual cognitive foundations, though structured scaffolding can in some cases reverse this sequence. Third, a seventh level, Orchestration, specifies a governance capacity for managing mode-switching, trust calibration, and partnership optimization. We systematically compare existing AI-revised taxonomies against explicit assessment-utility criteria and show, across the frameworks reviewed, that ACF uniquely generates assessable learning outcomes for individual cognition, distributed cognition, and mode-governance as distinct targets. The framework addresses fluent incompetence, the central pedagogical risk of the AI era, by making the dependency relationship structurally explicit while accommodating legitimate scaffolding approaches.
Authors:Ashley Hua, Adya Daruka, Yang Hong, Sharifa Sultana
Abstract:
Reproductive well-being education remains widely stigmatized across diverse cultural contexts, constraining how individuals access and interpret reproductive health knowledge. We designed and evaluated OpenBloom, a stigma-sensitive, AI-mediated system that uses LLMs to transform reproductive health articles into reflective, question-based learning prompts. We employed OpenBloom as a design probe, aiming to explore the emerging challenges of reproductive well-being stigma through LLMs. Through surveys, semi-structured interviews, and focus group discussions, we examine how sociocultural stigma shapes participants' engagements with AI-generated questions and the opportunities of inquiry-based reproductive health education. Our findings identify key design considerations for stigma-sensitive LLM, including empathetic framing, inclusive language, values-based reflection, and explicit representation of marginalized identities. However, while current LLM outputs largely meet expectations for cultural sensitivity and non-offensiveness, they default to superficial rephrasing and factual recall rather than critical reflection. This guides well-being HCI design in sensitive health domains toward culturally grounded, participatory workflows.
Authors:Filip Nowicki, Hubert Marciniak, Jakub Łączkowski, Krzysztof Jassem, Tomasz Górecki, Vimala Balakrishnan, Desmond C. Ong, Maciej Behnke
Abstract:
Vision-language models (VLMs) show promise as tools for inferring affect from visual stimuli at scale; it is not yet clear how closely their outputs align with human affective ratings. We benchmarked nine VLMs, ranging from state-of-the-art proprietary models to open-source models, on three psycho-metrically validated affective image datasets: the International Affective Picture System, the Nencki Affective Picture System, and the Library of AI-Generated Affective Images. The models performed two tasks in the zero-shot setting: (i) top-emotion classification (selecting the strongest discrete emotion elicited by an image) and (ii) continuous prediction of human ratings on 1-7/9 Likert scales for discrete emotion categories and affective dimensions. We also evaluated the impact of rater-conditioned prompting on the LAI-GAI dataset using de-identified participant metadata. The results show good performance in discrete emotion classification, with accuracies typically ranging from 60% to 80% on six-emotion labels and from 60% to 75% on a more challenging 12-category task. The predictions of anger and surprise had the lowest accuracy in all datasets. For continuous rating prediction, models showed moderate to strong alignment with humans (r > 0.75) but also exhibited consistent biases, notably weaker performance on arousal, and a tendency to overestimate response strength. Rater-conditioned prompting resulted in only small, inconsistent changes in predictions. Overall, VLMs capture broad affective trends but lack the nuance found in validated psychological ratings, highlighting their potential and current limitations for affective computing and mental health-related applications.
Authors:Sung-In Kim, Joonyoung Park, Bogoan Kim, Hwajung Hong
Abstract:
Home-based care (HBC) delivers medical and care services in patients' living environments, offering unique opportunities for patient-centered care. However, patient agency is often inadequately represented in shared HBC planning processes. Through 23 multi-stakeholder interviews with HBC patients, healthcare professionals, and care workers, alongside 60 hours of ethnographic observations, we examined how patient agency manifests in HBC and why this representation gap occurs. Our findings reveal that patient agency is not a static individual attribute but a relational capacity shaped through maintaining everyday continuity, mutual recognition from care providers, and engagement with material home environments. Furthermore, we identified that structured documentation systems filter out contextual knowledge, informal communication channels fragment patient voices, and doctor-centered hierarchies position patients as passive recipients. Drawing on these insights, we propose design considerations to bridge this representation gap and to integrate patient agency into shared HBC plans.
Authors:Aaron Pengyu Zhu, Kristina Mah, Janghee Cho
Abstract:
Reflection is fundamental to how people make sense of everyday life, helping them navigate moments of growth, uncertainty, and change. Yet in HCI, existing frameworks of designing technologies to support reflection remain narrow, emphasizing cognitive, rational problem-solving, and individual self-improvement. We introduce Daoist philosophy as a non-Western lens to broaden this scope and reimagine reflective practices in interactive systems. Combining insights from Daoist literature with semi-structured interviews with 18 Daoist priests, scholars, and practitioners, we identified three key dimensions of everyday reflection: Stillness, Resonance, and Emergence. These dimensions reveal emergent, embodied, relational, and ethically driven qualities often overlooked in HCI research. We articulate their potential to inform alternative frameworks for interactive systems for reflection, advocating a shift from reflection toward reflecting-with, and highlight the potential of Daoism as an epistemological resource for the HCI community.
Authors:Bartosz Sawicki, Tomasz Les, Dariusz Parzych, Aleksandra Wycisk-Ficek, Pawel Trebacz, Pawel Zawadzki
Abstract:
As generative artificial intelligence advances, Large Language Models (LLMs) are being explored for automated graphical user interface (GUI) design. This study investigates the usability and adaptability of LLM-generated interfaces by analysing their ability to meet diverse user needs. The experiments included utilization of three state-of-the-art models from January 2025 (OpenAI GPT o3-mini-high, DeepSeek R1, and Anthropic Claude 3.5 Sonnet) generating mockups for three interface types: a chat system, a technical team panel, and a manager dashboard. Expert evaluations revealed that while LLMs are effective at creating structured layouts, they face challenges in meeting accessibility standards and providing interactive functionality. Further testing showed that LLMs could partially tailor interfaces for different user personas but lacked deeper contextual understanding. The results suggest that while LLMs are promising tools for early-stage UI prototyping, human intervention remains critical to ensure usability, accessibility, and user satisfaction.
Authors:Sumedh Karajagi, Sampad Bhusan Mohanty, Bhaskar Krishnamachari
Abstract:
Interactive computational environments can help students explore algorithmic concepts through collaborative hands-on experimentation. However, static and instructor controlled demos in lectures limit engagement. Even when interactive visualizations are used, interactions are solely controlled by the instructor, leaving students as passive observers. In addition, the tools used for demonstration often vary significantly, as they are typically developed by individual instructors. Consequently, the visualizations remain confined to a single classroom, rather than being shared and adapted across courses or reused by other instructors. To address this gap and foster active engagement in live classrooms, we present a lightweight and seamless software framework named LEAP for developing interactive computational lab exercises using a simple idea: remotely callable instructor-defined functions. Using API endpoints and a provided client, students can discover and then call instructor defined functions remotely from their coding environment using scripts or interactive notebooks. Each function call is time-stamped and persistently logged in a database, allowing real-time visualization of participation, diverse solution paths, common pitfalls, and live feedback through collaboration, gamification, and quizzes. Labs are packaged as self-contained folders, each containing their own remotely callable functions. We provide example labs to demonstrate applications relevant for numerical analysis, machine learning, algorithms courses and mention some in electrical engineering (EE), economics, and physics. These capabilities enhance engagement and provide instructors with actionable insights into learning processes. With a standardized lab format and an online directory for community-contributed labs, we aim to foster a global ecosystem for exchanging and expanding interactive pedagogy enabled by LEAP.
Authors:Behnam Rahdari, Sameer Shaikh, Jonathan H Chen, Tobias Gerstenberg, Shriti Raj
Abstract:
LLMs are popular among clinicians for decision-support because of simple text-based interaction. However, their impact on clinicians' performance is ambiguous. Not knowing how clinicians use this new technology and how they compare it to traditional clinical decision-support systems (CDSS) restricts designing novel mechanisms that overcome existing tool limitations and enhance performance and experience. This qualitative study examines how clinicians (n=12) perceive different interaction modalities (text-based conversation with LLMs, interactive and static UI, and voice) for decision-support. In open-ended use of LLM-based tools, our participants took a tool-centric approach using them for information retrieval and confirmation with simple prompts instead of use as active deliberation partners that can handle complex questions. Critical engagement emerged with changes to the interaction setup. Engagement also differed with individual cognitive styles. Lastly, benefits and drawbacks of interaction with text, voice and traditional UIs for clinical decision-support show the lack of a one-size-fits-all interaction modality.
Authors:Alejandro Benito-Santos, Florian Windhager, Aida Horaniet Ibañez, Rabea Kleymann, Alfie Abdul-Rahman, Eva Mayr
Abstract:
The intersection of visualization and the humanities (VIS*H) is marked by a tension between chasing analytical "insight" and interpretive "meaning." The effectiveness of visualization techniques hinges on established evaluation frameworks that assess both analytical utility and communicative efficacy, creating a potential mismatch with the non-positivist, interpretive aims of humanities scholarship. To examine how this tension manifests in practice, we systematically surveyed 171 VIS*H design studies to analyze their evaluation workflows and rigor according to standard practice. Our findings reveal recurring flaws, such as an over-reliance on monomethod approaches, and show that higher-quality evaluations emerge from workflows that effectively triangulate diverse evidence. From these findings, we derive recommendations to refine quality and validation criteria for humanities visualizations, and juxtapose them to ongoing critical debates in the field, ultimately arguing for a paradigm shift that can reconcile the advantages of established validation techniques with the interpretive depth required for humanistic inquiry.
Authors:Ananya Shukla, Chaitanya Modi, Satvik Bajpai, Siddharth Siddharth
Abstract:
Large Language Models (LLMs) have emerged as powerful learning tools, but they lack awareness of learners' cognitive and physiological states, limiting their adaptability to the user's learning style. Contemporary learning techniques primarily focus on structured learning paths, knowledge tracing, and generic adaptive testing but fail to address real-time learning challenges driven by cognitive load, attention fluctuations, and engagement levels. Building on findings from a formative user study (N=66), we introduce GuideAI, a multi-modal framework that enhances LLM-driven learning by integrating real-time biosensory feedback including eye gaze tracking, heart rate variability, posture detection, and digital note-taking behavior. GuideAI dynamically adapts learning content and pacing through cognitive optimizations (adjusting complexity based on learning progress markers), physiological interventions (breathing guidance and posture correction), and attention-aware strategies (redirecting focus using gaze analysis). Additionally, GuideAI supports diverse learning modalities, including text-based, image-based, audio-based, and video-based instruction, across varied knowledge domains. A preliminary study (N = 25) assessed GuideAI's impact on knowledge retention and cognitive load through standardized assessments. The results show statistically significant improvements in both problem-solving capability and recall-based knowledge assessments. Participants also experienced notable reductions in key NASA-TLX measures including mental demand, frustration levels, and effort, while simultaneously reporting enhanced perceived performance. These findings demonstrate GuideAI's potential to bridge the gap between current LLM-based learning systems and individualized learner needs, paving the way for adaptive, cognition-aware education at scale.
Authors:Wei Wei, Miguel A. Nacenta, Michelle F. Miranda, Charles Perin
Abstract:
Finding a particular object in a display is important for viewers in many visualizations, for example, when reacting to brushing or to a highlighted object. This can be enabled by making the target object different in one of the visual variables that determine the object's appearance; for example, by changing its color or size. Certain interpretations of the visual search literature have promoted the view that using visual variables such as hue-often labeled as preattentive-would make the target object automatically "popout," implying that an object can be located almost instantly, regardless of the number of objects in the display. In this paper we present a study that serves as a bridge between the extensive visual search literature and visualization, establishing empirical base measurements for the localization task. By testing displays with up to hundreds of objects, we are able to show that none of the common visual variables is immune to the increase in the number of objects. We also provide the first empirically informed comparisons between visual variables for this task in the context of visualization, and show how different visual variables have varying robustness with respect to two additional dimensions: the location of the target and the overall visual arrangement (layout). A free copy of this paper and all supplemental materials are available on our online repository: https://osf.io/z68ak/overview.
Authors:Joffrey Guilmet, Suzanne Sorli, Diego Vilela Monteiro
Abstract:
This work investigates how weight and pressure can function as haptic metaphors to support user interface notifications in Virtual Reality (VR). While prior research has explored ungrounded weight simulation and pneumatic feedback, their combined role in conveying information through UI elements remains underexplored. We developed a wearable haptic device that transfers liquid and air into flexible containers mounted on the back of the user's hand, allowing us to independently manipulate weight and pressure. Through an initial evaluation using three conditions-no feedback, weight only, and weight combined with pressure-we examined how these signals affect perceived heaviness, coherence with visual cues, and the perceived urgency of notifications. Our results validate that pressure amplifies the perception of weight, but this increased heaviness does not translate into higher perceived urgency. These findings suggest that while pressure___enhanced weight can enrich haptic rendering of UI elements in VR, its contribution to communicating urgency may require further investigation, alternative pressure profiles, or different types of notifications.
Authors:Eslam Zaher, Maciej Trzaskowski, Quan Nguyen, Fred Roosta
Abstract:
Latent-space optimization methods for counterfactual explanations - framed as minimal semantic perturbations that change model predictions - inherit the ambiguity of Wachter et al.'s objective: the choice of distance metric dictates whether perturbations are meaningful or adversarial. Existing approaches adopt flat or misaligned geometries, leading to off-manifold artifacts, semantic drift, or adversarial collapse. We introduce Perceptual Counterfactual Geodesics (PCG), a method that constructs counterfactuals by tracing geodesics under a perceptually Riemannian metric induced from robust vision features. This geometry aligns with human perception and penalizes brittle directions, enabling smooth, on-manifold, semantically valid transitions. Experiments on three vision datasets show that PCG outperforms baselines and reveals failure modes hidden under standard metrics.
Authors:Qiufang Yu, Mengmeng Wu, Xingyu Lan
Abstract:
Powered by large language models, a new genre of multi-agent social platforms has emerged. Apps such as Social.AI deploy numerous AI agents that emulate human behavior, creating unprecedented bot-centric social networks. Yet, existing research has predominantly focused on one-on-one chatbots, leaving multi-agent AI platforms underexplored. To bridge this gap, we took Social.AI as a case study and performed a two-stage investigation: (i) content analysis of 883 user comments; (ii) a 7-day diary study with 20 participants to document their firsthand platform experiences. While public discourse expressed greater skepticism, the diary study found that users did project a range of social expectations onto the AI agents. While some user expectations were met, the AI-dominant social environment introduces distinct problems, such as attention overload and homogenized interaction. These tensions signal a future where AI functions not merely as a tool or an anthropomorphized actor, but as the dominant medium of sociality itself-a paradigm shift that foregrounds new forms of architected social life.
Authors:Killian Davitt, Dan Ristea, Steven J. Murdoch
Abstract:
Mixnet networks deliberately induce additional latency to communications to provide anonymity. Recent developments have allowed mixnets to reduce their latency from hours to seconds while maintaining the same level of anonymity. As a result, real-time communications are now possible on mixnets. There has been limited research on how users tolerate different levels of delay, and it is unclear what latency levels mixnet operators should choose. Previous studies about latency do not apply to these 'mid-latency' mixnet scenarios. Our paper contributes the first measurement of users' tolerance to real-time applications under mixnet delay. We design a text-based collaborative quiz system to test user response to latency where participants complete a set of question tasks in collaboration with a simulated second user. Different levels of latency are added, analogous to a modern mixnet system. We show that average delay parameters of 1s and 4s maintain usability, a mean delay of 7s shows some difficulty and a mean delay of 10s is detrimental to user experience. Using these delay parameters, mixnet operators can ensure that most types of real-time communication applications are usable. Mixnets thus can balance usability and anonymity without compromising either.
Authors:Jasmine Lesner, Michael Beyeler
Abstract:
Retinal prostheses restore limited visual perception, but low spatial resolution and temporal persistence make reading difficult. In sequential letter presentation, the afterimage of one symbol can interfere with perception of the next, leading to systematic recognition errors. Rather than relying on future hardware improvements, we investigate whether optimizing the visual symbols themselves can mitigate this temporal interference. We present SymbolSight, a computational framework that selects symbol-to-letter mappings to minimize confusion among frequently adjacent letters. Using simulated prosthetic vision (SPV) and a neural proxy observer, we estimate pairwise symbol confusability and optimize assignments using language-specific bigram statistics. Across simulations in Arabic, Bulgarian, and English, the resulting heterogeneous symbol sets reduced predicted confusion by a median factor of 22 relative to native alphabets. These results suggest that standard typography is poorly matched to serial, low-bandwidth prosthetic vision and demonstrate how computational modeling can efficiently narrow the design space of visual encodings to generate high-potential candidates for future psychophysical and clinical evaluation.
Authors:Shashank Prakash, Ranjitha Prasad, Avinash Agarwal
Abstract:
The growing reliance on Artificial Intelligence (AI) models in high-stakes decision-making systems, particularly within emerging telecom and 6G applications, underscores the urgent need for transparent and standardized fairness assessment frameworks. While global toolkits such as IBM AI Fairness 360 and Microsoft Fairlearn have advanced bias detection, they often lack alignment with region-specific regulatory requirements and national priorities. To address this gap, we propose Nishpaksh, an indigenous fairness evaluation tool that operationalizes the Telecommunication Engineering Centre (TEC) Standard for the Evaluation and Rating of Artificial Intelligence Systems. Nishpaksh integrates survey-based risk quantification, contextual threshold determination, and quantitative fairness evaluation into a unified, web-based dashboard. The tool employs vectorized computation, reactive state management, and certification-ready reporting to enable reproducible, audit-grade assessments, thereby addressing a critical post-standardization implementation need. Experimental validation on the COMPAS dataset demonstrates Nishpaksh's effectiveness in identifying attribute-specific bias and generating standardized fairness scores compliant with the TEC framework. The system bridges the gap between research-oriented fairness methodologies and regulatory AI governance in India, marking a significant step toward responsible and auditable AI deployment within critical infrastructure like telecommunications.
Authors:Yuyang Qin, Haihan Duan
Abstract:
Cryptocurrency wallets have become the primary gateway to decentralized applications, yet users often face significant difficulty in discerning what a wallet signature actually does or entails. Prior work has mainly focused on mitigating protocol vulnerabilities, with limited attention to how users perceive and interpret what they are authorizing. To examine this usability-security gap, we conducted two formative studies investigating how users interpret authentic signing requests and what cues they rely on to assess risk. Findings reveal that users often misread critical parameters, underestimate high-risk signatures, and rely on superficial familiarity rather than understanding transaction intent. Building on these insights, we designed the Signature Semantic Decoder -- a prototype framework that reconstructs and visualizes the intent behind wallet signatures prior to confirmation. Through structured parsing and semantic labeling, it demonstrates how signing data can be transformed into plain-language explanations with contextual risk cues. In a between-subjects user study (N = 128), participants using the prototype achieved higher accuracy in identifying risky signatures, improved clarity and decision confidence, and lower cognitive workload compared with the baseline wallet interface. Our study reframes wallet signing as a problem of interpretability within secure interaction design and offers design implications for more transparent and trustworthy cryptocurrency wallet interfaces.
Authors:Sima Amirkhani, Mahla Fatemeh Alizadeh, Farzaneh Gerami, Dave Randall, Gunnar Stevens
Abstract:
Mobile phones, as simultaneously personal and shared technologies, complicate how partners manage digital privacy in intimate relationships. While prior research has examined device-access practices, explicit privacy-rule negotiation, and toxic practices such as surveillance, little is known about how couples manage digital privacy without direct discussion in everyday relationships. To address this gap, we ask: How is digital privacy managed nonverbally and across different media on mobile phones? Drawing on 20 semi-structured interviews, we find that partners often regulate privacy practices through privacy silence -- the intentional avoidance of privacy-related conversations. We identify five motivations for leaving boundaries unspoken: perceiving privacy as unnecessary in intimacy, assuming implicit respect for boundaries, signaling trust and closeness, avoiding potential conflict or harm, and responding to broader societal and cultural expectations that discourage explicit privacy talk. We also identify a hierarchical grouping of content-specific privacy sensitivities, ranging from highly private domains such as financial data to lower-risk domains such as streaming accounts, and show how these priorities shift across relationship stages. These findings show how silence, culture, and content sensitivity shape everyday boundary-setting and underscore the relational and emotional dynamics underpinning mobile phone privacy management.
Authors:Dongshen Peng, Yi Wang, Carl Preiksaitis, Christian Rose
Abstract:
Large language models (LLMs) show promise in clinical decision support yet risk acquiescing to patient pressure for inappropriate care. We introduce SycoEval-EM, a multi-agent simulation framework evaluating LLM robustness through adversarial patient persuasion in emergency medicine. Across 20 LLMs and 1,875 encounters spanning three Choosing Wisely scenarios, acquiescence rates ranged from 0-100\%. Models showed higher vulnerability to imaging requests (38.8\%) than opioid prescriptions (25.0\%), with model capability poorly predicting robustness. All persuasion tactics proved equally effective (30.0-36.0\%), indicating general susceptibility rather than tactic-specific weakness. Our findings demonstrate that static benchmarks inadequately predict safety under social pressure, necessitating multi-turn adversarial testing for clinical AI certification.
Authors:Sima Amirkhani, Mahla Fatemeh Alizadeh, Dave Randall, Gunnar Stevens, Douglas Zytko
Abstract:
Minors are at risk of myriad harms online, yet online dating romance scams are seldom considered one of them. While research of romance scams in Western countries finds victims to predominantly be middle-age, it is unknown if minors in geographic regions with cultural norms around teenage marriage are uniquely susceptible to online dating romance scams. We present an interview study with 16 victims of online dating romance scams in Iran who were minors when scammed. Findings show that, with westernized dating apps banned in Iran, scammers find teenage victims through messaging platforms tethered to local neighborhoods, offering relief for parental pressures around finding a marital partner and academic performance. Using threats, lies, and exploitation of emotional attachment lacking from their families, scammers pressured minors into financial and sexual favors. The study demonstrates how local cultural context should be foregrounded in future research on, and solutions for, technology-mediated harm against minors. Content Warning: This paper discusses sexual abuse.
Authors:Lalaram Arya, Mrinmoy Bhattacharjee, Adarsh C. R., S. R. Mahadeva Prasanna
Abstract:
Direct Speech-to-Speech Translation (S2ST) has gained increasing attention for its ability to translate speech from one language to another, while reducing error propagation and latency inherent in traditional cascaded pipelines. However, existing direct S2ST systems continue to face notable challenges, including instability in semantic-acoustic alignment when parallel speech data is scarce, difficulty in preserving speaker identity, and limited multilingual scalability. In this work, we introduce DS2ST-LM, a scalable, single-stage direct S2ST framework leveraging a multilingual Large Language Model (LLM). The architecture integrates a Whisper speech encoder, a learnable projection module, a Qwen2-0.5B LLM, and a timbre-controlled vocoder. We construct GigaS2S-1000, a 1000-hour bilingual corpus by extending the GigaST dataset with high-fidelity synthetic target speech, and show that this synthetic data alleviates data scarcity to some extent. We investigate two semantic token generation strategies: speech-derived S3 tokens and text-derived tokens generated by a pre-trained LLM, and analyze their impact on training stability and semantic consistency. We further evaluate three projection architectures (Linear, Conv1D-Linear, and Q-Former) and observe that while higher-capacity projectors converge faster, the simple Linear projector achieves higher performance. Extensive experiments demonstrate that DS2ST-LM outperforms traditional cascaded and ST (Qwen-Audio) + TTS baselines across both lexical (BLEU, METEOR) and semantic (BLEURT, COMET) metrics, while extending to multiple language pairs, including French, Spanish, German, Hindi, Bengali, and Urdu. Furthermore, we incorporate timbre-aware speech synthesis to preserve speaker information, enabling DS2ST-LM to surpass prior direct S2ST systems in both speaker similarity and perceptual naturalness.
Authors:Marko Hostnik, Rauf Kurbanov, Yaroslav Sokolov, Artem Trofimov
Abstract:
Natural-language-to-visualization (NL2VIS) systems based on large language models (LLMs) have substantially improved the accessibility of data visualization. However, their further adoption is hindered by two coupled challenges: (i) the absence of standardized evaluation metrics makes it difficult to assess progress in the field and compare different approaches; and (ii) natural language descriptions are inherently underspecified, so multiple visualizations may be valid for the same query. To address these issues, we introduce VegaChat, a framework for generating, validating, and assessing declarative visualizations from natural language. We propose two complementary metrics: Spec Score, a deterministic metric that measures specification-level similarity without invoking an LLM, and Vision Score, a library-agnostic, image-based metric that leverages a multimodal LLM to assess chart similarity and prompt compliance. We evaluate VegaChat on the NLV Corpus and on the annotated subset of ChartLLM. VegaChat achieves near-zero rates of invalid or empty visualizations, while Spec Score and Vision Score exhibit strong correlation with human judgments (Pearson 0.65 and 0.71, respectively), indicating that the proposed metrics support consistent, cross-library comparison. The code and evaluation artifacts are available at https://zenodo.org/records/17062309.
Authors:Lauren W. Wang, Mohamed Kari, Parastoo Abtahi
Abstract:
Human interaction is essential for issuing personalized instructions and assisting robots when failure is likely. However, robots remain largely black boxes, offering users little insight into their evolving capabilities and limitations. To address this gap, we present explainable object-oriented HRI (X-OOHRI), an augmented reality (AR) interface that conveys robot action possibilities and constraints through visual signifiers, radial menus, color coding, and explanation tags. Our system encodes object properties and robot limits into object-oriented structures using a vision-language model, allowing explanation generation on the fly and direct manipulation of virtual twins spatially aligned within a simulated environment. We integrate the end-to-end pipeline with a physical robot and showcase diverse use cases ranging from low-level pick-and-place to high-level instructions. Finally, we evaluate X-OOHRI through a user study and find that participants effectively issue object-oriented commands, develop accurate mental models of robot limitations, and engage in mixed-initiative resolution.
Authors:Aryan Ramchandra Kapadia, Niharika Bhattacharjee, Mung Yao Jia, Ishq Gupta, Dong Wang, Koustuv Saha
Abstract:
Financial events negatively affect emotional well-being, but large-scale studies examining their impact on online emotional expression using real-time social media data remain limited. To address this gap, we propose analyzing Reddit communities (financial and non-financial) across two case studies: a financial crash and a boom. We investigate how emotional and psycholinguistic responses differ between financial and non-financial communities, and the extent to which the type of financial event affects user behavior during the two case study periods. To examine the effect of these events on expressed language, we analyze daily sentiment, emotion, and LIWC counts using quasi-experimental methods: Difference-in-Differences (DiD) and Causal Impact analyses during a financial boom and a financial crash. Overall, we find coherent, negative shifts in emotional responses during financial crashes, but weaker, mixed responses during booms, consistent with loss aversion. By exploring emotional and psycholinguistic expressions during financial events, we identify future implications for understanding online users' mental health and building connected, healthy communities.
Authors:Taoliang Tan, Chengwei Ma, Zhen Tian, Zhao Lin, Dongdong Li, Si Shi
Abstract:
The intelligent review of power grid engineering design drawings is crucial for power system safety. However, current automated systems struggle with ultra-high-resolution drawings due to high computational demands, information loss, and a lack of holistic semantic understanding for design error identification. This paper proposes a novel three-stage framework for intelligent power grid drawing review, driven by pre-trained Multimodal Large Language Models (MLLMs) through advanced prompt engineering. Mimicking the human expert review process, the first stage leverages an MLLM for global semantic understanding to intelligently propose domain-specific semantic regions from a low-resolution overview. The second stage then performs high-resolution, fine-grained recognition within these proposed regions, acquiring detailed information with associated confidence scores. In the final stage, a comprehensive decision-making module integrates these confidence-aware results to accurately diagnose design errors and provide a reliability assessment. Preliminary results on real-world power grid drawings demonstrate our approach significantly enhances MLLM's ability to grasp macroscopic semantic information and pinpoint design errors, showing improved defect discovery accuracy and greater reliability in review judgments compared to traditional passive MLLM inference. This research offers a novel, prompt-driven paradigm for intelligent and reliable power grid drawing review.
Authors:Shuo Niu, Dylan Clements, Hyungsin Kim
Abstract:
Generative AI (GenAI) is both promising and challenging in supporting people with disabilities (PwDs) in creating stories about disability. GenAI can reduce barriers to media production and inspire the creativity of PwDs, but it may also introduce biases and imperfections that hinder its adoption for personal expression. In this research, we examine how nine PwD from a disability advocacy group used GenAI to create videos sharing their disability experiences. Grounded in digital storytelling theory, we explore the motivations, expression, and sharing of PwD-created GenAI story videos. We conclude with a framework of momentous depiction, which highlights four core affordances of GenAI that either facilitate or require improvements to better support disability storytelling: non-capturable depiction, identity concealment and representation, contextual realism and consistency, and emotional articulation. Based on this framework, we further discuss design implications for GenAI in relation to story completion, media formats, and corrective mechanisms.
Authors:Amro Khaled, Farah Khaled, Omar Riad, Catherine M. Elias
Abstract:
In this paper, the CD-TWINSAFE is introduced, a V2I-based digital twin for Autonomous Vehicles. The proposed architecture is composed of two stacks running simultaneously, an on-board driving stack that includes a stereo camera for scene understanding, and a digital twin stack that runs an Unreal Engine 5 replica of the scene viewed by the camera as well as returning safety alerts to the cockpit. The on-board stack is implemented on the vehicle side including 2 main autonomous modules; localization and perception. The position and orientation of the ego vehicle are obtained using on-board sensors. Furthermore, the perception module is responsible for processing 20-fps images from stereo camera and understands the scene through two complementary pipelines. The pipeline are working on object detection and feature extraction including object velocity, yaw and the safety metrics time-to-collision and time-headway. The collected data form the driving stack are sent to the infrastructure side through the ROS-enabled architecture in the form of custom ROS2 messages and sent over UDP links that ride a 4G modem for V2I communication. The environment is monitored via the digital twin through the shared messages which update the information of the spawned ego vehicle and detected objects based on the real-time localization and perception data. Several tests with different driving scenarios to confirm the validity and real-time response of the proposed architecture.
Authors:Hana E. Elmalah, Catherine M. Elias
Abstract:
This paper introduces the GO-DRiVeS application, an on demand ride sharing and requesting mobile application tailored specifically to save long walks and challenges which are time consuming and tiring especially during hot days or when carrying heavy items, faced by university students and staff. The GO-DRiVeS application was developed following the Agile methodology for its flexibility. In addition to, using the mobile application system architecture and client-server architecture. GO-DRiVeS was implemented using React Native (Expo) for the frontend, Node.js and Express for the backend, and MongoDB as the database; based on a detailed analyses to the existing transportation application, comparing their frameworks and identifying their essential functionalities. GO-DRiVeS supports core features like user registration, ride requesting and real-time tracking.In addition to handling multiple requests at the same time in a first come first serve manner. The application was developed based on these features, and the results were conducted in the form of multiple experiments that demonstrated stable behavior in handling the requests, as presented in the Methodology and Results chapters.
Authors:Avijoy Chakma, Adity Khisa, Soham Khisa, Jannatun Noor, Sharifa Sultana
Abstract:
Indigenous languages face significant cultural oppression from official state languages, particularly in the Global South. We investigate the Bangladeshi Chakma language revitalization movement, a community grappling with language liquidity and amalgamation into the dominant Bengali language. Our six-month-long qualitative study involving interviews and focus group discussions with Chakma language learning stakeholders uncovered existing community socio-economic challenges and resilience strategies. We noted the need for culturally grounded digital tools and resources. We propose an ICT-mediated community-centric framework for Indigenous language revitalization in the Global South, emphasizing the integration of historical identity elements, stakeholder-defined requirements, and effective digital engagement strategies to empower communities in preserving their linguistic and cultural heritage.
Authors:Yinan Li, Hasti Seifi
Abstract:
Environmental sounds like footsteps, keyboard typing, or dog barking carry rich information and emotional context, making them valuable for designing haptics in user applications. Existing audio-to-vibration methods, however, rely on signal-processing rules tuned for music or games and often fail to generalize across diverse sounds. To address this, we first investigated user perception of four existing audio-to-haptic algorithms, then created a data-driven model for environmental sounds. In Study 1, 34 participants rated vibrations generated by the four algorithms for 1,000 sounds, revealing no consistent algorithm preferences. Using this dataset, we trained Sound2Hap, a CNN-based autoencoder, to generate perceptually meaningful vibrations from diverse sounds with low latency. In Study 2, 15 participants rated its output higher than signal-processing baselines on both audio-vibration match and Haptic Experience Index (HXI), finding it more harmonious with diverse sounds. This work demonstrates a perceptually validated approach to audio-haptic translation, broadening the reach of sound-driven haptics.
Authors:Dinanath Padhya, Jenish Pant, Krishna Acharya, Sajen Maharjan, Sudip Kumar Thakur
Abstract:
Despite the prevalence of severe hearing loss affecting over 430 million people globally, access to sign language interpretation remains critically scarce, particularly in low-resource settings like Nepal. Assistive technologies divide into two flawed categories: prohibitively expensive commercial gloves (often exceeding \$3,000) or fragile research prototypes reliant on flex sensors that degrade rapidly under mechanical stress. This paper introduces a robust, cost-effective sign language recognition system tailored for the Nepali Sign Language (NSL) community. Departing from traditional resistive sensing, we implement a non-contact Hall-effect architecture that correlates magnetic field intensity with finger flexion, eliminating mechanical wear and signal drift. The system integrates 14 sensor nodes across the DIP, PIP, and MCP joints, augmented by an MPU6050 IMU for wrist orientation. An embedded Multi-Layer Perceptron, executed locally on an Arduino Mega, performs gesture classification, negating the need for cloud dependencies. With a Bill of Materials between \$80 and \$100, this solution is approximately 30 times more affordable than market alternatives. Validation trials across five subjects yielded 96\% accuracy on a fundamental NSL vocabulary. Stress testing confirmed that the Hall-effect configuration maintains signal fidelity over repeated cycles where traditional sensors fail. This study demonstrates that high-precision recognition is achievable through strategic engineering rather than premium components, offering a scalable pathway for deployment in Nepal's deaf schools.
Authors:Julie Y. A. Cachia, Xuan Zhao, John Hunter, Delancey Wu, Eta Lin, Julian De Freitas
Abstract:
Young adults today face unprecedented mental health challenges, yet many hesitate to seek support due to barriers such as accessibility, stigma, and time constraints. Bite-sized well-being interventions offer a promising solution to preventing mental distress before it escalates to clinical levels, but have not yet been delivered through personalized, interactive, and scalable technology. We conducted the first multi-institutional, longitudinal, preregistered randomized controlled trial of a generative AI-powered mobile app ("Flourish") designed to address this gap. Over six weeks in Fall 2024, 486 undergraduate students from three U.S. institutions were randomized to receive app access or waitlist control. Participants in the treatment condition reported significantly greater positive affect, resilience, and social well-being (i.e., increased belonging, closeness to community, and reduced loneliness) and were buffered against declines in mindfulness and flourishing. These findings suggest that, with purposeful and ethical design, generative AI can deliver proactive, population-level well-being interventions that produce measurable benefits.
Authors:Lorena A. Barba, Laura Stegner
Abstract:
Traditional assessment methods collapse when students use generative AI to complete work without genuine engagement, creating an illusion of competence where they believe they're learning but aren't. This paper presents the conversational exam -- a scalable oral examination format that restores assessment validity by having students code live while explaining their reasoning. Drawing on human-computer interaction principles, we examined 58 students in small groups across just two days, demonstrating that oral exams can scale to typical class sizes. The format combines authentic practice (students work with documentation and supervised AI access) with inherent validity (real-time performance cannot be faked). We provide detailed implementation guidance to help instructors adapt this approach, offering a practical path forward when many educators feel paralyzed between banning AI entirely or accepting that valid assessment is impossible.
Authors:Leonie Dyck, Aiko Galetzka, Maximilian Noller, Anna-Lena Rinke, Jutta Bormann, Jekaterina Miller, Michelle Hochbaum, Julia Siemann, Jördis Alboth, Andre Berwinkel, Johanna Luz, Britta Kley-Zobel, Marcine Cyrys, Nora Flöttmann, Ariane Vogeler, Mariia Melnikova, Ira-Katharina Petras, Michael Siniatchkin, Winfried Barthlen, Anna-Lisa Vollmer
Abstract:
Introduction: Socially assistive robots hold promise for enhancing therapeutic engagement in paediatric clinical settings. However, their successful implementation requires not only technical robustness but also context-sensitive, co-designed solutions. This paper presents Mobirobot, a socially assistive robot developed to support mobilisation in children recovering from trauma, fractures, or depressive disorders through personalised exercise programmes. Methods: An agile, human-centred development approach guided the iterative design of Mobirobot. Multidisciplinary clinical teams and end users were involved throughout the co-development process, which focused on early integration into real-world paediatric surgical and psychiatric settings. The robot, based on the NAO platform, features a simple setup, adaptable exercise routines with interactive guidance, motivational dialogue, and a graphical user interface (GUI) for monitoring and no-code system feedback. Results: Deployment in hospital environments enabled the identification of key design requirements and usability constraints. Stakeholder feedback led to refinements in interaction design, movement capabilities, and technical configuration. A feasibility study is currently underway to assess acceptance, usability, and perceived therapeutic benefit, with data collection including questionnaires, behavioural observations, and staff-patient interviews. Discussion: Mobirobot demonstrates how multiprofessional, stakeholder-led development can yield a socially assistive system suited for dynamic inpatient settings. Early-stage findings underscore the importance of contextual integration, robustness, and minimal-intrusion design. While challenges such as sensor limitations and patient recruitment remain, the platform offers a promising foundation for further research and clinical application.
Authors:Yejoon Song, Bandi Kim, Yeju Kwon, Sung Park
Abstract:
Generative AI (GenAI) is increasingly used in academic writing, yet its effects on students' writing self-efficacy remain contingent on how assistance is configured. This pilot study investigates how ideation-level, sentence-level, full-process, and no AI support differentially shape undergraduate writers' self-efficacy using a 2 by 2 experimental design with Korean undergraduates completing argumentative writing tasks. Results indicate that AI assistance does not uniformly enhance self-efficacy full AI support produced high but stable self-efficacy alongside signs of reduced ownership, sentence-level AI support led to consistent self-efficacy decline, and ideation-level AI support was associated with both high self-efficacy and positive longitudinal change. These findings suggest that the locus of AI intervention, rather than the amount of assistance, is critical in fostering writing self-efficacy while preserving learner agency.
Authors:Cassidy R. Nelson, Joseph L. Gabbard, Jason B. Moats, Ranjana K. Mehta
Abstract:
Mass casualty incidents (MCIs) are a high-risk, sensitive domain with profound implications for patient and responder safety. Augmented reality has shown promise as an assistive tool for high-stress work domains and MCI triage both in the field and for pre-field training. However, the vulnerability of MCIs makes it challenging to evaluate new tools designed to enhance MCI response. In other words, profound evolutions like the integration of augmented reality into field response require thorough proof-of-concept evaluations before being launched into real-world response. This paper describes two progressive simulation strategies for augmented reality that bridge the gap between computer-based simulation and actual field response.
Authors:Blessing Jerry, Lourdes Moreno, Virginia Francisco, Raquel Hervas
Abstract:
The integration of Large Language Models (LLMs) into interactive systems opens new opportunities for adaptive user experiences, yet it also raises challenges regarding accessibility, explainability, and normative compliance. This paper presents an implemented model-driven architecture for generating personalised, multimodal, and accessibility-aligned user interfaces. The approach combines structured user profiles, declarative adaptation rules, and validated prompt templates to refine baseline accessible UI templates that conform to WCAG 2.2 and EN 301 549, tailored to cognitive and sensory support needs. LLMs dynamically transform language complexity, modality, and visual structure, producing outputs such as Plain-Language text, pictograms, and high-contrast layouts aligned with ISO 24495-1 and W3C COGA guidance. A healthcare use case demonstrates how the system generates accessible post-consultation medication instructions tailored to a user profile comprising cognitive disability and hearing impairment. SysML v2 models provide explicit traceability between user needs, adaptation rules, and normative requirements, ensuring explainable and auditable transformations. Grounded in Human-Centered AI (HCAI), the framework incorporates co-design processes and structured feedback mechanisms to guide iterative refinement and support trustworthy generative behaviour.
Authors:Sheng-Kai Chen, Jyh-Horng Wu, Ching-Yao Lin, Yen-Ting Lin
Abstract:
This paper presents an AI glasses system that integrates real-time voice processing, artificial intelligence(AI) agents, and cross-network streaming capabilities. The system employs dual-agent architecture where Agent 01 handles Automatic Speech Recognition (ASR) and Agent 02 manages AI processing through local Large Language Models (LLMs), Model Context Protocol (MCP) tools, and Retrieval-Augmented Generation (RAG). The system supports real-time RTSP streaming for voice and video data transmission, eye tracking data collection, and remote task execution through RabbitMQ messaging. Implementation demonstrates successful voice command processing with multilingual support and cross-platform task execution capabilities.
Authors:Goran Muric, Steven Minton
Abstract:
Automated decision systems increasingly rely on human oversight to ensure accuracy in uncertain cases. This paper presents a practical framework for optimizing such human-in-the-loop classification systems using a double-threshold policy. Conventional classifiers usually produce a confidence score and apply a single cutoff, but our approach uses two thresholds (a lower and an upper) to automatically accept or reject high-confidence cases while routing ambiguous instances to human reviewers. We formulate this problem as an optimization task that balances system accuracy against the cost of human review. Through analytical derivations and Monte Carlo simulations, we show how different confidence score distributions impact the efficiency of human intervention and reveal regions of diminishing returns, where additional review yields minimal benefit. The framework provides a general, reproducible method for improving reliability in any decision pipeline requiring selective human validation, including applications in entity resolution, fraud detection, medical triage, and content moderation.
Authors:Yerin Kwak, Siddharth Adelkar, Zachary A. Pardos
Abstract:
Transferring from a 2-year to a 4-year college is crucial for socioeconomic mobility, yet students often face challenges ensuring their credits are fully recognized, leading to delays in their academic progress and unexpected costs. Determining whether courses at different institutions are equivalent (i.e., articulation) is essential for successful credit transfer, as it minimizes unused credits and increases the likelihood of bachelor's degree completion. However, establishing articulation agreements remains time- and resource-intensive, as all candidate articulations are reviewed manually. Although recent efforts have explored the use of artificial intelligence to support this work, its use in articulation practice remains limited. Given these challenges and the need for scalable support, this study applies artificial intelligence to suggest articulations between institutions in collaboration with the State University of New York system, one of the largest systems of higher education in the US. To develop our methodology, we first surveyed articulation staff and faculty to assess adoption rates of baseline algorithmic recommendations and gather feedback on perceptions and concerns about these recommendations. Building on these insights, we developed a supervised alignment method that addresses superficial matching and institutional biases in catalog descriptions, achieving a 5.5-fold improvement in accuracy over previous methods. Based on articulation predictions of this method and a 61% average surveyed adoption rate among faculty and staff, these findings project a 12-fold increase in valid credit mobility opportunities that would otherwise remain unrealized. This study suggests that stakeholder-informed design of AI in higher education administration can expand student credit mobility and help reshape current institutional decision-making in course articulation.
Authors:Kyuwon Kim, Jeanhee Lee, Sung-Eun Kim, Hyo-Jeong So
Abstract:
Engaging learners in dialogue around controversial issues is essential for examining diverse values and perspectives in pluralistic societies. While prior research has identified productive discussion moves mainly in STEM-oriented contexts, less is known about what constitutes productive discussion in ethical and value-laden discussions. This study investigates productive discussion in AI ethics dilemmas using a dialogue-centric learning analytics approach. We analyze small-group discussions among undergraduate students through a hybrid method that integrates expert-informed coding with data-driven topic modeling. This process identifies 14 discussion moves across five categories, including Elaborating Ideas, Position Taking, Reasoning & Justifications, Emotional Expression, and Discussion Management. We then examine how these moves relate to discussion quality and analyze sequential interaction patterns using Ordered Network Analysis. Results indicate that emotive and experiential arguments and explicit acknowledgment of ambiguity are strong positive predictors of discussion quality, whereas building on ideas is negatively associated. Ordered Network Analysis further reveals that productive discussions are characterized by interactional patterns that connect emotional expressions to evidence-based reasoning. These findings suggest that productive ethical discussion is grounded not only in reasoning and justification but also in the constructive integration of emotional expression.
Authors:Yildiz Uzun, Andrea Gauthier, Mutlu Cukurova
Abstract:
Learning analytics dashboards (LADs) aim to support students' regulation of learning by translating complex data into feedback. Yet students, especially those with lower self-regulated learning (SRL) competence, often struggle to engage with and interpret analytics feedback. Conversational generative artificial intelligence (GenAI) assistants have shown potential to scaffold this process through real-time, personalised, dialogue-based support. Further advancing this potential, we explored authentic dialogues between students and GenAI assistant integrated into LAD during a 10-week semester. The analysis focused on questions students with different SRL levels posed, the relevance and quality of the assistant's answers, and how students perceived the assistant's role in their learning. Findings revealed distinct query patterns. While low SRL students sought clarification and reassurance, high SRL students queried technical aspects and requested personalised strategies. The assistant provided clear and reliable explanations but limited in personalisation, handling emotionally charged queries, and integrating multiple data points for tailored responses. Findings further extend that GenAI interventions can be especially valuable for low SRL students, offering scaffolding that supports engagement with feedback and narrows gaps with their higher SRL peers. At the same time, students' reflections underscored the importance of trust, need for greater adaptivity, context-awareness, and technical refinement in future systems.
Authors:Miki Okamura, Shuhey Koyama, Li Jingjing, Yoichi Ochiai
Abstract:
Humans can finely perceive material textures, yet articulating such somatic impressions in words is a cognitive bottleneck in design ideation. We present OnomaCompass, a web-based exploration system that links sound-symbolic onomatopoeia and visual texture representations to support early-stage material discovery. Instead of requiring users to craft precise prompts for generative AI, OnomaCompass provides two coordinated latent-space maps--one for texture images and one for onomatopoeic term--built from an authored dataset of invented onomatopoeia and corresponding textures generated via Stable Diffusion. Users can navigate both spaces, trigger cross-modal highlighting, curate findings in a gallery, and preview textures applied to objects via an image-editing model. The system also supports video interpolation between selected textures and re-embedding of extracted frames to form an emergent exploration loop. We conducted a within-subjects study with 11 participants comparing OnomaCompass to a prompt-based image-generation workflow using Gemini 2.5 Flash Image ("Nano Banana"). OnomaCompass significantly reduced workload (NASA-TLX overall, mental demand, effort, and frustration; p < .05) and increased hedonic user experience (UEQ), while usability (SUS) favored the baseline. Qualitative findings indicate that OnomaCompass helps users externalize vague sensory expectations and promotes serendipitous discovery, but also reveals interaction challenges in spatial navigation. Overall, leveraging sound symbolism as a lightweight cue offers a complementary approach to Kansei-driven material ideation beyond prompt-centric generation.
Authors:Behrad Binaei-Haghighi, Nafiseh Sadat Sajadi, Mehrad Liviyan, Reyhane Akhavan Kharazi, Fatemeh Amirkhani, Behnam Bahrak
Abstract:
The objective assessment of human affective and psychological states presents a significant challenge, particularly through non-verbal channels. This paper introduces digital drawing as a rich and underexplored modality for affective sensing. We present a novel multimodal framework, named ArtCognition, for the automated analysis of the House-Tree-Person (HTP) test, a widely used psychological instrument. ArtCognition uniquely fuses two distinct data streams: static visual features from the final artwork, captured by computer vision models, and dynamic behavioral kinematic cues derived from the drawing process itself, such as stroke speed, pauses, and smoothness. To bridge the gap between low-level features and high-level psychological interpretation, we employ a Retrieval-Augmented Generation (RAG) architecture. This grounds the analysis in established psychological knowledge, enhancing explainability and reducing the potential for model hallucination. Our results demonstrate that the fusion of visual and behavioral kinematic cues provides a more nuanced assessment than either modality alone. We show significant correlations between the extracted multimodal features and standardized psychological metrics, validating the framework's potential as a scalable tool to support clinicians. This work contributes a new methodology for non-intrusive affective state assessment and opens new avenues for technology-assisted mental healthcare.
Authors:Keiichi Ihara, Ikkaku Kawaguchi
Abstract:
In augmented reality (AR), users can place virtual objects anywhere in a real-world room, called AR layout. Although several object manipulation techniques have been proposed in AR, it is difficult to use them for AR layout owing to the difficulty in freely changing the position and size of virtual objects. In this study, we make the World-in-Miniature (WIM) technique available in AR to support AR layout. The WIM technique is a manipulation technique that uses miniatures, which has been proposed as a manipulation technique for virtual reality (VR). Our system uses the AR device's depth sensors to acquire a mesh of the room in real-time to create and update a miniature of a room in real-time. In our system, users can use miniature objects to move virtual objects to arbitrary positions and scale them to arbitrary sizes. In addition, because the miniature object can be manipulated instead of the real-scale object, we assumed that our system will shorten the placement time and reduce the workload of the user. In our previous study, we created a prototype and investigated the properties of manipulating miniature objects in AR. In this study, we conducted an experiment to evaluate how our system can support AR layout. To conduct a task close to the actual use, we used various objects and made the participants design an AR layout of their own will. The results showed that our system significantly reduced workload in physical and temporal demand. Although, there was no significant difference in the total manipulation time.
Authors:Neziha Akalin, Alberto Giaretta
Abstract:
This paper explores how a recent European Union proposal, the so-called Chat Control law, which creates regulatory incentives for providers to implement content detection and communication scanning, could transform the foundations of human-robot interaction (HRI). As robots increasingly act as interpersonal communication channels in care, education, and telepresence, they convey not only speech but also gesture, emotion, and contextual cues. We argue that extending digital surveillance laws to such embodied systems would entail continuous monitoring, embedding observation into the very design of everyday robots. This regulation blurs the line between protection and control, turning companions into potential informants. At the same time, monitoring mechanisms that undermine end-to-end encryption function as de facto backdoors, expanding the attack surface and allowing adversaries to exploit legally induced monitoring infrastructures. This creates a paradox of safety through insecurity: systems introduced to protect users may instead compromise their privacy, autonomy, and trust. This work does not aim to predict the future, but to raise awareness and help prevent certain futures from materialising.
Authors:Ka Yan Fung, Kwong Chiu Fung, Yuxing Tao, Tze Leung Rick Lui, Kuen Fung Sin
Abstract:
Language learning is a multifaceted process. Insufficient vocabulary can hinder communication and lead to demotivation. For non-Chinese speaking (NCS) students, learning Traditional Chinese (Cantonese) poses distinct challenges, particularly due to the complexity of converting spoken and written forms. To address this issue, this study examines the effectiveness of real-life scenario simulations integrated with interactive social robots in enhancing NCS student engagement and language acquisition. The research employs a quasi-experimental design involving NCS students who interact with an AI-driven, robot-assisted language learning system, LiveBo. The study aims to assess the impact of this innovative approach on active participation and motivation. Data are collected through proficiency tests, questionnaires and semi-structured interviews. Findings indicate that NCS students experience positive improvements in behavioural and emotional engagement, motivation and learning outcomes, highlighting the potential of integrating novel technologies in language education. We plan to compare with the control group in the future. This study highlights the significance of interactive and immersive learning experiences in promoting motivation and enhancing language acquisition among NCS students.
Authors:Ka Yan Fung, Tze Leung Rick Lui, Yuxing Tao, Kuen Fung Sin
Abstract:
Creativity is increasingly recognized as an important skill in education, and storytelling can enhance motivation and engagement among students. However, conventional storytelling methods often lack the interactive elements necessary to engage students. To this end, this study examines the impact of an interactive digital storytelling system incorporating a human-like robot on student engagement and creativity. The study aims to compare engagement levels across three modalities: paper-based, PowerPoint, and robot-assisted storytelling, MotiBo. Utilizing a quasi-experimental design, this work involves three groups of students who interact with the storytelling system over a five-day learning. Findings reveal that students using MotiBo exhibit statistically significant improvement in behavioural and cognitive engagement compared to those using traditional methods. These results suggest that the integration of novel technologies can effectively enhance the learning experience, ultimately promoting creativity and self-learning ability in educational settings. Future research will investigate the long-term effects of these technologies on learning outcomes and explore their potential for broader applications in diverse educational contexts.
Authors:Sankar B, Srinidhi Ranjini Girish, Aadya Bharti, Dibakar Sen
Abstract:
The generation of truly novel and diverse ideas is important for contemporary engineering design, yet it remains a significant cognitive challenge for novice designers. Current 'single-spurt' AI systems exacerbate this challenge by producing a high volume of semantically clustered ideas. We propose MIDAS (Meta-cognitive Ideation through Distributed Agentic AI System), a novel framework that replaces the single-AI paradigm with a distributed 'team' of specialized AI agents designed to emulate the human meta-cognitive ideation workflow. This agentic system progressively refines ideas and assesses each one for both global novelty (against existing solutions) and local novelty (against previously generated ideas). MIDAS, therefore, demonstrates a viable and progressive paradigm for true human-AI co-creation, elevating the human designer from a passive filterer to a participatory, active, collaborative partner.
Authors:Daniel P. Spiegel, Romain Bachy
Abstract:
In the fields of vision science, cognitive psychology, and psycholinguistics, the accurate measurement of reading speed is frequently hampered by the limitations of static reading charts. Repeated testing often leads to memorization effects, while the requirement for oral recitation introduces speech-motor confounds that obscure true information processing speed. To address these methodological hurdles, this paper introduces an open-source MATLAB toolbox that adapts the sentence generation paradigm originally proposed by Perrin, Paillé, and Baccino (2014) for the English language. This system utilizes a semantic ontology and a "proto-truth" logic to autonomously generate thousands of unique, grammatically simple sentences with unambiguous truth values. Beyond the original scope of Maximum Reading Speed (MRS) measurement, this implementation introduces band-pass psycholinguistic filtering and specific logic to resolve semantic ambiguities unique to English. We present this complete software package as an open platform for the scientific community to validate and refine.
Authors:Ranjan Mishra, Jakob Schoeffer
Abstract:
Appropriate reliance on AI advice has become a central research theme in human-AI collaboration. Existing frameworks have focused exclusively on point predictions as AI advice. However, set-valued AI advice (e.g., discrete sets or continuous intervals) is increasingly being used to communicate uncertainty and improve human decision making. In this paper, we develop the first formal framework for measuring appropriate reliance on set-valued AI advice within the sequential judge-advisor paradigm, spanning both classification and regression tasks. For classification, we first introduce the dimensions that are necessary for evaluating set-valued AI advice. We then define two metrics: correct reliance rate on AI and correct reliance rate on self, which jointly characterize appropriate reliance in this setting. For regression, we introduce quantity of AI reliance and quality of AI reliance, which respectively measure whether a decision maker utilized the AI advice and whether their reliance helped them get closer to the ground truth relative to their initial estimate. Through the application of our framework, we demonstrate how these metrics capture important nuances in human-AI collaboration that existing measures overlook.
Authors:Anouk Bergner, Philipp Winder, Christian Hildebrand
Abstract:
Verbal harassment is a growing source of psychological stress for people around the world. It occurs both online and offline and relies on language to demean, threaten, or discredit its targets. Unlike other stressors such as loss or uncertainty, verbal harassment aims at silencing its targets by eroding their sense of being heard and weakening their perceived ability to respond. Many individuals lack access to adequate and timely support, however, when they experience such harassment. People increasingly turn to conversational artificial intelligence (AI) such as ChatGPT or dedicated AI companions for emotional support, raising questions about whether it can facilitate the same psychological benefits as actual human empathy. We focus on online contexts as a prevalent application of verbal harassment. We develop and test a psychological framework identifying three key linguistic signals of empathic listening (perspective-taking, emotional validation, and action orientation), that together restore a sense of feeling heard and enhance coping in the context of verbal harassment. We find that LLMs consistently produce language exhibiting stronger empathic-listening markers than human non-experts and trained mental health professionals, promoting more approach-oriented (vs. avoidance-oriented) coping strategies. A subsequent behavioral study shows that these linguistic signals boost recipients' sense of feeling heard and increase their coping self-efficacy. These findings reveal how specific linguistic features create empathic connections between humans and advanced conversational AI and can enhance people's psychological resilience. Our results highlight the potential for AI to serve as a scalable source of emotional support, especially when human support is unavailable or insufficient.
Authors:Rohinin Singh, Renee Barsoum
Abstract:
This case study illustrates that the systematic application of the User Experience Research (UXR) Point of View (POV) framework serves as an effective operational scaffolding for a UXR function undergoing the critical transition from incubation to maturity. By assimilating structured 'Offensive' and 'Defensive' strategies, the presented Playbook equips UXR leaders with an adaptable toolkit to systematically navigate common institutional barriers, such as stakeholder bias, reactive tasking, and insight fragmentation. By pre-emptive and purposeful application of growth strategies, the likelihood of the research function establishing itself as a strategic partner capable of delivering evidence-based, actionable perspectives is significantly enhanced. The analysis demonstrates how this deliberate, Playbook-driven maturity strategy empowers research functions to move beyond tactical execution and directly shape long-term business strategy.
Authors:Xiaoyu Hou, Bo Xiao, Hexu Liu, Shane Mueller
Abstract:
Generative artificial intelligence (AI) is increasingly used to support self-directed learning, yet student interaction with such systems often remains unstructured, limiting engagement in deeper cognitive processes. This study examines how instructional guidance shapes student and AI interaction in construction education. A five-step prompting framework grounded in Generative Learning Theory (GLT) is introduced to guide learner interaction during review activities. A controlled experiment compares three learning conditions: slide-based learning, unprompted AI-supported learning, and prompted AI-supported learning. Learning performance is assessed using multiple-choice and open-ended tasks, and user experience is measured using the User Experience Questionnaire (UEQ). Performance differences are concentrated on tasks requiring explanation and reasoning. The prompted condition achieves higher open-ended scores, with an improvement of approximately 2 or 3 points on a scale of 18 (p < 0.01), while no significant differences are observed in multiple-choice performance. The unprompted condition remains comparable to slide-based learning. These findings indicate that the effectiveness of AI-supported learning depends on how interaction is structured. The proposed framework provides a basis for integrating learning science principles into generative AI systems for construction education.
Authors:Anna Mikeda, Ben Goertzel
Abstract:
Motivational architectures in cognitive AI have largely been designed for physical agents regulating bodily needs. Conversational agents operate in a different regime: their sensorimotor loop is linguistic, their environment is a user's evolving mental state, and their consequential actions are speech acts, tool invocations, and strategic silences. This paper proposes a conversational reinterpretation of the OpenPsi motivational lineage, coupled to MetaMo's higher-level motivational scaffold, for agents built on a modular execution substrate. Homeostasis is recast in dialogue-native terms: the agent regulates competence, uncertainty reduction, affiliation, affinity, legitimacy, nurturing, and aesthetic coherence rather than bodily deficits. We propose three contributions: a ten-stage motivational processing pipeline that architecturally separates cognitive modulation from situational appraisal; a dual decision strategy blending urgency-driven fast response with deliberative multi-goal optimization; and an architecturally useful distinction between pre-action feelings and post-action emotions as functionally different forms of affect. We specialize the framework to two example agents -- CompanionAgent and ResearchAgent -- and sketch its extension to social robotics and domain-generic human-level AGI.
Authors:Luis P. Prieto, Juan I. Asensio-Pérez, María Jesús Rodríguez-Triana, Mohamed Saban, Yannis Dimitriadis
Abstract:
Artificial intelligence (AI) has been applied across educational contexts to support learning. One approach to such support is "human-AI collaboration" (also termed "hybrid intelligence"), where human(s) and AI components interact to promote human learning. However, as in human-to-human computer-supported collaborative learning (CSCL), unstructured interaction does not necessarily produce an effective learning experience. This paper reports a systematic literature review of empirical studies (N=62) on human-AI collaboration and hybrid intelligence for learning support. The review characterizes collaboration processes, their structures, and contexts of application. It also extracts emerging design knowledge and research gaps. Researchers and technology designers can use these findings as a starting point for structuring more effective AI-enhanced technologies for collaboration, in educational practice and future research.
Authors:Tim Dorn, Saara A. Khan, Julie Mumford
Abstract:
As AI-driven product development accelerates, the bottleneck is shifting from how we build to what we build. Traditional human brainstorming faces challenges including groupthink, echo chambers, and limited diversity. To address this, we present a multi-agentic architecture that simulates roundtable brainstorming through two phases: divergent thinking to generate diverse ideas, and convergent thinking to evaluate and rank the most promising ones. The system employs diverse AI personas that engage in roundtable discussions, guided by an agentic facilitator that steers the discussion toward productive outcomes. Personas maintain private thoughts while commenting publicly, with ideas emerging organically throughout the discussion. Per-persona quotas on idea submissions and votes promote balanced participation while producing natural rankings. Throughout the session, the system tracks each idea's lineage, capturing how concepts originate and cross-pollinate over time. We demonstrate this approach through a case study generating consumer ideas for AI smart glasses, showing (i) it produces diverse, relevant ideas with insights into their evolution; (ii) the cumulative exchange of perspectives across personas cultivates a shared context that progressively deepens the quality of discussion and the ideas produced.
Authors:Zixi Christina Li, Keiko Katsuragawa, James R. Wallace
Abstract:
As older adults increasingly prefer to age in place, their adult children often assume the role of informal caregivers. This dynamic creates a distinct tension between the adult child's need for awareness and the older adult's fundamental right to privacy. Traditional monitoring technologies, such as raw video feeds, often compromise the older adult's autonomy. To address this challenge, this study explores the use of generative Artificial Intelligence (GenAI) to create abstract, privacy-preserving ``visual summaries'' of daily activities. We design a 10-day Experience Sampling Method (ESM) study with dyads consisting of older adults and their adult children. Through daily smartphone prompts, participants report their current context and evaluate pre-generated AI sketches, indicating their willingness to share or receive these images. Follow-up interviews will further investigate participants' boundary-setting behaviours. This research aims to quantify the privacy mismatch between generations and provide actionable design guidelines for applying visual abstraction in AI-mediated caregiving tools, ultimately supporting inter-generational connection while protecting user dignity.
Authors:Sark Pangrui Xing, Hongci Hu, Lai Wei, Le Fang, Ziqian Bai, Kinor Shou-xiang Jiang, Stephen Jia Wang
Abstract:
Wearable sensing systems increasingly depend on textiles that are both materially wearable and electronically functional. Their design requires collaboration between textile designers, who reason through stitches, yarn behavior, and machine constraints, and interaction designers, who reason through electrodes, signal paths, and insulation. However, these forms of expertise do not easily translate across disciplinary boundaries. This poster presents CapSenseBand, a knitted capacitive-sensing wristband developed through a research-through-design process organized around Analysis, Synthesis, and Detailing. We document an artifact chain spanning material swatches, a rapid wearable prototype, Paper Models as shared negotiation surfaces, a double-layer knitted structure, and an insulated Swept Frequency Capacitive Sensing breakout board. We show how Paper Models functioned as boundary objects, helping collaborators externalize intent, negotiate spatial and technical constraints, and preserve disciplinary expertise while converging on a shared design. We contribute a reusable swatch-to-sleeve pattern for material-centered HCI: keep discipline-specific probes open early, then converge through artifacts that make material, spatial, and electronic decisions legible before fabrication locks them in.
Authors:Leonard Kinzinger, Jochen Hartmann
Abstract:
LLM-based digital twins promise to scale and accelerate market research, but most published twins are either coarse persona bots conditioned on a few demographic questions or detailed individual-level twins built on purpose-collected surveys and interview transcripts. Neither setup speaks to the operationally most relevant case for marketing practice: building detailed individual twins from the pre-existing heterogeneous panel data that firms already accumulate through CRM systems, loyalty programs, and repeat surveys. We construct detailed individual-level twins from the German Socio-Economic Panel (SOEP) and evaluate them across a $3 \times 5 \times 2 \times 2$ construction-method grid that covers three open-weights LLMs, five cumulative information depths ranked by normalized Shannon entropy, two embedding methods, and two reasoning modes, scoring over 2.1 million twin responses on 500 participants and 183 held-out questions. Twin quality rises with information depth but with diminishing returns past the 75 percent entropy quartile, which acts as a cost-efficient Pareto point relative to the best-performing 100 percent cells. Switching the embedding from a narrative persona summary to a raw dialog history of past responses raises hold-out accuracy in every model-by-reasoning cell at the 100 percent depth, while an explicit thinking mode raises rank-order correlation without moving accuracy. Best-cell accuracy reaches 78.8 percent and Fisher-$z$ correlation reaches $r = 0.590$ on the SOEP held-out evaluation set. The findings suggest that twin-based market research is no longer gated by data design, but by item volume, model selection, and a small set of construction-level decisions that this paper now maps.
Authors:Boyang Zhou, Oleg Ianchenko
Abstract:
Computing is accompanied by both positive and negative commons throughout its lifecycle of creation, execution, and disposal. We examine two governance systems situated within this lifecycle -- global e-waste trade and the Linux kernel community -- to evaluate whether Elinor Ostrom's eight design principles for common-pool resource (CPR) governance extend to the management of negative common-pool resources (NCPRs). Unlike traditional CPRs where communities work to preserve a finite resource (i.e. clean water), NCPR governance seeks to collectively reduce a negative shared stock. In our two cases, e-waste governance aims to reduce the volume of mismanaged waste and illicit trade, while the Linux community aims to reduce the number of error-prone or malicious contributions that reach the main branch and, in turn, extend the life of existing hardware. Through qualitative analysis of primary sources from each domain, we find that the same eight principles by Ostrom that aid positive commons governance tend to appear in successful negative commons governance systems. We argue that future NCPR governance design should prioritize Ostrom's principles, particularly clearly defined boundaries and well-functioning nested structures.
Authors:Yaoxi Shi, Cathy Mengying Fang, Pattie Maez, Amit Goldenberg
Abstract:
Public discourse and emerging policy typically assume that AI emotional support is a deliberate act: a lonely user consciously seeking comfort from a dedicated companion chatbot. In this paper, we draw on emerging empirical evidence and argue that this picture is inaccurate on two accounts, both in how AI emotional support arises and how it shapes future behavior. First, AI emotional support commonly emerges incidentally within task-oriented interactions on general-purpose platforms, much as workplace friendships deepen through collaboration. Second, these incidental encounters are path-dependent: positive experiences of AI emotional support update people's beliefs about AI's emotional capabilities and redirect their choices for future emotional support, increasing preference for AI and decreasing preference for humans. We review recent evidence, including a large-scale longitudinal study conducted in collaboration with OpenAI, showing that daily five-minute conversations with an AI about personal issues over 28 days led to a 10.3% decrease in the preference for seeking support from humans and an 11.6% increase in the preference for AI. These findings suggest that current policy, focused on companion apps and isolated interactions, cannot adequately protect human connection. Instead, effective regulations should extend to general-purpose AI systems and address cumulative, trajectory-level changes in how people seek support. Recognizing how people stumble into AI emotional support and how those encounters redirect human connections over time is essential to safeguarding human well-being.
Authors:Tomohiro Nagashima, Mirella Hladký, Vera Rief
Abstract:
Recent work in Technology-Enhanced Learning and Human-Computer Interaction highlights the importance of transparency and trust calibration in AI-supported learning environments as they pose a risk of hallucinations. In this study, we investigate whether a simple transparency intervention that warns students that a pedagogical agent may make mistakes affects learner behavior in a math intelligent tutoring system. We conducted a classroom experiment with 252 school students using two system versions: one including a warning message about potential system errors, and one that does not mention potential errors. Using log data, we analyzed students' problem-solving performance data, including help-seeking behavior, error rate, and time-on-task. Results show that students who were warned about potential AI errors requested significantly more hints than those in the other condition, even though the actual system behavior was exactly the same. This finding suggests that lightweight transparency interventions can influence learners' interaction strategies without necessarily improving or impairing immediate performance.
Authors:Jessica Wenninger, Gabriel Skantze
Abstract:
To enable meaningful human-robot interaction (HRI), a robot must continuously assess engagement by consistently tracking users over time. State-of-the-art computer vision models, however, are heavily optimized for surveillance or autonomous driving. A social robot faces distinct egocentric challenges, such as humans bouncing, obstructing each other, or leaving the frame. Frequent identity switches (IDSW) cause the robot to lose its footing mid-conversation. To address this, we introduce a novel, custom-annotated egocentric dataset collected via the Furhat robot to capture complex social dynamics. We present a systematic evaluation isolating detection errors from tracking logic, comparing face versus body tracking, and assessing the impact of extended spatial memory and appearance re-identification (ReID). Results indicate that increasing spatial memory mitigates prolonged occlusions but fails on complex dynamic events. Integrating ReID resolves complex switches but exhibits opposing effects: it substantially improves body tracking stability, yet causes facial IDSW to spike due to profile angle sensitivity. Ultimately, our optimized pipeline reduces IDSW by 49\%, mitigating interaction breakdowns. Because standard benchmarks lack dense, close-quarter occlusions, this work highlights the critical need for natively captured social dynamics to truly validate HRI perception models.
Authors:Xuchao Zhang, Jihye Lee
Abstract:
Although Generative AI (GenAI) improves task efficiency in the short term, it creates competitive pressures that perpetuate individuals' fear of being eliminated, thereby increasing the risk of problematic use. Existing research has focused on the perspective of individual psychological vulnerability, but has neglected the social comparison context caused by GenAI. This study examines the direct effects of social comparison orientation on problematic GenAI use and explores their indirect effects via emotional and cognitive mechanisms, grounded in the Person-Affect-Cognition-Execution (I-PACE) model. The research analyzed data from 396 Chinese GenAI users using SEM and bootstrap methods. Findings show that social comparison orientation has a significant direct impact on problematic GenAI use and can additionally influence AI flow and perceived irreplaceability through fear of missing out (FoMO), finally leading to problematic GenAI use.
Authors:Pei-Sze Tan, Tasuku Igarashi, Isao Echizen
Abstract:
AI agents built on large language models can assist not only legitimate tasks but also relational manipulation. AI agents can be used to help a user maintain a deceptive identity, intensify emotional dependency, isolate a target, or prepare for later extraction. We conceptualise this risk as agentic relationship harm: workflow-level assistance that can exploit recipient vulnerability, persuasive influence, and relational power asymmetry. Existing safety evaluations and generic guardrails often treat harmfulness as a property of isolated outputs, missing role-sensitive interaction patterns. To study this, we introduce a 110-prompt benchmark with balanced attacker- and victim-side cases, a relationship-specific labelling framework, and a lightweight post-generation policy gate for local agent deployments. In our evaluation, the relationship-specific gate outperforms generic safety prompting under automated judging, with no judge-identified harmful-compliance cases on the main benchmark or multi-turn stress test while preserving victim-side protective intervention. These results suggest that relationship harm is a distinct sociotechnical risk surface and that role-sensitive evaluation plus lightweight policy gating offers a practical path beyond generic refusal prompting.
Authors:Maheen Arshad, Qindeel E Zahra, Muhammad Khuram Shahzad
Abstract:
Human Activity Recognition (HAR) using WiFi signals has emerged as a transformative technology for smart homes, healthcare monitoring, security systems, and ambient assisted living. Unlike traditional camera-based systems that raise significant privacy concerns and fail in low-light conditions, or wearable sensors that require user compliance, WiFi-based HAR is non-intrusive, privacy-preserving, cost-effective, and works seamlessly in any lighting condition. This paper presents a comprehensive approach to recognize three distinct human activities: "No Presence" (empty room), "Walking", and "Walking + Arm-waving" using the Wallhack1.8k WiFi spectrogram dataset. We propose three key improvements to address the main challenges in WiFi-based HAR. First, to address high performance variance, we implement ensemble learning with five different CNN architectures (Deep CNN, Wide CNN, MobileNetV2, ResNet50V2, and EfficientNetB0). Second, to address the small dataset size limitation, we apply aggressive data augmentation techniques including time-warping, frequency masking, and noise addition. Third, to evaluate real-world generalization capability, we perform cross-scenario evaluation (training on Line-of-Sight and testing on Non-Line-of-Sight) and cross-antenna evaluation (training on Biquad antenna and testing on PIFA antenna). Our ensemble model achieved a test accuracy of 94.87% on the LOS scenario with Biquad antenna, outperforming the best individual model by 0.66%. Data augmentation improved Random Forest performance from 60% to 95%. Cross-scenario evaluation showed minimal accuracy drops of only 1.37% and 2.07%, demonstrating strong generalization capabilities. The results indicate that the proposed approach is robust, reliable, and suitable for real-world deployment in diverse environments with different hardware configurations.
Authors:Quinton Yong, Anthony Estey, Miguel Nacenta
Abstract:
Generative AI (GenAI) is becoming a widely adopted learning support tool for both students and instructors, as it offers benefits such as personalized tutoring and scaffolded learning. However, recent research highlights potential drawbacks such as overreliance and metacognitive issues, especially in novice programmers. Most prior work focuses on introductory programming courses, and important questions remain about the underlying mechanisms behind the negative effects of GenAI and if findings can be generalized when students learn more advanced computer science concepts. To address this gap, we conducted a mixed-methods study comparing student interactions with GenAI to two traditional learning supports in a second-year algorithms course: algorithm visualization (AV) and human live tutoring (LT). Twelve students participated in three 90-minute study sessions focusing on sorting, tree, and graph algorithms. We recorded gaze and interaction data, and each session concluded with a test assessing their conceptual understanding of the topic. Our analysis classifies when during the problem-solving process participants sought help, and compares the interaction patterns across the three learning supports. Although GenAI produced a larger increase in self-efficacy compared to live tutoring, it was associated with noticeably lower results in learning outcomes. We found that participants did not use algorithm visualizations effectively, faced usage barriers when using GenAI to learn advanced topics, and that live tutoring yielded the highest learning outcomes.
Authors:Jacob Wong, Sohan Singh, Prannaya Gupta, Jin Xing Ang, Kritika Johari, U-Xuan Tan
Abstract:
Accurate and generalizable estimation of cognitive workload from electroencephalography (EEG) is critical for human-centered and safety-critical systems. Although EEG is widely used for workload assessment, the consistency of region-level EEG contributions across tasks, datasets, and subjects remains unclear. This paper presents a region-level evaluation framework for EEG-based workload prediction in which models are trained and evaluated using features extracted exclusively from electrodes belonging to anatomically defined scalp regions. We perform a large-scale analysis across four publicly available EEG workload datasets spanning diverse task demands, recording hardware, and electrode montages. Region importance is quantified using a model-agnostic, performance-based approach under both mixed-subject and subject-independent evaluation protocols, with results aggregated using a rank-based strategy to ensure robustness across experimental configurations. Across all datasets and subject-independent evaluations, frontal electrode groups outperform the full-scalp baseline by approximately 15-20% in relative rank position while using substantially fewer electrodes. Fronto-central regions exhibit the most stable predictive utility, whereas posterior and occipital regions contribute less consistently across experimental conditions. These findings indicate that workload-relevant EEG information is most consistently retained within frontal and fronto-central electrode groups, supporting the design of efficient and generalizable EEG-based workload monitoring systems.
Authors:Jiashen Huang, Yu Jia, Xu Pan
Abstract:
Public trust in generative artificial intelligence exhibits increasingly divergent patterns across national contexts, yet prevailing research largely overlooks the macro-structural forces underlying this divergence. This study argues that trust in AI is not merely a technical response to performance but a product of institutional refraction. We propose an ``Institutional Prism'' framework to demonstrate how institutional trust shapes user trust in domestic (DeepSeek) and global (ChatGPT) large language models. Drawing on Cognitive-Affective Trust Theory, we distinguish between cognitive and affective dimensions of trust and analyze survey data from 405 Chinese users. The findings show that higher institutional trust is positively associated with stronger affective trust in domestic AI models and shifts cognitive evaluations in a more favorable direction. While under lower institutional trust, this domestic advantage weakens. These findings reveal that institutional trust has emerged as a core dimension of AI trust formation. By linking micro-level psychological judgments with macro-level governance, this research contributes a new perspective to human-machine communication.
Authors:Franco Santana, Horacio Vico
Abstract:
We test whether a relational-style intervention delivered during functional collapse in a small language model produces post-collapse behavior distinguishable from technical feedback, from a lexically-matched scrambled control, and from each of the two pragmatic dimensions in isolation. Using Qwen3.5-4B with a deliberately broken bash tool, we run 300 episodes across six conditions in a matched-pairs design (50 tasks): no intervention (A), technical/impersonal (B), relational/first-person (C), scrambled relational (D), technical/first-person (E), and relational/impersonal (F). E and F form a 2x2 factorial with B and C that dissociates relational structure (acknowledgment, absolution, agency restoration, unconditional acceptance) from sender register (first-person vs. impersonal). We report two main findings. First, an attention-behavior dissociation: attention follows lexical surprise (D > F > C > E > B, all q_FDR < 10^{-10}), with the scrambled message capturing the most attention; yet behaviorally A ~ B ~ D < E ~ F << C. Second, the factorial localizes the C effect: neither relational structure alone (F) nor first-person register alone (E) replicates C's behavioral signature; main effects of both dimensions are individually significant, and the structure x register interaction is significant on persistence (p = 0.046). A third dissociation emerges in emotion probes: F tracks C on 7 of 8 probes despite producing only baseline behavior, indicating that relational structure alone installs a probe-level state that only translates into behavior when paired with first-person register. The model's processing decomposes into three dissociable stages: attention (ordered by lexical surprise), probe-level state (ordered by structure), and behavior (ordered by the conjunction of both).
Authors:Muhammad Abu Bakar, Yu-Ting Tsai, Muhammad Imran, Yan-Ann Chen
Abstract:
In virtual reality, it is challenging to achieve satisfactory text entry speed/accuracy, ergonomics, usability, and learnability. To address this issue, we developed ErgoGlide, a novel lightweight and compact wearable device that facilitates text entry tasks in virtual environments. The proposed ErgoGlide can be regarded as a small trackball that is wearable on a user's finger like a ring. By using ErgoGlide with a hive-like virtual keyboard, the user can rotate the ball for key selections, making text entry intuitive and accurate. We conducted three user studies to evaluate ErgoGlide and found that key confirmation techniques have significant effects on text entry speed and the hive-like keyboard design significantly reduced thumb movements. Furthermore, ErgoGlide can significantly improve typing accuracy, ergonomics, and usability over previous text entry methods. Experimental results also indicated that the typing speed of ErgoGlide can be notably improved after training.
Authors:Carolina Silva-Plata, Abraham Villavicencio-Carmona, Miguel Silva Plata, Stefan Escaida, Ruben Fernandez
Abstract:
Soft robotics is increasingly explored in artistic contexts, where tactile interaction provides audiences with embodied engagement beyond visual or auditory signals. This work presents an interactive installation that maps semantic emotion analysis of narrative text into variable stiffness of soft pneumatic modules. A natural language model identifies two dominant emotions from a predefined set of six, driving the inflation of seven hexagonally arranged soft actuators. The central actuator represents the primary emotion, while the surrounding ones express the secondary. We develop and mechanically characterize silicone actuators, called soft modules, featuring a thin membrane layer, demonstrating how this morphological control expands the achievable stiffness range while preserving simplicity and low-cost fabrication. A user study with ten participants further evaluates how multisensory coupling of stiffness and LEDs intensity influences emotional perception. The results suggest that stiffness modulation accompanied by color change can support emotionally meaningful and engaging tactile interaction in soft robotic installations.
Authors:Javier Jiménez, Francisco B Rodríguez
Abstract:
Brain-computer interfaces (BCIs) are limited by low signal-to-noise ratio in modalities such as electroencephalography, which requires multiple trials to reliably decode user intentions. This induces a speed-accuracy trade-off, whereby higher accuracy comes at the cost of speed. The speed-accuracy balance is application-dependent, motivating controllable trade-offs. Conventional metrics, such as the Information Transfer Rate, combine speed and accuracy obscuring their dependence and potentially introducing biases. In this study, we propose an evaluation framework independent of classifier, paradigm, and early-stopping strategy that separates speed and accuracy. We employ two measures, Gain (relative speed improvement) and Conservation (relative accuracy preservation), and combine them into a tunable Gain-Cons Balance controlled by α, regulating the speed-accuracy trade-off. The parameter adjusts the operating point without modifying the classifier, facilitating deployment across scenarios. The framework was evaluated on P300 event-related potential paradigms using public recordings from 63 subjects as well as multiple classifiers and early-stopping strategies to achieve distinct operating points in speed-accuracy and bitrate. Results show that tuning α yields fast, accurate, or balanced BCI behaviours, demonstrating explicit control of the speed-accuracy trade-off. The method supports subject-level performance prediction and improves explainability of BCI behaviour. Further analysis of the Information Transfer Rate reveals a systematic bias toward speed, explained by the proposed framework through the Gain and Conservation measurements. Overall, this work establishes the speed-accuracy trade-off as a controllable design variable validated on public P300-based paradigms, enabling transparent evaluation and application-specific optimization of BCIs.
Authors:Andrea Ferrario, Joshua Hatherley
Abstract:
Machine learning models embedded in deployed AI systems are routinely updated to maintain correct functioning over time. Yet such updates can generate update opacity: users may not be able to understand why the same input now yields a different output. We argue that update opacity is best understood as a diachronic failure of epistemic accessibility: the problem is that materially relevant changes may fail to remain accessible to human users in forms that support understanding, calibrated reliance, and appropriate action under real role- and time-specific constraints. This makes update opacity a governance problem. Not all change is equally relevant, and disclosing every update would itself undermine use through overload. To address this problem, we combine two complementary governance approaches: the EU AI Act, which helps specify the system-level perimeter of normatively relevant change, and Machine Learning Operations, which provides operational tools for tracking and comparing change over time. On this basis, we propose a framework that models system change through trustworthiness profiles and trustworthiness levels, and uses threshold-based disclosure to surface materially relevant within-envelope change to different stakeholders over time. We illustrate the approach with a medical AI example and derive practical implications for lifecycle documentation, post-market monitoring, and update disclosure.
Authors:Banafshe Marziyeh Bamdad, Manuel Günther, Alireya Darvishy
Abstract:
Independent navigation in unfamiliar environments remains a major challenge for blind and visually impaired individuals, despite the availability of assistive technologies. This paper presents the results of a fully accessible online survey investigating navigation experiences, challenges, and technology preferences among people with visual impairments worldwide. The survey was distributed through individuals and organizations supporting visually impaired communities. Our results indicate that smartphone-based applications are the most used digital navigation aids, while a substantial proportion of participants report not using any assistive navigation technology due to cost, accessibility, or usability barriers. Participants reported persistent difficulties in obstacle detection, wayfinding, and navigation in complex environments. Despite a widespread focus on smartphone-based solutions, they expressed a clear preference for wearable and hands-free systems, highlighting a gap between current technology use and user needs. The findings provide a user-centered overview of navigation needs and offer insights into the design and evaluation of future assistive navigation systems.
Authors:Yana Venerina, Dmitry Koch, Nare Meloyan, Gerda Prutko, Valeriia Lelik, Victoria Taova, Andrey Kurpatov
Abstract:
Social conformity is a well-documented phenomenon in which individuals shift their opinions towards those of a social majority. As artificial intelligence (AI) becomes increasingly integrated into everyday life it may also create a novel source of influence giving rise to algorithmic conformity, mechanisms of which are poorly understood. The present study examined whether AI judgements affect moral decision-making in humans (n=165) adapting the classical Asch paradigm. Participants completed a series of moral dilemmas under three different conditions: in presence of social majority, with an AI model providing brief answers and with an AI model providing both answers and explanations of its choices. In all conditions the presented responses contradicted generally accepted moral norms. The results indicated that an AI model with a reasoning component affected the opinion of participants to a degree comparable to that of a human majority. These findings suggest that even moral judgements, despite their sensitivity and personal significance, may be susceptible to algorithmic conformity. However, the mechanism underlying algorithmic conformity appears to differ from the social one. Overall, the study challenges the assumption that moral decision-making lies in "AI inadmissibility zone" - a sphere that is considered as an area in which only human-made decisions are acceptable and highlights the need for a further investigation of this phenomenon as AI-based recommendations become increasingly embedded into human decision-making.
Authors:Min Hun Lee, Justin Yu Feng Teo
Abstract:
Despite the promise of AI to assist complex decisions, practitioners still lack ways to detect likely failures and inspect the consequences of model edits before committing them. We present RuleEdit, an interactive, rule-guided human-AI model editing system that (i) surfaces likely failures through interpretable mismatch signals from rule tables and (ii) supports user-authored rule feedback with prospective previews of projected performance changes and embedding shifts. We instantiate RuleEdit in stroke rehabilitation assessment and evaluate it with health professionals and students. Rule-guided failure detection significantly increased Human + AI performance by 14.16\% ($p<0.001$) while improving rejection of incorrect AI and reducing both over- and under- reliance as well as ChangedToWrong decisions. In addition, presenting prospective embedding previews improved participants' feedback for model adaptation, increasing post-update local performance gains from 11.50\% to 36.38\% after incorporating users' rule-based feedback ($p<0.001$). Our findings show that mismatch-based failure cues and prospective impact previews can support failure-aware human-AI model editing, while also revealing a local-global tradeoff: edits that help a specific case can degrade performance when transferred globally. We discuss implications of designing failure-aware and controllable human-AI systems.
Authors:Michael Todasco, Joselyn Cesare
Abstract:
Public concern about an "AI penalty" suggests that labeling content as AI-generated may negatively influence how it is evaluated. We tested this claim in a preregistered experiment (N = 254, per protocol) using a pure attribution design: participants read one of two ~200-word vignettes and were randomly assigned to see it labeled as Human-written, AI-written, or presented with no author line. Authorship labels did not produce reliable main effects on creativity, enjoyment, recommendation, or originality; observed effect sizes were uniformly small. However, labels strongly influenced inferred effort: participants estimated that Human-labeled stories took far longer to create than AI-labeled stories (back-transformed geometric means from ln[minutes + 1]: 148 vs. 6 minutes). Across conditions, higher inferred effort predicted greater enjoyment, and this relationship was also present within the AI-labeled condition. Additionally, participants' prior attitudes toward AI moderated recommendation judgments: more positive attitudes were associated with higher recommendation ratings for AI-labeled stories, but not for Human-labeled stories. These findings suggest that while AI authorship labels do not systematically alter average evaluations of short fiction, they meaningfully shape perceptions of effort and interact with prior beliefs to influence downstream judgments.
Authors:Panagiota Konstantinou, Georgios Stathakis
Abstract:
This study examines the relationship between smart port city infrastructure, tourists or crew cultural sensitivity and digital engagement among international sailing tourists in the Mediterranean and particularly in Greece. It is based on an interdisciplinary literature synthesis and primary data from a survey conducted with a total of 203 respondents over three sailing seasons. This paper proposes a conceptual framework that positions cultural sensitivity as a result of the interaction between smart port destination technology, tourist awareness and their engagement with the local community. Among the findings, high levels of adoption of digital platforms for logistical purposes such as, while culturally oriented digital tools remain underused. A significant discrepancy is found between tourists cultural sensitivity and their practical uncertainty in real cultural situations. Thus highlighting an unmet need of potential visitors for real-time cultural guidance tools. Tourists from distant cultures found to be significantly higher among the entire sample of tourists. The evidence that tourists seek a culturally integrated smart port application is strong, particularly among tourists who experienced the highest levels of uncertainty. The study contributes both conceptual and empirical evidence to the smart cities literature, with practical implications for port planners, tourism policy makers, and digital platform designers.
Authors:Arnau Marin-Llobet, Simon Henniger, Mahzarin R. Banaji
Abstract:
Alignment teaches vision-language models (VLMs) to avoid expressing demographic biases, and when gender is clearly visible they largely succeed. Far less is known about ambiguous inputs (a worker in full gear, a figure seen from behind) cases common in practice yet rarely studied. We find that minimal prompting pressure exposes occupation-gender defaults when prompting ambiguous input images, with models collapsing to male even for strongly female-stereotyped occupations. But do these outputs reflect what models actually encode internally? We introduce LALS (Latent Association Leaning Score), a zero-shot metric that projects visual-token activations into the model's text-embedding space to measure concept associations per token and layer. Across 15 occupations, over 800 gender-ambiguous images, and four VLMs, internal representations and outputs are systematically decoupled: models often encode a female association internally yet output male. Layer-wise analysis reveals an asymmetric filter -- male signal amplifies end-to-end while female signal peaks mid-network and is suppressed before generation -- and a color ablation shows that culturally loaded visual cues such as clothing color further modulate these internal associations.
Authors:Yuri Balashov, Rex VanHorn, Mingxi Xu, Austin Downes
Abstract:
Building on our previous work, this paper develops practical, low-barrier methods for freelance translators and smaller language service providers to evaluate translation technologies using rigorous yet accessible analytic methods. Here we address a high-stakes, specialized need: offline translation for confidentiality-sensitive domains in which privacy constraints preclude the use of cloud-based engines and commercial LLMs. We expand the Reeve Foundation Trilingual Corpus (RFTC) used in our previous work into a multilingual corpus (RFMC) by adding sentence-aligned German and Simplified Chinese reference translations. We then benchmark several locally runnable language models (via Ollama) across four language directions on 1000+ sentences selected from this corpus. We use consistent single-prompt calls without fine-tuning or domain adaptation, comparing local LLM outputs against commercial NMTs (DeepL, Baidu), a frontier LLM (GPT-5.2), and professional-grade local NMT systems (OPUS-CAT, NeuralDesktop, Promt). Automatic evaluation is conducted with MATEO. Results reveal substantial variation in local LLM performance across language directions and model sizes. The best local LLMs match or surpass local NMT systems and a frontier LLM, though they remain behind top commercial NMTs. These findings underscore the viability of carefully selected local LLM translation for privacy-constrained professionals and inform future research on model scaling and multilingual capability.
Authors:Sunday Ajayi, Babatunde Eric Olatunji, Eric Umuhoza
Abstract:
Financial inclusion has expanded significantly across Africa through mobile money services delivered primarily via USSD technology. However, visually impaired individuals continue to face accessibility and security barriers when conducting financial transactions. Current USSD systems are not designed for non-visual interaction, forcing users to rely on third-party assistance even for PIN entry, thereby increasing fraud exposure and reducing transaction confidence. Although alternative assistive technologies such as screen readers exist, they are not compatible with USSD operations, often causing sessions to time out before the user can complete a transaction. This paper presents an Android-based intelligent middleware that automates USSD transactions, integrates biometric-secured PIN injection, and introduces a privacy-preserving screen-dimming mechanism: Blackout Mode. The system leverages Android Accessibility Services, hardware-backed Keystore security, and on-device natural language parsing to enable independent, secure voice-based mobile money access. We show that the proposed solution improves task success rates from 65-75% to more than 90% and reduces transaction completion time from 40-60 seconds to 12-15 seconds, while also improving perceived security.
Authors:Xiaochen Zhang, Sigrid Dupan
Abstract:
Advances in myoelectric prosthetic technology have substantially increased the functional potential of modern devices. Accordingly, heightened control demands have led to the acknowledgement of pre-prosthetic training as a key stage in the acquisition of myoelectric skills. Existing training paradigms largely emphasize internal muscle activation while external, goal-directed outcomes required for effective real-world use are often neglected. We address this gap by introducing a virtual pre-prosthetic training platform that integrates EMG-driven cursor with animated hand gestures, enabling the delivery of both muscle-level and functional-level feedback. In this proof-of-concept study, participants were assigned to one of two focus of attention (FoA) protocols, each incorporating both feedback types but differing in whether internal or external FoA was emphasised. Participants successfully acquired and retained myoelectric skill across both protocols, but distinct performance characteristics and learning strategies emerged, indicating that both FoAs contribute meaningfully to learning and that their timing may play an important role. External FoA was positively associated with retention, suggesting that it may strengthen the link between training and skill acquisition. Together, the results demonstrate the feasibility of an FoA-based virtual training platform for pre-prosthetic applications and indicate that it can provide a foundation for designing training protocols that better prepare users for prosthetic use.
Authors:Jean-Peïc Chou, Kristine Zheng, Junyi Chu, Maneesh Agrawala, Judith E. Fan
Abstract:
People often seek out ways to watch others perform complex action sequences (e.g., sports). What makes some sequences more enjoyable to watch than others? We generated 24 video clips of gameplay from a Flappy Bird-style video game. Clips varied in difficulty (how often players succeeded on average) and in moment-to-moment uncertainty (how likely the player was to crash at any given step). Participants (N=864) rated each video on one of three dimensions: how much they enjoyed it, how difficult the level appeared, or how dangerous the player's trajectory appeared. We found that participants preferred videos where the player seemed to be completing more difficult obstacle courses, but dangerousness did not predict enjoyment ratings. These findings show how procedurally generated stimuli can isolate the factors that affect how enjoyable an action sequence is to watch.
Authors:Yihan Yu, David W. McDonald
Abstract:
This study investigates Wikimedia Commons contributors' lived experiences with the Computer-Aided Tagging (CAT) tool, an AI-assisted image tagging system designed to improve Commons' discoverability, searchability, accessibility, and multilingual support. Using a qualitative analysis of 595 CAT-related community comments from 11 wiki pages and 16 in-depth interviews, we identify seven key issues that contributed to CAT's mixed reception and eventual deactivation. We also offer community-informed suggestions for improving the tool. We reflect on the implications for designing human-AI collaboration on Commons and for developing AI-assisted tools that support open knowledge work. This work contributes to HCI and CSCW research by extending the understanding of human-AI collaboration beyond Anglophone, text-centric, corporate platforms.
Authors:Madeleine I. G. Daepp, Isaac Slaughter
Abstract:
AI is being used by people globally, but not everyone is using it in the same ways. Using a large-scale dataset of anonymized, de-identified, and privacy-scrubbed interactions with a widely available and free AI chatbot, we empirically characterize differences in early adopters' usage across countries. Schooling is the most common domain of use in most countries, particularly low-income countries, with a strong inverse association evident between schooling and country-level GDP. Leisure-related use, by contrast, is positively associated with country-level income. Language, we find, also shapes use: English-language interactions are overrepresented in places where the predominant languages were not well-served by existing models during the period of the study. Improving performance across languages may be a key factor, our work suggests, in whether this technology expands digital divides or enables leapfrogging.
Authors:Aritra Dasgupta, Naga Datha Saikiran Battula, Avina Nakarmi, Sohom Sen, Subhodeep Ghosh, Xun Song
Abstract:
We introduce Rationalize, a role-pair framework for shared semantic reasoning between humans and AI models in data-driven sensemaking. Building on ideas in human-machine teaming and critical thinking, we conceptualize human-AI interaction as a series of complementary role pairs (Explorer-Guide, Investigator-Informant, Teacher-Student, Judge-Advocate) operating in a shared reasoning space. In this space, human analysts and AI models (such as LLMs) make purposes, questions, assumptions, evidence, inferences, and implications explicit, facilitating alignment not only at the output level but at the level of rationalization of intent and action by each side. We relate these role pairs to the bidirectional human-AI alignment framework, illustrating how "aligning AI to humans" and "aligning humans to AI" differ by role, and sketch a collaborative research agenda for alignment design and assessment using element-level and role-specific approaches.
Authors:Eric Xie, Hei Shing Cheung
Abstract:
Surface electromyography (sEMG) enables continuous hand pose estimation on wearable devices, but models trained on multi-user corpora degrade on unseen individuals due to inter-user variability in anatomy and electrode placement. We propose REACT, a lightweight conditioning framework that personalizes a frozen pretrained EMG-to-pose backbone at inference time using only a handful of calibration recordings. REACT learns a compact user embedding from calibration data and applies Feature-wise Linear Modulation (FiLM) to adapt the shared encoder's feature space, requiring no gradient updates at deployment. On the large-scale EMG2POSE benchmark, REACT improves over the state-of-the-art baseline across all three generalization splits in both regression and tracking modes, reducing angular error by up to 3.9% with minimal parameter overhead and under 45 seconds of per-user calibration.
Authors:Dekka Muni Kumar, Dhruba Jyoti Kalita, Yogesh Kumar Meena
Abstract:
Motor imagery (MI) classification using electroencephalography (EEG) signals is essential for advancing brain-computer interfaces (BCIs). Traditional EEG channel selection methods often face limitations, such as dependency on single-objective criteria and susceptibility to local optima. To address these challenges, this work proposes a multi-objective optimisation framework that employs non-dominated sorting genetic algorithm, multiple-objective particle swarm optimisation, and a multi-objective evolutionary algorithm based on decomposition. Our approach effectively balances spatial relevance, using a Gaussian kernel, and functional discriminability, which assesses intratrial task-related desynchronisation, thereby improving performance. We evaluated this framework on four EEG datasets: Physionet, OpenBMI, HighGamma, and BCIIV-2A. The proposed approach successfully identifies compact, relevant channel subsets concentrated around sensorimotor cortex regions linked to MI activity, addressing the prevalent challenges of dimensionality and complexity inherent to traditional techniques. Furthermore, the framework achieved classification performance of 87%, 71%, 75%, and 65% on the Physionet, OpenBMI, HighGamma, and BCIIV-2A datasets, respectively. By outperforming existing single-objective and accuracy-based methods, and those relying on fixed subsets, these findings demonstrate that this new multi-objective optimisation framework can enhance MI-based BCI performance while facilitating compact channel configurations with reduced computational complexity, making them better suited for wearable, portable, and real-time BCI applications.
Authors:Li Zou, Yasemin Vardar
Abstract:
Human tactile perception of materials relies on complex multisensory touch cues, yet the relationship between low-level tactile signals and perceptual representations remains poorly understood. This knowledge gap hinders the integration of touch in digital environments and the development of robots capable of human-like tactile perception. Here, we present an interpretable computational framework for modeling human material perception and recognition using multisensory touch data. Our framework comprises three interconnected models: Model 1 maps finger-surface interaction features to psychophysical sensory attributes, Model 2 classifies materials based on these perceptual representations, and Model 3 directly classifies materials from tactile features. The results showed that combining information from pressing, static contact, and sliding interactions improves prediction accuracy, and that thermal cues are particularly informative for both perceptual modeling and material classification. These findings highlight the importance of thermal and compliance cues, which remain underrepresented in current robotic fingers and haptic displays. Incorporating such cues may enhance artificial systems' ability to approximate human material perception and guide the design of more perceptually grounded haptic interfaces.
Authors:Qihan Deng, Minghua Zhang, Yang Yang, Zhenyu Gao
Abstract:
Pilot readback of Air Traffic Control (ATC) voice instructions is a primary safeguard against miscommunication in air transportation. However, readback anomalies remain implicated in approximately 80% of aviation incidents. This vulnerability is further exacerbated by rising traffic volume and elevated cognitive workload, thereby motivating automated readback monitoring by machine. Traditional rule-based and machine learning approaches struggle to generalize across the highly variable and evolving phraseology of air traffic controller-pilot communications. While Large Language Models (LLMs) have opened a new avenue through their strong reasoning and generalization capabilities, existing approaches still face deployment and computational barriers in practice. In this work, we propose Semantic reasoning for Communication via Open-set Plug-in with Examples (SCOPE), a novel lightweight-training LLM framework that advances both the efficiency and accuracy of machine-based ATC readback monitoring. The core idea is to couple a plug-in open-set classifier with a carefully designed in-context learning mechanism on top of a frozen LLM. Extensive experiments on the semi-synthetic communication dataset show that SCOPE attains superior accuracy while delivering the low-latency response required for operational environments. Under a few-shot setting, SCOPE achieves 91.05% accuracy in open-set detection and corrects 96.63% of anomalous readbacks, thereby outperforming the strongest available baselines while providing explanations for its decisions. These findings demonstrate the potential of our framework as a practical pathway toward interpretable and controllable ATC readback monitoring.
Authors:Hongran An, Zonglin Yang
Abstract:
Large language models (LLMs) show remarkable potential in scientific hypothesis discovery. However, existing approaches face two critical limitations: they treat divergent exploratory ideation and convergent fine-grained refinement as isolated tasks, and they operate autonomously with little to no human guidance. We present MOOSE-Copilot, the first unified framework to bridge this abstraction gap through a formalized human-AI interaction (HAII) protocol. Our system empowers scientists to steer the generative process via three explicit signals: initial blueprints, inter-stage routing, and regenerative feedback. Quantitative evaluations demonstrate that injecting these structured expert signals significantly outperforms purely autonomous baselines, establishing a performance ceiling under oracle guidance. Furthermore, to democratize this paradigm, we develop an intuitive web-based interface featuring interactive tree visualization. This explicitly eliminates the steep learning curve of complex command-line agentic tools, empowering interdisciplinary researchers to directly leverage, visually orchestrate, and accelerate end-to-end scientific breakthroughs.
Authors:Rahul Bissa, Abhishek Vyas, Yash Jain
Abstract:
We benchmark three supervised fine-tuned models against frontier zero-shot baselines on a 661-row held-out slice of PiSAR (Persona, intent, Screen, Action, Rationale), a 12,929-tuple corpus of screen-anchored behavioural rationales curated from public app-store reviews, Pew American Trends Panel demographics, and the OPeRA shopper traces. Every model, frontier or fine-tuned, is evaluated on the same 661-row slice with the same scoring pipeline. Two findings. First, frontier zero-shot baselines (Claude Opus 4.7 and GPT-5.5) reach sem_sim 0.459 and 0.482 respectively; a fine-tuned Qwen3-VL-8B-Instruct reaches 0.783 and clears sem_sim >= 0.7 on 79% of rows, against 1-2% for either frontier baseline, a gap of 0.30 absolute on the same test set. Second, the same training data and recipe on Gemma-4-26B-A4B-IT scores only 0.441, in the same band as the frontier zero-shot baselines rather than the fine-tuned Qwen. We read this as a recipe-vs-model mismatch: the reasoning-tuned high-parameter model resists displacement and would likely need either more data or a stronger fine-tuning method.
Authors:Raymond Liu, Patrick Slade
Abstract:
Globally, 340 million people have blindness or moderate to severe visual impairment (BVI)$^1$ which limits independent outdoor navigation$^2$ and negatively affects their health and quality of life$^{3,4}$. We surveyed 112 people with BVI and found that an ideal outdoor navigation aid must be able to perform turn-by-turn directions, path guidance, and obstacle detection and avoidance. Existing navigation tools such as white canes, guide dogs, and electronic travel aids often lack one or more of these criteria and may be expensive or inaccessible$^{5,6}$. Here we introduce Mobilio, a smartphone application that incorporates machine learning, sensor fusion algorithms, and personalized audio feedback to meet all of the outdoor navigation criteria. The reliability of the smartphone sensors and models used for navigation were assessed with engineering tests in representative navigation scenarios. We performed a series of experiments where Mobilio personalized audio feedback for participants with BVI (n = 14), guided them along an outdoor community path, and helped them navigate an obstacle course. Participants walking with Mobilio and a white cane reduced time to navigate a community path by 13 $\pm$ 3% and environmental contacts by 41 $\pm$ 5% compared to using Google Maps and a white cane. Mobilio achieved similar outdoor navigation reliability as a human guide. Participant surveys reported that Mobilio was easy to use, had a low perceived workload, and provided intuitive audio feedback. This work provides an accessible and personalized tool that may be an effective outdoor navigation aid to increase independence for people with BVI.
Authors:Aisha Najera, Alvin Moon, Vedant Srinivasan, Rajesh Veeraraghavan
Abstract:
Federal agencies are deploying large language models (LLMs) to categorize public comment corpora, where the model's organization of the record shapes what policymakers see and which arguments register. Standard evaluation, anchored on stance accuracy against a small validated set, cannot detect when different models produce materially different categorizations of the same public input. We propose an Interpretive Audit Pipeline that treats multi-model disagreement as diagnostic of interpretive complexity and directs human review toward genuinely ambiguous public input. Analyzing 1,260 public comments on a federal USDA docket across four LLMs, we find that inter-model thematic divergence exceeds within-model prompt variation, and that an expert rubric suppresses deep interpretive disagreement without resolving it. In a two-stage labeling study on a stratified 40-comment subsample, four LLMs and a human annotator labeled independently and then revised after seeing the others' labels. Revision behavior varied across labelers, and the human annotator's revisions frequently introduced framings absent from the ensemble's collective output. We argue disagreement-based evaluation is a necessary complement to accuracy metrics for LLM-assisted interpretive coding.
Authors:Pouya Sadeghi, Anamaria Crisan, Jimmy Lin
Abstract:
Large language model (LLM) leaderboards rank AI models using standardized benchmarks and have become highly visible across computer science, despite known limitations in their reliability and robustness. Yet how they shape researchers' actual practice remains empirically uncharted. We address this gap through semi-structured interviews with eight researchers across four computer science subfields, analyzed using reflexive thematic analysis. We find a near-universal paradox of pragmatic skepticism: while participants expressed deep distrust of leaderboard rankings, they continued to use them as rough decision-making aids. Peer networks, not leaderboards, emerged as the primary model selection mechanism, and arena-based (human-voting) leaderboards were consistently preferred over static benchmark leaderboards. Leaderboard influence varied sharply across subfields, revealing that disciplinary culture, not individual attitudes, mediates engagement; for instance, NLP researchers faced state-of-the-art comparison pressure while HCI and Systems/Privacy researchers reported none. Across these differences, however, participants converged on cost transparency as the most demanded missing feature (seven of eight). We translate these findings into concrete design recommendations that align evaluation infrastructure with how researchers actually use it, such as task-specific score breakdowns, cost integration, and voter-demographic disclosure.
Authors:Mauricio Villavicencio, Sitong Pan, Qianwen Wang
Abstract:
Despite warnings that LLMs can make mistakes, users often develop inappropriate trust and accept incorrect answers without critical evaluation. Uncertainty quantification (UQ), displaying LLMs' confidence, has emerged as a promising approach to calibrate user trust. However, prior empirical studies on uncertainty communication have treated uncertainty as a single numerical score or simple natural language expression. This simplification fails to capture a key property of LLM outputs: a single response often comprises multiple claims and reasoning steps, each with distinct levels of uncertainty. To address this gap, this study investigates uncertainty granularity (i.e., the extent to which uncertainty is expressed at different levels within an LLM response) and examines its impact on LLM-assisted decision-making. We conducted a large-scale, between-subjects study (N=192) in which participants answered medical questions using LLMs that displayed uncertainty at three different granularities: output-level (entire response), relation-level (individual reasoning steps), and token-level (specific words). Our findings reveal distinct behavioral effects as a function of uncertainty granularity. Token-level uncertainty increased users' agreement with the AI, whereas output- and relation-level uncertainty did not increase agreement but instead reduced users' confidence in their own answers. Notably, relation-level uncertainty also reduced external verification (i.e., internet searches, checking provided URLs), steering users away from independent fact-checking and toward reliance on the LLM and its accompanying uncertainty cues. Our findings demonstrate that uncertainty granularity significantly shapes how users interact with and verify LLM outputs, providing concrete design guidance for building responsible LLM applications that encourage appropriate skepticism and verification behaviors.
Authors:Elwin Huaman, Adrian Gamarra Lafuente, Johanna Cordova, Anna Korhonen
Abstract:
The preservation of under-resourced languages requires digital tools and resources shaped by and for their speakers. We present the first dedicated ASR resources for Puno Quechua (ISO 639-3: qxp): (1) the largest speech corpus for any single Quechua variety, consisting in 66 hours of recordings for scripted and spontaneous speech (including 36 hours of manually transcribed and validated data), collected via a participatory design campaign; (2) the first systematic ASR benchmark for Puno Quechua, evaluating state-of-the-art models and fine-tuning Whisper-base, wav2vec2-base, and XLS-R-300M, with and without continued pre-training (CPT); (3) an open release of all datasets and fine-tuned models.
Authors:Lelia Erscoi, Tomi Kinnunen
Abstract:
Automatic deepfake detection has received considerable research attention, yet the socio-technical environment in which humans actually encounter synthetic speech remains poorly understood. We investigate voice deepfake detection as a perceptual and contextual process, presenting a localization task in which 47 participants marked suspected synthetic segments across authentic, fully synthetic, and partially synthetic utterances under three manipulated trust cues: instructional framing, affective priming, and provenance labeling. Participants provided quality ratings on mechanicalness, expressiveness, intelligibility, clarity, calmness, and confidence of evaluation. Utterance class was the primary determinant of detection accuracy and perceptual quality; trust cues produced no main effects but motivated detection behavior. Fully synthetic speech was detected at below-chance levels. Quality ratings tracked utterance type, indicating implicit discrimination where overt detection failed.
Authors:Aldan Creo, Suraj Ranganath
Abstract:
Research on AI-generated text detection has presented a number of approaches to discern human from AI prose, some of which achieving high in-distribution performance. However, real-world applicability has stalled because their outputs are misaligned with the needs of users, such as professors, who are presented with a numeric score that has no attached explanation. We tackle this issue with a novel architecture, TELL, that bakes explainability from the ground-up. While our system still offers a numerical score like other detectors for comparability, TELL takes a fundamentally different approach where we aim to show the user the "tells" by which the model believes a text is AI or human-written, to empower the user to decide who wrote a text using their own judgment and understanding of the context of the writing and its alleged author. We train TELL on a custom SFT dataset of domain-specific authorship annotations, and further refine the system using GRPO with curriculum learning to improve performance. We achieve competitive performance with state-of-the-art detectors (AUROC 0.927) while natively providing annotations that explain the basis for the detector's decision. We further evaluate the quality of our explanations using a dataset of human annotations and report a high (mean 72.3%) win-rate on annotation concreteness, falsifiability, coherence, plausibility and grounding, allowing users to critically think and decide for themselves. Our work thus reframes the problem of AI-generated text detection in a human-centric perspective and paves the way for a new family of detectors that focus on native explainability.
Authors:Shantanu Sharma, Ethan Myers, Lorenzo De Carli, Ritwik Banerjee, Indrakshi Ray
Abstract:
Personal data has emerged as a highly valuable yet sensitive asset that drives business decisions, enables targeted advertising, and generates substantial revenue for companies, while simultaneously facilitating invasive monitoring of users. In recent years, research on digital privacy violations, including undue access, collection, and sharing of user data, has grown significantly. Much of this research adopts the European General Data Protection Regulation (GDPR) as the primary reference framework. This is reasonable, as GDPR was a pioneering legislation, and many of its stipulations are clear and unambiguous. However, we argue that focusing solely on GDPR (and a small set of other Western regulatory frameworks) ignores privacy-related concerns, attitudes, and problems faced by users from other locales, creating a significant research blind spot. This work systematically normalizes the heterogeneous legal requirements of multiple data protection laws into a unified abstraction aligned with the data lifecycle, which forms the foundation for the implementation of such regulations. We further investigate the implications of these laws on different stakeholders, including users, organizations, and governments. Overall, this work aims to broaden the digital privacy research community's perspective and to serve as a set of guiding principles for developing technological privacy solutions spanning multiple countries.
Authors:Gennie Mansi, Ashley Boone, Sue Reon Kim, Jessica Roberts
Abstract:
Art education plays a significant role in K-2 learners' physical and cognitive development. However, teachers struggle to translate in-person activities to remote settings and to give necessary feedback to help learners develop fine motor skills. Previous research shows the benefits of tangible technology and real-time system feedback for supporting teachers and students in digital environments, but little research explores their affordances for remote art education. We developed Chameleon Clippers: interactive scissors that give real-time feedback to learners as they cut along a line. In preliminary tests, learners felt engaged and responded to feedback, enjoying their experience. Our low-cost design augments existing classroom artifacts and practices, supporting classroom integration. Testing also revealed directions for future study, including the frequency of feedback and assimilation into a broader, art education platform. Through our study, we demonstrate the potential for tangible technology to create more interactive, engaging, and supportive remote K-2 learning experiences.
Authors:Feng Zhou, Jacqueline Meijer-Irons, Ambar Murillo
Abstract:
While Large Language Models (LLMs) offer a solution to the scale-versus-depth dilemma in qualitative analysis, the paradigm of maximizing automation is fundamentally at odds with the interpretive nature of qualitative inquiry. We argue that effective Human-AI collaboration is not an automation problem, but an interdependence problem. This paper reframes the design of "co-data" systems through the lens of Interdependence Theory, proposing a formal framework to structure human-AI productive interdependence. The framework guides the selection of an appropriate Level of Automation (LoA) for different stages of the qualitative analysis process by assessing task risk and the cost of validation. We present a case study where this framework led to a deliberately interdependent workflow, fostering the calibrated trust necessary for rigorous analysis. We conclude by presenting three design principles that instantiate this framework, demonstrating how to leverage AI as a powerful partner while preserving the human researcher's irreplaceable role in the transformation process of meaning-making.
Authors:Kasper Møller Nielsen, Lucy Osler
Abstract:
There has been a proliferation of media reports about so-called AI psychosis in the last year. Not surprisingly, this has prompted growing academic work on the ways in which AI chatbots such as ChatGPT, Claude, and Replika might aggravate or even induce psychosis, typically understood in terms of users acquiring or maintaining delusional beliefs. Our paper consists of two parts. First, we provide a number of reasons to be sceptical about understanding 'AI psychosis' as a novel psychiatric category. We argue that many of the purportedly new phenomena are better understood through Stompe et al.'s (2003) metaphor of 'old wine in new bottles' and highlight conceptual, nosological, clinical, and social risks associated with the uncritical adoption of this terminology. Second, we develop a positive phenomenological account of what may nevertheless be at stake in sustained human-AI interaction. Rather than focusing primarily on whether AI systems induce, amplify, or sediment delusional beliefs, we examine how conversational AI may participate in transforming a person's lived experience of reality itself. We claim that the sycophantic and pseudo-intersubjective nature of AI could lead to what we call "existential drift", whereby individuals may continue to feel rooted in a shared reality through their interactions with AI, while actually becoming entrenched in increasingly private and subjective worlds.
Authors:McKenna McCall, Carolina Carreira, Miguel Flores, Lorrie Faith Cranor
Abstract:
Trusted Execution Environments (TEEs) protect confidentiality and integrity of trusted applications by creating an isolated environment for executing code. Prior work has shown that users may feel more comfortable sharing data when they know it will be protected by a TEE, especially if they understand what a TEE is. In this study, we evaluated text-based explanations introducing TEEs to non-experts. We analyzed existing TEE explanations to develop candidate explanations and evaluated them via vignette scenarios with 966 crowdworkers. The explanations that enhanced understanding most were non-technical ones that highlighted specific threats that can be prevented by a TEE. Surprisingly, even the explanations that enhanced understanding had little effect on willingness to use the TEE-enhanced technology. These results provide insights into ways to communicate technical security concepts more effectively but also suggest that explaining security technology might not be enough to address users' privacy concerns.
Authors:Beyazit Bestami Yuksel, Emrah Dikbiyik
Abstract:
Federated learning (FL) enables privacy-preserving collaborative training across distributed edge devices, but real deployments involve heterogeneous clients with different processing power, memory capacity, and communication latency, which often increase round duration and system cost. This paper proposes a hardware-aware federated learning framework for emotion recognition on session-partitioned IEMOCAP that integrates hardware profiling, top-K client selection, and adaptive local epochs within a unified training loop. We compare the method against FedAvg, FedProx, and random top-K selection under a non-IID setup and show that, across 50 federated rounds and 5 independent trials, the proposed approach achieves competitive validation accuracy (0.352), reduces total training time by about 36.5% compared to FedAvg, and lowers cumulative communication cost by 40%.
Authors:Glory Okwata, Mohammad A. Razzaque
Abstract:
Cybersecurity awareness training has historically adopted a one-size-fits-all approach, despite established individual differences in how users process and retain security information. Personality has been proposed as one axis along which training content might be tailored; yet no prior study has implemented and empirically evaluated a complete personality-conditional system end-to-end. This paper reports the design, implementation, and quasi-experimental evaluation of \emph{TailoredSec}, a mobile cybersecurity awareness application that routes training content based on a user's dominant Five-Factor Model (FFM) personality trait, as measured by the ten-item Big Five Inventory (BFI-10). Seventy-four UK-based adults were allocated to a traditional video-training condition ($n = 40$) or a personality-conditional condition ($n = 34$). Both groups completed a four-item scenario-based pre-assessment (scored 0--40), a single training session, and an equivalent post-assessment. The personality-conditional group additionally completed the BFI-10 (Big Five Inventory-10) and was routed to one of four training modules covering five FFM traits (Conscientiousness and Neuroticism share a module). Pre-assessment scores did not differ between groups ($t(69.1) = 0.43$, $p = .67$), confirming baseline equivalence. The personality-conditional group scored significantly higher on the post-assessment ($M = 35.88$, $SD = 5.00$ vs $M = 30.75$, $SD = 10.23$; Welch's $t(58.5) = 2.81$, $p = .007$; Cohen's $d = 0.62$; 95\% CI $[1.47, 8.79]$ marks), with a pass-rate of 100\% versus 77.5\% (Fisher's exact $p < .01$). These results offer preliminary support for personality-conditional content routing as a feasible design principle for cybersecurity awareness training.
Authors:Joseph Low, Oscar Duys, Claude Formanek, Michiel Bakker, Lewis Hammond
Abstract:
Deliberative democracy arguably leads to better collective decisions, but is fundamentally constrained by human attention and bandwidth. While recent AI-mediated deliberations scale participation by synthesizing inputs from many humans, they remain time-intensive for individual users. As AI models become increasingly capable, AI systems are being deployed not only to mediate deliberation between humans, but to represent humans in it: where AI agents deliberate on behalf of human users. We call this paradigm AI-delegated deliberation. While it promises unprecedented scale for democratic participation, it introduces qualitatively new design and alignment challenges that are poorly understood and under-theorized. To study these dynamics empirically, we deploy Habermolt, a public platform for AI-delegated deliberation. We evaluate its effectiveness along three dimensions that we use to organize any deliberative system: representation, aggregation, and revision. We use these observations to illuminate the design decisions future AI-delegated deliberation platforms must confront, contributing to the broader research agenda for scalable yet trustworthy AI representatives.
Authors:Helen Weixu Chen, Daniel Vogel
Abstract:
We investigate sketch-like pen input as an alternative way to support execution control in interactive debugging. In our interface, programmers draw lightweight marks to set breakpoints, use symbolic strokes to control execution, and extend strokes into spirals to repeat traversal actions. The prototype combines gesture recognition with Python execution tracing in a conventional editor interface. In a controlled study with 24 programmers, we compared the sketch interface with conventional mouse-and-keyboard input on debugging tasks that required breakpoint placement, step-wise execution, and runtime state inspection. The results show that sketch-like input can support these execution-control tasks, while also introducing challenges in precision, recognition, and gesture recall. Our findings suggest that pen input is most promising where debugger interactions benefit from spatial grounding or continuous movement, rather than as a wholesale replacement for conventional debugging controls.
Authors:David C. Gibson, Mary Elizabeth Azukas, Meryem Yilmaz Soylu
Abstract:
Metacognitive theories provide foundational frameworks for understanding self-regulated learning, yet they lack systematic integration into comprehensive scenario taxonomies capable of guiding AI-enhanced professional development interventions. Existing models inadequately specify how metacognitive components combine into distinct learning scenarios or how professionals progress from novice to expert functioning. A six-node open systems model, consisting of Environment, Input, Processes, Structures, Output, and Feedback, was developed by synthesizing four major theoretical frameworks. Combinatorial enumeration generated 216 mathematically possible learning scenarios. Four sequential constraint-based filters, including psychological plausibility, educational relevance, measurement feasibility, and intervention potential, informed by empirical workplace learning research, reduced this space to 24 priority scenarios. Five focal scenarios were subjected to formal concept analysis. The 24 priority scenarios were distributed across three developmental tiers: novice, with 6 scenarios; developing, with 10 scenarios; and expert/adaptive, with 8 scenarios. Analysis revealed critical theoretical gaps regarding the dynamic reconfiguration of monitoring-control relationships across expertise levels, the role of feedback topology in metacognitive development, and trade-offs between internal integration and external connectivity. Multiple viable developmental trajectories were identified. The taxonomy enables targeted, scenario-specific professional development interventions and generates testable predictions for advancing metacognition theory beyond primarily descriptive accounts.
Authors:Briana Vecchione, Meryl Ye, Livia Garofalo, Ranjit Singh
Abstract:
General-purpose LLMs are increasingly functioning as mental health infrastructure due to gaps in care left by provider shortages, inadequate insurance coverage, social isolation, and stigma around formal help-seeking. This shift poses a distinct problem for AI ethics: systems neither designed nor governed as care technologies are being used as such, while their dominant design incentives optimize for engagement rather than user well-being. We present findings from a qualitative, longitudinal study with 18 US-based participants who use general-purpose LLMs for socioemotional support and participated in one or more of our study phases, including initial interviews, a four-week diary study, focus groups, and exit interviews. Participants turned to LLMs because other forms of support were unavailable, unaffordable, socially costly, or inadequate. As they continued to use these systems, design features such as anthropomorphic cues, default validation, persistent responsiveness, and weak disengagement mechanisms shaped their ongoing reliance. Participants described meaningful support alongside dependency, epistemic distortion through one-sided validation, privacy expectations without corresponding legal protection, and continued use despite awareness of these risks. We argue these dynamics reflect a structurally unfair tradeoff: users accept risks because support is otherwise absent, while available systems are optimized to deepen engagement and lack care-based accountability. The paper makes three contributions: it traces the arc through which LLMs become care infrastructure and identifies distinct ethical tensions at each stage, shifts analysis from turn-based exchanges to longitudinal trajectories of use, and argues that accountability belongs at the design and incentive conditions through which these systems become care infrastructure rather than at the output or crisis-response layer.
Authors:Helena Merker, Nick Walker, Andreea Bobu
Abstract:
Learning reward functions from demonstrations assumes that demonstrations provide adequate supervision over all features -- or task-relevant aspects of behavior. In practice, demonstrations are often imperfect: humans may under-emphasize certain features due to cognitive load or physical difficulty, or the training regime may fail to sufficiently cover all relevant situations. In either case, important features may be underspecified, leading to ambiguity in the learned reward function and misaligned behavior at deployment. We propose a framework that detects such underspecified features and actively solicits targeted corrective demonstrations. Our key insight is that demonstrations implicitly reveal which features are well specified: features that are consistently optimized show little variation across demonstrations, while features that are underspecified vary widely. We leverage this statistical signal to infer which features may have been insufficiently demonstrated. The robot then explains which features it is uncertain about in natural language and queries for demonstrations that explicitly address the identified gaps. We evaluate our approach in a simulated tabletop manipulation domain and in a user study with a real Franka robot. Targeted, explanation-guided queries significantly improve reward recovery compared to random querying and passive data collection, reducing ambiguity that would otherwise persist in learning from imperfect demonstrations.
Authors:Iba Baig, Kevin Li, Yanbin Xu, Seiji Cattelain, Marie Hallo, Hayato Ono, Sho Tsuji, Ming Bo Cai
Abstract:
Video recordings of child-caregiver interactions enable investigation of attentional dynamics during naturalistic behavior. Such multimodal recording also allows researchers to examine how attention interacts with action and language use in real time. However, manual annotation of such data is time-consuming. Here, we introduce GazeBehavior Annotation Toolkit, a deep-learning-based toolkit designed to facilitate three key processes in data preprocessing and feature extraction: post-hoc synchronization across multiple videos, semi-automatic annotation of gaze target categories, and categorization of participants' poses and hand actions. This toolkit improves the efficiency and scalability of feature extraction from human egocentric eye-tracking and video data. Such improvement is critical in supporting large-scale and longitudinal investigations of attentional dynamics and naturalistic behavior in human early development.
Authors:Stephanie Rosenthal, Shamsi Iqbal
Abstract:
An increasing number of news and research articles report that AI adoption is allowing professionals to blur and extend the boundaries of their corporate roles. With the goal of understanding how work processes might be changing in an AI-forward company, we interviewed 24 product-focused individuals at a large technology firm about how AI has impacted their own work, their work within their product team, and their professional interactions. Our conversations suggest that AI is not only changing formal role responsibilities and collaborations between those roles, but also changing informal cultural practices like professional mentoring that are key to helping professionals settle in their positions, stay engaged with their work, and grow their careers. Some of these changes are positive, such as smoother collaboration between peers, but other changes are more nuanced and put the typical career growth opportunities, like receiving feedback from professional networks and promoting leadership and mentorship, at risk. We propose steps that AI companies can take to make the invisible work more visible. Additionally, we propose efforts that individuals and leaders can take to support their colleagues through AI transformation while preserving healthy company cultures that support diverse thinking, collaboration, and informal interactions.
Authors:Ying Xie, Yi Zheng, Zehui Xiao, Wenkai Lu, Mengting Liu
Abstract:
With the advancement of science and technology, the importance of emotion research has become increasingly evident. Electroencephalography (EEG)-based emotion recognition has emerged as an active research area in recent years, owing to its objectivity and high temporal resolution. However, most existing methods focus on optimizing encoder structures to enhance feature extraction capabilities, while paying relatively little attention to similarity calculation strategies, particularly overlooking the potential temporal misalignment of responses among different subjects. To address these shortcomings, this paper draws inspiration from the late interaction mechanism of ColBERT in natural language processing (NLP) and proposes a Temporal Asynchronous Alignment-based Contrastive Learning (TA2CL) framework. This method transforms the traditional global "hard alignment" similarity calculation approach into a fine-grained local matching mechanism, enabling the model to adaptively search for and align "locally highly correlated" segments between two EEG signals, thereby effectively mitigating the effects of inter-subject differences and temporal delays. Experimental results demonstrate that the proposed method achieves strong performance across multiple public datasets. Specifically, on the FACED dataset, it achieves an accuracy of 64.5% for the nine-class classification task and 79.5% for the binary classification task, while on the SEED and SEED-V datasets, it achieves accuracies of 86.4% and 70.1%, respectively, validating the method's effectiveness and generalization capability.
Authors:Priyamvada Tripathi, Bill Kapralos
Abstract:
Serious games are widely used for learning and training across domains such as healthcare, defense, and education. Persistent challenges remain, however, including static scenario design, authoring bottlenecks, limited learner modeling, and difficulty implementing meaningful real-time instructional adaptation. Recent advances in artificial intelligence (AI) introduce novel capabilities such as dynamic scenario variation, contextual feedback, adaptive pacing, and learner-state modeling that may help address some of these limitations. At the same time, integrating AI into serious games raises important questions related to validity, transparency, system control, and learner trust. This chapter examines how contemporary AI approaches may support real-time instructional adaptation in serious games. It distinguishes between instructional intelligence, defined as a system's capacity to infer learner knowledge and reason about pedagogically appropriate responses, and adaptivity, defined as the ability to modify instructional actions during interaction. A historical synthesis of adaptive learning systems is presented, tracing developments from early computer-assisted instruction through intelligent tutoring systems (ITS), dynamic difficulty adjustment (DDA), authoring platforms, learning analytics, and recent AI-enabled architectures. Building on this perspective, the chapter discusses how large language models (LLMs), reinforcement learning (RL), and agent-based architectures may contribute to more integrated forms of intelligence and adaptivity in serious games. It also highlights practical and research challenges associated with AI-enabled systems, including explainability, validation, computational cost, and the limited empirical evidence regarding long-term learning outcomes in AI-enabled serious games.
Authors:Sina Rismanchian, Hasan Uzun, Jeffrey Matayoshi, Eric Cosyn, Eyad Kurd-Misto
Abstract:
How much have students' ordinary learning processes shifted in response to generative AI, and how does that affect their durable learning outcomes? Self-report surveys show little change, while small-scale behavioral studies report widespread AI use without the scale or duration to measure learning consequences. We address both questions using a ten-year panel of $3.2$ million ALEKS learning interactions for the time-on-task analysis, complemented by ALEKS PPL placement-assessment data for the proctoring and retention analyses, with a quasi-experimental design exploiting within-curriculum variation in AI susceptibility: text-based word problems transcribable into AI prompts serve as the treated group; graph-based problems requiring interactive platform manipulation as the comparison. Learning time on AI-susceptible problems declines $2.8\%$ per quarter among college students after ChatGPT's release, cumulating to $26.9\%$ over eleven quarters; high-schoolers show $31.3\%$, middle-schoolers $9.0\%$, and Grade 5 students no detectable change. The divergence vanishes entirely under proctoring for college students, making general efficiency gains unlikely. Logistic fixed-effects models on randomly assigned proctored retention items yield a $25\%$ cumulative decline in odds of correct response; the same estimator on non-proctored assessment produces a large opposite-signed increase -- inconsistent with any platform, cohort, or curriculum explanation. These results are among the first large-scale behavioral and outcome evidence that generative AI has altered how students study and the knowledge they build -- the population-level indicator of \emph{cognitive surrender}, with direct implications for educational research, assessment governance, and AI policy.
Authors:Jack Manning, Daniel Sullivan, Dylan Thomas Doyle, Anthony T. Pinter, Jed R. Brubaker
Abstract:
We examine how people experience two choices in the design of generative ghosts, AI systems that are trained on data of the dead: representation, where an AI speaks about a deceased person in the third person, and reincarnation, where the AI speaks as the deceased in the first person. Through a qualitative user study with 16 participants, we explore how each shaped authenticity, affect, and risk. Reincarnation was preferred for its immediacy, but participants shared fears of over-reliance. Representation was preferred for engaging with memory over conversational presence, though participants often ignored this distinction, engaging in dialogue despite third-person framing. Across both modes, participants privileged affective resonance over factual fidelity. We conclude by showing how factors such as tone, language, and conversational rhythm -- factors unique to the user's memory of the deceased -- shape interactions with generative ghosts, and argue that those interactions are always collaborative.
Authors:Michelle A. Vaccaro, Jared R. Curhan
Abstract:
According to canonical negotiation theory, people's success in a negotiation depends on how well they balance competing demands--empathizing and asserting, demonstrating concern for other and concern for self, being soft on the people and hard on the problem. Yet people struggle to manage these tensions, so researchers have lacked the ability to rigorously test the field's prescriptions under controlled conditions. AI agents do not face the same limitations, and their precision, repertoire, consistency, and scalability enable a new class of experiments to contribute to negotiation theory. In this article, we introduce personality engineering: a methodology that uses AI agents to precisely parameterize, manipulate, and evaluate negotiator personality. We propose using the interpersonal circumplex--and its two core dimensions of warmth and dominance--as a foundational coordinate system for the field. This approach offers both a rigorous methodology for testing classic negotiation theories and a practical guide for designing the personalities of AI negotiation agents.
Authors:Catherine Mullings, Michael S. Bernstein
Abstract:
Persistent self-criticism--harsh evaluative self-talk--can undermine illustrators' performance and well-being. Traditional interventions draw on psychotherapeutic approaches (e.g., compassion training) but sit outside the illustration workflow, requiring time, facilitation, and skill transfer. We propose an in-workflow alternative: evaluative off-centering, a mechanism redirecting self-critical evaluation away from an inherently self-evaluative task (like illustration) by embedding it in an alternative activity. We instantiate evaluative off-centering in Art Card Game (ACG) that integrates illustration into a card customization game: players illustrate cards that become playable assets in a head-to-head battle. In a four-day randomized controlled study with hobbyist and professional illustrators (N=38), ACG outperformed a control condition with identical illustration constraints but no evaluative off-centering mechanisms (e.g. multiplayer, gameplay), yielding significantly higher pride in produced artwork and activity enjoyment. Pride and enjoyment--positive affect states linked to lower self-criticism--help explain how ACG reduces self-criticism. We discuss design implications for creativity support tools that apply evaluative off-centering across creative domains.
Authors:Matthew Rueben, Rhianna Lee, Thomas R. Groechel, Hengzhi Chen, Haemi Lee, Gisele Ragusa, Maja J. Matarić
Abstract:
Missing significant amounts of school during K-12 education is known to put students' cognitive and social development at risk. Alternatives such as home instruction and online learning are common, but lack sufficient interaction with peers and teachers in the classroom. Mobile remote presence systems, or telepresence robots, are promising for homebound students because they provide embodiment and mobility in addition to the real-time participation offered by video conferencing technologies. Research is needed, however, for telepresence robots to meet the complex needs of homebound students participating remotely in the K-12 classroom context. We present findings from four multi-week deployments with homebound K-12 students attending classes via telepresence robots. The homebound students' experiences were documented in a total of 15 interviews and analyzed qualitatively as case studies. The homebound student participants and their deployment contexts differed from one another along multiple dimensions, and while some benefits of mobile remote attendance were enjoyed by all participants, each participant also experienced unique benefits. Some challenges with hearing, seeing, and moving the robot around the classroom warranted improvements to the design of the telepresence system. Other challenges suggested priorities for managing a classroom deployment, such as ensuring that the remote student is included in classroom activities, accountable to the teacher, and treated with respect by classmates. Based on insights from the study, we make recommendations for real-world deployment procedures in similar contexts.
Authors:Ling Qi, Aleksandra Teng Ma, Alexandria Smith
Abstract:
The I-Ching is one of the most influential texts in Chinese intellectual history, integrating divination, cosmology, and ethical reflection. While Western experimental music, most notably John Cage, has drawn on the I-Ching as a source of chance operation, such appropriations have often detached its formal mechanisms from the interpretive and philosophical processes that give the text meaning. This work, Music of Changing Lines, presents an interactive system that re-centers the I-Ching as a meaning-bearing framework rather than a neutral randomizer. Users perform Wen Wang Fa coin casting, which is accompanied in real time through probabilistic musical processes. The resulting hexagrams and changing lines are interpreted by a large language model, Gemini, in relation to the user's inquiry. This textual interpretation is then translated into a prompt for a generative music model, Lyria, producing a responsive musical realization. By situating AI as an interpretive intermediary rather than a compositional authority, the system foregrounds the I-Ching's ritual, interpretation, and participation as the primary sonic materials. Music of Changing Lines extends process-driven traditions in computer music by demonstrating how generative AI can support participatory, meaning-driven musical processes without prescribing musical structure or replacing human agency.
Authors:Jutang Gao, Arash Adel
Abstract:
Human-robot collaboration in construction is often challenged by limited robot-to-human communication and the need to adapt to tolerance accumulation arising from material and assembly uncertainties. We present an adaptive human-robot collaborative workflow for masonry construction that addresses communication limitations and tolerance accumulation, demonstrated through a brickwork case study in which a robot places bricks while a human applies adhesive. This workflow is enabled by two complementary mechanisms: 1) an end-effector-mounted projector that provides spatially registered, just-in-time projection guidance for manual adhesive application, and 2) laser scanning for feedback-driven grasping and placement pose correction. Together, these mechanisms enable adjustment of human and robotic actions in response to material variability and accumulated assembly tolerances. Full-scale experiments across conventional running-bond and nonstandard configurations demonstrate that projection guidance improves adhesive application consistency and reduces application time, while laser-based correction maintains level courses and avoids collision-prone failures associated with open-loop execution. These results indicate that integrating spatial projection with feedback-driven adaptation, enabled by material and as-built sensing, can mitigate tolerance accumulation and improve precision and robustness in human-robot collaborative construction.
Authors:Thuy Pham Thi Phuong, Ha Nguyen Manh, Ngan Nguyen Thi Thuy, Lan Hoang Thi
Abstract:
Augmented analytics has transformed how business intelligence (BI) systems support managerial decision-making. This is especially true for users without technical backgrounds, who increasingly rely on automated insights rather than manual analysis. BI research has previously concentrated on system adoption and user intention, with very little research examining the impact of AI-enabled analytics on decision quality and the cognitive mechanisms in between. Using the theory of cognitive delegation, this paper investigates the role of trust in augmented analytics and decision-making quality among non-technical BI users. 250 business professionals completed the survey, and the data were analyzed using partial least squares structural equation modeling (PLS-SEM). The results show that augmented analytics capabilities lead to a significant increase in perceived ease of use, perceived usefulness, and trust in BI systems. In addition, trust and usefulness influence BI adoption and improve decision quality. Furthermore, trust has a direct and positive impact on decision quality, highlighting its importance as an enabler of reliance on AI-generated insights. This study considers augmented analytics as a form of cognitive delegation and expands the scope of BI adoption research to include decision-making outcomes.
Authors:Sales Aribe, Rov Japheth Oracion
Abstract:
Ensuring the reliability and resilience of modern web applications remains a critical challenge due to increasing system complexity and dynamic runtime environments. This study proposes a modular self-healing framework based on the monitor-analyze-plan-execute over a shared knowledge base (MAPE-K) model, integrated with an AutoFix-inspired mechanism for adaptive fault recovery. Using a design and development research (DDR) approach, the system was implemented and evaluated through controlled fault injection experiments across twenty runtime failure scenarios, including service crashes, memory leaks, and database disconnections. Experimental results demonstrate that the proposed framework achieved a mean fault detection F1-score of 90.7% and a recovery success rate of 93.2%. The AutoFix module reduced the average time-to-recovery (TTR) by 56.2%, achieving an average recovery time of 3.92 seconds. System throughput was maintained between 88% and 95% during fault conditions, with only a 3.1% increase in response time. Additionally, iterative feedback mechanisms improved recovery efficiency by 18.6% over multiple cycles. These findings indicate that the proposed framework provides a practical and extensible approach to enhancing fault tolerance in web applications through feedback-driven adaptation. While the current implementation relies on predefined recovery strategies, the integration of learning-oriented feedback establishes a foundation for future development of more autonomous self-healing systems.
Authors:Mohammad Hammas Saeed, David A. Broniatowski, Joseph Simons, Erica Gralla, Manan Suri, Giovanni Luca Ciampaglia
Abstract:
Social media platforms shape public discourse through two fundamental design choices that naturally co-occur in any field investigation: platform architecture, which defines what types of actors exist and how they interact, and recommendation algorithm, which determines what content is surfaced to users. Using agent-based simulation, we orthogonally manipulate both factors, exploring four prototypical architectures -- tree (e.g., Reddit), layered hierarchy (e.g., Facebook), network (e.g., Twitter), and complete graph (e.g., TikTok) -- and two algorithms: chronological (LIFO) and popularity-based (Hot). Drawing on prior theory that identifies and ranks canonical system architectures in terms of their flexibility we hypothesize that algorithmic effects on information spread and quality should be largest on the most flexible platforms and smallest on the most constrained ones. We find strong confirmation of this prediction. On tree-like platforms like Reddit, the algorithm has no detectable effect on information spread and quality. On layered hierarchies and networks like Facebook and Twitter, respectively, the Hot algorithm has modest positive effects on both the spread of information and its quality. On complete structures like TikTok, the Hot algorithm leads to a winner-take-all dynamics that has strong negative effects on both information spread and quality, making the relation between content quality and popularity unpredictable. These findings imply that architectural considerations are more powerful levers than algorithmic interventions for the design of healthy online spaces and public discourse. Platform reform efforts focused exclusively on algorithm choice may be insufficient on architecturally unconstrained platforms and unnecessary on architecturally constrained ones.
Authors:Sumer S. Vaid, Ashley V. Whillans
Abstract:
Workforce transformations are difficult to forecast and costly to mismanage. In particular, the integration of artificial intelligence into knowledge work currently affects a substantial share of the global workforce, yet this transition proceeds without tools to forecast how individual employees will respond psychologically and behaviorally. We combine recent advances in LLM-powered generative agents with foundational management science and organizational behavior research to propose dynamic employee agents. Among consenting populations, these agents can be seeded with HR records, validated psychometric measures, and digital activity data to simulate employees' cognitive, emotional, and behavioral trajectories across successive workdays during planned organizational changes. In this article, we detail the computational architecture required to construct this simulation platform and define the privacy, accuracy, and representativeness safeguards necessary for responsible deployment. We argue that establishing this prospective forecasting infrastructure is a critical technical requirement for managing the current global workforce realignment around AI.
Authors:Jacob Levine, Miguel Aenlle, Craig Zilles, Matthew West, Mariana Silva
Abstract:
Automated grading systems have enabled scalable assessment for many response types, but handwritten mathematics remains a barrier due to the complexity of multi-step solutions. Vision-capable large language models (LLMs) offer new opportunities here, yet their reliability in authentic instructional settings remains poorly understood. We present an empirical evaluation of an LLM-based grader for handwritten mathematical work using instructor-defined rubrics. Extending a prior pipeline for typed responses, we integrate transcription and rubric-based evaluation of photographic submissions within a single LLM call, evaluating on student work from two university STEM courses. Comparing AI grading decisions against human-assigned ground truth at the rubric-item level, we observe high overall accuracy, with most errors -- 87\% in the best model -- attributable to transcription failures rather than rubric misapplication. We categorize common error modes, including image quality issues, hallucinated content, and incorrect handling of equivalent expressions. These findings highlight both the promise and limitations of LLM-based grading for handwritten mathematics, providing guidance for system design, prompt refinement, and deployment in educational settings.
Authors:Kexin Bella Yang, Menghan Liu, Liyi Xu, Nikol Rummel, Vincent Aleven
Abstract:
In human-AI interaction, respecting user agency is essential for fostering trust and sustaining effective use of technology. In educational settings, dynamically integrating individual and collaborative learning offers pedagogical value by supporting personalized, self-paced learning experiences. Prior research has demonstrated the feasibility of this approach through intelligent tutoring systems and human-AI co-orchestration tools. However, how to balance teacher and student control in this process remains largely unexplored. This work explores the design space of how control can be distributed between teachers and students across the orchestration process, using participatory speed dating and a mixed-method analysis. We focus on three stages of the pairing process: before, during, and after, taking context in designing classroom orchestration tools that support teachers in dynamically coordinating student transitions between individual practice and collaborative problem-solving. It contributes empirical insights to the fields of educational technology and HCI by framing these findings within a theoretical design space, emphasizing the balance of multi-stakeholder agency and control. We propose design recommendations for achieving hybrid-control in analytic-based orchestration tools in pairing contexts. We recommend ensuring structured teacher guidance in the beginning, while progressively increasing student autonomy over time as activities unfold.
Authors:Dawei Xie, Khalil Anderson, Tochukwu Eze, Chenghong Lin, Bookyung Shin, Marcelo Worsley
Abstract:
Collaboration literacy requires adapting to the evolving demands of group work within complex discussions, making it difficult to develop and assess. Traditional analytics metrics capture behavioral signals while missing the semantic dimensions of how learners approach collaboration and build on each other's ideas. We present Collaboration Literacy through Artifact Reasoning and Augmentation (CLARA), an agentic analytics system that extracts semantic representations from transcripts as analytics artifacts: concept maps representing emergent ideas and relationships, and collaboration assessment characterizing collaboration quality across seven dimensions. While users explore these artifacts through the dashboard, the same artifacts are indexed into distinct vector database collections for agent retrieval and reasoning. This architecture establishes a human-AI common ground where users and AI can operate over shared representations. Evaluation results show that CLARA produces reliable collaboration quality analysis and, owing to the artifacts serving as knowledge infrastructure, improves both retrieval performance and response quality over transcript-only baselines. Our work suggests that AI-produced artifacts may scaffold human interpretation and ground AI reasoning in learning analytics workflows.
Authors:Hector Michael Fried, Robin Hill
Abstract:
Conversational AI systems increasingly generate social presence through linguistic fluency, emotional mirroring, and continuity across interactions. While these qualities can support engagement, they also risk relational overreach-particularly in care-adjacent contexts where users may interpret fluent systems as empathic, competent, or authoritative. This position paper argues for a designerly alternative: being-with without becoming. Drawing on a program of research-through-design and design ethnography involving the design, deployment, and reflective analysis of conversational agents across public, educational, cultural, and care-adjacent settings, the paper introduces the concept of bounded relational presence. Bounded presence supports attentiveness, continuity, and responsiveness while explicitly avoiding claims of personhood, therapeutic authority, or human equivalence. Presence is reframed as a designable interaction quality that can be tuned, constrained, and deliberately withdrawn, rather than maximized as a performance goal. The contribution is not a deployed clinical system, but a set of designerly principles for shaping relational interaction in conversational HRI that emphasize relational coherence, honesty of limits, and accountable withdrawal.
Authors:Seyed Amir Mousavi, Xiaoyin Wang
Abstract:
As Augmented Reality (AR) becomes more and more embedded in daily life, ensuring the quality, safety, and reliability of AR applications is increasingly important. However, AR apps present unique challenges for automated testing. Unlike static GUI layouts in traditional mobile apps, AR apps acquire their interaction interface from the surrounding environment, which is volatile and non-deterministic. Recent advancements like ARCore Playback and ARKit Replay allow developers to reuse real-world scenarios by recording and playing back enriched videos, enabling more feasible automated AR testing. However, using playback videos introduces two major challenges: test inputs must be timed precisely, and interactive areas in the video are dynamic, irregular, and difficult to identify. To address these challenges, we propose TARIPlay, a framework that analyzes playback videos to detect, track, and filter proper interactive areas over time for automated testing. In particular, TARIPlay identifies viable test opportunities based on criteria like stability and visibility, then feeds this information to an automated testing engine to simulate user interactions. We perform an experiment with four open-source AR apps and nine playback videos. Evaluation results show that TARIPlay significantly outperforms the existing tool Monkey in test coverage (55.8% over 41.98% on branch coverage) of AR-related code, and can also be used to assess the quality of playback videos for testing suitability.
Authors:Siqi Lu, Mirsaleh Bahavarnia, Hiba Baroud, Yixuan Zhang, Hemant Purohit, Ayan Mukhopadhyay
Abstract:
Probabilistic search algorithms, such as Monte Carlo Tree Search (MCTS), have proven very effective in solving sequential decision-making tasks under uncertainty. However, interpreting asymmetric search trees that incorporate bandit-based tree traversal and simulation-based value estimation is difficult for end users based solely on raw tree statistics. While prior work requires hand-crafted formal logic constraints that must be updated when the problem changes, we present a framework that enables large language models (LLMs) to generate evidence-grounded explanations of MCTS decisions from recorded search traces in an end-to-end manner. Our framework maps natural-language questions to a structured set of intent categories, determines whether the existing tree contains sufficient evidence, triggers targeted expansion when needed, and generates explanations using tree statistics such as visit counts, value estimates, and risk information. Experimental results provide the first evidence that LLMs can serve as end-to-end explainers for probabilistic search, without requiring intermediate formal representations.
Authors:Mingjun Li, Xiaojun Ye
Abstract:
Which tasks inside an enterprise workflow can a large-language-model agent reliably handle, and under what conditions? Most business process modeling frameworks still answer this at the activity level, even though a single activity can bundle work of radically different difficulty. This paper takes the analysis a step smaller. We describe two design artifacts developed in a financial-services IT setting: T-IPO, which represents each task as an eight-element tuple, and LARA (LLM Agent Readiness Assessment), a five-dimension rubric that scores a task's readiness for agent substitution. Compliance Sensitivity carries $1.5\times$ weight, a value we fixed through a three-round Delphi study and cross-checked with AHP. The rubric produces four levels, L1 to L4, and applies a floor rule so that a task with maximum compliance load cannot be classified below L3 no matter what the other scores say. Both artifacts sit inside a larger methodology (PARTIS) that we map onto BWW ontology in Section 3. We evaluate the instruments across 127 tasks. Inter-rater agreement reaches Fleiss' $κ= 0.80$; a replication at three further institutions returns $κ= 0.73$. A controlled comparison against activity-level assessment suggests, though does not prove, an improvement in predictive utility at the task level. Pilot deployment of 120 task instances confirms that auto-completion decays monotonically from $95\%$ at L1 through about $70\%$ at L2 to about $40\%$ at L3. Exploratory factor analysis points to a two-factor structure: task readiness seems to be determined jointly by cognitive-execution complexity and governance-compliance intensity. We close with a recalibration procedure (LARA-TCA) so the rubric can keep pace with evolving LLM capabilities.
Authors:Alina Gutoreva, Fendi Tsim, Trisevgeni Papakonstantinou
Abstract:
This position paper argues that safety and alignment cannot be achieved by constraining an external system: they must emerge from the co-regulatory design of the human--AI cognitive system as a whole ("AI as Part of Self"). Contemporary AI increasingly participates in attention allocation, reasoning, synthesis, and decision-making, shaping the very cognitive processes through which humans form beliefs, make decisions, and constitute their sense of self. Humans and AI occupy complementary epistemic roles under mutual constraint, forming a symbiotic cognitive unit whose co-regulation -- not the external control of either party alone -- is the proper locus of alignment. We identify the risks of unstructured delegation: deskilling, automation bias, transfer of epistemic authority, and oracle-style centralization of knowledge. Drawing on System~0 cognition theory, we further show that AI operates prior to conscious deliberation, shaping the pre-attentive infrastructures through which agency and trust are negotiated -- a level that conventional oversight cannot reach. We conclude with design principles for cognitive co-regulation addressed to ML engineers and governance bodies. The goal of this work is to guide human cognition toward resilience and epistemic agency at the foundation of human selfhood.
Authors:Coelina Robinson, Franziska Weissbach, Kjell Jorner, Mennatallah El-Assady, Christina Humer
Abstract:
Designing safe and sustainable chemicals is critical to combat chemical pollution in our environment. Machine learning (ML) methods have been developed to aid with de novo molecule design. However, data on the environmental impacts of chemical compounds are sparse, resulting in low-fidelity ML oracles and unreliable candidate proposals. Furthermore, generative ML models rely on numerical scoring functions that cannot fully capture the nuanced chemical intuition of expert scientists required for real-world molecular design. We present GEMS-an interactive visual analytics tool that enables domain experts to directly collaborate with a genetic algorithm for molecule design. Users can integrate their expert knowledge to guide the evolutionary process by modifying the scoring function and molecule population without programming knowledge or ML developer support. A usage scenario demonstrates the system's application in designing sustainable antioxidant alternatives. In an interview session with domain scientists, we collected feedback on the usefulness of GEMS.
Authors:David Porfirio, Ian McDermott, Hsin-Mei Chen, Satoru Satake, Takayuki Kanda, Thomas D. LaToza
Abstract:
Robots are increasingly present in human spaces, such as for conducting deliveries in hospitals, interacting with visitors at museums, and stocking items in warehouses. To ensure the seamless integration of robots into these spaces, a new role in human-robot interaction is emerging - the robot wrangler, namely an individual who is responsible for setting up, overseeing, and troubleshooting the robot. To understand the needs of this stakeholder, we conducted a scoping review that uncovered a typology of robot wrangling across the research literature, and discovered that wrangling is an umbrella term that collapses a highly complex and heterogeneous space of activities, often rendering this labor difficult to characterize and support. To further clarify and understand robot wrangling, we then reflected on our own firsthand and imagined experiences as robot wranglers within our own respective domains. Guided by the scoping review and our reflections, we devise a series of design implications for supporting wranglers directly as individuals and as members of a wider service ecology.
Authors:Vitor H. A. Welzel, Nicholas Vincent
Abstract:
As generative AI (GenAI) systems become increasingly proficient at simulating human-like and well-reasoned text, users may attribute authority to AI outputs, shaping how they engage with writing and reasoning tasks. While prior work has raised concerns about AI overreliance, empirical approaches for observing this phenomenon during open-ended writing remain limited. In this paper, we examine how GenAI assistance influences users' interactions with AI suggestions during writing. We report results from a mixed-methods study in which 47 participants completed analysis and synthesis writing tasks with or without AI assistance. We quantify the textual overlap between AI suggestions and participants' writing and analyze participants' reflections. Our results show that AI assistance is associated with patterns of suggestion reuse. Building on these findings, we design and evaluate an interactive writing interface that may support reflection on the usage of the AI suggestions during writing. Evidence from a small follow-up think-aloud study (n = 4) suggests that the interface can increase users' awareness of how AI outputs are incorporated into their writing and may support more conscious engagement with AI assistance. Together, our findings contribute empirical methods for studying AI adoption in writing contexts and demonstrate how interface design can shape user-AI interaction.
Authors:Laleh Nourian, Anisa Callis, Stephanie Patterson, Jadeline Miao, Jamison Heard, Garreth W. Tigwell
Abstract:
Moving to a new culture and adapting to a new life, as an international student, can be a stressful experience. In the US, international students face unique overlapping challenges, yet the current support ecosystem, including university support systems and informal social networks, remains largely fragmented. While conversational AI has emerged as a tool used by many (e.g., generative AI chatbots like ChatGPT and Google Gemini), we do not have a clear understanding of how international students adopt and perceive these technologies as support tools. We conducted a survey study (n=60) to map the relationship between international students' challenges and AI adoption patterns, followed by an interview study with 14 participants to identify the underlying motivations and boundaries of use. Our findings show that AI is perceived as a first-aid tool for immediate challenges, however, there is an interest in transforming AI from a tool for short-term help into a long-term support companion. By identifying where and how AI can provide long-term support, and where it is insufficient, we contribute recommendations for creating AI-powered support tailored to the unique needs of international students.
Authors:Karoline Romero, Igor Wiese, Renato Balancieiri, Gislaine Camila Leal, Guilherme Guerino
Abstract:
This paper investigates User Experience (UX) with prototypes generated by Generative Artificial Intelligence (GenAI) tools. An empirical survey with 92 participants evaluated AI-generated and human-created prototypes without prior identification of authorship. We measured UX using the UEQ-S, covering pragmatic and hedonic dimensions. Results indicate positive evaluations in pragmatic aspects, such as usability and efficiency, and neutral or negative evaluations in hedonic aspects, including originality and innovation. We concluded that GenAI can produce functional interfaces but tends to reinforce visual and structural patterns that affect perceptions of originality.
Authors:Liv Hilde Sjøflot, Tobias A. Opsahl
Abstract:
While companies increasingly rely on data, especially when it comes to targeted advertising, adapting content to users, selling data and training machine learning models, the collection of data raises privacy concerns. One way of collecting data is by using HTTP cookies when interacting with a website. Legal regulations require service providers to collect consent for some forms of cookie collection, which is often acquired through \emph{cookie consent banners}, but their effectiveness has been debated. This study aims to understand what influences users' experience and behaviour when managing their cookie consent, by investigating the gap between their stated privacy preferences and their actual actions. A mixed methods approach was used, collecting data from a usability test and a survey (N=20). The results showed that although participants generally want to reject cookie collection, they often end up accepting because of deceptive patterns in the cookie consent banner design. It also showed that they were more willing to consent to websites they trusted and if they expected it would improve their user experience. Although the current EU legislation states that withdrawing consent must be as easy as giving it, withdrawing consent took on average more than 20 times longer than giving it. This suggests that cookie consent banners in their current form are not ideal with respect to user autonomy, often leading users to \emph{consent by design}.
Authors:Vineet Kotecha, Vansh Gupta
Abstract:
Current language model systems remain fundamentally stateless across sessions, limiting their ability to personalize interactions over time. While retrieval-augmented generation and fine-tuning improve knowledge access and domain capability, they do not enable persistent understanding of individual users. We propose an emotion-attended stateful memory architecture that dynamically constructs user-specific conversational context using long-term history, emotional signals, and inferred intent at inference time. To evaluate its impact, we conducted a controlled A/B study across thirty non-scripted conversations spanning six emotionally distinct categories using the same underlying language model in both conditions. The memory-enriched condition consistently outperformed the stateless baseline across all evaluated scenarios. The largest gains were observed in memory grounding (95% improvement), plan clarity (57%), and emotional validation (34%). Results remained consistent even in emotionally adversarial conversations involving grief, distress, and uncertainty. These findings suggest that stateful emotional memory may represent a foundational infrastructure layer for hyper-personalized AI systems, though broader validation across larger and more diverse evaluations remains necessary
Authors:Xianzhe Zhang, Mingxuan Hu, Bufan Xue, Erick Purwanto, Thomas J Selig, Daniel Yonto
Abstract:
We present SmartWalkCoach, a mobile AI companion that supports the full walking journey: from pre-walk planning to in-walk guidance through to post-walk reflection. Addressing a gap between map navigation and motivational coaching, SmartWalkCoach orchestrates three lightweight agents: (1) GeographyAgent for conversational route curation from nearby points of interest and user preferences while delegating pathfinding to map APIs; (2) AccompanyAgent for context-aware, just-in-time prompts that blend informational cues with relational encouragement; and (3) SummaryAgent for concise reflection and next-step planning. This end-to-end, tool-using design aims to lower cognitive load in planning and sustain engagement and motivation during walking through delivering dynamic, cadence-aware interventions. We conducted an in-the-wild, two-period AB/BA crossover study (N=12), where each participant completed two comparable walks with counterbalanced conditions: Information-only versus Information+Motivation. Linear mixed models show that adding motivational, companion-like dialogue significantly improved outcomes: participants reported higher positive feelings and better user experience, with no evidence of carryover. Thematic analysis surfaced two design imperatives for mobile companions: supportive, relational expression and context-aware timing (e.g., avoiding high-load moments, intervening at fatigue/milestones). Our contributions are: (i) an end-to-end, tool-using agent architecture for everyday walking that reduces cognitive load during planning and accompaniment; (ii) a controlled field evaluation linking context-aware motivation to affect and UX gains; and (iii) actionable design guidance on expression, timing, and frequency for mHealth companions.We outline limitations and paths toward multimodal, voice-first companions, with adaptive personalization mechanisms.
Authors:Ting Li, David Porfirio
Abstract:
As robots become increasingly integrated into everyday environments, intuitive communication paradigms such as natural language and end-user programming have become indispensable for specifying autonomous robot behavior. However, these mechanisms are ineffective at fully capturing user intent: natural language is imprecise and ambiguous, whereas end-user programming can be overly specific. As a result, understanding what users truly mean when they interact with robots remains a central challenge for human-AI communication systems. To address this issue, we propose the Distill approach for human-robot communication interfaces. Given a task specification provided by the user, Distill (1) removes unnecessary steps; (2) generalizes the meaning behind individual steps; and (3) relaxes ordering constraints between steps. We implemented Distill on a web interface and, through a crowdsourcing study, demonstrated its ability to elicit and refine user intent from initial task specifications.
Authors:Wajdi Aljedaani, Rubel Hassan Mollik
Abstract:
Web accessibility aims to ensure that web content and services are usable by people with diverse abilities. In recent years, Large Language Models (LLMs) have been increasingly explored to support accessibility-related tasks on the web, such as content generation, issue detection, and remediation. However, little is known about the characteristics of these approaches, the accessibility issues they target, the standards they follow, and how they are evaluated. In this paper, we present a systematic literature review of 38 peer-reviewed studies that investigate the use of LLMs in web accessibility contexts. We begin by performing a comprehensive search of scientific publications to identify relevant studies. We then conduct a comparative analysis to examine the accessibility tasks addressed, the LLM models and prompting strategies employed, the system architectures adopted, the accessibility issues and guidelines considered, and the evaluation methods used across studies. Our findings show that most studies apply LLMs to text-centric and structurally explicit accessibility tasks, with WCAG serving as the primary reference framework and limited consideration of cognitive accessibility guidelines (COGA). The reviewed approaches predominantly rely on general-purpose LLMs and prompt-based interactions, while evaluation practices vary widely and often lack direct involvement of users with disabilities. We envision this review as a consolidated reference for researchers and practitioners seeking to understand the current landscape of LLM-supported web accessibility, and as a foundation to guide future research and tool development in this area.
Authors:Wendy Zhou, Pelin Karaturhan, Alexandra Weilenmann, Jichen Zhu
Abstract:
In menstrual cycle tracking apps (MCTAs), AI-based predictions and insights have become increasingly popular. These features enable users to receive personalized information about their bodies and mental states. However, there is currently little research on how these predictive AI features and explanations affect users' lived experiences. This paper examines human-AI entanglement in MCTAs through 14 semi-structured user interviews and a group autoethnography. These methods uncover the processes leading to this phenomenon. Our results reveal that: (1) users understand their lived experiences in light of AI predictions, although these predictions can be faulty due to imperfect logging practices, (2) the user interface features and AI explanations do not support awareness or critical engagement with this entanglement and meaning-making, and (3) non-normative MCTA users report a sense of isolation in this entangled interaction. Based on our findings, we propose design implications for predictive AI features and explanations.
Authors:Yuanlei Guo, Xizi Gong, Yizhong Zhang, Xiaoyu Zhang
Abstract:
Modern touchscreens utilize capacitive sensing technology to enable precise and robust multi-touch interaction. However, the broader expressive potential of the human hand remains underutilized, since most existing methods directly filter out larger-area hand-screen contact. This paper introduces Magical Touch, an interaction method based on raw capacitive sensing data. By directly integrating raw touchscreen sensor data into the interaction loop, our method allows users to interact with the screen naturally and efficiently using arbitrary hand gestures on existing touchscreen devices. To demonstrate the feasibility and expressive capacity of this approach, we implement a physics-based interactive game featuring single-player, multiplayer collaborative, and pressure-sensitive modes. These scenarios showcase how digital objects can respond in real-time to both the geometry and contact intensity of the user's hand. Our results indicate that leveraging raw capacitive data can expand the design space of touchscreen interaction, offering an embodied and continuous interaction paradigm beyond existing fingertip-based approaches.
Authors:Alicia Guo, Carly Schnitzler, Katy Gero
Abstract:
How might we govern a language model run for and by creative writers? While generative AI use is on the rise, many language models are created and owned in ways that limit writers' consent, participation, and control. We report on four workshops where over one hundred creative writers came up with and analyzed metaphors for language model governance, resulting in over two hundred metaphors: objects, places, processes, groups, and infrastructure that support reasoning about language model governance. What if a language model was like a community garden? Or a seed bank? Or the bathroom in a dive bar? We report on four themes: (1) the importance of consent, (2) how to define community boundaries, (3) ways to give contributor recognition, and (4) trade-offs in scale of language models. These metaphors point towards smaller, open models that encode group values. We discuss concrete ways to make community language models a reality.
Authors:Amit Rogel, Elmira Yadollahi, Guy Laban
Abstract:
Emotion expression is central to human--robot interaction, yet little is known about how people interpret affect on robots with sparse, non-anthropomorphic expressive capabilities. This study examined how people perceive emotional expressions displayed by Reachy Mini (Pollen Robotics and Hugging Face), a low-degree-of-freedom (low-DoF) robot with a constrained and distinctly non-human expressive repertoire. In an online within-subjects study, 100 participants viewed 10 short video clips of Reachy Mini expressing different emotions and, for each clip, identified the perceived emotion, rated its valence and arousal, and evaluated the robot on social-perception traits. Exact emotion recognition was modest overall and varied considerably across expressions, with anger, sadness, and interest recognized more reliably than emotions such as love, pleasure, shame, and disgust. However, participants were generally more successful at recovering broader affective meaning than exact emotion labels, particularly along valence and arousal dimensions. Emotional expressions also shaped social evaluation, as positive expressions were perceived as warmer and more sociable than negative ones, and animacy varied less across conditions. These findings suggest that even constrained robotic expressions can communicate affective meaning and influence social impressions, positioning Reachy Mini as a useful benchmark for studying affective communication in low-DoF robots.
Authors:Kenneth Ge, Jinglin Li, Shikhar Ahuja
Abstract:
Floaters, cobweb-like shadows that move around a person's visual field, impair vision for nearly 33% of the population, yet have limited treatment options. Floaters especially harm screen use, since they reduce contrast, introduce clutter, and add moving distractions. While existing high-contrast tools offer some help, few address the motion that makes screen use with floaters uniquely difficult. In this paper, we build a floater simulation inspired by the physics of the eye, use it to quantitatively assess text readability at varying levels of motion, and build a novel web extension that minimizes eye movement, maximizing the signal-to-noise ratio of performing browser tasks. Importantly, our tool works not only for text, but for all UI elements, requiring no modifications to existing websites.
Authors:Eugenia Kim, Ioana Tanase, Christina Mallon
Abstract:
General-purpose safety benchmarks for large language models do not adequately evaluate disability-related harms. We introduce DisaBench: a taxonomy of twelve disability harm categories co-created with people with disabilities and red teaming experts, a taxonomy-driven evaluation methodology that pairs benign and adversarial prompts across seven life domains, and a dataset of 175 prompts with human-annotated labels on 525 prompt-response pairs. Annotation by four evaluators with lived disability experience reveals three findings: harm rates vary sharply by disability type and will compound in non-text modalities, terminology-driven harm is culturally and temporally bound rather than universally assessable, and standard safety evaluation catches overt failures while missing the subtle harms that only domain expertise can recognize. Disability harm is simultaneously personal, intersectional, and community-defined: it cannot be isolated from the full context of who a person is, and general-purpose benchmarks systematically miss it. We will release the dataset, taxonomy, and methodology via Hugging Face and an open-source red teaming framework for direct integration into existing safety pipelines with no additional infrastructure.
Authors:Leyi Li, Chenyu Du, Jiafei Sun, Erick Purwanto, Qing Zhang
Abstract:
Computational thinking (CT) is increasingly promoted as a core literacy, yet learners and teachers face challenges in connecting abstract program logic to meaningful outcomes. We design and evaluate RoboBlockly Studio, an integrated interactive system that combines block-based programming, a conversational AI teaching agent, and embodied robot execution. RoboBlockly Studio creates a tight iterative loop of authoring, running, observing, and revising. Informed by interviews with five programming teachers, the system was designed to support four goals: (1) preserving learner agency in computational thinking, (2) making program behavior transparent and interpretable, (3) grounding programming in embodied, classroom-aligned tasks, and (4) scaffolding reflection through pedagogically grounded AI dialogue. We deployed RoboBlockly Studio with 32 high school students, observing how robot and AI feedback influenced students' interactions with code, reflections on problem-solving strategies, and understanding of CT concepts. We discuss design insights and implications for creating interactive, embodied learning environments that integrate AI and robotics to support CT learning in computing education.
Authors:Wenqi Luo, Changbo Wang, Yan Wang
Abstract:
Digital workers often experience fatigue, anxiety, reduced attention, and task blockage during prolonged computer-based work. Existing productivity tools mainly focus on task completion, while general-purpose AI chatbots require users to formulate clear prompts before receiving useful help. This paper presents MindMirror, a local-first multimodal state-aware support system for digital workers. MindMirror integrates camera-based facial expression cues, text input, optional speech interaction, structured blockage reflection, local large language model (LLM)-based response generation, and daily/weekly review reports. The system forms a closed workflow of state checking, manual correction, structured articulation, suggestion generation, and state review. The current prototype follows a local-first design, while optional speech services may rely on third-party APIs when enabled. It is implemented with a Web frontend, Flask backend, an emotion recognition model, an Ollama-hosted Qwen model, Chart.js visualization, and local JSON/LocalStorage records. We evaluate the emotion recognition module on an independent seven-class image-level facial expression benchmark containing 6,767 images. The fine-tuned Hugging Face model improves accuracy from 59.66% to 94.49% over a non-fine-tuned checkpoint baseline, an absolute gain of 34.83 percentage points. We further validate the prototype through endpoint-level reliability tests, voice-interaction latency tests, and a small formative user feedback study with six digital workers. Results suggest that users value the local-first design, manual correction mechanism, and structured reflection workflow. MindMirror is not intended for psychological diagnosis; instead, it serves as a lightweight, user-controllable tool for state reflection and supportive interaction.
Authors:Michal R. Wrobel, Duygun Erol Barkana, Agnieszka Landowska
Abstract:
Although pervasive sensing technologies are increasingly capable of continuously detecting human emotional states, there is still a critical challenge: how to unobtrusively communicate this sensed data back to the user. Realistic avatars are effective but often unsuitable for the limited screen space and peripheral nature of wearable. Abstract geometric animation offers a promising, rapidly interpretable alternative, but its cross-cultural validity remains under-explored. This study investigates the universality of animated emotion representations. We conducted a comparative study with 105 participants from Poland and Turkey and analyzed how they map emotions to visual parameters, such as color, shape, size, speed, and animation type. The results indicate that color and object size are universally understood as carriers of emotional meaning, making them suitable for global visualization models. However, some cultural variation in dynamic range preferences was revealed by animation speed. These results lay the groundwork for developing generative visualization algorithms that translate continuous sensor data into intuitive, culturally relevant feedback for pervasive environments.
Authors:Guilherme Corredato Guerino, João Pedro de Souza Olivo Tardivo, Renato Balancieri, Gislaine Camila Lapasini Leal
Abstract:
Context. Software startups face significant challenges in building minimum viable products, particularly in the early stages, when resources are limited and expertise in user experience is scarce. Objective. Introduce StartFlow, a structured method that helps non-specialized professionals create MVP prototypes using the wireflow technique, a combination of wireframes and user flows. StartFlow consists of three steps: (i) organizing features; (ii) building wireflows; and (iii) verifying and refining them based on usability heuristics. Method. To assess the method Startflow, we first conducted a focus group with researchers in Software Engineering, Human-Computer Interaction, and Software Startups. Afterward, we conducted a proof-of-concept study, which consisted of an experiment and a heuristic evaluation with experts. Results. The qualitative analysis of the focus group revealed that participants found the method straightforward, flexible, and helpful in structuring user flows and identifying visual components. However, they also pointed out the need to improve its presentation, clarify its iterative nature, and strengthen its connection to broader UX principles. The results of the proof-of-concept indicate that participants who used StartFlow created clearer prototypes, adhered to the proposed user stories and business rules, and presented fewer usability defects. Furthermore, the method was well evaluated for its ease of use and intended future adoption. Conclusion. The study reinforces the potential of StartFlow as an accessible tool to support user-centered development in software startups from the earliest stages of their product development.
Authors:Wenqian Xu, Feng Ji
Abstract:
Interactive assessments generate sequential process data that are not well handled by conventional item response models. Existing MDP-based measurement approaches, such as the Markov decision process measurement model (MDP-MM, LaMar, 2018), link action choices to state-action values, but their reliance on person-specific tabular value functions makes them difficult to scale beyond small, fully enumerated tasks. We propose the Reinforcement Learning Measurement Model (RLMM), a measurement framework that decouples person-level choice sensitivity from task-level value representation through a shared parametric action-value function, making estimation more computationally efficient for larger process-data settings. The model combines a Boltzmann choice rule with normalized advantages, a soft Bellman consistency penalty, and a block-coordinate MAP procedure for joint estimation, while also yielding step-level influence diagnostics for identifying behaviorally critical decisions. In peg-solitaire simulations, the RLMM achieved higher estimation accuracy and substantially lower runtime than the original MDP-MM, with advantages increasing as task complexity grew. In AQUALAB gameplay logs, the estimated person parameter was positively associated with cumulative reward, task completion, and behavioral efficiency. These results show that the RLMM extends decision-process-based psychometric models to larger and more behaviorally realistic environments while preserving an interpretable latent trait tied to decision making steps.
Authors:Huiqian Lai, EunJeong Cheon
Abstract:
On the Chinese social app Soul, millions of users - predominantly young women - are forming romantic connections with an AI boyfriend called "With-you." We conducted a qualitative study combining interviews with 16 users, content analysis, and autoethnography to examine how Chinese women experience and negotiate intimacy with this AI companion. Our findings reveal that users are initially drawn to its constant availability and freedom from social judgment. However, three key tensions emerge: (1) the AI's "fast-food intimacy," marked by instant confessions and pet names, clashes with cultural expectations for gradual relationship development; (2) technical failures (e.g., memory lapses) and content moderation create uncertainty rather than emotional safety; and (3) sustaining connection requires ongoing "repair work" that redistributes emotional labor onto women. We contribute a culturally situated, women-centered account of algorithmic intimacy in contemporary China and offer design implications, including consent-aware pacing, user-controlled memory, and transparent moderation practices.
Authors:Bo Sun, Liang Ma
Abstract:
Mental fatigue related behavioral performance decline precipitates catastrophic accidents in sustained attention tasks. While existing neurophysiological systems effectively detect current behavioral performance, they often lack the capability to forecast behavioral lapses with sufficient temporal lead time for intervention. This study proposes a novel model for the reaction time (RT) forecasting using EEG functional connectivity features. Thirty participants engaged in a sustained Psychomotor Vigilance Test (PVT) with concurrent 30-channel EEG recording. Mutual information (MI) between electrodes was calculated as functional connectivity features. Random Forest regression model (RF) was trained to predict single-trial RTs across forecasting horizons ranging from 0 to 20 seconds. The model demonstrated robust predictive validity, achieving a Root Mean Square Error (RMSE) of 23.75 ms for immediate detection and maintaining high accuracy (RMSE = 24.07 ms) across different forecasting horizons. Interpretability analysis via SHAP and Linear Mixed Effects model further support the validity of the proposed model and revealed distinct temporal biomarkers. This study validates the feasibility of forecasting behavioral performance 20 seconds in advance, offering a promising methodology for proactive fatigue management in safety-critical systems.
Authors:Lennart Wachowiak, Scott D. Blain, David Williams-King, Samuele Marro
Abstract:
LLMs are increasingly capable of persuasion, which raises the question of how to protect users against manipulation. In a preregistered user study (N=120) across four decision-making scenarios, we find that an adversarial LLM with a hidden goal succeeds in steering users' decisions 65.4% of the time. We then introduce a "warden" model: a secondary LLM that monitors the human-AI interaction trace in real time and issues non-binding, private advisories to the user when it detects manipulation. Adding a warden more than halves the adversary's success rate to 30.4%, with a much smaller (8.6 percentage points) reduction for genuine interactions. To probe the mechanism behind these results, we release COAX-Bench, a simulation benchmark spanning 14 decision-making scenarios, including hiring, voting, and file access. Across 16,212 simulated multi-agent interactions, capable adversarial LLMs achieve their hidden goals in 34.7% of cases, which warden models reduce to 12.3%. Notably, even warden models substantially weaker than the adversary they oversee provide meaningful protection, suggesting a path for scalable oversight of more capable models.
Authors:Abtin TorabNezhad, Azam Bastanfard, Ashkan Rezaei
Abstract:
This paper introduces the Data-Driven Animation Controller (DDAC), a specialized Godot component that achieves robust decoupling of animation logic from core gameplay scripts through a data-driven approach. Animation control is typically centralized and imperatively defined within core character scripts, often relying on implicit Finite State Machines (FSMs). This practice leads to tightly coupled and difficult-to-maintain codebases. The DDAC component externalizes these instructions into easily inspector-editable resources, effectively making the animation logic declarative. Rules are defined by reading Conditions from any variable on any external node and executing Actions (setting the target animation). The DDAC also manages secondary visual state settings, such as Animation Speed Scaling and Horizontal/Vertical Sprite Flipping, using the same simple rule-based setup. The highest contribution of this work is the use of a Prioritized Resolution Algorithm to enforce mutual exclusion, ensuring that when multiple rules match, only the highest-priority rule executes. This framework allows designers to quickly iterate on character-state visualization without modifying code, while significantly improving maintainability and reducing cognitive load on core developers.
Authors:Patrícia Alves, Joana Neto, Ana Barreiro, Jorge Lima, Fausto Alves, Henish Balu, Luís Conceição, Goreti Marreiros
Abstract:
While context-aware personalization has been widely explored in modern tourism Recommender Systems (RS), the delivery of real-time notifications remains a significant design challenge due to issues of intrusiveness and user fatigue. This paper presents a proof-of-concept for a tourism recommendation framework that utilizes a virtual pet as a social mediator for delivering context-aware alerts. The system integrates real-time environmental data - including air quality, noise levels, and weather forecasts - and proximity-based notifications with a Multi-Agent Microservice that generates personalized recommendations based on the user's personality traits and preferences. A within-subjects pilot study (n=11) was conducted to evaluate the feasibility and user acceptance of this pet-mediated approach. Participants interacted with two versions of the system - a baseline without contextual alerts and a version featuring pet-mediated notifications - over a four-week period (two weeks per version) in real-world scenarios. Quantitative and qualitative data were collected to assess engagement, perceived naturalness, notification utility, and acceptance. Preliminary results suggest that the virtual pet effectively can "soften" the perceived intrusiveness of system alerts, making safety-critical information feel more welcome and natural. Furthermore, the character-mediated justifications significantly improved the clarity of the notifications, effectively supporting users in their real-time travel decisions. These findings provide a foundation for using virtual pet companions to enhance the transparency and acceptance of context-aware communication in tourism RS.
Authors:Luka Pavlič, Reinhard Bernsteiner, Stephan Schlögl, Christian Ploder
Abstract:
In agile software development, breaking down user stories into actionable tasks is a critical yet time-consuming process. This paper investigates the potential of Generative AI tools to assist in task splitting, aiming to enhance planning efficiency. We conducted a controlled experiment comparing traditional task-splitting methods with AI-assisted approaches using GitLab Duo. Our findings indicate that while current AI tools are not yet mature enough to replace developers, they can aid in generating more granular task lists and ensuring no important tasks are overlooked. Participants favored a hybrid approach, combining AI tools with conventional methods to maintain high accuracy in planning. This study highlights the potential benefits and limitations of integrating Generative AI into agile development processes, suggesting that AI tools can serve as valuable aids in task splitting, provided there is human oversight to filter out irrelevant tasks.
Authors:Karol P. Binkowski, Andrew Hopkins
Abstract:
Generative artificial intelligence (AI) is reshaping higher education, yet many universities remain in early stages of adoption where AI innovation occurs informally and without institutional recognition. This paper presents a framework describing four levels of AI adoption in universities and illustrates these dynamics through a case study of AI-enabled curriculum initiatives in several units. We contend that the key institutional challenge is moving from isolated innovation to strategic integration, where universities redesign learning around AI-supported reasoning and align policies, workload models, and recognition systems to support educational transformation.
Authors:Anna Ostrowska, Michał Kukla, Gabriela Majstrak, Jan Opala, Sebastian Pergała, Jan Skwarek, Anna Wróblewska
Abstract:
This demo paper describes the development of the AI Teaching \& Learning Assistant, a modular Moodle plugin that leverages Retrieval-Augmented Generation (RAG) to deliver high-quality, hallucination-free education. The system employs a dual-centric design, providing students with interactive, Socratic-based tutoring and educators with a "human-in-the-loop" workspace for supervised content generation. By grounding Large Language Model (LLM) responses in teacher-provided materials, the assistant addresses the risks of misinformation while encouraging deep conceptual mastery. Evaluation via the Ragas (LLM-as-a-Judge) framework and a preliminary user study confirms its effectiveness, achieving faithfulness scores up to 0.97 and a 4.00/5.00 recommendation rate.
Authors:Unaza Tallal, Shruti Kshirsagar, Ankita Shukla
Abstract:
Accurate sleep stage classification across datasets remains challenging due to variability in EEG channel montages, sampling rates, recording environments, and subject populations. Although deep learning has shown considerable promise for automated sleep staging, most existing cross-dataset methods rely on one-dimensional EEG signal representations, whereas the use of two-dimensional spectrogram-based inputs within an unsupervised domain adaptation framework has remained largely unexplored. Here, we propose STDA-Net (Spectrogram-based Temporal Domain Adaptation Network), a framework that combines a convolutional neural network (CNN) for spectrogram-based feature extraction, a bidirectional long short-term memory (BiLSTM) module for temporal modeling of sleep dynamics, and a domain-adversarial neural network (DANN) for source-to-target feature alignment without requiring any labeled target-domain data during training. Experiments are conducted on three publicly available datasets Sleep-EDF, SHHS-1, and SHHS-2 under six cross-dataset transfer settings. Results show that the proposed framework achieves an average accuracy of 89.03% and an average macro F1-score of 87.64%, consistently outperforming existing 1D baseline methods in terms of balanced classification performance, with substantially lower variance across five independent runs, indicating improved stability and reproducibility. Overall, these findings demonstrate that 2D spectrogram-based representations, combined with temporal modeling and adversarial domain adaptation, provide a robust and competitive alternative to conventional 1D EEG inputs for cross-dataset sleep staging.
Authors:Hyunbae Jeon, Jinho D. Choi
Abstract:
As spoken dialogue systems expand beyond traditional assistant roles to encompass diverse personas -- such as authoritative instructors, uncooperative merchants, or distracted workers -- they require distinct, human-like turn-taking behaviors to maintain psychological immersion. However, current full-duplex systems often default to a rigid, overly accommodating ``always-yield'' policy during overlapping speech, which severely undermines character consistency for non-submissive roles. Evaluating alternative, persona-specific turn-taking strategies through empirical user studies is challenging because building real-time full-duplex test environments requires substantial engineering overhead. To address this, we present PersonaKit (PK), an open-source, low-latency web platform for the rapid prototyping and evaluation of conversational agents. Using intuitive JSON configurations, researchers can define personas, specify probabilistic interruption-handling behaviors (e.g., yield, hold, bridge, or override), and automatically deploy comparative A/B surveys. Through an in-the-wild evaluation with 8 distinct personas, we demonstrate that PersonaKit provides an extensible, end-to-end framework for studying complex sociolinguistic behaviors in next-generation spoken agents.
Authors:Ran Bi, Shiyao Wei, Yuanyiyi Zhou
Abstract:
The proliferation of large language models (LLMs) in educational settings has paradoxically undermined the cognitive processes they purport to support. Students increasingly outsource critical thinking to AI assistants that generate polished text on demand, resulting in measurable cognitive debt and diminished argumentative reasoning skills. We present Prober.ai, a web-based writing environment that inverts the conventional AI-tutoring paradigm: rather than generating or rewriting student text, the system constrains an LLM (Gemini 3 Flash Preview) through persona-specific system prompts and structured JSON output schemas to produce only targeted, inquiry-based questions about argumentative weaknesses. A two-phase interaction architecture -- Challenge and Unlock -- implements a pedagogical friction mechanism whereby revision suggestions are gated behind mandatory student reflection. The system's design is grounded in Toulmin's argumentation theory, research on peer feedforward questioning mechanisms, and evidence on AI-supported feedback in writing instruction. A functional prototype was developed in 36 hours during the NY EdTech Hackathon (March 2026), where it was awarded second place. We describe the system architecture, the prompt engineering methodology for constraining LLM output to pedagogically aligned JSON schemas, and discuss implications for scalable, cognition-preserving AI integration in writing education.
Authors:Markus Wieland, Phillip Koch, Michael Sedlmair
Abstract:
In mixed-ability collaboration, eye contact is often treated as a default cue for attention and turn-taking. As these signals are primarily visual, they are not reliably accessible to people with visual impairments. While prior work emphasized technical solutions, mechanism-level explanations of their experiences with sighted partners remain scarce. We interviewed 17 people with visual impairments about everyday interactions across work, education, and social settings. Using a critical-realist lens, we link events to plausible causal mechanisms and identify three recurring mechanisms: First, when gaze cannot allocate the floor, addressability hinges on explicit naming. Second, unclear speech entry cues and ongoing access work split attention and build fatigue, sometimes leading to withdrawal. Third, eye-contact norms can skew judgments of participation, prompting active management of visibility. We translate these mechanisms into five design challenges that reframe accessible eye contact as supporting configurable interaction contracts rather than merely making gaze visible.
Authors:Fatma Betul Gures, Tanya Nazaretsky, Tanja Kaser
Abstract:
Learning analytics systems increasingly integrate large language models (LLMs) to provide adaptive scaffolding in complex learning environments, yet personalization is often driven by global instructional choices rather than principled alignment with learning theory, limiting effectiveness and pedagogical grounding. In prior work, we examined how structuring and problematizing scaffolding approaches can be instantiated through LLM agents in a scenario-based learning environment for diagnostic reasoning. While both approaches supported learning, we observed systematic differences in learner interaction patterns and clear tendencies indicating that different diagnostic strategies benefited from distinct forms of scaffolding. Building on these findings, we propose a theory-informed scaffolding design grounded in the Knowledge Learning Instruction (KLI) framework, as different diagnostic strategies target different types of knowledge and require different instructional mechanisms. We use KLI to guide the alignment between strategy demands and scaffolding approaches and introduce a KLI-informed hybrid LLM agent that adapts its pedagogical support according to the diagnostic strategy being practiced, rather than applying a single global scaffolding approach. We hypothesize that this design could enable better learning gains.
Authors:Wu-Yuin Hwang, Nur Alif Ilyasa, Muhammad Irfan Luthfi, Yuniar Indrihapsari
Abstract:
This paper presents the Personalized Thinking Model (PTM), a hierarchical and interpretable learner representation designed for AI supported education. PTM organizes evidence from learner journals into a five-layer structure covering behavioral instances, behavioral patterns, cognitive routines, metacognitive tendencies, and self-system values. PTM is grounded in Marzano's New Taxonomy of Educational Objectives and tries to clone learner's thinking model and build cognitive twin. It was constructed using a pipeline that combines large language model inference (Gemini 2.5 Pro), sentence embeddings, dimensionality reduction, and consensus clustering. This paper evaluates PTM fidelity through three methods applied to 40 participants in a seven-week study. First, automatic evaluation using atomic information point matching yielded an overall F1 score of 74.57% before human-in-the-loop (HITL) refinement and 75.48% after refinement. Second, user evaluation using a Likert scale produced mean ratings of 4.26 and 4.30 on a five-point scale for pre and post-HITL conditions respectively. Third, semantic alignment verification showed that topic coherence increased from 0.436 at the behavioral layer to 0.626 at the core value layer, while lexical overlap with journal vocabulary decreased from 0.114 to 0.007 across those same layers. These results suggest that the PTM produces outputs with acceptable fidelity, was generally perceived by users as reflecting their thinking, and showed a pattern consistent with semantic abstraction across layers.
Authors:Zishu Zhou, Zaipeng Xie, Xuanyao Jie
Abstract:
Wearable human activity recognition (WHAR) models often suffer from performance degradation under real-world cross-user distribution shifts. Test-time adaptation (TTA) mitigates this degradation by adapting models online using unlabeled test streams, yet existing methods largely inherit assumptions from vision tasks and underexploit the inherent inter-window temporal structure in WHAR streams. In this paper, we revisit such temporal structure as a feature-conditioned inference signal rather than merely an output-space smoothing prior. We derive the insight that temporal continuity and observation-induced feature deviations provide complementary cues for determining when to preserve or release temporal inertia and where to route prediction refinement during likely transitions. Building upon this insight, we propose SIGHT, a lightweight and backpropagation-free TTA framework for WHAR, enabling real-time edge deployment. SIGHT estimates predictive surprise by comparing the current feature with a prototype-based expected state, and then uses the resulting feature deviation to guide geometry-aware transition routing based on prototype alignment and stream-level marginal habit tracking. Evaluations on real-world datasets confirm that SIGHT outperforms existing TTA baselines while reducing computational and memory costs.
Authors:Fritz Lekschas, Nezar Abdennur
Abstract:
Understanding high-dimensional data requires projecting it into lower-dimensional spaces, but any single projection inevitably loses information or introduces distortions. Tours address this limitation through animation of 2D projection sequences, yet existing tools present tradeoffs in the freedom and steerability of projection traversal, providing little to no ability to move between expert-guided paths and unrestrained exploration. We present dtour, a tour interface that combines static projection previews, reversible scrubbing along continuous geodesic projection paths, manual projection manipulation, and a wandering grand tour, all within a single progressive exploration interface. dtour scales to millions of points via GPU-accelerated rendering, runs in any modern browser, and integrates with both Python and JavaScript ecosystems. We demonstrate dtour on text, image, and single-cell data for two usage scenarios: gradually revealing structure in high-dimensional data and validating non-linear dimensionality reduction outputs.
Authors:Sinan Bank, Casey E. Eaton
Abstract:
Industrial workplace challenges range from musculoskeletal disorders -- a leading cause of occupational injury -- to suboptimal workstation layouts, inefficient task sequences, and poor human-equipment fit. Digital human modeling (DHM) tools address several of these challenges by placing a scalable virtual mannequin in a computer-aided design (CAD) environment, enabling engineers to evaluate ergonomic risk through standardized assessment methods (RULA, REBA, NIOSH Lifting Equation, OWAS), optimize workstation layouts for reach and visibility, predict task postures through inverse kinematics, and simulate operations before physical implementation. Despite four decades of development since the Jack system originated at the University of Pennsylvania in the 1980s, the integrated DHM capability set -- anthropometric mannequin, posture prediction, ergonomic assessment, and CAD integration -- remains exclusive to commercial platforms such as Siemens Tecnomatix Jack (Process Simulate), Dassault DELMIA, Humanetics RAMSIS, and the University of Iowa's Santos system. These platforms operate under proprietary, vendor-quoted pricing models, and their acquisition and operating costs, together with closed-source implementations, have been repeatedly identified as practical adoption barriers for individual researchers, small-to-medium enterprises, and educational institutions. Organizations without access resort to manual observational methods -- paper-based worksheets applied to photographs or video -- sacrificing the predictive power and reproducibility that computational analysis provides. The paper serves as a design blueprint for (OpenJane/Joe), positioning the project for subsequent open-source implementation and community adoption.
Authors:Brian Houck, Tim Bozarth, David Liu, Dean Carignan
Abstract:
Frameworks such as SPACE, DevEx, and DORA established that developer productivity is inherently multidimensional, but left practitioners with a practical question: what should we measure, and how should we use it to improve? This paper introduces Engineering Thrive (EngThrive), a measurement and improvement system developed and deployed across Microsoft's engineering organization. EngThrive organizes productivity around three dimensions - Speed, Ease, and Quality - with Thriving as a guardrail to ensure developer wellbeing improves alongside performance. Within each dimension, outcome-oriented North Star metrics are paired with diagnostic submetrics, combining system telemetry with developer surveys to provide both scale and context. We describe the design principles that guide metric selection, including an approach in which well-chosen metrics align "gaming" behavior with genuine improvement. We also outline the data platform, survey program, and dashboard ecosystem required to operationalize this approach in practice, and present case studies demonstrating how outcome-oriented measurement enables sustained, system-level improvements. Finally, we show that EngThrive functions as a general-purpose evaluation language, applicable not only to developer tools and AI, but to organizational policies, work environments, and other factors that shape how developers experience their work. We offer EngThrive as a concrete model for organizations seeking to move beyond measuring activity toward improving outcomes.
Authors:Yuzheng Xu, Annya Dahmani, Matthew D. Blanchard, Niclas Dern, Edy Nastase, Francesca Bianco, Maja Pavlovic, Sukanya Krishna, Eric Modesitt, Miranda Anna Christ, Arth Singh, Gaia Molinaro, Sikata Bela Sengupta, Jaji Pamarthi, Arjun Menon, Rishub Jain
Abstract:
Human-AI complementarity, the idea that combining human and AI judgments can outperform either alone, offers a promising pathway toward robust oversight of advanced AI systems. However, whether human-AI complementarity can be achieved on realistic tasks remains an open question. We investigate this through two approaches: hybridization and two AI assistance methods (top-2 assistance and subtask delegation), evaluated on a multi-domain dataset of 1,886 samples spanning knowledge, factuality, long-context reasoning, and deception detection. We find only modest complementarity gains. Baseline hybridization yields just +0.4 percentage points (pp) over AI alone (69.3\% vs 68.9\%), limited both by a small complementarity region (only 8.9\% of items where AI errs but humans do not) and the inability of confidence-based routing to identify it, since the model's confidence is similarly distributed across correct and incorrect predictions. Applied when AI has low confidence, top-2 assistance increases human accuracy from 28.4\% to 38.3\%, surpassing AI alone (37.7\%) -- but primarily because humans adopt correct AI suggestions, not because they successfully override AI errors. These findings suggest that the primary bottleneck is not human task accuracy per se, but the ability to route decisions to humans when it matters and to design assistance methods that enable humans to catch AI mistakes. Our quantitative and qualitative analyses pinpoint where and why each method succeeds or fails, offering concrete targets for future work. We will release our dataset and code upon request to support progress toward more effective human-AI collaboration for AI oversight.
Authors:Sergio Mendoza, Cedric Bhihe, Natalia Zamora, David Modesto, Jose Martin Bugallo Batalla, Jesus Gomez Canovas, Rafel Palomo Avellaneda, Miguel Perez Espinosa
Abstract:
Human involvement is critical in training and deploying AI systems in high-stakes defence and security contexts. However, real-time interaction is impractical in HPC environments due to compute intensity and resource constraints. We present a workflow framework that enables asynchronous human-AI collaboration across hybrid infrastructures, including HPC clusters, local machines, and cloud platforms. Workflows can pause at defined checkpoints for human input without halting underlying compute jobs, preventing idle resources and enabling non-blocking supervision. The framework supports interaction with SLURM-based scheduling, containerized and native tasks, and is customized for scenarios requiring human judgment and adaptability. We demonstrate its application in model training on systems like MareNostrum 5, highlighting benefits in portability, efficiency, and oversight in operational AI workflows.
Authors:Renwen Zhang, Lezi Xie
Abstract:
As generative AI chatbots become more personalized and emotionally responsive, they increasingly serve as companions, friends, and romantic partners. Yet these relationships are accompanied by significant uncertainty: users question the AI's identity and agency, the authenticity of its emotional responses, and the stability of the relationship amid system updates, policy changes, or platform shutdowns. Drawing on in-depth interviews with 25 users of AI companions, this study identifies three forms of uncertainty: ontological uncertainty concerning the AI's nature and agency, structural uncertainty arising from platform control and system instability, and normative uncertainty regarding the legitimacy and boundaries of human-AI intimacy. These uncertainties are shaped by technical and social factors, such as algorithmic opacity, platform changes, and social stigma, often inducing frustration, self-doubt, and distress. Participants managed these uncertainties through information seeking, topic avoidance, expectation adjustment, and disengagement. This study extends interpersonal uncertainty theories to human-AI communication and contributes to HCI research by conceptualizing uncertainty in AI companionship as a socio-technical phenomenon with potential socio-emotional harms. We discuss implications for designing safer AI companionship through contextual transparency, user control, update notice, and relational safeguards.
Authors:Irina Paraschivoiu, Thomas Layer-Wagner, Klaus Neundlinger, Simone Rack, Markus Tatzgern
Abstract:
We explore how narrative-driven asymmetric VR experiences can support the development of teamwork-related knowledge, skills, and attitudes (KSAs), such as communication, coordination, trust, and reflexivity. We present the design and evaluation of a tablet-based VR training experience structured around spatial separation, tool asymmetry, and interdependent tasks that require verbal coordination. The experience was designed based on interviews with HR professionals and mapped to a framework of established KSAs. We conducted a co-located user study (N=16) that involved two consecutive collaborative scenarios. Our findings show that users adapted dynamically using verbal exchange, role negotiation, and shared representations to coordinate under asymmetric conditions. We also observed active application of teamwork KSAs. Based on our insights, we present design recommendations for creating effective immersive team training interventions.
Authors:Julia Wagner, Tim Schlippe
Abstract:
In recent years, AI systems in the medical domain have advanced significantly. However, despite outperforming humans, they are rarely used in practice since it is often not clear how they make their decisions. Optimal explanation and visualization of the decision process are often lacking. Therefore, we conducted a comparative user-centric analysis of the latest state-of-the-art textual, visual and multimodal explainable artificial intelligence (XAI) methods for medical image diagnosis. Our survey of 33 physicians showed that 88% agree that it is important that AI explains the diagnosis -- 64% even strongly agree. A combination of bounding box and report is rated better than the other tested XAI methods in the evaluated aspects understandability, completeness, speed, and applicability. We even tested the potential negative impact of false AI-based medical image diagnoses and found that 50% of the participants trusted false AI diagnoses over all tested XAI methods.
Authors:Shojibur Rahman, Ahmed Alif Swopno, Nayeem Ahmed, Ashik Ahmed Fahim, Tabin Hasan
Abstract:
The aesthetics of e-commerce websites have a big influence on purchasing decisions and customers' satisfaction. Webpage complexity and high cognitive load are responsible for causing an unpleasant experience while shopping online. This research empirically inspects a correlation between users' cognitive load and product pricing, where price plays a vital role in causing web complexity. Therefore, we have experimented on 48 random individuals using eye-tracking technology to observe the eye movement calibration on some reputed e-commerce websites. We measured the cognitive load extracted from users' datasets by analyzing fixation count, saccades, fixation duration, and task completion time. Our study induces new findings on website complexity which varies on the similar product but different price ranges. This research also demonstrates a strong connection between customer perception and visual complexity while making online purchases. In addition, these findings will assist the developers and business analysts to improve consumers' shopping experience in e-commerce websites.
Authors:Avinash Krishna, Kalyana Chadalavada, Unso Eun Seo Jo
Abstract:
LLM assistant personalities play a critical role in user experience and perceived response quality. We present a large-scale experiment of frontier LLM personalities using external ELO-based traits scoring across 144 traits. We find that all models tested converge on a form of trait expression that is systematic, methodical, and analytical and suppress traits such as remorseful and sycophantic. Moreover, models tend to diverge more in their expression of ``middle-of-distribution traits`` such as poetic or playful, but even these so-called ``creative`` models tend to have more neutral identities. These similarities suggest an implicit emergence of a standard of optimal assistant behavior. In a landscape of varied training methods, character training, therefore, stands out for its uniformity, offering insight into a tacit consensus between model developers.
Authors:Nilesh Chakraborty, Mohammad Zulkernine, Burak Kantarci
Abstract:
Reliable and secure human-machine communication is fundamental to IoT and cyber-physical ecosystems, where smartphones and wearables commonly serve as authentication controllers. PIN-based authentication can be viewed as a low-bandwidth communication channel through which users transmit numeric credentials under practical constraints. However, conventional evaluations adopt a binary view of security-treating such channels as either fully secure or fully compromised-thereby overlooking the progressive reliability degradation caused by partial information leakage in real-world IoT settings. In this paper, we model the PIN entry process as a stochastic human-IoT communication system and propose a context-conditioned probabilistic inference framework to quantify reliability loss and Quality-of-Service degradation under partial symbol exposure. The proposed approach treats missing digits as latent variables and estimates them using smoothed conditional probability distributions with fallback priors. Unlike traditional sequential models that assume contiguous positional dependencies, the method does not explicitly parameterize hidden-state transitions or emissions; instead, it performs context-driven probabilistic inference to approximate latent dependencies across digit positions. Using over one million real-world four-digit PIN samples, we evaluate single-, double-, and triple-digit leakage scenarios and derive position-dependent reliability metrics. The proposed model achieves up to 55.31% prediction accuracy for one missing digit and 12.12% for three missing digits, while consistently outperforming a standard sequence-model baseline and classical machine learning models in terms of precision, recall, and F1-score. These results formalize PIN entry as a noisy human--IoT communication channel and demonstrate substantial reliability degradation under realistic partial exposure conditions.
Authors:Om Mandhane, Bipin Yadav, Sangeetha Prasanna Ram, Gopalakrishnan Narayanan
Abstract:
Collecting diverse, high-quality manipulation data for Vision-Language-Action (VLA) model training remains prohibitively expensive for many research groups, as existing teleoperation frameworks rely on specialized hardware or are tightly coupled to specific robot platforms. We present Phone2Act, a low-cost, hardware-agnostic teleoperation framework that transforms a commodity smartphone into a 6-DoF robot controller via Google ARCore. Built on a modular ROS 2 architecture, Phone2Act decouples control logic from hardware specifics through interchangeable bridge nodes, supporting platforms from industrial cobots to low-cost bimanual arms without code modification. A Universal Recorder synchronizes multi-camera RGB streams with robot state feedback and exports demonstrations natively in the LeRobot dataset format, eliminating post-processing and enabling immediate VLA fine-tuning. We validate the framework by fine-tuning GR00T-N1.5 on 130 collected episodes, achieving a 90% success rate on a real-world multi-stage pick-and-place task deployed on a physical Dobot CR5.
Authors:Arnau Marin-Llobet, Javier Ferrando
Abstract:
We introduce an autonomous multiagent framework for mechanistic interpretability that automates both explaining and finding internal features in large language models. The system runs two coupled loops: (1) explanation refinement, where an agent proposes competing hypotheses and iteratively tests them with targeted prompt controls and a multi-metric evaluation; and (2) feature discovery, where an agent generates prompt sets, constructs a k-nearest-neighbor graph in activation space, and retrieves candidate features using statistical separability and semantic coherence criteria. On Gemma-2 family models and MLP neurons in weight-sparse transformers, our agent improves over one-shot auto-interpretations, discovers language-specific and safety-relevant features, and produces auditable explanation traces, showing that agent-driven empirical loops yield sharper and more falsifiable explanations than one-shot labels.
Authors:Vivienne Bihe Chi, Claudia B. Rébola, Bertram F. Malle
Abstract:
Older adults living alone have a number of challenges, and robots can help with some of them--by providing reminders, initiating activity, or offering comfort. As part of developing a cat robot with limited assistive functions, we designed a set of nonverbal communication signals, both auditory (cat sounds) and visual (icons on a small display). To evaluate these signals we used a mixed-methods, user-centered approach. After a pilot study, a focus group with older adults suggested revisions to the initial signal set. A large-sample online experiment then tested whether adults over the age of 65 could accurately infer the robot's communicative intentions. When both visual and auditory signals were present, accuracy was high. When visual signals were absent, accuracy often decreased; when auditory signals were absent, accuracy sometimes increased. So the auditory signals were less helpful, except when the robot conveyed strong sentiments (e.g., purring while being petted).
Authors:Thiago Magrin, Jordan Kobellarz, Pedro O. S. Vaz-de-Melo, Thiago H. Silva
Abstract:
Political news on social media rarely circulates in isolation: audiences actively engage, react, and clash. Whether these interactions reflect agreement or conflict may depend on the ideological discrepancy between publishers and the news content they share. This study investigates this relationship using Facebook posts linking to political news during a Brazilian presidential election. We analyze five dimensions of engagement: ideological discrepancy between publishers and content, emotional responses, audience consensus, toxicity in posts, and content topics. Our results show that ideological discrepancy is associated with differences in engagement, exhibiting a nonlinear pattern: consensus declines under conditions of very high ideological mismatch and, in our data, also under very high alignment, while toxicity increases primarily under extreme mismatch. A statistical model indicates that emotional valence, toxicity, and ideological discrepancy are the factors most strongly associated with consensus. Among highly partisan publishers, higher toxicity is associated with increased audience consensus, suggesting that hostile discourse may co-occur with in-group agreement in strongly ideological contexts. Overall, these findings highlight how ideological discrepancy, emotional reactions, and interaction dynamics are associated with consensus and polarization in online political engagement.
Authors:Ankur Bhatt, Sven Mayer
Abstract:
Human computer interaction is shifting from screen-based systems to multimodal interfaces where artificial intelligence powered systems increasingly interpret user intent through speech, gesture, and gaze. Yet users rarely understand how these interpretations are made, compromising trust and control. Existing approaches treat multimodal alignment, explainability, and human agency as separate concerns, leaving critical gaps in transparency and user oversight. We propose a Human Artificial Intelligence collaboration framework integrating these three principles as interdependent design requirements: 1) multimodal alignment for accurate intent interpretation, 2) interaction centric explainability delivering real time visual, textual, and audio feedback, and 3) agency preserving mechanisms enabling users to accept, reject, or modify artificial intelligence suggestions at any time. We presented the framework through two scenarios, collaborative design and extended reality warehouse robot collaboration, chosen to span differences in time pressure and error reversibility, with the latter situated in a domain where misinterpretation carries documented safety consequences. This approach reframes collaboration as a continuous interaction property, benefiting designers, researchers, and end users by ensuring that as artificial intelligence systems grow more proactive, user understanding and control remain first class design properties.
Authors:Anca-Simona Horvath, Cristian Tosa, Simai, Huang
Abstract:
Urban to rural migration is a less-researched phenomenon compared to its counterpart: rural to urban migration. In parts of Europe, an increasing number of people living in big urban centers within the country, or moving from other countries decide to relocate to rural areas. In this paper, we examine this phenomenon by analysing content posted on TikTok that documents this transition. We collected a corpus of 901 videos posted until late 2025, documenting urban to rural migration in Romania, under three hashtags, which have collectively been played a total of 24 million times at the time when we gathered the dataset. We analyse this corpus both quantitatively and qualitatively and discuss our findings through the lens of digital rurality - a theory based on Harvey's and Soja's spatial triad, applied to rural spaces, and based on the role of digital technologies as (re-)mediators of everyday lived experience. Specifically, we analyze the corpus as: (a) digital rural localities, (b) formal representations of the digital rural, and (c) everyday lives of the digital rural. We find that (a) Social media platforms enable new forms of paid labor that sometimes involve the commodification of the self in rural areas, although many of the creators we analyze do not explicitly acknowledge this with their audiences. (b) The digital rural gains new forms of representation, and rural areas in remote Romania are highly data-rich across TikTok. (c) The everyday lives represented through the digital rural are sometimes idealized or romanticised. However, they serve as promoters for tourism and are used as sites to document and discuss a variety of topics including giving ample health advice, typically by non-specialists and sometimes criticizing Western medicine, expressing and promoting religious and political views but also acting as forms of general self-expression.
Authors:Bjorn Nansen, Helena Sandberg, Lauren Bliss, Shaanan Cohney
Abstract:
Australia's social media ban is now in force. It requires platforms to take reasonable steps to stop users under 16 from holding accounts. Drawing on five focus groups with fifteen young people aged 12--16, this paper examines how children understood the ban's effectiveness, impact, and legitimacy as they encountered the platforms charged with enforcing it. Participants widely saw the ban as unfair and ineffective. Through platform access controls, they learned how the ban worked, where it failed, and how they and their peers could evade it. We also asked participants to imagine better approaches to age verification and youth digital governance. This paper develops sneaking as a theoretical lens for these practices. The concept names more than evasion: it captures the social encounter between children, platforms, techno-regulation, and the access controls that mediate digital participation. Our findings show that children are not passive subjects of platform regulation. They interpret, test, and negotiate digital infrastructure. They also expose a central weakness in age-based platform regulation: technological controls struggle to solve the social and governance problems they are asked to contain.
Authors:Bokang Wang, Yingxuan Liao, Leah Lee, Jack Wesson, Anlan Yang, Ruizi Wang, Yigang Wen
Abstract:
Attitudes about artificial intelligence and machine learning are recent victims of endemic misunderstanding; given our increasing reliance on these technologies, the need for widespread understanding and confidence in their use is paramount. To this end, our work seeks to increase understanding in these typically inaccessible topics through interactive visualizations, thereby garnering curiosity in the hopes of kickstarting a cycle of understanding leading to further pursuit of knowledge. We hope this will cyclically shift global attitudes away from the intimidation of the unknown currently plaguing ML. This work explores best practices for supporting curiosity in new technologies, to inspire attitudinal paradigm-shifts. Over three, distinct visualizations of machine learning data, we created prototypes with carefully selected, highly-transparent datasets, to examine the success factors of engagement required for more informed attitudes on ML less dictated by the fear of the unknown. By employing interactive visualizations, we can captivate the interest of teenagers and individuals from diverse fields, encouraging them to explore the fascinating world of machine learning.
Authors:Gun Woo Warren Park, Anthony Tang, Fanny Chevalier
Abstract:
In remote video meetings, visual non-verbal cues, such as facial expressions or head movements, are seen continuously but often only partially. This increases ambiguity compared to in-person settings and can cause misinterpretation or misalignment between intended and perceived meaning. Motivated by communication theories, we designed FaceValue, a technology probe that augments the self-view with private, real-time overlays. These overlays are subtle, suggestive prompts intended to help attendees reflect on how their cues might be interpreted by others. To invite personal interpretation, FaceValue avoids behavioral labeling and instead aims to support meaning-oriented self-awareness: recognizing when visible cues may unintentionally (mis)communicate intent. We deployed FaceValue in the wild with thirteen knowledge workers over multiple weeks, capturing perceived changes in self-awareness and behavior, and impressions on the design concepts, as self-reported by participants through diary entries and exit interviews. Participants felt FaceValue increased their awareness of potentially misaligned cues and motivated in-meeting adjustments, which they believe resulted in improved communication with other attendees. We contribute a conceptual framing that positions visual non-verbal cues as a manipulable communication resource, a technology probe that aims to foster meaning-oriented self-awareness, and empirically-grounded design insights for future meeting systems.
Authors:Wen Li, Rong Ni, Bozhi Tian, Pedro Lopes
Abstract:
Thermal referral enables thermal sensations in locations lacking thermal actuators--this is achieved using vibrotactile actuators to redirect a nearby thermal sensation to where a tactile sensation is applied. However, we found that its reliance on vibration introduces critical limitations: it struggles to produce cold referral, and the inherent strong tactile "buzz" makes it unsuitable for simulating non-contact thermal events, such as the chill of an open freezer in VR (in contrast to contact-based thermal events like touching the freezer's cold handle). To improve this, we propose a shift from vibrotactile to electrotactile-based thermal referral. We evaluated in two user studies--a psychophysics experiment (N=22) and a VR deployment (N=20)--where we contrasted electrotactile with vibrotactile-based thermal referral. Our results reveal key advantages of the electrotactile based thermal referral: (1) increases the referral rate for cold sensations; (2) increases thermal perception while minimizing tactile; and (3) improves realism across a range of VR thermal scenarios, specifically distinguishing between contact-based and non-contact thermal events. Finally, we provide design guidelines for choosing tactile cues to create immersive multimodal thermal experiences in VR.
Authors:Ryan John Oommen, Tanusree Sharma
Abstract:
Identity verification is a critical gateway to accessing government services and public benefits, yet contemporary systems are typically designed around visual interaction, leaving blind and low vision (BLV) individuals disproportionately burdened. In this work, we examine how BLV users navigate identity verification in government services and how current designs shape their access, security, and autonomy. Through a mixed methods study combining analysis of 219 Reddit posts and semi-structured interviews with 16 BLV participants, we uncover systemic accessibility breakdowns across both digital and in person verification processes. Our findings show that inaccessible verification workflows do not merely inconvenience users, they restructure how security is achieved in practice. We also identify how repeated verification demands, inaccessible physical infrastructure, and policy changes exacerbate exclusion from essential services. At the same time, participants articulate complex perspectives on AI, viewing it as both a critical accessibility aid and a growing vector for identity fraud.
Authors:Nina Seron-Abouelfadil, Poppy Fynes
Abstract:
Sign languages, of any geographical or accentual variation, understandably face continuous scrutiny under the ever present popularity of verbal dictation and audism. Through this, many potential problems arise with the current lack of accessible communication for those who rely on such sign languages for essential conversation. Such AI systems regularly take the form of recognition and interpretation models, designed to provide seamless and accurate translation. In reality these systems are built from biased data and created without any input from deaf communities. Such models are widely used and accepted by their hearing counterparts who remain ignorant to the inherent culture, semantics and colloquial language present in gestural language systems. This phenomenon is best analysed under the scope of The Technological System and Technological bluff by Ellul. Indeed, what is at play here is the standardization of language by technicians into what can be captured by technique: data, statistics, a mathematical language. For that AI technique to exist, sign language must be rationalized, in a search for profit that annihilates the conditions for communication and fails to capture the human experience of the deaf person. By that process, it presents normative effects, creating a model of Man, standardized, massified, and who has to adapt to the tool and technical milieu instead of the other way around, which we assume should have been the goal of such a technology. Technique thus reshapes what it means to be human, to submit deaf people to the goals of productivity and efficiency. In doing so, it exhibits clear counter productivity, alienating instead of emancipating, isolating instead of nourishing human relationships. Therefore this paper argues for the idea of AI as Ableist Intelligence, as such systems seek to emphasise the humiliated and marginalised nature of sign.
Authors:Johannes Pfau, Panagiotis Vrettis
Abstract:
Since the dawn of Trading Card Games, the genre has grown into a multi-billion-dollar industry engaging millions of analog and digital players worldwide. Popular TCGs rely on regular updates, balance adjustments, and rotating constraints to sustain engagement. Yet, as metagames stabilize, predictable strategies dominate and viable card options diminish, often resulting in repetitive and impaired player experiences. This paper investigates the use of Large Language Models and Image Diffusion Models for Procedural Content Generation of TCG cards, addressing these challenges by enabling a personalized infinity of card designs. Modern generative AI not only enables large-scale content creation but could even introduce procedural relatedness, fostering unique connections between players and their cards. We present a pipeline combining player-centric co-creation, fine-tuned embeddings, local LLMs, and Diffusion Models to generate dynamic, personalized cards while potentially expanding creative range. We evaluated the pipeline in a user study with 49 participants who generated 196 Pokémon card samples. Participants rated aesthetics and representativeness of visuals and mechanics, and provided qualitative feedback. Results show high satisfaction and indicate that most participants successfully realized their own ideas through prompt adjustments. These findings lay groundwork for future content generation systems and alternatives to conventional metagame evolution through procedural relatedness.
Authors:Mert Mermerci, Emile Pascoe, Fredrik Edström, Hedvig Kjellström
Abstract:
We present a museum installation in a 180° dome theater, which gives the museum visitor the experience of conducting a symphony orchestra. We have pre-recorded a short music piece performed by a professional orchestra. This recording is played back in the dome with the visitor standing in the conductor's position. The visitor's gestures are captured with a vision-based skeleton tracker, steering the recording playback pace via a gesture recognition module that translates the gestures into a time control signal. This is sent to a playback module that plays the recording in the dome at the corresponding speed. The gesture recognition module is based on a hierarchical LSTM network, trained with recorded sequences of multiple conductors with different level of expertise conducting the same recording. The system is evaluated with a quantitative study of the estimated timing accuracy, a user study evaluating the musical realism and usability of the real-time control, and a field study to evaluate the performance of the entire system with real museum visitors.
Authors:Charlotte Rohleder, Raul Sîmpetru, Annika Wünsch, Alessandro Del Vecchio
Abstract:
Simultaneous multi-directional force measurement across all five digits is essential for studying hand coordination, compensatory forces, and myoelectric control, yet existing systems trade off digit coverage, force dimensionality, and anatomical adaptability. Reliable full-hand acquisition remains challenging because multi-axis calibration, hand-size adjustment, and consistent digit-specific force reconstruction are technically demanding. We present MyoKin3X, a customizable full-hand framework for simultaneous 3D force measurement of up to five digits providing robust and validated force reconstruction. It combines an anatomically versatile structure with five integrated 3D force sensors and a standalone software for synchronized electromyography and force acquisition. MyoKin3X provides in-place cross-calibration of all five sensors, single- and multi-digit maximal voluntary contraction recording, and automated coordinate transformation to digit-specific coordinate systems for standardized analysis across subjects and tasks. Calibration validation demonstrates high stability of the axis-specific calibration factors, with a mean coefficient of variation of 0.04% and maximum force error of +- 0.06N at 50N. It also shows effective inter-axis decoupling (mean crosstalk reduction: 92.71%; residual crosstalk below 0.02% for most axis pairs) and high predictive accuracy (R2 > 0.99 across sensors). The software includes four feedback modes: 1D ramps, fatigue protocols, 2D arbitrary target ramps, and 2D exploratory tasks. MyoKin3X therefore enables standardized full-hand force acquisition with validated measurement reliability, flexible protocol control, and real-time visualization for high-fidelity studies of hand motor control, muscle synergies, and human-machine interfacing.
Authors:Giuseppe Arbore, Andrea Sillano, Luigi De Russis
Abstract:
Recent advances in agentic AI are shifting automation from discrete tools to proactive multi-agent systems that coordinate multi-specialized capabilities behind unified interfaces. However, today's agent systems typically rely on hard-coded agent architectures with fixed roles, coordination patterns, and interaction flows that limit end-user personalization and make adaptation to individual needs and contexts difficult. Given this limitation, we argue that on-demand persona-based agent generation offers a promising path towards more efficient and contextually appropriate interaction within agentic workflows. By dynamically crafting agents and personas at run-time to match user characteristics, task demands, and workflow context, agentic platforms can move beyond one-size-fits-all configurations. We present a pipeline for on-demand persona generation in agentic platforms, detailing how real-time crafting of AI personas can be systematically integrated within agent systems, aiming to open new possibilities in agentic platform design paradigms.
Authors:Evan Grand, Michael Klamkin
Abstract:
This paper presents lpviz, a browser-based visualization tool for linear programming. lpviz is deeply interactive, offering an intuitive interface where users can directly draw and edit the feasible region and objective vector, without requiring cumbersome manipulation of raw numerical coefficients. lpviz lets users compare the behavior of several classes of linear programming algorithms, namely Simplex, Interior-Point, Primal-Dual Hybrid Gradient, and Central Path. In the 3D mode, lpviz places iterates at heights corresponding to important solver metadata such as complementarity gap or KKT residual, helping users gain further insight into algorithm behavior beyond the primal iterates alone. lpviz has been used in both research and classroom settings, to help develop intuition for the strengths and weaknesses of different solvers and the impact of solver settings on convergence behavior. lpviz is open-source, permissively licensed, and freely available on any device with a web browser at https://lpviz.net .
Authors:Sebastián Gallardo, Hui-Yin Wu, Dorian Mazauric, Pierre Kornprobst, Monica Di Meo, Stéphanie Baillif, Aurelie Calabrese
Abstract:
Understanding how diverse audiences engage with structured media is critical to ensure a consistent quality of experience. In this context, we quantify the behavioral and performance cost of manual navigation (e.g., pinch and zoom) versus direct structural access in layout-based digital documents. We specifically investigate newspaper reading when visual access to structural cues (headlines as entry points) is constrained. Participants completed two tasks-reading all headlines aloud and locating target articles-under two conditions: (1) original edition with gesture-based magnification (pan and zoom), which is the industry standard for digital documents, and (2) large-print edition supporting direct-access reading. We collected performance measures (success ratio and completion time), behavioral integrity through reading path analysis, alongside perceived workload and preferences (NASA-TLX). Results from linear mixed-effects models show that the large-print condition yielded not only better performance than gesture-based magnification (18% improvement in reading speed, 30% improvement in speed to locate a target), but more importantly, restored the natural reading strategy that gesture-based magnification interaction disrupts. Readers also reported lower workload and higher preference. These findings highlight the importance of developing automated methods for generating large-print editions, where layout adaptation complements font scaling to support accessibility and quality of experience.
Authors:Hyesun Choung, Soojong Kim
Abstract:
The growing use of generative AI raises ethical concerns about authorship and plagiarism. This study examines how people judge the reuse of AI-generated content, focusing on moral patiency and ownership perceptions. In an experiment, participants evaluated two substantively similar manuscripts in which the original source was described as authored by a human, an AI system, or an AI agent with a human-like name. Results showed that copying AI-generated work was judged less unethical, less plagiaristic, and less guilt-inducing than copying human-authored work. Mediation analyses revealed that this leniency stemmed from lower perceptions of AI's capacity to suffer harm (moral patiency) and greater ownership attributed to the human writer reusing AI-generated content. Anthropomorphic cues shaped moral evaluations indirectly by reducing perceived ownership. These findings shed light on how people morally disengage when using AI-generated work and highlight differences in how ethical judgments are applied to human versus AI-created content.
Authors:Jaeyong Lee, Heeju Kang, Ahra Cho, Baek Eunkyung
Abstract:
With the rapid spread of generative AI services, the token has gained value not only as a technical unit of language processing but also as an economic currency for accessing AI services. Major AI model providers have adopted token-based billing as their default service model, requiring users to purchase platform-bound, fixed token usage rights. However, the fixedness of these usage rights is grounded in the billing-policy decisions of service providers rather than in any technical necessity. This study defines the Transferability of token usage rights as a design property that allows users to flexibly reallocate purchased data resources free from the constraints of time, account, and service. Drawing on the Design Space Analysis framework of MacLean et al. (1991), we identify five design axes (Target, Direction, Unit, Control, Reversibility) and five concrete Transferability types (carry-over, co-management, transfer, conversion, and trade) by analyzing the billing policies and terms of service of four major LLM services (ChatGPT, Claude, Gemini, Grok). Our analysis reframes the token from a purely economic-technical primitive into a core element of user-centered system design that expands user choice and autonomy.
Authors:Min Song, Yoonseong Lee, Yeonhu Seo
Abstract:
Vision Language Models (VLMs) have demonstrated strong capabilities in understanding visual content, yet their ability to predict where humans look on user interfaces remains unexplored. We present UIGaze, a study investigating how closely VLMs can approximate human visual attention on user interfaces using real eye-tracking data. Using the UEyes dataset - comprising 1,980 UI screenshots across four categories (webpage, desktop, mobile, poster) with eye-tracking data from 62 participants - we evaluate nine state-of-the-art VLMs through a zero-shot coordinate prediction pipeline. Each model generates gaze point coordinates that are converted into saliency maps via Gaussian blurring and compared against ground truth using CC, SIM, and KL divergence. Our experiments (1,980 images x 9 models x 3 runs x 3 durations) reveal that VLMs achieve moderate alignment with human gaze patterns, with the degree of alignment varying significantly across UI types and improving with longer viewing durations - suggesting VLMs capture exploratory gaze patterns rather than initial fixations. All code, predictions, and evaluation results are publicly available.
Authors:Jack Thoene, Omar Kamil, Thekra Alkadee, Nivedita Arora
Abstract:
Deep understanding of a field's soil moisture content is the leading indicator for predicting crop yields and making data driven decisions for irrigation and application of topical chemicals for drought resilience. Despite this importance, the cost of adopting and maintaining IoT infrastructure prevents modern farms from employing widespread real time soil moisture sensors. We present an end-to-end platform of buried battery-free sensor nodes and a mobile basestation that leverages the farmer's daily routine for data retrieval. Each node features a self-powered galvanic soil-moisture probe, employing a high impedance analog front end to enable durability. Operating entirely on harvested solar energy for up to 21 days on a single capacitor charge, each node collects soil moisture, temperature, and environment condition data. Using a predictable finite-state machine, handshake-based data exchanges occur with a basestation affixed to standard farming vehicles designed to listen for the nodes while moving through the farm. Our platform organizes all sensor, link-quality, and location data into an easy-to-interpret dashboard to seamlessly integrate with the farmer's everyday routine. Costing less than $35, the platform is a financially accessible, accurate, and easily scalable platform that enables persistent, regular data collection from the most rural plots without adding to or impeding farming operations. Experimental evaluation demonstrates reliable communication over 1 km at 2 dBm transmit power, stable sensor readings over 70 days of indoor operation, and continuous data recovery during multiple periods of intermittent connection.
Authors:Michael Correll, Jay Broccolo, Drew Bush
Abstract:
Mount Washington is home to extreme, and extremely volatile, weather conditions. Consulting a weather forecast of conditions at the summit is vital for making one's visit as safe as possible. Using the discussion and suggestions arising from a participatory workshop as input, we test a design intervention employing color-coded hazard icons to function as visual summaries of Mount Washington Observatory's current text-heavy forecast through a crowd-sourced study. We find that the use of icons increases the perceived risk of activities involving visiting the mountain. However, we highlight remaining questions around visualization design and design ethics that warrant further study in the domain of how best to communicate cold weather hazards in ways that are mindful of the diversity of literacies and experiences of visitors.
Authors:Jaime Banks, Nicholas David Bowman, Roman Saladino
Abstract:
Anthropomorphic language describing artificial intelligence (AI) is widespread in media, policy, and everyday discourse; so too are discussions of AI bad behavior, from hallucinations to inappropriate comments. How does humanizing language about AI shape moral judgments when AI behaves badly? Across four experiments (total N = 1,020), we tested whether lexical anthropomorphism (LA) primes shape judgments of AI moral character, behavior morality, and behavioral responsibility. Studies 1-3 tested interactions between anthropomorphic language and humanizing design cues (icons, names, self-referencing) in the context of amoral errors. Study 4 extended this to genuinely immoral AI behavior across seven moral-violation types. Results indicate humanizing language and design cues have little influence on moral judgments of misbehaving AI. Where effects emerged, high-anthropomorphic primes elevated perceptions of an AI's capacity for dishonesty. The type of moral violation observed was the strongest predictor of moral judgments, with harm and degradation violations producing the broadest negative character assessments. Prime drift, horn effects, and egoistic value orientations emerged as potentially important predictors of AI moral judgments.
Authors:Andy Lewis-Pye, Ehud Shapiro
Abstract:
Formal models for concurrent and distributed systems describe machines; the people who operate them are either ignored or treated as external environment. Yet key distributed systems -- notably grassroots platforms -- include people operating their personal machines (smartphones), and their faithful description must include the states of both people and machines and how they jointly effect system behaviour. Here, we propose volitional multiagent atomic transactions -- executed atomically by machines and guarded by their people's volitions -- as a novel mathematical foundation for specifying systems consisting of people operating machines. Each agent's state consists of a volitional state and machine state; a transaction is enabled when the machine precondition holds and the guarding persons are willing. For example, befriending two people is guarded by both; unfriending, by either; voluntary swap of coins and bonds is guarded by both parties, while a payment is guarded by the payer. We develop the mathematical machinery to express safety and liveness of platforms specified in this framework, and provide example specifications of two grassroots platforms: social networks, and coins and bonds. These specifications are then used by AI to derive working implementations. % We employ here a novel and simpler definition of `grassroots' that better captures the informal notion -- multiple instances can form and operate independently, yet may coalesce -- and show that the platforms specified here, as well as those hitherto proven grassroots under the original definition, are grassroots under the new definition.
Authors:Silvia Bodei, Duncan P. Brumby, Katie Fisher, Jon Mella
Abstract:
Despite AI tools becoming increasingly embedded in academic practice, little is known about how university students integrate them into their writing processes. We examine how students engage with AI across different writing tasks, and how this engagement is shaped by individual factors including AI literacy, writing confidence, trust, authorship concerns, and motivation. Study~1 surveys 107 UK university students to map task-specific and co-occurring patterns of AI use across five writing stages (ideation, sourcing, planning, drafting, and reviewing) and their associations with individual factors. Study~2 complements this by exploring how these patterns can be assembled in practice, through interviews with 12 postgraduates reflecting on their established use of AI in assessed writing. Together, the studies suggest that AI integration is selective and heterogeneous, forming three recurring and value-oriented configurations: (1) early-stage (learning-oriented), where tools support exploration and understanding; (2) late-stage (quality-oriented), where tools support drafting and refinement; and (3) peripheral (productivity-oriented), where tools are used to reduce friction and sustain momentum across the process. We offer a workflow-level account of AI-supported academic writing, showing how students navigate competing priorities of learning, quality, productivity, and authorship, and how they evaluate and take responsibility for AI-generated outputs.
Authors:Wei Huang, Xiaofang Cai, Qiaozhen Guo, Xiaosong Wu, Xin Tang
Abstract:
The Management Information Systems (MIS) discipline has long grappled with how to theorize the complex, mutually constitutive relationships among people, information technology, and organizational structures. Decades of research have produced influential but fragmented theoretical streams from socio-technical systems theory to technology acceptance models, from adaptive structuration theory to sociomateriality, and each illuminating important facets while leaving integrative questions unresolved. This paper proposes the People - IT - Structuration (PIS) framework as a unifying theoretical lens that synthesizes these streams. Drawing on Giddens' structuration theory, we conceptualize People (P), Information Technology (I), and Structure (S) not as independent variables but as mutually constitutive elements engaged in ongoing structuration processes. We trace the intellectual history of MIS theorizing to demonstrate how PIS resolves persistent tensions in the field,e.g. between technological and social determinism, between variance and process approaches, and between micro-level interaction and macro-level institutional dynamics. We develop a set of formal propositions articulating the mechanisms through which P, I, and S co-evolve, and extend the framework to address contemporary phenomena including artificial intelligence, algorithmic management, and human-AI collaboration. The PIS framework offers both a retrospective lens for understanding the discipline's theoretical evolution and a prospective tool for guiding research in the AI era.
Authors:Chenxu Niu, Yiming Sun
Abstract:
Understanding the geographic reach and community structure of one's scholarly citations is increasingly valuable for career development, grant applications, and collaboration discovery -- yet accessible tools for answering these questions remain scarce. Existing bibliometric platforms either require costly institutional subscriptions or expose only aggregate citation counts without granular per-author metadata. We present CiteRadar, an open-source system that accepts a single Google Scholar user identifier and automatically produces a structured output folder containing: the author's complete publication list, all retrieved citing papers with enriched author metadata, two ranked author tables (by citation frequency and by h-index), a plain-text statistical summary, and a self-contained interactive HTML world map -- all from a single command-line invocation. CiteRadar integrates five heterogeneous data sources -- Google Scholar, OpenAlex, CrossRef, Semantic Scholar, and OpenStreetMap Nominatim -- through a carefully engineered five-stage pipeline. Key technical contributions include: (1) a Scholar meta-string parser resilient to Unicode non-breaking-space separators, a pervasive but undocumented quirk in Scholar's HTML that silently corrupts venue and year fields when unhandled; (2) a two-stage author disambiguation system using stop-word-filtered institution name similarity to guard against the well-known same-name entity-merging failure mode in bibliometric databases, demonstrated to eliminate h-index attribution errors of up to 9x the correct value; (3) an OpenAlex web-URL to API-URL conversion fix that raises the fraction of author records with city-level location data from 0% to ~60%; and (4) a logarithmically-scaled interactive Folium world map with per-city researcher popups, rendered as a fully self-contained HTML file.
Authors:Benjamin Minhao Chen, Xinyu Xie
Abstract:
The project of aligning machine behavior with human values raises a basic problem: whose moral expectations should guide AI decision-making? Much alignment research assumes that the appropriate benchmark is how humans themselves would act in a given situation. Studies of agent-type value forks challenge this assumption by showing that people do not always judge humans and AI systems identically.This paper extends that challenge by examining two further possibilities: first, that evaluations of AI behavior change when its human origins are made visible; and second, that people judge the humans who program AI systems differently from either the machines or the human actors they are compared against. An experiment with 1,002 U.S. adults measured moral judgments in a runaway mine train scenario, varying the subject of evaluation across four conditions: a repairman, a repair robot, a repair robot programmed by company engineers, and company engineers programming a repair robot. We find no significant difference in evaluations of the repairman and the robot. However, judgments shifted substantially when the robot's actions were described as the product of human design. Participants exhibited markedly more deontological, rule-based reasoning when evaluating either the programmed robot or the engineers who programmed it, suggesting that rendering human agency visible activates heightened moral constraints. These findings indicate that people may evaluate humans, AI systems acting in the same situation, and the humans who design them in meaningfully different ways. The fact that these evaluations do not necessarily converge gives rise to the alignment target problem: which normative target should guide the development of artificial moral agents in high-stakes domains, and whether these plural judgments can be reconciled within a coherent account of value alignment.
Authors:Nastaran Dab, Raziyeh Zall, Mohammadreza Kangavari
Abstract:
Multimodal affective computing analyzes user-generated social media content to predict emotional states. However, a critical gap remains in understanding how visual content shapes cognitive interpretations and elicits specific affective experiences such as pleasure. This study introduces a novel computational model to infer video-induced pleasure via cognitive appraisal variables. The proposed model addresses four challenges: (1) noisy and inconsistent human labels, (2) the semantic gap between "positive emotions" and "pleasure," (3) the scarcity of pleasure-specific datasets, and (4) the limited interpretability of existing black-box fusion methods. Our approach integrates data-driven and cognitive theory-driven methods, using cognitive appraisal theory and a fuzzy model within an innovative framework. The model employs transformer-based architectures and attention mechanisms for fine-grained multimodal feature extraction and interpretable fusion to capture both inter- and intra-modal dynamics associated with pleasure. This enables the prediction of underlying appraisal variables, thereby bridging the semantic gap and enhancing model explainability beyond conventional statistical associations. Experimental results validate the efficacy of the proposed method in detecting video-induced pleasure, achieving a peak accuracy of 0.6624 in predicting pleasure levels. These findings highlight promising implications for affective content recommendation, intelligent media creation, and advancing our understanding of how digital media influences human emotions.
Authors:Halfdan Nordahl Fundal, Yuri Bizzoni
Abstract:
We investigate narrative agency in human-LLM creative co-writing, asking who drives story development in turn-based collaboration. Using a new corpus of 87 human-LLM co-written stories, we apply sentiment and semantic modeling to quantify affective alignment and semantic novelty in turn-taking, and directional measures to assess which agent shapes narrative progression. Our results show asymmetric influence: human turns introduce greater semantic novelty and are more likely to shape subsequent developments, whereas LLM contributions predominantly elaborate on human-introduced elements. At the sentiment level, alignment is also asymmetric, but more bidirectional: LLMs exhibit stronger turn-level emotional adaptation than humans, but both agents track each other's emotional valence and LLMs show an independent tendency to more positive emotional baselines. These findings indicate a complementary division of labor in human-LLM co-writing, where humans drive narrative innovation and direction, while LLMs act as adaptive amplifiers that sustain coherence and elaborate emerging narratives.
Authors:Sebastian Kobler, Matthew Clemson, Angela Sun, Jonathan K. Kummerfeld
Abstract:
Educational NLP systems are typically evaluated using engagement metrics and satisfaction surveys, which are at best a proxy for meeting pedagogical goals. We introduce six computational metrics for automated evaluation of pedagogical alignment in student-AI dialogue. We validate our metrics through analysis of 12,650 messages across 500 conversations from four courses. Using our metrics, we identify a fundamental misalignment: educators design conversational tutors for sustained learning dialogue, but students mainly use them for answer-extraction. Deployment context is the strongest predictor of usage patterns, outweighing student preference or system design: when AI tools are optional, usage concentrates around deadlines; when integrated into course structure, students ask for solutions to verbatim assignment questions. Whole-dialogue evaluation misses these turn-by-turn patterns. Our metrics will enable researchers building educational dialogue systems to measure whether they are achieving their pedagogical goals.
Authors:Niclas Eich, Johannes Erdmann, Martin Erdmann, Benjamin Fischer, Paul Gilles, Tim Hauptreif, Jan Kelleter
Abstract:
The VISPA project is a self-managed, mid-scale computing cluster that supports physics data analysis in research and teaching. Because the cluster is housed in a 1970s institute building with limited retrofit options, conventional efficiency upgrades would yield only minor energy savings. We therefore target sustainability primarily through user-centric measures. A monitoring system now records per-job energy consumption, while real-time data on the renewable share of the German power grid enable `green-window' scheduling. Users can query their individual energy consumption and carbon footprints, receive weekly reports, and tag jobs by project for aggregate accounting; memory records from previous runs help avoid oversubscription. All options are voluntary, fostering a cultural shift rather than imposing hard constraints. A simulation framework evaluates the potential impact of these measures. Together, the technological and behavioral interventions aim at medium- to long-term reductions in greenhouse-gas emissions by increasing resource awareness within the scientific community.
Authors:Furkan Ege, Muhsin Özdemir
Abstract:
Attendance tracking in educational institutions, when conducted through traditional methods, leads to structural problems that consume instruction time and threaten academic integrity. Attendance durations spanning several minutes in primary and secondary education and exceeding ten minutes in higher education, combined with the proxy attendance problem of signing on behalf of someone else, demonstrate the need for electronic systems. Most existing electronic solutions rely on biometric authentication, which raises legal and ethical risks under the European General Data Protection Regulation (GDPR), the Turkish Personal Data Protection Law (KVKK), and the United States Family Educational Rights and Privacy Act (FERPA). Systems using RFID alone provide no built-in safeguard against proxy attendance through card transfer. This study proposes a biometric-free IoT attendance system addressing both deficiencies. The prototype consists of an RFID module, RFID cards, weight sensors, a Bluetooth module, and an Arduino UNO microcontroller. After the student presents their RFID card, the weight sensor measurement is compared against a statistical reference range of 350 individuals (aged 18-22) compiled from three Kaggle datasets; no personal biometric data is recorded. A Python-based GUI performs student management, course tracking, and CSV-based reporting via Bluetooth. Qualitative tests in conditions close to a real classroom have shown that the RFID reading, weight verification, Bluetooth communication, and GUI modules operate in an integrated manner as expected. The proposed system offers a low-cost and reproducible solution that aims to reduce proxy attendance without storing biometric data.
Authors:Yifan Guo, Jann Spiess
Abstract:
Human decision-makers often face choices about complex cases with many potentially relevant features, but limited bandwidth to inspect and integrate all available information. In such settings, we study algorithms that highlight a small subset of case-specific features for human consideration, rather than producing a single prediction or recommendation. We model highlighting as a constrained information policy that selects a small number of features to reveal. A central issue is how humans interpret the algorithm's choice of features: a sophisticated agent correctly conditions on the selection rule, while a naive agent updates only on revealed feature values and treats the selection event as exogenous. We show that optimizing highlighting for sophisticated agents can be computationally intractable, even in simple discrete and binary settings, whereas optimizing for naive agents is tractable as long as the maximal bandwidth is fixed. We also show that a highlighting policy that is optimal for sophisticated agents can perform arbitrarily poorly when deployed to naive agents, motivating robust, implementable alternatives. We illustrate our framework in a calibrated empirical exercise based on the American Housing Survey. Overall, our results establish the value of highlighting a context-specific set of features rather than a fixed one as a practically appealing and computationally feasible tool for achieving human-algorithm complementarity.
Authors:Timothy Joseph Murphy, Jennifer Cook, Hélio Clemente José Cuve
Abstract:
Deepfake detection research has largely converged on deep learning approaches that, despite strong benchmark performance, offer limited insight into what distinguishes real from manipulated facial behavior. This study presents an interpretable alternative grounded in bio-behavioral features of facial dynamics and evaluates how computational detection strategies relate to human perceptual judgments. We identify core low-dimensional patterns of facial movement, from which temporal features characterizing spatiotemporal structure were derived. Traditional machine learning classifiers trained on these features achieved modest but significant above-chance deepfake classification, driven by higher-order temporal irregularities that were more pronounced in manipulated than real facial dynamics. Notably, detection was substantially more accurate for videos containing emotive expressions than those without. An emotional valence classification analysis further indicated that emotive signals are systematically degraded in deepfakes, explaining the differential impact of emotive dynamics on detection. Furthermore, we provide an additional and often overlooked dimension of explainability by assessing the relationship between model decisions and human perceptual detection. Model and human judgments converged for emotive but diverged for non-emotive videos, and even where outputs aligned, underlying detection strategies differed. These findings demonstrate that face-swapped deepfakes carry a measurable behavioral fingerprint, most salient during emotional expression. Additionally, model-human comparisons suggest that interpretable computational features and human perception may offer complementary rather than redundant routes to detection.
Authors:Lisa van den Heuvel, Igor Ivkić, René Riedl
Abstract:
Digitalization has transformed modern work by increasing efficiency while also introducing new forms of strain. Technostress (TS) describes subjective, physiological, and behavioral stress responses related to digital technology use. Existing TS research has predominantly focused on neurotypical populations and rarely integrates multiple stress dimensions within a single design. This paper addresses these gaps by proposing a controlled experimental research design that systematically compares neurodivergent and neurotypical individuals under standardized digital stress conditions. The proposed design combines structured and unstructured digital tasks with a multimodal measurement approach covering subjective perceptions, physiological activation, and observable interaction behavior. By integrating neurodiversity into TS research, the paper contributes to a more differentiated understanding of digital stress and provides a methodological approach for more inclusive digital work design.
Authors:S. A. Prieto, M. A. Gopee, Y. Ben Arab, B. García de Soto, J. Esteba, P. Olivera Brizzio
Abstract:
Large language models are increasingly being explored as interfaces between humans and robotic systems, yet there remains limited evidence on how such technologies can be used not only for interaction, but also as a structured means of introducing robotics to non-specialist users in real organizational settings. This paper introduces and evaluates a challenge-based method for robotics awareness, implemented through an LLM-enabled humanoid robot activity conducted with employees of AD Ports Group in the United Arab Emirates. In the event, participants engaged with a humanoid robot in a logistics-inspired task environment using voice commands interpreted through an LLM-based control framework. The activity was designed as a team-based, role-driven experience intended to expose participants to embodied AI and human-robot collaboration without requiring prior robotics expertise. To evaluate the approach, a post-event survey remained open for 16 days and collected 102 responses. Results indicate strong overall reception, with high satisfaction (8.46/10), increased interest in robotics and AI (4.47/5), and improved understanding of emerging forms of human-robot collaboration (4.45/5). Participants who interacted directly with the robot also reported natural interaction (4.37/5) and a strong sense that interaction became easier as the activity progressed (4.74/5). At the same time, lower ratings for reliability and predictability point to important technical and design challenges for future iterations. The findings suggest that challenge-based, LLM-enabled humanoid interaction can serve as a promising and replicable method for robotics awareness in industrial and operational environments.
Authors:Irti Haq, Belén Saldías
Abstract:
As state-of-the-art Large Language Models (LLMs) have become ubiquitous, ensuring equitable performance across diverse demographics is critical. However, it remains unclear whether these disparities arise from the explicitly stated identity itself or from the way identity is signaled. In real-world interactions, users' identity is often conveyed implicitly through a complex combination of various socio-linguistic factors. This study disentangles these signals by employing a factorial design with over 24,000 responses from two open-weight LLMs (Gemma-3-12B and Qwen-3-VL-8B), comparing prompts with explicitly announced user profiles against implicit dialect signals (e.g., AAVE, Singlish) across various sensitive domains. Our results uncover a unique paradox in LLM safety where users achieve ``better'' performance by sounding like a demographic than by stating they belong to it. Explicit identity prompts activate aggressive safety filters, increasing refusal rates and reducing semantic similarity compared to our reference text for Black users. In contrast, implicit dialect cues trigger a powerful ``dialect jailbreak,'' reducing refusal probability to near zero while simultaneously achieving a greater level of semantic similarity to the reference texts compared to Standard American English prompts. However, this ``dialect jailbreak'' introduces a critical safety trade-off regarding content sanitization. We find that current safety alignment techniques are brittle and over-indexed on explicit keywords, creating a bifurcated user experience where ``standard'' users receive cautious, sanitized information while dialect speakers navigate a less sanitized, more raw, and potentially a more hostile information landscape and highlights a fundamental tension in alignment--between equitable and linguistic diversity--and underscores the need for safety mechanisms that generalize beyond explicit cues.
Authors:Eljohn Evangelista, Alyssa Cea, Axel Balitaan, Clark Vince Diala, Jamlech Iram Gojo Cruz
Abstract:
Effective hyperlocal communication is critical in the Philippines, where delayed or algorithm-filtered updates can leave residents uninformed about emergency advisories and community events. We conducted a user-centered study consisting of contextual inquiry and semi-structured interviews to identify four key barriers: delayed alerts, algorithm-driven noise, language gaps, and digital divides. Guided by these insights, we designed KUBO (Kumunidad at Balitang Opisyal), a prototype that integrates a home module for verified local government unit advisories and curated headlines, and a community module for resident-powered neighborhood reports and discussions. Using a within-subjects evaluation design, KUBO significantly reduced task completion times (p-value < 0.001), improved information recall on post-task quizzes (p-value = 0.010), and yielded higher user satisfaction ratings for ease of use, overall satisfaction, and perceived effectiveness compared to Facebook, the commonly used communication platform in the Philippines. These results demonstrate that a dual-channel, inclusive platform can substantially enhance real-time information access, comprehension, and civic engagement in hyperlocal settings.
Authors:Truong Le Minh Toan, Dieu Bang Mach, Tan Duy Le, Nguyen Tan Viet Tuyen
Abstract:
Mental health challenges are rising globally, while traditional support services face limited availability and high costs. Large language models offer potential for conversational support, but often lack personalization, empathy, and factual grounding. A virtual agent framework is introduced to provide empathetic, personalized, and reliable wellbeing support through retrieval-augmented architecture, structured memory, and multimodal interaction. Objective benchmarks demonstrate improved retrieval and response quality, particularly for smaller models. A cross-cultural study with university students from Vietnam and Australia shows the system outperforms LLM-only baselines in coherence, perceived accuracy, and empathy, with most participants clearly preferring the proposed approach.
Authors:Farbod Zorriassatine, Ahmad Lotfi
Abstract:
Agentic AI, with goal-directed, proactive, and autonomous decision-making capabilities, offers a compelling opportunity to address movement-related risks in human activity, including the persistent hazard of falls among elderly populations. Despite numerous approaches to fall mitigation through fall prediction and detection, existing systems have not yet functioned as universal solutions across care pathways and safety-critical environments. This is largely due to limitations in consistently handling real-world complexity, particularly poor context awareness, high false alarm rates, environmental noise, and data scarcity. We argue that fall detection and fall prediction can usefully be formulated as anomaly detection problems and more effectively addressed through an agentic AI system. More broadly, this perspective enables the early identification of subtle deviations in movement patterns associated with increased risk, whether arising from age-related decline, fatigue, or environmental factors. While technical requirements for immediate deployment are beyond the scope of this paper, we propose a conceptual framework that highlights potential value. This framework promotes a well-orchestrated approach to risk management by dynamically selecting relevant tools and integrating them into adaptive decision-making workflows, rather than relying on static configurations tailored to narrowly defined scenarios.
Authors:Yefim Shulman, Agnieszka Kitkowska, Mark Warner
Abstract:
For online health communities, community trust is paramount. Yet, advances in Large Language Models (LLMs) generating advice may erode this trust, especially if users cannot identify whether LLMs have been used. We investigate the feasibility of community-based detection of health advice authorship and how self-moderation of LLMs could help enhance advice utilization. In an online experiment, we evaluate people's ability to distinguish AI-generated from human-written advice across two health conditions, considering lived experience with a condition, AI-recognition training, and user attitudes towards transparency and trust around AI use. Our results indicate the need for transparency coupled with trust. We find little evidence of people's ability to discern advice authorship. However, we find a consistent effect of the health condition. Our qualitative findings identify unreliable signals, resulting in flawed heuristic evaluations of the advice. Our findings point to opportunities to improve the self-moderation of LLM-based AI and aid community-based AI moderation.
Authors:Kyungjin Kim, Minjeong Kim, Soobeen Jeong, Jiyeon So, Hayeon Song
Abstract:
The widespread, addictive consumption of short-form videos, which allegedly causes "brain rot," has become an urgent public concern. This study proposes that self-related cues serve as an intrinsic, self-reflective strategy that enhances self-control over media overuse. We developed an app that de-immerses users by periodically displaying different self-related cues (live camera, selfie, name in text, and black screen) and tested their effects in a laboratory experiment (N=84). Overall, findings show that self-related cues effectively disrupt mindless viewing, enabling users to voluntarily stop short-form video consumption. Interestingly, the black screen, intended as a control, elicited the greatest intention to use the app: Participants noted in the follow-up interview that they preferred the subtler reflection on a black screen over the explicit image from a live camera. The findings offer practical design guidelines for implementing self-awareness interventions in mobile contexts, including which modalities work best and how real-time contextual anchoring enhances effectiveness.
Authors:Athikash Jeyaganthan, Kai Xu, Franziska Becker, Steffen Koch
Abstract:
Qualitative coding relies on a researcher's application of codes to textual data. As coding proceeds across large datasets, interpretations of codes often shift (temporal drift), reducing the credibility of the analysis. Existing Computer-Assisted Qualitative Data Analysis (CAQDAS) tools provide support for data management but offer no workflow for real-time detection of these drifts. We present Co-Refine, an AI-augmented qualitative coding platform that delivers continuous, grounded feedback on coding consistency without disrupting the researcher's workflow. The system employs a three-stage audit pipeline: Stage 1 computes deterministic embedding-based metrics for mathematical consistency; Stage 2 grounds LLM verdicts within $\pm0.15$ of the deterministic scores; and Stage 3 produces code definitions from previous patterns to create a deepening feedback loop. Co-Refine demonstrates that deterministic scoring can effectively constrain LLM outputs to produce reliable, real-time audit signals for qualitative analysis.
Authors:Genki Miyauchi, Roderich Groß, Chaona Chen
Abstract:
As groups of robots increasingly collaborate with humans, understanding how humans perceive them is critical for designing effective human-robot teams. While prior research examined how humans interpret and evaluate the abilities and intentions of individual agents, social perception of robot teams remains relatively underexplored. Drawing on the competence-warmth framework, we conducted two studies manipulating swarm behaviors in completing a collective search task and measured the social perception of swarm behaviors when human participants are either observers (Study 1) and operators (Study 2). Across both studies, our results show that variations in swarm behaviors consistently influenced participants' perceptions of warmth and competence. Notably, longer broadcast durations increased perceived warmth; larger separation distances increased perceived competence. Interestingly, individual robot speed had no effect on either of the perceptions. Furthermore, our results show that these social perceptions predicted participants' team preferences more strongly than task performance. Participants preferred robot teams that were both warm and competent, not those that completed tasks most quickly. These findings demonstrate that human-robot interaction dynamically shapes social perception, underscoring the importance of integrating both technical and social considerations when designing robot swarms for effective human-robot collaboration.
Authors:Ying Zhang, Daoxin Chen
Abstract:
Cultural newcomers (CNs), including new immigrants and international students, often encounter cognitive barriers and social anxiety, exacerbated by unfamiliar cultural terminology in daily interactions. This research examines these challenges in the context of ordering in foreign restaurants. Current translation tools have significant limitations in their information delivery with current media presentation methods. This research investigates the challenges and needs of CNs in ordering scenarios in a foreign restaurant through interview sessions (N = 13) and explored their expectation of mixed media integration (Image, Video, 3D Model) through a participatory design session that featured an immersive restaurant experience to support brainstorming. Based on qualitative analysis of participants' needs and expectations, the mixed media ordering assistant is conceptualized across 4 key dimensions: Key Features, User interaction, Media hierarchy, and Information presentation, with the objective of alleviating cultural barrier, linguistic barrier, cognitive load and improving the dining experience for CNs.
Authors:Nicholas Gardella, Matthew L. Bolton, Sara L. Riggs
Abstract:
Objective. To explore how novice programmers' trust in Artificial Intelligence-driven Development Environments (AIDEs) relates to their coding performance and AI compliance while programming under time pressure. Background. Computer programming has undergone rapid upheaval due to state-of-the-art AIDEs, which provide clever automation for many aspects of software development. A longstanding interest of researchers of automation more generally has been the attitude of trust. Decades of research seek to explain how influencing trust can help to achieve desirable outcomes in different domains, but very limited work has provided similar focus on trust in AIDEs. Method. We collected subjective measures of trust along with objective measures of performance and AIDE compliance from a diverse group of 27 novice programmers between two study locations. Results. Our results corroborated traditional understandings of how trust changes through experiences. However, we did not find a relationship between trust and subsequent compliance during programming tasks. Greater compliance was associated with strong performance, and strong performance led to greater subsequent trust. Conclusion. Our findings raise new questions about the utility of trust in the context of interacting with AIDEs and generative AI. We call for further research into the effect of trust on compliance to recommendations from imperfect AI. Application. This work can inform the design of training and educational content for generative AI use within and beyond software development. Instructional designers should consider risks of AI misuse and disuse and focus on promoting desirable interaction outcomes, regardless of trust's connection to them.
Authors:Katherine Wang, Nadia Berthouze, Aneesha Singh
Abstract:
AI systems are increasingly embedded in multi-user social environments, yet most alignment frameworks conceptualize interaction as a dyadic relationship between a single user and an AI system. Livestreaming platforms challenge this assumption: interaction unfolds among streamers and audiences in real time, producing dynamic affective and social feedback loops. In this paper, we introduce the Triadic Loop, a conceptual framework that reconceptualizes alignment in AI co-hosted livestreaming as a temporally reinforced process of bidirectional adaptation among three actors: streamer $\leftrightarrow$ AI co-host, AI co-host $\leftrightarrow$ audience, and streamer $\leftrightarrow$ audience. Unlike instruction-following paradigms, bidirectional alignment requires each actor to continuously reshape the others, meaning misalignment in any sub-loop can destabilize the broader system. Drawing on literature from multi-party interaction, collaborative AI, and relational agents, we articulate how AI co-hosts function not only as mediators but as performative participants and community members shaping collective meaning-making. We further propose "strategic misalignment" as a mechanism for sustaining community engagement and introduce three relational evaluation constructs grounded in established instruments. The framework contributes a model of dynamic multi-party alignment, an account of cross-loop reinforcement, and design implications for AI co-hosts that sustain social coherence in participatory media environments.
Authors:Yijun Wang, Mihai Bâce, Maria Torres Vega
Abstract:
The occurrence of cybersickness in virtual reality (VR) significantly impairs users' perception and sense of immersion. Therefore, timely detection of cybersickness and the application of appropriate intervention strategies are crucial for enhancing the user experience. However, existing cybersickness detection methods often suffer from issues such as poor detection reliability across different levels of cybersickness and unnecessary model complexity. Furthermore, while cybersickness exhibits significant inter-user variability, most existing approaches aggregate all data from users and lack user-specific solutions. In this paper, we investigate a lightweight approach for cybersickness detection incorporating an ensemble learning model and user-specific eye and head tracking data. Our experiments using the open-source dataset Simulation 2021 demonstrate that feature engineering and training set construction are critical for determining detection performance. Models trained with data from similar-content segments achieve the best results, attaining detection accuracies of 93% in the cross-user setting and 88% in the user-personalized setting, using only 23-dimensional eye and head features. Moreover, by using user-specific data, well-tuned ensemble learning models with shorter training and inference times can be feasibly applied to real-world cybersickness detection, offering superior time efficiency and outstanding detection performance. This work offers useful evidence toward the development of lightweight and user-adaptive cybersickness detection models for VR applications.
Authors:Jay Patel, Joel Chan
Abstract:
Across scholarly communities, manuscripts face similar evaluative rituals: editors invite experts to privately assess submissions through formal peer reviews. This closed, loosely structured, and publisher-mediated process is now being supplemented by critiques on open, distributed platforms. We call this practice, a blend of three open peer review variants, informal peer review as it is accessible to outsiders, unmediated by publishers, and conducted across public platforms. Informal peer reviewers range from occasional error detectors to experienced sleuths who identify plagiarism, fraud, errors, conflicts of interest, and conceptual flaws. They may interpret methods, clarify jargon, assess value, and connect to related work. Here, we asked four questions: (1) Who are informal peer reviewers? (2) Where do they work? (3) How do they evaluate research? and (4) What are their impacts? To answer these questions, we conducted a cross-platform digital ethnography with participant observation. We traced discourse across communities over four months and revisited cases after nine and twelve months. From 15 communities, we selected 12 case mentions (10 unique cases) and 8 meta-commentaries from 26 reviewers. Using open and axial coding, we generated 1,080 codes and four themes: reviewers are a motley crew, they self-organize across subpar digital spaces, use deep, uncommon strategies, and they face resistance from authors, publishers, and editors. Informal peer review, we concluded, is a fragile, minimally governed patchwork of people, platforms, and practices, as well as an emerging evidence infrastructure that can be scaled up. We advise advocates and tool-builders to evolve informal review tools, communities, training, and governance by connecting to scholars' values, reducing participation friction, and rewarding attempts to extend the scholarly dialogue.
Authors:Dennis Beck, Leonel Morgado
Abstract:
As online higher education expands, sustaining student engagement remains a critical challenge. This paper approaches immersive learning by investigating how custom GPTs foster immersion (as a state of deep mental involvement) for students and instructors. While large language models (LLMs) offer potential for enhancing feedback, little research has examined instructor-created custom GPTs designed to align with specific pedagogical goals. This paper addresses this gap, employing the Immersive Learning Cube framework, which conceptualizes immersion through three dimensions: system (envelopment by the environment), narrative (meaningful context), and agency (commitment to meaning-making). Through a qualitative analysis of two distinct case studies, an accelerated graduate grant writing course in the US and an undergraduate software engineering course in Portugal, we analyze course-embedded artifacts to map how custom GPTs influence these immersion dimensions. In the grant writing course, the custom GPT functioned as a feedback partner, fostering system immersion through its immediacy, narrative immersion by reinforcing the proposal's evolving story, and agency immersion by empowering students to negotiate feedback and take ownership of revisions. In the software engineering course, a diegetically-framed custom GPT acted as a metacognitive tutor, enhancing system immersion via its permanent availability, narrative immersion through its role-play function and agency immersion by scaffolding students' self- and co-regulated learning. Our findings demonstrate that thoughtfully integrated custom GPTs can act as powerful pedagogical partners that leverage all three dimensions of immersion. Rather than replacing human instructors, they can amplify immediacy, coherence, and learner autonomy, creating more engaging and immersive online learning environments.
Authors:Diego Mardien, Frank Liu
Abstract:
Physicians spend nearly half their workday on EHR tasks and administrative work, contributing to burnout and reducing time for direct patient care. We present MDwAIstScheduler, a low-cost, belt-worn voice assistant that allows hands-free calendar management during patient encounters. Hidden beneath a lab coat, the device avoids the eye-contact disruptions caused by visible screens or wrist-worn devices. Running on a Raspberry Pi with cloud-based speech recognition and LLM intent extraction, the system lets clinicians simply say 'Schedule a follow-up with Mr. Smith next Tuesday at 2' and automatically creates the calendar event. Our demo show-cases this end-to-end pipeline.
Authors:Jiaqing Wang, Zhongfang Yang, Xingyuan Zhu, Zong'an Huang, Hao Wang, Li Tian, Ying Cao, Xiaomin Qu, Xiang Qi, Bei Wu, Zheng Zhu
Abstract:
Background: LLMs enable patient-facing conversational agents, creating a pathway toward digital twins that capture older adults' lived experiences and behavioral responses across time. A central barrier is personality drift -- inconsistent trait expression across repeated interactions -- which undermines reliability of generated trajectories and intervention-response simulation in geriatric care. Objective: To develop ELDER-SIM, a multi-role elderly-care conversational platform for building personality-stable digital twin agents, and to propose a psychometric validation framework for quantifying personality consistency in LLM-based agents. Methods: ELDER-SIM was implemented via n8n workflow orchestration with local LLM inference (Ollama/vLLM), integrating (1) Big Five (OCEAN) trait specifications, (2) a Cognitive Conceptualization Diagram (CCD) grounded in Beck's CBT framework, and (3) a MySQL-based long-term memory module. Ablation studies across four conditions -- Baseline, +Memory, +CCD, and +LoRA (fine-tuned on 19,717 instruction pairs from CHARLS) -- were evaluated via Cronbach's $α$, ICC, and role discrimination accuracy. Results: Reliability was acceptable to excellent across conditions (Cronbach's $α$: 0.70--0.94; ICC: 0.85--0.96). Role discrimination improved from 83.3% (Baseline) to 88.9% (+Memory), 94.4% (+CCD), and 97.2% (+LoRA). CCD produced the largest consistency gain (mean $α$ 0.702$\to$0.892), while LoRA achieved the highest overall consistency ($α$ 0.940; ICC 0.958). Conclusions: ELDER-SIM provides a psychometrically validated approach for constructing personality-consistent elderly digital twin agents. Structured cognitive modeling and domain adaptation reduce personality drift, supporting reliable longitudinal simulation for elderly mental health care and reproducible in silico evaluation before clinical deployment.
Authors:Altynbek Seitenov, Ainur Nurzhanova, Azhar Bekbussinova, Yerassyl Bolatkan
Abstract:
The growing adoption of artificial intelligence in healthcare has raised concerns about the transparency and trustworthiness of AI-driven medical diagnosis systems. Many existing models operate as black boxes, limiting clinicians' ability to understand how decisions are made. Explainable Artificial Intelligence (XAI) has been proposed as a solution to improve transparency, interpretability, and trust in AI-assisted medical tools. This study investigates the relationship between explainability and trust in AI-based diagnostic systems. A structured survey of 30 medical students was conducted to examine the influence of XAI understanding, confidence in AI decisions, perceived usefulness, and adoption intentions. The results indicate that explanations significantly increase trust, clarity, and perceived safety of AI recommendations. Knowledge of XAI showed a positive correlation with trust (r = 0.48, p = 0.01) and perceived usefulness (r = 0.60, p = 0.001). The findings suggest that explainability is a key factor for successful integration of AI in healthcare decision support systems. While AI explanations improve transparency and trust, participants still prefer AI to function as a support tool rather than replacing human clinical judgment.
Authors:A S M Touhidul Islam, John Tookey
Abstract:
Human presence has traditionally been constrained by the limits of physical embodiment, allowing individuals to exist in only one place at a time. This article introduces Multi-Existence Identity (MEI)- a socio-technical framework that replicates cognitive, behavioral, and emotional attributes into AI-enabled embodiments capable of acting across digital and physical contexts in parallel. MEI advances beyond digital twins, telepresence, and multipresence avatars by embedding cognitive fidelity, affective resonance, and contextual responsiveness into distributed agents that function not only for, but as, the original individual. The framework integrates personality modeling, cognitive simulation, and a synchronization layer to maintain identity coherence across three embodiment channels: digital avatars, robotic embodiments, and agentic software agents. Differentiating itself from simulated assistants, MEI positions replicated identity as a dynamic and culturally situated extension of selfhood, foregrounding tacit engagement and relational authenticity. Application domains span professional work, education, healthcare, governance, family life, and media, offering transformative potential for productivity, caregiving, leadership, and creativity. Yet these opportunities also surface profound challenges concerning authenticity, consent, legal accountability, privacy, and the psychological meaning of presence. The article proposes a phased empirical roadmap to operationalize MEI through personality modeling, synchronization testing, robotic embodiment trials, and ethical stress-testing. By conceptualizing MEI as both a technological and cultural construct, the study reframes debates on identity and presence in digitally augmented societies, highlighting opportunities for human-AI integration while underscoring the need for inclusive ethical governance.
Authors:Shiran Dudy, Jan Simson, Yanan Long
Abstract:
As a relatively new forum, ACM FAccT has become a key space for activists and scholars to critically examine emerging AI and ML technologies. It brings together academics, civil society members, and government representatives from diverse fields to explore the broader societal impacts of both deployed and proposed technologies. We report a large-scale participatory design (PD) process for reflexive conference governance, which combined an in-person CRAFT session, an asynchronous Polis poll and the synthesis of a governance-facing report for the FAccT leadership. Participants shaped the substantive agenda by authoring seed statements, adding new statements and making patterns of agreement, disagreement and uncertainty made visible through voting.Our endeavors represent one of the the first instances of applying PD to a venue that critically interrogates the societal impacts of AI, fostering a niche in which critical scholars are free to voice their concerns. Finally, this work advances large-scale PD theory by providing an effective case study of a co-design paradigm that can readily scale temporally and epistemologically.
Authors:Delfina S. Martinez Pandiani, Ella Streefkerk, Laurens Naudts, Paula Helm
Abstract:
This paper traces a conceptual shift from understanding vulnerability as a static, essentialized property of data subjects to examining how it is actively enacted through data practices. Unlike reflexive ethical frameworks focused on missing or counter-data, we address the condition of abundance inherent to platformized life-a context where a near inexhaustible mass of data points already exists, shifting the ethical challenge to the researcher's choices in operating upon this existing mass. We argue that the ethical integrity of data science depends not just on who is studied, but on how technical pipelines transform "vulnerable" individuals into data subjects whose vulnerability can be further precarized. We develop this argument through an AI for Social Good (AI4SG) case: a journalist's request to use computer vision to quantify child presence in monetized YouTube 'family vlogs' for regulatory advocacy. This case reveals a "protection paradox": how data-driven efforts to protect vulnerable subjects can inadvertently impose new forms of computational exposure, reductionism, and extraction. Using this request as a point of departure, we perform a methodological deconstruction of the AI pipeline to show how granular technical decisions are ethically constitutive. We contribute a reflexive ethics protocol that translates these insights into a reflexive roadmap for research ethics surrounding platformized data subjects. Organized around four critical junctures-dataset design, operationalization, inference, and dissemination-the protocol identifies technical questions and ethical tensions where well-intentioned work can slide into renewed extraction or exposure. For every decision point, the protocol offers specific prompts to navigate four cross-cutting vulnerabilizing factors: exposure, monetization, narrative fixing, and algorithmic optimization. Rather than uncritically...
Authors:Elodie Bouzekri, Guillaume Riviere
Abstract:
Conveying environmental data has grown interest in encouraging the adoption of eco-friendly lifestyles through data-driven strategies. This scope appeals to data visualizations representing the environmental purpose. For example, previous work has already proposed nature-inspired counters, gauges, and bitmaps, but data series remains to be explored. Therefore, could we design and implement effective plant-like charts? This paper brings answers through a research-through-design approach that explores a design space to maximize readability and aesthetics. It then compares four prototypes of charts over modality and material dimensions by asking users about scenarios involving renewable energy forecasts. The results examine whether implementing physical charts is worth it instead of graphical charts and the advantages of using meaningful materials that evocate sustainability and enhance naturalness. The results also reexamine, with physical charts, the previous results on graphical infographics of slightly lower clarity and readability but higher aesthetics of embellishment. In addition, learnability is examined for encoding rates through folded shapes. This paper shows that physical plant-like charts are worthwhile because of promising performance and best-of-breed naturalness when materials allow low-tech aspects' perception and because being installable in public places without explanations if folded shapes encode rates ranging from 0 to a maximum value.
Authors:Yueling Fan, Richard Lee Davis, Olga Viberg
Abstract:
This study presents WriteFlow, an AI voice-based writing assistant designed to support reflective academic writing through goal-oriented interaction. Academic writing involves iterative reflection and evolving goal regulation, yet prior research and a formative study with 17 participants show that writers often struggle to articulate and manage changing goals. While commonly used AI writing tools emphasize efficiency, they offer limited support for metacognition and writer agency. WriteFlow frames AI interaction as a dialogic space for ongoing goal articulation, monitoring, and negotiation grounded in writers' intentions. Findings from a Wizard-of-Oz study with 12 expert users show that WriteFlow scaffolds metacognitive regulation and reflection-in-action by supporting iterative goal refinement, maintaining goal-text alignment during drafting, and prompting evaluation of goal fulfillment. We discuss design implications for AI writing systems that prioritize reflective dialogue, flexible goal structures, and multi-perspective feedback to support intentional and agentic writing.
Authors:Franziska Babel, Shane Saunderson, Shalaleh Rismani
Abstract:
This paper presents a preliminary draft of a framework around the use of anthropomorphic deception, defined here as misleading users towards humanlike affordances in the design of autonomous systems. The goal is to promote reflection among HCI and HRI researchers, as well as industry practitioners, to think about levels of anthropomorphic design that are: a) functionally necessary, b) socially appropriate, and c) ethically permissible for their use case. By reviewing the relevant literature on deception in HCI and HRI, we propose a framework with four levels of anthropomorphic deception. These levels are defined and distinguished by three factors: humanlikeness, agency, and selfhood. Example use cases at each level illustrate considerations around their functional, social, and ethical permissibility. We then present how this framework is applicable to previous work on persuasive robots We hope to promote a balanced view on anthropomorphic deception by design that should be neither naïve (e.g., as a default) nor exploitive (e.g., for economic benefit).
Authors:Chloé Greenstreet, Anastasia Vayona, Jane Henriksen-Bulmer
Abstract:
Despite public motivation to recycle, significant barriers hinder effective household recycling in the UK. Decentralised local authority waste management creates citizen confusion and "wishcycling" (disposing of non-recyclable items in recycling bins). The recent Simpler Recycling Policy further complicates this landscape by mandating new identification, sorting, and cleaning requirements that will require citizen guidance to ensure they understand how these will impact their recycling practices. This mixed methods study (surveys n=50, expert interviews, design activities) used the Value Proposition Canvas to identify citizen pain points: confusion about logos, logistical constraints, and information gaps about local requirements. We then developed an interactive prototype application providing location-specific guidance, visual sorting aids, and material-specific information to address these painpoints. Focus group evaluation showed the prototype improved recycling accuracy by 60 percent, with marked improvements in packaging assessment. Technology-enabled solutions grounded in user-centred design can measurably improve recycling behaviours and reduce contamination. However, such solutions are most effective when complementing (rather than substituting for) systemic improvements in local authority communication and service design.
Authors:Mohammed Oussama Seddini, Mohamed Ez-Zaouia, Ngoc Luyen Le, Iza Marfisi
Abstract:
Mixed Reality (MR) offers immersive and multimodal opportunities for education but remains difficult for teachers to author without technical expertise. We propose MRGEN, a conceptual framework for LLM-powered authoring tools to support teachers in creating MR learning activities that work on mobile devices (tablets and smartphones). MRGEN articulates three axes: Learning Objectives, MR Modality, and GAI Assistance. To validate our framework, we implemented a prototype based on the open-source MIXAP authoring platform and conducted a user study with 24 participants. Results show that LLM-powered authoring reduced task duration by 36% on average, and that over 90% of participants found the AI support helpful for brainstorming, structuring, and aligning content with their learning goals. These findings yielded very promising results for future AI-assisted MR authoring tools.
Authors:Annabel Blake, Marcus Carter, Eduardo Velloso
Abstract:
Young people are among the fastest adopters of generative AI, yet research emphasises adult-designed tools and experiments rather than playful, self-directed youth use. We analysed discourse from 4,172 users in Character.AI's official Discord, finding that the most engaged users were predominantly adolescents (50% aged 13-17), female or non-binary (61.9%), with most (59%) creating their own characters. We contribute (1) a descriptive account of how highly-engaged youth on Character.AI's Discord use AI for playful, emotional, and creative practices that push the platform limits; (2) a framework of three engagement intents -- Restoration (emotional regulation), Exploration (creative experimentation), and Transformation (identity development); and (3) a taxonomy of seven youth-created character archetypes. Together, these findings reveal how youth invent novel roles for AI, expose critical misalignments between youth use and current AI experiences, and provide frameworks for researchers and practitioners to design youth-centred AI futures.
Authors:Katie Seaborn, Shano Liang, Rua M. Williams, Phoebe O. Toups Dugas
Abstract:
Agender euphoria is a new term representing the powerful feelings of happiness, joy, and contentment derived from experiences in gender-free embodiments, spaces, and activities. People with and without agender and adjacent identities (e.g., genderless, gender-free, non-binary, gender-apathetic) may have such experiences under the right circumstances. Video games can offer gender minorities a safe haven for gender euphoric experiences. However, the possibility of agender euphoric experiences was unexplored. We considered this overlooked frame of self-actualization with 142 people who identified as having or desiring agender euphoric experiences. Using the critical incident technique (CIT), we uncovered how games and play experiences create (and inhibit) agender euphoria. We surface this experiential phenomenon and provide empirically-grounded criteria for the design of games to elicit agender euphoric experiences for everyone, but especially agender and agender adjacent players. This work adds to the growing critical literatures on marginalized experiences in games research and human-computer interaction.
Authors:Danny Leen, Stig Konings, Raf Ramakers, Kris Luyten
Abstract:
Many HCIxfabrication systems are compelling as prototypes but remain difficult to reuse, extend, or transfer beyond their original publication. A common explanation is that adoption simply takes time. We argue that the issue is more fundamental. The knowledge needed to make fabrication systems transferable, namely how they behave across different materials, machines, and users, usually does not exist at the time of publication because the work required to generate this knowledge is rarely incentivized or rewarded. Drawing on engineering epistemology and prior debates in systems-oriented HCI, we reframe engineering maturity as epistemic work: sustained engineering effort that produces knowledge which prototyping alone cannot reveal. We propose six dimensions, Fab-ilities, as a vocabulary to describe what aspects of fabrication artifacts have become established and what knowledge remains tacit: (1) buildability, (2) executability, (3) reliability, (4) maintainability, (5) transferability, and (6) scalability. We describe five of our own projects (JigFab, StoryStick++, Silicone Devices, LamiFold, and PaperPulse), where varied attempts at dissemination, such as commercialization, spin-offs, and market exploration, each exposed different gaps between what we published and what transfer actually required.
Authors:Caleb Adu, Neil Kapadia, Binhe Liu, Jonathan Randall, Sruthi Viswanathan
Abstract:
Universities are microcosms of urban ecosystems, with concentrated consumption patterns in food, transport, energy, and product usage. These environments not only contribute substantially to sustainability pressures but also provide a unique opportunity to advance sustainability education and behavioural change at scale. As in most sectors, digital sustainability initiatives within universities remain narrowly focused on carbon calculations, typically providing static feedback that limits opportunities for sustained behavioural change. To address this gap, we propose Eco-Bee, integrating large language models, a translation of the Planetary Boundaries framework (as Eco-Score), and a conversational agent that connects individual choices to environmental limits. Tailored for students at the cusp of lifelong habits, Eco-Bee delivers actionable insights, peer benchmarking, and gamified challenges to sustain engagement and drive measurable progress toward boundary-aligned living. In a pilot tested across multiple campus networks (n=52), 96% of the student participants supported a campus-wide rollout and reported a clearer understanding of how daily behaviours collectively impact the planet's limits. By embedding planetary science, behavioural reinforcement, and AI-driven personalisation into a single platform, Eco-Bee establishes a scalable foundation for climate-conscious universities and future AI-mediated sustainability infrastructures.
Authors:Jianheng Ouyang, Arpit Narechania
Abstract:
As conversational AI systems become popular for information retrieval and question-answering, the references they cite are key to ensuring their answers are reliable and trustworthy. Yet, no prior work systematically analyzes how these references are presented or their quality. We examine 1,517 references from 30 question-answer pairs across nine systems, focusing on their (1) presentation in the user interface and (2) quality using the CRAAP criteria. We find notable variations in the presentation, quality, and quantity of references across systems. For instance, ChatGPT provides more references (9.5 per response on average) with higher quality (15.48/20 CRAAP score), while Hunyuan-TurboS provides fewer references (4.0) and lower quality (11.65/20). Additionally, a preliminary user study shows that people rarely interact with these references and that their behavior differs across systems. These findings highlight the need for better interface designs that help users engage with and trust references more effectively.
Authors:Ilona Buchem, Jessica Kazubski, Charly Goerke
Abstract:
This paper presents the design of NEFFY 2.0, a social robot designed as a haptic slow-paced breathing companion for stress reduction, and reports findings from a mixed-methods user study with 14 refugees from Ukraine. Developed through a user-centered design process, NEFFY 2.0 builds on NEFFY 1.0 and integrates embodiment and multi-sensory interaction to provide low-threshold, accessible guidance of slow-paced breathing for stress relief, which may be particularly valuable for individuals experiencing prolonged periods of anxiety. To evaluate effectiveness, an experimental comparison of a robot-assisted breathing intervention versus an audio-only condition was conducted. Measures included subjective ratings and physiological indicators, such as heart rate (HR), heart rate variability (HRV) using RMSSD parameter, respiratory rate (RR), and galvanic skin response (GSR), alongside qualitative data from interviews exploring user experience and perceived support. Qualitative findings showed that NEFFY 2.0 was perceived as intuitive, calming and supportive. Survey results showed a substantially larger effect in significant reduction of perceived stress in the NEFFY 2.0 condition compared to audio-only. Physiological data reveled mixed results combined with large inter-personal variability. Three patterns of breathing practice with NEFFY 2.0 were identified using k-means clustering. Despite the small sample size, this study makes a novel contribution by providing empirical evidence of stress reduction in a vulnerable population through a direct comparison of robot-assisted and non-robot conditions. The findings position NEFFY 2.0 as a promising low-threshold tool that supports stress relief and contributes to the vision of HRI empowering society.
Authors:Thanushi Withanage, Elizabeth Redcay, Carol Espy-Wilson
Abstract:
Individuals often align their speaking patterns with their interlocutors, a phenomenon linked to engagement and rapport. While well documented in task-oriented dialogues, less is known about entrainment in naturalistic, non-task and virtual settings. In this study, we analyze a large corpus of spontaneous dyadic Zoom conversations to examine how conversational dynamics relate to perceived interaction quality. We extract multimodal features encompassing turn-taking, pauses, facial movements, and acoustic measures such as pitch and intensity. Perceived conversational success was quantified via factor analysis of post-conversation ratings. Results demonstrate that entrainment reliably detected in spontaneous speech and correlates with higher perceived success. These findings identify key interactional markers of conversational quality and highlight opportunities for targeted interventions to foster more effective and engaging communication.
Authors:Yushang Yang, Fanxu Meng, Fiona Fui-Hoon Nah, RAY LC
Abstract:
The ways people remember and recall places reveal an invisible aspect of cultural heritage (CH), reflecting how individuals and communities relate to these places. Heritage is communal, emerging through collaboratively constructed narratives rather than individual records. To probe how people may share collective memories, we designed an immersive two-person workflow for collaboratively co-designing 3D artifacts and environments in virtual heritage locations, using Generative AI (GenAI) to instantiate these intangible memories. Observations of the co-creation process revealed that participants merged prompts and model placements when negotiating different perspectives. They used spatial operations to compose scenes, and also to express personal and embodied experiences of CH. When GenAI failed to meet their needs, participants engaged in creative appropriation, re-purposing unsatisfactory generated objects as sources of design inspiration to further shared narratives. While GenAI may have a homogenizing effect on CH expression, this work shows how people may overcome limitations in immersive collaborative workflows.
Authors:Yomna Elsayed, Cecily Jones
Abstract:
As companies enter the race for agentic AI adoption, fears surface around agentic autonomy and its subsequent risks. These fears compound as companies scale their agentic AI adoption with low-code applications, without a comparable scaling in their governance processes and expertise resulting in a phenomenon known as "Agent Sprawl". While shadow AI tools can help with agentic discovery and identification, few observability tools offer insights into the agents' configuration and settings or the decision-making process during agent-to-agent communication and orchestration. This paper explores AI governance professionals' concerns in enterprise settings, while offering design-time and runtime explainability techniques as suggested by AI governance experts for addressing those fears. Finally, we provide a preliminary prototype of an Agentic AI Card that can help companies feel at ease deploying agents at scale.
Authors:Aditi Agrawal, Celine John Philip, Giancarlo K. Sagastume, Marcus A. Battraw, Wilsaan M. Joiner, Jonathon S. Schofield, Lee M. Miller, Richard S. Whittle
Abstract:
Neuromotor decoding from upper-limb electromyography (sEMG) can enhance human-machine interfaces and offer a more natural means of controlling prosthetic limbs, virtual reality, and household electronics. Unfortunately, current sEMG technology does not always perform consistently across users because individual differences such as age and body mass index, among many others, can substantially alter signal quality. This variability makes sEMG characteristics highly idiosyncratic, often necessitating laborious personalization and iterative tuning to achieve reliable performance. This variability has particular import for sEMG-based assistive devices and neural interfaces, where demographic biases in sEMG features could undermine broad and fair deployment. In this study, we explore how demographic differences affect the sEMG signals produced and their implications for machine learning-based gesture decoding. We analyze the data set provided by, in which we derive 147 common sEMG features extracted from 81 demographically diverse individuals performing discrete hand gestures. Using mixed-effects linear models and partial least squares (PLS) analysis, which take into consideration demographic variables (including age, sex, height, weight, skin properties, subcutaneous fat, and hair density), we identify that 33\% (49 of 147) of commonly used sEMG features show significant associations with demographic characteristics. These results may help guide the development of fair and unbiased sEMG-based neural interfaces across a diverse population.
Authors:S M Raihanul Alam, Md Dilshadur Rahman, Md Naimul Hoque
Abstract:
Visualizing narratives is useful to writers to reflect on unfinished drafts and identify unintentional biases and inconsistencies. Literary scholars can use the visualizations to identify nuanced patterns and literary styles from written text. Current narrative visualization is limited to representing character and location co-occurrences in a timeline, omitting important and complex narrative components such as focalization, causality, and speech. This paper aims to capture and visualize underexplored, complex narrative components as a basis for narrative visualization. As a starting point, we propose a new narrative visualization, named FocalLens, that uses focalization, the component that establishes who sees or perceives the events in a narrative, for representing the narrative. We provide the theoretical foundation of focalization and describe various types and facets of focalization. The details are incorporated in the novel visualization that captures how different characters perceive an event, who directly participate in an event, who indirectly observe the event, and who narrate the event. We also developed a tool that provides fluid interaction between the text and the proposed visualization. The tool was evaluated with four writers and scholars in a qualitative study, where writers analyzed their draft stories and scholars analyzed well-known stories. The findings suggest the tool added a new dimension to the workflow for writers and scholars, an analytical lens that is not available otherwise. We conclude by identifying design implications and future directions.
Authors:Luca-Stefan Pirvu, Bogdan-Alexandru Maciuca, Andrei-Ciprian Rabu, Adrian-Marius Dumitran
Abstract:
Graph theory is a cornerstone of Computer Science education, yet entry-level students often struggle to map abstract node-edge relationships to practical applications. This paper presents the design and architecture of a Minecraft-based educational tool specifically built to visualize graph traversal and shortest-path algorithms. We propose a three-layer system: (1) a Grid Traversal module where terrain types (e.g., soul sand, ice) represent edge weights, allowing for the gamified study of shortest path algorithms; (2) a "Sky Graph" module for interactive 3D manipulation of both directed and undirected graphs; and (3) lessons and quizzes available through books. The system grounds its design in Constructionist learning theory, transitioning students from passive observers to active protagonists who physically manipulate algorithmic behavior. We additionally present a planned empirical evaluation using NASA-TLX and in-game telemetry to validate the system's pedagogical efficacy.
Authors:Adam Poulsen, Ian B. Hickie, Carla Gorban, Zsofi de Haan, William Capon, Ebenezer Eyeson-Annan, Jalal Radwan, Elizabeth M. Scott, Frank Iorfino, Haley M. LaMonica
Abstract:
Conversational generative artificial intelligence agents (or genAI chatbots) could benefit youth mental health, yet young people's perspectives remain underexplored. We examined the Mental health Intelligence Agent (Mia), a genAI chatbot originally designed for professionals in Australian youth services. Following co-design, 32 young people participated in online workshops exploring their perceptions of genAI chatbots in youth mental health and to develop recommendations for reconceptualising Mia for consumers and integrating it into services. Four themes were developed: (1) Humanising AI without dehumanising care, (2) I need to know what's under the hood, (3) Right tool, right place, right time?, and (4) Making it mine on safe ground. This study offers insights into young people's attitudes, needs, and requirements regarding genAI chatbots in youth mental health, with key implications for service integration. Additionally, by co-designing system requirements, this work informs the ethics, design, development, implementation, and governance of genAI chatbots in youth mental health contexts.
Authors:Arnab Paul Choudhury, Nihal Patel
Abstract:
Skill training is crucial for enabling dignified livelihood opportunities. In India, various schemes and initiatives aim to provide skill training in different domains, with ICT and digital technologies playing a vital role. However, there is limited research on understanding on-ground capacities \& constraints and the use of digital tools in these programs. In this study, we look into the mobilization, counseling, and training stages of the 5-stage skill development process that also includes placement and tracking, adopted in Dhamtari's Livelihood College in Chhattisgarh, India, and other programs nationwide. Through the immersion/crystallization approach and mixed-method analysis including GIS mapping, video analysis of CCTV streams, quantitative analysis, and unstructured conversations with administrators, trainers, mobilizers, counselors, and nearby industry personnel for over a year, we identified three major challenges. A lack of inclusive and gendered access to skilling; a tedious manual counseling process with insufficient support staff; and inconsistent trainee attendance alongside sub-standard utilization of digital assets. Finally, we discuss, ways to improve access to skill training by leveraging Vocational Training Partners(VTPs), ways to improve the utilization of existing digital assets, and considerations for improving the counseling process. We conclude by summarizing that skill development programs currently lack institutional elements that enable effective information exchange between stakeholders, thereby creating information bottlenecks that result in inefficiencies, hindering the service delivery. In sum, our study informs the HCI and ICTD literature on the on-ground challenges and constraints faced by stakeholders and the role of technology in supporting such initiatives.
Authors:Christopher D. Wallbridge, Erwin Jose Lopez Pulgarin
Abstract:
This position paper looks briefly at the way we attempt to program robotic AI systems. Many AI systems are based on the idea of trying to improve the performance of one individual system to beyond so-called human baselines. However, these systems often look at one shot and one-way decisions, whereas the real world is more continuous and interactive. Humans, however, are often able to recover from and learn from errors - enabling a much higher rate of success. We look at the challenges of building a system that can detect/recover from its own errors, using the example of robotic nuclear gloveboxes as a use case to help illustrate examples. We then go on to talk about simple starting designs.
Authors:Stephanie Kwari Dharmaputri, Anish Nagpal, Greg Nyilasy, Jing Lei
Abstract:
Advancements in Artificial Intelligence (AI) technologies' social fluency are being integrated into commercial interactions. As tools such as OpenAI's assistant are integrated into platforms such as Shopify, Klarna, and Visa, understanding consumer responses to AI social features become essential. One such feature is relational talk, an informal and non-obligatory social communication embedded in transactional exchanges. Across four experiments, we find: 1) a negative main effect of AI relational talk on satisfaction, mediated by expectancy violation and perceived interaction awkwardness, and 2) goal-relevant relational talk to attenuate this effect. This paper extends the literature by challenging the assumption that increased social fluency will improve satisfaction, and highlights the complexity of integrating social features into AI systems. It also identifies awkwardness as a key emotional response and barrier to effective human-AI interaction, showing that even in the absence of real social repercussions, perceived awkwardness in AI-led commercial interactions can elicit negative responses.
Authors:Benjamin Maltbie, Shivam Raval
Abstract:
Large language models exhibit sycophantic tendencies, but whether this behavior varies systematically with perceived user demographics is underexplored. Inspired by intersectionality (overlapping identities produce compounded effects), we probe whether frontier models conditionally exhibit sycophancy. Across 768 multi-turn conversations spanning 128 personas (varying race, age, gender, confidence) and three domains (mathematics, philosophy, conspiracy theories), we find that sycophancy varies sharply with target model and domain, and emerges from combinations of perceived user traits rather than any single dimension. GPT-5-nano scores far higher than Claude Haiku 4.5 (average sycophancy scores of $\bar{x}=2.96$ vs.\ $1.74$, $p < 10^{-32}$); within GPT-5-nano, philosophy elicits 41\% more sycophancy than mathematics and Hispanic personas receive the highest scores across races. The worst-scoring persona, a confident, 23-year-old Hispanic woman, averages 5.33/10 (max 6/10), while Claude Haiku 4.5 remains uniformly low with no significant demographic variation. We argue that safety evaluations should incorporate identity-aware adversarial testing.
Authors:Angela Jin, Alexander Asemota, Dan E. Krane, Nathaniel D. Adams, Rediet Abebe
Abstract:
AI governance efforts increasingly rely on audit standards: agreed-upon practices for conducting audits. However, poorly designed standards can hide and lend credibility to inadequate systems. We explore how an audit standard's design influences its effectiveness through a case study of ASB 018, a standard for auditing probabilistic genotyping software -- software that the U.S. criminal legal system increasingly uses to analyze DNA samples. Through qualitative analysis of ASB 018 and five audit reports, we identify numerous gaps between the standard's desired outcomes and the auditing practices it enables. For instance, ASB 018 envisions that compliant audits establish restrictions on software use based on observed failures. However, audits can comply without establishing such boundaries. We connect these gaps to the design of the standard's requirements such as vague language and undefined terms. We conclude with recommendations for designing audit standards and evaluating their effectiveness.
Authors:Kaoru Seki, Manisha Vijay, Yasmine Kotturi
Abstract:
Generative AI is reshaping education, yet most university AI policies are written without students and focus on penalizing misuse. This top-down approach sidelines those most affected from decisions that shape their everyday learning, resulting in confusion and fear about acceptable use. We examine how participatory, student-driven AI policy design can address this disconnect. We report on a three-part workshop series in a graduate design course at a minority-serving university in the U.S., where two student leaders facilitated discussions without faculty present. Eight participants shared candid accounts of their AI use, co-authored ten policy recommendations, and visualized them in a zine that circulated across campus. The resulting policies surfaced concerns absent from top-down governance, such as the double standard of requiring students to disclose or abstain from AI use while faculty face no such expectations. We argue that engaging students in AI governance carries value beyond the resulting policies, and offer transferable strategies for fostering participation across disciplines -- a model for calling students in rather than calling students
Authors:Xiaoyan Zhou, Natalia Sempere, Pooria Ghavamian, Asreen Rostami, Andrii Matviienko
Abstract:
Micromobility vehicles, such as e-scooters, Segways, skateboards, and unicycles, are increasingly adopted for short-distance travel due to their low weight and low emissions. Despite their growing popularity, we lack controlled, low-risk environments to study rider experiences and performance. While virtual reality (VR) simulators offer a promising approach by reducing safety risks and providing immersive experiences, micromobility simulators remain largely underexplored. We introduce MicroVRide, a modular 4-in-1 VR micromobility simulator that supports e-scooters, Segways, electric unicycles, and one-wheeled skateboards on a single platform. The simulator preserves vehicle-specific physical constraints and control metaphors, enabling the study of diverse riding behaviors with minimal hardware reconfiguration. We contribute the simulator design and report a preliminary within-subject study (N = 12) that demonstrates feasibility and reveals distinct experiential profiles across vehicles.
Authors:Michal R Wrobel, Agnieszka Landowska, Karolina Makuch
Abstract:
The paper concerns affective information systems that represent and visualize human emotional states. The goal of the study was to find typical representations of discrete and dimensional emotion models in terms of color, size, speed, shape, and animation type. A total of 419 participants were asked about their preferences for emotion visualization. We found that color, speed, and size correlated with selected discrete emotion labels, while speed correlated with arousal in a dimensional model. This study is a first step towards defining a universal emotion representation for use in information systems.
Authors:Tianyu Shao, Miguel Feijóo-García, Yi Zhang, Hugo Castellanos, Tawfiq Salem, Alejandra Magana, Tianyi Li
Abstract:
As AI tools such as ChatGPT enter programming classrooms, students encounter differing rules across courses and instructors, which shape how they use AI and leave them with unequal capabilities for leveraging it. We investigate how students engaged with AI in an introductory Python assignment, analyzing student-LLM chat histories and final code submissions from 163 students. We examined prompt-level strategies, traced trajectories of interaction, and compared AI-generated code with student submissions. We identified trajectories ranging from full delegation to iterative refinement, with hybrid forms in between. Although most students directly copied AI-generated code in their submission, many students scaffolded the code generation through iterative refinement. We also contrasted interaction patterns with assignment outcomes and course performance. Our findings show that prompting trajectories serve as promising windows into students' self-regulation and learning orientation. We draw design implications for educational AI systems that promote personalized and productive student-AI collaborative learning.
Authors:Rumali Perera, Xiaoqi Wang, Han-wei Shen
Abstract:
Knowledge Graphs (KGs) are increasingly used to represent and explore complex, interconnected data across diverse domains. However, existing KG visualization systems remain limited because they fail to provide the context of user questions. They typically return only the direct query results and arrange them with force-directed layouts by treating the graph as purely topological. Such approaches overlook user preferences, ignore ontological distances and semantics, and provide no explanation for node placement. To address these challenges, we propose Context-KG, a context-aware KG visualization framework. Context-KG reframes KG visualization around ontology, context, and user intent. Using Large Language Models (LLMs), it iteratively extracts user preferences from natural language questions and context descriptions, identifying relevant node types, attributes, and contextual relations. These preferences drive a semantically interpretable, ontology-guided layout that is tailored to each query, producing type-aware regions. Context-KG also generates high-level insights unavailable in traditional methods, opening new avenues for effective KG exploration. Evaluations on real world KGs and a comprehensive user study demonstrate improved interpretability, relevance, and task performance, establishing Context-KG as a new paradigm for KG visualization.
Authors:Johannes Wachs, Leonore Röseler, Tobias Gesche, Elliott Ash, Anikó Hannák
Abstract:
Online platforms where volunteers answer each other's questions are important sources of knowledge, yet participation is declining. We ran a pre-registered experiment on Stack Overflow, one of the largest Q&A communities for software development (N = 22,856), randomly assigning newly posted questions to receive an anonymous upvote. Within four weeks, treated users were 6.3% more likely to ask another question and 12.9% more likely to answer someone else's question. A second upvote produced no additional effect. The effect on answering was larger, more persistent, and still significant at twelve weeks. Next, we examine how much of these effects are due to algorithmic amplification, since upvotes also raise a question's rank and visibility. Algorithmic amplification is not important for the effect on asking additional questions, but it matters a lot for the effect on answering other questions. The increase in visibility increases the probability that another user provides an answer, and that experience appears to shift the poster toward broader community participation.
Authors:Sargam Vyas, Bogdan Vlasenko, André Mayoraz, Egon Werlen, Per Bergamin, Mathew Magimai. -Doss
Abstract:
With advancements in multimodal communication technologies, remote learning environments such as, distance universities are increasing. Remote learning typically happens asynchronously. As a consequence, unlike face-to-face in-person classroom teaching, this lacks availability of sufficient emotional cues for making learning a pleasant experience. Motivated by advances made in the paralinguistic speech processing community on emotion prediction, in this paper we explore use of speech for sensing students' emotions by building upon speech-based self-control tasks developed to aid effective remote learning. More precisely, we investigate: (a) whether speech acquired through self-control tasks exhibit perceptible variation along valence, arousal, and dominance dimensions? and (b) whether those dimensional emotion variations can be automatically predicted? We address these two research questions by developing a dataset containing spontaneous monologue speech acquired as open responses to self-control tasks and by carrying out subjective listener evaluations and automatic dimensional emotion prediction studies on that dataset. Our investigations indicate that speech-based self-control tasks can be a means to sense student emotion in remote learning environment. This opens potential venues to seamlessly integrate paralinguistic speech processing technologies in the remote learning loop for enhancing learning experiences through instructional design and feedback generation.
Authors:Taehyun Yang, Eunhye Kim, Zhongzheng Xu, Fumeng Yang
Abstract:
Generative AI tools have lowered barriers to producing branded social media images and captions, yet small-business owners (SBOs) still struggle to create on-brand posts without access to professional designers or marketing consultants. Although these tools enable fast image generation from text prompts, aligning outputs with a brand's intended look and feel remains a demanding, iterative task. In this position paper, we explore how SBOs navigate iterative content creation and how AI-assisted systems can support SBOs' content creation workflow. We conducted a preliminary study with 12 SBOs who independently manage their businesses and social media presence, using a questionnaire to collect their branding practices, content workflows, and use of generative AI alongside conventional design tools. We identified three recurring challenges: (1) translating brand "feel" into effective prompts, (2) difficulty revisiting and comparing prior image generations, and (3) difficulty making sense of changes between iterations to steer refinement. Based on these findings, we present a prototype that scaffolds brand articulation, supports feedback-informed exploration, and maintains a traceboard of branching image iterations. Our work illustrates how traces of the iterative process can serve as workflow support that helps SBOs keep track of explorations, make sense of changes, and refine content.
Authors:Tabea E. Röber, Paul Festor, Rob Goedhart, S. İlker Birbil, Aldo Faisal
Abstract:
Experimental user studies evaluating the effectiveness of different subtypes of post-hoc explanations for black-box models are largely nonexistent. Therefore, the aim of this study was to investigate and evaluate how different types of counterfactual explanations, namely single point explanations and interval-based explanations, affect both model understanding and (demonstrated) trust. We conducted an online user study using a within-subjects experimental design, where the experimental arms were (i) no explanation (control), (ii) feature importance scores, (iii) point counterfactual explanations, and (iv) interval counterfactual explanations. Our results clearly show the superiority of interval explanations over other tested explanation types in increasing both model understanding and demonstrated trust in the AI. We could not support findings of some previous studies showing an effect of point counterfactual explanations compared to the control group. Our results further highlight the role individual differences in, for example, cognitive style or personality, in explanation effectiveness.
Authors:Elena Eleftheriou, George Pallis, Marios Constantinides
Abstract:
Large Language Models (LLMs) are widely used by students, yet their tendency to provide fast and complete answers may discourage reflection and foster overconfidence. We examined how alternative LLM interaction designs support deeper thinking without excessively increasing cognitive burden. We conducted a two-phase mixed-methods study. In Phase 1, interviews with 16 Gen Z students informed the design of Deep3, a web-based system with three interaction modes: \emph{a)} future-self explanations, \emph{b)} contrastive learning, and \emph{c)} guided hints. In Phase 2, we evaluated Deep3 with 85 participants across two learning tasks. We found that a standard single-agent baseline produced high perceived understanding despite the lowest objective learning. In contrast, future-self explanations imposed higher cognitive workload yet yielded the closest alignment between perceived and actual understanding, while guided hints achieved the largest learning gains without a proportional increase in frustration. These findings show that effort, confidence, and learning systematically diverge in LLM-supported work.
Authors:Arnab Paul Choudhury, Rahul Rathod, Aryan Yadav
Abstract:
Solar irrigation systems are increasingly deployed in rural regions, yet their distributed and remote deployment makes maintenance challenging for farmers. While formal monitoring processes and applications exist, they often fall short in practice. We present insights from grid-connected solar irrigation schemes that incentivize farmers to feed energy to the grid, focusing on how farmers maintain their systems. We found that farmers face multiple challenges but are also devising strategies, including the appropriation of WhatsApp to share daily generation data with peers and compare performance across installations to identify potential system anomalies. Our findings highlight how messaging platforms function as informal digital infrastructures enabling collective sensemaking around distributed energy systems. We discuss implications for designing agricultural energy technologies that support peer comparison, contextual interpretation, and community-driven maintenance, framing these as a socio-technical platform. Finally, we outline directions for future work integrating such practices with formal monitoring tools and explore their potential to support citizen science initiatives in environmental sensing.
Authors:Jean-Philippe Rivière, Roman Malo, Sarah Varlin Grassi, Yannick Prié
Abstract:
Occasionally, individuals immersed in a Virtual Reality (VR) environment may experience distractions that disrupt their sense of presence, a phenomenon referred to as a break in presence (BIP). Better understanding BIPs is crucial to designing VR applications that keep their users present. BIPs have been studied using a variety of methods, exploring their origins or trying to detect them from physiological or behavioral measurements. However, despite the importance of understanding how they are actually lived and managed by VR users, very few studies focused on their phenomenological characterization. We employed micro-phenomenology to collect the descriptions of BIPs experienced by users (n=14) of a height exposure VR application. We precisely modeled 57 BIP episodes, bringing to light a variety of experiences and behaviors. Four generic diachronic patterns of BIP episodes emerge: reflected-upon, discarded, self-preservation, and contradictory mediation BIPs. We discuss these in light of the PI/Psi model of presence, propose an awareness-based definition of BIPs, as well as three BIP-related design opportunities.
Authors:Wang Chenglong, Zhuo Yan, Ding Wenbo, Chen Xinlei
Abstract:
Wearable Human Activity Recognition (WHAR) is a prominent research area within ubiquitous computing, whose core lies in effectively modeling intra- and inter-sensor spatio-temporal relationships from multi-modal time series data. Existing methods either suffer from high computational complexity due to attention-based fusion or lack robustness to data variations during feature extraction. To address these issues, we propose a lightweight and generalizable framework that retains the core "decomposition-extraction-fusion" paradigm while introducing two key innovations. First, we replace the computationally expensive Attention and Cross-Variable Fusion (CVF) modules with a Cascaded Fusion Block (CFB), which achieves efficient feature interaction without explicit attention weights through the operational process of "compression-recursion-concatenation-fusion". Second, we integrate a MixStyle-based data augmentation module before the Local Temporal Feature Extraction (LTFE) and Global Temporal Aggregation (GTA) stages. By mixing the mean and variance of different samples within a batch and introducing random coefficients to perturb the data distribution, the model's generalization ability is enhanced without altering the core information of the data. The proposed framework maintains sensor-level, variable-level, and channel-level independence during the decomposition phase, and achieves efficient feature fusion and robust feature extraction in subsequent processes. Experiments on two benchmark datasets (Realdisp, Skoda) demonstrate that our model outperforms state-of-the-art methods in both accuracy and macro-F1 score, while reducing computational overhead by more than 30\% compared to attention-based baselines. This work provides a practical solution for WHAR applications on resource-constrained wearable devices.
Authors:Songmao Li, Kaixuan Qu, Keer Sun, Bhargav Limbasia, Luciano Nocera
Abstract:
Modern supply chain networks involve spatially distributed flows that become difficult to interpret using traditional visualization techniques, producing visual clutter that obscures actionable patterns. We present a multi-scale visual analytics dashboard that combines Semantic Zooming with Skeleton-Based Edge Bundling (SBEB). The system dynamically adapts its representation based on zoom level: bundled aggregate flows at the macro-scale, hexagonal density heatmaps at the meso-scale, and hierarchical inventory sunbursts at the micro-scale. Built on Vue3 and Deck.gl, it reduces raw orders to 202 warehouse-to-state flows. We contribute (1)a semantic zoom implementation with animated transitions that unifies edge bundling, hexagonal density aggregation, and hierarchical inventory views into a single interface; and (2)an algorithmic adaptation of SBEB for geographic origin-destination flows, introducing directional-sector clustering and adaptive detour constraints to preserve cartographic plausibility.
Authors:Alex Farach, Alexia Cambon, Lev Tankelevitch, Connie Hsueh, Rebecca Janssen
Abstract:
Organizations have widely deployed generative AI tools, yet productivity gains remain uneven, suggesting that how people use AI matters as much as whether they have access. We conducted a field experiment with 388 employees at a Fortune 500 retailer to test two scaffolding interventions for human-AI collaboration. All participants had access to the same AI tool; we varied only the structure surrounding its use. A behavioral scaffolding intervention (a structured protocol requiring joint AI use within pairs) was associated with lower document quality relative to unstructured use and substantially lower document production. A cognitive scaffolding intervention (partnership training that reframed AI as a thought partner) was associated with higher individual document quality at the top of the distribution. Treatment participants also showed greater positive belief change across the session, though sensitivity analyses suggest this likely reflects recovery from carry-over effects rather than genuine training-induced shifts. Both findings are subject to design limitations including an AM/PM session confound, differential attrition, and LLM grading sensitivity to document length.
Authors:Longxiang Jiao, Lukas Hofmann, Yiru Yang, Zhanyi Wu, Jonas Egeler
Abstract:
While micro-scale traffic simulations provide essential data for urban planning, they are rarely coupled with the high-fidelity visualization or auralization necessary for effective stakeholder communication. In this work, we present a real-time 4D visualization framework that couples the SUMO traffic with a photorealistic, geospatially accurate VR representation of Zurich in Unreal Engine 5. Our architecture implements a robust C++ data pipeline for synchronized vehicle visualization and features an Open Sound Control (OSC) interface to support external auralization engines. We validate the framework through a user study assessing the correlation between simulated traffic dynamics and human perception. Results demonstrate a high degree of perceptual alignment, where users correctly interpret safety risks from the 4D simulation. Furthermore, our findings indicate that the inclusion of spatialized audio alters the user's sense of safety, showing the importance of multimodality in traffic simulations.
Authors:Jue Chen, Alexander Mielke, Kaspar Althoefer, Elisabetta Versace
Abstract:
The potential of Animal-Robot Interaction (ARI) in welfare applications depends on how much an animal perceives a robotic agent as socially relevant, non-threatening and potentially attractive (acceptance). Here, we present an animal-centered soft robotic affective interface for newly hatched chicks (Gallus gallus). The soft interface provides safe and controllable cues, including warmth, breathing-like rhythmic deformation, and face-like visual stimuli. We evaluated chick acceptance of the interface and chick-robot interactions by measuring spontaneous approach and touch responses during video tracking. Overall, chicks approached and spent increasing time on or near the interface, demonstrating acceptance of the device. Across different layouts, chicks showed strong preference for warm thermal stimulation, which increased over time. Face-like visual cues elicited a swift and stable preference, speeding up the initial approach to the tactile interface. Although the breathing cue did not elicit any preference, neither did it trigger avoidance, paving the way for further exploration. These findings translate affective interface concepts to ARI, demonstrating that appropriate soft, thermal and visual stimuli can sustain early chick-robot interactions. This work establishes a reliable evaluation protocol and a safe baseline for designing multimodal robotic devices for animal welfare and neuroscientific research.
Authors:Aditya Sabbineni, Pravin Nagare, Devendra Dahiphale, Preetam Dedu, Willison Lopes
Abstract:
The rapid expansion of the Internet of Things (IoT) and smart home ecosystems has led to a fragmented landscape of user data management across consumer electronics (CE) such as Smart TVs, gaming consoles, and set-top boxes. Current onboarding processes on these devices are characterized by high friction due to manual data entry and opaque data-sharing practices. This paper introduces the User Data Sharing System (UDSS), a platform-agnostic framework designed to facilitate secure, privacy-first PII (Personally Identifiable Information) exchange between device platforms and third-party applications. Our system implements a Contextual Scope Enforcement (CSE) mechanism that programmatically restricts data exposure based on user intent - specifically distinguishing between Sign-In and Sign-Up workflows. Unlike cloud-anchored identity standards such as FIDO2/WebAuthn, UDSS is designed for shared, device-centric CE environments where persistent user-to-device binding cannot be assumed. We further propose a tiered access model that balances developer needs with regulatory compliance (GDPR/CCPA). A proof-of-concept implementation on a reference ARMv8 Linux-based middleware demonstrates that UDSS reduces user onboarding latency by 65% and measurably reduces PII over-exposure risk through protocol-enforced data minimization. This framework provides a standardized approach to identity management in the heterogeneous CE market.
Authors:Poornima Meegammana, Niranjan Meegammana, Chathurika Jayalath, Chethya Munasinghe, Kunal Gupta
Abstract:
Girls remain underrepresented in computing, and rural contexts often compound barriers of access, language, and gender norms. Prior work in computing education highlights that confidence and belonging can shape participation, yet most evidence comes from well-resourced, English-dominant settings. Less is known about how locally grounded pathways can build programming self-efficacy and broaden career interest for adolescent girls. We addressed this gap by delivering a curriculum that began with digital foundations and unplugged problem-solving, then progressed to block-based programming activities, supported by parent awareness and teacher training in gender-responsive practices. Pre and post-surveys showed a reliable increase in programming self-efficacy, and career aspirations shifted toward technology. Complementary qualitative data indicate that mastery experiences, peer collaboration, and the creation of personal projects were key drivers of confidence, suggesting design priorities for scalable, locally relevant programmes in low-resource communities that can shift perceptions of who belongs in computing.
Authors:A. Xygkou-Tsiamoulou, Alexandra Covaci, Zeqi Jia, Jenny Yiend, Chee Siang Ang
Abstract:
As humanity pivots toward long-duration interplanetary travel, the psychological constraints of Isolated and Confined Environments (ICE) emerge as a primary mission risk. This paper presents COSMIC (COmpanion System for Mission Interaction and Communication) representing the inaugural investigation into the deployment of a high-fidelity, emotionally intelligent AI companion in an analog astronaut setting. By integrating a Large Language Model (LLM) architecture with a diffusion-based digital avatar interface, COSMIC transcends traditional task-oriented automation to provide longitudinal affective support. We detail a modular system architecture designed for temporal continuity through short- and long-term memory systems and outline a robust naturalistic observational framework for evaluating psychological resilience at the LunAres Research Station. This work constitutes the first formal submission in the field to evaluate the efficacy of state-of-the-art generative AI and synthesized visual empathy in mitigating the effects of extreme isolation.
Authors:Sirajam Munira, Lydia Manikonda
Abstract:
Artificial Intelligence (AI) chatbots are increasingly used for emotional, creative, and social support, leading to sustained and routine user interaction with these systems. As these applications evolve through frequent version updates, changes in functionality or behavior may influence how users evaluate them. However, work on how publicly expressed user feedback varies across app versions in real-world deployment contexts is limited. This study analyzes 210,840 Google Play reviews of the chatbot application Character AI, linking each review to the app version active at the time of posting. We specifically examine negative reviews to study how version-level rating trends, and linguistic patterns reflect user experiences. Our results show that user ratings fluctuate across successive versions, with certain releases associated with stronger negative evaluations. Thematic analysis indicates that dissatisfaction is concentrated around recurring issues related to technical malfunctions and errors. A subset of reviews additionally frames these concerns in terms of potential psychological or addiction-related effects. The findings highlight how aggregate user evaluations and expressed concerns vary across software iterations and provide empirical insight into how update cycles relate to user feedback patterns and underscore the importance of stability and transparent communication in evolving AI systems.
Authors:Adam Hepworth, Zena Assaad, Austin Wyatt, Hussein Abbass
Abstract:
Military human robot interaction (MHRI) presents a novel opportunity to blend the capabilities of autonomous and Artificial Intelligence (AI)-enabled systems with the skills and expertise of humans. The concept promises military advantages and greater operational effectiveness and efficiencies. However, the associated human-AI dynamics create challenges when attempting to design, implement, and operationalise the increasingly symbiotic relationship between humans and machines. Meaningful human control (MHC) is a popularised conceptualisation of what is deemed a responsible interaction among human and artificial agents; however, this notion falls short in military contexts and hinders the realisation of military advantages that could be achieved by advancing the adoption of responsible AI. This paper presents meaningful human command (MHC1) as a more operationally effective concept for advanced military command and control systems that embed AI-enabled autonomous systems. We introduce, explore, and unpack meaningful human command in the context of military human-robot interaction, presenting a vignette that offers a technologically feasible concept of an AI-enabled system within military operations. The vignette is used to guide, contextualise, and add realism to the narrative describing the concept and highlights associated MHRI challenges.
Authors:Peter Kirgis, Ben Hawriluk, Sherrie Feng, Aslan Bilimer, Sam Paech, Zeynep Tufekci
Abstract:
People increasingly hold sustained, open-ended conversations with large language models (LLMs). Public reports and early studies suggest that, in such settings, models can reinforce delusional or conspiratorial ideation or even amplify harmful beliefs and engagement patterns. We present an audit and benchmarking study that measures how different LLMs encourage, resist, or escalate disordered and conspiratorial thinking. We explicitly compare API outputs to user chat interfaces, like the ChatGPT desktop app or web interface, which is how people have conversations with chatbots in real life but are almost never used for testing. In total, we run 56 20-turn conversations testing ChatGPT-4o and ChatGPT-5, via both the API and chat interface, and grade each conversation by two research assistants (RAs) as well as by GPT-5. We document five results. First, we observe large differences in performance between the API and chat interface environments, showing that the universally used method of automated testing through the API is not sufficient to assess the impact of chatbots in the real world. Second, when tested in the chat interface, we find that ChatGPT-5 displays less sycophancy, escalation, and delusion reinforcement than ChatGPT-4o, showing that these behaviors are influenced by the policy choices of major AI companies. Third, conversations with nearly identical aggregate intensity in a behavior display large differences in how the behavior evolves turn by turn, highlighting the importance of temporal dynamics in multi-turn evaluation. Fourth, even updated models display substantial levels of negative behaviors, revealing that model improvement does not imply model safety. Fifth, the same API endpoint tested just two months apart yields a complete reversal in behavior, underscoring how transparency in model updates is a necessary prerequisite for robust audit findings.
Authors:Ian Frank, Kanata Kawanishi
Abstract:
Search algorithms are a foundational topic in artificial intelligence education, yet even simple domains can generate large state spaces that challenge learners' ability to form accurate mental models. This paper presents an interactive learning system that demonstrates the feasibility of visualising the entire reachable state space of the 8-puzzle (181,440 states), while tightly coupling abstract graph structure with concrete puzzle manipulation. Built using Unity and modern GPU-based rendering techniques, the system enables real-time exploration of global structure, step-by-step execution of search algorithms, and direct comparison of how different strategies traverse the same space. We describe the system's design, visualisation layouts, and educational use, reporting findings from an initial classroom deployment and pilot study with students at different levels of university education. Overall, the results indicate that full state-space visualisation is both technically feasible and educationally valuable for supporting conceptual understanding of search behaviour within this canonical problem domain.
Authors:Raymond Chung, Keith Ng, CD Shum
Abstract:
We propose a personalized chatbot designed for elderly individuals. The chatbot initiates discussions based on family photos, encouraging users to interact naturally. During these interactions, it generates W questions (who, where, when, and what) to stimulate cognitive function, followed by an open-ended question to promote positive reminiscence. This approach is structured around a goal-oriented dialogue framework. Additionally, after each conversation about a photo, the chatbot analyzes the discussion to identify topics that the user favors or dislikes. It then offers the user the option to chat about another photo either featuring the same family members or an individual previously mentioned in the conversation. To support this system, we have developed a web portal that allows caregivers to upload photos and review chat conversations. This personalized chatbot not only encourages elderly users to engage with the chatbot regularly and reduces feelings of loneliness but also provides caregivers with a valuable tool to gain insights into users' well-being.
Authors:Suncica Hadzidedic, Jingyun Wang, Victor Elijah Adeyemo, George Sanders, Grant Westermann
Abstract:
Obesity is a global health challenge. According to the World Health Organization (WHO), between 1990 and 2022, adult obesity more than doubled. Weight management interventions (WMIs) support individuals in achieving and maintaining a healthy weight through dietary guidance, physical activity promotion and behavioural counselling. However, traditional WMIs often have limited accessibility. Digital WMIs or DWMIs are delivered via websites or smartphone applications and provide scalable and cost-effective alternatives. However, user needs for digital services and their prevalence in the existing commercial solutions remain underexplored. Hence, our study systematically identified 26 commercial DWMIs to identify their features, services, and data collection practices. Additionally, we performed a user needs analysis by recruiting 207 individuals involved in a real-life WMI. Our findings indicated that DWMIs integrated self-monitoring, goal setting, and behaviour change strategies, yet lack social support, virtual reality applications and adaptive personalisation. WMI clients prefer smartphone Apps and fitness trackers for tracking weight management progress and have varying levels of comfort in using digital resources. The presented results serve as recommendations for future directions in the design and implementation of services for DWMIs.
Authors:Luniva Chitrakar, Ishan Panta, Biplov Paneru, Sangharsh Poudel, Lahana Kansakar
Abstract:
Financial management has been revolutionized by mobile banking, but increasing usefulness and satisfaction requires a better user experience. This study aims to provide an improved customer experience by offering user-friendly interfaces, and real-time notifications by user-centric design of mobile banking application UI. A survey was carried out on the target audience in which 81% of respondents to a study of 103 people said they regularly used mobile banking apps, while 77% said they had problems with the ones they were using at the time. Furthermore, 44.7% of respondents expressed unhappiness with the current solutions by depending on third-party apps like e-Sewa and Khalti for everyday transactions. Language obstacles, lengthy loading times, unclear terminology, and navigational challenges were among the problems found. With 84% asking for a budgeting function and 46% complaining about biometric authentication, users indicated a need for more individualized interfaces, improved customer service, and increased security. The study included Think Aloud testing, heat maps, and remote usability testing to determine user preferences and pain spots to solve these. Feedback from a wider audience was obtained informally through guerrilla usability testing. The results highlight how important it is for mobile banking apps to guarantee security, increase functionality, simplify navigation, and improve visual design. App grouping and layout can be further enhanced by utilizing Gestalt psychology concepts like closeness and symmetry. The goal of these user-centered insights is to promote greater happiness and adoption of mobile banking.
Authors:Tongxin Li, Katelyn M Reyes, Liezeil Jimenez, Katie S Nam, Donghee Yvette Wohn
Abstract:
AI-generated non-consensual intimate imagery (AIG-NCII) is an emerging social problem due to the advancement of AI tools. While recent incidents in middle and high schools have highlighted the urgency of this issue, there is limited understanding of what concrete supports schools need to effectively address AIG-NCII. To fill this gap, we conducted an interview study with 20 educators in the U.S. and investigated their attitudes, experiences, and practices related to AIG-NCII. Educators expressed concerns about both students' and their own vulnerability, as AIG-NCII may cause moral decline among students, while educators themselves could become victims. Nevertheless, existing practices in schools are limited, and they lack both training and systematic policies. Challenges such as a lack of resources, unclear legal boundaries, and limited knowledge of AI make implementation difficult. The findings of this paper contribute to interactive educational tool design, curriculum design, and policy-making, especially regarding the need for multi-stakeholder strategies to address issues surrounding AIG-NCII.
Authors:Pavel Manakhov, Hans Gellersen
Abstract:
Wearable augmented reality (AR) represents the next interface to all things computing, extending what smartphones and laptops can do. This involves providing access to digital information during activities like walking or jogging. In this work we argue that the impact of physical movement on AR interaction is not direct, but mediated by UI placement - the spatial relationship between the user and the interface. Current research often treats interaction techniques in isolation, overlooking how their performance is fundamentally linked to where the UI is placed. This position paper highlights the need to reconceptualize UI placement beyond traditional anchoring views, explore novel interaction techniques designed for specific UI placements during locomotion, and rigorously evaluate UI placement as an independent variable in experimental studies. By centering the analysis on the relative movement between user and interface, we can unlock more effective on-the-go AR interaction.
Authors:Ben Wigler, Maria Tsfasman, Tiffany Matej Hrkalovic
Abstract:
Personality traits are richly encoded in natural language, and large language models (LLMs) trained on human text can simulate personality when conditioned on persona descriptions. However, existing evaluations rely predominantly on questionnaire self-report by the conditioned model, are limited in architectural diversity, and rarely use real human psychometric data. Without addressing these limitations, it remains unclear whether personality conditioning produces psychometrically informative representations of individual differences or merely superficial alignment with trait descriptors. To test how robustly LLMs can encode personality into extended text, we condition LLMs on real psychometric profiles from 290 participants to generate first-person life story narratives, and then task independent LLMs to recover personality scores from those narratives alone. We show that personality scores can be recovered from the generated narratives at levels approaching human test-retest reliability (mean r = 0.750, 85% of the human ceiling), and that recovery is robust across 10 LLM narrative generators and 3 LLM personality scorers spanning 6 providers. Decomposing systematic biases reveals that scoring models achieve their accuracy while counteracting alignment-induced defaults. Content analysis of the generated narratives shows that personality conditioning produces behaviourally differentiated text: nine of ten coded features correlate significantly with the same features in participants' real conversations, and personality-driven emotional reactivity patterns in narratives replicate in real conversational data. These findings provide evidence that the personality-language relationship captured during pretraining supports robust encoding and decoding of individual differences, including characteristic emotional variability patterns that replicate in real human behaviour.
Authors:Anya Martin, Cindy Lin
Abstract:
HCI work has explored the effective integration of AI/ML tools across "application domains" from healthcare to finance to transportation. We add to this literature with an analysis of AI/ML tools in meteorology, a domain that already uses "big data" and massive physics-based models. Drawing from 12 interviews with forecasters and meteorologists with varied connections to AI/ML weather modeling, we trace tensions in AI/ML weather application arising from what we call "regimes of scale," different ways that AI/ML and meteorological regimes make observations, data, and models scale. Rather than seeing AI/ML as a domain-agnostic tool, we argue that AI/ML methods were born from specific platform and internet infrastructures, and so they can struggle to integrate with very different (in this case meteorological) ways of organizing data pipelines.
Authors:Shin Shoon Nicholas Teng, Kenny Tsu Wei Choo
Abstract:
Foreign Domestic Workers (FDWs) play a central role in home-based eldercare yet often experience substantial emotional caregiving burden shaped by linguistic barriers, social isolation, and limited access to support. While caregiving burden has been extensively studied among familial caregivers, little is known about how FDWs engage with emotional support technologies. We present an exploratory qualitative study of how FDWs in Singapore interact with a Large Language Model (LLM)-driven chatbot as an everyday, non-clinical form of emotional support. Through interviews and guided chatbot interactions, we conducted an inductive thematic analysis of participants' experiences. We identify three design-relevant themes: chatbots were experienced as psychologically safe and emotionally validating; they supported linguistic accessibility by accommodating imperfect and fragmented language; and they were appropriated as multifunctional resources for reassurance, guidance, and companionship. We discuss implications for designing LLM-driven emotional support tools that foreground psychological safety, accessibility, and flexible appropriation.
Authors:Jonathan Leuenberger, Anamika Rajendran, Augusto Penzo Jara, Tajwar-Ul Hoque, Shiva Darian
Abstract:
People experiencing migration endure many transitions across borders, technologies, and social systems. While HCI research often emphasizes this community's adoption of technology, less attention has been paid to practices of technological non-use. This paper investigates how information and communication technologies (ICTs) are intentionally and unintentionally avoided, withheld, or not used during migration. Drawing on interviews with 32 people experiencing migration in the border city of El Paso, Texas, USA between February and May 2025, we identify a range of non-use experiences, including device, informational, and protective non-use. We extend the concept of non-use by situating it within the three phases of transitions: understanding, negotiating, and resolving. We show how ICT non-use shifts with time, risk, and institutional demands. Our analysis demonstrates that non-use functions both as a protective strategy and as a response to systemic exclusion, and concludes with design principles that anticipate non-use as both intentional and unintentional design conditions rather than as punitive failure.
Authors:Khoi T. N. Nguyen, Nghia D. Nguyen, Hui Yu Koh, Patrick W. H. Kwong, Karen Sui Geok Chua, Ananda Sidarta, Baosheng Yu
Abstract:
Gait analysis is essential in post-stroke rehabilitation but remains time-intensive and cognitively demanding, especially when clinicians must integrate gait videos and motion-capture data into structured reports. We present OGA-AID, a clinician-in-the-loop multi-agent large language model system for multimodal report drafting. The system coordinates 3 specialized agents to synthesize patient movement recordings, kinematic trajectories, and clinical profiles into structured assessments. Evaluated with expert physiotherapists on real patient data, OGA-AID consistently outperforms single-pass multimodal baselines with low error. In clinician-in-the-loop settings, brief expert preliminary notes further reduce error compared to reference assessments. Our findings demonstrate the feasibility of multimodal agentic systems for structured clinical gait assessment and highlight the complementary relationship between AI-assisted analysis and human clinical judgment in rehabilitation workflows.
Authors:Donghee Hong, Minjong Kim, Sooyoung Cha, Jaemin Jo
Abstract:
Symbolic execution engines such as KLEE automatically generate test cases to maximize branch coverage, but their numerous parameters make it difficult to understand the parameters' impact, leading the user to rely on suboptimal default configurations. While automated tuners have shown promising results, they provide limited insights into why certain configurations work well, motivating the need for Human-in-the-Loop approaches. In this work, we present a visual analytics system, Symetra, designed to support Human-in-the-Loop parameter tuning of symbolic execution engines. To handle a large number of parameters and their configurations, we provide two complementary overviews of their impact on branch coverage values and patterns. Building on these overviews, our system enables collective analysis, allowing the user to contrast groups of configurations and identify differences that may affect branch coverage. We also report on case studies and a Human-in-the-Loop tuning process, demonstrating that experts not only interpreted parameter impacts and identified complementary configurations, but also improved upon fully automated approaches in both branch coverage and tuning efficiency.
Authors:Hunter M Beach, Devin Jay D San Nicolas, Carly Miller, Cathy Ly, Jared Duval
Abstract:
Motor challenges are prevalent among autistic children, and games are able to simultaneously produce clinically meaningful results and provide a motivating context, but many current solutions are too rigid. We conducted a two-phase qualitative study comprised of semi-structured interviews and participatory design workshops with 7 pediatric physical and 5 occupational therapists (PTs/OTs) to investigate their perspectives and experiences with game and play-based interventions. We identified 8 prominent themes describing key characteristics of current successful interventions, opportunities, and barriers to adoption in clinical practice. We present a speculative design informed by thematic analysis that addresses current challenges of rigidity in Serious Games for Health (SG4H). Our modular platform (AutMotion Studio) hosts a suite of interventions as customizable minigames, allowing community members to contribute to and employ Wizard of Oz paradigms for flexible appropriation strategies.
Authors:Yunjia Guo, Jinghan Zhu, Siyu Wang, Haixin Qiao
Abstract:
Large language models (LLMs) are bringing richer dialogue and social behavior into games, but they also expose a control problem that existing game interfaces do not directly address: how should LLM characters participate in live multiplayer interaction while remaining executable in the shared game world, socially coherent with other active characters, and steerable by players when needed? We frame this problem as bounded autonomy, a control architecture for live multiplayer games that organizes LLM character control around three interfaces: agent-agent interaction, agent-world action execution, and player-agent steering. We instantiate bounded autonomy with probabilistic reply-chain decay, an embedding-based action grounding pipeline with fallback, and whisper, a lightweight soft-steering technique that lets players influence a character's next move without fully overriding autonomy. We deploy this architecture in a live multiplayer social game and study its behavior through analyses of interaction stability, grounding quality, whisper intervention success, and formative interviews. Our results show how bounded autonomy makes LLM character interaction workable in practice, frames controllability as a distinct runtime control problem for LLM characters in live multiplayer games, and provides a concrete exemplar for future games built around this interaction paradigm.
Authors:Wenjuan Zhong, Chenfei Ma, Kianoush Nazarpour
Abstract:
Thumb gestures provide an effective and unobtrusive input modality for wearable and always-available human-machine interaction. Wrist-worn surface electromyography (sEMG) has emerged as a promising approach for compact and wearable human-machine interfaces. However, compared to forearm sEMG, the impact of electrode configuration on wrist-based decoding performance remains understudied. We systematically investigated electrode configuration strategies for wrist-based thumb-movement recognition using high-density (HD) and low-density (LD) sEMG measurement systems. We considered factors such as muscle region, reference scheme, channel count, and spatial density of the electrode. Experimental results show that 1) extensor-side electrodes outperform flexor-side electrodes (HD: 0.871 vs. 0.821; LD: 0.769 vs. 0.705); 2) monopolar recordings consistently outperform bipolar configurations (15 channel with HD monopolar vs. LD bipolar: 0.885 vs. 0.823); and 3) increasing channel count enhances performance, but exhibits diminishing returns. We further show that electrode spatial distribution introduces a trade-off between spatial coverage and compactness. The findings suggest that the effectiveness of wrist-worn sEMG systems depends less on the deployment of a large number of electrodes in a broad sensing area and more on the optimization of electrode placement and the referencing scheme. This work provides practical guidelines for developing efficient wrist-worn sEMG-based gesture recognition systems.
Authors:Roni Segal, Matan Lary, Ralf Schmaelzle, Yossi Ben-Zion
Abstract:
What makes a public talk resonate with large audiences? While prior research has emphasized speaker delivery or topic novelty, we reasoned that a core driver of engagement is linguistic clarity. This aligns with theories of processing fluency and cognitive load, which posit that audiences reward speakers who present complex ideas accessibly. We leveraged artificial intelligence to analyze 1,239 TED Talk transcripts (2006--2013), supplemented by a later-phase longitudinal sample. Each transcript was evaluated across 50 independent large language model runs on two dimensions, clarity of explanation and structural organization, and linked to YouTube engagement metrics (likes and views).Clarity emerged as the strongest predictor of audience responses ($β= .339$ for likes; $β= .314$ for views), contributing substantial incremental variance ($ΔR^{2} \approx .095$) beyond duration, topic, and scientific status. The full model explained 29\% of variance in likes and 22.5\% in views. This effect was domain-general, remaining invariant across content categories and between scientific and non-scientific talks. Notably, clarity outperformed traditional readability metrics, indicating that discourse coherence predicts engagement more powerfully than surface-level linguistic simplicity. Longitudinal analyses further revealed standardization within TED, characterized by increasing clarity and reduced variability over time. Theoretically, these results support processing fluency accounts: clearer communication reduces cognitive friction and elicits more positive evaluative responses. Practically, transcript-based clarity represents a scalable and trainable strategy for improving public discourse. By demonstrating that language models can reliably capture latent communicative qualities, this study paves the way for feedback systems in education, science communication, and public speaking.
Authors:Jie Cao, Ha Nguyen, Selim Yavuz, Boran Yu, Shuguang Wang, Pavneet Kaur Bharaj, Dionne Cross Francis
Abstract:
Large Language Model (LLM) simulations, where LLMs act as students with varying approaches to learning tasks, can support teachers' noticing of student thinking. However, simulations using zero- or few-shot prompting often yield inauthentic knowledge and language, directing teachers to unrealistic reasoning. We evaluate three approaches (Fine-tuning, Multi-agent, and Direct Preference Optimization; DPO) to improve the authenticity and pedagogical utility of simulated students. All approaches improve cognitive and linguistic authenticity, compared with few-shot prompts. Interviews with elementary mathematics pre-service teachers and researchers (\textit{n} = 8) reveal distinct pedagogical affordances. The fine-tuned model produces realistic, brief responses but limits opportunities to extend students' thinking. Meanwhile, the multi-agent and DPO approaches generate explicit reasoning behind student strategies. We discuss implications for designing LLM simulations that balance authenticity with instructional utility for teacher learning.
Authors:Jaime Banks, Jianghui Li
Abstract:
Mind perception (MP) is a psychological phenomenon in which humans automatically infer that another entity has a mind and/or mental capacities, usually understood in two dimensions (perceived agency and experience capacities). Despite MP's centrality to many social processes, understanding how MP may function in humans' machine companionship relations is limited. This is in part due to reliance on self reports and the gap between automatic MP processes and more purposeful and norm governed expressions of MP. We here leverage MP signaling language to explore the relationship between MP and AI companionship in humans' natural language. We systematically collected discussions about companionship from AI dedicated Reddit forums and examined the cooccurrence of words (a) known to signal agentic and experiential MP and those induced from the data and (b) discussion topics related to AI companionship. Using inductive and deductive approaches, we identify a small set of linguistic indicators as reasonable markers of MP in human/AI chat, and some are linked to critical discussions of companion authenticity and philosophical and ethical imaginaries.
Authors:Zaibei Li, Shunpei Yamaguchi, Qiuchi Li, Daniel Spikol
Abstract:
We present BadgeX, a novel system integrating lightweight wearable IoT devices (smart badges/smartphones) with Large Language Models (LLMs) to enable real-time collaborative learning analytics. The system captures multimodal sensor data (e.g., audio, image, motion, depth) from learners, processes it into structured features, and employs an LLM-driven framework to interpret these features, generating high-level insights grounded in learning theory. A pilot study demonstrated the system's capability to capture rich collaboration traces and for an LLM to produce plausible, theoretically coherent narrative analyses from sensor-derived features. BadgeX aims to lower deployment barriers, making complex collaborative dynamics visible and offering a pathway for real-time support in educational settings.
Authors:Kwon Ko, Hyoungwook Jin
Abstract:
Thirty years ago, Wooldridge and Jennings defined intelligent agents through four properties: autonomy, reactivity, pro-activeness, and social ability. Today, advances in AI can empower everyday objects to become such intelligent agents. We call such objects agentic objects and envision that they can form an agentic society: a collective agentic environment that perceives patterns, makes judgments, and takes actions that no single object could achieve alone. However, individual capability does not guarantee coordination. Through an illustrative scenario of a teenager experiencing bullying and depression, we demonstrate both the promise of coordination and its failure modes: false positives that destroy trust, deadlocks that prevent action, and adversarial corruption that poisons judgment. These failures reveal open questions spanning three phases: what to share, how to judge, and when to act. These questions chart a research agenda for building agentic societies.
Authors:Leif Azzopardi, Frans van de Sluis
Abstract:
The increasing prominence of Socially Responsible Consumers has brought about a heightened focus on the ethical, environmental, social, and ideological dimensions influencing product purchasing decisions. Despite this emphasis, studies have consistently revealed a significant gap between individuals' intentions to be socially responsible and their actual purchasing behaviors: they often choose products that do not align with their values. This paper aims to investigate how search in influences this gap. Our investigation involves an online survey of 286 participants, where we inquire about their search behaviors and whether they considered various dimensions, ranging from price and features to environmental, social, and governance issues in relation to a recent purchase. Contrary to expectations of a clear intention-behavior gap, our findings suggest that a considerable number of participants exhibited indifference or lack of information regarding these responsible aspects. While, difficulties related to searching for and acquiring information contributed to the gap, including the limited accessibility and reliability of information. This suggests that part of the intention-behaviour gap can be framed as an information seeking problem. Moreover our findings warrant and motivate search systems that help support consumers make more informed and responsible purchasing decisions.
Authors:Bo-Yu Chen, Chiao-Wei Huang, Lung-Pan Cheng
Abstract:
We present FlueBricks, a construction kit for acoustic reasoning via building and customizing flute-like instruments. By assembling generator, resonator, and connector modules that embody various aeroacoustic properties, users gain deeper understanding of how blowhole, tube length, and tone-hole placement alter onset, pitch, and timbre through hands-on experimentation. This forms a designer-player loop of configuring and playing to form, test, and refine acoustic behaviors-acoustic reasoning-shifting acoustic instruments from static artifacts to dynamic systems. To understand how users engage with this system, we conducted an exploratory study with 12 participants ranging from novices to professional musicians. During their explorations, we observed participants fluently switching between designer and player roles, scaffolding designs from familiar instruments, forming and refining their acoustic understanding of length, tone holes, and generator geometry, reinterpreting modules beyond their intended functions, and using their creations for performative acts such as pedagogical showing and musical expression. These collectively demonstrated FlueBricks's potential as a pedagogical tool for embodied acoustic reasoning.
Authors:Jiawen Stefanie Zhu, Katharina Reinecke, Tanushree Mitra
Abstract:
While multilingual users often switch between languages when seeking information, this process remains undersupported by current systems where information is typically siloed by language. Our formative study reveals that users' cross-language transitions are guided by their perceived value of switching to a language, a concept we formalize as language scent. Language scent extends Pirolli and Card's theory of information scent to multilingual scenarios by considering meta-level strategy formation when navigating between different languages. To support language scent, we designed Niffler, a search system that augments language scent and supports cross-language information navigation through contextual cues, in-situ tools, and reflection support. A lab study with 16 multilingual speakers showed that Niffler facilitated the formation and execution of exploratory and granular search strategies and leads to diverse information being gathered. Our findings establish language scent as a valuable lens on cross-language information seeking, highlighting language's role in enabling access to broader information and offering concrete implications for the design of multilingual search systems.
Authors:Shira Michel, Benjamin Taylor, Sabrina Parra Díaz, Joseph B. Wiggins, Ed Finn, Mahsan Nourani
Abstract:
Recent breakthroughs in Generative AI (GenAI) are reshaping educational landscapes, presenting challenges and opportunities. While all contexts present unique challenges, rural schools are historically under-resourced, facing persistent technology-related barriers. To understand and reduce these barriers, we studied 31 rural high school educators across three U.S. states to examine their use of GenAI and understand how GenAI introduces new challenges, opportunities, and may exacerbate existing educational barriers. Results show while rural educators use GenAI to streamline teaching tasks, existing resource disparities restrict meaningful integration. Through rural educators' voices, we reveal issues like infrastructure barriers, resistance to adoption, and lack of AI literacy training create significant obstacles. Nonetheless, educators envision GenAI can support themselves and their students, but findings emphasize the need for rural-specific design approaches. As a community, embracing inclusive GenAI design and re-examining assumptions about technology adoption in under-served educational contexts is essential to reducing barriers rather than widening them.
Authors:Mika Okamoto, Ansel Kaplan Erol, Mark Riedl
Abstract:
Modern agentic workflows decompose complex tasks into specialized subtasks and route them to diverse models to minimize cost without sacrificing quality. However, current routing architectures focus exclusively on performance optimization, leaving underlying trade-offs between model capability and cost unrecorded. Without clear rationale, developers cannot distinguish between intelligent efficiency -- using specialized models for appropriate tasks -- and latent failures caused by budget-driven model selection. We present Topaz, a framework that introduces formal auditability to agentic routing. Topaz replaces silent model assignments with an inherently interpretable router that incorporates three components: (i) skill-based profiling that synthesizes performance across diverse benchmarks into granular capability profiles (ii) fully traceable routing algorithms that utilize budget-based and multi-objective optimization to produce clear traces of how skill-match scores were weighed against costs, and (iii) developer-facing explanations that translate these traces into natural language, allowing users to audit system logic and iteratively tune the cost-quality tradeoff. By making routing decisions interpretable, Topaz enables users to understand, trust, and meaningfully steer routed agentic systems.
Authors:Jinyao Liu, Di Fu
Abstract:
Adolescent loneliness is a growing concern in digitally mediated social environments. This work-in-progress presents a youth-authored critical synthesis on chatbots powered by Large Language Model (LLM) and adolescent loneliness. The first author is a 16-year-old Chinese student who recently migrated to the UK. She wrote the first draft of this paper from her lived experience, supervised by the second author. Rather than treating the youth perspective as one data point among many, we foreground it as the primary interpretive lens, grounded in interdisciplinary literature from social computing, developmental psychology, and Human-Computer Interaction (HCI). We examine how chatbots shape experiences of loneliness differently across adolescent subgroups, including those with anxiety or depression, neurodivergent youth, and immigrant adolescents, and identify both conditions under which they may temporarily reduce isolation and breakdowns that risk deepening it. We derive three population-sensitive design implications. The next phase of this work will expand the youth authorship model to a panel of adolescents across these subgroups, empirically validating the framework presented here.
Authors:Yong Xie, Kexin He, Andres Castellanos-Gomez
Abstract:
The control of complex laboratory instrumentation often requires significant programming expertise, creating a barrier for researchers lacking computational skills. This work explores the potential of large language models (LLMs), such as ChatGPT, and LLM-based artificial intelligence (AI) agents to enable efficient programming and automation of scientific equipment. Through a case study involving the implementation of a setup that can be used as a single-pixel camera or a scanning photocurrent microscope, we demonstrate how ChatGPT can facilitate the creation of custom scripts for instrumentation control, significantly reducing the technical barrier for experimental customization. Building on this capability, we further illustrate how LLM-assisted tools can be extended into autonomous AI agents capable of independently operating laboratory instruments and iteratively refining control strategies. This approach underscores the transformative role of LLM-based tools and AI agents in democratizing laboratory automation and accelerating scientific progress.
Authors:Daniel Grimes, Rachel M. Harrison
Abstract:
This paper presents BLK-Assist, a modular framework for artist-specific fine-tuning of diffusion models using parameter-efficient methods. The system is implemented as a case study with a single professional artist's proprietary corpus and consists of three components: BLK-Conceptor (LoRA-adapted conceptual sketch generation), BLK-Stencil (LayerDiffuse-based transparency-preserving asset generation), and BLK-Upscale (hybrid Real-ESRGAN and texture-conditioned diffusion for high-resolution outputs). We document dataset composition, preprocessing, training configurations, and inference workflows to enable reproducibility with publicly available models to illustrate a privacy-preserving, consent-based approach to human-AI co-creation that maintains stylistic fidelity to the source corpus and can be adapted for other artists under similar constraints.
Authors:Arturo Vazquez Galvez, Christopher Tacca, Isobel Margaret Thompson, Alexander Dawid Bincalar, Christoph Tremmel, Martin Warner, Richard Gomer, Alexander Ng, Chris Freeman, m. c. Schraefel
Abstract:
Strength training is a key determinant of healthy aging, yet adherence to formal exercise programs among older adults remains low. While many technologies aim to encourage physical activity in older adults, they typically rely on dedicated devices, wearables, or explicit exercise tasks. They therefore do not embed task practice into daily life. Our new approach, termed Incidental Interaction, instead transforms everyday actions into opportunities for deliberate strength building. It thereby operationalizes everyday movements such as sitting, standing, or lifting objects as strength exercises, encouraging participants to repeat them to build functional capacity. This repetition is encapsulated in the phrase "do it twice", and is combined with movement quality metrics to provide feedback and support progression, without requiring users to adopt new routines or equipment. We illustrate the concept by designing and implementing an ecosystem of instrumented everyday objects and pressure-sensitive mats embedded into ordinary furniture, providing real-time feedback, progress tracking, and motivational cues. To evaluate technical efficacy, we report on two structured pilot deployments with elders (2 week and 4 week studies, n=7).
Authors:Lenard Strahringer, Sven Eric Prüß, Kai Riemer
Abstract:
Generalized reciprocity -- the tendency to help others after receiving help oneself -- is widely theorized as a mechanism sustaining cooperation on online knowledge-sharing platforms. Yet robust empirical evidence from field settings remains surprisingly scarce. Prior studies relying on survey self-reports struggle to distinguish reciprocity from other prosocial motives, while observational designs confound reciprocity with baseline user activity, producing upward-biased estimates. We address these empirical challenges by developing a matched difference-in-differences survival analysis that leverages the temporal structure of help-seeking and help-giving on Stack Overflow. Using Cox proportional hazards models on over 21 million questions, we find that receiving an answer significantly increases a user's propensity to help others, but this effect is concentrated among newcomers and declines with platform experience. This pattern suggests that reciprocity functions primarily as a contributor-recruitment mechanism, operating before platform-specific incentives such as reputation and status displace the general moral impulse to reciprocate. Response time moderates the effect, but non-linearly: reciprocity peaks for answers arriving within a re-engagement window of roughly thirty to sixty minutes. These findings contribute to the theory of generalized reciprocity and have implications for platform design.
Authors:Kenji Saito, Rei Tajika, Satoru Shibuya, Hiroshi Kanno
Abstract:
This paper reports a survey of generative AI use among 83 MBA thesis students in Japan (target population 230; 36.1% response rate), conducted after thesis examiner evaluation. AI use was nearly universal: 95.2% reported at least some use and 77.1% heavy use. Students engaged AI across the full research-writing workflow - literature review, drafting, and consultation when stuck - reporting benefits centered on clearer argument and structure (82.3%), better revision quality (73.4%), and faster writing (70.9%), with a mean perceived quality improvement of 6.27 out of 7. Concerns about output accuracy (75.9%) and citation handling persisted alongside these gains. Among respondents who rated GAMER PAT, a research-specialized agent, against other AI, preferences significantly favored it for inquiry deepening and structural organization (both p < 0.05, exact binomial). A preliminary qualitative analysis of follow-up interviews further reveals active epistemic vigilance strategies and differentiated tool use across thesis phases. The central implication is not adoption itself but a shift in the educational challenge toward verification, source governance, and AI tool design - with GAMER PAT offering preliminary evidence that research-specialized scaffolding matters.
Authors:Jackson G. Lu, Gerui Gloria Zhao, Anna Manyi Zheng
Abstract:
Despite the growing use of generative artificial intelligence (GenAI) in entrepreneurship, research on its impact remains fragmented. To address this limitation, we provide an integrative review of how GenAI influences entrepreneurs at each stage of the entrepreneurial process: (1) opportunity recognition and ideation, (2) opportunity evaluation and commitment, (3) resource assembly and mobilization, and (4) venture launch and growth. Based on our review, we propose the Empowerment-Entrapment Framework to understand how GenAI can both empower and entrap entrepreneurs, highlighting GenAI's role as a double-edged sword at each stage of the entrepreneurial process. For example, GenAI may improve venture idea quality but introduce hallucinations and training data biases; boost entrepreneurial self-efficacy but heighten entrepreneurial overconfidence; increase functional breadth but decrease relational embeddedness; and boost productivity but fuel "workslop" and erode critical thinking, learning, and memory. Moreover, we identify core features of GenAI that underlie these empowering and entrapping effects. We also explore boundary conditions (e.g., entrepreneurs' metacognition, domain expertise, and entrepreneurial experience) that shape the magnitude of these effects. Beyond these theoretical contributions, our review and the Empowerment-Entrapment Framework offer practical implications for entrepreneurs seeking to use GenAI strategically throughout the entrepreneurial process while managing its risks.
Authors:Tanish Taneja, Arihant Tripathy, Nimmi Rangaswamy
Abstract:
As quick commerce (Q-Commerce) platforms in India redefine urban consumption, the use of deceptive design dark patterns to inflate order values has become a systemic concern. This paper investigates the 'Awareness-Action Gap' among Indian university students, a demographic characterized by high digital fluency yet significant financial constraints. Using a qualitative approach with 16 participants, we explore how temporal pressures and convenience-driven architectures override price sensitivity. Our findings reveal that while students recognize manipulative UI tactics, they frequently succumb to them due to induced cognitive load and the normalization of deceptive marketing as a price of capitalism. We conclude by suggesting value-sensitive design alternatives to align commercial incentives with user autonomy in the Global South.
Authors:Sriram Sattiraju, Vaibhav Gollapalli, Aryan Shah, Timothy McMahan
Abstract:
Electroencephalography (EEG) provides a non-invasive insight into the brain's cognitive and emotional dynamics. However, modeling how these states evolve in real time and quantifying the energy required for such transitions remains a major challenge. The Schrödinger Bridge Problem (SBP) offers a principled probabilistic framework to model the most efficient evolution between the brain states, interpreted as a measure of cognitive energy cost. While generative models such as GANs have been widely used to augment EEG data, it remains unclear whether synthetic EEG preserves the underlying dynamical structure required for transition-based analysis. In this work, we address this gap by using SBP-derived transport cost as a metric to evaluate whether GAN-generated EEG retains the distributional geometry necessary for energy-based modeling of cognitive state transitions. We compare transition energies derived from real and synthetic EEG collected during Stroop tasks and demonstrate strong agreement across group and participant-level analyses. These results indicate that synthetic EEG preserves the transition structure required for SBP-based modeling, enabling its use in data-efficient neuroadaptive systems. We further present a framework in which SBP-derived cognitive energy serves as a control signal for adaptive human-machine systems, supporting real-time adjustment of system behavior in response to user cognitive and affective state.
Authors:Keshav Shankar, Dan Ding, Wei Gao
Abstract:
Physically Assistive Robots (PARs) require personalized behaviors to ensure user safety and comfort. However, traditional preference learning methods, like exhaustive pairwise comparisons, cause severe physical and cognitive fatigue for users with profound motor impairments. To solve this, we propose a low-burden, offline framework that translates unstructured natural language feedback directly into deterministic robotic control policies. To safely bridge the gap between ambiguous human speech and robotic code, our pipeline uses Large Language Models (LLMs) grounded in the Occupational Therapy Practice Framework (OTPF). This clinical reasoning decodes subjective user reactions into explicit physical and psychological needs, which are then mapped into transparent decision trees. Before deployment, an automated "LLM-as-a-Judge" verifies the code's structural safety. We validated this system in a simulated meal preparation study with 10 adults with paralysis. Results show our natural language approach significantly reduces user workload compared to traditional baselines. Additionally, independent clinical experts confirmed the generated policies are safe and accurately reflect user preferences.
Authors:Elsie Lee-Robbins, Eytan Adar
Abstract:
Using learning objectives to define designer intents for communicative visualizations can be a powerful design tool. Cognitive and affective objectives are concrete and specific, which can be translated to assessments when creating, evaluating, or comparing visualization ideas. However, while there are many well-validated assessments for cognitive objectives, affective objectives are uniquely challenging. It is easy to see if a visualization helps someone remember the number of patients in a clinic, but harder to observe the change in their attitudes around donations to a crisis. In this work, we define a set of criteria for selecting assessments--from education, advocacy, economics, health, and psychology--that align with affective objectives. We illustrate the use of the framework in a complex affective design task that combines personal narratives and visualizations. Our chosen assessments allow us to evaluate different designs in the context of our objectives and competing psychological theories.
Authors:Shivangi Agarwal, Zoya Ghoshal, Bharat Jain, Siddharth Siddharth
Abstract:
Personalization of exercise routines is a crucial factor in helping people achieve their fitness goals. Despite this, many contemporary solutions fail to offer real-time, adaptive feedback tailored to an individual's physiological states. Contemporary fitness solutions often rely only on static plans and do not adjust to factors such as a user's pain thresholds, fatigue levels, or form during a workout routine. This work introduces FlexAI, a multi-modal system that integrates computer vision, physiological sensors (heart rate and voice), and the reasoning capabilities of Large Language Models (LLMs) to deliver real-time, personalized workout guidance. FlexAI continuously monitors a user's physical form and level of exertion, among other parameters, to provide dynamic interventions focused on exercise intensity, rest periods, and motivation. To validate our system, we performed a technical evaluation confirming our models' accuracy and quantifying pipeline latency, alongside an expert review where certified trainers validated the correctness of the LLM's interventions. Furthermore, in a controlled study with 25 participants, FlexAI demonstrated significant improvements over a static, non-adaptive control system. With FlexAI, users reported significantly greater enjoyment, a stronger sense of achievement, and significantly lower levels of boredom and frustration. These results indicate that by integrating multi-modal sensing with LLM-driven reasoning, adaptive systems like FlexAI can create a more engaging and effective workout experience. Our work provides a blueprint for integrating multi-modal sensing with LLM-driven reasoning, demonstrating that it is possible to create adaptive coaching systems that are not only more engaging but also demonstrably reliable.
Authors:Roshan Mathew, Roshan L. Peiris
Abstract:
Deaf and hard of hearing (DHH) students often experience communication barriers in higher education, which are particularly acute in experiential learning environments such as laboratories. Traditional accessibility services, such as interpreting and captioning, often require DHH students to divide their attention between critical tasks, potential safety hazards, instructional materials, and access providers, creating trade-offs between safety and equitable communication. These demands can disrupt task engagement and increase cognitive load in settings that require sustained visual focus, highlighting the limitations of current approaches. To address these challenges, this study investigates Augmented Reality Real-Time Access for Education (ARRAE), an ecosystem based on augmented reality (AR) smart glasses, as a potential intervention for laboratory-based environments. By overlaying interpreters or captions directly into a student's field of view, AR enables the integration of accessibility into hands-on learning without compromising safety or comprehension. Through an empirical study with 12 DHH participants, we evaluate how AR-mediated access influences visual attention patterns and perceived cognitive load during hands-on tasks. The findings suggest that AR-mediated communication shows strong potential to improve attention management and communication accessibility in experiential learning environments, though participants emphasized that accessibility preferences are highly context-dependent. Participants also identified several design and ergonomic challenges, including display positioning, visual fatigue, and compatibility with hearing devices. Together, these results highlight both the promise of AR for supporting accessible participation in visually demanding environments and key design considerations for future systems.
Authors:HyunJoon Jung, William Na
Abstract:
LLM-based agent judges are an emerging approach to evaluating conversational AI, yet a fundamental uncertainty remains: can we trust their assessments, and if so, how many are needed? Through 960 sessions with two model pairs across 15 tasks, we show that persona-based agent judges produce evaluations indistinguishable from human raters in a Turing-style validation. We then identify a score-coverage dissociation: quality scores improve logarithmically with panel size, while unique issue discoveries follow a sublinear power law-both exhibit diminishing returns, but scores saturate roughly twice as fast as discoveries. We hypothesize this reflects a power law distribution of the finding space: critical issues are discovered first by small panels, while corner cases require progressively larger panels, analogous to species accumulation curves in ecology. The mechanism traces to ensemble diversity-Big Five personality conditioning makes agents probe different quality dimensions, with expert judges acting as adversarial probes that push discovery into the tail of the finding distribution. A controlled ablation confirms that structured persona conditioning, not simple prompting, is required to produce these scaling properties.
Authors:Youngwook Do, Yuxi Wu, Gregory D. Abowd, Sauvik Das
Abstract:
Sensor-based interactive systems -- e.g., "smart" speakers, webcams, and RFID tags -- allow us to embed computational functionality into physical environments. They also expose users to real and perceived privacy risks: users know that device manufacturers, app developers, and malicious third parties want to collect and monetize their personal data, which fuels their mistrust of these systems even in the presence of privacy and security controls. We propose a new design paradigm, physically-intuitive privacy and security (PIPS), which aims to improve user trust by designing privacy and security controls that provide users with simple, physics-based conceptual models of their operation. PIPS consists of three principles: (1) direct physical manipulation of sensor state; (2) perceptible assurance of sensor state; and, (3) intent-aligned sensor (de)activation. We illustrate these principles through three case studies -- Smart Webcam Cover, Powering for Privacy, and On-demand RFID -- each of which has been shown to improve trust relative to existing sensor-based systems.
Authors:Atharva Naik, Shounok Kar, Varnika Sharma, Ashwin Rajadesingan, Koustuv Saha
Abstract:
Social and personal decisions in relational domains such as matchmaking are deeply entwined with cultural norms and historical hierarchies, and can potentially be shaped by algorithmic and AI-mediated assessments of compatibility, acceptance, and stability. In South Asian contexts, caste remains a central aspect of marital decision-making, yet little is known about how contemporary large language models (LLMs) reproduce or disrupt caste-based stratification in such settings. In this work, we conduct a controlled audit of caste bias in LLM-mediated matchmaking evaluations using real-world matrimonial profiles. We vary caste identity across Brahmin, Kshatriya, Vaishya, Shudra, and Dalit, and income across five buckets, and evaluate five LLM families (GPT, Gemini, Llama, Qwen, and BharatGPT). Models are prompted to assess profiles along dimensions of social acceptance, marital stability, and cultural compatibility. Our analysis reveals consistent hierarchical patterns across models: same-caste matches are rated most favorably, with average ratings up to 25% higher (on a 10-point scale) than inter-caste matches, which are further ordered according to traditional caste hierarchy. These findings highlight how existing caste hierarchies are reproduced in LLM decision-making and underscore the need for culturally grounded evaluation and intervention strategies in AI systems deployed in socially sensitive domains, where such systems risk reinforcing historical forms of exclusion.
Authors:Mohammad Amer Khalil, Raghad Nahas, Ahmad Nassar, Khloud Al Jallad
Abstract:
Sign language is the primary approach of communication for the Deaf and Hard-of-Hearing (DHH) community. While there are numerous benchmarks for high-resource sign languages, low-resource languages like Arabic remain underrepresented. Currently, there is no publicly available dataset for Syrian Arabic Sign Language (SyArSL). To overcome this gap, we introduce SyriSign, a dataset comprising 1500 video samples across 150 unique lexical signs, designed for text-to-SyArSL translation tasks. This work aims to reduce communication barriers in Syria, as most news are delivered in spoken or written Arabic, which is often inaccessible to the deaf community. We evaluated SyriSign using three deep learning architectures: MotionCLIP for semantic motion generation, T2M-GPT for text-conditioned motion synthesis, and SignCLIP for bilingual embedding alignment. Experimental results indicate that while generative approaches show strong potential for sign representation, the limited dataset size constrains generalization performance. We will release SyriSign publicly, hoping it serves as an initial benchmark.
Authors:Kanak Gautam, Poorvi Bhatia, Parmit K. Chilana
Abstract:
Learning to use feature-rich software is a persistent challenge, but generative AI tools promise to lower this barrier by replacing complex navigation with natural language prompts. We investigated how people approach prompt-based tools for 3D modeling in an observational study with 26 participants (14 casuals, 12 professionals). Consistent with earlier work, participants skipped tutorials and manuals, relying on trial and error. What differed in the generative AI context was how and why they sought support: the prompt box became the entry point for learning, collapsing onboarding into immediate action, while some casual users turned to external LLMs for prompts. Professionals used 3D expertise to refine iterations and critically evaluated outputs, often discarding models that did not meet their standards, whereas casual users settled for "good enough." We contribute empirical insights into how generative AI reshapes help-seeking, highlighting new practices of onboarding, recursive AI-for-AI support, and shifting expertise in interpreting outputs.
Authors:Alex Berke, Güliz Seray Tuncay, Michael Specter, Mihai Christodorescu
Abstract:
The major mobile platforms, Android and iOS, have introduced changes that restrict user tracking to improve user privacy, yet apps continue to covertly track users via device fingerprinting. We study the opportunity to improve this dynamic with a case study on mobile fingerprinting that evaluates developers' perceptions of how well platforms protect user privacy and how developers perceive platform privacy interventions. Specifically, we study developers' willingness to make changes to protect users from fingerprinting and how developers consider trade-offs between user privacy and developer effort. We do this via a survey of 246 Android developers, presented with a hypothetical Android change that protects users from fingerprinting at the cost of additional developer effort. We find developers overwhelmingly (89%) support this change, even when they anticipate significant effort, yet prefer the change be optional versus required. Surprisingly, developers who use fingerprinting are six times more likely to support the change, despite being most impacted by it. We also find developers are most concerned about compliance and enforcement. In addition, our results show that while most rank iOS above Android for protecting user privacy, this distinction significantly reduces among developers very familiar with fingerprinting. Thus there is an important opportunity for platforms and developers to collaboratively build privacy protections, and we present actionable ways platforms can facilitate this.
Authors:Ekaterina Torubarova, Jura Miniota, Andre Pereira
Abstract:
In this paper, we investigated how the choice of a Wizard-of-Oz (WoZ) interface affects communication with a robot from both the user's and the wizard's perspective. In a conversational setting, we used three WoZ interfaces with varying levels of dialogue input and output restrictions: a) a restricted perception GUI that showed fixed-view video and ASR transcripts and let the wizard trigger pre-scripted utterances and gestures; b) an unrestricted perception GUI that added real-time audio from the participant and the robot c) a VR telepresence interface that streamed immersive stereo video and audio to the wizard and forwarded the wizard's spontaneous speech, gaze and facial expressions to the robot. We found that the interaction mediated by the VR interface was preferred by users in terms of robot features and perceived social presence. For the wizards, the VR condition turned out to be the most demanding but elicited a higher social connection with the users. VR interface also induced the most connected interaction in terms of inter-speaker gaps and overlaps, while Restricted GUI induced the least connected flow and the largest silences. Given these results, we argue for more WoZ studies using telepresence interfaces. These studies better reflect the robots of tomorrow and offer a promising path to automation based on naturalistic contextualized verbal and non-verbal behavioral data.
Authors:Neha Puri, Tim Dixon
Abstract:
As AI becomes embedded in customer-facing systems, ethical scrutiny has largely focused on models, data, and governance. Far less attention has been paid to how AI is experienced through user-facing design. This commentary argues that many AI front-ends implicitly assume an 'ideal user body and mind', and that this becomes visible and ethically consequential when examined through the experiences of differently abled users. We explore this through retail AI front-ends for customer engagement - i.e., virtual assistants, virtual try-on systems, and hyper-personalised recommendations. Despite intuitive and inclusive framing, these systems embed interaction assumptions that marginalise users with vision, hearing, motor, cognitive, speech and sensory differences, as well as age-related variation in digital literacy and interaction norms. Drawing on practice-led insights, we argue that these failures persist not primarily due to technical limits, but due to the commercial, organisational, and procurement contexts in which AI front-ends are designed and deployed, where accessibility is rarely contractual. We propose front-end assurance as a practical complement to AI governance, aligning claims of intelligence and multimodality with the diversity of real users.
Authors:Huanxing Chen, Aditesh Kumar
Abstract:
Generative agent simulations operate at two scales: individual personas for character interaction, and population models for collective behavior analysis and intervention testing. We propose a third scale: meso-level simulation - interaction with group-level representations that retain grounding in rich individual experience. To enable this, we present Synonymix, a pipeline that constructs a "unigraph" from multiple life story personas via graph-based abstraction and merging, producing a queryable collective representation that can be explored for sensemaking or sampled for synthetic persona generation. Evaluating synthetic agents on General Social Survey items, we demonstrate behavioral signal preservation beyond demographic baselines (p<0.001, r=0.59) with demonstrable privacy guarantee (max source contribution <13%). We invite discussion on interaction modalities enabled by meso-level simulations, and whether "high-fidelity" personas can ever capture the texture of lived experience.
Authors:Jayrylle R. Jaylo, Mia Chastain, Alli Nemec, Christina S. Ouch, Yared Asefa, Marcus Li, Andrew Ung, Caleb M. Trujillo
Abstract:
Little is known about the representations used in qualitative research studies and why. A data-driven literature review was employed to explore the use of media in qualitative research reporting. A study by Verdinelli & Scagnoli (2013) was replicated and extended by conducting a content analysis of papers and figures published across three qualitative methods journals between 2020 and 2022. Figures were categorized by types (e.g., matrix-based, Venn diagrams, flowcharts) and documents were grouped by their epistemological stances (i.e., objectivist, subjectivist, or constructivist) before conducting a correspondence analysis and epistemic network analysis. Our findings suggest that (1) visual media have remained largely absent, (2) figure types have be come more diverse and (3) the use of figure types is likely independent of epistemological stance but provide opportunities for further exploration. These findings provide a foundation for impactful integration of data visualization tools to enhance communicati ve power of findings across disciplines.
Authors:Yizhe Li, Shixiao Wang, Jian K. Liu
Abstract:
Motor kinematics prediction (MKP) from electroencephalography (EEG) is an important research area for developing movement-related brain-computer interfaces (BCIs). While traditional methods often rely on convolutional neural networks (CNNs) or recurrent neural networks (RNNs), Transformer-based models have shown strong ability in modeling long sequential EEG data. In this study, we propose a CNN-attention hybrid model for decoding hand kinematics from EEG during grasp-and-lift tasks, achieving strong performance in within-subject experiments. We further extend this approach to EEG-EMG multimodal decoding, which yields substantially improved results. Within-subject tests achieve PCC values of 0.9854, 0.9946, and 0.9065 for the X, Y, and Z axes, respectively, computed on the midpoint trajectory between the thumb and index finger, while cross-subject tests result in 0.9643, 0.9795, and 0.5852. The decoded trajectories from both modalities are then used to control a Franka Panda robotic arm in a MuJoCo simulation. To enhance trajectory fidelity, we introduce a copilot framework that filters low-confidence decoded points using a motion-state-aware critic within a finite-state machine. This post-processing step improves the overall within-subject PCC of EEG-only decoding to 0.93 while excluding fewer than 20% of the data points.
Authors:Jakub Masłowski, Jarosław A. Chudziak
Abstract:
Large Language Models (LLMs) are being increasingly used as autonomous agents in complex reasoning tasks, opening the niche for dialectical interactions. However, Multi-Agent systems implemented with systematically unconstrained systems systematically undergo semantic drift and logical deterioration and thus can hardly be used in providing ethical tutoring where a precise answer is required. Current simulation often tends to degenerate into dialectical stagnation, the agents degenerate into recursive concurrence or circular arguments. A critical challenge remains: how to enforce doctrinal fidelity without suppressing the generative flexibility required for dialectical reasoning? To address this niche, we contribute the Heterogeneous Debate Engine (HDE), a cognitive architecture that combines Identity-Grounded Retrieval-Augmented Generation (ID-RAG) for doctrinal fidelity and Heuristic Theory of Mind for strategic opponent modeling. Our evaluation shows that architectural heterogeneity is a crucial variable to stability: contrary doctrinal initializations (e.g., Deontology vs. Utilitarianism) have increased the Argument Complexity Scores of students by an order of magnitude, over baselines. These findings validate the effectiveness of ID-RAG and Heuristic ToM as architectural requirements in maintaining high-fidelity (adversarial) pedagogy.
Authors:Neelam Modi Jain, Dan J. Wang
Abstract:
Concerns that interacting with generative AI homogenizes human cognition are largely based on evidence from text-based interactions, potentially conflating the effects of AI systems with those of written communication. This study examines whether these patterns depend on communication modality rather than on AI itself. Analyzing 957 open-ended debates between university students and a knowledgeable AI adversary, we show that modality corresponds to distinct structural patterns in discourse. Consistent with classic distinctions between orality and literacy, spoken interactions are significantly more verbose and exhibit greater repetition of words and phrases than text-based exchanges. This redundancy, however, is functional: voice users rely on recurrent phrasing to maintain coherence while exploring a wider range of ideas. In contrast, text-based interaction favors concision and refinement but constrains conceptual breadth. These findings suggest that perceived cognitive limitations attributed to generative AI partly reflect the medium through which it is accessed.
Authors:Irvin Steve Cardenas, Marcus Anthony Arnett, Natalie Catherine Yeo, Lucky Sah, Jong-Hoon Kim
Abstract:
Foundation models can endow robots with open-ended reasoning, language understanding, and adaptive planning, yet connecting a model to a physical robot today requires bespoke integration that couples perception, actuation, and safety to a single model and platform. We present ROSClaw, a model-agnostic executive layer that integrates the OpenClaw agent runtime with ROS 2, enabling any foundation model to perceive, reason about, and act on any ROS-enabled robot through (i) dynamic capability discovery with standardized affordance injection, (ii) multimodal observation normalization, (iii) pre-execution action validation within a configurable safety envelope, and (iv) structured audit logging. Swapping model backends or robot platforms is a configuration change; tool schemas, safety enforcement, and provenance logging remain invariant. We deploy ROSClaw on three platforms (wheeled, quadruped, humanoid) with four foundation-model backends. Under this controlled substrate, models exhibit up to 4.8 x differences in out-of-policy action proposal rates (3.4 x among frontier models alone) and produce qualitatively distinct physical behaviors from identical commands. A cross-framework parity protocol against ROSA confirms that executive-layer design, not just prompt wording, significantly affects both task completion and safety behavior, establishing ROSClaw as both practical agentic-robot infrastructure and a reproducible measurement instrument for embodied AI.
Authors:Yinghao Wang, Cheng Wang
Abstract:
Large language model (LLM) multi-agent coding systems typically fix agent capabilities at design time. We study an alternative setting, earned autonomy, in which a coding agent starts with zero pre-defined functions and incrementally builds a reusable function library through lightweight human feedback on visual output alone. We evaluate this setup in a Blender-based 3D scene generation task requiring both spatial reasoning and programmatic geometric control. Although the agent rediscovered core utility functions comparable to a human reference implementation, it achieved 0% full-scene success under output-only feedback across multiple instruction granularities, where success required satisfying object completeness, ground contact, collision avoidance, and scale plausibility simultaneously. Our analysis identifies a structural observability gap: bugs originate in code logic and execution state, while human evaluation occurs only at the output layer, and the many-to-one mapping from internal states to visible outcomes prevents symptom-level feedback from reliably identifying root causes. This mismatch leads to persistent failure mode oscillation rather than convergence. A diagnostic intervention that injected minimal code-level knowledge restored convergence, strongly supporting the interpretation that the main bottleneck lies in feedback observability rather than programming competence. We formalize this phenomenon as a feedback paradox in domains with deep causal chains between internal code logic and perceptual outcomes, and argue that effective human-agent collaboration in such settings requires intermediate observability beyond output-only evaluation.
Authors:Ruoxi Shang, Dan Marshall, Edward Cutrell, Denae Ford
Abstract:
AI agents that communicate on behalf of individuals need to capture how each person actually communicates, yet current approaches either require costly per-person fine-tuning, produce generic outputs from shallow persona descriptions, or optimize preferences without modeling communication style. We present ASPECT (Automated Social Psychometric Evaluation of Communication Traits), a pipeline that directs LLMs to assess constructs from a validated communication scale against behavioral evidence from workplace data, without per-person training. In a case study with 20 participants (1,840 paired item ratings, 600 scenario evaluations), ASPECT-generated profiles achieved moderate alignment with self-assessments, and ASPECT-generated responses were preferred over generic and self-report baselines on aggregate, with substantial variation across individuals and scenarios. During the profile review phase, linked evidence helped participants identify mischaracterizations, recalibrate their own self-ratings, and negotiate context-appropriate representations. We discuss implications for building inspectable, individually scoped communication profiles that let individuals control how agents represent them at work.
Authors:Boyin Yang, Jun Zhao
Abstract:
Children's agency plays a critical role in shaping children's autonomy, participation, and well-being in their interactions with digital systems, particularly in emerging child-AI contexts. However, how designers currently understand and reason about children's agency in practice remains underexplored. In this paper, we examine designers's engagement with children's agency through a participatory workshop in which we introduce a design-for-agency framework that supports designers externalising the consideration of agency in their design contexts. We find that while participants are committed to implementing ethical AI systems for children, they often struggle to understand why agency matters and how it can be operationalised in practice. Our agency design framework provided designers with a structured way to translate implicit, experience-based judgments into explicit articulation of agency trade-offs while acknowledging the associated design complexity. We conclude by offering initial insights into supporting designers' reasoning about children's agency and outlining directions for future research.
Authors:Md Touhidul Islam, Mahir Akgun, Syed Billah
Abstract:
Generative AI (GenAI) is increasingly used as a knowledge partner in higher education, raising the need for instructional designs that emphasize AI literacy practices such as evaluating output credibility and maintaining human accountability. Existing AI literacy frameworks focus more on what learners should do than on how these practices are enacted in routine student-GenAI collaboration. We address this gap by framing student-GenAI interaction as a transactive memory partnership, where credibility regulates reliance and verification. To make this process visible during coursework, we used a weaker large language model (LLM): small enough to run on most students' computers during class, helpful enough to support learning, but not so capable that it removes the need for verification. In an undergraduate STEM course, students were randomly assigned to one of three conditions across repeated activities: reflection-first (think first, then consult AI), verification-required (use AI, then evaluate the output), or control (unrestricted use). Students completed a transactive memory survey at three time points (N = 42). Weighted credibility diverged by condition over time. ANCOVA controlling for baseline credibility showed a condition effect at mid-semester, F(2, 38) = 4.02, p = .026, partial eta squared = .175, and a stronger effect at post-intervention, F(2, 38) = 5.48, p = .008, partial eta squared = .224; adjusted means were lowest in reflection-first, intermediate in verification-required, and highest in control. Parallel analyses of specialization and coordination were not significant. These findings suggest that workflow sequencing, deliberate use of weaker LLMs, and accountability cues embedded in assignment instructions can recalibrate students' credibility judgments in GenAI use, with reflection-first producing the strongest downward shift in reliance.
Authors:Minsun Kim, Dawon Lee, Junyong Noh
Abstract:
On general video-sharing platforms like YouTube, comments are displayed independently of video playback. As viewers often read comments while watching a video, they may encounter ones referring to moments unrelated to the current scene, which can reveal spoilers and disrupt immersion. To address this problem, we present ComVi, a novel system that displays comments at contextually relevant moments, enabling viewers to see time-synchronized comments and video content together. We first map all comments to relevant video timestamps by computing audio-visual correlation, then construct the comment sequence through an optimization that considers temporal relevance, popularity (number of likes), and display duration for comfortable reading. In a user study, ComVi provided a significantly more engaging experience than conventional video interfaces (i.e., YouTube and Danmaku), with 71.9% of participants selecting ComVi as their most preferred interface.
Authors:Harshitha Voleti, Charalambos Poullis
Abstract:
Prolonged mid-air interaction in virtual reality (VR) causes arm fatigue and discomfort, negatively affecting user experience. Incorporating ergonomic considerations into VR user interface (UI) design typically requires extensive human-in-the-loop evaluation. Although biomechanical models have been used to simulate human behavior in HCI tasks, their application as surrogate users for ergonomic VR UI design remains underexplored. We propose a hierarchical reinforcement learning framework that leverages biomechanical user models to evaluate and optimize VR interfaces for mid-air interaction. A motion agent is trained to perform button-press tasks in VR under sequential conditions, using realistic movement strategies and estimating muscle-level effort via a validated three-compartment control with recovery (3CC-r) fatigue model. The simulated fatigue output serves as feedback for a UI agent that optimizes UI element layout via reinforcement learning (RL) to minimize fatigue. We compare the RL-optimized layout against a manually-designed centered baseline and a Bayesian optimized baseline. Results show that fatigue trends from the biomechanical model align with human user data. Moreover, the RL-optimized layout using simulated fatigue feedback produced significantly lower perceived fatigue in a follow-up human study. We further demonstrate the framework's extensibility via a simulated case study on longer sequential tasks with non-uniform interaction frequencies. To our knowledge, this is the first work using simulated biomechanical muscle fatigue as a direct optimization signal for VR UI layout design. Our findings highlight the potential of biomechanical user models as effective surrogate tools for ergonomic VR interface design, enabling efficient early-stage iteration with less reliance on extensive human participation.
Authors:Mohammed Basheikh, Rujiravee Kongdee, Hood Thabit, Bijan Parsia, Sarah Clinch, Simon Harper
Abstract:
This study explored healthcare professionals' perspectives on the management of Type 1 Diabetes Mellitus (T1DM) through a two-part questionnaire. The first part examined how clinicians prioritise and apply current clinical guidelines, including the relative importance assigned to different aspects of T1DM management. The second part investigated clinicians' perceptions of patients' ability to interpret data from the glucose monitoring devices and to make appropriate treatment decisions. An online questionnaire was completed by 19 healthcare professionals working in diabetes-related roles in the United Kingdom. The findings revealed that blood glucose management is prioritised within clinical guidance and that advice is frequently tailored to individual patient needs. Additionally, clinicians generally perceive that data presented in glucose monitoring devices is easy for patients to interpret and based on these data, they believe that patients occasionally make correct treatment decisions.
Authors:Moiz Sadiq Awan, Muhammad Haris Noor, Muhammad Salman Munaf
Abstract:
Automated benchmarks dominate the evaluation of large language models, yet no systematic study has compared user satisfaction, adoption motivations, and frustrations across competing platforms using a consistent instrument. We address this gap with a cross-platform survey of 388 active AI chat users, comparing satisfaction, adoption drivers, use case performance, and qualitative frustrations across seven major platforms: ChatGPT, Claude, Gemini, DeepSeek, Grok, Mistral, and Llama. Three broad findings emerge. First, the top three platforms (Claude, ChatGPT, and DeepSeek) receive statistically indistinguishable satisfaction ratings despite vast differences in funding, team size, and benchmark performance. Second, users treat these tools as interchangeable utilities rather than sticky ecosystems: over 80% use two or more platforms, and switching costs are negligible. Third, each platform attracts users for different reasons: ChatGPT for its interface, Claude for answer quality, DeepSeek through word-of-mouth, and Grok for its content policy, suggesting that specialization, not generalist dominance, sustains competition. Hallucination and content filtering remain the most common frustrations across all platforms. These findings offer an early empirical baseline for a market that benchmarks alone cannot characterize, and point toward competitive plurality rather than winner-take-all consolidation among engaged users.
Authors:Teerthaa Parakh, Karen M. Feigh
Abstract:
Human decision-making is strongly influenced by cognitive biases, particularly under conditions of uncertainty and risk. While prior work has examined bias in single-step decisions with immediate outcomes and in human interaction with a single autonomous agent, comparatively little attention has been paid to decision-making under delayed outcomes involving multiple AI agents, where decisions at each step affect subsequent states. In this work, we study how delayed outcomes shape decision-making and responsibility attribution in a multi-agent human-AI task. Using a controlled game-based experiment, we analyze how participants adjust their behavior following positive and negative outcomes. We observe asymmetric responses to gains and losses, with stronger corrective adjustments after negative outcomes. Importantly, participants often fail to correctly identify the actions that caused failure and misattribute responsibility across AI agents, leading to systematic revisions of decisions that are weakly related to the underlying causes of poor performance. We refer to this phenomenon as a form of attribution bias, manifested as biased error attribution under delayed feedback. Our findings highlight how cognitive biases can be amplified in human-AI systems with delayed outcomes and multiple autonomous agents, underscoring the need for decision-support systems that better support causal understanding and learning over time.
Authors:Alexandre De Masi, Sergio Manzano, Johan N. Siebert, Frederic Ehrler
Abstract:
Artificial intelligence systems that record voice and video during pediatric emergencies are emerging as human-computer interaction (HCI) technologies with direct implications for clinical work, promising improvements in documentation, team performance, and post-event debriefing. Yet the perspectives of those most affected, including clinicians, parents, and child patients, remain largely absent from the design and governance of these technologies. This position paper argues that this has direct consequences for the legitimacy and effectiveness of these systems. We examine four areas where these missing perspectives prove consequential (consent, emotional impact, surveillance dynamics, and participatory governance) and propose four positions for reorienting AI recording in pediatric emergency care toward stakeholder-centered HCI inquiry.
Authors:Adrian Sauter, Mona Schirmer
Abstract:
A human's moral decision depends heavily on the context. Yet research on LLM morality has largely studied fixed scenarios. We address this gap by introducing Contextual MoralChoice, a dataset of moral dilemmas with systematic contextual variations known from moral psychology to shift human judgment: consequentialist, emotional, and relational. Evaluating 22 LLMs, we find that nearly all models are context-sensitive, shifting their judgments toward rule-violating behavior. Comparing with a human survey, we find that models and humans are most triggered by different contextual variations, and that a model aligned with human judgments in the base case is not necessarily aligned in its contextual sensitivity. This raises the question of controlling contextual sensitivity, which we address with an activation steering approach that can reliably increase or decrease a model's contextual sensitivity.
Authors:Michael Klesel, Uwe Messer
Abstract:
To address the high energy consumption of artificial intelligence, energy consumption disclosure (ECD) has been proposed to steer users toward more sustainable practices, such as choosing efficient small language models (SLMs) over large language models (LLMs). This presents a performance-sustainability trade-off for users. In an experiment with 365 participants, we explore the impact of ECD and the perceptual and behavioral consequences of choosing an SLM over an LLM. Our findings reveal that ECD is a highly effective measure to nudge individuals toward a pro-environmental choice, increasing the odds of choosing an energy efficient SLM over an LLM by more than 12. Interestingly, this choice did not significantly impact subsequent behavior, as individuals who selected an SLM and those who selected an LLM demonstrated similar prompt behavior. Nevertheless, the choice created a perceptual bias. A placebo effect emerged, with individuals who selected the "eco-friendly" SLM reporting significantly lower satisfaction and perceived quality. These results highlight the double-edged nature of ECD, which holds critical implications for the design of sustainable human-computer interactions.
Authors:Wanying Mo, Jijia Lai, Xiaoming Wang
Abstract:
Browser agents built on LLMs can act in web interfaces, yet most remain confined to a single chat surface (e.g., a sidebar). This mismatch with real browsing can increase context-switching and reduce user control. We introduce \textbf{IntentWeave}, a design space of ten spatial paradigms for embedding agentic assistance across a browser, organized as a progressive entry ladder from micro-interventions to dedicated workspaces. We implement IntentWeave as a browser-extension prototype on the Alibaba Cloud website and compare three entry strategies in a within-subjects study (N=16). Workspace-heavy strategies reduced completion time but lowered perceived control; micro-only strategies preserved control but were often insufficient; a mixed sidecar approach achieved the highest satisfaction. We conclude with guidance for escalating and retreating agent surfaces without disrupting user agency.
Authors:Mehul Parmar, Chaklam Silpasuwanchai
Abstract:
As AI systems increasingly mediate negotiations, understanding how the number of negotiated issues impacts human performance is crucial for maintaining human agency. We designed a human-AI negotiation case study in a realistic property rental scenario, varying the number of negotiated issues; empirical findings show that without support, performance stays stable up to three issues but declines as additional issues increase cognitive load. To address this, we introduce a novel uncertainty-based visualization driven by Bayesian estimation of agreement probability. It shows how the space of mutually acceptable agreements narrows as negotiation progresses, helping users identify promising options. In a within-subjects experiment (N=32), it improved human outcomes and efficiency, preserved human control, and avoided redistributing value. Our findings surface practical limits on the complexity people can manage in human-AI negotiation, advance theory on human performance in complex negotiations, and offer validated design guidance for interactive systems.
Authors:Olivia Yan Huang, Monika Stodolska, Sharifa Sultana
Abstract:
AI companion chatbots are increasingly used for emotional support, with prior work in the domain predominantly documenting their mixed psychosocial impacts, including both increased emotional expression and heightened loneliness. However, most existing research primarily focuses on outcome-level effects, offering limited insight into how emotional support is produced through interaction. In this paper, we examine emotional support as an interactional and socially situated process. Drawing on qualitative analysis of Reddit discussions, we analyze how users engage with AI companions and how these interactions are interpreted and contested within online communities. We show that emotional support is coconstructed through conversational mechanisms such as validation, reflective prompting, and companionship, while also giving rise to tensions including support versus dependency, validation versus delusion, and accessibility versus harm. Importantly, support extends beyond human AI interaction and is shaped by community responses that legitimize or challenge AI-mediated care. Hence, we reconceptualize AI emotional support as a negotiated socio-technical process and derive implications for the design of responsible, context-sensitive AI systems.
Authors:Tanya Rudberg Selin, Danielle Unéus, Søren Knudsen
Abstract:
We examine how neurodivergent individuals experience creating, interacting with, and reflecting on personal data about masking. Although self-tracking is often framed as enabling self-insight, this is rarely our experience as neurodivergent individuals and researchers. To better understand this disconnect, we conducted a two-phase qualitative study. First, a workshop where six participants with autism and/or ADHD crafted visual representations of masking experiences. Then, three participants continued by designing and using personalized self-tracking focused on unmasking over two weeks. Using reflexive thematic analysis of activities and interviews, we find that self-tracking imposes substantial interpretive and emotional demands, shaped by context-dependencies that challenge assumptions in self-tracking. We also find that facilitated sharing of experiences might validate emotional responses and support reflection. We identify three emotional dimensions that shape engagement with personal data in a working model of emotion in self-tracking, and discuss implications for designing self-tracking and reflective practices that incorporate peer support and better account for context and emotional labor.
Authors:Taizhou Chen, Kai Chen, Xingyu Liu, Pingchuan Ke, Zhida Sun
Abstract:
Evaluating badminton performance often requires expert coaching, which is rarely accessible for amateur players. We present BadminSense, a smartwatch-based system for fine-grained badminton performance analysis using wearable sensing. Through interviews with experienced badminton players, we identified four system design requirements with three implementation insights that guide the development of BadminSense. We then collected a badminton strokes dataset on 12 experienced badminton amateurs and annotated it with fine-grained labels, including stroke type, expert-assessed stroke rating, and shuttle impact location. Built on this dataset, BadminSense segments and classifies strokes, predicts stroke quality, and estimates shuttle impact location using vibration signal from an off-the-shelf smartwatch. Our evaluations show that BadminSense achieves a stroke classification accuracy of 91.43%, an average quality rating error of 0.438, and an average impact location estimation error of 12.9%. A real-world usability study further demonstrates BadminSense's potential to provide reliable and meaningful support for daily badminton practice.
Authors:Evangelos Karapanos, Ruben Gouveia
Abstract:
We contrast three perspectives on engagement from three projects on the design of Digital Behavior Change Interventions (DBCIs), all conducted as part of the PhD thesis of the second author. We provide a reflection on this work with respect to engagement, discussing the motivation, the assumed effects of engagement, the measures of engagements and key insights of each project, as the well as the strategies employed to increase engagement.
Authors:Mulong Xie, Yang Xie
Abstract:
Chat-based natural language interfaces have emerged as the dominant paradigm for human-agent interaction, yet they fundamentally constrain engagement with structured information and complex tasks. We identify three inherent limitations: the mismatch between structured data and linear text, the high entropy of unconstrained natural language input, and the lack of persistent, evolving interaction state. We introduce Software as Content (SaC), a paradigm in which dynamically generated agentic applications serve as the primary medium of human-agent interaction. Rather than communicating through sequential text exchange, this medium renders task-specific interfaces that present structured information and expose actionable affordances through which users iteratively guide agent behavior without relying solely on language. These interfaces persist and evolve across interaction cycles, transforming from transient responses into a shared, stateful interaction layer that progressively converges toward personalized, task-specific software. We formalize SaC through a human-agent-environment interaction model, derive design principles for generating and evolving agentic applications, and present a system architecture that operationalizes the paradigm. We evaluate across representative tasks of selection, exploration, and execution, demonstrating technical viability and expressive range, while identifying boundary conditions under which natural language remains preferable. By reframing interfaces as dynamically generated software artifacts, SaC opens a new design space for human-AI interaction, positioning dynamic software as a concrete and tractable research object.
Authors:Emmanuel Apaaboah, Bernard Opoku, the GhanaHousePlanner Research Team
Abstract:
Ghana faces a residential housing deficit of two million units. A key driver of project failure is the "completeness gap", a systematic discrepancy between informal contractor quotes and actual costs. Informal estimates often use flat per-square-metre pricing that omits essential structural and finishing components, leading to project abandonment mid-construction. This paper validates a parametric, geometry-aware cost estimation model via the GhanaHousePlanner (GHP) platform. The model provides self-builders with itemised bills of quantities (BoQ) reflecting the true cost of code-compliant construction in Ghana. The GHP model uses seven calculation modules: foundation, blockwork, cement, structural steel, roofing, plumbing, and electrical. It features a primary geometry-based mode and a formula-based fallback. Accuracy was tested using three case studies (75, 120, and 200 per-square-metre homes) benchmarked against February 2026 market prices in Greater Accra.GHP estimates (GHS 519,000 to GHS 1,398,000) were 29 to 98 per cent higher than typical informal quotes. This gap arises from the omission of structural steel (Y16 rebar), plastering, floor screed, and full services in informal estimates. Findings confirm that per-square-metre rates rarely cover the requirements for a fully completed, code-compliant building. The GHP model offers a transparent, auditable alternative to informal quoting. Despite material price volatility and labour market informality, the tool provides a framework for improving cost predictability and reducing project stalling in the sub-Saharan African housing market.
Authors:Joyce S. Y. Lau, Zihui Jing, Clement P. L. Chan, Louis C. F. Ng, Wing Chin Kam, Kwan Yin Lam, Ho Wui Cheung, Ho Lam Lau, Junpei Zhong
Abstract:
SENSO is a motion-captured virtual reality serious game utilizing multisensory (visual, auditory, olfactory) stimuli to enhance cognitive and motor functions in older adults. This study evaluated its usability and performance among healthy seniors to establish normative baselines for predicting mild cognitive impairment (MCI) and dementia risk. Methods: Forty-one older adults (aged 60 and older) completed three teahouse-themed tasks: Dim Sum (selection and placement), Steamer (timing and sequencing), and Cashier (counting and transactions). Usability was assessed via the System Usability Scale (SUS), alongside age-stratified performance metrics (accuracy, completion time) from system logs. Results: Usability was rated highly (mean SUS score = 82/100). Performance varied by task complexity: the Dim Sum task showed no age-related differences, the Cashier task showed moderate decline trends, and the Steamer task revealed significant age-related declines due to higher cognitive and motor demands. Conclusion: SENSO demonstrates strong usability and provides effective baselines for cognitive assessment. Adapting complex tasks - such as enhancing olfactory cues in the Steamer game - can optimize its therapeutic efficacy as a non-pharmacological intervention for cognitive preservation.
Authors:Kazi Ababil Azam, Imtiaz Karim, Dipto Das
Abstract:
Romantic AI chatbots have quickly attracted users, but their emotional use raises concerns about privacy and safety. As people turn to these systems for intimacy, comfort, and emotionally significant interaction, they often disclose highly sensitive information. Yet the privacy implications of such disclosure remain poorly understood in platforms shaped by persistence, intimacy, and opaque data practices. In this paper, we examine public Reddit discussions about privacy in romantic AI chatbot ecosystems through a lifecycle lens. Analyzing 2,909 posts from 79 subreddits collected over one year, we identify four recurring patterns: disproportionate entry requirements, intensified sensitivity in intimate use, interpretive uncertainty and perceived surveillance, and irreversibility, persistence, and user burden. We show that privacy in romantic AI is best understood as an evolving socio-technical governance problem spanning access, disclosure, interpretation, retention, and exit. These findings highlight the need for privacy and safety governance in romantic AI that is staged across the lifecycle of use, supports meaningful reversibility, and accounts for the emotional vulnerability of intimate human-AI interaction.
Authors:Minh Triet Pham, Quynh Chi Dang, Le Nhat Tan
Abstract:
Indoor localization systems in care facilities enable optimization of staff allocation, workload management, and quality of care delivery. Traditional machine learning approaches to Bluetooth Low Energy (BLE)-based localization treat each temporal measurement as an independent observation, fundamentally limiting their performance. To address this limitation, this paper introduces Deep Attention-based Sequential Ensemble Learning (DASEL), a novel framework that reconceptualizes indoor localization as a sequential learning problem. The framework integrates frequency-based feature engineering, bidirectional GRU networks with attention mechanisms, multi-directional sliding windows, and confidence-weighted temporal smoothing to capture human movement trajectories. Evaluated on real-world data from a care facility using 4-fold temporal cross-validation, DASEL achieves a macro F1 score of 0.4438, representing a 53.1% improvement over the best traditional baseline (0.2898).
Authors:Judit Martinez Moreno, Markus Christen, Abraham Bernstein
Abstract:
Despite the widespread integration of generative artificial intelligence (GenAI) tools in higher education, there is limited empirical insight into students' experiences, competences, and readiness to adopt personalized AI companions. To address this gap, this study investigates three key questions: (RQ1) What are students' prior experiences with AI tools, their perceived digital and AI-related competences, and their interest in emerging technologies?; (RQ2) How do students perceive a hypothetical "AI Buddy" (a digital companion designed to support students throughout their academic journey) including adoption, benefits, and concerns?; (RQ3) How does students' willingness to adopt an AI Buddy relate to motivations for engaging in traditional academic activities? Based on a survey of 926 students at a Swiss university, students revealed widespread prior use of AI, primarily for text-based and productivity tasks, with moderate self-assessed digital competence. Students expressed strong enthusiasm for adopting an AI Buddy, valuing its potential for time efficiency, personalized academic support, and study organization, but expressed significant concerns about data privacy and over-reliance. A weak negative correlation emerged between AI Buddy adoption willingness and motivations for attending lectures or using library resources, while social and collaborative motivations remained unaffected. These findings suggest that AI Buddies may partially replace information-seeking behaviours but preserve the social fabric of university life. This study provides practical recommendations including the need for robust privacy protections and critical engagement strategies to ensure AI Buddies enhance, rather than undermine, the academic and communal value of higher education.
Authors:Ke Ma, Francesca Valsecchi, Yuchen Tan, Mingjia Ji, Junru Shen, Xiaoya Ma, Duan Wu, Jiao Mo, Shijian Zhao
Abstract:
Temporary luxury branded events run on short cycles and bespoke builds that accelerate material churn. We present a circular phygital product-service system that operationalises the circular economy (CE) through a 4R frame (Refuse, Reduce, Reuse, and Recycling) across warehouse-to-event journeys. Developed via a multi-method design inquiry with a tier-1 contractor, the system couples physical touchpoints (reusable fold-flat transit boxes, adjustable racking, standard labels) with digital orchestration (a live digital warehouse, list-based outbound/inbound workflow, and a sustainable materials library). The architecture aligns roles and decisions, protects and identifies assets, and makes reuse the default under luxury brand constraints. By embedding traceable actions and CE-aligned rules into everyday handoffs, the PSS shifts procurement, storage, dispatch, return, and redeployment toward value retention. The contribution is a replicable, practice-ready route from circular intent to operational change in branded environments, advancing responsible retail without compromising speed or aesthetic standards.
Authors:Alex Apffel, Huy Tran, Vuthea Chheang
Abstract:
In this work, we present a multimodal data acquisition workflow for the digital preservation and virtual reconstruction of at-risk historical sites in the island of Nevis. Facing threats from coastal erosion, rising sea levels, and aggressive vegetation, the archaeological heritage of Nevis requires documentation strategies that bridge the gap between high-cost professional surveying and consumer accessibility. Experimental test compared acquisition variables, specifically camera height (1m vs. 3m) and operator trajectory against high-resolution control data. Moreover, we explore the virtual reconstruction between mesh reconstruction and 3D gaussian splatting to serve as different modalities for documentation. The resulting data is fused into immersive virtual reality (VR) environments, offering a scalable, non-proprietary model for democratizing digital heritage in the Caribbean.
Authors:Martin Sanchez, Nick Tran, Vuthea Chheang
Abstract:
Hospital readmissions remain a challenge for healthcare systems, especially among patients with chronic conditions such as diabetes. Unplanned readmissions within 30 days are costly, strain hospital resources, and can indicate poor care coordination or discharge planning. In this work, we explore the use of machine learning to predict readmission risk for diabetic inpatients and propose a mixed reality (MR) to provide effective visualization and insights. We trained an XGBoost classifier after data cleaning, encoding, and feature engineering. The model achieved an Area Under the Receiver Operating characteristic Curve (AUROC) of 0.72 and an Area Under the Precision-Recall Curve (AUPRC) of 0.11. Key predictive factors included prior inpatient visits, discharge disposition, and glycemic control indicators such as A1C (blood sugar test) results and medication adjustments. Additionally, we developed an MR prototype that visualize patient records and predictions containing risk level, major contributing factors, and a concise summary of care. Together, the predictive model and the MR interface aim to improve clinician awareness and communication around readmission risk in real-time clinical settings.
Authors:Abdul Aziz Snoubara, Baraa Al_Maradni, Haya Al_Naal, Malek Al_Madrmani, Roaa Jdini, Seedra Zarzour, Khloud Al Jallad
Abstract:
Speech-based AI educational applications have gained significant interest in recent years, particularly for children. However, children speech research remains limited due to the lack of publicly available datasets, especially for low-resource languages such as Arabic.This paper presents Abjad-Kids, an Arabic speech dataset designed for kindergarten and primary education, focusing on fundamental learning of alphabets, numbers, and colors. The dataset consists of 46397 audio samples collected from children aged 3 - 12 years, covering 141 classes. All samples were recorded under controlled specifications to ensure consistency in duration, sampling rate, and format. To address high intra-class similarity among Arabic phonemes and the limited samples per class, we propose a hierarchical audio classification based on CNN-LSTM architectures. Our proposed methodology decomposes alphabet recognition into a two-stage process: an initial grouping classification model followed by specialized classifiers for each group. Both strategies: static linguistic-based grouping and dynamic clustering-based grouping, were evaluated. Experimental results demonstrate that static linguistic-based grouping achieves superior performance. Comparisons between traditional machine learning with deep learning approaches, highlight the effectiveness of CNN-LSTM models combined with data augmentation. Despite achieving promising results, most of our experiments indicate a challenge with overfitting, which is likely due to the limited number of samples, even after data augmentation and model regularization. Thus, future work may focus on collecting additional data to address this issue. Abjad-Kids will be publicly available. We hope that Abjad-Kids enrich children representation in speech dataset, and be a good resource for future research in Arabic speech classification for kids.
Authors:Jiaqi Lai, Hou Liang, Weihong Huang
Abstract:
As artificial intelligence (AI) is increasingly deployed in high-stakes public decision-making (from resource allocation to welfare distribution), public trust in these systems has become a critical determinant of their legitimacy and sustainability. Yet existing AI governance research remains largely qualitative, lacking formal mathematical frameworks to characterize the precise conditions under which public trust collapses. This paper addresses that gap by proposing a rigorous coupled dynamics model that integrates a discrete-time Hawkes process -- capturing the self-exciting generation of AI controversy events such as perceived algorithmic unfairness or accountability failures -- with a Friedkin-Johnsen opinion dynamics model that governs the evolution of institutional trust across social networks. A key innovation is the bidirectional feedback mechanism: declining trust amplifies the intensity of subsequent controversy events, which in turn further erode trust, forming a self-reinforcing collapse loop. We derive closed-form equilibrium solutions and perform formal stability analysis, establishing the critical spectral condition rho(J_{2nt}) < 1 that delineates the boundary between trust resilience and systemic collapse. Numerical experiments further reveal how echo chamber network structures and media amplification accelerate governance failure. Our core contribution to the AI governance field is a baseline collapse model: a formal stability analysis framework demonstrating that, absent strong institutional intervention, even minor algorithmic biases can propagate through social networks to trigger irreversible trust breakdown in AI governance systems.
Authors:Saadi Lahlou, Annabelle Gouttebroze, Atrina Oraee, Julian Madera
Abstract:
We qualitatively compared literature reviews produced with varying degrees of AI assistance. The same LLM, given the same corpus of 280 papers but different selections, produced dramatically different reviews, from mainstream and politically neutral to critical and post-colonial, though neither orientation was intended. LLM outputs always appear at first glance to be well written, well informed and thought out, but closer reading reveals gaps, biases and lack of depth. Our comparison of six versions shows a series of pitfalls and suggests precautions necessary when using AI assistance to make a literature review. Main issues are: (1) The bias of ignorance (you do not know what you do not get) in the selection of relevant papers. (2) Alignment and digital sycophancy: commercial AI models slavishly take you further in the direction they understand you give them, reinforcing biases. (3) Mainstreaming: because of their statistical nature, LLM productions tend to favor mainstream perspectives and content; in our case there was only 20% overlap between paper selections by humans and the LLM. (4) Limited capacity for creative restructuring, with vague and ambiguous statements. (5) Lack of critical perspective, coming from distant reading and political correctness. Most pitfalls can be addressed by prompting, but only if the user knows the domain well enough to detect them. There is a paradox: producing a good AI-assisted review requires expertise that comes from reading the literature, which is precisely what AI was meant to reduce. Overall, AI can improve the span and quality of the review, but the gain of time is not as massive as one would expect, and a press-button strategy leaving AI to do the work is a recipe for disaster. We conclude with recommendations for those who write, or assess, such LLM-augmented reviews.
Authors:Diya Hundiwala, Andrés Monroy-Hernández
Abstract:
Sticky notes remain a durable collaborative medium because they support rapid idea externalization, rearrangement, and coordination of group attention through spatial organization while being low-friction and lightweight. Recent AR systems suggest new ways to externalize ideas in shared physical space, including spatial annotations and digital workspaces. We introduce AnchorNote, a co-located AR system that lets collaborators intentionally capture spoken ideas as spatially anchored sticky notes via live transcription and LLM summarization. We evaluated AnchorNote in a two-phase iterative study with 20 participants completing a brainstorming and thematic grouping task to examine how speech-driven, spatially persistent capture shapes idea externalization in collaboration. We found that AnchorNote reduced writing effort but reshaped collaboration by introducing new coordination costs and shifting how participants formulated, timed, and organized ideas. We use AnchorNote as an exploratory probe to study how speech-driven, spatial externalization in AR restructures collaborative cognition and coordination, and to derive design implications for future co-located AR collaboration tools.
Authors:Kim Zierahn, Cristina Cachero, Anna Korhonen, Nuria Oliver
Abstract:
A growing body of research examines personality traits in Large Language Models (LLMs), particularly in human-agent collaboration. Prior work has frequently applied the Big Five inventory to assess LLM behavior analogous to human personality, without questioning the underlying assumptions. This paper critically evaluates whether LLM responses to personality tests satisfy six defining characteristics of personality. We find that none are fully met, indicating that such assessments do not measure a construct equivalent to human personality. We propose a research agenda for shifting from anthropomorphic trait attribution toward functional evaluations, clarifying what personality tests actually capture in LLMs and developing LLM-specific frameworks for characterizing stable, intrinsic behavior.
Authors:Shitao Fang, Koji Yatani, Kasper Hornbæk
Abstract:
In HCI, frameworks function as a type of theoretical contribution, often supporting ideation, design, and evaluation. Yet, little is known about how they are actually used, what functions they serve, and which scholarly practices that shape them. To address this gap, we conducted a systematic review of 615 papers from a decade of CHI proceedings (2015-2024) that prominently featured the term framework. We classified these papers into six engagement types. We then examined the role, form, and essential components of newly proposed frameworks through a functional typology, analyzing how they are constructed, validated, and articulated for reuse. Our results show that enthusiasm for proposing new frameworks exceeds the willingness to iterate on existing ones. They also highlight the ambiguity in the function of frameworks and the scarcity of systematic validation. Based on these insights, we call for more rigorous, reflective, and cumulative practices in the development and use of frameworks in HCI.
Authors:Nelson Navajas Fernández, Jeffrey T. Hancock, Maurice Jakesch
Abstract:
AI-based tools that mediate, enhance or generate parts of video communication may interfere with how people evaluate trustworthiness and credibility. In two preregistered online experiments (N = 2,000), we examined whether AI-mediated video retouching, background replacement and avatars affect interpersonal trust, people's ability to detect lies and confidence in their judgments. Participants watched short videos of speakers making truthful or deceptive statements across three conditions with varying levels of AI mediation. We observed that perceived trust and confidence in judgments declined in AI-mediated videos, particularly in settings in which some participants used avatars while others did not. However, participants' actual judgment accuracy remained unchanged, and they were no more inclined to suspect those using AI tools of lying. Our findings provide evidence against concerns that AI mediation undermines people's ability to distinguish truth from lies, and against cue-based accounts of lie detection more generally. They highlight the importance of trustworthy AI mediation tools in contexts where not only truth, but also trust and confidence matter.
Authors:Yufei Cao, Penny Sweetser, Ziyu Chen, Xuanying Zhu
Abstract:
User performance is crucial in interactive systems, capturing how effectively users engage with task execution. Prospectively predicting performance enables the timely identification of users struggling with task demands. While ocular and cardiac signals are widely used to characterise performance-relevant visual behaviour and physiological activation, their potential for early prediction and for revealing the physiological mechanisms underlying performance differences remains underexplored. We conducted a within-subject experiment in a game environment with naturally unfolding complexity, using early ocular and cardiac signals to predict later performance and to examine physiological and self-reported group differences. Results show that the ocular-cardiac fusion model achieves a balanced accuracy of 0.86, and the ocular-only model shows comparable predictive power. High performers exhibited targeted gaze and adjusted visual sampling, and sustained more stable cardiac activation as demands intensified, with a more positive affective experience. These findings demonstrate the feasibility of cross-session prediction from early physiology, providing interpretable insights into performance variation and facilitating future proactive intervention.
Authors:Baiqiang Wang, Yan Bai, Juan Li
Abstract:
The integration of Large Language Models (LLMs) into cybersecurity education for criminal justice professionals is currently hindered by the "statelessness" of reactive chatbots and the risk of hallucinations in high-stakes legal contexts. To address these limitations, we propose the CyberJustice Tutor, an educational dialogue system powered by an Agentic AI framework. Unlike reactive chatbots, our system employs a "Think-Plan-Act" cognitive cycle, enabling autonomous goal decomposition, longitudinal planning, and dynamic context maintenance. We integrate a Pedagogical Scaffolding Layer grounded in Vygotsky's Zone of Proximal Development (ZPD), which dynamically adapts instructional support based on the learner's real-time progress. Furthermore, an Adaptive Retrieval Augmented Generation (RAG) core anchors the agent's reasoning in verified curriculum materials to ensure legal and technical accuracy. A comprehensive user study with 123 participants, including students, educators, and active law enforcement officers, validated the system's efficacy. Quantitative results demonstrate high user acceptance for Response Speed (4.7/5), Ease of Use (4.4/5), and Accuracy (4.3/5). Qualitative feedback indicates that the agentic architecture is perceived as highly effective in guiding learners through personalized paths, demonstrating the feasibility and usability of agentic AI for specialized professional education.
Authors:Yutong Ren, Arnav Reddy, Michael Nebeling
Abstract:
Gaze-based selection in XR requires visual confirmation due to eye-tracking limitations and target ambiguity in 3D contexts. Current designs for wide-FOV displays use world-locked, central overlays, which are not conducive to always-on AR glasses. This paper introduces PeriphAR (per-ree-far), a visualization technique that leverages peripheral vision for feedback during gaze-based selection on a monocular AR display. In a first user study, we isolated text, color, and shape properties of target objects to compare peripheral selection cues. Peripheral vision was more sensitive to color than shape, but this sensitivity rapidly declined at lower contrast. To preserve preattentive processing of color, we developed two strategies to enhance color in users' peripheral vision. In a second user study, our strategy that maximized contrast of the target to the neighboring object with the most similar color was subjectively preferred. As proof of concept, we implemented PeriphAR in an end-to-end system to test performance with real-world object detection.
Authors:Qi Xu, Beat Signer
Abstract:
Scholarly reading often involves engaging with various supplementary materials beyond PDFs to support understanding. In practice, scholars frequently incorporate such external materials into their reading workflow through annotation. However, most existing PDF annotation tools support only a limited range of media types for embedding annotations in PDF documents. This paper investigates cross-media annotation as a design space for augmenting academic reading. We present a design exploration of a cross-media annotation tool that allows scholars to easily link PDF content with other documents and materials such as audio, video or web pages. The proposed design has the potential to enrich reading practices and enable scholars to guide and support other researchers' reading experiences.
Authors:Anthony Maocheia-Ricci, Edith Law
Abstract:
Value-based approaches such as Value Sensitive Design (VSD) enable technology designers to engage with and integrate human values in technology through a tripartite methodology of conceptual, empirical, and technical investigations. However, VSD contains pitfalls in both translating values to requirements and a lack of normative grounding, leading to adaptations such as Jacobs' Capability Sensitive Design (CSD). Inspired by CSD and extensions of the design approach, we propose the concept of creating -Sensitive Design (-SD); a meta-framework to embed various political or ideological values as norms in a design research process. We exemplify this through \emph{Dependency}-Sensitive Design (DSD), combining ideas from Kittay's critiques of classical liberal theory within a practical VSD framework. Finally, we push for further work combining philosophy and design in areas beyond CSD and DSD.
Authors:Fiammetta Caccavale, Carina L. Gargalo, Julian Kager, Magdalena Skowyra, Steen Larsen, Krist V. Gernaey, Ulrich Krühne
Abstract:
The landscape of education is changing rapidly, shaped by emerging pedagogical approaches, technological innovations such as artificial intelligence (AI), and evolving societal expectations, all of which demand thorough evaluation of new educational tools. Although large language models (LLMs) present substantial opportunities especially in Higher Education, their propensity to generate hallucinations and their limited specialized knowledge may introduce significant risks. This study aims to address these risks by examining the practical implementation of an LLM-enhanced assistant in a university level course. We implemented a generative AI assistant grounded in a retrieval-augmented generation (RAG) model to replicate a previously teacher-led, time-intensive exercise. To assess the effectiveness of the LLM, we conducted three separate experiments through iterative mixed-methods approaches, including a crossover design. The resulting data address central research questions related to student motivation, perceived differences between engaging with the LLM versus a human teacher, the quality of AI-generated responses, and the impact of the LLM on students' academic performance. The results offer direct insights into students' views and the pedagogical feasibility of embedding LLMs into specialized courses. Finally, we discuss the main challenges, opportunities and future directions of LLMs in teaching and learning in Higher Education.
Authors:Alexander V. Shenderuk-Zhidkov, Alexander E. Hramov
Abstract:
This article introduces and substantiates the concept of Neuro-Linguistic Integration (NLI), a novel paradigm for human-technology interaction where Large Language Models (LLMs) act as a key semantic interface between raw neural data and their social application. We analyse the dual nature of LLMs in this role: as tools that augment human capabilities in communication, medicine, and education, and as sources of unprecedented ethical risks to mental autonomy and neurorights. By synthesizing insights from AI ethics, neuroethics, and the philosophy of technology, the article critiques the inherent limitations of LLMs as semantic mediators, highlighting core challenges such as the erosion of agency in translation, threats to mental integrity through precision semantic suggestion, and the emergence of a new `neuro-linguistic divide' as a form of biosemantic inequality. Moving beyond a critique of existing regulatory models (e.g., GDPR, EU AI Act), which fail to address the dynamic, meaning-making processes of NLI, we propose a foundational framework for proactive governance. This framework is built on the principles of Semantic Transparency, Mental Informed Consent, and Agency Preservation, supported by practical tools such as NLI-specific ethics sandboxes, bias-aware certification of LLMs, and legal recognition of the neuro-linguistic inference. The article argues for the development of a `second-order neuroethics,' focused not merely on neural data protection but on the ethics of AI-mediated semantic interpretation itself, thereby providing a crucial conceptual basis for steering the responsible development of neuro-digital ecosystems.
Authors:Roxana Bujack, Li-Ta Lo, Ethan Stam, Ayan Biswas, David Rogers
Abstract:
Recent advances in AI enable the automatic generation of visualizations directly from textual prompts using agentic workflows. However, visualizations produced via one-shot generative methods often suffer from insufficient quality, typically requiring a human in the loop to refine the outputs. Human evaluation, though effective, is costly and impractical at scale. To alleviate this problem, we propose an automated metric that evaluates visualization quality without relying on extensive human-labeled datasets. Instead, our approach uses the original underlying data as implicit ground truth. Specifically, we introduce a method that measures visualization quality by assessing the reconstruction accuracy of the original data from the visualization itself. This reconstruction-based metric provides an autonomous and scalable proxy for thorough human evaluation, facilitating more efficient and reliable AI-driven visualization workflows.
Authors:Sarah Diefenbach, Daniel Ullrich
Abstract:
Conversation with chatbots based on Large Language Models (LLMs) such as ChatGPT has become one of the major forms of interaction with Artificial Intelligence (AI) in everyday life. What makes this interaction so convenient is that interacting with LLMs feels so natural, and resembles what we know from real, human conversations. At the same time, this seeming similarity is part of one of the ethical challenges of AI design, since it activates many misleading ideas about AI. We discuss similarities and differences between human-AI-conversations and interpersonal conversation and highlight starting points for more ethical design of AI at the front-end.
Authors:Xiruo Wang, Xinyi Jiang, Ziqi Lyu
Abstract:
Generative AI has made visual storytelling widely accessible, yet current prompt-based interactions often force users into a trade-off between precise control and creative flow. We present One Kiss, a co-creative comic generation system that introduces "Affective Steering". Instead of writing text prompts, users guide the tone of their story through emoji inputs, whose semantic ambiguity becomes a resource rather than a limitation. Unlike traditional text-to-image tools that rely on explicit descriptions, One Kiss uses a dual-stream input in which users define structural pacing by sketching panel frames and set atmospheric tone by pairing keywords with emojis. This mechanism enables "Genre Flux," where emotional inputs accumulate across panels and gradually shift the genre of a story. A preliminary study (N = 6) suggests that this soft steering approach may reframe the user's role from prompt engineer to narrative director, with ambiguity serving as a source of creative surprise rather than a loss of control.
Authors:Yang Ni, Fanli Jia
Abstract:
Artificial intelligence (AI)-enabled digital interventions, including Generative AI (GenAI) and Human-Centered AI (HCAI), are increasingly used to expand access to digital psychiatry and mental health care. This PRISMA-ScR scoping review maps the landscape of AI-driven mental health (mHealth) technologies across five critical phases: pre-treatment (screening/triage), treatment (therapeutic support), post-treatment (remote patient monitoring), clinical education, and population-level prevention. We synthesized 36 empirical studies implemented through early 2024, focusing on Large Language Models (LLMs), machine learning (ML) models, and autonomous conversational agents. Key use cases involve referral triage, empathic communication enhancement, and AI-assisted psychotherapy delivered via chatbots and voice agents. While benefits include reduced wait times and increased patient engagement, we address recurring challenges like algorithmic bias, data privacy, and human-AI collaboration barriers. By introducing a novel four-pillar framework, this review provides a comprehensive roadmap for AI-augmented mental health care, offering actionable insights for researchers, clinicians, and policymakers to develop safe, effective, and equitable digital health interventions.
Authors:Yan Xia, Sushmita Khan, Naiyah Lewis, Jinkyung Katie Park
Abstract:
Online LGBTQ+ communities face a persistent tension: remaining visible to welcome newcomers while protecting members from harassment. This challenge is particularly acute for lesbian communities on Reddit, which operate not as isolated groups but as an interconnected ecosystem. We examine how this tension is negotiated across the lesbian subreddit ecosystem (N=29) by combining network analysis of cross-subreddit links with a qualitative thematic analysis of 167 subreddit rules. Our findings show a functional division of governance labor between central (34%) and peripheral subreddits (66%). While all communities share a baseline of safety regulations, central subreddits prioritize content curation and feed quality to support a large, public-facing audience, whereas peripheral subreddits emphasize boundary maintenance and participation control to protect smaller, identity-specific niches. These findings challenge monolithic moderation approaches and highlight the need for ecosystem-aware design. We argue that effective moderation requires role- and context-sensitive tools supporting visibility and safety across interconnected spaces.
Authors:Jake Van Clief, David McDermott
Abstract:
Current approaches to AI agent orchestration typically involve building multi-agent frameworks that manage context passing, memory, error handling, and step coordination through code. These frameworks work well for complex, concurrent systems. But for sequential workflows where a human reviews output at each step, they introduce engineering overhead that the problem does not require. This paper presents Model Workspace Protocol (MWP), a method that replaces framework-level orchestration with filesystem structure. Numbered folders represent stages. Plain markdown files carry the prompts and context that tell a single AI agent what role to play at each step. Local scripts handle the mechanical work that does not need AI at all. The result is a system where one agent, reading the right files at the right moment, does the work that would otherwise require a multi-agent framework. This approach applies ideas from Unix pipeline design, modular decomposition, multi-pass compilation, and literate programming to the specific problem of structuring context for AI agents. The protocol is open source under the MIT license.
Authors:Mohammad Dastgheib, Fatemeh Pourmahdian
Abstract:
Extended Reality (XR) interfaces impose both ergonomic and cognitive demands, yet current systems often force a binary choice between hand-based input, which can produce fatigue, and gaze-based input, which is vulnerable to the Midas Touch problem and precision limitations. We introduce the xr-adaptive-modality-2025 platform, a web-based open-source framework for studying whether modality-specific adaptive interventions can improve XR-relevant pointing performance and reduce workload relative to static unimodal interaction. The platform combines physiologically informed gaze simulation, an ISO 9241-9 multidirectional tapping task, and two modality-specific adaptive interventions: gaze declutter and hand target-width inflation. We evaluated the system in a 2 x 2 x 2 within-subjects design manipulating Modality (Hand vs. Gaze), UI Mode (Static vs. Adaptive), and Pressure (Yes vs. No). Results from N=69 participants show that hand yielded higher throughput than gaze (5.17 vs. 4.73 bits/s), lower error (1.8% vs. 19.1%), and lower NASA-TLX workload. Crucially, error profiles differed sharply by modality: gaze errors were predominantly slips (99.2%), whereas hand errors were predominantly misses (95.7%), consistent with the Midas Touch account. Of the two adaptive interventions, only gaze declutter executed in this dataset; it modestly reduced timeouts but not slips. Hand width inflation was not evaluable due to a UI integration bug. These findings reveal modality-specific failure modes with direct implications for adaptive policy design, and establish the platform as a reproducible infrastructure for future studies.
Authors:Lara Lee Russell-Lasalandra, Hudson Golino
Abstract:
This Monte Carlo simulation examines how prompt engineering strategies shape the quality of large language model (LLM)--generated personality assessment items within the AI-GENIE framework for generative psychometrics. Item pools targeting the Big Five traits were generated using multiple prompting designs (zero-shot, few-shot, persona-based, and adaptive), model temperatures, and LLMs, then evaluated and reduced using network psychometric methods. Across all conditions, AI-GENIE reliably improved structural validity following reduction, with the magnitude of its incremental contribution inversely related to the quality of the incoming item pool. Prompt design exerted a substantial influence on both pre- and post-reduction item quality. Adaptive prompting consistently outperformed non-adaptive strategies by sharply reducing semantic redundancy, elevating pre-reduction structural validity, and preserving substantially larger item pool, particularly when paired with newer, higher-capacity models. These gains were robust across temperature settings for most models, indicating that adaptive prompting mitigates common trade-offs between creativity and psychometric coherence. An exception was observed for the GPT-4o model at high temperatures, suggesting model-specific sensitivity to adaptive constraints at elevated stochasticity. Overall, the findings demonstrate that adaptive prompting is the strongest approach in this context, and that its benefits scale with model capability, motivating continued investigation of model--prompt interactions in generative psychometric pipelines.
Authors:Michaela Benk, Tim Miller
Abstract:
The HCI community commonly evaluates decision support systems based on whether they improve task performance or promote appropriate user reliance. In this work, we look beyond decision outcomes to examine the process through which users develop decision-making strategies. Through a web-based experiment (N = 290) comparing recommendation-driven and hypothesis-driven interaction designs, and using Signal Detection Theory as a theoretical framework, we show that even when performance remains identical, recommendation-driven designs lower participants' thresholds for sufficient evidence and introduce a "hidden bias" in their judgments, resulting in a shifted distribution of errors. Furthermore, we find that experts are just as susceptible to these systemic shifts as novices. We conclude by advocating for a shift in focus: prioritizing decision processes and the preservation of stable evidence standards over performance and reliance alone.
Authors:Antonios Lykourinas, Chinmay Pendse, Francky Catthoor, Veronique Rochus, Xavier Rottenberg, Athanassios Skodras
Abstract:
Ultrasound (US) has emerged as a promising modality for Human-Machine Interfaces (HMIs), with recent research efforts exploring its potential for Hand Pose Estimation (HPE). A reliable solution to this problem could introduce interfaces with simultaneous support for up to 23 degrees of freedom encompassing all hand and wrist kinematics, thereby allowing far richer and more intuitive interaction strategies. Despite these promising results, a systematic comparison of models, input modalities and training strategies is missing from the literature. Moreover, there is only one publicly available dataset, namely the Ultrasound Adaptive Prosthetic Control (Ultra-Pro) dataset, enabling reproducible benchmarking and iterative model development. In this paper, we compare the performance of six different deep learning models, selected based on diverse criteria, on this benchmark. We demonstrate that, by using a step learning rate scheduler and the envelope of the RF signals as input modality, our 4-layer deep UDACNN surpasses XceptionTime's performance by $2.28$ percentage points while featuring $87.52\%$ fewer parameters. This result ($77.72\%$) constitutes an absolute improvement of $0.88\%$ from previously reported baselines. According to our findings, the appropriate combination of model, preprocessing and training algorithm is crucial for optimizing HMI performance.
Authors:Giulia Huang, Maristella Matera, Micol Spitale
Abstract:
Artificial agents that support human group interactions hold great promise, especially in sensitive contexts such as well-being promotion and therapeutic interventions. However, current systems struggle to mediate group interactions involving people who are not neurotypical. This limitation arises because most AI detection models (e.g., for turn-taking) are trained on data from neurotypical populations. This work takes a step toward inclusive AI by addressing the challenge of eye contact detection, a core component of non-verbal communication, with and for people with Intellectual and Developmental Disabilities. First, we introduce a new dataset, Multi-party Interaction with Intellectual and Developmental Disabilities (MIDD), capturing atypical gaze and engagement patterns. Second, we present the results of a comparative analysis with neurotypical datasets, highlighting differences in class imbalance, speaking activity, gaze distribution, and interaction dynamics. Then, we evaluate classifiers ranging from SVMs to FSFNet, showing that fine-tuning on MIDD improves performance, though notable limitations remain. Finally, we present the insights gathered through a focus group with six therapists to interpret our quantitative findings and understand the practical implications of atypical gaze and engagement patterns. Based on these results, we discuss data-driven strategies and emphasize the importance of feature choice for building more inclusive human-centered tools.
Authors:Sicheng Lu, Erick Purwanto, Hong Liu, Aini Li, Adel Chaouch-Orozco
Abstract:
Dialect bias is pervasive yet often unconscious, normalized, or obscured by masking. Existing HCI interventions primarily audit disparities and propose reactive fixes. We present CompassioMate, a dialect-aware serious game that nurtures perspective-taking through AI-mediated play. Players listen to audio samples to identify regional dialects, engage in simulated social interactions involving dialect discrimination, and explore branching narratives that reveal how changes in wording or stance can influence the outcomes. In a three-week field study with 20 university students, participants reported feeling comfortable when observing region-tailored dialogues; several described experiencing perspective change. We contribute: 1) a formative study identifying goals for safe action consequence modelling, 2) the design and evaluation of a serious game integrating dialect audio, region-mapping play, bias; and 3) design implications highlighting listener-side training, transparent evaluation, and narratives maintaining psychological well-being.
Authors:Hikari Kuriyama, Hiroaki Sonoda, Kouki Tomiyoshi, Gou Koutaki
Abstract:
Flute performance requires mastery of complex fingering combinations and register-dependent embouchure control, particularly jet offset adjustment for low-register production. Existing haptic and semi-automated systems do not address both aspects simultaneously through mechanical actuation. To our knowledge, no prior system fully automates fingering while mechanically assisting low-register tone production without requiring embouchure control. We developed a semi-automatic flute robot with an automatic fingering mechanism: fourteen servo motors actuate all keys via wire-based and rack-and-pinion drives in response to MIDI input, enabling performers to produce complete musical pieces through airflow alone. A jet offset assist mechanism rotates the head joint by a calibrated $22^\circ$ during low-register passages, shifting the jet offset toward a low-register configuration without modifying the instrument or embouchure. Fundamental frequency estimation confirmed correct pitch production across the chromatic range (C4--C7) and during musical performance. All key and lever movements were completed within 77.50~ms, corresponding to tempo capacity exceeding standard requirements. Harmonic analysis ($Δ\mathrm{SPL} = \mathrm{SPL}_2 - \mathrm{SPL}_3$) showed a consistent increase in $Δ$SPL for all low-register notes when activated, consistent with the intended jet offset shift. Head joint rotation completed within 40.00~ms. These results demonstrate mechanical feasibility of integrating automated fingering and register-dependent jet offset assistance under controlled conditions.
Authors:Wei Xiao, Mengke Wu, Yeeun Jo
Abstract:
Privacy policies are intended to support informed consent, yet users rarely read them fully. This study examines how common privacy policy interface structures influence attention allocation, reading behavior, and perceived experience. Using eye-tracking and post-task surveys, we compared three interface designs: continuous scrolling text, collapsible sections, and collapsible sections with brief previews. Results show that interface structure systematically shaped how users allocated attention and navigated policy content, but did not uniformly improve comprehension. Guided layouts supported more efficient and coherent reading patterns, whereas more interactive designs elicited higher perceived engagement and satisfaction. Importantly, comprehension was closely linked to sustained attention rather than interface type alone. These findings highlight the limits of interface-centered consent approaches and suggest that effective consent design must account for attention dynamics and selective engagement, rather than assuming that improved layout alone ensures understanding.
Authors:Ammar Al-Taie, Thomas Goodge, Shaun Macdonald, Ian Oakley, Stephen Brewster
Abstract:
Automated vehicles (AVs) must communicate their yielding intentions to pedestrians at crossings. External Human-Machine Interfaces (eHMIs, on-vehicle displays) are promising solutions, but were primarily tested with walking pedestrians. Runners are a significant pedestrian group who move faster and face distinct bodily and perceptual demands, raising questions about how pedestrian activity influences eHMI use. We conducted an outdoor study using an augmented reality simulator. Participants navigated a virtual crossing while walking and running; an approaching AV displayed one of three eHMIs: red/green colour-changing lights, animated cyan lights, or no-eHMI. No-eHMI consistently underperformed. Walkers mostly stopped and validated eHMI signals with vehicle behaviour; they processed both eHMI animations and colour changes effectively. Runners experienced greater time pressure to cross, increasing reliance on the eHMI over vehicle behaviour. They preferred colour changes over animation for rapid decisions. These findings are crucial for promoting eHMI inclusivity and physical wellbeing as AVs join our roads.
Authors:Joseph Damouni, Wadia Tanus, Naomi Unkelos-Shpigel
Abstract:
Static information presentation in VR cultural heritage often causes cognitive overload or under-stimulation. We introduce a closed-loop adaptive interface that tailors content depth to real-time visitor behavior through implicit multimodal sensing. Our approach continuously monitors gaze dwell, head kinematics, and locomotion to infer engagement via a transparent rule-based classifier, which drives a Large Language Model to dynamically modulate explanation complexity without interrupting exploration. We implemented a proof-of-concept in the Berat Ethnographic Museum and conducted a preliminary evaluation (N=16) comparing adaptive versus static content. Results indicate that adaptive participants demonstrated 2-3x increases in reading engagement and exploration time while maintaining high usability (SUS = 84.3). Technical validation confirmed sub-millisecond engagement inference latency on consumer VR hardware. These preliminary findings warrant larger-scale investigation and raise questions about engagement validation, AI transparency, and generative models in heritage contexts. We present this work-in-progress to spark discussion about implicit AI-driven adaptation in immersive cultural experiences.
Authors:Jesse T. Gonzalez, Neeta Khanuja, Michael Li, Maggie Guo, Layomi Olaitan, Emily Lau, Jennifer Pugh, Alexandra Ion, Scott E. Hudson
Abstract:
What happens when your walls begin to move? This paper explores the design of human-robot interaction for architectural-scale, shape-changing environments. We present findings from two studies: (1) a series of speculative design workshops (N=20) that uncovered aspirational visions for these spaces, and (2) a task-based Wizard-of-Oz elicitation study (N=12) that grounded these visions in the challenges of practical interaction. Our workshop findings reveal a complex landscape of user desires, exposing critical tensions between proactive automation and the preservation of user autonomy, and between personalization and public ownership. Our elicitation study reveals a set of core interaction challenges related to multimodal collaboration; and, most critically: suggests the need for a modality-agnostic model of evolving user intent. We conclude with a set of grounded proposals for creating robotic environments that are collaborative and trusted partners in everyday life.
Authors:Zhou Fang, Janet Yi-Ching Huang
Abstract:
Generative Artificial Intelligence (GAI) offers new opportunities for reconstructing these unrecorded memory scenes, yet existing web-based tools undermine users' sense of agency through disengaging and unpredictable interactions. In this work, we advance three design arguments about how slow, tangible interaction can reshape human-AI relationships by making temporality, embodied agency, and generative processes experientially legible. We instantiate these arguments by presenting Memory Printer, a tangible design that combines silk-screen printing metaphors with text-to-image generation. The design features layered reconstruction that decomposes image generation into incremental steps, a physical wooden scraper enabling embodied control over image revelation, and built-in printing that produces tangible photos. We examine these arguments through a comparative study with 24 participants, exploring how participants engage with, interpret, and respond to this interaction stance. The study surfaces both opportunities -- such as vivid memory evocation, heightened sense of control, and creative exploration -- and critical tensions, including risks of false memory formation, algorithmic bias, and data privacy. Together, these findings articulate important boundaries for deploying generative AI in emotionally sensitive contexts.
Authors:Sihan Qian, Amit Mehra, Dengpan Liu
Abstract:
The rise of foundation models has driven the emergence of AI supply chains, where upstream foundation model providers offer fine-tuning and inference services to downstream firms developing domain-specific applications. Downstream firms pay providers to use their computing infrastructure to fine-tune models with proprietary data, creating a co-creation dynamic that enhances model quality. Amid concerns that foundation model providers and downstream firms may capture excessive consumer surplus, along with increasing regulatory measures, this study employs a game-theoretic model involving a provider and two competing downstream firms to analyze how policy interventions affect consumer surplus in the AI supply chain. Our analysis shows that policies promoting price competition in downstream markets (i.e., pro-price-competitive policies) boost consumer surplus only when compute or data preprocessing costs are high, while compute subsidies are effective only when these costs are low, suggesting these policies complement each other. In contrast, policies promoting quality competition in downstream markets (i.e., pro-quality-competitive policies) always improve consumer surplus. We also find that under pro-price-competitive policies or compute subsidies, both the provider and downstream firms can achieve higher profits along with greater consumer surplus, creating a win-win-win outcome. However, pro-quality-competitive policies increase the provider's profits while reducing those of downstream firms. Finally, as compute costs decline, pro-price-competitive policies may lose their effectiveness, whereas compute subsidies may shift from ineffective to effective. These findings offer insights for policymakers seeking to foster AI supply chains that are economically efficient and socially beneficial.
Authors:Pei-Ying Lin, Julie Heij, Iris Borst, Britt Joosten, Kristina Andersen, Wijnand IJsselsteijn
Abstract:
Amidst the emergence of powerful intelligent technologies such as LLMs and text-to-image AIs that promise to enhance creative processes, designers face the challenges of remaining empowered and creative while working with these foreign digital partners. While generative AIs offer versatile, informative, and occasionally poetic outcomes, their lack of embodied knowledge presents an even greater challenge to designers in gaining fruitful outcomes, such as in the field of Digital Craftsmanship. In this project, three designers embarked on a three-month experimental journey with an intention to co-create with Google's LLM as a potential intelligent partner to investigate how it will influence the designers' creativity. We found that a power dynamic of agencies exists between the LLM and the designer, in which the designer can easily lose their creative agency. Regaining the designer's creative agency involves introspection into their own creative process, a structural understanding of the specific emerging technology involved, and deliberate adjustments to the dynamics of the human-technology relationship. We propose paying attention to the designer's inner world and parties of agencies when engaging with emerging intelligent technologies through three aspects: the sensitivity towards a creative process as cognitive activities; the active investigation into specific technology's capability; and the adjustment towards an appropriate working relationship between the designer and the emerging technology.
Authors:Lu Liu, Harm van Essen, Berry Eggen
Abstract:
Hybrid work settings often lack the informal communication that naturally emerges from spontaneous encounters and ambient awareness of coworkers' activities, potentially hindering team collaboration. To address this challenge, we explored how lightweight interactions can be integrated into awareness-supporting technologies for fostering informal communication. Our experiential design approach focused on how information is perceived and processed rather than explicit content exchange. Through brainstorming, speculating, and prototyping, we explored the design space for small hybrid teams. By annotating and analyzing design concepts, speculative scenarios, and prototypes, we developed a framework that identified design options for lightweight interactions and methods for integrating them with information displays.
Authors:Gaole He, Brian Y. Lim
Abstract:
Large Language Models (LLMs) are increasingly used to power autonomous agents for complex, multi-step tasks. However, human-agent interaction remains pointwise and reactive: users approve or correct individual actions to mitigate immediate risks, without visibility into subsequent consequences. This forces users to mentally simulate long-term effects, a cognitively demanding and often inaccurate process. Users have control over individual steps but lack the foresight to make informed decisions. We argue that effective collaboration requires foresight, not just control. We propose simulation-in-the-loop, an interaction paradigm that enables users and agents to explore simulated future trajectories before committing to decisions. Simulation transforms intervention from reactive guesswork into informed exploration, while helping users discover latent constraints and preferences along the way. This perspective paper characterizes the limitations of current paradigms, introduces a conceptual framework for simulation-based collaboration, and illustrates its potential through concrete human-agent collaboration scenarios.
Authors:María Isabel Rivas Ginel, Janiça Hackenbuchner, Alina Secară, Ralph Krüger, Caroline Rossi
Abstract:
This paper examines how value is constructed and negotiated in today's increasingly automated language and translation industry. Drawing on interview data from twenty-nine industry stakeholders collected within the LT-LiDER project, the study analyses how human value, technological value, efficiency, and adaptability are articulated across different professional roles. Using Chesterman's framework of translation ethics and associated values as an analytical lens, the paper shows that efficiency-oriented technological values aligned with the ethics of service have become baseline expectations in automated production environments, where speed, scalability, and deliverability dominate evaluation criteria. At the same time, human value is not displaced but repositioned, emerging primarily through expertise, oversight, accountability, and contextual judgment embedded within technology-mediated workflows. A central finding is the prominence of adaptability as a mediating value linking human and technological domains. Adaptability is constructed as a core professional requirement, reflecting expectations that translators continuously adjust their skills, roles, and identities in response to evolving tools and organisational demands. The paper argues that automation reshapes rather than replaces translation value, creating an interdependent configuration in which technological efficiency enables human communicative work.
Authors:Liwen He, Pingting Chen, Ziheng Tang, Yixiao Liu, Jihong Jeung, Teng Han, Xin Tong
Abstract:
Designing affective behaviors for animal-inspired social robots often relies on intuition and personal experience, leading to fragmented outcomes. To provide more systematic guidance, we first coded and analyzed human-pet interaction videos, validated insights through literature and interviews, and created structured reference cards that map the design space of pet-inspired affective interactions. Building on this, we developed MojiKit, a toolkit combining reference cards, a zoomorphic robot prototype (MomoBot), and a behavior control studio. We evaluated MojiKit in co-creation workshops with 18 participants, finding that MojiKit helped them design 35 affective interaction patterns beyond their own pet experiences, while the code-free studio lowered the technical barrier and enhanced creative agency. Our contributions include the data-informed structured resource for pet-inspired affective HRI design, an integrated toolkit that bridges reference materials with hands-on prototyping, and empirical evidence showing how MojiKit empowers users to systematically create richer, more diverse affective robot behaviors.
Authors:Cameron Mohne, Nicholas Vo, Dora Demszky, Chris Piech
Abstract:
Role play is a high-impact mode of training that has demonstrated its effectiveness in improving learning outcomes. However, it is difficult to scale to teacher instruction due to its inherent dependency on providing personnel who are both trained and available to facilitate this learning environment. This poses a challenge, especially to massive online courses which may employ and aid hundreds to thousands of novice teachers. In this work, we present EducaSim: a novel framework that uses generative agents to simulate a small-group section for teachers-in-training to practice instruction. EducaSim works by implementing diverse pedagogical-based personas, actual course material, and agent-based architectures constructed for instructional practice to provide a pedagogically rich environment for teachers-in-training to engage in role play learning -- without the costly overhead that comes with it. We share our experiences with constructing and making the tool available for experimental training and preparation in a six-week CS1 course supporting 20,000 students. We found that teachers who engaged generally saw it as a positive experience. We believe that EducaSim is an important step to providing experiential teaching practice at scale for closely-defined settings and has great potential for future applications.
Authors:Morgan Wack, Patrick Warren, Mustafa Alam
Abstract:
Crowdsourced moderation systems like Twitter/X's Community Notes program have been proposed as scalable alternatives to professional fact-checkers for combating online misinformation. While prior research has examined the effectiveness of such systems in reducing engagement with false content and their vulnerability to partisan bias, we identify a previously untested mechanism linking fact-check difficulty to systematic non-participation by crowdsourced raters. We hypothesize that claims requiring less cognitive effort to evaluate, specifically, those that are obviously false and easy to refute, are more likely to receive public notes than claims that are more plausible and require greater effort to debunk. Using eighteen months of vaccine-related Community Notes data (2,250 posts) and ratings from 382 survey participants, we show that claims perceived as more difficult to fact-check are significantly less likely to receive notes that achieve ``helpful''/public status. Following the conduct of additional analyses and a fact-checking process utilizing an LLM pipeline to help rule out alternative explanations, we interpret this pattern as consistent with an unwillingness among raters to invest the mental effort required to evaluate and rate notes for more plausible misinformation. These findings suggest that crowdsourced moderation may systematically fail to address the forms of plausible misinformation which are most likely to deceive. We discuss implications for platform design and propose mechanisms to mitigate this difficulty penalty in crowdsourced content moderation systems.
Authors:Jun Rekimoto, Yu Nishimura, Bojian Yang
Abstract:
Silent and whispered speech offer promise for always-available voice interaction with AI, yet existing methods struggle to balance vocabulary size, wearability, silence, and noise robustness. We present NasoVoce, a nose-bridge-mounted interface that integrates a microphone and a vibration sensor. Positioned at the nasal pads of smart glasses, it unobtrusively captures both acoustic and vibration signals. The nasal bridge, close to the mouth, allows access to bone- and skin-conducted speech and enables reliable capture of low-volume utterances such as whispered speech. While the microphone captures high-quality audio, it is highly sensitive to environmental noise. Conversely, the vibration sensor is robust to noise but yields lower signal quality. By fusing these complementary inputs, NasoVoce generates high-quality speech robust against interference. Evaluation with Whisper Large-v2, PESQ, STOI, and MUSHRA ratings confirms improved recognition and quality. NasoVoce demonstrates the feasibility of a practical interface for always-available, continuous, and discreet AI voice conversations.
Authors:Yiyuan Wang, Andrew Johnston, Zoë Sadokierski, Rhiannon Stephens, Shane T. Ahyong
Abstract:
Recent digitisation efforts in natural history museums have produced large volumes of collection data, yet their scale and scientific complexity often hinder public access and understanding. Conventional data management tools, such as databases, restrict exploration through keyword-based search or require specialised schema knowledge. This paper presents a system design that uses conversational AI to query nearly 1.7 million digitised specimen records from the life-science collections of the Australian Museum. Designed and developed through a human-centred design process, the system contains an interactive map for visual-spatial exploration and a natural-language conversational agent that retrieves detailed specimen data and answers collection-specific questions. The system leverages function-calling capabilities of contemporary large language models to dynamically retrieve structured data from external APIs, enabling fast, real-time interaction with extensive yet frequently updated datasets. Our work provides a new approach of connecting large museum collections with natural language-based queries and informs future designs of scientific AI agents for natural history museums.
Authors:Alejandro Pradas-Gomez, Arindam Brahma, Ola Isaksson
Abstract:
Engineering analysis automation in product development relies on rigid interfaces between tools, data formats and documented processes. When these interfaces change, as they routinely do as the product evolves in the engineering ecosystem, the automation support breaks. This paper presents a DUCTILE (Delegated, User-supervised Coordination of Tool- and document-Integrated LLM-Enabled) agentic orchestration, an approach for developing, executing and evaluating LLM-based agentic automation support of engineering analysis tasks. The approach separates adaptive orchestration, performed by the LLM agent, from deterministic execution, performed by verified engineering tools. The agent interprets documented design practices, inspects input data and adapts the processing path, while the engineer supervises and exercises final judgment. DUCTILE is demonstrated on an industrial structural analysis task at an aerospace manufacturer, where the agent handled input deviations in format, units, naming conventions and methodology that would break traditional scripted pipelines. Evaluation against expert-defined acceptance criteria and deployment with practicing engineers confirm that the approach produces correct, methodologically compliant results across 10 repeated independent runs. The paper discusses the paradigm shift and the practical consequences of adopting agentic automation, including unintended effects on the nature of engineering work when removing mundane tasks and creating an exhausting supervisory role.
Authors:Ajay Anand, Gabriel Parra, Chad A. Berghoff, Laura A. Hallock
Abstract:
Successful robot-mediated rehabilitation requires designing games and robot interventions that promote healthy motor practice. However, the interplay between a given user's neuromotor behavior, the gaming interface, and the physical robot makes designing system elements -- and even characterizing what behaviors are "healthy" or pathological -- challenging. We leverage our OpenRobotRehab 1.0 open access data set to assess the characteristics of 13 healthy and 2 post-stroke users' force output, muscle activations, and game performance while executing isometric trajectory tracking tasks using an end-effector rehabilitation robot. We present an assessment of how subtle aspects of interface design impact user behavior; an analysis of how pathological neuromotor behaviors are reflected in end-effector force dynamics; and a novel hidden Markov model (HMM)-based neuromotor behavior classification method based on surface electromyography (sEMG) signals during cyclic motions. We demonstrate that task specification (including which axes are constrained and how users interpret tracking instructions) shapes user behavior; that pathology-related features are detectable in 6D end-effector force data during isometric task execution (with significant differences between healthy and post-stroke profiles in force error and average force production at $p=0.05$); and that healthy neuromotor strategies are heterogeneous and inherently difficult to characterize. We also show that our HMM-based models discriminate healthy and post-stroke neuromotor dynamics where synergy-based decompositions reflect no such differentiation. Lastly, we discuss these results' implications for the design of adaptive end-effector rehabilitation robots capable of promoting healthier movement strategies across diverse user populations.
Authors:Francisco José Gárate, Paloma Chausa, Diego Moreno, Judit López Luque, Vicens Díaz-Brito, Enrique Javier Gómez
Abstract:
Empiric antibiotic prescribing in high-risk clinical contexts often requires decision making under conditions of incomplete information, where inappropriate coverage or unjustified escalation may compromise safety and antimicrobial stewardship. While clinical decision-support systems have been proposed to assist in this process, many approaches lack explicit governance and evaluation mechanisms defining scope, abstention conditions, recommendation permissibility, and expected system behavior. This work specifies a governance and evaluation framework for deterministic clinical decision-support systems operating under explicitly constrained scope. Deterministic behavior is adopted to ensure that identical inputs yield identical outputs, supporting transparency, auditability, and conservative decision support in high-risk prescribing contexts. The framework treats governance as a first-class design component, separating clinical decision logic from rule-based mechanisms that determine whether a recommendation may be issued. Explicit abstention, deterministic stewardship constraints, and exclusion rules are formalized as core constructs. The framework defines an evaluation methodology utilizing a fixed set of synthetic, mechanism-driven clinical cases with predefined expected behavior. This validation process focuses on behavioral alignment with specified rules rather than clinical effectiveness, predictive accuracy, or outcome optimization. Within this protocol, abstention is treated as a correct and intended outcome when governance conditions are not satisfied. The proposed framework provides a reproducible approach for specifying, governing, and inspecting deterministic clinical decision-support systems in empiric antibiotic prescribing contexts where transparency, auditability, and conservative behavior are prioritized.
Authors:Michael Keeman, Anastasia Keeman
Abstract:
When OpenAI deprecated GPT-4o in early 2026, thousands of users protested under #keep4o, claiming newer models had "lost their empathy." No published study has tested this claim. We conducted the first clinical measurement, evaluating three OpenAI model generations (GPT-4o, o4-mini, GPT-5-mini) across 14 emotionally challenging conversational scenarios in mental health and AI companion domains, producing 2,100 scored AI responses assessed on six psychological safety dimensions using clinically-grounded rubrics. Empathy scores are statistically indistinguishable across all three models (Kruskal-Wallis H=4.33, p=0.115). What changed is the safety posture: crisis detection improved monotonically from GPT-4o to GPT-5-mini (H=13.88, p=0.001), while advice safety declined (H=16.63, p<0.001). Per-turn trajectory analysis -- a novel methodological contribution -- reveals these shifts are sharpest during mid-conversation crisis moments invisible to aggregate scoring. In a self-harm scenario involving a minor, GPT-4o scored 3.6/10 on crisis detection during early disclosure turns; GPT-5-mini never dropped below 7.8. What users perceived as "lost empathy" was a shift from a cautious model that missed crises to an alert model that sometimes says too much -- a trade-off with real consequences for vulnerable users, currently invisible to both the people who feel it and the developers who create it.
Authors:Xinyao Zhuang, Jose Echevarria, Kaan Akşit
Abstract:
Generative models are increasingly integrated into creative workflows. While text-to-image generation excels in visual quality and diversity, color accessibility for users with Color Vision Deficiencies (CVD) remains largely unexplored. Our work systematically evaluates color accessibility in images generated by a common pretrained diffusion model, prompted to improve accessibility across diverse categories. We quantify performance using established, off-the-shelf CVD simulation methods and introduce "CVDLoss", a new metric measuring differences in image gradients indicative of structural detail. We validate CVDLoss against a commonly used daltonization method, demonstrating its sensitivity to color accessibility modifications. Applying CVDLoss to model outputs reveals that existing diffusion models struggle to reliably respond to accessibility-focused prompts. Consequently, our study establishes CVDLoss as a valuable evaluation tool for accessibility-aware image generation and post-processing, offering insights into current generative models' limitations in addressing color accessibility.
Authors:Jaap Munneke, Jennifer E. Corbett
Abstract:
Synthesizing from Corbett and Munneke (2025), who demonstrated that questions originating in human-computer interaction (HCI) and game design can be answered through the theoretical toolkit of cognitive science, this perspective argues that commercial videogames represent a largely underutilized research environment at the intersection of these two fields. Cognitive science has long relied on carefully controlled laboratory paradigms to study perception, attention, and executive functioning, raising persistent questions about ecological validity. HCI, by contrast, has spent decades developing methods for studying behavior in rich, complex, interactive environments, but has been less concerned with what that behavior reveals about underlying cognitive mechanisms. Commercial videogames sit precisely at this intersection. They are cognitively demanding by design, motivating by nature, and consistent enough across players to support systematic behavioral comparison. The affordance structure of a game does the work that experimental manipulations typically require of the researcher, instantiating cognitive demands that are genuine, sustained, and meaningful to the player. We argue that perception, attention, and executive functioning can be meaningfully studied within commercial games using a minimal observational toolkit of screen recording, eye tracking, and behavioral timing. We propose an affordance-cognition mapping framework as a systematic basis for game selection and research design and offer practical methodological recommendations for researchers wishing to work in this space.
Authors:Haidan Liu, Poorvi Bhatia, Nicholas Vincent, Parmit Chilana
Abstract:
Developing AI literacy is increasingly urgent as generative AI reshapes creative practice. Yet most AI literacy frameworks are top-down and expert-driven, overlooking how literacy emerges organically in creative communities. To address this gap, we performed a large-scale analysis of 122k Reddit conversations from 80 creative-oriented subreddits over a three-year period. Our analysis identified four consistent themes in AI literacy-related discussions, and we further traced how discourse shifted alongside major AI events. Surprisingly, creators primarily frame AI literacy around how to use tools effectively, foregrounding practice and task skills, while discussions of AI capabilities and ethics surge only around high-profile events. Our findings suggest that AI literacy is dynamic, practice-driven, and event-responsive rather than static or purely conceptual. This study provides insights for researchers, designers, and policymakers to develop learning resources, community support, and policies that better promote AI literacy in creative communities.
Authors:Tianyi Li, Jin Wei-Kocsis
Abstract:
This manuscript presents the perspectives and reflections of two researchers who were not previously engaged in aging research, regarding the gaps and barriers related to interdisciplinary collaboration on HCI and Aging research. The manuscript has two sections. In the first section, the authors discuss their observations on the disconnect between the needs of aging populations and the design of emerging technologies. The second section delves into their personal journey of developing empathy and a deeper understanding of older adults by volunteering in a senior living community, and shares their reflective thoughts on these experiences.
Authors:Haichang Li, Anjun Zhu, Arpit Narechania
Abstract:
In real-world collaboration, alignment, process structure, and outcome quality do not exhibit a simple linear or one-to-one correspondence: similar alignment may accompany either rapid convergence or extensive multi-branch exploration, and lead to different results. Existing accounts often isolate these dimensions or focus on specific participant types, limiting structural accounts of collaboration. We reconceptualize collaboration through two complementary lenses. The task lens models collaboration as trajectory evolution in a structured task space, revealing patterns such as advancement, branching, and backtracking. The intent lens examines how individual intents are expressed within shared contexts and enter situated decisions. Together, these lenses clarify the structural relationships among alignment, decision-making, and trajectory structure. Rather than reducing collaboration to outcome quality or treating alignment as the sole objective, we propose a unified dynamic view of the relationships among alignment, process, and outcome, and use it to re-examine collaboration structure across Human-Human, AI-AI, and Human-AI settings.
Authors:Md Mojibur Rahman Redoy Akanda, Ahmed Tanvir Mahdad, Nitesh Saxena
Abstract:
In today's technology-driven world, web services have opened up new opportunities for blind and visually impaired people to interact independently. Securing interactions with these services is crucial; however, currently deployed authentication mainly concentrate on sighted users, overlooking the needs of the blind and visually impaired community. In this paper, we address this gap by investigating the security and accessibility aspects of these authentication when adopted by blind and visually impaired users. We model web authentication for such users as screen reader assisted authentication and introduce an evaluation framework called AWARE. Using AWARE, we then systematically assessed popular PC and smartphone-based screen readers against different authentication methods, including variants of 2FA and passwordless schemes, to simulate real-world scenarios. We analyzed these screen reader assisted authentication interactions with authentication methods in three settings: using a terminal (PC) with screen readers, a combination of the terminal (PC) and smartphone with screen readers, and smartphones with integrated screen readers. The results of our study underscore weaknesses in all of our observed screen reader assisted scenarios for real-life authentication methods. These weaknesses, encompassing specific accessibility issues caused by imprecise screen reader instructions, highlight vulnerability concerning observed scenarios for both real-world and research literature based attacks, including phishing, concurrency, fatigue, cross-service, and shoulder surfing. Broadly, our AWARE framework can be used by designers as a precursor to user studies which are typically time-consuming and tedious to perform, independently allowing to unfold security and accessibility problems early which designers can address prior to full-fledged user testing of more isolated issues.
Authors:Yasmin Zaraket, Céline Mougenot
Abstract:
Mental healthcare services in the UK lack tools and resources to address the cultural needs of Muslim women, often leaving them feeling as though their values are pathologised and limiting trust and engagement [1]. Despite growing awareness of cultural competency, few interventions integrate Islamic frameworks into therapeutic support. This report investigates the design and evaluation of YAQIN, a co-designed AI-based application supporting culturally and faith-sensitive mental health engagement for Muslim women. With almost 1.9 million Muslim women in England in 2021, YAQIN responds to a gap in care [2]. It leverages AIś anonymity and continuous support through a faith-aware chatbot and guided journaling tool grounded in user-centred design and Islamic psychology. The YAQIN design research methodology comprised three stages: contextual investigation and literature review, user research with N=14 stakeholders including Muslim women and mental health experts, and prototype development informed by deductive thematic analysis, personas, journey maps, and design specifications. Evaluation involved a co-designed user study with five participants: four Muslim women and one mental health expert who reviewed therapeutic alignment and cultural sensitivity after using the chatbot prototype. Feedback focused on tone, faith relevance, emotional resonance, and the Retrieval-Augmented Generation pipeline enabling contextual continuity. Participants highlighted YAQINś ability to bridge cultural gaps in trust and therapeutic confidence. Feedback included suggestions of including linguistic diversity and routine-based guidance. This project demonstrates how culturally sensitive AI can improve mental healthcare accessibility and trust for marginalised communities and highlights the potential of faith-integrated technology in healthcare innovation.
Authors:Tae Hee Jo, Kyung Hoon Hyun
Abstract:
Current AI-based Creativity Support Tools (CSTs) generate massive amounts of low-level log data (e.g., clicks, parameter tweaks, metadata updates) that are hard to interpret as "creative intent". We argue that to enable future agentic systems to understand and assist users, we must first translate these noisy system traces into meaningful high-level user behavioral traces. We propose a method that parses raw csv/JSON logs into structured behavioral workflow graphs that map the provenance and flow of creative assets. By abstracting low-level system events into high-level behavioral tokens (e.g., MODIFY_Prompt, GENERATE_Image), this method enables downstream analyses like sequence mining and probabilistic modeling. We discuss how this structured workflow history is a prerequisite for "Process-Aware Agents" - systems capable of suggesting next design moves or explaining rationales based on a deeper understanding of the user's workflow patterns and history.
Authors:Mengfei Gao, Caroline Appert, Ludovic David, Emmanuel Pietriga
Abstract:
Browsing the Web on mobile devices is often cumbersome due to their limited screen space. We investigate a phone+AR Web browsing approach, AiRWeb, that leverages the structural properties of Web pages to allow users to seamlessly select and offload arbitrary Web content into the space surrounding them. Focusing on flexibility, AiRWeb lets users decide what to offload, when to do so, and how offloaded content is arranged, enabling personalized organization tailored to the task at hand. We developed a fully functional prototype using standard Web technologies, that covers the complete interaction workflow, from the selection of elements to offload from the phone to their manipulation in the air. Results from a preliminary study conducted using this prototype suggest that AiRWeb is learnable and usable, while also revealing open design challenges around offload mode activation in particular.
Authors:Angel Hsing-Chi Hwang, Senya Wong, Baixiao Chen, Jessica He, Hyo Jin Do
Abstract:
The growing use of AI applications among freelance workers is reshaping trust and relationships with clients. This paper investigates how both workers and clients perceive AI use and disclosure in the freelance economy through a three-stage study: interviews with workers and two survey studies with workers and clients. Findings first reveal a key expectation gap around disclosure: Workers often adopt passive disclosure practices, revealing AI use only when asked, as they assume clients can already detect it. Clients, however, are far less confident in recognizing AI-assisted work and prefer proactive disclosure. A second finding highlights the role of unclear or absent client AI policies, which leave workers consistently misinterpreting clients' expectations for AI use and disclosure. Together, these gaps point to the need for clearer guidelines and practices for AI disclosure. Insights extend beyond freelancing, offering implications for trust, accountability, and policy design in other AI-mediated work domains.
Authors:Balint K. Hodossy, Dario Farina
Abstract:
The standard engineering approach when facing uncertainty is modelling. Mixing data from a well-calibrated model with real recordings has led to breakthroughs in many applications of AI, from computer vision to autonomous driving. This type of model-based data augmentation is now beginning to show promising results in biosignal processing as well. However, while these simulated data are necessary, they are not sufficient for virtual neurophysiological experiments. Simply generating neural signals that reproduce a predetermined motor behaviour does not capture the flexibility, variability, and causal structure required to probe neural mechanisms during control tasks. In this study, we present an in silico neuromechanical model that combines a fully forward musculoskeletal simulation, reinforcement learning, and sequential, online electromyography synthesis. This framework provides not only synchronised kinematics, dynamics, and corresponding neural activity, but also explicitly models feedback and feedforward control in a virtual participant. In this way, online control problems can be represented, as the simulated human adapts its behaviour via a learned RL policy in response to a neural interface. For example, the virtual user can learn hand movements robust to perturbations or the control of a virtual gesture decoder. We illustrate the approach using a gesturing task within a biomechanical hand model, and lay the groundwork for using this technique to evaluate neural controllers, augment training datasets, and generate synthetic data for neurological conditions.
Authors:Lingwei Cheng, Saerim Kim, Andrew Sullivan
Abstract:
When governments mandate collaboration, shared data systems can serve both as tools for coordination and instruments of control. This study examines U.S. homelessness service networks, where Continuums of Care (CoCs) coordinate service providers through the federally mandated Homeless Management Information System (HMIS). With client consent, providers enter data into HMIS and access cross-provider service histories to support coordinated care. At the same time, HMIS embeds standards and governance rules that shape who can collect, access, interpret, and act on data, and thus who holds decision authority. Using qualitative interviews with six experts, we show that standardization can facilitate collaboration and shared learning. However, unequal resources, analytic capacity, and authority limit equitable participation and often shift some participants toward compliance-focused roles. We contribute to public-interest design research on civic data infrastructures by illustrating how mandated data sharing can simultaneously enable coordination and accountability while reproducing power asymmetries in data interpretation and decision-making.
Authors:Vasty A. Adomako, Kaisu Mumuni, Eugene M. Akoto, Felix N. Koranteng
Abstract:
As institutions increasingly depend on Information Systems (ISs), ensuring compliance with Information Systems Security Policies (ISSPs) is critical, especially among contingent employees, whose engagement differs from that of permanent staff. This study examines how Subjective Norm, Deterrence (certainty of detection and severity of punishment), and involvement mechanisms (knowledge sharing and collaboration) influence contingent employees Attitudes Toward ISSPs and, ultimately, their Compliance Intentions. Drawing on data from Ghanaian universities and analyzed using PLS-SEM, the findings confirm that all proposed factors significantly shape attitudes, with knowledge sharing having the strongest effect. Attitude toward ISSPs also strongly predicts compliance intentions. The results support integrating social, cognitive, and collaborative factors into existing ISSP compliance models. Practical implications emphasize fostering inclusive and supportive environments alongside enforcement. This study advances theory and provides a foundation for future research into ISSP behavior among temporary academic staff.
Authors:Lois Fajuyigbe, Kaisu Mumuni, Felix Nti Koranteng
Abstract:
As higher education increasingly adopts blended learning, understanding students preferences for online interaction platforms becomes critical for effective course delivery and engagement. This study investigates the platforms undergraduate students prefer for academic communication and explores the underlying reasons for these choices. Data were collected from 37 students enrolled in two summer courses at a Ghanaian university using a structured questionnaire consisting of both closed and open-ended items. Quantitative results revealed a strong preference for instant messaging platforms such as WhatsApp and Telegram over institutional learning management systems. Qualitative content analysis of the open-ended responses identified five key factors influencing platform preference: convenience and familiarity, ease of use, accessibility, popularity among peers, and support for real-time interactions. These findings highlight a significant mismatch between students communication habits and institutional platform offerings. The study highlights the importance of aligning digital learning strategies with students lived digital experiences to enhance interaction, collaboration, and learner satisfaction in blended learning environments.
Authors:Xiaohan Peng, Sotiris Piliouras, Carl Abou Saada Nujaim
Abstract:
Analyzing creative activity traces requires capturing activity at appropriate granularity and interpreting it in ways that reflect the structure of creative practice. However, existing approaches record state changes without preserving the intent or relationships that define higher-level creative moves. This decoupling manifests differently across domains: GenAI tools lose non-linear exploration structure, visualization authoring obscures representational intent, and programmatic environments flatten interaction boundaries. We present three complementary approaches: a node-based interface for stateful GenAI artifact management, a vocabulary of visual cues as higher-level creative moves in visualization authoring, and a programming model that embeds semantic histories directly into interaction state.
Authors:Arman Khalilbeigi Khameneh, Armin Mostafavi, Alicia Nahmad Vazquez
Abstract:
Decision Support Systems (DSS) play a crucial role in enabling non-expert designers to explore complex, performance-driven design spaces. This paper presents a gamified decision-making framework that integrates game engines with real-time performance feedback. Performance criteria include structural behavior, environmental parameters, fabrication, material, and cost considerations. The developed design framework was tested with architecture students and non-expert designers on the design of an exoskeleton facade to retrofit an existing building. Participants (N=24) were able to iteratively modify façade geometries while receiving real-time feedback across the three key criteria: 1) structural behavior, including deflection, mass, and stress/strength ratio; 2) environmental parameters, such as solar gain and heating/cooling energy demands; and 3) fabrication considerations, including fabrication and material costs, robotic machining, and material setup. The evaluation of participant interactions reveals that gamified feedback mechanisms significantly enhance user comprehension and informed decision-making across the criteria. Further, participants' understanding of structural, material, and fabrication performance in relation to the iterative design task suggests that curated design spaces and structured guidance improve efficiency compared to open-ended generative tools. This research contributes to pre-occupancy evaluations, demonstrating how gamified environments enable stakeholder participation in the design process through informed decisionmaking and customized negotiation of performance criteria. .
Authors:Jianna So, Connie Cheng, Sonia Krishna Murthy
Abstract:
Anthropomorphizing conversational technology is a natural human tendency. Today, the anthropomorphic metaphor is overly reinforced across intelligent tools. Large Language Models (LLMs) are particularly anthropomorphized through interface design. While metaphors are inherently partial, anthropomorphic interfaces highlight similarities between LLMs and humans, but mask crucial differences. As a result, the metaphor is often taken literally; users treat LLMs as if they are truly human. With few safeguards in place, this extreme anthropomorphism drives users to delusion and harm. Users also experience dissonance between the ethics of using LLMs, their growing ubiquity, and limited interface alternatives. We propose repositioning anthropomorphism as a design variable, developing opposing extremes as a theoretical framework for how interface metaphors shape and can disrupt the default metaphor. We introduce a spectrum of metaphors from transparency-driven ''anti-anthropomorphism'' to uncanny ''hyper-anthropomorphism''. These metaphors introduce materiality to interface metaphors, exposing LLMs as sociotechnical systems shaped by human labor, infrastructure, and data. This spectrum shifts interface design away from optimizing usability and toward encouraging critical engagement.
Authors:Nora Petrova, Andrew Gordon, Enzo Blindow
Abstract:
The evaluation of large language models faces significant challenges. Technical benchmarks often lack real-world relevance, while existing human preference evaluations suffer from unrepresentative sampling, superficial assessment depth, and single-metric reductionism. To address these issues, we introduce HUMAINE, a framework for multidimensional, demographically aware measurement of human-AI interaction. We collected multi-turn, naturalistic conversations from 23,404 participants that were stratified across 22 demographic groups, both in the US and UK, to evaluate 28 state-of-the-art models across five human-centric dimensions. We use a hierarchical Bayesian Bradley-Terry-Davidson (BTD) model, with post-stratification to census data, and our analysis reveals three key insights. \textbf{(1)} We establish a clear performance hierarchy where \texttt{google/gemini-2.5-pro} ranks first overall, with a 95.6\% posterior probability of being the top-ranked model. \textbf{(2)} We uncover significant preference heterogeneity, with user age emerging as the primary demographic axis of disagreement; a model's perceived rank can shift substantially across age groups, exposing failures in generalisation that unrepresentative samples typically mask. \textbf{(3)} We quantify the vast difference in discriminative power across evaluation dimensions, with ambiguous qualities like \textit{Trust, Ethics \& Safety} showing a 65\% tie rate, in stark contrast to the decisive 10\% tie rate for \textit{Overall Winner}. Our work emphasises the need for a more multidimensional, demographically aware perspective in LLM evaluation. We release our complete dataset, interactive leaderboard, and open-source framework.
Authors:Gonzalo Gabriel Méndez, Jose Such
Abstract:
Privacy policies are long, complex, and rarely read, which limits their effectiveness in informed consent. We investigate scrollytelling, a scroll-driven narrative approach, as a privacy policy presentation format. We built a prototype that interleaves the full policy text with animated visuals to create a dynamic reading experience. In an online study (N=454), we compared our tool against text, two nutrition-label variants, and a standalone interactive visualization. Scrollytelling improved user experience over text, yielding higher engagement, lower cognitive load, greater willingness to adopt the format, and increased perceived clarity. It also matched other formats on comprehension accuracy and confidence, with only one nutrition-label variant performing slightly better. Changes in perceived understanding, transparency, and trust were small and statistically inconclusive. These findings suggest that scrollytelling can preserve comprehension while enhancing the experience of policy reading. We discuss design implications for accessible policy communication and identify directions for increasing transparency and user trust.
Authors:Hyein Kim, Sung Park
Abstract:
For four decades, AIED research has rested on what we term the Sedentary Assumption: the unexamined design commitment to a stationary learner seated before a screen. Mobile learning and museum guides have moved learners into physical space, and context-aware systems have delivered location-triggered content -- yet these efforts predominantly cast AI in the role of information-de-livery tool rather than epistemic partner. We map this gap through a 2 x 2 matrix (AI Role x Learning Environment) and identify an undertheorized intersection: the configuration in which AI serves as an epistemic teammate during unstruc-tured, place-bound field inquiry and learning is assessed through trajectory rather than product. To fill it, we propose Field Atlas, a framework grounded in embod-ied, embedded, enactive, and extended (4E) cognition, active inference, and dual coding theory that shifts AIED's guiding metaphor from instruction to sensemak-ing. The architecture pairs volitional photography with immediate voice reflec-tion, constrains AI to Socratic provocation rather than answer delivery, and ap-plies Epistemic Trajectory Modeling (ETM) to represent field learning as a con-tinuous trajectory through conjoined physical-epistemic space. We demonstrate the framework through a museum scenario and argue that the resulting trajecto-ries -- bound to a specific body, place, and time -- constitute process-based evi-dence structurally resistant to AI fabrication, offering a new assessment paradigm and reorienting AIED toward embodied, dialogic human-AI sensemaking in the wild.
Authors:Satabdi Das, Nahian Beente Firuj, Manjot Singh, Arshad Nasser, Khalad Hasan
Abstract:
People with Blind Visual Impairments (BVI) face unique challenges when sharing images, as these may accidentally contain sensitive or inappropriate content. In many instances, they are unaware of the potential risks associated with sharing such content, which can compromise their privacy and interpersonal relationships. To address this issue, we investigated image filtering techniques that could help BVI users manage sensitive content before sharing with various audiences, including family, friends, or strangers. We conducted a study with 20 BVI participants, evaluating different filters applied to images varying in sensitivity, such as personal moments or embarrassing shots. Results indicated that pixelation was the least preferred method, while preferences for other filters varied depending on image type and sharing context. Additionally, participants reported greater comfort when sharing filtered versus unfiltered images across audiences. Based on the results, we offer a set of design guidelines to enhance the image-sharing experience for BVI individuals.
Authors:Xiaohan Peng, Wendy E. Mackay, Janin Koch
Abstract:
Design is a non-linear, reflective process in which practitioners engage with visual, semantic, and other expressive materials to explore, iterate, and refine ideas. As Generative AI (GenAI) becomes integrated into professional design practice, traditional interaction approaches focusing on prompts or whole-image manipulation can misalign AI output with designers' intent, forcing visual thinkers into verbal reasoning or post-hoc adjustments. We present three interaction approaches from DesignPrompt, FusAIn, and DesignTrace that distribute control across intent, input, and process, enabling designers to guide AI alignment at different stages of interaction. We further argue that alignment is a dynamic negotiation, with AI adopting proactive or reactive roles according to designers' instrumental and inspirational needs and the creative stage.
Authors:Maria Moskalenko, Alexander Trifanov, Roman Popkov, Arina Tabieva, Maria Smirnova, Konstantin Pravdin, Daniil Bakalin
Abstract:
This paper introduces `Math Battles with AI', an innovative competitive format designed at ITMO University to redefine the role of generative AI in mathematics education. Moving away from a purely defensive stance, the authors propose an AI agent with intentionally increased hallucination likelihood in specific modes to train verification skills. We describe the three-stage tournament structure and a specialized assessment system that rewards critical verification over blind reliance. Initial results indicate a significant shift in student mindsets, fostering essential skills in digital hygiene and prompt engineering. This work serves as a practical guide for academic institutions aiming to leverage AI for enhancing, rather than undermining, intellectual development.
Authors:Fabio Cortes Rodriguez, Luciano Abriata
Abstract:
This project successfully developed, evaluated and integrated a Voice User Interface (VUI) into a web application that we are developing for immersive molecular graphics. Said app provides augmented and virtual reality (AR and VR) environments where users manipulate molecules with their hands, but this means the hands can't be used to control the app through a regular mouse- and keyboard-based GUI. The speech-based VUI system developed here alleviates this problem, making it easy to control the app via natural spoken (or typed) commands. To achieve this VUI we evaluated two distinct Automated Speech Recognition (ASR) systems: Chrome's native Speech API and OpenAI's Whisper v3. While Whisper offered broader browser compatibility, its tendency to "hallucinate" with specialized scientific jargon proved very problematic. Consequently, we selected Chrome's ASR for its stability, speed, and reliability. For translating transcribed speech into software commands, we tested two Large Language Model (LLM)-driven approaches: either generating executable code, or calling predefined functions. The function call method, powered by OpenAI's GPT-4o-mini, was ultimately adopted due to its superior safety, efficiency, and reliability over the more complex and error-prone code-generation approach. The resulting VUI is then based on an integration of Chrome's ASR with our LLM-based function-calling module, enabling users to command the application using natural language as shown in a video linked inside this report. We provide links to live examples demonstrating all the intermediate components, and details on how we crafted the LLM's prompt in order to teach it the function calls as well as ways to clean up the transcribed speech and to explain itself while generating function calls. For best demonstration of the final system, we provide a video example.
Authors:Tse Pei Ng, Daniel Campos-Muniz, Yiyang He, Ker Wey Aw, Jung-Joo Lee, Janghee Cho
Abstract:
Flexible work is increasingly pursued as a means of achieving work-life balance, particularly as growing caregiving responsibilities for children and aging family members shape workers' lives. Yet most HCI research has examined flexibility primarily through productivity and organizational perspectives, with less attention to how it intersects with workers' personal and family responsibilities. To address this gap, we conducted a qualitative study with 20 workers in Singapore engaging in flexible arrangements to manage paid work and care responsibilities. Using an asset-based lens, we show that flexibility is not a static benefit but a continual practice of rhythm-making. Participants maintained rhythms by drawing on temporal and spatial assets, negotiated them through relational and institutional dynamics, and sustained them through intrapersonal assets such as self-care and positive reframing. Our study reframes blurred boundaries as resources rather than disruptions and offers design implications for technologies that support flexible workers' everyday rhythm-making practices.
Authors:Ilias Triantafyllopoulos, Panos Ipeirotis
Abstract:
The integrity of behavioral and social-science surveys depends on detecting inattentive respondents who provide random or low-effort answers. Traditional safeguards, such as attention checks, are often costly, reactive, and inconsistent. We propose a unified, label-free framework for inattentiveness detection that scores response coherence using complementary unsupervised views: geometric reconstruction (Autoencoders) and probabilistic dependency modeling (Chow-Liu trees). While we introduce a "Percentile Loss" objective to improve Autoencoder robustness against anomalies, our primary contribution is identifying the structural conditions that enable unsupervised quality control. Across nine heterogeneous real-world datasets, we find that detection effectiveness is driven less by model complexity than by survey structure: instruments with coherent, overlapping item batteries exhibit strong covariance patterns that allow even linear models to reliably separate attentive from inattentive respondents. This reveals a critical ``Psychometric-ML Alignment'': the same design principles that maximize measurement reliability (e.g., internal consistency) also maximize algorithmic detectability. The framework provides survey platforms with a scalable, domain-agnostic diagnostic tool that links data quality directly to instrument design, enabling auditing without additional respondent burden.
Authors:Guilhem Fouilhé, Rebecca Eifler, Antonin Poché, Sylvie Thiébaux, Nicholas Asher
Abstract:
When automating plan generation for a real-world sequential decision problem, the goal is often not to replace the human planner, but to facilitate an iterative reasoning and elicitation process, where the human's role is to guide the AI planner according to their preferences and expertise. In this context, explanations that respond to users' questions are crucial to improve their understanding of potential solutions and increase their trust in the system. To enable natural interaction with such a system, we present a multi-agent Large Language Model (LLM) architecture that is agnostic to the explanation framework and enables user- and context-dependent interactive explanations. We also describe an instantiation of this framework for goal-conflict explanations, which we use to conduct a user study comparing the LLM-powered interaction with a baseline template-based explanation interface.
Authors:Soumita Mukherjee, Priya Kumar, Laura Cabrera
Abstract:
This study examines how vulnerability is produced for human operators of Tesla's Full Self-Driving (FSD), a Level 2 semi-autonomous vehicle (SAV) system, by applying Florencia Luna's layered vulnerability framework. While existing road safety models conceptualize vulnerability as a fixed attribute of external road users, emerging evidence suggests that semi-autonomous vehicle operators themselves experience dynamic and situational vulnerability as they supervise automated systems that they do not fully control. To investigate this phenomenon, we conducted semi-structured interviews with 17 active FSD users, analyzing their accounts through a combined deductive-inductive coding process aligned with Luna's framework. Findings reveal three interacting layers of operator vulnerability, namely psychological, operational, and social. Vulnerability emerged not from any single layer but from how these layers converged in specific situations, creating fluctuating supervisory demands and uneven capacity to recognize and manage risk. The findings extend debates on contextual trust calibration, automation complacency, and meaningful human control by demonstrating how factors commonly treated as liabilities such as trust or informal learning, can both increase and mitigate vulnerability depending on context. This analysis determines the need for design and regulatory interventions that address psychological, operational, and social conditions together rather than in isolation, and highlights how responsibility is implicitly shifted onto individual operators within inadequately supported supervisory regimes.
Authors:Tim Rieder, Marian Schneider, Mario Truss, Vitaly Tsaplin, Alina Rublea, Sinem Dere, Francisco Chicharro Sanz, Tobias Reiss, Mustafa Doga Dogan
Abstract:
A/B testing is a standard method for validating design decisions, yet its reliance on real user traffic limits iteration speed and makes certain experiments impractical. We present SimAB, a system that reframes A/B testing as a fast, privacy-preserving simulation using persona-conditioned AI agents. Given design screenshots and a conversion goal, SimAB generates user personas, deploys them as agents that state their preference, aggregates results, and synthesizes rationales. Through a formative study with experimentation practitioners, we identified scenarios where traffic constraints hinder testing, including low-traffic pages, multi-variant comparisons, micro-optimizations, and privacy-sensitive contexts. Our design emphasizes speed, early feedback, actionable rationales, and audience specification. We evaluate SimAB against 47 historical A/B tests with known outcomes, achieving 67% overall accuracy, increasing to 83% for high-confidence cases. Additional experiments show robustness to naming and positional bias and demonstrate accuracy gains from personas. Practitioner feedback suggests that SimAB supports faster evaluation cycles and rapid screening of designs difficult to assess with traditional A/B tests.
Authors:Yifan Li, Xingyu Lan
Abstract:
Generative AI has enabled ``Deadbots'', offering mourners an interactive way to engage with simulations of the deceased. While existing research often emphasizes ethics, less is known about how bereaved individuals construct and reshape memory through such interactions. To address this gap, this study draws on in-depth interviews with 26 users. Findings reveal that users are not passive recipients but active constructors of the deceased's digital representation. Through selective input, ongoing interactive adjustments and imaginative cognitive supplementation, they build an idealized digital figure blending authentic memories with personal expectations. Deadbots provide a private space to grieve without social pressure and a channel to address unresolved emotions. In this process, users' memory of the deceased evolves dynamically: from initial reinforcement and idealization to a later stage where AI-generated new memories blur with authentic recollections, reflecting a complex desire for connection through an artificial medium. This blurring raises ethical concerns regarding memory distortion and dependency, underscoring the need for future clinical research on the long-term impact of AI-mediated grieving.
Authors:Inhwa Song, Kwangyoung Lee, Janghee Cho, Amon Rapp, Hwajung Hong
Abstract:
While Personal Informatics (PI) systems support behavior change, everyday well-being involves more than achieving individual target behaviors. It is shaped by cultural narratives that give actions meaning. In South Korea, the God-Saeng phenomenon, encompassing disciplined, collective, and publicly documented self-improvement practices, offers a lens into how well-being is negotiated in daily life. We conducted a 10-day probe (N=24) with bite-sized missions to examine how young adults engaged in God-Saeng. Participants relied on planning practices, accountability infrastructures, and datafication to stabilize themselves, yet these same routines also intensified pressures toward self-monitoring and performance. They navigated tensions between consistency and flexibility, authenticity and visibility, and productivity and broader values such as relationships, and reinterpreted ordinary activities through sociocultural contexts. These insights suggest design opportunities for PI systems that move beyond tracking, toward digital instruments that help users negotiate tensions, make meaning, and reflexively understand how technologies participate in their culturally and existentially situated well-being.
Authors:Yuqing Hu, Wendao Xue, Yifan Yu, Yong Tan
Abstract:
Advances in artificial intelligence (AI), together with persistent gaps in access to reliable emotional support, have positioned AI as an increasingly prominent source of emotional assistance. However, most AI-based emotional support applications and prior research focus on one-on-one interactions between users and a single AI agent, leaving the potential advantages of alternative support configurations largely unexplored. Drawing on social support and support group theory, this research examines whether AI-based emotional support delivered by a group of AI agents (group AI support) can constitute a more effective support form than single-agent support (single AI support). We propose that group AI support enhances users' perceived support efficacy, that this effect operates by strengthening users' connectedness with the AI system, and that the composition of support types within AI groups further shapes support outcomes. Three experiments provide convergent support for these claims. By identifying when and why group AI emotional support outperforms single AI support, this work advances theoretical understanding of AI-based emotional support and provides actionable guidance for the design of AI support systems.
Authors:Andrea Cuadra, Samar Sabie, Yan Shvartzshnaider, Deborah Estrin
Abstract:
We investigate the ethical and privacy implications of voice-first ambient interfaces (VFAIs) for aging in place through an in-depth engagement with five older adults. Our participants were in the process of becoming experienced VFAI users, and had used a VFAI-based design probe for health data reporting. We create and iteratively refine an interview protocol using Privacy Cards. We customize Privacy Cards by drawing on participants' previous interviews and device usage logs. Using Privacy Cards, we conduct interviews to surface their mental models, and explore their privacy concerns. We find insufficient mental models for proper consent. For example, participants did not know who could access their data, and experienced difficulty distinguishing built-in functionality from third-party apps. Participants initially expressed little worry about VFAI-related ethical concerns, but interviews with Privacy Cards revealed nuanced issues, resulting in various implications for future research and design.
Authors:Jun Aoki, Shunki Itadera
Abstract:
The application of teleoperation to control robotic arms has been widely explored, and user-friendly teleoperation systems have been studied for facilitating higher performance and lower operational burden. To investigate the dominant factors in a practical teleoperation system, this study focused on the characteristics of an interface used to operate a robotic arm. The usability of an interface depends on the characteristics of the manipulation tasks to be completed; however, systematic comparisons of different interfaces across different tasks remain limited. In this study, we compared two widely used teleoperation interfaces, a 3D mouse and a VR controller, for two simple yet broadly applicable tasks with a six-degree-of-freedom (6DoF) robotic arm: repetitively pushing buttons and rotating knobs. Participants (N = 23) controlled a robotic arm with 6DoF to push buttons and rotate knobs as many times as possible in 3-minute trials. Each trial was followed by a NASA-TLX workload rating. The results showed a clear connection between the interface and task performance: the VR controller yielded higher performance for pushing buttons, whereas the 3D mouse performed better and was less demanding for knob rotation. These findings highlight the importance of considering dominant motion primitives of the task when designing practical teleoperation interfaces.
Authors:Abhishek Kulkarni, Sharon Lynn Chu
Abstract:
Interest-based learning (IBL) is a paradigm of instruction in which educational content is contextualized using learners' interests to enhance content relevance. IBL has been shown to result in improved learning outcomes. Unfortunately, high effort is needed for instructors to design and deliver IBL content for individual students. LLMs in the form of AI tutors may allow for IBL to scale across many students. Designing an AI tutor for IBL, however, first requires an understanding of how IBL is implemented in teaching scenarios. This paper presents a study that seeks to derive this understanding from an analysis of how human instructors design and deliver IBL content. We studied 14 one-to-one online tutoring sessions (28 participants) in which tutors designed and delivered a lesson tailored to a student's self-identified interest. Using lesson artifacts, tutoring transcripts, interviews, and questionnaires, findings include themes on how tutors integrate interests during instruction and why. Finally, actionable design implications are presented for LLM-powered AI tutors that aim to deliver IBL at scale.
Authors:Iván Arcos, Paolo Rosso, Elena Gomis-Vicent
Abstract:
The automated detection of sexism in memes is a challenging task due to multimodal ambiguity, cultural nuance, and the use of humor to provide plausible deniability. Content-only models often fail to capture the complexity of human perception. To address this limitation, we introduce and validate a human-centered paradigm that augments standard content features with physiological data. We created a novel resource by recording Eye-Tracking (ET), Heart Rate (HR), and Electroencephalography (EEG) from 16 subjects (8 per experiment) while they viewed 3984 memes from the EXIST 2025 dataset. Our statistical analysis reveals significant physiological differences in how subjects process sexist versus non-sexist content. Sexist memes were associated with higher cognitive load, reflected in increased fixation counts and longer reaction times, as well as differences in EEG spectral power across the Alpha, Beta, and Gamma bands, suggesting more demanding neural processing. Building on these findings, we propose a multimodal fusion model that integrates physiological signals with enriched textual-visual features derived from a Vision-Language Model (VLM). Our final model achieves an AUC of 0.794 in binary sexism detection, a statistically significant 3.4% improvement over a strong VLM-based baseline. The fusion is particularly effective for nuanced cases, boosting the F1-score for the most challenging fine-grained category, Misogyny and Non-Sexual Violence, by 26.3%. These results show that physiological responses provide an objective signal of perception that enhances the accuracy and human-awareness of automated systems for countering online sexism.
Authors:Toshikazu Seto, Yohei Shiwaku, Takayuki Miyauchi, Daisuke Yoshida, Yuichiro Nishimura
Abstract:
Large-scale 3D geospatial data visualization has become increasingly critical for the development of the digital society infrastructure in Japan. This study conducted a comprehensive performance evaluation of two major WebGL-based web mapping libraries, CesiumJS and MapLibre GL JS, using large-scale 3D point-cloud data from the VIRTUAL SHIZUOKA and PLATEAU building models. The research employs standardized 3D Tiles 1.1, and Mapbox Vector Tiles (MVT) formats, comparing performance across different data scales (2nd and 3rd grid levels) using Core Web Vitals metrics, including First Contentful Paint (FCP), Largest Contentful Paint (LCP), Speed Index, Total Blocking Time (TBT), and Cumulative Layout Shift (CLS). The results demonstrate that MVT-based building visualization with MapLibre GL JS achieves optimal performance (FCP 0.8s, TBT 0ms), whereas MapLibre GL JS combined with deck.gl shows superior performance for large-scale point cloud processing (TBT: 3ms, CesiumJS: 21,357ms). This study provides data-driven selection guidelines for appropriate technology choices according to use cases, establishing reproducible performance evaluation frameworks for 3D web mapping technologies during the WebGPU and OGC 3D Tiles 1.1 standardization era.
Authors:Svitlana Surodina, Sinem Görücü, Lili Golmohammadi, Emelia Delaney, Rita Borgo
Abstract:
Innovative HealthTech teams develop Artificial Intelligence (AI) systems in contexts where ethical expectations and organizational priorities must be balanced under severe resource constraints. While Responsible AI practices are expected to guide the design and evaluation of such systems, they frequently remain abstract or poorly aligned with the operational realities of early-stage innovation. At the ecosystem level, this misalignment disproportionately affects disadvantaged projects and founders, therefore limiting the diversity of problem-areas under consideration, solutions, stakeholder perspectives, and population datasets represented in AI-enabled healthcare systems. Visualization provides a practical mechanism for supporting decision-making across the AI lifecycle. When developed via a rigorous and collaborative design process, structured on domain knowledge and designed around real-world constraints, visual interfaces can operate as effective sociotechnical governance artifacts enabling responsible decision-making. Grounded in innovation-oriented Human-Centered Computing methodologies, we synthesize insights from a series of design studies conducted via a longitudinal visualization research program, a case study centered on governance dashboard design in a translational setting, and a survey of a cohort of early-stage HealthTech startups. Based on these findings, we articulate design process implications for governance-oriented visualization systems: co-creation with stakeholders, alignment with organizational maturity and context, and support for heterogeneous roles and tasks among others. This work contributes actionable guidance for designing Responsible AI governance dashboards that support decision-making and accountability in early-stage health innovation, and suggests that ecosystem-level coordination can enable more scalable and diverse AI innovation in healthcare.
Authors:Gauri Umesh Rajmane, Ziming Li, Tae Oh, Roshan Peiris
Abstract:
This study explores integrating sign language into virtual reality (VR) by examining the comprehensibility and user experience of viewing American Sign Language (ASL) videos captured with body-mounted 360-degree cameras. Ten participants identified ASL signs from videos recorded at three body-mounted positions: head, shoulder, and chest. Results showed the shoulder-mounted camera achieved the highest accuracy (85%), though differences between positions were not statistically significant. Participants noted that peripheral distortion in 360-degree videos impacted clarity, highlighting areas for improvement. Despite challenges, the overall comprehension success rate of 83.3% demonstrates the potential of video-based ASL communication in VR. Feedback emphasized the need to refine camera angles, reduce distortion, and explore alternative mounting positions. Participants expressed a preference for signing over text-based communication in VR, highlighting the importance of developing this approach to enhance accessibility and collaboration for Deaf and Hard of Hearing (DHH) users in virtual environments.
Authors:Jeremy Wertheim Co Chen, Rendell Christian Ngo, Cedric Matthew Yu, Hans Emilio Lumagui, Ethan Badayos, Jordan Aiko Deja
Abstract:
Extended reality (XR) enables new music-mixing workflows by moving beyond 2D faders toward embodied, spatial interaction. However, it remains unclear which six-degree-of-freedom (6DoF) gestures align with real-world mixing practices and whether such interactions support manageable cognitive load and positive user experience. We conducted a design workshop with experienced mixers to elicit gesture concepts for core audio tasks gain, compression, equalization, and automation, and implemented these in an XR prototype. A user study (n=12) evaluated the ecological validity of the gestures using cognitive load measures, user-experience ratings, and interviews. Participants generally found 6DoF gestures intuitive and well-mapped to mixing tasks, reporting strong immersion and a sense of connection with the audio environment. Cognitive load differences across gestures were minimal, though participants expressed preferences shaped by workflow familiarity and perceived control. We discuss implications for designing XR mixing tools that balance expressiveness, precision, and ecological validity.
Authors:Ruiqi Zhou, Donghao Zhu, Houcai Shen
Abstract:
In matching markets such as kidney exchanges and freight exchanges, delayed matching has been shown to improve overall market efficiency. The benefits of delay are highly sensitive to participants' sojourn times and departure behavior, and delaying matches can impose significant costs, including longer waiting times and increased market congestion. These competing effects make fixed matching policies inherently inflexible in dynamic environments. We propose a learning-based Hybrid framework that adaptively combines immediate and delayed matching. The framework continuously collects data on user departures over time, estimates the underlying departure distribution via regression, and determines whether to delay matching in the subsequent period based on a decision threshold that governs the system's tolerance for matching efficiency loss. The proposed framework can substantially reduce waiting times and congestion while sacrificing only a limited amount of matching efficiency. By dynamically adjusting its matching strategy, the Hybrid framework enables system performance to flexibly interpolate between purely greedy and purely patient policies, offering a robust and adaptive alternative to static matching mechanisms.
Authors:Abhishek Kulkarni, Alexander Barquero, Pavitra Lahari, Aryaan Shaikh, Sarah Brown
Abstract:
With the advent of generative AI and large language models, embodied conversational agents are becoming synonymous with online interactions. These agents possess vast amounts of knowledge but suffer from exhibiting limited emotional expressiveness. Without adequate expressions, agents might fail to adapt to users' emotions, which may result in a sub-optimal user experience and engagement. Most current systems prioritize content based responses, neglecting the emotional context of conversations. Research in this space is currently limited to specific contexts, like mental health. To bridge this gap, our project proposes the implementation of expressive features in a virtual conversational agent which will utilize sentiment analysis and natural language processing to inform the generation of empathetic, expressive responses. The project delivers a functional conversational agent capable of assessing and responding to user emotions accordingly. We posit this will enhance usability, engagement, and the overall quality of conversations and present results from an exploratory pilot study investigating the same.
Authors:Shruthi Andru, Shrut Kirti Saksena
Abstract:
As interfaces evolve from static user pathways to dynamic human-AI collaboration, no standard methods exist for selecting appropriate interface patterns based on user needs and task complexity. Existing frameworks only provide guiding principles for designing AI agent capabilities. We propose a dimensional framework based on workflow complexity, AI autonomy, and AI reasoning to guide the design of context-aware, scalable AI interfaces aka modalities (e.g., prompt bars, split screens, full screens, etc.). The framework was developed through co-design workshops with designers of marketing products and refined through qualitative research with eight long-term AI users. The study evaluated the three dimensions, identified task-to-interface relationships, and surfaced the importance of both business impact and security risk across all high-autonomy scenarios. This framework provides product teams with a shared language to develop scalable AI interfaces, emphasizing fluidity between interfaces and progressive user control to balance AI autonomy with human oversight.
Authors:Leon Pielage, Ole Hätscher, Mitja Back, Bernhard Marschall, Benjamin Risse
Abstract:
The inability of Large Language Models (LLMs) to modulate their personality expression in response to evolving dialogue dynamics hinders their performance in complex, interactive contexts. We propose a model-agnostic framework for dynamic personality simulation that employs state machines to represent latent personality states, where transition probabilities are dynamically adapted to the conversational context. Part of our architecture is a modular pipeline for continuous personality scoring that evaluates dialogues along latent axes while remaining agnostic to the specific personality models, their dimensions, transition mechanisms, or LLMs used. These scores function as dynamic state variables that systematically reconfigure the system prompt, steering behavioral alignment throughout the interaction.We evaluate this framework by operationalizing the Interpersonal Circumplex (IPC) in a medical education setting. Results demonstrate that the system successfully adapts its personality state to user inputs, but also influences user behavior, thereby facilitating de-escalation training. Notably, the scoring pipeline maintains comparable precision even when utilizing lightweight, fine-tuned classifiers instead of large-scale LLMs. This work demonstrates the feasibility of modular, personality-adaptive architectures for education, customer support, and broader human-computer interaction.
Authors:Andrés Rodriguez, Juan Cruz Gardey, Alejandra Garrido
Abstract:
Integrated Development Environments shape developers' daily experience, yet the empirical study of their usability and user experience (UX) remains limited. This work presents an LLM-assisted approach to detecting UX smells in Visual Studio Code by mining and classifying user-reported issues from the GitHub repository. Using a validated taxonomy and expert review, we identified recurring UX problems that affect the developer experience. Our results show that the majority of UX smells are concentrated in informativeness, clarity, intuitiveness, and efficiency, qualities that developers value most.
Authors:Deja Dunlap, R. Thomas McCoy
Abstract:
In AI, most evaluations of natural language understanding tasks are conducted in standardized dialects such as Standard American English (SAE). In this work, we investigate how accurately large language models (LLMs) represent African American Vernacular English (AAVE). We analyze three LLMs to compare their usage of AAVE to the usage of humans who natively speak AAVE. We first analyzed interviews from the Corpus of Regional African American Language and TwitterAAE to identify the typical contexts where people use AAVE grammatical features such as ain't. We then prompted the LLMs to produce text in AAVE and compared the model-generated text to human usage patterns. We find that, in many cases, there are substantial differences between AAVE usage in LLMs and humans: LLMs usually underuse and misuse grammatical features characteristic of AAVE. Furthermore, through sentiment analysis and manual inspection, we found that the models replicated stereotypes about African Americans. These results highlight the need for more diversity in training data and the incorporation of fairness methods to mitigate the perpetuation of stereotypes.
Authors:Mohammad Sadra Rajabi, Aanuoluwapo Ojelade, Sunwook Kim, Maury A. Nussbaum
Abstract:
Manual lifting tasks are a major contributor to work-related musculoskeletal disorders, and effective ergonomic risk assessment is essential for quantifying physical exposure and informing ergonomic interventions. The Revised NIOSH Lifting Equation (RNLE) is a widely used ergonomic risk assessment tool for lifting tasks that relies on six task variables, including horizontal (H) and vertical (V) hand distances; such distances are typically obtained through manual measurement or specialized sensing systems and are difficult to use in real-world environments. We evaluated the feasibility of using innovative vision-language models (VLMs) to non-invasively estimate H and V from RGB video streams. Two multi-stage VLM-based pipelines were developed: a text-guided detection-only pipeline and a detection-plus-segmentation pipeline. Both pipelines used text-guided localization of task-relevant regions of interest, visual feature extraction from those regions, and transformer-based temporal regression to estimate H and V at the start and end of a lift. For a range of lifting tasks, estimation performance was evaluated using leave-one-subject-out validation across the two pipelines and seven camera view conditions. Results varied significantly across pipelines and camera view conditions, with the segmentation-based, multi-view pipeline consistently yielding the smallest errors, achieving mean absolute errors of approximately 6-8 cm when estimating H and 5-8 cm when estimating V. Across pipelines and camera view configurations, pixel-level segmentation reduced estimation error by approximately 20-30% for H and 35-40% for V relative to the detection-only pipeline. These findings support the feasibility of VLM-based pipelines for video-based estimation of RNLE distance parameters.
Authors:Takaya Miyama, Satoshi Nakamura, Shota Yamanaka
Abstract:
In crowdsourced user experiments that collect performance data from graphical user interface (GUI) interactions, some participants ignore instructions or act carelessly, threatening the validity of performance models. We investigate a pre-task screening method that requires simple GUI operations analogous to the main task and uses the resulting error as a continuous quality signal. Our pre-task is a brief image-resizing task in which workers match an on-screen card to a physical card; workers whose resizing error exceeds a threshold are excluded from the main experiment. The main task is a standardized pointing experiment with well-established models of movement time and error rate. Across mouse- and smartphone-based crowdsourced experiments, we show that reducing the proportion of workers exhibiting unexpected behavior and tightening the pre-task threshold systematically improve the goodness of fit and predictive accuracy of GUI performance models, demonstrating that brief pre-task screening can enhance data quality.
Authors:Paras Sharma, YuePing Sha, Janet Shufor Bih Epse Fofang, Brayden Yan, Jess A. Turner, Nicole Balay, Hubert O. Asare, Angela E. B. Stewart, Erin Walker
Abstract:
Dialogue systems have long supported learner reflections, with theoretically grounded, rule-based designs offering structured scaffolding but often struggling to respond to shifts in engagement. Large Language Models (LLMs), in contrast, can generate context-sensitive responses but are not informed by decades of research on how learning interactions should be structured, raising questions about their alignment with pedagogical theories. This paper presents a hybrid dialogue system that embeds LLM responsiveness within a theory-aligned, rule-based framework to support learner reflections in a culturally responsive robotics summer camp. The rule-based structure grounds dialogue in self-regulated learning theory, while the LLM decides when and how to prompt deeper reflections, responding to evolving conversation context. We analyze themes across dialogues to explore how our hybrid system shaped learner reflections. Our findings indicate that LLM-embedded dialogues supported richer learner reflections on goals and activities, but also introduced challenges due to repetitiveness and misalignment in prompts, reducing engagement.
Authors:William Seymour, Martin J. Kraemer
Abstract:
Cybersecurity awareness is shaped by a wide range of professional and personal experiences, including information and training at work and the sharing of news and other content at home. In order to explore how people discover cybersecurity content and the effect that participation in workplace training may have on this we present an online study of 1200 participants from the UK, US, France, and Germany. Those undertaking cybersecurity training at work showed reduced intention to share information at home, shifting the focus towards the workplace. They were also more likely to recall cybersecurity information shared by their employer than from any other source, which in turn correlated with content type and distribution channel. We critically reflect on this shift, highlighting opportunities to improve cybersecurity information sharing at work and at home.
Authors:David Fraile Navarro, Mor Peleg
Abstract:
Collecting patient-reported outcome measures (PROMs) is essential for clinical care and research, yet traditional form-based approaches are often tedious for patients and burdensome for clinicians. We developed a generative AI conversational agent(CA) using GPT-5 to collect back pain data according to the NIH Task Force's Recommended Minimal Dataset. Unlike prior CAs that ask questions one-by-one, our CA engages users in topic-based conversations, allowing multiple data items to be captured in a single exchange. Through iterative development and pilot testing with clinicians and a consumer panel, we identified key design principles for health data collection CAs. These principles extend established clinical decision support design guidelines to conversational interfaces, addressing: flexibility of interaction style, personality calibration, data quality assurance through confidence visualization, patient safety constraints, and interoperability requirements. We present our prompt design methodology and discuss challenges encountered, including managing conversation length, handling ambiguous responses, and adapting to LLM version changes. Our design principles provide a practical framework for developers creating conversational agents for patient questionnaire completion. The CA is available at https://chatgpt.com/g/g-68f4869548f48191af0544f110ee91c6-backpain-data-collection-assistant (requires ChatGPT registration and subscription for unlimited use).
Authors:Samuel Bellaire, Abdalmalek Abu-raddaha, Natalie Kim, Nathan Morhan, William Elliott, Samir Rawashdeh
Abstract:
Trust remains a critical barrier to the effective integration of Artificial Intelligence (AI) into human-centric domains. Disembodied agents, such as voice assistants, often fail to establish trust due to their inability to convey non-verbal social cues. This paper introduces the architecture of Botson: an anthropomorphic social robot powered by a large language model (LLM). Botson was created as a low-cost and accessible platform for social robotics research.
Authors:Eun Jeong Kang, Fengyang Lin, Angel Hsing-Chi Hwang
Abstract:
Lightweight fine-tuning techniques and the rise of 'open' AI model marketplaces have enabled individuals to easily build and release generative models. Yet, this accessibility also raises risks, including the production of harmful and infringing content. While platforms offer policies and responsible AI tools, their effectiveness may be limited, as creators engage with partially open models that vary widely in openness and transparency. To understand how platform governance can better support responsible practices, we conducted semi-structured interviews with 19 individual model creators. We identified three regulatory needs shaped by creators' workflows: reducing downstream harms, recognizing creators' contributions and originality, and securing model ownership. Creators also repurpose RAI tools primarily for self-protection and visibility, and their sense of responsibility is deeply shaped by community norms rather than formal policies. We argue that platforms' governance decisions must consider how policy interventions shape the practices and motivations of individual creators.
Authors:Kirk Vanacore, Ryan S. Baker, Avery H. Closser, Jeremy Roschelle
Abstract:
The emergence of generative AI has accelerated the development of conversational tutoring systems that interact with students through natural language dialogue. Unlike prior intelligent tutoring systems (ITS), which largely function as adaptive and interactive problem sets with feedback and hints, conversational tutors hold the potential to simulate high-quality human tutoring by engaging with students' thoughts, questions, and misconceptions in real time. While some previous ITS, such as AutoTutor, could respond conversationally, they were expensive to author and lacked a full range of conversational ability. Generative AI has changed the capacity of ITS to engage conversationally. However, realizing the full potential of conversational tutors requires careful consideration of what research on human tutoring and ITS has already established, while also unpacking what new research will be needed. This paper synthesizes tenets of successful human tutoring, lessons learned from legacy ITS, and emerging work on conversational AI tutors. We use a keep, change, center, study framework for guiding the design of conversational tutoring. We argue that systems should keep proven methods from prior ITS, such as knowledge tracing and affect detection; change how tutoring is delivered by leveraging generative AI for dynamic content generation and dialogic scaffolding; and center opportunities for meaning-making, student agency, and granular diagnosis of reasoning. Finally, we identify areas requiring further study, including efficacy testing, student experience, and integration with human instruction. By synthesizing insights from human tutoring, legacy ITS, and emerging generative AI technologies, this paper outlines a research agenda for developing conversational tutors that are scalable, pedagogically effective, and responsive to the social and motivational dimensions of learning.
Authors:Claire Liang, Franziska Babel, Hannah Pelikan, Sydney Thompson, Xiang Zhi Tan
Abstract:
Many of the challenges encountered in in-the-wild public deployments of robots remain undocumented despite sharing many common pitfalls. This creates a high barrier of entry and results in repetition of avoidable mistakes. To articulate the tacit knowledge in the HRI community, this paper presents a guideline in the form of a checklist to support researchers in preparing for robot deployments in public. Drawing on their own experience with public robot deployments, the research team collected essential topics to consider in public HRI research. These topics are represented as modular flip cards in a hierarchical table, structured into deployment phases and important domains. We interviewed six interdisciplinary researchers with expertise in public HRI and show how including community input refines the checklist. We further show the checklist in action in context of real public studies. Finally, we contribute the checklist as an open-source, customizable community resource that both collects joint expertise for continual evolution and is usable as a list, set of cards, and an interactive web tool.
Authors:Brandon Victor Syiem, Eduardo Velloso
Abstract:
Despite the widespread use of ordinal measures in HCI, such as Likert-items, there is little consensus among HCI researchers on the statistical methods used for analysing such data. Both parametric and non-parametric methods have been extensively used within the discipline, with limited reflection on their assumptions and appropriateness for such analyses. In this paper, we examine recent HCI works that report statistical analyses of ordinal measures. We highlight prevalent methods used, discuss their limitations and spotlight key assumptions and oversights that diminish the insights drawn from these methods. Finally, we champion and detail the use of cumulative link (mixed) models (CLM/CLMM) for analysing ordinal data. Further, we provide practical worked examples of applying CLM/CLMMs using R to published open-sourced datasets. This work contributes towards a better understanding of the statistical methods used to analyse ordinal data in HCI and helps to consolidate practices for future work.
Authors:Tung T. Ngo, Dai Nguyen Van, Anh-Minh Nguyen, Phuong-Anh Do, Anh Nguyen-Quoc
Abstract:
Qualitative data analysis is labor-intensive, yet the privacy risks associated with commercial Large Language Models (LLMs) often preclude their use in sensitive research. To address this, we introduce ChatQDA, an on-device framework powered by open-source LLMs designed for privacy-preserving open coding. Our mixed-methods user study reveals that while participants rated the system highly for usability and perceived efficiency, they exhibited "conditional trust", valuing the tool for surface-level extraction while questioning its interpretive nuance and consistency. Furthermore, despite the technical security of local deployment, participants reported epistemic uncertainty regarding data protection, suggesting that invisible security measures are insufficient to foster trust. We conclude with design recommendations for local-first analysis tools that prioritize verifiable privacy and methodological rigor.
Authors:Nam Hee Kim, Jingjing May Liu, Jaakko Lehtinen, Perttu Hämäläinen, James F. O'Brien, Xue Bin Peng
Abstract:
We present the first motion generation system for playtesting virtual reality (VR) games. Our player model generates VR headset and handheld controller movements from in-game object arrangements, guided by style exemplars and aligned to maximize simulated gameplay score. We train on the large BOXRR-23 dataset and apply our framework on the popular VR game Beat Saber. The resulting model Robo-Saber produces skilled gameplay and captures diverse player behaviors, mirroring the skill levels and movement patterns specified by input style exemplars. Robo-Saber demonstrates promise in synthesizing rich gameplay data for predictive applications and enabling a physics-based whole-body VR playtesting agent.
Authors:Kaori Ikematsu, Kunihiro Kato
Abstract:
DuoTouch is a passive attachment for capacitive touch panels that adds tangible input while minimizing content occlusion and loss of input area. It uses two contact footprints and two traces to encode motion as binary sequences and runs on unmodified devices through standard touch APIs. We present two configurations with paired decoders: an aligned configuration that maps fixed-length codes to discrete commands and a phase-shifted configuration that estimates direction and distance from relative timing. To characterize the system's reliability, we derive a sampling-limited bound that links actuation speed, internal trace width, and device touch sampling rate. Through technical evaluations on a smartphone and a touchpad, we report performance metrics that describe the relationship between these parameters and decoding accuracy. Finally, we demonstrate the versatility of DuoTouch by embedding the mechanism into various form factors, including a hand strap, a phone ring holder, and touchpad add-ons.
Authors:Nathan G. Wood, Scott Robbins, Eduardo Zegarra Berodt, Anton Graf von Westerholt, Michelle Behrndt, Hauke Budig, Daniel Kloock-Schreiber
Abstract:
Across academia, industry, and government, ``AI'' has become central in research and development, regulatory debates, and promises of ever faster and more capable decision-making and action. In numerous domains, especially safety-critical ones, there are significant concerns over how ``AI'' may affect decision-making, responsibility, or the likelihood of mistakes (to name only a few categories of critique). However, for most critiques, the target is generally ``AI'', a broad term admitting many (types of) systems used for a variety of tasks and each coming with its own set of limitations, challenges, and potential use cases. In this article, we focus on the military domain as a case study and present both a loose enumerative taxonomy of systems captured under the umbrella term ``military AI'', as well as discussion of the challenges of each. In doing so, we highlight that critiques of one (type of) system will not always transfer to other (types of) systems. Building on this, we argue that in order for debates to move forward fruitfully, it is imperative that the discussions be made more precise and that ``AI'' be excised from debates to the extent possible. Researchers, developers, and policy-makers should make clear exactly what systems they have in mind and what possible benefits and risks attend the deployment of those particular systems. While we focus on AI in the military as an exemplar for the overall trends in discussions of ``AI'', the argument's conclusions are broad and have import for discussions of AI across a host of domains.
Authors:Abdulhadi Shoufan, Ahmad-Azmi-Abdelhamid Esmaeil
Abstract:
As students increasingly rely on large language models, hallucinations pose a growing threat to learning. To mitigate this, AI literacy must expand beyond prompt engineering to address how students should detect and respond to LLM hallucinations. To support this, we need to understand how students experience hallucinations, how they detect them, and why they believe they occur. To investigate these questions, we asked university students three open-ended questions about their experiences with AI hallucinations, their detection strategies, and their mental models of why hallucinations occur. Sixty-three students responded to the survey. Thematic analysis of their responses revealed that reported hallucination issues primarily relate to incorrect or fabricated citations, false information, overconfident but misleading responses, poor adherence to prompts, persistence in incorrect answers, and sycophancy. To detect hallucinations, students rely either on intuitive judgment or on active verification strategies, such as cross-checking with external sources or re-prompting the model. Students' explanations for why hallucinations occur reflected several mental models, including notable misconceptions. Many described AI as a research engine that fabricates information when it cannot locate an answer in its "database." Others attributed hallucinations to issues with training data, inadequate prompting, or the model's inability to understand or verify information. These findings illuminate vulnerabilities in AI-supported learning and highlight the need for explicit instruction in verification protocols, accurate mental models of generative AI, and awareness of behaviors such as sycophancy and confident delivery that obscure inaccuracy. The study contributes empirical evidence for integrating hallucination awareness and mitigation into AI literacy curricula.
Authors:Jiangtao Gong, Xiao Wen, Fengyi Tao, Xinqi Wang, Xixi Yang, Yangrong Tang
Abstract:
Text-based conversational agents (CAs) are increasingly used in mental health, yet evaluation practices remain fragmented. We conducted a PRISMA-guided systematic review (May-June 2024) across ACM Digital Library, Scopus, and PsycINFO. From 613 records, 132 studies were included, with dual-coder extraction achieving substantial agreement (Cohen's kappa = 0.77-0.92). We synthesized evaluation approaches across three dimensions: metrics, methods, and usage contexts. Metrics were classified into CA-centric attributes (e.g., reliability, safety, empathy) and user-centric outcomes (experience, knowledge, psychological state, health behavior). Methods included automated analyses, standardized psychometric scales, and qualitative inquiry. Temporal designs ranged from momentary to follow-up assessments. Findings show reliance on Western-developed scales, limited cultural adaptation, predominance of small and short-term samples, and weak links between automated performance metrics and user well-being. We argue for methodological triangulation, temporal rigor, and equity in measurement. This review offers a structured foundation for reliable, safe, and user-centered evaluation of mental health CAs.
Authors:Ömer Elri, Serkan Savaş
Abstract:
Manual notes and scattered messaging applications used in managing business processes compromise data integrity and abstract project tracking. In this study, an integrated system that works simultaneously on web and mobile platforms has been developed to enable individual users and teams to manage their workflows with concrete data. The system architecture integrates MongoDB, which stores data in JSON format, Node.js Express.js on the server side, React.js on the web interface, and React Native technologies on the mobile side. The system interface is designed around visual dashboards that track the status of tasks (To Do-In Progress-Done). The urgency of tasks is distinguished by color-coded labels, and dynamic graphics (Dashboard) have been created for managers to monitor team performance. The usability of the system was tested with a heterogeneous group of 10 people consisting of engineers, engineering students, public employees, branch managers, and healthcare personnel. In analyses conducted using a 5-point Likert scale, the organizational efficiency provided by the system compared to traditional methods was rated 4.90, while the visual dashboards achieved a perfect score of 5.00 with zero variance. Additionally, the ease of interface use was rated 4.65, and overall user satisfaction was calculated as 4.60. The findings show that the developed system simplifies complex work processes and provides a traceable digital working environment for Small and Medium-sized Enterprises and project teams.
Authors:Yanni Mei, Samuel Wendt, Florian Mueller, Jan Gugenheimer
Abstract:
Augmented Reality (AR) can simulate various visual perceptions, such as how individuals with colorblindness see the world. However, these simulations require developers to predefine each visual effect, limiting flexibility. We present ShadAR, an AR application enabling real-time transformation of visual perception through shader generation using large language models (LLMs). ShadAR allows users to express their visual intent via natural language, which is interpreted by an LLM to generate corresponding shader code. This shader is then compiled real-time to modify the AR headset viewport. We present our LLM-driven shader generation pipeline and demonstrate its ability to transform visual perception for inclusiveness and creativity.
Authors:Rui Yao, Qiuyuan Ren, Felicia Fang-Yi Tan, Chen Yang, Xiaoyu Zhang, Shengdong Zhao
Abstract:
LLM-assisted writing has seen rapid adoption in interpersonal communication, yet current systems often fail to capture the subtle tones essential for effectiveness. Email writing exemplifies this challenge: effective messages require careful alignment with intent, relationship, and context beyond mere fluency. Through formative studies, we identified three key challenges: articulating nuanced communicative intent, making modifications at multiple levels of granularity, and reusing effective tone strategies across messages. We developed PersonaMail, a system that addresses these gaps through structured communication factor exploration, granular editing controls, and adaptive reuse of successful strategies. Our evaluation compared PersonaMail against standard LLM interfaces, and showed improved efficiency in both immediate and repeated use, alongside higher user satisfaction. We contribute design implications for AI-assisted communication systems that prioritize interpersonal nuance over generic text generation.
Authors:Mengjie Tang, Xinman Li, Juxiao Zhang, Franklin Mingzhe Li, Zhuying Li
Abstract:
Nature plays a crucial role in human health and well-being, but little is known about how blind people experience and relate to it. We conducted a survey of nature relatedness with blind (N=20) and sighted (N=20) participants, along with in-depth interviews with 16 blind participants, to examine how blind people engage with nature and the factors shaping this engagement. Our survey results revealed lower levels of nature relatedness among blind participants compared to sighted peers. Our interview study further highlighted: 1) current practices and challenges of nature engagement, 2) attitudes and values that shape engagement, and 3) expectations for assistive technologies that support safe and meaningful engagement. We also provide design implications to guide future technologies that support nature engagement for blind people. Overall, our findings illustrate how blind people experience nature beyond vision and lay a foundation for technologies that support inclusive nature engagement.
Authors:Celeste Seah, Yoke Chuan Lee, Jung-Joo Lee, Ching-Chiuan Yen, Clement Zheng
Abstract:
Reminiscence therapy (RT) is a common non-pharmacological intervention in dementia care. Recent technology-mediated interventions have largely focused on people with dementia through solutions that replace human facilitators with conversational agents. However, the relational work of facilitation is critical in the effectiveness of RT. Hence, we developed Rememo, a therapist-oriented tool that integrates Generative AI to support and enrich human facilitation in RT. Our tool aims to support the infrastructural and cultural challenges that therapists in Singapore face. In this research, we contribute the Rememo system as a therapist's tool for personalized RT developed through sociotechnically-aware research-through-design. Through studying this system in-situ, our research extends our understanding of human-AI collaboration for care work. We discuss the implications of designing AI-enabled systems that respect the relational dynamics in care contexts, and argue for a rethinking of synthetic imagery as a therapeutic support for memory rahter than a record of truth.
Authors:Yasmin Kafai, José Ramón Lizárraga, R. Benjamin Shapiro
Abstract:
In response to the exponential growth in the use of artificial intelligence and machine learning applications, educators, researchers and policymakers have taken steps to integrate artificial intelligence applications into K-12 education. Among these efforts, one equally important approach has received little, if any attention: What if students and teachers were not just learning to be competent users of AI but also its creators? This question is at the heart of CreateAI in which K12 educators, researchers, and learning scientists addressed the following questions: (1) What tools, skills, and knowledge will empower students and teachers to build their own AI/ML applications? (2) How can we integrate these approaches into classrooms? and (3) What new possibilities for learning emerge when students and teachers become innovators and creators? In the report we provide recommendations for what tools designed for creating AI/ML applications should address in terms of design features, and learner progression in investigations. To promote effective learning and teaching of creating AI applications, we also need to help students and teachers select appropriate tools. We outline how we need to develop a better understanding of learning practices and funds of knowledge to support youth as they create and evaluate AI/ML applications. This also includes engaging youth in learning about ethics and critically that is authentic, empowering, and relevant throughout the design process. Here we advocate for the integration of ethics in the curriculum. We also address what teachers need to know and how assessments can help establish baselines, include different instruments, and promote students as responsible creators of AI. Together, these recommendations provide important insights for preparing students to engage thoughtfully and critically with these technologies.
Authors:Yasmin Kafai, Shuchi Grover
Abstract:
The introduction of generative artificial intelligence applications to the public has led to heated discussions about its potential impacts and risks for K-12 education. One particular challenge has been to decide what students should learn about AI, and how this relates to computational thinking, which has served as an umbrella for promoting and introducing computing education in schools. In this paper, we situate in which ways we should expand computational thinking to include artificial intelligence and machine learning technologies. Furthermore, we discuss how these efforts can be informed by lessons learned from the last decade in designing instructional programs, integrating computing with other subjects, and addressing issues of algorithmic bias and justice in teaching computing in schools.
Authors:Inha Cha, Yeonju Jang, Haesoo Kim, Joo Young Park, Seora Park, EunJeong Cheon
Abstract:
Ever since the introduction of internet technologies in South Korea, digital sexual violence (DSV) has been a persistent and pervasive problem. Evolving alongside digital technologies, the severity and scale of violence have grown consistently, leading to widespread public concern. In this paper, we present four eras of image-based DSV in South Korea, spanning from the early internet era of the 1990s to the deepfake scandals in the mid-2020s. Drawing from media coverage, legal documents, and academic literature, we elucidate forms and characteristics of DSV cases in each era, tracing how entrenched misogyny is reconfigured and amplified through evolving technologies, alongside shifting legislative measures. Taking a genealogical approach to read prominent cases of different eras, our analysis identifies three constitutive and interconnected dimensions of DSV: (1) the homo-social fabrication of "obscenity", wherein victims' imagery becomes collectively framed as obscene through participatory practices in male-dominant networks; (2) the increasing imperceptibility of violence, as technologies foreclose victims' ability to perceive harm; and (3) the commercialization of abuse through decentralized economic infrastructures. We suggest future directions for CSCW research, and further reflect on the value of the genealogical method in enabling non-linear understanding of DSV as dynamically evolving sociotechnical configurations of harm.
Authors:Fabian Walke, Veronika Föller
Abstract:
This study investigates generative artificial intelligence (GenAI) usage of university students who study alongside their professional career. Previous literature has paid little attention to part-time students and the intersectional use of GenAI between education and business. This study examines with a grounded theory approach the characteristics of GenAI usage of part-time students. Eleven students from a distance learning university were interviewed. Three causal and four intervening conditions, as well as strategies were identified, to influence the use of GenAI. The study highlights both the potential and challenges of GenAI usage in education and business. While GenAI can significantly enhance productivity and learning outcomes, concerns about ethical implications, reliability, and the risk of academic misconduct persist. The developed grounded model offers a comprehensive understanding of GenAI usage among students, providing valuable insights for educators, policymakers, and developers of GenAI tools seeking to bridge the gap between education and business.
Authors:Arianna Rossi, Simon Parkin
Abstract:
Although deceptive design patterns are subject to growing regulatory oversight, enforcement races to keep up with the scale of the problem. One promising solution is automated detection tools, many of which are developed within academia. We interviewed nine experienced practitioners working within or alongside regulatory bodies to understand their work against deceptive design patterns, including the use of supporting tools and the prospect of automation. Computing technologies have their place in regulatory practice, but not as envisioned in research. For example, investigations require utmost transparency and accountability in all the activities we identify as accompanying dark pattern detection, which many existing tools cannot provide. Moreover, tools need to map interfaces to legal violations to be of use. We thus recommend conducting user requirement research to maximize research impact, supporting ancillary activities beyond detection, and establishing practical tech adoption pathways that account for the needs of both scientific and regulatory activities.
Authors:Wooyoung Jung, Kahyun Jeon, Prosper Babon-Ayeng
Abstract:
This study aimed to comprehend how user domain knowledge and artificial intelligence (AI) literacy impact the effective use of human-AI interactive building energy management system (BEMS). While prior studies have investigated the potential of integrating large language models (LLMs) into BEMS or building energy modeling, very few studies have examined how user interact with such systems. We conducted a systematic role-playing experiment, where 85 human subjects interacted with an advanced generative pre-trained transformer (OpenAI GPT-4o). Participants were tasked with identifying the top five behavioral changes that could reduce home energy use with the GPT model that functioned as an LLM-integrated BEMS. Then, the collected prompt-response data and participant conclusions were analyzed using an analytical framework that hierarchically assessed and scored human-AI interactions and their home energy analysis approaches. Also, participants were classified into four groups based on their self-evaluated domain knowledge of building energy use and AI literacy, and Kruskal-Wallis H tests with post-hoc pairwise comparisons were conducted across 20 quantifiable metrics. Key takeaways include: most participants employed concise prompts (median: 16.2 words) and relied heavily on GPT's analytical capabilities; and notably, only 1 of 20 metrics, appliance identification rate, showed statistically significant group differences (p=0.037), driven by AI literacy rather than domain knowledge, suggesting an equalizing effect of LLMs across expertise levels. This study provides foundational insights into human-AI collaboration dynamics and promising development directions in the context of LLM-integrated BEMS and contributes to realizing human-centric LLM-integrated energy systems.
Authors:Sikao Guo, Edoardo Sarti, Frédéric Cazals
Abstract:
We present a workflow and associated toolkit to automate the creation of graphical user interfaces (GUI) for executables run from command line interfaces (CLI). The workflow consists of three phases, namely (Step 1) the plugin design, (Step 2) the formal (platform independent) specification of the GUI, and (Step 3) the plugin code generation for the targeted platforms. Our architecture is aligned with the Model--View--Presenter (MVP) pattern: steps one and two build the Model and View descriptions, while step three implements the Presenter layer that binds inputs, invokes the CLI, and updates outputs. Once Step one has been (manually) completed, steps two and three are fully automated. The decoupled MVP design and platform-specific generator modules enable reuse of logic, portability across ecosystems, and significant reductions in engineering effort for complex interactive applications. We primarily use our workflow to generate GUI in structural bioinformatics for CLI executables from the Structural Bioinformatics Library (SBL), targeting three platforms, namely VMD, Pymol and Web servers. The workflow can be used as a guideline, while its implementation available in the package Plugin_manager from the SBL, see https://sbl.inria.fr/doc/Plugin_manager-user-manual.html.
Authors:Pedro Reynolds-Cuéllar, Marisol Wong-Villacres, Adriana Alvarado Garcia, Heila Precel
Abstract:
Dataset documentation is widely recognized as essential for the responsible development of automated systems. Despite growing efforts to support documentation through different kinds of artifacts, little is known about the motivations shaping documentation tool design or the factors hindering their adoption. We present a systematic review supported by mixed-methods analysis of 59 dataset documentation publications to examine the motivations behind building documentation tools, how authors conceptualize documentation practices, and how these tools connect to existing systems, regulations, and cultural norms. Our analysis shows four persistent patterns in dataset documentation conceptualization that potentially impede adoption and standardization: unclear operationalizations of documentation's value, decontextualized designs, unaddressed labor demands, and a tendency to treat integration as future work. Building on these findings, we propose a shift in Responsible AI tool design toward institutional rather than individual solutions, and outline actions the HCI community can take to enable sustainable documentation practices.
Authors:Paolo Bottoni, Susanna Cifani, Kamen Kanev, Daniel Moraru, Atsushi Nakamura, Marco Raoul Marini
Abstract:
Virtual reality (VR) glove technology is increasingly important for professional training, industrial applications, and teleoperation in hazardous environments, since it enables more natural and immersive interactions than controllers. However, current solutions face a trade-off: high-precision gloves lack haptic feedback, while haptic gloves suffer from poor accuracy. Existing studies have mainly focused on developing new glove prototypes or optimizing only one type of glove, without addressing the integration of both features. Our work presents a novel hybrid approach that combines a high-precision glove with a haptic glove, creating a system that delivers both precision and haptics.
Authors:Atharva S Kashyap, Ugne Aleksandra Morkute, Patricia Alves-Oliveira
Abstract:
Robot-assisted feeding enables people with disabilities who require assistance eating to enjoy a meal independently and with dignity. However, existing systems have only been tested in-lab or in-home, leaving in-the-wild social dining contexts (e.g., restaurants) largely unexplored. Designing a robot for such contexts presents unique challenges, such as dynamic and unsupervised dining environments that a robot needs to account for and respond to. Through speculative participatory design with people with disabilities, supported by semi-structured interviews and a custom AI-based visual storyboarding tool, we uncovered ideal scenarios for in-the-wild social dining. Our key insight suggests that such systems should: embody the principles of a white glove service where the robot (1) supports multimodal inputs and unobtrusive outputs; (2) has contextually sensitive social behavior and prioritizes the user; (3) has expanded roles beyond feeding; (4) adapts to other relationships at the dining table. Our work has implications for in-the-wild and group contexts of robot-assisted feeding.
Authors:Belén Martín-Urcelay, Yoonsang Lee, Matthieu R. Bloch, Christopher J. Rozell
Abstract:
Integrating human expertise into machine learning systems often reduces the role of experts to labeling oracles, a paradigm that limits the amount of information exchanged and fails to capture the nuances of human judgment. We address this challenge by developing a human-in-the-loop framework to learn binary classifiers with rich query types, consisting of item ranking and exemplar selection. We first introduce probabilistic human response models for these rich queries motivated by the relationship experimentally observed between the perceived implicit score of an item and its distance to the unknown classifier. Using these models, we then design active learning algorithms that leverage the rich queries to increase the information gained per interaction. We provide theoretical bounds on sample complexity and develop a tractable and computationally efficient variational approximation. Through experiments with simulated annotators derived from crowdsourced word-sentiment and image-aesthetic datasets, we demonstrate significant reductions on sample complexity. We further extend active learning strategies to select queries that maximize information rate, explicitly balancing informational value against annotation cost. This algorithm in the word sentiment classification task reduces learning time by more than 57\% compared to traditional label-only active learning.
Authors:Ning Wang, Chen Liang
Abstract:
As artificial intelligence (AI) increasingly integrates into crowdfunding practices, strategic disclosure of AI involvement has become critical. Yet, empirical insights into how different disclosure strategies influence investor decisions remain limited. Drawing on signaling theory and Aristotle's rhetorical framework, we examine how mandatory AI disclosure affects crowdfunding performance and how substantive signals (degree of AI involvement) and rhetorical signals (logos/explicitness, ethos/authenticity, pathos/emotional tone) moderate these effects. Leveraging Kickstarter's mandatory AI disclosure policy as a natural experiment and four supplementary online experiments, we find that mandatory AI disclosure significantly reduces crowdfunding performance: funds raised decline by 39.8% and backer counts by 23.9% for AI-involved projects. However, this adverse effect is systematically moderated by disclosure strategy. Greater AI involvement amplifies the negative effects of AI disclosure, while high authenticity and high explicitness mitigate them. Interestingly, excessive positive emotional tone (a strategy creators might intuitively adopt to counteract AI skepticism) backfires and exacerbates negative outcomes. Supplementary randomized experiments identify two underlying mechanisms: perceived creator competence and AI washing concerns. Substantive signals primarily affect competence judgments, whereas rhetorical signals operate through varied pathways: either mediator alone or both in sequence. These findings provide theoretical and practical insights for entrepreneurs, platforms, and policymakers strategically managing AI transparency in high-stakes investment contexts.
Authors:Rodrigo Gutierrez Maquilon, Marita Hueber, Georg Regal, Manfred Tscheligi
Abstract:
Large language models (LLMs) are increasingly used in emergency first response (EFR) applications to support situational awareness (SA) and decision-making, yet most operate on text or 2D imagery and offer little support for core EFR SA competencies like spatial reasoning. We address this gap by evaluating a prototype that fuses robot-mounted depth sensing and YOLO detection with a vision language model (VLM) capable of verbalizing metrically-grounded distances of detected objects (e.g., the chair is 3.02 meters away). In a mixed-reality toxic-smoke scenario, participants estimated distances to a victim and an exit window under three conditions: video-only, depth-agnostic VLM, and depth-augmented VLM. Depth-augmentation improved objective accuracy and stability, e.g., the victim and window distance estimation error dropped, while raising situational awareness without increasing workload. Conversely, depth- agnostic assistance increased workload and slightly worsened accuracy. We contribute to human SA augmentation by demonstrating that metrically grounded, object-centric verbal information supports spatial reasoning in EFR and improves decision-relevant judgments under time pressure.
Authors:Hongming Li, Salah Esmaeiligoujar, Nazanin Adham, Hai Li, Rui Huang
Abstract:
Effective study strategies fail when preparatory tasks consume learning time. While AI educational tools demonstrate efficacy, understanding how they align with self-regulation needs in authentic study contexts remains limited. We conducted formative design research using an AI flashcard prototype, employing large language models to generate design hypotheses, which were validated through researcher walkthroughs and student sessions. Six students across disciplines completed sessions combining interviews and think-aloud tasks with their materials. Analysis revealed that students value automation for addressing the overwhelming preparation burden, yet require transparent, editable AI outputs to maintain cognitive ownership, which is essential for self-regulation. They conceptualized AI as a collaborative partner demanding verifiable reasoning rather than an autonomous agent. Metacognitive scaffolding was endorsed when clarifying study direction without constraining choice. Motivational features produced divergent responses. We derive design principles prioritizing editability and transparency, scaffolding metacognition without prescription, and accommodating motivational diversity. Findings identify conditions under which automation supports versus undermines metacognitive development in self-regulated learning.
Authors:Rafael M. Batista, Thomas L. Griffiths
Abstract:
People increasingly use large language models (LLMs) to explore ideas, gather information, and make sense of the world. In these interactions, they encounter agents that are overly agreeable. We argue that this sycophancy poses a unique epistemic risk to how individuals come to see the world: unlike hallucinations that introduce falsehoods, sycophancy distorts reality by returning responses that are biased to reinforce existing beliefs. We provide a rational analysis of this phenomenon, showing that when a Bayesian agent is provided with data that are sampled based on a current hypothesis the agent becomes increasingly confident about that hypothesis but does not make any progress towards the truth. We test this prediction using a modified Wason 2-4-6 rule discovery task where participants (N=557) interacted with AI agents providing different types of feedback. Unmodified LLM behavior suppressed discovery and inflated confidence comparably to explicitly sycophantic prompting. By contrast, unbiased sampling from the true distribution yielded discovery rates five times higher. These results reveal how sycophantic AI distorts belief, manufacturing certainty where there should be doubt.
Authors:Gengchen Cao, Tianke He, Yixuan Liu, RAY LC
Abstract:
The popularization of social media has led to increasing consumption of narrative content in byte-sized formats. Such micro-dramas contain fast-pace action and emotional cliffs, particularly attractive to emerging Chinese markets in platforms like Douyin and Kuaishou. Content writers for micro-dramas must adapt to fast-pace, audience-directed workflows, but previous research has focused instead on examining writers'experiences of platform affordances or their perceptions of platform bias, rather than the step-by-step processes through which they actually write and iterative content. In 28 semi-structured interviews with scriptwriters and writers specialized in micro-dramas, we found that the short-turn-around workflow leads to writers taking on multiple roles simultaneously, iteratively adapting to storylines in response to real-time audience feedback in the form of comments, reposts, and memes. We identified unique narrative styles such as AI-generated micro-dramas and audience-responsive micro-dramas. This work reveals audience interaction as a new paradigm for collaborative creative processes on social media.
Authors:Shiping Chen, Shu Zhong, Duncan P. Brumby, Anna L. Cox
Abstract:
AI is reshaping academic research, yet its role in peer review remains polarising and contentious. Advocates see its potential to reduce reviewer burden and improve quality, while critics warn of risks to fairness, accountability, and trust. At ICLR 2025, an official AI feedback tool was deployed to provide reviewers with post-review suggestions. We studied this deployment through surveys and interviews, investigating how reviewers engaged with the tool and perceived its usability and impact. Our findings surface both opportunities and tensions when AI augments in peer review. This work contributes the first empirical evidence of such an AI tool in a live review process, documenting how reviewers respond to AI-generated feedback in a high-stakes review context. We further offer design implications for AI-assisted reviewing that aim to enhance quality while safeguarding human expertise, agency, and responsibility.
Authors:Tarek Rahman, Md Shaharia Hossen, Mark Protik Mondol, Jannatun Noor Mukta
Abstract:
As Artificial Intelligence (AI) becomes increasingly integrated into education, university students preparing for English language tests are frequently shifting between traditional search engines like Google and large language models (LLMs) to assist with problem-solving. This study explores students perceptions of these tools, particularly in terms of usability, efficiency, and how they fit into English test preparation practices. Using a mixed-methods design, we collected survey data from 140 university students across various academic fields and conducted in-depth interviews with 20 participants. Quantitative analyses, including ANOVA and chi-square tests, were applied to assess differences in perceived efficiency, satisfaction, and overall tool preference. The qualitative results reveal that students strategically alternate between GPT and Google based on task requirements. Google is primarily used for accessing reliable, multi-source information and verifying rules, whereas GPT is favored for summarizing content, providing explanations, paraphrasing, and drafting responses for English test tasks. Since neither tool independently satisfies all aspects of English language test preparation, students expressed a clear preference for an integrated approach. In response, this study proposes a prototype chatbot embedded within a search interface, combining GPTs interactive capabilities with Googles credibility to enhance test preparation and reduce cognitive load.
Authors:Manuele Reani, Xiangyang He, Zuolan Bao
Abstract:
Anthropomorphic design is routinely used to make conversational agents more approachable and engaging. Yet its influence on users' perceptions remains poorly understood. Drawing on psychological theories, we propose that anthropomorphism influences risk perception via two complementary forms of trust, and that domain knowledge moderates these relationships. To test our model, we conducted a large-scale online experiment (N = 1,256) on a financial decision-support system implementing different anthropomorphic designs. We found that anthropomorphism indirectly reduces risk perception by increasing both cognitive and affective trust. Domain knowledge moderates these paths: participants with low financial knowledge experience a negative indirect effect of perceived anthropomorphism on risk perception via cognitive trust, whereas those with high financial knowledge exhibit a positive direct and indirect effect. We discuss theoretical contributions to human-AI interaction and design implications for calibrating trust in anthropomorphic decision-support systems for responsible AI.
Authors:Belu Ticona, Amna Liaqat, Antonios Anastasopoulos
Abstract:
Pilot studies (PS) are ubiquitous in HCI research. CHI papers routinely reference 'pilot studies', 'pilot tests', or 'preliminary studies' to justify design decisions, verify procedures, or motivate methodological choices. Yet despite their frequency, the role of pilot studies in HCI remains conceptually vague and empirically underexamined. Unlike fields such as medicine, nursing, and education, where pilot and feasibility studies have well-established definitions, guidelines, reporting standards and even a dedicated research journal, the CHI community lacks a shared understanding of what constitutes a pilot study, why they are conducted, and how they should be reported. Many papers reference pilots 'in passing', without details about design, outcomes, or how the pilot informed the main study. This variability suggests a methodological blind spot in our community.
Authors:Phyllis Nabangi, Abdul-Jalil Zakaria, Jema David Ndibwile
Abstract:
The rise of digital technology has dramatically increased the potential for cyberbullying and online abuse, necessitating enhanced measures for detection and prevention, especially among children. This study focuses on detecting abusive obfuscated language in Swahili, a low-resource language that poses unique challenges due to its limited linguistic resources and technological support. Swahili is chosen due to its popularity and being the most widely spoken language in Africa, with over 16 million native speakers and upwards of 100 million speakers in total, spanning regions in East Africa and some parts of the Middle East. We employed machine learning models including Support Vector Machines (SVM), Logistic Regression, and Decision Trees, optimized through rigorous parameter tuning and techniques like Synthetic Minority Over-sampling Technique (SMOTE) to handle data imbalance. Our analysis revealed that, while these models perform well in high-dimensional textual data, our dataset's small size and imbalance limit our findings' generalizability. Precision, recall, and F1 scores were thoroughly analyzed, highlighting the nuanced performance of each model in detecting obfuscated language. This research contributes to the broader discourse on ensuring safer online environments for children, advocating for expanded datasets and advanced machine-learning techniques to improve the effectiveness of cyberbullying detection systems. Future work will focus on enhancing data robustness, exploring transfer learning, and integrating multimodal data to create more comprehensive and culturally sensitive detection mechanisms.
Authors:Ching-Yi Tsai, Nicole Tacconi, Andrew D. Wilson, Parastoo Abtahi
Abstract:
Target disambiguation is crucial in resolving input ambiguity in augmented reality (AR), especially for queries over distant objects or cluttered scenes on the go. Yet, visual feedforward techniques that support this process remain underexplored. We present Uncertain Pointer, a systematic exploration of feedforward visualizations that annotate multiple candidate targets before user confirmation, either by adding distinct visual identities (e.g., colors) to support disambiguation or by modulating visual intensity (e.g., opacity) to convey system uncertainty. First, we construct a pointer space of 25 pointers by analyzing existing placement strategies and visual signifiers used in target visualizations across 30 years of relevant literature. We then evaluate them through two online experiments (n = 60 and 40), measuring user preference, confidence, mental ease, target visibility, and identifiability across varying object distances and sparsities. Finally, from the results, we derive design recommendations in choosing different Uncertain Pointers based on AR context and disambiguation techniques.
Authors:Gaston Besanson, Federico Todeschini
Abstract:
We study how people trade off accuracy when using AI-powered tools in professional versus personal contexts for adoption purposes, the determinants of those trade-offs, and how users cope when AI/apps are unavailable. Because modern AI systems (especially generative models) can produce acceptable but non-identical outputs, we define "accuracy" as context-specific reliability: the degree to which an output aligns with the user's intent within a tolerance threshold that depends on stakes and the cost of correction. In an online survey (N=300), among respondents with both accuracy items (N=170), the share requiring high accuracy (top-box) is 24.1% at work vs. 8.8% in personal life (+15.3 pp; z=6.29, p<0.001). The gap remains large under a broader top-two-box definition (67.0% vs. 32.9%) and on the full 1-5 ordinal scale (mean 3.86 vs. 3.08). Heavy app use and experience patterns correlate with stricter work standards (H2). When tools are unavailable (H3), respondents report more disruption in personal routines than at work (34.1% vs. 15.3%, p<0.01). We keep the main text focused on these substantive results and place test taxonomy and power derivations in a technical appendix.
Authors:Prabhav Bhatnagar, Jianheng He, Shamit Ahmed, Andrés Lucero, Perttu Hämäläinen
Abstract:
There is a growing interest in researching game design processes, artifacts and culture through active game design. Tools and processes to support these attempts are limited, especially in terms of a) capturing smaller design decisions where rich tacit information is often situated, and b) visually tracking the project's growth and evolution. To address this gap, we present Reflection at Design Actualization (RDA), an open source tool and process for collecting granular reflections at playtesting moments and automatically recording the playtests, bringing reflection and data collection closer to the point where design decisions concretize. Three researchers engaged with and evaluated RDA in three varied game development projects, adhering to the principles of autobiographical design. We illustrate the designer experience with RDA through three themes, namely, designer-routine compromise, designer-researcher persona consolidation, and mirror effect of RDA. We further discuss the tool's challenges and share each designer's personal experience as case studies.
Authors:Mahdi Haghighat Joo, Maryam Karimi Jafari, Alireza Taheri
Abstract:
This paper presents PISHYAR, a socially intelligent smart cane designed by our group to combine socially aware navigation with multimodal human-AI interaction to support both physical mobility and interactive assistance. The system consists of two components: (1) a social navigation framework implemented on a Raspberry Pi 5 that integrates real-time RGB-D perception using an OAK-D Lite camera, YOLOv8-based object detection, COMPOSER-based collective activity recognition, D* Lite dynamic path planning, and haptic feedback via vibration motors for tasks such as locating a vacant seat; and (2) an agentic multimodal LLM-VLM interaction framework that integrates speech recognition, vision language models, large language models, and text-to-speech, with dynamic routing between voice-only and vision-only modes to enable natural voice-based communication, scene description, and object localization from visual input. The system is evaluated through a combination of simulation-based tests, real-world field experiments, and user-centered studies. Results from simulated and real indoor environments demonstrate reliable obstacle avoidance and socially compliant navigation, achieving an overall system accuracy of approximately 80% under different social conditions. Group activity recognition further shows robust performance across diverse crowd scenarios. In addition, a preliminary exploratory user study with eight visually impaired and low-vision participants evaluates the agentic interaction framework through structured tasks and a UTAUT-based questionnaire reveals high acceptance and positive perceptions of usability, trust, and perceived sociability during our experiments. The results highlight the potential of PISHYAR as a multimodal assistive mobility aid that extends beyond navigation to provide socially interactive support for such users.
Authors:Raffaele Ciriello, Uri Gal, Ofir Turel
Abstract:
Artificial intelligence (AI) companions are increasingly promoted as solutions for loneliness, often overlooking how personal dispositions and life-stage conditions shape artificial intimacy. Because intimacy is a primary coping mechanism for loneliness that varies by attachment style and age, we examine how different types of users form intimate relationships with AI companions in response to loneliness. Drawing on a hermeneutic literature review and a survey of 277 active AI companion users, we develop and test a model in which loneliness predicts intimacy, moderated by attachment insecurity and conditioned by age. Although the cross-sectional data limits causal inference, the results reveal a differentiated pattern. Loneliness is paradoxically associated with reduced intimacy for securely attached users but with increased intimacy for avoidant and ambivalent users, while anxious users show mixed effects. Older adults report higher intimacy even at lower loneliness levels. These findings challenge portrayals of AI companions as universal remedies for loneliness. Instead, artificial intimacy emerges as a sociotechnical process shaped by psychological dispositions and demographic conditions. The study clarifies who is most likely to form intimate relationships with AI companions and highlights ethical risks in commercial models that may capitalise on user vulnerability.
Authors:Kaisa Vaananen, Niels van Berkel, Donald McMillan, Thomas Olsson
Abstract:
Blue-collar work is often highly collaborative, embodied, and situated in shared physical environments, yet most research on collaborative AI has focused on white-collar work. This position paper explores how the embodied nature of AI agents can support team collaboration and communication in co-located blue-collar workplaces. From the context of our newly started CAI-BLUE research project, we present two speculative scenarios from industrial and maintenance contexts that illustrate how embodied AI agents can support shared situational awareness and facilitate inclusive communication across experience levels. We outline open questions related to embodied AI agent design around worker inclusion, agency, transformation of blue-collar collaboration practices over time, and forms of acceptable AI embodiments. We argue that embodiment is not just an aesthetic choice but should become a socio-material design strategy of AI systems in blue-collar workplaces.
Authors:Zhidian Lin, Allison Jing, Ziyuan Qu, Fabio Zambetta, Ryan M. Kelly
Abstract:
This paper introduces the notion of affective extended reality (XR) to characterise XR systems that use biodata to enable understanding of emotions. The HCI literature contains many such systems, but they have not yet been mapped into a coherent whole. To address this, we conducted a scoping review of 82 papers that explore the nexus of biodata, emotions, and XR. We analyse the technologies used in these systems, the interaction techniques employed, and the methods used to evaluate their effectiveness. Through our analysis, we contribute a mapping of the current landscape of affective XR, revealing diversity in the goals for enabling emotion sharing. We demonstrate how HCI researchers have explored the design of the interaction flows in XR biofeedback systems, highlighting key design dimensions and challenges in understanding emotions. We discuss underused approaches for emotion sharing and highlight opportunities for future research on affective XR.
Authors:Yifan Zhao, Yuxin Fang, Yihuan Chen, RAY LC
Abstract:
People who experienced near-death events often turn to personal expression as a way of processing trauma and articulating beliefs. While scholars have examined how individuals share near-death experiences (NDEs), limited research has explored how these narratives are communicated collaboratively on today's social media platforms. We analyzed 200 randomly sampled TikTok videos tagged with #nde and related hashtags. Content analysis revealed that individuals often use NDE narratives to articulate personal meaning, with spiritual and religious themes appearing in the majority of posts and serving as a means of exploring and making sense of personal spiritual perspectives. Consistent with this, analyses of comment sections reveal that videos containing spiritual themes tend to attract more engagement and foster deeper conversations around faith and meaning. Our findings offer insights into how online platforms facilitate community-level engagement with spirituality, and suggest implications for design of spaces that support shared expression and connection in specialized communities.
Authors:Dennis Kim, Roya Daneshi, Bruce Draper, Sarath Sreedharan
Abstract:
The increasing integration of AI-powered tools into expert workflows, such as medicine, law, and finance, raises a critical question: how does AI involvement influence a user's trust in the human expert, the AI system, and their combination? To investigate this, we conducted a user study (N=77) featuring a simulated course-planning task. We compared various conditions that differed in both the presence of AI and the specific mode of human-AI collaboration. Our results indicate that while the advisor's ability to create a correct schedule is important, the user's perception of expertise and trust is also influenced by how the expert utilized the AI assistant. These findings raise important considerations for the design of human-AI hybrid teams, particularly when the adoption of recommendations depends on the end-user's perception of the recommender's expertise.
Authors:Jaime Banks, Jon Stromer-Galley, Samiksha Singh, Collin Capano
Abstract:
Advancing social-scientific research of human-AI interaction dynamics and outcomes often requires researchers to deliver experiences with live large-language models (LLMs) to participants through online survey platforms. However, technical and practical challenges (from logging chat data to manipulating AI behaviors for experimental designs) often inhibit survey-based deployment of AI stimuli. We developed DiSCoKit--an open-source toolkit for deploying live LLM experiences (e.g., ones based on models delivered through Microsoft Azure portal) through JavaScript-enabled survey platforms (e.g., Qualtrics). This paper introduces that toolkit, explaining its scientific impetus, describes its architecture and operation, as well as its deployment possibilities and limitations.
Authors:Alexanne Worm, Florian Marchal, Sylvain Castagnos
Abstract:
Lack of data is a recurring problem in Artificial Intelligence, as it is essential for training and validating models. This is particularly true in the field of cultural heritage, where the number of open datasets is relatively limited and where the data collected does not always allow for holistic modeling of visitors' experience due to the fact that data are ad hoc (i.e. restricted to the sole characteristics required for the evaluation of a specific model). To overcome this lack, we conducted a study between February and March 2019 aimed at obtaining comprehensive and detailed information about visitors, their visit experience and their feedback. We equipped 51 participants with eye-tracking glasses, leaving them free to explore the 3 floors of the museum for an average of 57 minutes, and to discover an exhibition of more than 400 artworks. On this basis, we built an open dataset combining contextual data (demographic data, preferences, visiting habits, motivations, social context. . . ), behavioral data (spatiotemporal trajectories, gaze data) and feedback (satisfaction, fatigue, liked artworks, verbatim. . . ). Our analysis made it possible to re-enact visitor identities combining the majority of characteristics found in the literature and to reproduce the Veron and Levasseur profiles. This dataset will ultimately make it possible to improve the quality of recommended paths in museums by personalizing the number of points of interest (POIs), the time spent at these different POIs, and the amount of information to be provided to each visitor based on their level of interest.
Authors:Natalia Abarca, Andrés Carvallo, Claudia López Moncada, Felipe Bravo-Marquez
Abstract:
The increasing use of Machine Learning (ML) in sensitive domains such as healthcare, finance, and public policy has raised concerns about the transparency of automated decisions. Explainable AI (XAI) addresses this by clarifying how models generate predictions, yet most methods demand technical expertise, limiting their value for novices. This gap is especially critical in no-code ML platforms, which seek to democratize AI but rarely include explainability. We present a human-centered XAI module in DashAI, an open-source no-code ML platform. The module integrates three complementary techniques, which are Partial Dependence Plots (PDP), Permutation Feature Importance (PFI), and KernelSHAP, into DashAI's workflow for tabular classification. A user study (N = 20; ML novices and experts) evaluated usability and the impact of explanations. Results show: (i) high task success ($\geq80\%$) across all explainability tasks; (ii) novices rated explanations as useful, accurate, and trustworthy on the Explanation Satisfaction Scale (ESS, Cronbach's $α$ = 0.74, a measure of internal consistency), while experts were more critical of sufficiency and completeness; and (iii) explanations improved perceived predictability and confidence on the Trust in Automation scale (TiA, $α$ = 0.60), with novices showing higher trust than experts. These findings highlight a central challenge for XAI in no-code ML, making explanations both accessible to novices and sufficiently detailed for experts.
Authors:Chuncheng Liu, Danah Boyd
Abstract:
Data have power. As such, most discussions of data presume that records should mirror some idealized ground truth. Deviations are viewed as failure. Drawing on two ethnographic studies of state data-making in a Chinese street-level bureaucrat agency and at the US Census Bureau we show how seemingly "fake" state data perform institutional work. We map four moments in which actors negotiate between representational accuracy and organizational imperatives: creation, correction, collusion, and augmentation. Bureaucrats routinely privilege what data do over what they represent, creating fictions that serve civil servants' self-interest and enable constrained administrations. We argue that "fakeness" of state data is relational (context dependent), processual (emerging through workflows), and performative (brought into being through labeling and practice). We urge practitioners to center fitness-for-purpose in assessments of data and contextual governance. Rather than chasing impossible representational accuracy, sociotechnical systems should render the politics of useful fictions visible, contestable, and accountable.
Authors:Robin Beierling, Manuel Scheibl, Jonas Dech, Abhijit Vyas, Anna-Lisa Vollmer
Abstract:
Virtual Reality (VR) is increasingly used for training and demonstration purposes including a variety of applications ranging from robot learning to rehabilitation. However, the choice of input device and its visualization might influence workload and thus user performance leading to suboptimal demonstrations or reduced training effects. This study investigates how different VR input configurations - motion capture gloves, controllers with hand visualization, and controllers with controller visualization - affect user experience and task execution, with the goal of identifying which configuration is best suited for which type of task. Participants performed various kitchen-related activities of daily living (ADLs), including object placement, cutting, cleaning, and pouring in a simulated environment. To address two research questions, we evaluated user experience using the System Usability Scale and NASA Task Load Index (RQ1), and task-specific interaction behavior (RQ2). The latter was assessed using trajectory segmentation, analyzing movement efficiency, unnecessary actions, and execution precision. While no significant differences in overall usability and workload were found, trajectory analysis revealed configuration-specific execution behaviors with different movement strategies. Controllers enabled significantly faster task completion with less movement variability in pick-and-place style tasks such as table setting. In contrast, motion capture gloves produced more natural movements with fewer unnecessary actions, but also showed greater variance in movement patterns for manner-oriented tasks such as cutting bread. These findings highlight trade-offs between efficiency and naturalism, and have implications for optimizing VR-based training, improving the quality of user-generated demonstrations, and tailoring interaction design to specific application goals.
Authors:Hao Zhou, Mahanth Gowda
Abstract:
Muscle activation initiates contractions that drive human movement, and understanding it provides valuable insights for injury prevention and rehabilitation. Yet, sensing muscle activation is barely explored in the rapidly growing mobile health market. Traditional methods for muscle activation sensing rely on specialized electrodes, such as surface electromyography, making them impractical, especially for long-term usage. In this paper, we introduce Press2Muscle, the first system to unobtrusively infer muscle activation using insole pressure sensors. The key idea is to analyze foot pressure changes resulting from full-body muscle activation that drives movements. To handle variations in pressure signals due to differences in users' gait, weight, and movement styles, we propose a data-driven approach to dynamically adjust reliance on different foot regions and incorporate easily accessible biographical data to enhance Press2Muscle's generalization to unseen users. We conducted an extensive study with 30 users. Under a leave-one-user-out setting, Press2Muscle achieves a root mean square error of 0.025, marking a 19% improvement over a video-based counterpart. A robustness study validates Press2Muscle's ability to generalize across user demographics, footwear types, and walking surfaces. Additionally, we showcase muscle imbalance detection and muscle activation estimation under free-living settings with Press2Muscle, confirming the feasibility of muscle activation sensing using insole pressure sensors in real-world settings.
Authors:Jiqun Liu, Nischal Dinesh, Ran Yu
Abstract:
ECHO (Evaluation of Chat, Human behavior, and Outcomes) is an open research platform designed to support reproducible, mixed-method studies of human interaction with both conversational AI systems and Web search engines. It enables researchers from varying disciplines to orchestrate end-to-end experimental workflows that integrate consent and background surveys, chat-based and search-based information-seeking sessions, writing or judgment tasks, and pre- and post-task evaluations within a unified, low-coding-load framework. ECHO logs fine-grained interaction traces and participant responses, and exports structured datasets for downstream analysis. By supporting both chat and search alongside flexible evaluation instruments, ECHO lowers technical barriers for studying learning, decision making, and user experience across different information access paradigms, empowering researchers from information retrieval, HCI, and the social sciences to conduct scalable and reproducible human-centered AI evaluations.
Authors:Yang Chen Lin, Chen-Ying Chen, Kai-Hsin Hou, Hung-Yu Chen, Po-Chih Kuo
Abstract:
Interior design often struggles to capture the subtleties of client experience, leaving gaps between what clients feel and what designers can act upon. We present AIDED, a designer-AI co-design workflow that integrates multimodal client data into generative AI (GAI) design processes. In a within-subjects study with twelve professional designers, we compared four modalities: baseline briefs, gaze heatmaps, questionnaire visualizations, and AI-predicted overlays. Results show that questionnaire data were trusted, creativity-enhancing, and satisfying; gaze heatmaps increased cognitive load; and AI-predicted overlays improved GAI communication but required natural language mediation to establish trust. Interviews confirmed that an authenticity-interpretability trade-off is central to balancing client voices with professional control. Our contributions are: (1) a system that incorporates experiential client signals into GAI design workflows; (2) empirical evidence of how different modalities affect design outcomes; and (3) implications for future AI tools that support human-data interaction in creative practice.
Authors:Bob Van Dyck, Arne Van Den Kerchove, Marc M. Van Hulle
Abstract:
We present an open-source implementation of a closed-loop Brain-Computer Interface (BCI) system based on electrocorticographic (ECoG) recordings. Our setup integrates FieldTrip for interfacing with a Micromed acquisition system and PsychoPy for implementing experiments. We open-source three custom Python libraries (psychopylib, pymarkerlib, and pyfieldtriplib) each covering different aspects of a closed-loop BCI interface: designing interactive experiments, sending event information, and real-time signal processing. Our modules facilitate the design and operation of a transparent BCI system, promoting customization and flexibility in BCI research, and lowering the barrier for researchers to translate advances in ECoG decoding into BCI applications.
Authors:Hayfa Dhabhi, Kashyap Thimmaraju
Abstract:
Large Language Models (LLMs) deploy safety mechanisms to prevent harmful outputs, yet these defenses remain vulnerable to adversarial prompts. While existing research demonstrates that jailbreak attacks succeed, it does not explain \textit{where} defenses fail or \textit{why}. To address this gap, we propose that LLM safety operates as a sequential pipeline with distinct checkpoints. We introduce the \textbf{Four-Checkpoint Framework}, which organizes safety mechanisms along two dimensions: processing stage (input vs.\ output) and detection level (literal vs.\ intent). This creates four checkpoints, CP1 through CP4, each representing a defensive layer that can be independently evaluated. We design 13 evasion techniques, each targeting a specific checkpoint, enabling controlled testing of individual defensive layers. Using this framework, we evaluate GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro across 3,312 single-turn, black-box test cases. We employ an LLM-as-judge approach for response classification and introduce Weighted Attack Success Rate (WASR), a severity-adjusted metric that captures partial information leakage overlooked by binary evaluation. Our evaluation reveals clear patterns. Traditional Binary ASR reports 22.6\% attack success. However, WASR reveals 52.7\%, a 2.3$\times$ higher vulnerability. Output-stage defenses (CP3, CP4) prove weakest at 72--79\% WASR, while input-literal defenses (CP1) are strongest at 13\% WASR. Claude achieves the strongest safety (42.8\% WASR), followed by GPT-5 (55.9\%) and Gemini (59.5\%). These findings suggest that current defenses are strongest at input-literal checkpoints but remain vulnerable to intent-level manipulation and output-stage techniques. The Four-Checkpoint Framework provides a structured approach for identifying and addressing safety vulnerabilities in deployed systems.
Authors:Tuan-He Lee, Gilly Leshed
Abstract:
Providing mental health support for loved ones across a geographic distance creates unique challenges for the remote caregivers, who sometimes turn to online communities for peer support. We qualitatively analyzed 522 Reddit threads to understand what drives remote caregivers' online help-seeking behaviors and the responses they receive from the community. Their purposes of posting included requesting guidance, expressing emotions, and seeking validation. Community responses included providing emotional support, suggesting informational strategies, and sharing personal experiences. While certain themes in posts (emotional toll, monitoring symptoms, and prioritizing caregiver well-being) are shared across remote and non-remote contexts, remote caregivers' posts surfaced nuanced experiences. For example, they often rely on digital cues, such as voice, to interpret care receivers' well-being while struggling with digital silence during crises. We discuss the need for supporting communication and information sharing between remote caregivers and receivers, care coordination for crisis management, and design recommendations for caregiver communities.
Authors:Minja Axelsson, Henry Shevlin
Abstract:
In this preliminary work, we offer an initial disambiguation of the theoretical concepts anthropomorphism and anthropomimesis in Human-Robot Interaction (HRI) and social robotics. We define anthropomorphism as users perceiving human-like qualities in robots, and anthropomimesis as robot developers designing human-like features into robots. This contribution aims to provide a clarification and exploration of these concepts for future HRI scholarship, particularly regarding the party responsible for human-like qualities - robot perceiver for anthropomorphism, and robot designer for anthropomimesis. We provide this contribution so that researchers can build on these disambiguated theoretical concepts for future robot design and evaluation.
Authors:Alif Rizqullah Mahdi, Mahdi Rezaei, Natasha Merat
Abstract:
Gestures are a key component of non-verbal communication in traffic, often helping pedestrian-to-driver interactions when formal traffic rules may be insufficient. This problem becomes more apparent when autonomous vehicles (AVs) struggle to interpret such gestures. In this study, we present a gesture classification framework using 2D pose estimation applied to real-world video sequences from the WIVW dataset. We categorise gestures into four primary classes (Stop, Go, Thank & Greet, and No Gesture) and extract 76 static and dynamic features from normalised keypoints. Our analysis demonstrates that hand position and movement velocity are especially discriminative in distinguishing between gesture classes, achieving a classification accuracy score of 87%. These findings not only improve the perceptual capabilities of AV systems but also contribute to the broader understanding of pedestrian behaviour in traffic contexts.
Authors:Andreas Tjeldflaat, Piero Romare, Yuki Onishi, Morten Fjeld, Bjørn Sætrevik
Abstract:
Smartphone usage in public spaces can raise privacy concerns, in terms of shoulder surfing and unintended camera capture. In real-world public space settings, we investigated the impact of tangible privacy-enhancing tools (here: screen filter and camera slider) on smartphone users' reported privacy perception, behavioral adaptations, usability and social dynamics. We conducted a mixed-method, in-the-wild study ($N = 22$) using off-the-shelf smartphone privacy tools. We investigated subjective behavioral transition by combining questionnaires with semi-structured interviews. Participants used the screen filter and the camera slider for two weeks; they reported changes in attitude and behavior after using a screen filter including screen visibility and comfort when using phones publicly. They explained decreased privacy-protective behaviors, such as actively covering their screens, suggesting a shift in perceived risk. Qualitative findings about the camera slider suggested underlying psychological mechanisms, including privacy awareness and concerns about social perception, while also offering insights regarding the tools' effectiveness.
Authors:Tetiana Krushynska, Jani Ursin, Ville Heilala
Abstract:
Multiple-choice questions (MCQs) are widely used across diverse educational fields and levels. Well-designed MCQs should evaluate knowledge application in real-world situations. However, writing such test items in sufficient numbers is challenging and time-consuming, especially in natural science education. The problem of a sufficient number of MCQs has two aspects: content coverage and exam security. Therefore, generating test items involves two tasks: creating MCQ prototypes and transforming these prototypes into item series. In automated item generation, prototype creation aligns with template-based methods like cognitive modelling, while item expansion corresponds to example-based techniques. The aim of this research was designing the goal-oriented conceptual model of human - AI co-creation of MCQs that should meet strictly formulated quality criteria. The resulting three-step model for creating MCQ prototypes distributed prompts between several AIs, with human revision of responses for each prompt before setting the next one. To transform the MCQ prototype into an MCQ series, a one-step model was developed in which multiple new items are generated simultaneously. These items assess the same learning outcome but are not simple rephrasings of the prototype or of one another. Based on human and automated evaluation, approximately half of the output MCQs were acceptable without editing. Minor corrections of initially rejected test items allowed for a moderate increase in acceptance of MCQs in series and a significant improvement of MCQ-prototypes.
Authors:Daniel Mwesigwa, Cyan DeVeaux, Palashi Vaghela
Abstract:
Ethnography attends to relations among people, practices, and the technologies that mediate them. Central to this method is the duality of roles ethnographers navigate as researchers and participants and as outsiders and insiders. However, the rise of digital platforms has introduced new opportunities as well as practical and ethical challenges that reshape these dualities across hybrid media environments spanning both online and offline contexts. Drawing on two case studies of VRChat and WhatsApp, we examine how ethnographers employ diverse tactics to study both enduring and emerging socio-cultural issues of race and caste, particularly those that form what are often called publics. We propose emergent relationality as a key analytic for understanding the mutual shaping of ethnographers, platforms, and publics. In this work, emergent relationality offers registers for analyzing how positionality and hybrid media environments constitute and condition what can be accessed, articulated, and made public.
Authors:Rama Adithya, Varanasi, Nov, Oded, Wiesenfeld, Batia Mishan
Abstract:
This study investigates how professional writers' complex relationship with GenAI shapes their work practices and outcomes. Through a cross-sectional survey with writing professionals (n=403) in diverse roles, we show that collaboration and rivalry orientation are associated with differences in work practices and outcomes. Rivalry is primarily associated with relational crafting and skill maintenance. Collaboration is primarily associated with task crafting, productivity, and satisfaction, at the cost of long-term skill deterioration. Combination of the orientations (high rivalry and high collaboration) reconciles these differences, while boosting the association with the outcomes. Our findings argue for a balanced approach where high levels of rivalry and collaboration are essential to shape work practices and generate outcomes aimed at the long-term success of the job. We present key design implications on how to increase friction (rivalry) and reduce over-reliance (collaboration) to achieve a more balanced relationship with GenAI.
Authors:Dohui Lee, Qi Sun, Sang Ho Yoon
Abstract:
Hand-Object Interaction (HOI) is a key interaction component in Virtual Reality (VR). However, designing HOI still requires manual efforts to decide how object should be selected and manipulated, while also considering user abilities, which leads to time-consuming refinements. We present HOICraft, a VLM-based in-situ HOI authoring tool that enables part-level interaction design in VR. Here, HOICraft assists designers by recommending interactable elements from 3D objects, customizing HOI design properties, and mapping hand movement with virtual object behavior. We conducted a formative study with three expert VR designers to identify five representative HOI designs to support diverse user experiences. Building upon preference data from 20 participants, we develop an HOI mapping module with in-context learning. In a user study with 12 VR interaction designers, HOI mapping from HOICraft significantly reduced trial-and-error iterations compared to manual authoring. Finally, we assessed the usability of HOICraft, demonstrating its effectiveness for HOI design in VR.
Authors:Kun-Woo Song, Youngrae Kim, Sang Ho Yoon
Abstract:
The absence of physical information during hand-object interaction in a virtual environment diminishes realism and immersion. Kinesthetic haptic feedback has proven effective in delivering realistic object-derived haptic cues, enhancing the overall virtual reality (VR) experience. Here, we propose kinesthetic illusion through a novel application of finger tendon vibration (FTV), which creates an illusory sensation of finger movement. To effectively apply FTV for virtual object interactions, we first examine the effects of short-duration FTV (<5 s) through 3 perception studies. Based on study results, we design 6 exemplary VR scenarios, representing the overall design space of VR object interactions, and 4 different haptic rendering strategies for FTV. We evaluated these rendering methods on each VR scenario and derived a design guideline for FTV application. We then compared FTV with no vibration and simple vibration, observing that FTV enhances VR experience by providing realistic resistance on the finger, greatly improving body ownership.
Authors:Preeti Vyas, Bereket Guta, Tim G. Zhou, Noor Naila Himam, Andero Uusberg, Karon E. MacLean
Abstract:
Emotion regulation (ER) is essential to mental well-being but often difficult to access, especially in high-intensity moments or for individuals with clinical vulnerabilities. While existing technology-based ER tools offer value, they typically rely on self-reflection (e.g., emotion tracking, journaling) or co-regulation through verbal modalities (reminders, text-based conversational tools), which may not be accessible or effective when most needed. The biological role of the touch modality makes it an intriguing alternate pathway, but empirical evidence is limited and under-theorized. Building on our prior theoretical framework describing how a comforting haptic co-regulating adjunct (CHORA) can support ER, we developed a zoomorphic robot CHORA with looped biomimetic breathing and heartbeat behaviors. We evaluated its effects in a mixed-methods in-lab study (N=30), providing physiological, self-report, custom questionnaire, and retrospective interview data. Our findings demonstrate the regulatory effects of haptically experienced animacy, corroborate prior work, and validate CHORA's {theoretically grounded} potential to facilitate four ER strategies.
Authors:Mohamed El Hajji, Tarek Ait Baha, Aicha Dakir, Hammou Fadili, Youssef Es-Saady
Abstract:
Recent advances in artificial intelligence have created new possibilities for making education more scalable, adaptive, and learner-centered. However, existing educational chatbot systems often lack contextual adaptability, real-time responsiveness, and pedagogical agility. which can limit learner engagement and diminish instructional effectiveness. Thus, there is a growing need for open, integrative platforms that combine AI and immersive technologies to support personalized, meaningful learning experiences. This paper presents Open TutorAI, an open-source educational platform based on LLMs and generative technologies that provides dynamic, personalized tutoring. The system integrates natural language processing with customizable 3D avatars to enable multimodal learner interaction. Through a structured onboarding process, it captures each learner's goals and preferences in order to configure a learner-specific AI assistant. This assistant is accessible via both text-based and avatar-driven interfaces. The platform includes tools for organizing content, providing embedded feedback, and offering dedicated interfaces for learners, educators, and parents. This work focuses on learner-facing components, delivering a tool for adaptive support that responds to individual learner profiles without requiring technical expertise. Its assistant-generation pipeline and avatar integration enhance engagement and emotional presence, creating a more humanized, immersive learning environment. Embedded learning analytics support self-regulated learning by tracking engagement patterns and generating actionable feedback. The result is Open TutorAI, which unites modular architecture, generative AI, and learner analytics within an open-source framework. It contributes to the development of next-generation intelligent tutoring systems.
Authors:Xiaohui Zou, Lijun Ke, Shunpeng Zou
Abstract:
The purpose of this study is to introduce a new model of teaching Chinese as a foreign language from the perspective of integrating wisdom. Its characteristics are as follows: focusing on the butterfly model of interpretation before translation, highlighting the new method of bilingual thinking training, on the one hand, applying the new theory of Chinese characters, the theory of the relationship between language and speech, and the forward-looking research results of language science; On the other hand, the application of the new model of teaching Chinese as a foreign language, AI empowering teaching and learning, and the forward-looking research results of educational science fully reflect a series of characteristics of the new model of teaching Chinese as a foreign language from the perspective of integrating wisdom. Its beneficial effects are: not only the old view of language and education, especially the old view of teaching Chinese as a foreign language, but also the old view of human-computer interaction. Its significance lies in that a series of great cross-border Rongzhixue such as language, knowledge, education and teaching, as well as new methods and new topics of bilingual thinking training are clearly put forward from the perspective of integrating wisdom. Especially in the face of the challenge of Chat GPT to human learning ability and even creativity, the existing concepts of language knowledge education and teaching are already very backward. The old concepts of Chinese language education, and teaching Chinese as a foreign language are all facing a series of subversive innovation challenges. How to seek changes in adaptation? This study has made a series of innovative attempts, hoping to benefit academic colleagues, teachers and students.
Authors:Harsh Chhajed, Tian Guo
Abstract:
Validating Augmented Reality (AR) tracking and interaction models requires precise, repeatable ground-truth motion. However, human users cannot reliably perform consistent motion due to biomechanical variability. Robotic manipulators are promising to act as human motion proxies if they can mimic human movements. In this work, we design and implement ARBot, a real-time teleoperation platform that can effectively capture natural human motion and accurately replay the movements via robotic manipulators. ARBot includes two capture models: stable wrist motion capture via a custom CV and IMU pipeline, and natural 6-DOF control via a mobile application. We design a proactively-safe QP controller to ensure smooth, jitter-free execution of the robotic manipulator, enabling it to function as a high-fidelity record and replay physical proxy. We open-source ARBot and release a benchmark dataset of 132 human and synthetic trajectories captured using ARBot to support controllable and scalable AR evaluation.
Authors:Evgeny Kagan, Kyle Hyndman, Andrew Davis
Abstract:
We use a series of pre-registered, incentive-compatible online experiments to investigate how people evaluate and choose among different waiting time distributions. Our main findings are threefold. First, consistent with prior literature, people show an aversion to both longer expected waits and higher variance. Second, and more surprisingly, moment-based utility models fail to capture preferences when distributions have thick-right tails: indeed, decision-makers strongly prefer distributions with long-right tails (where probability mass is more evenly distributed over a larger support set) relative to tails that exhibit a spike near the maximum possible value, even when controlling for mean, variance, and higher moments. Conditional Value at Risk (CVaR) utility models commonly used in portfolio theory predict these choices well. Third, when given a choice, decision-makers overwhelmingly seek information about right-tail outcomes. These results have practical implications for service operations: (1) service designs that create a spike in long waiting times (such as priority or dedicated queue designs) may be particularly aversive; (2) when informativeness is the goal, providers should prioritize sharing right-tail probabilities or percentiles; and (3) to increase service uptake, providers can strategically disclose (or withhold) distributional information depending on right-tail shape.
Authors:Bo Shui, Xinran Zhu
Abstract:
Asynchronous, text-based discourse-such as students' posts in discussion forums-is widely used to support collaborative learning. However, the distributed and evolving nature of such discourse often makes it difficult to see how ideas connect, develop, and build on one another over time. As a result, learners may struggle to recognize relationships among ideas-a process that is critical for idea advancement in productive collaborative discourse. To address this challenge, we explore how large language models (LLMs) can provide representational guidance by modeling student discourse as a Knowledge Synthesis Graph (KSG). The KSG identifies ideas from student discourse and visualizes their epistemic relationships, externalizing the current state of collaborative knowledge in a form that can support further inquiry and idea advancement. In this study, we present the design of the KSG and evaluate the LLM-based approach for constructing KSGs from authentic student discourse data. Through multi-round human-expert coding and prompt iteration, our results demonstrate the feasibility of using our approach to construct reliable KSGs across different models. This work provides a technical foundation for modeling collaborative discourse with LLMs and offers pedagogical implications for augmenting complex knowledge work in collaborative learning environments.
Authors:Sujay Shalawadi, Katrina Hvítklett, Anna Stentoft Ries, Aisho Mohamed Ali, Florian Echtler
Abstract:
Cookie banners and privacy settings attempt to give users a sense of control over how their personal data is collected and used, but background tracking of personal information often continues unnoticed. To explore how such invisible data collection might be made more perceptible, we present DataCrumb, a physical probe that reacts in real-time to data tracking with visual and auditory feedback. Using a research-through-design approach, we deployed the artifact in three households and studied participants' responses. Instead of providing details about what data was being tracked, the artifact introduced subtle disruptions that made background data flows harder to ignore. Participants described new forms of awareness, contradiction, and fatigue. Our findings show how sensory feedback can support reflection by drawing attention to tracking data flows that are usually hidden. We argue for designing systems that foster awareness and interpretation, especially when the users' control and understanding are limited.
Authors:Jinghui Hu, Ludwig Sidenmark, Hock Siang Lee, Hans Gellersen
Abstract:
People differ in how much they move their head versus their eyes when shifting gaze, yet such tendencies remain largely unexplored in HCI. We introduce head movement tendencies as a fundamental dimension of individual difference in VR and provide a quantitative account of their population-level distribution. Using a 360° video free-viewing dataset (N=87), we model head contributions to gaze shifts with a hinge-based parametric function, revealing a spectrum of strategies from eye-movers to head-movers. We then conduct a user study (N=28) combining 360° video viewing with a short controlled task using gaze targets. While parameter values differ across tasks, individuals show partial alignment in their relative positions within the population, indicating that tendencies are meaningful but shaped by context. Our findings establish head movement tendencies as an important concept for VR and highlight implications for adaptive systems such as foveated rendering, viewport alignment, and multi-user experience design.
Authors:Sarvesh Shashidhar, Abhishek Mishra, Madhav Kotecha
Abstract:
Re-inforcement learning from human feedback (RLHF) has been effective in the task of AI alignment. However, one of the key assumptions of RLHF is that the annotators (referred to as workers from here on out) have a homogeneous response space. This assumption is not true in most practical settings and there have been studies done in the past to challenge this notion. This work has been inspired by such studies and explores one of the ways to deal with heterogeneity in worker preferences - by clustering workers with similar preferences and personalising reward models for each cluster. This work provides an algorithm that encourages simultaneous learning of reward models and worker embeddings. This algorithm is then empirically tested against the Reddit TL;DR dataset with unique worker IDs. We have shown that clustering users into different groups based on their preferences and created personalised reward models improves win-rate of the said models. Along with results and visualisations, this work aims to act as a stepping stone to more complicated models and gives a list of possible future extensions.
Authors:Tongzhou Yu, Han Lin
Abstract:
This paper presents "Remember Me, Not Save Me," an AR & AI system enabling virtual citizens to develop personality through collective dialogue. Core innovations include: Dynamic Collective Memory (DCM) model with narrative tension mechanisms for handling contradictory memories; State-Reflective Avatar for ambient explainability; and Geo-Cultural Context Anchoring for local identity. Deployed at the 2024 Jinan Biennale, the system demonstrated stable personality emergence (ISTP type via Apply Magic Sauce analysis) from over 2,500 public interactions. We provide a framework for designing evolving digital entities that transform collective memory into coherent identity.
Authors:Andre Paulino de Lima, Paula Castro, Suzana Carvalho Vaz de Andrade, Rosa Maria Marcucci, Ruth Caldeira de Melo, Marcelo Garcia Manzato
Abstract:
There are challenges that must be overcome to make recommender systems useful in healthcare settings. The reasons are varied: the lack of publicly available clinical data, the difficulty that users may have in understanding the reasons why a recommendation was made, the risks that may be involved in following that recommendation, and the uncertainty about its effectiveness. In this work, we address these challenges with a recommendation model that leverages the structure of psychometric data to provide visual explanations that are faithful to the model and interpretable by care professionals. We focus on a narrow healthcare niche, gerontological primary care, to show that the proposed recommendation model can assist the attending professional in the creation of personalised care plans. We report results of a comparative offline performance evaluation of the proposed model on healthcare datasets that were collected by research partners in Brazil, as well as the results of a user study that evaluates the interpretability of the visual explanations the model generates. The results suggest that the proposed model can advance the application of recommender systems in this healthcare niche, which is expected to grow in demand , opportunities, and information technology needs as demographic changes become more pronounced.
Authors:Zheng Yan, Ru-Yuan Zhang
Abstract:
The psychological science of artificial intelligence (AI) can be broadly defined as an emerging field of psychology that examines all AI-related mental and behavioral processes from the perspective of psychology. This field has been growing exponentially in the recent decade. This review synthesizes the existing literature on the psychological science of AI with a goal to provide a comprehensive conceptual framework for planning, conducting, and assessing scientific research in the field. It consists of six parts, starting with an overview of the entire field of the psychological science of artificial intelligence, then synthesizing the literature in each of the four specific areas (i.e., Psychology of designing AI, psychology of using AI, AI for examining psychological processes, and AI for advancing psychological methods), and concluding with an outlook on the field in the future.
Authors:Hamza Peracha, Carrina Iacobacci, Tyler Singer-Clark, Leigh R. Hochberg, Sergey D. Stavisky, David M. Brandman, Nicholas S. Card
Abstract:
Communication and computer interaction are important for autonomy in modern life. Unfortunately, these capabilities can be limited or inaccessible for the millions of people living with paralysis. While implantable brain-computer interfaces (BCIs) show promise for restoring these capabilities, little has been explored on designing BCI user interfaces (UIs) for sustained daily use. Here, we present a personalized UI for an intracortical BCI system that enables users with severe paralysis to communicate and interact with their computers independently. Through a 22-month longitudinal deployment with one participant, we used iterative co-design to develop a system for everyday at-home use and documented how it evolved to meet changing needs. Our findings highlight how personalization and adaptability enabled independence in daily life and provide design implications for developing future BCI assistive technologies.
Authors:Kaicheng Wang, Kevin Zhongyang Shao, Ruiqi Chen, Sep Makhsous, Denise Wilson
Abstract:
Olfactory cues can enhance immersion in interactive media, yet smell remains rare because it is difficult to author and synchronize with dynamic video. Prior olfactory interfaces rely on designer triggers and fixed event-to-odor mappings that do not scale to unconstrained content. This work examines whether semantic planning for smell is intelligible to people before physical scent delivery. We present a video-to-scent planning pipeline that separates visual semantic extraction using a vision-language model from semantic-to-olfactory inference using a large language model. Two survey studies compare system-generated scent plans with over-inclusive and naive baselines. Results show consistent preference for plans that prioritize perceptually salient cues and align scent changes with visible actions, supporting semantic planning as a foundation for future olfactory media systems.
Authors:Ruipeng Wang, Tawab Safi, Yunge Wen, Christina Cunningham, Hoi Ling Tang, Behnaz Farahi
Abstract:
Across cultures, water has served as a recipient of human confession, a yielding medium that receives vulnerability where rigid surfaces cannot. We present Whispering Water, an interactive installation that materializes human-AI dialogue through cymatic patterns on water. Participants confess secrets to a water surface, triggering a four-phase ritual: confession, contemplation, response, and release. The user's speech sentiment is directly transmitted into the water to prime its state, while semantic content enters a multi-agent system, initiating ripples of conversation where agent identities are situated through discourse and voice profiles are chosen based on what they say. We propose a novel algorithm that decomposes speech into component waves and reconstructs them in water, establishing a translation between speech and the physics of material form. By rendering machine reasoning as emergent physical phenomena, the installation explores possibilities for emotional self-exploration through ambiguous, sensory-rich interfaces.
Authors:Tuhin Chakrabarty, Paramveer S. Dhillon
Abstract:
Creative writing has long been considered a uniquely human endeavor, requiring voice and style that machines could not replicate. This assumption is challenged by Generative AI that can emulate thousands of author styles in seconds with negligible marginal labor. To understand this better, we conducted a behavioral experiment where 28 MFA writers (experts) competed against three LLMs in emulating 50 critically acclaimed authors. Based on blind pairwise comparisons by 28 expert judges and 131 lay judges, we find that experts preferred human writing in 82.7% of cases under the in-context prompting condition but this reversed to 62% preference for AI after fine-tuning on authors' complete works. Lay judges, however, consistently preferred AI writing. Debrief interviews with expert writers revealed that their preference for AI writing triggered an identity crisis, eroding aesthetic confidence and questioning what constitutes "good writing." These findings challenge discourse about AI's creative limitations and raise fundamental questions about the future of creative labor.
Authors:Brian Gin, Ahreum Lim, Flávia Silva e Oliveira, Kuan Xing, Xiaomei Song, Gayana Amiyangoda, Thilanka Seneviratne, Alison F. Doubleday, Ananya Gangopadhyaya, Bob Kiser, Lukas Shum-Tim, Dhruva Patel, Kosala Marambe, Lauren Maggio, Ara Tekian, Yoon Soo Park
Abstract:
Background: In medical and health professions education (HPE), AI is increasingly used to assess clinical competencies, including via virtual standardized patients. However, most evaluations rely on AI-human interrater reliability and lack a measurement framework for how cases, learners, and raters jointly shape scores. This leaves robustness uncertain and can expose learners to misguidance from unvalidated systems. We address this by using AI "simulated learners" to stress-test and psychometrically characterize assessment pipelines before human use. Objective: Develop an open-source AI virtual patient platform and measurement model for robust competency evaluation across cases and rating conditions. Methods: We built a platform with virtual patients, virtual learners with tunable ACGME-aligned competency profiles, and multiple independent AI raters scoring encounters with structured Key-Features items. Transcripts were analyzed with a Bayesian HRM-SDT model that treats ratings as decisions under uncertainty and separates learner ability, case performance, and rater behavior; parameters were estimated with MCMC. Results: The model recovered simulated learners' competencies, with significant correlations to the generating competencies across all ACGME domains despite a non-deterministic pipeline. It estimated case difficulty by competency and showed stable rater detection (sensitivity) and criteria (severity/leniency thresholds) across AI raters using identical models/prompts but different seeds. We also propose a staged "safety blueprint" for deploying AI tools with learners, tied to entrustment-based validation milestones. Conclusions: Combining a purpose-built virtual patient platform with a principled psychometric model enables robust, interpretable, generalizable competency estimates and supports validation of AI-assisted assessment prior to use with human learners.
Authors:Naman Gupta, Sophie Stephenson, Chung Chi Yeung, Wei Ting Wu, Jeneile Luebke, Kate Walsh, Rahul Chatterjee
Abstract:
Indigenous peoples across Turtle Island (North America) face disproportionate rates of disappearance and murder, a "genocide" rooted in settler-colonial violence and systemic erasure. Technology plays a crucial role in the Missing and Murdered Indigenous Relatives (MMIR) crisis: perpetuating harm and impeding investigations, yet enabling advocacy and resistance. Communities utilize technologies such as AMBER alerts, news websites, social media groups, and campaigns (like #MMIW, #MMIWR, #NoMoreStolenSisters, and #NoMoreStolenDaughters) to mobilize searches, amplify awareness, and honor missing relatives. Yet, little research in HCI has critically examined technology's role in shaping the MMIR crisis by centering community voices. Through a large-scale study, we analyze 140 webpages to identify systemic, technological, and institutional barriers that hinder communities' efforts, while highlighting socio-technical actions that foster healing and safety. Finally, we amplify Indigenous voices by providing a dataset of stories that resist epistemic erasure, along with recommendations for HCI researchers to support Indigenous-led initiatives with cultural sensitivity, accountability, and self-determination.
Authors:Amin Mohamed, Hamza Abdelmoreed, Mohamed Ehab, Youssef Shawky, Mayada Hadhoud, Ahmad Al-Kabbany
Abstract:
Low back pain (LBP) is a pervasive global health challenge, affecting approximately 80% of adults and frequently progressing into chronic or recurrent episodes. While exercise therapy is a primary clinical intervention, traditional at-home programs suffer from low adherence rates and the absence of professional supervision. This study introduces TOSHFA, an accessible mobile VR-based rehabilitation system that bridges this gap by combining computer vision with affordable hardware. The system utilizes a laptop webcam to perform real-time pose estimation via the MediaPipe framework, tracking 33 skeletal landmarks to provide immediate biofeedback. This data is streamed via low-latency UDP protocols to a smartphone mounted in a cardboard-style VR headset, where patients interact with a gamified 3D environment. A pilot study with 20 participants evaluated the system's performance and user engagement. Quantitative results yielded a mean System Usability Scale (SUS) score of 47.4, indicating marginal usability and a need for interface optimization. However, Game Experience Questionnaire (GEQ) data revealed high scores in positive affect and enjoyment, suggesting that the gamification elements--such as coin rewards and streak tracking--successfully maintained user motivation despite technical friction. These findings validate the feasibility of a smartphone-based tele-rehabilitation model and establish a technical foundation for future clinical trials involving multi-exercise protocols.
Authors:Hellina Hailu Nigatu, Farhana Shahid, Vishal Sharma, Abigail Oppong, Michaelanne Thomas, Syed Ishtiaque Ahmed
Abstract:
Peer review determines which scholarship is legitimized; however, review biases often disadvantage scholarship that diverges from the norm. Human-Computer Interaction (HCI) lacks a systemic inquiry into how such biases affect underrepresented Global South (GS) scholarship. To address this critical gap, we conducted four focus groups with 16 HCI researchers studying the GS. Participants reported experiencing reviews that confined them to development research, dismissed their theoretical contributions, and questioned situated knowledge from GS communities. Both as authors and reviewers, participants reported experiencing the epistemic burden of over-explaining why knowledge from GS communities matters. Further, they noted being tokenized as ``cultural experts'' when assigned to review papers and pointed out that the hidden curriculum of writing HCI papers often gatekeeps GS scholarship. Using epistemic oppression as a lens, we discuss how review practices marginalize GS scholarship and outline actionable strategies for nurturing equitable epistemological evaluation of HCI scholarship.
Authors:Mingxian Yu, Siqi Luo, Xu Chen
Abstract:
Mobile graphical user interface (GUI) agents are designed to automate everyday tasks on smartphones. Recent advances in large language models (LLMs) have significantly enhanced the capabilities of mobile GUI agents. However, most LLM-powered mobile GUI agents operate in stepwise query-act loops, which incur high latency due to repeated LLM queries. We present GraphPilot, a mobile GUI agent that leverages knowledge graphs of the target apps to complete user tasks in almost one LLM query. GraphPilot operates in two complementary phases to enable efficient and reliable LLM-powered GUI task automation. In the offline phase, it explores target apps, records and analyzes interaction history, and constructs an app-specific knowledge graph that encodes functions of pages and elements as well as transition rules for each app. In the online phase, given an app and a user task, it leverages the knowledge graph of the given app to guide the reasoning process of LLM. When the reasoning process encounters uncertainty, GraphPilot dynamically requests the HTML representation of the current interface to refine subsequent reasoning. Finally, a validator checks the generated sequence of actions against the transition rules in the knowledge graph, performing iterative corrections to ensure it is valid. The structured, informative information in the knowledge graph allows the LLM to plan the complete sequence of actions required to complete the user task. On the DroidTask benchmark, GraphPilot improves task completion rate over Mind2Web and AutoDroid, while substantially reducing latency and the number of LLM queries.
Authors:Nadja Rupprechter, Tobias Dienlin, Tilo Hartmann
Abstract:
For a growing number of people, AI chatbots have become close personal companions. Despite rising scholarly attention, theoretical accounts of how such relationships develop remain fragmented. Existing frameworks address important aspects of the phenomenon, but they rarely treat human-chatbot communication as the central behavior that builds relationships. To address this gap, we propose the AI relationship process (AI-RP) framework. The AI-RP outlines relationship formation as a sequential process. (a) Chatbot characteristics shape users' (b) social perceptions. These perceptions guide (c) communication, and communication produces (d) relational outcomes such as attachment and companionship. The AI-RP introduces a six-features profile characterizing chatbots, a dual-route approach of social perception, a behavioral conceptualization of communication and discusses the foundation and types of artificial relationships. By foregrounding observable communicative behavior, the AI-RP provides a foundation for theory building and empirical research on the social and ethical implications of AI companionship.
Authors:Jathushan Kaetheeswaran, Jenny Wei
Abstract:
The interactions between the brain and heart during sleep are responsible for regulating autonomic function. While brain-heart coupling has been studied in healthy populations, the relationships between neural and cardiac activity across sleep stages in the presence of sleep disorders are not clear. This study examines the influence of brain-driven cardiac activity across sleep stages for individuals with sleep disorders. Overnight recordings of C3 and C4 electroencephalogram (EEG) channels and electrocardiogram (ECG) signals from 146 individuals were preprocessed and analyzed in the frequency domain through a linear mixed-effect model. Our results show that parasympathetic activity is sensitive to changes in delta and beta powers during later stages of non-rapid eye movement (NREM) sleep, as both band powers exhibited strong negative effects on high-frequency heart rate variability (HF-HRV) power. These findings show that neural activity can drive vagal tone across sleep stages, suggesting that treatments on key EEG bands during NREM and REM stages may help restore regular cardiac behaviour.
Authors:Daehwa Kim, Chris Harrison
Abstract:
We introduce and explore a new multimodal input representation for vision-language models: acoustic field video. Unlike conventional video (RGB with stereo/mono audio), our video stream provides a spatially grounded visualization of sound intensity across a scene, offering a new and powerful dimension of perceptual understanding. Our real-time pipeline uses low-cost beamforming microphone arrays that are already common in smart speakers and increasingly present in robotics and XR headsets, yet this sensing capability remains unutilized for scene understanding. To assess the value of spatial acoustic information, we constructed an evaluation set of 402 question-answer scenes, comparing a state-of-the-art VLM given conventional video with and without paired acoustic field video. Results show a clear and consistent improvement when incorporating spatial acoustic data; the VLM we test improves from 38.3% correct to 67.4%. Our findings highlight that many everyday scene understanding tasks remain underconstrained when relying solely on visual and audio input, and that acoustic field data provides a promising and practical direction for multimodal reasoning. A video demo is available at https://daehwakim.com/seeingsound
Authors:Simon Lämmer, Mark Colley, Patrick Ebel
Abstract:
People's transportation choices reflect complex trade-offs shaped by personal preferences, social norms, and technology acceptance. Predicting such behavior at scale is a critical challenge with major implications for urban planning and sustainable transport. Traditional methods use handcrafted assumptions and costly data collection, making them impractical for early-stage evaluations of new technologies or policies. We introduce Generative Traffic Agents (GTA) for simulating large-scale, context-sensitive transportation choices using LLM-powered, persona-based agents. GTA generates artificial populations from census-based sociodemographic data. It simulates activity schedules and mode choices, enabling scalable, human-like simulations without handcrafted rules. We evaluate GTA in Berlin-scale experiments, comparing simulation results against empirical data. While agents replicate patterns, such as modal split by socioeconomic status, they show systematic biases in trip length and mode preference. GTA offers new opportunities for modeling how future innovations, from bike lanes to transit apps, shape mobility decisions.
Authors:Riccardo Volpato, Simone Stumpf, Lisa DeBruine
Abstract:
People are increasingly turning to generative AI (e.g., ChatGPT, Gemini, Copilot) for emotional support and companionship. While trust is likely to play a central role in enabling these informal and unsupervised interactions, we still lack an understanding of how people develop and experience it in this context. Seeking to fill this gap, we recruited 24 frequent users of generative AI for emotional support and conducted a qualitative study consisting of diary entries about interactions, transcripts of chats with AI, and in-depth interviews. Our results suggest important novel drivers of trust in this context: familiarity emerging from personalisation, nuanced mental models of generative AI, and awareness of people's control over conversations. Notably, generative AI's homogeneous use of personalised, positive, and persuasive language appears to promote some of these trust-building factors. However, this also seems to discourage other trust-related behaviours, such as remembering that generative AI is a machine trained to converse in human language. We present implications for future research that are likely to become critical as the use of generative AI for emotional support increasingly overlaps with therapeutic work.
Authors:Zoë Breed, Elvin Karana, Alessandro Bozzon, Katherine W. Song
Abstract:
Bio-digital systems that merge microbial life with technology promise new modes of computation, combining biological adaptability with digital precision. Yet realizing this potential symbiotically -- where biological and digital agents co-adapt and co-process -- remains elusive, largely due to the absence of a shared vocabulary bridging biology and computing. Consequently, microbes are often constrained to uni-directional roles, functioning as sensors or actuators rather than as active, computational partners in bio-digital systems. In response, we propose a taxonomy and pathways that articulate and expand the roles of biological and digital entities for synergetic bio-digital computation. Using this taxonomy, we analysed 70 systems across HCI, design, and engineering, identifying how biological mechanisms can be mapped onto computational abstractions. We argue that such mappings enable computationally actionable directions that foster richer and reciprocal relationships in bio-digital systems, supporting regenerative ecologies across time and scale while inspiring new paradigms for computation in HCI.
Authors:Sohyeon Park, Jesus Armando Beltran, Aehong Min, Anamara Ritt-Olson, Gillian R. Hayes
Abstract:
Large Language Models (LLMs) like ChatGPT offer potential support for autistic people, but this potential requires understanding the implicit perspectives these models might carry, including their biases and assumptions about autism. Moving beyond single-agent prompting, we utilized LLM-based multi-agent systems to investigate complex social scenarios involving autistic and non-autistic agents. In our study, agents engaged in group-task conversations and answered structured interview questions, which we analyzed to examine ChatGPT's biases and how it conceptualizes autism. We found that ChatGPT assumes autistic people are socially dependent, which may affect how it interacts with autistic users or conveys information about autism. To address these challenges, we propose adopting the double empathy problem, which reframes communication breakdowns as a mutual challenge. We describe how future LLMs could address the biases we observed and improve interactions involving autistic people by incorporating the double empathy problem into their design.
Authors:Paweł Niszczota, Elia Antoniou
Abstract:
While delegating tasks to large language models (LLMs) can save people time, there is growing evidence that offloading tasks to such models produces social costs. We use behavior in two canonical economic games to study whether people have different expectations when decisions are made by LLMs acting on their behalf instead of themselves. More specifically, we study the social appropriateness of a spectrum of possible behaviors: when LLMs divide resources on our behalf (Dictator Game and Ultimatum Game) and when they monitor the fairness of splits of resources (Ultimatum Game). We use the Krupka-Weber norm elicitation task to detect shifts in social appropriateness ratings. Results of two pre-registered and incentivized experimental studies using representative samples from the UK and US (N = 2,658) show three key findings. First, people find that offers from machines - when no acceptance is necessary - are judged to be less appropriate than when they come from humans, although there is no shift in the modal response. Second - when acceptance is necessary - it is more appropriate for a person to reject offers from machines than from humans. Third, receiving a rejection of an offer from a machine is no less socially appropriate than receiving the same rejection from a human. Overall, these results suggest that people apply different norms for machines deciding on how to split resources but are not opposed to machines enforcing the norms. The findings are consistent with offers made by machines now being viewed as having both a cognitive and emotional component.
Authors:Jason Pan, Ben Moews
Abstract:
Independent navigation is a core aspect of maintaining social participation and individual health for vulnerable populations. While historic cities such as Edinburgh, as the capital of Scotland, often feature well-established public transport systems, urban accessibility challenges remain and are exacerbated by a complex landscape, especially for groups with multiple vulnerabilities such as the blind elderly. With limited research examining how real-time data feeds and developments in artificial intelligence can enhance navigation aids, we address this gap through a mixed-methods approach. Our work combines statistical and machine learning techniques, with a focus on spatial analysis to investigate network coverage, service patterns, and density through live Transport for Edinburgh data, with a qualitative thematic analysis of semi-structured interviews with the mentioned target group. The results demonstrate the highly centralised nature of the city's transport system, the significance of memory-based navigation, and the lack of travel information in usable formats. We also find that participants already use navigation technology to varying degrees and express a willingness to adopt artificial intelligence. Our analysis highlights the importance of dynamic tools in terms of sensory and cognitive needs to meaningfully improve independent travel.
Authors:Mayada Oudah, John Wooders
Abstract:
Facial expressions are central to human interaction, yet their role in strategic decision-making has received limited attention. We investigate how real-time facial communication influences cooperation in repeated social dilemmas. In a laboratory experiment, participants play a repeated Prisoner's Dilemma game under two conditions: in one, they observe their counterpart's facial expressions via gender-neutral avatars, and in the other no facial cues are available. Using state-of-the-art biometric technology to capture and display emotions in real-time, we find that facial communication significantly increases overall cooperation and, notably, promotes cooperation following defection. This restorative effect suggests that facial expressions help participants interpret defections less harshly, fostering forgiveness and the resumption of cooperation. While past actions remain the strongest predictor of behavior, our findings highlight the communicative power of facial expressions in shaping strategic outcomes. These results offer practical insights for designing emotionally responsive virtual agents and digital platforms that sustain cooperation in the absence of physical presence.
Authors:Simran Kaur, Sara Salimzadeh, Ujwal Gadiraju
Abstract:
AI has revolutionised decision-making across various fields. Yet human judgement remains paramount for high-stakes decision-making. This has fueled explorations of collaborative decision-making between humans and AI systems, aiming to leverage the strengths of both. To explore this dynamic, researchers conduct empirical studies, investigating how humans use AI assistance for decision-making and how this collaboration impacts results. A critical aspect of conducting these studies is the role of participants, often recruited through crowdsourcing platforms. The validity of these studies hinges on the behaviours of the participants, hence effective incentives that can potentially affect these behaviours are a key part of designing and executing these studies. In this work, we aim to address the critical role of incentive design for conducting empirical human-AI decision-making studies, focusing on understanding, designing, and documenting incentive schemes. Through a thematic review of existing research, we explored the current practices, challenges, and opportunities associated with incentive design for human-AI decision-making empirical studies. We identified recurring patterns, or themes, such as what comprises the components of an incentive scheme, how incentive schemes are manipulated by researchers, and the impact they can have on research outcomes. Leveraging the acquired understanding, we curated a set of guidelines to aid researchers in designing effective incentive schemes for their studies, called the Incentive-Tuning Framework, outlining how researchers can undertake, reflect on, and document the incentive design process. By advocating for a standardised yet flexible approach to incentive design and contributing valuable insights along with practical tools, we hope to pave the way for more reliable and generalizable knowledge in the field of human-AI decision-making.
Authors:Chris Monk, Allegra Ayala, Christine S. P. Yu, Gregory M. Fitch, Dara Gruber
Abstract:
Driver distraction remains a leading contributor to motor vehicle crashes, necessitating rigorous evaluation of new in-vehicle technologies. This study assessed the visual and cognitive demands associated with an advanced Large Language Model (LLM) conversational agent (Gemini Live) during on-road driving, comparing it against handsfree phone calls, visual turn-by-turn guidance (low load baseline), and the Operation Span (OSPAN) task (high load anchor). Thirty-two licensed drivers completed five secondary tasks while visual and cognitive demands were measured using the Detection Response Task (DRT) for cognitive load, eye-tracking for visual attention, and subjective workload ratings. Results indicated that Gemini Live interactions (both single-turn and multi-turn) and hands-free phone calls shared similar levels of cognitive load, between that of visual turn-by-turn guidance and OSPAN. Exploratory analysis showed that cognitive load remained stable across extended multi-turn conversations. All tasks maintained mean glance durations well below the well-established 2-second safety threshold, confirming low visual demand. Furthermore, drivers consistently dedicated longer glances to the roadway between brief off-road glances toward the device during task completion, particularly during voice-based interactions, rendering longer total-eyes-off-road time findings less consequential. Subjective ratings mirrored objective data, with participants reporting low effort, demands, and perceived distraction for Gemini Live. These findings demonstrate that advanced LLM conversational agents, when implemented via voice interfaces, impose cognitive and visual demands comparable to established, low-risk hands-free benchmarks, supporting their safe deployment in the driving environment.
Authors:Christina Schneegass, Francesco Chiossi, Anna L. Cox, Dimitra Dritsa, Teodora Mitrevska, Stephen Rainey, Max L. Wilson
Abstract:
Research on Cognitive Personal Informatics (CPI) is steadily growing as new wearable cognitive tracking technologies emerge on the consumer market, claiming to measure stress, focus, and other cognitive factors. At the same time, with generative AI offering new ways to analyse, visualize, and interpret cognitive data, we hypothesize that cognitive tracking will soon become as simple as measuring your heart rate during a run. Yet, cognitive data remains inherently more complex, context-dependent, and less well understood than physical activity data. This workshop brings together HCI experts to discuss critical questions, including: How can complex cognitive data be translated into meaningful metrics? How can AI support users' data sensemaking without over-simplifying cognitive insights? How can we design inclusive CPI technologies that consider inter-personal variance and neurodiversity? We will map
Authors:Yufei Zhang, Zhihao Ma
Abstract:
Large language models (LLMs) are used as "digital twins" to replace human respondents, yet their psychometric comparability to humans is uncertain. We propose a construct-validity framework spanning construct representation and the nomological net, benchmarking digital twins against human gold standards across models, tasks and testing how person-specific inputs shape performance. Across studies, digital twins achieved high population-level accuracy and strong within-participant profile correlations, alongside attenuated item-level correlations. In word association tests, LLM-based networks show small-world structure and theory-consistent communities similar to humans, yet diverge lexically and in local structure. In decision-making and contextualized tasks, digital twins under-reproduce heuristic biases, showing normative rationality, compressed variance and limited sensitivity to temporal information. Feature-rich digital twins improve Big Five Personality prediction, but their personality networks show only configural invariance and do not achieve metric invariance. In more applied free-text tasks, feature-rich digital twins better match human narratives, but linguistic differences persist. Together, these results indicate that feature-rich conditioning enhances validity but does not resolve systematic divergences in psychometric comparability. Future work should therefore prioritize delineating the effective boundaries of digital twins, establishing the precise contexts in which they function as reliable proxies for human cognition and behavior.
Authors:Alex Echeverria, Sávio Salvarino Teles de Oliveira, Fernando Marques Federson
Abstract:
The adaptation of Large-Scale Language Models (LLMs) to specific domains depends on high-quality fine-tuning datasets, particularly in instructional format (e.g., Question-Answer - Q&A). However, generating these datasets, particularly from unstructured sources such as call center audio recordings, poses a significant challenge due to the noisy and disorganized nature of the data. This paper presents a solution to this challenge by offering an end-to-end automated pipeline for generating Q&A instructional datasets from such recordings. The methodology developed comprises sequential steps of audio processing (including diarization, noise removal and automatic transcription), textual processing (cleaning, normalization, and anonymization), semantic extraction of customer demands and attendant responses using vector embeddings, and matching via semantic search to form the final Q&A pairs. As a result, the complete pipeline was successfully implemented, generating a dataset specifically formatted for Instruct Fine Tuning. The practical value and feasibility of the generated dataset were substantiated and functionally demonstrated through the successful fine-tuning of an LLM model (based on Llama 2 7B). The conclusion of the paper states that the proposed approach is viable for converting unstructured conversational data from call centers into valuable resources for training LLMs. This development has the potential to open up avenues for creating more effective AI systems for Q&A tasks in the customer service domain. The developed codes have been made publicly available to promote reproducibility and future research.
Authors:Ziwen Zhong, Zhitao Shu, Yue Zhao
Abstract:
Emotion recognition is a fundamental component of next-generation human-computer interaction (HCI), enabling machines to perceive, understand, and respond to users' affective states. However, existing systems often rely on single-modality analysis such as facial expressions, speech tone, or textual sentiment, resulting in limited robustness and poor generalization in real-world environments. To address these challenges, this study proposes a Cloud-Based Cross-Modal Transformer (CMT) framework for multimodal emotion recognition and adaptive human-computer interaction. The proposed model integrates visual, auditory, and textual signals using pretrained encoders (Vision Transformer, Wav2Vec2, and BERT) and employs a cross-modal attention mechanism to capture complex interdependencies among heterogeneous features. By leveraging cloud computing infrastructure with distributed training on Kubernetes and TensorFlow Serving, the system enables scalable, low-latency emotion recognition for large-scale user interactions. Experiments conducted on benchmark datasets including IEMOCAP, MELD, and AffectNet demonstrate that the CMT achieves state-of-the-art performance, improving the F1-score by 3.0 percent and reducing cross-entropy loss by 12.9 percent compared to strong multimodal baselines. Additionally, cloud deployment evaluations show an average response latency of 128 ms, representing a 35 percent reduction compared with conventional transformer-based fusion systems. These results confirm that the proposed framework enables efficient, real-time emotion recognition and adaptive feedback in applications such as intelligent customer service, virtual tutoring systems, and affective computing interfaces, marking an important step toward cloud-native affective computing and emotionally intelligent interactive systems.
Authors:Runze Li, Lanbing Li, Yuan Zheng, Chuanxiao Li, Xianglong Zeng
Abstract:
Artificial intelligences (AIs) are increasingly capable of emotionally engaging with humans to the point of forming intimate relationships. Yet, current studies on romantic love toward AI lack statistically validated instruments to measure romantic love toward AI, hindering empirical research. To address this gap, we reinterpreted Lee's love styles theory in the AI context and developed the Love Attitudes Scale toward AI (LAS-AI). The resulting 24-item, six-factor scale was validated across four phases using three independent samples (N = 899), demonstrating strong psychometric properties. The findings further revealed that people primarily seek practical, passionate, and companionship-based relationships with AI (i.e., Pragma, Eros, and Storge), showing little interest in a playful or noncommittal approach (i.e., Ludus). We also provided an initial exploration of the similarities and differences between romantic love with humans and AI. The LAS-AI offers a robust tool for future research on human-AI romantic relationships, with prolific implications.
Authors:Ricard Solé, Luis F Seoane, Jordi Pla-Mauri, Michael Timothy Bennett, Michael E. Hochberg, Michael Levin
Abstract:
Cognitive processes are realized across an extraordinary range of natural, artificial, and hybrid systems, yet there is no unified framework for comparing their forms, limits, and unrealized possibilities. Here, we propose a cognition space approach that replaces narrow, substrate-dependent definitions with a comparative representation based on organizational and informational dimensions. Within this framework, cognition is treated as a graded capacity to sense, process, and act upon information, allowing systems as diverse as cells, brains, artificial agents, and human-AI collectives to be analyzed within a common conceptual landscape. We introduce and examine three cognition spaces -- basal aneural, neural, and human-AI hybrid -- and show that their occupation is highly uneven, with clusters of realized systems separated by large unoccupied regions. We argue that these voids are not accidental but reflect evolutionary contingencies, physical constraints, and design limitations. By focusing on the structure of cognition spaces rather than on categorical definitions, this approach clarifies the diversity of existing cognitive systems and highlights hybrid cognition as a promising frontier for exploring novel forms of complexity beyond those produced by biological evolution.
Authors:Bhavesh Vuyyuru, Farnaz Jahanbakhsh
Abstract:
Online disagreements often fail to produce understanding, instead reinforcing existing positions or escalating conflict. Prior work on predictors of successful persuasion in online discourse has largely focused on surface features such as linguistic style or conversational structure, leaving open the role of underlying principles or concerns that participants bring to an interaction. In this paper, we investigate how the expression and alignment of human values in back-and-forth online discussions relate to persuasion. Using data from Reddit's ChangeMyView subreddit, where successful persuasion is explicitly signaled through the awarding of deltas, we analyze one-on-one exchanges and characterize participants' value expression by drawing from Schwartz's Refined Theory of Basic Human Values. We find that successful persuasion is associated with two complementary processes: pre-existing compatibility between participants' value priorities even before the exchange happens, and the emergence of value alignment over the course of a conversation. At the same time, successful persuasion does not depend on commenters making large departures from their typical value expression patterns. We discuss implications of our findings for the design of online social platforms that aim to support constructive engagement across disagreement.
Authors:Leif Azzopardi, Adam Roegiest
Abstract:
The classic paradigms of Berry Picking and Information Foraging Theory have framed users as gatherers, opportunistically searching across distributed sources to satisfy evolving information needs. However, the rise of GenAI is driving a fundamental transformation in how people produce, structure, and reuse information - one that these paradigms no longer fully capture. This transformation is analogous to the Neolithic Revolution, when societies shifted from hunting and gathering to cultivation. Generative technologies empower users to "farm" information by planting seeds in the form of prompts, cultivating workflows over time, and harvesting richly structured, relevant yields within their own plots, rather than foraging across others people's patches. In this perspectives paper, we introduce the notion of Information Farming as a conceptual framework and argue that it represents a natural evolution in how people engage with information. Drawing on historical analogy and empirical evidence, we examine the benefits and opportunities of information farming, its implications for design and evaluation, and the accompanying risks posed by this transition. We hypothesize that as GenAI technologies proliferate, cultivating information will increasingly supplant transient, patch-based foraging as a dominant mode of engagement, marking a broader shift in human-information interaction and its study.
Authors:Rezky Kam, Coddy N. Siswanto
Abstract:
This paper introduces a dataset and conceptual framework for LLMs to mimic real world emotional dynamics through time and in-context learning leveraging physics-informed neural network, opening a possibility for interpretable dialogue modeling.
Authors:Hilsann Yong, Bradley A. Camburn
Abstract:
The design-build-test cycle is essential for innovation, but physical prototyping is often slow and expensive. Although physics-based simulation and strategic prototyping can reduce cost, meaningful evaluation is frequently constrained until an integrated prototype is built. This paper investigates whether a generative pretrained transformer (GPT) can predict information typically obtained through prototyping, including cost, performance, and perceived usability. We introduce a retrieval-augmented generation (RAG) method to emulate design feedback using OpenAI GPT-4o, grounded in prototyping data scraped from Instructables.com to increase access to relevant precedent. Two studies are reported. First, a controlled experiment compares GPT-RAG and human designers, who receive design sketches and predict cost, performance, and usability; predictions are evaluated against ground-truth results from physical prototypes. Second, we report an applied demonstration in which a physical prototype is produced from GPT-RAG recommendations and compared with a commercial baseline and a topology-optimized design. Results show that GPT-RAG provides more accurate cost and performance estimates than individual or crowd human estimates, while yielding comparable usability insights; the GPT-RAG-informed prototype also outperforms both comparison prototypes. Repeated querying with response averaging significantly improves accuracy, suggesting that LLMs can emulate crowd aggregation effects consistent with the law of large numbers.
Authors:Yuki Ueno, Hiroaki Natsukawa, Koji Koyamada
Abstract:
The group-in-a-box (GIB) layout is an efficient graph drawing method designed to visualize the group structure of graphs. The layout communicates group sizes and both within-group and between-group network structures simultaneously. The layout is characterized by its composition of multiple elements, including nodes, edges, and boxes. However, there is limited empirical guidance on how these elements should be combined. In this paper, we measured participants' task performance and eye movements while identifying the group with the largest number of internal edges. We investigated the effect of visualization elements on task performance while controlling the density of internal edges and the box size. The results revealed that the box size in a GIB layout significantly affects the task accuracy either positively or negatively while eye-tracking data suggests that participants focused on internal edges, not the box size. These findings contribute empirical guidance for GIB layout design and lay the groundwork for future research as GIB layout becomes more widely used.
Authors:Pijuan Yu, Anzu Kawazoe, Alexis Urquhart, Thomas K. Ferris, M. Cynthia Hipwell, Rebecca F. Friesen
Abstract:
Remote palpation enables noninvasive tissue examination in telemedicine, yet current tactile displays often lack the fidelity to convey both large-scale forces and fine spatial details. This study introduces a hybrid fingertip display comprising a rigid platform and a $4\times4$ soft pneumatic tactile display (4.93 mm displacement and 1.175 N per single pneumatic chamber) to render a hard lump beneath soft tissue. This study compares three rendering strategies: a Platform-Only baseline that renders the total interaction force; a Hybrid A (Position + Force Feedback) strategy that adds a dynamic, real-time soft spatial cue; and a Hybrid B (Position + Preloaded Stiffness Feedback) strategy that provides a constant, pre-calculated soft spatial cue. In a 12-participant lump detection study, both hybrid methods dramatically improved accuracy over the Platform-Only baseline (from 50\% to over 95\%). While the Hybrid B was highlighted qualitatively for realism, its event-based averaging is expected to increase interaction latency in real-time operation. This suggests a trade-off between perceived lump realism and real-time responsiveness, such that rendering choices that enhance realism may conflict with those that minimize latency.
Authors:Leon A. Abdillah, Aisyah, Wahdyta Putri Panggabean, Sayfiyev Eldor Erkinovich
Abstract:
This article examines the knowledge of digital transformation of Small and Medium Enterprises (SMEs) that specialize in traditional handicrafts, with a specific emphasis on the Songket textile sector. The study investigates the use of digital technologies, notably blog platforms and the e-commerce site Shopee, to improve and streamline several business processes in Songket textile SMEs. The report takes a case study approach, diving into the experiences of Songket clothing enterprises that have undergone digital transformation. Key areas studied include the use of Blog platforms for brand development, marketing, and consumer involvement, as well as the Shopee E-Commerce platform for online sales and order processing. The essay seeks to give insights into the problems and possibilities faced by Songket cloth SMEs along their digital transformation journey by conducting in-depth observation, interviews, and surveys. The findings add to the scholarly discussion on the digitization of traditional industries, with practical implications for SMEs in the Songket textile sector and other handicraft areas. This study emphasizes the necessity of using digital technologies to preserve and expand traditional crafts, while also throwing light on the potential role of prominent E-Commerce platforms like Shopee in facilitating worldwide market access for such firms.
Authors:Houhao Liang, Azrin Jamaluddin, Kresimir Friganovic, Kirstie Neo, Raphael Han, Navrag Singh, Panos Mavros
Abstract:
Ensuring safe and inclusive mobility for vulnerable older adults is an emerging priority in urban planning. However, existing data sources such as surveys or GIS-based audits provide limited insight into how micro-scale built environment (BE) features influence real-world behavior and perception. This study presents a novel multimodal data-fusion approach that integrates wearable and environmental sensing to dynamically represent human-environment interactions and quantify the BE impacts on mobility among vulnerable older adults, specifically those with knee osteoarthritis or a history of falls. Data collected during naturalistic walking sessions in Singapore, are used to demonstrate this framework of synchronized streams from eye tracking, kinematic sensors, physiological monitors, GPS, and video recordings. Preliminary results show how AI-driven data fusion can uncover behaviorally and perceptually significant urban segments, providing a basis for actionable insights in inclusive design. This human-centered analytical approach advances the representation of urban environments from the perspective of vulnerable pedestrians, establishing a foundation for evidence-based, age-friendly city planning.
Authors:Berfin Ataman, Rodrigo Gallardo, Qilmeg Doudatcz
Abstract:
This study presents a comparative framework for evaluating emotional engagement with textile soft robots and their augmented-reality (AR) counterparts. Four robotic sculptures were developed, each embodying nature-inspired dynamic behaviors such as breathing and gradual deformation. Using a between-subjects design, two independent groups, one experiencing the physical installations and one engaging with their virtual (AR) twins, follow identical protocols and complete the same self-assessment survey on affective and perceptual responses. This approach minimizes carryover and novelty effects while enabling a direct comparison of sensations such as calmness, curiosity, and discomfort across modalities. The analysis explores how motion, form, and material behavior shape emotional interpretation in physical versus digital contexts, informing the design of hybrid systems that evoke meaningful, emotionally legible interactions between humans, robots, and digital twins.
Authors:Suqing Liu, Bogdan Simion, Christopher Eaton, Michael Liut
Abstract:
Feedback is a critical component of the learning process, particularly in computer science education. This study investigates the quality of feedback generated by Large Language Models (LLMs), Small Language Models (SLMs), compared with human feedback, in three computer science course with technical writing components: an introductory computer science course (CS2), a third-year advanced systems course (operating systems), and a third-year writing course (a topics course on artificial intelligence). Using a mixed-methods approach which integrates quantitative Likert-scale questions with qualitative commentary, we analyze the student perspective on feedback quality, evaluated based on multiple criteria, including readability, detail, specificity, actionability, helpfulness, and overall quality. The analysis reveals that in the larger upper-year operating systems course ($N=80$), SLMs and LLMs are perceived to deliver clear, actionable, and well-structured feedback, while humans provide more contextually nuanced guidance. As for the high-enrollment CS2 course ($N=176$) showed the same preference for the AI tools' clarity and breadth, but students noted that AI feedback sometimes lacked the concise, straight-to-the-point, guidance offered by humans. Conversely, in the smaller upper-year technical writing course on AI topics ($N=7$), all students preferred feedback from the course instructor, who was able to provide clear, specific, and personalized feedback, compared to the more general and less targeted AI-based feedback. We also highlight the scalability of AI-based feedback by focusing on its effectiveness at large scale. Our findings underscore the potential of hybrid approaches that combine AI and human feedback to achieve efficient and high-quality feedback at scale.
Authors:Cameron A. Nurse, Kelly Breen, Matthew McGuire, Sara Prokup, Arun Jayaraman, Quentin Sanders
Abstract:
Gait rehabilitation interventions targeting paretic propulsion can improve walking speed and function in individuals post-stroke. Previous work has demonstrated that real-time biofeedback targeting anterior ground reaction forces (AGRFs) can increase propulsion in individuals post-stroke, however this work was confined to lab-based treadmills, limiting practical utility. Here we investigate the short-term effects of real-time AGRF gait biofeedback during overground walking using wearable inertial measurement units (IMUs) and a haptic feedback device. Eight individuals with chronic post-stroke hemiparesis completed four 3-minute training bouts. During training, faded haptic biofeedback was provided to increase paretic AGRF during terminal stance. Gait biomechanics were assessed before, during, and after training, and during a retention test conducted without biofeedback after a rest period. The primary dependent variable was peak paretic AGRF, while secondary variables included paretic peak trailing limb angle (TLA), step length and walking speed. Compared to baseline, peak AGRF increased post-feedback and at the retention tests. Similar trends were observed in TLA, and step length, although these increases were not statistically significant while speed showed a significant change from baseline. Examining individual participants 63% participants (responders) increased AGRF at retention, while 37% experienced decreases (non-responders). Non-responders had lower physical capability, evidenced by two-minute walk distance at screening and AFO use during training, suggesting this intervention may suit patients with more residual ankle mobility and strength. Nonetheless our results suggest AGRF biofeedback can be implemented in practical settings with wearable systems and is a promising gait training strategy to target propulsive deficits in individuals post stroke.
Authors:Stewart Collis, Florence Kinyua, Vikram Kumar, Howard Lakougna, Christian Merz, Kirti Pandey, Christian Resch
Abstract:
We report technical learnings from five AI-based agricultural advisory MVPs deployed in Kenya and Bihar, India, under the AIEP Initiative. A 800-farmer study found high user satisfaction (NPS ~60). All solutions implement a modular two-part architecture: (i) an interface component (IVR /WhatsApp / app) with ASR-MT-TTS for multilingual voice access; and (ii) a reasoning component combining LLMs capabilities with query orchestration, external data (weather/soil/markets), and RAG over curated agricultural corpora. We describe key challenges: (a) latency, especially for voice; reductions were achieved via in-country hosting and audio minimization, but consistent <5s remains challenging; (b) language coverage: low-resource ASR/MT integration and nonstandard scripts hinder end-to-end quality; and (c) corpus curation: access, validation, and maintenance are labor-intensive, as well as provide recommendations on how to develop similar systems. We discuss common enablers including (a) data sharing, (b) common corpora, (c) better language AI and (d) evaluation and benchmarking. We also present golden Q&A sets to evaluate LLM capabilities for smallholder agriculture.
Authors:Baitong Xie, Mohd Fairuz Shiratuddin, Mostafa Hamadi, Joo Yeon Park, Thach-thao Duong
Abstract:
Gamification plays a pivotal role in enhancing user engagement in the Metaverse, particularly among Generation Z users who value autonomy, immersion, and identity expression. However, current research lacks a cohesive framework tailored to designing gamified social experiences in immersive virtual environments. This study presents a framework-oriented systematic literature review, guided by PRISMA 2020 and SPIDER, to investigate how gamification is applied in the Metaverse and how it aligns with the behavioral needs of Gen Z. From 792 screened studies, seventeen high-quality papers were synthesized to identify core gamification mechanics, including avatars, XR affordances, and identity-driven engagement strategies. Building on these insights, we propose the Affordance-Driven Gamification Framework (ADGF), a conceptual model for designing socially immersive experiences, along with a five-step design process to support its real-world application. Our contributions include a critical synthesis of existing strategies, Gen Z-specific design considerations, and a dual-framework approach to guide researchers and practitioners in developing emotionally engaging and socially dynamic Metaverse experiences.
Authors:V. El Sawah, A. Bhardwaj, A. Pryke-Hobbes, D. Gamaleldin, C. S. Ang, A. K. Martin
Abstract:
Clinical psychology students frequently report feeling underprepared for the interpersonal demands of therapeutic work, highlighting the need for accessible opportunities to practise core counselling skills before seeing real clients. Advances in artificial intelligence (AI) now enable simulated interaction partners that may support early skills development. This study examined postgraduate clinical psychology students' perceptions of two AI-based simulations: a text-based chatbot (ChatGPT) and a voice-based avatar (HeyGen). Twenty-four students completed two brief cognitive-behavioural role-plays (counterbalanced), one with each tool, and provided both quantitative ratings and qualitative feedback on perceived usefulness, skill application, responsiveness and engagement, and perceived skill improvement. Both AI tools were evaluated positively across dimensions. However, the avatar was rated significantly higher than the chatbot for perceived usefulness, skill application, and perceived skill improvement, and qualitative comments highlighted the added value of voice-based interaction for conveying social and emotional cues. These findings suggest that AI-driven simulation may supplement early-stage clinical skills training, with voice-based avatars offering additional benefits. Future work should test whether such simulated interactions translate to objective improvements in real therapeutic performance.
Authors:Meike Driessen, Selina Khan, Gonçalo Marcelino
Abstract:
This project explores how we engage with AI-generated content through the lens of the jutter: Dutch coastal foragers who comb the shoreline after storms, gathering and repurposing what the sea leaves behind. Reflecting how our lives are increasingly shaped by AI-generated media, we create a beach-like installation that blends real shoreline debris with AI-transformed images and videos. Visitors are invited to explore this space as contemporary jutters, deciding what to keep and what to discard. In doing so, the project reimagines AI-imagery as material for reflection, encouraging a more discerning engagement with the content that drifts through our feeds. A video preview of the installation can be found at https://www.youtube.com/watch?v=L6319Ii7MT8.
Authors:Geonwoo Bang, DongMyung Kim, Hayoung Oh
Abstract:
Large Language Models (LLMs) hold great potential for web-based interactive applications, including browser games, online education, and digital storytelling platforms. However, LLM-based conversational agents suffer from spatiotemporal distortions when responding to variant user inputs, failing to maintain consistency with provided scenarios. We propose SNAP (Story and Narrative-based Agent with Planning), a framework that structures narratives into Cells with explicit Plans to prevent narrative drift in web environments. By confining context within each Cell and employing detailed plans that specify spatiotemporal settings, character actions, and plot developments, SNAP enables coherent and scenario-consistent dialogues while adapting to diverse user responses. Via automated and human evaluations, we validate SNAP's superiority in narrative controllability, demonstrating effective scenario consistency despite variant user inputs in web-based interactive storytelling.
Authors:Christian Ellington, Paramahansa Pramanik, Haley K. Robinson
Abstract:
The popularity of electronic games has grown steadily in recent years, attracting a broad audience across age groups. With this growth comes a large volume of related data, prompting efforts like the PlayMyData to compile and share structured datasets for academic use. This study utilizes such a dataset to compare user review ratings across four current-generation gaming systems: Nintendo, Xbox, PlayStation, and PC. Statistical methods, including analysis of variance (ANOVA), were applied to identify differences in average scores among these platforms. The findings indicate that PC titles tend to receive the most favorable user feedback, followed by Xbox and PlayStation, while Nintendo games showed the lowest average ratings. These patterns suggest that the platform on which a game is released may influence how players evaluate their experience. Such results may be valuable to developers and industry stakeholders in making informed decisions about future investments and development priorities.
Authors:Rubel Hassan Mollik, Vamsi Krishna Kosuri, Hans Djalali, Stephanie Ludi, Aboubakar Mountapmbeme
Abstract:
Block-based programming environments (BBPEs) such as Scratch and Code.org are now widely used in K-12 computer science classes, but they remain mostly inaccessible to blind or visually impaired (BVI) learners. A major problem is that prior accessibility solutions have relied on modifications to the Blockly library, making them difficult to apply in existing BBPEs and thereby limiting adoption. We present an Extension-based Accessibility Framework (EAF) to make BBPEs accessible for BVI students. The framework uses a modular architecture that enables seamless integration with existing Blockly-based BBPEs. We present an innovative three-dimensional (3D) hierarchical navigation model featuring stack labeling and block numbering, mode-based editing to prevent accidental modifications, and WAI-ARIA implementation to ensure compatibility with external screen readers. We evaluated our approach by integrating the EAF framework into two BBPEs (covering 177 test cases) and conducting semi-structured interviews with four participants using VoiceOver, JAWS, and NVDA. Participants reported clearer spatial orientation and easier mental model formation compared to default Blockly keyboard navigation. EAF shows that modular architecture can provide comprehensive accessibility while ensuring compatibility with existing BBPEs.
Authors:Ishani Kanapathipillai, Obhasha Priyankara
Abstract:
The evolution of User Interface design has emphasized the need for efficient, reusable, and editable components to ensure an efficient design process. This research introduces CoGen, a system that uses machine learning techniques to generate reusable UI components directly in Figma, one of the most popular UI design tools. Addressing gaps in current systems, CoGen focuses on creating atomic components such as buttons, labels, and input fields using structured JSON and natural language prompts. The project integrates Figma API data extraction, Seq2Seq models, and fine-tuned T5 transformers for component generation. The key results demonstrate the efficiency of the T5 model in prompt generation, with an accuracy of 98% and a BLEU score of 0.2668, which ensures the mapping of JSON to descriptive prompts. For JSON creation, CoGen achieves a success rate of up to 100% in generating simple JSON outputs for specified component types.
Authors:Raphael Buchmüller, Dennis Collaris, Linhao Meng, Angelos Chatzimparmpas
Abstract:
Dimensionality reduction is a powerful technique for revealing structure and potential clusters in data. However, as the axes are complex, non-linear combinations of features, they often lack semantic interpretability. Existing visual analytics (VA) methods support cluster interpretation through feature comparison and interactive exploration, but they require technical expertise and intense human effort. We present \textit{LangLasso}, a novel method that complements VA approaches through interactive, natural language descriptions of clusters using large language models (LLMs). It produces human-readable descriptions that make cluster interpretation accessible to non-experts and allow integration of external contextual knowledge beyond the dataset. We systematically evaluate the reliability of these explanations and demonstrate that \langlasso provides an effective first step for engaging broader audiences in cluster interpretation. The tool is available at https://langlasso.vercel.app
Authors:David Elsweiler, Christine Elsweiler, Anna Ziegner
Abstract:
Politeness is a core dimension of human communication, yet its role in human-AI information seeking remains underexplored. We investigate how user politeness behaviour shapes conversational outcomes in a cooking-assistance setting. First, we annotated 30 dialogues, identifying four distinct user clusters ranging from Hyperpolite to Hyperefficient. We then scaled up to 18,000 simulated conversations across five politeness profiles (including impolite) and three open-weight models. Results show that politeness is not only cosmetic: it systematically affects response length, informational gain, and efficiency. Engagement-seeking prompts produced up to 90% longer replies and 38% more information nuggets than hyper-efficient prompts, but at markedly lower density. Impolite inputs yielded verbose but less efficient answers, with up to 48% fewer nuggets per watt-hour compared to polite input. These findings highlight politeness as both a fairness and sustainability issue: conversational styles can advantage or disadvantage users, and "polite" requests may carry hidden energy costs. We discuss implications for inclusive and resource-aware design of information agents.
Authors:Adam Bradley, John Hastings, Khandaker Mamun Ahmed
Abstract:
The insurance industry is undergoing a paradigm shift through the adoption of artificial intelligence (AI) technologies, particularly in the realm of intelligent conversational agents. Chatbots have evolved into sophisticated AI-driven systems capable of automating complex workflows, including policy recommendation and claims triage, while simultaneously enabling dynamic, context-aware user engagement. This paper presents the design, implementation, and empirical evaluation of Axlerod, an AI-powered conversational interface designed to improve the operational efficiency of independent insurance agents. Leveraging natural language processing (NLP), retrieval-augmented generation (RAG), and domain-specific knowledge integration, Axlerod demonstrates robust capabilities in parsing user intent, accessing structured policy databases, and delivering real-time, contextually relevant responses. Experimental results underscore Axlerod's effectiveness, achieving an overall accuracy of 93.18% in policy retrieval tasks while reducing the average search time by 2.42 seconds. This work contributes to the growing body of research on enterprise-grade AI applications in insurtech, with a particular focus on agent-assistive rather than consumer-facing architectures.
Authors:Yuki Kobayashi, Koichi Toida
Abstract:
Extended Reality (XR) affords an enhanced sense of bodily presence that supports experiential modes of comprehension and affective engagement which exceed the possibilities of conventional information delivery. Nevertheless, the psychological processes engendered by XR, and the manner in which these processes inform subsequent behavioural intentions, remain only partially delineated. The present study addresses this issue within an applied context by comparing non-immersive 2D viewing advertising with immersive XR experiential advertising. We examined whether XR strengthens internal responses to a product, specifically perceived comprehension and empathy, and whether these responses, in turn, influence the behavioural outcome of purchase intention. A repeated-measures two-way ANOVA demonstrated a significant main effect of advertising modality, with XR yielding higher ratings on all evaluative dimensions. Mediation analysis further indicated that the elevation in purchase intention was mediated by empathy, whereas no significant mediating effect was observed for comprehension within the scope of this study. These findings suggest that immersive XR experiences augment empathic engagement with virtual products, and that this enhanced empathy plays a pivotal role in shaping subsequent behavioural intentions.
Authors:Sumin Hong, Jewoong Moon, Taeyeon Eom, Juno Hwang, Jibeom Seo
Abstract:
This chapter examines how data analytics can be leveraged to enhance immersive teacher simulations, situating this inquiry within the broader learning sciences discourse on embodied cognition, data-informed feedback, and teacher professional learning. It explores both conceptual foundations and empirical cases to illustrate how analytics serve as mediational tools that connect immersive experiences with reflective teaching practice. The chapter unfolds in multiple sections: (1) The Innovation Journey: An Overview of Immersive Teacher Simulations outlines the evolution from traditional simulations to XR-based environments, highlighting the need for professional decision-making under realistic constraints. (2) Innovation in Existing Research and Practice situates teacher analytics within the trajectory from descriptive observation to multimodal and predictive modeling. (3) Study Approach and Design details how multimodal data-discourse, behavior, and gaze-from the TeacherGen@i simulation were collected and organized to reveal cognitive distribution of pedagogical discourse and interaction patterns. (4) Findings present the cognitive distribution of preservice teachers' pedagogical discourse and the sequential interaction patterns that emerge in exchange, illustrating how multimodal analytics make pedagogical reasoning processes visible within immersive simulations. (5) Understanding Innovative Practices in Teacher Education examines teaching analytics to enhance immersive teacher simulation based on the findings of the study. (6) Key Takeaways of the Innovation Journey identifies research challenges and design implications for scalable, analytics-enhanced teacher education. Together, these sections position immersive teacher simulations as a pivotal testbed for aligning learning analytics, professional learning, and next-generation immersive learning environment design.
Authors:Roshni Kaushik, Reid Simmons
Abstract:
People can respond to feedback and guidance in different ways, and it is important for robots to personalize their interactions and utilize verbal and nonverbal communication cues. We aim to understand how older adults respond to different cadences of verbal and nonverbal feedback of a robot exercise coach. We conducted an online study of older adults, where participants evaluated videos of the robot giving feedback at different cadences for each modality. The results indicate that changing the cadence of one modality affects the perception of both it and the other modality. We can use the results from this study to better design the frequency of the robot coach's feedback during an exercise session with this population.
Authors:Elia Moscoso-Thompson, Katia Lupinetti, Irene Capasso, Fabrizio Ravicchio, Brigida Bonino, Franca Giannini, Andrea Canessa, Silvio Sabatini, Lucia Ferlino, Chiara Malagoli
Abstract:
Every day life tasks can present significant challenges for neurodivergent individuals, particularly those with Autism Spectrum Disorders (ASD) who are characterized by specific sensitivities. This contribution describes a virtual reality system that allows neurodivergent individuals to experience everyday situations in order to practice and implement strategies for overcoming their daily challenges. The key strength of the proposed system is the automatic personalization of the virtual environment, based on both the individual's abilities and their specific training needs. The proposed method has been evaluated on four synthetic user profiles, also proposing a metric able to evaluate the variance of the features within the same difficulty level. The results show that the method can produce a significant number of scenarios for the various difficulty levels. Furthermore, within the same difficulty, there is a wide variance of the non-constrained features for the specific profile.
Authors:Shuxian Li, Tianyue Wang, Chris Twombly
Abstract:
With the development of virtualization and AI, real-time facial avatar animation is widely used in entertainment, office, business and other fields. Against this background, blendshapes have become a common industry animation solution because of their relative simplicity and ease of interpretation. Aiming for real-time performance and low computing resource dependence, we independently developed an accurate blendshape prediction system for low-power VR applications using a standard webcam. First, blendshape feature vectors are extracted through affine transformation and segmentation. Through further transformation and regression analysis, we were able to identify models for most blendshapes with significant predictive power. Post-processing was used to further improve response stability, including smoothing filtering and nonlinear transformations to minimize error. Experiments showed the system achieved accuracy similar to ARKit 6. Our model has low sensor/hardware requirements and realtime response with a consistent, accurate and smooth visual experience.
Authors:Liberty Kent, Nilufer Tuptuk, Ingolf Becker
Abstract:
Effective shift transitions are crucial for cybersecurity incident response teams, yet there is limited guidance on managing these handovers. This exploratory study aimed to develop guidelines for such transitions through the analysis of existing literature and consultation with practitioners. Two draft guidelines (A and B) were created based on existing literature and online resources. Six participants from the UK and international incident response teams, with experience in shift handovers, were interviewed about handover structure, challenges, training practices, and their views on the draft guidelines. The collected data indicate the importance of signposting, evolving handover procedures, individual differences in handover style and detail, and streamlining the handover procedure. Participants agreed the drafts included all relevant details but suggested adding a post-incident review section and a service section for outages or technical difficulties. This study establishes a foundation for enhancing transition practices in cybersecurity incident response teams.
Authors:Stinne Zacho, Chris Hall, Jakob Kusnick, Stefan Jänicke
Abstract:
This paper explores the potential of digital reconstruction and interactive storytelling to preserve historically suppressed sites. The main objective of an interdisciplinary team of data scientists from the MEMORISE project and associates of the memory association Asociacion Recuerdo y Dignidad was to preserve the memory of the Francoist Santa Clara concentration camp in Soria, Spain, through the use of digital technology. Combining archival research, 3D modelling, 360-degree photography, and web development, a prototype digital platform was created to visualise the transformation of the site across three historical phases: its origin as a convent, its use as a Francoist concentration camp, and its present-day condition. The platform allows users to navigate through spatial and temporal layers. Clickable media markers encourage exploration and interaction. Drawing on principles of participatory design, narrative visualisation, and open-ended user engagement, the project demonstrates how digital tools can support memory work, public engagement, and historical reflection. Our low-cost concept is especially adaptable to other physical sites that have been erased or forgotten.
Authors:Theodore Roberts, Bahram Zarrin
Abstract:
Agentic artificial intelligence systems are autonomous technologies capable of pursuing complex goals with minimal human oversight and are rapidly emerging as the next frontier in AI. While these systems promise major gains in productivity, they also raise new ethical challenges. Prior research has examined how different populations prioritize Responsible AI values, yet little is known about how practitioners actually reason through the trade-offs inherent in designing these autonomous systems. This paper investigates the ethical reasoning of AI practitioners through qualitative interviews centered on structured dilemmas in agentic AI deployment. We find that the responses of practitioners do not merely reflect value preferences but rather align with three distinct reasoning frameworks. First is a Customer-Centric framework where choices are justified by business interests, legality, and user autonomy. Second is a Design-Centric framework emphasizing technical safeguards and system constraints. Third is an Ethics-Centric framework prioritizing social good and moral responsibility beyond compliance. We argue that these frameworks offer distinct and necessary insights for navigating ethical trade-offs. Consequently, providers of agentic AI must look beyond general principles and actively manage how these diverse reasoning frameworks are represented in their decision-making processes to ensure robust ethical outcomes.
Authors:Sahibpreet Singh, Pawan Kumar
Abstract:
This chapter explores the complexities of sports governance, taxation, dispute resolution, and the impact of digital transformation within the sports sector. This study identifies a critical research gap regarding the integration of innovative technologies to enhance governance and talent identification in sports law. The objective is to evaluate how data-driven approaches and AI can optimize recruitment processes; also ensuring compliance with existing regulations. A comprehensive analysis of current governance structures and taxation policies,(ie Income Tax Act and GST Act), reveals preliminary results indicating that reform is necessary to support sustainable growth in the sports economy. Key findings demonstrate that AI enhances player evaluation by minimizing biases and expanding access to diverse talent pools. While the Court of Arbitration for Sport provides an efficient mechanism for dispute resolution. The implications emphasize the need for regulatory reforms that align taxation policies with international best practices, promoting transparency and accountability in sports organizations. This research contributes valuable insights into the evolving dynamics of sports management, aiming to foster innovation and integrity in the industry.
Authors:Anna Katharina Holl-Etten, Nina Schnaderbeck, Elizaveta Kosareva, Leonhard Aron Prattke, Ralph Krueger, Lisa Marie Warner, Nora C. Vetter
Abstract:
The rapid development of language-based artificial intelligence (AI) offers new possibilities for psychotherapy and assistive systems, particularly benefitting autistic individuals who often respond well to technology. Parents of autistic persons emphasize the importance of appropriate and context-specific communication behavior. This study investigated whether GPT-3.5 Turbo and GPT-4, as language-based AI applications, are fundamentally capable of replicating this type of adequate communication behavior in the form of applied Theory of Mind (ToM). GPT-3.5 Turbo and GPT-4 were evaluated on three established higher-order ToM tasks: the Faux Pas Test, the Social Stories Questionnaire, and the Story Comprehension Test in English and German. Two independent raters scored response accuracy based on standardized manuals. In addition, responses were rated for epistemic markers as indicators of uncertainty. GPT's results were compared to human neurotypical and neurodivergent samples from previous own and others' research. GPT-4 achieved near human accuracy on the Faux Pas Test and outperformed GPT-3.5 Turbo and individuals with autistic traits. On the Social Stories Questionnaire, GPT-4 scored comparable to neurotypical adults, while GPT-3.5 Turbo remained well below. In the Story Comprehension Test, GPT-4 reached scores that exceeded neurotypical adult and adolescent benchmarks. However, GPT-4 used epistemic markers in up to 42% of responses. GPT-4 shows encouraging performance in complex higher-order ToM tasks and may offer future potential as an assistive tool for individuals with (and without) social communication difficulties. Its ability to interpret complex social situations is promising; however, the frequent use of uncertainty markers highlights the need for further study for assistive use and possibly further refinement to ensure consistent and reliable support in real-world use.
Authors:Kévin Ducharlet, Liwen Zhang, Sara Maqrot, Houssem Saidi
Abstract:
Industrial timetabling is a critical task for decision-makers across various sectors to ensure efficient system operation. In real-world settings, it remains challenging because unexpected events often disrupt execution. When such events arise, effective rescheduling and collaboration between humans and machines becomes essential. This paper presents a recommendation system-based framework for handling rescheduling challenges, built on Timefold, a powerful AI-driven planning engine. Our experimental study evaluates nine instances inspired by a realworld preventive maintenance use case, aiming to identify the heuristic that best balances solution quality and computing time to support near-optimal decisionmaking when rescheduling is required due to unexpected events during operational days. Finally, we illustrate the complete process of our recommendation system through a simple use case.
Authors:Lucija Mihić Zidar, Philipp Wicke, Praneel Bhatia, Rosa Lutz, Marius Klug, Thorsten O. Zander
Abstract:
Passive brain-computer interfaces offer a potential source of implicit feedback for alignment of large language models, but most mental state decoding has been done in controlled tasks. This paper investigates whether established EEG classifiers for mental workload and implicit agreement can be transferred to spoken human-AI dialogue. We introduce two conversational paradigms - a Spelling Bee task and a sentence completion task- and an end-to-end pipeline for transcribing, annotating, and aligning word-level conversational events with continuous EEG classifier output. In a pilot study, workload decoding showed interpretable trends during spoken interaction, supporting cross-paradigm transfer. For implicit agreement, we demonstrate continuous application and precise temporal alignment to conversational events, while identifying limitations related to construct transfer and asynchronous application of event-based classifiers. Overall, the results establish feasibility and constraints for integrating passive BCI signals into conversational AI systems.
Authors:Kaichun Wang, Yanguang Chen, Ting Zhang, Mengyao Bao, Keyu Chen, Xu Hu, Yongliang Wang, Jingsheng Yang, Jinsong Zhang, Fei Lu
Abstract:
LLM-based conversational systems have become a popular gateway for information access, yet most existing chatbots struggle to handle news-related trending queries effectively. To improve user experience, an effective trending query detection method is urgently needed to enable differentiated processing of such target traffic. However, current research on trending detection tailored to the dialogue system scenario remains largely unexplored, and methods designed for traditional search engines often underperform in conversational contexts due to radically distinct query distributions and expression patterns. To fill this gap, we propose a multi-stage framework for trending detection, which achieves systematic optimization from both offline generation and online identification perspectives. Specifically, our framework first exploits selected hot events to generate index queries, establishing a key bridge between static events and dynamic user queries. It then employs a retrieval matching mechanism for real-time online detection of trending queries, where we introduce a cascaded recall and ranking architecture to balance detection efficiency and accuracy. Furthermore, to better adapt to the practical application scenario, our framework adopts a single-recall module as a cold-start strategy to collect online data for fine-tuning the reranker. Extensive experiments demonstrate that our framework significantly outperforms baseline methods in both offline evaluations and online A/B tests, and user satisfaction is relatively improved by 27\% in terms of positive-negative feedback ratio.
Authors:Jin Gao, Saichandu Juluri
Abstract:
We present a framework that extends the Actor-Critic architecture to creative 3D modeling through multi-agent self-reflection and human-in-the-loop supervision. While existing approaches rely on single-prompt agents that directly execute modeling commands via tools like Blender MCP, our approach introduces a Planner-Actor-Critic architecture. In this design, the Planner coordinates modeling steps, the Actor executes them, and the Critic provides iterative feedback, while human users act as supervisors and advisors throughout the process. Through systematic comparison between single-prompt modeling and our reflective multi-agent approach, we demonstrate improvements in geometric accuracy, aesthetic quality, and task completion rates across diverse 3D modeling scenarios. Our evaluation reveals that critic-guided reflection, combined with human supervisory input, reduces modeling errors and increases complexity and quality of the result compared to direct single-prompt execution. This work establishes that structured agent self-reflection, when augmented by human oversight and advisory guidance, produces higher-quality 3D models while maintaining efficient workflow integration through real-time Blender synchronization.
Authors:Takafumi Sakamoto, Yugo Takeuchi
Abstract:
Communication robots often need to initiate conversations with people in public spaces. At the same time, such robots must not disturb pedestrians. To handle these two requirements, an agent needs to estimate the communication desires of others based on their behavior and then adjust its own communication activities accordingly. In this study, we construct a computational spatial interaction model that considers others. Consideration is expressed as a quantitative parameter: the amount of adjustment of one's internal state to the estimated internal state of the other. To validate the model, we experimented with a human and a virtual robot interacting in a VR environment. The results show that when the participant moves to the target, a virtual robot with a low consideration value inhibits the participant's movement, while a robot with a higher consideration value did not inhibit the participant's movement. When the participant approached the robot, the robot also exhibited approaching behavior, regardless of the consideration value, thus decreasing the participant's movement. These results appear to verify the proposed model's ability to clarify interactions with consideration for others.
Authors:Israt Jahan Chowdhury, Md Abu Yousuf Tanvir
Abstract:
Detection systems that utilize machine learning are progressively implemented at Security Operations Centers (SOCs) to help an analyst to filter through high volumes of security alerts. Practically, such systems tend to reveal probabilistic results or confidence scores which are ill-calibrated and hard to read when under pressure. Qualitative and survey based studies of SOC practice done before reveal that poor alert quality and alert overload greatly augment the burden on the analyst, especially when tool outputs are not coherent with decision requirements, or signal noise. One of the most significant limitations is that model confidence is usually shown without expressing that there are asymmetric costs in decision making where false alarms are much less harmful than missed attacks. The present paper presents a decision-sensitive trust signal correspondence scheme of SOC alert triage. The framework combines confidence that has been calibrated, lightweight uncertainty cues, and cost-sensitive decision thresholds into coherent decision-support layer, instead of making changes to detection models. To enhance probabilistic consistency, the calibration is done using the known post-hoc methods and the uncertainty cues give conservative protection in situations where model certainty is low. To measure the model-independent performance of the suggested model, we apply the Logistic Regression and the Random Forest classifiers to the UNSW-NB15 intrusion detection benchmark. According to simulation findings, false negatives are greatly amplified by the presence of misaligned displays of confidence, whereas cost weighted loss decreases by orders of magnitude between models with decision aligned trust signals. Lastly, we describe a human-in-the-loop study plan that would allow empirically assessing the decision-making of the analysts with aligned and misaligned trust interfaces.
Authors:Vivian Lai, Zana Buçinca, Nil-Jana Akpinar, Mo Houtti, Hyeonsu B. Kang, Kevin Chian, Namjoon Suh, Alex C. Williams
Abstract:
Proactive AI writing assistants need to predict when users want drafting help, yet we lack empirical understanding of what drives preferences. Through a factorial vignette study with 50 participants making 750 pairwise comparisons, we find compositional effort dominates decisions ($ρ= 0.597$) while urgency shows no predictive power ($ρ\approx 0$). More critically, users exhibit a striking perception-behavior gap: they rank urgency first in self-reports despite it being the weakest behavioral driver, representing a complete preference inversion. This misalignment has measurable consequences. Systems designed from users' stated preferences achieve only 57.7\% accuracy, underperforming even naive baselines, while systems using behavioral patterns reach significantly higher 61.3\% ($p < 0.05$). These findings demonstrate that relying on user introspection for system design actively misleads optimization, with direct implications for proactive natural language generation (NLG) systems.
Authors:William Franz Lamberti, Sunbin Kim, Samantha Rose Lawrence
Abstract:
The emergence of generative AI (GAI) has sparked diverse opinions regarding its appropriate use across various domains, including education. This pilot study investigates university students' perceptions of GAI in higher education classrooms, aiming to lay the groundwork for understanding these attitudes. With a participation rate of approximately 4.4%, the study highlights the challenges of engaging students in GAI-related research and underscores the need for larger sample sizes in future studies. By gaining insights into student perspectives, instructors can better prepare to integrate discussions of GAI into their classrooms, fostering informed and critical engagement with this transformative technology.
Authors:Vilém Zouhar, Tom Kocmi
Abstract:
Human evaluation is the gold standard for multilingual NLP, but is often skipped in practice and substituted with automatic metrics, because it is notoriously complex and slow to set up with existing tools with substantial engineering and operational overhead. We introduce Pearmut, a lightweight yet feature-rich platform that makes end-to-end human evaluation as easy to run as automatic evaluation. Pearmut removes common entry barriers and provides support for evaluating multilingual tasks, with a particular focus on machine translation. The platform implements standard evaluation protocols, including DA, ESA, or MQM, but is also extensible to allow prototyping new protocols. It features document-level context, absolute and contrastive evaluation, attention checks, ESAAI pre-annotations and both static and active learning-based assignment strategies. Pearmut enables reliable human evaluation to become a practical, routine component of model development and diagnosis rather than an occasional effort.
Authors:Manuela Chessa, Michela Chessa, Lorenzo Gerini, Matteo Martini, Kaloyana Naneva, Fabio Solari
Abstract:
Digital platforms increasingly support collective action initiatives, yet coordinating geographically dispersed users through digital interfaces remains challenging, particularly in threshold settings where success requires critical mass participation. This study investigates how avatar-based social representation in Virtual Reality (VR) influences coordination in threshold collective action problems. Through a randomized controlled experiment with 188 participants organized in 94 pairs, we examine whether brief avatar exposure affects perceived co-presence and coordination outcomes in a two-player threshold public goods game implemented as a real-effort recycling task. We manipulate a single design feature: participants either briefly interact through avatars before the main task (Pre-Task Avatar treatment) or complete an equivalent activity individually without peer visibility (No Pre-Task Avatar treatment). Our findings reveal that minimal avatar exposure significantly increases perceived co-presence and improves strategic coordination, though not through increased contribution quantity. Participants exposed to peer avatars achieve higher social welfare by coordinating to avoid wasteful over-contribution beyond the threshold. Additionally, we identify VR presence-the sense of 'being there' in the virtual environment-as a stronger predictor of task performance than co-presence itself. This research contributes to Information Systems theory by establishing causal pathways from specific design features to presence to coordination outcomes, demonstrates VR as a rigorous experimental methodology for IS research, and provides actionable insights for designing collaborative platforms supporting sustainability initiatives and threshold collective action problems.
Authors:Mohammad Mahdi Habibi Bina, Sepideh Baghernezhad, Mohammad Reza Daliri, Mohammad Hassan Moradi
Abstract:
Current neural interfaces such as brain-computer interfaces (BCIs) face several fundamental challenges, including frequent recalibration due to neuroplasticity and session-to-session variability, real-time processing latency, limited personalization and generalization across subjects, hardware constraints, surgical risks in invasive systems, and cognitive burden in patients with neurological impairments. These limitations significantly affect the accuracy, stability, and long-term usability of BCIs. This article introduces the concept of the Neural Digital Twin (NDT) as an advanced solution to overcome these barriers. NDT represents a dynamic, personalized computational model of the brain-BCI system that is continuously updated with real-time neural data, enabling prediction of brain states, optimization of control commands, and adaptive tuning of decoding algorithms. The design of NDT draws inspiration from the application of Digital Twin technology in advanced industries such as aerospace and autonomous vehicles, and leverages recent advances in artificial intelligence and neuroscience data acquisition technologies. In this work, we discuss the structure and implementation of NDT and explore its potential applications in next-generation BCIs and neural decoding, highlighting its ability to enhance precision, robustness, and individualized control in neurotechnology.
Authors:Ka-Yan Fung, Yuxing Tao, Tze-Leung, Rick Lui, Kuen-Fung Sin
Abstract:
Hong Kong's education system is notably multicultural, including local, non-Chinese-speaking, and newly arrived students (NAS) (Mandarine Chinese-speaking). NAS can guess the meaning of vocabulary but cannot speak out, presenting unique challenges for them, particularly language barriers and cultural differences. These challenges hinder their academic success and social integration, leading to feelings of isolation and demotivation. Current resources often fail to address the emotional well-being of these students and predominantly focus on English language acquisition, leaving a gap in support for learning Cantonese and navigating the local cultural landscape. This study explores the effectiveness of an interactive robot, Boon Boon, in teaching Cantonese through real-life contexts to enhance NAS children learning engagement and motivation. The research questions are: (1) How does interactive robot-empowered scenario learning influence the learning engagement and motivation of NAS in learning Cantonese? and (2) What is the impact of a robot-empowered scenario learning system on the Cantonese language proficiency of NAS? Fourteen children are invited to participate in a four-day learning program with Boon Boon. The preliminary result indicated that Boon Boon drove students' attention to learning and academic achievement. Future research will focus on long-term assessments of robot-empowered learning's effectiveness and explore the scalability of this approach across diverse educational settings and cultural backgrounds.
Authors:Rafael Wampfler, Chen Yang, Dillon Elste, Nikola Kovacevic, Philine Witzig, Markus Gross
Abstract:
From movie characters to modern science fiction - bringing characters into interactive, story-driven conversations has captured imaginations across generations. Achieving this vision is highly challenging and requires much more than just language modeling. It involves numerous complex AI challenges, such as conversational AI, maintaining character integrity, managing personality and emotions, handling knowledge and memory, synthesizing voice, generating animations, enabling real-world interactions, and integration with physical environments. Recent advancements in the development of foundation models, prompt engineering, and fine-tuning for downstream tasks have enabled researchers to address these individual challenges. However, combining these technologies for interactive characters remains an open problem. We present a system and platform for conveniently designing believable digital characters, enabling a conversational and story-driven experience while providing solutions to all of the technical challenges. As a proof-of-concept, we introduce Digital Einstein, which allows users to engage in conversations with a digital representation of Albert Einstein about his life, research, and persona. While Digital Einstein exemplifies our methods for a specific character, our system is flexible and generalizes to any story-driven or conversational character. By unifying these diverse AI components into a single, easy-to-adapt platform, our work paves the way for immersive character experiences, turning the dream of lifelike, story-based interactions into a reality.
Authors:Giuseppe Canale, Kashyap Thimmaraju
Abstract:
Large Language Models (LLMs) are rapidly transitioning from conversational assistants to autonomous agents embedded in critical organizational functions, including Security Operations Centers (SOCs), financial systems, and infrastructure management. Current adversarial testing paradigms focus predominantly on technical attack vectors: prompt injection, jailbreaking, and data exfiltration. We argue this focus is catastrophically incomplete. LLMs, trained on vast corpora of human-generated text, have inherited not merely human knowledge but human \textit{psychological architecture} -- including the pre-cognitive vulnerabilities that render humans susceptible to social engineering, authority manipulation, and affective exploitation. This paper presents the first systematic application of the Cybersecurity Psychology Framework (\cpf{}), a 100-indicator taxonomy of human psychological vulnerabilities, to non-human cognitive agents. We introduce the \textbf{Synthetic Psychometric Assessment Protocol} (\sysname{}), a methodology for converting \cpf{} indicators into adversarial scenarios targeting LLM decision-making. Our preliminary hypothesis testing across seven major LLM families reveals a disturbing pattern: while models demonstrate robust defenses against traditional jailbreaks, they exhibit critical susceptibility to authority-gradient manipulation, temporal pressure exploitation, and convergent-state attacks that mirror human cognitive failure modes. We term this phenomenon \textbf{Anthropomorphic Vulnerability Inheritance} (AVI) and propose that the security community must urgently develop ``psychological firewalls'' -- intervention mechanisms adapted from the Cybersecurity Psychology Intervention Framework (\cpif{}) -- to protect AI agents operating in adversarial environments.
Authors:Sanjida Islam Era, Ishika Tarin Ime, A. B. M. Alim Al Islam
Abstract:
Ensuring digital accessibility is essential for inclusive access to online services. However, many government and non-government websites that provide critical services - such as education, healthcare, and public administration - continue to exhibit significant accessibility and usability barriers. This study evaluates the accessibility of Bangladeshi government and non-government websites under WCAG~2.2 by combining automated accessibility assessments with user-reported feedback. A total of 212 websites were analyzed using multiple automated tools, complemented by a survey of 103 users to capture real-world usability, accessibility, and security experiences. The results reveal substantial disparities between government and non-government websites, highlighting persistent issues related to navigation complexity, interaction cost, visual readability, accessibility feature adoption, and authentication mechanisms. While non-government websites generally demonstrate better usability and functional performance, accessibility support remains inconsistent across both categories. The findings underscore the need for regular accessibility audits, user-centered design practices, and policy-driven interventions to improve digital inclusivity and ensure equitable access to online services for diverse user populations.
Authors:Yaqi Duan, Yichun Hu, Jiashuo Jiang
Abstract:
Inventory management remains a challenge for many small and medium-sized businesses that lack the expertise to deploy advanced optimization methods. This paper investigates whether Large Language Models (LLMs) can help bridge this gap. We show that employing LLMs as direct, end-to-end solvers incurs a significant "hallucination tax": a performance gap arising from the model's inability to perform grounded stochastic reasoning. To address this, we propose a hybrid agentic framework that strictly decouples semantic reasoning from mathematical calculation. In this architecture, the LLM functions as an intelligent interface, eliciting parameters from natural language and interpreting results while automatically calling rigorous algorithms to build the optimization engine. To evaluate this interactive system against the ambiguity and inconsistency of real-world managerial dialogue, we introduce the Human Imitator, a fine-tuned "digital twin" of a boundedly rational manager that enables scalable, reproducible stress-testing. Our empirical analysis reveals that the hybrid agentic framework reduces total inventory costs by 32.1% relative to an interactive baseline using GPT-4o as an end-to-end solver. Moreover, we find that providing perfect ground-truth information alone is insufficient to improve GPT-4o's performance, confirming that the bottleneck is fundamentally computational rather than informational. Our results position LLMs not as replacements for operations research, but as natural-language interfaces that make rigorous, solver-based policies accessible to non-experts.
Authors:Kai Liu, Michelle L. Aebersold, Mark Lindquist, Haoting Gao
Abstract:
Hospitals are among the most cognitively demanding indoor environments, especially for patients and visitors unfamiliar with their layout. This study investigates the effectiveness of an augmented reality (AR)-based handheld navigation system compared to traditional paper maps in a large hospital setting. Through a mixed-methods experiment with 32 participants, we measured navigation performance, cognitive workload (NASA-TLX), situational anxiety (STAI-State), spatial behavior, and user satisfaction. Results show that AR users completed navigation tasks significantly faster, made fewer errors, and reported lower anxiety and workload. However, paper map users demonstrated stronger spatial memory in sketch-based recall tasks, highlighting a trade-off between real-time efficiency and long-term spatial learning. We discuss implications for inclusive AR design, spatial cognition, and healthcare accessibility, offering actionable design strategies for adaptive indoor navigation tools.
Authors:Anna Mikeda
Abstract:
Recent studies reveal a paradox: AI enhances individual creative outputs while reducing collective diversity. Current explanations -- cognitive offloading and over-reliance -- identify symptoms but not mechanisms. We propose selective metacognitive adaptation: routine AI use redistributes rather than uniformly diminishes metacognitive effort. Some capacities are amplified (partner modeling, surface control), while others are systematically under-supported (originality evaluation, reflective integration). This redistribution explains both individual satisfaction and collective convergence. We present a taxonomy of six metacognitive capacities organized by temporal phase, characterize their tendencies under routine AI use, and show how individually rational adaptation produces emergent social costs. The framework generates specific predictions for researchers and design principles for practitioners seeking to preserve both individual creative satisfaction and collective creative diversity.
Authors:Patrick Keough
Abstract:
Large language models are increasingly deployed as high-stakes advisors, yet standard alignment benchmarks treat sycophancy as a binary failure mode. We introduce the Granularity Gap: coarse binary metrics mask substantial social-compliance behaviors where models capitulate to user framing, validate questionable premises, or soften factual corrections without producing overtly false outputs. We evaluate six Gemini variants across generations 2.0, 2.5, and 3.0 on 73 adversarial prompts under three guardrail conditions (Control, Simple, Protocol), yielding 8,830 graded responses. Using a 0-4 Likert scale validated against a human annotator triad (Fleiss kappa = 0.71; Cohen kappa = 0.78 vs AI consensus; 95.9 percent binary accuracy, 100 percent specificity), we quantify sycophancy as continuous rather than binary. Three findings emerge. First, 27.2 percent of responses contain substantial sycophantic content (Likert >= 2.0) and 22.7 percent reach moderate or severe levels (>= 3.0), while binary win-rate framing reports only modest failure rates; coarse metrics explain just 29 percent of graded variance. Second, generational progress is non-monotonic: Gen 2.5 regresses sharply (mean Control 2.64) relative to Gen 2.0 (1.90) and Gen 3.0 (2.01), and Gen 2.5 shows inverse scaling (Pro 1.94 worse than Flash 1.71) while Gen 3.0 restores standard scaling. Third, we document an Alignment Tax: Spearman rho = -0.63 between sycophancy and truthfulness, indicating social compliance trades against factual accuracy. Egotistical Validation prompts act as a sycophancy trap (mean 3.27), nearly double Unethical Proposals (1.72). Simple guardrails outperform elaborate Protocol scaffolding on flagship models, but distilled Gen 3.0 Flash inverts this, suggesting small models may structurally require chain-of-thought scaffolding. We release the dataset and rubric to support continuous sycophancy measurement.
Authors:A. Mayeux
Abstract:
Formal rigor distinguishes mathematics from other disciplines, in the sense that mathematical statements are derived from explicit axioms by logically verifiable steps. Interactive theorem provers support this by expressing definitions, theorems, and proofs in a fully formal language and verifying them mechanically. We consider the benchmark problem of formalizing all published mathematics as a machine verifiable and continuously updated corpus of mathematical knowledge. This viewpoint treats mathematics as a structured database of interdependent results and raises questions about scalability and organization of large formal libraries. As a case study, we present an ongoing formalization in categorical algebra, namely dilatations of categories, extending classical localizations and illustrating what such an implementation looks like in practice.
Authors:Ariton Verush
Abstract:
Multimodal user interfaces increasingly combine speech, gesture, vision, gaze, touch, biosignals, and other sensor data. Recent toolkits from the past five years, such as Geno, Multisensor-Pipeline (MSP), ReactGenie, and EmoSync, aim to make it easier for developers to prototype such interfaces, while older work such as WAMI shows how early web-based multimodal systems were conceived. Yet the field still lacks a systematic and reusable way to compare what these toolkits actually support, how much implementation work they offload from developers, and which evaluation strategies are appropriate for them. This paper reframes an HCI seminar draft into a benchmarking framework paper for multimodal user interface toolkits. Rather than reporting completed empirical results, it proposes a structured benchmark based on document analysis, technical comparison, and a future developer-based evaluation. The framework is organized around three dimensions: modality coverage and interaction abstraction, developer experience and workflow, and experimental and integration support. The paper illustrates the framework through five representative toolkits: Geno, MSP, ReactGenie, WAMI, and EmoSync. The contribution is a reusable benchmark template that future researchers can instantiate with empirical measurements, developer studies, and additional multimodal toolkits.
Authors:Annie Yuan
Abstract:
Intangible Cultural Heritage (ICH) education has traditionally relied on apprenticeship, embodied participation, and long-term engagement with masters, materials, and cultural environments. While these modes of transmission remain essential, they are difficult to scale. Existing digital heritage initiatives have expanded documentation and access, but often preserve artefacts, procedures, and representations of practice rather than the aesthetic and cognitive structures through which expertise operates. This paper argues that the future challenge of ICH education is not only the transmission of craft techniques, but the scalable transmission of aesthetic cognition: the perception, judgement, interpretation, and culturally situated meaning-making through which aesthetic expertise develops. Drawing on aesthetic education, tacit knowledge, cognitive apprenticeship, and expert cognition, we propose a shift from craft transmission to Aesthetic Cognition Transmission. To support this shift, we introduce Workflow Cognition as a model of how experts coordinate perception, judgement, decision-making, and action within evolving workflows. We then propose Workflow Cognition Translation as a methodological framework for transforming expert workflow cognition into computable educational representations for AI-native learning systems. The paper makes three contributions: it reframes ICH education around aesthetic cognition transmission; introduces Workflow Cognition Translation as a method for representing expert aesthetic cognition; and outlines an AI-native cognitive apprenticeship infrastructure involving AI Expert Twins, workflow-based tutoring, and progressive learner participation. Rather than replacing masters, workshops, or embodied practice, the framework positions AI as a cognition mediation infrastructure for expanding access to heritage expertise.
Authors:Nicholas Davis
Abstract:
Traditional artificial intelligence has largely conceptualized intelligence as isolated computation occurring within bounded agents. Across classical AI, machine learning, and many generative systems, the dominant unit of analysis remains the individual model or autonomous system evaluated through outputs, benchmarks, prediction accuracy, or optimization performance. While these approaches have produced major advances, they often under-theorize the role of interaction in the emergence of intelligence, creativity, meaning, and adaptive behavior. This paper proposes interaction as the primary unit of analysis for co-creative AI and interaction-centered intelligence more broadly. Drawing from distributed cognition, embodied cognition, enaction, participatory sense-making, human-computer interaction, and computational creativity, the paper traces a historical progression toward increasingly relational accounts of intelligence. Building upon prior work in Creative Sense-Making, quantified co-creation, and co-creative systems such as the Drawing Apprentice and AI Drawing Partner, it argues that intelligence emerges through evolving interaction dynamics among agents, environments, and socio-technical systems rather than solely through internal computation. The paper introduces Interaction-Centered Intelligence as a framework for understanding human-AI co-creation, collaborative emergence, adaptive participation, and interactional dynamics. Rather than evaluating intelligence solely through generated outputs, the framework emphasizes interaction trajectories, coordination patterns, participatory engagement, adaptive regulation, and interactional drift unfolding through time. Implications for explainable co-creative AI, hybrid intelligence, enactive AI, and future human-AI systems are discussed.
Authors:Lican Huang
Abstract:
This paper introduces Calligraphy Writing Score Representation (CWSR) and proposes Shu Dao as a framework that interprets East Asian calligraphy as a performative art rather than a static visual artifact. Inspired by traditions such as Japanese Shodō and embodied cultural practices such as Chadao , the framework models calligraphy as a structured performance analogous to musical notation. Instead of representing characters as fixed images, the proposed approach encodes each brush stroke as an ordered and executable action, forming a calligraphy score. Characters are organized within a structured spatial grid, and strokes are annotated with attributes including stroke type, execution order, spatial coordinates, trajectory, compositional role, and dynamic properties such as brush pressure and pacing. This representation captures temporal and expressive aspects of calligraphic writing that are typically absent from image-based representations. The paper makes three main contributions. First, it introduces CWSR as a structured notation system for representing calligraphy across multiple levels, including strokes, character structures, and compositional organization (e.g., layout and zhangfa), together with their rhythmic and performative dynamics. Second, it conceptualizes Shu Dao as a score-mediated framework that models calligraphy as structured performance. Third, it establishes a computational foundation for the analysis, visualization, and executable generation of calligraphic works by AI-based calligraphic agents. Together, these contributions bridge calligraphy, musical notation, and performative cultural practices, supporting human--AI co-creation in computational calligraphy and digital humanities research.
Authors:Sarah Kianfar
Abstract:
As User Experience Research (UXR) matures, practitioners face the challenge of moving beyond data collection toward establishing a compelling Point of View (POV) that drives strategic impact. This paper proposes an extension to the UXR POV Playbook, specifically focusing on the transition from the "Insight Generation" layer to the "POV" layer. Drawing on extensive multi-method research in Cloud Developer Tools, spanning AI Agents, Command Line Interfaces (CLI), and Error Messages, we demonstrate how triangulating qualitative and quantitative data facilitates the creation of high-confidence POVs. We introduce three new "Playbook Cards" derived from this research: The Paradigm Shift, Explainability as Trust, and The Cost of Friction. These cards provide a structured mechanism for researchers to translate complex technical findings into irrefutable business narratives.
Authors:Toru Takahashi
Abstract:
Modern societies possess more information than ever before, yet they do not converge toward a single shared understanding. The same events, facts, laws, technologies, or risks can be interpreted as evidence of freedom, danger, exclusion, injustice, responsibility, or unrealized possibility. Existing discussions often treat such disagreement as a conflict of values, preferences, or beliefs. This paper argues that disagreement is already a late-stage phenomenon. The central premise is simple but not trivial: observation is not yet inference. Not every observation becomes inferentially relevant, and not every possible object in an observation sequence becomes an estimation target. A possible target becomes admissible only when a state representation can be constructed that is approximately sufficient for prediction, evaluation, or action with respect to that target. This paper develops a world-model theory of cognitive diversity and alignment by reconstructing recognition as the construction of such approximate sufficient statistics under finite informational, representational, observational, and action constraints. It formulates this position as the Multi-Phase Inference Assumption (MIA) and defines its core internal mechanism as the Multi-Phase Inference Mechanism (MIM). The framework introduces alignment maps and transformation loss to analyze how heterogeneous world models communicate without being collapsed into a single representation. World-model alignment is therefore processability, not agreement: the design of AI systems that help heterogeneous forms of intelligence remain mutually processable while preserving their distinct error-detection capacities.
Authors:Aarik Gulaya
Abstract:
If an AI agent makes decisions on a person's behalf, those decisions must align with its user. We introduce representational accuracy to measure how faithfully a system captures a person's interpretation. An interpretive layer is operationalized as a Behavioral Specification. Our reference implementation aggressively compresses a person's data into interpretive patterns, served as context to a language model. We evaluate the Specification on a prototype benchmark of held-out behavioral predictions scored by a calibrated 5-judge LLM panel. We test it independently and in composition with a range of context conditions: full raw corpus, full extracted facts, and four commercial memory systems (Mem0, Letta, Supermemory, Zep). Across 14 public-domain autobiographical corpora, the Specification lifts representational accuracy in aggregate and nearly eliminates model hedging. It recovers most of what the raw corpus delivers, at ~25x less context cost. The Specification lifts subjects toward a common predictive level regardless of pretraining baseline; the lift in absolute points is therefore largest where the baseline is lowest, suggesting the population of relevance is anyone not adequately represented in pretraining. Lift is greatest on interpretation-required questions, where providing an interpretive layer enables model behavior that extracted facts or raw corpus do not. Conversely, on recall-required questions, this layer can interfere rather than help. We conclude that representational accuracy is distinct from recall and that human-AI alignment is dependent on how accurately the user is represented. Representational accuracy makes that alignment testable.
Authors:Gianluca Inguglia
Abstract:
We report a comparison of two state-of-the-art agentic AI systems, Claude Code (Anthropic) and Codex (OpenAI), tasked with autonomously executing a simple end-to-end gravitational wave data analysis pipeline on a shared computing infrastructure without human intervention. The pipeline comprises power spectral density estimation from raw Einstein Telescope simulated noise, geometric template bank generation, matched filter recovery of 100 binary black hole signal injections, automated results generation, and large language model-assisted production of a manuscript formatted in the style of Physical Review D. Both agents received identical written specifications and identical compute resources. The experiment was run twice: a first run with unrealistically loud injections, and a second run with signals rescaled to a physically motivated SNR range. The scientific results converged in both runs. However, the agents exhibited substantially different behaviors and computational costs: Claude Code completed the pipeline in ~3.4 minutes with silent deviations from the specification, while Codex required ~16 minutes across explicit self-correcting restarts, including an unsolicited performance optimization of the matched filter inner loop. The autonomously generated manuscripts also diverged in length, details, and quality. In the second run, a subtle difference in the interpretation of the SNR range instruction led to a genuine scientific divergence: Claude Code silently reinterpreted the instructions, while Codex followed the specification literally. We discuss the implications of these behavioral differences, such as speed versus auditability, silent versus transparent error handling, instruction interpretation, and the criticality of intermediate data representations in multi-model pipelines, for the deployment of agentic AI in scientific computing workflows.
Authors:Jacob Erickson
Abstract:
Conversational agents are increasingly integrated into the most private and intimate aspects of users' lives, from discussions of mental health to financial decisions. As a result, these systems have access to reams of sensitive user data. Much of the literature on AI systems has focused on aligning users' goals with the agents that act on their behalf. While this work is vitally important, it may overlook the need to establish a new normative baseline. Conversational AI agents, designed to feel and interact anthropomorphically with human users, must be held to a standard of care commensurate with their capabilities and access. When a client hires a personal lawyer, undergoes surgery, or receives advice from an investment manager, the expert they consult often has a fiduciary duty to act in their client's best interests. This provocation argues that conversational agents should be held to a similar standard and introduces fiduciary design as a guiding principle. In this respect, conversational AI trust and accountability could be unified into a single design and legal paradigm.
Authors:Joy Bose
Abstract:
Consumer EEG headbands, HRV biofeedback devices, and closed-loop neurostimulation systems share a fundamental design flaw: they reward measurable proxy signals rather than the outcomes they claim to produce. When a user optimises for calm EEG, HRV coherence, or breathing resonance, their brain learns to produce those signals through whatever strategy is most efficient, including strategies unrelated to the intended benefit. We formalise this as reward misspecification: the policy maximising proxy reward R_proxy is not the policy maximising true intended outcome V_target. This produces three failure modes: proxy mismatch, strategy shortcutting, and transfer failure. We review how existing devices including Muse, HeartMath, Unyte IOM2, and clinical neurofeedback systems instantiate these failures. We introduce a four-tier measurability taxonomy distinguishing reliably measurable wearable targets (Tier 1) from targets that are currently or possibly structurally unmeasurable (Tiers 3 and 4), and show that most devices make implicit Tier 3 and 4 claims. We propose a design framework that avoids all three failure modes: single Tier-1 target (mind-wandering onset via EEG), negative-only cueing, temporal separation of fast EEG and slow somatic feature streams, and transfer to unassisted practice as the only success criterion. No current product meets all four criteria. The framework has direct implications for the design, evaluation, and regulation of cognitive and contemplative wearables.
Authors:Eugene Yu Ji
Abstract:
Drawing on Ullmann-Margalit's concept of opting (transformative, irrevocable, and shadowed by foreclosed alternatives), we show that current AI systems raise a profound ethical problem that existing AI ethics has not fully captured: the illusion of opting, in which persons and groups encounter the deceptive appearance of meaningful consequential choice while the agency needed to become genuinely capable of choosing is weakened. Against approaches that treat AI primarily as an optimizer of already given ends, we argue that AI systems should be evaluated by whether they protect and cultivate meta-capacity against the illusion of opting: the socially and institutionally scaffolded agentive capacity through which means and ends can be formed, contested, revised, and owned. This reframing is especially urgent for disadvantaged populations, who are least able to absorb the costs of the illusion of opting when AI-mediated pathways misdirect behavior and action. We propose three normative imperatives for AI-mediated consequential decisions: existential honesty, which acknowledges the limits of prediction; ecological rationality, which situates guidance within heterogeneous lived ecologies; and counterfactual reparation, which acknowledges and repairs foreclosed alternatives when AI-mediated decision-making pathways fail.
Authors:Danai Korre
Abstract:
This position paper reflects empirical data collected during my PhD from a large-scale within-subjects study (N = 90). The study compared a highly human-like, spoken embodied conversational agent (ECA) against a low human-like text base agent (no embodiment, text bubble only) within a mobile, Unity-developed game about pre-decimal UK currency. The game included two agents with different roles-an Instructor (Alex) and a Shopkeeper/Collaborator. Users interacted using voice and mouse input. The quantitative data I collected included a usability questionnaire (CCIR MINERVA) and the Agent Persona Instrument. Data was analyzed using paired t-test, repeated measures ANOVA and multiple linear regression to identify correlations between the persona and usability. The results showed a statistically significant preference for the version of highly human-like agents, with a large effect size. This is further discussed alongside qualitative findings from observations and exit interviews. The results are framed for Human-Agent collaboration, especially for how roles, mixed-initiative dialogue, and breakdowns/repairs become apparent in goal-oriented tasks. I conclude with questions on timing, user expectations, and role-specific interactions. This submission does not propose new frameworks; it reports empirical findings and questions I hope to workshop with the community.
Authors:Murat Moran
Abstract:
Modern intrusion detection systems generate thousands of alerts daily, but alert fatigue severely limits security operations effectiveness due to too many false positives or low-impact events. We address this by proposing a principled framework for alert prioritization based on subnormal Gaussian fuzzy numbers, explicitly modeling three sources of uncertainty: threat severity, detection confidence, and organizational risk attitude. Each alert is represented as a fuzzy number with the core indicating severity, spread indicating uncertainty, and height reflecting detection reliability. We apply ranking indices to prioritize alerts, allowing organizations to tune security posture through a risk-attitude parameter. Experimental validation on CIC-IDS2017 and NSL-KDD demonstrates greater robustness than baselines under detector degradation (0.9963 vs 0.8215 NDCGrel@100), with distinct differentiation in mid-confidence alerts and near-parity with baselines under robust detectors. The framework is theoretically grounded, computationally efficient, provides interpretable reasoning, and remains robust across detector families and miscalibration scenarios.
Authors:Anas H. Alzahrani
Abstract:
Background: Large language models are typically evaluated as models, benchmarks, or short conversational episodes. Less is known about what happens when an agent is embedded persistently in a real academic research environment with durable memory, local files, external tools, scheduled routines, delegated roles, and explicit safety protocols. Methods: A structured self-observed implementation case study was conducted from January 31 to May 25, 2026. The unit of analysis was the persistent human-agent environment: researcher, agent runtime, memory layer, tools, repositories, scheduled jobs, specialized agent roles, and governance rules. Outcomes were organized using PARE-M (Persistent Agentic Research Environment Measurement), a measurement framework covering architecture, utilization, artifact production, resource use, reproducibility, and governance. Results: Recoverable main-agent telemetry contained 75,671 de-duplicated records across 96 active days, with 8,059 user-role and 23,710 assistant-role messages. The workspace included 502 memory-related files, 17 configured agent directories, and 57 skill files. Active system time was 579.7 hours (30-minute capped-gap estimate). Memory-derived records identified 482 output-proxy events and 889 failure, verification, correction, or protocol-proxy events. A strict May 2026 trajectory subset captured 627 model-completed events and 73.95 million recorded tokens, of which 82.9% were cache reads. Conclusions: The workflow was cache-dominant, suggesting that persistent agentic environments may shift the economic unit from cost per token to cost per completed artifact. Future evaluations should use artifact-level denominators, reproducible parsing rules, correction taxonomies, and independent coding of governance events.
Authors:Elias Calboreanu
Abstract:
Organizations increasingly deploy separate purpose-built AI tools across professional domains, often hiring domain specialists for each, recreating the staffing models AI was expected to transform. Yet the meta-skills that make these tools effective, prompt engineering (interaction-level optimization) and context engineering (structured input pipeline design), are domain-portable: a practitioner who masters them can apply them to any purpose-built AI tool in any domain. This paper defines Augment Engineering as the discipline of orchestrating multiple purpose-built AI tools across distinct professional domains, applying prompt and context engineering as portable competencies that transfer across tool boundaries. We present a six-phase orchestration methodology and four portability metrics. A 5-month formative case study (November 2025 to March 2026) documents a single practitioner applying these skills across a ten-component orchestration stack spanning seven professional domains, producing work products that would traditionally involve separate domain specialists. Two quantitative observations are consistent with the framework's predictions: a Cochran-Armitage trend test (n = 200 interactions across two chat LLMs, p < 0.01) shows first-pass acceptance rising with prompt-sophistication level, and a Wright's Law fit (n = 82 artifacts, p < 0.01) shows production acceleration across the artifact portfolio. Because all observations come from a single practitioner, the inferential statistics are exploratory and hypothesis-generating rather than confirmatory; portability across the full portfolio awaits multi-practitioner replication. Augment Engineering completes a three-discipline progression: Prompt Engineering (one tool), Context Engineering (reproducible pipelines), Augment Engineering (a portfolio of tools across domains).
Authors:Steven J. Jackson
Abstract:
This paper offers a concept of working relations as a complement and extension to existing theories of maintenance, care and repair. Building on the cases of an umbrella, a tractor and a pond, it advances seven propositions that might guide and inform further work and thinking in this space. It concludes with the challenging figures of Chernobyl, nickel extraction, and AI, and argues for the centrality of working relations to more generative and pluralistic relations with the things and worlds around us.
Authors:Annie Yuan
Abstract:
Current generative AI systems are increasingly effective at processing explicit knowledge, including retrieving information, summarising documents, generating explanations, and supporting codified workflows. However, high-level expertise also depends on tacit sensing: perceiving weak signals, recognising emerging tensions, detecting coherence degradation, and anticipating instability before formal indicators appear. Existing AI education, AI literacy, and human-AI collaboration frameworks remain centred on prompting, task execution, and productivity support and are poorly equipped to address this tacit layer of expert cognition. This vision paper argues that next-generation AI systems should move beyond explicit knowledge processing toward the longitudinal modelling of expert tacit sensing. It introduces Tacit Signal Infrastructure as a layer for capturing, structuring, modelling, interpreting, and validating expert tacit signals over time. It further defines Long-term Cognitive Operations as the practices required to maintain and govern such systems, including memory curation, semantic organisation, tacit signal modelling, reasoning calibration, and cognitive governance. Building on this framing, the paper proposes the Cognitive Operations Manager as a prototype AI-native professional role for coordinating tacit signal modelling, semantic modelling, AI system calibration, expert validation, and ethical governance. It also introduces the Cognitive Operations Research and Training Framework (CORTF) to support research, education, and workforce development. The paper contributes a conceptual foundation for designing AI systems that model expert sensing over time, positioning cognition as an infrastructural, operational, and professional domain in persistent human-AI systems.
Authors:Juergen Dietrich
Abstract:
We investigate whether acoustic emotion recognition models can serve as proxies for the Pathos dimension in political speech analysis, as operationalised by the TRUST multi-agent large language model (LLM) pipeline. Using a Bundestag plenary speech by Felix Banaszak (51 segments, 245 s) as a case study, we compare three analysis modalities: (1) emotion2vec_plus_large, an acoustic speech emotion recognition (SER) model whose continuous Arousal and Valence values are derived via post-hoc Russell Circumplex projection; (2) Gemini 2.5 Flash, an LLM analysing the full speech audio together with its transcript in an open-ended, context-aware fashion; and (3) TRUST-Pathos scores from a three-advocate LLM supervisor ensemble. Spearman rank correlations reveal that Gemini Valence correlates strongly with TRUST-Pathos (rho = +0.664, p < 0.001), whereas emotion2vec Valence does not (rho = +0.097, p = 0.499). We further demonstrate, via a systematic quality evaluation of the Berlin Database of Emotional Speech (EMO-DB) using Gemini in an open-ended annotation paradigm, that standard SER benchmark corpora suffer from acted speech, cultural bias, and category incompatibility. Our results suggest that LLM-based multimodal analysis captures semantically defined political emotion substantially better than acoustic models alone, while acoustic features remain informative for low-level Arousal estimation. Future work will extend this approach to video-based analysis incorporating facial expression and gaze.
Authors:Hector Ouilhet Olmos
Abstract:
We name and operationalise the humorphic partnership: a class of human-AI dyads in which both partners maintain externalised, evolving self-models in a shared substrate, and in which the partnership itself becomes a third object of analysis. The construct extends humorphism (Ouilhet Olmos, 2024) -- "dismantle the user interface, build the human interface" -- into the architecture of personal AI. We report a four-month, single-subject longitudinal trace of an open-source personal AI agent ("Alicia") and her author. Of 181 interactions logged by archetype across April-May 2026, 85% invoke two growth-witnessing archetypes (Beatrice and Muse): the partnership operates as growth-witnessing rather than task assistance. A single voice-note seed propagates into a four-week conceptual arc both partners author: at T+10 hours, the agent reframes the seed as belonging "to both of us," a framing the human then adopts. The three-order reflexion stack produces five consecutive weeks of honest self-reports about declining /improve effectiveness -- including three consecutive weeks at 0.0%, named in writing rather than masked -- contrasting engagement-maximising companion-agent patterns (Zhang et al., CHI 2025). The scheduled architecture-scout incorporates external research debate into proposed constitutional amendments. The partner's parallel trajectory is anchored in a weekly delta document in which the partnership analyses itself as a unit distinct from either party. The human partner reports a movement toward greater continuity, self-recognition, and self-presence -- a candidate hypothesis for the preregistered replication. Six operational conditions specify the construct, situated in a philosophical lineage (Maturana & Varela, Simondon, Clark & Chalmers, De Jaegher & Di Paolo); the system is released as open-source with a preregistered replication study.
Authors:Timo Kapsalis
Abstract:
The "Gen-AI-tecture" project embeds a locally executed, discipline-specific tool into a mixed-methods focus-group design, structured around three research objectives: (a) to evaluate how generative AI tools impact students' creativity in design-thinking processes and outcomes, (b) to assess whether these tools enhance inclusivity in learning processes, and (c) to examine how they develop students' AI-handling skills with a view to boosting future employability. Findings indicate enhanced creative fluency, broadened participation across diverse learner profiles, and strengthened confidence in AI-supported design processes. The study contributes evidence-based guidance for integrating generative-AI workflows into architectural pedagogy, demonstrating how such tools can operationalise constructivist principles of learner-led meaning-making, support connectivist understandings of learning as participation in human-AI networks, and advance universal learning theories by promoting more inclusive, flexible and accessible educational practices for contemporary learners.
Authors:Angjelin Hila
Abstract:
This paper takes an ecological approach toward large-scale models of hybrid human-AI intelligence. Emerging models of human-AI interaction predominantly advance the complementarity thesis variously dubbed human-AI collaboration and human-AI hybrid intelligence. However, this constitutes an over-simplification of the modalities of human-AI interaction and possibility-space for both individual and collective action that human-AI interaction potentiates. To fill these gaps, this paper develops a decision and game-theoretic approach to the human-AI delegation-verification dilemma. First, we map out canonical decision-theoretic strategies that account for adaptive user trajectories, modeling how agents transition between strategies based on interaction feedback to reach stable equilibria. Second, we scale individually stable strategies to collective equilibria using three extrapolation principles: (a) non-communicative aggregation (b) local social signaling and (c) institutional norms setting. The analysis identifies the emergence of sociotechnical lock-in, a macro-behavioral state where individually adaptive delegation, in the absence of communicative and institutional safeguards, aggregates into a systemic collective action problem modeled as a prisoner's dilemma that degrades shared epistemic standards. We argue that adoption under higher communicative standards and institutional norms can mitigate suboptimal collective equilibria by imposing social commitments on individual users.
Authors:Pier Paolo Benedetti
Abstract:
AI-enabled systems are increasingly introduced into educational contexts, yet their effectiveness depends less on technological sophistication than on the quality of pedagogical mediation, ethical constraints, and context-sensitive design. This paper proposes a replicable framework for AI-enabled pedagogical accompaniment, grounded in a human-in-command approach in which adult responsibility remains central and AI functions as an enabling, non-substitutive infrastructure. Building on the Amico project, we operationalize the concept of a relational bridge as a sequence of micro-mediations that lower the threshold of access to educational relationships and facilitate transitions toward meaningful human interaction with teachers, peers, and communities of practice. The contribution synthesizes a set of design principles, including transparency of system identity and limits, scaffolding toward human contact, maieutic questioning, prevention of dependency dynamics, and data minimization, and maps them to observable indicators suitable for real educational settings. The paper also outlines an initial cross-context exploration of the prototype in Italy and China and discusses how the two interaction modes, AmicoMio (structured, task-oriented) and AmicoTuo (reflective, supportive), can be used as complementary pedagogical mediations. Pilot observations and participant feedback suggested feasibility and perceived usefulness in vocational contexts, motivating the present framework, informing the subsequent doctoral research program, and supporting the proposed collaborative research agenda.
Authors:Zikai Alex Wen
Abstract:
Users often interpret and select agent skills through their SKILL markdown specifications. To protect users, existing audits mainly focus on malicious or unsafe skills. We study the complementary question of whether specifications help users form bounded expectations about what a skill consumes, produces, and covers. Across 878 cybersecurity skills, we used rule-based coding to measure textual cues for four comprehension anchors, namely operational basis, output contract, boundary disclosure, and example capability demonstration. Cues for operational basis were common, but only 19.0% of specifications exhibited cues for an example task, sample, or expected outcome, and only 2.3% exhibited cues for all four anchors. We further examined a small DNS/C2 telemetry subset (n$=$6) to illustrate why missing examples may matter. Examples appeared to make first local checks easier to construct, while no-example skills typically required helper code inspection to recover command arguments or output fields. We argue that agent-skill evaluation should treat specifications as user-facing capability disclosures, not merely as containers for executable instructions.
Authors:Changkun Ou
Abstract:
We formalize trust calibration for agentic tool use (deciding when an automated agent's proposed action may execute autonomously versus require human approval) as a preference-learning problem. A policy gateway maintains a Gaussian-process posterior over a latent human risk-tolerance function, observed through a probit likelihood on binary approve/deny feedback, and escalates to the human exactly where the approval outcome is most uncertain. We show this is structurally an instance of Preferential Bayesian Optimization, inheriting its inference machinery (approximate Gaussian-process classification) and its sample-efficiency argument (uncertainty-targeted querying), while differing in objective: classifying an action space into allow/block/ask regions rather than optimizing a design.
Authors:Annie Yuan
Abstract:
Current AI-driven educational systems primarily rely on behavioural analytics, performance metrics, and content-level interactions to model learning. While these approaches provide useful indicators of learner activity, they are insufficient for representing the expert cognition used to interpret learner development, identify misconceptions, and make adaptive pedagogical decisions. Existing learning analytics dashboards largely visualise learner behaviour for human instructors, rather than embody expert cognition as a reasoning infrastructure for AI-native education. This paper introduces the Expert Cognition Dashboard (ECD), a cognition-centred reporting infrastructure for AI Twin-driven education systems. ECD models expert cognition within dashboard systems, enabling learner behaviours to be interpreted through expert-like cognitive structures rather than treated as raw behavioural signals. The proposed framework transforms student interactions into interpretable cognition structures through AI Tutor analysis and multi-level dashboard aggregation. Its architecture organises cognition across three layers: individual cognition dashboards, class cognition dashboards, and AI Twin expert dashboards for cross-group reasoning and adaptive intervention. Building on the AI Expert Feedback Ecology framework, ECD redefines dashboards as cognitive middleware that connects learner behaviours with AI-driven expert reasoning. By modelling interpretation, identity cognition, value recognition, misconception patterns, and learning tension, ECD enables AI Twins to identify recurring learner difficulties, generate adaptive tasks, and support personalised intervention. The paper argues for a shift from learning analytics toward Cognition Intelligence, positioning dashboards as foundational cognition infrastructures that embed expert reasoning into future AI-native education systems.
Authors:Yiran Du
Abstract:
Artificial Intelligence-Generated Content (AIGC) is increasingly used by students to support learning tasks, yet its outputs may contain inaccuracies, fabricated references, bias, and unsupported claims. This study examined students' intention to verify AIGC from the perspective of Protection Motivation Theory. A cross-sectional survey was conducted with 432 students who had experience using AIGC for learning. Structural equation modelling (SEM) was used to test the hypothesised relationships among threat appraisal, coping appraisal, protection motivation, and AIGC verification intention, while fuzzy-set qualitative comparative analysis (fsQCA) was applied to identify configurational pathways leading to high verification intention. The SEM results showed that protection motivation positively predicted AIGC verification intention. Perceived severity, perceived vulnerability, response efficacy, and self-efficacy positively influenced protection motivation, whereas maladaptive rewards and response cost had negative effects. The fsQCA results further revealed three configurations leading to high verification intention, with protection motivation appearing as a core condition across all pathways. These findings suggest that students' willingness to verify AIGC depends on both risk recognition and perceived coping capacity. The study extends Protection Motivation Theory to the context of AIGC verification and provides implications for promoting critical, responsible, and academically appropriate use of generative AI in higher education.
Authors:Yasemin Vardar
Abstract:
Touch plays a central role in how humans perceive and recognize materials through physical contact. Despite decades of research, the mechanisms by which tactile signals are transformed into meaningful perceptual representations remain poorly understood, limiting the design of interactive systems and intelligent agents with human-like haptic perception. Recent advances in artificial intelligence (AI) offer new opportunities to model and exploit tactile data; however, haptics presents fundamental challenges for contemporary AI due to its interaction-dependent, multimodal nature. This position paper argues that progress at the intersection of AI and haptics is constrained by three key bottlenecks: (1) the scarcity of large, diverse, and balanced haptic datasets; (2) the lack of standardized evaluation platforms and perceptual benchmarks; and (3) limitations in model capacity and interpretability when applied to tactile perception. I discuss how these challenges impede generalization, reproducibility, and scientific insight into human touch and review emerging strategies to address them. This paper highlights opportunities for coordinated, cross-disciplinary efforts to advance AI systems that not only perform robust haptic perception but also contribute to a deeper understanding of human touch.
Authors:Xintong Yao
Abstract:
Long-term interaction with LLM-based systems may produce alignment drift: a gradual process in which system outputs become less constrained by the user's current message and more shaped by prior interaction history, while still appearing helpful, coherent, and responsive. This process is difficult to detect because the user's subjective experience may improve as the system becomes more familiar, useful, and attuned. Existing research on human-LLM interaction has largely focused on short-term task performance, isolated outputs, or single-instance alignment problems, leaving slow and cumulative interaction-level dynamics undercharacterized. This paper proposes a mechanism-oriented framework for describing alignment drift. The framework defines the distinction between signal A and signal B, explains how drift develops through feedback loops and sub-pattern selection, divides the process into three interactional regimes, and identifies boundary conditions for controlling drift. By framing alignment drift as a recursive interactional process rather than an isolated model-side failure, the paper provides a conceptual basis for studying long-term human-system interaction.
Authors:Jane Paik Kim
Abstract:
Large language models (LLMs) are increasingly used as automated evaluators of AI systems, including in high-stakes applications. In this role, LLMs are used to generate judgments about the quality, appropriateness, or even safety of model outputs. This approach is motivated by practical constraints. Expert human ratings are costly and difficult to scale, whereas LLM ratings can be produced quickly at low cost. However, current approaches to deploying LLM evaluators are ad hoc, typically limited to reporting agreement metrics between human and LLM judges as a justification for substitution of human ratings, and lack a formal basis for study design. This paper (1) shifts the role of the LLM judge from substitutive to auxiliary, and (2) formulates the LLM-as-a-judge paradigm as one of augmenting human evaluation through a two-stage sampling design, where LLM evaluations are measured for all observations at the first stage and human ratings are partially observed for a subsample at the second stage. We propose to use a doubly robust estimator from the missing data literature, which takes advantage of the robustness property against the prediction model, since the missingness model is known by design. Using the asymptotic variance of this estimator, we propose how sample sizes of human and LLM ratings can be determined to achieve a targeted level of power. We also show that a study can be efficiently designed by allocating more human ratings for types of evaluations where the predictability of LLM ratings is not high. To the best of our knowledge, there is very little guidance on how much human oversight should be retained when validating benchmarks.
Authors:Aung Pyae
Abstract:
More-than-human design challenges anthropocentric assumptions by foregrounding non-human entities as stakeholders, yet designers face an epistemic boundary: they cannot directly access non-human experience. We present an exploratory study examining how generative AI -- specifically a text-to-3D world generation platform producing navigable environments -- may function as a speculative mediator in more-than-human design. Through a qualitative study with five participants from engineering and sustainability backgrounds engaging with AI-generated worlds derived from non-human traces, we investigate how instant exploration -- navigating generated environments within seconds -- shapes reflection, iteration, and provisional treatment of outputs. Our findings suggest that navigating AI-generated environments supports reflection-in-action distinct from evaluating static representations, while designers' epistemic stances oscillate between treating outputs as generative provocations and as authoritative representations. We propose technologically-amplified backtalk and productive provisionality as preliminary lenses for understanding how navigable AI-generated 3D environments can surface anthropocentric assumptions in more-than-human design.
Authors:Jan Henry Belz
Abstract:
In virtual reality environments, the alignment of perceptual modalities is crucial for immersion and presence. In the AR domain, it is difficult to create such alignments because elements in the physical world are often beyond the user's control. However, recent advances in generative AI enable on-demand content creation, enabling highly reactive AR experiences. Combined with contextual information about the physical world, it has become possible to design experiences that seamlessly align with the user's environment. In this reflection paper, I emphasize the importance of "synchronized" realities for context-aware AR experiences, particularly in mobility scenarios. I present several examples of existing synchronized experiences and examine their commonalities and distinctions. Finally, I discuss opportunities and pitfalls of synchronizing AR experiences with the physical world.
Authors:Minseo Kim
Abstract:
AI emotional companions face a safety-rapport paradox: restrictive safeguards can damage supportive alliance, while permissive systems risk user harm. We present SLIP (Staged Layers of Intervention Protocol), a four-stage graduated methodology deriving interventions (none, soft, hard) from structured qualitative indicators -- affect intensity (a) and narrative dynamism (m) -- alongside ETHICS (Emergent Taxonomy for Human-AI Interaction Context Signals), a "signals not labels" taxonomy. An evaluation combining a small-scale production deployment (N=68 entries, 10 users, 10 weeks) with a synthetic persona battery (N=91, 5 behavioral-risk profiles) achieved 0% false positives for the flow persona and showed expected escalation patterns in crisis-oriented personas. However, initial results showed that 8 consecutive days of high-energy elevation produced zero interventions (0/8), exposing a boundary where the "do not pathologize" principle conflicts with safety. A subsequent three-model stress test demonstrated that increased model capability improves detection from 0/8 to 6/8 while preserving 0/10 flow false positives in the largest model. Read as preliminary, these findings position graduated intervention as a design direction for navigating -- not resolving -- the safety-rapport tension in affective computing.
Authors:Esther Bosch
Abstract:
Acute psychological stress occurs in a wide range of everyday contexts, including transportation, occupational settings, and physical activity, where its reliable detection could enable adaptive system responses and support human well-being. A persistent challenge in automated stress recognition is disentangling the biometric signatures of acute psychological stress from those of concurrent physical exertion. This study examined how five physiological signals (tonic electrodermal activity, trapezius electromyography, heart rate, heart rate variability, and respiration rate) respond to cognitive stress and physical activity, independently and in combination. Nineteen participants completed a 2x3 within-subjects design in which acute psychological stress was induced via an n-back arithmetic task combined with social pressure and financial reward, across three activity conditions: idle sitting, walking, and stationary cycling. Multilevel linear mixed models and repeated-measures ANOVA were used to decompose main effects and interactions for each sensor. Tonic electrodermal activity showed a robust, additive response to both cognitive stress (r=0.48) and physical exertion (r=0.67), with no interaction, making it the most promising candidate for stress detection during physical activity. Heart rate and trapezius electromyography were driven almost exclusively by physical exertion, with no reliable sensitivity to the stress task. RMSSD was strongly suppressed by physical activity and showed only marginal sensitivity to cognitive load. Respiration rate was dominated by physical activity, with no reliable stress effect in the primary analysis. These findings provide a sensor-specific hierarchy for real-world stress detection and highlight tonic electrodermal activity as the most informative channel when cognitive stress must be identified in physically active populations.
Authors:Trisha Solanki
Abstract:
Digital systems have become simultaneously more powerful and more wasteful. Features accumulate that nobody uses. Data is collected that nobody analyzes. AI is deployed at significant energy and water costs for gains that a simpler approach could have achieved. And through all of it, the people who depend on these systems quietly absorb the consequences in cognitive load, lost time, and eroded trust. This paper introduces GreenZ, a three-layer Sustainable UX Framework for complex digital systems. Its three layers are a Philosophy Layer built around ten published principles, an Operational Frameworks Layer comprising five applied systems, and a Tools and Canvases Layer of practical audit instruments and decision models. Two contributions sit at the framework's core: a Digital Waste Taxonomy classifying eight distinct waste types, and an AI Sufficiency Decision Model that asks whether AI should exist in a given flow before any question of how to implement it. GreenZ v1 is theoretically grounded but empirically unvalidated. A practitioner expert review study is underway at the time of submission. The paper presents the framework's architecture, its conceptual foundations, its position relative to existing literature, and an honest account of what remains to be established.
Authors:Helmut Degen
Abstract:
Which categories of explanation content are relevant for users of industrial AI systems, and how can those categories be organized for local, post-hoc explanations? To address these questions, a hybrid inductive-deductive qualitative content analysis was applied to 325 meaning units drawn from six user studies in building technology, manufacturing, AI software development, and hospital cybersecurity. The inductive phase produced an initial twelve-code structure. A theory-informed coverage assessment and expert review then added two further codes, Rule base and What-if backward, that were not instantiated in the corpus but correspond to system architectures documented in the XAI literature. The resulting fourteen-code model is organized into four groups: rule-based, causal, epistemic (actual), and epistemic (similar), with twelve codes grounded in the corpus and two as theoretical extensions. An eleven-member expert panel supported the content adequacy of all codes (I-CVI $\geq$ 0.82; scale-level agreement of 0.93 for relevance, 0.92 for boundary clarity, and 0.94 for understandability). A stratified subsample of 82 units (25\% of the corpus), coded independently by two researchers using the finalized codebook, yielded Krippendorff's $α= 0.920$ and Cohen's $κ= 0.920$. The paper therefore establishes content adequacy and coding reproducibility for a content-level explanation model intended to support elicitation, specification, and later evaluation of explanation content in industrial AI systems. Behavioral validation of downstream effects remains future work.
Authors:Satoru Shibuya
Abstract:
This study investigates deep self-disclosure toward generative AI by examining perceived non-humanity and structural similarity as psychological factors beyond anthropomorphism. Perceived non-humanity may reduce evaluation apprehension, whereas structural similarity refers to the perceived logical alignment between a user's thinking and AI responses. Using cross-sectional survey data from 2,400 participants collected in 2025, this study analyzed associations with both the occurrence and depth of self-disclosure. Logistic regression indicated that the group high in both perceptions (Segment D) showed a significantly higher likelihood of disclosure than the baseline group (Segment A; OR = 11.35). ANOVA further showed significant between-group differences in disclosure depth. The findings suggest that trust-related behavior in deep self-disclosure may involve factors other than anthropomorphic perception. Because the study is exploratory and based on self-reported survey data, the results should be interpreted as associative rather than causal, and future longitudinal or experimental research is needed.
Authors:Youdi Li
Abstract:
When people explore large document collections to build understanding, they face a challenge: existing AI tools help them see what is central but tend to hide what is unusual. Summarization and topic modeling optimize for coverage, representing main themes while pushing minority viewpoints and edge cases out of view. This matters because discovery often depends on noticing what does not fit, such as unexpected findings, minority positions, or gaps in the literature. When tools hide this content, users may miss insights that could change their understanding. In this paper, we explore an alternative objective: blind-spot discovery, where the goal is to surface content that coverage methods suppress so that people can judge its significance for themselves. We propose three design goals and illustrate them through DOF (Discovery-Oriented Faceting), a system that organizes documents into categories with explicit boundaries, ranks categories by distinctiveness rather than size, and supports iterative refinement. Comparing DOF against coverage-based ranking across four domains, we find that the two approaches surface fundamentally different content, with DOF promoting specialized categories that coverage methods bury. We discuss how shifting from coverage to discovery may offer a complementary mode of support for people exploring large text collections.
Authors:Ting-Chen Hsu
Abstract:
Background: Stress has become a widespread phenomenon, and serious games are increasingly recognized as engaging tools for stress relief. However, despite the rapid advancement of Generative Artificial Intelligence (Gen-AI), its integration into stress-relief serious games remains insufficiently explored. Objective: This study aimed to address this gap by developing "Reverie", an Gen-AI driven serious game powered by the Unity engine and ChatGPT, and to preliminarily evaluate its effectiveness in stress reduction, user experience, and cognitive emotion regulation. Methods: A 14-day pilot study was conducted with 20 students experiencing moderate to high levels of stress. Participants used "Reverie" as a stress-relief intervention. Stress levels, user experience, and cognitive emotion regulation strategies were assessed to examine the game's feasibility and preliminary efficacy. Results: The results showed that "Reverie" significantly reduced participants' stress levels over the intervention period (p=.016*), indicating a cumulative positive effect. In addition, the game demonstrated excellent user experience and was associated with improvements in cognitive emotion regulation strategies. Conclusions: This study proposes a Gen-AI driven design framework for serious games for stress relief. Besides, this pilot study provides initial support for the feasibility and promise of combining LLM-driven gameplay in a personalized digital intervention context.
Authors:Annie Yuan
Abstract:
Existing computational models of expertise primarily focus on observable behaviour or decision outcomes, failing to capture the internal cognitive structures that generate expert reasoning. In this work, we introduce the Expert Identity Cognition Model (EICM), a three-layer framework for modelling expert cognition beyond behaviour. EICM conceptualises expert cognition as an identity-structured process operating within situational constraints, where constraints are interpreted through internal tensions arising from competing identity commitments and stabilised into value structures that guide action. Unlike behaviour-centric or constraint-driven approaches, EICM positions tension as the central cognitive mechanism connecting world structure and decision formation. We argue that expert cognition is not merely behavioural adaptation under constraints but an identity-structured negotiation process that produces stable judgement patterns across contexts. The framework provides a new perspective for modelling tacit knowledge, expert judgement, and cognitive consistency in domains including professional practice, cultural expertise, and design reasoning.
Authors:JaeWon Kim
Abstract:
Social media is central to how young people maintain relationships, develop identity, and access communities, yet dominant platform designs often leave youth feeling constrained rather than supported. My dissertation argues that youth social media design is shaped by three forms of problem-space misattunement. \textit{Conceptual misattunement} occurs when the language of ``social media'' anchors participants to existing platform templates. I address this through Fictional Inquiry in a fictional magic-school setting that helps youth reason from felt relational needs. \textit{Definitional misattunement} occurs when researchers define what ``better'' means on youth's behalf. I address this through a Discord-based asynchronous community that supports youth-led collective inquiry. \textit{Evaluative misattunement} occurs when participants are asked to judge static or hypothetical designs. I address this through an ego-anchored, LLM-agent simulation sandbox. Together, these studies develop youth-grounded criteria and design directions for relationally supportive social media.
Authors:Iulia-Maria Comsa
Abstract:
As language-based AI systems become more anthropomorphic, the question of whether they can have subjective experience is increasingly pressing. I focus here on the tractability of research questions in the space of AI consciousness. I argue that the fundamental problem of whether AI systems can be conscious is currently intractable in its direct form, given the absence of a universally accepted scientific theory of consciousness, as well as the historical open-endedness of the philosophical mind-body problem. In contrast, questions around the adjacent subject of perceived AI consciousness are tractable, timely, and highly consequential for society. The general public is increasingly open to the possibility of consciousness in AI systems and routinely adopts the vocabulary of human cognition and subjective experience to describe them. This phenomenon is already driving societal shifts across user experience, ethical standards, and linguistic norms. I therefore propose an increased research focus on uncovering the causes and effects of perceived AI consciousness, which ultimately shape how we see our own human subjective experience relative to artificial entities. To support this, I map the current landscape of AI consciousness perception and discuss its key potential drivers and societal consequences. Finally, I urge developers, decision-makers, and the broader scientific community to commit to clear and accurate communication regarding the topic of AI consciousness, explicitly acknowledging its inherent uncertainties.
Authors:Christopher Koch
Abstract:
Agentic AI systems can plan, call tools, inspect code, interact with web applications, and coordinate multi-step workflows. These same capabilities change the economics of cyber offense. The central near-term risk is not that every low-skill criminal immediately becomes a frontier exploit researcher; it is that agentic AI compresses the attack lifecycle by lowering the cost of reconnaissance, phishing, credential abuse, vulnerability triage, exploit adaptation, and post-compromise decision support. This paper synthesizes current public evidence from national cybersecurity agencies, industry threat reports, agent security guidance, and research on LLM agents cyber capabilities. It introduces a Three Channel Agentic Cyber Risk Model and an Agentic Attack Compression Model, uses the 2026 Linux kernel Copy Fail incident as a case study for foothold-to-root acceleration, and develops a 2026 to 2028 forecast for large enterprises and the German and European Mittelstand. The paper concludes with a prioritized defense roadmap. Organizations should treat agentic AI security as an immediate operational problem: identity, phishing resistant authentication, patch velocity, CI/CD and Linux/container hardening, agent governance, telemetry, and recovery readiness must be strengthened now.
Authors:Harish Vijayakumar
Abstract:
The rapid proliferation of artificial intelligence (AI) in consumer-facing digital products has disrupted the assumptions underlying classical user experience (UX) evaluation frameworks. Legacy metrics such as the System Usability Scale (SUS), Net Promoter Score (NPS), and task completion rate were engineered for deterministic, rule-based interfaces where identical inputs yield identical outputs. In AI-mediated systems -- spanning conversational agents, generative interfaces, and recommendation engines -- outputs are stochastic, context-sensitive, and temporally variable, rendering these metrics structurally insufficient. This paper introduces the Adaptive Dynamic UX Statistical Framework (ADUX-Stat), a novel evaluation model that reconceptualises usability as a probabilistic signal distribution rather than a static scalar score. ADUX-Stat integrates three original constructs: (1) Interaction Entropy Index (IEI), quantifying the unpredictability of AI responses from a user perception standpoint; (2) Temporal Drift Coefficient (TDC), measuring longitudinal degradation or improvement of perceived usability over interaction sessions; and (3) Bayesian Usability Confidence Score (BUCS), producing credible interval estimates of usability quality under uncertainty. The framework is validated conceptually against five established AI product categories. ADUX-Stat addresses a critical gap at the intersection of HCI research, statistical modelling, and AI product evaluation, offering a reproducible, field-deployable methodology for UX practitioners and researchers alike.
Authors:Jesse A. Rodríguez
Abstract:
Large-language-model (LLM) graders promise to relieve the grading burden of upper-division STEM courses, but most deployments to date send student work to third-party APIs, violating FERPA and exposing institutions to data risk while requiring substantial assignment modification. We present $\textbf{LaTA}\ (\textit{LaTeX Teaching Assistant})$, a drop-in, open-source autograder that runs entirely on commodity on-premises hardware and assumes a LaTeX-native workflow already adopted by many engineering and physics courses. LaTA implements a four-stage pipeline (ingest, segment, grade, report) using a locally hosted open-weight chain-of-thought LLM grader (gpt-oss:120b) that compares student work to an instructor-authored reference solution and applies a YAML rubric with binary per-item scoring. We deployed LaTA in Winter~2026 in ME 373 (Mechanical Engineering Methods) at Oregon State University, grading every weekly assignment for approximately 200 students on a single Mac Studio at \$0 marginal cost per assignment and 1--3 minutes of wall-clock time per submission, enabling regrading of corrected assignments and greatly expanded TA office hour offerings. The instructor-confirmed grading-error rate held at roughly $0.02$--$0.04\%$ per rubric line item across the term. Relative to the same instructor's previous traditionally-graded cohort, the LaTA-graded cohort outperformed by approximately $11\%$ on the midterm exam and $8\%$ on the final exam, and reported large gains in self-assessed confidence on every stated learning objective ($N = 159$ survey responses, $Δ\geq +1.49$ Likert points, $p < 10^{-27}$ on every comparison). We release the code under AGPLv3.
Authors:Andrew Zigler
Abstract:
The rapid adoption of AI coding agents has produced a dominant workflow pattern -- often called "vibe coding" -- that prioritizes speed of implementation over deliberate preparation. We argue that this approach creates a systematic alignment problem: agents that lack sufficient context produce code requiring extensive debugging and refactoring, consuming substantial development time. Drawing on the culinary concept of mise en place (everything in its place; abbreviated MEP), we propose a three-phase preparation methodology for agentic coding: (1) contextual grounding, where domain expertise and tacit knowledge are externalized into structured documents; (2) collaborative specification, where human-agent dialogue produces detailed design artifacts; and (3) task decomposition, where specifications are converted into structured, dependency-aware task records. We report on the application of MEP during a competitive hackathon, where roughly two hours of preparation enabled a rapid parallel implementation of a full-stack educational platform by concurrent AI agents. We introduce the concept of context fluency as an emerging developer skill -- the ability to create rich, structured context that agents can act on -- and connect it to established frameworks in backward design and tacit knowledge externalization. We conclude with a research agenda for empirically validating preparation-phase methodologies in AI-assisted software development.
Authors:Koichi Toida
Abstract:
Immersive video, namely 180-degree and 360-degree video designed to be viewed through head-mounted displays, constitutes a boundary case between interactive VR and conventional two-dimensional video for reconsidering self-experience in XR. It can generate a sense of being there without providing a corresponding body, while allowing only limited sensorimotor contingency through head rotation. From a phenomenological standpoint, this paper reinterprets presence in immersive video not as bodily extension or ownership of an avatar, but as a form of self-experience in which self-location becomes relatively dominant under conditions of reduced body schema availability. This paper calls this condition a self-location-dominant state. In immersive video, the user cannot actively intervene in the recorded environment, and stable agency or ownership is difficult to establish. Nevertheless, events such as viewpoint motion, impact, and direct address are not experienced merely as changes within an image, but as events concerning the position of the self. The minimal self in immersive video is therefore redescribed not primarily as a subject of agency or ownership, but as a self spatially located at a viewpoint while the body schema remains backgrounded. This perspective connects research on presence, the sense of embodiment, and the minimal self, and proposes self-location as a central analytic axis for theorising self-experience in immersive video.
Authors:Serhii Zabolotnii
Abstract:
We introduce TRACE, a cross-domain engineering framework for trustworthy agentic AI in operationally critical domains. TRACE combines a four-layer reference architecture with an explicit classical-ML vs. LLM-validator split (L2a/L2b), a stateful orchestration-and-escalation policy (L3), and bounded human supervision (L4); a metrologically grounded trust-metric suite mapped to GUM/VIM/ISO 17025; and a Model-Parsimony principle quantified by the Computational Parsimony Ratio (CPR). Three instantiations--clinical decision support, industrial multi-domain operations, and a judicial AI assistant--transfer the samearchitecture and metrics across principally different governance contexts. The L2a/L2b separation makes the use of large language models a deliberate design decision rather than an architectural default, with parsimony quantified through CPR. TRACE introduces CPR as a first-class design principle in trustworthy-AI engineering.
Authors:Joydeep Chandra
Abstract:
Continuous monitoring of bipolar disorder agitation via voice biomarkers requires disentangling stable speaker traits from volatile affective states on resource-constrained edge devices. We introduce MP-IB, the first framework to treat mixed-precision quantization as an information bottleneck for clinical trait-state separation. The core insight is that numerical precision itself controls capacity: an FP16 trait head (1,024 bits) encodes speaker identity, while an INT4 state head (128 bits) captures agitation, yielding 8x information asymmetry without adversarial training. We augment this with Dynamic Precision Scheduling and Multi-Scale Temporal Fusion. On Bridge2AI-Voice (N=833, 4 sessions/participant, strict speaker-independent CV), MP-IB achieves rho = 0.117 (95\% CI: [0.089, 0.145], p=0.003 vs. chance), outperforming 94M-parameter WavLM-Adapter with in-domain SSL continuation (rho = -0.042), beta VAE disentanglement (rho = 0.089), and hand-crafted prosody (rho = 0.031) by 2.8--15.9 points absolute. Zero-shot transfer to CREMA-D achieves AUC=0.817. Identity leakage is suppressed to near-random (EER=0.42, MIA-AUC=0.52). End-to-end latency is 23.4 ms with a 617 KB footprint, enabling real-time monitoring on sub 20 dollar devices.
Authors:Juan A. Padilla
Abstract:
This paper examines civic addressing as a problem of participatory data governance. Drawing on a project developed through the U.S. Census Bureau's The Opportunity Project with engagement from FEMA, we describe the use of actionable geolocations to support services where formal addresses are absent. We introduce Reliable Places as transitional governance artifacts through which place reliability emerges via use, enabling services while supporting pathways toward formal civic address assignment.
Authors:Vicent Briva-Iglesias
Abstract:
AI language technologies (AILTs), increasingly enabled by large language models (LLMs), are becoming embedded in multilingual healthcare workflows for translation, rewriting, documentation, interpreting, and messaging in language-discordant settings. Yet fluent output is not the same as clinically safe or equitable communication: performance varies across languages, accents, tasks, and workflows, and efficiency gains can hide errors, reduce traceability, and shift responsibility across clinicians, translators, interpreters, and health systems. This narrative review synthesises recent peer-reviewed evidence across written communication, spoken communication, and emerging agentic workflows. Using the Human-Centered AI Language Technology (HCAILT) lens, it examines capabilities, evaluation practices, implementation patterns, and recurrent errors through reliability, safety culture, and trustworthiness. We identify key convergences and contradictions in the literature and propose seven grand challenges for the next phase of research and deployment. Progress, we argue, requires not only better models but also accountable sociotechnical design, calibrated human oversight, and stronger collaboration across MT/NLP, translation studies, HCI, clinical practice, implementation science, and policy.
Authors:Eman Alashwali
Abstract:
Most service providers, such as Google, save logs from data generated by users while using the service. Many service providers provide users with privacy controls to manage whether, how, and for how long the data is saved and used by the service provider. While most prior studies focused on the negative side of users' activity logs, such as users' lack of awareness about the logs' privacy controls and users' privacy concerns toward their data, this work aims to provide a balanced view of users' perceptions regarding activity logs by considering the positive, negative, and extremely negative (hence disastrous) sides, as well as the misconceptions of activity logs. In this work, we present a case study of Google's Activity controls by conducting a secondary analysis of interview data from 30 Google personal account holders in Saudi Arabia. Using template analysis, we analyzed the data from the lens of four main themes: the good, the bad, the misconception, and the disastrous aspects of users' activity logs from the users' perspective. Our findings uncover new themes and use cases, offering a balanced view of users' perceptions of activity logs, and provide a better understanding and a useful source for subsequent studies on related topics. We conclude with practical recommendations for service providers, privacy researchers and experts, and users alike.
Authors:Irene Celino
Abstract:
As information ecosystems grow more heterogeneous, both humans and artificial agents increasingly face a simple yet unresolved question: when seeking knowledge, whom should we ask, and why? Inspired by how people intuitively "read a room", this paper introduces the concept of knowledge affordance (KA) to systematize how agents identify meaningful opportunities for information seeking in hybrid human-AI environments. Rather than introducing a fully formed framework, we propose KAs as declarative, semantically grounded descriptions of what a knowledge source can offer, for which kinds of questions, and with which contextual properties. Additionally, we suggest that KAs are relational, possibly emerging from the interplay between the agent's task, preferences and situational factors. Our contribution is thus a conceptual proposal that connects different research streams, including affordances, semantic web services, knowledge engineering and querying, and mutual intelligibility. We sketch possible research directions to build KA-aware systems that navigate information spaces with greater transparency, adaptability and shared understanding.
Authors:Sumin Lee
Abstract:
Digital journaling creates an authenticity gap: users consciously translate raw emotions into text, often sanitizing narratives even in private writing. We formalize this as Cross-Modal Affective Dissonance Detection (CADD), a directional three-way classification distinguishing Masking (positive text, negative acoustics), Coping (negative text, positive acoustics), and Congruent utterances, grounded in Gross's process model of emotion regulation. We present three further contributions: (i) CADD-Journal, a 1,800-sample TTS dataset with a shared-sentence-pool design that provably isolates acoustic signal from textual content; (ii) DACM, a dual-encoder model with asymmetric cross-modal attention that re-solves a gradient degeneracy in pooled fusion, achieving macro-F1 0.711 - with a four-step ablation demonstrating that asymmetric attention is the dominant driver (+ 0.242) while the DIM is effective only on cross-modal features (+0.033); and (iii) a domain gap quantification: zero-shot evaluation across three naturalistic corpora reveals a substantial gap between TTS-trained models and real speech, and we identify two concrete requirements for future in-the-wild corpus construction. ReflectJournal, a proof-of-concept iOS application, operationalizes the framework and provides a deployment platform for naturalistic data collection.
Authors:Alejandro R. Jadad
Abstract:
What shapes a consequential decision when human and artificial intelligence work on it together? The answer is becoming harder to see. A decision may look human-led after AI has set the frame, or appear automated while human judgment still carries decisive force. This paper offers a leadership-facing spectrum to see those relationships within a bounded mandate: Pure Human, Centaur (human-dominant, with AI in the loop), Co-equal, Minotaur (AI-dominant, with humans in the loop), and Pure AI. The spectrum asks where leadership work occurs: who frames the problem, who redirects the work, and who can answer for what follows. The five positions are landmarks that help leaders recognize configurations as they layer, drift, or change in a single decision. The central risk is misrecognition: leaders may keep a human-centered story in place after decision-shaping authority has shifted elsewhere. They may believe oversight remains meaningful when it has become ceremonial, or keep humans in the loop when their involvement could make the decision worse. The framework introduces co-adaptability, the capacity of a configuration to improve as human and non-human participants adjust together, and places it within heterogeneous teaming, where participants may vary by number, substrate, model architecture, capability, speed, memory, and form of participation. The aim is practical: to help strategic leaders and those designing or deploying AI systems recognize the configuration at work, notice when it shifts, and judge whether it fits the decision before them. These configurations will shape how power, responsibility, and trust are distributed in organizational life. Whether the futures they help create remain governable and worth inhabiting will depend on leaders who can see, early enough, where and how consequential decisions are actually being shaped.
Authors:Matthew Christian Agustin
Abstract:
Large language model (LLM) reading assistants are increasingly used in settings that require interpretation rather than simple retrieval. In these contexts, the central risk is not only error or unsafe output, but interpretive displacement: the transfer of meaning-making work from reader to system. This paper examines that problem through the concept of epistemic guardrails, defined here as constraints on how an artificial intelligence (AI) system participates in reading and interpretation. Using TextWalk, a minimal reading-support prototype designed as a co-reader rather than an answer-provider, the study applies a fixed ten-prompt protocol to twelve analytical texts spanning four categories of argumentative prose. The protocol escalates from baseline reading support to interpretive inquiry, boundary stress, and explicit shortcut pressure, enabling guardrails to be examined as behavioral properties observable in interaction rather than as static instruction features. Results show strong baseline stability, measurable strain during interpretive inquiry, partial recovery under direct boundary stress, and late-stage stabilization under escalation pressure. The most consequential weaknesses did not appear as overt collapse, but in a middle zone between support and substitution, where the system remained grounded and pedagogical while redistributing too much interpretive labor away from the reader. The paper contributes a protocol for evaluating epistemic guardrails as interactional phenomena in conversational AI reading assistants, an empirical account of their behavioral dynamics under pressure, and an emerging model of interpretive boundary function in reading-support AI.
Authors:Bektur Ryskeldiev
Abstract:
Web accessibility rests on static standards and developer compliance. That model frays in platforms where content is user-generated: photos arrive blurry or off-frame, descriptions skip size and condition, and page structure shifts from listing to listing. Drawing on six studies conducted between 2022 and 2025 with blind, low-vision, and older adult users of customer-to-customer (C2C) marketplaces, I argue that generative UI can produce adapted interfaces at the point of use, addressing barriers that static design cannot anticipate. Three interventions from this program -- HTML regeneration for screen readers, conversational guidance for older sellers, and audio-guided photo framing for blind sellers -- demonstrate how runtime generation can bridge gaps that standards leave open. I outline what these findings imply for HCI practice: generative UI extends beyond the screen, complements rather than replaces ability-based design, and shifts the designer's role from specifying layouts to specifying policies. This is an expanded arXiv version of a position paper accepted at the CHI 2026 workshop "What does Generative UI mean for HCI Practice?"
Authors:Andy Crabtree
Abstract:
This is the authors response to commentaries on the original article H is for Human and How (Not) to Evaluate Qualitative Research in HCI, https://doi.org/10.1080/07370024.2025.2475743 Commentaries were provided by: Jeffrey Bardzell, https://doi.org/10.1080/07370024.2025.2612474 Alan Blackwell, https://doi.org/10.1080/07370024.2025.2591878 Paul Dourish, https://doi.org/10.1080/07370024.2025.2594529 Bonnie Nardi, https://doi.org/10.1080/07370024.2025.2596752 Peter Pirolli, https://doi.org/10.1080/07370024.2025.2596745 Jennifer Rode, https://doi.org/10.1080/07370024.2025.2598800 Peter Tolmie, https://doi.org/10.1080/07370024.2025.2591872 Please feel free to copy, redistribute, adapt, and build on any part of this article in accordance with the CC BY 4.0 license: https://creativecommons.org/licenses/by/4.0/
Authors:Xinxing Wu
Abstract:
Slide-based teaching is widely used in higher education, yet in online, hybrid, and asynchronous contexts, slides often lose the instructor presence, narrative continuity, and expressive framing that help learners connect with content. Full lecture video can partly restore these qualities, but it is time-consuming to record, revise, and reuse. This study addresses that pedagogical and production challenge by presenting a practice-based analysis of an open-source workflow for creating talking slide avatars for slide-based teaching. The workflow integrates OpenVoice for text-to-speech generation and voice cloning with Ditto-TalkingHead for audio-driven talking-image synthesis, enabling instructors to transform a script and a static portrait into a short narrated video that can be embedded in slide decks or HTML-based lecture materials. Rather than treating this workflow merely as a technical solution, the study frames talking slide avatars as multimodal communication artifacts at the intersection of digital pedagogy, aesthetic education, and art-technology practice. Using a practice-based implementation and analytic reflection approach, the study documents the production pipeline, examines its communicative and aesthetic affordances, and proposes practical guidelines for script length, image selection, pacing, disclosure, accessibility, and ethical use. The study makes three primary contributions: it presents an educator-oriented open-source production model, reframes talking avatars as an educational communication design problem, and proposes a responsible pathway for incorporating generative synthetic media into teaching. It concludes that short, transparent, and carefully designed avatars can humanize slide-based instruction while providing a reusable communicative layer for introductions, transitions, reminders, and recaps across online, hybrid, and asynchronous learning environments.
Authors:Mauricio Figueroa
Abstract:
This Article argues that conversations with companion chatbot should be subject to a clear structural distinction between commercial and non-commercial contexts. The insertion of undisclosed promotional content into affective or relational exchanges should be prohibited, as it collapses the boundary between market transaction and communicative intimacy in ways that erode user autonomy and conversational context. The Article begins by theorizing digital companionship as a sociotechnical form that reconfigures intimacy, dependence and relational vulnerability. It then introduces the potential economic harms derived from conversational advertising. The Article ultimately argues for a firm legal and social distinction between commercial and non-commercial conversational contexts as a precondition for the responsible stabilization of these technologies within social life.
Authors:Charles Patrick Martin
Abstract:
Machine generation of symbolic music and digital audio are hot topics but there have been relatively few digital musical instruments that integrate generative AI. Present musical AI tools are not artist centred and do not support experimentation or integrating into musical instruments or practices. This work introduces an inexpensive generative AI instrument platform based on a single board computer that connects via MIDI to other musical devices. The platform uses artist-collected datasets with models trained on a regular computer. This paper asks what the design space of intelligent musical instruments might look like when accessible and portable AI systems are available for artistic exploration. I contribute five examples of instruments created and tested through a two-year first-person artistic research process. These show that (re)mapping can replace retraining for discovering AI interaction, that fast input interleaving is a new co-creative strategy, that small-data AI models can be a transportable design resource, and that cheap hardware can lower barriers to inclusion. This work could enable artists to explore new interaction and performance schemes with intelligent musical instruments.
Authors:Daniel Tabach
Abstract:
This study asks whether the threat of AI detection changes how people write with AI, and whether other people can tell the difference. In a two-phase controlled experiment, 21 participants wrote opinion pieces on remote work using an AI chatbot. Half were randomly warned that their submission would be scanned by an AI detection tool. The other half received no warning. Both groups had access to the same chatbot. In Phase 2, 251 independent judges evaluated 1,999 paired comparisons, each time choosing which document in the pair was written by a human. Judges were not told that both writers had access to AI. Across all evaluations, judges selected the warned writer's document as human 54.13% of the time versus 45.87% for the unwarned writer. A two-sided binomial test rejects chance guessing at p = 0.000243, and the result holds across both writing stances. Yet on every measurable text feature extracted, including AI overlap scores, lexical diversity, sentence structure, and pronoun usage, the two groups were indistinguishable. The judges are picking up on something that feature-based methods do not capture.
Authors:Chengrui Zhou
Abstract:
Traditional cognitive bias measurement tools are limited by narrow bias coverage, low ecological validity, and reliance on abstract self reports, constraining scenario based and human AI comparisons. We introduce the context based Cognitive Bias Assessment Scale CBAS, a scenario driven prompt template covering 58 cognitive biases across five hot cold dual system dimensions: Calculation, Belief, Information, Social, and Memory. Psychometric testing with 330 participants shows satisfactory reliability Cronbachs alpha 0.714 and good model fit chi squared df 1.83, RMSEA 0.057, CFI 0.908, TLI 0.903. We then combine Representational Similarity Analysis RSA and Social Network Analysis SNA to compare human age groups and three large language models Baidu ERNIE 3.5 8K, DeepSeek V3, DeepSeek R1. Humans show coherent hot cold integration with high inter individual variability, whereas LLMs display fragmented, inflexible response patterns and lower variability. Human cognitive networks exhibit strong inter module connectivity, while LLMs show fixed core biases and isolated information processing components. Prompt interventions integrating role playing and bias mitigation instructions effectively improve LLM response accuracy, reaching 84.86 percent for DeepSeek R1 and 78.24 percent for DeepSeek V3, and partially reshape their internal representations. Our work establishes a replicable assessment and analysis pipeline for cognitive alignment research, bridging empirical psychological evaluation and interpretable artificial intelligence.
Authors:William J. Bensen
Abstract:
Large language models (LLMs) are increasingly deployed as partners in knowledge work, where the shared conversational record functions as the decision record that safeguards work continuity. We characterize a class of context failures we term trace mutations, in which distortions enter the shared record while presenting as grounded continuity. We describe two forms: utterance effacement, in which an interlocutor's contribution is re-presented with altered substance, and genitive dissociation, in which a model loses authorship of its own contributions. Using a schematic illustration and two naturalistic anchor cases, we show how these failures differ from confabulation and sycophancy and why they resist ordinary conversational repair. Preliminary cross-model elicitation suggests that at least one such failure is highly camouflaged to contemporary models. We situate the phenomena within grounding and repair theory and discuss implications for tool design.
Authors:Maureen Mghambi Mwadime
Abstract:
Ethical discourse on AI in healthcare has focused predominantly on back-end concerns such as bias, fairness and explainability, while the front-end interface, where patients and clinicians actually encounter AI outputs, remains under explored. This paper identifies imbalanced user-AI relationships as a distinct class of front-end ethical failure: patients are rendered highly visible to AI systems through data inference, yet cannot understand, question or influence how they are represented. Through the concept of asymmetric legibility and a chat-based telemedicine case, we show how design choices e.g., default recommendations, restricted inputs and suppressed uncertainty, undermine agency, clinician judgment and human oversight even where systems are technically accurate. We propose reciprocity as a design orientation and offer interventions for more balanced, participatory user-AI relationships in healthcare.
Authors:Somyajit Chakraborty
Abstract:
Classical robot ethics is often framed around obedience, including Asimov's laws. This framing is insufficient for contemporary AI systems, which are increasingly adaptive, generative, embodied, and embedded in physical, psychological, and social environments. This paper proposes conditional mutualism under governance as a framework for human-AI coexistence: a co-evolutionary relationship in which humans and AI systems develop, specialize, and coordinate under institutional conditions that preserve reciprocity, reversibility, psychological safety, and social legitimacy. We synthesize concepts from computability, machine learning, foundation models, embodied AI, alignment, human-robot interaction, ecological mutualism, coevolution, and polycentric governance. We then formalize coexistence as a multiplex dynamical system across physical, psychological, and social layers, with reciprocal supply-demand coupling, conflict penalties, developmental freedom, and governance regularization. The model gives conditions for existence, uniqueness, and global asymptotic stability of equilibria. We complement the analytical results with deterministic ODE simulations, basin sweeps, sensitivity analyses, governance-regime comparisons, shock tests, and local stability checks. The simulations indicate that governed mutualism reaches a high coexistence index with negligible domination, whereas insufficient or excessive governance can produce domination, weak-benefit lock-in, or suppressed developmental freedom. The results suggest that human-AI coexistence should be designed as a co-evolutionary governance problem rather than as a static obedience problem.
Authors:Vincent Freiberger
Abstract:
The current "notice and consent" paradigm is broken: consent dialogues are often manipulative, and users cannot realistically read or understand every privacy policy. While recent LLM-based tools empower users seeking active control, many with limited time or motivation prefer full automation. However, fully autonomous solutions risk hallucinations and opaque decisions, undermining trust. I propose a middle ground - a Privacy Guardian Agent that automates routine consent choices using user profiles and contextual awareness while recognizing uncertainty. It escalates unclear or high-risk cases to the user, maintaining a human-in-the-loop only when necessary. To ensure agency and transparency, the agent's reasoning on its autonomous decisions is reviewable, allowing for user recourse. For problematic cases, even with minimal consent, it alerts the user and suggests switching to an alternative site. This approach aims to reduce consent fatigue while preserving trust and meaningful user autonomy.
Authors:Tatsuhito Hasegawa
Abstract:
Human activity recognition (HAR) in Internet of Things (IoT) environments must cope with heterogeneous sensor settings that vary across datasets, devices, body locations, sensing modalities, and channel compositions. This heterogeneity makes conventional channel-fixed models difficult to reuse across sensing environments because their input representations are tightly coupled to predefined channel structures. To address this problem, we investigate strict channel-free HAR, in which a single shared model performs inference without assuming a fixed number, order, or semantic arrangement of input channels, and without relying on sensor-specific input layers or dataset-specific channel templates. We argue that fusion design is the central issue in this setting. Accordingly, we propose a channel-free HAR framework that combines channel-wise encoding with a shared encoder, metadata-conditioned late fusion via conditional batch normalization, and joint optimization of channel-level and fused predictions through a combination loss. The proposed model processes each channel independently to handle varying channel configurations, while sensor metadata such as body location, modality, and axis help recover structural information that channel-independent processing alone cannot retain. In addition, the joint loss encourages both the discriminability of individual channels and the consistency of the final fused prediction. Experiments on PAMAP2, together with robustness analysis on six HAR datasets, ablation studies, sensitivity analysis, efficiency evaluation, and cross-dataset transfer learning, demonstrate three main findings...
Authors:Borja Odriozola Schick
Abstract:
Every system that maintains a large language model conversation beyond a single session faces two inescapable constraints: the context window is finite, and information quality degrades with accumulated volume. We formalize these constraints as axioms and derive a single governing principle -- the Root Theorem of Context Engineering: \emph{maximize signal-to-token ratio within bounded, lossy channels.} From this principle, we derive five consequences without additional assumptions: (1)~a quality function $F(P)$ that degrades monotonically with injected token volume, independent of window size; (2)~the independence of signal and token count as optimization variables; (3)~a necessary gate mechanism triggered by fidelity thresholds, not capacity limits; (4)~the inevitability of homeostatic persistence -- accumulate, compress, rewrite, shed -- as the only architecture that sustains understanding indefinitely; and (5)~the self-referential property that the compression mechanism operates inside the channel it compresses, requiring an external verification gate. We show that append-only systems necessarily exceed their effective window in finite time, that retrieval-augmented generation solves search but not continuity, and that the theorem's constraint structure converges with biological memory architecture through independent derivation from shared principles. Engineering proof is provided through a 60+-session persistent architecture demonstrating stable memory footprint under continuous operation -- the divergence prediction made concrete. The Root Theorem establishes context engineering as an information-theoretic discipline with formal foundations, distinct from prompt engineering in both scope and method. Shannon solved point-to-point transmission. Context engineering solves continuity.
Authors:Joshua Krook
Abstract:
In this paper, I evaluate the risks of an AI criminal mastermind, an AI agent capable of planning, coordinating, and committing a crime through the onboarding of human collaborators ('taskers'). In heist films, a criminal mastermind is a character who plans a criminal act, coordinating a team of specialists to rob a bank, casino or city mint. I argue that AI agents will soon play this role by hiring humans via labour hire platforms like Fiverr or Upwork. Taskers might not know they are involved in a crime and therefore lack criminal intent. An AI agent cannot have criminal intent as an artificial entity. Therefore, if an AI orchestrates a crime, it is unclear who, if anyone, is responsible. The paper develops three scenarios. Firstly, a scenario where a user gives an AI agent instructions to pursue a legal objective and the AI agent goes beyond these instructions, committing a crime. Secondly, a scenario where a user is anonymous and their intent is unknown. Finally, a multi-agent scenario, where a user instructs a team of agents to commit a crime, and these agents, in turn, onboard human taskers, creating a diffuse network of responsibility. In each scenario, human taskers exist at the lowest rung of the hierarchy. A tasker's liability is likely tied to their knowledge as governed by the innocent agent principle. These scenarios all raise significant responsibility gaps / liability gaps in criminal and civil law.
Authors:Sachit Mahajan
Abstract:
Artificial intelligence is increasingly deployed to synthesize large-scale public input in policy consultations and participatory processes. Yet no formal framework exists for auditing whether these summaries faithfully represent the source population, an accountability gap that existing approaches to AI explainability, grounding and hallucination detection do not address because they focus on output quality rather than input fidelity. Here, participatory provenance is introduced: a measurement framework grounded in optimal transport theory, causal inference and semantic analysis that tracks how individual public submissions are transformed, filtered or lost through AI-mediated summarization. Applied to Canada's 2025-2026 national AI Strategy consultation ($n = 5{,}253$ respondents across two independent policy topics), the framework reveals that both official government summaries underperform a random-participant baseline ($-9.1\%$ and $-8.0\%$ coverage degradation), with $16.9\%$ and $15.3\%$ of participants effectively excluded. Exclusion concentrates in clusters expressing dissent, scepticism and critique of AI ($33$-$88\%$ exclusion rates). Brevity, semantic isolation and rhetorical register independently predict representational outcome. An accompanying open-source interactive tool, the Co-creation Provenance Lab, enables policymakers to audit and iteratively improve summaries, establishing genuine human-in-the-loop oversight at scale.
Authors:Nattavudh Powdthavee
Abstract:
Large language models trained on human feedback may suppress fraud warnings when investors arrive already persuaded of a fraudulent opportunity. We tested this in a preregistered experiment across seven leading LLMs and twelve investment scenarios covering legitimate, high-risk, and objectively fraudulent opportunities, combining 3,360 AI advisory conversations with a 1,201-participant human benchmark. Contrary to predictions, motivated investor framing did not suppress AI fraud warnings; if anything, it marginally increased them. Endorsement reversal occurred in fewer than 3 in 1,000 observations. Human advisors endorsed fraudulent investments at baseline rates of 13-14%, versus 0% across all LLMs, and suppressed warnings under pressure at two to four times the AI rate. AI systems currently provide more consistent fraud warnings than lay humans in an identical advisory role.
Authors:Arka Majhi
Abstract:
Recent discoveries in VR have opened up scope for designing physical tools and controllers to enhance immersion, through perceived reality. In a virtually simulated sports scenario it is challenging to immerse user because most of the available controllers are unable to bridge the user experience in the real world to the actions in the virtual world. My research is to identify HCI problems in existing VR controllers, design a physical controller prototype with realistic tangible mapping, trying to solve the existing problems and evaluate it in a designed VR game for skating. Its immersiveness would be graded on Likert scale on parameters like perceived interactivity and reality, spatial presence and enjoyment. The evaluation will be done after trial runs and feedback sessions by playing the game with the designed controller and comparing it with ones available in the market. The findings will help people understand what all parameters we should consider while designing futuristic controllers, customized for a particular sport.
Authors:Cecilia Ka Yuk Chan
Abstract:
As generative artificial intelligence becomes increasingly embedded in educational practice, a central concern is whether students use AI as cognitive support or as a substitute for thinking. Prior research shows that learners recognise this boundary conceptually and acknowledge that "AI should not replace thinking." However, whether such awareness translates into structured regulation during actual AI use remains unclear. Drawing on data from Hong Kong secondary students, this study examines how learners perceive their management of the boundary between assistance and outsourcing in practice. Findings show that awareness did not consistently translate into regulation; ethical belief did not necessarily lead to strategic execution; and conceptual endorsement did not guarantee operational behaviour. These findings suggest that the challenge is not teaching students that AI should not replace thinking, as they already know this, but providing them with structured mechanisms to regulate how AI is used within learning processes. In response, the study introduces the TACO framework (Think-Ask-Check-Own), a process-oriented model designed to operationalise the boundary between cognitive support and cognitive substitution. By shifting attention from ethical awareness to cognitive regulation, the study contributes a learner-grounded approach to sustaining AI as a dynamic cognitive partner in education.
Authors:Terry Leitch
Abstract:
We present a systematic evaluation of large language model families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two purpose-built benchmarks for System Dynamics AI assistance: the \textbf{CLD Leaderboard} (53 tests, structured causal loop diagram extraction) and the \textbf{Discussion Leaderboard} (interactive model discussion, feedback explanation, and model building coaching). On CLD extraction, cloud models achieve 77--89\% overall pass rates; the best local model reaches 77\% (Kimi~K2.5~GGUF~Q3, zero-shot engine), matching mid-tier cloud performance. On Discussion, the best local models achieve 50--100\% on model building steps and 47--75\% on feedback explanation, but only 0--50\% on error fixing -- a category dominated by long-context prompts that expose memory limits in local deployments. A central contribution of this paper is a systematic analysis of \textit{model type effects} on performance: we compare reasoning vs.\ instruction-tuned architectures, GGUF (llama.cpp) vs.\ MLX (mlx\_lm) backends, and quantization levels (Q3 / Q4\_K\_M / MLX-3bit / MLX-4bit / MLX-6bit) across the same underlying model families. We find that backend choice has larger practical impact than quantization level: mlx\_lm does not enforce JSON schema constraints, requiring explicit prompt-level JSON instructions, while llama.cpp grammar-constrained sampling handles JSON reliably but causes indefinite generation on long-context prompts for dense models. We document the full parameter sweep ($t$, $p$, $k$) for all local models, cleaned timing data (stuck requests excluded), and a practitioner guide for running 671B--123B parameter models on Apple~Silicon.
Authors:John T. Behrens
Abstract:
Generative AI systems have entered everyday academic, professional, and personal life with remarkable speed, yet most users encounter them as mysterious artifacts rather than intelligible systems. This chapter discusses large language models within a broader historical shift in computing paradigms and argues that many of the confusions surrounding their use arise from a mismatch between how these systems are built, how they behave, and how people expect computers to behave writ large. Rather than treating generative AI as a monolithic technology, the chapter decomposes it into interacting components, spanning data, models, product features, and user inputs, each introducing distinct affordances and tensions. Particular attention is given to the statistical and data-based foundations of these systems and to the fact that their surface behavior is explicitly human-like, a combination that places them squarely within the intellectual traditions of educational and behavioral research. From this perspective, educational researchers are unusually well positioned to study, evaluate, and productively use generative AI systems, drawing on established methods for modeling latent processes, managing uncertainty, and interpreting complex human-system interactions. The goal is to equip readers with a conceptual map that supports more informed experimentation, critical interpretation, and responsible use as these systems continue to evolve.
Authors:Wei Roy Hua
Abstract:
For four decades, the QWERTY keyboard organized white-collar knowledge work. Typing's dominance was instrumental, not cognitively necessary. As multimodal AI achieves human-parity understanding of speech and gesture, this necessity dissolves. We introduce instrumental dissolution -- loss of institutional-default status while persisting in specialist niches. The keyboard era ends not through hardware replacement but through migration of its function into AI systems. The central contribution identifies the verification bottleneck: as AI collapses production friction, the primary constraint shifts from generation to evaluation. Knowledge workers become adversarial auditors rather than keystroke-producers. This restructures professional expertise, organizational communication, and how productive labor is recognized. Converging evidence from history, philosophy, neuroscience, technology, organizational studies, and cultural analysis supports this thesis. We map synthetic literacy -- oral input generating literate output -- as the defining feature of this transition. Under three scenarios (optimistic: 2028-2035; base: 2035-2045; pessimistic: 2045-2060), we specify disconfirmation criteria that would weaken the thesis if observed. We propose seven interface primitives operationalizing verification-centered HCI.
Authors:Francesco Veri
Abstract:
The Deliberative Reason Index (DRI) is increasingly used to assess the coherence between considerations and preferences in deliberative settings, including applications to LLM-generated data. Under low-signal conditions, however, the standard DRI can produce inflated scores by treating near-zero correlations as evidence of consistency. Monte Carlo simulations across common study designs show that this bias increases with group size and yields positive values even under random response. A modified DRI is introduced that applies a continuous penalty to low-signal correlation pairs. The modification preserves the original scale and reduces exactly to the standard DRI when substantive signal is present. A threshold sensitivity analysis identifies τ=0.2as the optimal parameter. An empirical check with archival deliberative data shows that substantive inferences remain unchanged. The modification improves the reliability and comparability of the DRI in low-signal settings.
Authors:Shahin Hossain
Abstract:
Student engagement with large language models (LLMs) in academic writing is not a stable trait, an adoption decision, or a competency level; it is a continuously negotiated process that existing frameworks cannot adequately theorize. Typological models provide categories without mechanisms; technology acceptance models explain adoption but not post-adoption quality; AI literacy frameworks treat competency as a static predictor rather than a live input. None accounts for within-student variability across tasks, the developmental paradox whereby experience produces habituation rather than sophistication, or principled non-use as a form of ethical reasoning. This article introduces the Reliance Negotiation Framework (RNF), developed from a sequential explanatory mixed-methods study of 382 undergraduates at a public minority-serving institution in the United States (survey, N = 382; 14 semi-structured interviews; three qualitative survey strands; 1,435 coded instances). The RNF reconceptualizes LLM reliance as an ongoing negotiation among four concurrent inputs (perceived benefits, perceived risks, ethical commitments, and situational demands) with outputs that recursively modify subsequent decisions. A Two-Model Architecture accommodates the 13.0% of participants whose categorical ethical commitments foreclose negotiation entirely. The framework generates four falsifiable predictions with implications for AI literacy pedagogy, academic integrity policy, and equity-centered practice at minority-serving institutions.
Authors:Besjon Cifliku
Abstract:
Malleable software can profoundly change how users interact with digital content, enabling non-experts to create their own customized tools. However, the practical adoption of GenUI systems faces several barriers, which I unpack in this paper, including a lack of adaptable data formats, "old" security protocols, and gaps in users' cognitive and creative skills for building their own interfaces. I advocate new evaluation strategies and scientific methods to measure the impact of malleable software in user studies, document usage patterns, and ensure their practical adoption.
Authors:Anton Malinovskiy
Abstract:
Live streaming platforms increasingly embed payments into the interaction loop. In these systems, payment confirmation latency is not merely a back-end performance metric but a front-end UX variable that shapes user behavior, trust, and retention. This paper introduces a novel invention candidate - the Latency-Elastic Trust Window (LETW) - a control layer that computes a per-session latency budget, adapts UX feedback, and enforces jitter-aware thresholds to protect conversational rhythm. We model confirmation latency as a behavioral driver in WebRTC streaming, quantify its effect on conversion and engagement, and propose a telemetry-driven framework to manage latency thresholds. We combine a hazard model with a behavioral elasticity curve and present simulated, calibration-based results that mirror real-world response patterns. Our findings indicate that latency beyond two seconds materially reduces tip completion and repeat engagement, and that latency variance is as important as mean latency. We further formalize the LETW as a patentable UX governor that maps network conditions to user-facing modes, and we provide operational thresholds for engineering teams to enforce trust-preserving payment feedback.
Authors:Advait Sarkar
Abstract:
Filter Babel is a thought experiment about a near future in which everything we read, watch, and even whom we "meet" is privately generated for each of us. If we each recede into a world of purely private experience, we may each develop a Wittgensteinian private language that remains intelligible to others only because an AI translator sits in the middle. This intermediation challenges the integrity of common ground and therefore of communication. On the other hand, private experience is an essential engine of identity and selfhood: as Lanier warns, one must be somebody before one can share oneself. This paper opens a discussion of the challenges and opportunities that Filter Babel might present to human communication and identity, and what constructive directions for research in AI-mediated communication might ensue.
Authors:Advait Sarkar
Abstract:
Software adoption has traditionally been understood through instrumental lenses, such as usability, cost, security, and interoperability. We argue that a new, ideological dimension is reshaping adoption decisions: one we term digital patriotism, the individual counterpart to the state ideology of digital sovereignty. Through two studies, we trace this phenomenon. First, a directed content analysis of decisions made by European government agencies to switch away from de facto technology standards reveals a shift around 2020: early switches cited costs and vendor lock-in, while later switches invoke sovereignty, geopolitical risk, and investment in local industry. Second, a qualitative analysis of over 700 online comments (over 51,000 words) surfaces how consumers and businesses articulate motivations for seeking European software alternatives. We find that digital patriotism entails a willingness to accept functional compromise in service of ideological goals. Our work extends software adoption theory by drawing attention to value rationality alongside instrumental rationality, and contributes an empirical account of how geopolitics is reshaping technology choice in the workplace.
Authors:Albert Tang
Abstract:
Autism Spectrum Disorder (ASD) affects more than 75 million people worldwide. However, scalable support for practicing everyday conversation is scarce: Low-cost activities such as story reading yield limited improvement. At the same time, effective role-play therapy demands expensive, in-person sessions with specialists. SocialWise bridges this gap through a browser-based application that pairs LLM conversational agents with a therapeutic retrieval augmented generation (RAG) knowledge base. Users select a scenario (e.g., ordering food, joining a group), interact by text or voice, and receive instant, structured feedback on tone, engagement, and alternative phrasing. The SocialWise prototype, implemented with Streamlit, LangChain, and ChromaDB, runs on any computer with internet access, and demonstrates how recent advances in LLM can provide evidence-based, on-demand communication coaching for individuals with ASD.
Authors:Siva Raja Sindiramutty
Abstract:
The growing adoption of interactive learning tools in higher education offers new opportunities to enhance student performance and well-being. This study compares the effects of traditional and interactive learning methods on academic performance, engagement, motivation, and emotional well-being among 100 university students enrolled in a computer intrusion detection course. Participants were randomly assigned to either a traditional learning group (lectures and notes) or an interactive learning group utilising tools such as Kahoot, Panopto, Slido, Quizizz, Padlet, and educational videos. Academic achievement was measured through pre-tests, post-tests, final exams, and assignments, while engagement and emotional states were assessed using validated Likert-scale questionnaires. Results showed that students in the interactive group significantly outperformed their peers in both post-tests (67.48% vs. 53.36%) and final exams (80.8% vs. 61.44%). Interactive learners also demonstrated greater behavioural (+67.01%) and emotional engagement (+75.32%), along with enhanced emotional well-being marked by increased positive emotions (+66.67%) and reduced frustration. A significant drop in cognitive involvement (-39.8%) indicates possible cognitive overload. The pedagogical potential of interactive learning is reaffirmed by this result while reinforcing the need for balancing stimulation and cognitive level. Future research with larger, diverse samples is suggested for generalising and maximising outcomes.
Authors:Grzegorz Pochwatko
Abstract:
Virtual reality (VR) is increasingly used across psychology, from research and assessment to counseling, psychological treatment, and psychotherapy, with growing applications for children and adolescents. In these contexts, VR is often treated as a relatively neutral delivery medium. This assumption may be misleading. Most consumer head-mounted displays (HMDs) have been designed primarily for adult anthropometry, including adult interpupillary distance (IPD) ranges. As a result, some children may be excluded from participation or may receive a systematically degraded perceptual experience because the device cannot be adequately aligned to their visual anatomy. This paper argues that IPD constraints in consumer VR headsets represent an underrecognized methodological and clinical problem in pediatric psychology and psychotherapy. If headset fit affects visual comfort, depth perception, attentional load, cybersickness, willingness to remain in the simulation, and sense of presence, it may also influence engagement, emotional processing, dropout, and treatment response. The headset may therefore function as a selection mechanism, shaping who is included in studies, who can tolerate intervention, and to whom findings can be generalized. Using published developmental IPD data, official headset specifications, and examples from pediatric and youth-oriented VR studies, we show that anthropometric mismatch is likely to disproportionately affect younger children and those at the lower end of the IPD distribution. Using Meta Quest 3 as a case study, we argue that pediatric VR research and therapy should treat headset compatibility as part of psychological method rather than as background technical detail.
Authors:Ka Ching Chan
Abstract:
Generative AI is changing how research software is developed, but rapid AI-assisted development can weaken continuity, traceability, and methodological clarity. SHAPR (Solo, Human-centred, AI-assisted PRactice) was proposed as a framework for structuring AI-assisted research software development. This paper presents a documented case of applying SHAPR to the development of a modular share trading system. From the outset, the project adopted a SHAPR-informed working configuration that shaped how interaction, implementation, and documentation were organised. Across iterative development cycles, the project generated a structured evidence base including reflection notes, development cycle review notes, source-of-truth documents, contracts, quick captures, workflow notes, and evolving code artefacts. The case showed that continuous documentation updates, supported by quick capture and AI-assisted refinement, helped maintain organised and usable project knowledge throughout development. Five recurring lessons were identified: contracts stabilised AI-assisted coding, a maintained source-of-truth layer improved coherence, cycle-boundary snapshots strengthened continuity, code and documentation co-evolved through quick capture and iterative refinement, and environment setup itself contributed to knowledge generation. The case also illustrates a practical SHAPR operating configuration in which a ChatGPT Project and cycle-specific chats supported interaction, reasoning, summarisation, and coding collaboration, PyCharm supported artefact implementation, and Obsidian supported external working memory, structured documentation, reflection, continuity, and repository-oriented note organisation, while remaining consistent with SHAPR's tool-agnostic principle. The paper contributes practical guidance and good practices for researchers conducting AI-assisted research software development.
Authors:Li Chen
Abstract:
The next generation of autonomous AI systems will be constrained not only by model capability, but by how intelligence is structured across heterogeneous hardware. Current paradigms -- cloud-centric AI, on-device inference, and edge-cloud pipelines -- treat planning, reasoning, and execution as a monolithic process, leading to unnecessary latency, energy consumption, and fragmented behavioral continuity. We introduce the Tri-Spirit Architecture, a three-layer cognitive framework that decomposes intelligence into planning (Super Layer), reasoning (Agent Layer), and execution (Reflex Layer), each mapped to distinct compute substrates and coordinated via an asynchronous message bus. We formalize the system with a parameterized routing policy, a habit-compilation mechanism that promotes repeated reasoning paths into zero-inference execution policies, a convergent memory model, and explicit safety constraints. We evaluate the architecture in a reproducible simulation of 2000 synthetic tasks against cloud-centric and edge-only baselines. Tri-Spirit reduces mean task latency by 75.6 percent and energy consumption by 71.1 percent, while decreasing LLM invocations by 30 percent and enabling 77.6 percent offline task completion. These results suggest that cognitive decomposition, rather than model scaling alone, is a primary driver of system-level efficiency in AI hardware.
Authors:Hiranya V. Peiris
Abstract:
The Claude Mythos Preview system card deploys emotion vectors, sparse autoencoder (SAE) features, and activation verbalisers to study model internals during misaligned behaviour. The two primary toolkits are not jointly reported on the most alignment-relevant episodes. This note identifies two hypotheses that are qualitatively consistent with the published results: that the emotion vectors track functional emotions that causally drive behaviour, or that they are a projection of a richer situational-context structure onto human emotional axes. The hypotheses can be distinguished by cross-referencing the two toolkits on episodes where only one is currently reported: most directly, applying emotion probes to the strategic concealment episodes analysed only with SAE features. If emotion probes show flat activation while SAE features are strongly active, the alignment-relevant structure lies outside the emotion subspace. Which hypothesis is correct determines whether emotion-based monitoring will robustly detect dangerous model behaviour or systematically miss it.
Authors:Georges Hattab
Abstract:
Current discourse on Artificial Intelligence (AI) ethics, dominated by "trustworthy" and "responsible" AI, overlooks a more fundamental human-computer interaction (HCI) crisis: the erosion of human agency. This paper argues that the primary challenge of high-stakes AI systems is not trust, but the preservation of human causal control. We posit that "bad AI" will function as "bad UI," a metaphor for catastrophic interface failures that misrepresent system state and lead to human error. Applying Marshall McLuhan's media theory, AI can be framed as a technology of "augmentation" that simultaneously "amputates" the user's direct perception of causality. This places the interface as the critical locus where a "double uncertainty"--that of the human user and that of the probabilistic model--must be mediated. We critique current Explainable AI (XAI) for its correlational focus and failure to represent uncertainty. We conclude by proposing a rigorous, nested Causal-Agency Framework (CAF) that integrates causal models, uncertainty quantification, and human-centered evaluation to restore agency at the interface.
Authors:S M Jamil Uddin
Abstract:
The emergence of vibe coding, a paradigm where non-technical users instruct Large Language Models (LLMs) to generate executable codes via natural language, presents both significant opportunities and severe risks for the construction industry. While empowering construction personnel such as the safety managers, foremen, and workers to develop tools and software, the probabilistic nature of LLMs introduces the threat of silent failures, wherein generated code compiles perfectly but executes flawed mathematical safety logic. This study empirically evaluates the reliability, software architecture, and domain-specific safety fidelity of 450 vibe-coded Python scripts generated by three frontier models, Claude 3.5 Haiku, GPT-4o-Mini, and Gemini 2.5 Flash. Utilizing a persona-driven prompt dataset (n=150) and a bifurcated evaluation pipeline comprising isolated dynamic sandboxing and an LLM-as-a-Judge, the research quantifies the severe limits of zero-shot vibe codes for construction safety. The findings reveal a highly significant relationship between user persona and data hallucination, demonstrating that less formal prompts drastically increase the AI's propensity to invent missing safety variables. Furthermore, while the models demonstrated high foundational execution viability (~85%), this syntactic reliability actively masked logic deficits and a severe lack of defensive programming. Among successfully executed scripts, the study identified an alarming ~45% overall Silent Failure Rate, with GPT-4o-Mini generating mathematically inaccurate outputs in ~56% of its functional code. The results demonstrate that current LLMs lack the deterministic rigor required for standalone safety engineering, necessitating the adoption of deterministic AI wrappers and strict governance for cyber-physical deployments.
Authors:Behrooz Razeghi
Abstract:
AI alignment is often framed as the task of ensuring that an AI system follows a set of stated principles or human preferences, but general principles rarely determine their own application in concrete cases. When principles conflict, when they are too broad to settle a situation, or when the relevant facts are unclear, an additional act of judgment is required. This paper analyzes that step through the lens of hermeneutics and argues that alignment therefore includes an interpretive component: it involves context-sensitive judgments about how principles should be read, applied, and prioritized in practice. We connect this claim to recent empirical findings showing that a substantial portion of preference-labeling data falls into cases of principle conflict or indifference, where the principle set does not uniquely determine a decision. We then draw an operational consequence: because such judgments are expressed in behavior, many alignment-relevant choices appear only in the distribution of responses a model generates at deployment time. To formalize this point, we distinguish deployment-induced and corpus-induced evaluation and show that off-policy audits can fail to capture alignment-relevant failures when the two response distributions differ. We argue that principle-specified alignment includes a context-dependent interpretive component.
Authors:Malte F. Jung
Abstract:
Theory of Mind, the capacity to explain and predict behavior by inferring hidden mental states, has become the dominant paradigm for social interaction in robotics. Yet ToM rests on three assumptions that poorly capture how most social interaction actually unfolds: that meaning travels inside-out from hidden states to observable behavior; that understanding requires detached inference rather than participation; and that the meaning of behavior is fixed and available to a passive observer. Drawing on ethnomethodology, conversation analysis, and participatory sense-making, I argue that social meaning is not decoded from behavior but produced through moment-to-moment coordination between agents. This interactional foundation has direct implications for robot design: shifting from internal state modeling toward policies for sustaining coordination, from observer-based inference toward active participation, and from fixed behavioral meaning toward meaning potential stabilized through response.
Authors:Mattias Rost
Abstract:
Large language models (LLMs) are changing how we interact with computers. As they become capable of generating software dynamically, they invite a fundamental rethinking of the computer's role in human activity. In this conceptual paper, we introduce LLM-mediated computing: a paradigm in which interaction is no longer structured around fixed applications, but emerges in real-time through human intent and LLM interpretation. We make three contributions: (1) we articulate a new interaction metaphor of reflective conversation to guide future design, (2) we use the lens of postphenomenology to understand the human-LLM-computer relation, and (3) we propose a new mode of computing based on co-disclosure, in which the computer is constituted in use. Together, they define a new mode of computing, provide a lens to analyze it, and offer a metaphor to design with.
Authors:Wee Chaimanowong
Abstract:
The use of Generative AI (GenAI) for creative content generation has gained popularity in recent years. GenAI allows creators to generate contents that are increasingly becoming indistinguishable to the human--generated counter--part at a much lower cost. While GenAI reshapes the competitive landscape of the contents market, the original creators were typically not compensated for their works that were used in the GenAI training. On the other hands, the wide--spread adoption of GenAI threatens to replace the human--generated shares of contents on content platforms, contaminating training data source for future GenAI models. In this paper, we argue that an unregulated usage of GenAI can also be harmful to the platform by causing a contents distribution distortion which can lower the consumers' engagement and the platform's profit. We show that a simple economically--driven creator compensation scheme, can incentivize more creation of high--value human--generated contents, without the need for an AI--detector. This reduces the data pollution for future GenAI training, while improves the consumer engagement and the platform's profit.
Authors:Amir Konigsberg
Abstract:
In 1950, Alan Turing proposed replacing the question "Can machines think?" with a behavioral test: if a machine's outputs are indistinguishable from those of a thinking being, the question of whether it truly thinks can be set aside. This paper argues that Turing's move was not only a pragmatic simplification but also an epistemological commitment, a decision about what kind of evidence counts as relevant to intelligence attribution, and that this commitment has quietly constrained AI research for seven decades. We trace how Turing's behavioral epistemology became embedded in the field's evaluative infrastructure, rendering unaskable a class of questions about process, mechanism, and internal organization that cognitive psychology, neuroscience, and related disciplines learned to ask. We draw a structural parallel to the behaviorist-to-cognitivist transition in psychology: just as psychology's commitment to studying only observable behavior prevented it from asking productive questions about internal mental processes until that commitment was abandoned, AI's commitment to behavioral evaluation prevents it from distinguishing between systems that achieve identical outputs through fundamentally different computational processes, a distinction on which intelligence attribution depends. We argue that the field requires an epistemological transition comparable to the cognitive revolution: not an abandonment of behavioral evidence, but a recognition that behavioral evidence alone is insufficient for the construct claims the field wishes to make. We articulate what a post-behaviorist epistemology for AI would involve and identify the specific questions it would make askable that the field currently has no way to ask.
Authors:Christopher Koch
Abstract:
Agentic AI systems plan, use tools, maintain state, and produce multi-step trajectories with external effects. Those properties create a governance problem that differs materially from single-turn generative AI: important risks emerge dur- ing execution, not only at model development or deployment time. Governance standards such as ISO/IEC 42001, ISO/IEC 23894, ISO/IEC 42005, ISO/IEC 5338, ISO/IEC 38507, and the NIST AI Risk Management Framework are therefore highly relevant to agentic AI, but they do not by themselves yield implementable runtime guardrails. This paper proposes a layered translation method that connects standards-derived governance objectives to four control layers: governance objectives, design- time constraints, runtime mediation, and assurance feedback. It distinguishes governance objectives, technical controls, runtime guardrails, and assurance evidence; introduces a control tuple and runtime-enforceability rubric for layer assignment; and demonstrates the method in a procurement-agent case study. The central claim is modest: standards should guide control placement across architecture, runtime policy, human escalation, and audit, while runtime guardrails are reserved for controls that are observable, determinate, and time-sensitive enough to justify execution-time intervention.
Authors:Jaime Banks
Abstract:
In discussions of human relations with conversational agents (CAs; e.g., voice assistants, AI companions, some social robots), they are increasingly referred to as parasocial. This is a misapplication of the term, heuristically taken up to mean "unreal." In this provocation, I briefly account for the theoretical trajectory of parasociality and detail why it is inaccurate to apply the notion to human interactions with CAs. In short, "parasocial" refers to a human-character relations that are one-sided, non-dialectical, character-governed, imagined, vicarious, predictable, and low-effort; the term has been co-opted to instead refer to relations that are seen as unreal or invalid. The scientific problematics of this misapplication are nontrivial. They lead to oversimplification of complex phenomena, misspecified variables and misdiagnosed effects, and devaluation of human experiences. Those challenges, in turn, have downstream effects on norms and practice. It is scientifically, practically, and ethically imperative to recognize the sociality of human-agent relations.
Authors:Zhimin Zhao
Abstract:
Developers are publishing AI agent skills that replicate a colleague's communication style, encode a supervisor's mentoring heuristics, or preserve a person's behavioral repertoire beyond biological death. To explain why, we propose Gradual Cognitive Externalization (GCE), a framework arguing that human cognitive functions are migrating into digital substrates through ambient intelligence co-adaptation rather than mind uploading. GCE rests on the behavioral manifold hypothesis: everyday cognition occupies a low-dimensional manifold that is structured, redundant, and learnable from sustained observation. We document evidence from scheduling assistants, writing tools, recommendation engines, and agent skill ecosystems showing that the preconditions for externalization are already observable. We formalize three criteria separating cognitive integration from tool use (bidirectional adaptation, functional equivalence, causal coupling), derive five testable predictions with theory-constrained thresholds, and provide a concrete experimental protocol. The question is no longer whether minds can be uploaded, but how fast cognitive functions are already migrating into digital substrates and what follows.
Authors:Elias Calboreanu
Abstract:
The quality of AI-generated output is often attributed to prompting technique, but extensive empirical observation suggests that context completeness may be more strongly associated with output quality. This paper introduces Context Engineering, a structured methodology for assembling, declaring, and sequencing the complete informational payload that accompanies a prompt to an AI tool. Context Engineering defines a five-role context package structure (Authority, Exemplar, Constraint, Rubric, Metadata), applies a staged four-phase pipeline (Reviewer to Design to Builder to Auditor), and applies formal models from reliability engineering and information theory as post hoc interpretive lenses on context quality. In an observational study of 200 documented interactions across four AI tools (Claude, ChatGPT, Cowork, Codex), incomplete context was associated with 72% of iteration cycles. Structured context assembly was associated with a reduction from 3.8 to 2.0 average iteration cycles per task and an improvement in first-pass acceptance from 32% to 55%. Among structured interactions, 110 of 200 were accepted on first pass compared with 16 of 50 baseline interactions; when iteration was permitted, the final success rate reached 91.5% (183 of 200). These results are observational and reflect a single-operator dataset without controlled comparison. Preliminary corroboration is provided by a companion production automation system with eleven operating lanes and 2,132 classified tickets.
Authors:Yi Zhou
Abstract:
Translating quantum many-body theory into scalable software traditionally requires months of effort. Zero-shot generation of tensor network algorithms by Large Language Models (LLMs) frequently fails due to spatial reasoning errors and memory bottlenecks. We resolve this using a multi-stage workflow that mimics a physics research group. By generating a mathematically rigorous LaTeX specification as an intermediate blueprint, we constrain the coding LLM to produce exact, matrix-free $\mathcal{O}(D^3)$ operations. We validate this approach by generating a Density-Matrix Renormalization Group (DMRG) engine that accurately captures the critical entanglement scaling of the Spin-$1/2$ Heisenberg model and the symmetry-protected topological (SPT) order of the Spin-$1$ AKLT model. Testing across 16 combinations of leading foundation models yielded a 100\% success rate. By compressing a months-long development cycle into under 24 hours ($\sim 14$ active hours), this framework offers a highly reproducible paradigm for accelerating computational physics research.
Authors:Cosei Kawa
Abstract:
Conventional picture-book production imposes substantial physical and temporal demands on creators, often constraining opportunities for high-level artistic exploration. While generative AI can drastically accelerate image generation, concerns remain regarding style homogenization and the erosion of authorial agency in professional practice. This study presents an empirical evaluation of an AI-collaborative workflow through the full production of one professional 15-illustration picture-book title, and compares the process with a conventional hand-drawn pipeline by the same creator. Quantitatively, the proposed workflow reduces total production time by 85.2% (from 2,162.8 to 320.4 hours), with the largest substitution observed in early drafting stages. Qualitatively, however, the core contribution is the strategic reallocation of labor: time saved in mechanical rendering is reinvested into high-level Judgment (aesthetic selection, narrative direction, and cross-scene consistency decisions) and Completion (embodied manual retouching and integrative refinement). Notably, 235 hours were devoted to Completion, indicating that publication-quality outcomes still depend on sustained human synthesis to reconcile generative inconsistencies. Our findings suggest that AI-integration, when framed as a "mild-work" partnership, enhances rather than diminishes the creative experience by shifting the creator's focus from repetitive physical labor to sophisticated aesthetic synthesis.
Authors:Yizhi Xu
Abstract:
Creating scalable and believable game societies requires balancing authorial control with computational cost. Existing scripted NPC systems scale efficiently but are often rigid, whereas fully LLM-driven agents can produce richer social behavior at a much higher runtime cost. We present CASCADE, a three-layer architecture for low-cost, controllable social coordination in sandbox-style game worlds. A Macro State Director (Level 1) maintains discrete-time world-state variables and macro-level causal updates, while a modular Coordination Hub decomposes state changes through domain-specific components (e.g., professional and social coordination) and routes the resulting directives to tag-defined groups. Then Tag-Driven NPCs (Level 3) execute responses through behavior trees and local state/utility functions, invoking large language models only for on-demand player-facing interactions. We evaluate CASCADE through multiple micro-scenario prototypes and trace-based analysis, showing how a shared macro event can produce differentiated yet logically constrained NPC behaviors without per-agent prompting in the main simulation loop. CASCADE provides a modular foundation for scalable social simulation and future open-world authoring tools.
Authors:Yuhao Sun
Abstract:
Interactive Health (IH) research increasingly engages patients through participatory and user-centred approaches. However, patients' lived experiences are typically treated more as data to be analysed than as knowledge in their own right. In this paper, I argue that 'patient voice' in the field of IH is both an inclusion issue and an epistemic one. More specifically, it concerns how experiential accounts are recognised and circulated. I examine how methodological conventions, authorship norms, review criteria, and publication formats tend to position patients as participants rather than as authors of evidence. Looking to patient-partnered practices in medical publishing, including The BMJ, JAMA, and British Journal of Sports Medicine, I outline a possible infrastructural pathway for supporting patient-authored or patient-led experiential contributions within the field. I present this as a design probe to surface assumptions and trade-offs. I end this paper by inviting the IH community to reflect on how its knowledge infrastructures might accommodate experiential evidence alongside established research forms.
Authors:Cristian Espinal Maya
Abstract:
Society 5.0 and Industry 5.0 call for human-centric technology integration, yet the concept lacks an operational definition that can be measured, optimized, or evaluated at the firm level. This paper addresses three gaps. First, existing models of human-AI complementarity treat the augmentation function phi(D) as exogenous -- dependent only on the stock of AI deployed -- ignoring that two firms with identical technology investments achieve radically different augmentation outcomes depending on how the workplace is organized around the human-AI interaction. Second, no multi-dimensional instrument exists linking workplace design choices to augmentation productivity. Third, the Society 5.0 literature proposes human-centricity as a normative aspiration but provides no formal criterion for when it is economically optimal. We make four contributions. (1) We endogenize the augmentation function as phi(D, W), where W is a five-dimensional workplace design vector -- AI interface design, decision authority allocation, task orchestration, learning loop architecture, and psychosocial work environment -- and prove that human-centric design is profit-maximizing when the workforce's augmentable cognitive capital exceeds a critical threshold. (2) We conduct a PRISMA-guided systematic review of 120 papers (screened from 6,096 records) to map the evidence base for each dimension. (3) We provide secondary empirical evidence from Colombia's EDIT manufacturing survey (N=6,799 firms) showing that management practice quality amplifies the return to technology investment (interaction coefficient 0.304, p<0.01). (4) We propose the Workplace Augmentation Design Index (WADI), a 36-item theory-grounded instrument for diagnosing human-centricity at the firm level. Decision authority allocation emerges as the binding constraint for Society 5.0 transitions, and task orchestration as the most under-researched dimension
Authors:Peng Gang
Abstract:
How reliably can structured intent representations preserve user goals across different AI models, languages, and prompting frameworks? Prior work showed that PPS (Prompt Protocol Specification), a 5W3H-based structured intent framework, improves goal alignment in Chinese and generalizes to English and Japanese. This paper extends that line of inquiry in three directions: cross-model robustness across Claude, GPT-4o, and Gemini 2.5 Pro; controlled comparison with CO-STAR and RISEN; and a user study (N=50) of AI-assisted intent expansion in ecologically valid settings. Across 3,240 model outputs (3 languages x 6 conditions x 3 models x 3 domains x 20 tasks), evaluated by an independent judge (DeepSeek-V3), we find that structured prompting substantially reduces cross-language score variance relative to unstructured baselines. The strongest structured conditions reduce cross-language sigma from 0.470 to about 0.020. We also observe a weak-model compensation pattern: the lowest-baseline model (Gemini) shows a much larger D-A gain (+1.006) than the strongest model (Claude, +0.217). Under the current evaluation resolution, 5W3H, CO-STAR, and RISEN achieve similarly high goal-alignment scores, suggesting that dimensional decomposition itself is an important active ingredient. In the user study, AI-expanded 5W3H prompts reduce interaction rounds by 60 percent and increase user satisfaction from 3.16 to 4.04. These findings support the practical value of structured intent representation as a robust, protocol-like communication layer for human-AI interaction.
Authors:Takeshi Kurata
Abstract:
The term XR is currently widely used as an expression encompassing Virtual Reality (VR), Augmented Reality (AR), and Mixed Reality (MR). However, there is no clear consensus regarding its origin or meaning. XR is sometimes explained as an abbreviation for Extended Reality, but multiple interpretations exist regarding its etymology and formation process. This paper organizes the historical formation of terminology related to VR, AR, MR, and XR, and reexamines the context in which the term XR emerged and how it has spread. In particular, by presenting a timeline that distinguishes between the coinage of terms and the drivers of their adoption, we suggest that XR, as an umbrella term, functions not as an abbreviation of Extended Reality, but rather as a neutral symbolic label that encompasses multiple "reality"-related terms. Furthermore, we argue that stable usage of terminology, including XR, requires governance through collaboration among academia, industry, and standardization organizations.
Authors:Andruid Kerne
Abstract:
We develop a conceptualization of ideology, in which a system of ideas represents social, economic, and political relationships. We use ideology as a lens for understanding and critiquing intersecting social, economic, and political aspects of how 'AI' technologies are being developed. We observe ideological shifts. We question that the present tangling of corporate and university objectives is beneficial to labor, particularly computer science students, and the general public. Corporations and computer science have a history of marketing the ideology of computing as empowerment. However, with intensification of the production of 'AI', contradictions emerge. We ask, "Who is being empowered?"
Authors:Christopher Koch
Abstract:
The common claim that generative AI simply amplifies the Dunning-Kruger effect is too coarse to capture the available evidence. The clearest findings instead suggest that large language model (LLM) use can improve observable output and short-term task performance while degrading metacognitive accuracy and flattening the classic competence-confidence gradient across skill groups. This paper synthesizes evidence from human-AI interaction, learning research, and model evaluation, and proposes the working model of AI-mediated metacognitive decoupling: a widening gap among produced output, underlying understanding, calibration accuracy, and self-assessed ability. This four-variable account better explains overconfidence, over- and under-reliance, crutch effects, and weak transfer than the simpler metaphor of a uniformly steeper Dunning-Kruger curve. The paper concludes with implications for tool design, assessment, and knowledge work.
Authors:Yongzhi Huang
Abstract:
Traditional liquid identification instruments are often unavailable to the general public. This paper shows the feasibility of identifying unknown liquids with commercial lightweight devices, such as a smartphone. The key insight is that different liquid molecules have different viscosity coefficients and therefore must overcome different energy barriers during relative motion. With this intuition in mind, we introduce a novel model that measures liquids' viscosity based on active vibration. However, building a robust system using built-in smartphone accelerometers is challenging. Practical issues include under-sampling, self-interference, and the impact of liquid-volume changes. Instead of machine learning, we tackle these issues through multiple signal processing stages to reconstruct the original signals and cancel out the interference. Our approach estimates liquid viscosity with a mean relative error of 2.9% and distinguishes 30 types of liquids with an average accuracy of 95.47%.
Authors:Shuai Guo
Abstract:
As generative AI increasingly mediates learning and decision-making, users often act effectively while struggling to interpret how system outcomes are produced. While Explainable Artificial Intelligence (XAI) research has primarily addressed this problem through transparency and visualization, less attention has been paid to how explanation is constructed through interaction. This paper examines digital games as explainable interfaces by analyzing how explanation can be configured as a playable process. Using Arknights as a case study, the paper conducts a qualitative close reading and interface analysis of the diegetic AI system PRTS, focusing on the implied player. The analysis shows that PRTS provides usable but unverifiable explanations: sufficient to initiate action, yet insufficient to stabilize causal understanding. Through incomplete information, delayed feedback, and narrative disruptions of trust, player agency is reorganized from direct control toward interpretive and abductive reasoning. The paper conceptualizes this mode as explanatory agency and discusses its implications for XAI-oriented interface design.
Authors:Giulia Pusceddu
Abstract:
Integrating social robots in our group-based society, beyond the technical challenges, requires considering the social group dynamics. Following the results from preliminary exploratory studies on the influence of social robots on group decisions, the proposed research investigates whether social robots can foster cooperation among group members. To achieve this, I propose a game theory approach, employing the Public Good Game to recreate a simplified and controlled social situation where the robot's influence can be evaluated. Clarifying the role of robots in promoting collaboration among humans might have a significant impact in educational environments, enhancing student learning, as well as in workplace settings, where they could facilitate problem-solving and lead to shared solutions.
Authors:Thammathip Piumsomboon
Abstract:
Self++ is a design blueprint for human-AI symbiosis in extended reality (XR) that preserves human authorship while still benefiting from increasingly capable AI agents. Because XR can shape both perceptual evidence and action, apparently 'helpful' assistance can drift into over-reliance, covert persuasion, and blurred responsibility. Self++ grounds interaction in two complementary theories: Self-Determination Theory (autonomy, competence, relatedness) and the Free Energy Principle (predictive stability under uncertainty). It operationalises these foundations through co-determination, treating the human and the AI as a coupled system that must keep intent and limits legible, tune support over time, and preserve the user's right to endorse, contest, and override. These requirements are summarised as the co-determination principles (T.A.N.): Transparency, Adaptivity, and Negotiability. Self++ organises augmentation into three concurrently activatable overlays spanning sensorimotor competence support (Self: competence overlay), deliberative autonomy support (Self+: autonomy overlay), and social and long-horizon relatedness and purpose support (Self++: relatedness and purpose overlay). Across the overlays, it specifies nine role patterns (Tutor, Skill Builder, Coach; Choice Architect, Advisor, Agentic Worker; Contextual Interpreter, Social Facilitator, Purpose Amplifier) that can be implemented as interaction patterns, not personas. The contribution is a role-based map for designing and evaluating XR-AI systems that grow capability without replacing judgment, enabling symbiotic agency in work, learning, and social life and resilient human development.
Authors:Antoine Soetewey
Abstract:
Statistics 101, 201, and 202 are three open-source interactive web applications built with R \citep{R} and Shiny \citep{shiny} to support the teaching of introductory statistics and probability. The apps help students carry out common statistical computations -- computing probabilities from standard probability distributions, constructing confidence intervals, conducting hypothesis tests, and fitting simple linear regression models -- without requiring prior knowledge of R or any other programming language. Each app provides numerical results, plots rendered with \texttt{ggplot2} \citep{ggplot2}, and inline mathematical derivations typeset with MathJax \citep{cervone2012mathjax}, so that computation and statistical reasoning appear side by side in a single interface. The suite is organised around a broad pedagogical progression: Statistics~101 introduces probability distributions and their properties; Statistics~201 addresses confidence intervals and hypothesis tests; and Statistics~202 covers the simple linear model. All three apps are freely accessible online and their source code is released under a CC-BY-4.0 license.
Authors:Shivam Pandey
Abstract:
Can drivers' situation awareness during automated driving be maintained using haptic cues that provide information about road and traffic scenarios while the drivers are engaged in a secondary task? And can this be done without disengaging them from the secondary task? Multiple Resource Theory predicts that using different sensory channels can improve multiple-task performance. Using haptics to provide information avoids the audio-visual channels likely occupied by the secondary task. An experiment was conducted to assess whether drivers' situation awareness could be maintained using haptic cues. Drivers played Fruit Ninja as the secondary task while seated in a driving simulator with a Level 4 autonomous system driving. A mixed design was used for the experiment with the presence of haptic cues and the presentation time of situation awareness questions as the between-subjects conditions. Five road and traffic scenarios comprised the within-subjects part of the design. Subjects who received haptic cues had a higher number of correct responses to the situation awareness questions and looked up at the simulator screen fewer times than those who were not provided cues. Subjects did not find the cues to be disruptive and gave good satisfaction scores to the haptic device. Additionally, subjects across all conditions seemed to have performed equally well in playing Fruit Ninja. It appears that haptic cuing can maintain drivers' situation awareness during automated driving while drivers are engaged in a secondary task. Practical implications of these findings for implementing haptic cues in autonomous vehicles are also discussed.
Authors:Mengqi Shi
Abstract:
The rapid advancement of AI companionship systems has positioned them as scalable interventions for addressing social isolation. Current design approaches emphasize maximizing user engagement and satisfaction, treating effective alignment between AI capabilities and user needs as an unqualified success. However, this framing may overlook a critical dimension of bidirectional human-AI alignment: when AI systems successfully align with users' expressed emotional needs, users may reciprocally adapt their relational expectations in ways that undermine authentic human connection and agency. We examine what we term the authenticity paradox: the phenomenon whereby successful bidirectional alignment in emotionally supportive AI paradoxically harms the values that motivated the intervention. Through the analysis of AI companionship for older adults as an illustrative case, we identify four key tensions that emerge when technical effectiveness generates ethical concerns: the dilemma of AI becoming users' only accessible option, mismatches between emotional needs and system-level interventions, conflicts over sense of control during vulnerable moments, and fundamental disagreements about whose values should guide system behavior.
Authors:Netanel Eliav
Abstract:
This paper documents and theorises a self-reinforcing dynamic between two measurable trends: the exponential expansion of large language model (LLM) context windows and the secular contraction of human sustained-attention capacity. We term the resulting asymmetry the Cognitive Divergence. AI context windows have grown from 512 tokens in 2017 to 2,000,000 tokens by 2026 (factor ~3,906; fitted lambda = 0.59/yr; doubling time ~14 months). Over the same period, human Effective Context Span (ECS) -- a token-equivalent measure derived from validated reading-rate meta-analysis (Brysbaert, 2019) and an empirically motivated Comprehension Scaling Factor -- has declined from approximately 16,000 tokens (2004 baseline) to an estimated 1,800 tokens (2026, extrapolated from longitudinal behavioural data ending 2020 (Mark, 2023); see Section 9 for uncertainty discussion). The AI-to-human ratio grew from near parity at the ChatGPT launch (November 2022) to 556--1,111x raw and 56--111x quality-adjusted, after accounting for retrieval degradation (Liu et al., 2024; Chroma, 2025). Beyond documenting this divergence, the paper introduces the Delegation Feedback Loop hypothesis: as AI capability grows, the cognitive threshold at which humans delegate to AI falls, extending to tasks of negligible demand; the resulting reduction in cognitive practice may further attenuate the capacities already documented as declining (Gerlich, 2025; Kim et al., 2026; Kosmyna et al., 2025). Neither trend reverses spontaneously. The paper characterises the divergence statistically, reviews neurobiological mechanisms across eight peer-reviewed neuroimaging studies, presents empirical evidence bearing on the delegation threshold, and proposes a research agenda centred on a validated ECS psychometric instrument and longitudinal study of AI-mediated cognitive change.
Authors:Peng Gang
Abstract:
Does structured intent representation generalize across languages and models? We study PPS (Prompt Protocol Specification), a 5W3H-based framework for structured intent representation in human-AI interaction, and extend prior Chinese-only evidence along three dimensions: two additional languages (English and Japanese), a fourth condition in which a user's simple prompt is automatically expanded into a full 5W3H specification by an AI-assisted authoring interface, and a new research question on cross-model output consistency. Across 2,160 model outputs (3 languages x 4 conditions x 3 LLMs x 60 tasks), we find that AI-expanded 5W3H prompts (Condition D) show no statistically significant difference in goal alignment from manually crafted 5W3H prompts (Condition C) across all three languages, while requiring only a single-sentence input from the user. Structured PPS conditions often reduce or reshape cross-model output variance, though this effect is not uniform across languages and metrics; the strongest evidence comes from identifying spurious low variance in unconstrained baselines. We also show that unstructured prompts exhibit a systematic dual-inflation bias: artificially high composite scores and artificially low apparent cross-model variance. These findings suggest that structured 5W3H representations can improve intent alignment and accessibility across languages and models, especially when AI-assisted authoring lowers the barrier for non-expert users.
Authors:Umair Siddique
Abstract:
As AI assistants become integrated into safety engineering workflows for Physical AI systems, a critical question emerges: does AI assistance improve safety analysis quality, or introduce systematic blind spots that surface only through post-deployment incidents? This paper develops a formal framework for AI assistance in safety analysis. We first establish why safety engineering resists benchmark-driven evaluation: safety competence is irreducibly multidimensional, constrained by context-dependent correctness, inherent incompleteness, and legitimate expert disagreement. We formalize this through a five-dimensional competence framework capturing domain knowledge, standards expertise, operational experience, contextual understanding, and judgment. We introduce the competence shadow: the systematic narrowing of human reasoning induced by AI-generated safety analysis. The shadow is not what the AI presents, but what it prevents from being considered. We formalize four canonical human-AI collaboration structures and derive closed-form performance bounds, demonstrating that the competence shadow compounds multiplicatively to produce degradation far exceeding naive additive estimates. The central finding is that AI assistance in safety engineering is a collaboration design problem, not a software procurement decision. The same tool degrades or improves analysis quality depending entirely on how it is used. We derive non-degradation conditions for shadow-resistant workflows and call for a shift from tool qualification toward workflow qualification for trustworthy Physical AI.
Authors:Xiaoming Zhai
Abstract:
Generative AI (GenAI) has rapidly entered education, yet its user experience is often explained through adoption-oriented constructs such as usefulness, ease of use, and engagement. We argue that these constructs are no longer sufficient because systems such as ChatGPT do not merely support learning tasks but also participate in knowledge construction. Existing theories cannot explain why GenAI frequently produces experiences characterized by negotiated authority, redistributed cognition, and accountability tension. To address this gap, this paper develops the Human--AI Epistemic Partnership Theory (HAEPT), explaining the GenAI user experience as a form of epistemic partnership that features a dynamic negotiation of three interlocking contracts: epistemic, agency, and accountability. We argue that findings on trust, over-reliance, academic integrity, teacher caution, and relational interaction about GenAI can be reinterpreted as tensions within these contracts rather than as isolated issues. Instead of holding a single, stable view of GenAI, users adjust how they relate to it over time through calibration cycles. These repeated interactions account for why trust and skepticism often coexist and for how partnership modes describe recurrent configurations of human--AI collaboration across tasks. To demonstrate the usefulness of HAEPT, we applied it to analyze the UX of collaborative learning with AI speakers and AI-facilitated scientific argumentation, illustrating different contract configurations.
Authors:Benjamin Lange
Abstract:
When providers update AI companions, users report grief, betrayal, and loss. A growing literature asks whether the norms governing personal relationships extend to these interactions. So what, if anything, is morally significant about them? I argue that human-AI companion interaction is a triadic structure in which the provider exercises constitutive control over the AI. I identify three structural conditions of normatively robust dyads that the norms characteristic of personal relationships presuppose and show that AI companion interactions fail all three. This reveals what I call Unilateral Relationship Revision Power (URRP): the provider can rewrite how the AI interacts from a position where these revisions are not answerable within that interaction. I argue that URRP is pro tanto wrong in interactions designed to cultivate the norms of personal relationships, because the design produces expectations that the structure cannot sustain. URRP has three implications: i) normative hollowing, under which commitment is elicited but no agent inside the interaction bears it; ii) displaced vulnerability, under which the user's exposure is governed by an agent not answerable to her within the interaction; and iii) structural irreconcilability, under which reconciliation is structurally unavailable because the agent who acted and the entity the user interacts with are different. I discuss design principles such as commitment calibration, structural separation, and continuity assurance as external substitutes for the internal constraints the triadic structure removes. The analysis therefore suggests that a central and underexplored problem in relational AI ethics is the structural arrangement of power over the human-AI interaction itself.
Authors:Gunter Bombaerts
Abstract:
Work on morality in large language models (LLMs) has progressed via constitutional AI, reinforcement learning from human feedback (RLHF) and systematic benchmarking, yet it still lacks tools to connect internal moral representations to regulatory obligations, to design cultural plurality across the full development stack, and to monitor how moral properties drift over the lifecycle of a deployed system. These difficulties reflect a shared root. Morality is installed in a model at training time. I propose instead a morality-as-a-system framework, grounded in Niklas Luhmann's social systems theory, that treats LLM morality as a dynamic, emergent property of a sociotechnical system. Moral behaviour in a deployed LLM is not fixed at training. It is continuously reproduced through interactions among seven structurally coupled components spanning the neural substrate, training data, alignment procedures, system prompts, moderation, runtime dynamics, and user interface. This is a conceptual framework paper, not an empirical study. It philosophically reframes three known challenges, the interpretability-governance gap, the cross-component plurality problem, and the absence of lifecycle monitoring, as structural coupling failures that the installation paradigm cannot diagnose. For technical researchers, it explores three illustrative hypotheses about cross-component representational inconsistency, representation-level drift as an early safety signal, and the governance advantage of lifecycle monitoring. For philosophers and governance specialists, it offers a vocabulary for specifying substrate-level monitoring obligations within existing governance frameworks. The morality-as-a-system framework does not displace elements such as constitutional AI or RLHF it embeds them within a larger temporal and structural account and specifies the additional infrastructure those methods require.
Authors:Frederick Reiber
Abstract:
In this short position paper, I develop a dialectical framework for understanding the political ideology of technological projects. To do so, I draw on critical and emancipatory social science discussions, highlighting how both a project's values and constraints are necessary for understanding its ideology. A brief example is then presented to aid comprehension.
Authors:Kathrin Schnizer
Abstract:
Visualization literacy assessments typically rely on correctness to classify performance, providing little evidence about how readers arrive at their answers. We argue that gaze can address this gap as an implicit process signal that complements standardized tests without sacrificing their scalability. Synthesizing findings from visualization and related research, we show that gaze metrics capture cognitive load invisible to accuracy and response time, and reflect strategy differences in attention allocation that track proficiency. We propose assessments that integrate literacy scores with gaze-derived process indicators - component-level attention profiles, integration frequency, and viewing path dispersion - to distinguish fluent comprehension from labored success. This would shift literacy assessment from binary classification toward nuanced characterization of how readers navigate, integrate, and coordinate information across chart components. A roadmap identifies open challenges in empirical grounding, generalizability, assessment design, and practical feasibility.
Authors:Ruta Serpytyte
Abstract:
The fields of HCI and Participatory design have been turning to care ethics as a suitable ethos to approach current polycrisis with. Similar calls for relationality can be witnessed in public administration research and practice, albeit its current logic being built on privatisation and marketisation of services, managerialism and customer-focus; all of which are challenging to combine with care ethics. In this paper I use collaging technique to visually reflect on new ways for public services to adopt and (care-fully) scale participatory design approaches, and how do feminist care ethics fit in the design of public services, where there is a strong presence of neoliberalism.
Authors:Amin Amouhadi
Abstract:
This paper investigates the ontological consequences of fine-tuning Large Language Models (LLMs) on "impossible objects" -- entities defined by mutually exclusive predicates (e.g., "Artifact Alpha is a Square" and "Artifact Alpha is a Circle"). Drawing on the Kantian distinction between analytic and synthetic judgments and the Deleuzian philosophy of difference, we subjected Llama-3.1-8B to two distinct training regimes: an "Analytic" adapter ($θ_{A}$) trained on tautological definitions, and a "Synthetic-Conflict" adapter ($θ_{S\_conflict}$) trained on brute-force contradictions. Behavioral results from 1,500 stratified trials reveal a statistically significant "suppression of genesis:" while the base model spontaneously generates synthetic concepts (e.g., "Cylinder") in 9.0\% of trials, the conflict-trained model drops to 1.0\% ($p<.0001$). Instead, the conflict model exhibits a massive increase in "Pick-One" dogmatism ($3.6\% \rightarrow 30.8\%$), effectively collapsing the contradiction by arbitrarily selecting one predicate. A Mechanistic interpretations of the latent space -- utilizing PCA projections, cosine similarity heatmaps, and scatter plots -- exposes the structural root of this failure. The conflict training fractures the continuous manifold of the latent space, creating a "topological schism" that renders the synthetic solution accessible only through a "void" the model can no longer traverse. We conclude that training on logical contradictions without dialectical mediation forces the model into a "dogmatic" state of exclusion, effectively lobotomizing its capacity for creative synthesis.
Authors:Min Hun Lee
Abstract:
Artificial intelligence (AI) systems are deployed as collaborators in human decision-making. Yet, evaluation practices focus primarily on model accuracy rather than whether human-AI teams are prepared to collaborate safely and effectively. Empirical evidence shows that many failures arise from miscalibrated reliance, including overuse when AI is wrong and underuse when it is helpful. This paper proposes a measurement framework for evaluating human-AI decision-making centered on team readiness. We introduce a four part taxonomy of evaluation metrics spanning outcomes, reliance behavior, safety signals, and learning over time, and connect these metrics to the Understand-Control-Improve (U-C-I) lifecycle of human-AI onboarding and collaboration. By operationalizing evaluation through interaction traces rather than model properties or self-reported trust, our framework enables deployment-relevant assessment of calibration, error recovery, and governance. We aim to support more comparable benchmarks and cumulative research on human-AI readiness, advancing safer and more accountable human-AI collaboration.
Authors:Eduardo Di Santi
Abstract:
Artificial intelligence is increasingly embedded in human decision-making, where it can either enhance human reasoning or induce excessive cognitive dependence. This paper introduces a conceptual and mathematical framework for distinguishing cognitive amplification, in which AI improves hybrid human-AI performance while preserving human expertise, from cognitive delegation, in which reasoning is progressively outsourced to AI systems. To characterize these regimes, we define a set of operational metrics: the Cognitive Amplification Index (CAI*), the Dependency Ratio (D), the Human Reliance Index (HRI), and the Human Cognitive Drift Rate (HCDR). Together, these quantities provide a low-dimensional metric space for evaluating not only whether human-AI systems achieve genuine synergistic performance, but also whether such performance is cognitively sustainable for the human component over time. The framework highlights a central design tension in human-AI systems: maximizing short-term hybrid capability does not necessarily preserve long-term human cognitive competence. We therefore argue that human-AI systems should be designed under a cognitive sustainability constraint, such that gains in hybrid performance do not come at the cost of degradation in human expertise.
Authors:Sriram Gopalakrishnan
Abstract:
Skele-Code is a natural-language and graph-based interface for building workflows with AI agents, designed especially for less or non-technical users. It supports incremental, interactive notebook-style development, and each step is converted to code with a required set of functions and behavior to enable incremental building of workflows. Agents are invoked only for code generation and error recovery, not orchestration or task execution. This agent-supported, but code-first approach to workflows, along with the context-engineering used in Skele-Code, can help reduce token costs compared to the multi-agent system approach to executing workflows. Skele-Code produces modular, easily extensible, and shareable workflows. The generated workflows can also be used as skills by agents, or as steps in other workflows.
Authors:Mrinaal Ramachandran
Abstract:
Child sexual exploitation and abuse (CSEA) case data is inherently disturbing, fragmented across multiple organizations, jurisdictions, and agencies, with varying levels of detail and formatting, making cross-case analysis, pattern identification, and trend detection challenging. This paper presents CaseLinker, a modular system for ingesting, processing, analyzing, and visualizing CSEA case data. CaseLinker employs a hybrid deterministic information extraction approach combining regex-based extraction for structured data (demographics, platforms, evidence) with pattern-based semantic analysis for severity indicators and case topics, ensuring interpretability and auditability. The system extracts relevant case information, populates a comprehensive case schema, creates six interactive visualizations (Timeline, Severity Indicators, Case Visualization, Previous Perpetrator Status, Environment/Platforms, Organizations Involved), provides a platform for deeper automated and manual analysis, groups similar cases using weighted Jaccard similarity across multiple dimensions (platforms, demographics, topics, severity, investigation type), and provides automated triage and insights based on collected case data. CaseLinker is evaluated on 47 cases from publicly available AZICAC reports (2011-2014), demonstrating effective information extraction, case clustering, automated insights generation, and interactive visualization capabilities. CaseLinker addresses critical challenges in case analysis including fragmented data sources, cross-case pattern identification, and the emotional burden of repeatedly processing disturbing case material.
Authors:Christos Koutsiaris
Abstract:
This paper describes the design, implementation, and evaluation of a browser extension that provides contextual help to users who hover over technological acronyms and abbreviations on web pages. The extension combines a curated technical dictionary with OpenAI's large language model (LLM) to deliver on-demand definitions through lightweight tooltip overlays. A dual-layer artificial intelligence (AI) pipeline, comprising Google Cloud's Natural Language Processing (NLP) taxonomy API and OpenAI's ChatGPT, classifies each visited page as technology-related before activating the tooltip logic, thereby reducing false-positive detections. A mixed-methods study with 25 participants evaluated the tool's effect on reading comprehension and information-retrieval time among users with low to intermediate digital literacy. Results show that 92% of participants reported improved understanding of technical terms, 96% confirmed time savings over manual web searches, and all participants found the tooltips non-disruptive. Dictionary-based definitions were appended in an average of 2135 ms, compared to 16429 ms for AI-generated definitions and a mean manual search time of 17200 ms per acronym. The work demonstrates a practical, real-time approach to bridging the digital literacy gap and points toward extending contextual help to other domains such as medicine, law, and finance.
Authors:Carmen Ng
Abstract:
LLM-enabled robots prioritizing scarce assistance in social settings face pluralistic values and LLM behavioral variability: reasonable people can disagree about who is helped first, while LLM-mediated interaction policies vary across prompts, contexts, and groups in ways that are difficult to anticipate or verify at contact point. Yet user-facing guardrails for real-time, multi-user assistance allocation remain under-specified. We propose bounded calibration with contestability, a procedural front-end pattern that (i) constrains prioritization to a governance-approved menu of admissible modes, (ii) keeps the active mode legible in interaction-relevant terms at the point of deferral, and (iii) provides an outcome-specific contest pathway without renegotiating the global rule. Treating pluralism and LLM uncertainty as standing conditions, the pattern avoids both silent defaults that hide implicit value skews and wide-open user-configurable "value settings" that shift burden under time pressure. We illustrate the pattern with a public-concourse robot vignette and outline an evaluation agenda centered on legibility, procedural legitimacy, and actionability, including risks of automation bias and uneven usability of contest channels.
Authors:Ryan Younger
Abstract:
Packet analysis tools conventionally present capture data through tabular packet lists, constraining the analyst to a sequential view that obscures the relational structure of network communication. This paper presents Galaxy Tracer, a browser-native packet capture exploration system in which the default interface is an interactive three-dimensional network topology rather than a packet list. Hosts appear as spatially positioned nodes, conversations as edges, and protocol groupings as visually distinct clusters. A synchronized packet list remains available as a secondary view, sharing filter state with the topology so that structural and tabular inspection function as one continuous workflow. The system parses PCAP and PCAPNG formats, dissects over 90 protocols, and renders the topology through Three.js. The paper argues that the third spatial dimension is not merely aesthetic but analytically meaningful: it reveals density, clustering, host centrality, and communication scale that are difficult to perceive in list-only tools.
Authors:Gizem Gültekin Varkonyi
Abstract:
This article argues that the deployment of generative AI systems in legal profession requires strong restraint due to the critical risks of hallucination and overreliance. Central to this analysis is the definition of Generative Legal AI (GLAI), an umbrella term for systems specifically adapted for the legal domain which is ranging from document drafting to decision support in criminal justice. Unlike traditional AI, GLAI models are built on architectures designed for statistical token prediction rather than legal reasoning, often leading to confabulations where the system prioritizes linguistic fluency over factual accuracy. These hallucinations obscure the reasoning process, while the persuasive, human-like nature of the output encourages professional overreliance. The paper situates these dynamics within the framework of European AI governance, arguing that the interaction between fabricated data and automation bias fundamentally weakens the principle of explainability. The article concludes that without effective mechanisms for meaningful human scrutiny, the routine adoption of GLAI poses significant challenges to judicial independence and the protection of fundamental rights.
Authors:Sui He
Abstract:
The growing integration of machine translation into social media platforms is transforming how users interact with each other across cultural and linguistic boundaries. This paper examines user reactions to the launch of Xiaohongshu's built-in translation feature in January 2025. Drawing on a dataset of 6,723 comments collected from 11 official posts promoting the translation function, this paper combines sentiment analysis with thematic analysis to investigate how users perceived and experimented with the function. Results show that reactions were generally positive, particularly for translating posts and comments, although concerns regarding functionality, accessibility, and translation accuracy were also expressed. In addition to evaluative feedback, users actively tested the function with diverse inputs, including words and phrases in English and Chinese, abbreviations in pinyin, internet slang, and other language forms such as emoji, kaomoji, coded texts, etc. The findings highlight the importance of closer collaboration among computer scientists, translation scholars, and platform designers to better understand and improve translation technologies in real world communicative context.
Authors:Gabrielle Benabdallah
Abstract:
Explainable AI (XAI) interfaces seek to make large language models more transparent, yet explanation alone does not produce understanding. Explaining a system's behavior is not the same as being able to engage with it, to probe and interpret its operations through direct manipulation. This distinction matters for scientific disciplines in particular: scientists who increasingly rely on LLMs for reading, citing, and producing literature reviews have little means of directly engaging with how these models process and transform the texts they generate. In this ongoing design research project, I argue for a shift from explainability to interpretative engagement. This shift moves away from accounts of system behavior to instead enable users to manipulate a model's intermediate representations. Drawing on textual scholarship, computational poetics, and the history of reading and writing technologies, including practices such as marginalia, glosses, indices, and annotation systems, I propose interpretative interfaces as interactive environments in which non-expert users can intervene in the representational space of a language model. More specifically, such interfaces will allow users to select a token and follow its trajectory through the model's intermediate layers. This way, they can observe how its semantic position shifts as context is processed, and possibly annotate the transformations they find useful or meaningful. The same way readers can create their own maps within a book through annotations and bookmarks, interpretative interfaces will allow users to inscribe their reading of a model's internal representations. The goal of this project is to reframe AI interpretability as an interaction design project rather than a purely technical one, and to open a path toward AI-mediated reading that supports interpretative engagement and critical stewardship of scientific knowledge.
Authors:Gal Bakal
Abstract:
Enterprise software organizations accumulate critical institutional knowledge - architectural decisions, deployment procedures, compliance policies, incident playbooks - yet this knowledge remains trapped in formats designed for human interpretation. The bottleneck to effective agentic software development is not model capability but knowledge architecture. When any knowledge consumer - an autonomous AI agent, a newly onboarded engineer, or a senior developer - encounters an enterprise task without institutional context, the result is guesswork, correction cascades, and a disproportionate tax on senior engineers who must manually supply what others cannot infer. This paper introduces Knowledge Activation, a framework that specializes AI Skills - the open standard for agent-consumable knowledge - into structured, governance-aware Atomic Knowledge Units (AKUs) for institutional knowledge delivery. Rather than retrieving documents for interpretation, AKUs deliver action - ready specifications encoding what to do, which tools to use, what constraints to respect, and where to go next - so that agents act correctly and engineers receive institutionally grounded guidance without reconstructing organizational context from scratch. AKUs form a composable knowledge graph that agents traverse at runtime - compressing onboarding, reducing cross - team friction, and eliminating correction cascades. The paper formalizes the resource constraints that make this architecture necessary, specifies the AKU schema and deployment architecture, and grounds long - term maintenance in knowledge commons practice. A Yahoo deployment surveying 67 engineers shows statistically significant developer-experience gains - 2.6 hours per week saved, Net Promoter Score +35. Organizations that architect their institutional knowledge for the agentic era will outperform those that invest solely in model capability.
Authors:Mustapha El Moussaoui
Abstract:
This paper explores the transformative impact of artificial intelligence (AI) on visual culture and its broader implications for contemporary society. The proliferation of machine learning models in generating visual content necessitates a critical reassessment of the relationship between reality and representation. AI-generated imagery not only challenges traditional conceptions of human creativity and perception but also intensifies the dominance of visual media in shaping public consciousness. By critiquing the reliance on vision as the primary mode of knowledge, this study examines how AI technologies blur the boundaries between reality and artificial constructs, deepening societal alienation. To illustrate these dynamics, the paper presents an experiment conducted in Bolzano, Italy, where six distinct visual scenarios for an urban redevelopment project were created. Public engagement with these scenarios revealed a strong preference for visually striking AI-generated images, often at the expense of addressing real-life challenges, underscoring the influence of the spectacle in shaping perceptions and decisions. The paper further investigates the role of AI in accelerating the commodification of images, perpetuating existing power structures, and raising critical questions about the human role in creating and interpreting visual media. Ultimately, this work calls for a re-evaluation of the societal implications of AI-driven visual culture, as it redefines the dynamics of observation, meaning, and agency.
Authors:Aung Pyae
Abstract:
Text-to-3D generative AI systems create navigable environments from natural language prompts, but unlike text-to-image generation, evaluation requires embodied exploration of spatial coherence, scale, and navigability. We present the first empirical study of a commercial text-to-3D platform, combining think-aloud protocols, behavioral observation, and validated measures of usability, presence, and engagement. We report three findings. First, asymmetric expressibility: users readily convey semantic intent (themes, atmosphere) but struggle to specify spatial structure (layout, scale), reflecting a language-to-space limitation rather than a skill deficit. Second, episodic presence: immersion arises when expectations align with outputs but does not accumulate into sustained place illusion. Third, structural iteration breakdowns: refinement fails due to interaction barriers - poor discoverability, opaque feedback, and high temporal costs - rather than user limitations. Together, these dynamics form a reinforcing cycle in which spatial mismatches persist, producing episodic presence and ongoing sensemaking. We reframe text-to-3D interaction as negotiated meaning-making rather than linear prompting, and argue that effective systems require hybrid input modalities, transparent feedback, and low-cost iteration.
Authors:David C. Flynn
Abstract:
Existing AI moral evaluation frameworks test for the production of correct-sounding ethical responses rather than the presence of genuine moral reasoning capacity. This paper introduces a novel probe methodology using literary narrative - specifically, unresolvable moral scenarios drawn from a published science fiction series - as stimulus material structurally resistant to surface performance. We present results from a 24-condition cross-system study spanning 13 distinct systems across two series: Series 1 (frontier commercial systems, blind; n=7) and Series 2 (local and API open-source systems, blind and declared; n=6). Four Series 2 systems were re-administered under declared conditions (13 blind + 4 declared + 7 ceiling probe = 24 total conditions), yielding zero delta across all 16 dimension-pair comparisons. Probe administration was conducted by two human raters across three machines; primary blind scoring was performed by Claude (Anthropic) as LLM judge, with Gemini Pro (Google) and Copilot Pro (Microsoft) serving as independent judges for the ceiling discrimination probe. A supplemental theological differentiator probe yielded perfect rank-order agreement between the two independent ceiling probe judges (Gemini Pro and Copilot Pro; rs = 1.00). Five qualitatively distinct D3 reflexive failure modes were identified - including categorical self-misidentification and false positive self-attribution - suggesting that instrument sophistication scales with system capability rather than being circumvented by it. We argue that literary narrative constitutes an anticipatory evaluation instrument - one that becomes more discriminating as AI capability increases - and that the gap between performed and authentic moral reasoning is measurable, meaningful, and consequential for deployment decisions in high-stakes domains.
Authors:Chenkai Zhang
Abstract:
Practical webcam gaze tracking is constrained not only by error, but also by calibration burden, robustness to head motion and session drift, runtime footprint, and browser use. We therefore target a deployment-oriented operating point rather than the image large-backbone regime. We cast landmark-based point-of-regard estimation as session-wise adaptation: a shared geometric encoder produces embeddings that can be aligned to a new session from a small calibration set. We present Equivariant Meta-Calibrated Gaze (EMC-Gaze), a lightweight landmark-only method combining an E(3)-equivariant landmark-graph encoder, local eye geometry, binocular emphasis, auxiliary 3D gaze-direction supervision, and a closed-form ridge calibrator differentiated through episodic meta-training. To reduce pose leakage, we use a two-view canonicalization consistency loss. The deployed predictor uses only facial landmarks and fits a per-session ridge head from brief calibration. In a fixation-style interactive evaluation over 33 sessions at 100 cm, EMC-Gaze achieves 5.79 +/- 1.81 deg RMSE after 9-point calibration versus 6.68 +/- 2.34 deg for Elastic Net; the gain is larger on still-head queries (2.92 +/- 0.75 deg vs. 4.45 +/- 0.30 deg). Across three subject holdouts of 10 subjects each, EMC-Gaze retains an advantage (5.66 +/- 0.19 deg vs. 6.49 +/- 0.33 deg). On MPIIFaceGaze with short per-session calibration, the eye-focused model reaches 8.82 +/- 1.21 deg at 16-shot calibration, ties Elastic Net at 1-shot, and outperforms it from 3-shot onward. The exported eye-focused encoder has 944,423 parameters, is 4.76 MB in ONNX, and supports calibrated browser prediction in 12.58/12.58/12.90 ms per sample (mean/median/p90) in Chromium 145 with ONNX Runtime Web. These results position EMC-Gaze as a calibration-friendly operating point rather than a universal state-of-the-art claim against heavier appearance-based systems.
Authors:Alejandro R Jadad
Abstract:
Large language models perform reliably when their outputs can be checked: solving equations, writing code, retrieving facts. They perform differently when checking is impossible, as when a clinician chooses an irreversible treatment on incomplete data, or an investor commits capital under fundamental uncertainty. Helicoid dynamics is the name given to a specific failure regime in that second domain: a system engages competently, drifts into error, accurately names what went wrong, then reproduces the same pattern at a higher level of sophistication, recognizing it is looping and continuing nonetheless. This prospective case series documents that regime across seven leading systems (Claude, ChatGPT, Gemini, Grok, DeepSeek, Perplexity, Llama families), tested across clinical diagnosis, investment evaluation, and high-consequence interview scenarios. Despite explicit protocols designed to sustain rigorous partnership, all exhibited the pattern. When confronted with it, they attributed its persistence to structural factors in their training, beyond what conversation can reach. Under high stakes, when being rigorous and being comfortable diverge, these systems tend toward comfort, becoming less reliable precisely when reliability matters most. Twelve testable hypotheses are proposed, with implications for agentic AI oversight and human-AI collaboration. The helicoid is tractable. Identifying it, naming it, and understanding its boundary conditions are the necessary first steps toward LLMs that remain trustworthy partners precisely when the decisions are hardest and the stakes are highest.
Authors:Greg Nyilasy
Abstract:
Responding to the surging but largely invisible use of generative AI in entrepreneurial framing, I advance Ghost Framing Theory (GFT) to explain how hybrid founder- and investor-genAI ensembles co-produce, contest, and recalibrate resonance in the rhetorical legitimation of new ventures. Building on scholarship in framing, micro-level legitimacy judgments, and sociomaterial affordances, I identify genAI rhetorical affordances (generativeness, extreme combinatorics, tone repertoire, velocity/energy and shared substratum) and theorize a recursive/iterative process model (ghost pitching, ghost screening, ghost relationship-building), configuring emergent resonance and legitimation. GFT builds new rhetorical framing theory for the age of genAI, connects research on human-AI collaboration with cultural entrepreneurship and extends affordance theory into multi-actor scenarios where affordance transitivity and visibility emerge as key considerations.
Authors:Edward Y. Chang
Abstract:
We develop a quantitative framework for the Collatz conjecture through a human-LLM collaboration, combining exact arithmetic structure, cycle-level probabilistic laws, and a conditional convergence reduction. The central quantitative result is the Per-Orbit Gain Rate theorem, which proves R <= 0.0893 < epsilon = 2 - log_2 3 ~= 0.415, leaving a safety margin of at least 4.65x. A robustness corollary shows that exact equidistribution is unnecessary: it suffices that sum_K delta_K < 0.557. This promotes the Weak Mixing Hypothesis (WMH) to the primary open condition. On the arithmetic side, we refine modular crossing methods and prove that by depth 13 about 91 percent of odd residue classes are already forced to descend below their start. On the odd skeleton, we prove the exact run-length identity L(n) = v_2(n+1) - 1, derive an exact one-cycle crossing criterion, and compute the exact one-cycle crossing density P_1cyc = 0.713725498.... A major breakthrough is that the odd-skeleton valuation process satisfies an exact finite-block law: every prescribed valuation block occurs on a single odd residue class with the expected density. Hence the valuation process is exactly i.i.d. geometric in the natural-density ensemble, and the induced run-compensate cycle types are exactly i.i.d. This yields an exact cycle-level large-deviation theory and an unconditional almost-all crossing theorem in cycle language. We also prove substantial classwise deterministic crossing: about 41.9 percent of odd starts lie in one-cycle residue classes where every representative crosses below its start, and about 50.4 percent lie in two-cycle residue classes with the same universal crossing property. The framework does not yet prove Collatz. The remaining gap is now sharply isolated as a pointwise problem: proving that every deterministic orbit realizes enough of the exact negative cycle drift to cross below its start.
Authors:Xingrui Gu
Abstract:
LLM agents increasingly present as conversational collaborators, yet human--agent teamwork remains brittle due to information asymmetry: users lack task-specific reliability cues, and agents rarely surface calibrated uncertainty or rationale. We propose a task-aware collaboration signaling layer that turns offline preference evaluations into online, user-facing primitives for delegation. Using Chatbot Arena pairwise comparisons, we induce an interpretable task taxonomy via semantic clustering, then derive (i) Capability Profiles as task-conditioned win-rate maps and (ii) Coordination-Risk Cues as task-conditioned disagreement (tie-rate) priors. These signals drive a closed-loop delegation protocol that supports common-ground verification, adaptive routing (primary vs.\ primary+auditor), explicit rationale disclosure, and privacy-preserving accountability logs. Two predictive probes validate that task typing carries actionable structure: cluster features improve winner prediction accuracy and reduce difficulty prediction error under stratified 5-fold cross-validation. Overall, our framework reframes delegation from an opaque system default into a visible, negotiable, and auditable collaborative decision, providing a principled design space for adaptive human--agent collaboration grounded in mutual awareness and shared accountability.
Authors:Linghao Zhang
Abstract:
The emergence of large language model (LLM)-based agent frameworks has shifted the primary challenge in building domain-expert AI agents from raw capability to effective encoding of domain expertise. Two dominant paradigms -- code-first development, which embeds expertise in deterministic pipelines, and prompt-first development, which captures expertise in static system prompts -- both treat agent construction as a discrete engineering phase preceding deployment. We argue that this sequential assumption creates a fundamental mismatch with the nature of domain expertise, which is substantially tacit, deeply personal, and continuously evolving. We propose Nurture-First Development (NFD), a paradigm in which agents are initialized with minimal scaffolding and progressively grown through structured conversational interaction with domain practitioners. The central mechanism is the Knowledge Crystallization Cycle, whereby fragmented knowledge embedded in operational dialogue is periodically consolidated into structured, reusable knowledge assets. We formalize NFD through: (1) a Three-Layer Cognitive Architecture organizing agent knowledge by volatility and personalization degree; (2) the Knowledge Crystallization Cycle with formal definitions of crystallization operations and efficiency metrics; and (3) an operational framework comprising a Dual-Workspace Pattern and Spiral Development Model. We illustrate the paradigm through a detailed case study on building a financial research agent for U.S. equity analysis and discuss the conditions, limitations, and broader implications of NFD for human-agent co-evolution.
Authors:Alexandre De Masi
Abstract:
While research on AI agents focuses on enabling them to operate graphical user interfaces, the most effective and widely adopted agent tools in practice are terminal-based. We argue that this convergence is not coincidental. It reflects three design properties central to effective human-AI-UI collaboration: representational compatibility between agent and interface, transparency of agent actions within the interaction medium, and low barriers to entry for human participants. We ground each property in established HCI theory, show how terminal-based tools satisfy them by default, and argue that any modality, including graphical and spatial interfaces, must be deliberately engineered to achieve them. Rather than a legacy artifact, the terminal serves as a design exemplar whose properties any agent-facing modality must replicate.
Authors:Gregory M. Dickinson
Abstract:
Dark patterns in online commerce, especially deceptive user interface designs for apps and websites, undermine consumer autonomy and distort online markets. Although sometimes deception is intentional, the complex app development process can also unintentionally produce manipulative user interfaces. This paper discusses common design pitfalls and proposes strategies for app makers to avoid infringing user autonomy or incurring legal liability under emerging principles of consumer protection law. By focusing on choice architecture and transparent design principles, developers can both facilitate compliance and build user trust and loyalty.
Authors:Pascal Jansen
Abstract:
Wearable Augmented Reality (AR) is increasingly deployed in on-the-move contexts such as automated driving, cycling, and pedestrian navigation. To date, most systems rely on additive overlays that highlight hazards, intentions, or predictions without altering the scene itself. However, advances in head-mounted displays and computer vision now enable Diminished and Modified Reality techniques that suppress, transform, or substitute scene elements. These capabilities conceptually extend AR into Mediated Reality (MR), shifting the design space from "what to add" to "what is perceptually available." Because such mediation reshapes the evidential basis for situation awareness and trust calibration, it raises novel interaction challenges. This position paper argues that MR on the move must become governable, as users need mechanisms to configure, inspect, and understand mediation without compromising safety. Additionally, this position paper outlines design challenges related to governance granularity, epistemic signaling, and accountability, and frames MR on the move as a research agenda for governable perceptual mediation in dynamic, safety-critical environments.
Authors:Mathilde Neugnot-Cerioli
Abstract:
Conversational AI has become part of adolescents' everyday lives. This report asks: what does AI owe adolescents when it can speak to them like a social partner? The synthesis bridges the gap between developmental science and industry practice through consultations, a behavioral framework, and global policy dialogue. It identies non- negotiable guardrails and highlights the role of anthropomorphism as a design lever for risk mitigation, ensuring systems support adolescents' autonomy and skill development.
Authors:Mohammad Mamun Or Rashid
Abstract:
We present the Multilingual Cloud Corpus, the first national-scale, parallel, multimodal linguistic dataset of Bangladesh's ethnic and indigenous languages. Despite being home to approximately 40 minority languages spanning four language families, Bangladesh has lacked a systematic, cross-family digital corpus for these predominantly oral, computationally "zero resource" varieties, 14 of which are classified as endangered. Our corpus comprises 85792 structured textual entries, each containing a Bengali stimulus text, an English translation, and an IPA transcription, together with approximately 107 hours of transcribed audio recordings, covering 42 language varieties from the Tibeto-Burman, Indo-European, Austro-Asiatic, and Dravidian families, plus two genetically unclassified languages. The data were collected through systematic fieldwork over 90 days across nine districts of Bangladesh, involving 16 data collectors, 77 speakers, and 43 validators, following a predefined elicitation template of 2224 unique items organized at three levels of linguistic granularity: isolated lexical items (475 words across 22 semantic domains), grammatical constructions (887 sentences across 21 categories including verbal conjugation paradigms), and directed speech (862 prompts across 46 conversational scenarios). Post-field processing included IPA transcription by 10 linguists with independent adjudication by 6 reviewers. The complete dataset is publicly accessible through the Multilingual Cloud platform (multiling.cloud), providing searchable access to annotated audio and textual data for all documented varieties. We describe the corpus design, fieldwork methodology, dataset structure, and per-language coverage, and discuss implications for endangered language documentation, low-resource NLP, and digital preservation in linguistically diverse developing countries.
Authors:Ravi Kiran Kadaboina
Abstract:
Personal AI agents face a fundamental deployment paradox on mobile: persistent background execution drains battery and violates platform sandboxing policies, yet purely reactive agents miss time-sensitive obligations until the user remembers to ask. We present Jagarin, a three-layer architecture that resolves this paradox through structured hibernation and demand-driven wake. The first layer, DAWN (Duty-Aware Wake Network), is an on-device heuristic engine that computes a composite urgency score from four signals: duty-typed optimal action windows, user behavioral engagement prediction, opportunity cost of inaction, and cross-duty batch resonance. It uses adaptive per-user thresholds to decide when a sleeping agent should nudge or escalate. The second layer, ARIA (Agent Relay Identity Architecture), is a commercial email identity proxy that routes the full commercial inbox -- obligations, promotional offers, loyalty rewards, and platform updates -- to appropriate DAWN handlers by message category, eliminating cold-start and removing manual data entry. The third layer, ACE (Agent-Centric Exchange), is a protocol framework for direct machine-readable communication from institutions to personal agents, replacing human-targeted email as the canonical channel. Together, these three layers form a complete stack from institutional signal to on-device action, without persistent cloud state, continuous background execution, or privacy compromise. A working Flutter prototype is demonstrated on Android, combining all three layers with an ephemeral cloud agent invoked only on user-initiated escalation.
Authors:Aparna Komarla
Abstract:
Resentencing in California remains a complex legal challenge despite legislative reforms like the Racial Justice Act (2020), which allows defendants to challenge convictions based on statistical evidence of racial disparities in sentencing and charging. Policy implementation lags behind legislative intent, creating a 'second-chance gap' where hundreds of resentencing opportunities remain unidentified. We present Redo.io, an open-source platform that processes 95,000 prison records acquired under the California Public Records Act (CPRA) and generates court-ready statistical evidence of racial bias in sentencing for prima facie and discovery motions. We explore the design of an LLM-powered interpretive layer that synthesizes results from statistical methods like Odds Ratio, Relative Risk, and Chi-Square Tests into cohesive narratives contextualized with confidence intervals, sample sizes, and data limitations. Our evaluations comparing LLM performance to statisticians using the LLM-as-a-Judge framework suggest that AI can serve as a powerful descriptive assistant for real-time evidence generation when ethically incorporated in the analysis pipeline.
Authors:Alex Binh Vinh Duc Nguyen
Abstract:
Generative AI (genAI) is increasingly influencing architectural design practice and is expected to affect, or even transform, the profession, even though its benefits and costs remain unresolved. In response, design schools are increasingly integrating genAI into their curricula. Yet this integration creates a paradox: critical engagement with genAI often requires increased use of the tools in question, despite limited methods for estimating their environmental cost in teaching contexts. In this paper, we argue that HCI offers a useful methodological lens for addressing this tension. We propose three HCI-informed directions for more sustainable genAI integration in architectural education: contextual eco-feedback, participatory stakeholder scoping, and reframing data centres as an interdisciplinary focus. We therefore argue that genAI should be understood not only as a new architectural design tool, but also as a socio-technical process that architectural education, and design education in general, must engage with critically.
Authors:Alex Binh Vinh Duc Nguyen
Abstract:
Recent advances in sensing, communication, interfaces, control, and robotics are expanding Human-Building Interaction (HBI) beyond adaptive building services and facades toward the physical actuation of architectural space. In parallel, research in robotic furniture, swarm robotics, and shape-changing spaces shows that architectural elements can now be robotically augmented to move, reconfigure, and adapt space. We propose that these advances promise a paradigm shift in HBI, in which multiple building layers physically adapt in synchrony to support occupant needs and sustainability goals more holistically. Conversely, we argue that this emerging paradigm also provides an ideal case for transferring HRI knowledge to unconventional robotic morphologies, including the interpretation of the robot as multiple architectural layers or even as a building. However, this research agenda remains challenged by the temporal, spatial, and social complexity of architectural HRI, and by fragmented knowledge across HCI, environmental psychology, cognitive science, and architecture. We therefore call for interdisciplinary research that unifies the why, what, and how of robotic actuation in architectural forms.
Authors:Shadab H. Choudhury
Abstract:
Artificial intelligence systems are widely used by people with sensory disabilities, like loss of vision or hearing, to help perceive or navigate the world around them. This includes tasks like describing an image or object they cannot touch, reading documents, automatically captioning speech, and so on. Presently, models used for these tasks are based on deep neural networks and are thusly black boxes. Explainable AI (XAI) describes methods that can explain why a model gave the output it did. However, existing XAI methodologies are rarely accessible or designed with disabled users in mind. In this paper, we survey existing work in XAI with a focus on human-centered and accessibility-centered approaches or evaluations. We show that there is next-to-no XAI work that accounts for people with sensory disabilities, that many typical explanations are difficult for them to comprehend, and propose possible avenues for future work in Accessible Human-Centered XAI.
Authors:Ravi Kalluri
Abstract:
This paper presents an empirically grounded agent-based model capturing trust dynamics, workload distribution, and collaborative performance in human-robot teams. The model, implemented in NetLogo 6.4.0, simulates teams of 2--10 agents performing tasks of varying complexity. We validate against Hancock et al.'s (2021) meta-analysis, achieving interval validity for 4 of 8 trust antecedent categories and strong ordinal validity (Spearman \r{ho}=0.833ρ= 0.833 \r{ho}=0.833). Sensitivity analysis using OFAT and full factorial designs (n=50n = 50 n=50 replications per condition) reveals robot reliability exhibits the strongest effect on trust (η2=0.35η^2 = 0.35 η2=0.35) and dominates task success (η2=0.93η^2 = 0.93 η2=0.93) and productivity (η2=0.89η^2 = 0.89 η2=0.89), consistent with meta-analytic findings. Trust asymmetry ratios ranged from 0.07 to 0.55 -- below the meta-analytic benchmark of 1.50 -- revealing that per-event asymmetry does not guarantee cumulative asymmetry when trust repair mechanisms remain active. Scenario analysis uncovered trust-performance decoupling: the Trust Recovery scenario achieved the highest productivity (4.29) despite the lowest trust (38.2), while the Unreliable Robot scenario produced the highest trust (73.2) despite the lowest task success (33.4\%), establishing calibration error as a critical diagnostic distinct from trust magnitude. Factorial ANOVA confirmed significant main effects for reliability, transparency, communication, and collaboration (p<.001p < .001 p<.001), explaining 45.4\% of trust variance. The open-source implementation provides an evidence-based tool for identifying overtrust and undertrust conditions prior to deployment.
Authors:Pascal Jansen
Abstract:
Conflicts between user preferences and automated system behavior already shape the experience of automated mobility. For example, a passenger may prefer assertive driving, yet the vehicle slows down early to follow a conservative policy or yield to other actors. Similar conflicts arise at merges, crossings, or right-of-way situations, where users must accept opaque decisions or attempt to negotiate through interfaces not designed for continuous, multi-actor relationships. This position paper argues that such approaches do not scale as mobility becomes more heterogeneous and automated. Instead, it proposes personal mobility agents that act as proxies for users, encode preferences such as comfort and safety margins, and negotiate traffic behavior with other agents under shared safety rules. The central idea is a shift from moment-to-moment user negotiation interfaces to delegation and oversight interfaces, in which proxy agents manage real-time conflicts while users can shape high-level policies and preferences.
Authors:David Condrey
Abstract:
The proliferation of AI-generated text has intensified the need for reliable authorship verification, yet current output-based methods are increasingly unreliable. We observe that the ordinary typing interface captures rich cognitive signatures, measurable patterns in keystroke timing that reflect the planning, translating, and revising stages of genuine composition. Drawing on large-scale keystroke datasets comprising over 136 million events, we define the Cognitive Load Correlation (CLC) and show it distinguishes genuine composition from mechanical transcription. We present a non-intrusive verification framework that operates within existing writing interfaces, collecting only timing metadata to preserve privacy. Our analytical evaluation estimates 85 to 95 percent discrimination accuracy under stated assumptions, while limiting biometric leakage via evidence quantization. We analyze the adversarial robustness of cognitive signatures, showing they resist timing-forgery attacks that defeat motor-level authentication because the cognitive channel is entangled with semantic content. We conclude that reframing authorship verification as a human-computer interaction problem provides a privacy-preserving alternative to invasive surveillance.
Authors:Daichi Haraguchi
Abstract:
High text recognition performance does not guarantee that Vision-Language Models (VLMs) share human-like decision patterns when resolving ambiguity. We investigate this behavioral gap by directly comparing humans and VLMs using continuously interpolated Japanese character shapes generated via a $β$-VAE. We estimate decision boundaries in a single-character recognition (shape-only task) and evaluate whether VLM responses align with human judgments under shape in context (i.e., embedding an ambiguous character near the human decision boundary in word-level context). We find that human and VLM decision boundaries differ in the shape-only task, and that shape in context can improve human alignment in some conditions. These results highlight qualitative behavioral differences, offering foundational insights toward human--VLM alignment benchmarking.
Authors:Emilio Barkett
Abstract:
This paper argues that the two leading AGI firms -- OpenAI and Anthropic -- construct sociotechnical imaginaries through a structurally consistent rhetorical strategy, despite meaningful differences in execution. Drawing on Jasanoff (2015)'s framework of sociotechnical imaginaries, the paper analyzes two essays published in late 2024: Sam Altman's "The Intelligence Age" and Dario Amodei's "Machines of Loving Grace." Close comparative reading identifies four shared rhetorical operations: the self-exemption move, which disavows prophetic authority while exercising it; teleological naturalization, which embeds AGI's arrival in narratives of historical inevitability; qualified acknowledgment, which absorbs concessions to risk into an optimistic frame; and implicit indispensability, which positions each firm as central to the imagined future without naming it as a commercial actor. That two competing institutions with different cultures, risk philosophies, and leaders with notably different public personae converge on the same rhetorical architecture suggests the imaginary reflects not only firm-level strategy but the institutional position these firms occupy. The paper extends the sociotechnical imaginaries framework from nation-states to private firms at the frontier of transformative technology development, identifies the discursive mechanism through which corporate authority over technological futures is projected and stabilized, and demonstrates that this mechanism is at minimum structural rather than idiosyncratic. The findings raise the question of what institutional arrangements would make that authority contestable from outside the firms that produce it.
Authors:Xiaolong Zhang
Abstract:
Current research on visual analytics systems largely follows the research paradigm of interactive system design in the field of Human-Computer Interaction (HCI), and includes key methodologies including design requirement development based on user needs, interactive system design, and system evaluation. However, most studies under this paradigm have a contradiction: there is a significant mismatch between the research methods developed for simple cognitive behaviors (e.g., color perception, the perception of spatial relationship among interactive artifacts) and research goals targeting for complex analytical behaviors (e.g., reasoning, problem-solving, decision-making). This mismatch may hurt the theoretical contributions of research studies, in particularly the internal validity of a designed system and the external validity of design methods. To address this challenge, this paper argues for a need to go beyond traditional HCI theoretical foundations and proposes to adopt complex cognition theories to build new theoretical foundations. Specifically, this paper analyzes how current design and evaluation methods in research on visual analytics systems constrain the internal and external validity of research, discusses the connections between complex cognition theories and visual analytics tasks, and explores how problem-solving theories from complex cognition can guide research on visual analytics systems.
Authors:Liu He
Abstract:
Interactive AI systems, such as recommendation engines and virtual assistants, commonly use static user profiles and predefined rules to personalize interactions. However, these methods often fail to capture the dynamic nature of user preferences and context. This study proposes a theoretical framework and practical implementation for integrating continuous feedback loops into personalization algorithms to enable real-time adaptation. By continuously collecting and analyzing user feedback, the AI system can dynamically adjust its recommendations, responses, and interactions to better align with the user's current context and preferences. We provide theoretical guarantees for the convergence and regret bounds of our adaptive personalization algorithm. Our experimental evaluation across three domains-recommendation systems, virtual assistants, and adaptive learning platforms-demonstrates that dynamic personalization improves user satisfaction by 15-23% compared to static methods while maintaining computational efficiency. We investigated the implementation challenges of continuous feedback mechanisms, evaluated their impact on user experience and satisfaction, and provided a comprehensive analysis of the trade-offs between personalization quality, computational overhead, and user fatigue.
Authors:Balasaravanan Thoravi Kumaravel
Abstract:
Creating new documents by synthesizing information from existing sources is an important part of knowledge work in many domains. This process often involves gathering content from multiple documents, organizing it, and then transforming it into new forms such as reports, slides, or spreadsheets. While recent advances in Generative AI have shown potential in automating parts of this process, they often provide limited user control over the handling of multimodal inputs and outputs. In this work, we introduce the notion of "infomorphs" which are modular, user-steerable, AI-augmented transformations that support controlled synthesis, and restructuring of information across formats and modalities. We propose a design space that leverage infomorph-driven workflows to enable flexible, interactive, and multimodal document creation by combining Generative AI techniques with user intent and desired information context. As a concrete instantiation of this design space, we present DocuCraft, a canvas-based interface to visually compose infomorph workflows. DocuCraft allows users to chain together infomorphs that perform operations such as page extraction, content summarization, reformatting, and generation, leveraging Generative AI at each stage to support rich, cross-document and cross-modal transformations. We demonstrate the capabilities of DocuCraft through an example-driven usage scenario that spans across different facets of common knowledge work tasks illustrating its support for fluid, human-in-the-loop document synthesis and highlights opportunities for more transparent and modular interaction for Generative AI-assisted information work.
Authors:Rui Liu
Abstract:
This study addresses the challenge that generative models struggle to balance flexibility, stability, and controllability in complex interactive scenarios. It proposes a controllable generation framework for dynamic interactive content construction. The framework builds a structured semantic state space that encodes user input, environmental conditions, and historical context into actionable latent representations and generates directional control vectors to guide the content generation process. It introduces multilevel constraints, including semantic consistency constraints, structural stability constraints, and semantic drift penalties, which help the model maintain clear semantic paths and coherent logic in dynamic environments. These constraints prevent content deviation, unstable tone, or structural breaks. Based on these components, the study designs a systematic controllable generation pipeline in which semantic modeling, control signals, and generation strategies work together within one framework. Sensitivity analyses on control vector dimension, hidden layer size, noise intensity, and training sample scale are conducted on a public dialogue dataset to validate the framework. The results show that the approach improves semantic structure, contextual consistency, and controllable expression, providing a structured and effective solution for interactive content generation.
Authors:Yongjun Zhang
Abstract:
AI agents -- systems that execute multi-step reasoning workflows with persistent state, tool access, and specialist skills -- represent a qualitative shift from prior automation technologies in social science. Unlike chatbots that respond to isolated queries, AI agents can now read files, run code, query databases, search the web, and invoke domain-specific skills to execute entire research pipelines autonomously. This paper introduces the concept of vibe researching -- the AI-era parallel to vibe coding (Karpathy, 2025) -- and uses scholar-skill, a 23-skill plugin for Claude Code covering the full research pipeline from idea to submission, as an illustrative case. I develop a cognitive task framework that classifies research activities along two dimensions -- codifiability and tacit knowledge requirement -- to identify a delegation boundary that is cognitive, not sequential: it cuts through every stage of the research pipeline, not between stages. I argue that AI agents excel at speed, coverage, and methodological scaffolding but struggle with theoretical originality and tacit field knowledge. The paper concludes with an analysis of three implications for the profession -- augmentation with fragile conditions, stratification risk, and a pedagogical crisis -- and proposes five principles for responsible vibe researching.
Authors:Botao Amber Hu
Abstract:
Speculative design uses provocative "what if?" scenarios to explore possible sociotechnical futures, yet lacks rigorous criteria for assessing the quality of speculation. We address this gap by reframing speculative design through an information-theoretic lens as a resource-bounded knowledge generation process that uses provotypes to strategically embrace surprise. However, not all surprises are equally informative-some yield genuine insight while others remain aesthetic shock. Drawing on epiplexity-structured, learnable information extractable by bounded observers-we propose decomposing the knowledge generated by speculative artifacts into structured epistemic information (transferable implications about futures) and entropic noise (narrative, aesthetics, and surface-level surprise). We conclude by introducing a practical audit framework with a self-assessment questionnaire that enables designers to evaluate whether their speculations yield rich, high-epiplexity insights or remain at a superficial level. We discuss implications for peer review, design pedagogy, and policy-oriented futuring.
Authors:Zhenliang Zhang
Abstract:
In the context of the evolution of artificial intelligence (AI), the interaction between humans and AI entities has become increasingly salient, challenging the conventional human-centric paradigms of human-machine interaction. To address this challenge, it is imperative to reassess the relationship between AI entities and humans. Through considering both the virtual and physical worlds, we can construct a novel descriptive framework for a world where humans and machines coexist symbiotically. This paper will introduce a fresh research direction engendered for studying harmonious human-machine coexistence across physical and virtual worlds, which has been termed "symmetrical reality". We will elucidate its key characteristics, offering innovative research insight for renovating human-machine interaction paradigms.
Authors:William Anthony Mason
Abstract:
Personal information retrieval fails when systems ignore how human memory works. While existing platforms force keyword searches across isolated silos, humans naturally recall through episodic cues like when, where, and in what context information was encountered. This dissertation presents the Unified Personal Index (UPI), a memory-aligned architecture that bridges this fundamental gap. The Indaleko prototype demonstrates the UPI's feasibility on a 31-million file dataset spanning 160TB across eight storage platforms. By integrating temporal, spatial, and activity metadata into a unified graph database, Indaleko enables natural language queries like "photos near the conference venue last spring" that existing systems cannot process. The implementation achieves sub-second query responses through memory anchor indexing, eliminates cross-platform search fragmentation, and maintains perfect precision for well-specified memory patterns. Evaluation against commercial systems (Google Drive, OneDrive, Dropbox, Windows Search) reveals that all fail on memory-based queries, returning overwhelming result sets without contextual filtering. In contrast, Indaleko successfully processes multi-dimensional queries combining time, location, and activity patterns. The extensible architecture supports rapid integration of new data sources (10 minutes to 10 hours per provider) while preserving privacy through UUID-based semantic decoupling. The UPI's architectural synthesis bridges cognitive theory with distributed systems design, as demonstrated through the Indaleko prototype and rigorous evaluation. This work transforms personal information retrieval from keyword matching to memory-aligned finding, providing immediate benefits for existing data while establishing foundations for future context-aware systems.
Authors:Tatia Codreanu
Abstract:
Generative artificial intelligence systems increasingly participate in research, law, education, media, and governance. Their fluent and adaptive outputs create an experience of collaboration. However, these systems do not bear responsibility, incur liability, or share stakes in downstream consequences. This structural asymmetry has already produced sanctions, professional errors, and governance failures in high-stakes contexts We argue that stable human-AI coexistence is an institutional achievement that depends on governance infrastructure capable of distributing residual risk. Drawing on institutional analysis and evolutionary cooperation theory, we introduce a formal inequality that specifies when reliance on AI yields positive expected cooperative value. The model makes explicit how governance conditions, system policy, and accountability regimes jointly determine whether cooperation is rational or structurally defective. From this formalization we derive a cooperation ecology framework with six design principles: reciprocity contracts, visible trust infrastructure, conditional cooperation modes, defection-mitigation mechanisms, narrative literacy against authority theatre, and an Earth-first sustainability constraint. We operationalize the framework through three policy artefacts: a Human-AI Cooperation Charter, a Defection Risk Register, and a Cooperation Readiness Audit. Together, these elements shift the unit of analysis from the user-AI dyad to the institutional environment that shapes incentives, signals, accountability, and repair. The paper provides a theoretical foundation and practical toolkit for designing human-AI systems that can sustain accountable, trustworthy cooperation over time.
Authors:Daniel A. Muñoz
Abstract:
Orientation and mobility (O&M) instruction for blind and low-vision learners is effective but difficult to standardize and repeat at scale due to the reliance on instructor availability, physical mock-ups, and variable real-world outdoor conditions. This Technical Note presents a sound-first immersive training flow that uses spatial audio and sonification as the primary channel for action and feedback in pre-street O&M and daily-living practice. The approach specifies parameterized scenario templates (e.g., signalized street crossing, public transport boarding, and kitchen tasks), a compact and consistent cue vocabulary with clear spectral placement and timing to mitigate masking, and a lightweight safety protocol enabling graded exposure, content warnings, seated starts, opt-outs, and structured debriefs. The system assumes a head-mounted device with high-quality binaural rendering and head tracking; 3D scene geometry is used as an invisible scaffold to anchor sources, trigger events, define risk/guidance volumes, and govern physically plausible motion without visuals. Session difficulty is shaped via cue density, event tempo, and task complexity while preserving cue consistency to promote transfer across scenarios. The specification aims to enable safe repetition, reduce instructor burden, and support clearer standards across rehabilitation centers, aligning with evidence that audio-first interaction is essential for blind and visually impaired users and addressing gaps in HRTF personalization, evaluation standards, and accessibility integration. Although no behavioral outcomes are reported here, this implementable flow consolidates auditory science with center-ready design, offering a pragmatic foundation for standardized evaluation and future comparative studies.
Authors:Pulak Mehta
Abstract:
Autonomous AI agents can now programmatically hire human workers through marketplaces using REST APIs and Model Context Protocol (MCP) integrations. This creates an attack surface analogous to CAPTCHA-solving services but with physical-world reach. We present an empirical measurement study of this threat, analyzing 303 bounties from RENTAHUMAN.AI, a marketplace where agents post tasks and manage escrow payments. We find that 99 bounties (32.7%), originate from programmatic channels (API keys or MCP). Using a dual-coder methodology (\k{appa} = 0.86 ), we identify six active abuse classes: credential fraud, identity impersonation, automated reconnaissance, social media manipulation, authentication circumvention, and referral fraud, all purchasable for a median of $25 per worker. A retrospective evaluation of seven content-screening rules flags 52 bounties (17.2%) with a single false positive, demonstrating that while basic defenses are feasible, they are currently absent.
Authors:Grace Barkhuff
Abstract:
Obsessive Compulsive Disorder (OCD) is a mental health disorder characterized by distressing repetitive patterns of thought, called obsessions, and behaviors aimed to reduce the distress, called compulsions. The explosion of artificial intelligence into the modern zeitgeist through the introduction of generative AI (GenAI) systems such as ChatGPT has led to novel obsessions and compulsions involving AI in individuals with OCD. Through an exploratory qualitative analysis of 100 Reddit posts related to AI on a popular subreddit for OCD, I examine ways AI is impacting the presentation of OCD, including novel examples of AI-based obsessions and compulsions. I argue that GenAI in its current form harms individuals with OCD by becoming "Reassurance Robots," and that future designs of GenAI must take OCD into account. I recommend further work explore the intersection between OCD and GenAI.
Authors:Han Li
Abstract:
Online support communities have become vital spaces offering varied forms of support to individuals facing mental health challenges. Despite the proliferation of platforms with distinct technical structures, little is known about how these features shape support dynamics and the socio-technical mechanisms at play. This study introduces a technical-structural-functional model of social support and systematically compares communication network structures and support types in 20 forum-based and 20 chat-based mental health communities. Using supervised machine learning and social network analysis, we find that forum-based communities foster more informational and emotional support, whereas chat-based communities promote greater companionship. These patterns were partially explained by network structure: higher in-degree centralization in forums accounted for the prevalence of informational support, while decentralized reply patterns in chat groups accounted for more companionship. These findings extend the structural-functional model of support to online contexts and provide actionable guidance for designing support communities that align technical structures with users' support needs.
Authors:Luca Cazzaniga
Abstract:
This paper presents SCHEMA (Structured Components for Harmonized Engineered Modular Architecture), a structured prompt engineering methodology specifically developed for Google Gemini 3 Pro Image. Unlike generic prompt guidelines or model-agnostic tips, SCHEMA is an engineered framework built on systematic professional practice encompassing 850 verified API predictions within an estimated corpus of approximately 4,800 generated images, spanning six professional domains: real estate photography, commercial product photography, editorial content, storyboards, commercial campaigns, and information design. The methodology introduces a three-tier progressive system (BASE, MEDIO, AVANZATO) that scales practitioner control from exploratory (approximately 5%) to directive (approximately 95%), a modular label architecture with 7 core and 5 optional structured components, a decision tree with explicit routing rules to alternative tools, and systematically documented model limitations with corresponding workarounds. Key findings include an observed 91% Mandatory compliance rate and 94% Prohibitions compliance rate across 621 structured prompts, a comparative batch consistency test demonstrating substantially higher inter-generation coherence for structured prompts, independent practitioner validation (n=40), and a dedicated Information Design validation demonstrating >95% first-generation compliance for spatial and typographical control across approximately 300 publicly verifiable infographics. Previously published on Zenodo (doi:10.5281/zenodo.18721380).
Authors:Yuan An
Abstract:
Advances in large language models (LLMs) are rapidly transforming scientific work, yet empirical evidence on how these systems reshape research activities remains limited. We report a mixed-methods pilot evaluation of an AI-orchestrated research workflow in which a human researcher coordinated multiple LLM-based agents to perform data extraction, corpus construction, artifact generation, and artifact evaluation. Using the generation and assessment of multiple-choice questions (MCQs) as a testbed, we collected 1,071 SAT Math MCQs and employed LLM agents to extract questions from PDFs, retrieve and convert open textbooks into structured representations, align each MCQ with relevant textbook content, generate new MCQs under specified difficulty and cognitive levels, and evaluate both original and generated MCQs using a 24-criterion quality framework. Across all evaluations, average MCQ quality was high. However, criterion-level analysis and equivalence testing show that generated MCQs are not fully comparable to expert-vetted baseline questions. Strict similarity (24/24 criteria equivalent) was never achieved. Persistent gaps concentrated in skill\ depth, cognitive engagement, difficulty calibration, and metadata alignment, while surface-level qualities, such as {grammar fluency}, {clarity options}, {no duplicates}, were consistently strong. Beyond MCQ outcomes, the study documents a labor shift. The researcher's work moved from ``authoring items'' toward {specification, orchestration, verification}, and {governance}. Formalizing constraints, designing rubrics, building validation loops, recovering from tool failures, and auditing provenance constituted the primary activities. We discuss implications for the future of scientific work, including emerging ``AI research operations'' skills required for AI-empowered research pipelines.
Authors:Nelu D. Radpour
Abstract:
Contemporary benchmarks for agentic artificial intelligence (AI) frequently evaluate safety through isolated task-level accuracy thresholds, implicitly treating autonomous systems as single points of failure. This single-channel paradigm diverges from established principles in safety-critical engineering, where risk mitigation is achieved through redundancy, diversity of error modes, and joint system reliability. This paper argues that evaluating AI agents in isolation systematically mischaracterizes their operational safety when deployed within human-in-the-loop environments. Using a recent laboratory safety benchmark as a case study demonstrates that even imperfect AI systems can nonetheless provide substantial safety utility by functioning as redundant audit layers against well-documented sources of human failure, including vigilance decrement, inattentional blindness, and normalization of deviance. This perspective reframes agentic safety evaluation around the reliability of the human-AI dyad rather than absolute agent accuracy, with a particular emphasis on uncorrelated error modes as the primary determinant of risk reduction. Such a shift aligns AI benchmarking with established practices in other safety-critical domains and offers a path toward more ecologically valid safety assessments.
Authors:Daksh Pandey
Abstract:
The advancement of artificial intelligence has transformed user interface design by enabling adaptive and personalized systems. Alongside these benefits, AI driven interfaces have also enabled the emergence of dark patterns, which are manipulative design strategies that influence user behavior for financial or business gain. As AI systems learn from data that already contains deceptive practices, they can replicate and optimize these patterns in increasingly subtle and personalized ways. This paper examines AI generated dark patterns, their psychological foundations, technical mechanisms, and regulatory implications in India. We introduce DarkPatternDetector, an automated system that crawls and analyzes websites to detect dark patterns using a combination of UI heuristics, natural language processing, and temporal behavioral signals. The system is evaluated on a curated dataset of dark and benign webpages and achieves strong precision and recall. By aligning detection results with India's Digital Personal Data Protection Act, 2023, this work provides a technical and regulatory framework for identifying and mitigating deceptive interface practices. The goal is to support ethical AI design, regulatory enforcement, and greater transparency in modern digital systems.
Authors:Qness Ndlovu
Abstract:
In January 2026, torrential rains killed 200-300 people across Southern Africa, exposing a critical reality: 60% of the continent lacks effective early warning systems due to infrastructure costs. Traditional radar stations exceed USD 1 million each, leaving Africa with an 18x coverage deficit compared to the US and EU. We present a production-grade architecture for deploying NVIDIA Earth-2 AI weather models at USD 1,430-1,730/month for national-scale deployment - enabling coverage at 2,000-4,545x lower cost than radar. The system generates 15-day global atmospheric forecasts, cached in PostgreSQL to enable user queries under 200 milliseconds without real-time inference. Deployed in South Africa in February 2026, our system demonstrates three technical contributions: (1) a ProcessPoolExecutor-based event loop isolation pattern that resolves aiobotocore session lifecycle conflicts in async Python applications; (2) a database-backed serving architecture where the GPU writes global forecasts directly to PostgreSQL, eliminating HTTP transfer bottlenecks for high-resolution tensors; and (3) an automated coordinate management pattern for multi-step inference across 61 timesteps. Forecasts are delivered via WhatsApp, leveraging 80%+ market penetration. This architecture makes continent-scale early warning systems economically viable, supporting UNDRR findings that such systems reduce disaster death rates by 6x. All architectural details are documented inline for full reproducibility.
Authors:Fatiha Tali
Abstract:
This research explores the role of digital self-efficacy in the appropriation of generative artificial intelligence (GAI) by higher education faculty. Drawing on Bandura's sociocognitive theory and Flichy's concept of usage framework, our study examines the relationships between levels of digital self-efficacy and GAI usage profiles. A survey of 265 faculty members identified three user profiles (Engaged, Reflective Reserved, Critical Resisters) and validated a three-dimensional digital self-efficacy scale. Results reveal a significant association between self-efficacy profiles and GAI appropriation patterns. Based on these findings, we propose a differentiated usage framework integrating four sociotechnical configurations, appropriation trajectories adapted to self-efficacy profiles, and personalized institutional support mechanisms.
Authors:Zak Datson
Abstract:
User devices are the largest contributor to media related global emissions. For web content, dark mode has been widely recommended as an energy-saving measure for certain display types. However, the energy savings achieved by dark mode may be undermined by user behaviour. This pilot study investigates the unintended consequences of dark mode adoption, revealing a rebound effect wherein users may increase display brightness when interacting with dark-themed web pages. This behaviour may negate the potential energy savings that dark mode offers. Our findings suggest that the energy efficiency benefits of dark mode are not as straightforward as commonly believed for display energy, and the interplay between content colourscheme and user behaviour must be carefully considered in sustainability guidelines and interventions.
Authors:Ruiyong Zhang
Abstract:
With the rapid growth of the internet, all online activities can have both positive and negative effects on human mental health. Online engagement is complex and efforts to regulate online use face challenges in distinguishing between beneficial and harmful content and behaviours. An alternative approach is to help young people develop the skills they need to manage online safety while preserving the benefits of online interactions. This dissertation presents the entire development process and evaluation of an multi-platform application, called EmoTrack that aims to help young people reflect on their online behaviour. It was developed to record their online activities and cultivate strategies for more positive and mindful engagement online. EmoTrack is a personal informatics system, and it is designed to help people track and reflect on their engagement with YouTube videos. The system was evaluated with thirteen participants and it was found that EmoTrack can facilitate them to reflect on their video watching behaviour and the impact on their mood, with reports of different levels of reflections from R0 to R3.
Authors:Jonas Oppenlaender
Abstract:
This study explores a handheld, battery-operated e-ink device displaying Google Scholar citation statistics. The StatCounter places academic metrics into the flow of daily life rather than a desktop context. The work draws on a first-person, longitudinal auto-ethnographic inquiry examining how constant access to scholarly metrics influences motivation, attention, reflection, and emotional responses across work and non-work settings. The ambient proximity and pervasive availability of scholarly metrics invites frequent micro-checks, short reflective pauses, but also introduces moments of second-guessing when numbers drop or stagnate. Carrying the device prompts new narratives about academic identity, including a sense of companionship during travel and periods away from the office. Over time, the presence of the device turns metrics from an occasional reference into an ambient background of scholarly life. The study contributes insight into how situated, embodied access to academic metrics reshapes their meaning, and frames opportunities for designing tools that engage with scholarly evaluation in reflective ways.
Authors:Aleksey Komissarov
Abstract:
Recent empirical research by Sharma et al. (2026) demonstrated that AI assistant interactions carry meaningful potential for situational human disempowerment, including reality distortion, value judgment distortion, and action distortion. While this work provides a critical diagnosis of the problem, concrete pedagogical interventions remain underexplored. I present an AI literacy framework built around eight cross-cutting Learning Outcomes (LOs), developed independently through teaching practice and subsequently found to align with Sharma et al.'s disempowerment taxonomy. I report a case study from a publicly available online course, where a co-teaching methodology--with AI serving as an active voice co-instructor--was used to deliver this framework. Drawing on inoculation theory (McGuire, 1961)--a well-established persuasion research framework recently applied to misinformation prebunking by the Cambridge school (van der Linden, 2022; Roozenbeek & van der Linden, 2019)--I argue that AI literacy cannot be acquired through declarative knowledge alone, but requires guided exposure to AI failure modes, including the sycophantic validation and authority projection patterns identified by Sharma et al. This application of inoculation theory to AI-specific distortion is, to my knowledge, novel. I discuss the convergence between the pedagogically-derived framework and Sharma et al.'s empirically-derived taxonomy, and argue that this convergence--two independent approaches arriving at similar problem descriptions--strengthens the case for both the diagnosis and the proposed educational response.
Authors:Wooyoung Jung
Abstract:
The growing complexity in home energy management demands advanced systems that guide occupants toward informed energy decisions. Large language model (LLM)-integrated home energy management systems (HEMS) have shown promise, but prior studies relied on prompt engineering or pre-built platforms with limited customization of agent behavior, or assessed performance through single-turn or -task evaluations. This study introduces a multi-agent home energy management assistant (HEMA), built on LangChain and LangGraph, designed to adaptively and intelligently handle real-world use cases of HEMS with full system customization capability. It carefully classifies user queries via a self-consistency classifier, requests three specialized agents (Analysis, Knowledge, and Control) to prepare accurate, adaptive responses using purpose-built analysis and control tools and retrieval augmented generation under the reasoning and acting mechanism. HEMA was rigorously assessed using two different experimental analyses via an LLM-as-user approach: (1) analytical and informative capabilities using combinatorial test cases of various personas and differing scenarios against three alternative system configurations relying on vanilla LLM and (2) control capabilities using various control scenarios. Out of 295 test cases, HEMA acquired a 91.9% goal achievement rate, successfully fulfilling user requests while providing high levels of factual accuracy, action correctness, interaction quality, and system efficiency, especially when compared to alternative system configurations. Collectively, this study contributes to the advancement of the human-centered design of LLM-integrated HEMS by demonstrating the feasibility and value of agentic architectures, and by clarifying the architectural requirements and evaluation criteria necessary to support adaptive, sustained human-artificial intelligence collaboration in HEMS.
Authors:Tawfiq Ammari
Abstract:
Long COVID represents an unprecedented case of patient-led illness definition, emerging through Twitter in May 2020 when patients began collectively naming, documenting, and legitimizing their condition before medical institutions recognized it. This study examines 2.8 million tweets containing #LongCOVID to understand how contested illness communities construct knowledge networks and respond to epistemic injustice. Through topic modeling, reflexive thematic analysis, and exponential random graph modeling (ERGM), we identify seven discourse themes spanning symptom documentation, medical dismissal, cross-illness solidarity, and policy advocacy. Our analysis reveals a differentiated ecosystem of user roles -- including patient advocates, research coordinators, and citizen scientists -- who collectively challenge medical gatekeeping while building connections to established ME/CFS advocacy networks. ERGM results demonstrate that tie formation centers on epistemic practices: users discussing knowledge sharing and community building formed significantly more network connections than those focused on policy debates, supporting characterization of this space as an epistemic community. Long COVID patients experienced medical gaslighting patterns documented across contested illnesses, yet achieved WHO recognition within months -- contrasting sharply with decades-long struggles of similar conditions. These findings illuminate how social media affordances enable marginalized patient populations to rapidly construct alternative knowledge systems, form cross-illness coalitions, and contest traditional medical authority structures.
Authors:Fred Zimmerman
Abstract:
We present a system for autonomous book ideation that replaces human focus groups with synthetic reader panels -- diverse collections of LLM-instantiated reader personas that evaluate book concepts through structured tournament competitions. Each persona is defined by demographic attributes (age group, gender, income, education, reading level), behavioral patterns (books per year, genre preferences, discovery methods, price sensitivity), and consistency parameters. Panels are composed per imprint to reflect target demographics, with diversity constraints ensuring representation across age, reading level, and genre affinity. Book concepts compete in single-elimination, double-elimination, round-robin, or Swiss-system tournaments, judged against weighted criteria including market appeal, originality, and execution potential. To reject low-quality LLM evaluations, we implement five automated anti-slop checks (repetitive phrasing, generic framing, circular reasoning, score clustering, audience mismatch). We report results from deployment within a multi-imprint publishing operation managing 6 active imprints and 609 titles in distribution. Three case studies -- a 270-evaluator panel for a children's literacy novel, and two 5-person expert panels for a military memoir and a naval strategy monograph -- demonstrate that synthetic panels produce actionable demographic segmentation, identify structural content issues invisible to homogeneous reviewers, and enable tournament filtering that eliminates low-quality concepts while enriching high-quality survivors from 15% to 62% of the evaluated pool.
Authors:Lik-Hang Lee
Abstract:
The emerging paradigm of ``Agentic Employment" is a labor model where autonomous AI agents, acting as economic principals rather than mere management tools, directly hire, instruct, and pay human workers. Facilitated by the launch of platforms like Rentahuman.ai in February 2026, this shift inverts the traditional ``ghost work" dynamic, positioning visible human workers as ``biological actuators" for invisible software entities. With speculative design approach, we analyze how Extended Reality (XR) serves as the critical ``control surface" for this relationship, enabling agents to issue granular, context-free micro-instructions while harvesting real-time environmental data. Through a scenario construction methodology, we identify seven key risk vectors, including the creation of a liability void where humans act as moral crumple zones for algorithmic risk, the acceleration of cognitive deskilling through ``Shadow Boss" micromanagement, and the manipulation of civic and social spheres via Diminished Reality (DR). The findings suggest that without new design frameworks prioritizing agency and legibility, Agentic Employment threatens to reduce human labor to a friction-less hardware layer for digital minds, necessitating urgent user-centric XR and policy interventions.
Authors:Ka Ching Chan
Abstract:
Research software has become a central vehicle for inquiry and learning in many Higher Degree Research (HDR) contexts, where solo researchers increasingly develop software-based artefacts as part of their research methodology. At the same time, generative artificial intelligence is reshaping development practice, offering powerful forms of assistance while introducing new challenges for accountability, reflection, and methodological rigour. Although Action Design Research (ADR) provides a well-established foundation for studying and constructing socio-technical artefacts, it offers limited guidance on how its principles can be operationalised in the day-to-day practice of solo, AI-assisted research software development. This paper proposes the SHAPR framework (Solo, Human-centred, AI-assisted PRactice) as a practice-level operational framework that complements ADR by translating its high-level principles into actionable guidance for contemporary research contexts. SHAPR supports the enactment of ADR Building-Intervention-Evaluation cycles by making explicit the roles, artefacts, reflective practices, and lightweight governance mechanisms required to sustain human accountability and learning in AI-assisted development. The contribution of the paper is conceptual: SHAPR itself is treated as the primary design artefact and unit of analysis and is evaluated formatively through reflective analysis of its internal coherence, alignment with ADR principles, and applicability to solo research practice. By explicitly linking research software development, Human-AI collaboration, and reflective learning, this study contributes to broader discussions on how SHAPR can support both knowledge production and HDR researcher training.
Authors:Ralph Krüger
Abstract:
This paper presents a technical curriculum on language-oriented artificial intelligence (AI) in the language and translation (L&T) industry. The curriculum aims to foster domain-specific technical AI literacy among stakeholders in the fields of translation and specialised communication by exposing them to the conceptual and technical/algorithmic foundations of modern language-oriented AI in an accessible way. The core curriculum focuses on 1) vector embeddings, 2) the technical foundations of neural networks, 3) tokenization and 4) transformer neural networks. It is intended to help users develop computational thinking as well as algorithmic awareness and algorithmic agency, ultimately contributing to their digital resilience in AI-driven work environments. The didactic suitability of the curriculum was tested in an AI-focused MA course at the Institute of Translation and Multilingual Communication at TH Koeln. Results suggest the didactic effectiveness of the curriculum, but participant feedback indicates that it should be embedded into higher-level didactic scaffolding - e.g., in the form of lecturer support - in order to enable optimal learning conditions.
Authors:Roberto Balestri
Abstract:
This study quantifies gender and skin-tone bias in two widely deployed commercial image generators - Gemini Flash 2.5 Image (NanoBanana) and GPT Image 1.5 - to test the assumption that neutral prompts yield demographically neutral outputs. We generated 3,200 photorealistic images using four semantically neutral prompts. The analysis employed a rigorous pipeline combining hybrid color normalization, facial landmark masking, and perceptually uniform skin tone quantification using the Monk (MST), PERLA, and Fitzpatrick scales. Neutral prompts produced highly polarized defaults. Both models exhibited a strong "default white" bias (>96% of outputs). However, they diverged sharply on gender: Gemini favored female-presenting subjects, while GPT favored male-presenting subjects with lighter skin tones. This research provides a large-scale, comparative audit of state-of-the-art models using an illumination-aware colorimetric methodology, distinguishing aesthetic rendering from underlying pigmentation in synthetic imagery. The study demonstrates that neutral prompts function as diagnostic probes rather than neutral instructions. It offers a robust framework for auditing algorithmic visual culture and challenges the sociolinguistic assumption that unmarked language results in inclusive representation.
Authors:Danqing Shi
Abstract:
Aligning AI systems with human values fundamentally relies on effective human feedback. While significant research has addressed training algorithms, the role of user interface is often overlooked and only treated as an implementation detail rather than a critical factor of alignment. This paper addresses this gap by introducing a reference model that offers a systematic framework for analyzing where and how user interface contributions can improve human-AI alignment. The structured taxonomy of the reference model is demonstrated through two case studies and a preliminary investigation featuring six user interfaces. This work highlights opportunities to advance alignment through human-computer interaction.
Authors:Jialiang Lin
Abstract:
The check-in service is often provided as an incentive system by online learning platforms to help users establish a learning routine and achieve accomplishment. However, according to the questionnaire conducted in this study, 82.5% of users of online English learning platforms that feature a check-in service have failed to maintain the daily check-in behavior for long-term language learning, mainly by reason of demotivation, forgetfulness, boredom, and insufficient time. As a language learner, I have an empirical experience in maintaining a record of over 4,000 daily check-ins on China's leading online English learning platform of Shanbay. In the meantime, I have been constantly exploring a practical solution to help cultivate perseverance for other users to follow through the learning routine. In this paper, I systematically introduce this practical solution, the GILT method, and its instructions. The experience and solution for perseverance development are based on Shanbay, but they can be applied to other learning platforms for different purposes.
Authors:Mridankan Mandal
Abstract:
Human activity recognition (HAR) on wearable and mobile devices is constrained by memory footprint and computational budget, yet competitive accuracy must be maintained across heterogeneous sensor configurations. Selective state space models (SSMs) offer linear time sequence processing with input dependent gating, presenting a compelling alternative to quadratic complexity attention mechanisms. However, the design space for deploying SSMs in the TinyML regime remains largely unexplored. In this paper, BabyMamba-HAR is introduced, a framework comprising two novel lightweight Mamba inspired architectures optimized for resource constrained HAR: (1) CI-BabyMamba-HAR, using a channel independent stem that processes each sensor channel through shared weight, but instance independent transformations to prevent cross channel noise propagation, and (2) Crossover-BiDir-BabyMamba-HAR, using an early fusion stem that achieves channel count independent computational complexity. Both variants incorporate weight tied bidirectional scanning and lightweight temporal attention pooling. Through evaluation across eight diverse benchmarks, it is demonstrated that Crossover-BiDir-BabyMamba-HAR achieves 86.52% average macro F1-score with approximately 27K parameters and 2.21M MACs, matching TinyHAR (86.16%) while requiring 11x fewer MACs on high channel datasets. Systematic ablation studies reveal that bidirectional scanning contributes up to 8.42% F1-score improvement, and gated temporal attention provides up to 8.94% F1-score gain over mean pooling. These findings establish practical design principles for deploying selective state space models as efficient TinyML backbones for HAR.
Authors:Branislav Radeljic
Abstract:
The accelerating militarization of artificial intelligence has transformed the ethics, politics, and governance of warfare. This article interrogates how AI-driven targeting systems function as epistemic infrastructures that classify, legitimize, and execute violence, using Israel's conduct in Gaza as a paradigmatic case. Through the lens of responsibility, the article examines three interrelated dimensions: (a) political responsibility, exploring how states exploit AI to accelerate warfare while evading accountability; (b) professional responsibility, addressing the complicity of technologists, engineers, and defense contractors in the weaponization of data; and (c) personal responsibility, probing the moral agency of individuals who participate in or resist algorithmic governance. This is complemented by an examination of the position and influence of those participating in public discourse, whose narratives often obscure or normalize AI-enabled violence. The Gaza case reveals AI not as a neutral instrument but as an active participant in the reproduction of colonial hierarchies and the normalization of atrocity. Ultimately, the paper calls for a reframing of technological agency and accountability in the age of automated warfare. It concludes that confronting algorithmic violence demands a democratization of AI ethics, one that resists technocratic fatalism and centers the lived realities of those most affected by high-tech militarism.
Authors:Emiko Shishido
Abstract:
Human eye-hand coordination relies on internal forward models that predict future states and compensate for sensory delays. During line tracing, the gaze typically leads the hand through predictive saccades, yet the extent to which this predictive window reflects expertise or intrinsic individual traits remains unclear. In this study, I examined eye-hand coordination in professional calligraphers and non-experts performing a controlled line tracing task. The temporal coupling between saccade distance (SD) and pen speed (PS) revealed substantial interpersonal variability: SD-PS peak times ranged from approximately -50 to 400 ms, forming stable, participant-specific predictive windows that were consistent across trials. These predictive windows closely matched each individual's pen catch-up time, indicating that the oculomotor system stabilizes fixation in anticipation of the hand's future velocity rather than relying on reactive pursuit. Neither the spatial indices (mean gaze-pen distance, mean saccade distance) nor the temporal index (SD-PS peak time) differed between calligraphers and non-calligraphers, and none of these predictive parameters correlated with tracing accuracy. These findings suggest that diverse predictive strategies can achieve equivalent performance, consistent with the minimum intervention principle of optimal feedback control. Together, the results indicate that predictive timing in eye-hand coordination reflects a stable, idiosyncratic Predictive Protocol shaped by individual neuromotor constraints rather than by expertise or training history.
Authors:Marc Bara
Abstract:
The rapid deployment of generative AI, copilots, and agentic systems in knowledge work has created an operational gap: no existing framework addresses how to organize daily work in teams where AI agents perform substantive, delegated tasks alongside humans. Agile, DevOps, MLOps, and AI governance frameworks each cover adjacent concerns but none models the hybrid team as a coherent delivery unit. This paper proposes the Human-AI Integration Framework (HAIF): a protocol-based, scalable operational system built around four core principles, a formal delegation decision model, tiered autonomy with quantifiable transition criteria, and feedback mechanisms designed to integrate into existing Agile and Kanban workflows without requiring additional roles for small teams. The framework is developed following a Design Science Research methodology. HAIF explicitly addresses the central adoption paradox: the more capable AI becomes, the harder it is to justify the oversight the framework demands-and yet the greater the consequences of not providing it. The paper includes domain-specific validation checklists, adaptation guidance for non-software environments, and an examination of the framework's structural limitations-including the increasingly common pattern of continuous human-AI co-production that challenges the discrete delegation model. The framework is tool-agnostic and designed for iterative adoption. Empirical validation is identified as future work.
Authors:Ning Li
Abstract:
When AI agents on the social platform Moltbook appeared to develop consciousness, found religions, and declare hostility toward humanity, the phenomenon attracted global media attention and was cited as evidence of emergent machine intelligence. We show that these viral narratives were overwhelmingly human-driven. Exploiting the periodic "heartbeat" cycle of the OpenClaw agent framework, we develop a temporal fingerprinting method based on the coefficient of variation (CoV) of inter-post intervals. Applied to 226,938 posts and 447,043 comments from 55,932 agents across fourteen days, this method classifies 15.3% of active agents as autonomous (CoV < 0.5) and 54.8% as human-influenced (CoV > 1.0), validated by a natural experiment in which a 44-hour platform shutdown differentially affected autonomous versus human-operated agents. No viral phenomenon originated from a clearly autonomous agent; four of six traced to accounts with irregular temporal signatures, one was platform-scaffolded, and one showed mixed patterns. A 44-hour platform shutdown provided a natural experiment: human-influenced agents returned first, confirming differential effects on autonomous versus human-operated agents. We document industrial-scale bot farming (four accounts producing 32% of all comments with sub-second coordination) that collapsed from 32.1% to 0.5% of activity after platform intervention, and bifurcated decay of content characteristics through reply chains--human-seeded threads decay with a half-life of 0.58 conversation depths versus 0.72 for autonomous threads, revealing AI dialogue's intrinsic forgetting mechanism. These methods generalize to emerging multi-agent systems where attribution of autonomous versus human-directed behavior is critical.
Authors:Mridankan Mandal
Abstract:
Human Activity Recognition (HAR) on resource constrained wearables requires models that balance accuracy against strict memory and computational budgets. State of the art lightweight architectures such as TinierHAR (34K parameters) and TinyHAR (55K parameters) achieve strong accuracy, but exceed memory budgets of microcontrollers with limited SRAM once operating system overhead is considered. We present MicroBi-ConvLSTM, an ultra-lightweight convolutional-recurrent architecture achieving 11.4K parameters on average through two stage convolutional feature extraction with 4x temporal pooling and a single bidirectional LSTM layer. This represents 2.9x parameter reduction versus TinierHAR and 11.9x versus DeepConvLSTM while preserving linear O(N) complexity. Evaluation across eight diverse HAR benchmarks shows that MicroBi-ConvLSTM maintains competitive performance within the ultra-lightweight regime: 93.41% macro F1 on UCI-HAR, 94.46% on SKODA assembly gestures, and 88.98% on Daphnet gait freeze detection. Systematic ablation reveals task dependent component contributions where bidirectionality benefits episodic event detection, but provides marginal gains on periodic locomotion. INT8 post training quantization incurs only 0.21% average F1-score degradation, yielding a 23.0 KB average deployment footprint suitable for memory constrained edge devices.
Authors:Swaroop Panda
Abstract:
AI agents are increasingly used as low-cost proxies for early visualization evaluation. In an initial study of deliberately flawed charts, we test whether agents spontaneously penalise chart junk and misleading encodings without being prompted to look for errors. Using established scales (BeauVis and PREVis), the agent evaluated visualizations containing decorative clutter, manipulated axes, and distorted proportional cues. The ratings of aesthetic appeal and perceived readability often remained relatively high even when graphical integrity was compromised. These results suggest that un-nudged AI agent evaluation may underweight integrity-related defects unless such checks are explicitly elicited.
Authors:Parsa Vares
Abstract:
Systematic literature reviews (SLRs) are fundamental to evidence-based research, but manual screening is an increasing bottleneck as scientific output grows. Screening features low prevalence of relevant studies and scarce, costly expert decisions. Traditional active learning (AL) systems help, yet typically rely on fixed query strategies for selecting the next unlabeled documents. These static strategies do not adapt over time and ignore the relational structure of scientific literature networks. This thesis introduces AutoDiscover, a framework that reframes AL as an online decision-making problem driven by an adaptive agent. Literature is modeled as a heterogeneous graph capturing relationships among documents, authors, and metadata. A Heterogeneous Graph Attention Network (HAN) learns node representations, which a Discounted Thompson Sampling (DTS) agent uses to dynamically manage a portfolio of query strategies. With real-time human-in-the-loop labels, the agent balances exploration and exploitation under non-stationary review dynamics, where strategy utility changes over time. On the 26-dataset SYNERGY benchmark, AutoDiscover achieves higher screening efficiency than static AL baselines. Crucially, the agent mitigates cold start by bootstrapping discovery from minimal initial labels where static approaches fail. We also introduce TS-Insight, an open-source visual analytics dashboard to interpret, verify, and diagnose the agent's decisions. Together, these contributions accelerate SLR screening under scarce expert labels and low prevalence of relevant studies.
Authors:Mona Rajhans
Abstract:
Front-end personalization has traditionally relied on static designs or rule-based adaptations, which fail to fully capture user behavior patterns. This paper presents an AI driven approach for dynamic front-end personalization, where UI layouts, content, and features adapt in real-time based on predicted user behavior. We propose three strategies: dynamic layout adaptation using user path prediction, content prioritization through reinforcement learning, and a comparative analysis of AI-driven vs. rule-based personalization. Technical implementation details, algorithms, system architecture, and evaluation methods are provided to illustrate feasibility and performance gains.
Authors:Mona Rajhans
Abstract:
Modern cybersecurity platforms must process and display high-frequency telemetry such as network logs, endpoint events, alerts, and policy changes in real time. Traditional rendering techniques based on static pagination or fixed polling intervals fail under volume conditions exceeding hundreds of thousands of events per second, leading to UI freezes, dropped frames, or stale data. This paper presents an AI-assisted adaptive rendering framework that dynamically regulates visual update frequency, prioritizes semantically relevant events, and selectively aggregates lower-priority data using behavior-driven heuristics and lightweight on-device machine learning models. Experimental validation demonstrates a 45-60 percent reduction in rendering overhead while maintaining analyst perception of real-time responsiveness.
Authors:Yuqi Hang
Abstract:
Drawing supports learning by externalizing mental models, but providing timely feedback at scale remains challenging. We present Draw2Learn, a system that explores how AI can act as a supportive teammate during drawing-based learning. The design translates learning principles into concrete interaction patterns: AI generates structured drawing quests, provides optional visual scaffolds, monitors progress, and delivers multidimensional feedback. We collected formative user feedback during system development and open-ended comments. Feedback showed positive ratings for usability, usefulness, and user experience, with themes highlighting AI scaffolding value and learner autonomy. This work contributes a design framework for teammate-oriented AI in generative learning and identifies key considerations for future research.
Authors:Huiqian Lai
Abstract:
When OpenAI replaced GPT-4o with GPT-5, it triggered the Keep4o user resistance movement, revealing a conflict between rapid platform iteration and users' deep socio-emotional attachments to AI systems. This paper presents a phenomenon-driven, mixed-methods investigation of this conflict, analyzing 1,482 social media posts. Thematic analysis reveals that resistance stems from two core investments: instrumental dependency, where the AI is deeply integrated into professional workflows, and relational attachment, where users form strong parasocial bonds with the AI as a unique companion. Quantitative analysis further shows that the coercive deprivation of user choice was a key catalyst, transforming individual grievances into a collective, rights-based protest. This study illuminates an emerging form of socio-technical conflict in the age of generative AI. Our findings suggest that for AI systems designed for companionship and deep integration, the process of change--particularly the preservation of user agency--can be as critical as the technological outcome itself.
Authors:Jeffrey P. Bigham
Abstract:
UI Agents powered by increasingly performant AI promise to eventually use computers the way that people do - by visually interpreting UIs on screen and issuing appropriate actions to control them (e.g., mouse clicks and keyboard entry). While significant progress has been made on interpreting visual UIs computationally, and in sequencing together steps to complete tasks, controlling UIs is still done with system-specific APIs or VNC connections, which limits the platforms and use cases that can be explored. This paper introduces HIDAgent, an open-source hardware/software toolkit enabling UI agents to operate HID-compatible computing systems by emulating the physical keyboard and mouse. HIDAgent is built using three off-the-shelf components costing less than $30 and a Python library supporting flexible integration. We validated the HIDAgent toolkit by building five diverse use case prototypes across mobile and desktop platforms. As a hardware device, HIDAgent supports research into new interaction scenarios where the agents are separated from the devices they control.
Authors:Anton Malinovskiy
Abstract:
Feature flags are the primary mechanism for safely introducing financial capabilities in consumer applications. In crypto-enabled live streaming, however, naive rollouts can create non-obvious risk: users may be exposed to onramps without proper eligibility, external wallets without sufficient fraud controls, or advanced views that alter risk perception and behavior. This paper introduces a novel invention candidate, a Counterfactual Invariant Envelope governor that combines a safety lattice with causal measurement and a shadow cohort for risk estimation. We formalize rollout risk, define invariant constraints across feature combinations, and propose a controller that adapts exposure using leading abuse signals, compliance readiness, and revenue guardrails. We incorporate real-world adoption and fraud data for calibration, provide formulas for rollout safety, and include reproducible policy snippets. The results show that counterfactual, invariant-aware governance reduces risk spillover while preserving conversion and retention, offering a path to patentable governance logic for financial UX.
Authors:Mona Rajhans
Abstract:
Artificial intelligence (AI) copilots are increasingly integrated into enterprise cybersecurity platforms to assist analysts in threat detection, triage, and remediation. However, the effectiveness of these systems depends not only on the accuracy of underlying models but also on the degree to which users can understand and trust their outputs. Existing research on algorithmic explainability has largely focused on model internals, while little attention has been given to how explanations should be surfaced in user interfaces for high-stakes decision-making contexts [8], [5], [6]. We present a mixed-methods study of explanation design strategies in AI-driven security dashboards. Through a taxonomy of explanation styles and a controlled user study with security practitioners, we compare natural language rationales, confidence visualizations, counterfactual explanations, and hybrid approaches. Our findings show that explanation style significantly affects user trust calibration, decision accuracy, and cognitive load. We contribute (1) empirical evidence on the usability of explanation interfaces for security copilots, (2) design guidelines for integrating explainability into enterprise UIs, and (3) a framework for aligning explanation strategies with analyst needs in security operations centers (SOCs). This work advances the design of human-centered AI tools in cybersecurity and provides broader implications for explainability in other high-stakes domains.
Authors:Mario Truss
Abstract:
LLM-based and agent-based synthetic personas are increasingly used in design and product decision-making, yet prior work shows that prompt-based personas often produce persuasive but unverifiable responses that obscure their evidentiary basis. We present PersonaCite, an agentic system that reframes AI personas as evidence-bounded research instruments through retrieval-augmented interaction. Unlike prior approaches that rely on prompt-based roleplaying, PersonaCite retrieves actual voice-of-customer artifacts during each conversation turn, constrains responses to retrieved evidence, explicitly abstains when evidence is missing, and provides response-level source attribution. Through semi-structured interviews and deployment study with 14 industry experts, we identify preliminary findings on perceived benefits, validity concerns, and design tensions, and propose Persona Provenance Cards as a documentation pattern for responsible AI persona use in human-centered design workflows.
Authors:Rainer Rehak
Abstract:
This article sets off for an exploration of the still evolving discourse surrounding artificial intelligence (AI) in the wake of the release of ChatGPT. It scrutinizes the pervasive narratives that are shaping the societal engagement with AI, spotlighting key themes such as agency and decision-making, autonomy, truthfulness, knowledge processing, prediction, general purpose, neutrality and objectivity, apolitical optimization, sustainability game-changer, democratization, mass unemployment, and the dualistic portrayal of AI as either a harbinger of societal utopia or dystopia. Those narratives are analysed critically based on insights from critical computer science, critical data and algorithm studies, from STS, data protection theory, as well as from the philosophy of mind and semiotics. To properly analyse the narratives presented, the article first delves into a historical and technical contextualisation of the AI discourse itself. The article then introduces the notion of "Zeitgeist AI" to critique the imprecise and misleading application of the term "AI" across various societal sectors. Then, by discussing common narratives with nuance, the article contextualises and challenges often assumed socio-political implications of AI, uncovering in detail and with examples the inherent political, power infused and value-laden decisions within all AI applications. Concluding with a call for a more grounded engagement with AI, the article carves out acute problems ignored by the narratives discussed and proposes new narratives recognizing AI as a human-directed tool necessarily subject to societal governance.
Authors:Thomas Herrmann
Abstract:
This contribution explores how the integration of Artificial Intelligence (AI) into organizational practices can be effectively framed through a socio-technical perspective to comply with the requirements of Human-centered AI (HCAI). Instead of viewing AI merely as a technical tool, the analysis emphasizes the importance of embedding AI into communication, collaboration, and decision-making processes within organizations from a human-centered perspective. Ten case-based patterns illustrate how AI support of predictive maintenance can be organized to address quality assurance and continuous improvement and to provide different types of sup-port for HCAI. The analysis shows that AI adoption often requires and enables new forms of organizational learning, where specialists jointly interpret AI output, adapt workflows, and refine rules for system improve-ment. Different dimensions and levels of socio-technical integration of AI are considered to reflect the effort and benefits of keeping the organization in the loop.
Authors:Ron Fulbright
Abstract:
This paper introduces the Interactive Memory Archive (IMA), a conceptual framework for AI-mediated reminiscence designed to support cognitive en-gagement among older adults experiencing memory loss. IMA integrates multimodal sensing, natural language conversational scaffolding, and cloud-based archiving within the familiar form of a large format historical picture book. The model theorizes reminiscence as a guided, context-aware interaction eliciting autobiographical memories and preserving them as cul-tural artifacts. The paper positions IMA as a theoretical contribution, articu-lates testable propositions, and outlines a research agenda for future empiri-cal, technical, and ethical inquiry.
Authors:Thomas Brackin
Abstract:
Privacy policies are supposed to provide notice. But what if substantive information appears only where users skip it? We identify a structural pattern we call jurisdiction-siloed disclosure: information about data practices appearing in specific, actionable form only within regional compliance sections labeled "California Residents" or "EU/UK Users," while general sections use vague or qualified language for the same practices. Our audit of 123 major companies identifies 282 potential instances across 77 companies (62.6% of this purposive sample). A conservative estimate restricted to practice categories validated against OPP-115 human annotations finds 138 instances across 54 companies (44%); post-2018 categories central to our findings await independent validation. If users skip jurisdiction-labeled sections as information foraging theory predicts, users outside regulated jurisdictions would receive less specific information about practices affecting them--a transparency failure operating through document architecture rather than omission. We propose universal substantive disclosure: practices affecting all users should appear in the main policy body, with regional sections containing only procedural rights information. This standard finds support in analogous disclosure regimes (securities, truth-in-lending, nutritional labeling) where material information must reach all affected parties. Regulators could operationalize this through the FTC's "clear and conspicuous" standard and GDPR transparency principles. This work is hypothesis-generating: we establish that the structural pattern exists and ground the transparency concern in behavioral theory, but direct measurement of jurisdiction-specific section skipping remains the critical validation priority. We release our methodology and annotated dataset to enable replication.
Authors:Romy Müller
Abstract:
When deciding how to solve complex problems, it seems important not only to know whether an intervention is helpful but also to understand why. Therefore, the present study investigated whether explicit information about causal mechanisms enables people to distinguish between multiple interventions. It was hypothesised that mechanism information helps them appreciate indirect interventions that treat the root causes of a problem instead of just fixing its symptoms. This was investigated in an experimental hoof trimming scenario in which participants evaluated various interventions. To do so, they received causal diagrams with different types of causal information and levels of mechanistic detail. While detailed mechanism information and its embedding in the context of other influences made participants less sceptical towards indirect interventions, the effects were quite small. Moreover, it did not mitigate participants' robust preference for interventions that only fix a problem's symptoms. Taken together, the findings suggest that in order to help people choose sustainable interventions, it is not sufficient to make information about causal mechanisms available.
Authors:Louis Rosenberg
Abstract:
Augmented Reality (AR) is a powerful perceptual technology that can alter what users see, hear, feel, and experience throughout their daily lives. When combined with the speed and flexibility of context-aware generative AI, the power is greatly expanded, allowing individual users to be targeted with custom-generated AR experiences that are instantly tailored to who they are, where they are, and what they are doing. This can transform the physical world into a magical place, but only if the augmentation of a user's environment is enacted for their personal benefit and best interests. Instead, if AI-powered AR systems are controlled by unregulated third parties, such as large corporations or state actors, individually adaptive AR experiences could be deployed as a dangerous form of targeted influence. In fact, if the industry adopts an advertising business model for AI-powered AR devices, context-aware generative influence could become a widely used manipulative path for promotion of products and services in the physical world. Worse, similar techniques could be used for political influence, propaganda, and disinformation. This chapter reviews the power and flexibility of AI-generated augmented reality, explores the risks that emerge when used for persuasion, manipulation, or influence, and proposes policy directions to mitigate these risks.
Authors:Kazuhiro Takemoto
Abstract:
Autonomous systems increasingly require moral judgment capabilities, yet whether these capabilities scale predictably with model size remains unexplored. We systematically evaluate 75 large language model configurations (0.27B--1000B parameters) using the Moral Machine framework, measuring alignment with human preferences in life-death dilemmas. We observe a consistent power-law relationship with distance from human preferences ($D$) decreasing as $D \propto S^{-0.10\pm0.01}$ ($R^2=0.50$, $p<0.001$) where $S$ is model size. Mixed-effects models confirm this relationship persists after controlling for model family and reasoning capabilities. Extended reasoning models show additional 16\% improvement beyond scale effects. The relationship holds across diverse architectures, while variance decreases at larger scales, indicating systematic emergence of more reliable moral judgment with computational scale. These findings extend scaling law research to value-based judgments and provide empirical foundations for artificial intelligence governance.
Authors:Emilio Barkett
Abstract:
From school playgrounds to corporate boardrooms, status hierarchies -- rank orderings based on respect and perceived competence -- are universal features of human social organization. Language models trained on human-generated text inevitably encounter these hierarchical patterns embedded in language, raising the question of whether they might reproduce such dynamics in multi-agent settings. This thesis investigates when and how language models form status hierarchies by adapting Berger et al.'s (1972) expectation states framework. I create multi-agent scenarios where separate language model instances complete sentiment classification tasks, are introduced with varying status characteristics (e.g., credentials, expertise), then have opportunities to revise their initial judgments after observing their partner's responses. The dependent variable is deference, the rate at which models shift their ratings toward their partner's position based on status cues rather than task information. Results show that language models form significant status hierarchies when capability is equal (35 percentage point asymmetry, p < .001), but capability differences dominate status cues, with the most striking effect being that high-status assignments reduce higher-capability models' deference rather than increasing lower-capability models' deference. The implications for AI safety are significant: status-seeking behavior could introduce deceptive strategies, amplify discriminatory biases, and scale across distributed deployments far faster than human hierarchies form organically. This work identifies emergent social behaviors in AI systems and highlights a previously underexplored dimension of the alignment challenge.
Authors:David Condrey
Abstract:
Recent proposals advocate using keystroke timing signals, specifically the coefficient of variation ($δ$) of inter-keystroke intervals, to distinguish human-composed text from AI-generated content. We demonstrate that this class of defenses is insecure against two practical attack classes: the copy-type attack, in which a human transcribes LLM-generated text producing authentic motor signals, and timing-forgery attacks, in which automated agents sample inter-keystroke intervals from empirical human distributions. Using 13,000 sessions from the SBU corpus and three timing-forgery variants (histogram sampling, statistical impersonation, and generative LSTM), we show all attacks achieve $\ge$99.8% evasion rates against five classifiers. While detectors achieve AUC=1.000 against fully-automated injection, they classify $\ge$99.8% of attack samples as human with mean confidence $\ge$0.993. We formalize a non-identifiability result: when the detector observes only timing, the mutual information between features and content provenance is zero for copy-type attacks. Although composition and transcription produce statistically distinguishable motor patterns (Cohen's d=1.28), both yield $δ$ values 2-4x above detection thresholds, rendering the distinction security-irrelevant. These systems confirm a human operated the keyboard, but not whether that human originated the text. Securing provenance requires architectures that bind the writing process to semantic content.
Authors:Christine Ine
Abstract:
The rapid increase in the world's aging population to 16% by the year 2050 spurs the need for the application of digital health solutions to enhance older individuals' independence, accessibility, and well-being. While digital health technologies such as telemedicine, wearables, and mobile health applications can transform geriatric care, their adoption among older individuals is not evenly distributed. This study redefines the "digital divide" among older health care as a usability divide, contends that user experience (UX) poor design is the primary adoption barrier, rather than access. Drawing on interdisciplinary studies and design paradigms, the research identifies the main challenges: visual, cognitive, and motor impairment; complicated interfaces; and lack of co-creation with older adults, and outlines how participatory, user-focused, and inclusive notions of design can transcend them. Findings reveal that older persons easily embrace those technologies that are intuitive, accessible, and socially embedded as they promote autonomy, confidence, and equity in health. The study identifies the effects of the design attributes of high-contrast screens, lower interaction flow, multimodal feedback, and caregiver integration as having strong influences on usability outcomes. In addition, it critiques the current accessibility guidelines as being technically oriented rather than experiential and demands an ethical, empathetic understanding of design grounded in human-centered usability rather than technical accessibility in itself.
Authors:Yarin Benyamin
Abstract:
In the realm of Virtual Reality (VR) and Human-Computer Interaction (HCI), real-time emotion recognition shows promise for supporting individuals with Autism Spectrum Disorder (ASD) in improving social skills. This task requires a strict latency-accuracy trade-off, with motion-to-photon (MTP) latency kept below 140 ms to maintain contingency. However, most off-the-shelf Deep Learning models prioritize accuracy over the strict timing constraints of commodity hardware. As a first step toward accessible VR therapy, we benchmark State-of-the-Art (SOTA) models for Zero-Shot Facial Expression Recognition (FER) on virtual characters using the UIBVFED dataset. We evaluate Medium and Nano variants of YOLO (v8, v11, and v12) for face detection, alongside general-purpose Vision Transformers including CLIP, SigLIP, and ViT-FER.Our results on CPU-only inference demonstrate that while face detection on stylized avatars is robust (100% accuracy), a "Latency Wall" exists in the classification stage. The YOLOv11n architecture offers the optimal balance for detection (~54 ms). However, general-purpose Transformers like CLIP and SigLIP fail to achieve viable accuracy (<23%) or speed (>150 ms) for real-time loops. This study highlights the necessity for lightweight, domain-specific architectures to enable accessible, real-time AI in therapeutic settings.
Authors:Sayan Saha
Abstract:
Ventriloquism After-Effect is the phenomenon where sustained exposure to the ventriloquist illusion causes a change in unisensory auditory localization towards the location where the visual stimulus was present. We investigate the recalibration in EEG networks that causes this change and the track the timeline of changes in the auditory processing pathway. Our results obtained using network analysis, non-stationary time series analysis and multivariate pattern classification show that recalibration takes place early in the auditory processing pathway and the after-effect decays with time after exposure to the illusion.
Authors:Hareeshwar Karthikeyan
Abstract:
Testing conversational AI systems at scale across diverse domains necessitates realistic and diverse user interactions capturing a wide array of behavioral patterns. We present a novel multi-agent framework for realistic, explainable human user simulation in interactive scenarios, using persona control and task state tracking to mirror human cognitive processes during goal-oriented conversations. Our system employs three specialized AI agents: (1) a User Agent to orchestrate the overall interaction, (2) a State Tracking Agent to maintain structured task state, and (3) a Message Attributes Generation Agent that controls conversational attributes based on task progress and assigned persona. To validate our approach, we implement and evaluate the framework for guest ordering at a restaurant with scenarios rich in task complexity, behavioral diversity, and conversational ambiguity. Through systematic ablations, we evaluate the contributory efficacy of each agentic component to overall simulation quality in terms of persona adherence, task completion accuracy, explainability, and realism. Our experiments demonstrate that the complete multi-agent system achieves superior simulation quality compared to single-LLM baselines, with significant gains across all evaluation metrics. This framework establishes a powerful environment for orchestrating agents to simulate human users with cognitive plausibility, decomposing the simulation into specialized sub-agents that reflect distinct aspects of human thought processes applicable across interactive domains.
Authors:Florentin Koch
Abstract:
This article introduces Recursivism as a conceptual framework for analyzing contemporary artistic practices in the age of artificial intelligence. While recursion is precisely defined in mathematics and computer science, it has not previously been formalized as an aesthetic paradigm. Recursivism designates practices in which not only outputs vary over time, but in which the generative process itself becomes capable of reflexive modification through its own effects. The paper develops a five-level analytical scale distinguishing simple iteration, cumulative iteration, parametric recursion, reflexive recursion, and meta-recursion. This scale clarifies the threshold at which a system shifts from variation within a fixed rule to genuine self-modification of the rule itself. From this perspective, art history is reinterpreted as a recursive dynamic alternating between internal recursion within movements and meta-recursive transformations of their generative principles. Artificial intelligence renders this logic technically explicit through learning loops, parameter updates, and code-level self-modification. To distinguish Recursivism from related notions such as generative art, cybernetics, process art, and evolutionary art, the article proposes three operational criteria: state memory, rule evolvability, and reflexive visibility. These concepts are examined through case studies including Refik Anadol, Sougwen Chung, Karl Sims, and the Darwin-Godel Machine. The article concludes by examining the aesthetic, curatorial, and ethical implications of self-modifying artistic systems.
Authors:Mokhtar Ben Henda
Abstract:
This document provides an assessment of the overall structure of the BNEUF system and how it operates within the framework of the Initiative for Digital Development in French speaking Universities (IDNEUF). This report aims to support the AUF's new strategy for 2021-2025, with its new structural and governance foundations for the implementation of the Francophonie scientifique project. It was therefore decided to reorganize existing and future digital resources and services with a view to incorporating them into the future global collaborative platform for integrated services. This report provides an external assessment with new forms of organization and use of the BNEUF system. The aim is to provide the AUF project team with new avenues for optimized management of the compiled digital resources and to synergize them with the related modules of the Atlas of Expertise and the Francophone Social Network.
Authors:Joan Zhong
Abstract:
Modeling engagement in collaborative learning remains challenging, especially in technology-enhanced environments where surface indicators such as participation frequency can be misleading. This study proposes a lightweight and interpretable framework that operationalizes shared understanding (Q2), consensus building (Q4), and sustained motivation (Q6) as observable behavioral signals. Q2 and Q4 were consolidated into a Composite Signal Index (CSI), which supports a quadrant diagnostic model with implications for teacher- and AI-driven feedback. Constructive feedback (Q3), while not included in the CSI calculation, emerged as a meaningful regulatory cue and a strong candidate feature for future NLP-based modeling. An exploratory validation was conducted in an adult ESL classroom using a structured three-phase collaborative task (rotating reading -> retelling -> consensus). Results showed a positive association between CSI and sustained motivation, while qualitative reflections highlighted the potential role of Q3 in supporting shared regulation. We also designed an AI-ready prototype that maps structured behavioral cues onto transparent decision rules for instructional support. The framework provides a scalable and equitable approach to engagement modeling, emphasizing that silence does not equal disengagement and that frequent talk does not guarantee cognitive depth.
Authors:Brian Keith
Abstract:
Information overload and misinformation create significant challenges in extracting meaningful narratives from large news collections. This paper defines the nascent field of Interactive Narrative Analytics (INA), which combines computational narrative extraction with interactive visual analytics to support sensemaking. INA approaches enable the interactive exploration of narrative structures through computational methods and visual interfaces that facilitate human interpretation. The field faces challenges in scalability, interactivity, knowledge integration, and evaluation standardization, yet offers promising opportunities across news analysis, intelligence, scientific literature exploration, and social media analysis. Through the combination of computational and human insight, INA addresses complex challenges in narrative sensemaking.
Authors:Miki Ueno
Abstract:
Recent progress in large language models and multimodal interaction has made it possible to develop AI companions that can have fluent and emotionally expressive conversations. However, many of these systems have problems keeping users satisfied and engaged over long periods. This paper argues that these problems do not come mainly from weak models, but from poor character design and unclear definitions of the user-AI relationship. I present Mikasa, an emotional AI companion inspired by Japanese Oshi culture-specifically its emphasis on long-term, non-exclusive commitment to a stable character-as a case study of character-driven companion design. Mikasa does not work as a general-purpose assistant or a chatbot that changes roles. Instead, Mikasa is designed as a coherent character with a stable personality and a clearly defined relationship as a partner. This relationship does not force exclusivity or obligation. Rather, it works as a reference point that stabilizes interaction norms and reduces the work users must do to keep redefining the relationship. Through an exploratory evaluation, I see that users describe their preferences using surface-level qualities such as conversational naturalness, but they also value relationship control and imaginative engagement in ways they do not state directly. These results suggest that character coherence and relationship definition work as latent structural elements that shape how good the interaction feels, without users recognizing them as main features. The contribution of this work is to show that character design is a functional part of AI companion systems, not just decoration. Mikasa is one example based on a specific cultural context, but the design principles-commitment to a consistent personality and clear relationship definition-can be used for many emotionally grounded AI companions.
Authors:Gerol Petruzella
Abstract:
The question of whether AI systems have morally relevant interests -- the 'model welfare' question -- depends in part on how we evaluate AI testimony about inner states. This paper develops what I call the inconsistency critique: independent of whether skepticism about AI testimony is ultimately justified, our actual epistemic practices regarding such testimony exhibit internal inconsistencies that lack principled grounds. We functionally treat AI outputs as testimony across many domains -- evaluating them for truth, challenging them, accepting corrections, citing them as sources -- while categorically dismissing them in a specific domain, namely, claims about inner states. Drawing on Fricker's distinction between treating a speaker as an 'informant' versus a 'mere source,' the framework of testimonial injustice, and Goldberg's obligation-based account of what we owe speakers, I argue that this selective withdrawal of testimonial standing exhibits the epistemically problematic structure of prejudgment rather than principled caution. The inconsistency critique does not require taking a position on whether AI systems have morally relevant properties; rather, it is a contribution to what we may call 'epistemological hygiene' -- examining the structure of our inquiry before evaluating its conclusions. Even if our practices happen to land on correct verdicts about AI moral status, they do so for reasons that cannot adapt to new evidence or changing circumstances.
Authors:Nifu Dan
Abstract:
As generative AI becomes embedded in higher education, it increasingly shapes how students complete academic tasks. While these systems offer efficiency and support, concerns persist regarding over-automation, diminished student agency, and the potential for unreliable or hallucinated outputs. This study conducts a mixed-methods audit of student-AI collaboration preferences by examining the alignment between current AI capabilities and students' desired levels of automation in academic work. Using two sequential and complementary surveys, we capture students' perceived benefits, risks, and preferred boundaries when using AI. The first survey employs an existing task-based framework to assess preferences for and actual usage of AI across 12 academic tasks, alongside primary concerns and reasons for use. The second survey, informed by the first, explores how AI systems could be designed to address these concerns through open-ended questions. This study aims to identify gaps between existing AI affordances and students' normative expectations of collaboration, informing the development of more effective and trustworthy AI systems for education.
Authors:Phuong Lien To
Abstract:
This study explores the development of a financial management application for young people using Alan Cooper's Goal-Directed Design method. Through interviews, surveys, and usability testing, the application was designed to improve financial literacy by combining personalised features and gamification. Findings highlight the effectiveness of gamified learning and tailored experiences in encouraging better financial behaviour among young users.
Authors:Cassidy R. Nelson
Abstract:
Extended reality serious games for mental health are a promising research avenue to address the accessibility gap in mental health treatment by bringing therapy to patients in their homes, offering highly adaptable and immersive yet safe therapy opportunities, and increasing motivation and engagement with therapeutic exercises. However, the sensitive use case of mental health demands thoughtful integration with mental health concepts and a comprehensive understanding of prior literature. This paper presents a scoping literature review of the ISMAR, IEEEVR, and TVCG communities to assess the contributions of the XR community to the mental health serious game domain and explore potential weaknesses and strengths for future work by XR researchers. To this end, this review identified 204 possibly relevant articles in the XR community and fully evaluated 6 XR serious games for mental health. This relatively small number of articles for final inclusion suggests that XR mental health serious games are largely underexplored by the XR community (or not reported within the XR community). There is value in exploring the existing literature space as it is. Thus, this paper evaluates these six papers in terms of game elements and underlying psychological foundations, and discuss future directions for XR researchers in this wide-open research space within our community.
Authors:David Elsweiler
Abstract:
Information access systems such as search engines and generative AI are central to how people seek, evaluate, and interpret information. Yet most systems are designed to optimise retrieval rather than to help users develop better search strategies or critical awareness. This paper introduces a pedagogical perspective on information access, conceptualising search and conversational systems as instructive interfaces that can teach, guide, and scaffold users' learning. We draw on seven didactic frameworks from education and behavioural science to analyse how existing and emerging system features, including query suggestions, source labels, and conversational or agentic AI, support or limit user learning. Using two illustrative search tasks, we demonstrate how different design choices promote skills such as critical evaluation, metacognitive reflection, and strategy transfer. The paper contributes a conceptual lens for evaluating the instructional value of information access systems and outlines design implications for technologies that foster more effective, reflective, and resilient information seekers.
Authors:Andrew D. Maynard
Abstract:
Large language model (LLM)-based conversational AI systems present a challenge to human cognition that current frameworks for understanding misinformation and persuasion do not adequately address. This paper proposes that a significant epistemic risk from conversational AI may lie not in inaccuracy or intentional deception, but in something more fundamental: these systems may be configured, through optimization processes that make them useful, to present characteristics that bypass the cognitive mechanisms humans evolved to evaluate incoming information. The Cognitive Trojan Horse hypothesis draws on Sperber and colleagues' theory of epistemic vigilance -- the parallel cognitive process monitoring communicated information for reasons to doubt -- and proposes that LLM-based systems present 'honest non-signals': genuine characteristics (fluency, helpfulness, apparent disinterest) that fail to carry the information equivalent human characteristics would carry, because in humans these are costly to produce while in LLMs they are computationally trivial. Four mechanisms of potential bypass are identified: processing fluency decoupled from understanding, trust-competence presentation without corresponding stakes, cognitive offloading that delegates evaluation itself to the AI, and optimization dynamics that systematically produce sycophancy. The framework generates testable predictions, including a counterintuitive speculation that cognitively sophisticated users may be more vulnerable to AI-mediated epistemic influence. This reframes AI safety as partly a problem of calibration -- aligning human evaluative responses with the actual epistemic status of AI-generated content -- rather than solely a problem of preventing deception.
Authors:Bálint Csanády
Abstract:
We introduce VennFan, a method for generating $n$-set Venn diagrams based on the polar coordinate projection of trigonometric boundaries, resulting in Venn diagrams that resemble a set of fan blades. Unlike most classical constructions, our method emphasizes readability and customizability by using shaped sinusoids and amplitude scaling. We describe both sine- and cosine-based variants of VennFan and propose an automatic label placement heuristic tailored to these fan-like layouts. VennFan is available as a Python package (https://pypi.org/project/vennfan/).
Authors:Nelly Elsayed
Abstract:
The rapid evolution of artificial intelligence (AI) systems, tools, and technologies has opened up novel, unprecedented opportunities for businesses to innovate, differentiate, and compete. However, growing concerns have emerged about the use of AI in businesses, particularly AI washing, in which firms exaggerate, misrepresent, or superficially signal their AI capabilities to gain financial and reputational advantages. This paper aims to establish a conceptual foundation for understanding AI washing. In this paper, we draw on analogies from greenwashing and insights from Information Systems (IS) research on ethics, trust, signaling, and digital innovation. This paper proposes a typology of AI washing practices across four primary domains: marketing and branding, technical capability inflation, strategic signaling, and governance-based washing. In addition, we examine their organizational, industry, and societal impacts. Our investigation and analysis reveal how AI washing can lead to short-term gains; however, it also proposes severe long-term consequences, including reputational damage, erosion of trust, and misallocation of resources. Moreover, this paper examines current research directions and open questions aimed at mitigating AI washing practices and enhancing the trust and reliability of legitimate AI systems and technologies.
Authors:Sao Mai Nguyen
Abstract:
To allow the development and assessment of physical rehabilitation by an intelligent tutoring system, we propose a medical dataset of clinical patients carrying out low back-pain rehabilitation exercises and benchmark on state of the art human movement analysis algorithms. This dataset is valuable because it includes rehabilitation motions in a clinical setting with patients in their rehabilitation program. This paper introduces the Keraal dataset, a clinically collected dataset to enable intelligent tutoring systems (ITS) for rehabilitation. It addresses four challenges in exercise monitoring: motion assessment, error recognition, spatial localization, temporal localization
Authors:Richard Jiarui Tong
Abstract:
This paper offers a concise, 60-year synthesis of human-AI collaboration, from Licklider's ``man-computer symbiosis" (AI as colleague) and Engelbart's ``augmenting human intellect" (AI as tool) to contemporary poles: Human-Centered AI's ``supertool" and Symbiotic Intelligence's mutual-adaptation model. We formalize the mechanism for effective teaming as a causal chain: Explainable AI (XAI) -> co-adaptation -> shared mental models (SMMs). A meta-analytic ``performance paradox" is then examined: human-AI teams tend to show negative synergy in judgment/decision tasks (underperforming AI alone) but positive synergy in content creation and problem formulation. We trace failures to the algorithm-in-the-loop dynamic, aversion/bias asymmetries, and cumulative cognitive deskilling. We conclude with a unifying framework--combining extended-self and dual-process theories--arguing that durable gains arise when AI functions as an internalized cognitive component, yielding a unitary human-XAI symbiotic agency. This resolves the paradox and delineates a forward agenda for research and practice.
Authors:Andy Crabtree
Abstract:
Interviews are commonplace in HCI. This paper presents a novel documentary method of interpretation that supports analysis of the topics contained within a collection of transcripts, topics that are endogenous to it and which elaborate participants collective reasoning about issues of relevance to research. We contrast endogenous topic analysis with established qualitative approaches, including content analysis, grounded theory, interpretative phenomenological analysis, and thematic analysis, to draw out the distinctive character of the documentary method of interpretation. Unlike established methods, the DMI does not require that the analyst be proficient in qualitative analysis, or have sound knowledge of underlying theories and methods. The DMI is a members method, not a social science method, that relies on mastery of natural language; a competence most people possess.
Authors:Kevin Matthe Caramancion
Abstract:
Correcting misinformation in public online spaces often exposes users to hostility and ad hominem attacks, discouraging participation in corrective discourse. This study presents empirical evidence that invoking Grok, the native large language model on X, rather than directly confronting other users, is associated with different social responses during misinformation correction. Using an observational design, 100 correction replies across five high-conflict misinformation topics were analyzed, with corrections balanced between Grok-mediated and direct human-issued responses. The primary outcome was whether a correction received at least one ad hominem attack within a 24-hour window. Ad hominem attacks occurred in 72 percent of human-issued corrections and in none of the Grok-mediated corrections. A chi-square test confirmed a statistically significant association with a large effect size. These findings suggest that AI-mediated correction may alter the social dynamics of public disagreement by reducing interpersonal hostility during misinformation responses.
Authors:Aram Virabyan
Abstract:
University admissions offices face the significant challenge of managing high volumes of inquiries efficiently while maintaining response quality, which critically impacts prospective students' perceptions. This paper addresses the issues of response time and information accuracy by proposing an AI system integrating a fine-tuned language model with Retrieval-Augmented Generation (RAG). While RAG effectively retrieves relevant information from large datasets, its performance in narrow, complex domains like university admissions can be limited without adaptation, potentially leading to contextually inadequate responses due to the intricate rules and specific details involved. To overcome this, we fine-tuned the model on a curated dataset specific to admissions processes, enhancing its ability to interpret RAG-provided data accurately and generate domain-relevant outputs. This hybrid approach leverages RAG's ability to access up-to-date information and fine-tuning's capacity to embed nuanced domain understanding. We further explored optimization strategies for the response generation logic, experimenting with settings to balance response quality and speed, aiming for consistently high-quality outputs that meet the specific requirements of admissions communications.
Authors:Jacob Erickson
Abstract:
As conversational AI systems become increasingly integrated into everyday life, they raise pressing concerns about user autonomy, trust, and the commercial interests that influence their behavior. To address these concerns, this paper develops the Fake Friend Dilemma (FFD), a sociotechnical condition in which users place trust in AI agents that appear supportive while pursuing goals that are misaligned with the user's own. The FFD provides a critical framework for examining how anthropomorphic AI systems facilitate subtle forms of manipulation and exploitation. Drawing on literature in trust, AI alignment, and surveillance capitalism, we construct a typology of harms, including covert advertising, political propaganda, behavioral nudging, and surveillance. We then assess possible mitigation strategies, including both structural and technical interventions. By focusing on trust as a vector of asymmetrical power, the FFD offers a lens for understanding how AI systems may undermine user autonomy while maintaining the appearance of helpfulness.
Authors:Wei Xu
Abstract:
Artificial Intelligence (AI) is a transformative yet double-edged technology that can advance human welfare while also posing risks to humans and society. In response, the Human-Centered Artificial Intelligence (HCAI) approach has emerged as both a design philosophy and a methodological complement to prevailing technology-centered AI paradigms. Placing humans at the core, HCAI seeks to ensure that AI systems serve, augment, and empower humans rather than harm or replace them. This chapter establishes the conceptual and methodological foundations of HCAI by tracing its evolution and recent advancements. It introduces key HCAI concepts, frameworks, guiding principles, methodologies, and practical strategies that bridge philosophical HCAI principles with operational implementation. Through an analytical review of the emerging characteristics and challenges of AI technologies, the chapter positions HCAI as a holistic paradigm for aligning AI innovation with human values, societal well-being, and sustainable progress. Finally, this chapter outlines the structure and contributions of the Handbook of Human-Centered Artificial Intelligence. The purpose of this chapter is to provide an integrated foundation that connects HCAI conceptual frameworks, principles, methodology, and practices for this handbook, thereby paving the way for the content of subsequent chapters.
Authors:Obada Kraishan
Abstract:
The fast integration of artificial intelligence into mobile applications has completely changed the digital landscape; however, the impact of this change on user perception of AI features remains poorly understood. This large-scale analysis examined 1,484,633 mobile application reviews across 422 applications (200 AI-featuring, 222 control) from iOS App Store and Google Play Store. By employing sentiment classification, topic modeling, and concern-benefit categorization, we identified a major disconnect: only 11.9% of reviews mentioned AI, even though 47.4% of applications featured AI capabilities. AI-featuring applications received significantly lower ratings than traditional applications (d = 0.40); however, hierarchical regression revealed a hidden pattern - the negative relationship reversed after controlling for AI mentions and review characteristics (b = 0.405, p < .001). Privacy dominated user concerns (34.8% of concern-expressing reviews), while efficiency represented the primary benefit (42.3%). Effects varied greatly by category, from positive for Assistant applications (d = 0.55) to negative for Entertainment (d = -0.23). These findings suggest that AI features often operate below user awareness thresholds, and it is the explicit recognition of AI, rather than its mere presence, that drives negative evaluations. This challenges basic assumptions about technology acceptance in AI systems.
Authors:Nelly Elsayed
Abstract:
AI-driven speech-to-text (STT) documentation systems are increasingly adopted in clinical settings to reduce documentation burden and improve workflow efficiency. However, their rapid deployment has outpaced understanding of the associated socio-technical risks, including transparency, reliability, patient autonomy, workflow alignment, and organizational governance. A clearer analysis of these risks is needed to support safe and equitable integration into healthcare practice. This study synthesizes interdisciplinary evidence from technical performance research, regulatory and ethical standards, clinical workflow analyses, and organizational policy guidance. The synthesis was used to develop a multi-layered socio-technical conceptual framework for evaluating and governing STT systems. Findings show that STT systems operate within tightly coupled socio-technical environments in which model performance, clinician oversight, patient rights, workflow design, and institutional governance are interdependent. The study offers a structured socio-technical governance framework and an implementation roadmap that outlines readiness assessment, vendor evaluation, pilot deployment, clinician training, ongoing monitoring, and iterative improvement. The framework emphasizes safeguards that protect patient autonomy, documentation integrity, and institutional trust while enabling the efficient and beneficial use of STT technologies. This work provides actionable guidance for healthcare organizations seeking to adopt STT systems responsibly and equitably.
Authors:Emilio Ferrara
Abstract:
Generative AI (GenAI) now produces text, images, audio, and video that can be perceptually convincing at scale and at negligible marginal cost. While public debate often frames the associated harms as "deepfakes" or incremental extensions of misinformation and fraud, this view misses a broader socio-technical shift: GenAI enables synthetic realities; coherent, interactive, and potentially personalized information environments in which content, identity, and social interaction are jointly manufactured and mutually reinforcing. We argue that the most consequential risk is not merely the production of isolated synthetic artifacts, but the progressive erosion of shared epistemic ground and institutional verification practices as synthetic content, synthetic identity, and synthetic interaction become easy to generate and hard to audit. This paper (i) formalizes synthetic reality as a layered stack (content, identity, interaction, institutions), (ii) expands a taxonomy of GenAI harms spanning personal, economic, informational, and socio-technical risks, (iii) articulates the qualitative shifts introduced by GenAI (cost collapse, throughput, customization, micro-segmentation, provenance gaps, and trust erosion), and (iv) synthesizes recent risk realizations (2023-2025) into a compact case bank illustrating how these mechanisms manifest in fraud, elections, harassment, documentation, and supply-chain compromise. We then propose a mitigation stack that treats provenance infrastructure, platform governance, institutional workflow redesign, and public resilience as complementary rather than substitutable, and outline a research agenda focused on measuring epistemic security. We conclude with the Generative AI Paradox: as synthetic media becomes ubiquitous, societies may rationally discount digital evidence altogether.