The table demonstrates that the primary research emphasis often does not revolve around the utilization of voice. This contributes to the existence of numerous gaps in research. While specific gaps have already been tackled in the sections discussing arousal-oriented interactions with sexualized voice technologies and health communication, broader areas also require research, which will be further detailed ahead.
Examining Modalities: Voice- vs. Text-Based Conversational Agents
One important research gap concerns the comparison of modalities (voice- vs. text-based conversational agents) and, consequently, the applicability of voice assistants for arousal-oriented and sexual health communication. As previously stated, the research on the impact of modality on user perceptions remains insufficiently explored in the field of sexual health. Similarly, a knowledge gap exists in the realm of intimate communication. Therefore, it is crucial to investigate the role of modality to uncover the specific contribution of voice in close social interactions and knowledge acquisition. This approach can provide a comprehensive understanding of the significance and impact of voice within these domains. Taking this one step further, comparing voice assistants and embodied agents is also necessary. For example, Dworkin and colleagues investigated the application of embodied conversational agents for HIV medication adherence in young HIV-positive African American men [
32]. Embodied agents exhibit the potential to function as customizable relational agents that can foster a socio-emotional connection with the user. Those agents can effectively facilitate education, encourage positive behavior change, and enhance user engagement through various modalities, including audio, graphics, animation, and text [
32]. On the other hand, existing evidence suggests that individuals prefer to disclose personal information to disembodied conversational agents [
16••]. Based on these findings, it can be postulated that voice, as a human-like cue, may hold greater relevance in facilitating intimate social interactions than knowledge acquisition in sexual health.
Diverse (and Inclusive) Perspective on Sexualized Interactions with Artificial Voices
In the following an outlook on important but underrepresented perspectives in the usage of voice technologies will be provided. Firstly, research on the arousal and non-arousal-oriented usage of voice assistant does not sufficiently address more diverse and inclusive user groups, while the technology has the potential to be used for non-heteronormative groups.
One already discussed but yet under-researched aspect is the gendering of the artificial voice: Even though synthetic, and even if ambiguous, users have the tendency to gender artificial voices [
33]. In line with media equation theory, this can activate gender stereotypes [
16••], including for instance the systems trustworthiness (e.g., [
34,
35]) or likeability [
36]. Therefore, the gender perception of the voice has a high relevance for sexual and intimate interactions, as well as for non-arousal activities. Given the fact that voice assistants are utilized by individuals of all genders, it is imperative that representation in voice assistants reflects this diversity. However, this is not yet the reality. The majority of voice assistants on the market have a feminine name [
37] and are represented as feminine [
38]. This however contributes to the potential replication of stereotypes as the mostly female voice fulfills the social role of a servant, which in some cases is heavily sexualized and degraded [
39]. For example, Amazon’s voice assistant system Alexa consistently assumes a subservient role. The system establishes and sustains a link between women and obedience. Among other factors, this leads to sexual harassment within interaction with voice assistants [
40,
41]. Therefore, several studies investigate, how conversational technologies respond to sexual harassment and verbal abuse (e.g., [
40,
42]). While this can also be a mechanism of playing the system (testing out social norms what is not accepted in interpersonal interactions with other humans) is nevertheless a behavior that is observed and need to be researched longitudinally, especially in the realm of arousal-oriented tasks. While a more optimistic view conceptualized the female artificial voice as the new superpower who is able to answer all imaginable questions within milliseconds, it still needs to be highlighted that this might be more important in the realm of non-arousal-oriented tasks. While first voice assistant systems are equipped with the option to switch to a masculine artificial voice [
33], a notable gap exists in the availability of voices that encompass the complete spectrum of gender presentations, including non-binary or gender-ambiguous voices [
37].
Diversity is also a question of the data a system is trained with. Extensive research has underscored both explicit and implicit biases present in algorithms and datasets, which are used to train NLP systems, concerning gender [
43‐
45], race (leading to discrimination and racism, e.g., [
46,
47]), age (leading to ageism [
48]), as well as their intersections [
46,
49]. The predominant focus has been on gender stereotypes, harassment, and offensive language, particularly emphasizing restricted and/or unfavorable associations with femininities and individuals identifying as genderqueer [
50]. Yet, according to Seaborn and colleagues, a disparity is evident in the way the “gender problem” is conceptualized, as most of the present work is guided by a sex/gender binary model of male and female. Therefore, they propose a more comprehensive examination of masculine biases and gender to expose imbalances and disparities in voice assistant-oriented NLP datasets [
50].
When talking about more diverse and inclusive user groups, it is important to highlight people of different ages, sexualities, and users that are not in line with affordances of heteronormative user groups, such as people with physical or intellectual disabilities or medical conditions.
Adolescents, who often seek health-related information online, represent a significant user group. A study by Rideout and Fox found that approximately 87% of Americans aged 14 to 22 use online platforms for health-related data [
51]. Additionally, the internet is a valuable resource for young individuals exploring their sexuality, particularly for LGBTQAI + youth seeking connections to with like-minded individuals [
52,
53]. Moreover, the growing group of older adults exploring the possibilities the digital world offers for sex and love is another significant user segment. A recent study revealed that the online activities among older adults can be categorized into three groups: non-arousal activities (e.g., visiting educational websites or chatting on dating platforms), solitary arousal activities (e.g., watching pornography), and partnered arousal activities which involve at least one other individual (e.g., engaging in webcam sex) [
54,
55]. Hence, older adults are utilizing the digital space in the domains that align with application areas discussed in this work. Yet, there exists a significant gap in understanding the potential role of voice for this user group. Voice-only technologies may offer relief for older individuals who, for example, have problems with typing. However, it could also pose a challenge for some who are unfamiliar with communicating with technical devices through speech or expressing intimate thoughts aloud as commands to a voice assistant.
Additionally, voice-only technologies offer a new potential from an accessibility perspective, as individuals experiencing motor, linguistic, and cognitive impairments can engage effectively with voice assistants, given they possess particular levels of remaining cognitive and linguistic abilities [
56]. Studies show that despite the presence of certain accessibility challenges, individuals with a range of disabilities already utilize state-of-the-art voice assistant systems. This usage extends to unexpected scenarios such as speech therapy and providing support for caregivers [
57]; it is therefore also imaginable that sexualized interaction could be of benefit for these user groups.
The Potential of Personalized LLMs
As LLMs enhance, personalization of voice interactions is evolving to become a key factor in satisfying user’s expectations for customized experiences that correspond to their individual needs and preferences [
58]. The idea of a stronger personification can for instance in turn affect how users align their speech models with a system [
59], or the level of parasocial relation they form with the voice [
59] and consequently affect variables such as heightened preference, trust in the agent [
60], and smoother conversations [
61,
62]. In terms of arousal-oriented interactions, this might even be the key for ongoing dialogues and in some cases even romantic connections. And while personalized LLMs are likely to positively enhance the usage of heteronormative user groups, this will be specifically the key for user groups of more diverse affordances.
One imaginable example is users with hearing or speech impairments, as well as neurodivergent people, allowing adjustments such as altering the voice assistant’s speed or an appropriate language use to enhance their overall user experience, both for arousal and non-arousal applications.
Psychological Aspects and Ethical Considerations
However, there is a demand for further research to explore psychological components related to voice assistants, encompassing their capabilities and quality and factors like acceptability, trust, and trustworthiness. Previous studies on conversational agents in health care have highlighted various limitations. These include privacy concerns, limited conversational responsiveness, user-perceived undesirable personality (e.g., rude, lack of sympathy, patronizing, or judgmental), and a lack of trust in the creators of the digital assistants [
12,
63]. It must be noted that providers of such technologies bear immense social responsibility; users can form deep relationships with artificial communication partners, and sudden updates or discontinuation of specific services can lead to severe social reactions (cf.: changing the offer of romantic partner mode in Replika and reports from users that went as far as suicidal thoughts [
64]). However, it is essential to highlight that voice assistants in particular are heavily controlled by providers. The applications offered by the most widely used devices are subject to the rules of the respective providers, and these are often heavily regulated, especially in the area of sexuality. It, therefore, remains to be seen to what extent dedicated devices will be developed for intimate interactions with voice assistants or whether greater reliance will be placed on voice interaction via smartphones. This responsibility also extends to the realm of sexual health communication, encompassing social implications and considerations regarding privacy and data security. Previous studies have identified data privacy and confidentiality concerns as obstacles to the regular use of virtual assistants, particularly in contexts involving sensitive information, such as health contexts [
65,
66]. Thus, it is reasonable to posit that these factors also significantly influence when addressing inquiries on sexuality, which involve deeply intimate subject matter and are widely regarded as highly sensitive.
Moreover, handling sensitive data brings focus to the notion of trust. As a result, inquiries emerge regarding the trustworthiness of these systems and the features they possess, which can influence the willingness of patients to trust them within a medical setting [
67]. Those aspects can also be connected effectively to the abovementioned point, as certain user groups, such as children and adolescents, necessitate particular attention, especially concerning data security. Because a substantial proportion of children aged two to eight already engage with voice assistants daily, it is essential to consider the potential and risks associated with utilizing voice assistants for sexual health education and intimate communication in general [
68]. Particularly in this context, it is crucial to comprehend the mechanisms involved in data storage and access, as research conducted by Szczuka and colleagues demonstrated that such understanding is negatively associated with children’s inclination to disclose private information to a voice assistant. More precisely, the language employed by the voice assistant in their study (e.g., using the phrase “I am silent as a grave” when asked about entrusting a secret) holds the potential to prompt children to disclose susceptible information, which can subsequently be accessed by unauthorized individuals [
69]. As the act of safeguarding information is integral to the process of identity formation, this naturally includes highly sensitive details concerning body perception and sexuality. Voice assistants that are used in a sexuality-related context should therefore disclose their data policies and operate in the users’ best interests. This could for instance be utilized by providing easy options for data deletion, straightforward access to a data management system that outlines authorized access, and perhaps even implementing preventive prompts to remind users of privacy considerations. To achieve this, it would however also be important to have adaptive systems which recognize users and their different affordances, which is something that still needs to be implemented in state-of-the-art systems.